PreprintPDF Available

Joint Optimization and Learning Approach for Smart Operation of Hydrogen-Based Building Energy Systems

Authors:
Preprints and early-stage research may not have been peer reviewed yet.

Abstract and Figures

In recent years, hydrogen-based multi-energy systems (HMESs) have received wide attention. However, existing works on the optimal operation of HMESs neglect building thermal dynamics, which means that the flexibility of thermal loads can not be utilized for reducing system operation cost. In this paper, we investigate an optimal operation problem of an HMES with the consideration of building thermal dynamics. Specifically, we first formulate an expected operational cost minimization problem related to an HMES. Due to the existence of uncertain parameters, inexplicit building thermal dynamics models, spatially and temporally coupled operational constraints, and nonlinear constraints, it is challenging to solve the formulated problem. Then, we propose an algorithm to solve the problem based on model-based optimization and data-driven based learning. The key idea of the proposed algorithm is summarized as follows: (1) transforming the long-term cost minimization problem into several single-slot subproblems using Lyapunov optimization techniques; (2) dividing each single-slot subproblem into two parts according to the availability of model information; (3) solving one part based on convex optimization and solving another part using multi-agent attention-based deep deterministic policy gradient. Simulation results based on real-world traces show the effectiveness of the proposed algorithm.
Content may be subject to copyright.
IEEE TRANSACTIONS ON SMART GRID, VOL. XX, NO. XX, MONTH 2022 1
Joint Optimization and Learning Approach for
Smart Operation of Hydrogen-based
Building Energy Systems
Liang Yu, Member, IEEE, Zhanbo Xu, Member, IEEE, Xiaohong Guan, Fellow, IEEE,
Qianchuan Zhao, Senior Member, IEEE, Chunxia Dou, Member, IEEE, and Dong Yue, Fellow, IEEE
Abstract—In recent years, hydrogen-based multi-energy sys-
tems (HMESs) have received wide attention. However, existing
works on the optimal operation of HMESs neglect building
thermal dynamics, which means that the flexibility of thermal
loads can not be utilized for reducing system operation cost.
In this paper, we investigate an optimal operation problem
of an HMES with the consideration of building thermal dy-
namics. Specifically, we first formulate an expected operational
cost minimization problem related to an HMES. Due to the
existence of uncertain parameters, inexplicit building thermal
dynamics models, spatially and temporally coupled operational
constraints, and nonlinear constraints, it is challenging to solve
the formulated problem. Then, we propose an algorithm to
solve the problem based on model-based optimization and data-
driven based learning. The key idea of the proposed algorithm
is summarized as follows: (1) transforming the long-term cost
minimization problem into several single-slot subproblems using
Lyapunov optimization techniques; (2) dividing each single-slot
subproblem into two parts according to the availability of model
information; (3) solving one part based on convex optimization
and solving another part using multi-agent attention-based deep
deterministic policy gradient. Simulation results based on real-
world traces show the effectiveness of the proposed algorithm.
Index Terms—Building energy systems, operational cost, car-
bon emission, uncertainty, hydrogen energy storage, deep rein-
forcement learning, Lyapunov optimization techniques
NOM EN CL ATUR E
Indices
This work was supported in part by the Basic Research Project of Lead-
ing Technology of Jiangsu Province under Grant BK20202011, in part by
the National Natural Science Foundation of China under Grant 62192751,
Grant 61972214, Grant 62122062, Grant 62192750, and Grant 61425027,
in part by the 111 International Collaboration Program of China under
Grant BP2018006, in part by China Postdoctoral Science Foundation under
Grant 2020M673406, in part by Qinlan Project of Jiangsu Province (2022),
and in part by 1311 Talent Project of Nanjing University of Posts and
Telecommunications. (Corresponding authors are Liang Yu and Dong Yue).
L. Yu is with the Faculty of Electronic and Information Engineering, Xi’an
Jiaotong University, Xi’an 710049, China, and is also with the College of Au-
tomation & College of Artificial Intelligence, Nanjing University of Posts and
Telecommunications, Nanjing 210003, China. (email: liang.yu@njupt.edu.cn)
Z. Xu and X. Guan are with Systems Engineering Institute, Ministry of
Education Key Lab for Intelligent Networks and Network Security, Xi’an
Jiaotong University, Xi’an 710049, China.
Q. Zhao is with the Center for Intelligent and Networked Systems (CFINS),
Department of Automation and BNRist, Tsinghua University, 100084 China.
C. Dou is with the Institute of Advanced Technology, Nanjing University of
Posts and Telecommunications, Nanjing 210003, China.
D. Yue is with the Institute of Advanced Technology, Nanjing Univer-
sity of Posts and Telecommunications, Nanjing 210003, China. (email:
medongy@vip.163.com)
tTime slot index.
iBuilding index, agent index.
Parameters and Constants
ηpv PV system generation efficiency.
hpv Total radiation area of solar panels (m2).
ςtThe solar radiation intensity at slot t(W/m2).
Pmax
gb Maximum heat power output of gas boiler (kW).
ηbc,ηbd Charging efficiency, discharging efficiency.
tThe duration of a time slot (hour).
Bmin Minimum BESS energy level (kWh).
Bmax Maximum BESS energy level (kWh).
Pmax
bc Maximum BESS charging power (kW).
Pmax
bd Maximum BESS discharging power (kW).
ηtc Injection efficiency of CWT.
ηtd Release efficiency of CWT.
Qmax
th CWT capacity (kWh).
Pmax
td Maximum CWT released power (kW).
Pmax
tc Maximum CWT injected power (kW).
Hmax HESS storage capacity (Nm3).
ωel Conversion coefficient of electrolyzer (Nm3/kWh).
ωfc Conversion coefficient of fuel cell (kWh/Nm3).
Pmax
el Rated input power of electrolyzer (kW).
Pmax
fc Rated output power of fuel cell (kW).
ηh2e Heat-to-electricity ratio.
ηhr Heat recovery efficiency.
βmin
iLower limit of comfortable temperature range (C).
βmax
iUpper limit of comfortable temperature range (C).
Pmax
sp,i Maximum thermal input power in building i(C).
NThe number of buildings.
ηh2c AC transformation efficiency.
µcA weighted carbon emission parameter (RMB/kg).
ψBESS Battery depreciation coefficient in (RMB/kW).
ηgb Gas-to-heat conversion efficiency.
VControl parameter related to operational cost.
Variables
Ppv,t Maximum PV generation output at slot t(kW).
Pgb,t Heat power output of gas boiler at slot t(kW).
BtStored energy level in the BESS at slot t(kWh).
Pbc,t BESS charging power at slot t(kW).
Pbd,t BESS discharging power at slot t(kW).
Qth,t Stored thermal energy in CWT at slot t(kWh).
Ptc,t CWT charging power at slot t(kW).
Ptd,t CWT discharging power at slot t(kW).
HtStorage level of hydrogen tank at slot t(Nm3)
This article has been accepted for publication in IEEE Transactions on Smart Grid. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TSG.2022.3197657
© 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: Nanjing Univ of Post & Telecommunications. Downloaded on August 10,2022 at 22:50:32 UTC from IEEE Xplore. Restrictions apply.
IEEE TRANSACTIONS ON SMART GRID, VOL. XX, NO. XX, MONTH 2022 2
Pel,t Charging power of electrolyzer at slot t(kW).
Pfc,t Discharging power of fuel cell at slot t(kW).
Qfc,t Thermal output power of fuel cell at slot t(kWh).
Psp,i,t Thermal input power for building iat slot t(kW).
βin,i,t Building indoor temperature at slot t(C).
βout,t Outdoor temperature at slot t(C).
ϱi,t Random thermal disturbance at slot t(C).
Pbuy,t Purchasing power of the HBMES at slot t(C).
Psell,t Selling power of the HBMES at slot t(C).
Pload,t Power demand at slot t(kW).
Pmax
gMaximum transaction power (kW).
vtBuying electricity price (RMB/kWh).
τtSelling electricity price (RMB/kWh).
C1,t Energy cost of electricity buying or selling (RMB).
C2,t Carbon emission cost (RMB).
C3,t BESS depreciation cost at slot t(RMB).
C4,t HESS related cost at slot t(RMB).
C5,t CWT depreciation cost at slot t(RMB).
C6,t Gas purchasing cost at slot t(RMB).
δon
xOperation cost of component x (x {el,fc}).
δsu
xStartup cost of component x (x {el,fc}).
δsd
xShutdown cost of component x (x {el,fc}).
µe,t Carbon emission rate at slot t(kg/kWh).
λg,t Gas price at slot t(in RMB/kWh).
XB,t BESS virtual queue length at slot t.
XH,t HESS virtual queue length at slot t.
Fi(·)Thermal dynamics model of building i.
oi,t Local observation of agent iat slot t.
ai,t Action of agent iat slot t.
rth,i,t Reward of agent iat slot t.
ΛtOne-slot conditional Lyapunov drift at slot t.
I. INT ROD UC TI ON
Buildings account for a large portion of total energy con-
sumption and total carbon emission in the world. For example,
global buildings consumed about 30% of the total energy and
generated about 28% of the total carbon emission in 2019
[1]. Since the global energy supply mainly depends on fossil
fuels, energy and environmental issues are incurred [2]. Due
to many advantages (e.g., free pollution, extensive sources,
convenient storage and transportation), hydrogen energy has
attracted widespread attention and is recognized as a promising
alternative to fossil fuels [2]–[4]. Moreover, the coordination
of hydrogen energy storage system (HESS) and other energy
storage systems (ESSs) (e.g., thermal energy storage and elec-
tric energy storage) contributes to the improvement of building
energy efficiency [2]. Therefore, it is of great importance to
optimize the operation of a hydrogen-based building multi-
energy system (HBMES) [5], [6].
In the literature, many approaches have been used for the
planning or operation of multi-energy systems, e.g., mixed-
integer linear programming (MILP) [7], nonconvex quadrati-
cally constrained programming [8], stochastic programming
[9] [10], robust optimization [11] [12] [13], Benders de-
composition [14], model-predictive control (MPC) [15] [16],
and deep reinforcement learning (DRL) [17]. Although some
efforts have been made, the above-mentioned studies did
not consider the utilization of hydrogen energy storage. To
promote the development of hydrogen energy storage, some
works have investigated the optimal planning or operation
problem of hydrogen-based multi-energy systems [2] [3] [5],
[6], [18], [19] and adopted many optimization approaches,
e.g., MILP [2], two-stage stochastic programming [5], mixed
integer programming [19], two-stage robust optimization [3],
distributed optimization [6], and DRL [20]. In existing works
on the optimal operation of hydrogen-based multi-energy
systems, building thermal dynamics and thermal comfort of
occupants are neglected, which means that the flexibility of
building thermal loads can not be utilized for reducing system
operational costs.
Based on the above observation, we investigate an op-
timal operation problem related to an HBMES with the
consideration of building thermal dynamics. To be specific,
we intend to minimize the long-term operational cost of
an HBMES by intelligently scheduling thermal loads and
various ESSs, including hydrogen, thermal, and electric ESSs.
However, several challenges are involved in achieving the
above aim. Firstly, there are many uncertain parameters, e.g.,
renewable generation output, electric load, electricity price,
outdoor temperature, and carbon emission rate. Secondly,
there are spatially coupled operational constraints related to
power balance and heat balance. Thirdly, there are temporally
coupled operational constraints related to several ESSs and
indoor temperature. Fourthly, there are nonlinear constraints
related to power transactions and ESS operations. Finally, it is
difficult to obtain explicit building thermal dynamics models
that are accurate and efficient enough for building control. To
overcome these challenges, we propose a solving algorithm
based on Lyapunov optimization techniques (LOT) [21] and
multi-agent deep reinforcement learning (MADRL) [22]. The
key idea of the proposed algorithm is to transform the long-
term operational cost minimization problem using LOT into
several single-slot subproblems and solve these subproblems
using convex optimization and multi-agent attention-based
deep deterministic policy gradient (MAADDPG), which can
utilize the advantages of model-based optimization and data-
driven based learning.
The main contributions of this paper are summarized as
follows.
Taking hydrogen/thermal/electric energy storage, inex-
plicit building thermal dynamics model, and thermal
comfort into consideration, we formulate an expected op-
erational cost minimization problem under uncertainties,
where operational cost consists of electricity cost, carbon
emission penalty, natural gas purchasing cost, and ESS
operation costs.
We propose an online operation algorithm with a polyno-
mial time computational complexity to solve the formu-
lated problem based on LOT and MAADDPG. Moreover,
we analyze the algorithmic feasibility and closed-form
expressions of hyper-parameters for controlling ESSs.
Note that the proposed algorithm does not require any
prior knowledge of uncertain parameters and explicit
building thermal dynamics models.
Simulation results based on real-world traces show that
This article has been accepted for publication in IEEE Transactions on Smart Grid. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TSG.2022.3197657
© 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: Nanjing Univ of Post & Telecommunications. Downloaded on August 10,2022 at 22:50:32 UTC from IEEE Xplore. Restrictions apply.
IEEE TRANSACTIONS ON SMART GRID, VOL. XX, NO. XX, MONTH 2022 3
the proposed algorithm can reduce average operation cost
by 26.35%-37.11% while maintaining comfortable indoor
temperature ranges compared with a rule-based scheme
and several DRL-based schemes.
The rest of this paper is organized as follows. In Section II,
we introduce related works. In Section III, we describe the
system model and formulate an expected operational cost
minimization problem. In Section IV, we propose an operation
algorithm to solve the formulated problem. In Section V,
algorithm feasibility and key control parameters are analyzed.
In Section VI, performance evaluations are conducted. Finally,
we draw a conclusion and point out future work in Section VII.
II. RE LATE D WO RK S
There have been many studies on the planning or opera-
tion of hydrogen-based multi-energy systems. For example,
Liu et al. investigated the optimal planning problem of a
hydrogen-based multi-energy system, which was formulated
by MILP and solved by Gurobi optimizer [2]. Similarly, Pan
et al. studied the optimal planning problem for electricity-
hydrogen integrated energy system considering seasonal s-
torage based on two-stage robust optimization [3]. Different
from the planning problem, the operation problem mainly
focuses on optimal scheduling of distributed resources for
system operational cost reduction under the given resource
configurations, e.g., generation capacity and storage capacity.
In [5], Langeroudi et al. studied the optimal operation of
power, heat, and hydrogen-based microgrid with the consider-
ation of a plug-in electric vehicle and proposed an operation
algorithm based on two-stage stochastic programming. In [6],
Langeroudi et al. proposed a distributed optimization method
for integrated electricity and hydrogen energy sharing so
that the total social welfare caused by energy dispatching
could be maximized. Since the above-mentioned operation
methods need to know the prior information of uncertain
parameters (e.g., predicted values, probability distribution, and
maximum/minimum values), DRL-based operation methods
have been proposed, which can operate without requiring any
prior knowledge of uncertain parameters. In [23], Vincent
Franc¸ois-Lavet et al. proposed a deep Q-network (DQN)-
based algorithm to schedule electric and hydrogen ESSs for
minimizing overall levelized energy cost without knowing
future information about electricity consumption and solar
generation. Since sustainability is also an important metric for
energy systems, Desportes et al. studied the carbon impact
minimization problem in an electric/hydrogen hybrid energy
storage system based on deep deterministic policy gradient
(DDPG) algorithm [20]. Although some advances have been
made in the above-mentioned studies, they did not consider
building thermal dynamics, which means that the flexibility of
thermal loads can not be utilized for operational cost reduction.
To overcome the limitations in existing works, we inves-
tigate an expected operational cost minimization problem of
a hydrogen-based multi-energy system with the consideration
of building thermal dynamics under uncertainties. Due to the
existence of many challenges caused by uncertain parame-
ters, spatially and temporally coupled constraints, nonlinear
constraints, and inexplicit building thermal dynamics models,
we propose a solving algorithm based on LOT and MADRL.
Recently, some works have used LOT and DRL in the field of
edge computing. For example, Bi et al. proposed Lyapunov-
guided DRL for stable online computation offloading, where
DRL is used for solving the mixed integer non-linear program-
ming (MINLP) subproblem [24]. In [25], Dai et al. proposed
a method for stochastic computation offloading in digital twin
networks based on LOT and asynchronous actor-critic (AAC),
where the AAC algorithm is adopted for solving single-
slot subproblems. In [26], Zhuang et al. proposed a method
to solve the network routing problem in multi-access edge
computing based on LOT and DQN. The differences between
these methods and our algorithm are summarized as follows.
Firstly, DRL methods in their studies were adopted for solving
deterministic single-slot subproblems obtained by LOT. In
contrast, the single-slot subproblems in this paper have inex-
plicit constraints related to building thermal dynamics models.
Secondly, the feasible hyper-parameters used for controlling
virtual queues and costs were not derived in existing studies,
while we analyze closed-form expressions of hyper-parameters
in LOT; Thirdly, we design a solving approach for each single-
slot subproblem in this paper by exploiting its special structure
and decompose the subproblem into two parts, which can be
solved by linear programming and MAADDPG, respectively.
Consequently, the proposed algorithm has better performance
than existing learning-based methods (e.g., DQN, MADDPG
[27]).
III. SYSTEM MOD EL AN D PROB LE M FOR MU LATION
We consider an HBMES in Fig. 1, where the main grid,
photovoltaic (PV) generation, battery energy storage system
(BESS), electrical load, electrolyzer, hydrogen tank, fuel cell,
gas boiler, cold water tank (CWT), and thermal loads can
be identified. Among these components, there are four kinds
of energy flows, i.e., electricity flow, hydrogen flow, heat
flow, and cooling flow. In electricity flow, electrical load (e.g.,
electric vehicles, electric water heaters, and computers) can be
served by the main grid, PV generators, BESS, and fuel cell.
Moreover, it can be seen that hydrogen flow appears in HESS,
which consists of an electrolyzer, a hydrogen tank, and a fuel
cell. To be specific, the hydrogen generated by the electrolyzer
can be stored in the hydrogen tank, which will discharge
hydrogen to drive the fuel cell for generating electricity and
heat simultaneously. The heat generated by the fuel cell and
gas boiler can be transformed into cold water by an absorption
chiller (AC). Next, cold water can be stored in CWT and used
for cooling buildings. In the following parts, we first introduce
the models related to PV generation, gas boiler, energy storage,
thermal load, power/energy balance, and operational cost.
Then, we formulate an expected operational cost minimization
problem related to the HBMES.
A. PV Generation Model
Let Ppv,t be the maximum generation output of PV system
at slot t. Then, its value can be estimated by [28]
Ppv,t =ηpvhpv ςt,(1)
This article has been accepted for publication in IEEE Transactions on Smart Grid. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TSG.2022.3197657
© 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: Nanjing Univ of Post & Telecommunications. Downloaded on August 10,2022 at 22:50:32 UTC from IEEE Xplore. Restrictions apply.
IEEE TRANSACTIONS ON SMART GRID, VOL. XX, NO. XX, MONTH 2022 4
Fig. 1. Illustration of an HBMES
where ηpv denotes the PV system generation efficiency, hpv is
total radiation area of solar panels, and ςtis the solar radiation
intensity at slot t.
B. Gas Boiler Model
Let Pgb,t denotes the heat power output of the gas boiler at
slot t. Then, we have
0Pgb,t Pmax
gb ,(2)
where Pmax
gb is the maximum heat power output of gas boiler.
C. Energy Storage Model
1) Battery Energy Storage Model: Let Btbe the stored
energy level in the BESS at slot t. Then, the dynamics of
energy level in BESS can be described by [29]
Bt+1 =Bt+ (ηbcPbc,t Pbd,t
ηbd
)∆t, (3)
where ηbc and ηbd are the charging and discharging efficiency
coefficients, respectively; Pbc,t and Pbd,t are charging power
and discharging power of BESS, respectively; tdenotes the
duration of time slot t.
To ensure that the energy level of the BESS fluctuates within
a normal range at any time, we have
Bmin BtBmax,(4)
where Bmin and Bmax are the minimum and maximum energy
levels of BESS, respectively.
Let Pmax
bc and Pmax
bd be the maximum charging power and
maximum discharging power, respectively. Then, we have
0Pbc,t Pmax
bc ,(5)
0Pbd,t Pmax
bd .(6)
Taking the round-trip inefficiency into consideration, simul-
taneous charging and discharging are not allowed (note that
the nonlinear constraint (7) can be removed for the purpose of
simplifying BESS dispatch, and related methods can be found
in [30]). Then, we have
Pbc,t ·Pbd,t = 0.(7)
2) Thermal Energy Storage Model: Let Qth,t be the stored
thermal energy in CWT at slot t. Then, its dynamics can be
described by
Qth,t+1 =Qth,t + (Ptc,tηtc Ptd,t
ηtd
)∆t, (8)
where ηtc and ηtd are injection efficiency and release efficiency
of CWT, respectively; Ptc,t and Ptd,t are injected power and
released power at slot t, respectively.
To ensure the normal operation of CWT, the following
operational constraints of the CWT should be satisfied, i.e.,
0Qth,t Qmax
th ,(9)
0Ptd,t Pmax
td ,(10)
0Ptc,t Pmax
tc ,(11)
Ptd,t ·Ptc,t = 0,(12)
where Qmax
th denotes the capacity of the CWT; Pmax
td and
Pmax
tc are the maximum released power and injected power,
respectively; (9) denotes that the stored thermal energy level
should fluctuate within a feasible range; (10) and (11) denote
the effective range of released power and injected power,
respectively. (12) means that releasing and injecting cold water
can not happen simultaneously so that meaningless thermal
loss can be avoided.
3) Hydrogen Energy Storage Model: Let Htbe the storage
level of hydrogen in the tank at slot t(in Nm3). Then, the
dynamics of hydrogen storage level can be described by [31]
Ht+1 =Ht+ (ωelPel,t Pfc,t
ωfc
)∆t, (13)
where Pel,t and Pfc,t are charging power of the electrolyzer
and discharging power of fuel cell at slot t, respectively; ωel
(in Nm3/kWh) and ωfc (in kWh/Nm3) denote the conversion
coefficients of electrolyzer and fuel cell, respectively.
Since the maximum storage level of the hydrogen tank is
limited by its tolerable tank pressure [2], we have
0HtHmax,(14)
where Hmax is the storage capacity of the hydrogen tank.
To keep the efficiency of the HESS, we assume that elec-
trolyzer and fuel cell can not operate simultaneously. Then,
we have
Pel,t ·Pfc,t = 0.(15)
In addition, the power consumption of the electrolyzer and
electric power output of fuel cell should satisfy the following
physical constraints, which can be given by
0Pel,t Pmax
el ,(16)
0Pfc,t Pmax
fc ,(17)
where Pmax
el and Pmax
fc are the rated powers of electrolyzer
and fuel cell, respectively.
Since the fuel cell generates electricity and heat simulta-
neously, the electrical output power of the fuel cell Pfc,t is
This article has been accepted for publication in IEEE Transactions on Smart Grid. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TSG.2022.3197657
© 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: Nanjing Univ of Post & Telecommunications. Downloaded on August 10,2022 at 22:50:32 UTC from IEEE Xplore. Restrictions apply.
IEEE TRANSACTIONS ON SMART GRID, VOL. XX, NO. XX, MONTH 2022 5
coupled with the corresponding thermal output power Qfc,t.
Then, we have [2]
Qfc,t =ηhrηh2e Pfc,tt, (18)
where ηh2e and ηhr are heat-to-electricity ratio and the heat
recovery efficiency, respectively.
D. Thermal Load Model
Let Psp,i,t be the thermal input power for cooling demand
in building iat slot t, which will affect building indoor tem-
perature βin,i,t. To provide a comfortable temperature range
for occupants in building i, the following constraints should
be satisfied [29], i.e.,
βmin
iβin,i,t βmax
i,(19)
βin,i,t+1 =Fi(Psp,i,t, βout,t , βin,i,t, ϱi,t ),(20)
0Psp,i,t Pmax
sp,i ,(21)
where βmin
iand βmax
iare the lower limit and upper limit of
the comfortable temperature rage in building i, respectively;
βout,t and ϱi,t are outdoor temperature and random thermal
disturbance at slot t, respectively; Fi(·)denotes a thermal dy-
namics model of building i, and Pmax
sp,i denotes the maximum
thermal input power in building i.
E. Power/Energy Balance Model
To maintain the electric power balance at each slot t, we
have
Pbuy,t+Ppv,t +Pfc,t+Pbd,t =Psell,t +Pel,t+Pload,t +Pbc,t,(22)
where Pbuy,t and Psell,t represent the purchasing power and
selling power of the HBMES at slot t, respectively; Pload,t
denotes the power demand at slot t. Moreover, we assume
that simultaneous purchasing and selling electricity is not
permitted, i.e.,
Pbuy,t ·Psell,t = 0.(23)
Since electricity transactions between HBMES and main
grid are limited by transmission line capacities, we have
0Pbuy,t Pmax
g,(24)
0Psell,t Pmax
g,(25)
where Pmax
gdenotes the maximum transaction power.
Similarly, thermal energy balance at each slot tcan be
depicted by the following constraint, i.e.,
Qfc,tηh2c (Ptc,t +
N
i=1
Psp,i,t Ptd,t Pgb,tηh2c )∆t, (26)
where Ndenotes the number of buildings and ηh2c denotes
AC transformation efficiency from heating to cooling.
F. Operational Cost Model
The operational cost of the HBMES consists of six parts,
i.e., the energy cost of electricity buying or selling C1,t, carbon
emission cost C2,t, BESS depreciation cost C3,t , HESS related
cost C4,t, CWT depreciation cost C5,t , and gas purchasing cost
C6,t.
Let vtand τtbe buying and selling prices of electricity,
respectively. Then, C1,t is expressed by
C1,t = (vtPbuy,t τtPsell,t)∆t. (27)
Let µe,t (in kg/kWh) be the carbon emission rate of the
main grid at slot t. Then, the carbon emission generated by
the HBMES at slot tcan be given by µe,tPg,tt. Then, the
carbon emission cost is calculated by [19]
C2,t =µcµe,tPg,t t, (28)
where µcis a weighted parameter in RMB/kg, which denotes
the importance of carbon emission with respect to energy cost.
Since too frequent charging or discharging will damage the
life of the BESS, BESS depreciation cost is adopted [29], i.e.,
C3,t =ψBESS(Pbc,t +Pbd,t ),(29)
where ψBESS is the battery depreciation coefficient in RM-
B/kW.
According to [32], the startup and shutdown cycles have
degradation effects on electrolyzer and fuel cell. Thus, startup
and shutdown costs are considered in this paper. Let δon
x,δsu
x,
and δsd
xbe the operation cost, startup cost, and shutdown cost
of component x (x {el,fc}) in HESS, respectively, where
“el” and “fc” denote electrolyzer and fuel cell, respectively.
Then, C4,t can be calculated by [32]
C4,t =x∈{el,fc}δon
xIon
x,t +δsu
xIsu
x,t +δsd
xIsd
x,t,(30)
where Ion
x,t,Isu
x,t, and Isd
x,t are logical indicator variables re-
lated to ON/OFF state, startup state, and shutdown state of
component x, respectively; Isu
x,t = max{Ion
x,t Ion
x,t1,0}and
Isd
x,t = max{Ion
x,t1Ion
x,t,0}.
Similar to BESS, CWT depreciation cost can be captured
by [29]
C5,t =ψCWT(Ptc,t +Ptd,t ),(31)
where ψCWT is the CWT depreciation coefficient in RMB/kW.
Let ηgb and λg,t be gas-to-heat conversion efficiency and gas
price (in RMB/kWh), respectively. Then, the gas purchasing
cost at slot tcan be given by [17]
C6,t =λg,t
Pgb,tt
ηgb
.(32)
G. Expected Operational Cost Minimization Problem
Based on above models, we can formulate an expected
operational cost minimization problem of an HBMES as
follows,
(P1) min lim sup
T→∞
1
T
T1
t=0
E6
j=1
Cj,t(33a)
s.t. (1) (26),(33b)
This article has been accepted for publication in IEEE Transactions on Smart Grid. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TSG.2022.3197657
© 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: Nanjing Univ of Post & Telecommunications. Downloaded on August 10,2022 at 22:50:32 UTC from IEEE Xplore. Restrictions apply.
IEEE TRANSACTIONS ON SMART GRID, VOL. XX, NO. XX, MONTH 2022 6
where the expectation operator Eis taken over the randomness
of system parameters (i.e., PV generation output Ppv,t, power
demand Pload,t, carbon emission rate µe,t , thermal load Qload,t,
buying/selling price vtt), and possible stochastic control
decisions (i.e., Pgb,t,Pbc,t ,Pbd,t,Ptc,t ,Ptd,t,Pel,t ,Pfc,t,
Psp,i,t|1iN,Pbuy,t , and Psell,t).
IV. THE PROPOSED OPERATIO N ALG OR IT HM
Solving P1 is a nontrivial task due to the following reasons.
Firstly, there are many uncertain parameters and it is often
difficult to know the statistical distributions of all combinations
in practice. Secondly, there are several temporally coupled
operational constraints (e.g., (3), (8), (13), and (20)). Thirdly,
there are some spatially coupled operational constraints (e.g.,
(22), (26)). Finally, it is challenging to obtain an explicit
building thermal dynamics model Fi(·)that is accurate and
efficient enough for building control [33].
To address the first challenge, some methods can be
adopted, e.g., stochastic programming, robust optimization,
and model predictive control. However, these methods either
need to know prior knowledge (e.g., probability distribution,
maximum and minimum values) of uncertain parameters or
predict/approximate random parameters. To deal with the
second challenge, typical methods are based on dynamic
programming, which suffers from “the curse of dimension-
ality” problem. When LOT is adopted, temporally coupled
operational constraints could be decoupled and an online
algorithm can be designed without knowing any prior knowl-
edge of uncertain parameters. However, due to the existence
of inexplicit building thermal dynamics models, the above-
mentioned online algorithm can not be realized. To overcome
the above-mentioned challenges, many model-free DRL meth-
ods can be adopted [34] [35], which can enable agents to learn
optimal policies from the process of interacting with building
environments. Once optimal policies are learned, they can
operate without knowing any prior information about uncertain
parameters and explicit building thermal dynamics models.
Although model-free DRL methods have some advantages,
their stability and performance may decrease with the increase
of action spaces and the number of heterogeneous agents.
Instead of solving P1 directly using DRL methods (e.g., DQN,
DDPG, MADDPG), we intend to reduce the size of action
space and the number of heterogeneous agents by exploiting
the fact that inexplicit building thermal dynamics model Fi(·)
only exists in thermal energy flow. Therefore, we can propose
an operation algorithm to solve P1 based on model-based
optimization and data-driven based learning.
To be specific, the key idea of the proposed online algorithm
can be illustrated by Fig. 2, where three steps could be
identified, i.e., transformation, decomposition, and solving.
Firstly, the original problem P1 is relaxed to P2 with time-
average constraints. Next, P2 is equivalently transformed into a
queue stability problem P3. Then, we can design an operation
algorithm for P3 based on LOT theory, which needs to solve
an online optimization problem P4. Since there are inexplicit
building thermal dynamics models, P4 can not be solved
directly using the model-based optimization methods. To solve
P4 efficiently, we decompose it into two subproblems P5
and P6. Since the premise of solving P5 is that reasonable
values of hyper-parameters V,WB, and WHare known,
we transform it into P7 with the convex objective function.
Continually, P7 can be decomposed into eight linear pro-
gramming subproblems by considering different combinations
of buying/selling electricity, BESS charging/discharging, and
HESS charging/discharging. To solve P6 efficiently, we re-
formulate it as a Markov game and propose a MAADDPG-
based algorithm to solve the game. Based on the above key
idea, we can design an online operation algorithm that has
a polynomial time computational complexity and does not
require any prior information of uncertain parameters and an
explicit building thermal dynamics model. In the following
parts, we will introduce three steps in detail.
Fig. 2. The key idea of the proposed algorithm.
A. Step-1: Transformation
Before conducting problem transformation, four assump-
tions are made, which can ensure that the electric-hydrogen
subsystem is controllable under the framework of LOT, i.e.,
vmax > τmax,(34)
vmin > τmin,(35)
ηbcηbd (vmax +µcµmax
eψBESS
t)> ν1,(36)
Bmax Bmin ηbcPmax
bc tPmax
bd t
ηbd
>0,(37)
ωelωfc (vmax +µcµmax
eδon
fc
Pmax
fc t)> ν2,(38)
Hmax Hmin ωelPmax
el tPmax
fc t
ωfc
>0,(39)
where vmax = maxtvt,τmax = maxtτt,vmin = mintvt,
τmin = mintτt,µmin
e= mintµe,t, and µmax
e= maxtµe,t,
ν1=τmin+µcµmin
e+ψBESS
t, and ν2=τmin+µcµmin
e+δon
el
Pmax
el t,
Note that the assumptions (34) and (35) are mild since
buying price vtis typically higher than selling price τtat
all time slots [36], i.e., buying electricity at low price and
selling electricity at high price simultaneously for making
profit is unrealistic. (36)-(39) are adopted to ensure that control
parameters Vmax
Band Vmax
Hdefined in theorems 1 and 2 of
section V are positive.
This article has been accepted for publication in IEEE Transactions on Smart Grid. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TSG.2022.3197657
© 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: Nanjing Univ of Post & Telecommunications. Downloaded on August 10,2022 at 22:50:32 UTC from IEEE Xplore. Restrictions apply.
IEEE TRANSACTIONS ON SMART GRID, VOL. XX, NO. XX, MONTH 2022 7
Since LOT framework can deal with stochastic program-
ming with time-average constraints and objectives, we intend
to transform P1 into an optimization problem with time-
average constraints. To be specific, (3), (4), (13), and (14)
in P1 are considered to derive the following constraints, i.e.,
ηbcPbc =Pbd
ηbd
, ωelPel =Pfc
ωfc
,(40)
where Px=lim
T→∞
1
T
T1
t=0
E[Px,t]and x {bc,bd,el,fc}, which
can be used to represent the time-average expected values
of BESS charging, BESS discharging, HESS charging, and
HESS discharging under any feasible control algorithm of P1,
respectively.
The specific process of obtaining (40) is explained as
follows. Taking BESS for example, summing (3) over t
{0, T 1}, taking expectation of two sides, dividing both sides
by T, and taking a limit as T , we have
lim
T→∞
EBTB0
T
= lim
T→∞
E1
T
T1
t=0
(ηbcPbc,t tPbd,t
ηbd
t)
=ηbcPbc tPbd
ηbd
t. (41)
According to (4), BminBmax BTB0BmaxBmin .
Thus, lim
T→∞
EBTB0
T= 0, and ηbcPbc =Pbd
ηbd . Using the
same way, ωelPel =Pfc
ωfc can be proved.
Based on the above description, P1 can be relaxed to P2 as
follows,
(P2) min lim sup
T→∞
1
T
T1
t=0
E6
j=1
Cj,t(42a)
s.t. (1),(2),(5) (12),(15) (26),(42b)
ηbcPbc =Pbd
ηbd
, ωelPel =Pfc
ωfc
.(42c)
To ensure the feasibility of (42c), we can construct two
virtual queues related to Btand Htand make them mean rate
stable. To be specific, we define two virtual queues as follows,
i.e., XB,t =Bt+WBand XH,t =Ht+WH, where WBand
WHare constants and their values could be derived in next
section. Moreover, according to (3) and (13), the dynamics of
these virtual queues can be written as follows,
XB,t+1 =XB,t + (ηbc Pbc,t Pbd,t
ηbd
)∆t, (43)
XH,t+1 =XH,t + (ωelPel,t Pfc,t
ωfc
)∆t. (44)
According to (42), we have
XB,l+1 XB,l = (ηbc Pbc,l Pbd,l
ηbd
)∆t. (45)
Summing the above equation (45) over l {0, t 1}for
t > 0, we have
XB,t XB,0=
t1
l=0
(ηbcPbc,l lPbd,l
ηbd
l).(46)
Taking expectations of (46), dividing two-sides by t, and
taking a limit as t , we have
lim
t→∞
E[1
t
t1
l=0
(ηbcPbc,l lPbd,l
ηbd
l)]
= lim
t→∞
E[XB,t XB,0
t] = lim
t→∞
E[XB,t
t]
lim
t→∞ |E[XB,t
t]| lim
t→∞
E|XB,t |
t,(47)
where XB,0=B0+WBis a constant since Bmin
B0Bmax. When XB ,t is mean rate stable, we have
lim
t→∞
E[|XB,t|]
t= 0 according to the definition of mean rate
stability [21]. As a result, (47) becomes ηbcPbc =Pbd
ηbd .
Similarly, we can prove that ωelPel =Pfc
ωfc holds if XH,t is
mean rate stable.
Based on the above conclusion, P2 can be equivalently
transformed into a queue stability problem P3 as follows,
(P3) min lim sup
T→∞
1
T
T1
t=0
E6
j=1
Cj,t(48a)
s.t. (1),(2),(5) (12),(15) (26),(48b)
XB,t and XH,t are mean rate stable,(48c)
According to LOT theory, P3 can be solved by constructing
adrift-plus-penalty function and minimizing its upper bound.
To this end, we first define a Lyapunov function as follows,
L(t)
=1
2(X2
B,t +ξX2
H,t),(49)
where ξis a weighted parameter since XB,t and XH,t have
different units.
Then, we can calculate the one-slot conditional Lyapunov
drift as follows,
Λt=E{L(t+ 1) L(t)|X(t)}.(50)
where X(t) = (XB,t , XH,t).
Based on the observed X(t),L(t+1)L(t)can be obtained
as follows, i.e.,
L(t+ 1) L(t)
=1
2X2
B,t+1 X2
B,t +ξ(X2
H,t+1 X2
H,t)
XB,t (ηbcPbc,t Pbd,t
ηbd
)∆t+ζB
+ξXH,t (ωelPel,t Pfc,t
ωfc
)∆t+ζH,(51)
where ζB=(∆tmax{ηbcPmax
bc ,Pmax
bd
ηbd })2
2and ζH=
ξ(∆tmax{ωelPmax
el ,Pmax
fc
ωfc })2
2. Then, we have
ΛtζB+ζH+E{Γ0|X(t)},(52)
where Γ0,t =XB,t (ηbcPbc,t Pbd,t
ηbd )∆t+XH,t(ωel Pel,t
Pfc,t
ωfc )∆t.
This article has been accepted for publication in IEEE Transactions on Smart Grid. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TSG.2022.3197657
© 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: Nanjing Univ of Post & Telecommunications. Downloaded on August 10,2022 at 22:50:32 UTC from IEEE Xplore. Restrictions apply.
IEEE TRANSACTIONS ON SMART GRID, VOL. XX, NO. XX, MONTH 2022 8
Then, drift-plus-penalty function can be derived by combing
(52) and the “penalty” related to objective function,
Y(t) = Λt+VE{
6
i=1
Ci,t|X(t)}
ζB+ζH+E{Γ0+V
6
i=1
Ci,t|X(t)}(53)
Finally, we can design an online operation algorithm (i.e.,
Algorithm 1) for P3 by minimizing the right-hand-side of
(53) subject to the constraints in P3, which is equivalent to
minimize P4 based on observed states at each slot t, i.e.,
(P4) min Γ0,t +V
6
i=1
Ci,t (54a)
s.t. (1),(2),(5) (12),(15) (26).(54b)
Algorithm 1: The proposed online operation algorithm for
HMBES
Input: Control parameters V,WB,WH
Output: Control decisions at each slot t, i.e., Pall,t
1Initialize XB,0=B0+WB, and XH,0=H0+WH.
2for each time slot t(0tT1)do
3Observe system states: X(t)and Ssys,t;
4Solve P4 using methods in section IV-C;
5Pall,t=(Pupper,t ,Plower,t);
6Update XB,t and XH,t according to (43) and (44);
7end
In Algorithm 1, control parameters V,WB, and WHare
taken as algorithmic inputs. Moreover, control decisions Pall,t
are taken as algorithmic outputs. Here, Pall,t = (Pgb,t,Pbc,t ,
Pbd,t,Ptc,t ,Ptd,t,Pel,t ,Pfc,t,Psp,i,t |1iN,Pbuy,t,Psell,t ). In
each time slot t, virtual queue length vector X(t)and sys-
tem states Ssys,t=(Ppv,t ,Pload,t,βin,i,t |1iN,βout,t,µe,t ,vt,τt,t)
are observed as shown in line 3. Then, P4 is solved based
on the methods introduced in section IV-C, i.e., obtaining
Pupper,t = (Pbc,t,Pbd,t ,Pel,t,Pfc,t ,Pbuy,t,Psell,t )by solving P7
and obtaining Plower,t=(Psp,i,t |1iN,Ptc,t,Ptd,t ,Pgb,t) using
Algorithm 3. Next, control decisions Pall,t = (Pupper,t, Plower,t)
are decided and two virtual queues are updated. Although the
line 6 of Algorithm 1 can ensure the feasibility of (3) and
(13), the proposed online algorithm may be infeasible to P1
since (4) and (14) are neglected. However, in next section, we
will prove that constraints (4) and (14) could be satisfied if
reasonable values of V,WB, and WHare selected.
B. Step-2: Decomposition
Since an explicit building thermal dynamics model Fi(·)
is unavailable, P4 can not be solved using traditional opti-
mization techniques. To solve P4 efficiently, we decompose it
into two subproblems according to the availability of model
information, i.e., upper subproblem P5 related to the electric-
hydrogen subsystem and lower subproblem P6 related to a
thermal subsystem. Firstly, we solve the upper subproblem
based on model-based optimization. Then, its decision on Pfc,t
is taken as a state component in the lower subsystem, which
is solved by MAADDPG. To be specific, P5 and P6 are given
as follows.
(P5) min Γ0,t +V
4
i=1
Ci,t (55a)
s.t. (1),(2),(5) (7),(15) (17),(22) (25).(55b)
(P6) min V(C5,t +C6,t)(56a)
s.t. (8) (12),(18) (21),(26) (56b)
C. Step-3: Solving two subproblems
1) The solution to P5:P5 can be transformed into a mixed
integer linear programming problem by adopting several aux-
iliary binary variables and linear constraints. However, the
premise of solving P5 using this kind of way is that the
reasonable values of V,WB, and WHcould be known. To
facilitate the derivation of reasonable values of V,WB, and
WH, we solve the following problem P7, which has a convex
objective function.
(P7) min Γ0,t +V(
3
i=1
Ci,t +δon
el Pel,t
Pmax
el
+δon
fc Pfc,t
Pmax
fc
)(57a)
s.t. (1),(2),(5) (7),(15) (17),(22) (25),(57b)
where the gap between the objection function of P7 and that of
P5 varies within the range [0, ], where = max{δon
el +δsu
el +
δsd
fc , δon
fc +δsu
fc +δsd
el , δsd
el +δsd
fc }. Although the optimal solution of
P7 may be different from that of P5, the reasonable values of
V,WB, and WHcould be derived by analyzing the structure
of P7 in next section and the proposed algorithm still shows
good performance as shown in simulation results.
The solution of P7 can be derived by considering two
cases as follows, i.e., buying electricity and without buy-
ing electricity. In other words, the following two problems
P8 and P9 should be solved. By further considering four
possible combinations of BESS and HESS operations (i.e.,
Pbd,t =Pfc,t = 0,Pbd,t =Pel,t = 0,Pbc,t =Pfc,t = 0, and
Pbc,t =Pel,t = 0), P8 and P9 can be transformed into eight
linear programming subproblems. Moreover, each of them has
3 variables and 6 constraints, which can be solved efficiently
by interior point method within polynomial-time, which is
O(33.5Lin)and Lin denotes the number of bits of input data
[38]. Finally, the optimal solution to P7 equals that of the
subproblem with the smallest objective function value.
(P8) min Γ1,t (58a)
s.t. (1),(2),(5) (7),(15) (17),(22),(24),(58b)
Psell,t = 0,(58c)
where Γ1,t =XB,t (ηbcPbc,t Pbd,t
ηbd )∆t+ξXH,t (ωelPel,t
Pfc,t
ωfc )∆t+V(vt+µcµe,t)Pbuy,t t+V ψBESS(Pbc,t +Pbd,t ) +
V(δon
el
Pel,t
Pmax
el +δon
fc
Pfc,t
Pmax
fc ).
This article has been accepted for publication in IEEE Transactions on Smart Grid. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TSG.2022.3197657
© 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: Nanjing Univ of Post & Telecommunications. Downloaded on August 10,2022 at 22:50:32 UTC from IEEE Xplore. Restrictions apply.
IEEE TRANSACTIONS ON SMART GRID, VOL. XX, NO. XX, MONTH 2022 9
(P9) min Γ2,t (59a)
s.t. (1),(2),(5) (7),(15) (17),(22),(25),(59b)
Pbuy,t = 0,(59c)
where Γ2,t =XB,t (ηbcPbc,t Pbd,t
ηbd )∆t+ξXH,t (ωelPel,t
Pfc,t
ωfc )∆tV(τt+µcµe,t)Psell,t t+V ψBESS(Pbc,t +Pbd,t ) +
V(δon
el
Pel,t
Pmax
el +δon
fc
Pfc,t
Pmax
fc ).
2) The solution to P6:In order to solve P6 efficiently, we
reformulate it as a Markov game, which is a general mod-
eling framework for multi-agent decision-making problems
under uncertainty [22]. Specifically, a Markov game with N
agents can be defined by a set of states, S, a collection of
action sets (each action set is associated with each agent in
the environment), A1,· · · ,AN, a state transition function,
F:S×A1×. . .×ANΠ(S), which defines the probability
distribution over possible next states, given the current state
and actions for all agents, and a reward function for each agent
i(1iN), Ri:S×A1×...×ANR. In a Markov
game, each agent itakes action aiAibased on its local
observation oi Oi, where oicontains partial information of
the global state sS. The aim of the agent iis to maximize
its own expected return by learning a policy πi:OiΠ(Ai),
which maps the agent’s local observation oi Oiinto a
distribution over its set of actions. Here, the return is the
sum of discounted rewards received over the future, i.e.,
j=0 γjri,t+j+1(st, a1,t ,· · · , aN,t ), where γ[0,1] is a
discount factor and ri,t+1 Riis the reward received by
the agent iat slot t. Since there is no need to know the
information of the state transition function when solving P6
with MAADDPG, just three components (i.e., state, action,
and reward function) are designed.
Environment State According to (19), the temperature
deviation should be penalized on each agent so that com-
fortable range can be maintained. Moreover, to promote the
coordination among all thermal-load agents, C5,t and C6,t
should be considered in the reward design of each thermal-
load agent. Since the temperature deviation, C5,t, and C6,t
depend on βin,i,t,βout,t ,Qth,t, and Qfc,t , the environment
state of i-th thermal-load agent is designed as follows, i.e.,
oi,t = (Qfc,t, Qth,t , βin,i,t, βout,t , t), where Qfc,t is obtained
from (18) and Pfc,t is obtained from the solution of P7.
Action Since each thermal-load agent needs to make a
decision on Psp,i,t, the action of i-th thermal-load agent can be
designed by ai,t =Psp,i,t. To speed up the learning of agents,
the following rules are adopted, i.e.,
ai,t =0,if βin,i,t βmin
ior βout,t βmax
i
Psp,i,t,otherwise.(60)
Remark 1: After the actions of thermal-load agents
are taken, the actions of gas boiler and CWT can be
decided accordingly. To be specific, when Qfc,tηh2c >
N
i=1 Psp,i,tt, CWT will operate in charging mode (i.e.,
Ptd,t = 0) and the thermal power input Ptc,t is min(Qfc,tηh2c
t
N
i=1 Psp,i,t, P max
tc ,Qmax
th Qth,t
ηtct). Under this situation, Pgb,t =
0. When Qfc,tηh2c N
i=1 Psp,tt, CWT will operate in
discharging mode (i.e., Ptc,t = 0) and the thermal power output
Ptd,t is min(N
i=1 Psp,i,t Qfc,tηh2c
t, P max
td ,Qth,tηtd
t). Under this
situation, Pgb,t = min(N
i=1 Psp,i,t Qfc,tηh2c
tPtd,t, P max
gb ).
Consequently, the total thermal power for cooling buildings
is Pthermal,t =Pgb,t +Ptd,t +Qfc,tηh2c
t. When N
i=1 Psp,i,t >
Pthermal,t, the actual thermal input of building iis decided by
Pthermal,t
Psp,i,t
N
i=1 Psp,i,t .
Reward As mentioned in the descriptions related to s-
tate design, the reward of i-th agent consists of three
parts, i.e., the penalties imposed on temperature deviation,
CWT depreciation cost, and gas purchasing cost. Therefore,
the reward of each agent ican be designed as follows,
i.e., rth,i,t =((C5,t+C6,t )Psp,i,t
N
i=1 Psp,i,t +ϖi,t), where ϖi,t =
κth([βin,i,t+1 βmax ]++βmin βin,i,t+1+,κth denotes a
positive penalty coefficient, and [·]+= max(·,0).
Actor1
o1a1
Actor 1
MLP 1
Encoder 1
Critic 1

Attention 1
Actor1
aNoN
Actor N
Critic N
e1
[
Q1(o,a)QN(o,a)
eN
[
MLP N
Encoder N
Attention N
e1eN-1

e2eN

Update Update
Fig. 3. The framework of MAADDPG.
To solve the Markov game mentioned above, a MAADDPG-
based algorithm is proposed based on an attention mechanism
and MADDPG, and its framework can be found in Fig. 3.
Compared with MADDPG, MAADDPG has higher scalability
since the output size of the attention module in Fig. 2 is fixed,
which is unrelated to the total number of agents. In Fig. 3,
each agent consists of an actor network and a critic network.
Actor network input of agent iis the local observation oi
and its output is action ai. Critic network input of agent i
consists of oi,ai, and ej=i,1jN(eidenotes the encoding
of local observation and action of agent i), and its output
is action-value function Qi(o, a), where o= (o1,· · · , oN),
a= (a1,· · · , aN). In critic network of agent i, the input of
the attention module is ei,1iNand its output is xi, which
represents the contribution of other agents, i.e.,
xi=j=iwj~(Wvalue,jej),(61)
where Wvalue,j is a value transformation matrix related to agent
j,~is a non-linear activation function, wjis the attention
weight associated with agent jand it reflects the similarity
between eiand ej. To be specific,
wj=exp((Wkey,iej)TWquery,iei)
N
j=1 exp((Wkey,iej)TWquery,iei),j, (62)
where Wkey,i and Wquery,i are key and query transformation
matrixes related to agent i, respectively.
This article has been accepted for publication in IEEE Transactions on Smart Grid. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TSG.2022.3197657
© 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: Nanjing Univ of Post & Telecommunications. Downloaded on August 10,2022 at 22:50:32 UTC from IEEE Xplore. Restrictions apply.
IEEE TRANSACTIONS ON SMART GRID, VOL. XX, NO. XX, MONTH 2022 10
Let θibe the network parameter of agent i. Then, the loss
function used for updating critic network is given by
L(θi) = E(o,a,o,r)[Qπ
i(o, a)y]2,(63)
where (o, a, o, r)denotes experience transition in memo-
ry buffer D,πdenotes policies of agents, y=ri+
γQπ
i(o, a)|a=π(o), and πdenotes the target policies of
agents. Moreover, the policy gradient used for updating actor
network can be given by
θiJ(πi) = Eo,a[θiπi(ai|oi)aiQπ
i(o, a)|ai=πi(oi)].(64)
The MAADDPG-based training algorithm for solving the
Markov game related to P6 is shown in Algorithm 2. To be
specific, a replay memory Dis initialized in line 1. Then,
the preprocessing function ϕ(o)is introduced to normalize
the environment state otas in [29], which can facilitate
the learning process of the proposed algorithm. In line 3,
Ornstein-Uhlenbeck (OU) process is used to generate random
noise N. In lines 4-5, we initialize the weight parameters of
actors/critics and target actors/critics. In each actor network,
there are one input layer, three hidden layers with the same
size Nh
a, and one output layer. In each critic network, three
modules are involved, i.e., encoder, attention, and MLP. Here,
MLP has one input layer, one hidden layer with size Nh
c,
and one output layer. In lines 7-8, environment state ois
initialized and the scale of OU noise is adjusted, which
decreases linearly with the increase of episode index. In line
10, each thermal-load agent takes an action based on the
current policy and exploration noise. In line 11, after receiving
the joint action of all thermal-load agents, the environment
returns a new state oand a reward r. Next, the experience
transition tuple (ϕ(o), a, r, ϕ(o)) is stored in the memory D.
When the number of transition tuples Msize exceeds Nm,
the multi-agent training process would be triggered. However,
for the purpose of stabilizing learning process, the training
frequency is decreased by adopting another condition, i.e.,
mod(ep,Tfre)=0, where ep denotes the episode index and Tfre
means that training is conducted every Tfre episodes. In lines
16-18, each agent updates its actor and critic parameters based
on the sampled mini-batch data with Ktransition tuples. In
line 20, target network parameters are updated.
Once the training process is finished, the obtained policy
πican be used for solving P6 in an online way without any
process of solution searching as shown in Algorithm 3. Since
just the forward propagation is involved, the computational
complexity of Algorithm 3 is low. To be specific, in the process
of forward propagation, three basic computations are involved,
i.e., addition, multiplication, and activation. Let Nin and Nout
be the number of neurons in the input layer and output layer,
respectively. For the first neuron in the first hidden layer of
the above-mentioned actor network, the number of addition,
multiplication, and activation is Nin,Nin, and 1, respectively.
Then, the total number of computations in the first hidden layer
is (2Nin+1)Nh
a. Similarly, the total number of computations in
the second/third hidden layer is (2Nh
a+ 1)Nh
a. For the output
layer, the total number of computations is (2Nh
a+ 1)Nout.
Finally, the computational complexity of Algorithm 3 can be
calculated by O(2NinNh
a+4Nh
aNh
a+2Nout Nh
a+3Nh
a+Nout).
Since Nin = 5 and Nout = 1 in this paper, the computational
complexity of Algorithm 3 is O(4(Nh
a)2+ 15Nh
a+ 1). When
taking the computational complexity of P7 into consideration,
the computational complexity of the proposed online algorithm
for solving P4 is O(4(Nh
a)2+ 15Nh
a+ 33.5Lin + 1).
Algorithm 2: MAADDPG-based Training Algorithm for
Thermal-load Agents
Input: real-world traces (e.g., price, PV generation,
power demand, and outdoor temperature), Pfc,t
obtained from the solution of P7;
Output: actor networks πi(ai|ϕ(o))
1Initialize replay memory Dwith size Nm;
2Initialize preprocess function ϕ(o);
3Initialize random noise Nfor action exploration;
4Randomly initialize critic networks Qπ
i(ϕ(o), a)and actor
networks πi(ai|ϕ(o)) with parameter θi, respectively.
5Initialize target critic networks Qπ
i(ϕ(o), a)and actor
networks π
i(ai|ϕ(o)) with parameter θ
i, respectively.
6for ep=1, 2, · · · ,Mdo
7Receive the initial environment state o;
8Adjust the scale of Ornstein-Uhlenbeck process;
9for t=0, 1, · · · ,T-1 do
10 Each agent iselects an action:
ai=πθi(ϕ(oi)) + Nt;
11 Execute action aand obtain next state oand
reward rfrom the environment;
12 Store (ϕ(o), a, r, ϕ(o)) in D;
13 oo;
14 if Msize Nmand mod(ep,Tfre)=0 then
15 for agent i=1, · · · ,Ndo
16 Sample a mini-batch of Ktransitions
(ϕ(ok), ak, rk, ϕ(ok)) from D;
17 Update critic network by minimizing the
loss function in (63);
18 Update actor network using the sampled
policy gradient in (64);
19 end
20 Update target network parameters for each
agent i:θ
iρθi+ (1 ρ)θ
i;
21 end
22 end
23 end
Algorithm 3: Real-time Algorithm for Solving P6
1Input: Actor networks and Pfc,t obtained from
Algorithm 2 and the solution of P7, respectively;
2Output:Psp,i,t,Ptc,t ,Ptd,t,Pgb,t ;
3Qfc,t =ηhrηh2e Pfc,tt;
4Observe system states Qth,t,βin,i,t ,βout,t;
5Obtain oi,t = (Qfc,t, Qth,t , βin,i,t, βout,t , t);
6Each thermal-load agent imakes its local action
Psp,i,t =πi(ai|ϕ(oi,t)) in parallel;
7Determine Ptc,t,Ptd,t ,Pgb,t according to Remark 1;
This article has been accepted for publication in IEEE Transactions on Smart Grid. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TSG.2022.3197657
© 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: Nanjing Univ of Post & Telecommunications. Downloaded on August 10,2022 at 22:50:32 UTC from IEEE Xplore. Restrictions apply.
IEEE TRANSACTIONS ON SMART GRID, VOL. XX, NO. XX, MONTH 2022 11
V. ALGORITHMIC FEASIBILITY
In this section, we provide the following three lemmas and
two theorems, which show that constraints (4), (14) could
be satisfied by the proposed online algorithm. Moreover, the
reasonable values of V,WB, and WHare derived.
Lemma 1. If Pbuy,t >0, the optimal solution of P7 has the
following properties:
1) If ΥB,1>0, the optimal BESS charging decision is
Pbc,t = 0; If ΥB,2<0, the optimal BESS discharging
decision is Pbd,t = 0.
2) If ΥH,1>0, the optimal HESS charging decision is
Pel,t = 0; If ΥH,2<0, the optimal HESS discharging
decision is Pfc,t = 0.
where ΥB,1=XB,t ηbct+V vtt+V µcµe,t t+V ψBESS,
ΥB,2=XB,t t
ηbd +V vtt+V µcµe,ttV ψBESS ,ΥH,1=
XH,tξωelt+V vtt+V µcµe,t t+V δon
el
Pmax
el
, and ΥH,2=
XH,tξt
ωfc +V vtt+V µcµe,ttV δon
fc
Pmax
fc
.
Proof: See Appendix A.
Lemma 2. If Pbuy,t = 0, the optimal solution of P7 has the
following properties:
1) If ΥB,3>0, the optimal BESS charging decision is
Pbc,t = 0; If ΥB,4<0, the optimal BESS discharging
decision is Pbd,t = 0;
2) If ΥH,3>0, the optimal HESS charging decision is
Pel,t = 0; If ΥH,4<0, the optimal HESS discharging
decision Pfc,t = 0.
where ΥB,3=XB,t ηbct+V τtt+V µcµe,t t+V ψBESS,
ΥB,4=XB,t t
ηbd +V τtt+V µcµe,ttV ψBESS ,ΥH,3=
XH,tξωelt+V τtt+V µcµe,t t+V δon
el
Pmax
el
, and ΥH,4=
XH,tξt
ωfc +V τtt+V µcµe,ttV δon
fc
Pmax
fc
.
Proof: See Appendix B.
Lemma 3. Based on lemma 1 and lemma 2, we can obtain
lemma 3 as follows, i.e., the optimal solution to P7 has the
following properties:
1) If XB ,t > Xhigh
B=V(τmint+µcµmin
et+ψBESS)
ηbct, the
optimal BESS charging decision is Pbc,t = 0; If
XB,t < Xlow
B=V ηbd(vmax t+µcµmax
etψBESS)
t, the
optimal BESS discharging decision is Pbd,t = 0;
2) If XH,t > Xhigh
H=V(τmint+µcµmin
et+δon
el
Pmax
el
)
ξωelt, the
optimal HESS charging decision is Pel,t = 0; If XH,t <
Xlow
H=V ωfc(vmax t+µcµmax
etδon
fc
Pmax
fc
)
ξt, the optimal
HESS discharging decision is Pfc,t = 0.
Proof: See Appendix C.
Based on lemma 3, the reasonable values of V,WB, and
WHcan be derived in next two theorems.
Theorem 1. Given the control parameter V(0, V max
B],
WB[Wmin
B, W max
B], the proposed algorithm can en-
sure the feasibility of (4), i.e., Bmin BtBmax
for all slots, where Vmax
B=BmaxBmin ηbc Pmax
bc tPmax
bd t
ηbd
χB,
χB=ηbd(vmax +µcµmax
eψBESS
t)τmin+µcµmin
e+ψBESS
t
ηbc ,
Wmin
B=V(τmin+µcµmin
e+ψBESS
t)
ηbc +ηbcPmax
bc tBmax,
Wmax
B=V ηbd(vmax +µcµmax
eψBESS
t)Pmax
bd t
ηbd Bmin.
Proof: See Appendix D.
Theorem 2. Given the control parameter V(0, V max
H],
WH[Wmin
H, W max
H], the proposed algorithm can en-
sure the feasibility of (14), i.e., Hmin HtHmax
for all slots, where Vmax
H=HmaxHmin ωel Pmax
el tPmax
fc t
ωfc
χH,
χH=ωfc(vmax t+µcµmax
etδon
fc
Pmax
fc
)
ξtτmint+µcµmin
et+δon
el
Pmax
el
ξωelt,
Wmin
H=V(τmint+µcµmin
et+δon
el
Pmax
el
)
ξωelt+ωel Pmax
el tHmax,
Wmax
H=V ωfc(vmax t+µcµmax
etδon
fc
Pmax
fc
)
ξtPmax
fc t
ωfc Hmin.
Proof: See Appendix E.
Remark 2: To ensure that operational constraints of BESS
and HESS are both feasible, the control parameter should
be selected within the range, i.e., 0< V Vmax =
min{Vmax
B, V max
H}. According to the objective function in P4,
it can be known that larger Vmeans higher “priority” of min-
imizing operational cost. Thus, this paper chooses V=Vmax
similar to existing works [36], [37]. In addition, Vmax
Bis
derived based on the equation that Wmin
B=Wmax
B. Moreover,
Vmax
His derived based on the equation that Wmin
H=Wmax
H.
In other words, when Vmax =Vmax
B, the gap between Wmin
B
and Wmax
Bis zero. Similarly, when Vmax =Vmax
H, the
gap between Wmin
Hand Wmax
His zero. Since WBand WH
mainly affect the average BESS/HESS energy level rather
than price diversity (i.e., charging/discharging BESS/HESS
when the price is low/high), we choose WB=Wmax
Band
WH=Wmax
Hfor simplicity in this paper.
VI. PE RF OR MA NC E EVALUATI ON
In this section, we evaluate the performance of the proposed
algorithm. To be specific, we first describe the simulation
setup. Next, five benchmarks are adopted for performance
comparisons. Then, two performance metrics are defined.
Finally, simulation results and discussions are provided.
A. Simulation setup
Real-world traces related to electricity price, power load,
PV generation, and outdoor temperature are adopted in sim-
ulations, which are shown in Fig. 4. To be specific, retail
commercial price between June 1 and Sept. 30 of 2019
in Beijing is used1. Moreover, power demand and outdoor
temperature data from Pecan Street database2during June
1 and Sept. 30 of 2018 are used. Note that such database
is the largest real-world open energy database and consists
of the data related to the Mueller neighborhood in Austin,
TX, USA. Since we focus on the cooling mode in summer,
solar irradiance data during June 1 and Sept. 30 of 2019 from
NREL Solar Radiation Research Laboratory3is used. In these
traces, the data within 90 days and 30 days are used for
training and testing, respectively. Note that main simulation
parameters are summarized in Table I, where lraand lrcare
learning rate of actor network and critic network, respectively.
1http://fgw.beijing.gov.cn/
2https://www.pecanstreet.org/
3https://midcdmz.nrel.gov/
This article has been accepted for publication in IEEE Transactions on Smart Grid. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TSG.2022.3197657
© 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: Nanjing Univ of Post & Telecommunications. Downloaded on August 10,2022 at 22:50:32 UTC from IEEE Xplore. Restrictions apply.
IEEE TRANSACTIONS ON SMART GRID, VOL. XX, NO. XX, MONTH 2022 12
Python-based simulations are conducted on a desktop com-
puter with Intel Core(TM) i9-9900 CPU and 64GB RAM. To
simulate the building thermal dynamics, the following model
Fis adopted similar to many existing works [40], [41], i.e.,
βin,i,t+1 =εhvacβin,i,t + (1 εhvac)(βout,t Psp,i,t ηhvac/Ai).
Note that the above model structure is not used for energy
planning/optimization similar to model-based methods (e.g.,
model predictive control [42], and Lyapunov optimization
techniques [43]), but used to obtain environment data for
model-free learning. In addition, the adoption of the above-
mentioned model can facilitate the performance comparison
with an optimal scheme that solves a deterministic model with
perfect information of uncertain parameters.
TABLE I
MAIN PARA MET ER SETTINGS
PV generation, gas boiler, and carbon emission
ηpv=0.2[28], hpv =100m2,ηgb=0.95 [17], λgb =0.287RMB/kWh [17],
Pmax
gb =20kW, µe,t=0.968kg/kWh, µc=0.01RMB/kg, τt=0.1RMB/kWh
BESS
Bmin=0kWh, B0=0kWh, Bmax =100kWh, Pmax
bc =10kW, Pmax
bd =10kW,
ηbc=ηbd =0.95 [29], ψBESS=0.01RMB/kW [29]
CWT
ηtc=0.9, ηtd =0.9, Qmax
th =50kWh, Qinit
th =0kWh, Pmax
tc =10kWh,
Pmax
td =10kWh, ψCWT=0.05RMB/kW
HESS
ωfc=0.23Nm3/kWh [31], ωel =1.4985kWh/Nm3,ηhr=0.7 [2], ηh2e =1.4
[2], ηh2c=0.7 [2], Pmax
el =10kW, Pmax
fc =10kW, δon
el =0.158RMB [32],
δsu
el =0.97RMB [32], δsd
el =0.049RMB [32], δon
fc =δsu
fc =0.079RMB [32],
δsd
fc =0.0395RMB [32], Hmax=100Nm3,H0=Hmin =0Nm3
Thermal load
N= 4,βinit=[21, 20, 22, 21.5]C, βmin
i=20C, βmax
i=25C, ηhvac=2.5,
A= 0.5kW/F, εhvac = 0.8,Pmax
sp =20kW
Training algorithm
γ=0.995, Nh
a=Nh
c=64, Nm=120000, M=100000, κth=1 RMB/oF, ξ= 4,
lra=0.0005, lrc=0.005, T=24, t=1h, K=128, ρ=0.001, Ttest=720, Tfre =5
B. Benchmarks
Baseline 1 (B1): This scheme controls BESS and HESS
using an algorithm similar to [44], i.e., charging BESS
and HESS greedily when there is a surplus of renewable
energy and discharging them otherwise. Moreover, this
scheme adopts ON/OFF strategy [45] for building cool-
ing, i.e., Psp,i,t=0 if βin