Content uploaded by Liang Yu

Author content

All content in this area was uploaded by Liang Yu on Aug 10, 2022

Content may be subject to copyright.

IEEE TRANSACTIONS ON SMART GRID, VOL. XX, NO. XX, MONTH 2022 1

Joint Optimization and Learning Approach for

Smart Operation of Hydrogen-based

Building Energy Systems

Liang Yu, Member, IEEE, Zhanbo Xu, Member, IEEE, Xiaohong Guan, Fellow, IEEE,

Qianchuan Zhao, Senior Member, IEEE, Chunxia Dou, Member, IEEE, and Dong Yue, Fellow, IEEE

Abstract—In recent years, hydrogen-based multi-energy sys-

tems (HMESs) have received wide attention. However, existing

works on the optimal operation of HMESs neglect building

thermal dynamics, which means that the ﬂexibility of thermal

loads can not be utilized for reducing system operation cost.

In this paper, we investigate an optimal operation problem

of an HMES with the consideration of building thermal dy-

namics. Speciﬁcally, we ﬁrst formulate an expected operational

cost minimization problem related to an HMES. Due to the

existence of uncertain parameters, inexplicit building thermal

dynamics models, spatially and temporally coupled operational

constraints, and nonlinear constraints, it is challenging to solve

the formulated problem. Then, we propose an algorithm to

solve the problem based on model-based optimization and data-

driven based learning. The key idea of the proposed algorithm

is summarized as follows: (1) transforming the long-term cost

minimization problem into several single-slot subproblems using

Lyapunov optimization techniques; (2) dividing each single-slot

subproblem into two parts according to the availability of model

information; (3) solving one part based on convex optimization

and solving another part using multi-agent attention-based deep

deterministic policy gradient. Simulation results based on real-

world traces show the effectiveness of the proposed algorithm.

Index Terms—Building energy systems, operational cost, car-

bon emission, uncertainty, hydrogen energy storage, deep rein-

forcement learning, Lyapunov optimization techniques

NOM EN CL ATUR E

Indices

This work was supported in part by the Basic Research Project of Lead-

ing Technology of Jiangsu Province under Grant BK20202011, in part by

the National Natural Science Foundation of China under Grant 62192751,

Grant 61972214, Grant 62122062, Grant 62192750, and Grant 61425027,

in part by the 111 International Collaboration Program of China under

Grant BP2018006, in part by China Postdoctoral Science Foundation under

Grant 2020M673406, in part by Qinlan Project of Jiangsu Province (2022),

and in part by 1311 Talent Project of Nanjing University of Posts and

Telecommunications. (Corresponding authors are Liang Yu and Dong Yue).

L. Yu is with the Faculty of Electronic and Information Engineering, Xi’an

Jiaotong University, Xi’an 710049, China, and is also with the College of Au-

tomation & College of Artiﬁcial Intelligence, Nanjing University of Posts and

Telecommunications, Nanjing 210003, China. (email: liang.yu@njupt.edu.cn)

Z. Xu and X. Guan are with Systems Engineering Institute, Ministry of

Education Key Lab for Intelligent Networks and Network Security, Xi’an

Jiaotong University, Xi’an 710049, China.

Q. Zhao is with the Center for Intelligent and Networked Systems (CFINS),

Department of Automation and BNRist, Tsinghua University, 100084 China.

C. Dou is with the Institute of Advanced Technology, Nanjing University of

Posts and Telecommunications, Nanjing 210003, China.

D. Yue is with the Institute of Advanced Technology, Nanjing Univer-

sity of Posts and Telecommunications, Nanjing 210003, China. (email:

medongy@vip.163.com)

tTime slot index.

iBuilding index, agent index.

Parameters and Constants

ηpv PV system generation efﬁciency.

hpv Total radiation area of solar panels (m2).

ςtThe solar radiation intensity at slot t(W/m2).

Pmax

gb Maximum heat power output of gas boiler (kW).

ηbc,ηbd Charging efﬁciency, discharging efﬁciency.

∆tThe duration of a time slot (hour).

Bmin Minimum BESS energy level (kWh).

Bmax Maximum BESS energy level (kWh).

Pmax

bc Maximum BESS charging power (kW).

Pmax

bd Maximum BESS discharging power (kW).

ηtc Injection efﬁciency of CWT.

ηtd Release efﬁciency of CWT.

Qmax

th CWT capacity (kWh).

Pmax

td Maximum CWT released power (kW).

Pmax

tc Maximum CWT injected power (kW).

Hmax HESS storage capacity (Nm3).

ωel Conversion coefﬁcient of electrolyzer (Nm3/kWh).

ωfc Conversion coefﬁcient of fuel cell (kWh/Nm3).

Pmax

el Rated input power of electrolyzer (kW).

Pmax

fc Rated output power of fuel cell (kW).

ηh2e Heat-to-electricity ratio.

ηhr Heat recovery efﬁciency.

βmin

iLower limit of comfortable temperature range (◦C).

βmax

iUpper limit of comfortable temperature range (◦C).

Pmax

sp,i Maximum thermal input power in building i(◦C).

NThe number of buildings.

ηh2c AC transformation efﬁciency.

µcA weighted carbon emission parameter (RMB/kg).

ψBESS Battery depreciation coefﬁcient in (RMB/kW).

ηgb Gas-to-heat conversion efﬁciency.

VControl parameter related to operational cost.

Variables

Ppv,t Maximum PV generation output at slot t(kW).

Pgb,t Heat power output of gas boiler at slot t(kW).

BtStored energy level in the BESS at slot t(kWh).

Pbc,t BESS charging power at slot t(kW).

Pbd,t BESS discharging power at slot t(kW).

Qth,t Stored thermal energy in CWT at slot t(kWh).

Ptc,t CWT charging power at slot t(kW).

Ptd,t CWT discharging power at slot t(kW).

HtStorage level of hydrogen tank at slot t(Nm3)

This article has been accepted for publication in IEEE Transactions on Smart Grid. This is the author's version which has not been fully edited and

content may change prior to final publication. Citation information: DOI 10.1109/TSG.2022.3197657

© 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: Nanjing Univ of Post & Telecommunications. Downloaded on August 10,2022 at 22:50:32 UTC from IEEE Xplore. Restrictions apply.

IEEE TRANSACTIONS ON SMART GRID, VOL. XX, NO. XX, MONTH 2022 2

Pel,t Charging power of electrolyzer at slot t(kW).

Pfc,t Discharging power of fuel cell at slot t(kW).

Qfc,t Thermal output power of fuel cell at slot t(kWh).

Psp,i,t Thermal input power for building iat slot t(kW).

βin,i,t Building indoor temperature at slot t(◦C).

βout,t Outdoor temperature at slot t(◦C).

ϱi,t Random thermal disturbance at slot t(◦C).

Pbuy,t Purchasing power of the HBMES at slot t(◦C).

Psell,t Selling power of the HBMES at slot t(◦C).

Pload,t Power demand at slot t(kW).

Pmax

gMaximum transaction power (kW).

vtBuying electricity price (RMB/kWh).

τtSelling electricity price (RMB/kWh).

C1,t Energy cost of electricity buying or selling (RMB).

C2,t Carbon emission cost (RMB).

C3,t BESS depreciation cost at slot t(RMB).

C4,t HESS related cost at slot t(RMB).

C5,t CWT depreciation cost at slot t(RMB).

C6,t Gas purchasing cost at slot t(RMB).

δon

xOperation cost of component x (x ∈ {el,fc}).

δsu

xStartup cost of component x (x ∈ {el,fc}).

δsd

xShutdown cost of component x (x ∈ {el,fc}).

µe,t Carbon emission rate at slot t(kg/kWh).

λg,t Gas price at slot t(in RMB/kWh).

XB,t BESS virtual queue length at slot t.

XH,t HESS virtual queue length at slot t.

Fi(·)Thermal dynamics model of building i.

oi,t Local observation of agent iat slot t.

ai,t Action of agent iat slot t.

rth,i,t Reward of agent iat slot t.

ΛtOne-slot conditional Lyapunov drift at slot t.

I. INT ROD UC TI ON

Buildings account for a large portion of total energy con-

sumption and total carbon emission in the world. For example,

global buildings consumed about 30% of the total energy and

generated about 28% of the total carbon emission in 2019

[1]. Since the global energy supply mainly depends on fossil

fuels, energy and environmental issues are incurred [2]. Due

to many advantages (e.g., free pollution, extensive sources,

convenient storage and transportation), hydrogen energy has

attracted widespread attention and is recognized as a promising

alternative to fossil fuels [2]–[4]. Moreover, the coordination

of hydrogen energy storage system (HESS) and other energy

storage systems (ESSs) (e.g., thermal energy storage and elec-

tric energy storage) contributes to the improvement of building

energy efﬁciency [2]. Therefore, it is of great importance to

optimize the operation of a hydrogen-based building multi-

energy system (HBMES) [5], [6].

In the literature, many approaches have been used for the

planning or operation of multi-energy systems, e.g., mixed-

integer linear programming (MILP) [7], nonconvex quadrati-

cally constrained programming [8], stochastic programming

[9] [10], robust optimization [11] [12] [13], Benders de-

composition [14], model-predictive control (MPC) [15] [16],

and deep reinforcement learning (DRL) [17]. Although some

efforts have been made, the above-mentioned studies did

not consider the utilization of hydrogen energy storage. To

promote the development of hydrogen energy storage, some

works have investigated the optimal planning or operation

problem of hydrogen-based multi-energy systems [2] [3] [5],

[6], [18], [19] and adopted many optimization approaches,

e.g., MILP [2], two-stage stochastic programming [5], mixed

integer programming [19], two-stage robust optimization [3],

distributed optimization [6], and DRL [20]. In existing works

on the optimal operation of hydrogen-based multi-energy

systems, building thermal dynamics and thermal comfort of

occupants are neglected, which means that the ﬂexibility of

building thermal loads can not be utilized for reducing system

operational costs.

Based on the above observation, we investigate an op-

timal operation problem related to an HBMES with the

consideration of building thermal dynamics. To be speciﬁc,

we intend to minimize the long-term operational cost of

an HBMES by intelligently scheduling thermal loads and

various ESSs, including hydrogen, thermal, and electric ESSs.

However, several challenges are involved in achieving the

above aim. Firstly, there are many uncertain parameters, e.g.,

renewable generation output, electric load, electricity price,

outdoor temperature, and carbon emission rate. Secondly,

there are spatially coupled operational constraints related to

power balance and heat balance. Thirdly, there are temporally

coupled operational constraints related to several ESSs and

indoor temperature. Fourthly, there are nonlinear constraints

related to power transactions and ESS operations. Finally, it is

difﬁcult to obtain explicit building thermal dynamics models

that are accurate and efﬁcient enough for building control. To

overcome these challenges, we propose a solving algorithm

based on Lyapunov optimization techniques (LOT) [21] and

multi-agent deep reinforcement learning (MADRL) [22]. The

key idea of the proposed algorithm is to transform the long-

term operational cost minimization problem using LOT into

several single-slot subproblems and solve these subproblems

using convex optimization and multi-agent attention-based

deep deterministic policy gradient (MAADDPG), which can

utilize the advantages of model-based optimization and data-

driven based learning.

The main contributions of this paper are summarized as

follows.

•Taking hydrogen/thermal/electric energy storage, inex-

plicit building thermal dynamics model, and thermal

comfort into consideration, we formulate an expected op-

erational cost minimization problem under uncertainties,

where operational cost consists of electricity cost, carbon

emission penalty, natural gas purchasing cost, and ESS

operation costs.

•We propose an online operation algorithm with a polyno-

mial time computational complexity to solve the formu-

lated problem based on LOT and MAADDPG. Moreover,

we analyze the algorithmic feasibility and closed-form

expressions of hyper-parameters for controlling ESSs.

Note that the proposed algorithm does not require any

prior knowledge of uncertain parameters and explicit

building thermal dynamics models.

•Simulation results based on real-world traces show that

This article has been accepted for publication in IEEE Transactions on Smart Grid. This is the author's version which has not been fully edited and

content may change prior to final publication. Citation information: DOI 10.1109/TSG.2022.3197657

© 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: Nanjing Univ of Post & Telecommunications. Downloaded on August 10,2022 at 22:50:32 UTC from IEEE Xplore. Restrictions apply.

IEEE TRANSACTIONS ON SMART GRID, VOL. XX, NO. XX, MONTH 2022 3

the proposed algorithm can reduce average operation cost

by 26.35%-37.11% while maintaining comfortable indoor

temperature ranges compared with a rule-based scheme

and several DRL-based schemes.

The rest of this paper is organized as follows. In Section II,

we introduce related works. In Section III, we describe the

system model and formulate an expected operational cost

minimization problem. In Section IV, we propose an operation

algorithm to solve the formulated problem. In Section V,

algorithm feasibility and key control parameters are analyzed.

In Section VI, performance evaluations are conducted. Finally,

we draw a conclusion and point out future work in Section VII.

II. RE LATE D WO RK S

There have been many studies on the planning or opera-

tion of hydrogen-based multi-energy systems. For example,

Liu et al. investigated the optimal planning problem of a

hydrogen-based multi-energy system, which was formulated

by MILP and solved by Gurobi optimizer [2]. Similarly, Pan

et al. studied the optimal planning problem for electricity-

hydrogen integrated energy system considering seasonal s-

torage based on two-stage robust optimization [3]. Different

from the planning problem, the operation problem mainly

focuses on optimal scheduling of distributed resources for

system operational cost reduction under the given resource

conﬁgurations, e.g., generation capacity and storage capacity.

In [5], Langeroudi et al. studied the optimal operation of

power, heat, and hydrogen-based microgrid with the consider-

ation of a plug-in electric vehicle and proposed an operation

algorithm based on two-stage stochastic programming. In [6],

Langeroudi et al. proposed a distributed optimization method

for integrated electricity and hydrogen energy sharing so

that the total social welfare caused by energy dispatching

could be maximized. Since the above-mentioned operation

methods need to know the prior information of uncertain

parameters (e.g., predicted values, probability distribution, and

maximum/minimum values), DRL-based operation methods

have been proposed, which can operate without requiring any

prior knowledge of uncertain parameters. In [23], Vincent

Franc¸ois-Lavet et al. proposed a deep Q-network (DQN)-

based algorithm to schedule electric and hydrogen ESSs for

minimizing overall levelized energy cost without knowing

future information about electricity consumption and solar

generation. Since sustainability is also an important metric for

energy systems, Desportes et al. studied the carbon impact

minimization problem in an electric/hydrogen hybrid energy

storage system based on deep deterministic policy gradient

(DDPG) algorithm [20]. Although some advances have been

made in the above-mentioned studies, they did not consider

building thermal dynamics, which means that the ﬂexibility of

thermal loads can not be utilized for operational cost reduction.

To overcome the limitations in existing works, we inves-

tigate an expected operational cost minimization problem of

a hydrogen-based multi-energy system with the consideration

of building thermal dynamics under uncertainties. Due to the

existence of many challenges caused by uncertain parame-

ters, spatially and temporally coupled constraints, nonlinear

constraints, and inexplicit building thermal dynamics models,

we propose a solving algorithm based on LOT and MADRL.

Recently, some works have used LOT and DRL in the ﬁeld of

edge computing. For example, Bi et al. proposed Lyapunov-

guided DRL for stable online computation ofﬂoading, where

DRL is used for solving the mixed integer non-linear program-

ming (MINLP) subproblem [24]. In [25], Dai et al. proposed

a method for stochastic computation ofﬂoading in digital twin

networks based on LOT and asynchronous actor-critic (AAC),

where the AAC algorithm is adopted for solving single-

slot subproblems. In [26], Zhuang et al. proposed a method

to solve the network routing problem in multi-access edge

computing based on LOT and DQN. The differences between

these methods and our algorithm are summarized as follows.

Firstly, DRL methods in their studies were adopted for solving

deterministic single-slot subproblems obtained by LOT. In

contrast, the single-slot subproblems in this paper have inex-

plicit constraints related to building thermal dynamics models.

Secondly, the feasible hyper-parameters used for controlling

virtual queues and costs were not derived in existing studies,

while we analyze closed-form expressions of hyper-parameters

in LOT; Thirdly, we design a solving approach for each single-

slot subproblem in this paper by exploiting its special structure

and decompose the subproblem into two parts, which can be

solved by linear programming and MAADDPG, respectively.

Consequently, the proposed algorithm has better performance

than existing learning-based methods (e.g., DQN, MADDPG

[27]).

III. SYSTEM MOD EL AN D PROB LE M FOR MU LATION

We consider an HBMES in Fig. 1, where the main grid,

photovoltaic (PV) generation, battery energy storage system

(BESS), electrical load, electrolyzer, hydrogen tank, fuel cell,

gas boiler, cold water tank (CWT), and thermal loads can

be identiﬁed. Among these components, there are four kinds

of energy ﬂows, i.e., electricity ﬂow, hydrogen ﬂow, heat

ﬂow, and cooling ﬂow. In electricity ﬂow, electrical load (e.g.,

electric vehicles, electric water heaters, and computers) can be

served by the main grid, PV generators, BESS, and fuel cell.

Moreover, it can be seen that hydrogen ﬂow appears in HESS,

which consists of an electrolyzer, a hydrogen tank, and a fuel

cell. To be speciﬁc, the hydrogen generated by the electrolyzer

can be stored in the hydrogen tank, which will discharge

hydrogen to drive the fuel cell for generating electricity and

heat simultaneously. The heat generated by the fuel cell and

gas boiler can be transformed into cold water by an absorption

chiller (AC). Next, cold water can be stored in CWT and used

for cooling buildings. In the following parts, we ﬁrst introduce

the models related to PV generation, gas boiler, energy storage,

thermal load, power/energy balance, and operational cost.

Then, we formulate an expected operational cost minimization

problem related to the HBMES.

A. PV Generation Model

Let Ppv,t be the maximum generation output of PV system

at slot t. Then, its value can be estimated by [28]

Ppv,t =ηpvhpv ςt,(1)

This article has been accepted for publication in IEEE Transactions on Smart Grid. This is the author's version which has not been fully edited and

content may change prior to final publication. Citation information: DOI 10.1109/TSG.2022.3197657

© 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: Nanjing Univ of Post & Telecommunications. Downloaded on August 10,2022 at 22:50:32 UTC from IEEE Xplore. Restrictions apply.

IEEE TRANSACTIONS ON SMART GRID, VOL. XX, NO. XX, MONTH 2022 4

Fig. 1. Illustration of an HBMES

where ηpv denotes the PV system generation efﬁciency, hpv is

total radiation area of solar panels, and ςtis the solar radiation

intensity at slot t.

B. Gas Boiler Model

Let Pgb,t denotes the heat power output of the gas boiler at

slot t. Then, we have

0≤Pgb,t ≤Pmax

gb ,(2)

where Pmax

gb is the maximum heat power output of gas boiler.

C. Energy Storage Model

1) Battery Energy Storage Model: Let Btbe the stored

energy level in the BESS at slot t. Then, the dynamics of

energy level in BESS can be described by [29]

Bt+1 =Bt+ (ηbcPbc,t −Pbd,t

ηbd

)∆t, (3)

where ηbc and ηbd are the charging and discharging efﬁciency

coefﬁcients, respectively; Pbc,t and Pbd,t are charging power

and discharging power of BESS, respectively; ∆tdenotes the

duration of time slot t.

To ensure that the energy level of the BESS ﬂuctuates within

a normal range at any time, we have

Bmin ≤Bt≤Bmax,(4)

where Bmin and Bmax are the minimum and maximum energy

levels of BESS, respectively.

Let Pmax

bc and Pmax

bd be the maximum charging power and

maximum discharging power, respectively. Then, we have

0≤Pbc,t ≤Pmax

bc ,(5)

0≤Pbd,t ≤Pmax

bd .(6)

Taking the round-trip inefﬁciency into consideration, simul-

taneous charging and discharging are not allowed (note that

the nonlinear constraint (7) can be removed for the purpose of

simplifying BESS dispatch, and related methods can be found

in [30]). Then, we have

Pbc,t ·Pbd,t = 0.(7)

2) Thermal Energy Storage Model: Let Qth,t be the stored

thermal energy in CWT at slot t. Then, its dynamics can be

described by

Qth,t+1 =Qth,t + (Ptc,tηtc −Ptd,t

ηtd

)∆t, (8)

where ηtc and ηtd are injection efﬁciency and release efﬁciency

of CWT, respectively; Ptc,t and Ptd,t are injected power and

released power at slot t, respectively.

To ensure the normal operation of CWT, the following

operational constraints of the CWT should be satisﬁed, i.e.,

0≤Qth,t ≤Qmax

th ,(9)

0≤Ptd,t ≤Pmax

td ,(10)

0≤Ptc,t ≤Pmax

tc ,(11)

Ptd,t ·Ptc,t = 0,(12)

where Qmax

th denotes the capacity of the CWT; Pmax

td and

Pmax

tc are the maximum released power and injected power,

respectively; (9) denotes that the stored thermal energy level

should ﬂuctuate within a feasible range; (10) and (11) denote

the effective range of released power and injected power,

respectively. (12) means that releasing and injecting cold water

can not happen simultaneously so that meaningless thermal

loss can be avoided.

3) Hydrogen Energy Storage Model: Let Htbe the storage

level of hydrogen in the tank at slot t(in Nm3). Then, the

dynamics of hydrogen storage level can be described by [31]

Ht+1 =Ht+ (ωelPel,t −Pfc,t

ωfc

)∆t, (13)

where Pel,t and Pfc,t are charging power of the electrolyzer

and discharging power of fuel cell at slot t, respectively; ωel

(in Nm3/kWh) and ωfc (in kWh/Nm3) denote the conversion

coefﬁcients of electrolyzer and fuel cell, respectively.

Since the maximum storage level of the hydrogen tank is

limited by its tolerable tank pressure [2], we have

0≤Ht≤Hmax,(14)

where Hmax is the storage capacity of the hydrogen tank.

To keep the efﬁciency of the HESS, we assume that elec-

trolyzer and fuel cell can not operate simultaneously. Then,

we have

Pel,t ·Pfc,t = 0.(15)

In addition, the power consumption of the electrolyzer and

electric power output of fuel cell should satisfy the following

physical constraints, which can be given by

0≤Pel,t ≤Pmax

el ,(16)

0≤Pfc,t ≤Pmax

fc ,(17)

where Pmax

el and Pmax

fc are the rated powers of electrolyzer

and fuel cell, respectively.

Since the fuel cell generates electricity and heat simulta-

neously, the electrical output power of the fuel cell Pfc,t is

content may change prior to final publication. Citation information: DOI 10.1109/TSG.2022.3197657

IEEE TRANSACTIONS ON SMART GRID, VOL. XX, NO. XX, MONTH 2022 5

coupled with the corresponding thermal output power Qfc,t.

Then, we have [2]

Qfc,t =ηhrηh2e Pfc,t∆t, (18)

where ηh2e and ηhr are heat-to-electricity ratio and the heat

recovery efﬁciency, respectively.

D. Thermal Load Model

Let Psp,i,t be the thermal input power for cooling demand

in building iat slot t, which will affect building indoor tem-

perature βin,i,t. To provide a comfortable temperature range

for occupants in building i, the following constraints should

be satisﬁed [29], i.e.,

βmin

i≤βin,i,t ≤βmax

i,(19)

βin,i,t+1 =Fi(Psp,i,t, βout,t , βin,i,t, ϱi,t ),(20)

0≤Psp,i,t ≤Pmax

sp,i ,(21)

where βmin

iand βmax

iare the lower limit and upper limit of

the comfortable temperature rage in building i, respectively;

βout,t and ϱi,t are outdoor temperature and random thermal

disturbance at slot t, respectively; Fi(·)denotes a thermal dy-

namics model of building i, and Pmax

sp,i denotes the maximum

thermal input power in building i.

E. Power/Energy Balance Model

To maintain the electric power balance at each slot t, we

have

Pbuy,t+Ppv,t +Pfc,t+Pbd,t =Psell,t +Pel,t+Pload,t +Pbc,t,(22)

where Pbuy,t and Psell,t represent the purchasing power and

selling power of the HBMES at slot t, respectively; Pload,t

denotes the power demand at slot t. Moreover, we assume

that simultaneous purchasing and selling electricity is not

permitted, i.e.,

Pbuy,t ·Psell,t = 0.(23)

Since electricity transactions between HBMES and main

grid are limited by transmission line capacities, we have

0≤Pbuy,t ≤Pmax

g,(24)

0≤Psell,t ≤Pmax

g,(25)

where Pmax

gdenotes the maximum transaction power.

Similarly, thermal energy balance at each slot tcan be

depicted by the following constraint, i.e.,

Qfc,tηh2c ≥(Ptc,t +

N

i=1

Psp,i,t −Ptd,t −Pgb,tηh2c )∆t, (26)

where Ndenotes the number of buildings and ηh2c denotes

AC transformation efﬁciency from heating to cooling.

F. Operational Cost Model

The operational cost of the HBMES consists of six parts,

i.e., the energy cost of electricity buying or selling C1,t, carbon

emission cost C2,t, BESS depreciation cost C3,t , HESS related

cost C4,t, CWT depreciation cost C5,t , and gas purchasing cost

C6,t.

Let vtand τtbe buying and selling prices of electricity,

respectively. Then, C1,t is expressed by

C1,t = (vtPbuy,t −τtPsell,t)∆t. (27)

Let µe,t (in kg/kWh) be the carbon emission rate of the

main grid at slot t. Then, the carbon emission generated by

the HBMES at slot tcan be given by µe,tPg,t∆t. Then, the

carbon emission cost is calculated by [19]

C2,t =µcµe,tPg,t ∆t, (28)

where µcis a weighted parameter in RMB/kg, which denotes

the importance of carbon emission with respect to energy cost.

Since too frequent charging or discharging will damage the

life of the BESS, BESS depreciation cost is adopted [29], i.e.,

C3,t =ψBESS(Pbc,t +Pbd,t ),(29)

where ψBESS is the battery depreciation coefﬁcient in RM-

B/kW.

According to [32], the startup and shutdown cycles have

degradation effects on electrolyzer and fuel cell. Thus, startup

and shutdown costs are considered in this paper. Let δon

x,δsu

x,

and δsd

xbe the operation cost, startup cost, and shutdown cost

of component x (x ∈ {el,fc}) in HESS, respectively, where

“el” and “fc” denote electrolyzer and fuel cell, respectively.

Then, C4,t can be calculated by [32]

C4,t =x∈{el,fc}δon

xIon

x,t +δsu

xIsu

x,t +δsd

xIsd

x,t,(30)

where Ion

x,t,Isu

x,t, and Isd

x,t are logical indicator variables re-

lated to ON/OFF state, startup state, and shutdown state of

component x, respectively; Isu

x,t = max{Ion

x,t −Ion

x,t−1,0}and

Isd

x,t = max{Ion

x,t−1−Ion

x,t,0}.

Similar to BESS, CWT depreciation cost can be captured

by [29]

C5,t =ψCWT(Ptc,t +Ptd,t ),(31)

where ψCWT is the CWT depreciation coefﬁcient in RMB/kW.

Let ηgb and λg,t be gas-to-heat conversion efﬁciency and gas

price (in RMB/kWh), respectively. Then, the gas purchasing

cost at slot tcan be given by [17]

C6,t =λg,t

Pgb,t∆t

ηgb

.(32)

G. Expected Operational Cost Minimization Problem

Based on above models, we can formulate an expected

operational cost minimization problem of an HBMES as

follows,

(P1) min lim sup

T→∞

1

T

T−1

t=0

E6

j=1

Cj,t(33a)

s.t. (1) −(26),(33b)

content may change prior to final publication. Citation information: DOI 10.1109/TSG.2022.3197657

IEEE TRANSACTIONS ON SMART GRID, VOL. XX, NO. XX, MONTH 2022 6

where the expectation operator Eis taken over the randomness

of system parameters (i.e., PV generation output Ppv,t, power

demand Pload,t, carbon emission rate µe,t , thermal load Qload,t,

buying/selling price vt/τt), and possible stochastic control

decisions (i.e., Pgb,t,Pbc,t ,Pbd,t,Ptc,t ,Ptd,t,Pel,t ,Pfc,t,

Psp,i,t|1≤i≤N,Pbuy,t , and Psell,t).

IV. THE PROPOSED OPERATIO N ALG OR IT HM

Solving P1 is a nontrivial task due to the following reasons.

Firstly, there are many uncertain parameters and it is often

difﬁcult to know the statistical distributions of all combinations

in practice. Secondly, there are several temporally coupled

operational constraints (e.g., (3), (8), (13), and (20)). Thirdly,

there are some spatially coupled operational constraints (e.g.,

(22), (26)). Finally, it is challenging to obtain an explicit

building thermal dynamics model Fi(·)that is accurate and

efﬁcient enough for building control [33].

To address the ﬁrst challenge, some methods can be

adopted, e.g., stochastic programming, robust optimization,

and model predictive control. However, these methods either

need to know prior knowledge (e.g., probability distribution,

maximum and minimum values) of uncertain parameters or

predict/approximate random parameters. To deal with the

second challenge, typical methods are based on dynamic

programming, which suffers from “the curse of dimension-

ality” problem. When LOT is adopted, temporally coupled

operational constraints could be decoupled and an online

algorithm can be designed without knowing any prior knowl-

edge of uncertain parameters. However, due to the existence

of inexplicit building thermal dynamics models, the above-

mentioned online algorithm can not be realized. To overcome

the above-mentioned challenges, many model-free DRL meth-

ods can be adopted [34] [35], which can enable agents to learn

optimal policies from the process of interacting with building

environments. Once optimal policies are learned, they can

operate without knowing any prior information about uncertain

parameters and explicit building thermal dynamics models.

Although model-free DRL methods have some advantages,

their stability and performance may decrease with the increase

of action spaces and the number of heterogeneous agents.

Instead of solving P1 directly using DRL methods (e.g., DQN,

DDPG, MADDPG), we intend to reduce the size of action

space and the number of heterogeneous agents by exploiting

the fact that inexplicit building thermal dynamics model Fi(·)

only exists in thermal energy ﬂow. Therefore, we can propose

an operation algorithm to solve P1 based on model-based

optimization and data-driven based learning.

To be speciﬁc, the key idea of the proposed online algorithm

can be illustrated by Fig. 2, where three steps could be

identiﬁed, i.e., transformation, decomposition, and solving.

Firstly, the original problem P1 is relaxed to P2 with time-

average constraints. Next, P2 is equivalently transformed into a

queue stability problem P3. Then, we can design an operation

algorithm for P3 based on LOT theory, which needs to solve

an online optimization problem P4. Since there are inexplicit

building thermal dynamics models, P4 can not be solved

directly using the model-based optimization methods. To solve

P4 efﬁciently, we decompose it into two subproblems P5

and P6. Since the premise of solving P5 is that reasonable

values of hyper-parameters V,WB, and WHare known,

we transform it into P7 with the convex objective function.

Continually, P7 can be decomposed into eight linear pro-

gramming subproblems by considering different combinations

of buying/selling electricity, BESS charging/discharging, and

HESS charging/discharging. To solve P6 efﬁciently, we re-

formulate it as a Markov game and propose a MAADDPG-

based algorithm to solve the game. Based on the above key

idea, we can design an online operation algorithm that has

a polynomial time computational complexity and does not

require any prior information of uncertain parameters and an

explicit building thermal dynamics model. In the following

parts, we will introduce three steps in detail.

Fig. 2. The key idea of the proposed algorithm.

A. Step-1: Transformation

Before conducting problem transformation, four assump-

tions are made, which can ensure that the electric-hydrogen

subsystem is controllable under the framework of LOT, i.e.,

vmax > τmax,(34)

vmin > τmin,(35)

ηbcηbd (vmax +µcµmax

e−ψBESS

∆t)> ν1,(36)

Bmax −Bmin −ηbcPmax

bc ∆t−Pmax

bd ∆t

ηbd

>0,(37)

ωelωfc (vmax +µcµmax

e−δon

fc

Pmax

fc ∆t)> ν2,(38)

Hmax −Hmin −ωelPmax

el ∆t−Pmax

fc ∆t

ωfc

>0,(39)

where vmax = maxtvt,τmax = maxtτt,vmin = mintvt,

τmin = mintτt,µmin

e= mintµe,t, and µmax

e= maxtµe,t,

ν1=τmin+µcµmin

e+ψBESS

∆t, and ν2=τmin+µcµmin

e+δon

el

Pmax

el ∆t,

Note that the assumptions (34) and (35) are mild since

buying price vtis typically higher than selling price τtat

all time slots [36], i.e., buying electricity at low price and

selling electricity at high price simultaneously for making

proﬁt is unrealistic. (36)-(39) are adopted to ensure that control

parameters Vmax

Band Vmax

Hdeﬁned in theorems 1 and 2 of

section V are positive.

content may change prior to final publication. Citation information: DOI 10.1109/TSG.2022.3197657

IEEE TRANSACTIONS ON SMART GRID, VOL. XX, NO. XX, MONTH 2022 7

Since LOT framework can deal with stochastic program-

ming with time-average constraints and objectives, we intend

to transform P1 into an optimization problem with time-

average constraints. To be speciﬁc, (3), (4), (13), and (14)

in P1 are considered to derive the following constraints, i.e.,

ηbcPbc =Pbd

ηbd

, ωelPel =Pfc

ωfc

,(40)

where Px=lim

T→∞

1

T

T−1

t=0

E[Px,t]and x∈ {bc,bd,el,fc}, which

can be used to represent the time-average expected values

of BESS charging, BESS discharging, HESS charging, and

HESS discharging under any feasible control algorithm of P1,

respectively.

The speciﬁc process of obtaining (40) is explained as

follows. Taking BESS for example, summing (3) over t∈

{0, T −1}, taking expectation of two sides, dividing both sides

by T, and taking a limit as T→ ∞, we have

lim

T→∞

EBT−B0

T

= lim

T→∞

E1

T

T−1

t=0

(ηbcPbc,t ∆t−Pbd,t

ηbd

∆t)

=ηbcPbc ∆t−Pbd

ηbd

∆t. (41)

According to (4), Bmin−Bmax ≤BT−B0≤Bmax−Bmin .

Thus, lim

T→∞

EBT−B0

T= 0, and ηbcPbc =Pbd

ηbd . Using the

same way, ωelPel =Pfc

ωfc can be proved.

Based on the above description, P1 can be relaxed to P2 as

follows,

(P2) min lim sup

T→∞

1

T

T−1

t=0

E6

j=1

Cj,t(42a)

s.t. (1),(2),(5) −(12),(15) −(26),(42b)

ηbcPbc =Pbd

ηbd

, ωelPel =Pfc

ωfc

.(42c)

To ensure the feasibility of (42c), we can construct two

virtual queues related to Btand Htand make them mean rate

stable. To be speciﬁc, we deﬁne two virtual queues as follows,

i.e., XB,t =Bt+WBand XH,t =Ht+WH, where WBand

WHare constants and their values could be derived in next

section. Moreover, according to (3) and (13), the dynamics of

these virtual queues can be written as follows,

XB,t+1 =XB,t + (ηbc Pbc,t −Pbd,t

ηbd

)∆t, (43)

XH,t+1 =XH,t + (ωelPel,t −Pfc,t

ωfc

)∆t. (44)

According to (42), we have

XB,l+1 −XB,l = (ηbc Pbc,l −Pbd,l

ηbd

)∆t. (45)

Summing the above equation (45) over l∈ {0, t −1}for

t > 0, we have

XB,t −XB,0=

t−1

l=0

(ηbcPbc,l ∆l−Pbd,l

ηbd

∆l).(46)

Taking expectations of (46), dividing two-sides by t, and

taking a limit as t→ ∞, we have

lim

t→∞

E[1

t

t−1

l=0

(ηbcPbc,l ∆l−Pbd,l

ηbd

∆l)]

= lim

t→∞

E[XB,t −XB,0

t] = lim

t→∞

E[XB,t

t]

≤lim

t→∞ |E[XB,t

t]| ≤ lim

t→∞

E|XB,t |

t,(47)

where XB,0=B0+WBis a constant since Bmin ≤

B0≤Bmax. When XB ,t is mean rate stable, we have

lim

t→∞

E[|XB,t|]

t= 0 according to the deﬁnition of mean rate

stability [21]. As a result, (47) becomes ηbcPbc =Pbd

ηbd .

Similarly, we can prove that ωelPel =Pfc

ωfc holds if XH,t is

mean rate stable.

Based on the above conclusion, P2 can be equivalently

transformed into a queue stability problem P3 as follows,

(P3) min lim sup

T→∞

1

T

T−1

t=0

E6

j=1

Cj,t(48a)

s.t. (1),(2),(5) −(12),(15) −(26),(48b)

XB,t and XH,t are mean rate stable,(48c)

According to LOT theory, P3 can be solved by constructing

adrift-plus-penalty function and minimizing its upper bound.

To this end, we ﬁrst deﬁne a Lyapunov function as follows,

L(t)∆

=1

2(X2

B,t +ξX2

H,t),(49)

where ξis a weighted parameter since XB,t and XH,t have

different units.

Then, we can calculate the one-slot conditional Lyapunov

drift as follows,

Λt=E{L(t+ 1) −L(t)|X(t)}.(50)

where X(t) = (XB,t , XH,t).

Based on the observed X(t),L(t+1)−L(t)can be obtained

as follows, i.e.,

L(t+ 1) −L(t)

=1

2X2

B,t+1 −X2

B,t +ξ(X2

H,t+1 −X2

H,t)

≤XB,t (ηbcPbc,t −Pbd,t

ηbd

)∆t+ζB

+ξXH,t (ωelPel,t −Pfc,t

ωfc

)∆t+ζH,(51)

where ζB=(∆tmax{ηbcPmax

bc ,Pmax

bd

ηbd })2

2and ζH=

ξ(∆tmax{ωelPmax

el ,Pmax

fc

ωfc })2

2. Then, we have

Λt≤ζB+ζH+E{Γ0|X(t)},(52)

where Γ0,t =XB,t (ηbcPbc,t −Pbd,t

ηbd )∆t+XH,t(ωel Pel,t −

Pfc,t

ωfc )∆t.

content may change prior to final publication. Citation information: DOI 10.1109/TSG.2022.3197657

IEEE TRANSACTIONS ON SMART GRID, VOL. XX, NO. XX, MONTH 2022 8

Then, drift-plus-penalty function can be derived by combing

(52) and the “penalty” related to objective function,

∆Y(t) = Λt+VE{

6

i=1

Ci,t|X(t)}

≤ζB+ζH+E{Γ0+V

6

i=1

Ci,t|X(t)}(53)

Finally, we can design an online operation algorithm (i.e.,

Algorithm 1) for P3 by minimizing the right-hand-side of

(53) subject to the constraints in P3, which is equivalent to

minimize P4 based on observed states at each slot t, i.e.,

(P4) min Γ0,t +V

6

i=1

Ci,t (54a)

s.t. (1),(2),(5) −(12),(15) −(26).(54b)

Algorithm 1: The proposed online operation algorithm for

HMBES

Input: Control parameters V,WB,WH

Output: Control decisions at each slot t, i.e., Pall,t

1Initialize XB,0=B0+WB, and XH,0=H0+WH.

2for each time slot t(0≤t≤T−1)do

3Observe system states: X(t)and Ssys,t;

4Solve P4 using methods in section IV-C;

5Pall,t=(Pupper,t ,Plower,t);

6Update XB,t and XH,t according to (43) and (44);

7end

In Algorithm 1, control parameters V,WB, and WHare

taken as algorithmic inputs. Moreover, control decisions Pall,t

are taken as algorithmic outputs. Here, Pall,t = (Pgb,t,Pbc,t ,

Pbd,t,Ptc,t ,Ptd,t,Pel,t ,Pfc,t,Psp,i,t |1≤i≤N,Pbuy,t,Psell,t ). In

each time slot t, virtual queue length vector X(t)and sys-

tem states Ssys,t=(Ppv,t ,Pload,t,βin,i,t |1≤i≤N,βout,t,µe,t ,vt,τt,t)

are observed as shown in line 3. Then, P4 is solved based

on the methods introduced in section IV-C, i.e., obtaining

Pupper,t = (Pbc,t,Pbd,t ,Pel,t,Pfc,t ,Pbuy,t,Psell,t )by solving P7

and obtaining Plower,t=(Psp,i,t |1≤i≤N,Ptc,t,Ptd,t ,Pgb,t) using

Algorithm 3. Next, control decisions Pall,t = (Pupper,t, Plower,t)

are decided and two virtual queues are updated. Although the

line 6 of Algorithm 1 can ensure the feasibility of (3) and

(13), the proposed online algorithm may be infeasible to P1

since (4) and (14) are neglected. However, in next section, we

will prove that constraints (4) and (14) could be satisﬁed if

reasonable values of V,WB, and WHare selected.

B. Step-2: Decomposition

Since an explicit building thermal dynamics model Fi(·)

is unavailable, P4 can not be solved using traditional opti-

mization techniques. To solve P4 efﬁciently, we decompose it

into two subproblems according to the availability of model

information, i.e., upper subproblem P5 related to the electric-

hydrogen subsystem and lower subproblem P6 related to a

thermal subsystem. Firstly, we solve the upper subproblem

based on model-based optimization. Then, its decision on Pfc,t

is taken as a state component in the lower subsystem, which

is solved by MAADDPG. To be speciﬁc, P5 and P6 are given

as follows.

(P5) min Γ0,t +V

4

i=1

Ci,t (55a)

s.t. (1),(2),(5) −(7),(15) −(17),(22) −(25).(55b)

(P6) min V(C5,t +C6,t)(56a)

s.t. (8) −(12),(18) −(21),(26) (56b)

C. Step-3: Solving two subproblems

1) The solution to P5:P5 can be transformed into a mixed

integer linear programming problem by adopting several aux-

iliary binary variables and linear constraints. However, the

premise of solving P5 using this kind of way is that the

reasonable values of V,WB, and WHcould be known. To

facilitate the derivation of reasonable values of V,WB, and

WH, we solve the following problem P7, which has a convex

objective function.

(P7) min Γ0,t +V(

3

i=1

Ci,t +δon

el Pel,t

Pmax

el

+δon

fc Pfc,t

Pmax

fc

)(57a)

s.t. (1),(2),(5) −(7),(15) −(17),(22) −(25),(57b)

where the gap between the objection function of P7 and that of

P5 varies within the range [0, ℓ], where ℓ= max{δon

el +δsu

el +

δsd

fc , δon

fc +δsu

fc +δsd

el , δsd

el +δsd

fc }. Although the optimal solution of

P7 may be different from that of P5, the reasonable values of

V,WB, and WHcould be derived by analyzing the structure

of P7 in next section and the proposed algorithm still shows

good performance as shown in simulation results.

The solution of P7 can be derived by considering two

cases as follows, i.e., buying electricity and without buy-

ing electricity. In other words, the following two problems

P8 and P9 should be solved. By further considering four

possible combinations of BESS and HESS operations (i.e.,

Pbd,t =Pfc,t = 0,Pbd,t =Pel,t = 0,Pbc,t =Pfc,t = 0, and

Pbc,t =Pel,t = 0), P8 and P9 can be transformed into eight

linear programming subproblems. Moreover, each of them has

3 variables and 6 constraints, which can be solved efﬁciently

by interior point method within polynomial-time, which is

O(33.5Lin)and Lin denotes the number of bits of input data

[38]. Finally, the optimal solution to P7 equals that of the

subproblem with the smallest objective function value.

(P8) min Γ1,t (58a)

s.t. (1),(2),(5) −(7),(15) −(17),(22),(24),(58b)

Psell,t = 0,(58c)

where Γ1,t =XB,t (ηbcPbc,t −Pbd,t

ηbd )∆t+ξXH,t (ωelPel,t −

Pfc,t

ωfc )∆t+V(vt+µcµe,t)Pbuy,t ∆t+V ψBESS(Pbc,t +Pbd,t ) +

V(δon

el

Pel,t

Pmax

el +δon

fc

Pfc,t

Pmax

fc ).

content may change prior to final publication. Citation information: DOI 10.1109/TSG.2022.3197657

IEEE TRANSACTIONS ON SMART GRID, VOL. XX, NO. XX, MONTH 2022 9

(P9) min Γ2,t (59a)

s.t. (1),(2),(5) −(7),(15) −(17),(22),(25),(59b)

Pbuy,t = 0,(59c)

where Γ2,t =XB,t (ηbcPbc,t −Pbd,t

ηbd )∆t+ξXH,t (ωelPel,t −

Pfc,t

ωfc )∆t−V(τt+µcµe,t)Psell,t ∆t+V ψBESS(Pbc,t +Pbd,t ) +

V(δon

el

Pel,t

Pmax

el +δon

fc

Pfc,t

Pmax

fc ).

2) The solution to P6:In order to solve P6 efﬁciently, we

reformulate it as a Markov game, which is a general mod-

eling framework for multi-agent decision-making problems

under uncertainty [22]. Speciﬁcally, a Markov game with N

agents can be deﬁned by a set of states, S, a collection of

action sets (each action set is associated with each agent in

the environment), A1,· · · ,AN, a state transition function,

F:S×A1×. . .×AN→Π(S), which deﬁnes the probability

distribution over possible next states, given the current state

and actions for all agents, and a reward function for each agent

i(1≤i≤N), Ri:S×A1×...×AN→R. In a Markov

game, each agent itakes action ai∈Aibased on its local

observation oi∈ Oi, where oicontains partial information of

the global state s∈S. The aim of the agent iis to maximize

its own expected return by learning a policy πi:Oi→Π(Ai),

which maps the agent’s local observation oi∈ Oiinto a

distribution over its set of actions. Here, the return is the

sum of discounted rewards received over the future, i.e.,

∞

j=0 γjri,t+j+1(st, a1,t ,· · · , aN,t ), where γ∈[0,1] is a

discount factor and ri,t+1 ∈Riis the reward received by

the agent iat slot t. Since there is no need to know the

information of the state transition function when solving P6

with MAADDPG, just three components (i.e., state, action,

and reward function) are designed.

Environment State According to (19), the temperature

deviation should be penalized on each agent so that com-

fortable range can be maintained. Moreover, to promote the

coordination among all thermal-load agents, C5,t and C6,t

should be considered in the reward design of each thermal-

load agent. Since the temperature deviation, C5,t, and C6,t

depend on βin,i,t,βout,t ,Qth,t, and Qfc,t , the environment

state of i-th thermal-load agent is designed as follows, i.e.,

oi,t = (Qfc,t, Qth,t , βin,i,t, βout,t , t), where Qfc,t is obtained

from (18) and Pfc,t is obtained from the solution of P7.

Action Since each thermal-load agent needs to make a

decision on Psp,i,t, the action of i-th thermal-load agent can be

designed by ai,t =Psp,i,t. To speed up the learning of agents,

the following rules are adopted, i.e.,

ai,t =0,if βin,i,t ≤βmin

ior βout,t ≤βmax

i

Psp,i,t,otherwise.(60)

Remark 1: After the actions of thermal-load agents

are taken, the actions of gas boiler and CWT can be

decided accordingly. To be speciﬁc, when Qfc,tηh2c >

N

i=1 Psp,i,t∆t, CWT will operate in charging mode (i.e.,

Ptd,t = 0) and the thermal power input Ptc,t is min(Qfc,tηh2c

∆t−

N

i=1 Psp,i,t, P max

tc ,Qmax

th −Qth,t

ηtc∆t). Under this situation, Pgb,t =

0. When Qfc,tηh2c ≤N

i=1 Psp,t∆t, CWT will operate in

discharging mode (i.e., Ptc,t = 0) and the thermal power output

Ptd,t is min(N

i=1 Psp,i,t −Qfc,tηh2c

∆t, P max

td ,Qth,tηtd

∆t). Under this

situation, Pgb,t = min(N

i=1 Psp,i,t −Qfc,tηh2c

∆t−Ptd,t, P max

gb ).

Consequently, the total thermal power for cooling buildings

is Pthermal,t =Pgb,t +Ptd,t +Qfc,tηh2c

∆t. When N

i=1 Psp,i,t >

Pthermal,t, the actual thermal input of building iis decided by

Pthermal,t

Psp,i,t

∑N

i=1 Psp,i,t .

Reward As mentioned in the descriptions related to s-

tate design, the reward of i-th agent consists of three

parts, i.e., the penalties imposed on temperature deviation,

CWT depreciation cost, and gas purchasing cost. Therefore,

the reward of each agent ican be designed as follows,

i.e., rth,i,t =−((C5,t+C6,t )Psp,i,t

∑N

i=1 Psp,i,t +ϖi,t), where ϖi,t =

κth([βin,i,t+1 −βmax ]++βmin −βin,i,t+1+,κth denotes a

positive penalty coefﬁcient, and [·]+= max(·,0).

Actor1

o1a1

Actor 1

MLP 1

Encoder 1

Critic 1

Attention 1

Actor1

aNoN

Actor N

Critic N

e1

[

Q1(o,a)QN(o,a)

eN

[

MLP N

Encoder N

Attention N

e1eN-1

e2eN

Update Update

Fig. 3. The framework of MAADDPG.

To solve the Markov game mentioned above, a MAADDPG-

based algorithm is proposed based on an attention mechanism

and MADDPG, and its framework can be found in Fig. 3.

Compared with MADDPG, MAADDPG has higher scalability

since the output size of the attention module in Fig. 2 is ﬁxed,

which is unrelated to the total number of agents. In Fig. 3,

each agent consists of an actor network and a critic network.

Actor network input of agent iis the local observation oi

and its output is action ai. Critic network input of agent i

consists of oi,ai, and ej=i,1≤j≤N(eidenotes the encoding

of local observation and action of agent i), and its output

is action-value function Qi(o, a), where o= (o1,· · · , oN),

a= (a1,· · · , aN). In critic network of agent i, the input of

the attention module is ei,1≤i≤Nand its output is xi, which

represents the contribution of other agents, i.e.,

xi=j=iwj~(Wvalue,jej),(61)

where Wvalue,j is a value transformation matrix related to agent

j,~is a non-linear activation function, wjis the attention

weight associated with agent jand it reﬂects the similarity

between eiand ej. To be speciﬁc,

wj=exp((Wkey,iej)TWquery,iei)

N

j=1 exp((Wkey,iej)TWquery,iei),∀j, (62)

where Wkey,i and Wquery,i are key and query transformation

matrixes related to agent i, respectively.

content may change prior to final publication. Citation information: DOI 10.1109/TSG.2022.3197657

IEEE TRANSACTIONS ON SMART GRID, VOL. XX, NO. XX, MONTH 2022 10

Let θibe the network parameter of agent i. Then, the loss

function used for updating critic network is given by

L(θi) = E(o,a,o′,r)[Qπ

i(o, a)−y]2,(63)

where (o, a, o′, r)denotes experience transition in memo-

ry buffer D,πdenotes policies of agents, y=ri+

γQπ′

i(o′, a′)|a′=π′(o′), and π′denotes the target policies of

agents. Moreover, the policy gradient used for updating actor

network can be given by

∇θiJ(πi) = Eo,a[∇θiπi(ai|oi)∇aiQπ

i(o, a)|ai=πi(oi)].(64)

The MAADDPG-based training algorithm for solving the

Markov game related to P6 is shown in Algorithm 2. To be

speciﬁc, a replay memory Dis initialized in line 1. Then,

the preprocessing function ϕ(o)is introduced to normalize

the environment state otas in [29], which can facilitate

the learning process of the proposed algorithm. In line 3,

Ornstein-Uhlenbeck (OU) process is used to generate random

noise N. In lines 4-5, we initialize the weight parameters of

actors/critics and target actors/critics. In each actor network,

there are one input layer, three hidden layers with the same

size Nh

a, and one output layer. In each critic network, three

modules are involved, i.e., encoder, attention, and MLP. Here,

MLP has one input layer, one hidden layer with size Nh

c,

and one output layer. In lines 7-8, environment state ois

initialized and the scale of OU noise is adjusted, which

decreases linearly with the increase of episode index. In line

10, each thermal-load agent takes an action based on the

current policy and exploration noise. In line 11, after receiving

the joint action of all thermal-load agents, the environment

returns a new state o′and a reward r. Next, the experience

transition tuple (ϕ(o), a, r, ϕ(o′)) is stored in the memory D.

When the number of transition tuples Msize exceeds Nm,

the multi-agent training process would be triggered. However,

for the purpose of stabilizing learning process, the training

frequency is decreased by adopting another condition, i.e.,

mod(ep,Tfre)=0, where ep denotes the episode index and Tfre

means that training is conducted every Tfre episodes. In lines

16-18, each agent updates its actor and critic parameters based

on the sampled mini-batch data with Ktransition tuples. In

line 20, target network parameters are updated.

Once the training process is ﬁnished, the obtained policy

πican be used for solving P6 in an online way without any

process of solution searching as shown in Algorithm 3. Since

just the forward propagation is involved, the computational

complexity of Algorithm 3 is low. To be speciﬁc, in the process

of forward propagation, three basic computations are involved,

i.e., addition, multiplication, and activation. Let Nin and Nout

be the number of neurons in the input layer and output layer,

respectively. For the ﬁrst neuron in the ﬁrst hidden layer of

the above-mentioned actor network, the number of addition,

multiplication, and activation is Nin,Nin, and 1, respectively.

Then, the total number of computations in the ﬁrst hidden layer

is (2Nin+1)Nh

a. Similarly, the total number of computations in

the second/third hidden layer is (2Nh

a+ 1)Nh

a. For the output

layer, the total number of computations is (2Nh

a+ 1)Nout.

Finally, the computational complexity of Algorithm 3 can be

calculated by O(2NinNh

a+4Nh

aNh

a+2Nout Nh

a+3Nh

a+Nout).

Since Nin = 5 and Nout = 1 in this paper, the computational

complexity of Algorithm 3 is O(4(Nh

a)2+ 15Nh

a+ 1). When

taking the computational complexity of P7 into consideration,

the computational complexity of the proposed online algorithm

for solving P4 is O(4(Nh

a)2+ 15Nh

a+ 33.5Lin + 1).

Algorithm 2: MAADDPG-based Training Algorithm for

Thermal-load Agents

Input: real-world traces (e.g., price, PV generation,

power demand, and outdoor temperature), Pfc,t

obtained from the solution of P7;

Output: actor networks πi(ai|ϕ(o))

1Initialize replay memory Dwith size Nm;

2Initialize preprocess function ϕ(o);

3Initialize random noise Nfor action exploration;

4Randomly initialize critic networks Qπ

i(ϕ(o), a)and actor

networks πi(ai|ϕ(o)) with parameter θi, respectively.

5Initialize target critic networks Qπ′

i(ϕ(o), a)and actor

networks π′

i(ai|ϕ(o)) with parameter θ′

i, respectively.

6for ep=1, 2, · · · ,Mdo

7Receive the initial environment state o;

8Adjust the scale of Ornstein-Uhlenbeck process;

9for t=0, 1, · · · ,T-1 do

10 Each agent iselects an action:

ai=πθi(ϕ(oi)) + Nt;

11 Execute action aand obtain next state o′and

reward rfrom the environment;

12 Store (ϕ(o), a, r, ϕ(o′)) in D;

13 o←o′;

14 if Msize ≥Nmand mod(ep,Tfre)=0 then

15 for agent i=1, · · · ,Ndo

16 Sample a mini-batch of Ktransitions

(ϕ(ok), ak, rk, ϕ(o′k)) from D;

17 Update critic network by minimizing the

loss function in (63);

18 Update actor network using the sampled

policy gradient in (64);

19 end

20 Update target network parameters for each

agent i:θ′

i←ρθi+ (1 −ρ)θ′

i;

21 end

22 end

23 end

Algorithm 3: Real-time Algorithm for Solving P6

1Input: Actor networks and Pfc,t obtained from

Algorithm 2 and the solution of P7, respectively;

2Output:Psp,i,t,Ptc,t ,Ptd,t,Pgb,t ;

3Qfc,t =ηhrηh2e Pfc,t∆t;

4Observe system states Qth,t,βin,i,t ,βout,t;

5Obtain oi,t = (Qfc,t, Qth,t , βin,i,t, βout,t , t);

6Each thermal-load agent imakes its local action

Psp,i,t =πi(ai|ϕ(oi,t)) in parallel;

7Determine Ptc,t,Ptd,t ,Pgb,t according to Remark 1;

content may change prior to final publication. Citation information: DOI 10.1109/TSG.2022.3197657

IEEE TRANSACTIONS ON SMART GRID, VOL. XX, NO. XX, MONTH 2022 11

V. ALGORITHMIC FEASIBILITY

In this section, we provide the following three lemmas and

two theorems, which show that constraints (4), (14) could

be satisﬁed by the proposed online algorithm. Moreover, the

reasonable values of V,WB, and WHare derived.

Lemma 1. If Pbuy,t >0, the optimal solution of P7 has the

following properties:

1) If ΥB,1>0, the optimal BESS charging decision is

Pbc,t = 0; If ΥB,2<0, the optimal BESS discharging

decision is Pbd,t = 0.

2) If ΥH,1>0, the optimal HESS charging decision is

Pel,t = 0; If ΥH,2<0, the optimal HESS discharging

decision is Pfc,t = 0.

where ΥB,1=XB,t ηbc∆t+V vt∆t+V µcµe,t ∆t+V ψBESS,

ΥB,2=XB,t ∆t

ηbd +V vt∆t+V µcµe,t∆t−V ψBESS ,ΥH,1=

XH,tξωel∆t+V vt∆t+V µcµe,t ∆t+V δon

el

Pmax

el

, and ΥH,2=

XH,tξ∆t

ωfc +V vt∆t+V µcµe,t∆t−V δon

fc

Pmax

fc

.

Proof: See Appendix A.

Lemma 2. If Pbuy,t = 0, the optimal solution of P7 has the

following properties:

1) If ΥB,3>0, the optimal BESS charging decision is

Pbc,t = 0; If ΥB,4<0, the optimal BESS discharging

decision is Pbd,t = 0;

2) If ΥH,3>0, the optimal HESS charging decision is

Pel,t = 0; If ΥH,4<0, the optimal HESS discharging

decision Pfc,t = 0.

where ΥB,3=XB,t ηbc∆t+V τt∆t+V µcµe,t ∆t+V ψBESS,

ΥB,4=XB,t ∆t

ηbd +V τt∆t+V µcµe,t∆t−V ψBESS ,ΥH,3=

XH,tξωel∆t+V τt∆t+V µcµe,t ∆t+V δon

el

Pmax

el

, and ΥH,4=

XH,tξ∆t

ωfc +V τt∆t+V µcµe,t∆t−V δon

fc

Pmax

fc

.

Proof: See Appendix B.

Lemma 3. Based on lemma 1 and lemma 2, we can obtain

lemma 3 as follows, i.e., the optimal solution to P7 has the

following properties:

1) If XB ,t > Xhigh

B=−V(τmin∆t+µcµmin

e∆t+ψBESS)

ηbc∆t, the

optimal BESS charging decision is Pbc,t = 0; If

XB,t < Xlow

B=−V ηbd(vmax ∆t+µcµmax

e∆t−ψBESS)

∆t, the

optimal BESS discharging decision is Pbd,t = 0;

2) If XH,t > Xhigh

H=−V(τmin∆t+µcµmin

e∆t+δon

el

Pmax

el

)

ξωel∆t, the

optimal HESS charging decision is Pel,t = 0; If XH,t <

Xlow

H=−V ωfc(vmax ∆t+µcµmax

e∆t−δon

fc

Pmax

fc

)

ξ∆t, the optimal

HESS discharging decision is Pfc,t = 0.

Proof: See Appendix C.

Based on lemma 3, the reasonable values of V,WB, and

WHcan be derived in next two theorems.

Theorem 1. Given the control parameter V∈(0, V max

B],

WB∈[Wmin

B, W max

B], the proposed algorithm can en-

sure the feasibility of (4), i.e., Bmin ≤Bt≤Bmax

for all slots, where Vmax

B=Bmax−Bmin −ηbc Pmax

bc ∆t−Pmax

bd ∆t

ηbd

χB,

χB=ηbd(vmax +µcµmax

e−ψBESS

∆t)−τmin+µcµmin

e+ψBESS

∆t

ηbc ,

Wmin

B=−V(τmin+µcµmin

e+ψBESS

∆t)

ηbc +ηbcPmax

bc ∆t−Bmax,

Wmax

B=−V ηbd(vmax +µcµmax

e−ψBESS

∆t)−Pmax

bd ∆t

ηbd −Bmin.

Proof: See Appendix D.

Theorem 2. Given the control parameter V∈(0, V max

H],

WH∈[Wmin

H, W max

H], the proposed algorithm can en-

sure the feasibility of (14), i.e., Hmin ≤Ht≤Hmax

for all slots, where Vmax

H=Hmax−Hmin −ωel Pmax

el ∆t−Pmax

fc ∆t

ωfc

χH,

χH=ωfc(vmax ∆t+µcµmax

e∆t−δon

fc

Pmax

fc

)

ξ∆t−τmin∆t+µcµmin

e∆t+δon

el

Pmax

el

ξωel∆t,

Wmin

H=−V(τmin∆t+µcµmin

e∆t+δon

el

Pmax

el

)

ξωel∆t+ωel Pmax

el ∆t−Hmax,

Wmax

H=−V ωfc(vmax ∆t+µcµmax

e∆t−δon

fc

Pmax

fc

)

ξ∆t−Pmax

fc ∆t

ωfc −Hmin.

Proof: See Appendix E.

Remark 2: To ensure that operational constraints of BESS

and HESS are both feasible, the control parameter should

be selected within the range, i.e., 0< V ≤Vmax =

min{Vmax

B, V max

H}. According to the objective function in P4,

it can be known that larger Vmeans higher “priority” of min-

imizing operational cost. Thus, this paper chooses V=Vmax

similar to existing works [36], [37]. In addition, Vmax

Bis

derived based on the equation that Wmin

B=Wmax

B. Moreover,

Vmax

His derived based on the equation that Wmin

H=Wmax

H.

In other words, when Vmax =Vmax

B, the gap between Wmin

B

and Wmax

Bis zero. Similarly, when Vmax =Vmax

H, the

gap between Wmin

Hand Wmax

His zero. Since WBand WH

mainly affect the average BESS/HESS energy level rather

than price diversity (i.e., charging/discharging BESS/HESS

when the price is low/high), we choose WB=Wmax

Band

WH=Wmax

Hfor simplicity in this paper.

VI. PE RF OR MA NC E EVALUATI ON

In this section, we evaluate the performance of the proposed

algorithm. To be speciﬁc, we ﬁrst describe the simulation

setup. Next, ﬁve benchmarks are adopted for performance

comparisons. Then, two performance metrics are deﬁned.

Finally, simulation results and discussions are provided.

A. Simulation setup

Real-world traces related to electricity price, power load,

PV generation, and outdoor temperature are adopted in sim-

ulations, which are shown in Fig. 4. To be speciﬁc, retail

commercial price between June 1 and Sept. 30 of 2019

in Beijing is used1. Moreover, power demand and outdoor

temperature data from Pecan Street database2during June

1 and Sept. 30 of 2018 are used. Note that such database

is the largest real-world open energy database and consists

of the data related to the Mueller neighborhood in Austin,

TX, USA. Since we focus on the cooling mode in summer,

solar irradiance data during June 1 and Sept. 30 of 2019 from

NREL Solar Radiation Research Laboratory3is used. In these

traces, the data within 90 days and 30 days are used for

training and testing, respectively. Note that main simulation

parameters are summarized in Table I, where lraand lrcare

learning rate of actor network and critic network, respectively.

1http://fgw.beijing.gov.cn/

2https://www.pecanstreet.org/

3https://midcdmz.nrel.gov/

content may change prior to final publication. Citation information: DOI 10.1109/TSG.2022.3197657

IEEE TRANSACTIONS ON SMART GRID, VOL. XX, NO. XX, MONTH 2022 12

Python-based simulations are conducted on a desktop com-

puter with Intel Core(TM) i9-9900 CPU and 64GB RAM. To

simulate the building thermal dynamics, the following model

Fis adopted similar to many existing works [40], [41], i.e.,

βin,i,t+1 =εhvacβin,i,t + (1 −εhvac)(βout,t −Psp,i,t ηhvac/Ai).

Note that the above model structure is not used for energy

planning/optimization similar to model-based methods (e.g.,

model predictive control [42], and Lyapunov optimization

techniques [43]), but used to obtain environment data for

model-free learning. In addition, the adoption of the above-

mentioned model can facilitate the performance comparison

with an optimal scheme that solves a deterministic model with

perfect information of uncertain parameters.

TABLE I

MAIN PARA MET ER SETTINGS

PV generation, gas boiler, and carbon emission

ηpv=0.2[28], hpv =100m2,ηgb=0.95 [17], λgb =0.287RMB/kWh [17],

Pmax

gb =20kW, µe,t=0.968kg/kWh, µc=0.01RMB/kg, τt=0.1RMB/kWh

BESS

Bmin=0kWh, B0=0kWh, Bmax =100kWh, Pmax

bc =10kW, Pmax

bd =10kW,

ηbc=ηbd =0.95 [29], ψBESS=0.01RMB/kW [29]

CWT

ηtc=0.9, ηtd =0.9, Qmax

th =50kWh, Qinit

th =0kWh, Pmax

tc =10kWh,

Pmax

td =10kWh, ψCWT=0.05RMB/kW

HESS

ωfc=0.23Nm3/kWh [31], ωel =1.4985kWh/Nm3,ηhr=0.7 [2], ηh2e =1.4

[2], ηh2c=0.7 [2], Pmax

el =10kW, Pmax

fc =10kW, δon

el =0.158RMB [32],

δsu

el =0.97RMB [32], δsd

el =0.049RMB [32], δon

fc =δsu

fc =0.079RMB [32],

δsd

fc =0.0395RMB [32], Hmax=100Nm3,H0=Hmin =0Nm3

Thermal load

N= 4,βinit=[21, 20, 22, 21.5]◦C, βmin

i=20◦C, βmax

i=25◦C, ηhvac=2.5,

A= 0.5kW/◦F, εhvac = 0.8,Pmax

sp =20kW

Training algorithm

γ=0.995, Nh

a=Nh

c=64, Nm=120000, M=100000, κth=1 RMB/oF, ξ= 4,

lra=0.0005, lrc=0.005, T=24, ∆t=1h, K=128, ρ=0.001, Ttest=720, Tfre =5

B. Benchmarks

•Baseline 1 (B1): This scheme controls BESS and HESS

using an algorithm similar to [44], i.e., charging BESS

and HESS greedily when there is a surplus of renewable

energy and discharging them otherwise. Moreover, this

scheme adopts ON/OFF strategy [45] for building cool-

ing, i.e., Psp,i,t=0 if βin