ArticlePDF Available

FiDRL: Flexible Invocation-Based Deep Reinforcement Learning for DVFS Scheduling in Embedded Systems

Authors:

Abstract and Figures

Deep Reinforcement Learning (DRL)-based Dynamic Voltage Frequency Scaling (DVFS) has shown great promise for energy conservation in embedded systems. While many works were devoted to validating its efficacy or improving its performance, few discuss the feasibility of the DRL agent deployment for embedded computing. State-of-the-art approaches focus on the miniaturization of agents’ inferential networks, such as pruning and quantization, to minimize their energy and resource consumption. However, this spatial-based paradigm still proves inadequate for resource-stringent systems. In this paper, we address the feasibility from a temporal perspective, where FiDRL, a flexible invocation-based DRL model is proposed to judiciously invoke itself to minimize the overall system energy consumption, given that the DRL agent incurs non-negligible energy overhead during invocations. Our approach is threefold: (1) FiDRL that extends DRL by incorporating the agent’s invocation interval into the action space to achieve invocation flexibility; (2) a FiDRL-based DVFS approach for both inter- and intra-task scheduling that minimizes the overall execution energy consumption; and (3) a FiDRL-based DVFS platform design and an on/off-chip hybrid algorithm specialized for training the DRL agent for embedded systems. Experiment results show that FiDRL achieves 55.1% agent invocation cost reduction, under 23.3% overall energy reduction, compared to state-of-the-art approaches.
Content may be subject to copyright.
1
FiDRL: Flexible Invocation-Based Deep Reinforcement
Learning for DVFS Scheduling in Embedded Systems
Jingjin Li, Weixiong Jiang, Yuting He, Qingyu Yang, Anqi Gao, Yajun Ha,
Ender ¨
Ozcan, Ruibin Bai, Tianxiang Cui, Heng Yu
Abstract—Deep Reinforcement Learning (DRL)-based Dy-
namic Voltage Frequency Scaling (DVFS) has shown great
promise for energy conservation in embedded systems. While
many works were devoted to validating its efficacy or improving
its performance, few discuss the feasibility of the DRL agent de-
ployment for embedded computing. State-of-the-art approaches
focus on the miniaturization of agents’ inferential networks,
such as pruning and quantization, to minimize their energy and
resource consumption. However, this spatial-based paradigm still
proves inadequate for resource-stringent systems. In this paper,
we address the feasibility from a temporal perspective, where
FiDRL, a flexible invocation-based DRL model is proposed to
judiciously invoke itself to minimize the overall system energy
consumption, given that the DRL agent incurs non-negligible
energy overhead during invocations. Our approach is three-
fold: (1) FiDRL that extends DRL by incorporating the agent’s
invocation interval into the action space to achieve invocation
flexibility; (2) a FiDRL-based DVFS approach for both inter- and
intra-task scheduling that minimizes the overall execution energy
consumption; and (3) a FiDRL-based DVFS platform design
and an on/off-chip hybrid algorithm specialized for training the
DRL agent for embedded systems. Experiment results show that
FiDRL achieves 55.1% agent invocation cost reduction, under
23.3% overall energy reduction, compared to state-of-the-art
approaches.
Index Terms—Deep reinforcement learning, DVFS, lightweight
AI, energy optimization.
I. INTRODUCTION
A. Background
REINFORCEMENT Learning (RL) greatly bolsters the
intelligence level of embedded devices, in terms of both
user-level applications and system-level operations. In the
context of runtime energy management, RL-based Dynamic
Voltage Frequency Scaling (DVFS) has been proven effective
for energy optimization, which benefits the battery life of
embedded devices [1]–[7]. Moreover, Deep Reinforcement
Jingjin Li, Yuting He, Qingyu Yang, Anqi Gao, Ruibin Bai, Tianxiang Cui,
and Heng Yu are with the School of Computer Science, University of Not-
tingham Ningbo China, Ningbo 315100, China. Email: {jingjin.li, yuting.he,
qingyu.yang, scyag1, ruibin.bai, tianxiang.cui, heng.yu}@nottingham.edu.cn
Weixiong Jiang is with the School of Information Science and Tech-
nology, ShanghaiTech University, Shanghai 201210, China, and also with
the School of Electronic, Electrical and Communication Engineering, Uni-
versity of Chinese Academy of Sciences, Beijing 101408, China. Email:
jiangwx@shanghaitech.edu.cn
Yajun Ha is with the School of Information Science and Technology
and the Shanghai Engineering Research Center of Energy Efficient and
Custom AI IC, ShanghaiTech University, Shanghai 201210, China. E-mail:
hayj@shanghaitech.edu.cn
Ender ¨
Ozcan is with the School of Computer Science, University of Not-
tingham, Nottingham NG8 1BB, UK. Email: ender.ozcan@nottingham.ac.uk
Corresponding author: Heng Yu.
Fig. 1. Illustration of the repetitive decisions in the trace of a DRL-based
DVFS policy inference [1]. The left figure shows the sequential decisions of
the DVFS policy, while the right one shows the percentage of the repetitive
decisions of the DVFS scheduling policy.
Learning (DRL)-based DVFS scheduling is increasingly in-
vestigated, given its capacity to manage unpredictability and
complexity faced by embedded systems interacting with the
sophisticated world [8]–[16]. Study [12] demonstrates that
DRL-based DVFS achieves superior performance and robust-
ness on complex states compared to traditional RL-based,
energy model-based, and heuristic-based DVFS scheduling
approaches.
A natural challenge of deploying DRL onto embedded
devices is the contradiction between DRL’s high resource
demand and on-chip resource scarcity. The cost of invoking the
DRL agent itself can be non-negligible, particularly under the
energy optimization scenarios. Specifically, the cost comprises
both spatial and temporal overheads that eventually lead to
extra energy consumption. Spatial overhead arises with the
memory or circuit implementations of the DRL agent. Tem-
poral overhead refers to the periodical invocations of the DRL
agent for decision-making, which contributes to considerable
energy consumption during the deep learning-based policy
inference. Contemporary DRL-based DVFS studies primarily
emphasize validating DRL’s effectiveness in energy optimiza-
tion, with a few discussing its deployment feasibility. For
example, works [8] and [13] implement and validate the DRL
agents in the OS kernel and GEM5 [17], respectively. [18]
proposes a distributed Q-learning-based heterogeneous com-
puting paradigm to reduce energy consumption considering
communication overhead. However, the above works ignore
the cost of agent invocations.
In order to minimize the cost of the DRL agent, there
exists a spectrum of spatial-based miniaturization techniques
that aim at reducing the size of the deep inference network,
such as quantization [19], compression and pruning [20], etc.
However, this category of work inevitably incurs accuracy
losses. On the other hand, from an orthogonal, temporal angle,
2
[22]
Fig. 2. Energy overhead caused by agent invocation could be non-negligible
in contemporary RL/DRL-based DVFS approaches.
it can be observed that repetitive decisions tend to appear
in consecutive invocations of the DRL agent, given that a
chip’s physical state change (such as ambient temperature)
could be less swift compared to the frequency of an agent’s
invocation [21]. An empirical study is presented below to
demonstrate this phenomenon.
B. Motivational Examples
Fig. 1 (left) demonstrates the snapshot of a decision trace
of executing a DRL-based DVFS policy proposed in [1],
which is employed to optimize the energy consumption of
executing an FFT application. The x-axis represents the de-
cision timestamp of the DRL agent, and the y-axis denotes
the output voltage/frequency settings decided by the agent. It
is shown that repetitive decisions indeed occur in the learnt
policy as represented by the red points. Extensive runs of the
same example reveal the percentage of repetitive decisions
59% as shown in Fig. 1 (right). It is thus energy-beneficial
to remove this kind of temporal overhead by minimizing the
agent’s intermediate invocations. To the best of our knowledge,
current DRL-based DVFS approaches have not investigated or
optimized the agent’s invocation cost.
To demonstrate the impacts on the energy overhead gener-
ated by agent invocations, Fig. 2 shows empirical results of
energy consumption among several available RL/DRL-based
DVFS works [1], [6], [22]. All of the results are normalized
on the baseline energy consumption by executing applications
without DVFS. It can be observed that all works reduce the
overall energy consumption (grey shaded area), yet they all
incur additional agent costs (red shaded area) besides the
energy consumed by application executions. According to this
red-shaded area from various RL-based methods, the agent
energy accounts for around 10% of the overall system energy,
reflecting potential reduction opportunities.
C. Scope and Contributions
To effectively reduce the invocation overhead of the DRL
agent in the DVFS scheduling, which is a sleeper issue in this
field, in this paper we propose a Flexible invocation-based
Deep Reinforcement Learning (FiDRL) scheme to tackle the
DVFS-based energy optimization problem in the temporal
aspect. Our approach extends the conventional DRL by incor-
porating the determination of the agent’s invocation interval
into the action space, allowing the agent to determine its next
invocation time with its normal actions to take. As a result,
unlike the conventional RL paradigm that exhibits a clear
separation between the agent and its actuated environment,
the agent in the proposed FiDRL scheme actuates itself as part
of the environment. We thus necessarily prove its property of
convergence to theoretically support its feasibility. The FiDRL
scheme is then formulated for the DVFS problem, to achieve
the overall system energy minimization by being aware of
agent invocation overheads. Specifically, our FiDRL model is
adaptive for both inter-task and intra-task DVFS scheduling.
Moreover, given that state-of-the-art RL-based DVFS
scheduling studies mostly adopt simulations to justify the per-
formance, in this work, we propose an FPGA-based emulating
platform, including the neural network-based FiDRL agent,
the DVFS and system controllers, and IP cores or tasks to be
scheduled, for fast implementation and realistic verification of
the proposed DVFS scheduling approaches. It is imperative to
note that under stringent resources in embedded devices, direct
on-chip training of the agent is infeasible. Yet completely
off-chip training on a remote host machine may not capture
the states of the system on-the-fly, leading to biased training
results. We present an on/off-chip hybrid training algorithm
that synergies on-chip data collection and off-chip training, to
satisfy both the training feasibility and accuracy requirements.
The contributions of our study are three-fold:
1) We propose FiDRL, a flexible invocation-based DRL
scheme that temporally reduces the repetitive agent invo-
cations by incorporating the agent’s invocation interval
into the action space, with its convergency analysis.
2) We optimize the system’s overall energy consumption
by formulating and solving the FiDRL-based DVFS
problem in terms of inter- and intra-task scheduling.
3) We present the design and implementation of an FPGA-
based FiDRL-DVFS system, together with an on/off-
chip hybrid training algorithm specialized for training
DRL agents for embedded systems.
Our FiDRL-based DVFS attains 23.3% overall energy re-
duction and saves on average 55.1% of DRL agent energy
overhead compared to state-of-the-art approaches.
The remainder of this paper is organized as follows: Sec-
tion II reviews existing studies. Section III provides DRL
preliminaries, proposes and proves the FiDRL model. Sec-
tion IV demonstrates the FiDRL-based DVFS formulation and
training, and Section V elaborates FiDRL-based DVFS system
design and implementation. Section VI presents and analyzes
experimental results. Section VII concludes the paper.
II. RE LATE D WOR KS
DVFS-based energy management is a long-studied problem,
and a spectrum of effective strategies have been proposed
for various embedded systems [23]–[26]. As RL-based DVFS
scheduling techniques continue to prove their efficacy in
system energy consumption management, they have become
widely studied and applied in embedded devices. Q-learning
is a mainstream algorithm used in RL agent training. The re-
alization of the Q-learning-based agent can be: (i) simulation-
based [27], (ii) instantiated in the Linux kernel [22], [28], or
3
TABLE I
HIGHLIGHTS OF LEARNING-BAS ED DVFS STUDIES
Studies Advantages Drawbacks
RL-based
DVFS
1. Less hardware resource.
2. Perform well under
simple problem formulation.
1. Cannot handle complicated
formulated problems.
2. Sensitive to data noise.
DRL-based
DVFS
1. Able to solve complicated
formulated problems.
2. Good generalizability.
1. More hardware resources.
2. Required to be well-trained.
(iii) implemented in Processing Logic (PL) [29]. All these
studies use basic RL algorithms on DVFS scheduling and
verified their effectiveness compared to heuristic-based ones.
However, the states and actions are simple and implemented
as a look-up Q-table, which brings concerns for scalability and
capability of dealing with complicated environments faced by
embedded systems.
DRL models have been integrated with DVFS to improve
the energy/performance in embedded devices. Huang et al. [8]
builds a double-Q-network governor in Linux kernel for DVFS
scheduling, reducing the run-time energy on the CPU. [10],
[13] implement deep-Q-network agents for DVFS energy
or reliability management in the simulator, verifying that
their DRL-DVFS outperformed Q-learning-based counterparts.
In [30], a DRL-based job scheduling algorithm is designed
to reduce energy consumption and improve the quality of
service in cloud data centres. Yu et al. [31] employs policy-
based learning to train the agent that manages the power
of IoT devices. Those DRL-DVFS works train and deploy
the network on the CPU side, without further investigating
the training and deployment feasibility on resource-scarce
embedded devices. Specifically, they have not studied the
overhead generated by the agent’s invocation and execution.
[12] uses a deep-Q-network to instruct DVFS for adaptive
application quality improvement and evaluate the overhead of
the agent only on a simulator. Table I highlights the advantages
and drawbacks of learning-based DVFS studies.
A few works from the AI society have been found that
study the flexible invocation paradigm. Lakshminarayanan
et al. [32] propose a dynamic repetition scheme for deep-
Q-network, which is a value-based DRL algorithm1. The
DRL agent selects actions together with its repetitions,
enhancing the learning speed and performance. The same
group further proposes a policy-based DRL with variable
decision periods [33], constructing a two-network model:
one for outputting the action and another for its repetition
size. Biedenkapp et al. [34] propose TempoRL, an improved
version of the above two works evaluated in the Gym [35]
benchmarks. As shown in the left of Fig. 3, it also employs
a two-network architecture to output a specific action aand
a skipping factor jto escape unnecessary decisions. A key
issue when applying the TempoRL in embedded scenarios is
“state drifting”, which describes the phenomenon that when
Net2makes inference on jbased on aand Net2-input state
1The conventional RL training algorithms can be classified into value-
based and policy-based, depending on the usage of the trajectories. Value-
based algorithms build on single-step interactions between the agent and the
environment, while policy-based ones employ the whole interactive trajectory.
s, the expected svector should deviate from its Net1-input
version, due to the non-negligible execution of Net1. This
may not pose an issue to the generic RL scenarios that
TempoRL assumes, but for RL in embedded computing
where the change of environment (e.g., chip state) is at a
comparable speed of the agent computation, this deviation
may cause a problem. A naive solution is to re-observe
the Net2-input state, as shown in the middle of Fig. 3, but
then aproduced by Net1mismatches jdue to the state
alteration. Thus, in this work, a unified one-step network is
proposed to resolve the above issue for embedded computing.
It is worth to note that the unified network determines
whether to invoke itself, making the agent itself part of
the environment of the RL paradigm. For this reason, this
work provides theoretical foundation for the proposed scheme.
TempoRL [34]
(state
re-observation)
Modified [34] for embedded
computing scenario FiDRL
Fig. 3. Left: Illustration of the two-network structure of the TempoRL
model [34]; Middle: A naive state re-observation modification to overcome
the state drifting problem; Right: One-network scheme proposed in this work.
III. FLEXIBLE INVOCATION-BA SE D DRL
In this section, we introduce preliminaries of the DRL, present
the FiDRL model, and prove its convergency properties.
A. DRL Preliminaries
RL is an effective approach to solving sequential decision-
making problems with markov decision property, which de-
fines that the probability of reaching the current state is solely
determined by the previous state and its action. RL has the
following essential components: (i) environment (env): the
external environment for the agent to interact with, which
can either be a simulated or a real-world system; (ii) state
(sS): a state can be a multidimensional vector extracted
from env that represents its features, the space of the state
could be discrete or continuous (i.e., finite or infinite); (iii)
action (aA): a one or multidimensional vector whose
physical form actuates state transitions in the environment,
the space of the action could be discrete or continuous; (iv)
reward (rR): the reward denotes the value that the agent
receives upon an action. Its formulation can be manually or
automatically adjusted to facilitate the agent to learn correctly;
(v) agent: the main entity to select corresponding actions to
maximize the long-term reward, which can be denoted as:
E[
t=T
X
t=0
γtrt(st, at)|stS, atA, rtR](1)
4
(vi) policy (π(a|s), a A, s S): the policy denotes the
probability for the agent to choose the action agiven a state
s; (vii) value (V(s), s S): the value function is used to
evaluate the long-term return given a state s, which can be
noted as:
V(s) = E[
t=T
X
t=0
γtr(st, at)|st=s](2)
In DRL, a deep neural network is employed as the policy or
value network to estimate the policy or value functions after
encoding the state, action, and reward. During the training
process, the agent obtains trajectorized data by interacting
with the environment and employs this data to update network
parameters.
B. Proposed FiDRL Scheme
1) Mechanism Description: To enable flexible invocations
of the agent, we propose to extend the action space by
incorporating a feature named ainterval , which stands for
the time interval between the current and the next agent’s
invocation. Therefore, the modified action aflexible can be
denoted as:
aflexible = (aoriginal , ainterval ), aflexible A(3)
where aoriginal represents the action ain the conventional
DRL. The definition of ainterval encompasses both the discrete
and continuous terms, to cater for the inter-task and more
generalized intra-task scheduling scenarios, respectively:
ainterval =({n×intervalmin}, n Z+[discrete]
[intervalmin , intervalmax][continuous](4)
where the intervalmin denotes the minimum decision interval
and it could be in different attributes, including (i) time spent,
(ii) iteration spent, and (iii) task spent, specified by the design.
nis defined as a variable integer representing the skip factor of
the decision interval, with nnmax, where nmax is a user-set
value to restrict the upper bound of the decision interval. Under
continuous ainterval ,intervalmax (> intervalmin) is a user-
set value to restrict the upper bound of the decision interval.
With the definition of ainterval, it is then possible to explore
the full or fractional “skipping” of repetitive actions over
consecutive agent invocations. This adds an extra dimension
to the searching space, for an even better energy reduction
policy due to fewer agent invocations.
To formalize FiDRL, we describe its Markov Decision
Process (MDP) MF, denoted as MF=S, A, P, R. In this
MDP, sSrepresents the encoded state in our problem, and
aflexible Arepresents the modified action in FiDRL, given
by Eqn. 3. Furthermore, p(s, aflexible, s)Prepresents the
transition probability from state sto swhen the agent takes
action aflexible, and r(s, af lexible, s)Rdenotes the reward
obtained by the agent in the transition from sto staking
action aflexible. The goal of this MDP is similar to that of
the conventional DRL, which is to find an optimal policy
π:SAfor the reward maximization Eπ[Pt=T1
t=0 γtrt],
where tis the decision step of the agent, Tis the total
number of decisions until the episode ends. The Q-function is
formulated as follows:
Qπ(s, a) = E[rt+γQπ(st+1 , at+1)|s=st, a], a =aflexible
(5)
In other words, the MDP used in FiDRL encompasses a
wider range of scenarios compared to classical DRL, by having
ainterval to accompany normal actions to enable flexible invo-
cations of the agent. On the other hand, classical DRL can be
seen as a special case of af lexible = (aoriginal , intervalmin ),
where intervalmin is a given constant (namely, a hyper-
parameter) representing the fixed invocation interval of the
agent.
2) Convergence Analysis: With the action aflexible ex-
tended, it is necessary to prove that our framework has an
equal ability of convergence compared to conventional DRL.
Definition: A contraction mapping is a function f:MM
in a metric space Mthat satisfies the condition:
d(f(x), f (y)) k×d(x, y)
For all x, y in M, where 0k < 1is a constant and d(x, y )
represents the distance between xand y.
Traditional RL convergence analysis starts from the con-
traction mapping of Bellman Expectation Equation [36], [37],
which proves that the expected reward value has the property
of converging to a fixed point during the agent’s training
update process. Similarly, we first prove the usability of this
Bellman Expectation Equation with our extended RL design.
We then prove the convergence of both value and policy
iterations, which are two core algorithms to solve the markov
decision process in traditional RL and provide the systematic
way to optimize the policy [37].
Lemma 1. The Bellman Expectation Equation is still usable
with the extension of the action space in aflexible .
Proof. Firstly, with the premise of the finite space of states
and actions, the extension of the actions could not change the
finiteness of the states and actions.
It would be useful to provide the proof of contraction
mapping for conventional Bellman Expectation Equation [36],
then extend to the addition of aflexible . Let Tdenote the
Bellman expectation update operator, for any value function
V:
(T V )(s) = X
a
π(a|s)X
s,r
p(s, r|s, a)[r+γV (s)] (6)
Consider the difference between two value functions Vand
V, denoting V=d(V, V ) = maxsS|VV|:
|(T V )(s)(T V )(s)|
=|X
a
π(a|s)X
s,r
p(s, r|s, a)[γ(V(s)V(s))]|
X
a
π(a|s)X
s,r
p(s, r|s, a)[γ|V(s)V(s)|]
γmax
sS|V(s)V(s)|=γV
(7)
As γis the discount factor in RL which holds the range from
0 to 1, d(T V, T V )γd(V , V ). The feature of contraction
5
mapping only depends on the γand has no relationship with
the action space. Thus, the feature of the contraction mapping
for the Bellman Expectation Equation still holds no matter how
the action space changes. Therefore, the Bellman Expectation
Equation is also usable for aflexible.
Lemma 2. The value function can converge to the optimum
under value iterations of flexible invocation-based RL.
Proof. By Lemma 1, we firstly replace the action ato affor
our action representation, which is the short form of af lexible.
Let Vrepresent the value function, and Vthe value
function in the next iteration, then the Bellman Expectation
Equation can be denoted as:
V(s) = E[rt+1 +γV (st+1 )|st=s](8)
where γis the discount factor, rt+1 is the reward, st+1 is
the next state. By using contraction mapping of the Bellman
Expectation Equation and Banach’s Fixed Point Theorem [38],
the Bellman Expectation Equation possesses a unique V:
V(s) = E[rt+1 +γV (st+1 )|st=s](9)
Therefore, from any initial V0, by continuously applying
Bellman Expectation Equation:
lim
ninf Vn(s) = V(s)(10)
Lemma 3. The policy can converge to the optimum under
policy iterations of the flexible invocation-based RL.
Proof. For every state sS, one selects the action afthat
maximizes the action-value function Qπ(s, af):
π(s) = arg max
af
Qπ(s, af)(11)
Hence, we can assert:
Vπ(s) = max
af
Qπ(s, af)X
af
π(af|s)Qπ(s, af) = Vπ(s)
(12)
Thus, Vπ(s)Vπ(s)for improved policy π.
Given the finiteness of the policy and states, by calculating
the value function Vπgiven a policy π, policy iteration
converges to a policy that does not change after any subse-
quent policy improvement attempts, implying optimality of
the policy.
Theorem. Flexible invocation-based RL holds the ability of
convergence.
Proof. By Lemma 2 and 3, after extending the action space,
the value function and the policy can both converge to the
optimal. This means the value iteration and policy iteration
can converge. Hence, our framework holds the equal ability
of convergence compared to conventional RL with the finite
space of the states and actions.
IV. FIDRL-BASED DVFS SCHEDULING
In this section, we introduce formulations of the FiDRL-based
DVFS problem, aiming at minimizing the system’s overall
energy consumption. We then describe the training for FiDRL-
based DVFS in terms of inter-task and intra-task scheduling.
A. FiDRL-based DVFS Encoding
To adopt FiDRL to the DVFS scheduling problem, we define
state st, action at, and reward rtat decision tas follows:
Action: at= (vft, It), where vft, corresponding to aor igin
in Eqn. 3, refers to the operating Voltage and Frequency (V/F)
pair on the processing core. It, corresponding to ainterval in
Eqn. 3, specifies the number of the tasks executed by the
processing core (discrete scenario) or the time spent by the
processing core (continuous scenario) at this V/F level.
State: st= (Tt,¯
Pt,Tt,¯
Pt, vft1, It1), where Ttrep-
resents the temperature of the system at t;¯
Ptindicates the
average power consumption from the last state to the current,
namely from t1to t;Tt=TtTt1and ¯
Pt=¯
Pt¯
Pt1
denote the changes of temperature and power from the last
observation (t1) to the current (t); vft1is the operating V/F
level from the last state to the current; It1denotes the number
of tasks completed or the time spent of the processing core
from the last state to the current. vft1and It1together can
be seen as the last action that the agent chose. Each element
of the state vector is normalized to (0,1).
Reward: rt=ξEt, where Etdenotes the energy con-
sumption from the last state to the current, and ξis a scaling
factor for normalization to stabilize the learning process.
B. FiDRL-based DVFS Training
After encoding the essential elements of FiDRL, we adopt
Proximal Policy Optimization (PPO) as our training algorithm
for FiDRL. PPO is effective in restricting the updated degree
of the policy, resulting in better convergence than others [39].
PPO can also be combined with other techniques such as the
Actor-Critic (AC) architecture [40] to improve the training
stability and reduce the training variance. Importantly, it
supports both discrete and continuous action spaces, which
match our FiDRL framework (see Eqn. 4). We thus introduce
the training for inter- and intra-task scheduling, featuring two
different action spaces (different aintervals), in the following
two subsections.
1) Training for Inter-task Scheduling: Algorithm 1 states
the FiDRL training process employing PPO/AC. It is required
to train two networks, namely the policy and value networks,
as defined in the AC architecture. The policy network is
an actor which is the main agent that interacts with the
environment, and it outputs the probability of all the actions
when the action space is discrete:
πθ(a|s) = softmaxaA(fθ(s)) (13)
As Eqn. 13 shows, fθis the actor-network with parameters θ,
accepting a state sand outputting the overall probability of
undergoing all actions a. Softmax function [41] is employed
to represent the probability distribution of all the actions.
Specifically, aflexible = (aorigin , ainterval )is discrete. If
we assume aorigin has ndimensions and ainterv al has m
dimensions. Then, the output size of this actor-network should
be n×m.
After collecting trajectories by using the actor-network,
the empirical advantage Ashould be calculated by using
the value of each state in collected trajectories following
6
……
……
Softmax
Policy Network-A
……
……
Policy Network-B
……
……
State State State
Value Network
Hidden Layers Hidden Layers Hidden Layers
Value )
Fig. 4. The policy and value network architectures in the FiDRL-DVFS problem, which contains four-layer fully connected networks with 32-bit precision.
The post steps of the output layer are different, depending on the discrete/continuous actions (policy network) or the value of the input state (value network).
Dim(.)is the function to get the dimension. As for action selection, Policy-A outputs a discrete space with n×mpossible actions, and Policy-B outputs
one value from a continuous space [x, n ×y].
Algorithm 1 FiDRL Training under PPO/AC
Require: Learning rate α, discount factor γ, General Advan-
tage Estimation (GAE) parameter λ, PPO-Clip parameter
ϵ, epoch E, trajectories size J, mini-batch size N.
Ensure: Policy parameters θ, value function parameters w.
1: Initialize policy network parameters θand value network
parameters w.
2: Initialize experience buffer D.
3: for each epoch Edo
4: Collect a set of complete trajectories {τ1, τ2, ..., τJ}by
executing the latest policy πθin the environment.
5: Calculate the advantage function Atfor each timestep
tin each trajectory.
6: for every mini-batch Nin collected Jtrajectories do
7: Calculate the loss: Lvalue(w)and Lpolicy (θ, w).
8: Update θand wusing gradient descent and Adam
optimizer.
9: end for
10: end for
Eqn. 14 and 15 (line 5 in Algorithm 1). Specifically, we
use General Advantage Estimation (GAE) to calculate At
following Eqn. 15, which balances the trade-off between bias
and variance of the estimation [39].
δt=rt+γVw(st+1 )Vw(st)(14)
At=
K
X
k=0
(γλ)kδt+k, t specifies decision index in [0, K ]
(15)
Vw(s) = fw(s)(16)
Eqn. 16 defines the value network fwwith parameters w,
accepting the state sand outputting the value of this state.
We update network parameters every mini-batch size Nas
line 6 shows. From line 7, the calculation of the loss functions
of the policy and value networks are given as Eqn. 20 and 21:
bj
t(θ) = πθ(aj
t|sj
t)
πθold (aj
t|sj
t)(17)
Lclip(θ) = 1
J
J
X
j=1
(X
t
min(bj
t(θ)Aj
t,clip(bj
t(θ),1ϵ, 1+ϵ)Aj
t))
(18)
Lentropy(θ) = 1
J
J
X
j=1
(X
tX
a
πθ(aj
t|sj
t) log πθ(aj
t|sj
t))
(19)
Lvalue(w) = 1
J
J
X
j=1
(X
t
(Aj
tVw(sj
t))2)(20)
Lpolicy (θ, w) = ˆ
Et,j [Lclip(θ)c1Lvalue(w) + c2Lentropy(θ)]
(21)
The design of the above loss functions follows the PPO
mechanism. Lclip and Lentropy are used to control the degree
of updating the policy, for better convergence of the training.
In Eqn. 18 to 20, we firstly use Ptto accumulate the loss of
decision steps for each trajectory within the mini-batch size.
Then, we get the average result of all trajectories with size
Jcollected in line 4. We finally use these losses to update
networks’ parameters once in this mini-batch.
2) Training for Intra-task Scheduling: For intra-task DVFS
scheduling, which employs the continuous action space, the
basic process follows Algorithm 1 as well. However, the policy
network is different for the output with different forms:
fθ(s) = hµ(s) log(σ(s))i, σ(s) = exp(log(σ(s)) (22)
πθ(a|s) = 1
σ(s)2πexp (aµ(s))2
2σ2(s)(23)
Eqn. 22 denotes the output of the actor-network fθwith
parameters θ, it has two values representing the average
value of the action µ(s), and the logarithm of the action’s
standard derivation σ(s), respectively. Hence, the action could
be sampled from the normal distribution [42], given its average
value and standard derivation according to the actor-network.
Specifically, aflexible = (aorigin , ainterval )is continuous. If
we assume aorigin has ndimensions and the ainterv al is a
continuous range from xto y. Then, the output of this actor-
network should be a continuous range from xto ny. Line
4 in the Algorithm 1 uses the actor-network to collect the
7
ZYNQ-based Heterogeneous Embedded System
FiDRL
Agent
Processing Core (inter-task)
Linux DVFS Governor
Processing Logic (FPGA)
Processing System (CPU)
Processing Core
(intra-task)
System Monitor
AXI Interconnections
DVFS IP core
Related Drivers
Fig. 5. Detailed FiDRL-DVFS system is based on the heterogeneous platform
named ZYNQ. Blue components represent inter-task-based DVFS, green ones
represent intra-task ones, purple ones represent FiDRL agent control, and
yellow represents related drivers of the above three.
trajectories based on the policy. The network should be altered
to adapt for continuous action space as Eqn. 22 shows.
As for the value network, The same settings in Eqn. 14-16
are used under the intra-task DVFS scheduling scenarios, and
the loss functions are calculated as Eqn. 17-21.
3) Network Architecture for Training: Fig. 4 illustrates
network architectures we designed for the policy and value
network. Both networks are designed using a four-layer fully
connected neural network. The input layer follows the dimen-
sion of the state that we encoded in Section IV-A. Two hidden
layers have the same size and can be altered for higher learning
performance. The output layer has three conditions which are:
Policy network-A: Distribution of all actions when the
action space is discrete, then use softmax function to
select the most probable action (see Eqn. 13).
Policy network-B: The average and standard derivation
of the action when the action space is continuous, then
use the normal distribution to sample one of the action by
using average µand standard derivation σ(see Eqn. 23).
Value network: The value of the current state (see
Eqn. 16).
Before the training, we could select one policy network
between A and B according to the action space characteristic,
together with the value network.
V. FIDRL-BASE D DVF S SY ST EM
In this section, we provide the FiDRL-based DVFS system
design and implementation, together with a hybrid training
algorithm for training DRL agents on embedded systems.
A. System Design and Implementation
We design and implement our FiDRL-based DVFS schedul-
ing system on the Xilinx ZYNQ UltraScale+ heterogeneous
embedded platform, consisting of the Processing System (PS,
i.e., CPU) and the Processing Logic (PL, i.e., FPGA). Fig. 5
illustrates the entire system. Specifically, the system monitor
and all the related drivers are implemented on the PS side,
and the FiDRL agent is deployed on the PL side. The tasks
Algorithm 2 On/Off-Chip Hybrid FiDRL-DVFS Training
Require: Off-chip training host Moff , on-chip embedded
device Mon, trajectory size J.
Ensure: Policy parameters θ, value function parameters w.
1: Initialize policy parameters θon both Mon and Moff and
value function parameters won Moff .
2: Initialize necessary variables on Moff following Algo-
rithm 1.
3: for each epoch Edo
4: On Mon do
5: Execute tasks, collect a set of complete trajectories
{τ1, τ2, ..., τJ}by executing the latest policy πθ.
6: Send the trajectories {τ1, τ2, ..., τJ}to Mof f .
7: On Moff do
8: Update the policy and value network following line
5 to line 9 in Algorithm 1.
9: Send the latest policy params θto Mon.
10: end for
being scheduled respectively for the inter- and intra-task DVFS
scenarios are realized in different processing cores: inter-task
scheduled tasks are implemented as FPGA-based processing
cores, while intra-task scheduled tasks are implemented as
threads in CPUs. This separation is due to the DVFS capability
provided by the ZYNQ platform: V/F can only be applied per
task for FPGA cores, while on CPUs the V/F regulation could
happen at any time of the task execution [43].
The procedure of the FiDRL-based DVFS scheduling com-
promises four stages at runtime: (i) The system monitor cap-
tures the states; (ii) the FiDRL agent is invoked and accepts the
states from the system monitor through AXI interconnection
between PS and PL, and then outputs the V/F level and its
ainterval ; (iii) the corresponding DVFS controller regulates
the given V/F of the processing core; (iv) the processing core
runs tasks at this regulated V/F for ainterval . Specific setups
of these hardware modules are described in Section VI-A.
B. Hybrid On/Off-Chip Training
The DRL training process of the system includes two basic
phases: (1) data collection and (2) network update. It is
important to note that, the overall system is regarded as the
environment that the FiDRL agent faces. The training data thus
needs to be collected on chip by undertaking the inference
process of the agent. On the other hand, the network update,
namely the training, scheme comes with two design choices
that are discussed below.
On-chip network update (infeasible): The first reason
for the infeasibility is that the resource on targeted
embedded devices are stringent, making it impractical to
perform network updates on the chip. Secondly, on-chip
network updates could impact system states since on-chip
computations bring noisy temperature or power alterna-
tions of the chip, leading to biased state observations.
Off-chip network update (feasible): As described in
Section II, RL training could be value-based or policy-
based, requiring single-step data and the whole trajectory,
8
Policy Network (FiDRL Agent)
Applications (e.g., Object Detection, fft…)
executes on
Training Machine (Off-chip)
ZYNQ-based Embedded Device (On-chip)
agent parameters
collected trajectories
Inferencing
Training
Policy
Network
Value
Network
CPU
(intra-task)
FPGA
(inter-task)
Processing Core
Fig. 6. FiDRL workflow on DVFS scheduling in embedded devices, including
both training and inferential stages.
respectively. For the single-step paradigm, the network is
updated after each step, resulting in substantial commu-
nications between the system and the remote machine.
The excessive communications defer the states captured
by the agent, thus causing chaos in terms of timing,
which specifically hampers the training result of the time-
sensitive FiDRL scheme. It is thus advantageous to update
the network with a whole trajectory, which corresponds
to the policy-based approach.
Based on the above discussion, we adopt the inferential
process on chip to record the data trajectories and update the
neural networks off chip.Algorithm 2 shows our proposed
FiDRL-based DVFS training approach in an on/off-chip hybrid
way. It is noted that only the policy network is required
to be deployed in the on-chip device Mon to collect the
data, while in the training host Moff , the copy-version of
the policy network in Mon and the value network are both
needed for the training process (line 1 in Algorithm 2). We
regard Mon as the actor in the AC scheme, and it sends
data trajectories to Moff (lines 4-6 in Algorithm 2). Then
we refer to lines 5-9 in Algorithm 1 for the off-chip network
update. Mon only needs to accept the latest parameters of the
policy network at each epoch. This on/off-chip hybrid training
paradigm is generalizable to embedded system-targeted DRL
agent deployment.
Fig. 6 summarizes the workflow of our proposed FiDRL-
based DVFS system. At the inference stage, the applications
execute on the processing core iteratively. Our FiDRL agent is
flexibly invoked by accepting the system state and outputting
the voltage, frequency and the executed interval to the next
awakening. At the training stage, we propose an on/off-chip
hybrid training scheme. A specific advantage of this design
is to allow the off-chip updating to employ various models
for training without considering on-chip resource constraints.
Importantly, the on-chip agent only takes the inference role,
causing no additional timing overhead at run-time.
VI. EX PE RI ME NT RE SU LTS AN D ANALYSI S
In this section, we first describe the experiment setups and
then evaluate experimental results from several perspectives.
TABLE II
STATE-OF -TH E-ART DV FS SCHEDULING METHODS FOR COMPARISON.
No-DVFS
(Baseline) Execution at the given V/F without DVFS.
Heuristic
V/F levels are decided based on given mappings
of the absolute value or the delta changes of
temperature and power.
DATE’22 [1] Conventional DRL-based DVFS with fixed
invocation intervals for power management.
MobiCom’23 [16] Conventional DRL-based DVFS with fixed
invocation intervals for makespan optimization.
TMC’24 [15] Conventional DRL-based DVFS with fixed
invocation intervals for thermal management.
ICML’21 [34] TempoRL on DVFS scheduling with variable
invocation intervals.
FiDRL (Ours) Our FiDRL-based DVFS scheduling.
All the DRL-based methods use our hybrid training algorithm to fully
utilize the data generated on the chip.
A. Setups
We employ an x86-Linux PC with Intel(R) Core(TM) i7-
12700 CPU as our training machine, and a Xilinx ZYNQ
UltraScale+ ZCU104 Evaluation Board as the verification
platform. The application benchmarks running on the PL of
ZCU104 are neural network accelerators (F-01 to F-08 shown
in Table IV), which are used to verify the FiDRL-based inter-
task DVFS scheduling. The benchmarks running on the PS
are adopted from SPLASH-2 [44] (C-01 to C-08 shown in
Table IV), which are used to evaluate the FiDRL-based intra-
task DVFS scheduling.
Our FiDRL agent consists of a four-layer float32 fully
connected network with 64 neurons in the middle two layers.
For agents and inter-task benchmarks, they are synthesized
and exported as IP cores through Xilinx Vitis HLS 2022.1.
For the intra-task benchmarks, we use cpufreq in Linux to
scale the runtime frequency during the execution. As for the
DVFS controller used for PL, we employ MMCM, which can
be found in the Xilinx Vivado IP Repository for frequency
scaling, and PMIC which is hardened on the ZCU104 for
voltage scaling [43]. We integrate the above IP cores in
the PL under Xilinx Vivado 2022.1. With available onboard
sensors, we obtain the state values defined in Section III-B
through I2C. We adopt PYNQ2to invoke our designed IPs
and transfer the necessary data between PS and PL. For
data transmission between the training host and ZCU104, the
SCP command via SSH is adopted. The data communication
bandwidths are 430 Mbps (training host ZCU104) and 368
Mbps (ZCU104 training host). Bandwidth variation does
not impact the FiDRL’s training quality because our hybrid
training algorithm utilizes on-chip inference trajectories for
the off-chip network update, which will be triggered once
the required trajectories are successfully collected, being fast
or slow. In our experiment, the maximum data volume per
transfer between the training host and the embedded device is
4 MB (<10 ms with set bandwidth).
For F-01 to F-08 shown in Table IV, each iteratively clas-
sifies or detects 300k images. The unit of the ainterval is the
2“Python Productivity for Zynq” (https://www.pynq.io/), an open-source
project from Xilinx to use Python to control Zynq, a heterogeneous platform
containing PS (CPU) and PL (FPGA).
9
C-08 #1 C-08 #2 C-08 #3 C-08 #4 C-08 #5
F-08 #1
(vf0, 3Tasks) (vf1, 2Tasks)
(vf0, 90ms) (vf1, 81ms) (vf0, 39ms)
(vf2, 60ms)
(vf0, 100ms)
inter-task decision decision interval
V/F level
F-08 #2 F-08 #3 F-08 #4 F-08 #5
intra-task decision
Fig. 7. Sample cases of the inter- and intra-task scheduling using FiDRL.
vf
0
vf
1
vf2vf3vf4vf5vf6vf7vf8vf9
V-F settings ( MHz/mV )
4
5
6
7
8
9
Energy consumption ( 10
2
J)
2
4
6
8
10
12
Execution time ( 10
2
s)
vf
0
vf
1
vf2vf3
V-F settings (
M
/mV )
6
8
10
12
14
Energy consumption ( 2J)
4
6
8
10
12
Execution time ( 2s)
Fig. 8. Average energy consumption or execution time over different V/F
settings, left applies for inter-task benchmarks, right for intra-task ones.
number of images processed. The possible aoriginal is chosen
from {80/650, 100/650, 120/675, 140/700, 160/750, 180/800,
200/850, 220/875, 240/900, 260/900}MHz/mV. Those bench-
marks with all possible settings above satisfy the STA con-
straints and produce no timing errors or data hazards. For C-
01 to C-08 shown in Table IV, each is iteratively executed for
30k times. The unit of the ainterval is in milliseconds. The
possible aoriginal is chosen from {300/650, 400/700, 600/750,
1200/900}MHz/mV, which are the available frequency set-
tings of ARM Cortex-A53 in ZCU104. Fig. 7 illustrates the
snapshots of inter- and intra-task scheduling by FiDRL for F-
08 and C-08, respectively. In the upper of Fig. 7, the voltage
and frequency adjustments happen between tasks. In the lower
of Fig. 7, those adjustments occur at any time when the
processing core is executing tasks.
The total energy consumption, including both dynamic
and static energy, and execution time of the benchmarks are
selected as our main evaluation metrics. The state-of-the-art
approaches adopted for comparison are summarized in Table
II. Note that RL-related methods are all trained with our
proposed hybrid training mechanism, for the sake of fairness
in evaluation. The essential hyper-parameter settings of FiDRL
training are shown in Table III. Results are scaled on the No-
DVFS approach. To figure out the energy consumption and
execution time of No-DVFS, we empirically evaluate its results
at each stable V/F setting, as shown in Fig. 8. It shows that
(i) the execution time decreases with the increase of the V/F
level; (ii) vf3(140/700 MHz/mV) in all inter-task benchmarks
and vf2(400/700 MHz/mV) in all intra-task benchmarks on
average lead to the lowest energy consumption of all V/F
settings. We thus choose those two V/F settings as the No-
DVFS execution baseline.
B. Results and Analysis
1) Evaluating Energy and Makespan Reduction: Table IV
lists the results on the normalized energy consumption of the
TABLE III
HYP ER-PA RAM ET ERS S ETT IN GS FO R FIDRL-BAS ED DV FS
NAME SYMBOL VALUE
epoch E200
mini-batch size N128
trajectory size J64
learning rate α0.01
discount factor γ0.95
GAE parameter λ0.95
PPO-clip parameter ϵ0.2
hidden layer size H64
skip factor nmax 5
reward degree ξ1
minimum interval intervalmin 250
value loss weight c10.5
entropy bonus weight c20.01
overall system (E1) and the DRL agent (E2), as well as the
execution time of the whole system (T). Compared to the
baseline No-DVFS, our method results in the most energy
reduction (23.3%) on average. Yet, the Heuristic, [1], [15],
[16], and [34] save on average 9.5%, 13.6%, 8.5%, -4.9%, and
15.3%, respectively. Generally, the DRL-based methods lead
to higher energy reduction than the heuristic except [16]. The
reason for energy increment in [16] is that it aims to increase
the utility instead of decreasing the overall energy. All methods
reduce more energy in the intra-task DVFS scheduling than the
inter-task one, given higher temporal flexibility when V/F can
be altered.
Looking into the energy consumption of the DRL agent,
we use [1] as the baseline and normalize the agent’s en-
ergy consumption of all the methods based on it. As
the columns labelled E2shows, [15], [16] have similar
agent overhead compared to the baseline as conventional
DRL-based methods without flexible invocations. [34] re-
duces by 11.5% while our method reduces by 55.1%
agent energy consumption. Specifically, the inter- and intra-
task benchmarks show a similar energy reduction propor-
tion of the agent by employing our method. This indi-
cates that our FiDRL-DVFS is stable to effectively re-
duce the agent’s overhead under different DVFS scenarios.
TABLE V
DVFS OVER HE AD EVAL UATIO N
Avg. Time Edvfs Ratio
DFS DVS /Eagent /Esys
3.3µs 7.1ms 23.5% 3.7%
Notably, the overhead dur-
ing the execution not only in-
cludes the DRL agent, but also
the V/F switching for DVFS.
Table V reports the overhead
caused by the DVFS process.
In terms of the average delay
of the DVFS process, DFS costs 3.3µs, and DVS costs 7.1ms
measured by our tools proposed in [43]. It is also noted that
DVFS energy overhead is equivalent to 23.4% of the agent
execution energy and 3.7% of the overall energy. Importantly,
our FiDRL is able to temporally reduce the DVFS overhead
because the DVFS switching is aligned with the agent invo-
cations.
We evaluate the benchmarks’ execution time under different
methods, to investigate the parasite performance loss due to en-
ergy reduction. The results indicate that our method gains the
energy benefit not at the cost of performance reduction, while
10
TABLE IV
NORMALIZED ENERGY CONSUMPTION &EXECUTION TIME USING DIFFERENT METHODS.MEANS SMALLER IS BETTER,MEAN S LA RGE R IS B ETT ER .
Benchmark Heuristic [1] [15] [16] [34] Ours [15] [16] [34] Ours Heuristic [1] [15] [16] [34] Ours
E1E2T
F-01 FM 0.95 0.90 0.92 1.06 0.88 0.83 1.02 0.99 0.92 0.47 1.02 1.07 1.10 0.90 0.96 0.93
F-02 CM 0.97 0.89 0.93 1.02 0.89 0.84 1.03 1.01 0.91 0.41 1.02 1.07 1.09 0.95 1.01 1.01
F-03 CR 0.96 0.89 0.98 1.12 0.88 0.82 1.02 0.98 0.80 0.43 0.97 1.06 0.99 0.92 1.06 0.98
F-04 C’R 0.93 0.91 0.92 1.09 0.91 0.82 1.00 0.97 0.93 0.50 1.04 1.01 1.02 0.99 1.05 0.99
F-05 YV 0.97 0.90 0.95 0.99 0.89 0.81 0.98 0.99 0.79 0.42 1.05 1.08 1.07 0.95 1.09 1.03
F-06 MV 0.94 0.87 0.90 1.05 0.91 0.80 1.04 1.02 0.81 0.42 1.03 1.00 0.99 0.95 0.94 0.93
F-07 YC 0.95 0.91 0.94 1.07 0.87 0.80 1.01 0.97 0.94 0.49 1.02 1.06 1.05 0.96 1.07 1.08
F-08 MC 0.92 0.89 0.94 1.02 0.85 0.81 1.00 0.99 0.80 0.46 0.99 1.06 1.12 0.92 0.95 0.98
C-01 bar. 0.85 0.80 0.88 1.04 0.79 0.71 0.98 0.97 0.82 0.41 1.08 1.08 1.07 0.91 1.00 0.90
C-02 fft 0.83 0.81 0.89 0.99 0.82 0.73 1.01 0.96 0.89 0.47 0.95 1.07 1.03 0.93 1.03 0.92
C-03 ray. 0.88 0.82 0.90 1.03 0.79 0.72 0.97 0.99 0.95 0.48 0.99 0.97 1.02 0.94 1.03 0.92
C-04 vol. 0.89 0.84 0.91 1.02 0.84 0.70 1.02 1.01 0.96 0.43 0.98 1.02 1.05 0.94 1.08 0.98
C-05 lu 0.85 0.82 0.87 1.10 0.83 0.70 1.02 0.98 0.94 0.42 1.10 1.00 0.98 0.94 1.01 0.94
C-06 rad. 0.86 0.84 0.89 1.01 0.80 0.71 0.99 0.96 0.95 0.49 0.95 1.06 0.99 0.93 1.03 1.01
C-07 oce. 0.84 0.87 0.91 1.07 0.81 0.72 0.99 1.02 0.91 0.47 0.97 1.07 1.04 0.93 1.00 0.91
C-08 hmm 0.89 0.85 0.92 1.09 0.79 0.74 1.03 1.04 0.83 0.42 1.02 1.02 1.01 0.94 1.08 1.04
AR inter (%) 5.0 10.0 6.5 -5.3 11.0 18.0 -1.25 1 14.0 55.0 -2.0 -5.0 -5.4 5.8 -2.0 1.0
AR intra (%) 14.0 17.0 10.4 -4.4 19.0 28.0 -0.13 0.8 9.0 55.0 -1.0 -4.0 -2.4 6.7 -3.0 4.8
AR total (%) 9.5 13.6 8.5 -4.9 15.3 23.3 -0.7 0.9 11.5 55.1 -1.6 -4.4 -3.9 6.2 -2.4 2.9
F-0108 represents benchmarks featuring inter-task scheduling, C-0108 represents benchmarks featuring intra-task scheduling.
In F-xx benchmarks, the first letter denotes the network architecture, and the second letter means the dataset used.
(First letter) F: Feed-forward network; C: Simple CNN ; C’: CNN with more layers; Y: YOLOTiny-V3; M: MobileNetV2.
(Second letter) M: MNIST; R; CIFAR-10; V: VOC2017; C: COCO
E1is the system’s overall energy, it is normalized on the No-DVFS method: E1=Eused method/ENo-DVFS
Tis the whole execution time, it is normalized on the No-DVFS method: T=Tused method/TNo-DVFS
E2is the agent energy, it is normalized on [1]: E2=Eused method/E[1]
AR: Average reduction ratio compared to the baseline, the baseline of E1and Tis No-DVFS, the baseline of E2is [1]. the AR larger than zero means
reduction, smaller than zero means increase. inter, intra, total means average over inter-task, intra-task, and both of them, respectively.
Note: All methods use the optimized parameters we tuned. Values in bold are maximal/minimal for its set of experiments.
TABLE VI
DETAILED COMPARISON BETWEEN FIDRL AN D [34].
[34] - LIGHT [34] - ALTER [34] - BOTH Ours
E1E2E1E2E1E2E1E2
1.24 0.55 0.97 0.98 1.19 0.56 0.90 0.50
1.31 0.59 0.96 0.97 1.24 0.58 0.85 0.51
E1and E2are the same as Table IV. They are normalized with [34]. Two
rows data represent inter- and intra-task, respectively.
other methods increase the benchmarks’ average execution
time. Specifically, [1] increases the most (on average 4.4%)
execution time of the benchmarks due to the non-optimized
invocation of the DRL agent. [16] reduces the total execution
time most as it focuses on makespan optimization. It is
noteworthy that our proposed FiDRL even reduces on average
2.9% of the total execution time. This benefits from our
method that aims at reducing the agent invocations, which is
a constituting factor of the timing overheads during executing
benchmarks.
To further evaluate the performance between [34] and
FiDRL, we perform extended analysis and comparison. Ac-
cording to the internal differences of the mechanism of [34]
and FiDRL, given in Fig. 3, we provide two modified versions
of [34]: (1) LIGHT represents the pruned version of [34]
whose parameters of totally two networks are set similar to
ours, in order to keep the spatial feature similar between the
two-network structure of [34] and our one-network scheme;
and (2) ALTER represents that the state input to the N et2is
re-observed. Table VI shows the results of these two modified
F-01
F-02
F-03
F-04
F-05
F-06
F-07
F-08
C-01
C-02
C-03
C-04
C-05
C-06
C-07
C-08
0.8
0.9
1.0
1.1
Normalized Energy Consumption
Energy consumption with different state encoding
Fig. 9. Evaluation on three state encoding methods over all benchmarks.
versions and our FiDRL in terms of overall energy E1and
agent energy E2. Our method outperforms both of them, where
LIGHT has the average overall energy increase by 27.5%
compared to [34] due to loss of accuracy in policy inference
after pruning. ALTER improves the overall energy by reducing
3.5% from [34], but still not as good as FiDRL due to mis-
matched aand j, as shown in Fig. 3. When we implement
both methods (BOTH), FiDRL still shows advantage in overall
energy reduction.
2) Evaluating the State Encoding: Since the state encoding
plays a vital role in the FiDRL-DVFS formulation, we eval-
uate different state encoding options. As Section IV-A men-
tioned, the state contains the system’s temperature and average
power consumption, which are important factors in the DVFS
scheduling. Thus, we investigated different combinations of
the absolute value or the delta change of the temperature and
11
average power consumption. Fig. 9 demonstrates the impacts
of different state encoding. We employ three different state
encoding mechanisms, namely (i) “absolute”, representing
that the current value of the temperature and average power
consumption are used in the state; (2) “delta”, meaning that the
difference of the temperature and average power consumption
from last state to the current are committed in the state;
(3) “proposed”, our approach combining both “absolute” and
“delta”. The results in Fig. 9 indicate that more energy
reduction is achieved with our proposed approach, reducing
energy consumption by 7% on average.
3) Evaluating the Range of Decision Interval: We investi-
gate invocation-aware hyper-parameter settings of FiDRL. As
Eqn. 4 shows, two parameters decide the range of the invoca-
tion interval of FiDRL, namely (1) the minimum invocation
interval intervalmin , and (2) the maximum invocation interval
nmax or intervalmax . When considering how to determine the
range of decision interval, the right of Fig. 10 demonstrates
results of a hyperparameter optimization (HPO) [45] approach
to find the best intervalmin and nmax . We employ grid
search, a common HPO technique that exhaustively searches
in a set of hyperparameters. We first empirically set the range
of two parameters (intervalmin [20,5000], nmax [1,10])
with regard to the length of one complete epoch (300k images
or 30k iterations in our setups), and then sweep all the possible
combinations of the two parameters within the ranges. It is
noted that the optimal settings of both parameters are (250,
5), on average for all benchmarks.
To evaluate the hyper-parameters’ effect in different bench-
marks, the left of Fig. 10 provides more details on each bench-
mark’s energy consumption at the optimal interval settings
(250, 5). Specifically, the upper left of Fig. 10 shows the
energy consumption under different intervalmin with fixed
maximum interval (5intervalmin ). It can be observed that
for all benchmarks, moderate intervalmin settings outperform
extreme settings, by 20% compared to its the worst setting.
The lower and higher ends of the intervalmin correspond to
too fine-grained (such as 20, 50) and too coarse-grained (such
as 2000, 5000) invocations. A too fine-grained interval means
that the agent is still invoked frequently, while a too coarse-
grained one means the agent is rarely invoked.
The lower left of Fig. 10 shows the energy consump-
tion at various maximum invocation intervals when keeping
the intervalmin = 250.nmax and intervalmax represent
the upper bound intervals of the inter-task and intra-task
scenarios, respectively. The result shows that the energy is
reduced the most (12% compared to the worst setting where
intervalmin =intervalmax) when the maximum decision
interval is 5 times of the intervalmin . The energy opti-
mization exhibits a similar trend as in upper Fig. 10, i.e.,
moderate intervalmin settings outperform extreme settings.
Additionally, when we set the nmax as 1, which is to let
the intervalmin =intervalmax, then the decision interval is
fixed, making it effectively the same as the conventional DRL.
This setting has the worst performance, justifying the necessity
of the “flexible invocation” in the FiDRL-DVFS approach.
4) Evaluating the Learnt Policy: Fig. 11 shows the
action distribution of FiDRL’s learnt policy under the
inter-task benchmarks, with different action spaces. The
x-axis represents the possible decision period (Nmin =
intervalmin ), and the y-axis represents the V/F levels
(i.e., aoriginal ). The results demonstrate that FiDRL has
a good coverage of the actions and the frequently se-
lected ones concentrate on the middle values of both the
V/F level and the decision period, as the Figs. 8 and 10
show that the middle settings possess higher performance.
TABLE VII
CON VER GE NCE E POC HS OV ER
DIFFERENT DECISION INTERVALS
nmax 1 2 5 7 10
Cinter 120 115 140 160 210
Cintra 125 130 135 152 235
5) Evaluating the Train-
ing Process: Fig. 12 re-
ports the learning process
of different DRL schemes.
The results indicate that all
tested schemes converge,
while the approach from [1]
based on conventional DRL has the quickest convergence
given its small action space. However, it has the lowest
accumulated reward as it does not optimize the overhead of the
agent. For two invocation-aware DRL approaches: (i) [34] has
the slowest convergence among all DRL-based methods be-
cause its network architecture is the most complicated; (ii) our
FiDRL method finally converges to the highest accumulated
reward compared to others. This is because FiDRL is featured
in searching in larger action space for more policy exploration
and exploitation, but the learning speed is not as fast as the
conventional DRL-based method. Besides, our method learns
more quickly in the intra-task scenarios. However, the learning
stability under the intra-task benchmarks is lower than inter-
task ones for FiDRL, given its larger shaded area in Fig. 12.
Additionally, Fig. 12 records the training process of
the PPO-w/o-AC architecture. For the inter-task scenario
in Fig. 12, PPO-w/o-AC learns a good policy which has similar
performance compared to [1] and [34]. However, its learning
speed is the lowest among all the methods. For the intra-
task scenario, PPO-w/o-AC performs better than [1] and [34]
in terms of the accumulated reward and the learning speed.
However, it is noticed that the learning process of PPO-w/o-
AC is less stable given its larger shaded area in Fig. 12.
Adopting the AC architecture effectively improves the learning
performance and speed of the agent’s DVFS scheduling policy
for both inter- and intra-task scenarios.
Table VII shows how the agent’s invocation interval affect
the convergence rate. We record the value of C, which
denotes the number of epochs when the learning process first
converges, under various nmax.Cinter and Cintr a represent
Cfor different types of benchmarks. The results indicate that
larger decision interval makes the learning process converge
slower due to the enlarged search space. When the invocation
interval (nmax) increases from 1 to 10, Cincreases by 81.6%,
while Conly increases by 12.2% with the interval (nmax)
from 1 to 5. This shows a valuable return as the energy
reduction is 23.3% when Cincreases 12.2%.
6) Evaluating the Resource Utilization: The evaluation
of the resource utilization depends on different hardware
configurations. We record the time proportion of a total of
four stages in one training epoch. The time of four main
stages which include data collection (95.01%), data transfer
12
on average
Energy consumption (102J)
GridSearch Results
Fig. 10. The right figure shows grid search results with possible combinations of different (intervalmin ,nmax). Each data point is obtained by averaging
all benchmarks at one possible combination of nmax and intervalmin . In terms of the results of each benchmark, the upper left figure shows energy
consumption over different intervalmin with a fixed nmax or intervalmax , and the lower left figure shows energy consumption over different nmax with
a fixed intervalmin . Each value is accumulated by running the same experiment by 300k and 30k times for inter- and intra-task scenarios, respectively.
TABLE VIII
RES OUR CE U SAGE O F TH E DVFS SYSTEM AND THREE DRL A GEN T IM PLE ME NTATIO NS.
Hardware
Resources
Agent in [1], [15], [16]
(#/%)
Agent in [34]
(#/%)
Our Agent
(#/%)
Entire System (#/%) Agent/System Ratio (%) Total (#/%)
with F-01 with F-08 with F-01 with F-08
BRAM 49/8 114/17 52/8 87/14 192/31 59.8 27.0 624/100
DSP 7/0 18/07/0367/21 587/34 1.9 1.2 1728/100
FF 17530/4 35702/8 17616/4 50893/10 60755/13 34.6 29.0 460800/100
LUT 11289/5 22605/10 11897/5 47671/19 58785/26 25.0 20.2 230400/100
Fig. 11. Action space distribution in FiDRL-based DVFS policy, left and
right applies for nmax = 5 and nmax = 10, respectively.
Fig. 12. Normalized training curve using different DRL-based algorithms.
The shaded area represents the reward variance under different runs. Higher
variance implies lower training stability of the agent. Left applies for inter-
task benchmarks, right applies for intra-task benchmarks.
(0.453%), policy update (4.533%) and parameter synchroniza-
tion (0.004%). It is observed that the on-chip data collection
stage takes the highest percentage of the total training time.
Besides, the off-chip policy learning process takes 4.5% of
the training time and it depends on the scale of the collected
data and the employed DRL training algorithm. Finally, the
communication process containing data transfer (on-chip to
off-chip) and parameter synchronization (off-chip to on-chip)
together occupies 0.46% of the training time, as the FiDRL
network architecture and the amount of the data required in
one epoch are both in small scale.
Table VIII reports the FPGA resource usage of our designed
system, containing the DRL agent and benchmark accelerator
implementations. The agent takes up 7.9% in BRAM, 0% in
DSP, 3.8% in FF, and 4.9% in LUT. The resource usages of our
agent in BRAM, FF and LUT are slightly higher than the [1]
agent, given that larger action space requires more neurons in
the network. [34] consumes approximately twice the resources
of FiDRL due to its complicated network architecture that is
not optimized for embedded deployment. Overall, our FiDRL
agent causes a slight increase in resource utilization, with the
benefit of promising energy reduction.
To demonstrate the system-wide hardware utilization, we
examine the system implementation with two inter-task bench-
marks, namely F-01 (a small-scale feed-forward network) and
F-08 (MobileNetV2 as a larger-scale network). For the system
with F-01, we report that BRAM is the highest sector (59.8%)
that the agent takes up of the entire FiDRL-based DVFS
system. For the system with F-08, on the other hand, FF is
the highest sector (29%) that the agent takes up.
7) Evaluating the Thermal Performance: Fig. 13 evaluates
the processor temperature of FiDRL and the thermal-aware
scheduling using conventional DRL [15]. Both the FPGA
and CPU temperature is recorded for inter- and intra-task
benchmarks, respectively. The results show that the maxi-
mum temperature of FiDRL is 0.6 and 0.8 Centigrade higher
than [15] for FPGA and CPU, respectively, with the average
temperature of FiDRL being 1.1% higher than [15]. However,
FiDRL contributes 14.8% more energy reduction compared to
13
Max. Min. Avg.
Ours 54.6 51.1 53.1
[15] 54.0 51.0 52.6
Max. Min. Avg.
Ours 55.8 50.9 53.7
[15] 55.0 51.0 53.0
Fig. 13. Thermal evaluation on FPGA (running MobileNetV2) and CPU
(running fft) using FiDRL and DRL-based thermal mangement [15].
the thermal-aware approach [15], shown in Table IV.
VII. CONCLUSION
In this paper, we propose FiDRL, a temporal-lightweight DRL
scheme that is feasible to deploy on resource-constrained em-
bedded devices for DVFS scheduling. With that, we formulate
a FiDRL-based DVFS energy optimization approach under
inter- and intra-task scheduling scenarios, corresponding to
the discrete and continuous invocation intervals in FiDRL. We
further provide the design and implementation of the FiDRL-
DVFS integrated system and an on/off-chip hybrid training al-
gorithm. Experimentally, our FiDRL-based DVFS scheduling
achieves on average 23.3% overall energy reduction, and on
average 55.1% agent overhead reduction compared to state-of-
the-art DRL-based DVFS scheduling approaches. Our FiDRL
paves the road for deploying lightweight DRL onto resource-
constrained devices.
ACKNOWLEDGMENT
The authors would like to thank the reviewers for their
valuable feedback to improve the work. This work was
supported in part by the Natural Science Foundation of
China under Grant 62220106011, and in part by the Zhejiang
Natural Science Foundation under Grant LY24F020006.
REFERENCES
[1] L. Chen et al., “Improve the stability and robustness of power manage-
ment through model-free deep reinforcement learning,” in DATE, 2022.
[2] H. Yu et al., “Dvfs-based quality maximization for adaptive applications
with diminishing return,” IEEE TC, 2021.
[3] J. Luis Nunez-Yanez et al., “Energy optimization in commercial fpgas
with voltage, frequency and logic scaling, IEEE TC, 2016.
[4] B. Salami et al., “Fairness-aware energy efficient scheduling on hetero-
geneous multi-core processors,” IEEE TC, 2021.
[5] F. M. M. u. Islam and M. Lin, “Hybrid dvfs scheduling for real-time
systems based on reinforcement learning,” IEEE Syst. Journal, 2017.
[6] Z. Wang et al., “Modular reinforcement learning for self-adaptive energy
efficiency optimization in multicore system, in Proc. of the 22nd
ASPDAC, 2017.
[7] Y. Wang et al., “Online power management for multi-cores: A reinforce-
ment learning based approach,” IEEE TPDS, 2021.
[8] H. Huang et al., “Autonomous power management with double-q
reinforcement learning method,” IEEE TII, 2019.
[9] S. K. Panda et al., “Energy-efficient computation offloading with dvfs
using deep reinforcement learning for time-critical iot applications in
edge computing,” IEEE IoTJ, 2023.
[10] A. Zou et al., “F-lemma: Fast learning-based energy management for
multi-/many-core processors,” IEEE TCAD, 2023.
[11] Q. Zhang et al., “Deep reinforcement learning towards real-world
dynamic thermal management of data centers,” Applied Energy, 2023.
[12] F. Chen et al., “Quality optimization of adaptive applications via deep
reinforcement learning in energy harvesting edge devices, IEEE TCAD,
2022.
[13] Y. Cao et al., An efficient and flexible learning framework for dynamic
power and thermal co-management, in Proc. of the 2020 MLCAD, 2020.
[14] Q. Fettes et al., “Dynamic voltage and frequency scaling in nocs with
supervised and reinforcement learning techniques,” IEEE TC, 2019.
[15] T. Tan and G. Cao, “Thermal-aware scheduling for deep learning on
mobile devices with npu, IEEE TMC, 2024.
[16] C. Lin et al., “A workload-aware dvfs robust to concurrent tasks for
mobile devices, in Proc. of the 29th MobiCom, 2023.
[17] N. Binkert et al., “The gem5 simulator,” SIGARCH Comput. Archit.
News, 2011.
[18] Y. Xiao et al., “Self-optimizing and self-programming computing sys-
tems: A combined compiler, complex networks, and machine learning
approach,” IEEE TVLSI, 2019.
[19] M. Nagel et al., “A white paper on neural network quantization,” arXiv
preprint arXiv:2106.08295, 2021.
[20] S. Han et al., “Deep compression: Compressing deep neural networks
with pruning, trained quantization and huffman coding, in ICLR, 2016.
[21] R. Zhang et al., “Hotspot 6.0: Validation, acceleration and extension,
University of Virginia, Tech. Rep, 2015.
[22] A. Das et al., “Reinforcement learning-based inter- and intra-application
thermal optimization for lifetime improvement of multicore systems, in
Proc. of the 51st DAC, 2014.
[23] P. Bogdan, “Mathematical modeling and control of multifractal work-
loads for data-center-on-a-chip optimization,” in Proc. of the 9th NOCS,
2015.
[24] P. Bogdan et al., “Dynamic power management for multidomain system-
on-chip platforms: An optimal control approach,” ACM TODAES, 2013.
[25] R. David et al., “Dynamic power management for multicores: Case study
using the intel scc,” in Proc. of the 20th VLSI-SoC, 2012.
[26] R. Li et al., “Dvfs-based scrubbing scheduling for reliability maximiza-
tion on parallel tasks in sram-based fpgas,” in Proc. of the 57th DAC,
2020.
[27] Y. Tan et al., “Adaptive power management using reinforcement learn-
ing,” in ICCAD, 2009.
[28] R. A. Shafik et al., “Learning transfer-based adaptive energy minimiza-
tion in embedded systems,” IEEE TCAD, 2016.
[29] E. Kwon et al., “Reinforcement learning-based power management
policy for mobile device systems: Late breaking results, in Proc. of
the 57th DAC, 2020.
[30] J. Yan et al., “Energy-aware systems for real-time job scheduling in
cloud data centers: A deep reinforcement learning approach,” Computers
and Electrical Engineering, 2022.
[31] Z. Yu et al., “Multi-objective optimization approach using deep rein-
forcement learning for energy efficiency in heterogeneous computing
system,” arXiv preprint arXiv:2302.00168, 2023.
[32] A. Lakshminarayanan et al., “Dynamic action repetition for deep rein-
forcement learning,” in Proc. of the 31st AAAI, 2017.
[33] S. Sharma et al., “Learning to repeat: Fine grained action repetition for
deep reinforcement learning,” in ICLR, 2017.
[34] A. Biedenkapp et al., “Temporl: Learning when to act,” in ICML, 2021.
[35] G. Brockman et al., “Openai gym,” 2016.
[36] E. Barron and H. Ishii, “The bellman equation for minimizing the
maximum cost.” Nonlinear Anal. Theory Methods Applic., 1989.
[37] R. S. Sutton et al.,Introduction to reinforcement learning. MIT press
Cambridge, 1998.
[38] A. T. Bharucha-Reid, “Fixed point theorems in probabilistic analysis,”
Bulletin of the AMS, 1976.
[39] J. Schulman et al., “Proximal policy optimization algorithms,” arXiv
preprint arXiv:1707.06347, 2017.
[40] V. Konda and J. Tsitsiklis, Actor-critic algorithms,” in NIPS, 1999.
[41] E. Jang et al., “Categorical reparameterization with gumbel-softmax,”
in ICLR, 2016.
[42] S. Nadarajah, “A generalized normal distribution,” J. Appl. Stat., 2005.
[43] W. Jiang et al., “Aos: An automated overclocking system for high
performance cnn accelerator through timing delay measurement on
fpga,” IEEE TCAD, 2023.
[44] S. C. Woo et al., “The splash-2 programs: Characterization and method-
ological considerations,” SIGARCH Comput. Archit. News, 1995.
[45] M. Feurer and F. Hutter, “Hyperparameter optimization, AutoML, 2019.
14
Jingjin Li received the B.S. degree of Computer
Science with Artificial Intelligence from the School
of Computer Science, University of Nottingham
Ningbo China, in 2022. He is currently pursuing
the Ph.D. degree with the School of Computer Sci-
ence, University of Nottingham Ningbo China. His
research interests include AI in embedded systems,
deep learning, and adaptive computing.
Weixiong Jiang received the B.S. degree from
Harbin Institute of Technology, Harbin, China, in
2017, and the Ph.D. degree with ShanghaiTech Uni-
versity, Reconfigurable and Intelligent Computing
Lab, Shanghai, and Shanghai Institute of Microsys-
tem and Information Technology, Chinese Academy
of Sciences, Shanghai, and the University of Chinese
Academy of Sciences, Beijing, in 2022. His current
research interests include chip design for algorithms
related to autonomous driving and neural radiation
field.
Yuting He received the B.S. degree of Computer
Science with Artificial Intelligence from the School
of Computer Science, University of Nottingham
Ningbo China, in 2022. She is currently pursuing
the Ph.D. degree with the School of Computer
Science, University of Nottingham Ningbo China.
Her research interests include edge computing and
deep learning.
Qingyu Yang received the B.Eng. degree in Mea-
surement, Control Technique and Instruments from
the Harbin Institute of Technology, China, in 2017,
and the M.Sc. degree in Advanced Microelectronic
Systems Engineering from the University of Bristol,
UK, in 2019. He is currently pursuing the Ph.D.
degree with the School of Computer Science, Uni-
versity of Nottingham Ningbo China. His current
research interests include computer-aided design and
reliability-aware scheduling for integrated circuits.
Anqi Gao is currently an undergraduate student
in the School of Computer Science, University of
Nottingham Ningbo China. Her research interest is
machine learning and neural networks.
Yajun Ha (Senior Member, IEEE) received the B.S.
degree from Zhejiang University, China, in 1996,
the M.Eng. degree from the National University
of Singapore (NUS), Singapore, in 1999, and the
Ph.D. degree from Katholieke Universiteit Leuven,
Belgium, in 2004, all in electrical engineering. He
is currently a Professor at ShanghaiTech University,
China. Before this, he was the Director, I2R-BYD
Joint Lab at Institute for Infocomm Research, Sin-
gapore, and an Adjunct Associate Professor at the
Department of Electrical & Computer Engineering,
NUS. Before this, he was an Assistant Professor at NUS. His research
interests include reconfigurable computing, ultra-low power digital circuits
and systems, embedded system architecture, and design tools for applications
in robots, smart vehicles and intelligent systems. He has published around
100 internationally peer-reviewed journal/conference papers on these topics.
He serves as the Editor-in-Chief for the IEEE Trans. on Circuits and
Systems II: Express Briefs (TCAS-II, 2022–2023), the Associate Editor-in-
Chief for IEEE TCAS-II (2020–2021), the Associate Editor for IEEE TCAS-I
(2016–2019), IEEE TCAS-II (2011–2013), IEEE TVLSI (2013–2014), and
JLPE (since 2009). He has served as the TPC Co-Chair for ISICAS 2020, the
General Co-Chair for ASP-DAC 2014, the Program Co-Chair for FPT 2010
and FPT 2013, and the Chair for the Singapore Chapter of the IEEE Circuits
and Systems (CAS) Society, in 2011 and 2012.
Ender ¨
Ozcan (Senior Member, IEEE) received his
Ph.D. degree from the Department of Computer and
Information Science, Syracuse University, USA, in
1998. He is a Professor with the Computational
Optimisation and Learning (COL) Lab, School of
Computer Science, University of Nottingham, U.K.
He is currently the Director of the Faculty of Science
Artificial Intelligence Doctoral Training Centre. His
research interests include interface of computer sci-
ence, artificial intelligence, and operational research
with a focus on intelligent decision support systems
combining data science techniques and (hyper/meta)heuristics applied to real-
world problems. He is a prolific researcher with over 200 publications in his
field of expertise. Prof. ¨
Ozcan is a Co-Founder and Co-Chair of the EURO
Working Group on Data Science Meets Optimization. He is an Associate
Editor of the Journal of Scheduling, Engineering Applications of Artificial
Intelligence Journal and the Journal of Applied Metaheuristic Computing.
For more information, visit https://people.cs.nott.ac.uk/pszeo/.
Ruibin Bai (Senior Member, IEEE) received the
B.Sc. and M.Sc. degrees from Northwestern Poly-
technic University, China, in 1999 and 2002, re-
spectively, and the Ph.D. degree from University of
Nottingham, U.K., in 2005. He is a Professor and
Head of the School of Computer Science, University
of Nottingham Ningbo China. He leads the Artificial
Intelligence and Optimisation (AIOP) group and
Ningbo Digital Port Technologies Key Lab. His main
research interests include computational intelligence,
reinforcement learning, operations research, schedul-
ing and optimisation with a special focus on transportation systems and port
logistics.
Tianxiang Cui (Senior Member, IEEE) received
the B.Sc. (Hons) degree in computer science and
the M.Sc. degree in advanced computing from the
University of Bristol, U.K., in 2010 and 2011, re-
spectively, and the Ph.D. degree from University of
Nottingham, U.K., in 2016. He was a Senior AI
Engineer with Huawei, Shanghai, China, and a Se-
nior Algorithm Researcher with PingAn, Shanghai,
China. He was involved in some frontier industrial
projects, including autonomous driving, and quanti-
tative trading. He is currently an Assistant Professor
in the School of Computer Science, University of Nottingham Ningbo China.
He has authored or coauthored a number of research papers in several
prestigious and reputable international journals, including IEEE Transactions
on Industrial Informatics, European Journal of Operational Research, Tech-
nological Forecasting and Social Change, Resources Policy, etc. His research
interests include development of novel computational intelligence techniques
in industrial decision-making problems, and Fintech applications.
Heng Yu received his B.Eng. and Ph.D. degrees
in Electrical and Computer Engineering from the
National University of Singapore (NUS), in 2006
and 2011 respectively. He was a research scien-
tist at the University of Erlangen-Nuremberg, and
subsequently a research fellow at NUS. He was an
Assistant Professor at the United Arab Emirates Uni-
versity, and briefly a Xinghai Associate Professor at
Dalian Maritime University. He is now an Assistant
Professor in the School of Computer Science, Uni-
versity of Nottingham Ningbo China. His research
interests focus on adaptive and reliable embedded systems, as well as AI in
embedded computing. His work received the best paper award or nominations
at ACM CF’17, FPT’13, and SAFECOMP’12. He served as an Associate
Editor for the IEEE Transactions on Circuits and Systems II: Express Briefs.
He has been the PI/Co-PI for multiple national/provincial/municipal research
grants in China.
Article
As Deep Neural Networks (DNNs) have been successfully applied to various fields, there is a tremendous demand for running DNNs on mobile devices. Although mobile GPU can be leveraged to improve performance, it consumes a large amount of energy. After a short period of time, the mobile device may become overheated and the processors are forced to reduce the clock speed, significantly reducing the processing speed. A different approach to support DNNs on mobile device is to leverage the Neural Processing Units (NPUs). Compared to GPU, NPU is much faster and more energy efficient, but with lower accuracy due to the use of low precision floating-point numbers. We propose to combine these two approaches to improve the performance of running DNNs on mobile devices by studying the thermal-aware scheduling problem, where the goal is to achieve a better tradeoff between processing time and accuracy while ensuring that the mobile device is not overheated. To solve the problem, we propose a heuristic-based scheduling algorithm to determine when to run DNNs on GPU and when to run DNNs on NPU based on the current states of the mobile device. The heuristic-based algorithm makes scheduling decisions greedily and ignores their future impacts. Thus, we propose a deep reinforcement learning based scheduling algorithm to further improve performance. Extensive evaluation results show that the proposed algorithms can significantly improve the performance of running DNNs on mobile devices while avoiding overheating.
Article
With the inherent algorithmic error resilience of conventional neural networks (CNNs) and the worst-case design methodologies of current electronic design automation tools, overclocking-based timing speculation is a promising technique to improve the performance of CNN accelerators on FPGA by removing unnecessary timing margins. To avoid potential timing errors, timing delay measurement should be used during overclocking. However, current approaches are not yet good at measuring paths with more intense variability factors such as jitter and lack an automated process for testing circuit delays. In this article, we first propose 2-dimension multiframe fusion to deal with the sampling jitter, then present a timing delay measurement-based automatic overclocking system (AOS) running on heterogeneous FPGA for high-performance CNN accelerators. On the FPGA side, AOS is composed of timing delay monitors (TDMs) that can measure all types of timing paths, a TDM controller that converts the sampled values of TDMs into timing delay in terms of the ratio of path delay to the clock period. On the CPU side, AOS converts the path delay from clock period ratio to absolute delay value and decides the frequency of the accelerator in the next iteration. We demonstrate AOS with a SkyNet accelerator on the Xilinx ZCU104 board and achieve 657 FPS at 436 MHz without accuracy degradation, which is 1.41×1.41\times performance compared to the baseline.
Article
One of the long standing goals of Artificial Intelligence (AI) is to build cognitive agents which can perform complex tasks from raw sensory inputs without explicit supervision. Recent progress in combining Reinforcement Learning objective functions and Deep Learning architectures has achieved promising results for such tasks. An important aspect of such sequential decision making problems, which has largely been neglected, is for the agent to decide on the duration of time for which to commit to actions. Such action repetition is important for computational efficiency, which is necessary for the agent to respond in real-time to events (in applications such as self-driving cars). Action Repetition arises naturally in real life as well as simulated environments. The time scale of executing an action enables an agent (both humans and AI) to decide the granularity of control during task execution. Current state of the art Deep Reinforcement Learning models, whether they are off-policy or on-policy, consist of a framework with a static action repetition paradigm, wherein the action decided by the agent is repeated for a fixed number of time steps regardless of the contextual state while executing the task. In this paper, we propose a new framework - Dynamic Action Repetition which changes Action Repetition Rate (the time scale of repeating an action) from a hyper-parameter of an algorithm to a dynamically learnable quantity. At every decision-making step, our models allow the agent to commit to an action and the time scale of executing the action. We show empirically that such a dynamic time scale mechanism improves the performance on relatively harder games in the Atari 2600 domain, independent of the underlying Deep Reinforcement Learning algorithm used.
Article
Over the last two decades, as microprocessors have evolved to achieve higher computational performance, their power density has also increased at an accelerated rate. Improving energy efficiency and reducing power consumption are therefore critically important to modern computing systems. One effective technique for improving energy efficiency is dynamic voltage and frequency scaling (DVFS). With the emergence of integrated voltage regulators (IVRs), the speed of DVFS can reach microsecond ( μs\mu \text{s} ) timescales. However, a practical and effective strategy to guide fast DVFS remains a challenge. In this article, we propose F-LEMMA: a fast, learning-based, hierarchical DVFS framework consisting of a global power allocator in the kernel space, a reinforcement learning-based power management scheme at the architecture level, and a swift controller at the digital circuit level. This hierarchical approach leverages computation at the system and architecture levels with the short response time of the swift controller to achieve effective and rapid μs\mu \text{s} -level power management supported by the IVR. Our experimental results demonstrate that F-LEMMA can achieve significant energy savings (35.2%) across a broad range of workloads. Conservatively compared with existing state-of-the-art DVFS-based power management schemes that can only operate at millisecond timescales, F-LEMMA can provide notable (up to 11%) energy-delay product (EDP) improvements across benchmarks. Compared with state-of-the-art nonlearning-based power management, our method has a universally positive effect on evaluated benchmarks, proving its adaptability.
Article
Internet of Things (IoT) is a technology that allows ordinary physical devices to collect, process, and share data with other physical devices and systems over the internet. It provides pervasively connected infrastructures to support innovative applications and services that can automate otherwise intensely laborious manual effort. Edge computing (EC) complements the powerful centralized cloud servers by providing powerful computation capability close to the data source, minimizing communication latency, and securing data privacy. The energy consumption problem has continued to receive much attention from the IoT community in applying various techniques to reduce energy consumption while still meeting the computational demand. In this paper, we propose an application-deadline-aware data offloading scheme using deep reinforcement learning and Dynamic Voltage and Frequency Scaling (DVFS) in an edge computing environment to reduce the energy consumption of IoT devices. The proposed scheme learns the optimal data distribution policies and local computation DVFS frequency scaling by interacting with the system environment and learning the behavior of the device, network, and edge servers. The proposed scheme was tested on multiple edge computing environments with different IoT devices. Experimental results show that this scheme can reduce energy consumption while achieving the IoT application and services timing and computational goals. The proposed scheme has substantial energy savings when compared with the native Linux governors.
Article
With the advantages such as high-performance, low-maintenance, and reliability, more and more companies are moving their computing infrastructures to the cloud. In the meantime, with the increasing number of users continuously submitting jobs to cloud, energy consumed by the current cloud data centers has become a major concern for cloud service providers, due to financial and environmental reasons. In this paper, we propose a deep reinforcement learning (DRL) approach to handle real-time jobs. Specifically, we focus on allocating incoming jobs to appropriate virtual machines (VMs) in a way that energy consumption can be optimized while high quality of service (QoS) can be achieved. We give the detailed design and implementation of our approach, and our experimental results demonstrate that the proposed method can achieve better performance in job success rate and average response time with less energy consumption than the current approaches, in the presence of different real-time cloud workloads.
Article
Applications with adaptability are widely available on the edge devices with energy harvesting capabilities. For their runtime quality optimization, however, current approaches can not tackle the variations of quality modeling and harvested energy simultaneously. Therefore, in this paper, we are the first to propose a deep reinforcement learning (DRL)-based DVFS method that optimizes the application execution quality of energy harvesting edge devices to mitigate the variations. First, we propose a baseline DRL formulation that novelly migrates the objective of quality maximization into a reward function and constructs a DRL quality agent. Second, we devise a long short-term memory (LSTM)-based selector that performs DRL quality agent selection based on the energy harvesting history. Third, we further propose two optimization methods to alleviate the non-negligible overhead of DRL computations: 1) an improved thinking-while-moving concurrent DRL scheme to compromise the ‘state drifting’ issue during the DRL decision process, and 2) a variable inter-state duration decision scheme that compromises the DVFS overhead incurred in each action taken. The experiments take an adaptive stereo matching application as a case study. The results show that the proposed DRL-based DVFS method on average achieves 17.9% runtime reduction and 22.05% quality improvement compared to state-of-the-art solutions.