Available via license: CC BY 4.0
Content may be subject to copyright.
Date of publication xxxx 00, 0000, date of current version xxxx 00, 0000.
Digital Object Identifier xx.xxxx/ACCESS.2022.DOI
Deep Reinforcement Learning for
System-on-Chip: Myths and Realities
TEGG TAEKYONG SUNG1and BO RYU2
1EpiSys Science, Inc., Poway, CA 92064, USA (e-mail: tegg@episyscience.com)
2EpiSys Science, Inc., Poway, CA 92064, USA (e-mail: boryu@episyscience.com)
Corresponding author: Bo Ryu (e-mail: boryu@episyscience.com)
ABSTRACT Neural schedulers based on deep reinforcement learning (DRL) have shown considerable
potential for solving real-world resource allocation problems, as they have demonstrated significant
performance gain in the domain of cluster computing. In this paper, we investigate the feasibility of neural
schedulers for the domain of System-on-Chip (SoC) resource allocation through extensive experiments and
comparison with non-neural, heuristic schedulers. The key finding is three-fold. First, neural schedulers
designed for cluster computing domain do not work well for SoC due to i) heterogeneity of SoC computing
resources and ii) variable action set caused by randomness in incoming jobs. Second, our novel neural
scheduler technique, Eclectic Interaction Matching (EIM), overcomes the above challenges, thus signifi-
cantly improving the existing neural schedulers. Specifically, we rationalize the underlying reasons behind
the performance gain by the EIM-based neural scheduler. Third, we discover that the ratio of the average
processing elements (PE) switching delay and the average PE computation time significantly impacts the
performance of neural SoC schedulers even with EIM. Consequently, future neural SoC scheduler design
must consider this metric as well as its implementation overhead for practical utility.
INDEX TERMS System-on-chip scheduling, resource allocation, deep reinforcement learning, neural
scheduler, heuristic scheduler
I. INTRODUCTION
APPROACHING the limit of Moore’s Law has spurred
tremendous advances in System-on-Chip (SoC) which
bestows unprecedented gain in computational and energy
efficiency for a wide range of applications through an inte-
grated architecture of general-purpose and specialized pro-
cessors [19]. In particular, the domain-specific SoC (DSSoC),
a class of heterogeneous chip architecture, empowers ex-
ploitation of distinct characteristics of compute different
resources (i.e., CPU, GPU, FPGA, accelerator, etc.) for speed
maximization and energy efficiency via intelligent resource
allocation [25], [33], [34]. The primary goal of a DSSoC
scheduling policy is to optimally assign a variety of hierar-
chically structured jobs, derived from many-core platforms
executing streaming applications from wireless communi-
cations and radar systems, to heterogeneous resources or
processing elements (PEs). Over the years, researchers have
demonstrated effective performance for DSSoC with expert-
crafted, heuristic rules [5], [28].
While heuristic schedulers have been dominant in a
wide range of domains for resource allocation, recent ef-
fort on scheduling algorithm development started under-
going a paradigm shift toward neural approaches as they
demonstrated state-of-the-art performance in complex re-
source management domains [17], [20]. In particular, recent
successes in applying deep reinforcement learning (DRL) for
scheduling heterogeneous (cloud) cluster resources [11], [32]
have further motivated applying similar DRL approaches
for task scheduling on DSSoC, obtaining noticeable perfor-
mance gains over well-known heuristic schedulers under cer-
tain operational conditions [37], [38], [40]. Through exten-
sive experimentation with both DRL and heuristic schedulers
under extremely wide ranges of DSSoC scenarios, we present
an in-depth comparative analysis between neural schedulers
and their heuristic counterparts for the DSSoC domain. The
key contribution of our research is that the high performance
of DRL schedulers previously observed in both cloud cluster
and DSSoC domains is found to be highly sensitive to the
ratio of the average PE switching delay and the average
PE computation time. Specifically, when this ratio is close
to one, neural schedulers tend to outperform their heuristic
counterparts under various operational scenarios. On the
VOLUME x, 2022 1
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3203401
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
Algorithm Application Approach Job Resource Objective
TetriSched [45] Cluster Estimates job run-time heuristically and plans
for placement options
MapReduce
jobs
Heterogeneous
clusters
Minimization of errors
in job execution timing
Decima [32] Cluster Allocates a quantity of resources to ready tasks
using graph-structured information Spark jobs Homogeneous
clusters
Minimization in avg.
job completion time
Gandiva [49] Cluster Exploits job predictability to time-slice
resources efficiently across multiple jobs DLT jobs Heterogeneous
clusters
Improvement on
cluster utilization
Tiresias [18] Cluster Assigns job priority using Gittins index to
schedule distributed jobs DLT jobs Homogeneous
clusters
Minimization in avg.
job completion time
Themis [30] Cluster Uses a two-level architecture to capture
placement sensitivity and ensure efficiency DLT jobs Heterogeneous
clusters
Improvement on
cluster utilization
Allox [27] Cluster Transforms the scheduling problem into a
min-cost bipartite matching problem DLT jobs Heterogeneous
clusters
Minimization in avg.
job completion time
Gavel [35] Cluster Generalizes existing scheduling policies by
expressing them as optimization problems DLT jobs Heterogeneous
clusters
Minimization in avg.
job completion time.
DeepRM [31] Cluster Includes backlog information on remaining jobs;
train the agent using REINFORCE Cluster jobs Single cluster Minimization in avg.
slowdown
SCARL [11] Cluster Employs attentive embedding and schedules
tasks using factorization of action Cluster jobs Heterogeneous
clusters
Minimization in avg.
slowdown
DRM [37] SoC Iteratively maps tasks to resources and updates
the agent using REINFORCE algorithm
Single
synthetic job
Heterogeneous
resources
Minimization in job
completion time
DeepSoCS [38] SoC Re-arranges task orders using graph-structured
information and greedily maps tasks to resources
Synthetic and
SoC jobs
Heterogeneous
resources
Minimization in avg.
latency
SoCRATES [40] SoC Iteratively maps tasks to resources and aligns
post-processed returns to corresponding tasks
Synthetic and
SoC jobs
Heterogeneous
resources
Minimization in avg.
latency
TABLE 1. Design features of cluster and DSSoC scheduling approaches (DLT: Deep Learning Training).
other hand, when the ratio is much less than one and subject
to other operational conditions, the anticipated high perfor-
mance of neural schedulers does not materialize. We attribute
this to two major factors: (i) heterogeneity of SoC computing
resources; (ii) variable action set caused by randomness in
incoming jobs. When combined, they exacerbate the prob-
lem of delayed reward because the accumulated rewards
are likely to disrupt the backpropagation-based optimization
method. With this finding, we present a realistic avenue for
future DRL-based resource scheduler design.
II. RELATED WORK
Design of high-performance SoC resource schedulers has
been active for many years [5], [28]. Scheduling algorithms
are mostly heuristic in nature with specific optimization
goals. Examples include First Come First Served (FCFS),
Earliest Task First (ETF) [6], Minimum Execution Time
(MET) [9], and Hierarchical Earliest First Time (HEFT) [44].
While both MET and STF schedule tasks to PEs which take
the shortest amount of execution time, HEFT schedules tasks
by considering both task computation time and data trans-
mission delays. A real-time heterogeneity-aware scheduler
HetSched [4] with task- and meta-scheduling components
having multiple static DAG-represented jobs as input is built
for autonomous vehicle applications. A new pruning Monte-
Carlo Tree Search (MCTS)-based algorithm [26] has been
applied for workflow scheduling. It has improved perfor-
mance in makespan over the heuristics, Improved Predict
Priority Task Scheduling (IPPTS) [15], and a meta-heuristic
Genetic Algorithm approach [22]. However, much of the
gain depends on the specific heuristics and the nature of job
configurations.
Cluster resource management for cloud computing (e.g.,
YARN [47] or Kubernetes [8]) is another orthogonal ap-
proach in resource allocation. It is primarily designed to
schedule big-data, time-persistent jobs (i.e., MapReduce [12]
or Deep Learning Training (DLT) jobs1). A list of scheduling
approaches along with their design features is summarized
in Table 1. Themis [30] and Tiresias [18] allocate tasks
from distributed DLT jobs to homogeneous clusters using
two-dimensional scheduling algorithm. Gandiva [49] sched-
ules a set of heterogeneous DLT jobs to a fixed set of
GPU clusters. It allows preemption on jobs to share over-
load jobs to available resources’ spaces. AlloX [27] trans-
forms a heterogeneous resources scheduling problem into
a min-cost bipartite matching problem in order to provide
performance optimization and fairness to users in Kuber-
netes. TetriSched [45] estimates job run-time heuristically for
placement options. Gavel [35] transforms existing scheduling
policies to heterogeneity-aware optimization problems for
generalization and improves the diversity of policy objec-
tives. Such cluster schedulers enhance run-time performance
by exploiting the simulation characteristics.
Neural schedulers have begun to surpass hand-crafted
algorithms and show a significant performance gain in the
cluster scheduling problem. DeepRM, the first DRL-based
cluster scheduler reported in the literature, shows signif-
1A neural network represents a job, and each operation, such as matrix
multiplication or nonlinear function, acts as tasks.
2VOLUME x, 2022
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3203401
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
icant reduction in job slowdown2over heuristics [31]. In
comparison to DeepRM, Decima [32] proposes an end-to-
end neural scheduler for more realistic cluster environment
with hierarchical cluster jobs. It extracts hierarchical job
information with graph neural networks (GNNs) [16] and
decides how many resources to execute each task. Decima
addresses varying action selection caused by the hierarchical
jobs using placeholder implementation [2], but it considers
homogeneous clusters. SCARL [11] aims to schedule jobs
to heterogeneous resources by exploiting attentive embed-
ding [46] in policy networks. However, SCARL is not able
to schedule non-hierarchical jobs, which differs from Dec-
ima, and is not applicable to the realistic environment [32].
Spear [21] applies MCTS to plan the task scheduling with a
DRL model for guidance in the expansion and rollout steps
in MCTS.
Building on the success of neural schedulers for the cluster
environment as described above, novel neural approaches
have been proposed for the domain of SoC. Deep Resource
Management (DRM) [37] is considered the first DRL-based
SoC scheduler that schedules hierarchical jobs to heteroge-
neous resources in the scenario with a single synthetic job.
DeepSoCS [38], adapted from Decima, is proposed for han-
dling more realistic SoC scenarios where multiple numbers
of both synthetic and real-world SoC jobs are continuously
generated. It is a hybrid approach that rearranges the tasks
using the graph-structured information extracted by GNNs
and maps them to resources using a heuristic algorithm.
However, the performance gain achieved by DeepSoCS de-
pends on operational conditions, as it is inherently imitating
the expert policy with an exhaustive search employed by
heuristic schedulers. In order to explore the feasibility of an
end-to-end neural SoC scheduler with the goal of achieving
significant performance gain over heuristic schedulers, the
authors proposed SoCRATES [40] with a novel technique of
Eclectic Interaction Matching (EIM). EIM remedies the con-
currency problem in receiving observation and reward gains
by matching the time-varying interaction and simulation time
steps. Consequently, SoCRATES achieves considerable en-
hancement in performance over prior neural schedulers [37],
[38]. In this paper, we present key insights into how such per-
formance gain is achieved by SoCRATES through extensive
comparative experimentation.
III. MOTIVATION
Despite significant performance gains demonstrated by neu-
ral schedulers for cluster computing management, they gen-
erally suffer from limited extensibility. For example, prior
cluster schedulers address non-hierarchical workloads [27],
[49] and homogeneous resources [31], [32] that cannot
fully exhibit SoC resource allocation. Although a series of
research in the cluster application employs heterogeneous
machines [27], [30], [45], [49], their complexity is rela-
2This metric represents a relative value of actual job duration and ideal
job duration.
tively simpler than DSSoC. Schedulers in cluster applica-
tions allocate jobs to a set of CPUs or GPUs with different
performances, whereas schedulers in DSSoC applications
map a range of domain-specific jobs to various types of
PEs, e.g., CPUs, GPUs, accelerators, memory, each with
different performance and supported functionalities. Cluster
schedulers decide how many clusters to execute incoming
tasks, whereas SoC schedulers map which SoC computing
resource to an incoming task. Hence, the scheduler must be
aware of unsupported action for an individual task. Hence,
directly applying neural schedulers to SoC is non-trivial due
to the disparities in the environment properties, such as the
structures of jobs/resources and scheduling mechanisms.
In contrast, heuristic scheduling algorithms in the domain
of SoC steadily show state-of-the-art performance. We dis-
covered that their significant performance gains come from
rescheduling task assignments with exhaustive searches, such
as PE availability checks or gaps between consecutive task
assignments (see Section VI-B for more details). However,
such rule-based algorithms generally have limited robust per-
formance. For instance, heuristic schedulers are vulnerable to
system perturbation from external forces in the setting of a
single job execution [37]. Based on these robust and signifi-
cant performance gains in the domain of cluster computing,
we are interested in extending these neural schedulers to SoC.
While neural algorithms generally adapt to dynamic system
changes and have robust performance [41], subsequent works
have motivated and developed in a more complicated and
practical scenario with continuous job injection [38], [40].
In this paper, we investigate the challenging standpoints for
designing DRL scheduling policy in the domain of SoC. With
the recently introduced EIM technique overcoming such
challenges, we rationalize the underlying reasons behind the
performance improvement in existing neural schedulers by
examining PE usages and action designs. Furthermore, we in-
vestigate which operation condition impacts the performance
of neural SoC schedulers with EIM. To the end, the questions
we want to consider in this paper are the following:
•What is the main difference between SoC and other
domains?
•How does the neural scheduling policy effort change in
SoC domain?
•Under which operational conditions do neural sched-
ulers perform/cannot perform well?
•How does EIM technique improve the performance of
neural schedulers?
•What are the strengths/weaknesses of neural scheduler?
IV. BACKGROUND AND SYSTEM MODEL
A large body of research in scheduling exists for a broad
range of domains. Cluster management in datacenters allo-
cates Spark or DLT jobs to a set of CPU and GPU ma-
chines. This paper contributes to the domain of heteroge-
neous DSSoC and its emulator in the form of high-fidelity
Domain-Specific SoC Simulator (DS3) [5], [39]. DS3, which
supports a heterogeneous SoC computing platform Odroid-
VOLUME x, 2022 3
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3203401
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
Start
Job generator
Ready queue
Outstanding queue
Executable queue
Completed queue
Run on a PE
Is it
a task
head?
Are
preds
done?
End
Scheduler
Job profile
Resource profileApplication
PE
9
12
18 11 14
0
19 16
1
23
2
27 23
3
13
4
15
5
17
6
11
7
13
8
9
HEAD
TAIL
Task P0 P1 P2
014 16 9
113 19 18
2 11 13 19
313 817
412 13 10
513 16 9
6 7 15 11
7 5 11 14
818 12 20
921 716
FIGURE 1. An illustration of a set of synthetic job and resource profiles. The
diagram on the left depicts a job DAG, where a note represents a task by its ID
and the edge represents data transmission delay by its weights. The table on
the right shows a set of heterogeneous PEs with different computation time for
each task.
XU3 [1], enables the allocation of a set of communica-
tion or radar jobs to various types of resources, such as
general-purpose cores, hardware accelerators, and memory.
The ARM heterogeneous big.LITTLE architecture of the
cores enables performance-oriented and energy-efficiency
runs (the big cores of 2.1Ghz Cortex-A15 are performance-
oriented, the LITTLE cores of 1.5GHz Cortex-A7 are energy-
efficient). DS3 integrates system-level design features for
hierarchical jobs and heterogeneous resources. The job and
resource profiles are given as a list specifying properties. The
system parses them and generates workloads and PEs using
the job and resource models.
A. SYSTEM-ON-CHIP SIMULATION
1) Job Model
We define a job as a collection of interleaved tasks. Jobs in
DS3 implement real-world applications of wireless commu-
nication and radar processing. The tasks represent operations,
such as waveform generator, Fast Fourier Transform, vector
multiplication, or decoder [5]. A job structure is in the form
of a directed acyclic graph (DAG), illustrated in Fig. 1 [44].
A job is denoted as G= (N, E ), where Nis a set of nodes
and Eis a set of edges. Each node ni∈Nrepresents
a heterogeneous task in the job, and each directed edge
ei,j ∈Econnects node nito node nj. We interchangeably
use the term “task” and “node” unless there is confusion.
The edge encodes task dependency. To be specific, if there
exists edge ei,j , task njcan start execution only after task
nifinishes. Here, we call node nia parent of node nj, and
nja child of ni. We define a set of parents of nj(predeces-
sors) by pred(nj)and a set of child of ni(successors) by
succ(ni). A node may have multiple parents or children, and
nodes can be simultaneously executed. Each edge ei,j has a
weight wi,j that represents data transmission delay between
niand nj. This delay is added to the task duration when the
scheduler selects a different PE to task ifrom task j. The
labels with HEAD and TAIL refer to the root parent node
and the terminal leaf node, respectively. Assume a job G
has vtasks, N={n1, . . . , nv}, then the job is considered
complete when all tasks in Nhave been completed. Here, n1
is HEAD node and nvis TAIL node. According to the job
model aforementioned, multiple jobs are generated. Each job
is generated based on the following parameters [44]:
1) v: the number of tasks in the directed acyclic graph
2) α: the shape parameter of the graph. αcontrols the
width and depth of a graph structure. We sample the
average width of each level in a graph from a normal
distribution with a mean of √v×α. The depth of a
graph is equal to the √v
α(see Appendix A for details
on job DAG construction). If α1.0, a shallow but
wide graph is generated; if α1.0, a deep but narrow
graph is generated.
3) ν: the average value of communication delay. The
weight of ei,j , representing communication delay, is
set to max(1,b|w|c), where w∼ N(ν, 0).
4) CC R: the communication-to-computation ratio. We
calculate an average communication cost by the sum
of the scheduled PE bandwidth and the weights of
edges between the current task and the previous task.
An average computation cost is defined in the SoC job
profile. If the CC R value in a DAG is high, the job is a
communication-intensive workload. Conversely, if the
CC R value is low, the job is a computation-intensive
workload.
5) din: an average value of in-degree of nodes
6) dout: an average value of out-degree of nodes
2) Resource Model
Resource profile defines the characteristics of PEs, and
each PE is defined with a set of different, fixed supported
tasks and operating performance points (OPP). OPP is a
utilization set for a tuple of power consumption and task
run-time frequency. OPP for PE q, for instance, can be
defined by a set of voltage-frequency pairs, OPPq=
{(Vq
1, f q
1),...,(Vq
O, f q
O)}where Ois the number of oper-
ating points. Once the frequency parameter is given, the
resource model creates the corresponding PE. Since PE run-
ning with high frequency, generally, executes tasks faster but
consumes more power and energy, a trade-off exists between
run-time performance and energy efficiency. Moreover, each
PEs has a bandwidth that contribute to the communication
delay when the simulator switches over PEs during task
execution.
3) Objective
DS3 is heavily shaped by the peculiarities of the SoC domain.
DS3 comes with real-world reference applications from wire-
less communications and radar processing domains. Each
supported workload consists of various operations (i.e.,
tasks), which require a short amount of duration. The run-
time overhead of each task includes the task duration and
data transmission delay. The allocation of different proces-
sors at the same time for task niand its parent task set
4VOLUME x, 2022
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3203401
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
Start
Job generator
Ready queue
Outstanding queue
Executable queue
Completed queue
Run on a PE
Is it
a task
head?
Are
preds
done?
End
Scheduler
Job profile
Resource profileApplication
PE
9
12
18 11 14
0
19 16
1
23
2
27 23
3
13
4
15
5
17
6
11
7
13
8
9
HEAD
TAIL
Task P0 P1 P2
014 16 9
113 19 18
2 11 13 19
313 817
412 13 10
513 16 9
6 7 15 11
7 5 11 14
818 12 20
921 716
FIGURE 2. An overview of DS3 workflow. At initialization, a set of workloads
and PEs are generated for given job and resource profiles. The job generator
continuously generates multiple jobs using the set of workloads and distributes
them to the task queues. The scheduler takes any tasks in the ready queue
and maps each task to one of the PEs. If the PE is idle, it starts task execution.
The task dependency graph prescribes which next task to move onto the
ready queue after the completion of its predecessors.
{nj}=pred(ni)would incur the data transmission delay.
Let task niis mapped to PE Piand its task computation time
with operating frequency fi
oby comp(ni|Pi, f i
o). Then, the
overall task duration is equated by
exec(ni) = µ·comp(ni|Pi, f i
o) + delay(ni),(1)
where µindicates a scaling parameter for extending the task
execution time. On the right-hand side, the first term is task
computation time on a PE, and the second term is data
communication delay, given by:
delay(ni) = max
nj∈pred(ni)
wi,j
B(Pi, Pj),(2)
where wi,j is the weight of edges between task iand task
j, and B(Pi, Pj)is the PE bandwidth from Pito Pj. The
self-loop bandwidth of the same processors is assumed to be
negligible, B(Pi, Pi) = 0. Due to the communication delay,
frequent resource switching leads to an increasing loss in
task completion time. The objective to optimize, the average
latency minimization, is given by
L=PG∈Gcomp Pi∈|G|exec(ni)
|Gcomp|,(3)
where Gcomp is a set of completed job DAGs, and |G|is the
number of tasks in the job G.
Previous work [44] introduced additional evaluation met-
rics of run-time overhead for a single completed job: Sched-
ule Length Ratio (SLR) and Speedup. The SLR and Speedup
metrics are given by
SLR =makespan
Pni∈CPM IN minpj∈Q{wi,j }(4)
Speedup =
minpj∈QnPni∈vwi,j o
makespan ,(5)
Algorithm 1 DS3 Environment
1: Input: job inter-arrival rate scale, clock signal clk,
maximum simulation length CLK , job model MJ, re-
source model MR, job capacity C, job queue Qjob , ready
task queue Qready , job profile job, resource profile
resource,Wnumber of jobs, Qnumber of PEs
2: Output: average latency L
3: for each episode do
4: clk ←0
5: {Gi}i=1:W←MJ(job)
6: {Pi}i=1:Q←MR(resource)
7: repeat
8: # Generate jobs
9: if |Qjob |< C then
10: clkinj ∼Exp(scale)
11: Qjob ←Gat clkinj
12: end if
13: for each task iin Qready do
14: # Schedule tasks in ready list to PE
15: end for
16: if Pis idle then
17: # PE execution
18: start Pexecution corresponding to the scheduled
tasks
19: end if
20: clk ←clk + 1
21: until clk =CLK
22: Compute Lusing (3)
23: end for
where the denominator of SLRkrepresents the ideal lower
limit time for scheduling for the job DAG. CPMIN is the
minimal critical path of job DAG, and Qis the number
of PEs. The nominator of Speedup represents the overall
task computation time when each of the vtasks in a job
DAG is scheduled onto the same processor. This indicates the
ability of the algorithm to schedule tasks to explore parallel
performance. The lower SLR and the higher Speedup, the
more optimal scheduling performance.
Since this paper seeks to evaluate performance over mul-
tiple jobs, we average out SLR and Speedup over the entire
completed jobs per simulation length. Let a set of completed
jobs by {Gi}|Gcomp|
i=1 , each of the completed jobs corresponds
to the set of generated workloads at DS3 initialization. Since
heterogeneous jobs are generated, each job likely has a dif-
ferent minimal critical path and parallel performance with the
same processors. The average SLR and the average Speedup
are given by
SLR =P|Gcomp |
k=1 SLRk
|Gcomp|(6)
Speedup =P|Gcomp |
k=1 Speedupk
|Gcomp|.(7)
The DS3 workflow is given in Algorithm 1 and Fig. 2.
After initialization, DS3 continuously generates indefinite
VOLUME x, 2022 5
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3203401
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
DS3 Spark
synthetic real-world
Job characteristic
Number of tasks 10 27 1137.6
Data transmission delays 16.6±5.0 3.38 ±2.4 2000
Average job DAG level 4 7 5.8±2.6
Number of types 1 1 154
Average job arrival time 25 25000
Job duration Varied 1127.3±441.9
Resource characteristic
Structure Heterogeneous Homogeneous
Task computation time 13.3±4.1 40.0±83.6-
Number of type 4 17 1
TABLE 2. A comparison between DS3 and Spark properties. Due to the differences in applicability, DS3 and Spark have different characteristics of job and
resource. On DS3, a representative real-world profile, WiFi-TX, is included. Note that the shape parameter controls the diversity in the number of job types and their
average graph levels, α.
hierarchical-structured workloads with respect to the job
model at stochastic job inter-arrival rates. While the num-
ber of injected workloads is below the job capacity C, the
job generator injects a mix of multiple instances of the
workloads in a stream fashion, G={G1, . . . , GW}, where
W≤C. The workloads are generated at every clkinj, where
clkinj ∼Exp(scale), where scale is the mean of job inter-
arrival rate. A large value of job inter-arrival leads to high
frequency in job injection. Then, DS3 loads a set of tasks
that have no dependency onto the ready queue, otherwise
onto the outstanding queue. Each ready task, which is derived
dynamically based on the prior task scheduling, is ready to
be scheduled by PEs using a scheduling policy. The task then
moves to the executable queue, and the corresponding PE, if
idle, starts executing the task. The tasks are non-preemptive,
as DS3 runs in a non-preemptive setting during task exe-
cution. The job generator, distributed PEs, and simulation
kernel, all of which are executed in parallel, share the same
clock signal.
B. SIMULATION ANALYSIS
Elucidating the distinction in simulation behaviors, we com-
pare DS3 against Spark [32], one of the representative re-
alistic simulations for cluster applications. Both simulations
support the scheduling of multiple graph-structured jobs, but
differ greatly in the mechanism of resource allocation in their
domains. We list the job and resource characteristics after
normalization in Table 2.
The scheduling policy in Spark decides on how many
resource machines to allocate for the ready tasks with respect
to the given job profile. DS3 scheduling policy, on the other
hand, decides which PE to execute the ready tasks, and
the task run-time is solely dependent on the selected PE
performance. After task completion, Spark applies a static
moving delay, whereas DS3 applies a dynamic data transmis-
sion delay. Cluster jobs generally consist of numerous tasks
and last for a long time. Spark, for instance, supports 154
FIGURE 3. The edge density and chain ratio of cluster and SoC workloads.
The results of TPC-DS and TPC-H are reproduced by referring to [43].
types of jobs with approximately 5.8 levels (DAG depth).
Alternatively, DSSoC jobs are executed in a short range
of duration. DS3 provides several types of real-world job
profiles, but this paper focuses on one real-world job, WiFi-
TX, and one synthetic job. These jobs have 4 and 7 levels,
respectively. Endowing with heterogeneous resources, DS3
has 4 PEs on a synthetic profile and 17 PEs on a real-
world profile. Each has a different run-time performance for
tasks and different supported functionalities. In that sense,
an individual scheduling task must check whether it can be
executed in PE. Regarding CCR, the synthetic profile has
a similar range of computation and communication costs.
In contrast, the real-world profile is chain-structured and
compute-intensive. That being said, the communication time
for the synthetic job has at most 22x larger than that for
the real-world job, and the task computation time for real-
world resources is at most 13.4x larger than that for synthetic
resources. In practice, we modify the job characteristics using
shaping parameters α,µ, and νto grant more variability. (see
Section IV-A1 for details on the parameter description). The
difference in the mechanisms in two different domains limits
the scope of the applicability of each scheduling algorithm,
and the extent or range of run-time is largely different.
Additional metrics are given here for hierarchical job
6VOLUME x, 2022
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3203401
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
1 2 3
4 5 6
T1
T2
T3
T5
T6
T4P1
P2
P3
next
obs
r3r1
simulation
time step
FIGURE 4. An illustration of irregular interactions. Although Tasks 1 and 3
have been completed earlier, the next Tasks 4, 5, and 6 are scheduled after
Task 2 has been completed. As a result, the reward gains for scheduling
decisions for Tasks 1 and 3 are truncated due to the task dependencies.
DAGs [43]: (i) The edge density measures the sparsity of job
DAGs. The density computed by 2E
V(V−1) , where Edenotes
the number of edges, Vthe number of vertices (tasks), and
V(V−1) the possible maximum number of edges for a job
DAG—the higher density results in denser job DAG with
more complexity. (ii) The chain ratio measures the prevalence
of chained tasks. The rate is computed by C
V, where C
denotes the number of chained tasks that have an exact one
child and one parent. Fig. 3 reports that the synthetic job
DAG is relatively sparse, and SoC jobs have a larger number
of chains than cluster jobs.
A major difficulty in designing a DS3 scheduler is that
the number of available actions (scheduling decisions) varies
over time due to the mixes of incoming heterogeneous jobs
and their different task dependency graphs. Fig. 4 illustrates
an exemplary scenario where tasks 4, 5, and 6 are a child
of task 2. Although task 1 and task 3 have been completed
earlier, per the dependency graph, the next observation can be
received after completing task 2. Then, the immediate reward
signals for tasks 1 and 3 are naturally delayed. With heteroge-
neous PEs- and dependency-graph-induced variations incur-
ring abrupt dynamic run-time of tasks, it becomes an entan-
gled affair in pairing action and its reward to compute returns
properly. Indeed, this entanglement of the task dependency
graph and heterogeneous resources leads to the misalignment
of the order and timing of observation and reward gains. In
that sense, the agent has a mismatched reward timestamp
with the actual simulation running clock signal. As a result,
the interactions become inconsistent, and rewards (returns)
will be incorrectly assessed and backpropagated.
Based on the above analysis, DS3 exhibits dynamic, realis-
tic operational behaviors but differs significantly from other
domains of resource allocations. Due to task dependencies
from mixes in various jobs, the scheduler must address a vari-
able action set (i), shown in the list below. The distributed PE
executes each scheduled task accordingly (ii). By combining
(i) and (ii), the agent naturally has delayed rewards likely to
disrupt DRL optimization. That is the last difficulty, (iii).
1) Variable action sets: Mixes in various jobs with dif-
ferent task dependency graphs cause variable action
sets. Since the job queue holds multiple heterogeneous
jobs, the agent must recognize multiple job graphs and
respond to the fact that action sets are irregular. At
every scheduling interaction, the agent receives tasks
free of dependencies for the given state.
2) Heterogeneous resources: As heterogeneity in both
jobs and SoC computing resources, the DS3 sched-
uler must consider different task execution times and
data transmission delays. The SoC scheduler computes
which task to be executed on which PE. Based on
the task-PE mappings from a scheduling policy, the
average job duration becomes highly unpredictable.
3) Delayed rewards: A combination of the varying ac-
tions caused by randomness in incoming jobs and
heterogeneous resources exacerbates the problem of
delayed rewards. The accumulated rewards tend to
disrupt DRL optimization. With the previous action
commitments and a varying number of observations,
the returns must match the interaction steps and the
actual simulation clock signal.
C. BENCHMARK SCHEDULER
1) Rule-based scheduler
The task duration depends on task computation time on a PE
and communication delay computed by the PE bandwidth
and data transmission delay in the job DAGs, as described
in (1). Shortest Time First (STF) and Minimum Execution
Time (MET) [7] iteratively schedule ready tasks to the PE
that has minimal execution time. After the schedules, MET
checks whether the PE is busy or idle. If the scheduled PE is
busy, then MET revises the task assignment to alternate PE.
Heterogeneous Earliest Finish Time (HEFT) [44] is effective
at hierarchical job scheduling. HEFT first sort ready tasks
based on the upward rank values, which are importance
weights, and greedily map tasks to heterogeneous PEs. An
upward rank of a ready task nican be recursively calculated
by
ranku(ni) = wi+ max
nj∈succ(ni)(ci,j +ranku(nj)),(8)
where succ(ni)is a set of successors of task ni,ci,j is the
average communication cost of edge (i, j), and wiis the
average computation cost of task ni. Essentially, the upward
rank is the length of the critical path from task nito the exit
task. While the original research seeks the critical path [44], a
job DAG in DS3 is deemed complete when all of its tasks are
finished. Therefore, the performance of HEFT relies heavily
on the heuristic task-PE mapping, which iteratively computes
the earliest execution finish time (EFT) of a ready task. EFT
of task niand processor pkis equated by
EF T (ni, pk) = max{avail[k],
max
nj∈pred(ni)(AF T (nj) + ci,j )},(9)
where avail[k]is the earliest time at which the processor pk
is ready for task execution, pred(ni)is the set of predecessor
tasks of ni, and AF T (nj)is the actual finish time of the task
nj.ci,j =wi,j /B(pi, pj), where wi,j is the weight of edge
(i, j)and Bis bandwidth between given processors, is the
data transmission delay as referred to (2). Essentially, EFT
VOLUME x, 2022 7
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3203401
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
algorithm calculates actual delay-aware computation time
and exhaustively schedules the task to the PE with minimal
cost. HEFT particularly applies an insertion-based policy that
seeks whether the scheduled task can be executed prior to the
previous task assignment. If HEFT finds residual gaps due
to transmission delays associated with previous scheduling
decisions, it reschedules the tasks. Recent improvement via
execution-focused heuristic in dynamic run-time scenarios
resulted in a run-time variant of HEFT, HEFTRT [28].
2) Neural scheduler
DeepSoCS [38] first introduced in the DSSoC with the realis-
tic setting. It sorts tasks using the topological knowledge ex-
tracted by graph neural networks and maps each task to PEs
using exhaustive search, EFT algorithm [44], accordingly.
DeepSoCS shows a promising result in the SoC application
by exploiting the insertion policy and imitating the expert
policy, HEFT. However, mapping the tasks to PEs crucially
impacts the performance rather than sorting the tasks in DS3
due to counting job completion after all tasks are finished.
SCARL [11] is designed for scheduling a single-level job
input to heterogeneous machines in a pre-defined number of
injecting jobs. SCARL employs attentive embedding [46] to
share representation from the job and resource embedding
and allocate each job to the available machine. The scheduler
conducted experiments in the extended version of the simple
cluster simulation [31].
V. PROPOSED METHOD
The critical challenges for designing a DRL scheduler in
DS3 are that the scheduler requires to i) adaptively allocate
a varying number of tasks to heterogeneous PEs by consid-
ering system dynamics and data transmission delays, and ii)
correctly align task returns to scheduling actions according
to respective time-varying agent experiences. An overall sys-
tematic workflow of DS3 with scheduling policies is depicted
in Fig. 5. After the tasks enter the ready queue, the scheduler
receives an observation and maps each task to corresponding
PEs. For the following subsections, we provide the state,
action, and reward statements tailored to DS3. As the set of
actions varies in order and time, we provide cases on how
to design actions. Also, we delineate a straightforward and
effective EIM technique and how this technique addresses the
alignment of return.
A. AGENT DESCRIPTION
Applying RL to sequential decision-making problems is
natural, as it collects experiences via interactions with the
environment. In general, conventional RL is formalized by
Markov Decision Processes (MDP), which is consisted of
a 5-tuple hS,A, R, P, γi[41]. Here, S ∈ Rdis the state
space, A ∈ Rnis the action space, and R∈Ris the
reward signal that is generally defined over states or state-
action pairs. P:S × A → S is a matrix representing
transition probabilities to next states given a state and an
action. γ∈[0,1] is the discount factor determining how
much to care about rewards in maximizing immediate reward
myopically or weighing more on future rewards. RL aims
to discover an optimal policy πthat maximizes the expected
cumulative (discounted) rewards or (discounted) returns. At
every interaction, the RL agent samples a (discrete) action
from its policy, which is the probability distribution of actions
given a state, at∼π(st). The agent then computes the return
with E[PT
t=0 γt−1R(st, at)], where tis the interaction time
step. In this paper, we assume a finite state, finite action, and
finite-horizon problem.
1) State
The state representation is designed to capture informa-
tion of simulation dynamics. Considering the SoC domain-
specific knowledge, we select the attributes of the overlap-
ping tasks/jobs and resource information. The observation
features at every interaction are
Concat((PG
n, StatG
n, T W T G
n,|predG
n|)v,W
n=0,G=0,
(DepG, J W T G)W
G=0, Nchild ),(10)
where nis a task in every job G,vis the number of tasks in
job G, and Wis the number of job DAGs in job queue. Each
of the observation features is described as follows.
•PG
n, the assigned PE ID.
•StatG
n, one-hot embedded task status. Status is clas-
sified by one of the labels from ready, running, or
outstanding.
•TWTG
n, the relative task waiting time from the ready
status to the current time.
•|predG
n|, the number of remaining predecessors.
•DepG, the number of hops (levels) for the remaining
tasks as referred to task dependency graph.
•J W T G, the relative job waiting time from injected to
the system to the execution time.
•Nchild, the number of all awaiting child tasks in the
outstanding and ready statuses.
Time in observation features refers to the actual clock signal
in an SoC simulation. Based on the choices of neural archi-
tecture designs, state representation includes graph embed-
dings that capture topological information using graph neural
networks [11], [32], [38].
2) Action
At every task assignment, shown in the top-middle stage
from Fig. 2, the agent performs a scheduling decision on
an individual task that is free from dependency. Since the
number of ready tasks varies by the previous scheduling
decisions following their dependencies, the feasible action
set varies. Let the ready tasks by at∈ Tready, where Tr eady
is a set of ready tasks. For every task {ai}|Tready|
i=1 , an action
iis sampled from the policy distribution with parameter θ,
ai∼πθ(a|s), which can be represented by multinomial
distribution, πθ(a|s)d
=Multinomial(p, m). Here, p∈R1×Q
is the probabilities of each PEs, Qis the number of PEs,
8VOLUME x, 2022
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3203401
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
Tasks free from dependency
N ready tasks
Preprocessed
observation
(base state)
ENC
PE
Value
Action
N times
(state, action) buffer
reward buffer
clock buffer
Returns
EIM
Rearranged
experience
(Episodic)
RL optimization
Forward
Backward
Task-PE map
softmax
SoCRATES
DeepSoCS
Tasks free from dependency
N ready tasks
z
s
p
a
N times
EFT
Job queue
Job generator Scheduling policy
Mappings
PE executions
Completed jobs/tasks
Task assignment -- RL interaction (step)
DS3 simulation (clock)
reward buffer clock buffer
Task entrance Task completion
(state, action)
buffer
FIGURE 5. The architecture of neural schedulers applied to DS3 simulator. Schedulers receive Ntasks in the ready queue and map each task to SoC computing
resources. Due to the varying number of tasks, scheduling policies feed each task iteratively. SoCRATES applies Eclectic Interaction Matching to post-process the
return (bottom-left). DeepSoCS returns sorted tasks and uses the EFT algorithm to map them to resources (bottom-r ight).
and m∈R1×Qis a masking vector to filter out PEs not
supporting the task.
One approach is to consider a set of actions as a group
action. The group action at RL interaction time step tcan be
represented by ai,t ∼πθ(st), where the set of actions are
sampled from the same probabilities with respect to policy
distribution. In practice, we define the size of the action
vector to a large enough number and apply zero-padding
whenever the number of ready tasks is less than that [48].
An alternative approach is to treat each ready task as an
individual action. In lieu of the group action, the agent pulls
out each for a PE selection with respect to different policy
distribution iteratively, ai,t ∼πθ,(i)(st). In this case, each
action is sampled from different probabilities with respect to
policy distribution.
3) Reward
DS3 schedulers aim to minimize average latency over simu-
lation length. As described in (3), the number of completion
jobs is largely dependent on the latency. While the negative
job duration reward is an adequate reward metric in a cluster
environment [32], this is not effective for latency. Minimizing
the elapsed time of the completed jobs is a local optimiza-
tion, while increasing the number of completed jobs is a
global optimization to entail latency minimization in overall.
Moreover, maximizing the number of tasks is not adequate
optimization because leaving one task out of a job did not
contribute to the job completion. We state the reward function
as follows.
R(clk) = C1· | ˆ
Gcomp|+C2·clk, (11)
where |ˆ
Gcomp|is the number of newly completed jobs at
clk,C1and C2are the weights of job completion bonus
and penalty for clock signal, respectively. Empirically, we
design +50 for C1and −0.5for C2. We apply a penalty term
to encourage the agent to complete jobs fast. Note that this
reward function is computed per clock signal to enable the
EIM technique, which is discussed in the following section.
For the standard RL approach, the reward is computed per
interaction step tinstead of clock signal clk.
B. ECLECTIC INTERACTION MATCHING
Conventional RL environments formalized in standard MDP
assumption return the next observation and action conse-
quences right after the previous action has been completed.
However, as introduced in the example case in Fig. 4, the
action in DS3 is performed with the ready tasks, and it is
highly not regularly performed due to the variability in task
dependency from a mix of incoming jobs. As a result, the
next observation is not immediately generated after executing
the previous scheduled task. Moreover, treating a reward
and the next observation at the same time leads to incorrect
VOLUME x, 2022 9
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3203401
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
Clock signal
Time step
Task status
Agent
Reward
clk=1 clk=2 clk=3 clk=4 clk=5 clk=6 clk=7 clk=8 clk=9 clk=10
t=1 - t=2 - - - - t=3 - -
T1 start - T2 start - T1 end - - T3 start T2 end T3 end
s1-s2----s3- -
a1-a2----a3- -
r1r2r3r4r5r6r7r8r9r10
Return r1
Task return r1
Time step
Agent
Task return
t=1 t=2 t=3
s1s2s3
a1a2a3
✔
✔
⭐
⭐
✔
⭐
✔
✔
⭐
⭐
⭐
✔
time step time step
clock signal clock signal
(b) EIM(a) Standard
(a) State-action-return with standard strategy
Time step
Agent
Task return
t=1 t=2 t=3
s1s2s3
a1a2a3
(b) State-action-return with EIM strategy
Return Return
valid information discarded information Monte-Carlo return EIM return
FIGURE 6. An experience with different strategies. Top figure depicts the experiences and rearranged state-action-return sequential triplets after processing
different strategies: standard and EIM. EIM preserves the integrity of task execution for return calculation via accounting for returns spanning task duration. Bottom
figure describes three tasks coming with concurrency and inconsistent interaction due to task dependency and heterogeneous resources. Two orthogonal axes
show interaction time step and actual simulation clock signal. For standard strategy, delayed consequences are discarded, as depicted by shaded regions.
reward propagation, because the scheduler assigns multiple
tasks at the same time, and each of the scheduled tasks
readily be completed at different time due to the different
performance of heterogeneous PEs. In that sense, the task
dependencies and different task duration inherently cause
delayed rewards, and this phenomenon leads to incorrect
reward propagation in the optimization updates. Therefore,
the scheduling agent must handle a varying number of action
sets and the mismatches between the interaction and the
action effects during the action decision stage, for which we
address right below.
A standard RL experience comes down to a sequence of
hs, a, r, s0i. We first decouple the receiving reward and next
state to have i) a sequence of {si, ai}T
i=1, where Tis the
last interaction step, and ii) a list of rewards collected upon
the simulation clock signal {rclk}C LK
clk=1 ={R(clk)}C LK
clk=1,
where R(clk)is a reward function described in (11). As
discussed in Section IV-B, an amount of interactions Tand
entire clock signal CLK are not generally matched due
to the different task dependencies in a mix of hierarchical
jobs and performances in heterogeneous PEs. We compute
the immediate reward per clock signal independent of the
interaction step. Additionally, we append the ‘start’ flag to the
state-action tuple and stored in the buffer. While traversing
the simulation, at the completion of any scheduled task, we
store the ‘complete’ flag and completed clock signal ωin the
buffer. Hence, the experiences in the buffer is described as
{st,{a(n,t)}n0
n=1,{ω(n,t)}n0
n=1}T
t=1, where n0is the number
of ready tasks at interaction step t.
Fig. 6 showcases an exemplary experience of scheduling
three ready tasks and return computations using two different
strategies. On the upper-left diagram, an agent receives an ob-
servation and sequentially selects an action. The state, action,
and task starting/completing clock signal are marked green.
Next, we compute returns based on the accumulated rewards.
In the standard approach performing the Monte-Carlo return
with the accumulated rewards [42], partial reward sequences
that overlap ongoing tasks and subsequent observation are
not counted; the missing sequences incur incorrect return
matching and instability in training.
The EIM technique instead aligns Monte-Carlo returns
with the committed actions spanning individual task du-
ration, referred to the ‘start’ and ‘end’ task signals. The
return for each action reflects the length of task duration,
and each action correctly matches outcomes without any
discarded information. Moreover, task flags and actual clock
signals allow the agent to sequentially select actions within
a set of varying actions. EIM technique thus enables the
agent to receive a correct form of state-action-return triplets,
regardless of varying action sets. EIM is a straightforward
post-process that is proven effective in training an agent when
the agent interaction and simulation clock signal is incon-
sistent. The bottom diagram of Fig. 6 depicts the task and
action with the return computation. The x-axis denotes the
simulation clock signal, and the y-axis is the RL interaction
time step. Partitions of second and third task duration in the
standard approach are discarded for return assignment. EIM,
by contrast, properly pairs the state-action tuple with returns
by aligning returns to task assignments.
In training, we use the Actor-Critic algorithm [24]. We
use shared neural networks on both actor and critic and
update parameters with REINFORCE [42]. While the actor
network selects actions with respect to the policy distribution,
πθ, the critic network estimates the value using the value
10 VOLUME x, 2022
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3203401
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
Algorithm 2 SoCRATES Scheduler
1: Input: clock signal clk, job queue J, ready task queue
Tready
2: for each episode do
3: state-action buffer BSA ← ∅
4: clock buffer Bclk ← ∅
5: reward buffer BR← ∅
6: for each task iin Tready do
7: Construct state st
8: ai,t ∼πθ,(i)(st)
9: Assign ai,t to PE for task i
10: BSA ←(st, ai,t )
11: end for
12: if task icomplete then
13: Bclk ←(i, ω)
14: end if
15: BR←rclk
16: end for
17: # Update the agent model
18: θ←θ+ηγt∇θLSoC
t(θ)
function, ˆ
Vπ
θ. At the end of the episode, EIM post-processes
the expected returns as described in (12):
ˆ
G(st, ω) =
ω
X
clk=0
γclkrclk +1.(12)
The actor loss is equated by
LACT
t(θ) = −
T
X
t=0
log πθ(at|st)hˆ
G(st, ω)−ˆ
Vπ
θ(st)i,(13)
and the critic loss is computed by the standard mean squared
loss,
LCRI
t(θ) = 1
2ˆ
G(st, ω)−ˆ
Vπ
θ(st)2.(14)
The overall loss is given as:
LSoC
t(θ) = LACT
t(θ) + LCRI
t(θ) + ξH(st),(15)
where the last term on the right-hand side is the entropy
regularization, H(st) = Eπθ[log πθ(st)], with its coefficient
ξintroduced for exploration. Pseudocode for the proposed
algorithm is given in Algorithm 2.
VI. EVALUATION
This section demonstrates the feasibility of neural schedulers
in a high-fidelity SoC simulation, DS3. We present evalua-
tions in three ways: (a) we revisit rule-based schedulers and
observe their benefits on performance, (b) we verify the effi-
cacy of EIM technique on neural schedulers by investigating
PE usage with various reward functions and different action
designs, and (c) we empirically validate that neural sched-
ulers can have competitive and generalized performance on
run-time overhead in a series of experiments where job
DAG topology and PE performance are varied. Specifically,
Hyperparameter Value
Optimizer Adam
γ(discount factor) 0.98
η(learning rate) 0.0003
ξ(entropy coefficient) 0.01
α(job structure) 0.8
µ(scale to PE performance) 0.5
ν(avg. comm. delay) 0.0
Number of workloads 200
Simulation length 10,000
C(job capacity) 3
Gradient clip 1
Scale 25
TABLE 3. A table of hyperparameters used for training neural schedulers
we examine in which operational conditions existing neural
schedulers with EIM have significant performance gains.
A. EXPERIMENTAL SETUP
Table 3 describes a list of training parameters. We use Adam
optimizer [23] and clip the gradients to avoid gradient ex-
plosion. To engender more interactions and a more dynamic
environment, we randomly inject jobs with a set of 200
workloads. As we empirically discovered that the job DAG
topologies of synthetic or real-world profiles are structured
with α= 0.8, we synthesize job structures with α= 0.8
based on the given job profile. The job inter-arrival rate
(scale) is set to 25; the system stochastically injects a job
at every 25 clock signals on average. At every episode, the
simulation executes until 10,000 clock signals. To reduce
the training and evaluation time, we conduct all experiments
with the initial condition of quasi steady state, that is, each
experiment begins with all jobs already stacked in the job
queue [38]. All evaluations have been conducted by 20 trials
with different random seeds.
B. REVISITING RULE-BASED SCHEDULERS
Rule-based algorithms have continue to demonstrate state-of-
the-art performance in SoC run-time scheduling [5], [28]. In
order to establish a baseline for comparative study, first we
extensively investigate the run-time performance of existing
heuristic schedulers. As described in Section IV-C1, the main
discrepancy between STF and MET is that MET reschedules
the scheduling assignment by checking whether the selected
PE is busy or idle. Likewise, HEFTRT iteratively computes
the actual run-time of given tasks using computation time
and data transmission delay. It then applies an insertion
policy, which exhaustively searches for a possible empty slot
between each task assignment.
The top plot in Fig. 8 shows an overall run-time per-
formance of different heuristic schedulers using synthetic
(Syn) and real-world (RW) profiles. The x-axis indicates
CC R, and the y-axis indicates the average latency. The
jobs are communicative-intensive, if CCR 1.0, and are
VOLUME x, 2022 11
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3203401
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
FIGURE 7. Performances evaluation on neural and non-neural schedulers with various job structures and PE performances using synthetic profiles. The left and
right figures show run-time performance on varying topologies in job DAGs and PE performances, respectively.
FIGURE 8. An experimental analysis of heuristic schedulers. The top figure
compares run-time performance with different data transmission delays. The
cross and triangle marks depict fixed job profiles from synthetic and real-world
workloads. We report the results with a solid line for real-world (RW) and a
dotted line for synthetic (Syn) workloads for the increase in data transmission
delay compared to task computation time. Both lines are plotted by varying the
job structure with α= 0.8. The bottom figure shows the effectiveness of
insertion policy in HEFTRT scheduler using violin chart. An insertion policy
significantly improves performance for average latency (bottom-left) and
increases the number of completed jobs (bottom-right).
computation-intensive, if CCR 1.0. For synthetic profiles
structured with a similar range of CC R, STF and MET have
comparable performance and surpass HEFTRT. On the other
hand, MET and HEFTRT significantly improve performance
for real-world profiles, where jobs are computation-intensive,
by checking the availabilities in PEs and exhaustively search-
ing the empty slot between task assignments. Particularly,
when increasing CC R on real-world profiles, MET and STF
show similar performance and surpass HEFTRT. We observe
that the increasing gap between computation time and com-
munication delay leads to large variances in the distribution
of task run-time. The high variances in profile statistics result
in more chances to improve the performance by rescheduling
task assignments.
The bottom plot in Fig. 8 shows an experimental result for
real-world profile using two types of HEFTRT, with and with-
out insertion policy. The insertion policy effectively seeks
better placements due to the divergent distribution of task
computation time. Hence, the rescheduling task assignment
in the heuristic schedulers instrumentally improves run-time
performance. In that sense, rescheduling task assignments
can largely improve performance, and HEFTRT can show
almost optimal performance within a myopic scope by its
exhaustive search when the variations in task run-time are
large.
C. PERFORMANCE COMPARISON
This section describes our extensive evaluation of the perfor-
mance of existing schedulers specifically designed for hetero-
geneous resources in the SoC domain. We compare two types
of representative scheduling algorithms: i) SoCRATES [40],
DeepSoCS [38], and SCARL [11] for neural, and ii) STF,
MET [7], and HEFTRT [28], [44] for heuristics. Since
SCARL does not support hierarchical workloads, we mod-
ified SCARL as follows: (1) State: We take the same job rep-
resentation with the SoCRATES. We select PE performance,
types of PE, capacity, available time to execute tasks, task
remaining execution time, idle rate, and normalized values
of PE run-time and expected total task time for features of
PE representation. (2) Action: Original SCARL selects both
workload and resource. Since SCARL does not support task
selections for hierarchical workloads, the action maps the
selected resource to the task in sequence. (3) Reward: We
use the same reward function of job completion, described
as (11). At the update stage, we compute the returns with the
collected rewards after post-processing with EIM.
Throughout the evaluations, we primarily concentrate on
average latency, which indicates the average run-time per-
formance. We observe how the schedulers behave in a wide
range of experiments by varying the types and structures
of the jobs, transmission delay, and performance in het-
erogeneous PEs. Fig. 7 reports the run-time performance
using a synthetic workload. The right and left plots depict
the experimental results after varying job structures and
PE performance. For the former case, we control the job
12 VOLUME x, 2022
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3203401
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
FIGURE 9. Performances evaluation on neural and non-neural schedulers with various job structures and PE performances using real-world profiles. The left and
right figures show run-time performance on varying topologies in job DAGs and PE performances, respectively.
structure parameter αwhile holding the parameter of PE
performance µ, and for the latter case, vice versa. Large
αgenerates shallow but wide job graphs, while small α
generates deep but narrow job graphs. All evaluations are
conducted with the highest job inter-arrival rate (the smallest
scale value), leading to a high frequency of job injection.
From the holistic viewpoint, the trends in SLR and Speedup
follow the curve of the average latency. Among all other
schedulers, we can observe that SoCRATES has surpassed
under a wide range of experiment settings. Since the neural
schedulers have been evaluated using a single trained model,
SoCRATES has generalized and competitive performance in
various scenarios in job structures and PE performances. As
described in Section VI-B, CC R for the synthetic workload
closely reaches 1.0, meaning that the task computation time
and data transmission delay lie in a similar range. As a result,
the task ordering in heuristics did not impact much, and their
performances fell behind the SoCRATES. When the number
of tasks was varied, SCARL’s attentive embedding of tasks
and resources was unable to take advantages of attentive
representation and even further deteriorates the overall run-
time performance. As a result, SCARL shows comparable
performance to random policy.
Synthetic and real-world profiles differ in the number of
tasks and resources, job DAG topology, and supported func-
tionalities on individual resources. Table 2 indicates that the
real-world profile has a much higher task computation time
cost than the data transmission delay. Hence, the actual task
run-time is more varied, and rescheduling task assignments
from heuristic schedulers can largely improve the run-time
performance. As a result, SCARL significantly improves
performance, and HEFTRT shows the most optimal run-
time behavior in the real-world profile, as shown in Fig. 9.
We observe that SoCRATES surpassed other schedulers in
the synthetic profile, but it was limited in the real-world
profile. Although EIM remedies fundamental difficulties in
DS3 and improves SoCRATES performance, it cannot reduce
the performance gap for the optimal task scheduling in a
myopic range using an exhaustive search. We hypothesize
that the characteristics of the real-world profile, such as
various availabilities of task executions in resources, invokes
cohesive challenges to designing a DRL scheduler. Addition-
ally, a large number of tasks leads to increased complexity
FIGURE 10. An overall result of average latency with various job DAG
topology by controlling α. The left plot shows the synthetic profile, and the
right shows the real-world profile.
in task dependency composition and large variance because
completing all tasks counts as job completion. Hence, the
end-to-end neural approach could not surpass HEFTRT when
the task duration has high variance and computing resources
cannot compute all tasks. Although SoCRATES shows lim-
ited performance in the real-world profile, DeepSoCS shows
comparable performance to HEFTRT by imitating experi-
ences from the expert algorithm.
In conclusion, Fig. 10 demonstrates the overall evaluation
of neural and non-neural schedulers using synthetic and real-
world profiles with various job DAG topology by control-
ling α. The left figure shows that SoCRATES has largely
improved behavior when the computation time and commu-
nication delay lie in a similar range. The right figure shows
that DeepSoCS and HEFTRT shows the most optimistic per-
formance when the composition of task duration has a large
variance. Thus, if we adaptively choose a neural scheduler
between EIM-based policy and imitated expert policy de-
pending on different scenarios, the neural SoC schedulers can
obtain an improved performance over other neural and non-
neural schedulers.
D. ANATOMY OF SOCRATES
SoCRATES is the fully differential decision-making algo-
rithm [40]. The crucial component in SoCRATES is EIM
technique that alleviates both delayed rewards and variable
action selection, caused by hierarchical job graphs, mixes
of different jobs, and heterogeneous computing resources.
Although the recently introduced EIM technique overcoming
such challenges, it lacks the validation of the efficacy of
EIM. In this section, we rationalize the underlying reasons
behind the performance improvement of EIM by examining
VOLUME x, 2022 13
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3203401
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
FIGURE 11. Experiments for the analysis of EIM technique. Top plot shows
PE selection and active/block time for PEs in different scheduling algorithms.
Bottom plot compares latency performance using EIM and standard strategies
with various sparse and dense reward functions. All experiments are
conducted with synthetic workloads with a fixed job topology.
PE usages and action designs.
1) Analysis of Eclectic Interaction Matching
First, we examine how EIM affects the scheduling policy
decisions with the PE selection behavior. The top plot in
Fig. 11 shows the counts on each PE execution and the total
amount of time for PE in active and blocking time. Active
time represents the amount of time in PE execution, and
blocking time represents the duration when a PE is busy
while the assigned tasks are ready. The simulation clock sig-
nal measures each time. Intuitively, optimized performance
in PE usage can be achieved when active time increases but
blocking time decreases. SoCRATES with EIM technique, in
effect, utilizes a greater number of PEs and achieves higher
resource active time than that with other policies. Its high
blocking time derives from the fact that the policy weights
encourage achieving long-term returns while blocking the
cost of immediate returns. The same policy without EIM
technique also has similar values of active and blocking
times. However, its low number of PE counts leads to poor PE
utilization and latency behavior. MET also uses a large quan-
tity of PEs, but its low active time in PEs invokes additional
bottlenecks in PE usage. Random and SCARL policies show
high value in active time. However, their absolute number
of PE counts is much lower. As a result, they have poor
performance compared to other schedulers.
Next, as shown in the bottom plot, we train SoCRATES
using various types of reward functions with and without the
EIM technique for generalization. We train the policy using
synthetic workload, and each two types of dense and sparse
reward functions are used (see Appendix B for more details).
The solid line represents the average values, and the shaded
region bounds the maximum and minimum values among
0 1000 2000 3000
Episode
2000
4000
6000
8000
Total Rewards
0 1000 2000 3000
Episode
10
20
30
40
Returns
0 1000 2000 3000
Episode
5
0
Explained Variance
1e6
0 1000 2000 3000
Episode
100
150
Avg. Job Exec. Time
group-action ind-action
FIGURE 12. A set of performance metrics for selecting a group action and
independent actions in the training phase. Top-left: total rewards, top-right:
returns, bottom-left: explained variance [13], and bottom-right: average job
execution time (µs)
8 runs in random seeds. The standard strategy seemingly
cannot train the model by observing its steady and straight
performance curve. On the other hand, the EIM strategy
enables to show learning curve. The EIM iteratively matches
each return in actions with respective task duration at the
cost of storing extra flags on task start and completion.
This additional post-processing is very cheap in operations
and achieves substantially better latency performance in any
kind of reward function than the standard strategy. From the
reward function perspective, the sparse reward apparently
exacerbates unstable latency performance due to its limited
feedback for an RL agent. Hence, it is commonly modified to
dense forms using the shaping technique [36].
2) Action design
To design a DRL agent in an environment with varying
actions, one can set an action space to the number of maximal
actions and mask out every varying action [48]. In the case
of group actions, we distribute the returns, computed by the
longest task duration, to the set of actions. It turns out, how-
ever, that the approach to group actions is inadequate due to
the rapid convergence of gradients. The action probabilities
quickly devolve to the local minima, and the losses in policy
and value increase exponentially. Also, a set of group actions
and their respective returns cannot be distributed to individual
actions, since each task has a respective reward and estimated
return or value in the task duration. On the other hand,
individual action selection with the EIM technique converts
a varying action problem into a conventional sequential
decision-making problem, with no need to be aware of invalid
actions from the agent perspective. Fig. 12 shows that the
policy with independent actions produces much higher values
in the expected returns by 270% and has improved run-time
performance by 30%. We applied the EIM technique to both
approaches. Additionally, we report the explained variance,
EV , denoted by (16).
EV (st) = 1−V[G(st)−ˆ
G(st)]
V[G(st)] ,(16)
14 VOLUME x, 2022
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3203401
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
where G(st)is empirical return of state stand ˆ
G(st)is
predicted return of state st.EV measures the difference
between the expected return and the predicted return [14].
By observing the decreases in explained variance for the
group action, we empirically validate that group action does
not fully understand the environment while reconciling the
experiences.
VII. CONCLUSION AND FUTURE WORK
In this paper, we unveil myths and realities of DRL-based
SoC job/task schedulers. We identify key practical challenges
in designing high-fidelity neural SoC schedulers: (1) varying
action sets, (2) high degree of heterogeneity in incoming
jobs and available SoC compute resources, and (3) misalign-
ment between agent interactions and delayed rewards. We
propose and analyze a novel end-to-end neural scheduler
(SoCRATES) by detailing its core technique (EIM) which
aligns returns with proper time-varying agent experiences.
EIM successfully addresses the aforementioned challenges,
endowing SoCRATES with a significant gain in average
latency and generalized performance over a wide range of
job structures and PE performances. We also rationalize
the underlying reasons behind the substantial performance
improvement in existing neural schedulers with EIM by
examining actual PE usages and disparate action designs.
Through extensive experiments, we discover that CC R sig-
nificantly impacts the performance of neural SoC schedulers
even with EIM. At the same time, we find that the action of
rescheduling task assignments by heuristic schedulers leads
to significant performance gain under certain operational
conditions, often outperforming neural counterparts.
With these findings, we intend to investigate further
whether EIM technique can bring additional performance
gains in other learning-based and planning algorithms, both
empirically and theoretically. With the advantage of task
rescheduling in heuristic schedulers, we plan to improve
neural schedulers by converting such technique to a dif-
ferential function and integrating it into the optimization.
Alternatively, offline reinforcement learning using expert or
trace replay [3] is another possible approach to improve
neural schedulers. Moreover, leveraging the structure of the
underlying action space to parameterize the policy is a candi-
date approach to tackle a varying action set [10]. We also plan
to leverage GNNs to bestow the structural knowledge from
job DAGs [50], and demonstrate the performance gain of the
improved neural schedulers by using the Compiler Integrated
Extensible DSSoC Runtime (CEDR) tool, a successor to DS3
emulator, as it enables the gathering of low-level and fine-
grain timing and performance counter characteristics [29].
.
APPENDIX A JOB DAG CONSTRUCTION
The simulator can synthesize a variety of workloads given
the job profile and hyperparameters, which are described in
Section IV-A1. First, we compute average values of widths,
w, and depths, d, with the hyperparameter αbased on the job
deep
narrow
shallow
wide
HEAD HEAD
TAIL
TAIL
FIGURE 13. An illustration of two types of job DAGs based on α. Left
diagram shows that small αgenerates a deep but narrow job graph. Dotted
lines in the middle represent hidden nodes and edges. Right diagram shows
that large αgenerates a shallow but wide job graph.
model description in Section IV-A1. We compute the number
of nodes by w∼max(1,N(bwc,0)) per d−2job levels.
Here, we exclude two levels in which the HEAD and TAIL
nodes are located. Then we check whether the total number of
nodes matches v(the number of nodes). If the total number
of nodes is less or greater than v, then we randomly select
nodes from the job DAG and add/subtract them in order
to exactly have vnodes. As illustrated in Fig. 13, small α
generates deep but narrow job graphs (left figure), and large
αgenerates shallow but wide job graphs (right figure). Next,
the job model generates the task dependency by the following
iterative process. Let the number of predecessors and the
number of nodes at level lby |pred(ni)|and |l|, respectively.
Then, the number of dependent tasks for node iat level lis
computed by max(1,min(N(|l−1|
3,0),|l−1|)). We connect
nito randomly selected |pred(ni)|nodes in l−1level.
APPENDIX B DESCRIPTION OF REWARD FUNCTIONS
Two types of dense and sparse reward functions are used to
validate the efficacy of EIM technique. The reward functions
are described as follows.
Rdense(clk) = C1· |ˆ
Gcomp|+C2·clk (17)
Rdense2(clk) = C1· | ˆ
Gcomp|(18)
Rsparse(clk)=0·1[clk<CLK−m]
+C1· | ˆ
Gcomp| · 1[clk≥C LK−m](19)
Rsparse2(clk)=0·1[clk6=C LK]
+C1· | ˆ
Gcomp| · 1[clk=C LK],(20)
where |ˆ
Gcomp|is the number of newly completed jobs at clk,
CLK is the end of simulation length, and mis the number of
the lastly completed tasks. C1and C2are the weights of job
completion bonus and penalty for clock signal, respectively.
We set +50 for C1and -0.5 for C2empirically. All reward
functions are computed per simulation clock signal.
REFERENCES
[1] Odroid-xu3. https://wiki.odroid.com/old_product/odroid- xu3/odroid-xu3.
Accessed: 2019-03-03.
VOLUME x, 2022 15
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3203401
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
[2] Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis,
Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving,
Michael Isard, et al. Tensorflow: A system for large-scale machine
learning. In 12th {USENIX}symposium on operating systems design
and implementation ({OSDI}16), pages 265–283, 2016.
[3] Rishabh Agarwal, Dale Schuurmans, and Mohammad Norouzi. An
optimistic perspective on offline reinforcement learning. In International
Conference on Machine Learning, pages 104–114. PMLR, 2020.
[4] Aporva Amarnath, Subhankar Pal, Hiwot Tadese Kassa, Augusto Vega,
Alper Buyuktosunoglu, Hubertus Franke, John-David Wellman, Ronald
Dreslinski, and Pradip Bose. Heterogeneity-aware scheduling on socs for
autonomous vehicles. IEEE Computer Architecture Letters, 20(2):82–85,
2021.
[5] Samet E Arda, Anish Krishnakumar, A Alper Goksoy, Nirmal Kumbhare,
Joshua Mack, Anderson L Sartor, Ali Akoglu, Radu Marculescu, and
Umit Y Ogras. Ds3: A system-level domain-specific system-on-chip
simulation framework. IEEE Transactions on Computers, 69(8):1248–
1262, 2020.
[6] James Blythe, Sonal Jain, Ewa Deelman, Yolanda Gil, Karan Vahi, Anir-
ban Mandal, and Ken Kennedy. Task scheduling strategies for workflow-
based applications in grids. In CCGrid 2005. IEEE International Sympo-
sium on Cluster Computing and the Grid, 2005., volume 2, pages 759–767.
IEEE, 2005.
[7] Tracy D Braun, Howard Jay Siegel, Noah Beck, Ladislau L Bölöni, Muthu-
cumaru Maheswaran, Albert I Reuther, James P Robertson, Mitchell D
Theys, Bin Yao, Debra Hensgen, et al. A comparison of eleven static
heuristics for mapping a class of independent tasks onto heterogeneous
distributed computing systems. Journal of Parallel and Distributed com-
puting, 61(6):810–837, 2001.
[8] Brendan Burns, Brian Grant, David Oppenheimer, Eric Brewer, and John
Wilkes. Borg, omega, and kubernetes: Lessons learned from three
container-management systems over a decade. Queue, 14(1):70–93, 2016.
[9] Giorgio C Buttazzo. Hard real-time computing systems: predictable
scheduling algorithms and applications, volume 24. Springer Science &
Business Media, 2011.
[10] Yash Chandak, Georgios Theocharous, Chris Nota, and Philip Thomas.
Lifelong learning with a changing action set. In Proceedings of the AAAI
Conference on Artificial Intelligence, volume 34, pages 3373–3380, 2020.
[11] Mukoe Cheong, Hyunsung Lee, Ikjun Yeom, and Honguk Woo. Scarl:
Attentive reinforcement learning-based scheduling in a multi-resource
heterogeneous cluster. IEEE Access, 7:153432–153444, 2019.
[12] Jeffrey Dean and Sanjay Ghemawat. Mapreduce: simplified data process-
ing on large clusters. Communications of the ACM, 51(1):107–113, 2008.
[13] Prafulla Dhariwal, Christopher Hesse, Oleg Klimov, Alex Nichol, Matthias
Plappert, Alec Radford, John Schulman, Szymon Sidor, Yuhuai Wu, and
Peter Zhokhov. Openai baselines, 2017.
[14] Prafulla Dhariwal, Christopher Hesse, Oleg Klimov, Alex Nichol, Matthias
Plappert, Alec Radford, John Schulman, Szymon Sidor, Yuhuai Wu, and
Peter Zhokhov. Openai baselines. https://github.com/openai/baselines,
2017.
[15] Hamza Djigal, Jun Feng, Jiamin Lu, and Jidong Ge. Ippts: an efficient
algorithm for scientific workflow scheduling in heterogeneous comput-
ing systems. IEEE Transactions on Parallel and Distributed Systems,
32(5):1057–1071, 2020.
[16] Justin Gilmer, Samuel S Schoenholz, Patrick F Riley, Oriol Vinyals, and
George E Dahl. Neural message passing for quantum chemistry. In
International Conference on Machine Learning, pages 1263–1272. PMLR,
2017.
[17] A Alper Goksoy, Anish Krishnakumar, Md Sahil Hassan, Allen J Farcas,
Ali Akoglu, Radu Marculescu, and Umit Y Ogras. Das: Dynamic adaptive
scheduling for energy-efficient heterogeneous socs. IEEE Embedded
Systems Letters, 14(1):51–54, 2021.
[18] Juncheng Gu, Mosharaf Chowdhury, Kang G Shin, Yibo Zhu, Myeongjae
Jeon, Junjie Qian, Hongqiang Liu, and Chuanxiong Guo. Tiresias: A
{GPU}cluster manager for distributed deep learning. In 16th {USENIX}
Symposium on Networked Systems Design and Implementation ({NSDI}
19), pages 485–500, 2019.
[19] John L Hennessy and David A Patterson. A new golden age for computer
architecture. Communications of the ACM, 62(2):48–60, 2019.
[20] Conrad Mestres Holt. Novel Learning-Based Task Schedulers for Domain-
Specific SoCs. PhD thesis, Arizona State University, 2020.
[21] Zhiming Hu, James Tu, and Baochun Li. Spear: Optimized dependency-
aware task scheduling with deep reinforcement learning. In 2019
IEEE 39th International Conference on Distributed Computing Systems
(ICDCS), pages 2037–2046. IEEE, 2019.
[22] Bahman Keshanchi, Alireza Souri, and Nima Jafari Navimipour. An
improved genetic algorithm for task scheduling in the cloud environments
using the priority queues: formal verification, simulation, and statistical
testing. Journal of Systems and Software, 124:1–21, 2017.
[23] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic
optimization. arXiv preprint arXiv:1412.6980, 2014.
[24] Vijay Konda and John Tsitsiklis. Actor-critic algorithms. Advances in
neural information processing systems, 12, 1999.
[25] Anish Nallamur Krishnakumar. Design and Run-Time Resource Man-
agement of Domain-Specific Systems-on-Chip (DSSoCs). PhD thesis,
University of Wisconsin-Madison, 2022.
[26] Hok-Leung Kung, Shu-Jun Yang, and Kuo-Chan Huang. An improved
monte carlo tree search approach to workflow scheduling. Connection
Science, 34(1):1221–1251, 2022.
[27] Tan N Le, Xiao Sun, Mosharaf Chowdhury, and Zhenhua Liu. Allox:
compute allocation in hybrid clusters. In Proceedings of the Fifteenth
European Conference on Computer Systems, pages 1–16, 2020.
[28] Joshua Mack, Samet Arda, Umit Y Ogras, and Ali Akoglu. Performant,
multi-objective scheduling of highly interleaved task graphs on hetero-
geneous system on chip devices. IEEE Transactions on Parallel and
Distributed Systems, 2021.
[29] Joshua Mack, Sahil Hassan, Nirmal Kumbhare, Miguel Castro Gonzalez,
and Ali Akoglu. Cedr-a compiler-integrated, extensible dssoc runtime.
ACM Transactions on Embedded Computing Systems (TECS), 2022.
[30] Kshiteej Mahajan, Arjun Balasubramanian, Arjun Singhvi, Shivaram
Venkataraman, Aditya Akella, Amar Phanishayee, and Shuchi Chawla.
Themis: Fair and efficient {GPU}cluster scheduling. In 17th {USENIX}
Symposium on Networked Systems Design and Implementation ({NSDI}
20), pages 289–304, 2020.
[31] Hongzi Mao, Mohammad Alizadeh, Ishai Menache, and Srikanth Kan-
dula. Resource management with deep reinforcement learning. In
Proceedings of the 15th ACM Workshop on Hot Topics in Networks, pages
50–56. ACM, 2016.
[32] Hongzi Mao, Malte Schwarzkopf, Shaileshh Bojja Venkatakrishnan, Zili
Meng, and Mohammad Alizadeh. Learning scheduling algorithms for
data processing clusters. In Proceedings of the 2019 ACM SIGCOMM
Conference. ACM, 2019.
[33] Marcel Mettler, Martin Rapp, Heba Khdr, Daniel Mueller-Gritschneder,
Jörg Henkel, and Ulf Schlichtmann. An fpga-based approach to evaluate
thermal and resource management strategies of many-core processors.
ACM Transactions on Architecture and Code Optimization (TACO),
19(3):1–24, 2022.
[34] Kasra Moazzemi. Runtime Resource Management of Emerging Appli-
cations in Heterogeneous Architectures. University of California, Irvine,
2020.
[35] Deepak Narayanan, Keshav Santhanam, Fiodar Kazhamiaka, Amar Phan-
ishayee, and Matei Zaharia. Heterogeneity-aware cluster scheduling
policies for deep learning workloads. In 14th {USENIX}Symposium
on Operating Systems Design and Implementation ({OSDI}20), pages
481–498, 2020.
[36] Andrew Y Ng, Daishi Harada, and Stuart Russell. Policy invariance under
reward transformations: Theory and application to reward shaping. In
Icml, volume 99, pages 278–287, 1999.
[37] Tegg Taekyong Sung, Valliappa Chockalingam, Alex Yahja, and Bo Ryu.
Neural heterogeneous scheduler. arXiv preprint arXiv:1906.03724, 2019.
[38] Tegg Taekyong Sung, Jeongsoo Ha, Jeewoo Kim, Alex Yahja, Chae-Bong
Sohn, and Bo Ryu. Deepsocs: A neural scheduler for heterogeneous
system-on-chip (soc) resource scheduling. Electronics, 9(6):936, 2020.
[39] Tegg Taekyong Sung and Bo Ryu. A scalable and reproducible
system-on-chip simulation for reinforcement learning. arXiv preprint
arXiv:2104.13187, 2021.
[40] Tegg Taekyong Sung and Bo Ryu. Socrates: System-on-chip resource
adaptive scheduling using deep reinforcement learning. In 2021 20th
IEEE International Conference on Machine Learning and Applications
(ICMLA), pages 496–501. IEEE, 2021.
[41] Richard S Sutton and Andrew G Barto. Reinforcement learning: An
introduction. MIT press, 2018.
[42] Richard S Sutton, David A McAllester, Satinder P Singh, Yishay Mansour,
et al. Policy gradient methods for reinforcement learning with function
approximation. Citeseer, 1999.
[43] Huangshi Tian, Yunchuan Zheng, and Wei Wang. Characterizing and
synthesizing task dependencies of data-parallel jobs in alibaba cloud. In
16 VOLUME x, 2022
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3203401
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
Proceedings of the ACM Symposium on Cloud Computing, pages 139–
151, 2019.
[44] Haluk Topcuoglu, Salim Hariri, and Min-You Wu. Task scheduling
algorithms for heterogeneous processors. In Proceedings. Eighth Hetero-
geneous Computing Workshop (HCW’99), pages 3–14. IEEE, 1999.
[45] Alexey Tumanov, Timothy Zhu, Jun Woo Park, Michael A Kozuch, Mor
Harchol-Balter, and Gregory R Ganger. Tetrisched: global rescheduling
with adaptive plan-ahead in dynamic heterogeneous clusters. In Proceed-
ings of the Eleventh European Conference on Computer Systems, pages
1–16, 2016.
[46] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion
Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention
is all you need. Advances in neural information processing systems, 30,
2017.
[47] Vinod Kumar Vavilapalli, Arun C Murthy, Chris Douglas, Sharad Agar-
wal, Mahadev Konar, Robert Evans, Thomas Graves, Jason Lowe, Hitesh
Shah, Siddharth Seth, et al. Apache hadoop yarn: Yet another resource
negotiator. In Proceedings of the 4th annual Symposium on Cloud
Computing, pages 1–16, 2013.
[48] Oriol Vinyals, Timo Ewalds, Sergey Bartunov, Petko Georgiev, Alexan-
der Sasha Vezhnevets, Michelle Yeo, Alireza Makhzani, Heinrich Küttler,
John Agapiou, Julian Schrittwieser, et al. Starcraft ii: A new challenge for
reinforcement learning. arXiv preprint arXiv:1708.04782, 2017.
[49] Wencong Xiao, Romil Bhardwaj, Ramachandran Ramjee, Muthian Si-
vathanu, Nipun Kwatra, Zhenhua Han, Pratyush Patel, Xuan Peng, Hanyu
Zhao, Quanlu Zhang, et al. Gandiva: Introspective cluster scheduling for
deep learning. In 13th {USENIX}Symposium on Operating Systems
Design and Implementation ({OSDI}18), pages 595–610, 2018.
[50] Yue Yu, Jie Chen, Tian Gao, and Mo Yu. Dag-gnn: Dag structure learning
with graph neural networks. In International Conference on Machine
Learning, pages 7154–7163. PMLR, 2019.
TEGG TAEKYONG SUNG Dr. Sung is an expert
in the field of deep reinforcement learning with
over three years of research and industrial experi-
ence. Currently, he is working as a senior research
scientist at EpiSys Science, Inc. In the past, he
has worked as a research assistant at Kwangwoon
University and a visiting researcher at Electron-
ics and Telecommunications Research Institute,
South Korea. Dr. Sung’s research interest includes
real-world reinforcement learning, machine learn-
ing, and graph neural networks. He has authored and co-authored more than
10 publications. He received his Ph.D. from Kwangwoon University, South
Korea.
BO RYU Over the last two decades, Dr. Ryu
has accumulated a wealth of successful experi-
ences on high-risk high-pay-off R&D programs
sponsored by DARPA, ONR, AFRL, and various
Department of Defense agencies. Before founding
EpiSys Science, Inc. in 2013, he served in various
technical positions at Hughes, Boeing, San Diego
Research Center, and Argon ST. He was respon-
sible for spearheading internal research projects,
pursuing new government programs, and perform-
ing various government and industry-sponsored projects in the area of self-
organizing wireless networking systems. During his time at EpiSci, he has
been the PI of over $20M of R&D projects from government agencies.
He has authored and co-authored more than 40 publications and holds
twelve U.S. patents. He received two performance awards for his technical
achievements on DARPA’s Adaptive C4ISR Node program and is a recipient
of a Meritorious Award from Raytheon in 2001 for technical performance
recognition. He received his Ph.D. from Columbia University.
VOLUME x, 2022 17
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3203401
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/