ArticlePDF Available

Abstract and Figures

This paper proposes Bayesian optimization augmented factoring self-scheduling (BO FSS), a new parallel loop scheduling strategy. BO FSS is an automatic tuning variant of the factoring self-scheduling (FSS) algorithm and is based on Bayesian optimization (BO), a black-box optimization algorithm. Its core idea is to automatically tune the internal parameter of FSS by solving an optimization problem using BO. The tuning procedure only requires online execution time measurement of the target loop. In order to apply BO, we model the execution time using two Gaussian process (GP) probabilistic machine learning models. Notably, we propose a locality-aware GP model, which assumes that the temporal locality effect resembles an exponentially decreasing function. By accurately modeling the temporal locality effect, our locality-aware GP model accelerates the convergence of BO. We implemented BO FSS on the GCC implementation of the OpenMP standard and evaluated its performance against other scheduling algorithms. Also, to quantify our method’s performance variation on different workloads, or workload-robustness in our terms, we measure the minimax regret. According to the minimax regret, BO FSS shows more consistent performance than other algorithms. Within the considered workloads, BO FSS improves the execution time of FSS by as much as 22% and 5% on average.
Content may be subject to copyright.
1
A Probabilistic Machine Learning Approach to
Scheduling Parallel Loops with
Bayesian Optimization
Kyurae Kim, Student Member, IEEE, Youngjae Kim, Member, IEEE, and Sungyong Park, Member, IEEE
Abstract—This paper proposes Bayesian optimization augmented factoring self-scheduling (BO FSS), a new parallel loop scheduling
strategy. BO FSS is an automatic tuning variant of the factoring self-scheduling (FSS) algorithm and is based on Bayesian optimization
(BO), a black-box optimization algorithm. Its core idea is to automatically tune the internal parameter of FSS by solving an optimization
problem using BO. The tuning procedure only requires online execution time measurement of the target loop. In order to apply BO, we
model the execution time using two Gaussian process (GP) probabilistic machine learning models. Notably, we propose a
locality-aware GP model, which assumes that the temporal locality effect resembles an exponentially decreasing function. By
accurately modeling the temporal locality effect, our locality-aware GP model accelerates the convergence of BO. We implemented BO
FSS on the GCC implementation of the OpenMP standard and evaluated its performance against other scheduling algorithms. Also, to
quantify our method’s performance variation on different workloads, or workload-robustness in our terms, we measure the minimax
regret. According to the minimax regret, BO FSS shows more consistent performance than other algorithms. Within the considered
workloads, BO FSS improves the execution time of FSS by as much as 22% and 5% on average.
Index Terms—Parallel Loop Scheduling, Bayesian Optimization, Parallel Computing, OpenMP
1 INTRODUCTION
LOOP parallelization is the de-facto standard method for
performing shared-memory data-parallel computation.
Parallel computing frameworks such as OpenMP [1] have
enabled the acceleration of advances in many scientific and
engineering fields such as astronomical physics [2], climate
analytics [3], and machine learning [4]. A major challenge
in enabling efficient loop parallelization is to deal with the
inherent imbalance in workloads [5]. Under the presence
of load imbalance, some computing units (CU) might end
up remaining idle for a long time, wasting computational
resources. It is thus critical to schedule the tasks to CUs
efficiently.
Early on, dynamic loop scheduling algorithms [6], [7],
[8], [9], [10], [11], [12] have emerged to attack the parallel
loop scheduling problem. However, these algorithms ex-
ploit a limited amount of information about the workloads,
resulting in inconsistent performance [13]. In our terms,
they lack workload-robust as their performance varies across
workloads.
Meanwhile, workload-aware scheduling methods have re-
K. Kim is with the Department of Electronics Engineering, Sogang
University, Seoul, Republic of Korea.
E-mail: msca8h@sogang.ac.kr
Y. Kim and S. Park are with the Department of Computer Science and
Engineering, Sogang University, Seoul, Republic of Korea.
E-mail: {youkim, parksy}@sogang.ac.kr
Manuscript received June 26, 2020; revised August 21, 2020; accepted
September 7 2020. This work was supported by the Next-Generation In-
formation Computing Development Program through the National Re-
search Foundation of Korea (NRF) funded by the Ministry of Science,
ICT (2017M3C4A7080245). This paper was presented in part of the 27th
IEEE International Symposium on Modeling, Analysis, and Simulation of
Computer and Telecommunication Systems (MASCOTS’19) held in Rennes,
France, 2019. (Corresponding author: Sungyong Park.)
cently emerged. These methods, including the history-aware
self-scheduling (HSS, [14]) and bin-packing longest pro-
cessing time (BinLPT, [15], [16]) algorithms, utilize the static
imbalance information of workloads. Static imbalance is an
imbalance inherent to the workload that is usually caused
by algorithmic variations. Unlike dynamic imbalance, which
is caused by the environment of execution, static imbalance
can sometimes be accurately estimated before execution. In
these cases, workload-aware methods aim to exploit the
static imbalance information for scheduling. On the other
side of the coin, these methods are inapplicable when the
static imbalance information, or a workload-profile, is not
provided. In many high-performance computing (HPC) ap-
plications, static imbalance is often avoided by design. Even
if such imbalance is present, it is usually unknown unless
extensive profiling is performed. As a result, workload-
aware methods can only be applied to a limited range of
workloads or challenging to use at best.
As discussed, both dynamic and workload-aware meth-
ods have limitations. Thus, additional efforts must be made
to find the algorithm best-suited for a particular workload.
Practitioners often need to try out different scheduling
algorithms and manually tune them for the best perfor-
mance, which is tedious and time-consuming. To resolve the
issues of dynamic and workload-aware scheduling meth-
ods, we propose Bayesian optimization augmented factoring
self-scheduling (BO FSS), a workload-robust parallel loop
scheduling algorithm. BO FSS automatically infers proper-
ties of the target loop only using its execution time measure-
ments. Since BO FSS doesn’t rely on a workload-profile, it
applies to a wide range of workloads.
In this paper, we first show that it is possible to achieve
robust performance if we are able to appropriately tune the
2
internal parameters of a classic scheduling algorithm to each
workload individually. Based on this observation, BO FSS
tunes the parameter of factoring self-scheduling (FSS, [7]), a
classic dynamic scheduling algorithm, only using execution
time measurements of the target loop. This is achieved
by solving an optimization problem using a black-box
global optimization algorithm called Bayesian optimization
(BO, [17]). BO is notable for being data efficient; it requires a
minimal number of measurements until convergence [18].
It is also able to efficiently handle the presence of noise
in the measurements. These properties previously led to
successful applications such as compiler optimization flag
selection [19], garbage collector tuning [20], and cloud con-
figuration selection [18]. Based on these properties of BO,
our system is able to improve scheduling efficiency with
a minimal number of repeated executions of the target
workload.
To apply BO, we need to provide a surrogate model that
accurately describes the relationship between the schedul-
ing algorithm’s parameter and the resulting execution time.
By extending our previous work in [21], we propose two
types of probabilistic machine learning models as surro-
gates. First, we model the total execution time contribution
of a loop as Gaussian processes (GP). Second, for workloads
where the loops are executed multiple times in a single run,
we propose a locality-aware GP model. Based on the assump-
tion that the temporal locality effect resembles exponentially
decreasing functions, our locality-aware GP can accurately
model the execution time using exponentially decreasing
function kernels from [22]. As a result, it is able to achieves
faster convergence of BO when applicable.
We implemented BO FSS as well as other classic schedul-
ing algorithms such as chunk self-scheduling (CSS, [6]),
FSS [7], trapezoid self-scheduling (TSS, [8]), tapering self-
scheduling (TAPER, [10]) on the GCC implementation [23]
of the OpenMP parallelism framework. Then, we evaluate
BO FSS against these classical algorithms and workload-
aware methods including HSS and BinLPT. To quantify and
compare the robustness of BO FSS, we adopt the minimax
regret metric [24], [25]. We selected workloads from the Ro-
dinia 3.1 [26] and GAP [27] benchmark suites for evaluation.
Results show that our method outperforms other scheduling
algorithms by improving the execution time of FSS as much
as 22% and 5% on average. In terms of workload-robustness,
BO FSS achieves a regret of 22.34, which is the lowest among
the considered methods.
The key contributions of this paper are as follows:
We show that, when appropriately tuned, FSS can
achieve workload-robust performance (Section 2). In
contrast, the performance of dynamic scheduling and
workload-aware methods varies across workloads.
We apply BO to tune the internal parameter of FSS
(Section 3). Results show that BO FSS achieves consis-
tently good performance across workloads (Section 5).
We propose to model the temporal locality effect of
workload using locality-aware GPs (Section 3.3). Our
locality-aware GP incorporates the effect of temporal lo-
cality using exponentially decreasing function kernels.
We implement BO FSS over the OpenMP parallel
computing framework (Section 4). Our implementation
includes other classic scheduling algorithms used for
the evaluation and is publicly available online1.
We propose to use minimax regret for quantifying
workload-robustness of scheduling algorithms (Sec-
tion 5). According to the minimax regret criterion, BO
FSS shows the most robust performance among consid-
ered algorithms.
2 BACKG RO UN D AND MOT IVATIO N
In this section, we start by describing the loop schedul-
ing problem. Then, we show that dynamic scheduling
and workload-aware methods lack what we call workload-
robustness. Our analysis is followed by proposing a strategy
to solve this problem.
2.1 Background
Parallel loop scheduling. Loops in scientific comput-
ing applications are easily parallelizable because of
their embarrassingly-data-parallel nature. A parallel loop
scheduling algorithm attempts to map each task, or itera-
tion, of a loop to CUs. The most basic scheduling strategy
called static scheduling (STATIC) equally divides the tasks
(𝑇𝑖) by the number of CUs in compile time. Usually, a barrier
is implied at the end of a loop, forcing all the CUs to wait un-
til all tasks finish computing. If imbalance is present across
the tasks, some CUs may complete computation before other
tasks, resulting in many CUs remaining idle. Since execution
time variance is abundant in practice because of control-
flow divergence and inherent noise in modern computer
systems [5], more advanced scheduling schemes are often
required.
Dynamic loop scheduling. Dynamic loop scheduling has
been introduced to solve the inefficiency caused by execu-
tion time variance. In dynamic scheduling schemes, each
CU self-assigns a chunk of 𝐾tasks in runtime by accessing
a central task queue whenever it becomes idle. The queue
access causes a small runtime scheduling overhead, denoted
by the constant . The case where 𝐾= 1is called self-
scheduling (SS, [28]). For SS, we can achieve the mini-
mum amount of load imbalance. However, the amount of
scheduling overhead grows proportionally to the number
of tasks. Even for small values of , the total scheduling
overhead can quickly become overwhelming. The problem
then boils down to finding the optimal tradeoff between
load imbalance and scheduling overhead. This problem has
been mathematically formalized in [6], [29], and a general
review of the problem is provided in [30].
2.2 Factoring Self-Scheduling
Among many dynamic scheduling algorithms, we focus on
the factoring self-scheduling algorithm (FSS, [7]). Instead of
using a constant chunk size 𝐾, FSS uses a chunk size that
decreases along the loop execution. At the 𝑖th batch, the size
of the next 𝑃chunks, 𝐾𝑖, is determined according to
1. Source code available in https://github.com/Red-Portal/bosched
3
Measured Estimated
LoadProportion(%)
10−4
10−3
10−2
10−1
Task(Ti)
0 250 500 750 1000
Measured1 Estimated
LoadProportion(%)
0
2
4
Chunk
1 20 40 60
(a) Accuracy of load estimation
OurSolution
FSSSolution
FSS
FAC2
STATIC
GUIDED
ExecutionTime(sec)
0.55
0.6
0.65
0.7
ParameterValue(θ)
0.01 0.1 1 10 100
(b) Low static imbalance
ExecutionTime(sec)
0.6
0.7
0.8
0.9
1
ParameterValue(θ)
0.1 1 10 100 1000
OurSolution
FSSSolution
FSS
FAC2
BinLPT
HSS
(c) High static imbalance
Fig. 1. ((a), top) Discrepancy between the workload-profile and actual execution time of the tasks. ((a), bottom) Discrepancy between the load of the
chunks created by BinLPT, and their actual execution time. (b-c) Effect of the internal parameter (𝜃) of FSS on a workload with homogeneous tasks
((b), low static imbalance, lavaMD workload) and a workload with non-homogeneous tasks ((c), high static imbalance, pr-journal workload).
The value of the parameter suggested by the original FSS algorithm is marked with a blue cross, while the actual optimal solution targeted by our
proposed method is marked with a blue star. The error bands are the 95% empirical bootstrap confidence intervals of the execution time mean.
𝑅0=𝑁, 𝑅𝑖+1=𝑅𝑖𝑃𝐾𝑖, 𝐾𝑖=𝑅𝑖
𝑥𝑖𝑃(1)
𝑏𝑖=𝑃
2𝑅𝑖𝜃(2)
𝑥0=1+𝑏20+𝑏0𝑏20+4 (3)
𝑥𝑖=2+𝑏2𝑖+𝑏𝑖𝑏2𝑖+4.(4)
where 𝑅𝑖is the number of remaining tasks at the 𝑖th batch.
The parameter 𝜃in (2) is crucial to the overall performance
of FSS. The analysis in [31] indicates that 𝜃=𝜎𝜇results in
the best performance where 𝜇and 𝜎2are the mean (𝔼[𝑇𝑖])
and variance (𝕍[𝑇𝑖]) of the tasks. However, in Section (2.3),
we show that this 𝜃does not always perform well. Instead,
the essence of our work is a strategy to empirically deter-
mine good 𝜃for each individual workloads by solving an
optimization problem.
The FAC2 scheduling strategy. Since determining 𝜇and
𝜎requires extensive profiling of the workload, the original
authors of FSS suggest an unparameterized heuristic ver-
sion [7]. This version is often abbreviated as FAC2 in the
literature and has been observed to outperform the original
FSS [9], [11] despite being a heuristic modification. Again,
this observation supports the fact that the analytic solution
𝜃=𝜎𝜇is not always the best nor the only good solution.
2.3 Motivation
Limitations of workload-aware methods. The HSS and
BinLPT strategies have significant drawbacks despite being
able to fully incorporate the information about load im-
balance. First, both the HSS and BinLPT methods require
an accurate workload-profile. This is a significant limiting
factor since many HPC workloads are comprised of homo-
geneous tasks where the imbalance is caused dynamically
during runtime. This means there is no static imbalance in
the first place. Also, even if a workload-profile is present, it
imposes a runtime memory overhead of 𝑂(𝑁)for each loop.
For large-scale applications where the task count 𝑁is huge,
the memory overhead is a significant nuisance.
Moreover, both the HSS and BinLPT algorithms also
have their own caveats. The HSS algorithm has high
scheduling overhead [16]. In Section 5, we observe that
HSS performs well only when high levels of imbalance,
such as in the pr-wiki workload, are present. On the
other hand, BinLPT is highly sensitive to the accuracy of
the workload-profile. In practice, discrepancies between the
actual workload and the workload-profile are inevitable. We
illustrate this fact using the pr-journal graph analytics
workload in the upper plot of Fig. 1a. We estimated the load
of each task using the in-degree of the corresponding vertex
in the graph. The grey region is the estimated load of each
task, while the red region is the measured load. As shown in
the figure, the estimated load does not accurately describe
the actual load. Likewise, the chunks created by BinLPT
using these estimates are equally inaccurate, as shown in
the lower plot of Fig. 1a. If the number of tasks is minimal,
some level of discrepancy may be acceptable. Indeed, the
original analysis in [16] considers at most 𝑁=3074tasks. In
practice, the number of tasks scales with data, leading to a
very large 𝑁.
Effect of tuning the parameter of FSS. Similarly, classi-
cal scheduling algorithms such as FSS are not workload-
robust [13]. However, we reveal an interesting property
by tuning the parameter (𝜃) of FSS. Fig. 1b and Fig. 1c
illustrate the evaluation results of FSS using the lavaMD
(a workload with low static imbalance) and pr-journal
(a workload with high static imbalance) workloads with
different values of 𝜃, respectively. The solution suggested
in the original FSS algorithm (as discussed in Section 2.2)
is denoted by a blue cross. For the lavaMD workload
(Fig. 1b), this solution is arguably close to the optimal value.
However, for the pr-journal workload (Fig. 1c), it leads
to poor performance. The original FSS strategy is thus not
workload-robust since its performance varies greatly across
workloads.
In contrast, by using an optimal value of 𝜃(blue star),
FSS outperforms all other algorithms as shown in the plots.
Even in Fig. 1c where HSS and BinLPT are equipped with an
accurate workload-profile, FSS outperforms both methods.
4
Algorithm 1: Bayesian optimization
Initial dataset 𝒟0={(𝜃0,𝜏0),,(𝜃𝑁,𝜏𝑁)}for 𝑡[1,𝑇]
do
1. Fit surrogate model using 𝒟𝑡.
2. Solve inner optimization problem.
𝜃𝑡+1=argmax𝜃𝛼(𝜃,𝒟𝑡)
3. Evaluate parameter. 𝜏𝑡+1𝑇total (𝑆𝜃𝑡+1)
4. Update dataset. 𝒟𝑡+1𝒟𝑡+1(𝜃,𝜏)
end
This means that tuning the parameter of FSS on a per-
workload basis can achieve robust performance.
Motivational remarks. Workload-aware methods and clas-
sical dynamic scheduling methods tend to vary in appli-
cability and performance. Meanwhile, classic scheduling
algorithms such as FSS achieve optimal performance when
they are appropriately tuned to the target workload. This
performance potential of FSS points towards the possibility
of creating a novel robust scheduling algorithm.
3 AUGMENTING FACTO RING SEL F-SCHEDULING
WITH BAYESIAN OPTIMIZATION
In this section, we describe BO FSS, a self-tuning variant of
the FSS algorithm. First, we provide an optimization per-
spective on the loop scheduling problem. Next, we describe
a solution to the optimization problem using BO. Since
solving our problem requires modeling of the execution
time using surrogate models, we describe two ways to
construct surrogate models.
3.1 Scheduling as an Optimization Problem
The main idea of our proposed method is to design an
optimal scheduling algorithm by finding its optimal con-
figurations based on execution time measurements. First,
we define a set of scheduling algorithms 𝒮={𝑆𝜃1,𝑆𝜃2,…}
indentified by a tunable parameter 𝜃. In our case, 𝒮is the set
of configurations of the FSS algorithm with the parameter 𝜃
discussed in Section 2.2. Within this set of configurations,
we choose the optimal configuration that minimizes the mean
of the total execution time contribution (𝑇𝑡𝑜𝑡𝑎𝑙) of a parallel
loop. This problem is now of the form of an optimization
problem denoted as
minimize
𝜃𝔼[𝑇𝑡𝑜𝑡𝑎𝑙(𝑆𝜃)].(5)
Problem structure. Now that the optimization problem
is formulated, we are supposed to apply an optimization
solver. However, this optimization problem is ill-formed,
prohibiting the use of any typical solver. First, the objective
function is noisy because of the inherent noise in computer
systems. Second, we do not have enough knowledge about
the structure of 𝑇. Different workloads interact differently
with scheduling algorithms [13]. It is thus difficult to obtain
an analytic model of 𝑇that is universally accurate. More-
over, most conventional optimization algorithms require
knowledge about the gradient 𝜃𝑇, which we do not have.
for (int l = 0; l < L; ++l)
{
#pragma omp parallel for schedule(runtime)
for (int i = 0; i < N; ++i)
{
task(i);
}
}
Fig. 2. Visualization of our execution time models. The execution time
of the parallel loop (red bracket) is denoted as 𝑇, while the execution
time of the tasks in the parallel loop (green bracket) is denoted as 𝑇𝑖.
The outer loop (blue bracket) represents repeated execution (𝐿times) of
the parallel loop within the application, where 𝑇𝑡𝑜𝑡𝑎𝑙 is the total execution
time contribution of the loop.
Solution using Bayesian optimization. For solving this
problem, we leverage Bayesian Optimization (BO). We ini-
tially attempt to apply other gradient-free optimization
methods such as stochastic approximation [32]. However,
the noise level in execution time is so extreme that most
gradient-based methods fail to converge. Conveniently, BO
has recently been shown to be effective for solving such kind
of optimization problems [18], [19], [20]. Compared to other
black-box optimization methods, BO requires less objective
function evaluations and can handle the presence of noise
well [18].
Description of Bayesian optimization. The overall flow of
BO is shown in Algorithm 1. First, we build a surrogate
model of 𝑇total. Let (𝜃,𝜏)denote a data point of an
observation where 𝜃is a parameter value, and 𝜏is the
resulting execution time measurement such that 𝜏𝑇total.
Based on a dataset of previous observations denoted as
𝒟𝑡= {(𝜃1,𝜏1),,(𝜃𝑡,𝜏𝑡)}, a surrogate model provides a
prediction of 𝑇total(𝜃)and the uncertainty of the prediction.
In our context, the prediction and uncertainty are given as
the mean of the predictive distribution denoted as 𝜇(𝜃𝒟𝑡)
and its variance denoted as 𝜎2(𝜃𝒟𝑡).
Using , we now solve what is known as the inner
optimization problem. In this step, we choose to exploit our
current knowledge about the optimal value or explore en-
tirely new values that we have not tried yet. In the extremes,
minimizing 𝜇(𝜃𝒟𝑡)gives us the optimal parameter given
our current knowledge, while minimizing 𝜎2(𝜃𝒟𝑡)gives us
the parameter we are currently the most uncertain. The
optimal solution is given by a tradeoff of the two ends
(often called the exploration-exploitation tradeoff), found by
solving the optimization problem
𝜃𝑖+1=argmax
𝜃𝛼(𝜃,𝒟𝑡)(6)
where the function 𝛼is called the acquisition function. Based
on the predictions and uncertainty estimates of ,𝛼returns
our utility of trying out a specific value of 𝜃. Evidently, the
quality of the prediction and uncertainty estimates of
are crucial to the overall performance. By maximizing 𝛼,
we obtain the parameter value that has the highest utility,
according to 𝛼. In this work, we use the max-value entropy
search (MES, [33]) acquisition function.
After solving the inner optimization problem, we obtain
the next value to try out, 𝜃𝑡+1. We can then try out this
parameter and append the result (𝜃𝑡+1,𝜏𝑡+1) to the dataset.
For a comprehensive review of BO, please refer to [17]. We
5
ExecutionTime(ms)
3
4
5
6
7
8
IndexofLoopExecution(
)
5 10 15 20 25 30 35
(a) Locality effect 𝓁-axis view
Execution1~10 Execution11~30
ExecutionTime(ms)
4
5
6
7
8
9
ParameterValue(θ)
0 0.2 0.4 0.6 0.8 1
(b) Locality effect 𝜃-axis view (c) Samples from the Exp. kernel
Fig. 3. (a) (b) Visualization of the temporal locality effect on the execution time of the kmeans workload. (a) 𝓁-axis view. The error bars are the
95% empirical confidence intervals. (b) 𝜃-axis view. The red squares are measurements of earlier executions (𝓁10) while the blue circles are
measurements of later executions (𝓁>10). (c) Randomly sampled functions from a GP prior with an exponentially decreasing function kernel.
will later explain our OpenMP framework implementation
of this overall procedure in Section 4.
3.2 Modeling Execution Time with Gaussian Processes
As previously stated, having a good surrogate model is
essential. Modeling the execution time of parallel programs
has been a classic problem in the field of performance
modeling. It is known that parallel programs tend to follow
a Gaussian distribution when the execution time variance is
not very high [34]. This result follows from the central limit
theorem (CLT), which states that the influence of multiple
i.i.d. noise sources asymptotically form a Gaussian distribu-
tion. Considering this, we model the total execution time
contribution of a loop as
𝑇𝑡𝑜𝑡𝑎𝑙 =𝐿
𝓁=1 𝑇(𝑆𝜃)+𝜖(7)
where 𝐿is the total number of times a specific loop is
executed within the application, indexed by 𝓁. Following
the conclusions of [34], we naturally assume that 𝜖follows
a Gaussian distribution. Note that, at this point, we assume
𝑇is independent of the index 𝓁. For an illustration of the
models used in our discussion, please see Fig. 3.1.
Gaussian Process formulation. From the dataset 𝒟𝑡, we
infer the model of the execution time 𝑇𝑡𝑜𝑡𝑎𝑙(𝜃)using Gaussian
processes (GPs). A GP is a nonparametric Bayesian prob-
abilistic machine learning model for nonlinear regression.
Unlike parametric models such as polynomial curve fitting
and random forest, GPs automatically tune their complexity
based on data [35]. Also, more importantly, GPs can natu-
rally incorporate the assumption of additive noise (such as
𝜖in (7)). The prediction of a GP is given as a univariate
Gaussian distribution fully described by its mean 𝜇(𝑥𝒟𝑡)
and variance 𝜎2(𝑥𝒟𝑡). These are computed in closed forms
as 𝜇(𝜃𝒟𝑡)=k(𝜃)𝑇(𝐊+𝜎2𝑛𝐼)−1𝐲(8)
𝜎2(𝜃𝒟𝑡)=𝑘(𝜃,𝜃)k(𝜃)𝑇(𝐊+𝜎2𝜖𝐼)−1k(𝜃)(9)
where 𝐲=[𝜏1,𝜏2,,𝜏𝑡],𝐤(𝜃)is a vector valued function
such that [𝐤(𝜃)]𝑖=𝑘(𝜃,𝜃𝑖),𝜃𝑖𝒟𝑡, and 𝐊is the Gram
matrix such that [𝐊]𝑖,𝑗 =𝑘(𝜃𝑖,𝜃𝑗),𝜃𝑖,𝜃𝑗𝒟𝑡;𝑘(𝑥,𝑦)
denotes the covariance kernel function which is a design
choice. We use the Matern 5/2 kernel which is computed
as 𝑘(𝑥,𝑥;𝜎2,𝜌2)=𝜎2(1+5𝑟+53𝑟2)exp(−5𝑟)(10)
where 𝑟=𝑥𝑥2𝜌2.(11)
For a detailed introduction to GP regression, please refer
to [36].
Non-Gaussian noise. Despite the remarks in [34] saying
that the noise in parallel programs mostly follow Gaussian
distributions, we experienced cases where the execution
time of individual parallel loops did not quite follow Gaus-
sian distributions. For example, occasional L2, L3 cache-
misses results in large deviations, or outliers, in execution
time. To correctly model these events, it is advisable to
use heavy-tail distributions such as the Student-T. More
advanced methods for dealing with such outliers are de-
scribed in [37] and [38]. However, to narrow the scope of
our discussion, we stay within the Gaussian assumption.
3.3 Modeling with Locality-Aware Gaussian Processes
Until now, we only considered acquiring samples of 𝑇𝑡𝑜𝑡𝑎𝑙
by summing our measurements of 𝑇. For the case where the
parallel loop in question is executed more than once (that is,
𝐿>1), we acquire 𝐿observations of 𝑇in a single run of the
workload. By exploiting our model’s structure in (7), it is
possible to utilize all 𝐿samples instead of aggregating them
into a single one. Since the Gaussian distribution is additive,
we can decompose the distribution of 𝑇𝑡𝑜𝑡𝑎𝑙 such that
𝑇𝑡𝑜𝑡𝑎𝑙 =𝐿
𝓁=1𝑇(𝑆𝜃,𝓁)(12)
𝐿
𝓁=1𝒩(𝔼[𝑇(𝑆𝜃,𝓁)],𝕍[𝑇(𝑆𝜃,𝓁)],)(13)
=𝒩(𝐿
𝓁=1𝔼[𝑇(𝑆𝜃,𝓁)],𝐿
𝓁=1𝕍[𝑇(𝑆𝜃,𝓁)]) (14)
𝒩(𝐿
𝓁=1𝜇(𝜃,𝓁𝒟𝑡),𝐿
𝓁=1𝜎2(𝜃,𝓁𝒟𝑡)).(15)
6
Offline Online
BO FSS
Offline Tuner
Loop scheduler
Loop in workload
OpenMP runtime
runtime
interface
Disk
#pragma omp parallel for schedule(runtime)
for (int i = 0; i < N; ++i) {
/* workload */
}
#pragma omp parallel for schedule(runtime)
for (int i = 0; i < N; ++i) {
/* workload */
}
scheduling
Loop id
scheduling parameter
BO FSS
execution time
measurement
1
2
3
4
.json
Scheduling
Algorithm (FSS)
dataset
parameter value
to try out next
Fig. 4. System overview of BO FSS. Online denotes the time we are actually executing the workload, while offline denotes the time we are not
executing the workload. For a detailed description, refer to the text in Section 4.
Note the dependence of 𝑇on the index of execution 𝓁.
From (14), we can retrieve 𝑇𝑡𝑜𝑡𝑎𝑙 from the mean (𝔼[𝑇(𝑆𝜃,𝓁)])
and variance (𝕍[𝑇(𝑆𝜃,𝓁)]) estimates of 𝑇, which are given by
modeling 𝑇using GPs as denoted in (15).
Temporal locality effect. However, this is not as simple as
assuming that all 𝐿measurements of 𝑇are independent (ig-
noring the argument 𝓁of 𝑇). The execution time distribution
of a loop changes dramatically within a single application
run because of the temporal locality effect. This is shown
in Fig. 3 using measurements of a loop in the kmeans
benchmark. In Fig. 3a, it is clear that earlier executions of
the loop (𝓁10) are much longer than the later executions
(𝓁>10). Also, different moments of executions are effected
differently by 𝜃, as shown in Fig. 3b. It is thus necessary
to accurately model the effect of 𝓁to better distinguish the
effect of 𝜃.
Exponentially decreasing function kernel. To model the
temporal locality effect, we expand our GP model to include
the index of execution 𝓁. Now, the model is a 2-dimensional
GP receiving 𝓁and 𝜃. Within the workloads we consider,
the temporal locality effect is shown an exponentially de-
creasing tendency. We thus assume that the locality effect
can be represented with exponentially decreasing functions
(Exp.) of the form of 𝑒𝜆𝓁. The kernel for these functions has
been introduced in [22] for modeling the learning curves of
machine learning algorithms. The exponentially decreasing
function kernel is computed as
𝑘(𝓁,𝓁)= 𝛽𝛼
(𝓁+𝓁+𝛽)𝛼.(16)
Random functions sampled from the space induced by the
Exp. kernel are shown in Fig. 3c. Notice the similarity of
the sampled functions and the visualized locality effect in
Fig. 3a. Modeling more complex locality effects such as
periodicity can be achieved by combining more different
kernels. An automatic procedure for doing this is described
in [39].
Kernel of locality-aware GPs. Since the sum of covariance
kernels is also a valid covariance kernel [36], we define our
2-dimensional kernel as
𝑘(𝑥,𝑥)=𝑘Matern(𝜃,𝜃)+𝑘Exp (𝓁,𝓁)(17)
where 𝑥=[𝜃,𝓁],𝑥=[𝜃,𝓁].(18)
Intuitively, this definition implies that we assume the effect
of scheduling (resulting from 𝜃) and locality (resulting from
𝓁) to be superimposed (additive).
Reducing computational cost. The computational complex-
ity of computing a GP is in 𝑂(𝑇3)where 𝑇represents
the total number of BO iterations. The locality aware con-
struction uses all the independent loop executions resulting
in computational complexity in 𝑂((𝐿𝑇)3). To reduce the
computational cost, we subsample data along the axis of
𝓁by using every 𝑘th measurement of the loop, such that
𝓁∈ {1,𝑘+1,2𝑘+1,,𝐿}. As a result, the computa-
tional complexity is reduced by a constant factor such that
𝑂𝐿𝑘𝑇3. In all of our experiments, we use a large value of
𝑘so that 𝐿𝑘=4.
3.4 Treatment of Gaussian Process Hyperparameters
GPs have multiple hyperparameters that need to be pre-
determined. The suitability of these hyperparameters is
directly related to the optimization performance of BO [40].
Unfortunately, whether a set of hyperparameters is ap-
propriate depends on the characteristics of the workload.
Since real-life workloads are very diverse, it is essential to
automatically handle these parameters.
The Matern 5/2 kernel has two hyperparameters 𝜌and
𝜎, while the exponentially decreasing function kernel has
two hyperparameters 𝛼and 𝛽. GPs also have hyperparame-
ters themselves, the function mean 𝜇and the noise variance
𝜎2𝜖. We denote the hyperparameters using the concatenation
𝜙=[𝜇,𝜎𝜖,𝜎,𝜌,].
Since the marginal likelihood 𝑝(𝒟𝑡𝜙)is available in a
closed form [36], we can infer the hyperparameters using
maximum likelihood estimation type-II or marginalization
by integrating out the hyperparameters. Marginalization
has empirically shown to give better optimization perfor-
mance in the context of BO [40], [41]. It is performed by
approximating the integral
𝛼(𝑥,𝒟𝑡)=𝛼(𝑥,𝜙,𝒟𝑡)𝑝(𝜙𝒟𝑡)𝑑𝜃 (19)
1
𝑁
𝜙𝑖𝑝(𝜙𝒟𝑡)𝛼(𝑥,𝜙𝑖,𝒟𝑡),(20)
using samples from the posterior 𝜙𝑖where 𝑁is the number
of samples. For sampling from the posterior, we use the no-
u-turn sampler (NUTS, [42]).
4 SYSTEM IMPLEMENTATION
We now describe our implementation of BO FSS. Our im-
plementation is based on the GCC implementation of the
7
TABLE 1
Benchmark Workloads
Suite Workload Profile Characterization # Tasks (𝑁) Application Domain Benchmark Suite
lavaMD Uniformative 1N-Body 8000 Molecular Dynamics Rodinia 3.1
stream. No Dense Linear Algebra 65536 Data Mining Rodinia 3.1
kmeans Uniformative 2Dense Linear Algebra 494020 Data Mining Rodinia 3.1
srad v1 Uniformative 1Structured Grid 229916 Image Processing Rodinia 3.1
nn Uniformative 1Dense Linear Algebra 8192 Data Mining Rodinia 3.1
cc-* Yes Sparse Linear Algebra N/A 3Graph Analytics GAP
pr-* Yes Sparse Linear Algebra N/A 3Graph Analytics GAP
1Uniformly partitioned workload.
2Imbalance present only in domain boundaries.
3Input data dependent; number of vertices of the input graph.
OpenMP 4.5 framework [1], which is illustrated in Fig. 4.
The overall workflow is as follows:
0First, we randomly generate initial scheduling pa-
rameters 𝜃0,,𝜃𝑁0using a Sobol quasi-random se-
quence [43].
1During execution, for each loop in the workload, we
schedule the parallel loop using the parameter 𝜃𝑡. We
measure the resulting execution time of the loop and
acquire a measurement 𝜏𝑡.
2Once we finish executing the workload, store the pair
(𝜃𝑡,𝜏𝑡)in disk in a JSON format.
3Then, we run the offline tuner, which loads the dataset
𝒟𝑡from disk.
4Using 𝒟𝑡, we solve the inner optimization problem
in (6), obtaining the next scheduling configuration 𝜃𝑡+1.
5At the subsequent execution of the workload, 𝑡𝑡+1,
and go back to 1.
Note that offline means the time we finish executing the
workload, while online means the time we are executing the
workload (runtime).
Implementation of the offline tuner. We implemented the
offline tuner as a separate program written in Julia [44],
which is invoked by the user. When invoked, the tuner
solves the inner optimization problem, and stores the results
in disk. For solving the inner optimization problem, we
use the DIRECT algorithm [45] implemented in the NLopt
library [46]. For marginalizing the GP hyperparameters, we
use the AdvancedHMC.jl implementation of NUTS [47].
Search space reparameterization. BO requires the domain
of the parameter to be bounded. However, in the case of
FSS, 𝜃is not necessarily bounded. As a compromise, we
reparameterized 𝜃into a fixed domain such that
minimize
𝑥𝔼[𝑇𝑡𝑜𝑡𝑎𝑙(𝑆𝜃(𝑥))] (21)
where 𝜃(𝑥)=219𝑥−10,0<𝑥<1.(22)
This also effectively converts the search space to be in a loga-
rithmic scale. The reparameterized domain was determined
by empirically investigating feasible values of 𝜃.
User interface. BO FSS can be selected by setting the
OMP_SCHEDULE environment variable, or by the OpenMP
runtime API as in Listing 1.
Listing 1
Selecting a scheduling algorithm
omp_set_schedule(BO_FSS); // selects BO FSS
Modification of the OpenMP ABI. As previously described,
our system optimizes each loop in the workload indepen-
dently. Naturally, our system requires the identification of
the individual loops within the OpenMP runtime. However,
we encountered a major issue: the current OpenMP ABI
does not provide a way for such identification. Conse-
quently, we had to modify the GCC 8.2 [23] compiler’s
OpenMP code generation and the OpenMP ABI. The mod-
ified GCC OpenMP ABI is shown in Listing 2. During
compilation, a unique token for each loop is generated
and inserted at the end of the OpenMP procedure calls.
Using this token, we store and manage the state of each
loop. Measuring the loop execution time is done by start-
ing the system clock in OpenMP runtime entries such as
GOMP_parallel_runtime_start, and stopping in exits
such as GOMP_parallel_end.
Listing 2
Modified GCC OpenMP ABI
void GOMP_parallel_loop_runtime(void (*fn) (void *),
void *data, unsigned num_threads, long start,
long end, long incr, unsigned flags, size_t
loop_id)
void GOMP_parallel_runtime_start(long start, long
end, long incr, long *istart, long *iend, size_t
loop_id)
void GOMP_parallel_end(size_t loop_id)
5 EVALUATIO N
In this section, we first describe the overall setup of our
experiments. Then, we compare the robustness of BO FSS
against other scheduling algorithms. After that, we evaluate
the performance of our BO augmentation scheme. Lastly, we
directly compare the execution time.
5.1 Experimental Setup
System setup. All experiments are conducted on a single
shared-memory computer with an AMD Ryzen Thread-
ripper 1950X 3.4GHz CPU which has 16 cores and 32
threads with simultaneous multithreading enabled. It also
has 1.5MB of L1 cache, 8MB of L2 cache and 32MB
of last level cache. We use the Linux 5.4.36-lts ker-
nel with two 16GB DDR4 RAM (32GB total). Frequency
8
TABLE 2
Minimax Regret of Scheduling Algorithms
The values in the table are the percentage slowdown relative to the best performing algorithm. They can be
interpreted as the opportunity cost of using each algorithm. For more details, refer to the text in Section 5.1.
Workload Ours Static Workload-Aware Dynamic
BO FSS STATIC HSS BinLPT GUIDED FSS CSS FAC2 TRAP1 TAPER3
lavaMD 0.00 17.55 n/a n/a 7.25 3.00 0.36 0.25 10.33 42.64
stream.0.00 10.79 n/a n/a 2.39 10.36 1.25 0.68 2.00 2.45
kmeans 0.00 23.02 n/a n/a 8.01 17.62 1.50 1.17 2.30 6.41
srad v1 22.34 10.92 n/a n/a 16.75 11.74 26.03 0.00 16.43 17.61
nn 4.76 5.06 n/a n/a 0.00 0.55 7.00 6.06 4.39 5.14
cc-journal 0.00 2.88 66.98 196.63 11.94 2.47 2.98 6.15 3.65 0.66
cc-wiki 0.00 6.94 58.57 154.31 10.37 2.77 6.58 5.29 7.88 5.27
cc-road 0.00 8.57 81.88 251.71 7.19 1.37 1.55 1.23 1.97 1.71
cc-skitter 5.28 2.28 61.69 129.08 3.57 1.03 1.05 1.06 0.73 0.00
pr-journal 0.00 29.66 5.52 66.89 42.93 29.01 29.07 29.17 29.33 28.81
pr-wiki 15.30 45.20 0.00 42.26 85.34 46.99 47.28 46.82 46.53 46.87
pr-road 0.00 0.32 41.65 138.32 6.60 0.41 0.42 0.42 0.40 0.41
pr-skitter 0.00 11.51 23.21 68.91 29.97 11.66 11.21 11.34 12.06 11.26
(𝑆)22.34 45.20 81.88 251.71 85.34 46.99 47.28 46.83 46.53 46.87
90(𝑆)13.30 28.33 71.75 213.15 40.34 26.73 28.46 25.60 26.75 39.87
scaling is disabled with the cpupower frequency-set
performance setting. We use the GCC 8.3 compiler with
the -O3,-march=native optimization flags enabled in all
of our benchmarks.
BO FSS setup. We run BO FSS for 20 iterations starting from
4 random initial points. All results use the best parameter
found after the aforementioned number of iterations.
Baseline scheduling algorithms. We compare BO FSS
against the FSS [7], CSS [6], TSS [8], GUIDED [48], TA-
PER [10], BinLPT [16], HSS [14] algorithms. We use the
implementation of BinLPT and HSS provided by the authors
of BinLPT2. For the FSS and CSS algorithms, we estimate
the statistics of each workloads (𝜇,𝜎) beforehand from 64
executions. The scheduling overhead parameter is esti-
mated using the method described in [49]. We use the de-
fault STATIC and GUIDED implementations of the OpenMP
4.5 framework using the static and guided scheduling
flags. For the TSS and TAPER schedules, we follow the
heuristic versions suggested in their original works, denoted
as TRAP1 and TAPER3, respectively.
Benchmark workloads. The workloads considered in our
experiments are summarized in Table 1. We select work-
loads from the Rodinia 3.1 benchmark suite [26] (lavamd,
streamcluster,kmeans,srad v1) where the STATIC
scheduling method performs worse than other dynamic
scheduling methods. We also include workloads from the
GAP benchmark suite [27] (cc,pr) where the load is pre-
dictable from the input graph.
Workload-profile availability. We characterize the
workload-profile availability of each workload in the
Workload-Profile column in Table 1. For workloads with
homogeneous tasks (lavaMD,stream.,srad v1,nn),
static imbalance does not exist. Most of the imbalance
is caused during runtime, deeming a workload-profile
uniformative. On the other hand, the static imbalance of
the kmeans workload is revealed during execution, not
2. Retrieved from https://github.com/lapesd/libgomp
before execution. We thus consider the workload-profile to
be effectively unavailable.
TABLE 3
Input Graph Datasets
Dataset 𝒱 deg(𝐯)1
,deg+(𝐯)2
mean std max
journal [50] 4.0M 69.36M 17, 17 43, 43 15k, 15k
wiki [51] 3.57M 45.01M 13, 13 33, 250 7k, 187k
road [52] 24.95M 57.71M 2, 2 1, 1 9, 9
skitter [53] 1.70M 22.19M 13, 13 137, 137 35k, 35k
1In-degree of each vertex.
2Out-degree of each vertex.
Input graph datasets. We organize the graph datasets used
for the workloads from the GAP benchmark suite in Table 3,
acquired from [54]. 𝒱and are the vertices and edges
in each graph, respectively. The load of each task 𝑇𝑖in
the cc and pr workloads is proportional to the in-degree
and out-degree of each vertex [55]. We use this degree
information for forming the workload-profiles. Among the
datasets considered, wiki has the most extreme imbalance
while road has the least imbalance [55].
Workload-robustness measure. To quantify the notion
of workload-robustness, we use the minimax regret mea-
sure [25]. The minimax regret quantifies robustness by calcu-
lating the opportunity cost of using an algorithm, computed
as
(𝑆,𝑤)=𝐶(𝑆,𝑤)min𝑆𝒮𝐶(𝑆,𝑤)
min𝑆𝒮𝐶(𝑆,𝑤)×100 (23)
(𝑆)=max
𝑤𝒲(𝑆,𝑤)(24)
where 𝐶(𝑆,𝑤)is the cost of the scheduling algorithm 𝑆on
the workload 𝑤, and 𝒲is our set of workloads. We choose
𝐶(𝑆,𝑤)to be the execution time. In this case, (𝒮,𝑤)can be
interpreted as the slowdown relative to the best performing
algorithm in percentages. Also, (𝒮)is the worst case regret
of using 𝒮on the set of workloads 𝒲. Note that among
9
lavaMD
RelativeExecutionTime
0.8
0.9
1.0
1.1
1.2
1.3
stream. kmeans sradv1
BOFSS(proposed)
FSS
FAC2
nn cc-journal cc-wiki cc-road cc-skitter pr-journal pr-skitter pr-wiki pr-road
Fig. 5. Execution time comparison of BO FSS, FSS and FAC2. We estimate the mean execution time from 256 executions. The error bars show
the 95% bootstrap confidence intervals. The results are normalized by the mean execution time of BO FSS. The methods with the lowest execution
time are marked with a star (*). Methods not significantly different with the best performing method are also marked with a star (Wilcoxon signed
rank test, 1% null-hypothesis rejection threshold).
different robustness measures, the minimax regret is very
pessimistic [24], emphasizing worst-case performance. For
this reason, we additionally consider the 90th percentile of
the minimax regret denoted as 90(𝑆).
5.2 Evaluation of Workload-Robustness
Table 2 compares the minimax regrets of different schedul-
ing algorithms with that of BO FSS. Each entry in the table is
the regret subject to the workload and scheduling algorithm,
(𝑆,𝑤). The final rows are the minimax regret (𝑆)and the
90th percentile minimax regret 90(𝑆)subject to the schedul-
ing algorithm. BO FSS achieves the lowest regret both in
terms of minimax regret (22%points) and 90th percentile
minimax regret (13%points). In contrast, both static and
dynamic scheduling methods achieve similar level of regret.
This observation is on track with the previous findings [13];
none of the classic scheduling methods dominate each other.
It is worth to note that we selected workloads in which
STATIC performs poorly. Our robustness analysis thus only
holds for comparing dynamic and workload-aware schedul-
ing methods.
Remarks. The results for workload-robustness using the
minimax regret metric show that BO FSS achieves signifi-
cantly lower levels of regret compared to other scheduling
methods. As a result, BO FSS performs consistently well.
Even when BO FSS does not perform the best, its perfor-
mance is within an acceptable range.
5.3 Evaluation of Bayesian Optimization Augmentation
A fundamental part of the proposed method is that BO
FSS improves the performance of FSS by tuning its internal
parameter. In this section, we show how much BO augmen-
tation improves the performance of FSS and its heuristic
variant FAC2. We run BO FSS, FSS, and FAC2 on workloads
with both high and low static imbalances. The results are
shown in Fig. 5. Overall, we can see that BO FSS consistently
outperforms FSS and FAC2 with the exception of srad
v1 and cc-skitter. On workloads with high imbalance
such as pr-journal and pr-wiki, the execution time
improvements are as high as 30%.
Performance degradation on srad v1.Interestingly, BO
FSS does not perform well on two workloads: srad v1
and cc-skitter. While the performance difference in
cc-skitter is marginal, the difference in srad v1 is not.
This phenomenon is due to the large deviations in the exe-
cution time measurements as shown in Fig. 6. That is, large
Datapoints
Student-Tprocess
Gaussianprocess
StandardizedExecutionTime
−2
0
2
4
ParameterValue(θ)
0 0.2 0.4 0.6 0.8 1
Fig. 6. Parameter space and surrogate model fit on the srad v1 work-
load. The colored regions are the 95% predictive credible intervals of
a GP (green region) and a Student-T process (blue region). The red
circles are the data points used to fit both surrogate models.
Fig. 7. Convergence plot of the locality-unaware GP and the locality-
aware GP on the skitter workload. We can see the execution time
decreasing as we run BO. We ran BO 30 times with 10 iterations each,
and computed the 95% boostrap confidence intervals.
outliers near 𝜃=0.4and 𝜃=0.8deviated the GP predictions
(green line). Since GPs assume the noise to be Gaussian, they
are not well suited for this kind of workload. A possible
remedy is to use Student-T processes [37], [38], shown with
the blue line. In Fig. 6, the Student-T process is much less
affected by outliers, resulting in a tighter fit. Nonetheless,
GPs worked consistently well on other workloads.
Comparison of Gaussian Process Models. We now com-
pare the simple GP construction in Section 3.2 and the
locality-aware GP construction in Section 3.3. We equip
BO with each of the models, and run the autotuning pro-
cess beginning to end 30 times. The convergence results
are shown in Fig. 7. We can see that the locality-aware
construction converges much quickly. Note that the shown
results are averages. In the individual results, there are
cases where the locality-unaware version completely fails to
10
cc-journal
RelativeExecutionTime
0.9
1.0
1.1
1.2
1.3
2.0
3.0
cc-wiki cc-road cc-skitter
BOFSS STATIC GUIDED BinLPT HSS
pr-journal pr-wiki pr-road pr-skitter
Fig. 8. Execution time comparison of BO FSS against workload-aware methods. We estimate the mean execution time from 256 executions. The
error bars show the 95% bootstrap confidence intervals. The results are normalized by the mean execution time of BO FSS. The methods with the
lowest execution time are marked with a star (*). Methods not significantly different with the best performing method are also marked with a star
(Wilcoxon signed rank test, 1% null-hypothesis rejection threshold).
PerformanceDrop(%)
0
0.2
0.4
0.6
0.8
1
road-usawikiskitterjournal
road-usawikiskitterjournal
0.0(-0.0,0.0)
0.9(0.6,1.0)
0.5(1.1,1.9)
0.1(-0.0,0.3)
0.7(0.7,0.7)
0.0(-0.2,0.2)
1.0(1.1,1.9)
0.5(0.4,0.7)
0.8(0.7,0.8)
0.8(0.7,1.0)
0.0(0.6,1.4)
0.2(0.1,0.3)
0.6(0.6,0.7)
1.0(0.7,1.0)
0.6(1.1,2.0)
0.0(-0.2,0.1)
TuningTimeData
RuntimeData
Fig. 9. Effect of mismatching the data used for tuning BO FSS and the
data used for execution. The rows are the data used for tuning of BO
FSS, while the columns are the data used for execution. The numbers
represent the percentage slowdown relative to the matched case. Colder
colors represent more slowdown (hotter the better).
lavaMD
RelativeExecutionTime
0.9
1.0
1.1
1.2
1.3
1.4
1.5
stream. kmeans
BOFSS(proposed)
STATIC
GUIDED
CSS
TAPER3
TRAP1
sradv1 nn
Fig. 10. Execution time comparison of BO FSS against dynamic
scheduling methods. We estimate the mean execution time from 256
executions. The error bars show the 95% bootstrap confidence intervals.
The results are normalized by the mean execution time of BO FSS.
The methods with the lowest execution time are marked with a star (*).
Methods not significantly different with the best performing method are
also marked with a star (Wilcoxon signed rank test, 1% null-hypothesis
rejection threshold).
converge within a given budget. We thus suggest to use the
locality-aware construction whenever possible. It achieves
consistent results at the expense of additional computation
during tuning.
Remarks. Apart from srad v1, BO FSS performs better
than FSS and FAC2 on most workloads. This indicates that
the Gaussian assumption works fairly well in most cases.
We can conclude that our BO augmentation improves the
performance of FSS on both workloads with high and low
static imbalances. Our interest is now to see how this im-
provement compares against other scheduling algorithms.
5.4 Evaluation on Workloads Without Static Imbalance
This section compares the performance of BO FSS against
dynamic scheduling methods on workloads where a
workload-profile is unavailable or uniformative. The bench-
mark results are shown in Fig. 10. Out of the 5 workloads
considered, BO FSS outperforms all other methods on 3 out
of 5 workloads. On the nn workload, the difference between
all methods is insignificant. As discussed in Section 5.3, BO
FSS performs poorly on the srad v1 workload. Note that
the same tuning results are used both for Section 5.3 and
this experiment.
Remarks. Compared to other dynamic scheduling methods,
BO FSS achieves more consistent performance. However,
because of the turbulence in the tuning process, BO FSS
performs poorly on srad v1. It is thus important to ensure
that BO FSS correctly converges to a critical point before
applying it.
5.5 Evaluation on Workloads With Static Imbalance
This section evaluates the performance of BO FSS
against workload-aware methods using workloads with a
workload-profile. The evaluation results are shown in Fig. 8.
Except for the pr-wiki workload, BO FSS dominates all
considered baselines. Because of the large number of tasks,
both the HSS and BinLPT algorithms do not perform well
on these workloads. Meanwhile, the STATIC and GUIDED
strategies are very inconsistent in terms of performance. On
the pr-wiki and pr-journal workloads, both methods
are nearly 30% slower than BO FSS. This means that these
algorithms lack workload-robustness unlike BO FSS.
On the pr-wiki workload which has the most extreme
level of static imbalance, HSS performs significantly better.
As discussed in Section 2.3, HSS has a very large critical
section, resulting in a large amount of scheduling over-
head. However, on the pr-wiki workload, the inefficiency
caused by load imbalance is so extreme compared to the
inefficiency caused by the scheduling overhead, giving HSS
a relative advantage.
Does the input data affect performance? BO FSS’s per-
formance is tightly related to the individual property of
each workload. It is thus interesting to ask how much the
input data of the workload affects the behavior of BO FSS.
To analyze this, we interchange the data used to tune BO
FSS and the data used to measure the performance. If the
input data plays an important role, the discrepancy between
11
the tuning time data and the runtime data would degrade the
performance. The corresponding results are shown in Fig. 9
where the entries are the percentage increase in execution
time relative to the matched case. Each row represents the
dataset used for tuning, while each column represents the
dataset used for execution. The anti-diagonal (bottom left
to top right) is the case when the data is matched. The
maximum amount of degradation is caused when we use
skitter for tuning and wiki during runtime. Also, the
case of using journal for tuning and wiki during runtime
significantly degrades the performance. Overall, the wiki
and road datasets turned out to be the pickiest about
the match. Since both wiki and road resulted in high
degradation, the amount of imbalance in the data does not
determine how important the match is. However, judging
from the fact that the degradation is at most 1%, we can
conclude that BO FSS is more sensitive to the workload’s
algorithm rather than its input data.
Remarks. Compared to the workload-aware methods, BO
FSS performed the best except for one workload which has
the most amount of imbalance. Excluding this extreme case,
the performance benefits of BO FSS is quite large. We also
evaluated the sensitivity of BO FSS on perturbations to the
workload. Results show that BO FSS is not affected much
by changes in the input data of the workload.
5.6 Discussions
Analysis of overhead. BO FSS has specific duties, both
online and offline. When online, BO FSS loads the precom-
puted scheduling parameter 𝜃𝑖, measures the loop execution
time and stores the pair (𝜃𝑖,𝜏𝑖)in the dataset 𝒟𝑡. A storage
memory overhead of 𝑂(𝑇), where 𝑇is the number of BO
iterations, is required to store 𝒟𝑡. This is normally much
less than the 𝑂(𝑁)memory requirement, where 𝑁is the
number of tasks, imposed by workload-aware methods.
When offline, BO FSS runs BO using the dataset 𝒟𝑡and
determines the next scheduling parameter 𝜃𝑖+1. Because
most of the actual work is performed offline, the online
overhead of BO FSS is almost identical to that of FSS. The
offline step is relatively expensive due to the computation
complexity of GPs. Fortunately, BO FSS converges within 10
to 20 iterations for most cases. This allows the computational
cost to stay within a reasonable range.
Limitation. When the target loop is not to be executed for a
significant amount of time, BO FSS does provide significant
benefits, as it requires time for offline tuning. However, HPC
workloads are often long-running and reused over time. For
this reason, BO FSS should be applicable for many HPC
workloads.
Portability. When solving the optimization problem in (5)
with BO, the target system becomes part of the objective
function. As a result, BO FSS automatically takes into ac-
count the properties of the target system. This fact makes BO
FSS highly portable. At the same time, as the experimental
results of Fig. 9 imply, instead of directly operating on the
full target workload, it should be possible to use much
cheaper proxy workloads for tuning BO FSS.
6 RE LATED WO RKS
Classical dynamic loop scheduling methods. To improve
the efficiency of dynamic scheduling, many classical algo-
rithms are introduced such as CSS [6], FSS [7], TSS [8],
BOLD [9], TAPER [10] and BAL [11]. However, most of these
classic algorithms are derived in a limited theoretical context
with strict statistical assumptions. Such an example is the
𝑖.𝑖.𝑑.assumption imposed on the workload.
Adaptive and workload-aware methods. To resolve this
limitation, adaptive methods are developed starting from
the adaptive FSS algorithm [12]. Recently, workload-aware
methods including HSS [14] and BinLPT [15], [16] are in-
troduced. These scheduling algorithms explicitly require a
workload-profile before execution and exploit this knowl-
edge in the scheduling process. On the flip side, this re-
quirement makes these methods difficult to use in practice
since the exact workload-profile may not always be avail-
able beforehand. In contrast, our proposed method is more
convenient since we only need to measure the execution
time of a loop. Also, the overall concept of our method is
more flexible; it is possible to plug in our framework to any
parameterized scheduling algorithm, directly improving its
robustness.
Machine learning based approaches. Machine learning has
yet to see many applications in parallel loop scheduling.
In [56], Wang and O’Boyle use compiler generated features
to train classifiers that select the best-suited scheduling
strategy for a workload. This approach contrasts with ours
since it does not improve the effectiveness of the cho-
sen scheduling algorithm. On the other hand, Khatami et
al. in [57] recently used a logistic regression model for
predicting the optimal chunk size for a scheduling strat-
egy, combining CSS and work-stealing. Similarly, Laberge
et al. [58] propose a machine-learning based strategy for
accelerating linear algebra applications. These supervised-
learning based approaches are limited in the sense that they
are not yet well understood: their performance is dependent
on the quality of the training data. It is unknown how well
these approaches generalize across workloads from different
application domains. In fact, quantifying and improving
generalization is still a central problem in supervised learn-
ing. Our method is free of these issues since we directly
optimize the performance for a target workload.
7 CONCLUSION
In this paper, we have presented BO FSS, a data-driven,
adaptive loop scheduling algorithm based on BO. The
proposed approach automatically tunes its performance to
the workload using execution time measurements. Also,
unlike the scheduling algorithms that are inapplicable to
some workloads, our approach is generally applicable. We
implemented our method on the OpenMP framework and
quantified its performance as well as its robustness on
realistic workloads. BO FSS has consistently performed well
on a wide range of real workloads, showing that it is
robust compared to other loop scheduling algorithms. Our
approach motivates the development of computer systems
that can automatically adapt to the target workload.
12
At the moment, BO FSS assumes that the properties
of the workload do not change during execution. For this
reason, BO FSS does not address some crucial scientific
workloads, such as adaptive mesh refinement methods.
These types of workloads dynamically change during ex-
ecution, depending on the computation results. It would be
interesting to investigate automatic tuning-based schedul-
ing algorithms that can target such types of workloads in
the future.
ACKNOWLEDGMENTS
The authors would like to thank the reviewers for providing
precious comments enriching our work, Pedro Henrique
Penna for the helpful discussions about the BinLPT schedul-
ing algorithm, Myoung Suk Kim for his insightful comments
about our statistical analysis and Rover Root for his helpful
comments about the scientific workloads considered in this
work.
REFERENCES
[1] L. Dagum and R. Menon, “OpenMP: An industry standard API
for shared-memory programming,” IEEE Comput. Sci. Eng., vol. 5,
no. 1, pp. 46–55, Jan.-March/1998.
[2] J. Regier, K. Pamnany, K. Fischer, A. Noack, M. Lam, J. Revels,
S. Howard, R. Giordano, D. Schlegel, J. McAuliffe, R. C. Thomas,
and Prabhat, “Cataloging the visible universe through Bayesian
inference at petascale,” in Proc. Int. Parallel Distrib. Process. Symp.,
ser. IPDPS’18, 2018, pp. 44–53.
[3] T. Kurth, S. Treichler, J. Romero, M. Mudigonda, N. Luehr,
E. Phillips, A. Mahesh, M. Matheson, J. Deslippe, M. Fatica, and
e. al, “Exascale deep learning for climate analytics,” in Proc. Int.
Conf. High Perform. Comput. Networking, Storage, Anal., ser. SC ’18,
2018.
[4] A. G. Baydin, L. Shao, W. Bhimji, L. Heinrich, L. Meadows, J. Liu,
A. Munk, S. Naderiparizi, B. Gram-Hansen, G. Louppe, M. Ma,
X. Zhao, P. Torr, V. Lee, K. Cranmer, Prabhat, and F. Wood, “Etalu-
mis: Bringing probabilistic programming to scientific simulators at
scale,” in Proc. Int. Conf. High Perform. Comput. Networking, Storage,
Anal., ser. SC’19, Denver Colorado, Nov. 2019, pp. 1–24.
[5] D. Durand, T. Montaut, L. Kervella, and W. Jalby, “Impact of
memory contention on dynamic scheduling on NUMA multipro-
cessors,” IEEE Trans. Parallel Distrib. Syst., vol. 7, no. 11, pp. 1201–
1214, Nov. 1996.
[6] C. P. Kruskal and A. Weiss, “Allocating independent subtasks on
parallel processors,” IEEE Trans. Softw. Eng., vol. SE-11, no. 10, pp.
1001–1016, Oct. 1985.
[7] S. F. Hummel, E. Schonberg, and L. E. Flynn, “Factoring: A method
for scheduling parallel loops,” Commun ACM, vol. 35, no. 8, pp.
90–101, Aug. 1992.
[8] T. H. Tzen and L. M. Ni, “Trapezoid self-scheduling: A practical
scheduling scheme for parallel compilers,” IEEE Trans. Parallel
Distrib. Syst., vol. 4, no. 1, pp. 87–98, Jan. 1993.
[9] T. Hagerup, “Allocating independent tasks to parallel processors:
An experimental study,” J. Parallel Distrib. Comput., vol. 47, no. 2,
pp. 185–197, Dec. 1997.
[10] S. Lucco, “A dynamic scheduling method for irregular parallel
programs,” in Proc. ACM SIGPLAN 1992 Conf. Program. Lang. Des.
Implementation, ser. PLDI ’92. New York, NY, USA: ACM, 1992,
pp. 200–211.
[11] H. Bast, “On scheduling parallel tasks at twilight,” Theory Comput.
Syst., vol. 33, no. 5-6, pp. 489–563, Dec. 2000.
[12] I. Banicescu and V. Velusamy, “Load balancing highly irregular
computations with the adaptive factoring,” in Proc. Int. Parallel
Distrib. Process. Symp., ser. IPDPS’02, Ft. Lauderdale, FL, 2002, p.
12 pp.
[13] F. M. Ciorba, C. Iwainsky, and P. Buder, “OpenMP loop scheduling
revisited: Making a case for more schedules,” in Evolving OpenMP
for Evolving Architectures, ser. IWOMP’18. Springer, 2018, pp. 21–
36.
[14] A. Kejariwal, A. Nicolau, and C. D. Polychronopoulos, “History-
aware self-scheduling,” in Proc. Int. Conf. Parallel Process., ser.
ICPP’06. IEEE, 2006.
[15] P. H. Penna, M. Castro, P. Plentz, H. Cota de Freitas, F. Broquedis,
and J.-F. M´
ehaut, “BinLPT: A novel workload-aware loop sched-
uler for irregular parallel loops,” in Proc. Simp´osio Em Sistemas
Computacionais de Alto Desempenho, Campinas, Brazil, Oct. 2017.
[16] P. H. Penna, A. T. A. Gomes, M. Castro, P. D.M. Plentz, H. C. Fre-
itas, F. Broquedis, and J.-F. M´
ehaut, “A comprehensive perfor-
mance evaluation of the BinLPT workload-aware loop scheduler,”
Concurrency and Computation: Pract. and Experience, Feb. 2019.
[17] B. Shahriari, K. Swersky, Z. Wang, R. P. Adams, and N. de Fre-
itas, “Taking the human out of the loop: A review of Bayesian
optimization,” Proc. IEEE, vol. 104, no. 1, pp. 148–175, Jan. 2016.
[18] O. Alipourfard, H. H. Liu, J. Chen, S. Venkataraman, M. Yu, and
M. Zhang, “CherryPick: Adaptively unearthing the best cloud
configurations for big data analytics,” in Proc. 14th USENIX Symp.
Networked Syst. Des. Implementation, ser. NSDI’17. Boston, MA:
USENIX Association, 2017, pp. 469–482.
[19] B. Letham, B. Karrer, G. Ottoni, and E. Bakshy, “Constrained
Bayesian optimization with noisy experiments,” Bayesian Anal.,
vol. 14, no. 2, pp. 495–519, Aug. 2018.
[20] V. Dalibard, M. Schaarschmidt, and E. Yoneki, “BOAT: Building
auto-tuners with structured Bayesian optimization,” in Proc. 26th
Int. Conf. World Wide Web, ser. WWW ’17. Perth, Australia: ACM
Press, 2017, pp. 479–488.
[21] K.-r. Kim, Y. Kim, and S. Park, “Towards robust data-driven par-
allel loop scheduling using Bayesian optimization,” in IEEE 27th
Int. Symp. Model., Anal. Simul. Comput. Telecommun. Syst. Rennes,
FR: IEEE, 2019, pp. 241–248.
[22] K. Swersky, J. Snoek, and R. P. Adams, “Freeze-thaw Bayesian
optimization,” arXiv:1406.3896 [cs, stat], Jun. 2014.
[23] GCC, “GCC, the GNU compiler collection,” Jul. 2018.
[24] C. McPhail, H. R. Maier, J. H. Kwakkel, M. Giuliani, A. Castelletti,
and S. Westra, “Robustness Metrics: How Are They Calculated,
When Should They Be Used and Why Do They Give Different
Results?” Earth’s Future, vol. 6, no. 2, pp. 169–191, Feb. 2018.
[25] L. J. Savage, “The theory of statistical decision,” J. Am. Stat. Assoc.,
vol. 46, no. 253, pp. 55–67, 1951.
[26] S. Che, J. W. Sheaffer, M. Boyer, L. G. Szafaryn, Liang Wang, and
K. Skadron, “A characterization of the Rodinia benchmark suite
with comparison to contemporary CMP workloads,” in Proc. IEEE
Int. Symp. Workload Characterization, ser. IISWC’10. Atlanta, GA,
USA: IEEE, Dec. 2010, pp. 1–11.
[27] S. Beamer, K. Asanovi´
c, and D. Patterson, “The GAP benchmark
suite,” arXiv:1508.03619 [cs], May 2017.
[28] P. Tang and P. C. Yew, “Processor self-scheduling for multiple-
nested parallel loops,” in Proc. Int. Conf. Parallel Process., ser.
ICPP’86. IEEE, Dec. 1986, pp. 528–535.
[29] Bast, Hannah, “Provably optimal scheduling of similar tasks,”
Ph.D Thesis, Universit¨
at des Saarlandes, Saarbr ¨
ucken, 2000.
[30] K. K. Yue and D. J. Lilja, “Parallel loop scheduling for high
performance computers,” Adv. Parallel Comput., vol. 10, pp. 243–
264, 1995.
[31] L. E. Flynn and S. F. Hummel, “Scheduling variable-length parallel
subtasks,” IBM Research T. J. Watson Research Center, Tech. Rep.,
Feb. 1990.
[32] J. C. Spall, “An overview of the simultaneous perturbation method
for efficient optimization,” Johns Hopkins Apl Tech. Dig., vol. 19,
no. 4, pp. 482–492, 1998.
[33] Z. Wang and S. Jegelka, “Max-value entropy search for efficient
Bayesian optimization,” in Proc. 34th Int. Conf. Mach. Learn., ser.
ICML’17, vol. 70. JMLR.org, 2017, pp. 3627–3635.
[34] V. S. Adve and M. K. Vernon, “The influence of random delays on
parallel execution times,” in Proc. 1993 ACM SIGMETRICS Conf.
Meas. Model. Comp. Syst., ser. SIGMETRICS’93. New York, NY,
USA: ACM, 1993, pp. 61–73.
[35] C. E. Rasmussen and Z. Ghahramani, “Occam’s razor,” in Adv.
Neural Inf. Process. Syst. 13, ser. NIPS’13. MIT Press, 2001, pp.
294–300.
[36] C. E. Rasmussen and C. K. I. Williams, Gaussian Processes for Ma-
chine Learning, ser. Adaptive Comput. Mach. Learn. Cambridge,
Mass: MIT Press, 2006.
[37] R. Martinez-Cantin, K. Tee, and M. McCourt, “Practical Bayesian
optimization in the presence of outliers,” arXiv:1712.04567 [cs,
stat], Dec. 2017.
13
TABLE 4
Implementation Details of Considered Baselines
Type Chunk Size Equation Parameter Setting
CSS [6] 𝐾=
𝜎2𝑁
𝑃log𝑃2∕3 ,𝜎,𝜇(measured values)
TAPER [10] 𝑣𝛼=𝛼𝜎
𝜇, 𝑥𝑖=𝑅𝑖
𝑃+𝐾min
2, 𝑅𝑖+1=𝑅𝑖𝐾𝑖
𝐾𝑖=max(𝐾min ,𝑥𝑖+𝑣2𝑎
2𝑣𝛼2𝑥𝑖+𝑣2𝛼
4)𝑣𝛼=3,𝐾min =1
TSS [8] 𝛿=𝐾𝑓𝐾𝑙
𝑁1 , 𝐾0=𝐾𝑓
𝐾𝑖+1=max(𝐾𝑖𝛿,𝐾𝑙)𝐾𝑓=𝑁
2𝑃, 𝐾𝑙=1,
[38] A. Shah, A. G. Wilson, and Z. Ghahramani, “Bayesian Optimiza-
tion using Student-t Processes,” in NIPS Workshop on Bayesian
Optimisation, 2013.
[39] D. Duvenaud, J. Lloyd, R. Grosse, J. Tenenbaum, and G. Zoubin,
“Structure discovery in nonparametric regression through compo-
sitional kernel search,” in Proc. 30th Int. Conf. Mach. Learn., ser.
ICML’13, vol. 28. Atlanta, Georgia, USA: PMLR, Jun. 2013, pp.
1166–1174.
[40] J. M. Henr´
andez-Lobato, M. W. Hoffman, and Z. Ghahramani,
“Predictive Entropy Search for Efficient Global Optimization of
Black-box Functions,” in Adv. Neural Inf. Process. Syst. 27, ser.
NIPS’14, 2014, pp. 918–926.
[41] J. Snoek, H. Larochelle, and R. P. Adams, “Practical Bayesian
optimization of machine learning algorithms,” in Adv. Neural Inf.
Process. Syst. 25, ser. NIPS’12. USA: Curran Associates Inc., 2012,
pp. 2951–2959.
[42] M. D. Hoffman and A. Gelman, “The no-u-turn sampler: Adap-
tively setting path lengths in Hamiltonian Monte Carlo,” J. Mach.
Learn. Res., vol. 15, no. 47, pp. 1593–1623, 2014.
[43] I. Sobol’, “On the distribution of points in a cube and the approx-
imate evaluation of integrals,” USSR Comput. Math. Math. Phys.,
vol. 7, no. 4, pp. 86–112, Jan. 1967.
[44] J. Bezanson, A. Edelman, S. Karpinski, and V. B. Shah, “Julia: A
fresh approach to numerical computing,” SIAM Rev., vol. 59, no. 1,
pp. 65–98, 2017.
[45] D. R. Jones, C. D. Perttunen, and B. E. Stuckman, “Lipschitzian
optimization without the Lipschitz constant,” J Optim Theory Appl,
vol. 79, no. 1, pp. 157–181, Oct. 1993.
[46] S. G. Johnson, The NLopt Nonlinear-Optimization Package, 2011.
[47] H. Ge, K. Xu, and Z. Ghahramani, “Turing: A language for
flexible probabilistic inference,” in Int. Conf. Artif. Intell. Statist.,
ser. AISTATS’18, 2018, pp. 1682–1690.
[48] C. D. Polychronopoulos and D. J. Kuck, “Guided self-scheduling:
A practical scheduling scheme for parallel supercomputers,” IEEE
Trans. Comput., vol. C-36, no. 12, pp. 1425–1439, Dec. 1987.
[49] J. M. Bull, “Measuring synchronisation and scheduling overheads
in OpenMP,” in Proc. 1st Eur. Workshop OpenMP, ser. IWOMP’99,
1999, pp. 99–105.
[50] L. Backstrom, D. Huttenlocher, J. Kleinberg, and X. Lan, “Group
formation in large social networks: Membership, growth, and
evolution,” in Proc. 12th ACM SIGKDD Int. Conf. Knowl. Discovery
Data Mining, ser. KDD’06. New York, NY, USA: Association for
Computing Machinery, 2006, pp. 44–54.
[51] D. Gleich, “Wikipedia-20070206,” 2007.
[52] C. Demetrescu, A. Goldberg, and D. Johnson, “9th DIMACS
implementation challenge - shortest paths,” 2006.
[53] J. Leskovec, J. Kleinberg, and C. Faloutsos, “Graphs over time:
Densification laws, shrinking diameters and possible explana-
tions,” in Proc. 11th ACM SIGKDD Int. Conf. Knowl. Discovery
Data Mining, ser. KDD ’05. New York, NY, USA: Association
for Computing Machinery, 2005, pp. 177–187.
[54] T. A. Davis and Y. Hu, “The university of florida sparse matrix
collection,” ACM Trans Math Softw, vol. 38, no. 1, Dec. 2011.
[55] S. Bak, Y. Guo, P. Balaji, and V. Sarkar, “Optimized execution of
parallel loops via user-defined scheduling policies,” in Proc. 48th
Int. Conf. Parallel Process., ser. ICPP’19. Kyoto Japan: ACM, Aug.
2019, pp. 1–10.
[56] Z. Wang and M. F. O’Boyle, “Mapping parallelism to multi-cores:
A machine learning based approach,” in Proc. 14th ACM SIGPLAN
Symp. Princ. Pract. Parallel Program., ser. PPoPP’09. New York, NY,
USA: ACM, 2009, pp. 75–84.
[57] Z. Khatami, L. Troska, H. Kaiser, J. Ramanujam, and A. Serio,
“HPX smart executors,” in Proc. 3rd Int. Workshop Extreme Scale
Program. Models Middleware, ser. ESPM2’17, 2017.
[58] G. Laberge, S. Shirzad, P. Diehl, H. Kaiser, S. Prudhomme, and
A. S. Lemoine, “Scheduling optimization of parallel linear algebra
algorithms using supervised learning,” in IEEE/ACM Workshop
Mach. Learn. High Perform. Comput. Environ., ser. MLHPC’19. Den-
ver, CO, USA: IEEE, Nov. 2019, pp. 31–43.
Khu-rai Kim (Student Member, IEEE) is working
towards his B.S. degree with the Department
of Electronics Engineering, Sogang University,
Seoul, South Korea.
His research interests lie in the duality of
machine learning and computer systems, in-
cluding parallel computing, compiler runtime en-
vironments, probabilistic machine learning and
Bayesian inference methods.
Youngjae Kim (Member, IEEE) received the
B.S. degree in computer science from Sogang
University, South Korea, in 2001, the MS de-
gree in computer science from KAIST, in 2003,
and the PhD degree in computer science and
engineering from Pennsylvania State University,
University Park, Pennsylvania, in 2009.
He is currently an associate professor with
the Department of Computer Science and Engi-
neering, Sogang University, Seoul, South Korea.
Before joining Sogang University, Seoul, South
Korea, he was a R&D staff scientist with the US Department of Energy’s
Oak Ridge National Laboratory (2009–2015) and as an assistant profes-
sor at Ajou University, Suwon, South Korea (2015–2016). His research
interests include operating systems, file and storage systems, parallel
and distributed systems, computer systems security, and performance
evaluation.
Sungyong Park (Member, IEEE) received the
B.S. degree in computer science from Sogang
University, Seoul, South Korea, and both the
MS and PhD degrees in computer science from
Syracuse University, Syracuse, New York.
He is currently a professor with the Depart-
ment of Computer Science and Engineering, So-
gang University, Seoul, South Korea. From 1987
to 1992, he worked for LG Electronics, South
Korea, as a research engineer. From 1998 to
1999, he was a research scientist at Bellcore,
where he developed network management software for optical switches.
His research interests include cloud computing and systems, high per-
formance I/O and storage systems, parallel and distributed system, and
embedded system.
... However, these methods do not consider the fairness of data, which may lead to low accuracy for multi-job FL. The black-box optimization-based methods, e.g., RL [39], BO [41], and deep neural network [42], have been proposed to improve the efficiency, i.e., the reduction of execution time, in distributed systems. They do not consider data fairness either, which may lead to low accuracy for multijob FL. ...
Preprint
Full-text available
Recent years have witnessed a large amount of decentralized data in various (edge) devices of end-users, while the decentralized data aggregation remains complicated for machine learning jobs because of regulations and laws. As a practical approach to handling decentralized data, Federated Learning (FL) enables collaborative global machine learning model training without sharing sensitive raw data. The servers schedule devices to jobs within the training process of FL. In contrast, device scheduling with multiple jobs in FL remains a critical and open problem. In this paper, we propose a novel multi-job FL framework, which enables the training process of multiple jobs in parallel. The multi-job FL framework is composed of a system model and a scheduling method. The system model enables a parallel training process of multiple jobs, with a cost model based on the data fairness and the training time of diverse devices during the parallel training process. We propose a novel intelligent scheduling approach based on multiple scheduling methods, including an original reinforcement learning-based scheduling method and an original Bayesian optimization-based scheduling method, which corresponds to a small cost while scheduling devices to multiple jobs. We conduct extensive experimentation with diverse jobs and datasets. The experimental results reveal that our proposed approaches significantly outperform baseline approaches in terms of training time (up to 12.73 times faster) and accuracy (up to 46.4% higher).
... However, these methods do not consider the fairness of data, which may lead to low accuracy for multi-job FL. The black-box optimizationbased methods, e.g., RL , BO (Kim, Kim, and Park 2020), and deep neural network (Zang et al. 2019), have been proposed to improve the efficiency, i.e., the reduction of execution time, in distributed systems. They do not consider data fairness either, which may lead to low accuracy for multi-job FL. ...
Preprint
Full-text available
Recent years have witnessed a large amount of decentralized data in multiple (edge) devices of end-users, while the aggregation of the decentralized data remains difficult for machine learning jobs due to laws or regulations. Federated Learning (FL) emerges as an effective approach to handling decentralized data without sharing the sensitive raw data, while collaboratively training global machine learning models. The servers in FL need to select (and schedule) devices during the training process. However, the scheduling of devices for multiple jobs with FL remains a critical and open problem. In this paper, we propose a novel multi-job FL framework to enable the parallel training process of multiple jobs. The framework consists of a system model and two scheduling methods. In the system model, we propose a parallel training process of multiple jobs, and construct a cost model based on the training time and the data fairness of various devices during the training process of diverse jobs. We propose a reinforcement learning-based method and a Bayesian optimization-based method to schedule devices for multiple jobs while minimizing the cost. We conduct extensive experimentation with multiple jobs and datasets. The experimental results show that our proposed approaches significantly outperform baseline approaches in terms of training time (up to 8.67 times faster) and accuracy (up to 44.6% higher).
... The authors showed cases when applications achieve improved performance beyond the one offered by the scheduling techniques supported in the GNU OpenMP RTL. Another variant of FAC, called BO FSS, was proposed and compared against STATIC, GSS, TSS, FAC2, TAP [21], HSS [25], and BinLPT [26], in another extended implementation in the GNU OpenMP RTL [27]. The scheduling techniques considered in these research efforts does not consider dynamic and adaptive scheduling techniques. ...
Article
Full-text available
Exascale computing systems will exhibit high degrees of hierarchical parallelism, with thousands of computing nodes and hundreds of cores per node. Efficiently exploiting hierarchical parallelism is challenging due to load imbalance that arises at multiple levels. OpenMP is the most widely-used standard for expressing and exploiting the ever-increasing node-level parallelism. The scheduling options in OpenMP are insufficient to address the load imbalance that arises during the execution of multithreaded applications. The limited scheduling options in OpenMP hinder research on novel scheduling techniques which require comparison with others from the literature. This work introduces LB4OMP, an open-source dynamic load balancing library that implements successful scheduling algorithms from the literature. LB4OMP is a research infrastructure designed to spur and support present and future scheduling research, for the benefit of multithreaded applications performance. Through an extensive performance analysis campaign, we assess the effectiveness and demystify the performance of all loop scheduling techniques in the library. We show that, for numerous applications-systems pairs, the scheduling techniques in LB4OMP outperform the scheduling options in OpenMP. Node-level load balancing using LB4OMP leads to reduced cross-node load imbalance and to improved MPI+OpenMP applications performance, which is critical for Exascale computing.
... The authors showed cases when applications achieve improved performance beyond the one offered by the scheduling techniques supported in the GNU OpenMP RTL. Another variant of FAC, called BO FSS, was proposed and compared against STATIC, GSS, TSS, FAC2, TAP [18], HSS [22], and BinLPT [23], in another extended implementation in the GNU OpenMP RTL [24]. The scheduling techniques considered in these research efforts does not consider dynamic and adaptive scheduling techniques. ...
Preprint
Full-text available
Exascale computing systems will exhibit high degrees of hierarchical parallelism, with thousands of computing nodes and hundreds of cores per node. Efficiently exploiting hierarchical parallelism is challenging due to load imbalance that arises at multiple levels. OpenMP is the most widely-used standard for expressing and exploiting the ever-increasing node-level parallelism. The scheduling options in OpenMP are insufficient to address the load imbalance that arises during the execution of multithreaded applications. The limited scheduling options in OpenMP hinder research on novel scheduling techniques which require comparison with others from the literature. This work introduces LB4OMP, an open-source dynamic load balancing library that implements successful scheduling algorithms from the literature. LB4OMP is a research infrastructure designed to spur and support present and future scheduling research, for the benefit of multithreaded applications performance. Through an extensive performance analysis campaign, we assess the effectiveness and demystify the performance of all loop scheduling techniques in the library. We show that, for numerous applications-systems pairs, the scheduling techniques in LB4OMP outperform the scheduling options in OpenMP. Node-level load balancing using LB4OMP leads to reduced cross-node load imbalance and to improved MPI+OpenMP applications performance, which is critical for Exascale computing.
... However, the existing intelligence-based ground fault detection methods extract the fault characteristics from a single source, and hence can easily result in misjudgment [30]. The k-Nearest Neighbor (kNN) classification algorithm [31,32] and Bayesian Optimization algorithm [33] have been widely used for power system fault identification. In this paper, they are introduced into the faulty line identification of distribution network. ...
Article
Full-text available
Selective ground-fault protection is greatly valued for the safe and reliable operation of power systems. With the wide adoption of fault indicator in distribution network, the amount of available fault data increases dramatically. The in-depth investigation of fault recording data helps improve the accuracy of faulty line identification. To perform fault data analysis with higher efficiency, a single-phase-to-ground fault identification model based on the k-Nearest Neighbor (kNN) classification algorithm is proposed in the paper. In this model, the eigenvectors consist of wavelet energy ratio, wavelet coefficients variance and wavelet power obtained by the decomposition of transient components. Furthermore, through the theoretical analysis and experimental comparison of three parameter adjustment algorithms, Bayesian Optimization algorithm is selected to find the optimal parameters of fault identification model, and realize the adaptive adjustment of model parameters. Finally, the validity and feasibility of the model are verified by the experimental data, and the accuracy and efficiency of fault identification are improved by using Bayesian Optimization algorithm.
Article
The trend prediction of dissolved gas concentration in transformer oil can provide basis for transformer fault diagnosis, which is of vital significance to the safe operation of power system. However, due to the inevitable malfunction of monitoring equipment, it is difficult to collect all needed data in actual operation scenarios. Therefore, a method for predicting dissolved gas concentration in transformer oil for data loss scenarios is proposed based on Bayesian probabilistic matrix factorization (BPMF) and gated recurrent unit (GRU) neural network. Firstly, aiming at the problem of data loss in actual monitoring of dissolved gas in oil, BPMF is used to fill in the missing data. Then, a GRU neural network model is established to predict the trend of dissolved gas concentration in oil. Finally, the hyperparameters of the prediction model are selected and optimized by Bayesian theory. The examples show that this method can effectively fill in the missing part of the measured data. Compared with traditional prediction methods, the proposed method has higher prediction accuracy.
Article
This paper proposes a novel time-frequency feature fusion method to recognise patients’ behaviours based on the Frequency Modulated Continuous Wave (FMCW) radar system, which can locate patients as well as recognise their current actions and thus is expected to solve the shortage of medical staff caused by the novel coronavirus pneumonia(COVID-19).To recognise the patient’s behaviour, the FMCW radar is utilised to acquire point clouds reflected by the human body, and the micro-Doppler spectrogram is generated by human motion. Then features are extracted and fused from the time-domain information of point clouds and the frequency-domain information of the micro-Doppler spectrogram respectively. According to the fused features, the patient’s behaviour is recognised by a Bayesian optimisation algorithm which selects optimal hyper-parameters, i.e., the number of random forest decision trees, the depth of leaves, and the number of features. The experimental results show that an average accuracy of 99.3% can be achieved by using the time-frequency fusion with the Bayesian optimisation random forest model to recognise six actions.
Article
Large-scale Internet applications running on data centers are typically instantiated as a set of containers. Assigning a container to its affinity machine can reduce communication and transport costs while assigning it to the anti-affinity machine may affect the proper operation of the container. Existing container scheduling methods cannot accommodate these two types of requirements. In order to reduce the operation and maintenance cost of data centers, this article focuses on the container instance allocation problem in heterogeneous server cluster, and proposes a global cost-aware scheduling algorithm (GCCS) to solve it. The purpose is to minimize the total power consumption of the cluster from a global perspective, while trying to meet the affinity/anti-affinity requirements of applications. We study the number of containers per server selected by the application, model it as an integer linear program (ILP), and then propose a heuristic search algorithm to repair the relaxation solution of the ILP into a suboptimal feasible solution. In particular, we use Bayesian optimizer to perform a number of automated development and exploration processes for the selection of the cost coefficient. The experiments are carried out with the best cost coefficient recommended by Bayesian optimizer. Finally, the results demonstrate that GCCS can significantly reduce the total power consumption of the cluster, while maintaining a high affinity satisfaction ratio.
Conference Paper
Full-text available
Efficient parallelization of loops is critical to improving the performance of high-performance computing applications. Many classical parallel loop scheduling algorithms have been developed to increase parallelization efficiency. Recently, workload-aware methods were developed to exploit the structure of workloads. However, both classical and workload-aware scheduling methods lack what we call robustness. That is, most of these scheduling algorithms tend to be unpredictable in terms of performance or have specific workload patterns they favor. This causes application developers to spend additional efforts in finding the best suited algorithm or tune scheduling parameters. This paper proposes Bayesian Optimization augmented Factoring Self-Scheduling (BO FSS), a robust data-driven parallel loop scheduling algorithm. BO FSS is powered by Bayesian Optimization (BO), a machine learning based optimization algorithm. We augment a classical scheduling algorithm, Factoring Self-Scheduling (FSS), into a robust adaptive method that will automatically adapt to a wide range of workloads. To compare the performance and robustness of our method, we have implemented BO FSS and other loop scheduling methods on the OpenMP framework. A regret-based metric called performance regret is also used to quantify robustness. Extensive benchmarking results show that BO FSS performs fairly well in most workload patterns and is also very robust relative to other scheduling methods. BO FSS achieves an average of 4% performance regret. This means that even when BO FSS is not the best performing algorithm on a specific workload, it stays within a 4 percentage points margin of the best performing algorithm.
Conference Paper
Full-text available
On-node parallelism continues to increase in importance for high-performance computing and most newly deployed supercomputers have tens of processor cores per node. These higher levels of on-node parallelism exacerbate the impact of load imbalance and locality in parallel computations, and current programming systems notably lack features to enable efficient use of these large numbers of cores or require users to modify codes significantly. Our work is motivated by the need to address application-specific load balance and locality requirements with minimal changes to application codes. In this paper, we propose a new approach to extend the specification of parallel loops via user functions that specify iteration chunks. We also extend the runtime system to invoke these user functions when determining how to create chunks and schedule them on worker threads. Our runtime system starts with subspaces specified in the user functions, performs load balancing of chunks concurrently, and stores the balanced groups of chunks to reduce load imbalance in future invocations. Our approach can be used to improve load balance and locality in many dynamic iterative applications, including graph and sparse matrix applications. We demonstrate the benefits of this work using MiniMD, a miniapp derived from LAMMPS, and three kernels from the GAP Benchmark Suite: Breadth-First Search, Connected Components, and PageRank, each evaluated with six different graph data sets. Our approach achieves geometric mean speedups of 1.16× to 1.54× over four standard OpenMP schedules and 1.07× over the static_steal schedule from recent research.
Conference Paper
Full-text available
Workload-aware loop schedulers were introduced to deliver better performance than classical strategies, but they present limitations on work-load estimation, chunk scheduling and integrability with applications. Targeting these challenges, in this work we propose a novel workload-aware loop sched-uler that is called BinLPT and it is based on three features. First, it relies on some user-supplied estimation of the workload of the target parallel loop. Second , BinLPT uses a greedy bin packing heuristic to adaptively partition the iteration space in several chunks. The maximum number of chunks to be produced is a parameter that may be fine-tuned. Third, it schedules chunks of iterations using a hybrid scheme based on the LPT rule and on-demand scheduling. We integrated BinLPT in OpenMP, and we evaluated its performance in a large-scale NUMA machine using a synthetic kernel and 3D N-Body Simulations. Our results revealed that BinLPT improves performance over OpenMP's strategies by up to 45.13% and 37.15% in the synthetic and application kernels, respectively.
Article
Full-text available
Robustness is being used increasingly for decision analysis in relation to deep uncertainty and many metrics have been proposed for its quantification. Recent studies have shown that the application of different robustness metrics can result in different rankings of decision alternatives, but there has been little discussion of what potential causes for this might be. To shed some light on this issue, we present a unifying framework for the calculation of robustness metrics, which assists with understanding how robustness metrics work, when they should be used, and why they sometimes disagree. The framework categorizes the suitability of metrics to a decision-maker based on (i) the decision-context (i.e. the suitability of using absolute performance or regret), (ii) the decision-maker's preferred level of risk aversion, and (iii) the decision-maker's preference towards maximizing performance, minimizing variance, or some higher-order moment. This paper also introduces a conceptual framework describing when relative robustness values of decision alternatives obtained using different metrics are likely to agree and disagree. This is used as a measure of how “stable” the ranking of decision alternatives is when determined using different robustness metrics. The framework is tested on three case studies, including water supply augmentation in Adelaide, Australia, the operation of a multipurpose regulated lake in Italy, and flood protection for a hypothetical river based on a reach of the river Rhine in the Netherlands. The proposed conceptual framework is confirmed by the case study results, providing insight into the reasons for disagreements between rankings obtained using different robustness metrics.
Conference Paper
Probabilistic programming languages (PPLs) are receiving widespread attention for performing Bayesian inference in complex generative models. However, applications to science remain limited because of the impracticability of rewriting complex scientific simulators in a PPL, the computational cost of inference, and the lack of scalable implementations. To address these, we present a novel PPL framework that couples directly to existing scientific simulators through a cross-platform probabilistic execution protocol and provides Markov chain Monte Carlo (MCMC) and deep-learning-based inference compilation (IC) engines for tractable inference. To guide IC inference, we perform distributed training of a dynamic 3DCNN-LSTM architecture with a PyTorch-MPI-based framework on 1,024 32-core CPU nodes of the Cori supercomputer with a global mini-batch size of 128k: achieving a performance of 450 Tflop/s through enhancements to PyTorch. We demonstrate a Large Hadron Collider (LHC) use-case with the C++ Sherpa simulator and achieve the largest-scale posterior inference in a Turing-complete PPL.
Article
Workload-aware loop schedulers were introduced to deliver better performance than classical loop scheduling strategies. However, they presented limitations such as inexible built-in workload estimators and suboptimal chunk scheduling. Targeting these challenges, we proposed previously a workload-aware scheduling strategy called BinLPT, which relies on three features: (i) user-supplied estimations of the workload of the loop; (ii) a greedy heuristic that adaptively partitions the iteration space in several chunks; and (iii) a scheduling scheme based on the Longest Processing Time (LPT) rule and on-demand technique. In this paper, we present two new contributions to the state-of-the-art. First, we introduce a multiloop support feature to BinLPT, which enables the reuse of estimations across loops. Based on this feature, we integrated BinLPT into a real-world elastodynamics application, and we evaluated it running on a supercomputer. Second, we present an evaluation of BinLPT using simulations as well as synthetic and application kernels. We carried out this analysis on a large-scale NUMA machine under a variety of workloads. Our results revealed that BinLPT is better at balancing the workloads of the loop iterations and this behavior improves as the algorithmic complexity of the loop increases. Overall, BinLPT delivers up to 37.15% and 9.11% better performance than well-known loop scheduling strategies, for the application kernels and the elastodynamics simulation, respectively.
Article
Astronomical catalogs derived from wide-field imaging surveys are an important tool for understanding the Universe. We construct an astronomical catalog from 55 TB of imaging data using Celeste, a Bayesian variational inference code written entirely in the high-productivity programming language Julia. Using over 1.3 million threads on 650,000 Intel Xeon Phi cores of the Cori Phase II supercomputer, Celeste achieves a peak rate of 1.54 DP PFLOP/s. Celeste is able to jointly optimize parameters for 188M stars and galaxies, loading and processing 178 TB across 8192 nodes in 14.6 minutes. To achieve this, Celeste exploits parallelism at multiple levels (cluster, node, and thread) and accelerates I/O through Cori's Burst Buffer. Julia's native performance enables Celeste to employ high-level constructs without resorting to hand-written or generated low-level code (C/C++/Fortran), and yet achieve petascale performance.