Content uploaded by Kyurae Kim

Author content

All content in this area was uploaded by Kyurae Kim on Jun 12, 2022

Content may be subject to copyright.

1

A Probabilistic Machine Learning Approach to

Scheduling Parallel Loops with

Bayesian Optimization

Kyurae Kim, Student Member, IEEE, Youngjae Kim, Member, IEEE, and Sungyong Park, Member, IEEE

Abstract—This paper proposes Bayesian optimization augmented factoring self-scheduling (BO FSS), a new parallel loop scheduling

strategy. BO FSS is an automatic tuning variant of the factoring self-scheduling (FSS) algorithm and is based on Bayesian optimization

(BO), a black-box optimization algorithm. Its core idea is to automatically tune the internal parameter of FSS by solving an optimization

problem using BO. The tuning procedure only requires online execution time measurement of the target loop. In order to apply BO, we

model the execution time using two Gaussian process (GP) probabilistic machine learning models. Notably, we propose a

locality-aware GP model, which assumes that the temporal locality effect resembles an exponentially decreasing function. By

accurately modeling the temporal locality effect, our locality-aware GP model accelerates the convergence of BO. We implemented BO

FSS on the GCC implementation of the OpenMP standard and evaluated its performance against other scheduling algorithms. Also, to

quantify our method’s performance variation on different workloads, or workload-robustness in our terms, we measure the minimax

regret. According to the minimax regret, BO FSS shows more consistent performance than other algorithms. Within the considered

workloads, BO FSS improves the execution time of FSS by as much as 22% and 5% on average.

Index Terms—Parallel Loop Scheduling, Bayesian Optimization, Parallel Computing, OpenMP

✦

1 INTRODUCTION

LOOP parallelization is the de-facto standard method for

performing shared-memory data-parallel computation.

Parallel computing frameworks such as OpenMP [1] have

enabled the acceleration of advances in many scientiﬁc and

engineering ﬁelds such as astronomical physics [2], climate

analytics [3], and machine learning [4]. A major challenge

in enabling efﬁcient loop parallelization is to deal with the

inherent imbalance in workloads [5]. Under the presence

of load imbalance, some computing units (CU) might end

up remaining idle for a long time, wasting computational

resources. It is thus critical to schedule the tasks to CUs

efﬁciently.

Early on, dynamic loop scheduling algorithms [6], [7],

[8], [9], [10], [11], [12] have emerged to attack the parallel

loop scheduling problem. However, these algorithms ex-

ploit a limited amount of information about the workloads,

resulting in inconsistent performance [13]. In our terms,

they lack workload-robust as their performance varies across

workloads.

Meanwhile, workload-aware scheduling methods have re-

∙K. Kim is with the Department of Electronics Engineering, Sogang

University, Seoul, Republic of Korea.

E-mail: msca8h@sogang.ac.kr

∙Y. Kim and S. Park are with the Department of Computer Science and

Engineering, Sogang University, Seoul, Republic of Korea.

E-mail: {youkim, parksy}@sogang.ac.kr

Manuscript received June 26, 2020; revised August 21, 2020; accepted

September 7 2020. This work was supported by the Next-Generation In-

formation Computing Development Program through the National Re-

search Foundation of Korea (NRF) funded by the Ministry of Science,

ICT (2017M3C4A7080245). This paper was presented in part of the 27th

IEEE International Symposium on Modeling, Analysis, and Simulation of

Computer and Telecommunication Systems (MASCOTS’19) held in Rennes,

France, 2019. (Corresponding author: Sungyong Park.)

cently emerged. These methods, including the history-aware

self-scheduling (HSS, [14]) and bin-packing longest pro-

cessing time (BinLPT, [15], [16]) algorithms, utilize the static

imbalance information of workloads. Static imbalance is an

imbalance inherent to the workload that is usually caused

by algorithmic variations. Unlike dynamic imbalance, which

is caused by the environment of execution, static imbalance

can sometimes be accurately estimated before execution. In

these cases, workload-aware methods aim to exploit the

static imbalance information for scheduling. On the other

side of the coin, these methods are inapplicable when the

static imbalance information, or a workload-proﬁle, is not

provided. In many high-performance computing (HPC) ap-

plications, static imbalance is often avoided by design. Even

if such imbalance is present, it is usually unknown unless

extensive proﬁling is performed. As a result, workload-

aware methods can only be applied to a limited range of

workloads or challenging to use at best.

As discussed, both dynamic and workload-aware meth-

ods have limitations. Thus, additional efforts must be made

to ﬁnd the algorithm best-suited for a particular workload.

Practitioners often need to try out different scheduling

algorithms and manually tune them for the best perfor-

mance, which is tedious and time-consuming. To resolve the

issues of dynamic and workload-aware scheduling meth-

ods, we propose Bayesian optimization augmented factoring

self-scheduling (BO FSS), a workload-robust parallel loop

scheduling algorithm. BO FSS automatically infers proper-

ties of the target loop only using its execution time measure-

ments. Since BO FSS doesn’t rely on a workload-proﬁle, it

applies to a wide range of workloads.

In this paper, we ﬁrst show that it is possible to achieve

robust performance if we are able to appropriately tune the

2

internal parameters of a classic scheduling algorithm to each

workload individually. Based on this observation, BO FSS

tunes the parameter of factoring self-scheduling (FSS, [7]), a

classic dynamic scheduling algorithm, only using execution

time measurements of the target loop. This is achieved

by solving an optimization problem using a black-box

global optimization algorithm called Bayesian optimization

(BO, [17]). BO is notable for being data efﬁcient; it requires a

minimal number of measurements until convergence [18].

It is also able to efﬁciently handle the presence of noise

in the measurements. These properties previously led to

successful applications such as compiler optimization ﬂag

selection [19], garbage collector tuning [20], and cloud con-

ﬁguration selection [18]. Based on these properties of BO,

our system is able to improve scheduling efﬁciency with

a minimal number of repeated executions of the target

workload.

To apply BO, we need to provide a surrogate model that

accurately describes the relationship between the schedul-

ing algorithm’s parameter and the resulting execution time.

By extending our previous work in [21], we propose two

types of probabilistic machine learning models as surro-

gates. First, we model the total execution time contribution

of a loop as Gaussian processes (GP). Second, for workloads

where the loops are executed multiple times in a single run,

we propose a locality-aware GP model. Based on the assump-

tion that the temporal locality effect resembles exponentially

decreasing functions, our locality-aware GP can accurately

model the execution time using exponentially decreasing

function kernels from [22]. As a result, it is able to achieves

faster convergence of BO when applicable.

We implemented BO FSS as well as other classic schedul-

ing algorithms such as chunk self-scheduling (CSS, [6]),

FSS [7], trapezoid self-scheduling (TSS, [8]), tapering self-

scheduling (TAPER, [10]) on the GCC implementation [23]

of the OpenMP parallelism framework. Then, we evaluate

BO FSS against these classical algorithms and workload-

aware methods including HSS and BinLPT. To quantify and

compare the robustness of BO FSS, we adopt the minimax

regret metric [24], [25]. We selected workloads from the Ro-

dinia 3.1 [26] and GAP [27] benchmark suites for evaluation.

Results show that our method outperforms other scheduling

algorithms by improving the execution time of FSS as much

as 22% and 5% on average. In terms of workload-robustness,

BO FSS achieves a regret of 22.34, which is the lowest among

the considered methods.

The key contributions of this paper are as follows:

∙We show that, when appropriately tuned, FSS can

achieve workload-robust performance (Section 2). In

contrast, the performance of dynamic scheduling and

workload-aware methods varies across workloads.

∙We apply BO to tune the internal parameter of FSS

(Section 3). Results show that BO FSS achieves consis-

tently good performance across workloads (Section 5).

∙We propose to model the temporal locality effect of

workload using locality-aware GPs (Section 3.3). Our

locality-aware GP incorporates the effect of temporal lo-

cality using exponentially decreasing function kernels.

∙We implement BO FSS over the OpenMP parallel

computing framework (Section 4). Our implementation

includes other classic scheduling algorithms used for

the evaluation and is publicly available online1.

∙We propose to use minimax regret for quantifying

workload-robustness of scheduling algorithms (Sec-

tion 5). According to the minimax regret criterion, BO

FSS shows the most robust performance among consid-

ered algorithms.

2 BACKG RO UN D AND MOT IVATIO N

In this section, we start by describing the loop schedul-

ing problem. Then, we show that dynamic scheduling

and workload-aware methods lack what we call workload-

robustness. Our analysis is followed by proposing a strategy

to solve this problem.

2.1 Background

Parallel loop scheduling. Loops in scientiﬁc comput-

ing applications are easily parallelizable because of

their embarrassingly-data-parallel nature. A parallel loop

scheduling algorithm attempts to map each task, or itera-

tion, of a loop to CUs. The most basic scheduling strategy

called static scheduling (STATIC) equally divides the tasks

(𝑇𝑖) by the number of CUs in compile time. Usually, a barrier

is implied at the end of a loop, forcing all the CUs to wait un-

til all tasks ﬁnish computing. If imbalance is present across

the tasks, some CUs may complete computation before other

tasks, resulting in many CUs remaining idle. Since execution

time variance is abundant in practice because of control-

ﬂow divergence and inherent noise in modern computer

systems [5], more advanced scheduling schemes are often

required.

Dynamic loop scheduling. Dynamic loop scheduling has

been introduced to solve the inefﬁciency caused by execu-

tion time variance. In dynamic scheduling schemes, each

CU self-assigns a chunk of 𝐾tasks in runtime by accessing

a central task queue whenever it becomes idle. The queue

access causes a small runtime scheduling overhead, denoted

by the constant ℎ. The case where 𝐾= 1is called self-

scheduling (SS, [28]). For SS, we can achieve the mini-

mum amount of load imbalance. However, the amount of

scheduling overhead grows proportionally to the number

of tasks. Even for small values of ℎ, the total scheduling

overhead can quickly become overwhelming. The problem

then boils down to ﬁnding the optimal tradeoff between

load imbalance and scheduling overhead. This problem has

been mathematically formalized in [6], [29], and a general

review of the problem is provided in [30].

2.2 Factoring Self-Scheduling

Among many dynamic scheduling algorithms, we focus on

the factoring self-scheduling algorithm (FSS, [7]). Instead of

using a constant chunk size 𝐾, FSS uses a chunk size that

decreases along the loop execution. At the 𝑖th batch, the size

of the next 𝑃chunks, 𝐾𝑖, is determined according to

1. Source code available in https://github.com/Red-Portal/bosched

3

Measured Estimated

LoadProportion(%)

10−4

10−3

10−2

10−1

Task(Ti)

0 250 500 750 1000

Measured1 Estimated

LoadProportion(%)

0

2

4

Chunk

1 20 40 60

(a) Accuracy of load estimation

OurSolution

FSSSolution

FSS

FAC2

STATIC

GUIDED

ExecutionTime(sec)

0.55

0.6

0.65

0.7

ParameterValue(θ)

0.01 0.1 1 10 100

(b) Low static imbalance

ExecutionTime(sec)

0.6

0.7

0.8

0.9

1

ParameterValue(θ)

0.1 1 10 100 1000

OurSolution

FSSSolution

FSS

FAC2

BinLPT

HSS

(c) High static imbalance

Fig. 1. ((a), top) Discrepancy between the workload-proﬁle and actual execution time of the tasks. ((a), bottom) Discrepancy between the load of the

chunks created by BinLPT, and their actual execution time. (b-c) Effect of the internal parameter (𝜃) of FSS on a workload with homogeneous tasks

((b), low static imbalance, lavaMD workload) and a workload with non-homogeneous tasks ((c), high static imbalance, pr-journal workload).

The value of the parameter suggested by the original FSS algorithm is marked with a blue cross, while the actual optimal solution targeted by our

proposed method is marked with a blue star. The error bands are the 95% empirical bootstrap conﬁdence intervals of the execution time mean.

𝑅0=𝑁, 𝑅𝑖+1=𝑅𝑖−𝑃𝐾𝑖, 𝐾𝑖=𝑅𝑖

𝑥𝑖𝑃(1)

𝑏𝑖=𝑃

2𝑅𝑖𝜃(2)

𝑥0=1+𝑏20+𝑏0𝑏20+4 (3)

𝑥𝑖=2+𝑏2𝑖+𝑏𝑖𝑏2𝑖+4.(4)

where 𝑅𝑖is the number of remaining tasks at the 𝑖th batch.

The parameter 𝜃in (2) is crucial to the overall performance

of FSS. The analysis in [31] indicates that 𝜃=𝜎∕𝜇results in

the best performance where 𝜇and 𝜎2are the mean (𝔼[𝑇𝑖])

and variance (𝕍[𝑇𝑖]) of the tasks. However, in Section (2.3),

we show that this 𝜃does not always perform well. Instead,

the essence of our work is a strategy to empirically deter-

mine good 𝜃for each individual workloads by solving an

optimization problem.

The FAC2 scheduling strategy. Since determining 𝜇and

𝜎requires extensive proﬁling of the workload, the original

authors of FSS suggest an unparameterized heuristic ver-

sion [7]. This version is often abbreviated as FAC2 in the

literature and has been observed to outperform the original

FSS [9], [11] despite being a heuristic modiﬁcation. Again,

this observation supports the fact that the analytic solution

𝜃=𝜎∕𝜇is not always the best nor the only good solution.

2.3 Motivation

Limitations of workload-aware methods. The HSS and

BinLPT strategies have signiﬁcant drawbacks despite being

able to fully incorporate the information about load im-

balance. First, both the HSS and BinLPT methods require

an accurate workload-proﬁle. This is a signiﬁcant limiting

factor since many HPC workloads are comprised of homo-

geneous tasks where the imbalance is caused dynamically

during runtime. This means there is no static imbalance in

the ﬁrst place. Also, even if a workload-proﬁle is present, it

imposes a runtime memory overhead of 𝑂(𝑁)for each loop.

For large-scale applications where the task count 𝑁is huge,

the memory overhead is a signiﬁcant nuisance.

Moreover, both the HSS and BinLPT algorithms also

have their own caveats. The HSS algorithm has high

scheduling overhead [16]. In Section 5, we observe that

HSS performs well only when high levels of imbalance,

such as in the pr-wiki workload, are present. On the

other hand, BinLPT is highly sensitive to the accuracy of

the workload-proﬁle. In practice, discrepancies between the

actual workload and the workload-proﬁle are inevitable. We

illustrate this fact using the pr-journal graph analytics

workload in the upper plot of Fig. 1a. We estimated the load

of each task using the in-degree of the corresponding vertex

in the graph. The grey region is the estimated load of each

task, while the red region is the measured load. As shown in

the ﬁgure, the estimated load does not accurately describe

the actual load. Likewise, the chunks created by BinLPT

using these estimates are equally inaccurate, as shown in

the lower plot of Fig. 1a. If the number of tasks is minimal,

some level of discrepancy may be acceptable. Indeed, the

original analysis in [16] considers at most 𝑁=3074tasks. In

practice, the number of tasks scales with data, leading to a

very large 𝑁.

Effect of tuning the parameter of FSS. Similarly, classi-

cal scheduling algorithms such as FSS are not workload-

robust [13]. However, we reveal an interesting property

by tuning the parameter (𝜃) of FSS. Fig. 1b and Fig. 1c

illustrate the evaluation results of FSS using the lavaMD

(a workload with low static imbalance) and pr-journal

(a workload with high static imbalance) workloads with

different values of 𝜃, respectively. The solution suggested

in the original FSS algorithm (as discussed in Section 2.2)

is denoted by a blue cross. For the lavaMD workload

(Fig. 1b), this solution is arguably close to the optimal value.

However, for the pr-journal workload (Fig. 1c), it leads

to poor performance. The original FSS strategy is thus not

workload-robust since its performance varies greatly across

workloads.

In contrast, by using an optimal value of 𝜃(blue star),

FSS outperforms all other algorithms as shown in the plots.

Even in Fig. 1c where HSS and BinLPT are equipped with an

accurate workload-proﬁle, FSS outperforms both methods.

4

Algorithm 1: Bayesian optimization

Initial dataset 𝒟0={(𝜃0,𝜏0),…,(𝜃𝑁,𝜏𝑁)}for 𝑡∈[1,𝑇]

do

1. Fit surrogate model ℳusing 𝒟𝑡.

2. Solve inner optimization problem.

𝜃𝑡+1=argmax𝜃𝛼(𝜃ℳ,𝒟𝑡)

3. Evaluate parameter. 𝜏𝑡+1∼𝑇total (𝑆𝜃𝑡+1)

4. Update dataset. 𝒟𝑡+1←𝒟𝑡+1∪(𝜃,𝜏)

end

This means that tuning the parameter of FSS on a per-

workload basis can achieve robust performance.

Motivational remarks. Workload-aware methods and clas-

sical dynamic scheduling methods tend to vary in appli-

cability and performance. Meanwhile, classic scheduling

algorithms such as FSS achieve optimal performance when

they are appropriately tuned to the target workload. This

performance potential of FSS points towards the possibility

of creating a novel robust scheduling algorithm.

3 AUGMENTING FACTO RING SEL F-SCHEDULING

WITH BAYESIAN OPTIMIZATION

In this section, we describe BO FSS, a self-tuning variant of

the FSS algorithm. First, we provide an optimization per-

spective on the loop scheduling problem. Next, we describe

a solution to the optimization problem using BO. Since

solving our problem requires modeling of the execution

time using surrogate models, we describe two ways to

construct surrogate models.

3.1 Scheduling as an Optimization Problem

The main idea of our proposed method is to design an

optimal scheduling algorithm by ﬁnding its optimal con-

ﬁgurations based on execution time measurements. First,

we deﬁne a set of scheduling algorithms 𝒮={𝑆𝜃1,𝑆𝜃2,…}

indentiﬁed by a tunable parameter 𝜃. In our case, 𝒮is the set

of conﬁgurations of the FSS algorithm with the parameter 𝜃

discussed in Section 2.2. Within this set of conﬁgurations,

we choose the optimal conﬁguration that minimizes the mean

of the total execution time contribution (𝑇𝑡𝑜𝑡𝑎𝑙) of a parallel

loop. This problem is now of the form of an optimization

problem denoted as

minimize

𝜃𝔼[𝑇𝑡𝑜𝑡𝑎𝑙(𝑆𝜃)].(5)

Problem structure. Now that the optimization problem

is formulated, we are supposed to apply an optimization

solver. However, this optimization problem is ill-formed,

prohibiting the use of any typical solver. First, the objective

function is noisy because of the inherent noise in computer

systems. Second, we do not have enough knowledge about

the structure of 𝑇. Different workloads interact differently

with scheduling algorithms [13]. It is thus difﬁcult to obtain

an analytic model of 𝑇that is universally accurate. More-

over, most conventional optimization algorithms require

knowledge about the gradient ∇𝜃𝑇, which we do not have.

for (int l = 0; l < L; ++l)

{

#pragma omp parallel for schedule(runtime)

for (int i = 0; i < N; ++i)

{

task(i);

}

}

Fig. 2. Visualization of our execution time models. The execution time

of the parallel loop (red bracket) is denoted as 𝑇, while the execution

time of the tasks in the parallel loop (green bracket) is denoted as 𝑇𝑖.

The outer loop (blue bracket) represents repeated execution (𝐿times) of

the parallel loop within the application, where 𝑇𝑡𝑜𝑡𝑎𝑙 is the total execution

time contribution of the loop.

Solution using Bayesian optimization. For solving this

problem, we leverage Bayesian Optimization (BO). We ini-

tially attempt to apply other gradient-free optimization

methods such as stochastic approximation [32]. However,

the noise level in execution time is so extreme that most

gradient-based methods fail to converge. Conveniently, BO

has recently been shown to be effective for solving such kind

of optimization problems [18], [19], [20]. Compared to other

black-box optimization methods, BO requires less objective

function evaluations and can handle the presence of noise

well [18].

Description of Bayesian optimization. The overall ﬂow of

BO is shown in Algorithm 1. First, we build a surrogate

model ℳof 𝑇total. Let (𝜃,𝜏)denote a data point of an

observation where 𝜃is a parameter value, and 𝜏is the

resulting execution time measurement such that 𝜏∼𝑇total.

Based on a dataset of previous observations denoted as

𝒟𝑡= {(𝜃1,𝜏1),…,(𝜃𝑡,𝜏𝑡)}, a surrogate model provides a

prediction of 𝑇total(𝜃)and the uncertainty of the prediction.

In our context, the prediction and uncertainty are given as

the mean of the predictive distribution denoted as 𝜇(𝜃𝒟𝑡)

and its variance denoted as 𝜎2(𝜃𝒟𝑡).

Using ℳ, we now solve what is known as the inner

optimization problem. In this step, we choose to exploit our

current knowledge about the optimal value or explore en-

tirely new values that we have not tried yet. In the extremes,

minimizing 𝜇(𝜃𝒟𝑡)gives us the optimal parameter given

our current knowledge, while minimizing 𝜎2(𝜃𝒟𝑡)gives us

the parameter we are currently the most uncertain. The

optimal solution is given by a tradeoff of the two ends

(often called the exploration-exploitation tradeoff), found by

solving the optimization problem

𝜃𝑖+1=argmax

𝜃𝛼(𝜃ℳ,𝒟𝑡)(6)

where the function 𝛼is called the acquisition function. Based

on the predictions and uncertainty estimates of ℳ,𝛼returns

our utility of trying out a speciﬁc value of 𝜃. Evidently, the

quality of the prediction and uncertainty estimates of ℳ

are crucial to the overall performance. By maximizing 𝛼,

we obtain the parameter value that has the highest utility,

according to 𝛼. In this work, we use the max-value entropy

search (MES, [33]) acquisition function.

After solving the inner optimization problem, we obtain

the next value to try out, 𝜃𝑡+1. We can then try out this

parameter and append the result (𝜃𝑡+1,𝜏𝑡+1) to the dataset.

For a comprehensive review of BO, please refer to [17]. We

5

ExecutionTime(ms)

3

4

5

6

7

8

IndexofLoopExecution(

ℓ

)

5 10 15 20 25 30 35

(a) Locality effect 𝓁-axis view

Execution1~10 Execution11~30

ExecutionTime(ms)

4

5

6

7

8

9

ParameterValue(θ)

0 0.2 0.4 0.6 0.8 1

(b) Locality effect 𝜃-axis view (c) Samples from the Exp. kernel

Fig. 3. (a) (b) Visualization of the temporal locality effect on the execution time of the kmeans workload. (a) 𝓁-axis view. The error bars are the

95% empirical conﬁdence intervals. (b) 𝜃-axis view. The red squares are measurements of earlier executions (𝓁≤10) while the blue circles are

measurements of later executions (𝓁>10). (c) Randomly sampled functions from a GP prior with an exponentially decreasing function kernel.

will later explain our OpenMP framework implementation

of this overall procedure in Section 4.

3.2 Modeling Execution Time with Gaussian Processes

As previously stated, having a good surrogate model ℳis

essential. Modeling the execution time of parallel programs

has been a classic problem in the ﬁeld of performance

modeling. It is known that parallel programs tend to follow

a Gaussian distribution when the execution time variance is

not very high [34]. This result follows from the central limit

theorem (CLT), which states that the inﬂuence of multiple

i.i.d. noise sources asymptotically form a Gaussian distribu-

tion. Considering this, we model the total execution time

contribution of a loop as

𝑇𝑡𝑜𝑡𝑎𝑙 =𝐿

𝓁=1 𝑇(𝑆𝜃)+𝜖(7)

where 𝐿is the total number of times a speciﬁc loop is

executed within the application, indexed by 𝓁. Following

the conclusions of [34], we naturally assume that 𝜖follows

a Gaussian distribution. Note that, at this point, we assume

𝑇is independent of the index 𝓁. For an illustration of the

models used in our discussion, please see Fig. 3.1.

Gaussian Process formulation. From the dataset 𝒟𝑡, we

infer the model of the execution time 𝑇𝑡𝑜𝑡𝑎𝑙(𝜃)using Gaussian

processes (GPs). A GP is a nonparametric Bayesian prob-

abilistic machine learning model for nonlinear regression.

Unlike parametric models such as polynomial curve ﬁtting

and random forest, GPs automatically tune their complexity

based on data [35]. Also, more importantly, GPs can natu-

rally incorporate the assumption of additive noise (such as

𝜖in (7)). The prediction of a GP is given as a univariate

Gaussian distribution fully described by its mean 𝜇(𝑥𝒟𝑡)

and variance 𝜎2(𝑥𝒟𝑡). These are computed in closed forms

as 𝜇(𝜃𝒟𝑡)=k(𝜃)𝑇(𝐊+𝜎2𝑛𝐼)−1𝐲(8)

𝜎2(𝜃𝒟𝑡)=𝑘(𝜃,𝜃)−k(𝜃)𝑇(𝐊+𝜎2𝜖𝐼)−1k(𝜃)(9)

where 𝐲=[𝜏1,𝜏2,…,𝜏𝑡],𝐤(𝜃)is a vector valued function

such that [𝐤(𝜃)]𝑖=𝑘(𝜃,𝜃𝑖),∀𝜃𝑖∈𝒟𝑡, and 𝐊is the Gram

matrix such that [𝐊]𝑖,𝑗 =𝑘(𝜃𝑖,𝜃𝑗),∀𝜃𝑖,𝜃𝑗∈𝒟𝑡;𝑘(𝑥,𝑦)

denotes the covariance kernel function which is a design

choice. We use the Matern 5/2 kernel which is computed

as 𝑘(𝑥,𝑥′;𝜎2,𝜌2)=𝜎2(1+5𝑟+53𝑟2)exp(−5𝑟)(10)

where 𝑟=𝑥−𝑥′2∕𝜌2.(11)

For a detailed introduction to GP regression, please refer

to [36].

Non-Gaussian noise. Despite the remarks in [34] saying

that the noise in parallel programs mostly follow Gaussian

distributions, we experienced cases where the execution

time of individual parallel loops did not quite follow Gaus-

sian distributions. For example, occasional L2, L3 cache-

misses results in large deviations, or outliers, in execution

time. To correctly model these events, it is advisable to

use heavy-tail distributions such as the Student-T. More

advanced methods for dealing with such outliers are de-

scribed in [37] and [38]. However, to narrow the scope of

our discussion, we stay within the Gaussian assumption.

3.3 Modeling with Locality-Aware Gaussian Processes

Until now, we only considered acquiring samples of 𝑇𝑡𝑜𝑡𝑎𝑙

by summing our measurements of 𝑇. For the case where the

parallel loop in question is executed more than once (that is,

𝐿>1), we acquire 𝐿observations of 𝑇in a single run of the

workload. By exploiting our model’s structure in (7), it is

possible to utilize all 𝐿samples instead of aggregating them

into a single one. Since the Gaussian distribution is additive,

we can decompose the distribution of 𝑇𝑡𝑜𝑡𝑎𝑙 such that

𝑇𝑡𝑜𝑡𝑎𝑙 =𝐿

𝓁=1𝑇(𝑆𝜃,𝓁)(12)

∼𝐿

𝓁=1𝒩(𝔼[𝑇(𝑆𝜃,𝓁)],𝕍[𝑇(𝑆𝜃,𝓁)],)(13)

=𝒩(𝐿

𝓁=1𝔼[𝑇(𝑆𝜃,𝓁)],𝐿

𝓁=1𝕍[𝑇(𝑆𝜃,𝓁)]) (14)

≈𝒩(𝐿

𝓁=1𝜇(𝜃,𝓁𝒟𝑡),𝐿

𝓁=1𝜎2(𝜃,𝓁𝒟𝑡)).(15)

6

Offline Online

BO FSS

Offline Tuner

Loop scheduler

Loop in workload

OpenMP runtime

runtime

interface

Disk

#pragma omp parallel for schedule(runtime)

for (int i = 0; i < N; ++i) {

/* workload */

}

#pragma omp parallel for schedule(runtime)

for (int i = 0; i < N; ++i) {

/* workload */

}

scheduling

Loop id

scheduling parameter

BO FSS

execution time

measurement

1

2

3

4

.json

Scheduling

Algorithm (FSS)

dataset

parameter value

to try out next

Fig. 4. System overview of BO FSS. Online denotes the time we are actually executing the workload, while ofﬂine denotes the time we are not

executing the workload. For a detailed description, refer to the text in Section 4.

Note the dependence of 𝑇on the index of execution 𝓁.

From (14), we can retrieve 𝑇𝑡𝑜𝑡𝑎𝑙 from the mean (𝔼[𝑇(𝑆𝜃,𝓁)])

and variance (𝕍[𝑇(𝑆𝜃,𝓁)]) estimates of 𝑇, which are given by

modeling 𝑇using GPs as denoted in (15).

Temporal locality effect. However, this is not as simple as

assuming that all 𝐿measurements of 𝑇are independent (ig-

noring the argument 𝓁of 𝑇). The execution time distribution

of a loop changes dramatically within a single application

run because of the temporal locality effect. This is shown

in Fig. 3 using measurements of a loop in the kmeans

benchmark. In Fig. 3a, it is clear that earlier executions of

the loop (𝓁≤10) are much longer than the later executions

(𝓁>10). Also, different moments of executions are effected

differently by 𝜃, as shown in Fig. 3b. It is thus necessary

to accurately model the effect of 𝓁to better distinguish the

effect of 𝜃.

Exponentially decreasing function kernel. To model the

temporal locality effect, we expand our GP model to include

the index of execution 𝓁. Now, the model is a 2-dimensional

GP receiving 𝓁and 𝜃. Within the workloads we consider,

the temporal locality effect is shown an exponentially de-

creasing tendency. We thus assume that the locality effect

can be represented with exponentially decreasing functions

(Exp.) of the form of 𝑒−𝜆𝓁. The kernel for these functions has

been introduced in [22] for modeling the learning curves of

machine learning algorithms. The exponentially decreasing

function kernel is computed as

𝑘(𝓁,𝓁′)= 𝛽𝛼

(𝓁+𝓁′+𝛽)𝛼.(16)

Random functions sampled from the space induced by the

Exp. kernel are shown in Fig. 3c. Notice the similarity of

the sampled functions and the visualized locality effect in

Fig. 3a. Modeling more complex locality effects such as

periodicity can be achieved by combining more different

kernels. An automatic procedure for doing this is described

in [39].

Kernel of locality-aware GPs. Since the sum of covariance

kernels is also a valid covariance kernel [36], we deﬁne our

2-dimensional kernel as

𝑘(𝑥,𝑥′)=𝑘Matern(𝜃,𝜃′)+𝑘Exp (𝓁,𝓁′)(17)

where 𝑥=[𝜃,𝓁],𝑥′=[𝜃′,𝓁′].(18)

Intuitively, this deﬁnition implies that we assume the effect

of scheduling (resulting from 𝜃) and locality (resulting from

𝓁) to be superimposed (additive).

Reducing computational cost. The computational complex-

ity of computing a GP is in 𝑂(𝑇3)where 𝑇represents

the total number of BO iterations. The locality aware con-

struction uses all the independent loop executions resulting

in computational complexity in 𝑂((𝐿𝑇)3). To reduce the

computational cost, we subsample data along the axis of

𝓁by using every 𝑘th measurement of the loop, such that

𝓁∈ {1,𝑘+1,2𝑘+1,…,𝐿}. As a result, the computa-

tional complexity is reduced by a constant factor such that

𝑂𝐿𝑘𝑇3. In all of our experiments, we use a large value of

𝑘so that 𝐿∕𝑘=4.

3.4 Treatment of Gaussian Process Hyperparameters

GPs have multiple hyperparameters that need to be pre-

determined. The suitability of these hyperparameters is

directly related to the optimization performance of BO [40].

Unfortunately, whether a set of hyperparameters is ap-

propriate depends on the characteristics of the workload.

Since real-life workloads are very diverse, it is essential to

automatically handle these parameters.

The Matern 5/2 kernel has two hyperparameters 𝜌and

𝜎, while the exponentially decreasing function kernel has

two hyperparameters 𝛼and 𝛽. GPs also have hyperparame-

ters themselves, the function mean 𝜇and the noise variance

𝜎2𝜖. We denote the hyperparameters using the concatenation

𝜙=[𝜇,𝜎𝜖,𝜎,𝜌,…].

Since the marginal likelihood 𝑝(𝒟𝑡𝜙)is available in a

closed form [36], we can infer the hyperparameters using

maximum likelihood estimation type-II or marginalization

by integrating out the hyperparameters. Marginalization

has empirically shown to give better optimization perfor-

mance in the context of BO [40], [41]. It is performed by

approximating the integral

𝛼(𝑥ℳ,𝒟𝑡)=∫𝛼(𝑥ℳ,𝜙,𝒟𝑡)𝑝(𝜙𝒟𝑡)𝑑𝜃 (19)

≈1

𝑁

𝜙𝑖∼𝑝(𝜙𝒟𝑡)𝛼(𝑥ℳ,𝜙𝑖,𝒟𝑡),(20)

using samples from the posterior 𝜙𝑖where 𝑁is the number

of samples. For sampling from the posterior, we use the no-

u-turn sampler (NUTS, [42]).

4 SYSTEM IMPLEMENTATION

We now describe our implementation of BO FSS. Our im-

plementation is based on the GCC implementation of the

7

TABLE 1

Benchmark Workloads

Suite Workload Proﬁle Characterization # Tasks (𝑁) Application Domain Benchmark Suite

lavaMD Uniformative 1N-Body 8000 Molecular Dynamics Rodinia 3.1

stream. No Dense Linear Algebra 65536 Data Mining Rodinia 3.1

kmeans Uniformative 2Dense Linear Algebra 494020 Data Mining Rodinia 3.1

srad v1 Uniformative 1Structured Grid 229916 Image Processing Rodinia 3.1

nn Uniformative 1Dense Linear Algebra 8192 Data Mining Rodinia 3.1

cc-* Yes Sparse Linear Algebra N/A 3Graph Analytics GAP

pr-* Yes Sparse Linear Algebra N/A 3Graph Analytics GAP

1Uniformly partitioned workload.

2Imbalance present only in domain boundaries.

3Input data dependent; number of vertices of the input graph.

OpenMP 4.5 framework [1], which is illustrated in Fig. 4.

The overall workﬂow is as follows:

0First, we randomly generate initial scheduling pa-

rameters 𝜃0,…,𝜃𝑁0using a Sobol quasi-random se-

quence [43].

1During execution, for each loop in the workload, we

schedule the parallel loop using the parameter 𝜃𝑡. We

measure the resulting execution time of the loop and

acquire a measurement 𝜏𝑡.

2Once we ﬁnish executing the workload, store the pair

(𝜃𝑡,𝜏𝑡)in disk in a JSON format.

3Then, we run the ofﬂine tuner, which loads the dataset

𝒟𝑡from disk.

4Using 𝒟𝑡, we solve the inner optimization problem

in (6), obtaining the next scheduling conﬁguration 𝜃𝑡+1.

5At the subsequent execution of the workload, 𝑡←𝑡+1,

and go back to 1.

Note that ofﬂine means the time we ﬁnish executing the

workload, while online means the time we are executing the

workload (runtime).

Implementation of the ofﬂine tuner. We implemented the

ofﬂine tuner as a separate program written in Julia [44],

which is invoked by the user. When invoked, the tuner

solves the inner optimization problem, and stores the results

in disk. For solving the inner optimization problem, we

use the DIRECT algorithm [45] implemented in the NLopt

library [46]. For marginalizing the GP hyperparameters, we

use the AdvancedHMC.jl implementation of NUTS [47].

Search space reparameterization. BO requires the domain

of the parameter to be bounded. However, in the case of

FSS, 𝜃is not necessarily bounded. As a compromise, we

reparameterized 𝜃into a ﬁxed domain such that

minimize

𝑥𝔼[𝑇𝑡𝑜𝑡𝑎𝑙(𝑆𝜃(𝑥))] (21)

where 𝜃(𝑥)=219𝑥−10,0<𝑥<1.(22)

This also effectively converts the search space to be in a loga-

rithmic scale. The reparameterized domain was determined

by empirically investigating feasible values of 𝜃.

User interface. BO FSS can be selected by setting the

OMP_SCHEDULE environment variable, or by the OpenMP

runtime API as in Listing 1.

Listing 1

Selecting a scheduling algorithm

omp_set_schedule(BO_FSS); // selects BO FSS

Modiﬁcation of the OpenMP ABI. As previously described,

our system optimizes each loop in the workload indepen-

dently. Naturally, our system requires the identiﬁcation of

the individual loops within the OpenMP runtime. However,

we encountered a major issue: the current OpenMP ABI

does not provide a way for such identiﬁcation. Conse-

quently, we had to modify the GCC 8.2 [23] compiler’s

OpenMP code generation and the OpenMP ABI. The mod-

iﬁed GCC OpenMP ABI is shown in Listing 2. During

compilation, a unique token for each loop is generated

and inserted at the end of the OpenMP procedure calls.

Using this token, we store and manage the state of each

loop. Measuring the loop execution time is done by start-

ing the system clock in OpenMP runtime entries such as

GOMP_parallel_runtime_start, and stopping in exits

such as GOMP_parallel_end.

Listing 2

Modiﬁed GCC OpenMP ABI

void GOMP_parallel_loop_runtime(void (*fn) (void *),

void *data, unsigned num_threads, long start,

long end, long incr, unsigned flags, size_t

loop_id)

void GOMP_parallel_runtime_start(long start, long

end, long incr, long *istart, long *iend, size_t

loop_id)

void GOMP_parallel_end(size_t loop_id)

5 EVALUATIO N

In this section, we ﬁrst describe the overall setup of our

experiments. Then, we compare the robustness of BO FSS

against other scheduling algorithms. After that, we evaluate

the performance of our BO augmentation scheme. Lastly, we

directly compare the execution time.

5.1 Experimental Setup

System setup. All experiments are conducted on a single

shared-memory computer with an AMD Ryzen Thread-

ripper 1950X 3.4GHz CPU which has 16 cores and 32

threads with simultaneous multithreading enabled. It also

has 1.5MB of L1 cache, 8MB of L2 cache and 32MB

of last level cache. We use the Linux 5.4.36-lts ker-

nel with two 16GB DDR4 RAM (32GB total). Frequency

8

TABLE 2

Minimax Regret of Scheduling Algorithms

The values in the table are the percentage slowdown relative to the best performing algorithm. They can be

interpreted as the opportunity cost of using each algorithm. For more details, refer to the text in Section 5.1.

Workload Ours Static Workload-Aware Dynamic

BO FSS STATIC HSS BinLPT GUIDED FSS CSS FAC2 TRAP1 TAPER3

lavaMD 0.00 17.55 n/a n/a 7.25 3.00 0.36 0.25 10.33 42.64

stream.0.00 10.79 n/a n/a 2.39 10.36 1.25 0.68 2.00 2.45

kmeans 0.00 23.02 n/a n/a 8.01 17.62 1.50 1.17 2.30 6.41

srad v1 22.34 10.92 n/a n/a 16.75 11.74 26.03 0.00 16.43 17.61

nn 4.76 5.06 n/a n/a 0.00 0.55 7.00 6.06 4.39 5.14

cc-journal 0.00 2.88 66.98 196.63 11.94 2.47 2.98 6.15 3.65 0.66

cc-wiki 0.00 6.94 58.57 154.31 10.37 2.77 6.58 5.29 7.88 5.27

cc-road 0.00 8.57 81.88 251.71 7.19 1.37 1.55 1.23 1.97 1.71

cc-skitter 5.28 2.28 61.69 129.08 3.57 1.03 1.05 1.06 0.73 0.00

pr-journal 0.00 29.66 5.52 66.89 42.93 29.01 29.07 29.17 29.33 28.81

pr-wiki 15.30 45.20 0.00 42.26 85.34 46.99 47.28 46.82 46.53 46.87

pr-road 0.00 0.32 41.65 138.32 6.60 0.41 0.42 0.42 0.40 0.41

pr-skitter 0.00 11.51 23.21 68.91 29.97 11.66 11.21 11.34 12.06 11.26

ℛ(𝑆)22.34 45.20 81.88 251.71 85.34 46.99 47.28 46.83 46.53 46.87

ℛ90(𝑆)13.30 28.33 71.75 213.15 40.34 26.73 28.46 25.60 26.75 39.87

scaling is disabled with the cpupower frequency-set

performance setting. We use the GCC 8.3 compiler with

the -O3,-march=native optimization ﬂags enabled in all

of our benchmarks.

BO FSS setup. We run BO FSS for 20 iterations starting from

4 random initial points. All results use the best parameter

found after the aforementioned number of iterations.

Baseline scheduling algorithms. We compare BO FSS

against the FSS [7], CSS [6], TSS [8], GUIDED [48], TA-

PER [10], BinLPT [16], HSS [14] algorithms. We use the

implementation of BinLPT and HSS provided by the authors

of BinLPT2. For the FSS and CSS algorithms, we estimate

the statistics of each workloads (𝜇,𝜎) beforehand from 64

executions. The scheduling overhead parameter ℎis esti-

mated using the method described in [49]. We use the de-

fault STATIC and GUIDED implementations of the OpenMP

4.5 framework using the static and guided scheduling

ﬂags. For the TSS and TAPER schedules, we follow the

heuristic versions suggested in their original works, denoted

as TRAP1 and TAPER3, respectively.

Benchmark workloads. The workloads considered in our

experiments are summarized in Table 1. We select work-

loads from the Rodinia 3.1 benchmark suite [26] (lavamd,

streamcluster,kmeans,srad v1) where the STATIC

scheduling method performs worse than other dynamic

scheduling methods. We also include workloads from the

GAP benchmark suite [27] (cc,pr) where the load is pre-

dictable from the input graph.

Workload-proﬁle availability. We characterize the

workload-proﬁle availability of each workload in the

Workload-Proﬁle column in Table 1. For workloads with

homogeneous tasks (lavaMD,stream.,srad v1,nn),

static imbalance does not exist. Most of the imbalance

is caused during runtime, deeming a workload-proﬁle

uniformative. On the other hand, the static imbalance of

the kmeans workload is revealed during execution, not

2. Retrieved from https://github.com/lapesd/libgomp

before execution. We thus consider the workload-proﬁle to

be effectively unavailable.

TABLE 3

Input Graph Datasets

Dataset 𝒱 ℰdeg−(𝐯)1

,deg+(𝐯)2

mean std max

journal [50] 4.0M 69.36M 17, 17 43, 43 15k, 15k

wiki [51] 3.57M 45.01M 13, 13 33, 250 7k, 187k

road [52] 24.95M 57.71M 2, 2 1, 1 9, 9

skitter [53] 1.70M 22.19M 13, 13 137, 137 35k, 35k

1In-degree of each vertex.

2Out-degree of each vertex.

Input graph datasets. We organize the graph datasets used

for the workloads from the GAP benchmark suite in Table 3,

acquired from [54]. 𝒱and ℰare the vertices and edges

in each graph, respectively. The load of each task 𝑇𝑖in

the cc and pr workloads is proportional to the in-degree

and out-degree of each vertex [55]. We use this degree

information for forming the workload-proﬁles. Among the

datasets considered, wiki has the most extreme imbalance

while road has the least imbalance [55].

Workload-robustness measure. To quantify the notion

of workload-robustness, we use the minimax regret mea-

sure [25]. The minimax regret quantiﬁes robustness by calcu-

lating the opportunity cost of using an algorithm, computed

as

ℛ(𝑆,𝑤)=𝐶(𝑆,𝑤)−min𝑆∈𝒮𝐶(𝑆,𝑤)

min𝑆∈𝒮𝐶(𝑆,𝑤)×100 (23)

ℛ(𝑆)=max

𝑤∈𝒲ℛ(𝑆,𝑤)(24)

where 𝐶(𝑆,𝑤)is the cost of the scheduling algorithm 𝑆on

the workload 𝑤, and 𝒲is our set of workloads. We choose

𝐶(𝑆,𝑤)to be the execution time. In this case, ℛ(𝒮,𝑤)can be

interpreted as the slowdown relative to the best performing

algorithm in percentages. Also, ℛ(𝒮)is the worst case regret

of using 𝒮on the set of workloads 𝒲. Note that among

9

lavaMD

RelativeExecutionTime

0.8

0.9

1.0

1.1

1.2

1.3

stream. kmeans sradv1

BOFSS(proposed)

FSS

FAC2

nn cc-journal cc-wiki cc-road cc-skitter pr-journal pr-skitter pr-wiki pr-road

Fig. 5. Execution time comparison of BO FSS, FSS and FAC2. We estimate the mean execution time from 256 executions. The error bars show

the 95% bootstrap conﬁdence intervals. The results are normalized by the mean execution time of BO FSS. The methods with the lowest execution

time are marked with a star (*). Methods not signiﬁcantly different with the best performing method are also marked with a star (Wilcoxon signed

rank test, 1% null-hypothesis rejection threshold).

different robustness measures, the minimax regret is very

pessimistic [24], emphasizing worst-case performance. For

this reason, we additionally consider the 90th percentile of

the minimax regret denoted as ℛ90(𝑆).

5.2 Evaluation of Workload-Robustness

Table 2 compares the minimax regrets of different schedul-

ing algorithms with that of BO FSS. Each entry in the table is

the regret subject to the workload and scheduling algorithm,

ℛ(𝑆,𝑤). The ﬁnal rows are the minimax regret ℛ(𝑆)and the

90th percentile minimax regret ℛ90(𝑆)subject to the schedul-

ing algorithm. BO FSS achieves the lowest regret both in

terms of minimax regret (22%points) and 90th percentile

minimax regret (13%points). In contrast, both static and

dynamic scheduling methods achieve similar level of regret.

This observation is on track with the previous ﬁndings [13];

none of the classic scheduling methods dominate each other.

It is worth to note that we selected workloads in which

STATIC performs poorly. Our robustness analysis thus only

holds for comparing dynamic and workload-aware schedul-

ing methods.

Remarks. The results for workload-robustness using the

minimax regret metric show that BO FSS achieves signiﬁ-

cantly lower levels of regret compared to other scheduling

methods. As a result, BO FSS performs consistently well.

Even when BO FSS does not perform the best, its perfor-

mance is within an acceptable range.

5.3 Evaluation of Bayesian Optimization Augmentation

A fundamental part of the proposed method is that BO

FSS improves the performance of FSS by tuning its internal

parameter. In this section, we show how much BO augmen-

tation improves the performance of FSS and its heuristic

variant FAC2. We run BO FSS, FSS, and FAC2 on workloads

with both high and low static imbalances. The results are

shown in Fig. 5. Overall, we can see that BO FSS consistently

outperforms FSS and FAC2 with the exception of srad

v1 and cc-skitter. On workloads with high imbalance

such as pr-journal and pr-wiki, the execution time

improvements are as high as 30%.

Performance degradation on srad v1.Interestingly, BO

FSS does not perform well on two workloads: srad v1

and cc-skitter. While the performance difference in

cc-skitter is marginal, the difference in srad v1 is not.

This phenomenon is due to the large deviations in the exe-

cution time measurements as shown in Fig. 6. That is, large

Datapoints

Student-Tprocess

Gaussianprocess

StandardizedExecutionTime

−2

0

2

4

ParameterValue(θ)

0 0.2 0.4 0.6 0.8 1

Fig. 6. Parameter space and surrogate model ﬁt on the srad v1 work-

load. The colored regions are the 95% predictive credible intervals of

a GP (green region) and a Student-T process (blue region). The red

circles are the data points used to ﬁt both surrogate models.

Fig. 7. Convergence plot of the locality-unaware GP and the locality-

aware GP on the skitter workload. We can see the execution time

decreasing as we run BO. We ran BO 30 times with 10 iterations each,

and computed the 95% boostrap conﬁdence intervals.

outliers near 𝜃=0.4and 𝜃=0.8deviated the GP predictions

(green line). Since GPs assume the noise to be Gaussian, they

are not well suited for this kind of workload. A possible

remedy is to use Student-T processes [37], [38], shown with

the blue line. In Fig. 6, the Student-T process is much less

affected by outliers, resulting in a tighter ﬁt. Nonetheless,

GPs worked consistently well on other workloads.

Comparison of Gaussian Process Models. We now com-

pare the simple GP construction in Section 3.2 and the

locality-aware GP construction in Section 3.3. We equip

BO with each of the models, and run the autotuning pro-

cess beginning to end 30 times. The convergence results

are shown in Fig. 7. We can see that the locality-aware

construction converges much quickly. Note that the shown

results are averages. In the individual results, there are

cases where the locality-unaware version completely fails to

10

cc-journal

RelativeExecutionTime

0.9

1.0

1.1

1.2

1.3

2.0

3.0

cc-wiki cc-road cc-skitter

BOFSS STATIC GUIDED BinLPT HSS

pr-journal pr-wiki pr-road pr-skitter

Fig. 8. Execution time comparison of BO FSS against workload-aware methods. We estimate the mean execution time from 256 executions. The

error bars show the 95% bootstrap conﬁdence intervals. The results are normalized by the mean execution time of BO FSS. The methods with the

lowest execution time are marked with a star (*). Methods not signiﬁcantly different with the best performing method are also marked with a star

(Wilcoxon signed rank test, 1% null-hypothesis rejection threshold).

PerformanceDrop(%)

0

0.2

0.4

0.6

0.8

1

road-usawikiskitterjournal

road-usawikiskitterjournal

0.0(-0.0,0.0)

0.9(0.6,1.0)

0.5(1.1,1.9)

0.1(-0.0,0.3)

0.7(0.7,0.7)

0.0(-0.2,0.2)

1.0(1.1,1.9)

0.5(0.4,0.7)

0.8(0.7,0.8)

0.8(0.7,1.0)

0.0(0.6,1.4)

0.2(0.1,0.3)

0.6(0.6,0.7)

1.0(0.7,1.0)

0.6(1.1,2.0)

0.0(-0.2,0.1)

TuningTimeData

RuntimeData

Fig. 9. Effect of mismatching the data used for tuning BO FSS and the

data used for execution. The rows are the data used for tuning of BO

FSS, while the columns are the data used for execution. The numbers

represent the percentage slowdown relative to the matched case. Colder

colors represent more slowdown (hotter the better).

lavaMD

RelativeExecutionTime

0.9

1.0

1.1

1.2

1.3

1.4

1.5

stream. kmeans

BOFSS(proposed)

STATIC

GUIDED

CSS

TAPER3

TRAP1

sradv1 nn

Fig. 10. Execution time comparison of BO FSS against dynamic

scheduling methods. We estimate the mean execution time from 256

executions. The error bars show the 95% bootstrap conﬁdence intervals.

The results are normalized by the mean execution time of BO FSS.

The methods with the lowest execution time are marked with a star (*).

Methods not signiﬁcantly different with the best performing method are

also marked with a star (Wilcoxon signed rank test, 1% null-hypothesis

rejection threshold).

converge within a given budget. We thus suggest to use the

locality-aware construction whenever possible. It achieves

consistent results at the expense of additional computation

during tuning.

Remarks. Apart from srad v1, BO FSS performs better

than FSS and FAC2 on most workloads. This indicates that

the Gaussian assumption works fairly well in most cases.

We can conclude that our BO augmentation improves the

performance of FSS on both workloads with high and low

static imbalances. Our interest is now to see how this im-

provement compares against other scheduling algorithms.

5.4 Evaluation on Workloads Without Static Imbalance

This section compares the performance of BO FSS against

dynamic scheduling methods on workloads where a

workload-proﬁle is unavailable or uniformative. The bench-

mark results are shown in Fig. 10. Out of the 5 workloads

considered, BO FSS outperforms all other methods on 3 out

of 5 workloads. On the nn workload, the difference between

all methods is insigniﬁcant. As discussed in Section 5.3, BO

FSS performs poorly on the srad v1 workload. Note that

the same tuning results are used both for Section 5.3 and

this experiment.

Remarks. Compared to other dynamic scheduling methods,

BO FSS achieves more consistent performance. However,

because of the turbulence in the tuning process, BO FSS

performs poorly on srad v1. It is thus important to ensure

that BO FSS correctly converges to a critical point before

applying it.

5.5 Evaluation on Workloads With Static Imbalance

This section evaluates the performance of BO FSS

against workload-aware methods using workloads with a

workload-proﬁle. The evaluation results are shown in Fig. 8.

Except for the pr-wiki workload, BO FSS dominates all

considered baselines. Because of the large number of tasks,

both the HSS and BinLPT algorithms do not perform well

on these workloads. Meanwhile, the STATIC and GUIDED

strategies are very inconsistent in terms of performance. On

the pr-wiki and pr-journal workloads, both methods

are nearly 30% slower than BO FSS. This means that these

algorithms lack workload-robustness unlike BO FSS.

On the pr-wiki workload which has the most extreme

level of static imbalance, HSS performs signiﬁcantly better.

As discussed in Section 2.3, HSS has a very large critical

section, resulting in a large amount of scheduling over-

head. However, on the pr-wiki workload, the inefﬁciency

caused by load imbalance is so extreme compared to the

inefﬁciency caused by the scheduling overhead, giving HSS

a relative advantage.

Does the input data affect performance? BO FSS’s per-

formance is tightly related to the individual property of

each workload. It is thus interesting to ask how much the

input data of the workload affects the behavior of BO FSS.

To analyze this, we interchange the data used to tune BO

FSS and the data used to measure the performance. If the

input data plays an important role, the discrepancy between

11

the tuning time data and the runtime data would degrade the

performance. The corresponding results are shown in Fig. 9

where the entries are the percentage increase in execution

time relative to the matched case. Each row represents the

dataset used for tuning, while each column represents the

dataset used for execution. The anti-diagonal (bottom left

to top right) is the case when the data is matched. The

maximum amount of degradation is caused when we use

skitter for tuning and wiki during runtime. Also, the

case of using journal for tuning and wiki during runtime

signiﬁcantly degrades the performance. Overall, the wiki

and road datasets turned out to be the pickiest about

the match. Since both wiki and road resulted in high

degradation, the amount of imbalance in the data does not

determine how important the match is. However, judging

from the fact that the degradation is at most 1%, we can

conclude that BO FSS is more sensitive to the workload’s

algorithm rather than its input data.

Remarks. Compared to the workload-aware methods, BO

FSS performed the best except for one workload which has

the most amount of imbalance. Excluding this extreme case,

the performance beneﬁts of BO FSS is quite large. We also

evaluated the sensitivity of BO FSS on perturbations to the

workload. Results show that BO FSS is not affected much

by changes in the input data of the workload.

5.6 Discussions

Analysis of overhead. BO FSS has speciﬁc duties, both

online and ofﬂine. When online, BO FSS loads the precom-

puted scheduling parameter 𝜃𝑖, measures the loop execution

time and stores the pair (𝜃𝑖,𝜏𝑖)in the dataset 𝒟𝑡. A storage

memory overhead of 𝑂(𝑇), where 𝑇is the number of BO

iterations, is required to store 𝒟𝑡. This is normally much

less than the 𝑂(𝑁)memory requirement, where 𝑁is the

number of tasks, imposed by workload-aware methods.

When ofﬂine, BO FSS runs BO using the dataset 𝒟𝑡and

determines the next scheduling parameter 𝜃𝑖+1. Because

most of the actual work is performed ofﬂine, the online

overhead of BO FSS is almost identical to that of FSS. The

ofﬂine step is relatively expensive due to the computation

complexity of GPs. Fortunately, BO FSS converges within 10

to 20 iterations for most cases. This allows the computational

cost to stay within a reasonable range.

Limitation. When the target loop is not to be executed for a

signiﬁcant amount of time, BO FSS does provide signiﬁcant

beneﬁts, as it requires time for ofﬂine tuning. However, HPC

workloads are often long-running and reused over time. For

this reason, BO FSS should be applicable for many HPC

workloads.

Portability. When solving the optimization problem in (5)

with BO, the target system becomes part of the objective

function. As a result, BO FSS automatically takes into ac-

count the properties of the target system. This fact makes BO

FSS highly portable. At the same time, as the experimental

results of Fig. 9 imply, instead of directly operating on the

full target workload, it should be possible to use much

cheaper proxy workloads for tuning BO FSS.

6 RE LATED WO RKS

Classical dynamic loop scheduling methods. To improve

the efﬁciency of dynamic scheduling, many classical algo-

rithms are introduced such as CSS [6], FSS [7], TSS [8],

BOLD [9], TAPER [10] and BAL [11]. However, most of these

classic algorithms are derived in a limited theoretical context

with strict statistical assumptions. Such an example is the

𝑖.𝑖.𝑑.assumption imposed on the workload.

Adaptive and workload-aware methods. To resolve this

limitation, adaptive methods are developed starting from

the adaptive FSS algorithm [12]. Recently, workload-aware

methods including HSS [14] and BinLPT [15], [16] are in-

troduced. These scheduling algorithms explicitly require a

workload-proﬁle before execution and exploit this knowl-

edge in the scheduling process. On the ﬂip side, this re-

quirement makes these methods difﬁcult to use in practice

since the exact workload-proﬁle may not always be avail-

able beforehand. In contrast, our proposed method is more

convenient since we only need to measure the execution

time of a loop. Also, the overall concept of our method is

more ﬂexible; it is possible to plug in our framework to any

parameterized scheduling algorithm, directly improving its

robustness.

Machine learning based approaches. Machine learning has

yet to see many applications in parallel loop scheduling.

In [56], Wang and O’Boyle use compiler generated features

to train classiﬁers that select the best-suited scheduling

strategy for a workload. This approach contrasts with ours

since it does not improve the effectiveness of the cho-

sen scheduling algorithm. On the other hand, Khatami et

al. in [57] recently used a logistic regression model for

predicting the optimal chunk size for a scheduling strat-

egy, combining CSS and work-stealing. Similarly, Laberge

et al. [58] propose a machine-learning based strategy for

accelerating linear algebra applications. These supervised-

learning based approaches are limited in the sense that they

are not yet well understood: their performance is dependent

on the quality of the training data. It is unknown how well

these approaches generalize across workloads from different

application domains. In fact, quantifying and improving

generalization is still a central problem in supervised learn-

ing. Our method is free of these issues since we directly

optimize the performance for a target workload.

7 CONCLUSION

In this paper, we have presented BO FSS, a data-driven,

adaptive loop scheduling algorithm based on BO. The

proposed approach automatically tunes its performance to

the workload using execution time measurements. Also,

unlike the scheduling algorithms that are inapplicable to

some workloads, our approach is generally applicable. We

implemented our method on the OpenMP framework and

quantiﬁed its performance as well as its robustness on

realistic workloads. BO FSS has consistently performed well

on a wide range of real workloads, showing that it is

robust compared to other loop scheduling algorithms. Our

approach motivates the development of computer systems

that can automatically adapt to the target workload.

12

At the moment, BO FSS assumes that the properties

of the workload do not change during execution. For this

reason, BO FSS does not address some crucial scientiﬁc

workloads, such as adaptive mesh reﬁnement methods.

These types of workloads dynamically change during ex-

ecution, depending on the computation results. It would be

interesting to investigate automatic tuning-based schedul-

ing algorithms that can target such types of workloads in

the future.

ACKNOWLEDGMENTS

The authors would like to thank the reviewers for providing

precious comments enriching our work, Pedro Henrique

Penna for the helpful discussions about the BinLPT schedul-

ing algorithm, Myoung Suk Kim for his insightful comments

about our statistical analysis and Rover Root for his helpful

comments about the scientiﬁc workloads considered in this

work.

REFERENCES

[1] L. Dagum and R. Menon, “OpenMP: An industry standard API

for shared-memory programming,” IEEE Comput. Sci. Eng., vol. 5,

no. 1, pp. 46–55, Jan.-March/1998.

[2] J. Regier, K. Pamnany, K. Fischer, A. Noack, M. Lam, J. Revels,

S. Howard, R. Giordano, D. Schlegel, J. McAuliffe, R. C. Thomas,

and Prabhat, “Cataloging the visible universe through Bayesian

inference at petascale,” in Proc. Int. Parallel Distrib. Process. Symp.,

ser. IPDPS’18, 2018, pp. 44–53.

[3] T. Kurth, S. Treichler, J. Romero, M. Mudigonda, N. Luehr,

E. Phillips, A. Mahesh, M. Matheson, J. Deslippe, M. Fatica, and

e. al, “Exascale deep learning for climate analytics,” in Proc. Int.

Conf. High Perform. Comput. Networking, Storage, Anal., ser. SC ’18,

2018.

[4] A. G. Baydin, L. Shao, W. Bhimji, L. Heinrich, L. Meadows, J. Liu,

A. Munk, S. Naderiparizi, B. Gram-Hansen, G. Louppe, M. Ma,

X. Zhao, P. Torr, V. Lee, K. Cranmer, Prabhat, and F. Wood, “Etalu-

mis: Bringing probabilistic programming to scientiﬁc simulators at

scale,” in Proc. Int. Conf. High Perform. Comput. Networking, Storage,

Anal., ser. SC’19, Denver Colorado, Nov. 2019, pp. 1–24.

[5] D. Durand, T. Montaut, L. Kervella, and W. Jalby, “Impact of

memory contention on dynamic scheduling on NUMA multipro-

cessors,” IEEE Trans. Parallel Distrib. Syst., vol. 7, no. 11, pp. 1201–

1214, Nov. 1996.

[6] C. P. Kruskal and A. Weiss, “Allocating independent subtasks on

parallel processors,” IEEE Trans. Softw. Eng., vol. SE-11, no. 10, pp.

1001–1016, Oct. 1985.

[7] S. F. Hummel, E. Schonberg, and L. E. Flynn, “Factoring: A method

for scheduling parallel loops,” Commun ACM, vol. 35, no. 8, pp.

90–101, Aug. 1992.

[8] T. H. Tzen and L. M. Ni, “Trapezoid self-scheduling: A practical

scheduling scheme for parallel compilers,” IEEE Trans. Parallel

Distrib. Syst., vol. 4, no. 1, pp. 87–98, Jan. 1993.

[9] T. Hagerup, “Allocating independent tasks to parallel processors:

An experimental study,” J. Parallel Distrib. Comput., vol. 47, no. 2,

pp. 185–197, Dec. 1997.

[10] S. Lucco, “A dynamic scheduling method for irregular parallel

programs,” in Proc. ACM SIGPLAN 1992 Conf. Program. Lang. Des.

Implementation, ser. PLDI ’92. New York, NY, USA: ACM, 1992,

pp. 200–211.

[11] H. Bast, “On scheduling parallel tasks at twilight,” Theory Comput.

Syst., vol. 33, no. 5-6, pp. 489–563, Dec. 2000.

[12] I. Banicescu and V. Velusamy, “Load balancing highly irregular

computations with the adaptive factoring,” in Proc. Int. Parallel

Distrib. Process. Symp., ser. IPDPS’02, Ft. Lauderdale, FL, 2002, p.

12 pp.

[13] F. M. Ciorba, C. Iwainsky, and P. Buder, “OpenMP loop scheduling

revisited: Making a case for more schedules,” in Evolving OpenMP

for Evolving Architectures, ser. IWOMP’18. Springer, 2018, pp. 21–

36.

[14] A. Kejariwal, A. Nicolau, and C. D. Polychronopoulos, “History-

aware self-scheduling,” in Proc. Int. Conf. Parallel Process., ser.

ICPP’06. IEEE, 2006.

[15] P. H. Penna, M. Castro, P. Plentz, H. Cota de Freitas, F. Broquedis,

and J.-F. M´

ehaut, “BinLPT: A novel workload-aware loop sched-

uler for irregular parallel loops,” in Proc. Simp´osio Em Sistemas

Computacionais de Alto Desempenho, Campinas, Brazil, Oct. 2017.

[16] P. H. Penna, A. T. A. Gomes, M. Castro, P. D.M. Plentz, H. C. Fre-

itas, F. Broquedis, and J.-F. M´

ehaut, “A comprehensive perfor-

mance evaluation of the BinLPT workload-aware loop scheduler,”

Concurrency and Computation: Pract. and Experience, Feb. 2019.

[17] B. Shahriari, K. Swersky, Z. Wang, R. P. Adams, and N. de Fre-

itas, “Taking the human out of the loop: A review of Bayesian

optimization,” Proc. IEEE, vol. 104, no. 1, pp. 148–175, Jan. 2016.

[18] O. Alipourfard, H. H. Liu, J. Chen, S. Venkataraman, M. Yu, and

M. Zhang, “CherryPick: Adaptively unearthing the best cloud

conﬁgurations for big data analytics,” in Proc. 14th USENIX Symp.

Networked Syst. Des. Implementation, ser. NSDI’17. Boston, MA:

USENIX Association, 2017, pp. 469–482.

[19] B. Letham, B. Karrer, G. Ottoni, and E. Bakshy, “Constrained

Bayesian optimization with noisy experiments,” Bayesian Anal.,

vol. 14, no. 2, pp. 495–519, Aug. 2018.

[20] V. Dalibard, M. Schaarschmidt, and E. Yoneki, “BOAT: Building

auto-tuners with structured Bayesian optimization,” in Proc. 26th

Int. Conf. World Wide Web, ser. WWW ’17. Perth, Australia: ACM

Press, 2017, pp. 479–488.

[21] K.-r. Kim, Y. Kim, and S. Park, “Towards robust data-driven par-

allel loop scheduling using Bayesian optimization,” in IEEE 27th

Int. Symp. Model., Anal. Simul. Comput. Telecommun. Syst. Rennes,

FR: IEEE, 2019, pp. 241–248.

[22] K. Swersky, J. Snoek, and R. P. Adams, “Freeze-thaw Bayesian

optimization,” arXiv:1406.3896 [cs, stat], Jun. 2014.

[23] GCC, “GCC, the GNU compiler collection,” Jul. 2018.

[24] C. McPhail, H. R. Maier, J. H. Kwakkel, M. Giuliani, A. Castelletti,

and S. Westra, “Robustness Metrics: How Are They Calculated,

When Should They Be Used and Why Do They Give Different

Results?” Earth’s Future, vol. 6, no. 2, pp. 169–191, Feb. 2018.

[25] L. J. Savage, “The theory of statistical decision,” J. Am. Stat. Assoc.,

vol. 46, no. 253, pp. 55–67, 1951.

[26] S. Che, J. W. Sheaffer, M. Boyer, L. G. Szafaryn, Liang Wang, and

K. Skadron, “A characterization of the Rodinia benchmark suite

with comparison to contemporary CMP workloads,” in Proc. IEEE

Int. Symp. Workload Characterization, ser. IISWC’10. Atlanta, GA,

USA: IEEE, Dec. 2010, pp. 1–11.

[27] S. Beamer, K. Asanovi´

c, and D. Patterson, “The GAP benchmark

suite,” arXiv:1508.03619 [cs], May 2017.

[28] P. Tang and P. C. Yew, “Processor self-scheduling for multiple-

nested parallel loops,” in Proc. Int. Conf. Parallel Process., ser.

ICPP’86. IEEE, Dec. 1986, pp. 528–535.

[29] Bast, Hannah, “Provably optimal scheduling of similar tasks,”

Ph.D Thesis, Universit¨

at des Saarlandes, Saarbr ¨

ucken, 2000.

[30] K. K. Yue and D. J. Lilja, “Parallel loop scheduling for high

performance computers,” Adv. Parallel Comput., vol. 10, pp. 243–

264, 1995.

[31] L. E. Flynn and S. F. Hummel, “Scheduling variable-length parallel

subtasks,” IBM Research T. J. Watson Research Center, Tech. Rep.,

Feb. 1990.

[32] J. C. Spall, “An overview of the simultaneous perturbation method

for efﬁcient optimization,” Johns Hopkins Apl Tech. Dig., vol. 19,

no. 4, pp. 482–492, 1998.

[33] Z. Wang and S. Jegelka, “Max-value entropy search for efﬁcient

Bayesian optimization,” in Proc. 34th Int. Conf. Mach. Learn., ser.

ICML’17, vol. 70. JMLR.org, 2017, pp. 3627–3635.

[34] V. S. Adve and M. K. Vernon, “The inﬂuence of random delays on

parallel execution times,” in Proc. 1993 ACM SIGMETRICS Conf.

Meas. Model. Comp. Syst., ser. SIGMETRICS’93. New York, NY,

USA: ACM, 1993, pp. 61–73.

[35] C. E. Rasmussen and Z. Ghahramani, “Occam’s razor,” in Adv.

Neural Inf. Process. Syst. 13, ser. NIPS’13. MIT Press, 2001, pp.

294–300.

[36] C. E. Rasmussen and C. K. I. Williams, Gaussian Processes for Ma-

chine Learning, ser. Adaptive Comput. Mach. Learn. Cambridge,

Mass: MIT Press, 2006.

[37] R. Martinez-Cantin, K. Tee, and M. McCourt, “Practical Bayesian

optimization in the presence of outliers,” arXiv:1712.04567 [cs,

stat], Dec. 2017.

13

TABLE 4

Implementation Details of Considered Baselines

Type Chunk Size Equation Parameter Setting

CSS [6] 𝐾=ℎ

𝜎2𝑁

𝑃log𝑃2∕3 ℎ,𝜎,𝜇(measured values)

TAPER [10] 𝑣𝛼=𝛼𝜎

𝜇, 𝑥𝑖=𝑅𝑖

𝑃+𝐾min

2, 𝑅𝑖+1=𝑅𝑖−𝐾𝑖

𝐾𝑖=max(𝐾min ,𝑥𝑖+𝑣2𝑎

2−𝑣𝛼2𝑥𝑖+𝑣2𝛼

4)𝑣𝛼=3,𝐾min =1

TSS [8] 𝛿=𝐾𝑓−𝐾𝑙

𝑁−1 , 𝐾0=𝐾𝑓

𝐾𝑖+1=max(𝐾𝑖−𝛿,𝐾𝑙)𝐾𝑓=𝑁

2𝑃, 𝐾𝑙=1,

[38] A. Shah, A. G. Wilson, and Z. Ghahramani, “Bayesian Optimiza-

tion using Student-t Processes,” in NIPS Workshop on Bayesian

Optimisation, 2013.

[39] D. Duvenaud, J. Lloyd, R. Grosse, J. Tenenbaum, and G. Zoubin,

“Structure discovery in nonparametric regression through compo-

sitional kernel search,” in Proc. 30th Int. Conf. Mach. Learn., ser.

ICML’13, vol. 28. Atlanta, Georgia, USA: PMLR, Jun. 2013, pp.

1166–1174.

[40] J. M. Henr´

andez-Lobato, M. W. Hoffman, and Z. Ghahramani,

“Predictive Entropy Search for Efﬁcient Global Optimization of

Black-box Functions,” in Adv. Neural Inf. Process. Syst. 27, ser.

NIPS’14, 2014, pp. 918–926.

[41] J. Snoek, H. Larochelle, and R. P. Adams, “Practical Bayesian

optimization of machine learning algorithms,” in Adv. Neural Inf.

Process. Syst. 25, ser. NIPS’12. USA: Curran Associates Inc., 2012,

pp. 2951–2959.

[42] M. D. Hoffman and A. Gelman, “The no-u-turn sampler: Adap-

tively setting path lengths in Hamiltonian Monte Carlo,” J. Mach.

Learn. Res., vol. 15, no. 47, pp. 1593–1623, 2014.

[43] I. Sobol’, “On the distribution of points in a cube and the approx-

imate evaluation of integrals,” USSR Comput. Math. Math. Phys.,

vol. 7, no. 4, pp. 86–112, Jan. 1967.

[44] J. Bezanson, A. Edelman, S. Karpinski, and V. B. Shah, “Julia: A

fresh approach to numerical computing,” SIAM Rev., vol. 59, no. 1,

pp. 65–98, 2017.

[45] D. R. Jones, C. D. Perttunen, and B. E. Stuckman, “Lipschitzian

optimization without the Lipschitz constant,” J Optim Theory Appl,

vol. 79, no. 1, pp. 157–181, Oct. 1993.

[46] S. G. Johnson, The NLopt Nonlinear-Optimization Package, 2011.

[47] H. Ge, K. Xu, and Z. Ghahramani, “Turing: A language for

ﬂexible probabilistic inference,” in Int. Conf. Artif. Intell. Statist.,

ser. AISTATS’18, 2018, pp. 1682–1690.

[48] C. D. Polychronopoulos and D. J. Kuck, “Guided self-scheduling:

A practical scheduling scheme for parallel supercomputers,” IEEE

Trans. Comput., vol. C-36, no. 12, pp. 1425–1439, Dec. 1987.

[49] J. M. Bull, “Measuring synchronisation and scheduling overheads

in OpenMP,” in Proc. 1st Eur. Workshop OpenMP, ser. IWOMP’99,

1999, pp. 99–105.

[50] L. Backstrom, D. Huttenlocher, J. Kleinberg, and X. Lan, “Group

formation in large social networks: Membership, growth, and

evolution,” in Proc. 12th ACM SIGKDD Int. Conf. Knowl. Discovery

Data Mining, ser. KDD’06. New York, NY, USA: Association for

Computing Machinery, 2006, pp. 44–54.

[51] D. Gleich, “Wikipedia-20070206,” 2007.

[52] C. Demetrescu, A. Goldberg, and D. Johnson, “9th DIMACS

implementation challenge - shortest paths,” 2006.

[53] J. Leskovec, J. Kleinberg, and C. Faloutsos, “Graphs over time:

Densiﬁcation laws, shrinking diameters and possible explana-

tions,” in Proc. 11th ACM SIGKDD Int. Conf. Knowl. Discovery

Data Mining, ser. KDD ’05. New York, NY, USA: Association

for Computing Machinery, 2005, pp. 177–187.

[54] T. A. Davis and Y. Hu, “The university of ﬂorida sparse matrix

collection,” ACM Trans Math Softw, vol. 38, no. 1, Dec. 2011.

[55] S. Bak, Y. Guo, P. Balaji, and V. Sarkar, “Optimized execution of

parallel loops via user-deﬁned scheduling policies,” in Proc. 48th

Int. Conf. Parallel Process., ser. ICPP’19. Kyoto Japan: ACM, Aug.

2019, pp. 1–10.

[56] Z. Wang and M. F. O’Boyle, “Mapping parallelism to multi-cores:

A machine learning based approach,” in Proc. 14th ACM SIGPLAN

Symp. Princ. Pract. Parallel Program., ser. PPoPP’09. New York, NY,

USA: ACM, 2009, pp. 75–84.

[57] Z. Khatami, L. Troska, H. Kaiser, J. Ramanujam, and A. Serio,

“HPX smart executors,” in Proc. 3rd Int. Workshop Extreme Scale

Program. Models Middleware, ser. ESPM2’17, 2017.

[58] G. Laberge, S. Shirzad, P. Diehl, H. Kaiser, S. Prudhomme, and

A. S. Lemoine, “Scheduling optimization of parallel linear algebra

algorithms using supervised learning,” in IEEE/ACM Workshop

Mach. Learn. High Perform. Comput. Environ., ser. MLHPC’19. Den-

ver, CO, USA: IEEE, Nov. 2019, pp. 31–43.

Khu-rai Kim (Student Member, IEEE) is working

towards his B.S. degree with the Department

of Electronics Engineering, Sogang University,

Seoul, South Korea.

His research interests lie in the duality of

machine learning and computer systems, in-

cluding parallel computing, compiler runtime en-

vironments, probabilistic machine learning and

Bayesian inference methods.

Youngjae Kim (Member, IEEE) received the

B.S. degree in computer science from Sogang

University, South Korea, in 2001, the MS de-

gree in computer science from KAIST, in 2003,

and the PhD degree in computer science and

engineering from Pennsylvania State University,

University Park, Pennsylvania, in 2009.

He is currently an associate professor with

the Department of Computer Science and Engi-

neering, Sogang University, Seoul, South Korea.

Before joining Sogang University, Seoul, South

Korea, he was a R&D staff scientist with the US Department of Energy’s

Oak Ridge National Laboratory (2009–2015) and as an assistant profes-

sor at Ajou University, Suwon, South Korea (2015–2016). His research

interests include operating systems, ﬁle and storage systems, parallel

and distributed systems, computer systems security, and performance

evaluation.

Sungyong Park (Member, IEEE) received the

B.S. degree in computer science from Sogang

University, Seoul, South Korea, and both the

MS and PhD degrees in computer science from

Syracuse University, Syracuse, New York.

He is currently a professor with the Depart-

ment of Computer Science and Engineering, So-

gang University, Seoul, South Korea. From 1987

to 1992, he worked for LG Electronics, South

Korea, as a research engineer. From 1998 to

1999, he was a research scientist at Bellcore,

where he developed network management software for optical switches.

His research interests include cloud computing and systems, high per-

formance I/O and storage systems, parallel and distributed system, and

embedded system.