ArticlePDF Available

Efficient exact algorithms for continuous bi‐objective performance‐energy optimization of applications with linear energy and monotonically increasing performance profiles on heterogeneous high performance computing platforms

Authors:

Abstract

Performance and energy are the two most important objectives for optimization on heterogeneous high performance computing platforms. This work studies a mathematical problem motivated by the bi‐objective optimization of data‐parallel applications on such platforms for performance and energy. First, we formulate the problem and present an exact algorithm of polynomial complexity solving the problem where all the application profiles of objective type one are continuous and strictly increasing, and all the application profiles of objective type two are linear increasing. We then apply the algorithm to develop solutions for two related optimization problems of parallel applications on heterogeneous hybrid platforms, one for performance and dynamic energy and the other for performance and total energy. Our proposed solution methods are then employed to solve the two bi‐objective optimization problems for two data‐parallel applications, matrix multiplication and gene sequencing, on a hybrid platform employing five heterogeneous processors, namely, two different Intel multicore CPUs, an Nvidia K40c GPU, an Nvidia P100 PCIe GPU, and an Intel Xeon Phi.
Received: 31 December 2021 Revised: 5 May 2022 Accepted: 13 July 2022
DOI: 10.1002/cpe.7285
SPECIAL ISSUE PAPER
Efficient exact algorithms for continuous bi-objective
performance-energy optimization of applications with linear
energy and monotonically increasing performance profiles on
heterogeneous high performance computing platforms
Hamidreza Khaleghzadeh1Ravi Reddy Manumachu2Alexey Lastovetsky2
1School of Computing, University of
Portsmouth, Portsmouth, UK
2School of Computer Science, University
College Dublin, Belfield, Dublin, Ireland
Correspondence
Ravi Reddy Manumachu, School of Computer
Science, University College Dublin, Belfield,
Dublin 4, Ireland.
Email: manumachu.reddy@gmail.com
Funding information
Science Foundation Ireland, Grant/Award
Number: 14/IA/2474
Abstract
Performance and energy are the two most important objectives for optimization on
heterogeneoushighperformancecomputingplatforms.Thisworkstudiesamathemat-
ical problem motivated by the bi-objective optimization of data-parallel applications
on such platforms for performance and energy. First, we formulate the problem and
present an exact algorithm of polynomial complexity solving the problem where all the
application profiles of objective type one are continuous and strictly increasing, and
all the application profiles of objective type two are linear increasing. We then apply
the algorithm to develop solutions for two related optimization problems of parallel
applications on heterogeneous hybrid platforms, one for performance and dynamic
energy and the other for performance and total energy. Our proposed solution meth-
ods are then employed to solve the two bi-objective optimization problems for two
data-parallel applications, matrix multiplication and gene sequencing, on a hybrid plat-
form employing five heterogeneous processors, namely, two different Intel multicore
CPUs, an Nvidia K40c GPU, an Nvidia P100 PCIe GPU, and an Intel Xeon Phi.
KEYWORDS
bi-objective optimization, energy optimization, high performance computing, min-max
optimization, min-sum optimization, performance optimization
1INTRODUCTION
Performance and energy are the two most important objectives for optimization on modern parallel platforms such as supercomputers, heteroge-
neoushighperformancecomputing(HPC) clusters, and cloud infrastructures.1-4 State-of-the-artsolutionsforthe bi-objective optimization problem
for performance and energy on such platforms can be broadly classified into system-level and application-level categories.
System-level solution methods aim to optimize the performance and energy of the environmentwhere the applications are executed. The meth-
ods employ application-agnostic models and hardware parametersas decision variables. The dominant decision variable in this category is Dynamic
Voltage and Frequency Scaling (DVFS).2,3,5-8
Abbreviations: GPU, graphicsprocessing unit; HPC, highperformance computing.
These authors contributed equally to this study.
This is an open access article under the terms of the Creative Commons Attribution License, which permits use, distribution and reproduction in any medium, provided
the original work is properly cited.
© 2022 The Authors. Concurrency and Computation: Practice and Experiencepublished by John Wiley & Sons Ltd.
Concurrency Computat Pract Exper. 2023;35:e7285. wileyonlinelibrary.com/journal/cpe 1of19
https://doi.org/10.1002/cpe.7285
2of19 KHALEGHZADEH ET AL.
Research works9-12 propose application-level solution methods that employ decision variables, which include the number of processes,
number of threads, loop blocking factor, and workload distribution. The solution methods proposed in References 10,11 solve the bi-objective
optimization problem of an application for performance and energy on homogeneous clusters of modern multicore CPUs. The solution method9
considers the effect of heterogeneous workload distribution on bi-objective optimization of data analytics applications by simulating heterogeneity
on homogeneous clusters.
Khaleghzadeh et al.12 study bi-objective optimization of data-parallel applications for performance and energy on heterogeneous processors.
The main contribution of this work is a bi-objective optimization algorithm for the case of discrete performance and energy functions with any arbi-
trary shape. The algorithm returns the Pareto front of load imbalanced solutions and best load balanced solutions. The authors also briefly study
the continuous bi-objective optimization problem but only for the simple case of two heterogeneous processors with linear execution time and linear
dynamic energy functions. They propose an algorithm to find the Pareto front and show that it is linear, containing an infinite number of solutions.
While one solution is load balanced, the rest are load imbalanced. However, they do not present an algorithm to determine the workload distribution
(solution in the decision vector space) corresponding to a point in the Pareto front (solution in the objective space).
In Reference 13, we study a more general continuous bi-objective optimization problem for a generic case of kheterogeneous processors.
The problem is motivated by the bi-objective optimization for the performance and energy of data-parallel applications on heterogeneous HPC
platforms. We now present an use case that highlights the problem.
Consider, for example, the bi-objective optimization of a highly optimized matrix multiplication application on a heterogeneous computing plat-
form for performance and energy. The application computes the matrix product, C=𝛼×A×B+𝛽×C, where A,B,andCare matrices of size M×N,
N×N,andM×N,and𝛼and 𝛽are constant floating-point numbers. The application invokesCUBLAS library functions for Nvidia GPUs and Intel MKL
DGEMM library functions for CPUs and Intel Xeon Phi. The Intel MKL and CUDA versions used are 2017.0.2 and 9.2.148. Workload sizes range
from 64 ×10,112 to 19,904 ×10,112 with a step size of 64 for the first dimension M.
The platform consists of five heterogeneous processors: Intel Haswell E5-2670V3 multi-core CPU (CPU_1), Intel Xeon Gold 6152 multi-core
CPU (CPU_2), NVIDIA K40c GPU (GPU_1), NVIDIA P100 PCIe GPU (GPU_2), and Intel Xeon Phi 3120P (XeonPhi_1). Figure 1presents the details
of the computing platform.
Figure 2shows the execution time functions {f0(x),,f4(x)} and the dynamic energy functions {g0(x),,g4(x)} of the processors against the
workload size (x). Briefly, there are two types of energy consumption in computing platforms, static and dynamic. The static energy consumption is
equal to the execution time of the application multiplied by the static power consumption of the platform. The application’s dynamic energy con-
sumption is equal to the platform’s total energy consumption (ET) minus the static energy consumption. The static energy consumption is the idle
power of the platform (PS) multiplied by the application’s execution time (t). The static and dynamic energy consumptions during an application
execution is obtained using power meters, which is considered the most accurate method of energy measurement.14
The execution time function shapes are continuous and strictly increasing. The energy function shapes can be approximated accurately by linear
increasing functions. While the execution time profiles of the two CPUs are close to each other, the energy profile of CPU_1 is significantly higher
FIGURE 1 Specifications of the five heterogeneous processors, Intel Haswell multicore CPU, Nvidia K40c, Intel Xeon Phi 3120P, Intel Skylake
multicore CPU and Nvidia P100 PCIe
15320634, 2023, 20, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/cpe.7285 by Health Research Board, Wiley Online Library on [11/08/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
KHALEGHZADEH ET AL.3of19
(A) (B)
(C) (D)
FIGURE 2 (A) and (B) contain the execution time and energy profiles of the five heterogeneous processors (Figure 1) employed in the matrix
multiplication application. (C) and (D) are the same except that they do not contain the profiles for Xeon Phi whose energy profile dominates the
other energy profiles. While the execution time profiles of the two CPUs are close to each other, the energy profile of CPU_1 is significantly higher
than that of CPU_2
than that of CPU_2. The optimization goal is to find workload distributions of the workload size n({x0,,x4},4
i=0xi=n) minimizing the execution
time (max4
i=0fi(xi)) and the dynamic energy consumption (4
i=0gi(xi)) during the parallel execution of the application.
In Reference 13, we solve the continuous optimization problem for such shapes of performance and dynamic energy functions. We first for-
mulate the mathematical problem, which for a given positive real number naims to find a vector X={x0,··· ,xk1}∈Rk
0such that k1
i=0xi=
n, minimizing the max of k-dimensional vector of functions of objective type one and the sum of k-dimensional vector of functions of objec-
tive type two. We then propose an exact algorithm of polynomial complexity solving the case where all the functions of objective type one
are continuous and strictly increasing, and all the functions of objective type two are linear increasing. The algorithm exhibits polynomial
complexity.
In this work, we apply our proposed algorithm to solve two related optimization problems of parallel applications on heterogeneous
hybrid platforms, one for performance and dynamic energy and the other for performance and total energy. We believe both optimiza-
tion problems are pertinent to optimizing applications on modern heterogeneous hybrid HPC platforms due to following reasons. First, the
thermal design power (TDP) of the multicore CPUs and accelerators has either remained the same or increased with each new genera-
tion. For example, the TDPs of the K40 and P100 GPUs used in our experiments are 235 and 250 W. The TDP of the latest generation
A100 GPU is 250 W. The TDPs of Intel Xeon Gold 6152 and Intel Xeon E5-2670 v3 used in our experiments are 140 and 120 W. The
TDP of the latest generation Icelake Xeon Gold 6354 is 205 W. Second, although the static energy consumption of devices is decreas-
ing with every new generation, the total static energy consumption of a heterogeneous hybrid HPC platform with two or more such
devices is still significant. Finally, the improvements in dynamic power consumption are not similar in magnitude to those for static power
consumption.
15320634, 2023, 20, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/cpe.7285 by Health Research Board, Wiley Online Library on [11/08/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
4of19 KHALEGHZADEH ET AL.
We first formulate and solve a bi-objective optimization problem of parallel applications on heterogeneous hybrid platforms for performance
and dynamic energy. The solution to the problem is a straightforward application of our algorithm proposed in Reference 13.
We then formulate a bi-objective optimization problem of parallel applications on heterogeneous hybrid platforms for performance and
total energy. We prove a theorem that lays the foundation for our algorithm solving the theorem. The theorem states that a solution vector
X={x0,··· ,xk1}∈Rk
0,k1
i=0xi=nis Pareto-optimal for execution time and total energy if and only if it is Pareto-optimal for execution time and
dynamic energy and there is no solution vector X1such that X1is Pareto-optimal for execution time and dynamic energy and ET(X1)<ET(X). Then,
we propose an algorithm for solving the problem. The correctness of the algorithm follows from the theorem. Finally, we prove the algorithm has
polynomial complexity.
The proposed algorithms (in Reference 13 and this work) are then employed to solve the two bi-objective optimization problems for two
data-parallel applications, matrix multiplication and gene sequencing, employing five heterogeneous processors, two Intel multicore CPUs, an
Nvidia K40c GPU, an Nvidia P100 PCIe GPU, and an Intel Xeon Phi (Figure 1). For the workloads and the platform employed in our experiments, the
algorithms provide continuous piecewise linear Pareto fronts for performance and dynamic energy and performance and total energy where the
performance-optimal point is the load balanced configuration of the application.
Based on our experiments,the maximum dynamic energy savings can be up to 17% while tolerating a performance degradation of 5% (an energy
savings of 106 J for an execution time increase of 0.05 seconds) for the matrix multiplication application. The maximum total energy savings is 8%.
The dynamic energy and total energy savings for the gene sequencing application accepting a 1% performance hit are 23% and 16%.
The main original contributions of this work are:
Mathematical formulations of the bi-objective optimization problem of parallelapplications on heterogeneous hybrid platforms for performance
and dynamic energy and for performance and total energy;
A theorem that lays the foundation for our algorithm solving the bi-objective optimization problem for performance and total energy. The
theorem states that a solution vector Xis Pareto-optimal for execution time and total energy if and only if it is Pareto-optimal for execution time
and dynamic energy and there is no solution vector X1such that X1is Pareto-optimal for execution time and dynamic energy and ET(X1)<ET(X);
An exact algorithm of polynomial complexity solving the bi-objective optimization problem for performance and total energy and whose
correctness follows from the theorem;
Experimental study of the practical efficacy of our proposed algorithms for optimization of two data-parallel applications, matrix multiplica-
tion and gene sequencing, on a platform comprising five heterogeneous processors that include two multicore CPUs, two GPUs, and one Intel
Xeon Phi. We demonstrate that the algorithms provide continuous piecewise linear Pareto fronts for performance and dynamic energy and
performance and total energy for the workloads and the platform employed in our experiments.
The rest of the paper is organized as follows. We discuss the related work in Section 2. The formulation of the bi-objective optimization problem
is presented in Section 3. In Section 4, we propose an efficient and exact algorithm solving the bi-objectiveoptimization problem. Section 5presents
application of our proposed algorithm to optimization of heterogeneous parallel applications for performance and energy. Section 6contains the
experimental results. Finally, we conclude the paper in Section 7.
2RELATED WORK
2.1 Bi-objective optimization: background
A bi-objective optimization problem can be mathematically formulated as:15,16
minimize {T(X),E(X)},Subject to X,
where there are two objective functions, TRkRand ERkR. We denote the vectorof objective functions by(X)=(T(X),E(X))T.Thedeci-
sion vectors X=(x0, ..., xk1)Tbelong to the (nonempty) feasible region (set) , which is a subset of the decision variable space Rk. We denote the
image of the feasible region by (=()), and call it a feasible objective region. It is a subset of the objective space R2. The elements of are called
objective (function) vectors or criterion vectors and denoted by (X)or z=(z1,z2)T, where z1=T(X)and z2=E(X)are objective (function) values
or criterion values.
The objective is to minimize both the objective functions simultaneously. The objective functions are at least partly conflicting or incommensu-
rable, due to which it is impossible to find a single solution that would be optimal for all the objectives simultaneously.
15320634, 2023, 20, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/cpe.7285 by Health Research Board, Wiley Online Library on [11/08/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
KHALEGHZADEH ET AL.5of19
Definition 1. A decision vector Xis Pareto optimal if there does not exist another decision vector Xsuch that T(X)T(X),E(X)E(X)
and either T(X)<T(X)or E(X)<E(X)or both.15
An objective vector zis Pareto optimal if there is not another objective vector zsuch that z1z
1,z2z
2and zj<z
jfor at least one
index j.
There are several classifications for methods solving bi-objective optimization problems.15,16 Since the set of Pareto optimal solutions is par-
tially ordered, one classification is based on the involvement of the decision-maker in the solution method to select specific solutions. There are
four categories in this classification, No preference,A priori,A posteriori,Interactive. The algorithms solving bi-objective optimization problems can be
divided into two major categories, exact methods and metaheuristics. While branch-and-bound is the dominant technique in the first category, genetic
algorithm (GA) is popular in the second category.
2.2 Bi-objective optimization for performance and energy on HPC platforms
There are two principal categories of methods for optimizing applications on HPC platforms for performance and energy.
2.2.1 System-level methods
The first category of system-level solution methods aims to optimize the performance and energy of the executing environment of the applica-
tions. The dominant decision variable in this category is DVFS. DVFS reduces the dynamic power consumed by a processor by throttling its clock
frequency.
Rong et al.17 present a runtime system (CPU MISER) based on DVFS that provides energy savings with minimal performance degradation by
using a performance model. Huang et al.18 propose an eco-friendly daemon that employs workload characterization as a guide to DVFS to reduce
power and energy consumption with little impact on application performance. Mezmaz et al.19 propose a parallel bi-objective GA to maximize the
performance and minimize the energy consumption in cloud computing infrastructures. Fard et al.1present a four-objective case study comprising
performance, economic cost, energy consumption, and reliability for optimization of scientific workflows in heterogeneous computing environ-
ments. Beloglazov et al.20 propose heuristics that consider twin objectives of energy efficiency and Quality of Service for provisioning data center
resources.
Durillo et al.3propose a multi-objective workflow scheduling algorithm for optimization of applications executing in heterogeneous
high-performance parallel and distributed computing systems. Performance and energy consumption are among the objectives. Das et al.21
propose task mapping to optimize for energy and reliability on multiprocessor systems-on-chip with performance as a constraint. Kolodziej
et al.5propose multi-objective GAs that aim to maximize performance and energy consumption of applications executing in green grid
clusters and clouds. The performance is modeled using computation speed of a processor. The decision variable is the DVFS level. Vaib-
hav et al.22 present a runtime system that performs both processor and DRAM frequency scaling and demonstrate total energy sav-
ings with minimal performance loss. Abdi et al.23 propose multicriteria optimization where they minimize the execution time under
three constraints, the reliability, the power consumption, and the peak temperature. DVFS is a key decision variable in all of these
research works.
The methods proposed in References 6-8 optimize for performance under a energy budget or optimize for energy under an execution time
constraint. The methods proposed in References 2,3,5 solvebi-objective optimization for performance and energy with no time constraint or energy
budget.
2.2.2 Application-level methods
The second category of application-level solution methods9-12,24,25 use application-level decision variables and models. The most popular decision
variables include the loop tile size, workload distribution, number of processors, and number of threads.
Tarplee et al.26 employ task-mapping as a decision variable for bi-objective optimization of applications for performance and energy in a
HPC platform. Aba et al.27 present an approximation algorithm for bi-objective optimization of parallel applications running on a heterogeneous
resources system for performance and total energy. The decision variable is task scheduling. Their algorithm ignores all solutions where energy
consumption exceeds a given constraint and returns the solution with minimum execution time.
Reddy et al.11,25 study bi-objective optimization of data-parallel applications for performance and energy on homogeneous clusters
multicore CPUs employing only one decision variable, the workload distribution. They propose an efficient solution method. The method
15320634, 2023, 20, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/cpe.7285 by Health Research Board, Wiley Online Library on [11/08/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
6of19 KHALEGHZADEH ET AL.
accepts as input the number of available processors, the discrete function of the processor’s energy consumption against the workload size,
the discrete function of the processor’s performance against the workload size. It outputs a Pareto-optimal set of workload distributions.
Chakraborti et al.9consider the effect of heterogeneous workload distribution on bi-objective optimization of data analytics applications by simu-
lating heterogeneity on homogeneous clusters. The performance is represented by a linear function of problem size and the total energy is predicted
using historical data tables. Khaleghzadeh et al.12 propose a solution method solving the bi-objective optimization problem of data-parallel appli-
cations for performance and energy on heterogeneous processors and comprising of two principal components. The first component is a data
partitioningalgorithmthattakesasaninputdiscrete performance and dynamic energy functions withnoshapeassumptions.The second component
is a novel methodology employed to build the discrete dynamic energy profiles of individual computing devices, which are input to the algorithm.
Khokhriakhov et al.28 propose a novel solution method for bi-objective optimization of multithreaded data-parallel applications for perfor-
mance and dynamic energy on a single multicore processor. The method uses two decision variables, the number of identical multithreaded kernels
(threadgroups) executing the application and the number of threads per threadgroup, with a given workload partitioned equally between the
threadgroups.
3FORMULATION OF THE BI-OBJECTIVE OPTIMIZATION PROBLEM
Given a positive real number nR>0and two sets of kfunctions each, F={f0,f1,··· ,fk1}and G={g0,g1,··· ,gk1}, where fi,giR0R0,i
{0,··· ,k1}, the problem is to find a vector X={x0,··· ,xk1}∈Rk
0such that k1
i=0xi=n, minimizing the objective functions T(X)=maxk1
i=0fi(xi)
and E(X)=k1
i=0gi(xi).WeuseT×Eto denote the objective space of this problem, R0×R0.
Thus, the problem can be formulated as follows:
BOPGV(n,k,F,G):
T(X)= k1
max
i=0fi(xi),E(X)=
k1
i=0
gi(xi)
minimize
X{T(X),E(X)}
s.t. x0+x1+···+xk1=n.(1)
We aim to solve BOPGV by finding both the Pareto front containing the optimal objective vectors in the objective space T×Eand the decision
vector for a point in the Pareto front. Thus, our solution finds a set of triplets Ψ={(T(X),E(X),X)} such that Xis a Pareto-optimal decision vector,
and the projection of Ψonto the objective space T×Eis the Pareto front symbolized by ΨT×E.
4BI-OBJECTIVE OPTIMIZATION PROBLEM FOR MAX OF CONTINUOUS FUNCTIONS AND
SUM OF LINEAR FUNCTIONS
In this section, we solve BOPGV for the case where all functions in the set Fare continuous and strictly increasing, and all functions in the set Gare
linear increasing, that is, G={g0,··· ,gk1},gi(x)=bi×x,biR>0,i=0,,k1. Without loss of generality,we assume that the functions in Gare
sorted in the decreasing order of coefficients, b0b1bk1.
Our solution consists of two algorithms, Algorithms 1and2. The first one, which we call LBOPA, constructs the Pareto front of the optimal
solutions in the objective space ΨT×E. The second algorithm finds the decision vector for a given point in the Pareto front.
TheinputstoLBOPA(seeAlgorithm1forpseudo-code)aretwosetsofkfunctions each, Fand G, and an input value, nR>0. LBOPA constructs a
continuous Pareto front,consisting of k1 segments {s0,s1,··· ,sk2}. Each segment sihas two endpoints, (ti,ei)and (ti+1,ei+1), which are connected
by curve Pf(t)=bi×nk1
j=i+1(bibjf1
j(t)(0 ik2). Figure 3illustrates the functions in the sets, Fand G, when all functions in Fare linear,
fi(x)=ai×x. In this particular case, the Pareto front returned by LBOPA will be piecewise linear, Pf(t)=bi×nt×k1
j=i+1
bibj
aj(0 ik2), as
shown in Figure 3.
The main loop of the Algorithm 1 computes kend-points of the segments of the Pareto front (Lines 3-7). In an iteration i, the minimum value of
objective T,ti, is obtained using the algorithm, solving the single-objective min-max optimization problem, minX{maxk1
j=ifj(xj)}. We do not present
the details of this algorithm. Depending on the shapes of functions, {f0,,fk1}, one of the existing polynomial algorithms solving this problem can
be employed.29,30
The end point (tmin,emax )=(t0,e0)represents decision vectors with the minimum value of objective Tand the maximum value of objective E,
while the end point (tmax,emin )=(tk1,ek1)represents decision vectors with the maximum value of objective Tand the minimum value of objective
E(as illustrated for the case of all linear increasing functions in Figure 3).
15320634, 2023, 20, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/cpe.7285 by Health Research Board, Wiley Online Library on [11/08/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
KHALEGHZADEH ET AL.7of19
(A)
(C)
(B)
FIGURE 3 Sets Fand Gof klinear increasing functions each. Functions in Gare arranged in the decreasing order of slopes. LBOPA returns a
piecewise linear Pareto front shown in Figure (C) comprising a chain of k1 linear segments.
Algorithm 1. Algorithm LBOPA constructing the Pareto front of the optimal solutions, minimizing the max of continuous and strictly increasing
functions and the sum of linear increasing functions, in the objective space T×E
1: function LBOPA(n,k,F,G)
2: S
3: for i0,k1do
4: timinX{maxk1
j=ifj(xj)}
5: eibi×nk1
j=i+1(bibjf1
j(ti)
6: SS∪(ti,ei)
7: end for
8: for i0,k2do
9: Connect (ti,ei)and (ti+1,ei+1)by curve bi×nk1
j=i+1(bibjf1
j(t)
10: end for
11: return S
12: end function
Given an input t∈[t0,tk1], PARTITION (Algorithm 2) finds a decision vector X={x0,x1,···,xk1} such that k1
i=0xi=n,
maxk1
i=0fi(xi)=t,andk1
i=0gi(xi)is minimal. The algorithm first initializes Xwith {x0,x1,··· ,xk1xi=f1
i(t)} (Line 2) so that fi(xi)=
tfor all i∈{0,··· ,k1}. For this initial Xthe condition maxk1
i=0fi(xi)=tis already satisfied but k1
i=0ximay be either equal to
nor greater than n.Ifk1
i=0xi=n, then this initial Xwill be the only decision vector such that k1
i=0xi=nand maxk1
i=0fi(xi)=
tand hence the unique (Pareto-optimal) solution. Otherwise, k1
i=0xi=n+nplus where nplus >0. In that case, this initial vec-
tor Xwill maximize both k1
i=0xiand k1
i=0gi(xi)in the set tof all vectors in the decision space satisfying the condition
maxk1
i=0fi(xi)=t.
The algorithm then iteratively reduces elements of vector Xuntil their sum becomes equal to n. Obviously, each such reduction will also reduce
k1
i=0gi(xi). To achieve the maximum reduction of k1
i=0gi(xi), the algorithm starts from vector element xi, the reduction of which by an arbitrary
amount Δxwill result in the maximum reduction of k1
i=0gi(xi). In our case, it will be x0as the functions in Gare sorted in the decreasing order of
coefficients bi. Thus, at the first reduction step, the algorithm will try to reduce x0by nplus.Ifx0nplus, it will succeed and find a Pareto-optimal
decision vector X={x0nplus ,x1,··· ,xk1}.Ifx0<nplus, it will reduce nplus by x0, set x0=0 and move to the second step. At the second step,
it will try to reduce x1by the reduced nplus, and so on. This way the algorithm minimizes k1
i=0gi(xi), preserving maxk1
i=0fi(xi)=tand achieving
k1
i=0xi=n.
15320634, 2023, 20, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/cpe.7285 by Health Research Board, Wiley Online Library on [11/08/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
8of19 KHALEGHZADEH ET AL.
Algorithm 2. Algorithm finding a Pareto-optimal decision vector X={x0,x1,··· ,xk1}for the problem BOPGV(n,k,F,G), where functions in Fare
continuous and strictly increasing and functions in Gare linear increasing, for a given point (t,e)from the Paretofront of this problem, (t,e)∈ΨT×E.
Only the first coordinate of the input point, t, is required for this algorithm
1: function PARTITION(n,k,F,t)
2: X={x0,···,xk1xif1
i(t)}
3: nplus k1
i=0xin
4: if nplus<0then return (0,0,∅) end if
5: i0
6: while (nplus>0)∧(i<k1)do
7: if xinplus then
8: xixinplus
9: nplus 0
10: else
11: nplus nplus xi
12: xi0
13: ii+1
14: end if
15: end while
16: if nplus>0then return (0,0,∅) end if
17: return X
18: end function
The correctness of LBOPA and PARTITION is proved in Theorem 1.
Theorem 1. Consider bi-objective optimization problem BOPGV(n,k,F,G)where all functions in Fare continuous and strictly increasing and G=
{gi(x)gi(x)=bi×x,biR>0,i∈{0,··· ,k1}}. Then, the piecewise function S,returnedbyLBOPAn,k,F,G(Algorithm 1) and consisting of k1seg-
ments, is the Pareto front of this problem,ΨT×E,andforany(t,e)∈ΨT×E, Algorithm 2 returns a Pareto-optimal decision vector Xsuch that T(X)=tand
E(X)=e.
Proof. First, consider Algorithm 2 and arbitrary input parameters n>0andt>0. If after initialization of X(Line 2) we will have k1
i=0xi<n, it means
that tis too small for the given n, and for any vector Y={y0,y1,··· ,yk1}such that k1
i=0yi=n,maxk1
i=0fi(yi)>t. In this case, there is no solution to
the optimization problem, and the algorithm terminates abnormally.
Otherwise, the algorithm enters the while loop (Line 6). If i<k1 upon exit from this loop, then the elements of vector Xwill be calculated as
xj=
0j<i
nk1
m=j+1f1
m(t)j=i
f1
j(t)j>i
,(2)
and therefore satisfy the conditions k1
j=0xj=nand maxk1
j=0fj(xj)=t. Moreover, the total amount of nwill be distributed in Xbetween vector ele-
mentswithhigherindices,whichhave lower G cost,gi(x),becausebibi+1,i∈{0,··· ,k2}. Therefore, for anyothervector Y={y0,y1,··· ,yk1}
satisfying these two conditions, we will have k1
i=0gi(yi)k1
i=0gi(xi). Indeed, such a vector Ycan be obtained from Xby relocating certain amounts
from vector elements with higher indices to vector elements with lower indices, which will increase the G cost of the relocated amounts. Thus, when
the algorithm exits from the while loop with i<k1, it returns a Pareto-optimal vector X.
If the algorithm exits from the while loop with i=k1, it will mean that tis too big for the given n. We would still have nplus >0 to take off the
last vector element, xk1, but if we did it, we would make maxk1
j=0fj(xj)<t. This way we would construct for the given na decision vector, which min-
imizes k1
i=0gi(xi)but whose maxk1
j=0fj(xj)will be less than t, which means that no decision vector Xsuch that maxk1
j=0fj(xj)=tcan be Pareto optimal.
Therefore, in this case the algorithm also terminates abnormally.
Thus, for any tT, Algorithm 2 either finds a Pareto-optimal decision vector Xsuch that T(X)=tand E(X)=k1
i=0bi×xi=e, or returns
abnormally if such a vector does not exist.Let Algorithm 2 return normally, and the loop variable ibe equal to supon exit from the loop. Then, accord-
ing to Formula (2), e=k1
i=0bi×xi=bs×(nk1
i=s+1f1
i(t)) + k1
i=s+1(bi×f1
i(t)) = bs×nk1
i=s+1(bsbif1
i(t), where s,n,bi,bs,aiare all known
constants. Therefore, the Pareto front e=Pf(t)can be expressed as follows:
15320634, 2023, 20, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/cpe.7285 by Health Research Board, Wiley Online Library on [11/08/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
KHALEGHZADEH ET AL.9of19
e=Pf(t)=bs×n
k1
i=s+1
(bsbif1
i(t)
tmin =min
X{k1
max
j=ifj(xj)},tmax =fk1(n)
t∈[tmin ,tmax],sZ[0,k2],
which is the analytical expression of the piecewise function constructed by Algorithm 1 (LBOPA).
Theorem 2. The time complexity of LBOPA (Algorithm 1) is (k3×log2n). The time complexity of PARTITION (Algorithm 2) is (k).
Proof. The for loop in LBOPA (Algorithm 1, Lines 3–7) has kiterations. At each iteration i, the computation of tihas a time complexity of (k2×
log2n),29 the computation of eihas a time complexity of (k), and the insertion of the point in the set has complexity (1). Therefore, the time
complexity of the loop is (k3×log2n). The time complexity of the loop (Lines 8–10) is (k). Therefore, the time complexity of LBOPA is (k3×
log2n).
Let us consider the PARTITION algorithm. The initialization of X(Line 2) and computation of nplus has time complexity (k)each. The while loop
(Lines 6–15) iterates as long as nplus >0andi<k1, of which i<k1 is the worst case scenario. The time complexity of the loop is, therefore,
(k). Therefore, the time complexity of PARTITION is bounded by (k).
5APPLICATION OF THE BI-OBJECTIVE ALGORITHMS TO OPTIMIZATION OF
HETEROGENEOUS PARALLEL APPLICATIONS FOR PERFORMANCE AND ENERGY
In this section, we apply the LBOPA and PARTITION a lgorithmsto optimization of heterogeneous parallel applications for performance and energy.
We look at two bi-objective optimization problems, performance and dynamic energy and performance and total energy.
5.1 Bi-objective optimization for performance and dynamic energy
The bi-objective optimization problem for performance and dynamic energy is a direct application of BOPGV where the functions in Fand Gare the
execution time and dynamic energy functions, respectively. fi(x)is the execution time of the problem size xon processor Piand gi(x)represents the
amount of dynamic energy consumed by Pito execute the problem size x.
The solution to the problem is given by LBOPA and PARTITION algorithms for the case where all functions in the set Fare continuous and
strictly increasing, and all functions in the set Gare linear increasing. The algorithms determine a set of tuples, ΨDE ={(T(X),E(X),X)} where Xis a
Pareto-optimal decision vector and T(X)and E(X)are the execution time and the dynamic energy consumption corresponding to X. The projection
of ΨDE onto the objective space T×Eis the Pareto front symbolized by ΨDE T×E. The projection of ΨDE onto the decision vector space Xis the set of
workload distributions signified by ΨDE X.
If the application requires integer solutions, XI={x0,··· ,xk1}∈Zk
0such that k1
i=0xi=n, we will find the closest approximation to the
real-valued solution vector, X, output by PARTITION, in the Euclidean space.
5.2 Bi-objective optimization for performance and total energy
We start with the formulation for bi-objective optimization problem for performance and total energy. It is an extension of BOPGV.
5.2.1 Problem formulation
Consider a workload size nexecuted using pheterogeneous processors, whose execution time and dynamic energy functions are given by the
two sets, Fand G.LetPSR+be the static power consumption of the platform, which is a constant. The problem is then to find a vector
X={x0,··· ,xk1}∈Rk
0such that k1
i=0xi=n, minimizing the objective functions T(X)and ET(X)=E(X)+PS×T(X).WeuseT×ETto denote the
objective space of this problem, R0×R0.
Thus, the problem can be formulated as follows:
15320634, 2023, 20, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/cpe.7285 by Health Research Board, Wiley Online Library on [11/08/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
10 of 19 KHALEGHZADEH ET AL.
BOPPTE(n,k,F,G,PS)∶
T(X)= k1
max
i=0fi(xi),E(X)=
k1
i=0
gi(xi)
minimize
X{T(X),E(X)+PS×T(X)}
s.t. x0+x1+···+xk1=n.(3)
Our solution for BOPPTE finds a set of tuples, ΨTE ={(T(X),ET(X),X)}, where Xis a Pareto-optimal decision vector, T(X)and ET(X)are the execu-
tion time and the total energy consumption corresponding to X. The projection of ΨTE onto the objective space T×ETis the Pareto front symbolized
by ΨTE T×ET. The projection of ΨTE onto the decision vector space Xis the set of solutions (workload distributions) represented by ΨTE X.
5.2.2 Solution using the bi-objective algorithms LBOPA and PARTITION
In this section, we propose a solution that employs the LBOPA and PARTITION algorithms to solve BOPPTE for the case where all functions in the
set Fare continuous and strictly increasing, and all functions in the set Gare linear increasing.
The solution is based on the following observations. Let be the space of all feasible solutions to the BOPGV problem, that is, a set consisting
of all vectors X={x0,··· ,xk1}∈Rk
0such that k1
i=0xi=n.Letbe the image of in the objective space T×E. Then, will also be the space
of all feasible solutions to the BOPPTE problem. However, its image in the BOPPTE objective space, T×ET. will be different, obtained by moving
each point (t,e)∈vertically by PS×t. This transformationguarantees that no solution X, which is not Pareto optimal for BOPGV, will become
Pareto optimal for BOPPTE. Indeed, as we consider the case when all functions in Fare continuous and strictly increasing, and all functions in Gare
linear increasing, the Pareto front constructed by LBOPA for the BOPGV problem will be a continuous decreasing function. Therefore, there exists
a BOPGV Pareto-optimal X, which dominates Xso that T(X)=T(X)and E(X)<E(X). The images of Xand Xin the BOPPTE objective space will
be (T(X),E(X)+PS×T(X))and (T(X),E(X)+PS×T(X)), respectively. As (T(X),E(X)+PS×T(X)) <(T(X),E(X)+PS×T(X)),Xwill be dominating Xin
theBOPPTEspaceaswell.
Thus, solutions, which are not BOPGV Pareto optimal, cannot be BOPPTE Pareto optimal. It means that if a solution is BOPPTE
Pareto optimal, then it must be BOPGV Pareto optimal; that is, the BOPPTE set of Pareto-optimal solutions is a subset of the BOPGV
set of Pareto-optimal solutions. By construction, the image of the set of BOPGV Pareto-optimal solutions in the BOPPTE objective space
will be a continuous function of objective Tbut not necessarily decreasing. Therefore, different points belonging to this function will have
different Tcoordinates but may have the same ETcoordinate. Obviously, if we have two points with the same ETcoordinate, the one
with greater Tcoordinate will be the image of the inferior solution, which should be removed from the BOPPTE set of Pareto-optimal
solutions.
These observations can be summarised as the following theorem.
Theorem 3. A solution vector Xis Pareto-optimal for execution time and total energy if and only if it is Pareto-optimal for execution time and dynamic
energy and there is no solution vector X1such that X1is Pareto-optimal for execution time and dynamic energy and ET(X1)<ET(X).
We now propose an algorithm LBOPA-TE that along with PARTITION solves BOPPTE for the case where all functions in the set Fare contin-
uous and strictly increasing, and all functions in the set Gare linear increasing. It returns the Pareto front for execution time and total energy. The
correctness of LBOPA-TE follows from the Theorem 3. It invokes LBOPA to obtain the piecewise linear Pareto front for execution time and dynamic
energy.
LBOPA-TE (Algorithm 3) takes input the workload size n; the two sets of kfunctions, {F,G}; the base power of the computing platform, PS,and
the machine precision, 𝜀. It returns a piecewise linear Pareto front for execution time and total energy. The Pareto front is a set of segments, STE,
where a segment siis represented by a 4-tuple, (ti,ei,ti+1,ei+1), with the left endpoint, (ti,ei), and the right endpoint, (ti+1,ei+1). The coordinates of the
points in each segment have indices {0,1,2,3}. Therefore, STE[i][j]gives the jth coordinate in the ith segment.
Given the first coordinate of the point, (t,et)∈Ψ
TE T×ET, the workload size n, the set, F, the PARTITION algorithm returns the workload
distribution associated with the point.
We now explain the main steps of LBOPA-TE (Algorithm 3). Line 2 contains the call to LBOPA to determine the piecewise linear Pareto front
for execution time and dynamic energy. In the for loop (Lines 4–7), the structure, Stmp, containing tuples (line segments) for execution time and total
energy is initialized. The slopes of the line segments are determined and stored in TE (Line 6). In Line 9, the output Pareto front, STE, is initialized
with the performance-optimal point, (tmin,emax ,𝜙,𝜙), corresponding to the load-balanced workload distribution. The second for loop (Lines 10–20)
iteratesthroughthe line segments, iStmp ,i={0,··· ,k2}. The first ifconditioncheckswhetherthe slope of the line segment i,TE[i], is greater
than or equal to 0, or the energy of its right endpoint is greater than emin. If so, the line segment iis not added to STE since all the solutions lying on it
are dominated by the solution with the energy, emin .
15320634, 2023, 20, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/cpe.7285 by Health Research Board, Wiley Online Library on [11/08/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
KHALEGHZADEH ET AL.11 of 19
Algorithm 3. Algorithm LBOPA-TE determining the Pareto front for execution time and total energy
1: function LBOPA-TE(n,k,F,G,PS,𝜖)
2: {(ti,ei,ti+1,ei+1)i=0,···,k2}LBOPA(n,k,F,G)
3: Stmp Φ;tmp Φ
4: for i0,k2do
5: Stmp[i](ti,ei+PS×ti,ti+1,ei+1+PS×ti+1)
6: tmp[i]ei+1ei+PS×(ti+1ti)
ti+1ti
7: end for
8: tmin Stmp[0][0]; emax Stmp [0][1];emin Stmp[0][3]
9: STE {(tmin,emax ,𝜙,𝜙)}
10: for i0,k2do
11: if (tmp[i]0)Or (Stmp [i][3]>emin)then
12: continue
13: else if (Stmp[i][1]>emin )And (Stmp[i][3]<emin)then
14: epoi emin 𝜖;tpoi Stmp[i][0]+ epoi Stmp[i][1]
tmp[i]
15: STE[i](tpoi ,epoi,Stmp [i][2],Stmp[i][3])
16: emin Stmp[i][3]
17: else
18: STE[i]Stmp [i];emin Stmp[i][3]
19: end if
20: end for
21: return STE
22: end function
Lines 13–17 represent the case when emin lies between the energies of the endpoints of the line segment isignifying that a fragment of the line
segment satisfies Pareto-optimality. The point of intersection, (tpoi,emin 𝜀), of the lines y=emin 𝜀and the line segment with the slope tmp [i]is
determined. The line segment represented by the tuple, (tpoi,epoi ,Stmp[i][2],Stmp[i][3]), is Pareto-optimal and is added to STE. Line 18 represents the
case of the line segment whose points satisfy Pareto-optimality. Therefore, it is stored in STE at index i.
If no solutions are added to STE in the for loop, then STE will contain only the performance-optimal point corresponding to the load-balanced
workload distribution.
We illustrate LBOPA-TE using an example shown in the Figure 4. The number of processors employed in the example is four. The static power
consumption, PS, is assumed to be 5 W. The Pareto front for execution time and dynamic energy (ΨDE T×E) is given by the blue line in Figure 4A.
It contains four segments. The static energy consumption as a function of execution time (5 ×t) is shown as an orange line. Figure 4B shows the
execution time versus total energy curve (highlighted in green) obtained by adding the static energy consumptions to the energies in the execution
time and dynamic energy Pareto front. In Figure 4C, the solutions highlighted in red are the non-Pareto-optimal solutions removed by LBOPA-TE in
Lines 13–17. The output Pareto front for execution time and total energy, ΨTE T×ET, is shown Figure 4D.
Theorem 4. The time complexity of LBOPA-TE is (k3×log2n).
Proof. LBOPA-TE invokes LBOPA in Line 2. The time complexity of LBOPA is (k3×log2n). In the loop (Lines 4–7), the insertion of a line segment
represented by a 4-tuple in Line 5 and the insertion of the segment slope in Line 6 each have a time complexity of (1). In the loop (Lines 9–20), the
time complexity of adding a line segment to STE is also (1). Therefore, the for loops (Lines 4–7, Lines 9–20) each have a time complexity of (k).
Hence, the time complexity of LBOPA-TE is (k3×log2n).
6EXPERIMENTAL RESULTS AND DISCUSSION
We analyze the proposed algorithms for two data-parallelapplications, matrix multiplication and gene sequencing, executed on a platform compris-
ing the five heterogeneous processors illustrated in Figure 1.
We first describe the methodology to construct the discrete execution time and the dynamic energy profiles based on system-level physical
power measurements using power meters for the processors involved in the execution of our applications. We then present the applications and
the experimental results.
15320634, 2023, 20, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/cpe.7285 by Health Research Board, Wiley Online Library on [11/08/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
12 of 19 KHALEGHZADEH ET AL.
(A)
(C) (D)
(B)
FIGURE 4 Example illustrating LBOPA-TE using four processors. The static power assumed is 5 W. (A) shows the piecewise linear Pareto
front for the execution time and dynamic energy. (B) displays the piecewise linear curve for execution time and total energy obtained by adding the
static energy consumption to the Pareto front for the execution time and dynamic energy in the first for loop of the algorithm. (C) shows the
non-Pareto-optimal solutions highlighted in red that are removed. The non-Pareto-optimal solutions are dominated by the solution displayed as a
solid green circle. (D) shows the output Pareto front for execution time and total energy.
Our platform is equipped with WattsUpPro power meters between the wall A/C outlets and the input power sockets. The power meters cap-
ture the total power consumption of the node. They have data cables connected to USB ports of the node. A Perl script collects the data from the
power meter using the serial USB interface. The execution of these scripts is nonintrusive and consumes insignificant power. The power meters are
periodically calibrated using an ANSI C12.20 revenue-grade power meter, Yokogawa WT210. The maximum sampling speed of the power meters is
one sample every second. The accuracy specified in the data sheets is ±3%. The minimum measurable power is 0.5 watts. The accuracy at 0.5 W is
±0.3W.
To ensure the reliability of our results, we follow a statistical methodology where a sample mean for a response variable (energy, time, PMC,
utilization variables) is obtained from multiple experimental runs. The sample mean is calculated by executing the application repeatedly until it lies
in the 95% confidence interval and a precision of 0.025 (2.5%) is achieved. For this purpose, Student’s t-test is used assuming that the individual
observations are independent and their population follows the normal distribution. We verify the validity of these assumptions using Pearson’s
chi-squared test.
6.1 Methodology to construct execution time and dynamic energy profiles
We employ an experimental methodology,14 31 that accurately models the energy consumption by a hybrid data-parallel application executing on
a heterogeneous HPC platform containing different computing devices using system-level power measurements provided by power meters. The
automated software tool, HCLWATTSUP,32 provides the dynamic and total energy consumptions based on system-level physical power measure-
ments using power meters. The tool has no overhead and, therefore, does not influence the energy consumption of the application. HCLWATTSUP
gives the static power consumption of our platform when it does not execute any application. Based on our measurements, it is 410 W.
15320634, 2023, 20, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/cpe.7285 by Health Research Board, Wiley Online Library on [11/08/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
KHALEGHZADEH ET AL.13 of 19
A data-parallel application executing on this heterogeneous hybrid platform consists of several kernels (generally speaking, multithreaded)
running in parallel on different computing devices of the platform. The proposed algorithms for solving the bi-objective optimisation problem for
performance and energy requires individual performance and dynamic energy profiles of all the kernels.
Due to tight integration and severe resource contention in heterogeneous hybrid platforms, the load of one computational kernel may signif-
icantly impact others’ performance to the extent of preventing the ability to model the performance and energy consumption of each kernel in
hybrid applications individually.33 Therefore, we only consider configurations where one CPU kernel or accelerator kernel runs on the correspond-
ing device. Each group of cores executing an individual kernel of the application is modeled as an abstract processor so that the executing platform is
represented as a set of heterogeneous abstract processors. We ensure that the sharing of system resources is maximized within groups of compu-
tational cores representing the abstract processors and minimized between the groups. This way, the contention and mutual dependence between
abstract processors are minimized.
We thus model our platform by five abstract processors, CPU_1, GPU_1, xeonphi_1, CPU_2, and GPU_2. CPU_1 contains 22 (out of the total
24 physical) CPU cores. GPU_1 involves the Nvidia K40c GPU and a host CPU core connected to this GPU via a dedicated PCI-E link. CPU_2
comprises 10 (out of the total 12 physical) CPU cores. XeonPhi_1 is made up of one Intel Xeon Phi 3120P and its host CPU core connected
via a dedicated PCI-E link. GPU_2 involves the Nvidia P100 PCI-E GPU and a host CPU core connected to this GPU via a dedicated PCI-E
link. Since there should be a one-to-one mapping between the abstract processors and computational kernels, any hybrid application executing
on the node should consist of three kernels, one kernel per computational device, running in parallel. Because the abstract processors contain
CPU cores that share some resources such as main memory and QPI, they cannot be considered entirely independent. Therefore, the perfor-
mance of these loosely coupled abstract processors must be measured simultaneously, thereby taking into account the influence of resource
contention.
The execution time profiles of the abstract processors are experimentally built separately using an automated build procedure using OpenMP
threads where one thread is mapped to one abstract processor. To account for the influence of resource contention, all the abstract processors
execute the same workload simultaneously and their execution times are measured. The execution time for accelerators includes the time taken to
transfer data between the host and devices.
The dynamic energy profiles of the abstract processors are constructed using the additive approach.12 In the additive approach, the
dynamic energy profiles of the five processors are constructed serially. The combined profile where the individual dynamic energy consump-
tions are totaled for each data point is then obtained. Then, the dynamic energy profile employing all the processors in parallel is built.
The difference between the parallel and combined dynamic energy profiles is observed. We find that the average difference between par-
allel and combined dynamic energy profiles is around 2.5% for the applications and within the statistical accuracy threshold set in our
experiments. Both the parallel and combined profiles also follow the same pattern. Therefore, we conclude that the processors in our
experiments satisfy the additive hypothesis: the abstract processors are loosely coupled and do not interfere during the application. Thus,
we conclude that the dynamic energy profiles of the five processors can be constructed serially or in parallel for our experimental platform and
applications.
6.1.1 Precautions to rule out interference of other components in dynamic energy consumption
Several precautions are taken in computing energy measurements to eliminate any potential interference of the computing elements that are not
part of the given abstract processor running the given application kernel. First, we group abstract processors so that a given abstract processor
constitutes solely the computing elements involvedto run a given application kernel. Hence, the dynamic energy consumption will solely reflect the
work done by the computing elements of the given abstract processor executing the application kernel.
Consider the DGEMM application kernel executing on the abstract processor CPU_1, which comprises CPU and DRAM. The HCLWattsUp API
function gives the total energy consumption of the server during the execution of an application. The energy consumption includes the contribution
from all components such as NIC, SSDs, and fans. We ensure that the application exercises only the CPUs and DRAM and not the other components
so that the dynamic energy consumption reflects the contribution of only these two components. The following steps are employed to achieve this
goal:
The disk consumption is monitored before and during the application run and ensure no I/O is performed by the application using tools such as
sar,andiotop;
The problem size used in executing an application does not exceed the main memory and that swapping (paging) does not occur;
The application does not use the network by monitoring using tools such as sar,andatop;
The application kernel’s CPU affinity mask is set using SCHED API’s system call, SCHED_SETAFFINITY(). To bind the DGEMM application kernel,
we set its CPU affinity mask to 11 physical CPU cores of Socket 1 and 11 physical CPU cores of Socket 2.
15320634, 2023, 20, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/cpe.7285 by Health Research Board, Wiley Online Library on [11/08/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
14 of 19 KHALEGHZADEH ET AL.
Fans are also a great contributor to energy consumption. On our platform, fans are controlled in two zones: (a) zone 0: CPU or System fans, (b)
zone 1: Peripheral zone fans. There are four levels to control the speed of fans:
Standard: BMC control of both fan zones, with the CPU zone based on CPU temp (target speed 50%) and Peripheral zone based on PCH temp
(target speed 50%);
Optimal: BMC control of the CPU zone (target speed 30%), with Peripheral zone fixed at low speed (fixed 30%);
Heavy IO: BMC control of CPU zone (target speed 50%), Peripheral zone fixed at 75%;
Full: all fans are running at 100%.
We set the fans at full speed before launching the experiments to rule out fans’ contribution to dynamic energy consumption. When set at
full speed, the fans run consistently at 13,400 rpm. In this way, fans consume the same amount of power that is included in the static power of
the platform. Furthermore, we monitor the server’s temperatures and the fans’ speeds with the help of Intelligent Platform Management Interface
(IPMI) sensors, both with and without the application run. We find that there are no significant differences in temperature and the speeds of fans
are the same in both scenarios.
Thus, we ensure that the dynamic energy consumption measured reflects the contribution solely by the abstract processor executing the given
application kernel.
6.2 Applications used in the experiments
The matrix multiplication application computes C=𝛼×A×B+𝛽×C, where A,B,andCare matrices of size m×n,n×n,andm×n,and𝛼and 𝛽
are floating-point constants. The application uses Intel MKL DGEMM for CPUs, ZZGEMMOOC out-of-card package34 for Nvidia GPUs, and Xeon-
PhiOOC out-of-card package34 for Intel Xeon Phis. ZZGEMMOOC and XeonPhiOOC packages reuse CUBLAS and MKL BLAS for in-card DGEMM
calls. The out-of-card packages allow the GPUs and Xeon Phis to execute computations of arbitrary size. The Intel MKL and CUDA versions used
on HCLServer01 are 2017.0.2 and 7.5, and on HCLServer02 are 2017.0.2 and 9.2.148. Workloadsizes range from 64 ×10,112 to 28,800 ×10,112
with a step size of 64 for the first dimension m. The speed of execution of a given problem size m×nis calculated as (2×m×n2)∕twhere tis the
execution time.
The gene sequencing application deals with the alignment of DNA or protein sequences. It employs the Smith-Waterman algorithm (SW),35,36
which uses a dynamic programming (DP) approach to determine the optimal local alignment score of two sequences, a query sequence of length m
and a database sequence of length n. The time and space complexities of the SW DP algorithm are O(m×n)and O(m), where m<n, assuming the
use of refined linear-space methods. The speed of execution of the application for a given workload size, (m+n), is calculated as (m×n)∕twhere t
is the execution time. The speed is usually measured in GCUPS, which stands for Billions of Cell Updated per Second.
The application employs the five heterogeneous processors, CPU_1, GPU_1, xeonphi_1, CPU_2, and GPU_2. The application invokes opti-
mized SW routines provided by SWIPE for Multicore CPUs,37 CUDASW++3.0 for Nvidia GPU accelerators,38 andSWAPHIforIntelXeonPhi
accelerators.39 All the computations are in-card.
The performance and dynamic energy profiles for the matrix multiplication and gene sequencing applications are shown in the Figures 2and 5.
The input performance and dynamic energy functions, (F,G), to LBOPA and PARTITION are linear approximations of the profiles.
To demonstrate the practical efficacy and the most interesting aspects of our algorithms, we select two workloads, 12,352 ×10,112 and
15,552 ×10,112, for the matrix multiplication and the workload 29,312 ×163,841 for the gene sequencing application. First, the workloads pro-
vide shapes of Pareto fronts with steep slopes and a wide range of performance-energy tradeoffs for both performance and dynamic energy and
performance and total energy. Second, the workloads allow us to demonstrate scenarios where the set of Pareto-optimal solutions for performance
and total energy is equal to the set of Pareto-optimal solutions for performance and dynamic energy (ΨTE X
DE X) and where it is only a proper
subset (ΨTE XΨDE X).
6.3 Performance-dynamic energy Pareto fronts
Figure 6shows the Pareto fronts for the matrix multiplication application for two workloads, 12,352 ×10,112 and 15,552 ×10,112. Each Pareto
front contains four linear segments. Each segment is connected by two endpoints. All the points lying on a segment are the optimal solutions in the
objective space. The solution (shown as a circle) with the minimal execution time is the load-balancing solution.
For the workload 12,352 ×10,112, 17% dynamic energy savings is gained while allowing 5% performance degradation. Similarly, for the
workload 15,552 ×10,112, 13% energy savings is achieved while tolerating 5% performance degradation.
15320634, 2023, 20, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/cpe.7285 by Health Research Board, Wiley Online Library on [11/08/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
KHALEGHZADEH ET AL.15 of 19
(A) (B)
FIGURE 5 The execution time and dynamic energy profiles of the five heterogeneous processors (Figure 1) employed in the gene sequencing
application
(A)
(C) (D)
(B)
FIGURE 6 Pareto fronts for the matrix multiplication application using five heterogeneous processors for two workloads. (A) and (B) contain
the Pareto fronts for execution time and dynamic energy. Each Pareto front contains four linear segments. (C) and (D) contain the Pareto fronts for
execution time and total energy. The non-Pareto-optimal solutions due to high static energy consumption are highlighted in red. The solution
(represented as a circle) with the minimal execution time is the load-balancing solution.
15320634, 2023, 20, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/cpe.7285 by Health Research Board, Wiley Online Library on [11/08/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
16 of 19 KHALEGHZADEH ET AL.
(A) (B)
FIGURE 7 Pareto fronts for the gene sequencing application using the five heterogeneous processors for the workload 29,312 ×163,841.
(A) contains the Pareto front for execution time and dynamic energy. It has four linear segments. (B) displays the Pareto front for execution time
and total energy. The solution (represented as a circle) with the minimal execution time is the load-balancing solution.
The first linear segment has a steep slope signifying a significant dynamic energy savings for a slight increase in execution time. The energy
savings are 93 and 106 J for execution time increases of 0.03 and 0.05 s for the two workloads. The energy-performance tradeoff (i.e., the gain in
energy savings for a corresponding increase in execution time) decreases with each next linear segment.
Figure 7shows the Pareto front for the gene sequencing application solving a problem size 29,312 ×163,841. The dynamic energy savings
accepting a 1% performance hit is 23%.
6.4 Performance-total energy Pareto fronts
Figure 6presents the Pareto fronts for executiontime and total energy for the matrix multiplication application for two workloads, 12,352 ×10,112
and 15,552 ×10,112. The solution (represented as a circle) with the minimal executiontime is the load-balancing solution. The dotted red lines show
the non-Pareto-optimal solutions removed by LBOPA-TE due to high static energy consumption.
Figure 7shows the Pareto fronts for execution time and total energy for the gene sequencing application for the workload 29,312 ×163,841.
The total energy savings accepting a 1% performance hit is 16%.
6.5 Discussion
Following are our salient observations:
The set of Pareto-optimal solutions (workload distributions) for execution time and dynamic energy is optimal for total energy only for the
first two linear segments starting from the performance-optimal endpoint for the matrix multiplication application. The third and fourth linear
segments with the positive slopes contain non-Pareto-optimal solutions due to high static energy consumptions;
The shapes of the two Pareto fronts for execution time and dynamic energy and execution time and total energy are similar, suggesting that the
qualitative conclusions apply for all workloads;
For the gene-sequencing application, the set of Pareto-optimal solutions (workload distributions) for execution time and dynamic energy is also
optimal for total energy for the workload employed in our experiments;
Based on an input user-specified energy-performance tradeoff, one can selectively focus on a specific segment in the Pareto fronts to return the
Pareto-optimal solutions (workload distributions). A steep slope in the line segment with the load-balanced solution as the performance-optimal
endpoint will provide significant energy savings while tolerating little performance degradation. It signifies that introducing a small load
imbalance can provide good energy savings.
The execution times of our proposed algorithms range from milliseconds to 1 s to find Pareto-optimal solutions for the workload sizes used in
the experiments. These execution times are insignificant compared to the executiontimes of the applications where our proposed algorithms are
employed to find the workload distribution.
15320634, 2023, 20, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/cpe.7285 by Health Research Board, Wiley Online Library on [11/08/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
KHALEGHZADEH ET AL.17 of 19
7CONCLUSION
Performance and energy are the two most important objectives for optimization on heterogeneous HPC platforms. Khaleghzadeh et al.12 studied
bi-objective optimization of data-parallel applications for performance and energy on heterogeneous processors. They proposed an algorithm for
the case of discrete performance and energy functions with any arbitrary shape. They also briefly studied the continuous bi-objective optimization
problem but only for the simple case of two heterogeneous processors with linear execution time and linear dynamic energy functions. They proposed
an algorithm to find the Pareto front and showed that it is linear, containing an infinite number of solutions. While one solution is load balanced, the
rest are load imbalanced.
In Reference 13, we studied a more general continuous bi-objective optimization problem for a generic case of kheterogeneous processors.
The problem is motivated by the bi-objective optimization for the performance and dynamic energy of data-parallel applications on heterogeneous
HPC platforms. We first formulated the problem, which for a given positive real number naims to find a vector X={x0,··· ,xk1}∈Rk
0such that
k1
i=0xi=n, minimizing the max of k-dimensional vector of functions of objective type one and the sum of k-dimensional vectorof functions of objec-
tive type two. We then proposed an exact algorithm of polynomial complexity solving the problem where all the functions of objective type one are
continuous and strictly increasing, and all the functions of objective type two are linear increasing.
In this work, we applied the problem and the algorithm proposed in Reference 13 to solve two related optimization problems of parallel appli-
cations on heterogeneous hybrid platforms, one for performance and dynamic energy and the other for performance and total energy. First, we
formulated and solved the bi-objective optimization problem for performance and dynamic energy. The problem and the solution are a direct
application of the problem and algorithm proposed in Reference 13.
We then formulated the bi-objective optimization problem of parallel applications on heterogeneous hybrid platforms for performance and
total energy. We proved a theorem that states that a solution vector Xis Pareto-optimal for execution time and total energy if and only if it is
Pareto-optimal for execution time and dynamic energy and there is no solution vector X1such that X1is Pareto-optimal for execution time and
dynamic energy and ET(X1)<ET(X). Finally, we proposed an algorithm of polynomial complexity to solve the problem and whose correctness follows
from the theorem.
Using the algorithms (proposed in Reference 13 and this work), we solved the two bi-objective optimization problems for two applications,
matrix multiplication and gene sequencing, employing five heterogeneous processors, two Intel multicore CPUs, an Nvidia K40c GPU, an Nvidia
P100 PCIe GPU, and an Intel Xeon Phi. For the workloads and the platform employed in our experiments, the algorithms provide continuous piece-
wise linear Pareto fronts for performance and dynamic energy and performance and total energy where the performance-optimal point is the load
balanced configuration of the application.
Finally, 17% dynamic energy savings was achieved while tolerating a performance degradation of 5% (a saving of 106 J for an execution time
increase of 0.05 s) for the matrix multiplication application. The dynamic energy and total energy savings for the gene sequencing application
accepting a 1% performance hit were 23% and 16%.
In our future work, we will study bi-objective performance-energy optimization of applications with both continuous performance and energy
profiles on heterogeneous hybrid HPC platforms.
ACKNOWLEDGMENTS
This publication has emanated from research conducted with the financial support of Science Foundation Ireland (SFI) under Grant Number
14/IA/2474. Open access funding provided by IReL.
CONFLICT OF INTEREST
The authors declare no potential conflict of interest.
DATA AVAILABILITY STATEMENT
Data are available on request from the authors.
ORCID
Hamidreza Khaleghzadeh https://orcid.org/0000- 0003-4070-7468
Ravi Reddy Manumachu https://orcid.org/0000-0001-9181-3290
REFERENCES
1. Fard HM, Prodan R, BarrionuevoJJD, Fahringer T. A multi-objective approach for workflow scheduling in heterogeneous environments. Paper presented
at: Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (Ccgrid 2012) CCGRID’12; IEEE Computer
Society; 2012; Ottawa, ON: 300–309.
2. Kessaci Y, Melab N, Talbi EG. A pareto-based metaheuristic for scheduling HPC applications on a geographically distributed cloud federation. Cluster
Comput. 2013;16(3):451-468.
15320634, 2023, 20, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/cpe.7285 by Health Research Board, Wiley Online Library on [11/08/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
18 of 19 KHALEGHZADEH ET AL.
3. Durillo JJ, Nae V, Prodan R. Multi-objective energy-efficient workflow scheduling using list-based heuristics. Future Gener Comput Syst. 2014;36:
221-236.
4. Rossi FD, Xavier MG, De Rose CA, Calheiros RN, Buyya R. E-eco: performance-aware energy-efficient cloud data center orchestration. JNetwComput
Appl. 2017;78:83-96.
5. Kołodziej J, Khan SU, Wang L, Zomaya AY. Energy efficient genetic-based schedulers in computational grids. Concurr Pract Exp. 2015;27(4):
809-829.
6. Yu L, Zhou Z, Wallace S, Papka ME, Lan Z. Quantitative modeling of power performance tradeoffs on extreme scale systems. J Parallel Distrib Comput.
2015;84:1-14.
7. Gholkar N, Mueller F, Rountree B. Power tuning HPC jobs on power-constrained systems. Paper presented at: Proceedings of the 2016 International
Conference on Parallel Architectures and Compilation ACM; 2016; Haifa, Israel: 179–191.
8. Rountree B, Lowenthal DK, Funk S, Freeh VW, de Supinski BR, Schulz M. Bounding energy consumption in large-scale MPI programs. Paper presented
at: SC ’07: Proceedings of the 2007 ACM/IEEE Conference on Supercomputing; 2007; Reno, NV: 1–9.
9. Chakrabarti A, Parthasarathy S, Stewart C. A pareto framework for data analytics on heterogeneous systems: implications for green energy usage and
performance. Paper presented at: Parallel Processing (ICPP), 2017 46th International Conference on IEEE; 2017; Bristol, UK: 533–542.
10. Lastovetsky A, Reddy R. New model-based methods and algorithms for performance and energy optimization of data parallel applications on homoge-
neous multicore clusters. IEEE Trans Parallel Distrib Syst. 2017;28(4):1119-1133.
11. Manumachu RR, Lastovetsky A. Bi-objectiveoptimization of data-parallel applications on homogeneous multicore clusters for performance and energy.
IEEE Trans Comput. 2018;67(2):160-177.
12. Khaleghzadeh H, Fahad M, Shahid A, Manumachu RR, Lastovetsky A. Bi-objective optimization of data-parallel applications on heterogeneous HPC
platforms for performance and energy through workload distribution. IEEE Trans Parallel Distrib Syst. 2021;32(3):543-560.
13. Khaleghzadeh H, Manumachu RR, Lastovetsky A. A novel algorithm for bi-objective performance-energy optimization of applications with continuous
performance and linear energy profiles on heterogeneous HPC platforms. Paper presented at: Euro-Par 2021 Workshops, Lecture Notes in Computer
Science. Springer; 2022; Gottingen, Germany.
14. Fahad M, Shahid A, Manumachu RR, Lastovetsky A. A comparative study of methods for measurement of energy of computing. Energies.
2019;12(11):2204.
15. Miettinen K. Nonlinear Multiobjective Optimization. Kluwer; 1999.
16. Ta l b i E G . Metaheuristics: From Design to implementation. Vol 74. John Wiley & Sons; 2009.
17. Ge R, FengX, Feng WC, Cameron KW. CPU MISER: a performance-directed, run-time system for power-aware clusters. Paperpresented at: International
Conference on Parallel Processing. IEEE Computer Society; 2007; Xi’an, China.
18. Huang S, Feng W. Energy-efficient cluster computing via accurate workload characterization. Paper presented at: 9th IEEE/ACM International Sympo-
sium on Cluster Computing and the Grid, IEEE Computer Society; 2009; Shanghai, China.
19. Mezmaz M, Melab N, Kessaci Y, et al. A parallel bi-objective hybrid metaheuristic for energy-aware scheduling for cloud computing systems. J Parallel
Distrib Comput. 2011;71(11):1497-1508.
20. Beloglazov A, Abawajy J, Buyya R. Energy-aware resource allocation heuristics for efficient management of data centers for Cloud computing. Future
Gener Comput Syst. 2012;28(5):755-768. Special Section: Energy efficiency in large-scale distributed systems.
21. Das A, Kumar A, Veeravalli B, Bolchini C, Miele A. Combined DVFS and mapping exploration for lifetime and soft-error susceptibility improvement in
MPSoCs. Paper presented at: 2014 Design, Automation Test in Europe Conference Exhibition (DATE); 2014; Dresden, Germany: 1–6.
22. Sundriyal V, Sosonkina M. Joint frequency scaling of processor and DRAM. J Supercomput. 2016;72(4):1549-1569.
23. Abdi A, Girault A, Zarandi HR. ERPOT: a quad-criteria scheduling heuristic to optimize execution time, reliability, power consumption and temperature
in multicores. IEEE Trans Parallel Distrib Syst. 2019;30:2193-2210.
24. Lang J, Rünger G. An execution time and energy model for an energy-aware execution of a conjugate gradient method with CPU/GPU collaboration.
J Parallel Distrib Comput. 2014;74(9):2884-2897.
25. Reddy Manumachu R, Lastovetsky AL. Design of self-adaptable data parallelapplications on multicore clusters automatically optimized for performance
and energy through load distribution. Concurr Comput Practice Exp. 2019;31(4):e4958.
26. Tarplee KM, Friese R, Maciejewski AA, Siegel HJ, Chong EK. Energy and makespan tradeoffs in heterogeneous computing systems using efficient linear
programming techniques. IEEE Trans Parallel Distrib Syst. 2016;27(6):1633-1646.
27. Aba MA, Zaourar L, Munier A. Approximation algorithm for scheduling a chain of tasks on heterogeneous systems. Paper presented at: European
Conference on Parallel Processing. Springer; 2017; Santiago de Compostela, Spain: 353–365.
28. KhokhriakovS, Manumachu RR, Lastovetsky A. Multicore processor computing is not energy proportional: An opportunity for bi-objective optimization
for energy and performance. Appl Energy. 2020;268:114957.
29. Lastovetsky A, Reddy R. Data partitioning with a realistic performance model of networks of heterogeneous computers. Paper presented at: 18th
International Parallel and Distributed Processing Symposium, 2004; 2004; Santa Fe, NM: 104.
30. Lastovetsky A, Reddy R. Data partitioning with a functional performance model of heterogeneous processors. Int J High Perform Comput Appl.
2007;21:76-90.
31. Fahad M, Shahid A, Manumachu RR, Lastovetsky A. Accurate energy modelling of hybrid parallel applications on modern heterogeneous computing
platforms using system-level measurements. IEEE Access. 2020;8:93793-93829.
32. Fahad M, Manumachu RR. HCLWattsUp: Energy API Using System-Level Physical Power Measurements Provided by Power Meters. Heterogeneous Computing
Laboratory, University College Dublin; 2021. https://csgitlab.ucd.ie/manumachu/hclwattsup
33. Zhong Z, Rychkov V, Lastovetsky A. Data partitioning on multicore and multi-GPU platforms using functional performance models. IEEE Trans Comput.
2015;64(9):2506-2518.
34. Khaleghzadeh H, Zhong Z, Reddy R, Lastovetsky A. Out-of-core implementation for accelerator kernels on heterogeneous clouds. J Supercomput.
2018;74(2):551-568.
35. Smith TF, Waterman MS. Identification of common molecular subsequences. J Mol Biol. 1981;147(1):195-197.
36. Gotoh O. An improved algorithm for matching biological sequences. J Mol Biol. 1982;162(3):705-708.
37. Rognes T. Faster Smith-Waterman database searches with inter-sequence SIMD parallelisation. BMC Bioinform. 2011;12(1):1.
15320634, 2023, 20, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/cpe.7285 by Health Research Board, Wiley Online Library on [11/08/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
KHALEGHZADEH ET AL.19 of 19
38. Liu Y, Wirawan A, Schmidt B. CUDASW++ 3.0: accelerating Smith-Waterman protein database search by coupling CPU and GPU SIMD instructions.
BMC Bioinform. 2013;14(1):1.
39. Liu Y, Schmidt B. SWAPHI: Smith-Waterman protein database search on Xeon Phi coprocessors. Paper presented at: 2014 IEEE 25th International
Conference on Application-Specific Systems, Architectures and Processors. IEEE; 2014; Zurich, Switzerland: 184–185.
How to cite this article: Khaleghzadeh H, Reddy Manumachu R, Lastovetsky A. Efficient exact algorithms for continuous bi-objective
performance-energy optimization of applications with linear energy and monotonically increasing performance profiles on heterogeneous
high performance computing platforms. Concurrency Computat Pract Exper. 2023;35(20):e7285. doi: 10.1002/cpe.7285
15320634, 2023, 20, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/cpe.7285 by Health Research Board, Wiley Online Library on [11/08/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
... The decomposition of the matrix A is computed using a model-based workload partitioning algorithm [26], [27]. We present the methodology to construct the execution time profiles of the three components ({S CPU , S GPU 1 , S GPU 2 }). ...
... The OpenH matrix multiplication application performs the best since it employs the performance-optimal workload distribution and mapping of the software components to the CPU cores of the hybrid server. Furthermore, the performance-optimal workload distribution is the load-balanced distribution between the three software components [27], [28] since their execution times are linear functions of workload size. VOLUME 12, 2024 FIGURE 12. ...
... For the OpenH matrix multiplication application, we employ the workload partitioning algorithms proposed in [27], [28], and [29] to determine the partitioning of the matrices between A and C that minimizes the total dynamic energy consumption of the application. Specifically, the matrices A and C are partitioned horizontally among the three software components to minimize the application's total dynamic energy consumption. ...
Article
Full-text available
Heterogeneous nodes composed of a multicore CPU and accelerators are today’s norm in high-performance computing (HPC) platforms due to their superior performance and energy efficiency. Tools such as OpenCL and hybrid combinations such as OpenMP plus OpenACC are used for developing portable parallel programs for such nodes. However, these tools have some drawbacks, including a lack of compiler support for nested parallelism, performance portability, automatic heterogeneous workload distribution, user-friendly thread placement, and processor affinity essential to the portable performance of hybrid programs executing on such nodes. In this paper, we propose OpenH, a novel programming model and library API for developing portable parallel programs on heterogeneous hybrid servers composed of a multicore CPU and one or more different types of accelerators. OpenH integrates Pthreads, OpenMP, and OpenACC seamlessly to facilitate the development of hybrid parallel programs. An OpenH hybrid parallel program starts as a single main thread, creating a group of Pthreads called hosting Pthreads. A hosting Pthread then leads the execution of a software component of the program, either an OpenMP multithreaded component running on the CPU cores or an OpenACC (or OpenMP) component running on one of the accelerators of the server. The OpenH library provides API functions that allow programmers to get the configuration of the executing environment and bind the hosting Pthreads (and hence the execution of components) of the program to the CPU cores of the hybrid server to get the best performance. We illustrate the OpenH programming model and library API using two hybrid parallel applications based on matrix multiplication and 2D fast Fourier transform for the most general case of a hybrid hyperthreaded server comprising p computing devices. Finally, we demonstrate the practical performance and energy consumption of OpenH for the hybrid parallel matrix multiplication application on a server comprising an Intel Icelake multicore CPU and two Nvidia A40 GPUs.
... Apart from the performance-optimal solution, all other Pareto-optimal solutions are load-imbalanced. Motivated by these findings, the authors [14] comprehensively study these implications for the general case of p linear heterogeneous processors executing workload size n. A more general problem is actually solved in this research, allowing performance profiles to be just continuously monotonically increasing, not necessarily linear. ...
... . The set of p linear increasing energy functions, {g 0 , · · · , g p−1 } in set G. (c). The algorithm proposed in [14] solves the problem and returns a piecewise linear Pareto front comprising a chain of p − 1 linear segments, {S 0 , · · · , S p−2 }. ...
... Furthermore, the research work [14] proposes efficient exact polynomial algorithms constructing the Pareto front for performance and dynamic energy and performance and total energy. The algorithms exhibit time complexity of O(p 3 × log 2 n). ...
Article
Full-text available
The energy consumption of Information and Communications Technology (ICT) presents a new grand technological challenge. The two main approaches to tackle the challenge include the development of energy-efficient hardware and software. The development of energy-efficient software employing application-level energy optimization techniques has become an important category owing to the paradigm shift in the composition of digital platforms from single-core processors to heterogeneous platforms integrating multicore CPUs and graphics processing units (GPUs). In this work, we present an overview of application-level bi-objective optimization methods for energy and performance that address two fundamental challenges, non-linearity and heterogeneity, inherent in modern high-performance computing (HPC) platforms. Applying the methods requires energy profiles of the application’s computational kernels executing on the different compute devices of the HPC platform. Therefore, we summarize the research innovations in the three mainstream component-level energy measurement methods and present their accuracy and performance tradeoffs. Finally, scaling the optimization methods for energy and performance is crucial to achieving energy efficiency objectives and meeting quality-of-service requirements in modern HPC platforms and cloud computing infrastructures. We introduce the building blocks needed to achieve this scaling and conclude with the challenges to scaling. Briefly, two significant challenges are described, namely fast optimization methods and accurate component-level energy runtime measurements, especially for components running on accelerators.
... While innovations in energy-efficient hardware mainly drive the response to the challenge, developing energy-efficient software employing application-level energy optimization techniques on heterogeneous hybrid platforms is also vital to mitigate the challenge. Energy modelling and optimization of computations on such platforms have received considerable attention in energy research [3]- [5]. However, there remains a significant gap in the energy modelling of data transfer between a host CPU and accelerators. ...
... State-of-the-art solutions for optimizing parallel applications on heterogeneous hybrid platforms for energy can be classified into the following categories: a). Methods optimizing the applications for only energy using an energy model of computation [5], and b). Methods optimizing the applications for performance and energy by employing performance and energy models of computation [3], [4]. ...
Article
Full-text available
Developing energy-efficient software that leverages application-level energy optimization techniques is essential to tackle the pressing technological challenge of energy efficiency on modern heterogeneous computing platforms. While energy modelling and optimization of computations have received considerable attention in energy research, there remains a significant gap in the energy modelling of data transfer between computing devices on heterogeneous hybrid platforms. Our study aims to fill this crucial gap. In this work, we comprehensively study the energy consumption of data transfer between a host CPU and a GPU accelerator on heterogeneous hybrid platforms using the three mainstream energy measurement methods: (a) System-level physical measurements based on external power meters (ground-truth), (b) Measurements using on-chip power sensors, and (c) Energy predictive models. The ground-truth method is accurate but prohibitively time-consuming. While the on-chip sensors in Intel multicore CPU processors are inaccurate, the Nvidia GPU sensors do not capture data transfer activity. Therefore, we focus on the third approach and propose a novel methodology to select a small subset of performance events that effectively capture all the energy consumption activities during a data transfer and develop accurate linear energy predictive models employing the shortlisted performance events. Finally, we develop independent and accurate runtime pluggable software energy sensors based on our proposed energy predictive models that employ disjoint sets of performance events to estimate the dynamic energy of computations and data transfers. We employ the sensors to predict the energy consumption of computations and data transfer between a host CPU and two A40 Nvidia GPUs in three parallel scientific applications, and the high accuracy (average prediction error of 5%) of our sensors’ predictions further underscores their practical relevance.</p
... The authors benchmarked applications such as matrix multiplication, 2D fast Fourier transform, and gene sequencing using two connected heterogeneous servers with CPUs, GPUs, and Intel Xeon Phi, and demonstrated significant gains in execution time and dynamic energy used compared to balanced solutions e.g., on average 26% and 130% for matrix multiplication, 7% and 44% for FFT, and 2.5% and 64% for gene sequencing. In [142], the authors applied an algorithm for a continuous case (execution time strictly increasing and energy linear increasing) for optimization of parallel application execution on a hybrid heterogeneous platform to two problems: optimizing for dynamic energy and performance as well as for total energy and performance. They showed that a given solution vector is Pareto-optimal for execution time and total energy if and only if it is Pareto-optimal for execution time and dynamic energy. ...
Article
Full-text available
High-performance computing (HPC), according to its name, is traditionally oriented toward performance, especially the execution time and scalability of the computations. However, due to the high cost and environmental issues, energy consumption has already become a very important factor that needs to be considered. The paper presents a survey of energy-aware scheduling methods used in a modern HPC environment, starting with the problem definition, tackling various goals set up for this challenge, including a bi-objective approach, power and energy constraints, and a pure energy solution, as well as metrics related to the subject. Then, considered types of HPC systems and related energy-saving mechanisms are described, from multicore-processors/graphical processing units (GPU) to more complex solutions, such as compute clusters supporting dynamic voltage and frequency scaling (DVFS), power capping, and other functionalities. The main section presents a collection of carefully selected algorithms, classified by the programming method, e.g., machine learning or fuzzy logic. Moreover, other surveys published on this subject are summarized and commented on, and finally, an overview of the current state-of-the-art with open problems and further research areas is presented.
Article
Full-text available
Modern high-performance computing platforms, cloud computing systems, and data centers are highly heterogeneous containing nodes where a multicore CPU is tightly integrated with accelerators. An important challenge for energy optimization of hybrid parallel applications on such platforms is how to accurately estimate the energy consumption of application components running on different compute devices of the platform. In this work, we propose a method for accurate estimation of the application component-level energy consumption employing system-level power measurements with power meters. We experimentally validate the method on a cluster of two hybrid heterogeneous computing nodes using three parallel applications – matrix-matrix multiplication, 2D fast Fourier transform and gene sequencing. The experiments demonstrate a high estimation accuracy of the proposed method, with the average estimation error ranging between 2% and 5%. The average error demonstrated by the state-of-the-art estimation methods for the same experimental setup ranges from 15% to 75%, while the maximum reaches 178%. We also show that the use of the state-of-the-art estimation methods instead of the proposed one in the energy optimization loop leads to significant energy losses (up to 45% in our case).
Article
Full-text available
Energy proportionality is the key design goal followed by architects of multicore processors. One of its implications is that optimization of an application for performance will also optimize it for energy. In this work, we show that energy proportionality does not hold true for multicore processors. This finding creates the opportunity for bi-objective optimization of applications for energy and performance. We propose and study a novel application-level bi-objective optimization method for energy and performance for multithreaded dataparallel applications. The method uses two decision variables, the number of identical multithreaded kernels (threadgroups) executing the application and the number of threads per threadgroup, with a given workload partitioned equally between the threadgroups. We experimentally demonstrate the efficiency of the method using four popular and highly optimized multithreaded data-parallel applications, two employing two-dimensional fast Fourier transform and the other two, dense matrix multiplication. The experiments performed on four modern multicore processors show that the optimization for performance alone results in increase in dynamic energy consumption by up to 89% and optimization for dynamic energy alone results in performance degradation by up to 49%. By solving the bi-objective optimization problem, the method determines up to 11 Pareto-optimal solutions. Finally, we propose a qualitative dynamic energy model employing performance events as variables to explain the discovered energy nonproportionality. The model shows that the energy nonproportionality on our experimental platforms for the two data-parallel applications is due to disproportionately energy expensive activity of the data translation lookaside buffer.
Article
Full-text available
Energy of computing is a serious environmental concern and mitigating it is an important technological challenge. Accurate measurement of energy consumption during an application execution is key to application-level energy minimization techniques. There are three popular approaches to providing it: (a) System-level physical measurements using external power meters; (b) Measurements using on-chip power sensors and (c) Energy predictive models. In this work, we present a comprehensive study comparing the accuracy of state-of-the-art on-chip power sensors and energy predictive models against system-level physical measurements using external power meters, which we consider to be the ground truth. We show that the average error of the dynamic energy profiles obtained using on-chip power sensors can be as high as 73% and the maximum reaches 300% for two scientific applications, matrix-matrix multiplication and 2D fast Fourier transform for a wide range of problem sizes. The applications are executed on three modern Intel multicore CPUs, two Nvidia GPUs and an Intel Xeon Phi accelerator. The average error of the energy predictive models employing performance monitoring counters (PMCs) as predictor variables can be as high as 32% and the maximum reaches 100% for a diverse set of seventeen benchmarks executed on two Intel multicore CPUs (one Haswell and the other Skylake). We also demonstrate that using inaccurate energy measurements provided by on-chip sensors for dynamic energy optimization can result in significant energy losses up to 84%. We show that, owing to the nature of the deviations of the energy measurements provided by on-chip sensors from the ground truth, calibration can not improve the accuracy of the on-chip sensors to an extent that can allow them to be used in optimization of applications for dynamic energy. Finally, we present the lessons learned, our recommendations for the use of on-chip sensors and energy predictive models and future directions.
Article
Full-text available
Cloud environments today are increasingly featuring hybrid nodes containing multicore CPU processors and a diverse mix of accelerators such as Graphics Processing Units (GPUs), Intel Xeon Phi co-processors, and Field-Programmable Gate Arrays (FPGAs) to facilitate easier migration to them of HPC workloads. While virtualization of accelerators in clouds is a leading research challenge, we address the programming challenges that assail execution of large instances of data-parallel applications using these accelerators in this paper. In a typical hybrid node in a cloud, the tight integration of accelerators with multicore CPUs via PCI-E communication links contains inherent limitations such as limited main memory of accelerators and limited bandwidth of the PCI-E communication links. These limitations poses formidable programming challenges to execution of large problem sizes on these accelerators. In this paper, we describe a library containing interfaces (HCLOOC) that addresses these challenges. It employs optimal software pipelines to overlap data transfers between host CPU and the accelerator and computations on the accelerator. It is designed using the fundamental building blocks, which are OpenCL command queues for FPGAs, Intel offload streams for Intel Xeon Phis, and CUDA streams and events that allow concurrent utilization of the copy and execution engines provided in NVidia GPUs. We elucidate the key features of our library using an out-of-core implementation of matrix multiplication of large dense matrices on a hybrid node, an Intel Haswell multicore CPU server hosting three accelerators that includes NVidia K40c GPU, Intel Xeon Phi 3120P, and a Xilinx FPGA. Based on experiments with the GPU, we show that our out-of-core implementation achieves 82% of peak double-precision floating performance of the GPU and a speedup of 2.7 times over the NVidia’s out-of-core matrix multiplication implementation (CUBLAS-XT). We also demonstrate that our implementation exhibits 0% drop in performance when the problem size exceeds the main memory of the GPU. We observe this 0% drop also for our implementation for Intel Xeon Phi and Xilinx FPGA.
Article
Performance and energy are the two most important objectives for optimization on heterogeneous HPC platforms. This work studies a mathematical problem motivated by the bi-objective optimization of a matrix multiplication application on such platforms for performance and energy. We formulate the problem and propose an algorithm of polynomial complexity solving the problem where all the application profiles of objective type one are continuous and strictly increasing, and all the application profiles of objective type two are linear increasing. We solve the problem for the matrix multiplication application employing five heterogeneous processors that include two Intel multicore CPUs, an Nvidia K40c GPU, an Nvidia P100 PCIe GPU, and an Intel Xeon Phi. Based on our experiments, a dynamic energy saving of 17% is gained while tolerating a performance degradation of 5% (a saving of 106 J for an execution time increase of 0.05 s).