Fast load balancing with the most to least loaded policy in dynamic networks
ABSTRACT Load balancing a distributed/parallel system consists in allocating work (load) to its processors so that they have to process
approximately the same amount of work or amounts in relation with their computation power. In this paper, we present a new
distributed algorithm that implements the Most to Least Loaded (M2LL) policy. This policy aims at indicating pairs of processors, that will exchange loads, taking into account actually broken
edges as well as the current load distribution in the system. The M2LL policy fixes the pairs of neighboring processors by
selecting in priority the most loaded and the least loaded processor of each neighborhood. Our first and main result is that
the M2LL distributed implementation terminates after at most (n/2)⋅d
t
iterations where n and d
t
are respectively the number of nodes and the degree of the system at time t. We then present a performance comparison between Generalized Adaptive Exchange (GAE) that uses M2LL and Relaxed First Order Scheme (RFOS), two load balancing algorithms for dynamic networks in which only link failures are considered. The comparison is carried
out on a dedicated test bed that we have designed and implemented to this end. Our second important result is that although
generating more communications, the GAE algorithm with the M2LL policy is faster than RFOS in balancing the system load. In
addition, GAE M2LL is able to achieve a more stable balanced state than RFOS and scales well.
- Citations (17)
-
Cited In (0)
-
Article: Load balancing and Poisson equation in a graph
[show abstract] [hide abstract]
ABSTRACT: We present a fully distributed dynamic load balancing algorithm for parallel MIMD architectures. The algorithm can be described as a system of identical parallel processes, each running on a processor of an arbitrary interconnected network of processors. We show that the algorithm can be interpreted as a Poisson (heath) equation in a graph. This equation is analysed using Markov chain techniques and is proved to converge in polynomial time resulting in a global load balance. We also discuss some important parallel architectures and interconnection schemes such as linear processor arrays, tori, hypercubes, etc. Finally we present two applications where the algorithm has been successfully embedded (process mapping and molecular dynamic simulation).Concurrency Practice and Experience 10/2006; 2(4):289 - 313. -
SourceAvailable from: univ-fcomte.fr
Article: Solving nonlinear wave equations in the grid computing environment: an experimental study
[show abstract] [hide abstract]
ABSTRACT: In this paper we are interested in studying the development of parallel algorithms to solve nonlin-ear wave equations. Both synchronous and asynchronous algorithms contexts are considered. The solver is based on the multisplitting Newton method that provides a coarse-grained scheme. Exper-iments are carried out in both homogeneous and heterogeneous grid environments. According to the configuration environment, the behaviors of parallel synchronous and asynchronous algorithms are analyzed. Experiments allow us to draw some conclusions about the use of parallel iterative algorithms in grid computing environment. -
Conference Proceeding: Local v Global Strategies for Dynamic Load Balancing.
Proceedings of the 1990 International Conference on Parallel Processing, Volume 1: Architectur, Urbana-Champaign, IL, August 1990; 01/1990
Page 1
J Supercomput (2009) 49: 291–317
DOI 10.1007/s11227-008-0238-5
Fast load balancing with the most to least loaded policy
in dynamic networks
Abderrahmane Sider ·Raphaël Couturier
Published online: 9 October 2008
© Springer Science+Business Media, LLC 2008
Abstract Load balancing a distributed/parallel system consists in allocating work
(load) to its processors so that they have to process approximatelythe same amount of
work or amounts in relation with their computation power. In this paper, we present a
new distributed algorithm that implements the Most to Least Loaded (M2LL) policy.
This policy aims at indicating pairs of processors, that will exchange loads, taking
into account actually broken edges as well as the current load distribution in the
system. The M2LL policy fixes the pairs of neighboring processors by selecting in
priority the most loaded and the least loaded processor of each neighborhood. Our
first and main result is that the M2LL distributed implementation terminates after
at most (n/2) · dt iterations where n and dt are respectively the number of nodes
and the degree of the system at time t. We then present a performance comparison
between Generalized Adaptive Exchange (GAE) that uses M2LL and Relaxed First
OrderScheme(RFOS),twoloadbalancingalgorithmsfordynamicnetworksinwhich
only link failures are considered. The comparison is carried out on a dedicated test
bed that we have designed and implemented to this end. Our second important result
is that althoughgeneratingmore communications,the GAE algorithmwith the M2LL
policy is faster than RFOS in balancing the system load. In addition, GAE M2LL is
able to achieve a more stable balanced state than RFOS and scales well.
Keywords Load balancing · Dynamic networks · Most to least loaded strategy ·
Relaxed first order scheme
A. Sider (?)
Département d’Informatique, Université Abderrahmane Mira de Béjaia, Route de Targa Ouzemmour,
Béjaia 06000, Algérie
e-mail: abdr.sider@gmail.com
R. Couturier
Laboratoire d’Informatique de l’Université de Franche-Comté(LIFC), IUT de Belfort-Montbéliard,
University of Franche-Comte, BP 527, 90016 Belfort Cedex, France
e-mail: raphael.couturier@univ-fcomte.fr
Page 2
292 A. Sider, R. Couturier
1 Introduction
Solving large-size problems and reducing the execution times for small instances
are the main purpose of parallel algorithms and architectures. Nowadays, the need
for parallelism is becoming critical in many scientific fields ranging from simulating
fluid molecular dynamics and particle mechanics [1] to solving large optimization
and scientific problems [2]. The data-parallel model for parallelization is based on
splitting the data that has to be processed over several processing units. The amount
of data that is allocated to processors has to be controlled because of two main rea-
sons: The computation amount of each processor may increase or decrease depend-
ing on the computation being carried out and the processors may have heterogeneous
speeds. That is what makes load balancing a fundamental problem that has to be
addressed in the development of efficient parallel/distributed software. It consists in
allocating, according to a load balancing algorithm, some loads to processors in re-
lation with their computing powers. Load balancing algorithms have usually been
characterized as static/dynamic, global/local, sender/receiver initiated, and/or syn-
chronous/asynchronous [3–6].
Local load balancing algorithms are very attractive because with this scheme,
processors only knowand use their direct neighboringloadand this loadis exchanged
only with direct neighbors. These algorithms are iterative by nature, since they tend
to balance the load globally (in the system) by successively balancing the load lo-
cally (in each neighborhood). The most popular local iterative algorithms are those
named First Order Scheme (FOS) [1, 7–11] and their derived form called Dimension
Exchange (DE) [7, 12, 13]. The difference between the diffusion scheme and the di-
mension exchange one lies in their ability to communicate with different nodes. If
a node can perform simultaneous communication with its neighbors then diffusion
is used to exchange load information and workload in parallel with all of them (in
a single step of the algorithm). If it is not the case, then dimension exchange must
be used and in this case, a node exchanges load information and workload with only
one of its neighbors per iteration. These algorithms were designed in a context of
spreading usage of computing clusters formed by connecting machines by a rapid
Ethernet-like network. But the recent evolution of network architectures toward the
use of the internet generates a new execution environment for distributed computing.
It is well known that the internet is subject to contention and temporary link failures
and processors may crash and recover. In this work, we deal with this issue suppos-
ing that the number of processors does not change and that a processor knows its
living links and to what other processors it can send/receive messages. A link is alive
if it can transmit a message in each direction [14]. FOS on dynamic networks has
been investigated in [15, 16] and an accelerated version of it called RFOS appeared
in [17]. Dimension exchange on hypercube architectures with broken edges has been
studied in [18]. Then research has focused on adapting the most efficient DE-type
algorithm named Generalized Dimension Exchange (GDE) [13] so that it could take
into account broken edges. This enhanced version of GDE, called Generalized Adap-
tative Exchange (GAE), can be conducted according to several policies [19]. M2LL
is used by the GAE algorithm for load balancing and aims at determining, for each
iteration of GAE, all the pairs of nodes that will have to exchange work. This process
Page 3
Fast load balancing with the most to least loaded policy293
has to ensure that the determined pairs are those that minimize the local imbalance of
every neighborhood. In order to take into account broken edges at a given moment,
we propose the following solution: If a link is broken, then it is simply not considered
for finding a pair. Systematically, each processor looks for a pair, whatever edges are
actually broken in the network, taking into account only neighbor nodes that have not
chosen their pair yet.
In this paper, we present more extensively the first M2LL distributed implemen-
tation (a first and shorter version was presented in [20]) and then report and analyze
comparison results of GAE M2LL algorithm vs. the RFOS algorithm on dynamic
networks. To the best of our knowledge, this work is the first to compare the two ap-
proaches of load balancing in dynamic networks. On static networks, however, FOS
and GDE load balancing algorithms have been broadly investigated (e.g., see [7, 21])
and it is commonly agreed that GDE outperforms FOS in synchronous implementa-
tions.
The remainder of this paper is organized as follows. In Sect. 2, performance indi-
cators of load balancing algorithms are first recalled. Then detailed formulations of
both RFOS and GAE are given. Section 3 is devoted to presenting the M2LL distrib-
uted algorithm analysis, design foundations, and associated issues like termination.
Section 4 presents the developed framework first and also many parameters used to
test GAE and RFOS. Then Sect. 4.2 presents experimental results with regard to ter-
mination and efficiency of distributed M2LL. Subsequent subsections give main and
subsidiary results (such as scaling issues and different load patterns) along with de-
tailed discussions of metric comparison defined in 2. Finally, Sect. 5 concludes this
work and gives hints of our future work.
2 Preliminaries
A distributed-memory parallel system of n processors linked with an interconnection
network is modeled by a graph G = (V,E) where vertices V and edges of E repre-
sent the processors and links between them, respectively. Let Et
edges in the graph G at time (iteration) t and Nt
the set of neighbor nodes of processor i at time t. The workload of node i is repre-
sented by a nonnegative integer scalar value wi. At time t, the system’s load distrib-
ution is represented by the vector Wt= {wt
balancing process is to make this load distribution system converge toward the uni-
form load distribution represented by vector W = {w,w,w,...,w} where w is the
load that every node should have received if a global knowledge of the overall sys-
tem’s load were known. If the system is built by assembling homogeneous processor
powers and link bandwidths, then w =?n
workload which is proportional to its power; that is wi=
Bbe the set of broken
i= {j ∈ V : (i,j) ∈ E ∧(i,j) / ∈ Et
B}
1,wt
2,wt
3,...,wtn}. The target of the load
i=1wi/n. In the other case, suppose each
?n
processors’ power is represented by value si. Then node i should be allocated; a
i=1wi
?n
i=1sisi.
2.1 Performance metrics
Three fundamental properties must be considered when assessing the performance of
a load balancing algorithm: its termination, its efficiency, and its stability [13]. The
Page 4
294 A. Sider, R. Couturier
notion of termination relates to the ability of the algorithm to lead any initial load
distribution to the average load one. This is done mainly by means of formal proofs.
The efficiency is a subsidiary result of the termination proof because it either shows
how many iterations the algorithm can execute or what the execution time is until
this algorithm reaches a load balanced state. Finally, stability describes the quality
of the obtained global balance because the load is often not so evenly distributed
and there still subsists an unbalance between indirectly linked nodes. In practice, this
is modeled by some norm of the vector ¯ W − Wt. The norm may be the Euclidean
max norm noted l1= maxi{| ¯ w −wi|} or the quadratic one l2= (?
network’s edges is an important criterium for a load balancing algorithm since it
measures transmission costs and it possibly acts on efficiency of the algorithm in the
synchronous case.
i( ¯ wi−wi)2)1/2
which will be used in this work. In addition, the amount of workloads moved over
2.2 The relaxed first order scheme
Bahi et al. [17] presented an accelerated version of the well-known First Order
Scheme (FOS) diffusion algorithm due to Cybenko in his fundamental contribution
to iterative local load balancing algorithms [7]. According to this last scheme, on the
hypercube static structure, a node i exchanges a portion noted α of its load difference
| wt
Later, FOS had been optimally tuned for some general static structures called k-ary
n-cubes by [8]. The tuning of FOS is achieved by looking for a diffusion parameter
noted αoptthat is neither necessarily constant nor equal to
gave general formulae to compute it for chain, ring, and mesh structures. Unfortu-
nately, this is only possible for the so-called k-ary n-cubes static structures, but not in
dynamic network topologies, which often occur in real world distributed-memory ar-
chitectures when some communication links are broken even if the initial structure is
supposed static. Boillat’s independently devised FOS version [1] is another approach
for choosing the diffusion parameter which becomes dependent in the degrees of two
considered processors; i.e., αi,j=
nodes i and j, respectively. Anyway, by the FOS diffusion algorithm, the load evo-
lution of processor i at time t is done according to formula 1 where αi,j is fixed
according to one way from the three above-cited ones.
j−wt
i| with its neighbor j where α =
1
?+1; ? being the maximum degree of G.
1
?+1. Moreover, authors
1
max(di,dj)+1where diand djrepresent degrees of
wt
i= wt−1
i
+
?
j∈Ni
αi,j(wt−1
j
−wt−1
i
).
(1)
The first adaptation of FOS for dynamic networks has been done in [15]. The
authors showed that if the diffusion parameter is adapted to the current topology of
the network by considering only living edges, then the FOS diffusion algorithm will
converge under some realistic conditions: A necessary and sufficient condition is that
no part of the network (one or several processors) should ever (i.e., during the process
of iterative load balancing) be disconnected from the network. According to the new
algorithm, the load evolution of processor i at time t is done according to formula (2)
where αt
i,j= αijif (i,j) / ∈ Et
Band 0 if (i,j) ∈ Et
B.
Page 5
Fast load balancing with the most to least loaded policy 295
wt
i= wt−1
i
+
?
j∈Nt
i
αt
i,j(wt−1
j
−wt−1
i
).
(2)
The Relaxed First Order Scheme (RFOS) diffusion algorithm [17] is an acceler-
ated version of this latter algorithm. Acceleration is achieved again by introducing a
relaxation factor noted β to the iterative formula (2) that results in a load evolving
according to formula (3).
wt
i= wt−1
i
+β
?
j∈Nt
i
αt
i,j(wt−1
j
−wt−1
i
).
(3)
The authors gave an exact formula for computing the optimal value of β noted βopt
that should be used to reach the maximum acceleration rate. Under the assumption
that the total system load does not vary with time, they determined that at time step
t, βt
opt= min(Rt,
the largest diffusion matrix eigenvalue at time t and Rtis given by formula (4) that
considers only processors i having?
2
2−(st+lt)) where stand lt, respectively, represent the smallest and
j∈Nt
iαt
i,j(wt−1
j
−wt−1
i
) < 0.
Rt= mini
wt−1
i
i,j)(wt−1
(?
j∈Nt
iαt
i
−wt−1
min).
(4)
The authors noted very opportunistically that even if −1 is always the smallest
eigenvalue of the diffusion matrix (the connection graph is always disconnected),
then RFOS will converge while FOS will not. From a practical point of view, we can
add that RFOS with the beta parameter can be more suitable for dynamic networks
because a computing formula is known for it which is not the case for the optimal
diffusion parameter of Xu et al. since they considered only static networks. In spite
of this limitation, we still want to assess the use of an optimal diffusion parameter for
dynamic networks if its value is initially computed when the computation is launched
and then used throughout the overall load balancing steps without considering the
impact of link failures. We do this for a very important reason: It could be very easy
to use in practice.
2.3 The generalized adaptive exchange
Another approach for direct neighbor load balancing is to let a processor exchange
some load with only one of its neighbors at each load balancing step. Cybenko in [7]
presented the Dimension Exchange (DE) algorithm on the structure of a hypercube
interconnecting network. Under this scheme, a processor with id i will exchange
some load with some other node j (and vice versa) on a dimension of the hypercube
noted d and computed according to the formula d = (t mod D) + 1 where t and D
are respectively the current iteration number of the load balancing algorithm and the
number of dimensions of the hypercube; hence the name of the algorithm. Formula
(5) shows how the load of processor i evolves between iteration t −1 and t.
Page 6
296 A. Sider, R. Couturier
wt
i= wt−1
i
+1
2(wt−1
j
−wt−1
i
).
(5)
Lateron,[12]generalizedDEforarbitraryinterconnectionnetworksbysimulating
the D dimensions with the maximum number of K colors necessary for coloring the
edges of the graph representing the network topology. It is well known that their
minimum number is bounded: ? ≤ K ≤ (? + 1) [22]. Load exchange takes place
between two nodes i and j iff i and j are the extremities of some edge e ∈ E having
color k = t mod K + 1. However, with DE and its generalized Hosseini version, the
portion of load difference that is effectively exchanged is λ =1
this scheme is called Averaging Dimension Exchange (ADE).
With Generalized Dimension Exchange (GDE) [13], Xu et al. proved that this
would lead to the maximum convergence rate only on the hypercube. They go on and
give an alternative way to fix the optimal value λoptfor the k-ary n-cube structures
like the chain, the ring, the mesh, and the torus. Unfortunately again, the optimal
exchange parameter is computed under the basic hypothesis of a static structure, and
consequently may not lead to the highest convergence rate on a dynamic network.
2, and for this reason
wt
i= wt−1
i
+λopt(wt−1
j
−wt−1
i
).
(6)
The Generalized Adaptive Exchange (GAE) [19] is the adaptation of GDE for
dynamic networks in which broken edges are considered. Authors showed that this
can be done according to three different policies named arbitrary, random and most
to least loaded (M2LL), all of which are only a particular case of FOS on dynamic
networks, and hence the convergence of GAE is guaranteed under the same assump-
tions. The outcome of M2LL is a finite set of pairs such that each one represents
two processors which have the greatest unbalance of their neighborhood. By making
these two nodes exchange some load, the objective is to tackle all such unbalances,
and thus to reach a balanced state as quickly as a pair can be formed. The GAE algo-
rithm together with the M2LL policy can be seen as a per-iteration implementation
of GDE. In fact, M2LL creates, when it is invoked (generally before applying GAE)
a virtual ad hoc coloring in which nodes of each dimension are chosen in such a way
that the load they will exchange will lead to a fast convergence to the uniform load
distribution. Fixed pairs are different from one iteration to another according to load
differences between nodes and available live links.
3 The M2LL distributed algorithm
This section is dedicated to presenting the M2LL distributed algorithm. The distrib-
uted implementation is performed in two phases. The aim of the first one is to make
the processors see the same thing with regard to the most loaded or the least loaded
situation that we group into one concept called the best interesting processors of a
node. The second step is then to look among this set for a pair processor. This is done
in two substeps: during the first, one processor is chosen to be the most interesting
processor and its identity is communicated to other members of the set; and during
Page 7
Fast load balancing with the most to least loaded policy297
the last substep each node checks if the processor it has chosen in the first substep
has chosen it in turn. Other messages are communicated to make neighbors aware of
the status of each other.
In the four next subsections, we present an analysis of the problems faced by a
processor when it has to make the proper decision about a pair. Then we proceed by
giving the basic building concept behind the solution of the two first problems. Suc-
cessive subsections then detail the solution and examine in particular the termination
issue.
3.1 The point of view of a processor
A processor starts by exchanging its load level with each of its neighbors. This is
achieved through a message having the form {id,iteration,load} where id is an iden-
tifier of the sending node and iteration, the current iteration of the GAE algorithm.
Thus, each processor knows the load of each of its neighbors and can order them by
increasing load: wt
j0≤ wt
sor i which made it, this order implies that:
j1≤ ··· ≤ wt
jb−2≤ wt
jb−1. From the point of view of proces-
1. If wt
strict) or in the general case, belongs to the set of the “most loaded” nodes.
2. If wt
i≤ wt
strict) or more generally, belongs to the set of the “least loaded” nodes.
3. If wt
j0< wt
its neighborhood.
i≥ wt
jb−1then i is the “most loaded” of its neighborhood (if the inequality is
j0, then i is the “least loaded” of its neighborhood (if the inequality is
i< wt
jb−1, then i is neither the “most loaded” nor the “least loaded” of
In the last case, processor i cannot directly choose with which of its neighbors it
will balance its load. The idea that will allow it to choose a pair is to progressively
remove j0and/or jb−1by letting these nodes find, before node i has time to do so,
withwhichoftheirrespectiveneighborstheywillexchangesomeload,inotherwords
their pairs. Whenever a processor neighbor of i finds a pair, it is removed from the
list of possible pairs of node i.
The removing is performed after receiving a “status message” of the form
{id,iteration,subiteration,decided} sent by the processor id to all of its neighbors
and only if decided = true is in this message. Subiteration is associated with the cur-
rent iteration of the M2LL algorithm which is executed many times by a processor
until it finds a pair which in turn is transmitted in the boolean value decided and
simply means whether the processor id will run further M2LL subiterations or not.
Processor i has to send such a message after each M2LL subiteration to its neighbors
which are still running M2LL (those whose last status message is true) so that they
can update load orders of their respective neighborhoods. From now on, we will re-
strain to case 1 and 2 of load order, since a processor in case 3 will finish, after some
M2LL subiterations (see Sect. 3.7) to be in either one of them.
3.2 Problems faced by a processor
3.2.1 A problem of choice
In case 1 of possible load orders just enumerated in Sect. 3.1, processor i belongs to
the set of the “most loaded” nodes of its neighborhood. From its point of view, node
Page 8
298 A. Sider, R. Couturier
j0is the “least loaded” in this neighborhood. If another processor in j1,j2,... has
the same load as j0, it should also be considered. In case 2, the situation of processor
i is inverted relatively to case 1. Indeed, i new belongs to the set of the “least loaded”
of its neighborhood, and consequently, it should take as pair node jb−1or nodes
jb−2,... if they have the same load as jb−1. So, we can see that whenever some
processors have the same (maximum or minimum) load and they belong to a common
neighborhood, then it will be necessary to “choose” one of them by some means to be
defined. The solving of this issue referenced to as the “problem of choice” is realized
through the preference concept and is explained in Sect. 3.6.
3.2.2 The problem of different points of views
Until now, we have investigated the situation of the neighborhood of processor i from
the point of view of i. However, since M2LL is distributed, every processor has its
own point of view of the load order that prevails in its neighborhood. For example, let
us suppose that node i sees (views) it is in case 1. The question is how do processors
j0,j1... (that are the least loaded from the point of view of node i), see things in their
respective neighborhoods. It is possible that they see processors i and jb−1,jb−2...
as the “most loaded.” In this situation, nodes j0,j1... should choose a processor
among i and jb−1,jb−2... according to the devised solution for the “problem of
choice.” But it may be also that one or each of them sees another processor, say
h, more loaded than node i is. In this latter case, M2LL specifies that processor j0
and/or j1chooses the “most loaded” to them (node h) rather than choosing i or one
among jb−1,jb−2.... Moreover, if processors i and jb−1,jb−2... know that one or
several processors among j0,j1... consider another node more loaded than they are,
then they can remove it (or them) from the list of the “least loaded” nodes of their
respective load orders. This will somewhat simplify the resolution of the “problem of
choice” since the set of equally loaded processors is reduced.
Case 2 shows two similar problems. Indeed, processor jb−1may see a node h less
loaded than i and j0,j1.... Again M2LL states that it should choose h and, if i and
j0,j1... know this information, they can remove jb−1.
3.2.3 The problem of centered load
In the previous section, regarding case 1, we have pointed out that jb−1 may
see a processor h less loaded than j0. It is also obvious that h may be more
loaded than jb−1. We can see now that node jb−1 must be provided with some
means of measuring the “distance” that separates it from processors h and j0.
This measure should apply to h whatever its load is in comparison with the load
of node jb−1. Suppose that these distances are, respectively, distancejb−1(h) and
distancejb−1(j0). If distancejb−1(h) > distancejb−1(j0), then processor jb−1should
choose h. If distancejb−1(h) < distancejb−1(j0), then the processor jb−1 should
choose node j0. A particular case arises when distancejb−1(h) = distancejb−1(j0)
and wj0< wjb−1< wh. We say that processor jb−1has a “centered load” between
j0and h. Notice that this problematic situation is only visible to processor jb−1. We
have to care about it because it can cause a real deadlock for the algorithm if it hap-
pens for all processors and no global knowledge about it is permitted. Symmetrically,
Page 9
Fast load balancing with the most to least loaded policy299
processor j0also may experience a “centered load” between a processor h from its
own neighborhood and node jb−1. In the next section, we will define the distance
used by M2LL and show how it enables the distributed solution of different points
of views. Then we will look closer to the problem of choice taking into account the
particular case of centered loads.
3.3 How problems are solved
3.3.1 The interest for load balancing
The interest of a processor j for load balancing from the point of view of a neighbor
i, at time t, is defined by a scalar value interestt
???wt
In other words, it is the absolute value of the difference of their loads. It makes
it possible for processor i to measure its unbalance with its neighbor j. Notice that
the interest is symmetric for both nodes on a given nonbroken edge, i.e., ∀(i,j) ∈
E\Et
computationpowersorbandwidths,thentheinterestofprocessor j forloadbalancing
should be weighted by fijthe bandwidth of link (i,j) and by the machine’s powers
siand sj. Consequently, the interest for load balancing can be expressed by:
i(j)
interestt
i(j) =
j−wt
i
??? ≥ 0.
(7)
B: interestt
i(j) = interestt
j(i).Ifprocessorsandlinksareeitherofheterogenous
interestt
i(j) =
1
fij
?????
wt
j
sj
−wt
i
si
?????.
3.3.2 The best interest
Thebestinterest ofprocessor i,noted BestInterestt
in the neighborhood Nt
i.
i,isthehighestinterest(unbalance)
BestInterestt
i= max
k∈Nt
i
{interestt
i(k)}.
(8)
If, as we have mentioned before, any other processor h is less loaded than
node j0, then considering node jb−1, we obtain interestt
which implies BestInterestt
jb−1> interestt
measure of a processor’s point of view and is the maximum unbalance it sees
in its neighborhood. If it communicates this value to its neighbor nodes, they
then can assess whether it can be a partner in the process of looking for a pair.
This is performed by letting every processor send a message which has the form
{id,iteration,subiteration,Interestedt
The meaning of the Interested Boolean is very simple: If the interest of neigh-
bor j is equal to BestInterestt
id, then Interestedt
j as being among the least or the most loaded neighbors. For example, in Fig. 1,
nodes 0 and 1 have equal best interest (10), but both have computed it with node 2.
jb−1(h) > interestt
jb−1(j0)
jb−1(j0). The best interest represents a
for j,BestInterestt
id}.
forjis true which means that i sees
Page 10
300 A. Sider, R. Couturier
Fig. 1 Having equal best
interests is not sufficient to be a
valid candidate pair. Circles
denote processors whose id is
depicted outside. Load level is
noted inside circles
So, node 0 (resp. 1) is interested by node 2, but not by node 1 (resp. 0). This mes-
sage type will be exchanged locally during a stage that we call interest exchange
phase. Based on collected information after this exchange, processor jb−1selects
among nodes j0,j1..., those that still consider it as the most loaded. For example,
if node j0sends an interest message containing {BestInterestt
that BestInterestt
j0= BestInterestt
end up to a sure knowledgeit is most loaded than processor j0. But if BestInterestt
BestInterestt
jb−1then Interestedt
that processor j0has in view some other processors h more loaded then jb−1and can
proceed to remove j0from its list of processors having minimal load. Thus, in addi-
tion to enable processors to solve the different points of view problem, the measure
of the best interest offers to both nodes jb−1and j0, the possibility of restraining their
set of processors that present a problem of choice.
j0,Interestedt
jb−1} such
jb−1and Interestedt
jb−1= true, then node jb−1can
j0>
jb−1= false and node jb−1now knows with certainty
3.3.3 The set of interesting processors
The set of interesting processors for node i is defined by:
Bt
i= {j ∈ Nt
i: BestInterestt
j= BestInterestt
i∧ Interestedt
j= true}.
(9)
Bt
more or less loaded than i and their elements are the only processors that will likely
form a pair with i. Up from this stage, node i restricts the pair search to the set Bt
iis the set of processors that have the same maximal unbalance as i, they may be
i.
3.3.4 The most interesting processor
Let Bt
last interest exchange phase in a M2LL subiteration at GAE iteration t.
The solving of the problem of choice by a processor i lets it deterministically find
its most interesting processor, noted MostInterestingt
ation. MostInterestingt
inecessarily belongs to Bt
It is easy to see that if a processor computes Bt
the cardinality of Bt
iis equal to 1. In this case, this unique processor is the most
interestingprocessor.Wheneverthecardinalityof Bt
to solve its problem of choice (see details in Sect. 3.6). Figure 2 shows three possible
i= {b0,b1,b2...} ⊆ Nt
ibe the set of interesting processors for node i after the
i, for the current M2LL subiter-
i.
ithen Bt
i?= ∅ and at minimum,
iisgreaterthan1,processor i has
Page 11
Fast load balancing with the most to least loaded policy301
Fig. 2 The set of interesting
processors and the problem of
choice. (a) Load of processor 2
is centered between that of
nodes 0,1 and 3,4. (b)
Processor 2 is the least loaded in
relation with all its interesting
processors. (c) Processor 2 is the
most loaded in relation with all
its interesting processors
(a)
(b)
(c)
cases in a set Bt
two or several processors. We can see that processor 2 has a centered load problem
to solve. In order to detect it, it is sufficient that processor 2 verifies whether two
or more interesting processors have different loads. Case (b) shows another situation
where processor 2 is the least loaded against the totality of the Bt
case (case (c)), the situation is inverted w.r. to case (b); node 2 is the most loaded.
i. In the first one (case (a)), processor 2 has a centered load between
2set. In the third
3.3.4.1 Case with a centered load problem
one node among the least loaded ones, i.e., among 0 and 1. This implies that it is
the most loaded in the pair being formed. Thus, this scheme makes it possible to
ensure load sharing by favoring load migration from over to under loaded regions.
And finally, the problem of choice for processor 2 is between {0,1} and not {3,4}.
In Fig. 2(a), we let processor 2 choose
Page 12
302A. Sider, R. Couturier
Let Lt
load than that of i and let Pref(.) be a given solution for the problem of choice. The
most interesting processor for a node facing a problem of a centered load is defined
according to formula (10).
i= {bj∈ Bt
i: wt
bj< wt
i} be the set of interesting processors that have a smaller
MostInterestingt
i= Pref(Lt
i)if [(??Bt
i
??> 1)∧ ¬(|Lt
i| = |Bt
i| ∨ |Lt
i| = 0)].
(10)
3.3.4.2 Case without a centered load problem
lem of centered load, then it is necessarily in a second case: its load is minimum
(cf. processor 2 in Fig. 2(b)) or maximum (Fig. 2(c)) against two or more neighbors.
In this case, the choice of the most interesting processor is equivalent to solving the
problem of choice. For example, processor 2 will have to choose from {0,1,3,4}.
More generally, the most interesting processor is obtained by formula (11).
If a processor does not face a prob-
MostInterestingt
i= Pref(Bt
i) if [(??Bt
i
??> 1)∧ (|Lt
i| = |Bt
i| ∨ |Lt
i| = 0)].
(11)
Now it is time to inform direct neighbors of the choice of the best interest-
ing processor. A processor i with id id sends a message containing {id, iteration,
MostInterestingt
i} to all the nodes in Bt
i.
3.4 The pair processor
A processor i that found node j as its most interesting processor can conclude that
pairt
i= j iff MostInterestingt
its pair j, a simple comparison of their load will make it clear which is the most
loaded and which is the least loaded and will consequently give the direction of the
load migration.
i= j and MostInterestingt
j= i. Whenever node i finds
3.5 The decision of a processor
A processor j is declared to have taken its decision by a processor i and noted
decidedt
i(j) = true iff: (i) interestt
broken at time t or (iii) pairt
jexists according to Sect. 3.4. Moreover, a processor
considers that it has taken its own decision and is noted decidedt
(i) ∀j ∈ Nt
M2LL subiteration, each processor sends to all its neighbors a decision message that
contains its current state decidedt
i(i) in the decision component. Based on this infor-
mation, neighbors that have not taken their own decision yet can eliminate it from
their respective load orders.
i(j) ≤ 1 or (ii) (i,j) ∈ Et
Bi.e. the (i,j) link is
i(i) = true
iff:
i: decidedt
i(j) = true or (ii) pairt
iexists according to Sect. 3.4. After one
3.6 The preference of a processor
A processor i that considers node j as its most interesting processor according to
formulas (10) or (11) is said to have a preference for j.
Page 13
Fast load balancing with the most to least loaded policy 303
The preference of a processor, simply noted Pref(.), is the process by which the
problem of choice is solved.
A very simple preference consists in: (i) arbitrary choosing (the first, the last or the
node with lower identity) or (ii) randomly choosing one of the conflicting elements.
In the following lines, we give a solution that allows to maximize the number of
formed pairs. Besides, we address the problem of centered load when it spans the
network and that can lead to repetitive noncoinciding choices.
3.6.1 A choice based on the degree of freedom
The freedom degree of an interesting processor is defined by the number of nodes
that presents the best interest for it, that is, |Lt
number consists in favoring neighbor nodes that have a low degree of freedom when
looking for the most interesting processor. The problem amounts then to choosing
one node from Lt
ior Bt
to avoid the repetition of noncoinciding choices during two or more sub-iterations,
we associate a memory with least loaded nodes to store the freedom degree of their
most loaded neighbors. By iterating between nodes with no evolving freedom de-
grees, least loaded processors are ensured to get coinciding choices after some finite
subiterations number.
i| or |Bt
i|. The preference based on this
ithat has the minimum freedom degree. Moreover, in order
3.7 M2LL termination
The termination of a distributed algorithm is an essential property. The answer to the
question “Does M2LL terminate” can be given by answering that of “How much time
(subiterations) does a processor spend in M2LL”?
Proposition 1 If the network topology (broken edges) does not change during its ex-
ecution, then M2LL terminates after a maximum of (n/2)dt
the safety of M2LL is ensured by means of the decision concept and the choice based
on the preference ensures its correctness.
maxsubiterations. Besides,
Proof Let dt
and suppose it does not vary before t +1. If in the worst case noncoinciding choices
arise in the network, our solution ensures that a pair of processors will take their own
decision after a maximum of dt
maxsubiterations. The number of possible pairs being
at worst (n/2), it follows that (n/2)dt
maxM2LL subiterations will be necessary if they
should all be formed.
maxbe the degree of the graph G = (V,E,Et
B) at iteration t of GAE
?
3.8 The GAE algorithm with the M2LL policy
Algorithm 1 shows the GAE load balancing algorithm with the M2LL policy. It can
easily be shown that the notion of decision in conjunction with the existence of a pair
is equivalent to the relation “node i communicates with node j” that has been used
in the definition of the GAE algorithm [19].
In the first stage (lines 4–5), each processor exchanges load information locally
on living links then keeps iterating (lines 10–27) within M2LL until it finds a pair
Page 14
304 A. Sider, R. Couturier
Algorithm 1 Generalized Adaptative Exchange (GAE) with the M2LL policy
1: decidedt
i(i) = false;
2: Pairt
i= UNKNOWN; {GAE start}
3: for all j ∈ Nt
4:
send(wt
i,j);
5:
receive(wt
j);
6: end for{exchange load information with all living links}
7: bool localBalancet
i= ∀j ∈ Nt
8: if (localBalancet
9:
{M2LL start}
10:
while ¬decidedt
11:
Find the processor MostInterestingt
12:
for all j ∈ Nt
13:
send(MostInterestingt
i, j);
14:
receive(MostInterestingt
j);
15:
end for
16:
Pairt
i= j
MostInterestingt
j= i {Find pairt
17:
if (Pairt
18:
decidedt
i(i) = true;
19:
end if
20:
for all j ∈ Nt
21:
send(decidedt
i(i), j);
22:
decidedt
i(j) =receive(decidedt
23:
end for
24:
if (Pairt
25:
decidedt
i(i) = ∃j ∈ Nt
26:
end if
27:
end while
28:
{M2LL end}
29:
if (decidedt
30:
if (Pairt
wt+1
i
= wt
32:
else
33:
wt+1
i
= wt
34:
end if
35:
end if{GAE end}
36:
migrate-load();
37: end if
ido
i: |wt
j−wt
i| ≤ 1;
i= false) then
i(i) do
i;{with formulae 10 and 11}
i(j) do
isuch that ¬decidedt
⇔ ∃j ∈ Nt
i: ¬ decidedt
i}
i(j) ∧ MostInterestingt
i= j
∧
i?= UNKNOWN) then
isuch that ¬decidedt
i(j) do
j(j));
i= UNKNOWN) then
i: ¬decidedt
i(j);
i(i) = true) then
i= j ?= UNKNOWN) then
i+λ(wt
31:
j−wt
i);
i;
or knows that all its neighbors took their decision. During one M2LL subiteration,
a processor exchanges two kinds of messages: interest exchange and decision mes-
sages. On line 13, the outcome of formula (10) or (11) is sent to adjacent nodes
that are “not decided yet.” Finding the most interesting processor allows each node to
eventually find a pair. The necessary condition is stated line 16. In the last stage (lines
Page 15
Fast load balancing with the most to least loaded policy305
21–22), each processor indicates to its neighbors, participating in the current M2LL
subiteration, whether it has found a pair by a decision message. If so, its neighbors
that have not succeeded to take their decision after the current subiteration, remove it
from their respective load orders.
4 Experimental results
4.1 Framework
Because we want to compare the two algorithms when using optimal diffusion and
exchange parameters, we are somewhat conditioned in the initial static networks to
k-ary n-cube structures since these optimal parameters can be driven only for this
class of topologies. So, we shall focus on chain, ring, mesh, torus, and hypercube
interconnecting networks on an architecture of 64 processors. On the other hand,
these topologies are more likely to be encountered in real world scientific applica-
tions where the problem domain is often discretized by using a one, two, or three
dimensional grids. These topologies differ by their maximum degree δ, their diame-
ter D (see last column of Table 11) and the number of links in the static situation, and
consequently their number in the dynamic case (see Table 1). Before each step of the
simulation, a randomly generated dynamic topology is set up by breaking a subset
of the static topology links; the subset size represents a percentage p of the edges
of the considered network. We experiment three edge failure probabilities: 10, 30,
and 50% of links are broken corresponding to the fact that the network respectively
experiences low, mean, and high link dynamics. Each result presented hereafter is
the mean outcome of ten distinct experimentations. The opt subscript associated with
GAE M2LL or RFOS respectively means an optimal exchange parameter and an op-
timal diffusion parameter. Likewise, the cyb expresses the use of a standard default
value of the exchange parameter λ =1
sion parameter. Table 2 shows standard and optimal exchange values for the initial
static topologies and Table 3 shows Boillat and optimal diffusion parameter values
for the same topologies along with the computed β relaxation factor value [23]—an
empty entry means that the value for this topology is variable. For general formulae
to compute the optimal exchange parameter values, the reader should refer to [13]
and references herein. For optimal diffusion parameter values [8] is to be considered.
The global system’s load is of 6,400 virtual load units, meaning that the expected
2; while boi is for the use of Boillat’s diffu-
Table 1 Number of total links
for the 64-node initial static
topologies and the number of
broken edges for p = 10%,
30%, and 50%
Total
p = 10%
p = 30%
p = 50%
Chain
Ring
2DMesh
3DMesh
2DTorus
HyperC
63
64
7
7
19
20
34
44
39
58
32
32
56
72
64
96
112
144
128
192
12
15
13
20
View other sources
Hide other sources
-
Available from Raphaël Couturier · 1 Mar 2013
-
Available from scc.acad.bg