Content uploaded by Khaled Elleithy
Author content
All content in this area was uploaded by Khaled Elleithy
Content may be subject to copyright.
Trees
and
Butterflies
Barriers
in
Distributed
Simulation
System:
A
Better
Approach
to
Improve
Latency
and
the
Processor
Idle
Time
Syed
S.
Rizvi,
K.
M.
Elleithy
Computer
Science
and
Engineering
Department
University
of
Bridgeport
Bridgeport,
CT
06605,
USA
{
iv,elleit
(
brdpr.ed
Aasia
Riasat
Department
of
Computer
Science
Institute
of
Business
Management
Karachi,
Pakistan
78100
aasia.
riasaitcbm.
edu.
pk
number
of
processors
and
the
number
of
message
Abstract
transmission
during
the
G
VT
computation.
Global
virtual
time
(GVT)
is
used
in
parallel
1.
Introduction
discrete
event
simulations
to
reclaim
memory,
commit
output,
detect
termination,
and
handle
errors.
The
term
distributed
refers to
distributing
the
Mattern's
[1]
has
proposed
GVT
approximation
with
execution
of
a
single
run
of
a
simulation
program
distributed
termination
detection
algorithm.
This
across
multiple
processors
[2].
One
of
the
main
algorithm
works
fine
and
gives
optimal
performance
in
problems
associated
with
distributed
simulation
is
the
terms
of
accurate
GVT
computation
at
the
expense
of
synchronization
of
distributed
execution.
If
not
slower
execution
rate.
This
slower
execution
rate
properly
handled,
synchronization
problems
may
results
a
high
GVT
latency.
Due
to
the
high
GVT
degrade
the
performance
of
a
distributed
simulation
latency,
the
processors
involve
in
communication
environment
[5].
This
situation
gets
more
severe
when
remain
idle
during
that
period
of
time.
As
a
result,
the
the
synchronization
algorithm
needs
to
run
to
perform
overall
throughput
of
a
discrete
event
parallel
a
detailed
logistics
simulation
in
a
distributed
simulation
system
degrades
significantly.
Thus,
the
environment
to
simulate
a
huge
amount
of
data
as
high
GVT
latency
prevents
the
widespread
use
of
this
specified
in
"in
press"
[6].
algorithm
in
discrete
event
parallel
simulation
system.
Event
synchronization
is
an
essential
part
of
parallel
However,
if
we
could
improve
the
latency
of
GVT
simulation
[2].
In
general,
synchronization
protocols
computation,
most
of
the
discrete
event
parallel
can
be
categorized
into
two
different
families:
simulation
system
would
likely
take
advantage
of
this
conservative
and
optimistic.
Time
Warp
is
an
technique
in
terms
of
accurate
GVT
computation.
In
optimistic
protocol
for
synchronizing
parallel
discrete
this
paper,
we
examine
the
potential
use
of
tress
and
event
simulations
[3].
Global
virtual
time
(GVT)
is
butterflies
barriers
with
the
Mattern's
GVT
structure
used
in
the
Time
Warp
synchronization
mechanism
to
using
a
ring.
Simulation
results
demonstrate
that
the
reclaim
memory,
commit
output,
detect
termination,
use
of
tree
barriers
with
the
Mattern's
GVT
structure
and
handle
errors.
GVT
can
be
considered
as
a
global
can
significantly
improve
the
latency
time
and
thus
function
which
is
computed
many
times
during
the
increase
the
overall
throughput
of
the
parallel
course
of
a
simulation.
The
time
required
to
compute
simulation
system.
The
performance
measure
adopted
the
value
of
GVT
may
result
in
performance
in
this
paper
is
the
achievable
latency
for
a
fixed
degradation
due
to
a
slower
execution
rate
[4].
On
the
other
hand,
a
small
GVT
latency
(delay
between
its
occurrence
and
detection)
reduces
the
processor's
idle
Authorized licensed use limited to: University of Bridgeport. Downloaded on February 24,2010 at 13:16:22 EST from IEEE Xplore. Restrictions apply.
time
and
thus
improves
the
overall
throughput
of
where
as
C2
guarantees
that
no
message
distributed
simulation
system.
generated
prior
to
the
first
cut
is
in
transient.
Mattem's
[1]
has
proposed
GVT
approximation
*
For
our
analysis,
we
assume
that
tp
is
the
with
distributed
termination
detection
algorithm.
This
requiredtime
to
send
one
message
from
one
algorithm
works
fine
and
gives
optimal
performance
in
processor
to
its
neighbor
(note
that
this
terms
of
accurate
GVT
computation
at
the
expense
of
neighboring
processor
might
be
a
child
for
C1
slower
execution
rate.
This
slower
execution
rate
and
a
parent
for
C2).
results
a
high
GVT
latency.
Due
to
the
high
GVT
*
In
addition
to
that,
we
also
assume
that
both
latency,
the
processors
involve
in
communication
rounds
of
message
transmission
are
required
remain
idle
during
that
period
of
time.
As
a
result,
the
to
compute
the
final
value
of
GVT
(i.e..,
the
overall
throughput
of
a
discrete
event
parallel
moment
when
the
second
cut
has
fully
simulation
system
degrades
significantly.
Thus,
the
constructed).
high
GVT
latency
prevents
the
widespread
use
of
this
algorithm
in
discrete
event
parallel
simulation
system.
2.
Comparative
analysis
of
tree
However,
if
we
could
improve
the
latency
of
GVT
computation,
most
of
the
discrete
event
parallel
In
order
to
identify
the
processor's
current
state
with
simulation
system
would
likely
to
get
advantage
of
this
respect
to
two
cuts
C1
and
C2,
we
use
the
same
technique
in
terms
of
accurate
GVT
computation.
In
coloring
scheme
adapted
by
original
Mattemr's
this
paper,
we
examine
the
potential
use
of
tress
and
algorithm.
It is
observed
that
the
two
cuts
C1
and
C2,
butterflies
barriers
with
the
Mattem's
GVT
structure
have
a
direct
impact
on
latency
involved
in
GVT
using
a
ring.
Simulation
results
demonstrate
that
the
computation
where
as
the
processor
idle
time
is
related
use
of
tree
barriers
with
the
Mattem's
GVT
structure
to
only
second
cut
C2.
Next,
we
present
a
discussion
on
can
significantly
improve
the
latency
time
and
thus
the
implementation
of
trees
structure
in
a
Mattem''s
increase
the
overall
throughput
of
the
parallel
GVT
algorithm.
simulation
system.
The
performance
measure
adopted
in
this
paper
is
the
achievable
latency
for
a
fixed
2.1.
Analysis
of a
tree
barrier
number
of
processors
and
the
number
of
message
transmission
during
the
GVT
computation.
thuns,issionduring
the
foc
fThs
coputap
i
oWe
assume
that
initially
all
processors
(nodes)
and
implemena
on
of
tre
s
pan
erfis
barier
their
neighbors
that
are
organized
in
a
minimal
tree
(i.e..,
no
cycles)
based
structure
are
colored
white.
In
structures
(i.e..,
we
do
not
focus
on
how
the
GVT
is
addition,
we
also
assume
that
there
should
be
one
actually
computed.
Instead,
our
focus
of
study
is
on
the
initiator
of
GVT
computation
that
may
also
be
parameters
(if
any)
or
factors
that
may
improve
or
considered
as
a
root
of
the
tree
(i.e..,
the
node
where
degrade
the
latency
involved
in
GVT
computation).
In
'
.
.
.
addition,
we
briefly
describe
that
what
changes
(if
any)
mesg
trnmiso
strs.Temmn.ntao
addition,
we
briefly
descibethatwhatcanges(ifany)
processor
initiates
GVT
computation,
it
becomes
red
may
introduce
due
to
the
implementation
of
these
new
from
white.
At
the
same
time,
it
starts
a
broadcast
barrier
structures
that
may
have
an
impact
on
the
scheme
to
indirectly
(i.e..,
from
node
to
edges)
send
overall
latency.
control
messages
to
all
connected
processors.
Thus,
this
first
transmission
(the
process
of
making
red)
of
1.1.
Key
assumptions
broadcast
from
root
(i.e..,
the
initiator
processor)
to
all
its
connected
nodes
is
intended
for
the
first
cut
C1.
Before
presenting
our
discussion
on
implementation,
According
to
our
initial
assumptions,
Mattem''s
it
is
worth
mentioning
some
key
assumptions
and
algorithm
does
not
require
acknowledgement
messages
features
of
the
original
Mattemn's
algorithm,
but
it
does
require
the
construction
of
the
second
cut
C2.
We
assume
that,
in
order
to
construct
the
second
*
Mattem's
algorithm
is
asynchronous
(i.e..,
it
cut
C2,
we
need
the
same
number
of
messages
that
will
does
not
require
global
synchronization).
propagate
from
processors
(i.e..,
the
edges
of
the
tree)
*
Mattem's
algorithm
does
not
use
message
to
the
initiator
(i.e..,
the
root
of
the
tree).
Therefore,
acknowledgement.
this
implies
that
any
processor
in
the
given
design
*
Mattem's
algorithm
uses
two
cuts
C1
and
C2.
which
is
the
part
of
a
balanced
minimal
tree
must
C1
is
intended
to
inform
each
processor
to
process
two
messages;
one
forconstructingthe
first
cut
begin
recording
the
smallest
time
stamp
C1
and
the
other
for
constructing
the
second
cut
C2.
Authorized licensed use limited to: University of Bridgeport. Downloaded on February 24,2010 at 13:16:22 EST from IEEE Xplore. Restrictions apply.
If
we
assume
that
we
have
N
number
of
total
nodes
In
addition,
our
analysis
demonstrates
that
one
can
and
e
number
of
total
edges
(i.e..,
the
neighboring
achieve
the
same
latency
for
Mattem's
algorithm
if
we
processors)
exist
in
a
minimal
tree,
then
the
following
assume
that
two
rounds
of
messages
propagate
from
relationship
for
the
number
of
messages
can
be
derived
initiator
to
all
processors
(i.e..,
intended
for
C1)
and
such
as:
e
=
N

1.
Thus,
this
relationship
shows
that
from
all
processors
to
the
root
(i.e..,
intended
for
C2)
in
every
processor
in
a
minimal
tree
has
to
receive
the
a
tree
barrier.
However,
the
latency
can
be
improved
if
broadcast
messages
except
the
initiator
(i.e..,
the
root
parallel
traversal
of
connected
processors
is
allowed.
of
the
tree).
The
same
is
true
for
the
second
cut
C2.
The
above
discussion
can
be
extended
for
a
tree
Mathematically,
this
relationship
can
be
expressed
as:
structure
where
the
left
and
the
right
sub
trees
have
different
length.
Thus,
this
structure
requires
2t1e
2
(N1)
(1)
time
to
compute
the
final
value
of
GVT
where
e
equals
to
N1
processors.
However,
if
the
same
tree
structure
This
implies
that
a
total
of
2(N1)
messages
need
to
is
organized
in
a
way
that
allows
parallel
traversing
of
be
transmitted
to
successfully
construct
both
cuts
C1
each
left
and
the
right
sub
trees,
the
time
required
to
and
C2.
Consequently,
the
time
required
(i.e..,
the
compute
GVT
by
distributing
messages
all
the
way
latency
for
GVT)
to
finish
all
this
message
from
initiator
to/from
processors
reduces
to
exactly
by
transmission
for
both
cuts
(C1
and
C2)
in
a
minimal
(3).
tree
can
be
computed
using
(2).
2tp
log2
(N1)
(3)
In
other
words,
this
latency
does
not
depend
on
how
Equation
(2)
gives
approximately
the
same
time
as
if
many
control
rounds
are
initiated
by
the
main
node
we
implement
this
structure
using
a
unidirectional
ring
(i.e.,
the
initiator).
Instead,
it
takes
exactly
two
rounds
where
only
one
messagetransmission
path
exists
of
messagetransmission
time
to
compute
the
final
between
the
two
processors
or
nodes
as
shown
in
fig.
value
of
GVT
as
shown
in
fig.
5.
This
variation
of
tree
1.
Also,
note
that
this
latency
for
a
simple
minimal
tree
based
structure,
therefore,
can
make
a
performance
structure
can
be
varied
with
respect
to
the
number
of
difference
(from
latency
point
of
view)
if
the
latency
rounds
as
shown
in
fig.
2.
Both
fig.
1
and
fig.
2
of
a
ring
based
structure
vary
widely
with
respect
to
satisfies
the
characteristics
of
(2).
the
number
of
control
messages.
The
above
discussion
presented
is
a
variant
of
a
tree
However,
the
latency
gain
can
be
reduced
if
a
barrier
based
organization
that
does
not
have
cycles
distributed
system
does
not
require
the
second
round
of
(i.e..,
it
takes
the
same
amount
of
time
to
propagate
message
transmission
frequently.
On
the
other
hand,
one
message
to
all
processors
which
are
either
the
part
the
latency
can
be
improved
significantly
if
the
of
a
balanced
tree
or
a
unidirectional
ring).
Fig.
3
and
construction
of
two
cuts are
not
consistent
enough
fig.
4
show
this
organization.
(also
note
that
these
need
not
be
consistent
that
may
Minin
Tree
Stncture
700
700,
tGVTF
1000
Unidirectioned
FiRng
600
~800
500
400
600
300
200~~~~~~~~~~~~~~~~~~40
200
~~~~~~~~~~~~~~~~~~0
O
~~~~~~~~~~~~~~~~2O
100
/
C
0
0
2
4
6 8
10
12
14
16
18
20
0
5
10
15
20
25
30
No.
OF
Roundis
No.
OF
Roundis
Fig.1.
No.
of
Rounds
versus
Latency
(seconds)
Fig.2.
No.
of
Rounds
versus
Latency
(seconds)
Authorized licensed use limited to: University of Bridgeport. Downloaded on February 24,2010 at 13:16:22 EST from IEEE Xplore. Restrictions apply.
150D
1
~~~~~~~~~~~~~~~~~~~~~Tree(Pard
lei)
~~100
li
CD~~~~~~~~~~~~~~~~~~~~~~~~~I
0
4
~
~
~
~~~~~~~~05
1
1.5
2
2.5
3
as
4
tp
Secods
Fig.3.
Ring
Structure:
Outer
ring
(blue
line)
is
intended
for
cut
C1
and
the
inner
ring
(red
line)
is
F
intended
for
cut
C2
increase
the
GVT
computation
time).
Butterfly
barrier
requires
each
processor
to
perform
As
a
final
remarks,
the
performance
gain
(time
to
log2
(N)
pair
wise
synchronization
in
a
way
that
the
compute
GVT
value)
due
to
the
implementation
of
processor
can
complete
all
of
the
synchronization
only
trees
barrier
greatly
depends
on
how
processors
are
when
all
processors
have
reached
their
barrier.
As
our
organized
in
a
tree.
It
should
be
noticed
that
the
pervious
analysis
indicates
that
the
barriers
(ring
or
performance
of
a
balance
tree
without
cycles
is
exactly
tree)
in
a
normal
situation
run
in
time
0
(log
N).
Fig.
the
same
as
in
the
case
of
ring
based
structure
(i.e..,
both
of
the
barrier
structures
give
the
same
message
6
shows
an
example
of
butterfly
barrier
where
four
complexity
that
results
equal
latency
required
for
processors
are
organized
and
sending/receiving
computing
the
final
value
of
GVT).
However,
it
can
be
messages
to
each
other.
evidenced
in
fig.
5
that
the
latency
for
computing
GVT
To
make
it
more
clear,
this
barrier
requires
can
be
improved
if
parallel
traversing
of
the
right
and
log2
(N)
steps
with
the
transmission
of
the
left
sub
trees
are
allowed.
Particularly,
this
implementation
is
bounded
the
latency
to
the
number
N
0log2
(N)]
messages,
since
each
processor
must
of
edges
rather
than
the
control
rounds
of
messages
on
send
and
receive
one
message
in
each
step
of
the
a
ring
structure.
algorithm.
Thus,
the
asymptotic
complexity
of
this
barrier
is
clearly
higher
than
the
tree
or
ring
structures
2.2.
Analysis
of
a
butterfly
barrier
which
in
turn
give
a
higher
value
of
latency.
The
complexity
comparison
in
terms
of
message
1
N1
(N
/+
N2(
Xi
2~~~~~~~~~~N
t)
N30
\/
/
\
K
N4
Time
*
Fig.4.
Tree
Barrier
Organization:
Blue
line
is
Fig.
6:
Butterfly
Barrier
Organization:
Arrows
in
the
intended
for
cut
C1
and
the
Red
line
is
intended
for
figure
show
that
the
node
is
arriving/reaching
barrier
cut
C2
to
other
processors.
Authorized licensed use limited to: University of Bridgeport. Downloaded on February 24,2010 at 13:16:22 EST from IEEE Xplore. Restrictions apply.
90
1100
80
Butterfly
Bamier
StrLctre
1000

Butterfly
Bamier
Structure
70

Tree
(Paealel)
900
,
60
BOO
E
e
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~U)
6
00
50)
e
cu
500
E30
C
400
20.
300
10
200
+
02
4
6
l
l
l
l
10
16
I
20
2
4 6
8
10
12
14
16 18
20
Nb.
Of
Processors
(N)
Nb.
Of
Processors
(N
Fig.7.
No.
of
Processors
(N)
Versus
Message
Fig.9.
No.
of
Processors
(N)
Versus
Message
Transmitted
Transmitted
for
10Level
System
Structure
transmission
of
butterfly
barrier
with
the
ring
and
the
tree'
structures
can
be
analyzed
in
fig.
7.
In
harmony
where
k
represents
the
level
of
the
structure
and
N
is
with
our
expectations,
as
the
number
of
processors,
N,
the
total
number
of
processors.
increased,
the
performance
of
the
butterfly
barrier
The
simulation
results
of
fig.
8
and
fig.
9
degraded.
However,
the
performance
degradation
of
demonstrate
the
behavior
of
messagetransmission
butterfly
barrier
was
small
compared
to
ring
and
tree
with
respect
to
a
linear
variation
in
a
total
number
of
structures
for
the
first
few
values
of
N.
Ftructurfor
i
tace,
consider
vasles
eme
re
processors
(N)
for
the
5
and
10
level
system
structures
have
N
processorswich
are
oranzed
using
a
respectively.
In
addition,
it
should
be
noted
that
as
the
have
procssorswhic
are
rganied
usng
a
value
of
N
reaches
to
the
twice
of
the
number
of
level
butterfly
structure.
It
is
observed
that
at
each
stage
of
valueoNreahesthetwicemofsthe
number
o
lvk
structures,
the
messagetransmission
approaches
to2
this
structure,
the
total
number
of
transmitted
messages
1.
This
relationship
can
be
expressed
as:
can
be
computed by
using
(4)
N
approaches
2k
<
(Message
Transmission)
(>
2K
I)
(2k1)
2k)
(4)
(5)
The
simulation
result
of
fig.
8
and
fig.
9
satisfies
the
1oc
characteristics
of
(5).
It
is
worth
mentioning
that
the
asymptotic
latency
of
butterfly
is
exactly
the
same
as
1000eX
the
merge
algorithm
where
total
of
N
number
of
comparisons
are
analogous
to
the
total
number
of
N
messages
transmitted
in
one
direction.
From
message
complexity
point
of
view,
it
is
obvious
that
the
latency
CuU)
of
butterfly
barrier
for
Mattemn's
GVT
algorithm
exists
~
60
7
/1in
a
logarithmic
region
with
a
constant
N.
This
implies
that
for
a
small
value
of
a
constant
(N),
4
1
the
latency
for
both
tree/ring
structures
and
the
butterfly
barrier
are
almost
overlapping
each
other
but
200
~~~~~~~Butterfly
Barier
Structure
Qc=5
,,
Butterfly
Baner
Structure
O=10)
as
we
increase
the
number
of
processors
in
a
system,
the
latency
improvement
due
to
the
tree
structure
will
0
2
4
6
8
10
12
14
16 18
=
be
obvious.
Also,
note
that
we
demonstrated
that
the
ND.
Of
Processors
(
ring
and
tree
structures
for
Mattem's
algorithm
give
Fig.8.
No.
of
Processors
(N)
Versus
Message
the
same
latency
except
the
one
case
where
we
Transmitted
for
5Level
System
Structure
consider
parallel
traversal
of
nodes
as
shown
in
fig.
5.
Therefore,
the
latency
comparison
between
the
tree
Authorized licensed use limited to: University of Bridgeport. Downloaded on February 24,2010 at 13:16:22 EST from IEEE Xplore. Restrictions apply.
that
the
latency
of
the
GVT
computation
can
be
700
improved
if
the
tree
based
structure
is
organized
in
a
Buttery
BanierStucture
Af
way
that
allows
parallel
traversing
of each
left
and
60
Tre(Serad)
right
sub
trees
simultaneously.
In
addition,
our
results
Tr*Padlel)
suggest
that
for
small
values
of
N,
the
latency
for
all
Tr5(Paralld)
barriers
(serialtree
and
butterfly)
will
be
overlapping
each
other.
However,
as
we
increase
the
value
of
N,
the
performance
differences
(i.e.,
the
time
required
to
_
3Co
X
compute
the
final
value
of
GVT)
among
these
approaches
will
be
obvious.
200

:'
f
4.
References
100
/go
[1]
Mattern,
F.,
Mehl,
H.,
Schoone,
A.,
Tel,
G.
Global
oL_
1.5
Virtual
Time
Approximation
with
Distributed
0
0.t5
2
2(
5
3
Termination
Detection
Algorithms.
Tech.
Rep.
RUU
tp(Sewrnds)
CS9132,
Department
of
Computer
Science,
University
Fig.10.
tp
(seconds)
Versus
Latency
(seconds)
of
Utrecht,
The
Netherlands,
1991.
[2]
Friedemann
Mattern,
"Efficient
Algorithms
for
and
the
butterfly
barriers
is
the
same
as
for
the
ring
and
Distributed
Snapshots
and
Global
virtual
Time
the
butterfly
barriers.
Approximation,"
Journal
of
Parallel
and
Distributed
Recall
our
previous
assumption
where
we
assume
Computing,
Vol.
18,
No.4,
1993.
that
two
rounds
of
messagetransmission
are
required
[3]
Ranjit
Noronha
and
AbuGhazaleh,
"Using
for
constructing
two
cuts
C1
and
C2.
This
implies
that
a
Programmable
NICs
for
TimeWarp
Optimization,"
total
of
N
[log2
(N)]
messages
will
be
transmitted
for
Parallel
and
Distributed
Processing
Symposium.,
botal
Cf
aNd
[l2
(N)]
mesatems
GVT
computrantion
if
ar
Proceedings
International,
IPDPS
2002,
Abstracts
and
both
Cl
and
C2
for
Mattemn's
GVT
computation
if
a
CDROM,
PP
613,
2002.
butterfly
barrier
is
implemented
rather
than
a
ring.
[4]
D.
Bauer, G.
Yaun,
C.
Carothers,
S.
Kalyanaraman,
Consequently,
the
total
time
required
to
determine
the
"SevenO'
Clock:
A
new
Distributed
GVT
Algorithm
new
value
of
GVT
in
a
butterfly
barrier
structure
can
using
Network
Atomic
Operations,"
19th
Workshop
on
be
computed
by
(6).
Principles
of
Advanced
and
Distributed
Simulation
(PADS'05),
PP
3948.
r
X
(N)]
[5]
Syed
S.
Rizvi,
K.
M.
Elleithy,
Aasia
Riasat,
Ntp
10l2
(6)
"Minimizing
the
Null
Message
Exchange
in
Conservative
Distributed
Simulation,"
International
where
tp,
is
the
time
required
to
propagate
one
message
Joint
Conferences
on
Computer,
Information,
and
fhromone
tpr1Sotheshmeor
to
thepoern
borngae
pron
essor.
Systems
Sciences,
and
Engineering,
CISSE
2006,
from
one
processor
to
the
other
neighboring
processor.
Bridgeport
CT,
December
414
2006.
Fig.
10
demonstrates
the
required
latency
for
the
three
[6]
Lee
A.
Belfore,
Saurav
Mazumdar,
and
Syed
S.
Rizvi
et
approaches
(i.e
.........
,
t
he
ring,
tree,
and
butterfly
barriers)
al.,
"Integrating
the
joint
operation
feasibility
tool
with
with
respect
to
the
number
of
rounds.
JFAST,"
Proceedings
of
the
Fall
2006
Simulation
Interoperability
Workshop,
Orlando
Fl,
September
10
3.
Conclusion
15
2006.
In
this
paper,
we
provide
a
comparative
analysis
of
Mattem's
GVT
structure
using
a
ring
structure
with
the
potential
use
of
trees
and
butterflies
barriers
to
improve
the
latency
and/or
the
processor
idle
time.
The
simulation
results
have
verified
that
the
use
of
butterfly
barriers
is
not
appropriate
with
an
asynchronous
type
of
algorithm
when
the
target
is
to
improve
the
latency
of
the
system.
Since
the
latency
is
directly
related
to
how
many
number
of
messages
each
processor
is
sending,
butterfly
barrier
may
not
be
a
good
candidate
to
improve
the
latency
of
the
GVT
computation.
However,
our
experimental
verifications
have
shown
Authorized licensed use limited to: University of Bridgeport. Downloaded on February 24,2010 at 13:16:22 EST from IEEE Xplore. Restrictions apply.