Conference PaperPDF Available

Trees and Butterflies Barriers in Distributed Simulation System: A Better Approach to Improve Latency and the Processor Idle Time

Authors:
Conference Paper

Trees and Butterflies Barriers in Distributed Simulation System: A Better Approach to Improve Latency and the Processor Idle Time

Abstract

Global virtual time (GVT) is used in parallel discrete event simulations to reclaim memory, commit output, detect termination, and handle errors. Mattern 's [I] has proposed G VT approximation with distributed termination detection algorithm. This algorithm works fine and gives optimal performance in terms of accurate GVT computation at the expense of slower execution rate. This slower execution rate results a high GVT latency. Due to the high GVT latency, the processors involve in communication remain idle during that period of time. As a result, the overall throughput of a discrete event parallel simulation system degrades significantly. Thus, the high G VT latency prevents the widespread use of this algorithm in discrete event parallel simulation system. However, if we could improve the latency of GVT computation, most of the discrete event parallel simulation system would likely take advantage of this technique in terms of accurate G VT computation. In this paper, we examine the potential use of tress and butterflies barriers with the Mattern's GVT structure using a ring. Simulation results demonstrate that the use of tree barriers with the Mattern's GVT structure can significantly improve the latency time and thus increase the overall throughput of the parallel simulation system. The performance measure adopted in this paper is the achievable latency for a fixed number of processors and the number of message transmission during the G VT computation.
Trees
and
Butterflies
Barriers
in
Distributed
Simulation
System:
A
Better
Approach
to
Improve
Latency
and
the
Processor
Idle
Time
Syed
S.
Rizvi,
K.
M.
Elleithy
Computer
Science
and
Engineering
Department
University
of
Bridgeport
Bridgeport,
CT
06605,
USA
{
iv,elleit
(
brdpr.ed
Aasia
Riasat
Department
of
Computer
Science
Institute
of
Business
Management
Karachi,
Pakistan
78100
aasia.
riasaitcbm.
edu.
pk
number
of
processors
and
the
number
of
message
Abstract
transmission
during
the
G
VT
computation.
Global
virtual
time
(GVT)
is
used
in
parallel
1.
Introduction
discrete
event
simulations
to
reclaim
memory,
commit
output,
detect
termination,
and
handle
errors.
The
term
distributed
refers to
distributing
the
Mattern's
[1]
has
proposed
GVT
approximation
with
execution
of
a
single
run
of
a
simulation
program
distributed
termination
detection
algorithm.
This
across
multiple
processors
[2].
One
of
the
main
algorithm
works
fine
and
gives
optimal
performance
in
problems
associated
with
distributed
simulation
is
the
terms
of
accurate
GVT
computation
at
the
expense
of
synchronization
of
distributed
execution.
If
not
slower
execution
rate.
This
slower
execution
rate
properly
handled,
synchronization
problems
may
results
a
high
GVT
latency.
Due
to
the
high
GVT
degrade
the
performance
of
a
distributed
simulation
latency,
the
processors
involve
in
communication
environment
[5].
This
situation
gets
more
severe
when
remain
idle
during
that
period
of
time.
As
a
result,
the
the
synchronization
algorithm
needs
to
run
to
perform
overall
throughput
of
a
discrete
event
parallel
a
detailed
logistics
simulation
in
a
distributed
simulation
system
degrades
significantly.
Thus,
the
environment
to
simulate
a
huge
amount
of
data
as
high
GVT
latency
prevents
the
widespread
use
of
this
specified
in
"in
press"
[6].
algorithm
in
discrete
event
parallel
simulation
system.
Event
synchronization
is
an
essential
part
of
parallel
However,
if
we
could
improve
the
latency
of
GVT
simulation
[2].
In
general,
synchronization
protocols
computation,
most
of
the
discrete
event
parallel
can
be
categorized
into
two
different
families:
simulation
system
would
likely
take
advantage
of
this
conservative
and
optimistic.
Time
Warp
is
an
technique
in
terms
of
accurate
GVT
computation.
In
optimistic
protocol
for
synchronizing
parallel
discrete
this
paper,
we
examine
the
potential
use
of
tress
and
event
simulations
[3].
Global
virtual
time
(GVT)
is
butterflies
barriers
with
the
Mattern's
GVT
structure
used
in
the
Time
Warp
synchronization
mechanism
to
using
a
ring.
Simulation
results
demonstrate
that
the
reclaim
memory,
commit
output,
detect
termination,
use
of
tree
barriers
with
the
Mattern's
GVT
structure
and
handle
errors.
GVT
can
be
considered
as
a
global
can
significantly
improve
the
latency
time
and
thus
function
which
is
computed
many
times
during
the
increase
the
overall
throughput
of
the
parallel
course
of
a
simulation.
The
time
required
to
compute
simulation
system.
The
performance
measure
adopted
the
value
of
GVT
may
result
in
performance
in
this
paper
is
the
achievable
latency
for
a
fixed
degradation
due
to
a
slower
execution
rate
[4].
On
the
other
hand,
a
small
GVT
latency
(delay
between
its
occurrence
and
detection)
reduces
the
processor's
idle
Authorized licensed use limited to: University of Bridgeport. Downloaded on February 24,2010 at 13:16:22 EST from IEEE Xplore. Restrictions apply.
time
and
thus
improves
the
overall
throughput
of
where
as
C2
guarantees
that
no
message
distributed
simulation
system.
generated
prior
to
the
first
cut
is
in
transient.
Mattem's
[1]
has
proposed
GVT
approximation
*
For
our
analysis,
we
assume
that
tp
is
the
with
distributed
termination
detection
algorithm.
This
required-time
to
send
one
message
from
one
algorithm
works
fine
and
gives
optimal
performance
in
processor
to
its
neighbor
(note
that
this
terms
of
accurate
GVT
computation
at
the
expense
of
neighboring
processor
might
be
a
child
for
C1
slower
execution
rate.
This
slower
execution
rate
and
a
parent
for
C2).
results
a
high
GVT
latency.
Due
to
the
high
GVT
*
In
addition
to
that,
we
also
assume
that
both
latency,
the
processors
involve
in
communication
rounds
of
message
transmission
are
required
remain
idle
during
that
period
of
time.
As
a
result,
the
to
compute
the
final
value
of
GVT
(i.e..,
the
overall
throughput
of
a
discrete
event
parallel
moment
when
the
second
cut
has
fully
simulation
system
degrades
significantly.
Thus,
the
constructed).
high
GVT
latency
prevents
the
widespread
use
of
this
algorithm
in
discrete
event
parallel
simulation
system.
2.
Comparative
analysis
of
tree
However,
if
we
could
improve
the
latency
of
GVT
computation,
most
of
the
discrete
event
parallel
In
order
to
identify
the
processor's
current
state
with
simulation
system
would
likely
to
get
advantage
of
this
respect
to
two
cuts
C1
and
C2,
we
use
the
same
technique
in
terms
of
accurate
GVT
computation.
In
coloring
scheme
adapted
by
original
Mattemr's
this
paper,
we
examine
the
potential
use
of
tress
and
algorithm.
It is
observed
that
the
two
cuts
C1
and
C2,
butterflies
barriers
with
the
Mattem's
GVT
structure
have
a
direct
impact
on
latency
involved
in
GVT
using
a
ring.
Simulation
results
demonstrate
that
the
computation
where
as
the
processor
idle
time
is
related
use
of
tree
barriers
with
the
Mattem's
GVT
structure
to
only
second
cut
C2.
Next,
we
present
a
discussion
on
can
significantly
improve
the
latency
time
and
thus
the
implementation
of
trees
structure
in
a
Mattem''s
increase
the
overall
throughput
of
the
parallel
GVT
algorithm.
simulation
system.
The
performance
measure
adopted
in
this
paper
is
the
achievable
latency
for
a
fixed
2.1.
Analysis
of a
tree
barrier
number
of
processors
and
the
number
of
message
transmission
during
the
GVT
computation.
thuns,issionduring
the
foc
fThs
coputap
i
oWe
assume
that
initially
all
processors
(nodes)
and
implemena
on
of
tre
s
pan
erfis
barier
their
neighbors
that
are
organized
in
a
minimal
tree
(i.e..,
no
cycles)
based
structure
are
colored
white.
In
structures
(i.e..,
we
do
not
focus
on
how
the
GVT
is
addition,
we
also
assume
that
there
should
be
one
actually
computed.
Instead,
our
focus
of
study
is
on
the
initiator
of
GVT
computation
that
may
also
be
parameters
(if
any)
or
factors
that
may
improve
or
considered
as
a
root
of
the
tree
(i.e..,
the
node
where
degrade
the
latency
involved
in
GVT
computation).
In
'
.
.
.
addition,
we
briefly
describe
that
what
changes
(if
any)
mesg
trnmiso
strs.Temmn.ntao
addition,
we
briefly
descibethatwhatcanges(ifany)
processor
initiates
GVT
computation,
it
becomes
red
may
introduce
due
to
the
implementation
of
these
new
from
white.
At
the
same
time,
it
starts
a
broadcast
barrier
structures
that
may
have
an
impact
on
the
scheme
to
indirectly
(i.e..,
from
node
to
edges)
send
overall
latency.
control
messages
to
all
connected
processors.
Thus,
this
first
transmission
(the
process
of
making
red)
of
1.1.
Key
assumptions
broadcast
from
root
(i.e..,
the
initiator
processor)
to
all
its
connected
nodes
is
intended
for
the
first
cut
C1.
Before
presenting
our
discussion
on
implementation,
According
to
our
initial
assumptions,
Mattem''s
it
is
worth
mentioning
some
key
assumptions
and
algorithm
does
not
require
acknowledgement
messages
features
of
the
original
Mattemn's
algorithm,
but
it
does
require
the
construction
of
the
second
cut
C2.
We
assume
that,
in
order
to
construct
the
second
*
Mattem's
algorithm
is
asynchronous
(i.e..,
it
cut
C2,
we
need
the
same
number
of
messages
that
will
does
not
require
global
synchronization).
propagate
from
processors
(i.e..,
the
edges
of
the
tree)
*
Mattem's
algorithm
does
not
use
message
to
the
initiator
(i.e..,
the
root
of
the
tree).
Therefore,
acknowledgement.
this
implies
that
any
processor
in
the
given
design
*
Mattem's
algorithm
uses
two
cuts
C1
and
C2.
which
is
the
part
of
a
balanced
minimal
tree
must
C1
is
intended
to
inform
each
processor
to
process
two
messages;
one
forconstructingthe
first
cut
begin
recording
the
smallest
time
stamp
C1
and
the
other
for
constructing
the
second
cut
C2.
Authorized licensed use limited to: University of Bridgeport. Downloaded on February 24,2010 at 13:16:22 EST from IEEE Xplore. Restrictions apply.
If
we
assume
that
we
have
N
number
of
total
nodes
In
addition,
our
analysis
demonstrates
that
one
can
and
e
number
of
total
edges
(i.e..,
the
neighboring
achieve
the
same
latency
for
Mattem's
algorithm
if
we
processors)
exist
in
a
minimal
tree,
then
the
following
assume
that
two
rounds
of
messages
propagate
from
relationship
for
the
number
of
messages
can
be
derived
initiator
to
all
processors
(i.e..,
intended
for
C1)
and
such
as:
e
=
N
-
1.
Thus,
this
relationship
shows
that
from
all
processors
to
the
root
(i.e..,
intended
for
C2)
in
every
processor
in
a
minimal
tree
has
to
receive
the
a
tree
barrier.
However,
the
latency
can
be
improved
if
broadcast
messages
except
the
initiator
(i.e..,
the
root
parallel
traversal
of
connected
processors
is
allowed.
of
the
tree).
The
same
is
true
for
the
second
cut
C2.
The
above
discussion
can
be
extended
for
a
tree
Mathematically,
this
relationship
can
be
expressed
as:
structure
where
the
left
and
the
right
sub
trees
have
different
length.
Thus,
this
structure
requires
2t1e
2
(N-1)
(1)
time
to
compute
the
final
value
of
GVT
where
e
equals
to
N-1
processors.
However,
if
the
same
tree
structure
This
implies
that
a
total
of
2(N-1)
messages
need
to
is
organized
in
a
way
that
allows
parallel
traversing
of
be
transmitted
to
successfully
construct
both
cuts
C1
each
left
and
the
right
sub
trees,
the
time
required
to
and
C2.
Consequently,
the
time
required
(i.e..,
the
compute
GVT
by
distributing
messages
all
the
way
latency
for
GVT)
to
finish
all
this
message
from
initiator
to/from
processors
reduces
to
exactly
by
transmission
for
both
cuts
(C1
and
C2)
in
a
minimal
(3).
tree
can
be
computed
using
(2).
2tp
log2
(N-1)
(3)
In
other
words,
this
latency
does
not
depend
on
how
Equation
(2)
gives
approximately
the
same
time
as
if
many
control
rounds
are
initiated
by
the
main
node
we
implement
this
structure
using
a
unidirectional
ring
(i.e.,
the
initiator).
Instead,
it
takes
exactly
two
rounds
where
only
one
message-transmission
path
exists
of
message-transmission
time
to
compute
the
final
between
the
two
processors
or
nodes
as
shown
in
fig.
value
of
GVT
as
shown
in
fig.
5.
This
variation
of
tree
1.
Also,
note
that
this
latency
for
a
simple
minimal
tree
based
structure,
therefore,
can
make
a
performance
structure
can
be
varied
with
respect
to
the
number
of
difference
(from
latency
point
of
view)
if
the
latency
rounds
as
shown
in
fig.
2.
Both
fig.
1
and
fig.
2
of
a
ring
based
structure
vary
widely
with
respect
to
satisfies
the
characteristics
of
(2).
the
number
of
control
messages.
The
above
discussion
presented
is
a
variant
of
a
tree
However,
the
latency
gain
can
be
reduced
if
a
barrier
based
organization
that
does
not
have
cycles
distributed
system
does
not
require
the
second
round
of
(i.e..,
it
takes
the
same
amount
of
time
to
propagate
message
transmission
frequently.
On
the
other
hand,
one
message
to
all
processors
which
are
either
the
part
the
latency
can
be
improved
significantly
if
the
of
a
balanced
tree
or
a
unidirectional
ring).
Fig.
3
and
construction
of
two
cuts are
not
consistent
enough
fig.
4
show
this
organization.
(also
note
that
these
need
not
be
consistent
that
may
Minin-
Tree
Stncture
700-
700,
tGVTF
1000
Unidirectioned
FiRng
600
~800-
500-
400
600
300-
200~~~~~~~~~~~~~~~~~~40
200
-~~~~~~~~~~~~~~~~~~0
O
~~~~~~~~~~~~~~~~2O
100
/
C
0
0
2
4
6 8
10
12
14
16
18
20
0
5
10
15
20
25
30
No.
OF
Roundis
No.
OF
Roundis
Fig.1.
No.
of
Rounds
versus
Latency
(seconds)
Fig.2.
No.
of
Rounds
versus
Latency
(seconds)
Authorized licensed use limited to: University of Bridgeport. Downloaded on February 24,2010 at 13:16:22 EST from IEEE Xplore. Restrictions apply.
150D
1
-~~~~~~~~~~~~~~~~~~~~~Tree(Pard
lei)
~~100-
li
CD~~~~~~~~~~~~~~~~~~~~~~~~~I
0
4
~
~
~
~~~~~~~~05
1
1.5
2
2.5
3
as
4
tp
Secods
Fig.3.
Ring
Structure:
Outer
ring
(blue
line)
is
intended
for
cut
C1
and
the
inner
ring
(red
line)
is
F
intended
for
cut
C2
increase
the
GVT
computation
time).
Butterfly
barrier
requires
each
processor
to
perform
As
a
final
remarks,
the
performance
gain
(time
to
log2
(N)
pair
wise
synchronization
in
a
way
that
the
compute
GVT
value)
due
to
the
implementation
of
processor
can
complete
all
of
the
synchronization
only
trees
barrier
greatly
depends
on
how
processors
are
when
all
processors
have
reached
their
barrier.
As
our
organized
in
a
tree.
It
should
be
noticed
that
the
pervious
analysis
indicates
that
the
barriers
(ring
or
performance
of
a
balance
tree
without
cycles
is
exactly
tree)
in
a
normal
situation
run
in
time
0
(log
N).
Fig.
the
same
as
in
the
case
of
ring
based
structure
(i.e..,
both
of
the
barrier
structures
give
the
same
message
6
shows
an
example
of
butterfly
barrier
where
four
complexity
that
results
equal
latency
required
for
processors
are
organized
and
sending/receiving
computing
the
final
value
of
GVT).
However,
it
can
be
messages
to
each
other.
evidenced
in
fig.
5
that
the
latency
for
computing
GVT
To
make
it
more
clear,
this
barrier
requires
can
be
improved
if
parallel
traversing
of
the
right
and
log2
(N)
steps
with
the
transmission
of
the
left
sub
trees
are
allowed.
Particularly,
this
implementation
is
bounded
the
latency
to
the
number
N
0log2
(N)]
messages,
since
each
processor
must
of
edges
rather
than
the
control
rounds
of
messages
on
send
and
receive
one
message
in
each
step
of
the
a
ring
structure.
algorithm.
Thus,
the
asymptotic
complexity
of
this
barrier
is
clearly
higher
than
the
tree
or
ring
structures
2.2.
Analysis
of
a
butterfly
barrier
which
in
turn
give
a
higher
value
of
latency.
The
complexity
comparison
in
terms
of
message-
1
N1
(N
/+
N2(
Xi
2~~~~~~~~~~N
t)
N30
\/
/
\
K
N4
Time
*
Fig.4.
Tree
Barrier
Organization:
Blue
line
is
Fig.
6:
Butterfly
Barrier
Organization:
Arrows
in
the
intended
for
cut
C1
and
the
Red
line
is
intended
for
figure
show
that
the
node
is
arriving/reaching
barrier
cut
C2
to
other
processors.
Authorized licensed use limited to: University of Bridgeport. Downloaded on February 24,2010 at 13:16:22 EST from IEEE Xplore. Restrictions apply.
90
1100
80
Butterfly
Bamier
StrLctre
1000
-
Butterfly
Bamier
Structure
70
-
Tree
(Paealel)
900
,
60
BOO
E
e
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~U)
6
00-
50)
e
cu
500-
E30
C
400-
20.
300-
10
200
+
02
4
6
l
l
l
l
10
16
I
20
2
4 6
8
10
12
14
16 18
20
Nb.
Of
Processors
(N)
Nb.
Of
Processors
(N
Fig.7.
No.
of
Processors
(N)
Versus
Message
Fig.9.
No.
of
Processors
(N)
Versus
Message
Transmitted
Transmitted
for
10-Level
System
Structure
transmission
of
butterfly
barrier
with
the
ring
and
the
tree'
structures
can
be
analyzed
in
fig.
7.
In
harmony
where
k
represents
the
level
of
the
structure
and
N
is
with
our
expectations,
as
the
number
of
processors,
N,
the
total
number
of
processors.
increased,
the
performance
of
the
butterfly
barrier
The
simulation
results
of
fig.
8
and
fig.
9
degraded.
However,
the
performance
degradation
of
demonstrate
the
behavior
of
message-transmission
butterfly
barrier
was
small
compared
to
ring
and
tree
with
respect
to
a
linear
variation
in
a
total
number
of
structures
for
the
first
few
values
of
N.
Ftructurfor
i
tace,
consider
vasles
eme
re
processors
(N)
for
the
5
and
10
level
system
structures
have
N
processorswich
are
oranzed
using
a
respectively.
In
addition,
it
should
be
noted
that
as
the
have
procssorswhic
are
rganied
usng
a
value
of
N
reaches
to
the
twice
of
the
number
of
level-
butterfly
structure.
It
is
observed
that
at
each
stage
of
valueoNreahesthe-twicemofsthe
number
o
lvk
structures,
the
message-transmission
approaches
to2
this
structure,
the
total
number
of
transmitted
messages
1.
This
relationship
can
be
expressed
as:
can
be
computed by
using
(4)
N
approaches
2k
<
(Message
Transmission)
(>
2K
I)
(2k-1)
2k)
(4)
(5)
The
simulation
result
of
fig.
8
and
fig.
9
satisfies
the
1oc
characteristics
of
(5).
It
is
worth
mentioning
that
the
asymptotic
latency
of
butterfly
is
exactly
the
same
as
1000--eX
the
merge
algorithm
where
total
of
N
number
of
comparisons
are
analogous
to
the
total
number
of
N
messages
transmitted
in
one
direction.
From
message
complexity
point
of
view,
it
is
obvious
that
the
latency
CuU)
of
butterfly
barrier
for
Mattemn's
GVT
algorithm
exists
~
60
7
/1in
a
logarithmic
region
with
a
constant
N.
This
implies
that
for
a
small
value
of
a
constant
(N),
4
1
the
latency
for
both
tree/ring
structures
and
the
butterfly
barrier
are
almost
overlapping
each
other
but
200
~~~~~~~Butterfly
Barier
Structure
Qc=5
,,
Butterfly
Baner
Structure
O=10)
as
we
increase
the
number
of
processors
in
a
system,
the
latency
improvement
due
to
the
tree
structure
will
0
2
4
6
8
10
12
14
16 18
=
be
obvious.
Also,
note
that
we
demonstrated
that
the
ND.
Of
Processors
(
ring
and
tree
structures
for
Mattem's
algorithm
give
Fig.8.
No.
of
Processors
(N)
Versus
Message
the
same
latency
except
the
one
case
where
we
Transmitted
for
5-Level
System
Structure
consider
parallel
traversal
of
nodes
as
shown
in
fig.
5.
Therefore,
the
latency
comparison
between
the
tree
Authorized licensed use limited to: University of Bridgeport. Downloaded on February 24,2010 at 13:16:22 EST from IEEE Xplore. Restrictions apply.
that
the
latency
of
the
GVT
computation
can
be
700
improved
if
the
tree
based
structure
is
organized
in
a
Buttery
BanierStucture
Af
way
that
allows
parallel
traversing
of each
left
and
60
Tre(Serad)
right
sub
trees
simultaneously.
In
addition,
our
results
Tr*Padlel)
suggest
that
for
small
values
of
N,
the
latency
for
all
Tr5(Paralld)
barriers
(serial-tree
and
butterfly)
will
be
overlapping
each
other.
However,
as
we
increase
the
value
of
N,
the
performance
differences
(i.e.,
the
time
required
to
_
3Co
X
compute
the
final
value
of
GVT)
among
these
approaches
will
be
obvious.
200
-
:'
f--
4.
References
100
/go-
[1]
Mattern,
F.,
Mehl,
H.,
Schoone,
A.,
Tel,
G.
Global
oL_
1.5
Virtual
Time
Approximation
with
Distributed
0
0.t5
2
2(
5
3
Termination
Detection
Algorithms.
Tech.
Rep.
RUU-
tp(Sewrnds)
CS-91-32,
Department
of
Computer
Science,
University
Fig.10.
tp
(seconds)
Versus
Latency
(seconds)
of
Utrecht,
The
Netherlands,
1991.
[2]
Friedemann
Mattern,
"Efficient
Algorithms
for
and
the
butterfly
barriers
is
the
same
as
for
the
ring
and
Distributed
Snapshots
and
Global
virtual
Time
the
butterfly
barriers.
Approximation,"
Journal
of
Parallel
and
Distributed
Recall
our
previous
assumption
where
we
assume
Computing,
Vol.
18,
No.4,
1993.
that
two
rounds
of
message-transmission
are
required
[3]
Ranjit
Noronha
and
Abu-Ghazaleh,
"Using
for
constructing
two
cuts
C1
and
C2.
This
implies
that
a
Programmable
NICs
for
Time-Warp
Optimization,"
total
of
N
[log2
(N)]
messages
will
be
transmitted
for
Parallel
and
Distributed
Processing
Symposium.,
botal
Cf
aNd
[l2
(N)]
mesatems
GVT
computrantion
if
ar
Proceedings
International,
IPDPS
2002,
Abstracts
and
both
Cl
and
C2
for
Mattemn's
GVT
computation
if
a
CD-ROM,
PP
6-13,
2002.
butterfly
barrier
is
implemented
rather
than
a
ring.
[4]
D.
Bauer, G.
Yaun,
C.
Carothers,
S.
Kalyanaraman,
Consequently,
the
total
time
required
to
determine
the
"Seven-O'
Clock:
A
new
Distributed
GVT
Algorithm
new
value
of
GVT
in
a
butterfly
barrier
structure
can
using
Network
Atomic
Operations,"
19th
Workshop
on
be
computed
by
(6).
Principles
of
Advanced
and
Distributed
Simulation
(PADS'05),
PP
39-48.
r
X
(N)]
[5]
Syed
S.
Rizvi,
K.
M.
Elleithy,
Aasia
Riasat,
Ntp
|10l2
(6)
"Minimizing
the
Null
Message
Exchange
in
Conservative
Distributed
Simulation,"
International
where
tp,
is
the
time
required
to
propagate
one
message
Joint
Conferences
on
Computer,
Information,
and
fhromone
tpr1Sotheshmeor
to
thepoern
borngae
pron
essor.
Systems
Sciences,
and
Engineering,
CISSE
2006,
from
one
processor
to
the
other
neighboring
processor.
Bridgeport
CT,
December
4-14
2006.
Fig.
10
demonstrates
the
required
latency
for
the
three
[6]
Lee
A.
Belfore,
Saurav
Mazumdar,
and
Syed
S.
Rizvi
et
approaches
(i.e
.........
,
t
he
ring,
tree,
and
butterfly
barriers)
al.,
"Integrating
the
joint
operation
feasibility
tool
with
with
respect
to
the
number
of
rounds.
JFAST,"
Proceedings
of
the
Fall
2006
Simulation
Interoperability
Workshop,
Orlando
Fl,
September
10-
3.
Conclusion
15
2006.
In
this
paper,
we
provide
a
comparative
analysis
of
Mattem's
GVT
structure
using
a
ring
structure
with
the
potential
use
of
trees
and
butterflies
barriers
to
improve
the
latency
and/or
the
processor
idle
time.
The
simulation
results
have
verified
that
the
use
of
butterfly
barriers
is
not
appropriate
with
an
asynchronous
type
of
algorithm
when
the
target
is
to
improve
the
latency
of
the
system.
Since
the
latency
is
directly
related
to
how
many
number
of
messages
each
processor
is
sending,
butterfly
barrier
may
not
be
a
good
candidate
to
improve
the
latency
of
the
GVT
computation.
However,
our
experimental
verifications
have
shown
Authorized licensed use limited to: University of Bridgeport. Downloaded on February 24,2010 at 13:16:22 EST from IEEE Xplore. Restrictions apply.
... There are two types of synchronization algorithms that could be used with the parallel and discrete-event simulation (PDES): conservative and the optimistic synchronization algorithms. The conservative synchronization ensures that the local causality constrain requirement must not be violated by the logical processes (LPs) within the simulation system [5]. On the other hand, optimistic synchronization allows the violation of the local causality constraint requirement. ...
Conference Paper
Full-text available
The Time Wrap algorithm [3] offers a run time recovery mechanism that deals with the causality errors. These run time recovery mechanisms consists of rollback, anti-message, and Global Virtual Time (GVT) techniques. For rollback, there is a need to compute GVT which is used in discrete-event simulation to reclaim the memory, commit the output, detect the termination, and handle the errors. However, the computation of GVT requires dealing with transient message problem and the simultaneous reporting problem. These problems can be dealt in an efficient manner by the Samadi’s algorithm [8] which works fine in the presence of causality errors. However, the performance of both Time Wrap and Samadi’s algorithms depends on the latency involve in GVT computation. Both algorithms give poor latency for large simulation systems especially in the presence of causality errors. To improve the latency and reduce the processor ideal time, we implement tree and butterflies barriers with the optimistic algorithm. Our analysis shows that the use of synchronous barriers such as tree and butterfly with the optimistic algorithm not only minimizes the GVT latency but also minimizes the processor idle time.
Conference Paper
Full-text available
Mattern’s GVT algorithm is a time management algorithm that helps achieve the synchronization in parallel and distributed systems. This algorithm uses ring structure to establish cuts C1 and C2 to calculate the GVT. The latency of calculating the GVT is vital in parallel/distributed systems which is extremely high if calculated using this algorithm. However, using synchronous barriers with the Matterns algorithm can help improving the GVT computation process by minimizing the GVT latency. In this paper, we incorporate the butterfly barrier to employ two cuts C1 and C2 and obtain the resultant GVT at an affordable latency. Our analysis shows that the proposed GVT computation algorithm significantly improves the overall performance in terms of memory saving and latency.
Chapter
Full-text available
The performance of a conservative time management algorithm in a distributed simulation system degrade s significantly if a large number of null messages are exchanged across the logical processes in order to avoid deadlock. This situation gets more severe when the exchange of null messages is increased due to the poor selection of key parameters such as lookahead values. However, with a mathematical model that can approximate the optimal values of parameters that are directly involved in the performance of a time management algorithm, we can limit the exchange of null messages. The reduction in the exchange of null messages greatly improves the performance of the time management algorithm by both minimizing the transmission overhead and maintaining a consistent parallelization. This paper presents a generic mathematical model that can be effectively used to evaluate the performance of a conservative distributed simulation system that uses null messages to avoid deadlock. Since the proposed mathematical model is generic, the performance of any conservative synchronization algorithm can be approximated. In addition, we develop a performance model that demonstrates that how a conservative distributed simulation system performs with the null message algorithm (NMA). The simulation results show that the performance of a conservative distributed system degrades if the NMA generates an excessive number of null messages due to the improper selection of parameters. In addition, the proposed mathematical model presents the critical role of lookahead which may increase or decrease the amount of null messages across the logical processes. Furthermore, the proposed mathematical model is not limited to NMA. It can also be used with any conservative synchronization algorithm to approximate the optimal values of parameters.
Conference Paper
Full-text available
In this paper we introduce a new concept, network atomic operations (NAOs) to create a zero-cost consistent cut. Using NAOs, we define a wall-clock-time driven GVT algorithm called Seven O'Clock that is an extension of Fujimoto's shared memory GVT algorithm. Using this new GVT algorithm, we report good optimistic parallel performance on a cluster of state-of-the-art Itanium-II quad processor systems for both benchmark applications such as PHOLD and real-world applications such as a large-scale TCP/Internet model. In some cases, super-linear speedup is observed.
Conference Paper
Full-text available
This paper explores optimization of parallel discrete event simulators (PDES) on a cluster of workstations with programmable network interface cards (NICs). We explore reprogramming the firmware on the NIC to optimize the performance of distributed simulation. This is a new implementation model for distributed applications where: (i) application specific communication optimizations can be implemented on the NIC; (ii) portions of the application that are most heavily communicating can be migrated to the NIC; (iii) some messages can be filtered out at the NIC without burdening the primary processor resources; and (iv) critical events are detected and handled early. The combined effect is to optimize the application communication behavior as well as reduce the load on the host processor resources. We explore this new model by implementing two optimizations to a time-warp simulator on the NIC: (1) the migration of the global virtual time estimation algorithm to the NIC; and (2) early cancellation of messages in place upon early detection of rollbacks. We believe that the model generalizes to other distributed applications
Article
. This paper presents snapshot algorithms for determining a consistent global state of a distributed system without significantly affecting the underlying computation. These algorithms do not require channels to be FIFO or messages to be acknowledged. Only a small amount of storage is needed. An important application of a snapshot algorithm is Global Virtual Time determination for distributed simulations. The paper proposes new and efficient Global Virtual Time approximation schemes based on snapshot algorithms and distributed termination detection principles. 1 Introduction A snapshot of a distributed system is a global state (consisting of the local states of the processes and all the messages in transit) which is meaningful in the sense that it corresponds to a possible global state where the local states of all processes and of all communication channels are recorded simultaneously [5]. In order to get such a causally consistent state in a system without a common clock, the local...
Article
It is shown that distributed termination detection algorithms can be transformed into efficient algorithms to approximate the so-called Global Virtual Time (GVT) of a distributed monotonic computation. Typical instances of such computations are optimistic distributed simulations based on the timewarp principle. The transformation is exemplified for two termination detection algorithms, namely an algorithm by Dijkstra et al. and a new scheme based on the principle of "sticky flags". The general idea of the transformation is that many termination detection algorithms (viz., one for each possible GVT value) run in parallel. Each algorithm determines a specific lower bound The work of H. Mehl is supported by the German National Science Foundation (Deutsche Forschungsgemeinschaft) under grant SPP-322671.
Using for constructing two cuts C1 and C2. This implies that a Programmable NICs for Time-Warp Optimization
  • Ranjit Noronha
Ranjit Noronha and Abu-Ghazaleh, "Using for constructing two cuts C1 and C2. This implies that a Programmable NICs for Time-Warp Optimization," total of N [log2 (N)] messages will be transmitted for Parallel and Distributed Processing Symposium., botal Cf aNd [l2 (N)] mesatems GVT computrantion if a r
Integrating the joint operation feasibility tool with JFAST
  • A Lee
  • Saurav Belfore
  • Syed S Mazumdar
  • Rizvi