Conference PaperPDF Available

A GVT Based Algorithm for Butterfly Barrier in Parallel and Distributed Systems

Authors:

Abstract and Figures

Mattern’s GVT algorithm is a time management algorithm that helps achieve the synchronization in parallel and distributed systems. This algorithm uses ring structure to establish cuts C1 and C2 to calculate the GVT. The latency of calculating the GVT is vital in parallel/distributed systems which is extremely high if calculated using this algorithm. However, using synchronous barriers with the Matterns algorithm can help improving the GVT computation process by minimizing the GVT latency. In this paper, we incorporate the butterfly barrier to employ two cuts C1 and C2 and obtain the resultant GVT at an affordable latency. Our analysis shows that the proposed GVT computation algorithm significantly improves the overall performance in terms of memory saving and latency.
Content may be subject to copyright.
A GVT Based Algorithm for Butterfly Barrier in
Parallel and Distributed Systems
Syed S. Rizvi, Shalini Potham, and Khaled M. Elleithy
Computer Science Department, University of Bridgeport, Bridgeport, CT 06601 USA
{srizvi, spotham, elleithy}@bridgeport.edu
Abstract-Mattern’s GVT algorithm is a time management algorithm
that helps achieve the synchronization in parallel and distributed
systems. This algorithm uses ring structure to establish cuts C1 and
C2 to calculate the GVT. The latency of calculating the GVT is vital
in parallel/distributed systems which is extremely high if calculated
using this algorithm. However, using synchronous barriers with the
Matterns algorithm can help improving the GVT computation
process by minimizing the GVT latency. In this paper, we
incorporate the butterfly barrier to employ two cuts C1 and C2 and
obtain the resultant GVT at an affordable latency. Our analysis
shows that the proposed GVT computation algorithm significantly
improves the overall performance in terms of memory saving and
latency.
Keywords-Time management algorithm, latency, butterfly barrier
I. INTRODUCTION
A parallel and distributed system is an environment where a
huge single task is being divided into several sub-tasks and
each terminal getting a sub-task to execute. The main problem
that is being faced here is the synchronization. All the
processes need to be synchronized as the main aim of
distributed system is that the final output after execution of
entire task should be exactly the same as that of the output
attained when the same task is executed sequentially on a
single machine.
Mattern’s GVT algorithm helps keep all the processes in
synchronization by finding the minimum of time stamps of all
the messages at a point. It also makes sure that there are no
transient messages in the process of execution as it waits for
the processes to receive all the messages that are destined for it.
The backlog of this algorithm is that the latency is high. This
keeps the algorithm away from its widespread usage. The
performance of a parallel/distributed system can be degraded if
the latency for computing the GVT is high. The Mattern’s GVT
algorithm uses several variables which in turn increase the
number of memory fetches.
In the proposed algorithm, we implement the similar
mechanism structure suggested by the Mattern’s [1] with the
use of a matrix. This utilization of the matrix eliminates two of
the variables and an array as used by the original Mattern’s
GVT algorithm. Consequently, the use of matrix with the
Mattern’s algorithm provides several advantages such as it
reduces the number of memory fetches, saves memory,
increases the processor speed, and improves the latency. We
incorporated the butterfly barrier as it has great performance
when compared to the other barriers such as broadcast and the
centralized barriers [7]. When we finish implementing the
barrier with the proposed algorithm, the current simulation time
is updated. This implies that there is no need to communicate
the minimum time or the simulation time reducing the message
exchanges. This, therefore, improves the latency at affordable
rate.
II. RELATED WORK
The term distributed refers to distributing the execution of a
single run of a simulation program across multiple processors
[2]. One of the main problems associated with the distributed
simulation is the synchronization of a distributed execution. If
not properly handled, synchronization problems may degrade
the performance of a distributed simulation environment [5].
This situation gets more severe when the synchronization
algorithm needs to run to perform a detailed logistics
simulation in a distributed environment to simulate a huge
amount of data [6].
Event synchronization is an essential part of parallel
simulation [2]. In general, synchronization protocols can be
categorized into two different families: conservative and
optimistic. Time Warp is an optimistic protocol for
synchronizing parallel discrete event simulations [3]. Global
virtual time (GVT) is used in the Time Warp synchronization
mechanism to reclaim memory, commit output, detect
termination, and handle errors. GVT can be considered as a
global function which is computed many times during the
course of a simulation. The time required to compute the value
of GVT may result in performance degradation due to a slower
execution rate [4].
On the other hand, a small GVT latency (delay between its
occurrence and detection) reduces the processor’s idle time and
thus improves the overall throughput of distributed simulation
system. However, this reduction in the latency is not consistent
and linear if it is used in its original form with the existing
distributed termination detection algorithm [7].
Mattern’s [1] has proposed GVT approximation with
distributed termination detection algorithm. This algorithm
works fine and gives optimal performance in terms of accurate
GVT computation at the expense of slower execution rate. This
slower execution rate results a high GVT latency. Due to the
high GVT latency, the processors involve in communication
remain idle during that period of time. As a result, the overall
throughput of a discrete event parallel simulation system
degrades significantly. Thus, the high GVT latency prevents
the widespread use of this algorithm in discrete event parallel
simulation system.
However, if we could improve the latency of the GVT
computation, most of the discrete event parallel simulation
system would likely to get advantage of this technique in terms
of accurate GVT computation. In this paper, we examine the
potential use of butterfly barriers with the Mattern’s GVT
structure using a ring. Simulation results demonstrate that the
use of the tree barriers with the Mattern’s GVT structure can
significantly improve the latency time and thus increase the
overall throughput of the parallel simulation system. The
performance measure adopted in this paper is the achievable
latency for a fixed number of processors and the number of
message transmission during the GVT computation.
Thus, the focus of this paper is on the implementation of
butterfly barrier structures. In other words, we do not focus on
how the GVT is actually computed. Instead, our focus of study
is on the parameters (if any) or factors that may improve the
latency involved in GVT computation. In addition, we briefly
describe that what changes (if any) may introduce due to the
implementation of this new barrier structure that may have an
impact on the overall latency.
III. PROPOSED ALGORITHM
In this section, we present the proposed algorithm. For the
sake of simplicity, we divide the algorithms for both cuts C1
and C2.
A. The Proposed Algorithm
ALGO GVT_FLY(V[N][N],Tmin,Tnow,Tred,Ts,n)
Begin
Loop n times
Begin
//Green message sent by to
V[i][j] = V[i][j]+1
//Green message received by
V[i][i] =V[i][i]+1
//Calculate minimum time stamp
Tmin = min (Tmin,Ts)
CUT C1:
//Messages exchanged at this point are the red messages
// Red message sent by to
V[i][j] = V[i][j]+1
// Red message received by
V[i][i] = V[i][i]+1
//Calculate minimum time of red messages
Tred = min (Tred,Ts)
Forward token to appropriate LP
CUT C2:
//Wait until all messages are received
Wait until( -V[i][i])
Forward token to appropriate LP
Tnow= min(Tred,Tmin)
END LOOP
END ALGO
B. A Detailed Overview of the Proposed Algorithm
The Mattern’s algorithm uses N vectors of size N to maintain
a track of the messages being exchanged among the LPs. It also
uses an array of size N to maintain a log of number of messages
a particular LP needs to receive. On the whole, it uses (N+1)
vectors of size N. This increases the number of fetches to
memory resulting in more processor idle time.
In our proposed algorithm, we implement an N x N matrix to
calculate the GVT whose flow can be explained as shown in
Fig.1. Firstly, the LPs exchange green messages (i.e., green
messages represent those messages that are safe to process by
LP). Whenever an sends a green message to , the cell
V[i][j] of the matrix gets updated as shown in Fig.2. On the
other hand, if receives a message, the cell V[i][i] of the
matrix is updated. At this point, we also calculate the minimum
of all the time stamps of the event messages. After a certain
period of time, when the first cut point C1 is reached, the LPs
start exchanging the red messages (i.e., the red messages
represent those messages that are referred as the straggler or the
transient messages). These messages are handled as shown in
the Fig. 3.
When an sends a red message to , the cell V[i][j] of
the matrix is updated. On the other hand, if receives a
message, the cell v[i][i] of the matrix is updated. At this cut
point, we also calculate the minimum timestamp of all the red
messages and then the control is passed to the appropriate pair-
wise LP. Next, at second cut point C2, the LPs have to wait
until all the messages destined to them are being received and
then calculate the current simulation time as the minimum of
the minimum time stamps calculated for red and green
messages.
The control token is then forwarded to the appropriate pair-
wise LP. Since we are using the butterfly barrier, the entire
process is repeated log2 N times. In other words, the condition
for this algorithm is that the number of processes involved in
the system should be a multiple of 2 (i.e., N= ).
For the sake of a comprehensive explanation of the proposed
algorithm, let us take an example of four LPs communicating
with each other as shown in the Fig. 5. It can be seen in Fig. 5
that the four LPs are exchanging messages with respect to the
simulation time. Let us see how it modifies the cells of a matrix
which are initialized to zero. From the Fig.5, let us understand
how the cells are modified with respect to time. The first
message is sent by to . As a result, the cell V[1][3] of the
matrix is incremented and the message is immediately received
by the that will increment the cell V[3][3] of the matrix. In
Start Event
Processing
Message
exchanges
Cut C1
Cut C2
Iterated
LogN times?
Reach Sync
point
Yes
No
Fig.1.
the flow of data with the matrix and butterfly barrier
LP
i
send msg to
LP
j
Increment
V[i][j]
LP
j
receives msg
Incremen
t
V[i][i]
Calculate min. of
T
S
Fig. 2. Handling green messages
LP
i
send msg to
LP
j
LP
i
receives msg
Calculate min. of
T
red
Forward token to
pair
-
wise LP
Increment
V[i][j]
Increment
V[i][i]
Fig.3 Cut C1 handling Red messages that represent the
transient or struggler messages
the second round, the next message is sent by to .
Consequently, the cell V[1][4] of the matrix is incremented and
since the message is immediately received by the , the cell
V[4][4] of the matrix is incremented. The next message is sent
by to that will increment the cell V[3][4] of the matrix.
However, before this message could be received by , the
next message is sent by to . The result of this
transmission would be an increment in the cell V[1][3] of the
matrix. As time progresses, the receives the message that
results an increment in the cell V[4][4] of the matrix and so on.
Table I shows the message exchanges till point C1. Table II
shows the message exchanges after C1 and before C2 and
Table III shows the message exchanges after C2.
At point C2, the LP has to wait until it receives all the
messages that are destined to be received by it. This can be
done by using the condition that the has to wait until the
value of the cell V[i][i] of the matrix is equal to the sum of all
the other cells of the column ‘i’. In other words, has to wait
until V[i][i]= -V[i][i]. As an example, if we
take V1 from Table II, then at cut point C2, it has to wait until
V[1][1]=V[2][1]+V[3][1]+V[4][1].
According to Table II, the value of V[1][1] is ‘1’ and the sum
of other cells of first column is ‘2’. This implies that the
has to wait until it receives the message which is destined
to reach it. Once it receives the message, it increments V[1][1]
and again verifies weather if it has to wait. If not, it then passes
the control token to the next appropriate pair-wise LP.
Every time the process forwards the control token, it also
updates the current simulation time and as a result, we do not
require additional instructions as well as time to calculate the
GVT. This eliminates the need of communicating the GVT
time among the different LPs exchanging messages. This saves
Wait until every msg is
received
Calculate current
simulation time
Forward token to
pair
-
wise LP
Fig 4: Cut C2 handling green messages for synchronization
LP1
LP2
LP3
LP4
C1 C2
Represents Cut points
Green-font
green messages (safe events)
Red-font
red messages (transient/struggler
messages)
Fig.5. Example of message exchanges between the four LPs.
The C1 and C2 represent two cuts for green and red messages.
TABLE I: MATRIX OF 4 LPS EXCHANGING GREEN MESSAGES
V1
V2
V3
V4
V1
1 0 2 1
V2
1 0 0 0
V3
0 0 1 1
V4
0 0 1 2
TABLE II: MATRIX OF 4 LPS AT CUT C1
V1
V2
V3
V4
V1
1 0 2 1
V2
2 1 0 0
V3
0 1 3 1
V4
0 0 1 2
TABLE III: MATRIX OF 4 LPS AT CUT C2
V1 V2 V3 V4
V1 2 0 2 1
V2 2 2 0 0
V3 0 1 3 1
V4 0 1 1 2
time which in turns improves the GVT latency. This algorithm
proves helpful in upgrading the system performance of the
parallel and distributed systems.
IV. CONCLUSION
In this paper, we present an algorithm that helps us to
optimize the memory and processor utilization by using
matrices instead of using N different vectors of size N in order
to reduce the overall GVT latency. The improved GVT latency
can play a vital role in upgrading the parallel/distributed
system’s performance. In the future, it will be interesting to
develop an algorithm to calculate GVT using the tree barriers.
REFERENCES
[1] Mattern, F., Mehl, H., Schoone, A., Tel, G. Global Virtual Time
Approximation with Distributed Termination Detection Algorithms. Tech.
Rep. RUU-CS-91-32, Department of Computer Science, University of
Utrecht, The Netherlands, 1991.
[2] Friedemann Mattern, “Efficient Algorithms for Distributed Snapshots and
Global virtual Time Approximation,Journal of Parallel and Distributed
Computing, Vol.18, No.4, 1993.
[3] Ranjit Noronha a nd Abu-Ghazaleh, “Using Programmable NIC s for
Time-Warp Optimization,” Parallel and Distributed Processing
Symposium., Proceedings International, IPDPS 2002, Abstracts and CD-
ROM, PP 6-13, 2002.
[4] D. Bauer, G. Yaun, C. Carothers, S. Kalyanaraman, “Seven-O’ Clock: A
new Distributed GVT Algorithm using N etwork Atomic Operations,”
19th Workshop on Principles of Advanced and Distributed Simulation
(PADS'05), PP 39-48.
[5] Syed S. Rizvi, Khaked. M. Elleithy, Aasia Riasat, “Minimizing the Null
Message Exchange in Conservative Distributed Simulation,”
International Joint Conferences on Computer, Information, and Systems
Sciences, and Engineering, CISSE 2006, Bridgeport CT, pp. 443-448
,December 4-14 2006,
[6] Lee A. Belfore, Saurav Mazumdar, and Syed S. Rizvi et al., “Integrating
the joint operation feasibility tool with JFAST,” Proceedings of the Fall
2006 Simulation Interoperability Workshop, Orlando Fl, September 10-15
2006.
[7] Syed S. Rizvi, Khaled M. Elleithy, and Aasia Riasat, “Trees and
Butterflies Barriers in Mattern’s G VT: A B etter Approach to Improve the
Latency and the Processor Idle Time for Wide Range Parallel and
Distributed Systems”, IE EE International Conference on Information and
Emerging Technologies (ICIET-2007), July 06-07, 2007, Karachi,
Pakistan.
ResearchGate has not been able to resolve any citations for this publication.
Conference Paper
Full-text available
transmission during the G VT computation. Global virtual time (GVT) is used in parallel 1. Introduction discrete event simulations to reclaim memory, commit output, detect termination, and handle errors. The term distributed refers to distributing the Mattern's [1] has proposed GVT approximation with execution of a single run of a simulation program distributed termination detection algorithm. This across multiple processors [2]. One of the main algorithm works fine and gives optimal performance in problems associated with distributed simulation is the terms of accurate GVT computation at the expense of synchronization of distributed execution. If not slower execution rate. This slower execution rate properly handled, synchronization problems may results a high GVT latency. Due to the high GVT degrade the performance of a distributed simulation latency, the processors involve in communication environment [5]. This situation gets more severe when remain idle during that period of time. As a result, the the synchronization algorithm needs to run to perform overall throughput of a discrete event parallel a detailed logistics simulation in a distributed simulation system degrades significantly. Thus, the environment to simulate a huge amount of data as high GVT latency prevents the widespread use of this specified in "in press" [6]. algorithm in discrete event parallel simulation system. Event synchronization is an essential part of parallel However, if we could improve the latency of GVT simulation [2]. In general, synchronization protocols computation, most of the discrete event parallel can be categorized into two different families: simulation system would likely take advantage of this conservative and optimistic. Time Warp is an technique in terms of accurate GVT computation. In optimistic protocol for synchronizing parallel discrete this paper, we examine the potential use of tress and event simulations [3]. Global virtual time (GVT) is butterflies barriers with the Mattern's GVT structure used in the Time Warp synchronization mechanism to using a ring. Simulation results demonstrate that the reclaim memory, commit output, detect termination, use of tree barriers with the Mattern's GVT structure and handle errors. GVT can be considered as a global can significantly improve the latency time and thus function which is computed many times during the increase the overall throughput of the parallel course of a simulation. The time required to compute simulation system. The performance measure adopted the value of GVT may result in performance in this paper is the achievable latency for a fixed degradation due to a slower execution rate [4]. On the other hand, a small GVT latency (delay between its occurrence and detection) reduces the processor's idle Authorized licensed use limited to: University of Bridgeport. Downloaded on February 24,2010 at 13:16:22 EST from IEEE Xplore. Restrictions apply. time and thus improves the overall throughput of where as C2 guarantees that no message distributed simulation system. generated prior to the first cut is in transient. Mattem's [1] has proposed GVT approximation * For our analysis, we assume that tp is the with distributed termination detection algorithm. This required-time to send one message from one algorithm works fine and gives optimal performance in processor to its neighbor (note that this terms of accurate GVT computation at the expense of neighboring processor might be a child for C1 slower execution rate. This slower execution rate and a parent for C2). results a high GVT latency. Due to the high GVT * In addition to that, we also assume that both latency, the processors involve in communication rounds of message transmission are required
Chapter
Full-text available
The performance of a conservative time management algorithm in a distributed simulation system degrade s significantly if a large number of null messages are exchanged across the logical processes in order to avoid deadlock. This situation gets more severe when the exchange of null messages is increased due to the poor selection of key parameters such as lookahead values. However, with a mathematical model that can approximate the optimal values of parameters that are directly involved in the performance of a time management algorithm, we can limit the exchange of null messages. The reduction in the exchange of null messages greatly improves the performance of the time management algorithm by both minimizing the transmission overhead and maintaining a consistent parallelization. This paper presents a generic mathematical model that can be effectively used to evaluate the performance of a conservative distributed simulation system that uses null messages to avoid deadlock. Since the proposed mathematical model is generic, the performance of any conservative synchronization algorithm can be approximated. In addition, we develop a performance model that demonstrates that how a conservative distributed simulation system performs with the null message algorithm (NMA). The simulation results show that the performance of a conservative distributed system degrades if the NMA generates an excessive number of null messages due to the improper selection of parameters. In addition, the proposed mathematical model presents the critical role of lookahead which may increase or decrease the amount of null messages across the logical processes. Furthermore, the proposed mathematical model is not limited to NMA. It can also be used with any conservative synchronization algorithm to approximate the optimal values of parameters.
Conference Paper
Full-text available
In this paper we introduce a new concept, network atomic operations (NAOs) to create a zero-cost consistent cut. Using NAOs, we define a wall-clock-time driven GVT algorithm called Seven O'Clock that is an extension of Fujimoto's shared memory GVT algorithm. Using this new GVT algorithm, we report good optimistic parallel performance on a cluster of state-of-the-art Itanium-II quad processor systems for both benchmark applications such as PHOLD and real-world applications such as a large-scale TCP/Internet model. In some cases, super-linear speedup is observed.
Conference Paper
Full-text available
Global virtual time (GVT) is used in parallel discrete event simulations to reclaim memory, commit output, detect termination, and handle errors. Mattern 's [I] has proposed G VT approximation with distributed termination detection algorithm. This algorithm works fine and gives optimal performance in terms of accurate GVT computation at the expense of slower execution rate. This slower execution rate results a high GVT latency. Due to the high GVT latency, the processors involve in communication remain idle during that period of time. As a result, the overall throughput of a discrete event parallel simulation system degrades significantly. Thus, the high G VT latency prevents the widespread use of this algorithm in discrete event parallel simulation system. However, if we could improve the latency of GVT computation, most of the discrete event parallel simulation system would likely take advantage of this technique in terms of accurate G VT computation. In this paper, we examine the potential use of tress and butterflies barriers with the Mattern's GVT structure using a ring. Simulation results demonstrate that the use of tree barriers with the Mattern's GVT structure can significantly improve the latency time and thus increase the overall throughput of the parallel simulation system. The performance measure adopted in this paper is the achievable latency for a fixed number of processors and the number of message transmission during the G VT computation.
Conference Paper
Full-text available
This paper explores optimization of parallel discrete event simulators (PDES) on a cluster of workstations with programmable network interface cards (NICs). We explore reprogramming the firmware on the NIC to optimize the performance of distributed simulation. This is a new implementation model for distributed applications where: (i) application specific communication optimizations can be implemented on the NIC; (ii) portions of the application that are most heavily communicating can be migrated to the NIC; (iii) some messages can be filtered out at the NIC without burdening the primary processor resources; and (iv) critical events are detected and handled early. The combined effect is to optimize the application communication behavior as well as reduce the load on the host processor resources. We explore this new model by implementing two optimizations to a time-warp simulator on the NIC: (1) the migration of the global virtual time estimation algorithm to the NIC; and (2) early cancellation of messages in place upon early detection of rollbacks. We believe that the model generalizes to other distributed applications
Article
In this paper we introduce a new concept, Network Atomic Operations (NAOs) to create a zero-cost consistent cut. Using NAOs, we define a wall clock time driven Global Virtual Time (GVT) algorithm called The Seven O'clock GVT algorithm that is an extension of Fujimoto's shared memory GVT algorithm. Using this new GVT algorithm, we report good optimistic parallel performance on a cluster of Itanium-II quad processor systems as well as a dated cluster of 40 dual Pentium III systems for both benchmark applications such as PHOLD and real-world applications such as a large-scale TCP/IP internet model. In some cases, super-linear speedup is observed. The Seven O'clock GVT algorithm greatly simplifies processor synchronisation by creating a zero-cost 'consistent cut' across the distributed simulation.
Article
. This paper presents snapshot algorithms for determining a consistent global state of a distributed system without significantly affecting the underlying computation. These algorithms do not require channels to be FIFO or messages to be acknowledged. Only a small amount of storage is needed. An important application of a snapshot algorithm is Global Virtual Time determination for distributed simulations. The paper proposes new and efficient Global Virtual Time approximation schemes based on snapshot algorithms and distributed termination detection principles. 1 Introduction A snapshot of a distributed system is a global state (consisting of the local states of the processes and all the messages in transit) which is meaningful in the sense that it corresponds to a possible global state where the local states of all processes and of all communication channels are recorded simultaneously [5]. In order to get such a causally consistent state in a system without a common clock, the local...
Article
It is shown that distributed termination detection algorithms can be transformed into efficient algorithms to approximate the so-called Global Virtual Time (GVT) of a distributed monotonic computation. Typical instances of such computations are optimistic distributed simulations based on the timewarp principle. The transformation is exemplified for two termination detection algorithms, namely an algorithm by Dijkstra et al. and a new scheme based on the principle of "sticky flags". The general idea of the transformation is that many termination detection algorithms (viz., one for each possible GVT value) run in parallel. Each algorithm determines a specific lower bound The work of H. Mehl is supported by the German National Science Foundation (Deutsche Forschungsgemeinschaft) under grant SPP-322671.
Integrating the joint operation feasibility tool with JFAST
  • A Lee
  • Saurav Belfore
  • Syed S Mazumdar
  • Rizvi
Lee A. Belfore, Saurav Mazumdar, and Syed S. Rizvi et al., "Integrating the joint operation feasibility tool with JFAST," Proceedings of the Fall 2006 Simulation Interoperability Workshop, Orlando Fl, September 10-15 2006.