Conference PaperPDF Available

Using programmable NICs for time-warp optimization

Authors:

Abstract and Figures

This paper explores optimization of parallel discrete event simulators (PDES) on a cluster of workstations with programmable network interface cards (NICs). We explore reprogramming the firmware on the NIC to optimize the performance of distributed simulation. This is a new implementation model for distributed applications where: (i) application specific communication optimizations can be implemented on the NIC; (ii) portions of the application that are most heavily communicating can be migrated to the NIC; (iii) some messages can be filtered out at the NIC without burdening the primary processor resources; and (iv) critical events are detected and handled early. The combined effect is to optimize the application communication behavior as well as reduce the load on the host processor resources. We explore this new model by implementing two optimizations to a time-warp simulator on the NIC: (1) the migration of the global virtual time estimation algorithm to the NIC; and (2) early cancellation of messages in place upon early detection of rollbacks. We believe that the model generalizes to other distributed applications
Content may be subject to copyright.
Using Programmable NICs for Time-Warp Optimization
Ranjit Noronha and Nael B. Abu-Ghazaleh
Computer Science Deptartment
State University of New York
Binghamton, NY 13902
{rnoronha,nael}@cs.binghamton.edu
Abstract
This paper explores optimization of Parallel Discrete
Event Simulators (PDES) on a cluster of workstations
with programmable Network Interface Cards (NICs).
We explore reprogramming the firmware on the NIC to
optimize the performance of distributedsimulation. This
is a new implementation model for distributed applica-
tions where: (i) application specific communication op-
timizations can be implemented on theNIC; (ii) portions
of the application that are most heavily communicating
can be migrated to the NIC; (iii) some messages can be
filtered out at the NIC without burdening the primary
processor resources; and (iv) critical events aredetected
and handled early. The combined effect is to optimize
the application communication behavior as well as re-
duce the load on the host processor resources. We ex-
plore this new modelby implementing two optimizations
to a Time-Warp simulator on the NIC: (1) the migra-
tion of the Global Virtual Time estimation algorithm to
the NIC; and (2) early cancelation of messages in place
upon early detection of rollbacks. We believe that the
model generalizes to other distributed applications.
Keywords: Clusters, Programmable NIC, Time Warp,
Parallel Discrete Event Simulation
1. Introduction
The emergence and commercial success of clus-
tering technologies using commodity components al-
low scalable cost-effective parallel processing machines
to be built easily [3, 4, 24, 29]. Clusters approach
the performance of custom parallel machines by using
high-performance Local/System area networking tech-
nologies and standards (such as Myrinet [6], SCI [16]
This work was partially supported by NSF Grant EIA-9911099
and others [5, 30, 31]) and low-overhead user-level
space communicationprotocols (such as the Basic Inter-
face for Parallelism (BIP) [15], Illinois Fast Messages
(IFM) [23] and others [11, 20]).
In a typical node in a workstation cluster, the NIC re-
sides on the I/O bus which is connected to the system
bus using a bus adapter. Outgoing messages traverse the
I/O bus twice; once while being transferred (using ei-
ther DMA or programmed I/O) from the senders host
buffers to the NIC buffers and the other traversal at the
receivers end where it is transferred from the NIC to the
host. With networking technology improving rapidly,
the network can nowdeliver messages at a rate that over-
whelms the abilities of the host workstation to handle
them. This includes: (i) the I/O bus: at the full net-
work bandwidth of a 4Gb/sec Myrinet network, 100%
of a typical I/O bus bandwidth (64-bit 66MHz PCI bus)
will be consumed by network traffic; (ii) the system bus:
this is a well known bottleneck even without commu-
nication [7] that is significantly exacerbated by network
traffic[19]; and (iii) CPU: CPU time is needed to handle
the messages (interrupt handling, context switch, pars-
ing/generating headers, buffer management, checksums
etc..). For fine grained applications, these factors result
in low effective communication bandwidth and draw re-
sources away from the application.
Recently, NIC vendors started developing high-end,
affordable, programmable NICs[1, 2, 28]. Providing the
programmability on the NIC opens the door for a new
system model forimplementingdistributedapplications.
More specifically, application-specific customization of
the NIC becomes possible. More generally, any portion
of the application may be implemented on the NIC to
optimize performance in the following ways:
1. Migrate “distributed” portions of the application,
defined informally as the objects that are more
closely related to objects on other nodes than to the
1
0-7695-1573-8/02/$17.00 (C) 2002 IEEE
local objects, to the NIC processor such that they
are able to tap into the network directly at low la-
tency and high bandwidth. This migration reduces
the traffic across the NIC to host interface, freeing
up the host resources to local object processing;
2. Detect and react quickly to urgent or unexpected
events. Normally, such events carry this urgency
semantics at the application level. This requires
that they percolate through the system and up to the
application before they can be handled. Detecting
and handling them at the NIC bypasses this cost;
3. Filter (or generate) messages directly on the NIC,
further reducing the traffic to the host; and
4. Allow communication monitoring and profiling at
a low level not available to applications under the
traditional model.
Implementingintelligence in the I/O subsystem (or even
specificallyin the NIC) is nota new idea; in fact, the idea
of DMA itself is an example of such intelligence. Our
work differs in that we seek to make this intelligence
application specific. There have been a small number
of investigationsintousing the programmability features
to implement application-specific communication prim-
itives on the NIC [8, 13, 17, 32]. The SPINE Operating
System provides mechanisms for off-loading user-level
primitives to the NIC by extending the Operating Sys-
tem [12]. In this paper, we presents initial experiments
with this model using parallel discrete event simulation
(PDES) as an application. We explore two optimizations
to a Time-Warp simulator using this model: the migra-
tion of the Global Virtual Time estimation algorithm to
the NIC and early cancelation of outgoing messages on
the NIC after detecting incoming rollback messages.
2. Time Warp Simulation
Parallel Discrete Event Simulation (PDES) [14] can
potentially increase performance and capacity of the
simulation by partitioning the simulation model across a
collection of concurrent simulators called Logical Pro-
cesses (LPs). Each LP maintains a Local Virtual Time
(LVT) and communicates with other LPs by exchang-
ing time-stamped event messages. In optimistic simu-
lation (Time-Warp), no explicit synchronization is en-
forced among the LPs LPs process local events in
timestamp order. A causalityerror occurs when an arriv-
ing message has a lower timestamp than LVT (straggler
event) forcing the LP to halt processing and roll back to
the earlier virtual time of the straggler. Thus, each LP
must maintain state and event histories to enable recov-
eryfrom stragglerevents. The progress of the simulation
is determined by the smallest time-stamp of an unpro-
cessed event in the simulation (taking into account mes-
sages in transit) – this value is called the Global Virtual
Time (GVT) and its estimation can be shown to be simi-
lar to the distributed snapshot algorithm [18]. Rollbacks
to estimates before GVT are not possible since events
cannotbe generated in thepast. GVT is used as a marker
to garbage collect histories no longer needed for roll-
back. Time Warp simulations communicate heavily and
at very fine granularity which has a high-overhead thus
substantially affecting performance well on distributed
memory machines (and clusters) [9]. This makes them
a suitable application for Active NIC optimization.
3. Proposed Optimizations
In deciding what optimizations to implement on the
NIC, the following factors were considered: (i) NIC re-
sources are severely limited. The processor is the equiv-
alent of 10 year old technology and is already saddled
with the other responsibilities. Furthermore, the avail-
able memory (1Mbytes) is restrictive; and(ii) We would
like to demonstrate migration of distributed functional-
ity to the NIC and filtration of messages based on in-
formation available at the NIC. With the newer gener-
ation of NICs with better processors and more mem-
ory [28, 21], we expect a much larger scope for opti-
mizations to become available. We elected to implement
the following two optimizations as an initial demonstra-
tion of feasibility: migration of GVT computation to the
NIC; and early cancellation of erroneous messages on
the NIC.
3.1. NIC-level GVT
GVT represents the minimum on the timestamp of
all unprocessed messages at all the logical processes
including those in transit. GVT places a lower bound
on the progress of the simulation and therefore guaran-
tees that no earlier events can be generated. With this
knowledge the LPs can garbage collect event and state
histories. GVT is also used for termination detection.
GVT estimate operates concurrently with the simula-
tion: if it is carried out aggressively, it incurs a higher
overhead but the obtained estimate is tighter, allowing
more timely garbage collection. WARPED implements
two GVT algorithms pGVT [10] and Mattern’s algo-
rithm [18]. We use Mattern’s algorithm because it has a
lower overhead and produces good estimates.
Mattern’s algorithm is similar to classic two-round
distributed snapshot algorithms. In Figure 1, C1 rep-
resents the point we decide to invoke the estimate while
C2 represents a point in the future where a consistent es-
2
0-7695-1573-8/02/$17.00 (C) 2002 IEEE
LP3
LP2
LP1
LP0
LVT1
LVT2
LVT3
LVT0
M1
M2
M4
M3
M5 M6
Cut C1
Cut C2
Figure 1. Consistent Cuts
timate is obtained; C1 and C2 should be close to closely
bound GVT. GVT is the minimum timestamp of all the
LPs and the messages sent between cuts C1 and C2 in-
cluding messages that cut C2. All processes are WHITE
initially and in this state count the number of messages
(WHITE messages) they send out. A designated root LP
starts off the process with the root process LP0 turning
RED and sending a GVT token to process LP1. Process
LP1 on receiving the GVT token turns RED and for-
wards the token to LP2 and so on until the token returns
to the root process LP0. As the GVT token circulates,
each LP adds to a counter the number of WHITE mes-
sages it has sent, and subtracts from it the number of
WHITE messages it received. When the token reaches
LP0, all processes are RED and the counter contains the
number of messages that are in transit. LP0 initiates cir-
culation of the token for a second round. As each LP re-
ceives the token it subtracts the number of WHITE mes-
sages received since the token was last seen. Once the
token is received by LP0 again, if the counter is 0, the
token is circulated again to inform all nodes of the GVT
value. If the counter is larger than 0, additional rounds
are initiated until all the WHITE messages are received.
GVT computation is not on the critical path of the
simulation and can be performed in the background. It
is also not computationally intensive and does not place
excessive load on the NIC. We can save bandwidth by
intercepting and preventing messages from going across
the IO bus. Also GVT information can be piggybacked
on many of the normal message fields, which carry
pointer information only useful on the originating LP. In
WARPED while running Mattern’s GVT algorithm en-
tirely on the host, it is not possible to piggyback infor-
mation on an outgoing message since we cannot guar-
antee that LPn will send an event message to LP(n+1)
mod m as required by the algorithm where m is the total
number of LP’s. Finally, the migration of GVT from the
host to the NIC is transparent to the applications being
simulated; in this case RAID and POLICE.
Figure 2 shows the division of the implementation of
the algorithm between the host processor and the NIC.
The host is responsible for deciding when to change
color, keeping track of LVT (all objects are on the host),
1. Track V, T, Tmin across hosts
2. Generate and receive GVT msgs
3. Decide on GVT termination
4. Report new values of GVT
1. Keep track of white messages
2. track minimum timestamp of all
red messages sent
3. Calculate LVT
NICHost
Figure 2. Implementation
and finally tracking the minimum on the timestamp of
all outgoing RED messages to reduce the latency on the
send side at the NIC. The NIC keeps track of the number
of outstanding WHITE messages (V) and the minimum
timestampreceivedso far (
, initially ) and the Lo-
cal Virtual Time (LVT).
Consistency is a major issue introduced by this im-
plementation mode. On the receive side, messages that
are seen by the NIC (causing an update in state, for
example LVT) spend time in the OS, message library,
and/orapplicationbuffers beforethey are seen by the ap-
plication. Thereverse is true on the send side. For exam-
ple, the NIC may report LVT to be some value which is
not the true LVT value due to the presence of a rollback
message that has not been received by the host yet. We
expect this consistency problem to arise whenever state
is shared between the NIC and the host. The GVTimple-
mentation must take care of this inconsistency problem
or erroneous GVT values may be computed. In the re-
mainder of this section we describe the implementation.
Initially, each LP reports its rank to the NIC through
the global buffer shared between the host and the NIC.
The Communication Manager (CM), which is a module
in the LP responsible for communication, initializes its
control flag to indicate the piggybacking of a GVT to-
ken to 0. The Mattern GVT Manager at the root (LP0)
initiates GVT computation by reporting the values of V
(number of WHITE messages),
(minimum of the
received time stamps) and T (local virtual time estimate
at the NIC) to the CM on the host processor and asking
it to send out a control message. CM in turn sets a bit
in an outgoing event message and encodes the values of
T, and V in four unused fields in the Basic Event
Message. The NIC at the root on receiving the message
for the first time extracts the values of T, , V and
stores them temporarily. The handshaking is carried out
to enforce consistency.
Whenever it gets a chance, the NIC marshals the val-
ues of T, and V into a special GVT message and
forwards it to LP1’s host. The NIC also has the fol-
lowing variables: TimewarpInitialised: set by BIP to in-
dicate that BIP has been initialized and the rank vari-
able has been written to; GvtTokenPending: indicates
whether we are in the middle of a GVT computation
or not; ControlMessagePending: This variable indicates
3
0-7695-1573-8/02/$17.00 (C) 2002 IEEE
that a GVT control message has been received by the
NIC and has been sent to the host for processing; Re-
ceivedHostVaraibles: indicates that the control message
which was pending was processed by the host and the
values (T, , V) just came off the last outgoing mes-
sage.
On receiving a GVT token in round 0, the receiving
NIC extracts the values of T, and V. The NIC re-
ports these values to the host processor and also requests
an update. Now when the Mattern GVT Manager re-
ceives this message in memory it updates the WHITE
message count and modifies its color. On receiving an
incoming GVT message in any round other than round
0, the NIC first requests the values of V, T, and from
the host and adds V to the value of V in the token. At
the root LP, if the result is 0, the NIC broadcasts a new
GVT message to all other logical processes. Following
that, it reports the value of GVT to its own host. The full
implementation details can be found elsewhere [22].
3.2. Early Message Cancellation
Timewarp is based upon the optimistic strategy for
parallel discrete event simulation. More specifically,
Logical Processes proceed in parallel with no synchro-
nization and use a detect and recover strategy to deal
with causality errors. Recovery consists of restoring
the simulation to an earlier valid state and sending out
negative or anti-messages to cancel erroneously sent
messages. We use aggressive cancellation [27] where
erroneous messages are instantly canceled (via anti-
messages) when a causality error is detected.
Our second optimization, Early Message Cancella-
tion, explores deleting messages in the NIC buffer if a
rollback message with an earlier time is detected. Such
messages are overly optimistic and will have to be can-
celed using an anti-message once the rollback is de-
tected. Eliminating these messages in place therefore
saves the cost of sending them (and handling them at the
destination), as well as the later cost of canceling them.
This optimization was chosen to demonstrate how the
NIC could be used to intelligently filter out some traffic
and increase simulationefficiency. We believe that there
are numerous opportunities for such optimizations both
for PDES and other distributed applications; however,
we are currently limited by NIC speed and the immatu-
rity of the model and programming environment.
Any message sent from WARPED follows the path
shown in Figure 3(a), spending a substantial amount of
time in buffers before being actually sent out over the
network. At some point, the NIC or the host might
decide that an event message still in the buffers is not
needed and can be dropped. The technique we use is to
Network
MPICH buffers (64K)
NIC Buffer (4K)
Warped Buffers
(a) Message Send
Path
100
85
102
110
115
120
Messages
> 100 to
be cancelled
(straggler)
Anti Message
NIC send buff NIC recv buff
(b) Early Cancellation Exam-
ple
Figure 3. Early Message Cancellation
is to peek at received anti-messages at the NIC level and
discard messages from the send queue based on the re-
ceive time stamp of the anti-messages. An example of
this is shown in Figure 3(b) where messages with times-
tamps 120, 115, 110 and 102 can be discarded because
of the anti-message received with timestamp 100.
Before we describe the implementation we discuss
some of the problems experienced with the communica-
tion layers BIP and MPICH when packets are dropped.
For one BIP maintains sequence numbers to help in the
ordering of packets making it necessary to turn off se-
quence numbers while implementing packet dropping.
Other methods include requiring the NIC to maintain
state information about dropped packets or informing
the host of packet drops both of which are difficult in
the current implementation due to limited capabilities of
the NIC and IO bus contention. The second problem
lies with the implementation of credit based flow con-
trol in MPICH. Since additional credit is piggybacked
on packets from the receiver back to the sender dropped
packets cause credit to be lost and the senders window to
close up. We address this problem by enabling sequence
numbers in MPICH so that lost packets can immediately
be detected and the receiver can update his estimate of
the number of credits the sender has used up. Also the
NIC keeps track of credit from dropped packets for a
particular destination and updates credit information for
a packet headed for that destination. Finally the send-
ing window is increased allowing the sender to send for
longer periods of time and recover in the case of a block
of dropped packets.
The algorithm begins by scanning the receive queue
on the NIC for anti-messages whose receive timestamp
is then recorded. This time-stamp is compared to the
send-time stamp on all outgoing messages sent before
the anti-message is received at the host (the host reports
4
0-7695-1573-8/02/$17.00 (C) 2002 IEEE
the last received anti-stamp to the NIC by piggybacking
its receive time-stamp on all outgoing messages). If the
received anti-stamp is less than the outgoing message
timestamp, the message is dropped. The event-Id’s of
all dropped messages are recorded so that we can either
prevent the sending of the corresponding anti-messageat
the host or drop it at the NIC. Due to space limitations,
we only discuss the implementation briefly.
Whenever the NIC receives a message, it checks
whether it is an anti-message. If so it records the times-
tamp on that message in a variable ( ), which is
used by the send queue to cancel messages. The sim-
ulator’s Input Queue makes a note of the timestamp of
the last processed anti-message, required by the CM on
the sending side. It must record only the timestamps
of messages that have been processed by the NIC. That
is, it must record the timestamps of anti-messages from
remote objects. This diffrentiation is achieved by us-
ing the object ID defined by the applications RAID and
POLICE and will need to be recoded for another new
application. The CM piggybacks the timestamp of the
last received anti-message on the Next object field of
every outgoing message. This is necessary to maintain
consistency. Otherwise, it would not be possible to dis-
criminate between messages generated before the anti-
message was processed by the host (should be canceled)
and ones generated after (should not be canceled).
The logic on the send side queue of the NIC is the
most complicated. Whenever we drop a positive mes-
sage, we know that at some point of time the host will
try to cancel this message and therefore we need to track
the event ID’s of all canceled messages. For every ob-
ject on the LP we allocate a buffer of size 10. which
is declared in the global structures of the NIC, so that
it can be accessed by both the host and the NIC. The
host can avoid sending negative messages, by access-
ing this buffer while the NIC can filter out the negative
messages, which the host sent before the corresponding
positive message on the NIC was dropped. Finally we
describe the logic in the Timewarp object, which is re-
sponsible for generation of anti-messages. We first scan
the event ID buffer on the NIC; if the event ID is present
in the buffer the anti-message is not generated. Imple-
mentation details can be found elsewhere [22].
4. Experimental Study
In this section, we present an experimental study of
the proposed optimizations on a myrinet connected clus-
ter. The cluster has eight nodes; each node is a 2-
way SMP with Pentium III 550MHZ processors running
Redhat Linux 6.2. The machines are connected by a 1.2
Gbps Myrinet switch with LanAi4 processors (66MHz,
1MB dual ported SRAM); the optimizations were im-
plemented by reprogramming the firmware of this pro-
cessor. We used the Message PAssing Interface (MPI)
runnong on top of the Basic Interface for Parallelism
(BIP) suite: a light-weight user-level communication
protocol for Myrinet [15, 25]. BIP runs directly on top
of the hardware (it bypasses TCP/IP). The optimiztions
wereimplementedforthe WARPED simulation engine: a
configurable Time-Warp parallel discrete event simula-
tor [26]. We present results usingtwo of the applications
provided by the WARPED release: RAID models the op-
eration of a RAID-5 disk array, and POLICE which is
a simple model of a traffic police telecommunications
network.
4.1. NIC-level GVT
0
5
10
15
20
25
30
35
40
1 10 100 1000 10000 100000
Simulation Time (sec)
GVT Period (Events)
RAID Performance with NIC GVT
WARPED
NIC GVT
Figure 4. RAID GVT Execution Time
RAID was simulated using 10 processes sending disk
I/O requests to 8 forks which in turn forward the re-
quests to one of the 8 disks in the simulation. There
are a total of 8 LP’s. Figure 4.1 shows the performance
of the simulation with and without the NIC level imple-
mentation of GVT. When performing GVT aggressively
(GVT COUNT = 1 effectively performingGVT after ev-
ery event is processed), NIC-GVT outperforms the stan-
dard implementation. As we decrease the frequency of
GVT (increase GVT COUNT) the time required for ex-
ecution by NIC-GVT increases, while that required by
WARPED decreases, until the two implementations per-
form almost identically. A probable explanation of this
behavior is that when GVT is performed aggressively,
more GVT messages must be generated and sent in the
traditional implementation. These messages take up re-
sources (CPU and memory) and create additional con-
tention for the IO bus. On the other hand, no additional
memory has to be allocated for a message for NIC-GVT
since all information is generated at the NIC and piggy-
backed on other messages. However as we reduce the
frequency of GVT computation, we see that NIC-GVT
becomes slightly slower than WARPED. This is due to
5
0-7695-1573-8/02/$17.00 (C) 2002 IEEE
60
80
100
120
140
160
180
200
1 10 100 1000 10000
Simulation Time (sec)
GVT Period (Events)
POLICE Performance with NIC GVT (8 Processors)
WARPED
NIC GVT
(a) Police – Execution Time
0
50000
100000
150000
200000
250000
300000
350000
400000
450000
500000
1 10 100 1000 10000
Number of GVT Rounds
GVT Count (Events)
POLICE -- NIC GVT Rounds
WARPED
NIC GVT
(b) Police – Number of Rounds
Figure 5. Police GVT Performance
the fact that NIC has to perform GVT checks on each
incoming and outgoing message adding overhead that is
needed infrequently due to the low frequency of GVT
computation.
The results for the Police model for 8 LPs are shown
in Figure 5(a). The same pattern observed in RAID
was seen for Police as well. At highly aggressive GVT,
the traditional implementation breaks down because the
communication traffic overwhelms the host processor
resources. Since the messages are generated by the
NIC, the optimized version does not break down. As
GVT is carried out less aggressively, the gap between
the two implementations narrows until they are almost
identical if GVT is performed highly infrequently. With
highly aggressive GVT, in addition to not requiring the
resources for generating GVT messages and delivering
them to the NIC, we found that the number of GVT
rounds being carried out at the NIC remained relatively
constant because the NIC opportunistically forwards the
GVT information (Figure 5(b)).
4.2. Early Cancellation
RAID was simulated using 16 source processes, 8
forks, and 8 disks spread across 8 LP’s in the cluster.
We have taken readings at 50000, 100000, 200000 and
400000 disk requests respectively. The execution times
scale almost linearly with the number of requests. Fig-
ure 6(a) shows the percentage speedup obtained from
the optimization. A modest improvement in the simu-
lation time was obtained (less than 5%) due to the re-
duction in the number of messages generated. When we
looked closer, the percentage of messages canceled in
place was small (less than 1%) we expect to be able
to drop significantly more messages with a better NIC
processor. Despite this small percentage, the total num-
ber of messages sent is reduced by a more appreciable
amount(Figure6(b)) due to the eliminationof some roll-
backs by directly canceling the erroneous messages that
cause them.
0
200000
400000
600000
800000
1e+06
1.2e+06
900 1000 2000 3000 4000
Messages Sent
Number of Police Stations
Police Message Count -- NIC Direct Cancelation
Warped
Direct Cancelation
Figure 8. Overall Messages Generated (in-
cluding messages that will be canceled)
The speedup obtained for Police was significantly
higher than that for RAID for several of the simula-
tion points (up to 27%; see Figure 7(a)). This improved
speedup is due to a large percentage of canceled mes-
sages being canceled by the NIC (Figure 7(b). More-
over, similar to RAID, the total message count (includ-
ing those that were canceled later) was reduced osten-
sibly because of the reduction in the rollbacks due to
the elimination of some of the anti-messages beforethey
cause erroneous computation at their destination (Fig-
ure 8).
5. Conclusions and Future Work
In this paper, we investigated optimization of a PDES
simulator by programming the firmware of the network
interface card of a cluster of workstation. The pro-
6
0-7695-1573-8/02/$17.00 (C) 2002 IEEE
0
1
2
3
4
5
6
50000 100000 200000 400000
Improvement in Performance (%)
Number of RAID Disk Requests
RAID Performance with NIC Direct Cancelation
Percentage Improvement
(a) RAID Performance
0
2e+07
4e+07
6e+07
8e+07
1e+08
1.2e+08
1.4e+08
1.6e+08
50000 100000 200000 400000
Messages Sent
Number of RAID Disk Requests
RAID Message Count -- NIC Direct Cancelation
Warped
Direct Cancelation
(b) RAID Message Count
Figure 6. RAID Early Cancellation
0
5
10
15
20
25
30
900 1000 2000 3000 4000
Improvement in Performance (%)
Number of Police Stations
POLICE Performance with NIC Direct Cancelation
Percentage Improvement
(a) Police Performance
52
54
56
58
60
62
900 1000 2000 3000 4000
Percentage Dropped
Number of Police Stations
Percentage of Canceled Messages Dropped by NIC
Percentage Dropped
(b) Police Message Count
Figure 7. RAID Early Cancellation
cessor on the NIC cards available for our experiments
were not intendedfor general programmability (they are
small CPUs with limited resources); therefore, we se-
lected two optimizations that are lightweight in order to
demonstrate the feasibility of the model and to under-
stand the challenges and issues. As programmable cards
with better processors continue to appear, it is possi-
ble that a significantly larger class of optimizations will
become feasible both for simulation and for other dis-
tributed applications.
The two optimizations we studied provided some im-
provement in the performance of our application (in
some instances significant improvement) despite these
limitations. In the process, we learned the following
lessons: (i) Consistency is a recurring problem in this
model if state is shared between the NIC and the host
processor. Enforcing strong consistency via shared vari-
ables will be expensive in most cases, and relaxed con-
sistency can be obtained by piggybacking handshaking
information on incoming and outgoing messages; and
(ii) There is a need for tools and programming models
to allow effective programming in this model. We are
encouraged that the new NIC cards are offering main-
stream OS’s running on the NIC processor.
We believe that the bottleneck on the transfer path be-
tween the NIC and the processor would make offloading
computation to the NIC more promising as network per-
formance continues to increase. This is especially true
if the programmable NIC resources continue to improve.
Making more resources available on the NIC will open
the door for additional optimizations using this model
model both for PDES and other applications (distributed
OS, databases, filesystems, etc...); this is a focus of our
future research.
References
[1] 3Com EtherLink Server 10/100 PCI Net-
7
0-7695-1573-8/02/$17.00 (C) 2002 IEEE
work Interface Card with 3XP Processor.
http://www.megahaus.com/tech/3com/
nics/specs/3cr990svr97_spec.shtml.
[2] Alcatric 100x4 Quad Port Server Adapter. http:
//www.bellmicro.com/fibrechannel/
connectivity/alacr/quad_port.htm.
[3] T. Anderson, D. Culler, and D. Patterson. The case for
NOW (network of workstations). IEEE Micro, 15(1),
Feb. 1995.
[4] D. Becker, T. Sterling, D. Savarese, J. Dorband,
U. Ranawak, and C. Packer. BEOWULF: A parallel
workstation for scientific computation. In International
Conference on Parallel Processing, 1995.
[5] M. Blumrich, R. Alpert, Y. Chen, D. Clark, S. Dami-
anakis, C. Dubnicki, E. Felten, L. Iftode, K. Li,
M. Martonosi, and R. Shillner. Design choices in the
shrimp system: An empirical study. In Proceedings of
the
Annual ACM/IEEE International Symposium
on Computer Architecture, June 1998.
[6] N. Boden, D. Cohen, and W. Su. Myrinet: A gigabit-
per-second local area network. IEEE Micro, 15(1), Feb.
1995.
[7] D. Burger, J. R. Goodman, and A. Kagi. Memory
Bandwidth Limitations of Future Microprocessors. In
23rd International Symposium on Computer Architec-
ture, May 1996.
[8] J. Chase, D. Anderson, A. Gallatin, A. Lebbeck, and
K. Yocum. Network i/o with trapeze. In Proceedings
of 1999 Hot Interconnects, Aug. 1999.
[9] M. Chetlur, N. Abu-Ghazaleh, R. Radhakrishnan, and
P. A. Wilsey. Optimizing communication in Time-Warp
simulators. In Proceedings of the 12th Workshop on Par-
allel and Distributed Simulation, pages 64–71. Society
for Computer Simulation, May 1998.
[10] L. M. D’Souza, X. Fan, and P. A. Wilsey. pGVT: An al-
gorithm for accurate GVT estimation. In Proc. of the 8th
Workshop on Parallel and Distributed Simulation (PADS
94), pages 102–109. Society for Computer Simulation,
July 1994.
[11] C. Dubnicki, A. Bilas, Y. Chen, S. Damianakis, and
K. Li. VMMC-2: Efficient support for reliable,
connection-oriented communication. In Hot Intercon-
nects V, Aug. 1997.
[12] M. Fiuczynski and B. Bershad. SPINE: A safe pro-
grammable and integrated network environment. In Pro-
ceedings of the Eighth ACM SIGOPS Workshop, 1998.
[13] M. Fiuczynski, R. Martin, T. Owa, and B. Ber-
shad. On using intelligent network interface
cards to support multimedia applications. In
Proceedings of NOSSDAV’98, 1998. http:
//www.cs.washington.edu/homes/mef/
research/spine/reports/nossdav98%/.
[14] R. Fujimoto. Parallel discrete event simulation. Com-
munications of the ACM, 33(10):30–53, Oct. 1990.
[15] P. Geoffray, L. Prylli, and B. Tourancheau. BIP-SMP:
High performance message passing over a cluster of
commodity SMPs. In Proceedings of Supercomputing
(SC99), Nov. 1999.
[16] M. Ibel, K. Schauser, C. Scheiman, and M. Weis. High
performance cluster computing using SCI. In Hot Inter-
connects V, Aug. 1997.
[17] R. Krishnamurthy, K. Schwan, R. West, and M. Rosu. A
network co-processor-based approach to scalable media
streaming in servers. In Proceeding of the International
Conference on Parallel Processing (ICPP’00), 2000.
[18] F. Mattern. Efficient algorithms for distributed snapshots
and global virtual time approximation. Journal of Par-
allel and Distributed Computing, 18(4):423–434, Aug.
1993.
[19] D. Mosberger, L. Peterson, and S. O’Malley. Protocol
latency: MIPS and reality. Technical Report TR-95-
02, Department of Computer Science, The University of
Arizon, Tuscon, AZ, 1995.
[20] M-VIA: Virtual interface architecture for linux, 2001.
http://www.nersc.gov/research/FTG/
via/.
[21] Myrinet, inc. home page, 2001. http://www.myri.
com.
[22] R. Noronha. Intelligent NICs – a feasibility study of im-
proving performance of distributed applications by pro-
gramming some of their components on the NIC. Mas-
ter’s thesis, Binghamton University, Binghamton, NY,
2001.
[23] S. Pakin, M. Lauria, and A. Chien. High performance
messaging on workstations: Illinois Fast Messages (FM)
for Myrinet. In Proceedings of Supercomputing (SC’95),
1995.
[24] G. Pfister. In Search of Clusters, 2nd Ed. Prentice Hall,
1998.
[25] L. Prylli. BIP messages user manual, 1998.
Available at http://lhpca.univ-lyon1.fr/
BIP-manual/index.html.
[26] R. Radhakrishnan, D. E. Martin, M. Chetlur, D. M. Rao,
and P. A. Wilsey. An Object-Oriented Time Warp Sim-
ulation Kernel. In Proceedings of the International Sym-
posium on Computing in Object-Oriented Parallel Envi-
ronments (ISCOPE’98), volume LNCS 1505, pages 13–
23. Springer-Verlag, Dec. 1998.
[27] R. Rajan and P. A. Wilsey. Dynamically switching be-
tween lazy and aggressive cancellation in a time warp
parallel simulator. In Proc. of the 28th Annual Simula-
tion Symposium, pages 22–30. IEEE Computer Society
Press, Apr. 1995.
[28] Intelligent ethernet interface solutions. http://www.
ramix.com/tech/intelethernet.html.
[29] Scalable Computing Lab. SCL cluster cookbook:
Building your own clusters for parallel computa-
tion, 1998. http://www.scl.ameslab.gov/
Projects/ClusterCookbook.
[30] Virtual interface architecture (VIA) specification, 2001.
http://www.viarch.org.
[31] M. Welsh, A. Basu, and T. von Eicken. Atm and fast eth-
ernet network interfaces for user-level communication.
In Proceedings of the Third High-Performance Com-
puter Architecture Conference (HPCA’97), Feb. 1997.
[32] K. Yocum and J. Chase. Payload caching: High speed
data forwarding for network intermediaries. In 2001
Usenix Conference, June 2001.
8
0-7695-1573-8/02/$17.00 (C) 2002 IEEE
... However, these schemes impact the simulation completion time due to the RTT propagation delay and software stack traverse latency while computing a new GVT value. Server-only implementations may become a bottleneck for synchronization because of the processing, buffering, and transmission operations [28,37]. Even if GVT computation is pushed to a server placed in an optimal location in the network to minimize RTT, the delay will still be imposed because of the server software stack. ...
... However, none of these works consider network programmability. The most similar to our work is [37], which migrates the GVT computation to a programmable NIC. Although using the NIC avoids the overhead of the server software stack, this approach still requires an entire RTT to estimate the GVT. ...
... In general, synchronization protocols can be categorized into two different families: conservative and optimistic. Time Warp is an optimistic protocol for synchronizing parallel discrete event simulations [3]. GVT is used in the Time Warp synchronization mechanism to reclaim memory, commit output, detect termination, and handle errors. ...
Conference Paper
Full-text available
One of the most common optimistic synchronization protocols for parallel simulation is the Time Warp algorithm proposed by Jefferson [12]. Time Warp algorithm is based on the virtual time paradigm that has the potential for greater exploitation of parallelism and, perhaps more importantly, greater transparency of the synchronization mechanism to the simulation programmer. It is widely believe that the optimistic Time Warp algorithm suffers from large memory consumption due to frequent rollbacks. In order to achieve optimal memory management, Time Warp algorithm needs to periodically reclaim the memory. In order to determine which event-messages have been committed and which portion of memory can be reclaimed, the computation of global virtual time (GVT) is essential. Mattern [2] uses a distributed snapshot algorithm to approximate GVT which does not rely on first in first out (FIFO) channels. Specifically, it uses ring structure to establish cuts C1 and C2 to calculate the GVT for distinguishing between the safe and unsafe event-messages. Although, distributed snapshot algorithm provides a straightforward way for computing GVT, more efficient solutions for message acknowledging and delaying of sending event messages while awaiting control messages are desired. This paper studies the memory requirement and time complexity of GVT computation. The main objective of this paper is to implement the concept of matrix with the original Mattern's GVT algorithm to speedups the process of GVT computation while at the same time reduce the memory requirement. Our analysis shows that the use of matrix in GVT computation improves the overall performance in terms of memory saving and latency.
... [3] experimented with offloading part of a network traffic analysis code to the NIC but claimed that the slow CPU cores on their NIC could not do any useful work besides message protocol processing. Another interesting example is [16] , which uses programmable NICs to improve the performance of a Time-Warp application. One of the optimizations proposed in the paper is to migrate GVT computation to the NIC since this computation is not on the critical path of the Time-Warp simulation and can be performed in the background. ...
Article
Full-text available
We explore the possibility of exploiting the comput-ing power of chip-multiprocessor-based network interface controllers and switches in a system area network envi-ronment. We use stream processing applications as case studies to show that combining the compute-power of the host processor and a CMP-based NIC with 4 CPU cores can achieve speedup up to 1.30 even with the NIC run-ning at one tenth the speed of the host processor. We also emphasize that a workload balance between the host CPU and the NIC CPUs is the key to achieving maximum speedup. With detailed simulation, we explore the effects of different NIC memory hierarchies on offloaded appli-cation performance. In addition, we propose the idea of two-level active I/O systems with both CMP-based NICs and CMP-based switches, and demonstrate that they can achieve even better performance than a system with active NICs or active switches alone.
... In general, synchronization protocols can be categorized into two different families: conservative and optimistic. Time Warp is an optimistic protocol for synchronizing parallel discrete event simulations [3]. Global virtual time (GVT) is used in the Time Warp synchronization mechanism to reclaim memory, commit output, detect termination, and handle errors. ...
Conference Paper
Full-text available
Mattern’s GVT algorithm is a time management algorithm that helps achieve the synchronization in parallel and distributed systems. This algorithm uses ring structure to establish cuts C1 and C2 to calculate the GVT. The latency of calculating the GVT is vital in parallel/distributed systems which is extremely high if calculated using this algorithm. However, using synchronous barriers with the Matterns algorithm can help improving the GVT computation process by minimizing the GVT latency. In this paper, we incorporate the butterfly barrier to employ two cuts C1 and C2 and obtain the resultant GVT at an affordable latency. Our analysis shows that the proposed GVT computation algorithm significantly improves the overall performance in terms of memory saving and latency.
... In the context of hardware acceleration of GVT algorithms,[31]demonstrates that hardware assisted, target-specific global reductions can dramatically improve parallel simulator per-formance. More recently,[32]has shown the benefits of offloading the GVT computation to network interface cards. VI. ...
Conference Paper
Full-text available
n this paper we illustrate scalable parallel performance for the TimeWarp synchronization protocol on the L and P variants of the IBM BlueGene supercomputer. Scalable Time Warp performance for models that communicate a large percentage of the event population over the network has not been shown on more than a handful of processors. We present our design for a robust performing Time Warp simulator over a variety of communication loads, and extremely large processor counts-- up to 131,072. For the PHOLD benchmark model using 65,536 processors, our Time Warp simulator produces a peak committed event-rate of 12.26 billion events per second at 10\% remote events and 4 billion events per second at 100\% remote events, the largest ever reported. Additionally, for the Transmission Line Matrix (TLM) model which approximates Maxwell's equations for electromagnetic wave propagation, we report a committed event-rate in excess of 100 million on 5,000 processors with 200 million grid-LPs. The TLM model is particularly challenging given the bursty and cubic growth in event generation. Overall, these performance results indicate that scalable Time Warp performance is obtainable on high-processor counts over a wide variety of event scheduling behaviors and not limited to relatively low, non-bursty rates of off-processor communications.
... A NIC-based barrier implementation that is an order of magnitude faster is described in [16]. In other system level work, the use of an active NIC was shown to to significantly improve the performance of a Parallel Discrete Event Simulator [14,13]. The SPINE project focuses on mechanisms to improve application performance by downloading pieces of code to the NIC which are then executed in a sandbox [10]. ...
Article
Full-text available
We present and evaluate a NIC-based network intrusion detection system. Functions such as signature-based and anomaly-based packet classification are performed on the NIC, which has its own processor and memory. This makes the system virtually impossible to bypass or tamper with as can be the case with software-based systems that rely on the host operating system to function. We empirically evaluate such systems from the perspective of quality and performance (bandwidth of acceptable messages) under varying conditions of host load. The preliminary results we obtain are very encouraging and lead us to believe that such NIC-based security schemes could very well be a crucial part of next generation network security systems.
Article
Full-text available
The aim of our work is to investigate the performance and overall scalability of an optimistic discrete-event simulator on a Blue Gene/L supercomputer. We find that strong scaling out to 16,384 processors is possible. In terms of event-rate, we observed 853 million events per second on 16,384 pro-cessors for the PHOLD benchmark. This is 1.5 times faster than any previously reported PDES synchronization proto-col for PHOLD executing on a Blue Gen/L supercomputer (e.g. conservative, optimistic or hybrid). Additionally, we observed 2.47 billion events per second for a PCS telephone network model when executed on 32,768 processors. To the best of our knowledge, this is the first multi-billion event rate achieved for any Time Warp model executing on a Blue Gene/L supercomputer.
Conference Paper
Clusters have become a crucial technology for providing low-cost high performance computing to scientific applications like weather prediction. In addition, networks like Myrinet, InfiniBand and Quadrics have become popular as an interconnection technology for high performance clusters. The high-bandwidth, low-latency characteristics of these networks make them ideally suited to the demanding characteristics of large scale weather simulations. Additionally, these networks have features like efficient and scalable hardware broadcast, reduce and atomic operations. Some of the features have been integrated into the MPI stack for these networks, allowing the user to exploit them for improved performance. In this paper, we evaluate the communication characteristics of a popular weather simulation code MM5 using InfiniBand. We also investigate how special features of InfiniBand like scalable broadcast can benefit MM5 performance. For some workloads, we see that InfiniBand performs up to 34% better than other interconnects. It also performs better in general than other networks for all workloads. KeywordsMM5-Myrinet-InfiniBand-Quadrics-System Area Networks-Clusters
Conference Paper
Energy-efficient system design and optimization is rapidly becoming an important consideration across a spectrum of computing systems - from embedded and mobile devices to high end platforms. This paper presents recent research results and experiences when attempting energy optimization at the software level, which is anticipated to have much more impact than micro-management of resources at the hardware level.
Conference Paper
Full-text available
Global virtual time (GVT) is used in parallel discrete event simulations to reclaim memory, commit output, detect termination, and handle errors. Mattern 's [I] has proposed G VT approximation with distributed termination detection algorithm. This algorithm works fine and gives optimal performance in terms of accurate GVT computation at the expense of slower execution rate. This slower execution rate results a high GVT latency. Due to the high GVT latency, the processors involve in communication remain idle during that period of time. As a result, the overall throughput of a discrete event parallel simulation system degrades significantly. Thus, the high G VT latency prevents the widespread use of this algorithm in discrete event parallel simulation system. However, if we could improve the latency of GVT computation, most of the discrete event parallel simulation system would likely take advantage of this technique in terms of accurate G VT computation. In this paper, we examine the potential use of tress and butterflies barriers with the Mattern's GVT structure using a ring. Simulation results demonstrate that the use of tree barriers with the Mattern's GVT structure can significantly improve the latency time and thus increase the overall throughput of the parallel simulation system. The performance measure adopted in this paper is the achievable latency for a fixed number of processors and the number of message transmission during the G VT computation.
Conference Paper
Full-text available
Large-scale network services such as data delivery often incorporate new functions by interposing intermediaries on the network. Examples of forwarding intermediaries include firewalls, content routers, protocol converters, caching proxies, and multicast servers. With the move toward network storage, even static Web servers act as intermediaries to forward data from storage to clients. This paper presents the design, implementation, and measured performance of payload caching, a technique for improving performance of host-based intermediaries. The approach is for the NIC to cache portions of the incoming packet stream, so the system may forward data directly from the cache. We prototyped payload caching in a programmable high-speed network adapter and a FreeBSD kernel with support for zero-copy networking. Experiments with TCP/IP traffic flows shows that payload caching can improve forwarding performance by up to 60% in realistic scenarios. 1 Introduction Data forwarding is increasi...
Article
The emergence of fast, cheap embedded processors presents the opportunity for inexpensive processing to occur on the network interface. We are investigating how a system design incorporating such an intelligent network interface can be used to support streaming multimedia applications. We are developing an extensible execution environment, called SPINE, that enables applications to compute directly on the networ k interface and communicate with other applications executing on the host CPU, peer devices, and remote nodes. Using SPINE, we have implemented a video client that executes on the network interface, and transfers video da ta arriving from the network directly to the region of frame buffer memory representing the applications window. As a result of this system structure the video data is trans ferred only once over the I/O bus and places no load on the ho st CPU to display video at aggregate rates exceeding 80 Mbps.
Article
The time warp mechanism uses memory space to save event and state information for rollback processing. As the simulation advances in time, old state and event information can be discarded and the memory space reclaimed. This reclamation process is called fossil collection and is guided by a global time value called Global Virtual Time (GVT). That is, GVT represents the greatest minimum time of the fully committed events (the time before which no rollback will occur). GVT is then used to establish a boundary for fossil collection. This paper presents a new algorithm for GVT estimation called pGVT. pGVT was designed to support accurate estimates of the actual GVT value and it operates in an environment where the communication subsystem does not support FIFO message delivery and where message delivery failure may occur. We show that pGVT correctly estimates GVT values and present some performance comparisons with other GVT algorithms.
Article
Cluster of standard servers are rapidly gaining popu-larity as an alternative for mainframe systems. Distributed and high performance applications running over these clusters require high performance communication subsys-tem. The proprietary network configurations developed for Systems Area Networks (SAN) so far have remedied this but have limited the scalability of the system because of high cost and lack of interoperability. Virtual Interface Archi-tecture (VIA) is an initiative to achieve high performance communication through minimizing software overhead.
Conference Paper
The design of a Time Warp simulation kernel is made difficult by the inherent complexity of the paradigm. Hence it becomes critical that the design of such complex simulation kernels follow established design principles such as object-oriented design so that the implementation is simple to modify and extend. In this paper, we present a compendium of our efforts in the design and development of an object-oriented Time Warp simulation kernel, called warped. warped is a publically available Time Warp simulation kernel for experimentation and application development. The kernel defines a standard interface to the application developer and is designed to provide a highly configurable environment for the integration of Time Warp optimizations. It is written in C++, uses the MPI message passing standard for communication, and executes on a variety of platforms including a network of SUN workstations, a SUN SMP workstation, the IBM SP1/SP2 multiprocessors, the Cray T3E, the Intel Paragon, and IBM-compatible PCs running Linux.
Conference Paper
This paper makes the case that pin bandwidth will be a critical consideration for future microprocessors. We show that many of the techniques used to tolerate growing memory latencies do so at the expense of increased bandwidth requirements. Using a decomposition of execution time, we show that for modern processors that employ aggressive memory latency tolerance techniques, wasted cycles due to insufficient bandwidth generally exceed those due to raw memory latencies. Given the importance of maximizing memory bandwidth, we calculate then estimate optimal effective pin bandwidth. We measure these quantities by determining the amount by which both caches and minimal-traffic caches filter accesses to the lower levels of the memory hierarchy. We see that there is a gap that can exceed two orders of magnitude between the total memory traffic generated by caches and the minimal-traffic caches---implying that the potential exists to increase effective pin bandwidth substantially. We decompose this traffic gap into four factors, and show they contribute quite differently to traffic reduction for different benchmarks. We conclude that, in the short term, pin bandwidth limitations will make more complex on-chip caches cost-effective. For example, flexible caches may allow individual applications to choose from a range of caching policies. In the long term, we predict that off-chip accesses will be so expensive that all system memory will reside on one or more processor chips.
Conference Paper
The emergence of fast, cheap embedded proc- essors present s the opportunity to execute code directly on the network interface. We are developing an extensible execution environment, called SPINE, that enables appli- cations to compute directly on the network interface This structure allows network-oriented applications to commu- nicate with other applications executing on the host CPU, peer devices, and remote nodes with low latency and high efficiency. Many I/O intensive applications such as multimedia cli- ent, file servers, host based IP routers often move large amounts of data between devices, and therefore place high I/O demands on both the host operating system and the underlying I/O subsystem. Although technology trends point to continued increases in link bandwidth, processor speed, and disk capacity the lagging perform- ance improvements and scalability of I/O busses is increasingly becoming apparent for I/O intensive applica- tions. This performance gap exists because recent improvements in workstation performance have not been balanced by similar improvements in I/O performance. The exponential growth of processor speed relative to the rest of the I/O system, though, presents the opportunity for application-specific processing to occur directly on intelligent I/O devices. Several network interface cards, such as the Myricom's LANai, Alteon's ACEnic, and I20 systems, provide the infrastructure to compute on the de- vice itself. With the technology trend of cheap, fast embedded processors (e.g., StrongARM, PowerPC, MIPS) used by intelligent network interface cards, the challenge is not so much in the hardware design as in a redesign of the soft- ware architecture needed to match the capabilities of the raw hardware.