Using programmable NICs for time-warp optimization


This paper explores optimization of parallel discrete event simulators (PDES) on a cluster of workstations with programmable network interface cards (NICs). We explore reprogramming the firmware on the NIC to optimize the performance of distributed simulation. This is a new implementation model for distributed applications where: (i) application specific communication optimizations can be implemented on the NIC; (ii) portions of the application that are most heavily communicating can be migrated to the NIC; (iii) some messages can be filtered out at the NIC without burdening the primary processor resources; and (iv) critical events are detected and handled early. The combined effect is to optimize the application communication behavior as well as reduce the load on the host processor resources. We explore this new model by implementing two optimizations to a time-warp simulator on the NIC: (1) the migration of the global virtual time estimation algorithm to the NIC; and (2) early cancellation of messages in place upon early detection of rollbacks. We believe that the model generalizes to other distributed applications
1. Introduction
The emergence and commercial success of clus-
tering technologies using commodity components al-
low scalable cost-effective parallel processing machines
to be built easily [3, 4, 24, 29]. Clusters approach
the performance of custom parallel machines by using
high-performance Local/System area networking tech-
nologies and standards (such as Myrinet [6], SCI [16]
This work was partially supported by NSF Grant EIA-9911099
and others [5, 30, 31]) and low-overhead user-level
space communicationprotocols (such as the Basic Inter-
face for Parallelism (BIP) [15], Illinois Fast Messages
(IFM) [23] and others [11, 20]).
In a typical node in a workstation cluster, the NIC re-
sides on the I/O bus which is connected to the system
bus using a bus adapter. Outgoing messages traverse the
I/O bus twice; once while being transferred (using ei-
ther DMA or programmed I/O) from the senders host
buffers to the NIC buffers and the other traversal at the
receivers end where it is transferred from the NIC to the
host. With networking technology improving rapidly,
the network can nowdeliver messages at a rate that over-
whelms the abilities of the host workstation to handle
them. This includes: (i) the I/O bus: at the full net-
work bandwidth of a 4Gb/sec Myrinet network, 100%
of a typical I/O bus bandwidth (64-bit 66MHz PCI bus)
will be consumed by network traffic; (ii) the system bus:
this is a well known bottleneck even without commu-
nication [7] that is significantly exacerbated by network
traffic[19]; and (iii) CPU: CPU time is needed to handle
the messages (interrupt handling, context switch, pars-
ing/generating headers, buffer management, checksums
etc..). For fine grained applications, these factors result
in low effective communication bandwidth and draw re-
sources away from the application.
Recently, NIC vendors started developing high-end,
affordable, programmable NICs[1, 2, 28]. Providing the
programmability on the NIC opens the door for a new
system model forimplementingdistributedapplications.
More specifically, application-specific customization of
the NIC becomes possible. More generally, any portion
of the application may be implemented on the NIC to
optimize performance in the following ways:
1. Migrate “distributed” portions of the application,
defined informally as the objects that are more
closely related to objects on other nodes than to the
local objects, to the NIC processor such that they
are able to tap into the network directly at low la-
tency and high bandwidth. This migration reduces
the traffic across the NIC to host interface, freeing
up the host resources to local object processing;
2. Detect and react quickly to urgent or unexpected
events. Normally, such events carry this urgency
semantics at the application level. This requires
that they percolate through the system and up to the
application before they can be handled. Detecting
and handling them at the NIC bypasses this cost;
3. Filter (or generate) messages directly on the NIC,
further reducing the traffic to the host; and
4. Allow communication monitoring and profiling at
a low level not available to applications under the
traditional model.
Implementingintelligence in the I/O subsystem (or even
specificallyin the NIC) is nota new idea; in fact, the idea
of DMA itself is an example of such intelligence. Our
work differs in that we seek to make this intelligence
application specific. There have been a small number
of investigationsintousing the programmability features
to implement application-specific communication prim-
itives on the NIC [8, 13, 17, 32]. The SPINE Operating
System provides mechanisms for off-loading user-level
primitives to the NIC by extending the Operating Sys-
tem [12]. In this paper, we presents initial experiments
with this model using parallel discrete event simulation
(PDES) as an application. We explore two optimizations
to a Time-Warp simulator using this model: the migra-
tion of the Global Virtual Time estimation algorithm to
the NIC and early cancelation of outgoing messages on
the NIC after detecting incoming rollback messages.
2. Time Warp Simulation
Parallel Discrete Event Simulation (PDES) [14] can
potentially increase performance and capacity of the
simulation by partitioning the simulation model across a
collection of concurrent simulators called Logical Pro-
cesses (LPs). Each LP maintains a Local Virtual Time
(LVT) and communicates with other LPs by exchang-
ing time-stamped event messages. In optimistic simu-
lation (Time-Warp), no explicit synchronization is en-
forced among the LPs LPs process local events in
timestamp order. A causalityerror occurs when an arriv-
ing message has a lower timestamp than LVT (straggler
event) forcing the LP to halt processing and roll back to
the earlier virtual time of the straggler. Thus, each LP
must maintain state and event histories to enable recov-
eryfrom stragglerevents. The progress of the simulation
is determined by the smallest time-stamp of an unpro-
cessed event in the simulation (taking into account mes-
sages in transit) – this value is called the Global Virtual
Time (GVT) and its estimation can be shown to be simi-
lar to the distributed snapshot algorithm [18]. Rollbacks
to estimates before GVT are not possible since events
cannotbe generated in thepast. GVT is used as a marker
to garbage collect histories no longer needed for roll-
back. Time Warp simulations communicate heavily and
at very fine granularity which has a high-overhead thus
substantially affecting performance well on distributed
memory machines (and clusters) [9]. This makes them
a suitable application for Active NIC optimization.
3. Proposed Optimizations
In deciding what optimizations to implement on the
NIC, the following factors were considered: (i) NIC re-
sources are severely limited. The processor is the equiv-
alent of 10 year old technology and is already saddled
with the other responsibilities. Furthermore, the avail-
able memory (1Mbytes) is restrictive; and(ii) We would
like to demonstrate migration of distributed functional-
ity to the NIC and filtration of messages based on in-
formation available at the NIC. With the newer gener-
ation of NICs with better processors and more mem-
ory [28, 21], we expect a much larger scope for opti-
mizations to become available. We elected to implement
the following two optimizations as an initial demonstra-
tion of feasibility: migration of GVT computation to the
NIC; and early cancellation of erroneous messages on
the NIC.
3.1. NIC-level GVT
GVT represents the minimum on the timestamp of
all unprocessed messages at all the logical processes
including those in transit. GVT places a lower bound
on the progress of the simulation and therefore guaran-
tees that no earlier events can be generated. With this
knowledge the LPs can garbage collect event and state
histories. GVT is also used for termination detection.
GVT estimate operates concurrently with the simula-
tion: if it is carried out aggressively, it incurs a higher
overhead but the obtained estimate is tighter, allowing
more timely garbage collection. WARPED implements
two GVT algorithms pGVT [10] and Mattern’s algo-
rithm [18]. We use Mattern’s algorithm because it has a
lower overhead and produces good estimates.
Mattern’s algorithm is similar to classic two-round
distributed snapshot algorithms. In Figure 1, C1 rep-
resents the point we decide to invoke the estimate while
C2 represents a point in the future where a consistent es-
M5 M6
Cut C1
Cut C2
Figure 1. Consistent Cuts
timate is obtained; C1 and C2 should be close to closely
bound GVT. GVT is the minimum timestamp of all the
LPs and the messages sent between cuts C1 and C2 in-
cluding messages that cut C2. All processes are WHITE
initially and in this state count the number of messages
(WHITE messages) they send out. A designated root LP
starts off the process with the root process LP0 turning
RED and sending a GVT token to process LP1. Process
LP1 on receiving the GVT token turns RED and for-
wards the token to LP2 and so on until the token returns
to the root process LP0. As the GVT token circulates,
each LP adds to a counter the number of WHITE mes-
sages it has sent, and subtracts from it the number of
WHITE messages it received. When the token reaches
LP0, all processes are RED and the counter contains the
number of messages that are in transit. LP0 initiates cir-
culation of the token for a second round. As each LP re-
ceives the token it subtracts the number of WHITE mes-
sages received since the token was last seen. Once the
token is received by LP0 again, if the counter is 0, the
token is circulated again to inform all nodes of the GVT
value. If the counter is larger than 0, additional rounds
are initiated until all the WHITE messages are received.
GVT computation is not on the critical path of the
simulation and can be performed in the background. It
is also not computationally intensive and does not place
excessive load on the NIC. We can save bandwidth by
intercepting and preventing messages from going across
the IO bus. Also GVT information can be piggybacked
on many of the normal message fields, which carry
pointer information only useful on the originating LP. In
WARPED while running Mattern’s GVT algorithm en-
tirely on the host, it is not possible to piggyback infor-
mation on an outgoing message since we cannot guar-
antee that LPn will send an event message to LP(n+1)
mod m as required by the algorithm where m is the total
number of LP’s. Finally, the migration of GVT from the
host to the NIC is transparent to the applications being
simulated; in this case RAID and POLICE.
Figure 2 shows the division of the implementation of
the algorithm between the host processor and the NIC.
The host is responsible for deciding when to change
color, keeping track of LVT (all objects are on the host),
1. Track V, T, Tmin across hosts
2. Generate and receive GVT msgs
3. Decide on GVT termination
4. Report new values of GVT
1. Keep track of white messages
2. track minimum timestamp of all
red messages sent
3. Calculate LVT
Figure 2. Implementation
and finally tracking the minimum on the timestamp of
all outgoing RED messages to reduce the latency on the
send side at the NIC. The NIC keeps track of the number
of outstanding WHITE messages (V) and the minimum
timestampreceivedso far (
, initially ) and the Lo-
cal Virtual Time (LVT).
Consistency is a major issue introduced by this im-
plementation mode. On the receive side, messages that
are seen by the NIC (causing an update in state, for
example LVT) spend time in the OS, message library,
and/orapplicationbuffers beforethey are seen by the ap-
plication. Thereverse is true on the send side. For exam-
ple, the NIC may report LVT to be some value which is
not the true LVT value due to the presence of a rollback
message that has not been received by the host yet. We
expect this consistency problem to arise whenever state
is shared between the NIC and the host. The GVTimple-
mentation must take care of this inconsistency problem
or erroneous GVT values may be computed. In the re-
mainder of this section we describe the implementation.
Initially, each LP reports its rank to the NIC through
the global buffer shared between the host and the NIC.
The Communication Manager (CM), which is a module
in the LP responsible for communication, initializes its
control flag to indicate the piggybacking of a GVT to-
ken to 0. The Mattern GVT Manager at the root (LP0)
initiates GVT computation by reporting the values of V
(number of WHITE messages),
(minimum of the
received time stamps) and T (local virtual time estimate
at the NIC) to the CM on the host processor and asking
it to send out a control message. CM in turn sets a bit
in an outgoing event message and encodes the values of
T, and V in four unused fields in the Basic Event
Message. The NIC at the root on receiving the message
for the first time extracts the values of T, , V and
stores them temporarily. The handshaking is carried out
to enforce consistency.
Whenever it gets a chance, the NIC marshals the val-
ues of T, and V into a special GVT message and
forwards it to LP1’s host. The NIC also has the fol-
lowing variables: TimewarpInitialised: set by BIP to in-
dicate that BIP has been initialized and the rank vari-
able has been written to; GvtTokenPending: indicates
whether we are in the middle of a GVT computation
or not; ControlMessagePending: This variable indicates
that a GVT control message has been received by the
NIC and has been sent to the host for processing; Re-
ceivedHostVaraibles: indicates that the control message
which was pending was processed by the host and the
values (T, , V) just came off the last outgoing mes-
On receiving a GVT token in round 0, the receiving
NIC extracts the values of T, and V. The NIC re-
ports these values to the host processor and also requests
an update. Now when the Mattern GVT Manager re-
ceives this message in memory it updates the WHITE
message count and modifies its color. On receiving an
incoming GVT message in any round other than round
0, the NIC first requests the values of V, T, and from
the host and adds V to the value of V in the token. At
the root LP, if the result is 0, the NIC broadcasts a new
GVT message to all other logical processes. Following
that, it reports the value of GVT to its own host. The full
implementation details can be found elsewhere [22].
3.2. Early Message Cancellation
Timewarp is based upon the optimistic strategy for
parallel discrete event simulation. More specifically,
Logical Processes proceed in parallel with no synchro-
nization and use a detect and recover strategy to deal
with causality errors. Recovery consists of restoring
the simulation to an earlier valid state and sending out
negative or anti-messages to cancel erroneously sent
messages. We use aggressive cancellation [27] where
erroneous messages are instantly canceled (via anti-
messages) when a causality error is detected.
Our second optimization, Early Message Cancella-
tion, explores deleting messages in the NIC buffer if a
rollback message with an earlier time is detected. Such
messages are overly optimistic and will have to be can-
celed using an anti-message once the rollback is de-
tected. Eliminating these messages in place therefore
saves the cost of sending them (and handling them at the
destination), as well as the later cost of canceling them.
This optimization was chosen to demonstrate how the
NIC could be used to intelligently filter out some traffic
and increase simulationefficiency. We believe that there
are numerous opportunities for such optimizations both
for PDES and other distributed applications; however,
we are currently limited by NIC speed and the immatu-
rity of the model and programming environment.
Any message sent from WARPED follows the path
shown in Figure 3(a), spending a substantial amount of
time in buffers before being actually sent out over the
network. At some point, the NIC or the host might
decide that an event message still in the buffers is not
needed and can be dropped. The technique we use is to
MPICH buffers (64K)
NIC Buffer (4K)
Warped Buffers
(a) Message Send
> 100 to
be cancelled
Anti Message
NIC send buff NIC recv buff
(b) Early Cancellation Exam-
Figure 3. Early Message Cancellation
is to peek at received anti-messages at the NIC level and
discard messages from the send queue based on the re-
ceive time stamp of the anti-messages. An example of
this is shown in Figure 3(b) where messages with times-
tamps 120, 115, 110 and 102 can be discarded because
of the anti-message received with timestamp 100.
Before we describe the implementation we discuss
some of the problems experienced with the communica-
tion layers BIP and MPICH when packets are dropped.
For one BIP maintains sequence numbers to help in the
ordering of packets making it necessary to turn off se-
quence numbers while implementing packet dropping.
Other methods include requiring the NIC to maintain
state information about dropped packets or informing
the host of packet drops both of which are difficult in
the current implementation due to limited capabilities of
the NIC and IO bus contention. The second problem
lies with the implementation of credit based flow con-
trol in MPICH. Since additional credit is piggybacked
on packets from the receiver back to the sender dropped
packets cause credit to be lost and the senders window to
close up. We address this problem by enabling sequence
numbers in MPICH so that lost packets can immediately
be detected and the receiver can update his estimate of
the number of credits the sender has used up. Also the
NIC keeps track of credit from dropped packets for a
particular destination and updates credit information for
a packet headed for that destination. Finally the send-
ing window is increased allowing the sender to send for
longer periods of time and recover in the case of a block
of dropped packets.
The algorithm begins by scanning the receive queue
on the NIC for anti-messages whose receive timestamp
is then recorded. This time-stamp is compared to the
send-time stamp on all outgoing messages sent before
the anti-message is received at the host (the host reports
the last received anti-stamp to the NIC by piggybacking
its receive time-stamp on all outgoing messages). If the
received anti-stamp is less than the outgoing message
timestamp, the message is dropped. The event-Id’s of
all dropped messages are recorded so that we can either
prevent the sending of the corresponding anti-messageat
the host or drop it at the NIC. Due to space limitations,
we only discuss the implementation briefly.
Whenever the NIC receives a message, it checks
whether it is an anti-message. If so it records the times-
tamp on that message in a variable ( ), which is
used by the send queue to cancel messages. The sim-
ulator’s Input Queue makes a note of the timestamp of
the last processed anti-message, required by the CM on
the sending side. It must record only the timestamps
of messages that have been processed by the NIC. That
is, it must record the timestamps of anti-messages from
remote objects. This diffrentiation is achieved by us-
ing the object ID defined by the applications RAID and
POLICE and will need to be recoded for another new
application. The CM piggybacks the timestamp of the
last received anti-message on the Next object field of
every outgoing message. This is necessary to maintain
consistency. Otherwise, it would not be possible to dis-
criminate between messages generated before the anti-
message was processed by the host (should be canceled)
and ones generated after (should not be canceled).
The logic on the send side queue of the NIC is the
most complicated. Whenever we drop a positive mes-
sage, we know that at some point of time the host will
try to cancel this message and therefore we need to track
the event ID’s of all canceled messages. For every ob-
ject on the LP we allocate a buffer of size 10. which
is declared in the global structures of the NIC, so that
it can be accessed by both the host and the NIC. The
host can avoid sending negative messages, by access-
ing this buffer while the NIC can filter out the negative
messages, which the host sent before the corresponding
positive message on the NIC was dropped. Finally we
describe the logic in the Timewarp object, which is re-
sponsible for generation of anti-messages. We first scan
the event ID buffer on the NIC; if the event ID is present
in the buffer the anti-message is not generated. Imple-
mentation details can be found elsewhere [22].
4. Experimental Study
In this section, we present an experimental study of
the proposed optimizations on a myrinet connected clus-
ter. The cluster has eight nodes; each node is a 2-
way SMP with Pentium III 550MHZ processors running
Redhat Linux 6.2. The machines are connected by a 1.2
Gbps Myrinet switch with LanAi4 processors (66MHz,
1MB dual ported SRAM); the optimizations were im-
plemented by reprogramming the firmware of this pro-
cessor. We used the Message PAssing Interface (MPI)
runnong on top of the Basic Interface for Parallelism
(BIP) suite: a light-weight user-level communication
protocol for Myrinet [15, 25]. BIP runs directly on top
of the hardware (it bypasses TCP/IP). The optimiztions
wereimplementedforthe WARPED simulation engine: a
configurable Time-Warp parallel discrete event simula-
tor [26]. We present results usingtwo of the applications
provided by the WARPED release: RAID models the op-
eration of a RAID-5 disk array, and POLICE which is
a simple model of a traffic police telecommunications
4.1. NIC-level GVT
1 10 100 1000 10000 100000
Simulation Time (sec)
GVT Period (Events)
RAID Performance with NIC GVT
Figure 4. RAID GVT Execution Time
RAID was simulated using 10 processes sending disk
I/O requests to 8 forks which in turn forward the re-
quests to one of the 8 disks in the simulation. There
are a total of 8 LP’s. Figure 4.1 shows the performance
of the simulation with and without the NIC level imple-
mentation of GVT. When performing GVT aggressively
(GVT COUNT = 1 effectively performingGVT after ev-
ery event is processed), NIC-GVT outperforms the stan-
dard implementation. As we decrease the frequency of
GVT (increase GVT COUNT) the time required for ex-
ecution by NIC-GVT increases, while that required by
WARPED decreases, until the two implementations per-
form almost identically. A probable explanation of this
behavior is that when GVT is performed aggressively,
more GVT messages must be generated and sent in the
traditional implementation. These messages take up re-
sources (CPU and memory) and create additional con-
tention for the IO bus. On the other hand, no additional
memory has to be allocated for a message for NIC-GVT
since all information is generated at the NIC and piggy-
backed on other messages. However as we reduce the
frequency of GVT computation, we see that NIC-GVT
becomes slightly slower than WARPED. This is due to
1 10 100 1000 10000
Simulation Time (sec)
GVT Period (Events)
POLICE Performance with NIC GVT (8 Processors)
(a) Police – Execution Time
1 10 100 1000 10000
Number of GVT Rounds
GVT Count (Events)
(b) Police – Number of Rounds
Figure 5. Police GVT Performance
the fact that NIC has to perform GVT checks on each
incoming and outgoing message adding overhead that is
needed infrequently due to the low frequency of GVT
The results for the Police model for 8 LPs are shown
in Figure 5(a). The same pattern observed in RAID
was seen for Police as well. At highly aggressive GVT,
the traditional implementation breaks down because the
communication traffic overwhelms the host processor
resources. Since the messages are generated by the
NIC, the optimized version does not break down. As
GVT is carried out less aggressively, the gap between
the two implementations narrows until they are almost
identical if GVT is performed highly infrequently. With
highly aggressive GVT, in addition to not requiring the
resources for generating GVT messages and delivering
them to the NIC, we found that the number of GVT
rounds being carried out at the NIC remained relatively
constant because the NIC opportunistically forwards the
GVT information (Figure 5(b)).
4.2. Early Cancellation
RAID was simulated using 16 source processes, 8
forks, and 8 disks spread across 8 LP’s in the cluster.
We have taken readings at 50000, 100000, 200000 and
400000 disk requests respectively. The execution times
scale almost linearly with the number of requests. Fig-
ure 6(a) shows the percentage speedup obtained from
the optimization. A modest improvement in the simu-
lation time was obtained (less than 5%) due to the re-
duction in the number of messages generated. When we
looked closer, the percentage of messages canceled in
place was small (less than 1%) we expect to be able
to drop significantly more messages with a better NIC
processor. Despite this small percentage, the total num-
ber of messages sent is reduced by a more appreciable
amount(Figure6(b)) due to the eliminationof some roll-
backs by directly canceling the erroneous messages that
cause them.
900 1000 2000 3000 4000
Messages Sent
Number of Police Stations
Police Message Count -- NIC Direct Cancelation
Direct Cancelation
Figure 8. Overall Messages Generated (in-
cluding messages that will be canceled)
The speedup obtained for Police was significantly
higher than that for RAID for several of the simula-
tion points (up to 27%; see Figure 7(a)). This improved
speedup is due to a large percentage of canceled mes-
sages being canceled by the NIC (Figure 7(b). More-
over, similar to RAID, the total message count (includ-
ing those that were canceled later) was reduced osten-
sibly because of the reduction in the rollbacks due to
the elimination of some of the anti-messages beforethey
cause erroneous computation at their destination (Fig-
ure 8).
5. Conclusions and Future Work
In this paper, we investigated optimization of a PDES
simulator by programming the firmware of the network
interface card of a cluster of workstation. The pro-
50000 100000 200000 400000
Improvement in Performance (%)
Number of RAID Disk Requests
RAID Performance with NIC Direct Cancelation
Percentage Improvement
(a) RAID Performance
50000 100000 200000 400000
Messages Sent
Number of RAID Disk Requests
RAID Message Count -- NIC Direct Cancelation
Direct Cancelation
(b) RAID Message Count
Figure 6. RAID Early Cancellation
900 1000 2000 3000 4000
Improvement in Performance (%)
Number of Police Stations
POLICE Performance with NIC Direct Cancelation
Percentage Improvement
(a) Police Performance
900 1000 2000 3000 4000
Percentage Dropped
Number of Police Stations
Percentage of Canceled Messages Dropped by NIC
Percentage Dropped
(b) Police Message Count
Figure 7. RAID Early Cancellation
cessor on the NIC cards available for our experiments
were not intendedfor general programmability (they are
small CPUs with limited resources); therefore, we se-
lected two optimizations that are lightweight in order to
demonstrate the feasibility of the model and to under-
stand the challenges and issues. As programmable cards
with better processors continue to appear, it is possi-
ble that a significantly larger class of optimizations will
become feasible both for simulation and for other dis-
tributed applications.
The two optimizations we studied provided some im-
provement in the performance of our application (in
some instances significant improvement) despite these
limitations. In the process, we learned the following
lessons: (i) Consistency is a recurring problem in this
model if state is shared between the NIC and the host
processor. Enforcing strong consistency via shared vari-
ables will be expensive in most cases, and relaxed con-
sistency can be obtained by piggybacking handshaking
information on incoming and outgoing messages; and
(ii) There is a need for tools and programming models
to allow effective programming in this model. We are
encouraged that the new NIC cards are offering main-
stream OS’s running on the NIC processor.
We believe that the bottleneck on the transfer path be-
tween the NIC and the processor would make offloading
computation to the NIC more promising as network per-
formance continues to increase. This is especially true
if the programmable NIC resources continue to improve.
Making more resources available on the NIC will open
the door for additional optimizations using this model
model both for PDES and other applications (distributed
OS, databases, filesystems, etc...); this is a focus of our
future research.
