ArticlePDF Available

Abstract and Figures

We present a novel trace-based analysis tool that rapidly classifies an MPI application as bandwidth-bound, latency-bound, load-imbalance-bound, or computation-bound for different interconnection networks. The tool uses an extension of Lamport's logical clock to track application progress in the trace replay. It has two unique features. First, it can predict application performance for many latency and bandwidth parameters from a single replay of the trace. Second, it infers the performance characteristics of an application and classifies the application using the predicted performance trend for a range of network configurations instead of using the predicted performance for a particular network configuration. We describe the techniques used in the tool and its design and implementation, and report our performance study of the tool and our experience with classifying twelve applications and mini-apps from the DOE DesignForward project as well as the NAS Parallel Benchmarks.
Content may be subject to copyright.
Fast Classification of MPI Applications using
Lamport’s Logical Clocks
Zhou Tong
Department of Computer Science
Florida State University
Tallahassee, Florida, USA
tong@cs.fsu.edu
Scott Pakin, Michael Lang
Computer, Computational, and Stat. Sci. Div.
Los Alamos National Laboratory
Los Alamos, NM
{pakin, mlang}@lanl.gov
Xin Yuan
Department of Computer Science
Florida State University
Tallahassee, Florida, USA
xyuan@cs.fsu.edu
Abstract—We present a novel trace-based analysis tool that
rapidly classifies an MPI application as bandwidth-bound,
latency-bound, load-imbalance-bound, or computation-bound for
different interconnection networks. The tool uses an extension of
Lamport’s logical clock to track application progress in the trace
replay. It has two unique features. First, it can predict application
performance for many latency and bandwidth parameters from
a single replay of the trace. Second, it infers the performance
characteristics of an application and classifies the application
using the predicted performance trend for a range of network
configurations instead of using the predicted performance for
a particular network configuration. We describe the techniques
used in the tool and its design and implementation, and report our
performance study of the tool and our experience with classifying
twelve applications and mini-apps from the DOE DesignForward
project as well as the NAS Parallel Benchmarks.
Index Terms—Parallel application; performance analysis; tool
I. INT RO DU CTI ON
The performance of message-passing applications is often
studied through application tracing and trace replay and anal-
ysis. Tools for trace collection and analysis enable studying
a parallel application and the system on which it runs in
great detail by replaying the collected traces and simulating
the application and the underlying architecture. Simulating
an application on a large-scale system can take a significant
amount of time and computing resources but can be useful for
understanding the performance characteristics of the applica-
tion on the simulated platform. Such studies, however, offer
little insight into how an application will perform when run
on a different system.
We present a new trace-based performance-analysis tool
for fast classification of MPI [24] applications. 1Unlike
conventional trace-based performance-analysis tools that focus
on uncovering application performance characteristics for a
particular system configuration, our tool can rapidly reveal the
performance characteristics when the application runs on many
different interconnection networking technologies. In our tool,
each interconnect technology is represented by two terms,
bandwidth and latency, that can be measured by commonly-
used methods such as netperf or ping-pong [21].
1Available at https://github.com/ztong87/MFACT
Our tool classifies an application as bandwidth-bound,
latency-bound, communication-bound (both bandwidth- and
latency-bound), load-imbalance bound, or computation-bound
for different latencies and bandwidths by replaying the ap-
plication trace only once. The tool is implemented in MPI
and performs less communication in the trace replay than the
original MPI program. It is at least as scalable as the MPI
program under study. The tool can be used by application
developers to rapidly forecast communication-related perfor-
mance bottlenecks and by system designers to quickly gauge
the relative benefits of various networking options. It thereby
complements slower but more accurate approaches such as
cycle-accurate simulation.
The tool uses an extension of Lamport’s logical clock [13] to
keep track of time progress in the trace replay. Such a logical
clock not only maintains the happened-before relationship but
also tracks the predicted application execution time. The tool
maintains multiple sets of logical clock counters, one set
for each network configuration. Each set of counters records
different types of predicted times including computation time,
wait time, latency time, and bandwidth time of the application
for the particular interconnect configuration. During the one
pass of the trace replay, logical clocks for all network configu-
rations are maintained simultaneously, and all of the counters
are computed. In other words, by simulating the application
once, the predicted execution time for many interconnect
configurations is obtained. With the predicted execution times
for carefully selected network configurations, application clas-
sification is performed based on the performance trend across
a range of interconnect configurations instead of the predicted
performance for a particular interconnect configuration. This
approach provides a robust classification by tolerating the
inherent inaccuracy in predicting the performance of an in-
dividual network configuration. It makes it easy to gauge an
application’s relative sensitivity to latency, bandwidth, and
load imbalance by presenting performance trends over a range
of network configurations.
We will discuss the techniques used and describe the design
and implementation of the tool. We will then validate the
tool, report the performance of the tool, and describe our
experience with using the tool to classify twelve Department
of Energy (DOE) applications and miniapps in the DesignFor-
ward project as well as the NAS parallel benchmarks.
II. RE L ATED WO RK
Performance analysis has always been a critical component
of high-performance computing (HPC). A large number of
profiling, trace-collection, simulation, analysis, and visualiza-
tion tools have been developed. Many tracing tools such as
DUMPI [22] and the Intel Trace Analyzer and Collector [9]
are available for collecting traces. These traces can be analyzed
by various trace analysis/replay tools such as the Structured
Simulation Toolkit (SST) [1] and BigSim [27]. In addition,
compression techniques can further reduce the volume of
communication traces while preserving structural information
for replay as in ScalaTrace [19].
These performance analysis infrastructures usually build
upon a low level event-driven interconnect simulator such as
BookSim [11] or ROSS [3]. Other tools, including Kojak [17],
Scalasca [6] and DIMEMAS [7], provide performance pre-
diction and analysis with source-level instrumentation. In
particular, Scalasca’s trace-replay analysis enables automatic
wait-state search and identifies the root cause of parallelization
bottlenecks such as load imbalance. These end-to-end simula-
tion tools can simulate the behavior of a full application on
a target system and generate performance summaries. Further-
more, the performance data can be visualized by tools such
as Vampir [18], Paraver [12], Jumpshot [4], and Ravel [10].
Ravel can infer an application’s communication pattern based
on logical happened-before relationships. Issacs et. el. [10]
define a lateness metric that quantifies the experienced delay
relative to its logical assignment. Our tool is different from
these existing tools in that it does not focus on obtaining
detailed performance for a particular system. Rather, it tries to
rapidly uncover the fundamental performance characteristics
of an application by quantifying its sensitivity to increases
and decreases in bandwidth and latency. It is therefore similar
in that regard to analytical performance modeling [2] but
represents an automated rather than a manual approach.
III. FAST CLAS SI FICATI ON TOOL
A. Overview
The tool takes DUMPI [22] traces of an application as input,
simulates the events in the traces, obtains simulated times
for different configurations, and classifies the application for
different interconnect technologies. Our techniques can be ap-
plied to any MPI tracing format. We select the DUMPI format
because it is widely used in DOE laboratories; and DUMPI
traces for many applications and mini-apps are available from
the DOE DesignForward Web site.2One limitation of DUMPI
traces, though, is that they lack information about the machine
topology, task mapping, and anything else below the MPI
level.
A DUMPI trace file records all MPI events in each rank
of the application. In replaying the traces, our simulation
tool creates one MPI process for each DUMPI trace file. An
2http://portal.nersc.gov/project/CAL/designforward.htm
extension of Lamport’s logical clock scheme [13] is used to
keep track of the progress of the processes in the simulation
of the application. Multiple sets of counters are maintained to
record the predicted time for different network configurations.
For each network configuration, four counters, computation
time,wait time,bandwidth time and latency time, are used to
record the different types of time that a process experiences.
For the communication categories (wait time, bandwidth time,
and latency time), only the time that contributes to the overall
process execution time is recorded. Communication time that
overlaps with computation time is excluded. For example, if
a message is sent and arrives before the receiver enters the
MPI Recv operation, there is no wait, bandwidth, or latency
time experienced at the receiver. The times recorded in the
four counters are defined as follows:
1) Computation time: the time consumed by computation
and local (non-communicating) MPI routines. Computa-
tion time is measured as the gap from the exit time of an
MPI routine to the entry time of the next MPI routine as
well as the gap from the entry time of a local MPI routine
to the exit time of that routine. The computation time can
be scaled to model CPUs of different processing speeds.
Note that the entry and exit time of each MPI routine is
recorded in the DUMPI trace.
2) Wait time: the time spent waiting for a corresponding
party to start its communication. It occurs in both point-
to-point and collective MPI routines. For a blocking point-
to-point communication, the receiver may be blocked
waiting for the sender to start the communication. In
this case, the wait time is from the time when the
receiver enters the blocking receive routine to the time
when the sender enters the send routine. For a collective
communication, a process may be blocked waiting for
the last process in the operation to enter the collective
communication: the wait time for each process is from
the time when the process enters the collective routine
to the time when the last process enters the collective
routine.
3) Latency time: the fixed latency for the first byte of a
message to reach the receiver, independent of message
size. For a point-to-point communication, the latency time
is from the time when the sender enters the send operation
to the time when the first byte reaches receiver (this
time is estimated with a model in the tool). The latency
time counter records the total latency time that is not
overlapped with computation.
4) Bandwidth time: the time needed for the whole message
to arrive at the receiver after when the first byte arrives.
The tool estimates this time using a model based on
the message size and network bandwidth. The bandwidth
time counter records the total bandwidth time that is not
overlapped with computation.
To investigate the performance characteristics of an appli-
cation and classify the application for a particular intercon-
nect technology, a set of network configurations is carefully
selected with the four counters for each configuration. The
performance statistics across the set of configurations is then
used to classify the application for the targeted interconnect
technology. In the text that follows we discuss the details of the
techniques used in the tool including how the communication
is modeled, how the logical clock is maintained for each
configuration, how the counters are computed, how the pre-
dicted time for multiple network configurations are progressed
simultaneously, how to select network configurations for a
targeted interconnect technology, and how the classification
is performed based on the performance statistics across the
configurations.
B. Modeling communication
The tool currently supports two types of communication:
intra-node and inter-node. For each type of communications
the tool supports two models of point-to-point communication:
a model based on table look-ups that provides more accurate
estimation about the communication time for a given system,
and a simple Hockney model [8] in which an n-byte message
takes
α
+n
β
time to complete, where
α
is the communication
latency (the time to communicate a zero byte message) and
β
is the per-byte cost (reciprocal bandwidth).
A collective operation is modeled as a global synchro-
nization for all processes to be ready for the operation, and
then Thakur and Gropp’s models [25] are used to estimate
the time for different collective operations. For example, in
collective operations such as MPI Bcast and MPI Reduce,
the communication time is modeled as (
α
+n
β
)log2P, where
α
log2Pis the latency time and n
β
log2Pis the bandwidth
time.
In the rest of the paper, we will assume a Hockney model to
ease exposition. We note that neither model is entirely accurate.
However, because our tool classifies an application based on its
performance over a range of interconnect configurations, even
when an individual configuration is not modeled accurately,
the classification can still be quite robust.
C. Modeling computation
The tool measures computation time as the time from when
one MPI function returns control to the application to the time
the subsequent MPI call is invoked. Non-blocking and purely
local MPI functions are also treated as computation, with the
associated time being the time between when the function is
invoked and when it returns control to the application. To
model different CPU speeds, the tool supports computation
scaling by multiplying the default computation time by a
specified scaling factor. This scaling factor will be denoted
as fcomp in the rest of the paper.
D. Lamport’s logical clock
Lamport’s logical clock [13] provides a clock synchroniza-
tion mechanism that honors a happened-before relationship.
Each process represents time with a local counter. Before an
“event” (a computation or a communication operation), the
process increments its counter. The current counter value is
Fig. 1: Example of Lamport’s logical clock
transmitted with each message. A receiver sets its counter to
the maximum of its current value and the value received plus
one unit of latency.
Figure 1 illustrates how a logical clock enforces the or-
dering of three causally-related events (a, b, and c) across
three processes (P
0,P
1, and P
2). That is, an event that
happens before another event will have a smaller timestamp.
P
1sends message (a) at time 2, and it is received at P
0at
time max(3,2+1) = 3. P
2sends message (b) also at time 2,
but because P
0already observed a few events, it is not received
until time max(5,2+1) = 5. Although P
1is only at time 4
when it receives P
0’s message, it advances its counter to
time max(3,6+1) = 7 because P
0sent the message at time 6.
In Figure 1, we use 4 7(RECV )in the execution of P
1to
show the change of timestamp. This way, a partial ordering
over all events is established.
E. Maintaining the extended Lamport’s logical clock
Our tool extends Lamport’s logical clock to not only
preserve causal relationships, but also to maintain predicted
application time by incorporating non-unit computation time
and non-unit communication time. There is one logical clock
for each network configuration. The logical time for each
event represents the predicted time for the event for the
given network configuration. The tool assumes a starting event
such as the default MPI Init and computes the timing for all
following events.
Let there be Ncommunication events in a trace numbered
from 0 to N1. We denote ri
in to be the physical entry time
of the ith event, 0 i<N, logged in the DUMPI trace, ri
out to
be the physical exit time, ti
in to be the logical entry time, and
ti
out to be the logical exit time. The tool maintains the logical
clock and calculates ti
in and ti
out for all events in the DUMPI
trace. The logical entry time is calculated as follows:
t0
in =0 and ti
in =ti1
out + (ri
in ri1
out )×fcomp .
The starting logical time t0
out is initialized to be 0. The
logical entry time of an event is the logical exit time of the pre-
vious event ti1
out plus the computation time (ri
in ri1
out )×fcomp
between the two events. This allows the logical clock to keep
track of the computation time of the application. The logical
exit time of an event depends on the type of event. For an MPI
non-blocking operation such as MPI Isend, the operation is
treated as computation and the logical exit time is the logical
entry time plus the time in the event:
ti
out =ti
in + (ri
out ri
in)×fcom p .
For an MPI point-to-point blocking send operation, the
exit time may be affected by the MPI protocol implemented.
In most MPI implementations, point-to-point communication
is realized by two protocols: an eager protocol for small
messages and a rendezvous protocol for large messages. With
an eager protocol, the sender copies the message into a
system buffer and exits the operation while the MPI library’s
progress engine asynchronously transfers the message to the
receiver. With a rendezvous protocol, the sender has to wait
for an acknowledgment from the receiver before initiating data
transfer. Consequently, the rendezvous protocol introduces an
additional dependence from the receiver to the sender, which
restricts our tool to processing only one configuration at a
time, thereby greatly limiting its utility. We design our tool
to be MPI-implementation oblivious. Because the eager proto-
col better reflects the inherent communication characteristics
(there is a dependence from the sender to the receiver but
not the other way around), communication assumes an eager
protocol by default although the rendezvous protocol is also
supported by the tool.
With the eager protocol, the blocking send operation com-
pletes when the message is copied into system memory:
ti
out =ti
in +memcopytime(msg size).
For a blocking, point-to-point receive operation
(e.g., MPI Recv or MPI Wait), the logical exit time
will also depend on the time for the matching send when
the communication starts. If the logical entry time for the
matching send operation is called Tthen the logical exit time
for the blocking wait is
ti
out =max(ti
in +1,T+
α
+n
β
).
In the formula,
α
+n
β
is the time to transfer the message
from the sender to the receiver. This calculation captures both
the happens-before relation and the predicted communication
time. In the implementation, in order for the receiver to
obtain the value of T(the logical time for the matching
send event) in the trace replay, each send event transmits its
logical timestamp to the receiver, and each blocking receive
event performs the matching receive and updates its logical
clock based on the timestamp received. The deterministic
ordering in our trace-replay tool ensures reproducibility and
helps compare different networks fairly.
We treat MPI Waitall as an array of blocking wait requests.
The simulated process is blocked until all matching messages
are received and requests are completed. For instance, if there
are kwait requests, each one estimates the logical exit time for
an individual blocking wait. The maximum logical exit time
of all blocking waits is the exit time of MPI Waitall:
ti
out =max(ti,j
out),j[1,k].
If the ith event is a collective operation, the logical exit time
is computed as
ti
out =max(tc
in) + model time(n).
The term max(tc
in)is the largest logical entry time of the
collective operation among all processes. model time(n)is
the modeled time for the collective operation with message
size nas in Thakur and Gropp [25]. In the trace replay,
an MPI Allreduce operation is performed for each collective
operation to obtain the largest logical entry time max(tc
in)
among all processes. Then, the exit time of each process can
be computed locally.
F. Recording different types of communication times
Figure 2 depicts three distinctive scenarios in MPI point-
to-point operations. The logical enter times tin of each MPI
routine are marked, and the exit times tout are estimated based
on the expected timing that P
1received the message. As
discussed in Section III-A, the classification tool maintains
four time counters for each network configuration: compute
time, wait time, latency time, and bandwidth time. Figure 2
illustrates how the four time counters capture both communi-
cation dependencies and the predicted communication time. In
this example we assume two units of time for latency, ten for
bandwidth and two for the memory copy of a message into a
system buffer.
Early Sender: Figure 2a shows that P
2enters MPI Send
at logical time 8. This MPI Send has a predicted arrival time
of 22 (2 units of time apiece for latency and memory copy,
10 units of time for bandwidth). P
1enters MPI Recv at logical
time 25. Therefore, P
1experiences zero wait time, latency time,
and bandwidth time.
Early Receiver: In Figure 2b, P
1enters MPI Recv at
logical time 12 while P
2enters MPI Send at logical time 18.
P
1spends 8 units of waiting time for P
2to start sending a
message at time 20. The estimated delivery time is 32 at P
1
resulting in 10 units of bandwidth time and 2 units of latency
time experienced at P
1.
Concurrent Sender/Receiver: In Figure 2c, P
2enters
MPI Send at logical time 5 while P
1is busy computing until
time 10. This leads to 5 units of overlapped computation and
communication time. With an estimated delivery time of 19,
P
1experiences no wait time, no latency time, and 9 units
of bandwidth time (latency time is counted before bandwidth
time). In our tool, computation that overlaps communication
is considered by default to overlap first latency time then
bandwidth time (lat-first). The alternative choice is to overlap
bandwidth time first and then latency time (bw-first), which is
also supported by the tool. Note that the default lat-first policy
can lead to applications being categorized as bandwidth-bound
when most or all of the latency is overlapped, even though
improvements to latency could improve overall application
performance. We will see how these two policies impact the
overall fast classification results in Section IV-B4.
Figure 3 illustrates an example for collective communica-
tion. Different processes may start the operation at different
(a) Early Sender (b) Early Receiver (c) Concurrent Sender & Receiver
Fig. 2: Counting wait time, latency time, and bandwidth time in a point-to-point communication
Fig. 3: Counting wait time, latency time, and bandwidth time
in a collective communication
times. All processes that arrive before the last one must wait
until the last process arrives. After that, the modeled latency
and bandwidth time starts to count. In the figure, P
1starts at
time 3. It first waits for 15 units of time (until the last process
P
3arrives at time 18 and then experiences 2 units of latency
time and 10 units of bandwidth time.
G. Progressing logical times for multiple network configura-
tions
With the default eager protocol for point-to-point commu-
nication, there are only two situations in which the replay
of a process can be blocked: blocking receive operations
and collective operations. In the blocking receive operation
a process must receive a logical timestamp from the matching
sending process in order to determine the logical exit time for
the blocking receive operation. In a collective operation, each
process must perform an MPI Allreduce operation to obtain
the largest logical entry time from all processes. Because the
traces are generated from a correctly executed MPI program,
a blocking receive operation always has a matching send
operation, and all collective operations are called by all par-
ticipating processes with no point-to-point routines spanning
the collective operation. In other words, all blocking point-
to-point communications issued before a collective operation
must complete before the collective operation begins. Hence,
in the trace replay, a blocking receive operation is therefore
guaranteed to receive a timestamp from its matching send
TABLE I: Interconnects in current production systems
Configurations Latency (
µ
s) BW (Gbps)
1G Ethernet (E1G) 50 1
10G Ethernet (E10G) 5 10
InfiniBand QDR (QDR) 1.3 32
operation without inducing deadlock. Similarly, in the replay
of a collective operation, the all-gather operation is guaranteed
to succeed, and the logical exit time of the collective operation
is guaranteed to be updated. This is independent of the network
configuration to be simulated. Hence, the tool can maintain
any number of logical clocks corresponding to any number
of different network configurations and be assured that it can
communicate and update all of the logical timestamps for each
communication event in a single pass. This ability to predict
the performance for many network configurations is the key
to our fast classification tool.
H. Selecting network configurations and classification
Each interconnect technology is characterized by its band-
width (BW ) and latency (L). We denote the interconnect
technology by (BW,L). Table I list nominal values for BW
and Lfor three commonly used interconnection network
technologies. We will use these values throughout our evalu-
ation in Section IV. In Table I, the MPI latency of 1G and
10G Ethernet are the measured minimum latency reported
by QLogic [21], and the latency for InfiniBand QDR is as
measured by Panda [20].
The tool uses the predicted performance on three sets of
network configurations. Each set consists of three configu-
rations to determine whether an application is bandwidth-
bound, latency-bound, communication-bound, load-imbalance
bound, or computation bound for the interconnection network
(BW,L).
One set of configurations tests the application’s sensitivity
to bandwidth at the given latency. This set consists of con-
figurations (BW /2,L),(BW,L)and (2BW,L). By examining
the performance trend across this range of bandwidths one
can determine whether an application’s execution time is
sensitive to bandwidth changes. Moreover, even though the
communication model does not explicitly consider network
contention, bandwidth-sensitive applications will also be sen-
sitive to contention.
The second set of configurations tests the application’s sensi-
tivity to latency. It consists of configurations (BW,2L),(BW,L)
and (BW,L/2). The third set tests the application’s sensitivity
to proportional changes to bandwidth and latency, consisting
of configurations (BW /2,2L),(BW,L)and (2BW,L/2). With
the predicted performance on the three sets of configurations,
applications can be classified as follows:
C1. Computation-bound: Observed computation time ac-
counts for a large percentage of total execution time. In
addition, a program’s predicted total execution time is
insensitive to speed-up or slow-down of bandwidth and la-
tency. The exact threshold values for the large percentage
of total time and the insensitivity are parameters decided
by the user. In Section IV, we give examples to show how
the threshold values are set.
C2. Load-imbalance-bound: The wait time accounts for a
significant percentage of the total execution time. In
addition, the wait time is insensitive to the speed-up
and slow-down of bandwidth and latency. Wait time can
be introduced by computational load imbalance, network
contention, system noise, algorithm design, etc.
C3. Bandwidth-Bound: Bandwidth bound is related to a
given bandwidth (BW ). In this case, communication time
(latency time + bandwidth time + wait time) accounts for
a significant percentage of the total execution time and is
insensitive to latency changes. The communication time
decreases significantly as the bandwidth speed increases
across the range of BW (from BW/2 to 2BW ). Again, the
exact threshold values can be set by the user.
C4. Latency-Bound: Latency-bound is related to a given
latency (L). In this case, the communication time accounts
for a significant percentage of the total execution time and
is insensitive to bandwidth changes. The communication
time decreases significantly as the latency speed improves
across L(from 2Lto L/2).
C5. Communication-bound: Communication bound is re-
lated to a given bandwidth (BW) and a given latency
(L). In this case, communication time (latency time +
bandwidth time + wait time) accounts for a significant
portion of the total execution time. The communication
time decreases significantly as the bandwidth speed in-
creases across BW, or as latency time improves across L,
or both.
The classification method can be further fine-tuned. For
example, if one knows that the communication pattern in
an application is unlikely causing contention, then targeted
(BW,L)can be used as the anchor for observing the perfor-
mance trend. If however, one knows that the communication
in an application has significant contention, then some slower
configuration (e.g., half the bandwidth) may be used as the
anchor. In this study, we choose to consider a range of network
speeds from X/2 to 2X. The range may be extended if it can
help in classification.
IV. EVALUAT IO N
We report the results of various experiments with our fast
classification tool. We first describe our experimental setup,
specifically the set of parallel benchmarks and applications
whose performance we characterize and the definitions of
numeric thresholds we assign to application classifications C1–
C5 from the previous section. Then, we study and validate
the tool, and demonstrate the robustness of our technique.
After that, we report how our complete set of applications
and benchmarks are classified based on their performance
trends, and apply the tool to classify applications for exscale
systems. Finally, we evaluate the run-time performance of our
classification tool itself.
A. Experimental Setup
The parallel programs used in this study include extracted
kernels, mini-apps and full-sized applications from the Depart-
ment of Energy (DOE)’s DesignForward project. the extracted
kernels used are Big FFT, Crystal Router (CR); the mini-
apps used are AMG, MiniFE, and CLAMR; and the full-sized
applications are MultiGrid (MG), AMR Boxlib, PARTISN,
and FillBoundary (FB). CLAMR3from Los Alamos National
Laboratory (LANL) is also used. These applications represent
the diverse communication patterns observed in DOE’s produc-
tion systems. More information about these programs as well
as their DUMPI traces are in the DOE DesignForward project
Web site. In addition, we also use the NAS Parallel Bench-
marks4(NPB) and mini-apps LULESH, CNS, CMC, and
Nekbone from DOE’s EXMATEX5, CESAR6, and EXACT7
codesign centers. All traces used in this study except NPB
and CLAMR are publicly available on the DesignForward
project website. Traces and timing measurements of NPB and
CLAMR were generated on Cielito, a 64-node (1024-core)
Cray XE6 system at LANL.
We classify the applications on three interconnect technolo-
gies: 1G Ethernet, 10G Ethernet and InfiniBand QDR. Each
network configuration is represented by (BW,Lat ency)where
BW is in the unit of Gbps and Latency is in the unit of
microseconds. Based on the bandwidth/latency data shown in
Table I, 1G Ethernet is represented as (1,50), 10G Ethernet is
represented as (10,5), and InfiniBand QDR is represented as
(32,1.3). Memory copy time is obtained by running memory
benchmarks on each host machine. The configurations used in
the experiments are shown in Table II. Based on the prediction
results of these configurations, the applications are classified
as follows:
C1. Computation-bound (Comp.): Observed computation
time accounts for 90% of total execution time. In addition,
a program’s predicted total execution time is insensitive
to the speed-up and slow-down of network bandwidth
3https://github.com/losalamos/CLAMR
4https://www.nas.nasa.gov/publications/npb.html
5https://portal.nersc.gov/project/CAL/exmatex.htm
6https://portal.nersc.gov/project/CAL/cesar.htm
7https://portal.nersc.gov/project/CAL/exact.htm
TABLE II: Configurations (Bandwidth, Latency) for classifica-
tion (bandwidth is in the unit of Gbps while latency is in the
unit of microseconds)
Sensitivity ENET-1G ENET-10G QDR
Latency (1, 25) (10, 2.5) (32, 0.65)
(1, 50) (10, 5) (32, 1.3)
(1, 100) (10, 10) (32, 2.6)
BW (.5, 50) (5, 5) (16, 1.3)
(1, 50) (10, 5) (32, 1.3)
(2, 50) (20, 5) (64, 1.3)
Comm (0.5, 100) (5, 10) (16, 2.6)
(1, 50) (10, 5) (32, 1.3)
(2, 25) (20, 2.5) (64, 0.65)
and latency: less than 15% difference for the network
bandwidth and latency ranging from X/2 to 2X.
C2. Load-imbalanced-bound (Imb.): The wait time accounts
for 25% of the total execution time. In addition, the wait
time is insensitive to the speed-up and slow-down of
network bandwidth and latency: less than 15% difference
for the network bandwidth and latency ranging from X/2
to 2X.
C3. Bandwidth-Bound (BW): For the targeted bandwidth,
communication time (latency time + bandwidth time +
wait time) accounts for at least 25% of the total execution
time. The communication time is insensitive to latency
change: less than 5% difference to the 4Xlatency change
from 2Lto L/2. The communication time decreases by a
factor of 2 when the bandwidth increases from BW /2 to
2BW.
C4. Latency-Bound (Latency): Given a latency L, the com-
munication time accounts for at least 25% of the total
execution time. The communication time is insensitive
to bandwidth change: less than 5% difference to 4X
bandwidth change. The communication time decreases by
a factor of 2 as the latency improves from 2Lto L/2.
C5. Communication-bound (Comm.): Given BW and L,
communication time (latency time + bandwidth time +
wait time) accounts for 25% of the total execution time.
The communication time decreases by at least a factor
of 2 when the network bandwidth and latency speed-up
from X
2to 2X.
For some applications, the total communication accounts for
less than 25%, but more than 10% of the total time. We further
classify the applications as load-imbalance-sensitive (Imb.-s),
bandwidth-sensitive (BW-s), latency-sensitive (Latency-s), and
communication-sensitive (Comm.-s) following the similar
definitions as the load-imbalance-bound, bandwidth-bound,
latency-bound, and communication-bound.
B. Validation
1) Eager vs. Rendezvous: We show that when accurate
latency and bandwidth information is available, our classifica-
tion tool can accurately predict application performance. 64-
rank runs of three programs, CLAMR, CR and FB, which
represent different types of communications, are used in this
TABLE III: Predicted and measured communication and total
application time (in seconds) of 64-rank CLAMR, CR, and
FB on Cielito
CLAMR(64) CR(64) FB(64)
eag. eag.* eag. eag.* eag. eag.* rend.
Act. Comp. 0.23 0.23 0.26 0.26 0.86 0.86 0.86
Pred. Comm. 0.76 0.90 0.07 0.05 0.15 0.19 0.35
Act. Comm. 0.89 0.89 0.06 0.06 0.36 0.36 0.36
Comm. Err.% -14.61 +1.12 +16.67 -16.67 -58.33 -47.22 -2.78
Pred. Tot. 0.99 1.13 0.33 0.31 1.01 1.05 1.21
Act. Tot. 1.12 1.12 0.32 0.32 1.22 1.22 1.22
Pred. Err.% -11.61 +0.89 +3.13 -3.13 -17.21 -13.93 -0.82
study. To obtain accurate communication performance infor-
mation, we used Intel’s MPI ping-pong benchmark8with two
communicating processes mapped onto either the same node
or separate ones, over hundreds of repetitions. We built a look-
up table that allows for accurate extrapolation of message
communication times (latency plus bandwidth times) of any
size on this machine. For comparison, the parameters in the
Hockney model for Cielito are derived using least-squares
fitting from the measured data. Memory copy speed is obtained
with a microbenchmark.
Table III presents the predicted and measured communica-
tion time and total application time as well as the prediction
errors. Each column compares the predicted performance with
different models with the measured time. For each benchmark,
the first column (eag.) compares the predicted performance
using the default eager protocol with the Hockney model. The
second column (eag.*) compares the predicted performance us-
ing the default eager protocol with the more accurate look-up
table. For FB, there is a third column that contains prediction
results with the rendezvous protocol. With the default eager
protocol and the Hockney model, the baseline results show
that the communication prediction errors of CLAMR, CR and
FB are -14.61%, 16.67% and -58.33%, respectively, and that
the total application time prediction errors are -11.61%, 3.13%,
and -17.21%, respectively. A negative prediction error means
under-estimation while a positive prediction error means over-
estimation. With the look-up table, the communication predic-
tion errors of CLAMR, CR, and FB reduced to 1.12% and
-16.67%, -47.22% while the predictions for overall execution
time reduced to -0.89%, -3.13%, and -13.93%. FB poses a
significantly larger prediction error of -47.22% for commu-
nication and -13.93% overall. This discrepancy is caused by
the fact that message sizes in FB are up to 4 MB (so the
rendezvous protocol dominates the communications in the
program) whereas message sizes in CLAMR and CR are under
4 KB. When the model for the rendezvous protocol is assumed,
the prediction accuracy improves to -2.78% for communication
time and -0.82% overall for FB.
These results indicate that although our tool does not
consider network contention, when accurate communication
information is used and modeled, the tool can predict the
performance of applications fairly accurately, which is more
8https://software.intel.com/en-us/articles/intel-mpi-benchmarks
than sufficient to classify applications and understand their
performance-limiting factors. In the rest of the experiments
involving classification we use the Hockney model. Although
this model is not as accurate as the look-up table used for
validation, it provides close approximations. Moreover, our
tool classifies applications based on their performance trend
over a range of network configurations, which alleviates the
problems caused by the inaccurate modeling. The inaccuracy
raised by the specific MPI implementation (e.g., eager protocol
vs. rendezvous protocol) poses more challenges. We believe
that the use of a rendezvous protocol is an implementation
choice, and does not represent inherent communication char-
acteristics. We therefore use exclusively the eager protocol for
classification. Note that observing the performance trend with
a range of network speeds also alleviates the impact of such
inaccuracy.
2) Our tool vs. SST/Macro: Table IV compares the pre-
diction results of our tool with an established HPC applica-
tion simulation infrastructure, SST/Macro 3.0’s packet-level
simulation [23]. Some applications such as AMR, CLAMR,
and PARTISN are not included in this evaluation because
SST/Macro 3.0 is unable to handle complex MPI grouping
operations and MPI multi-threading.
For each application we model the MPI communication
using the network configuration of the host machine where
the DUMPI traces are collected and compare our prediction
to SST/Macro’s simulation results. Most traces were collected
by others on LBNL’s Hopper [15] and Edison [14] super-
computers. Traces marked by “†” were collected by us on
LANL’s Cielito testbed [26]. In our model, we set the network
bandwidth and latency, (10Gbps, 2500ns) for Cielito, (35Gbps,
2575ns) for Hopper and (24Gbps, 1300ns) for Edison. For
SST/Macro, the machine configurations of Hopper and Edison
are included in SST/Macro’s default built, and we derive the
configuration for Cielito based on the machine performance
data at [26]. To compare the results of our tool and SST/Macro,
MPI Init is set as the starting event for both tools. Some traces
such as CR(10) do not contain MPI Init. For such cases, we
added an MPI Init to the beginning of the traces.
Table IV shows the communication and total time for
various applications as estimated by each of SST/Macro and
our tool. We define
δ
comm and
δ
total as the absolute percentage
difference between our tool and SST/Macro’s estimations for
the communication and total time, respectively. In terms of
communication time, our tool’s estimated communication time
on average is lower than SST/Macro’s with an average
δ
comm
of 11.31%. Our tool’s estimated total time is very close
to SST/Macro’s simulation results with an average
δ
total of
4.31%. This indicates that the modeling results with our tool
are not significantly different from the traditional more detailed
simulation tool.
Note that since our tool does not take network contention
into account and SST does, it should normally under-estimate
the communication time in comparison to SST. However, there
are two notable situations that our tool can over-estimate
the communication time. The first situation is for collective
TABLE IV: Predicted communication and total execution
times
App(MPI ranks) Communication (s) Total (s)
SST ours
δ
comm SST ours
δ
total
DT(512)† 18.29 16.90 7.60% 20.34 18.95 6.83%
MiniFE(1152) 4.19 3.88 7.30% 61.01 60.71 0.50%
AMG(216) 0.17 0.17 4.22% 0.83 0.84 0.84%
CR(100) 0.13 0.18 34.46% 0.35 0.39 12.89%
LULESH(512) 7.18 7.20 0.18% 48.27 48.29 0.03%
CNS(1024) 27.18 23.56 13.31% 64.29 60.67 5.63%
CMC(1024) 17.75 17.77 0.13% 55.17 55.19 0.04%
Nekbone(256) 1.55 1.19 23.25% 3.52 3.16 10.25%
Average 11.31% 4.31%
=traced by us on Cielito, not by others on Edison/Hopper
communication. Our tool assumes a global synchronization
before the collective operation can start. In practice, some MPI
collective operations such as MPI Reduce and MPI Bcast do
not require synchronization. For such operations, our tool can
have a larger prediction time than SST. The other case relates
to message size and the Hockney’s model, where the two
parameters
α
and
β
are estimated for all message sizes. When
the message size in an application is clustered, the Hockney’s
model can introduce systematic biases for all communications
in either direction. This is observed in the CR application that
contains a large number of small messages.
3) Classification examples: We select three NAS
benchmarks—EP, IS, and DT—to demonstrate how different
programs are classified. It is well known that EP is
embarrassingly parallel; IS is communication-intensive; and
DT is load-imbalanced.
Figure 4 shows performance relevant to the classification of
the applications. For each figure, the y-axis is for the timing
results for different counters. The figures show the sensitivity
of the programs to the network speed (both bandwidth and
latency), EP(64) and IS(64) on 1G Ethernet and DT(512) on
10G Ethernet. Figure 4a shows that EP(64) is dominated by
computation, and that speeding up the bandwidth and latency
from X/2 to 2Xalmost has no impact on the total time. Thus,
EP(64) is classified as computation-bound on 1G Ethernet.
For IS(64) in Figure 4b, communication time (latency time +
bandwidth time + wait time) accounts for around 60% of the
total time on the 1G Ethernet; and the communication time
decreases by a factor of more than 3 as the network speed
(both bandwidth and latency) increases from X/2 to 2X. The
tool classifies IS(64) as communication-bound. For DT(512),
the wait time accounts for 74% of total time for 1X network.
The time is not sensitive to changes in network bandwidth
and latency: the wait time decreased by less than 3% when
the network speed increases from X/2 to 2X. DT(512) is
thus classified as load-imbalanced-bound. As can be seen from
these examples, the performance trend for the range of network
configurations can show the type of applications clearly.
4) Impacts of latency-first and bandwidth-first overlapping:
When considering communication-computation overlaps, la-
(a) EP on 1G Ethernet (sensitivity to both bandwidth and lantency)
(b) IS on 1G Ethernet (sensitivity to both bandwidth and latency)
(c) DT on 10G Ethernet (sensitivity to both bandwidth and latency)
Fig. 4: Classfication of three applications
tency time, bandwidth time, or both may be overlapped with
computation. The tool supports overlapping either latency first
or bandwidth first. This section reports the results of our
study of the impacts of these two models of communication-
computation overlap.
In general, we found that the two different models of over-
lapping exhibit only marginal differences. Table V shows the
latency time and bandwidth time changes with the two models.
As shown in the table, the change of latency and bandwidth
time counters between latency-first and bandwidth-first are
within 1.5% of the estimated total time for all applications, ex-
cept for CR(100). CR(100) exhibits a slightly higher difference
of 5.40% and 2.53% for 1G and 10G Ethernet, respectively.
TABLE V: Changes of bandwidth and latency time counters
between latency-first and bandwidth-first for 1G and 10G
Ethernet and QDR InfiniBand
App(ranks) ENET-1G ENET-10G IB-QDR
CG(64) 1.38% 0.27% 0.06%
MG(64) <0.01% <0.01% <0.01%
BT(64) 0.20% <0.01% <0.01%
SP(64) 0.99% 0.05% 0.01%
DT(512) <0.01% <0.01% <0.01%
CNS(1024) <0.01% <0.01% <0.01%
CR(100) 5.40% 2.53% 0.59%
LULESH(512) 0.02% <0.01% <0.01%
Nekbone(256) 0.04% 0.01% <0.01%
TABLE VI: Classification results of overlapping latency-
first vs. bandwidth-first for 1G and 10G Ethernet and Infini-
Band QDR
App(ranks) Latency-first Bandwidth-first
ENET-1G ENET-10G IB-QDR ENET-1G ENET-10G IB-QDR
BT(64) Comm.-s Comp. Comp. Comm.-s Comp. Comp.
SP(64) Comm.-s Comp. Comp. Comm.-s Comp. Comp.
CG(64) Comm. Comm.-s Comp. Comm. Comm-s Comp .
MG(64) Comm. Comp. Comp. Comm. Comp. Comp.
DT(512) Comm. Imb. Imb. Comm. Imb. Imb.
CNS(1024) Imb. Imb. Imb. Imb. Imb. Imb.
CR(100) Comm. Comm. Imb.-s Comm. Comm. Imb.-s
LULESH(512) Imb.-s Imb.-s Imb.-s Im b.-s Imb.-s Imb.-s
Nekbone(256) Comm. Comm. Imb.-s Comm. Comm. Imb.-s
However, as shown in Table VI, classification results for all
programs remain the same in both cases. Communication
overlapping is mostly determined by how the application
is written: in an application, most overlapping either is not
happening or completely overlaps the whole communication
(both bandwidth time and latency time) with computation. As
a result, latency-first or bandwidth-first makes little difference
as evidenced by the results in this experiment. We will use the
latency-first policy for the rest of this study.
C. Classification Results
Table VII summaries the classification results of NAS bench-
marks, DOE mini-apps, extracted kernels, and production
applications for 1G Ethernet, 10G Ethernet, and InfiniBand
QDR. Traces marked by “†” are ones we collected ourselves
on Cielito, while the rest are taken from the DesignForward
Web site.
With the slow 1Gbps Ethernet, communication has
significant impacts on applications. Many programs are
communication-, latency-, or bandwidth-bound (or at least -
sensitive). As the network becomes faster, in InfiniBand QDR,
the latency is sufficiently small and the bandwidth is suffi-
ciently large such that these programs become computation-
bound or load-imbalance-bound. IS, DT, AMG, BigFFT,
CLAMR, PARTISN, LULESH, CNS, CMC, and Nekbone
are either load-imbalance-bound or -sensitive for InfiniBand
QDR. For instance, one might naively diagnose PARTISN
as having communication issues due to the large amount of
time it spends in MPI communication, but our tool can more
accurately classify it as load-imbalanced due to its 2D wave-
TABLE VII: Classification results of the selected NAS bench-
marks, mini-apps, extracted kernels and production applica-
tions
App(ranks) ENET-1G ENET-10G IB-QDR
BT(64)† Comm.-s Comp. Comp.
EP(64)† Comp. Comp. Comp.
SP(64)† Comm.-s Comp. Comp.
CG(64)† Comm. Comm-s. Comp.
MG(64)† Comm. Comp. Comp.
FT(64)† BW. Comp. Comp.
IS(64)† Comm. Imb-s. Imb.-s
DT(512)† Comm. Imb. Imb.
AMG(216) Imb.-s Imb.-s Imb.-s
BigFFT(100) Imb./Comm. Imb. Imb.
BigFFT(1024) Imb. Imb. Imb.
MiniFE(1152) Comp. Comp. Comp.
AMR(64) Comm.-s Comp. Comp.
CLAMR(64)† Comm. Imb. Imb.
CR(100) Comm. Comm. Imb.-s
FB(64)† BW Comm.-s Imb.-s
FB(125) BW-s Imb.-s Imb.-s
MG(1000) Comm.-s Comp. Comp.
PARTISN(168) Imb. Imb. Imb.
LULESH(512) Imb.-s Imb.-s Imb.-s
CNS(1024) Imb. Imb. Imb.
CMC(1024) Imb. Imb. Imb.
Nekbone(256) Comm. Comm. Imb.-s
=traced by us on Cielito, not by others on Edison/Hopper
Fig. 5: Sensitivity of MiniFE (1152) to communication (band-
width and latency) on 10G Ethernet
front communication pattern with data dependencies. In this
case, improving communication performance in the absence of
other program-configuration changes will not improve overall
application performance. Instead, improving these programs’
load-balance property is the key to improving their scalability
on modern HPC systems.
We can see that communication is eliminated as the
performance-limiting factor for some applications as the in-
terconnection network becomes faster. For example, CR(100)
is communication-bound on 1G Ethernet, but becomes load-
imbalance-sensitive on the faster InfiniBand QDR; FB(64)
is bandwidth-bound on 1G Ethernet, but becomes load-
imbalance-sensitive on InfiniBand QDR.
Next, we will use selected results to show how some
applications are classified. Figure 5 shows the sensitivity of
Fig. 6: Sensitivity of BigFFT(100) to communication (band-
width and latency) on InfiniBand QDR
TABLE VIII: Classification of MiniFE, CR, and AMG at
exascale
App(ranks) Current Exascale
ENET-1G ENET-10G IB-QDR Exa-1x Exa-8x
MiniFE(1152) Comp. Comp. Comp. Comp. Comp.
CR(100) Comm. Comm. Imb.-s Imb.-s Imb.
AMG(216) Imb.-s Imb.-s Imb.-s Imb.-s Imb.-s
MiniFE(1152) to communication (bandwidth and latency) on
10G Ethernet. In this case, the 1X configuration is (10Gbps,
5µs); the X/2 configuration is (5Gbps, 10µs); the 2X configu-
ration is (20Gbps, 2.5µs). It clearly shows that this program
is dominated by computation and is not sensitive to band-
width or latency. As a result, MiniFE(1152) is classified as
a computation-bound program on 10G Ethernet.
Figure 6 shows the sensitivity of BigFFT(100) to commu-
nication (bandwidth and latency) on InfiniBand QDR. We
observe that the wait time accounts for more than 25% of the
total execution time. Moreover, the wait time is insensitive
(less than 5%) to the factor of 2 speed-up and slow-down
of bandwidth and latency. This is in agreement with the
communication pattern of BigFFT(100), which consists of
a small number of all-to-all and global barrier operations.
Because of this communication pattern, a significant amount
of time spent waiting for peers to synchronize. As such, the
program is classified as a load-imbalance-bound program for
InfiniBand QDR based on the classification criteria.
As can be seen from these experiments, by examining
the performance trends for an application on a range of
network configurations, important application performance
characteristics are revealed. The tool is effective in classifying
applications even given modeling inaccuracy for individual
networks.
D. Classifying Applications on Exascale Systems
With the exascale era approaching, we expect to see high-
throughput network technology with increased network band-
widths and reduced network latencies in both InfiniBand
and Ethernet-based networks. With 200Gbps InfiniBand HDR
(a) CR on Exa-1x (b) CR on Exa-8x
(c) AMG on Exa-1x (d) AMG on Exa-8x
Fig. 7: Sensitivity of CR(100) and AMG(216) to communication (bandwidth and latency) in Exa-1x and Exa-8x
rolled out in early 2017 [16], it is an modest estimation that
network bandwidths will reach 100GB/s (IB-NDR), and end-
to-end latencies will be on the order of 100ns, based on the
roadmap of exascale supercomputers [5], [16]. Meanwhile,
with the frequency of individual CPU cores stagnating, we
expect no dramatic improvement of computing power per core,
but better overlapping of computation and communication,
algorithm design and implementation.
To classify applications for future exascale systems, we
assume that an exascale system has a network bandwidth of
400Gbps and an end-to-end latency of 0.3µs, which can be
represented as (400Gbps, 300ns). This gives a factor of 12.5
increase in bandwidth and a factor of 4.33 reduction in latency.
We consider two compute speeds: Exa-1x with fcomp =1 when
the future system has the same compute speed as the current
system and Exa-8x with fcom p =1
8when the future system is
8 times faster.
We use MiniFE(1152), AMG(216), CR(100) for this
study because they represent applications of different
performance-limiting factors. For example, on 10G Ethernet,
MiniFE(1152) is classified as computation-bound, CR(100)
as communication-bound, and AMG(216) as load-imbalance-
sensitive. Table VIII shows the classification results of the
selected applications targeted at the current generation of
interconnect technologies and the projected exascale systems
Exa-1x and Exa-8x. From Table VIII, MiniFE(1152) remains
computation-bound while CR(100) and AMG(216) become
load-imbalance-bound or -sensitive in exascale systems.
For MiniFE(1152), computation time dominates the total
execution time from 1G Ethernet to Exa-1x. In Exa-8x, the
computation time reduces by a factor of 8, yet it remains the
main performance limiting factor for such an “embarrassingly
parallel” application. Therefore, MiniFE(1152) is classified
as computation-bound even for Exa-8x. CR(100) is classified
as communication-bound in 1G Ethernet and 10G Ethernet,
but its inherent load-balancing issues start to surface when
its bandwidth and latency time are reduced significantly by
the faster InfiniBand QDR interconnect. This trend continues
through exascale as bandwidth is further increased and latency
is further decreased.
Figure 7 shows the logical time counters of wait, com-
munication (wait+bandwidth+latency), computation and total
(communication+computation) for CR(100) in Exa-1x and
Exa-8x. From Figures 7a and 7b, the wait time in CR(100)
descreases from around about 75 seconds in Exa-1x to about
38 seconds in Exa-8x. However, as can be seen in the figures,
for both Exa-1x and Exa-8x, the wait time is not sensitive
to the network speed: this application will not benefit from
a fast interconnect for both exascale configurations. Similar
observation is made for AMG(216) in Figures 7c and 7d
although the wait time accounts for a smaller percentage of
the overall application time. A faster interconnect will not help
this application either.
This study shows that load-balancing will be a significant
challenge for exascale systems and applications. We note that
the applications considered are current-generation programs.
Fig. 8: Simulation speedup vs. application runtime of NPB
benchmarks (class C, 64 ranks)
As exascale systems become available, exascale applications
will emerge that have different characteristics from current
HPC applications.
E. Performance of the classification tool
We study the performance of our classification tool itself
using the NPB with both 64 and 4096 ranks. The 64-rank
runs use Class C inputs while the 4096-rank runs use the larger
Class D inputs. For this study, both our tool and the original
programs are executed on Cielito with the same numbers of
ranks per node.
Because trace replay replaces arbitrarily large computation
with trivial counter updates and arbitrarily large messages
with fixed-size counter transmissions, replay runs faster than
normal execution time. This is in contrast to cycle-accurate
simulation, which runs substantially slower than normal exe-
cution time. Figure 8 shows the speed-up of simulation time
over application execution time for the NPB benchmarks with
64 ranks (16 ranks per node times 4 nodes) on Cielito. The
speed-up is computed as the total execution time (sum of the
execution time for all ranks) for an application divided by
the total simulation time (sum of the execution time for all
ranks). As can be seen from the figure, the speed-up ranges
from 3 to 45 over the application time. Figure 9 presents the
simulation time and the application execution time for 4096-
rank runs, also on Cielito. In this case we see a speed-up
ranging from 2 to 14.
Figure 10 shows the simulation time for the benchmarks
when different numbers of network configurations are sim-
ulated in each run. This experiment uses the 64-rank runs.
With more network configurations, more timestamps need to
be computed and communicated among processes. We can see
that the execution time increases only slightly as the number of
network configurations increases from 16 to 256. The implica-
tion is that the extra costs of maintaining and communicating
a larger number of timestamps is not a major performance-
limiting factor for trace replay. Hence, it is quite feasible
for our tool to predict the performance of many network
configurations with a single pass through a communication
trace.
Fig. 9: Simulation time vs. application runtime of 5 NPB
benchmarks on 4,096 ranks
Fig. 10: Simulation time of NPB benchmarks for 16, 32, 64,
128 and 256 network configurations
V. CON CLU SIO NS
We have presented a fast, trace-based and communication-
centric classification tool for MPI programs. Our innovation
is to use a modified Lamport logical clock scheme that uses
non-unit computation and communication times to predict
overall application execution time. By maintaining multiple
independent logical clocks that are parameterized differently
but honor the same happened-before relationship, the tool can
predict execution time on many network configurations in
nearly the same time needed to predict execution time for a
single configuration.
This multiple-prediction capability enables new analyses to
be performed that would be too computationally expensive to
perform with traditional, one-configuration-at-a-time simula-
tion. In particular, overall application time can be attributed to
the time spent in computation, communication, and load im-
balance; and applications can be analyzed for their sensitivity
to compute speed, latency, and bandwidth.
The importance of this work is that application developers
finally have the information needed to determine if load
imbalance in their codes is limiting performance more than
interconnect speeds or vice versa; and system architects finally
have the information needed to determine what hardware
upgrades would yield the greatest improvement in application
performance.
VI. ACKN OW LE D GE MEN T
We gratefully acknowledge the support of the U.S. Depart-
ment of Energy through grant DE-SC0016039 for this work.
Los Alamos National Laboratory is operated by Los Alamos
National Security, LLC for the U.S. Department of Energy
under contract DE-AC52-06NA25396.
REF ERE NC ES
[1] H. Adalsteinsson, S. Cranford, D. A. Evensky, J. P. Kenny, J. Mayo,
A. Pinar, and C. L. Janssen. A simulator for large-scale parallel computer
architectures. Int. J. Distrib. Syst. Technol., 1(2):57–73, Apr. 2010.
[2] K. J. Barker, K. Davis, A. Hoisie, D. J. Kerbyson, M. Lang, S. Pakin,
and J. C. Sancho. Using performance modeling to design large-scale
systems. Computer, 42(11):42–49, 2009.
[3] C. Carothers, D. Bauer, and S. Pearce. ROSS: A high-performance,
low memory, modular time warp system. In Parallel and Distributed
Simulation, 2000. PADS 2000. Proceedings. Fourteenth Workshop on,
pages 53–60, 2000.
[4] A. Chan, D. Ashton, R. Lusk, and W. Gropp. Jumpshot-4 users guide.
[Online; accessed 14-AUG-2015].
[5] L. Dickman. Infiniband on the road to exascale computing. Jan 2011.
[Online; accessed 17-June-2016].
[6] M. Geimer, F. Wolf, B. J. N. Wylie, E. ´
Abrah´am, D. Becker, and B. Mohr.
The Scalasca performance toolset architecture. Concurr. Comput.: Pract.
Exper., 22(6):702–719, Apr. 2010.
[7] S. Girona and J. Labarta. Sensitivity of performance prediction of
message passing programs. In in Proc. International Conference on
Parallel and Distributed Processing Techniques and Applications, pages
620–626, 1999.
[8] R. W. Hockney. The communication challenge for MPP: Intel Paragon
and Meiko CS-2. Parallel Comput., 20(3):389–398, Mar. 1994.
[9] Intel Corporation. Introduction to Ethernet latency. 2014. [Online;
accessed 14-AUG-2015].
[10] K. Isaacs, P.-T. Bremer, I. Jusufi, T. Gamblin, A. Bhatele, M. Schulz,
and B. Hamann. Combing the communication hairball: Visualizing
parallel execution traces using logical time. Visualization and Computer
Graphics, IEEE Transactions on, 20(12):2349–2358, Dec 2014.
[11] N. Jiang, D. Becker, G. Michelogiannakis, J. Balfour, B. Towles,
D. Shaw, J. Kim, and W. Dally. A detailed and flexible cycle-accurate
network-on-chip simulator. In Performance Analysis of Systems and
Software (ISPASS), 2013 IEEE International Symposium on, pages 86–
96, April 2013.
[12] J. Labarta, S. Girona, V. Pillet, T. Cortes, and L. Gregoris. DiP: A par-
allel program development environment. In Proceedings of the Second
International Euro-Par Conference on Parallel Processing-Volume II,
Euro-Par ’96, pages 665–674, London, UK, UK, 1996. Springer-Verlag.
[13] L. Lamport. Time, clocks, and the ordering of events in a distributed
system. Communications of the ACM, 21(7):558–565, 1978.
[14] Lawrence Berkeley National Lab. Edison system interconnect. 2016.
[Online; accessed 6-Aug-2016].
[15] Lawrence Berkeley National Lab. Hopper system interconnect. 2016.
[Online; accessed 6-Aug-2016].
[16] Mellanox Technologies. Introducing 200g hdr infiniband solutions. Jan
2017. [Online; accessed 5-April-2017].
[17] B. Mohr and F. Wolf. KOJAK—a tool set for automatic performance
analysis of parallel applications. In Euro-Par, 2003.
[18] W. E. Nagel, A. Arnold, M. Weber, H. C. Hoppe, and K. Solchenbach.
VAMPIR: Visualization and analysis of MPI resources. In Supercom-
puter, volume 12, January 1996.
[19] M. Noeth, P. Ratn, F. Mueller, M. Schulz, and B. R. de Supinski. Sca-
laTrace: Scalable compression and replay of communication traces for
high-performance computing. J. Parallel Distrib. Comput., 69(8):696–
710, Aug. 2009.
[20] D. Panda. MVAPICH: MPI over InfiniBand, 10GigE/iWARP and RoCE.
2014. [Online; accessed 6-Oct-2015].
[21] QLogic Corp. Introduction to Ethernet latency. an explanation of latency
and latency measurement. 2014. [Online; accessed 6-Oct-2015].
[22] Sandia National Lab. SST: The structural simulation toolkit. 2014.
[Online; accessed 14-DEC-2012].
[23] Sandia National Lab. SST/macro 3.0: User’s manual. May 2016.
[Online; accessed 14-JUN-2016].
[24] M. Snir, S. Otto, S. Huss-Lederman, D. Walker, and J. Dongarra. MPI:
The Complete Reference, volume 1, The MPI Core. The MIT Press,
Cambridge, Massachusetts, 2nd edition, Sept. 1998.
[25] R. Thakur and W. D. Gropp. Improving the performance of MPI
collective communication on switched networks. 11/2002 2002.
[26] B. Tomlinson, J. Cerutti, R. A. Ballance, M. Vigil, J. Johnson, K. Haskel,
and R. A. Ballance. Cielo usage model. July 2012. [Online; accessed
6-Aug-2016].
[27] G. Zheng, T. Wilmarth, P. Jagadishprasad, and L. V. Kal´e. Simulation-
based performance prediction for large parallel machines. Int. J. Parallel
Program., 33(2):183–207, June 2005.
... The original timing data is then encoded through color, leading to a more intuitive visualization. Zhou Tong et al. [23] present a novel trace-based analysis tool that rapidly classifies an MPI application as bandwidth-bound, latencybound, load-imbalance-bound, or computation-bound for different interconnection networks. The tool uses an extension of Lamport's logical clock to track application process in the trace replay to infer the application characteristics of an application. ...
... The objective is to minimize the wait time when sending a message and maximize the computation time. 24 Figure 9 illustrates the processes for four images and two processing threads (threads 2 and 3). ...
Article
Full-text available
Visual simultaneous localization and mapping (VSLAM) is a relevant solution for vehicle localization and mapping environments. However, it is computationally expensive because it demands large computational effort, making it a non-real-time solution. The VSLAM systems that employ geometric reconstructions are based on the parallel processing paradigm developed in the Parallel Tracking and Mapping (PTAM) algorithm. This type of system was created for processors that have exactly two cores. The various SLAM methods based on the PTAM were also not designed to scale to all the cores of modern processors nor to function as a distributed system. Therefore, we propose a modification to the pipeline for the execution of well-known VSLAM systems so that they can be scaled to all available processors during execution, thereby increasing their performance in terms of processing time. We explain the principles behind this modification via a study of the threads in the SLAM systems based on PTAM. We validate our results with experiments describing the behavior of the original ORB-SLAM system and the modified version.
Article
Performance variability is an important factor of high-performance computing (HPC) systems. HPC performance variability is often complex because its sources interact and are distributed throughout the system stack. For example, the performance variability of I/O throughput can be affected by factors such as CPU frequency, the number of I/O threads, file size, and record size. In this paper, we focus on the I/O throughput variability across multiple executions of a benchmark program. For a given system configuration, the distribution of throughputs from run to run is of interest. We conduct large-scale experiments and collect a massive amount of data to study the distribution of I/O throughput under tens of thousands of system configurations. Despite normality often being assumed in the literature, our statistical analysis reveals that the performance variability is not normally distributed under most system configurations. Instead, multimodal distributions are common for many system configurations. We propose the use of mixture distributions to describe the multimodal behavior. Various underlying parametric distributions such as normal, gamma, and the Weibull are considered. We apply an expectation–maximization (EM) algorithm for parameter estimation and use the Bayesian information criterion (BIC) for parametric model selections. We also illustrate how to use the estimated mixture distribution to calculate the number of runs needed for future experiments on variability analysis. The paper provides a useful tool set in studying the behavior of performance variability.
Article
Full-text available
With the continuous rise in complexity of modern supercomputers, optimizing the performance of large-scale parallel programs is becoming increasingly challenging. Simultaneously, the growth in scale magnifies the impact of even minor inefficiencies – potentially millions of compute hours and megawatts in power consumption can be wasted on avoidable mistakes or sub-optimal algorithms. This makes performance analysis and optimization critical elements in the software development process. One of the most common forms of performance analysis is to study execution traces, which record a history of per-process events and interprocess messages in a parallel application. Trace visualizations allow users to browse this event history and search for insights into the observed performance behavior. However, current visualizations are difficult to understand even for small process counts and do not scale gracefully beyond a few hundred processes. Organizing events in time leads to a virtually unintelligible conglomerate of interleaved events and moderately high process counts overtax even the largest display. As an alternative, we present a new trace visualization approach based on transforming the event history into logical time inferred directly from happened-before relationships. This emphasizes the code's structural behavior, which is much more familiar to the application developer. The original timing data, or other information, is then encoded through color, leading to a more intuitive visualization. Furthermore, we use the discrete nature of logical timelines to cluster processes according to their local behavior leading to a scalable visualization of even long traces on large process counts. We demonstrate our system using two case studies on large-scale parallel codes.
Conference Paper
Full-text available
Network-on-Chips (NoCs) are becoming integral parts of modern microprocessors as the number of cores and modules integrated on a single chip continues to increase. Research and development of future NoC technology relies on accurate modeling and simulations to evaluate the performance impact and analyze the cost of novel NoC architectures. In this work, we present BookSim, a cycle-accurate simulator for NoCs. The simulator is designed for simulation flexibility and accurate modeling of network components. It features a modular design and offers a large set of configurable network parameters in terms of topology, routing algorithm, flow control, and router microarchitecture, including buffer management and allocation schemes. BookSim furthermore emphasizes detailed implementations of network components that accurately model the behavior of actual hardware. We have validated the accuracy of the simulator against RTL implementations of NoC routers.
Article
Full-text available
In this paper, we present new algorithms for improving the performance of collective communication operations in MPI. Our target architecture is a cluster of machines connected by a switched network such as Myrinet or switched ethernet. We have developed new algorithms for all the MPI collective communication operations, namely, scatter/gather/reduce, allgather/allreduce, broadcast, reduce-scatter, all-to-all, and scan. We compare the performance of our new algorithms with the algorithms currently used in the latest version of MPICH on up to 256 nodes of a Myrinet-connected cluster. For operations such as scatter/gather/reduce, allgather/allreduce, and reduce-scatter, we observe an improvement of up to a factor of 10 for short message sizes. For operations such as broadcast and reduce-scatter and for long messages sizes, the new algorithms are truly scalable: the time taken remains fairly constant as we increase the number of processes participating in the operation.
Chapter
Full-text available
This paper describes an environment whose aim is to aid in the development and tuning of message passing applications before actually running them in a real system with a large number of processors. Our objective is not to eliminate tests on real machines but to be able to focus them in a more selective way and thereby minimize their number. The environment presented in this paper consists of three closely integrated tools: an instrumented communication library, a trace driven simulator (Dimemas) and a visualization/analysis tool (Paraver).
Conference Paper
Full-text available
Today’s parallel computers with SMP nodes provide both multithreading and message passing as their modes of parallel execution. As a consequence, performance analysis and optimization becomes more difficult and creates a need for advanced performance tools that are custom made for this class of computing environments. Current state-of-the-art tools provide valuable assistance in analyzing the performance of mpi and Openmp programs by visualizing the run-time behavior and calculating statistics over the performance data. However, the developer of parallel programs is still required to filter out relevant parts from a huge amount of low-level information shown in numerous displays and map that information onto program abstractions without tool support.
Article
This paper discusses the issues related to the accuracy of performance prediction tools for message passing programs. We present the results of two sets of experiments to quantify the effect of the instrumentation overhead and variance in the accuracy of Dimemas. The results show that this performance prediction tool can be used with a high level of confidence as the effect of instrumentation overhead on the predicted performance is minimal. We also show that it is possible to carry out instrumentation runs in highly loaded multi-user environments and still be able to accurately analyze the performance of the application as if it had run alone.
Article
The communication performance of the Intel iPSC/860, Paragon XP/S and the Meiko CS-2 are compared using the COMMS1 benchmark from the Genesis Parallel Benchmark Suite. The challenge to distributed-memory massively-parallel processors presented by the Cray-C90 shared memory computer is highlighted by re-interpreting vector processing results as though they were measuring communication startup and bandwidth. The results show a wide gap between the two types of computer, in favour of the C-90. These results are for the initial issue of software and hardware of the Paragon and CS-2. Comments from Intel and Meiko are included to show how the manufacturers aim to improve communication performance.
Article
In this paper, we introduce a new Time Warp system called ROSS: Rensselaer’s optimistic simulation system. ROSS is an extremely modular kernel that is capable of achieving event rates as high as 1,250,000 events per second when simulating a wireless telephone network model (PCS) on a quad processor PC server. In a head-to-head comparison, we observe that ROSS out performs the Georgia Tech Time Warp (GTW) system by up to 180% on a quad processor PC server and up to 200% on the SGI Origin 2000. ROSS only requires a small constant amount of memory buffers greater than the amount needed by the sequential simulation for a constant number of processors. ROSS demonstrates for the first time that stable, highly efficient execution using little memory above what the sequential model would require is possible for low-event granularity simulation models. The driving force behind these high-performance and low-memory utilization results is the coupling of an efficient pointer-based implementation framework, Fujimoto's fast GVT algorithm for shared memory multiprocessors, reverse computation and the introduction of kernel processes (KPs). KPs lower fossil collection overheads by aggregating processed event lists. This aspect allows fossil collection to be done with greater frequency, thus lowering the overall memory necessary to sustain stable, efficient parallel execution. These characteristics make ROSS an ideal system for use in large-scale networking simulation models. The principle conclusion drawn from this study is that the performance of an optimistic simulator is largely determined by its memory usage.