Fault-tolerant Routing for Multiple Permanent and Non-permanent Faults in HPC Systems
ABSTRACT The interconnection network communicates and links together the processing units of modern high-performance computing systems. In this context, network faults have an extremely high impact since most routing algorithms were not designed to tolerate faults. Because of this, just a single fault may stall messages in the network, preventing the finalization of applications, or may lead to deadlocked configurations.
In this paper we introduce a fault-tolerant routing method designed to solve a large number of dynamic permanent and non-permanent link faults. As failures appear randomly during system operation, our method provides escape paths for the stalled messages and, at the same time, avoids deadlock occurrences. Our proposal avoids faulty areas by means of multipath routing approaches, taking advantage of the communication path redundancy, as long as alternative paths are available.
Performance evaluation consists of synthetic test scenarios for proving correctness, and test scenarios based on the availability traces of real high-performance systems. Experiments show that our method allows applications to successfully complete their executions even in the presence of a large number of faults, given performance degradations below 3% for a 1024-node system with up to 200 simultaneous link failures.
-
Citations (0)
-
Cited In (0)
Page 1
Fault-tolerant Routing for Multiple Permanent and
Non-permanent Faults in HPC Systems
Gonzalo Zarza, Diego Lugones, Daniel Franco, and Emilio Luque
Computer Architecture and Operating Systems Department,
Universitat Aut` onoma de Barcelona, Edifici Q, Barcelona (08193), Spain
{gonzalo.zarza, diego.lugones, daniel.franco, emilio.luque}@uab.es
Abstract—The interconnection network communicates
and links together the processing units of modern high-
performance computing systems. In this context, network
faults have an extremely high impact since most routing
algorithms were not designed to tolerate faults. Because of
this, just a single fault may stall messages in the network,
preventing the finalization of applications, or may lead to
deadlocked configurations.
In this paper we introduce a fault-tolerant routing
method designed to solve a large number of dynamic per-
manent and non-permanent link faults. As failures appear
randomly during system operation, our method provides
escape paths for the stalled messages and, at the same
time, avoids deadlock occurrences. Our proposal avoids
faulty areas by means of multipath routing approaches,
taking advantage of the communication path redundancy,
as long as alternative paths are available.
Performance evaluation consists of synthetic test scenar-
ios for proving correctness, and test scenarios based on
the availability traces of real high-performance systems.
Experiments show that our method allows applications to
successfully complete their executions even in the presence
of a large number of faults, given performance degrada-
tions below 3% for a 1024-node system with up to 200
simultaneous link failures.
Keywords: Interconnection Networks, Fault Tolerance, Adap-
tive Routing
1. Introduction
Over recent decades, the computing power demand has
shown a steady and undeniable increase. This increase has
as origin the execution of a growing number of complex
and computationally intensive applications. At first, the
computing power was dedicated almost exclusively to
scientific research fields. However, during the last few
years new application areas also have began to require
bigger amounts of computational power, highlighting the
necessity of high-performance computing (HPC) systems.
These emerging application areas include the DNA se-
quencing, weather forecasting, geological studies, etc.
At this moment, the importance of HPC systems is
undeniable since they have opened a trend in modeling
the daily behavior and life style of modern societies. This
is evident if considering that even the simplest Google
Supported by the MEC-Spain under contract TIN2007-64974
search is based on HPC systems [1]. Indeed, given the
importance of these systems, it is essential to avoid service
interrupts, particularly in sensitive systems such as those
which involve mission-critical operations, banking and
computation-intensive applications, among others [2].
Clearly, the performance of such systems is closely
related to the dependability and robustness of the fault
tolerance mechanisms on which they rely. Unfortunately,
the steady increase in complexity and number of compo-
nents of these HPC systems significantly increases failure
rates. Questions arise from the analysis of this situation
such as: how do failures affect HPC systems? What kinds
of failures appear on real systems? Are those systems able
to maintain their operation and performance standards in
spite of failure occurrences? If they are not, what should
the solution be? What are the best options to achieve fault
tolerance and system service continuity?
The mere posing of these questions highlights the im-
portance of fault tolerance, and the need to address this
problem in current HPC systems. Above all, it is critical
to ensure the successful completion of every application,
even in the presence of multiple failures. In this paper, we
address the problem of fault tolerance for interconnection
networks (INs) because network failures constitute a major
drawback for current HPC systems. Notice that just a
single fault may prevent the finalization of an application.
On account of this potentially high impact, many meth-
ods have been designed to treat faults in INs, however,
only a few of them have been implemented in real net-
works. Fault tolerance for INs is mainly based on three
approaches: component redundancy [3], network reconfig-
uration [4], [5], and fault-tolerant routing algorithms [6],
[7]. The latter approach emerges as the most interesting
and suitable option because it allows systems to reach bet-
ter performance results at lower costs. However, designing
fault-tolerant routing algorithms represents a challenging
problem.
For this reason, most fault-tolerant routing algorithms
have different kinds of constrains and their fault tolerance
capabilities are often limited. These limitations are heavily
influenced by the fault-tolerant model assumed by each
method, particularly by two model attributes: the failure
type (link, node, etc) and the failure mode (static or
dynamic) [8].
The structure, complexity and power of fault-tolerant
routing methods are closely tied to the failure mode
attribute of the fault-tolerant model to be used. For this
Page 2
reason, it is one of the most important attributes to be con-
sidered when designing fault-tolerant routing algorithms.
In a few words, if a static failure mode is assumed, all
failures are static and they are present in the network when
the system is started (or restarted). In contrast, if a dynamic
failure mode is used, failiures may appear at random time
and location during system operation.
When assuming a static failure mode, the routing algo-
rithms design could be simplified because the distribution
of faults is known in advance, as in [6]. However, this
failure mode still presents a major drawback because to
avoid package drops, each time a fault is detected, the
system needs to be stopped, the information about faults
updated, and the system restarted. All these actions have
an extra cost.
In turn, routing algorithms based on the dynamic failure
mode do not need to know the location of faults in
advance, thus avoiding packet drops, and stopping and
restarting the system. However, adaptability poses a new
and extremely important problem, deadlock occurrences,
because “the presence of faults renders existing solutions
to deadlock- and livelock-free routing ineffective” [8].
Up to this moment, one of the only proposals capable
of dealing with dynamic faults was proposed in [7].
Some drawbacks of this method are: its low scalability,
originated in the use of virtual channels to support fully
adaptive routing; the packet drop, allowed under certain
situations; and performance results bounded by the routing
mechanism based on a variation of the turn-model [9].
In this paper, we present a multipath routing method for
treating a large number of link failures in HPC systems.
The purpose of our method is to allow applications to
successfully complete their executions, even in the pres-
ence of several dynamic link failures. This new method is
based on the work presented in [10] but enhanced with new
capabilities and modifications to deal simultaneously with
permanent and non-permanent faults, in order to improve
the overall system performance. In addition, the resulting
method has been validated with the availability traces of
real HPC systems.
Our proposal exploits the available paths redundancy of
current INs in order to treat dynamic link failures. At the
same time, the method avoids deadlock occurrences and
treats the congestion problems caused by the occurrences
of such failures. Our work is based on source-destination
path information and consists of three phases. The first
phase is responsible for on-line fault diagnosis and uses
physical level monitoring at network nodes along the
source-destination path. If a message encounters a faulty
link along the path, the second phase immediately reroutes
that message to the destination through an alternative path.
In the third and last phase, the source node is notified
about the link failure in order to disable the faulty path,
and to establish new alternative paths for the following
messages to be sent to that destination. At a first stage,
failures are considered and treated as non-permanent. If a
failure persists over time, its state is changed from non-
permanent to permanent.
The main contribution of this work is the ability to
largely maximize the use of system resources, by means of
detecting and applying differential treatments to permanent
and non-permanent faults. Furthermore, the method does
not require the use of virtual channels, thus allowing a sig-
nificant reduction in costs and overheads. The combination
of all these features in a single method is unique in the
context of interconnection networks for high-performance
computing systems.
The experimental evaluation of the method is based on
a set of test scenarios with up to 200 simultaneous link
failures in a 1024-node 2D torus network. In addition, real-
based test scenarios were evaluated using the availability
traces of parallel and distributed systems obtained from
the public failure data repositories CFDR [11] and FTA
[12]. Evaluation results show an average performance
degradation below 3%.
The rest of the paper is organized as follows. Section
2 describes the new method for tolerating dynamic faults,
and details its behavior. Evaluation environment, test sce-
narios and results are presented in Section 3. Finally, some
conclusions and future work are drawn in Section 4.
2. Fault-Tolerant Routing Method
A brief introductory explanation of the method is pro-
vided at the beginning of this section. Later on, the
configuration of alternative paths is explained. Finally, the
detailed behavior of the method is presented as section
closure.
It is necessary to introduce some initial assumptions be-
fore describing the method. First of all, faults on network
nodes were not considered since our method is intended to
treat fail-stop link failures. In addition, the situation where
a network node is completely disconnected due to multiple
link failures has not been taken into account since, in such
circumstances, it is not possible to reach that node. These
are common assumptions in the context of fault-tolerant
routing for HPC.
Conceptually, our method is based on the state informa-
tion of source-destination paths. This information includes
latency values of the path, and the state information of
the links along the path. If there are no link failures along
the path, each message records the latency information
about the path it traverses. Once the message reaches the
destination, this node sends the latency information of the
path back to the source node, using an ACK message.
If there is at least one link failure along the source-
destination path, it is discovered when a message tries
to use the faulty link, as in the example of Fig. 1(a). In
the fault tolerance theory, this first step would correspond
to the error detection phase. After this phase has been
completed, damage confinement and error recovery must
be provided. To this end, the node which discovers the
failure sends back a special ACK message, in order to
alert the source node about the failure in the path, as
shown in Fig. 1(b). This latter action corresponds to the
damage confinement phase (the second step in our fault-
tolerant routing method). At the same time, still as part
Page 3
of the second step in Fig. 1(b), those messages that have
been already sent through the path where the link failure
has been discovered are rerouted towards the destination
node. These corresponds to the error recovery phase of the
fault-tolerance theory. As this rerouting action is intended
to be a fast and temporary response to link failures, it
may not be the optimal solution. For this reason, our
method includes a third and last step, which represents the
fault treatment and service continuity phase. At this step,
shown in Fig. 1(c), the source node disables the faulty path
and reconfigures new paths for the following messages,
in order to avoid faults, ease routing paths and improve
performance.
Once those new paths have been configured, their la-
tency values are recorded and then sent back from the
destination to the source node. Counting on this infor-
mation, the source node is able to calculate the number
of alternative paths that must be used and to distribute
messages among them, according to the network traffic
burden. These actions are shown in Fig. 1(c), where two
alternative paths were set by the source node. The set of
alternative paths between each source-destination pair is
called multipath or metapath [13].
Using one or more alternative paths, the method is able
to avoid and/or circumvent link failures, while improving
the system performance by means of distributing and
balancing communications among these alternative paths.
The configuration and use of multipaths are explained in
subsection 2.1.
As mentioned above, when a link failure is detected by
a node along the source-destination path, that node sends
back the link failure information to the corresponding
source node by means of an ACK message. At first, source
nodes consider link failures as non-permanent. If a source
node receives multiple failure notifications regarding the
same link, that node will treat such failure as permanent.
In other words, a failure on a link will be considered as
permanent only after receiving a predefined number of
failure notifications regarding that specific link. Notice that
the number of failure notifications is a threshold, namely,
a modifiable parameter of our method. This set of actions
is summarized in the flow diagram of Fig. 2.
Upon receiving an ACK message, each source node
processes the information contained in that message. If
the message carries information about a link failure, that
information is stored in the Link Fault Information list of
the source node. If there is previous information about
the link failure, that information is updated as shown in
Fig. 2. If after the update process, the counter of failure
notifications of that link has reached the threshold value,
its state is change from non-permanent to permanent.
In contrast, if the ACK message carries latency infor-
mation about a source-destination path, the state of every
link along that path is changed to fault-free (their entries
are removed from the Link Fault Information list). This
is because the reception of latency information indicates
that at least one message has reached the destination node
through that path, therefore, failures have disappeared.
(a) Step 1(b) Step 2
(c) Step 3
Fig. 1: Example of the method behavior with 1 link failure.
Fig. 2: Reception of ACK messages.
Given this information, each source node is able to know
the state of source-destination paths through which it has
sent messages (with a fairly good accuracy). Therefore,
if there are no permanent link failures along the original
path, each new message is sent to the destination through
that path. If all the failures have disappeared, an ACK
message carrying the path latency information will be
received. In these circumstances, every link along that path
is non-faulty, thus their state will be changed to fault-free.
Otherwise, if there is at least one permanent link failure,
the source node sets one or more new paths and sends the
message through them.
Page 4
2.1 Configuration of Alternative Paths
Alternative paths are created using some intermediate
destination nodes. Those nodes have two different pur-
poses. One of these purposes is to circumvent dynamic
faults on-the-fly, as shown in Fig. 1(b). The second purpose
is to allow the configuration of alternative segmented
source-destination paths to avoid faulty areas, as in Fig.
1(c). In this last approach, intermediate nodes are used as
scattering and gathering areas from source and destination
nodes.
Intermediate nodes are chosen according to their dis-
tance to the node that have detected the fault (to cir-
cumvent faults on-the-fly) or to the source and destination
nodes (to avoid faulty areas), as appropriate. The nodes of
1-hop distance are considered first, then nodes of 2-hop
distance, etc.
In this work, the number of intermediate nodes that
could be used is not limited so that the path could be
segmented several times in order to avoid link failures.
Segmented paths are called multistep paths (MSPs), and
use minimal static routing based on dimension-order rout-
ing (DOR) at each segment. By using this predefined
minimum routing, our method avoids the updating of
routing tables and simplifies the network design.
By using MSPs, our routing algorithm is able to change
the routing direction several times in order to avoid fail-
ures. Therefore, deadlock freedom becomes a key issue.
To avoid deadlock occurrences, our method implements
the deadlock avoidance technique presented at [14]. This
technique avoids the output buffers saturation by means
of adding a small-sized Deadlock Avoidance Queue, and
applying a simple set of actions when accessing output
buffers with limited free space to avoid cyclic dependen-
cies. The size of the deadlock avoidance queue is one slot
per each router input link.
2.2 Detailed Method Behavior
The behavior of the method, including all its functional-
ities, could be seen in Fig. 3. The behavior diagram shown
in this figure consists of four main blocks: Source endnode;
Message routing; ACK routing; and Destination endnode.
The source and destination endnode blocks contain the
actions implemented at the source and destination nodes,
respectively; while message and ACK routing blocks rep-
resent actions carried out by the routers along source-
destination paths. Each block is composed by several
elements (stage boxes and decision elements), where the
colorless elements represent the set of actions performed
in the absence of failures, and the colored ones correspond
to the additional features for fault tolerance and congestion
control.
When a source node injects a message in the inter-
connection network, it traverses a set of routers before
reaching the destination node. Two monitoring actions are
conducted at each router along the source-destination path:
link state and traffic load monitoring. Link state monitor-
ing is performed directly over router physical channels,
while traffic load monitoring is accomplished by the router
over the message. These actions are represented in Fig. 3
by the two decision elements in the Message routing block.
If a message tries to use a faulty link, two actions
are triggered. As a first step, the message is rerouted
to its destination through an escape path (Escape Path
Selection element in the Message routing block). At the
same time, the ACK Injection (faulty link) element sends
back a special ACK message to the source node. This
ACK message is sent by means of the ACK routing block
and carries information about the fault location (node ID
+ port ID), in order to inform the source node about the
link failure and to avoid the use of the faulty path. This
action makes sense since there is at least one faulty link in
the original path and it should not be used by the source
node to send new messages. Those triggered actions were
previously illustrated in Fig. 1(b).
From this path state information, the Multipath Config-
uration and the Multistep Path Selection elements in the
Source endnode block choose a MSP for each message,
avoiding the use of faulty paths.
If there are no faults in the source-destination path, each
message registers and transports the accumulated latency
information about the path it traverses, by means of the
Latency Accumulation element in Message routing block.
When the message reaches the destination node, and the
path is fault-free, the accumulated latency value is obtained
from the packet and sent back to the source node by means
of an ACK message (colored elements in the Destination
endnode block) in order to notify the source node about
the network traffic burden.
Those two kinds of ACK messages have higher priority
in the routing unit, and they count on the same fault
tolerance mechanism than the rest of messages, as shown
in the ACK routing block in Fig. 3. Furthermore, their sizes
are very small compared with data messages of current
applications because they only transport control info: a
latency value or failure information. Notice that only one
of those ACK messages is sent for each data message, as
appropriate.
Using the set of collected latencies and the failure
information, the number of alternative paths needed for a
specific source-destination pair is determined at the Mul-
tipath Configuration element. This is done based on the
actions shown in the flow diagram of Fig. 2. The complete
failure information (location and state) is stored in the Link
Fault Information element at the Source endnode block.
By means of these actions the method avoids the use of
faulty paths and fairly distributes the communication load
over the multipath. The communication load distribution
is accomplished by selecting the appropriate MSPs at the
Multistep Path Selection element.
The outcome of this phase is then used by the source
node to distribute the load among all the MSPs in base
of their latency. The path with lower latency is most
frequently used, then messages are distributed over the
MSPs according to their relative latency values.
The set of actions at node level of our method have
low overheads because they are simple (comparisons
Page 5
Interconnection Network
......
Message routing
ACK routing
Source endnode Destination endnode
Routing
Decision
Link
up?
Yes
No
Empty
output?
Yes
No
Latency
Accumulation
Escape Path
Selection
ACK Injection
(faulty link)
Message
Forward
Link
up?
Escape Path
Selection
Routing
Decision
Yes
No
... ...
Mesage
Forward
Destination
Message
Delivery
ACK Injection
(path latency)
User
Message
Yes
Latency
Information
Fault
free?
Source
Build
Message
Message
Injection
ACK
Delivery
Multipath
Configuration
Multistep Path
Selection
Link Faults
Information
Fig. 3: Behavior diagram of the method.
and accumulations), locally performed, and do not delay
send/receive primitives. As shown in Fig. 3, the message is
forwarded with no overheads when the output link is non-
faulty. The escape path mechanism is invoked only when
faults are detected, and latency updates are performed
while messages are waiting in the queue. Hence, these
operations are performed concurrently with packet deliv-
ery. Furthermore, interconnection networks usually are not
designed to continuously operate at their saturation point,
thus small overheads could be tolerated to avoid faults (if
necessary).
Our method relies on physical level information about
links state. This information is already available on al-
most all modern network devices. Current devices test
and control their ports and links by means of physical
parameters such as potential difference, impedance, etc.
For example, the InfiniBand architecture offers four link
states: LinkDown, LinkInitialize, LinkArm and LinkActive
[15]. Even the simplest Ethernet router makes available
the link state information.
3. Performance Evaluation
This section describes the test scenarios used to evaluate
our method and provides the explanation of experimental
results.
The simulation environment is provided by the com-
mercial modeling and simulation tool OPNET Modeler
[16]. This tool gives support for modeling communication
networks, and allows the injection of failures in model
components. The behavior of network devices is defined
through a Finite State Machine approach, which supports
detailed specification of protocols, resources, applications,
algorithms, and queuing policies. The whole actions and
functionalities of our proposal have been modeled using
this tool.
Experimentation is based on direct networks, specif-
ically on 2D torus chosen mainly due to its multiple
alternative paths between nodes and its current popularity
(6 of the first 10 HPCs in the Top500 List use torus
topologies [17]). The network was modeled based on
interconnection elements, connected among them through
links, and endnodes that provide the interface to connect
processing nodes to the network.
The simulations were conducted for a 1024 nodes net-
work arranged in a 32x32 torus topology. Several standard
package sizes with a constant packet injection rate were
used. Link bandwidth was set to 1 Gbps, and router buffers
to 2 MB.
Two different approaches were used to evaluate the
method from different points of view:
1) Performance evaluation of the method for permanent
and non-permanent faults when dealing with some
standard communication patterns.
2) Evaluation of real-based scenarios using availability
traces of current HPC systems.
Standard communication patterns were used because
they are commonly applied in computational intensive
scientific applications [18]. On this approach, specific
communication patterns between pairs of nodes were
used to model the behavior of real applications. These
communication patterns are: Uniform, Matrix Transpose,
and Complement.
The evaluation of real scenarios is based on failure
traces of real systems belonging to the Los Alamos Na-
tional Laboratory (LANL) [19] and the Pacific Northwest
National Laboratory (PNNL) [20]. These failure traces
were obtained from the public failure data repositories
CFDR [11] and FTA [12]. Four HPC systems were chosen
to be simulated, taking into account their number of nodes
and network failures (detailed in Table 1).
Page 6
Table 1: Characteristics of the HPC systems.
Machine
LANL 12
LANL 18
LANL 19
PNNL MPP2
Nodes
512
1024
1024
980
Procs.
1024
4096
4096
1960
Net Faults
52
62
58
89
Trace duration
09/2003-11/2005
05/2002-11/2005
10/2002-11/2005
11/2003-09/2007
The method evaluation was conducted in two steps.
First, each scenario was simulated thirty times with no
link failures. Later, up to 200 failures were injected in the
scenarios used in the first step (for each approach). Finally,
performance degradation was measured as the difference
between latency values obtained from the faulty and fault-
free scenarios. As the aim of these experiments is to
evaluate the functionality of the method (and congestion
problems caused by the occurrence of failures), the simu-
lations were conducted using moderated traffic loads.
Test scenarios and evaluation results are detailed in
subsections 3.1 and 3.2. When evaluating standard com-
munication patterns, faults were randomly injected on
network links with equal probability. The number of link
failures used in those scenarios were: 50, 100, 150, 200.
On the other hand, the number and lengths of link
failures in the test scenarios based on information of real
systems varies according to the failures traces of such
systems (see Table 1). Notice that failures lengths have
been normalized to the simulation time. Since the PNNL
MPP2 trace provides no information about the failures
length, they were all considered as permanent. However,
as the information about the length of the fault is available
for the LANL systems, their failures were considered as
non-permanent.
As most current routing approaches only treat static
failures, it is not possible to conduct realistic performance
comparisons against them. One of the only proposals
dealing with dynamic failures was presented in [7] and
[21]. According to their performance evaluation graphs,
their method achieves average latency values of about
82%. When applying the same evaluation conditions (3
link failures in a 16x16 torus network with 90% of traffic
load), the average latency value of our method is about
95%. Our method obtains an improvement of about 70%
because the method presented in [7] is based on a variation
of the turn-model routing, which uses non-faulty paths but
lacks of a suitable load balancing technique [9].
3.1 Results of Communication Patterns
Evaluation results of the Uniform communication pat-
tern are shown in Fig. 4 while results of the Complement
and the Matrix Transpose patterns are shown in Fig. 5.
There are two important points to emphasize about those
results. First, the low degradation achieved by the method,
with degradations below 5% in the worst case (Uniform
traffic in Fig. 4). Second, performance degradations are
gradual, avoiding abrupt transitions in performance and
average latency of the interconnection network.
The promising results obtained from the evaluation
of these three communication patterns are based on the
Fig. 4: Evaluation results of the uniform comm. pattern.
Fig. 5: Evaluation results of other standard comm. patterns.
use of alternative paths and an effective and successful
distribution of communication load through those paths.
Performance improvements achieved by the differential
treatment of permanent and non-permanent failures can
be clearly seen in Fig. 4. These improvements has been
achieved through a better utilization of available resources
over time, reaching up to 2%. Notice that this improvement
of 2% represents an improvement of 40% regarding the
method we have presented in [10], where all failures
were considered as permanent. Results of the Complement
traffic pattern (Fig. 5) are also in line with this situation.
In contrast, as explained below, results of the Matrix
Transpose pattern are slightly different.
In the evaluation of the Matrix Transpose pattern,
there was no congestion in the absence of failures, so
additional alternative paths were not needed. However,
the occurrence of failures reduces the bisection width of
the network and increases the utilization of other links,
generating congestion problems. In the presence of fail-
ures, several additional alternative paths were established
automatically between some source-destination pairs, in
order to address congestion problems. In particular, for the
Matrix Transpose pattern, alternative paths present lower
Page 7
latency values compared with the original paths (due to a
lower overlapping level between paths). Therefore, when
non-permanent failures disappear, the alternative paths are
no longer used. For this reason, performance is higher for
permanent failures, giving rise to the values in Fig. 5.
3.2 Results of Scenarios Based on Real Faults
The performance results of test scenarios based on the
four systems listed in Table 1 are shown in Fig. 6.
The impact of differences in the number and the dura-
tion of failures can be clearly seen in Fig. 6, especially
for the PNNL MPP2 machine. The degradation is higher
because the PNNL MPP2 machine presents the highest
number of failures and those failures were considered as
permanent (unlike the other three scenarios in Fig. 6).
The average performance degradation for the set of real-
based scenarios is less than 2%. In the worst case, when
considering all the failures as permanent, the performance
degradation is about 3%. In contrast, when failures were
considered as non-permanent, the degradation were below
1% as for the LANL machines.
4. Conclusions
In this paper, we have proposed a fault-tolerant routing
method, designed to deal with a large number of perma-
nent and non-permanent dynamic link failures in high-
performance computing systems. The method is based on a
multipath routing approach and it does not degrade system
performance in the absence of failures.
Some common communication patterns together with
failure information obtained from real systems were used
to evaluate the method. Evaluation results show perfor-
mance degradation below 3% for several test scenarios
with up to 10% of network links failed. From these results
we can conclude that, as long as fault-free paths are
available between any source-destination pair, our proposal
is able to allow applications to successfully complete their
execution.
Future work includes the adjustment of the method
to other network topologies such as k-ary n-trees. Also,
we plan to study influence of failures when network is
operated near saturation point, to tune the congestion
control mechanism of our method to this situation.
Acknowledgment
We would like to thank to the Computing, Communications,
and Networking Division at Los Alamos National Laboratory
(LANL) for providing us the real-systems failure data, and
specifically to Gary Grider, Laura Davey and James Nunez for
their efforts and help.
Also, we would like to thank Evan Felix and David Brown
from the Pacific Northwest National Laboratory (PNNL) for
collecting the data and sharing it. The data was collected and
made available using the MSC Facility in the William R. Wiley
Environmental Molecular Sciences Laboratory (sponsored by the
U.S. Department of Energy’s Office of Biological and Environ-
mental Research).
Finally, we thank OPNET Technologies, Inc. for providing us
the modeler licenses to perform the experimental evaluation of
our work.
Fig. 6: Results of real-based scenarios for all the machines.
References
[1] L. Barroso, J. Dean, and U. Holzle, “Web search for a planet: The
google cluster architecture,” Micro, IEEE, vol. 23, no. 2, pp. 22–28,
March-April 2003.
[2] M. Abd-El-Barr, Design and analysis of reliable and fault-tolerant
computer systems.London, UK.: Imperial College Press, 2007.
[3] F. Sem-Jacobsen, T. Skeie, O. Lysne, et al., “Siamese-twin: A
dynamically fault-tolerant fat-tree,” in Intl. Parallel and Distributed
Processing Symp. (IPDPS 2005), April 2005, p. 100b.
[4] O. Lysne, J. Montanana, J. Flich, J. Duato, T. Pinkston, and
T. Skeie, “An efficient and deadlock-free network reconfiguration
protocol,” Computers, IEEE Transactions on, vol. 57, no. 6, pp.
762 –779, june 2008.
[5] V. Puente and J. A. Gregorio, “Immucube: Scalable fault-tolerant
routing for k-ary n-cube networks,” IEEE Transactions on Parallel
and Distributed Systems, vol. 18, no. 6, pp. 776–788, 2007.
[6] C. G´ omez, M. E. G´ omez, P. L´ opez, and J. Duato, “An efficient fault-
tolerant routing methodology for fat-tree interconnection networks,”
in ISPA, ser. LNCS, vol. 4742.
[7] N. A. Nordbotten and T. Skeie, “A routing methodology for
dynamic fault tolerance in meshes and tori,” in International
Conference on High Performance Computing, ser. LNCS 4873,
2007, pp. 514–527.
[8] J. Duato et al., Interconnection networks. An Engineering Ap-
proach. Morgan Kaufmann, 2003, ch. 6, pp. 287–357.
[9] W. J. Dally and B. Towles, Principles and practices of intercon-
nection networks. Morgan Kaufmann Publishers, 2004.
[10] G. Zarza, D. Lugones, D. Franco, and E. Luque, “FT-DRB: A
method for tolerating dynamic faults in high-speed interconnection
networks,” in 18th Euromicro International Conference on Parallel,
Distributed and Network-Based Computing, Feb 2010, pp. 77–84.
[11] USENIX, “The computer failure data repository (CFDR),”
http://cfdr.usenix.org/, Feb 2010.
[12] FTA, “Failure Trace Archive,” http://fta.inria.fr/, Feb 2010.
[13] D. Franco, I. Garc´ es, and E. Luque, “Distributed routing balanc-
ing for interconnection network communication,” in International
Conference On High Performance Computing, 1998, pp. 253–261.
[14] G. Zarza, D. Lugones, D. Franco, and E. Luque, “Deadlock avoid-
ance for interconnection networks with multiple dynamic faults,” in
18th Euromicro International Conference on Parallel, Distributed
and Network-Based Computing, Feb 2010, pp. 276–280.
[15] InfiniBand Trade Association, InfiniBand architecture specification:
release 1.2.InfiniBand Trade Association, 2004, vol. 1.
[16] OPNET Technologies, “Opnet modeler accelerating network
R&D,” Feb 2010. [Online]. Available: http://www.opnet.com/
[17] TOP500 Supercomputing Site, “TOP500 List,” November 2009.
[18] J. Duato et al., Interconnection networks. An Engineering Ap-
proach.Morgan Kaufmann, 2003, ch. 9, pp. 475–558.
[19] LANL, “Los Alamos National Laboratory,” http://www.lanl.gov/.
[20] PNNL, “Pacific Northwest National Laboratory,” http://pnnl.gov/.
[21] N. A. Nordbotten, “Fault-tolerant routing in interconnection net-
works,” Ph.D. dissertation, University of Oslo, 2008.
Springer, 2007, pp. 509–522.