Conference Proceeding

Fault-tolerant Routing for Multiple Permanent and Non-permanent Faults in HPC Systems

07/2010; pp.144-150 In proceeding of: Proceedings of the International Conference on Parallel and Distributed Processing Techniques and Applications, PDPTA 2010, Las Vegas, Nevada, USA, July 12-15, 2010, 2 Volumes, At Las Vegas, USA, Volume: 2
Source: DBLP

ABSTRACT The interconnection network communicates and links together the processing units of modern high-performance computing systems. In this context, network faults have an extremely high impact since most routing algorithms were not designed to tolerate faults. Because of this, just a single fault may stall messages in the network, preventing the finalization of applications, or may lead to deadlocked configurations.
In this paper we introduce a fault-tolerant routing method designed to solve a large number of dynamic permanent and non-permanent link faults. As failures appear randomly during system operation, our method provides escape paths for the stalled messages and, at the same time, avoids deadlock occurrences. Our proposal avoids faulty areas by means of multipath routing approaches, taking advantage of the communication path redundancy, as long as alternative paths are available.
Performance evaluation consists of synthetic test scenarios for proving correctness, and test scenarios based on the availability traces of real high-performance systems. Experiments show that our method allows applications to successfully complete their executions even in the presence of a large number of faults, given performance degradations below 3% for a 1024-node system with up to 200 simultaneous link failures.

0 0
 · 
0 Bookmarks
 · 
30 Views

Full-text

View
3 Downloads
Available from
7 Mar 2013

Keywords

200 simultaneous link failures
 
availability traces
 
avoids deadlock occurrences
 
dynamic permanent
 
interconnection network communicates
 
modern high-performance
 
processing units
 
real high-performance systems
 
synthetic test scenarios
 
test scenarios