ABSTRACT: Interconnection networks communicate and link together the processing units of modern high-performance computing systems. In this context, network faults have an extremely high impact since most routing algorithms have not been designed to tolerate faults. Because of this, as few as one single link failure may stall messages in the network, leading to deadlock configurations or, even worse, prevent the finalization of applications running on computing systems. In this thesis we present fault-tolerant routing policies based on concepts of adaptability and deadlock freedom, capable of serving interconnection networks affected by a large number of link failures. Two contributions are presented throughout this thesis, namely: a multipath fault-tolerant routing method, and a novel and scalable deadlock avoidance technique. The first contribution of this thesis is the adaptive multipath routing method Fault-tolerant Distributed Routing Balancing (FT-DRB). This method has been designed to exploit the communication path redundancy available in many network topologies, allowing interconnection networks to perform in the presence of a large number of faults. The second contribution is the scalable deadlock avoidance technique Non-blocking Adaptive Cycles (NAC), specifically designed for interconnection networks suffering from a large number of failures. This technique has been designed and implemented with the aim of ensuring freedom from deadlocks in the proposed fault-tolerant routing method FT-DRB.
07/2011, Degree: Ph. D., Supervisor: Dr. Daniel Franco Puntes