Conference Paper

Enhanced Self-Configurability and Yield in Multicore Grids

Dept. of Inf., Univ. of Piraeus, Piraeus, Greece
DOI: 10.1109/IOLTS.2009.5195986 Conference: On-Line Testing Symposium, (IOLTS 2009), Volume: 1, pp 75-80
Source: IEEE Xplore


As we move deeper in the nanotechnology era, computer architecture is solicited to manipulate tremendous numbers of devices per chip with high defect densities. These trends provide new computing opportunities but efficiently exploiting them will require a shift towards novel, highly parallel architectures. Fault tolerant mechanisms will have to be integrated to the design to deal with the low yield of future nanofabrication processes. In this paper we consider multi processor grid (MPG) architectures that assure scalability beyond hundreds of cores per chip. We study self-diagnosis and self-configuration methods at the architectural level and propose an enhanced self-configuration methodology that enables usage of a maximum percentage of available fault-free cores in MPGs with high defect densities. We show that our approach achieves usability of all fault-free cores for the case of fault-free routers whereas previous work was efficient for defect densities of up to 20-25% of defective cores. We also address the case of faulty routers, achieving usability of almost all fault-free nodes (fault-free cores having a fault-free router) for very high defect densities both in the cores and in the routers.

Download full-text


Available from: Jacques Henri Collet, May 09, 2014
  • Source
    • "It should be noted that our proposed strategy is different with the solution [8] that proposed a mechanism for discovering the faultless paths between an I/O port and the faultfree cores in a MP2SoC. This centralized discovering process is piloted by the smart I/O port, that is a critical resource. "
    [Show abstract] [Hide abstract]
    ABSTRACT: In this paper, we present a software approach for localization of faulty components in a 2D-mesh Network-on-Chip, targeting fault tolerance in a shared memory MP2SoC architecture. We use a pre-existing and distributed hardware infrastructure supporting self-test and de-activation of the faulty components (routers and communication channels), that are transformed into “black hole”. We detail the software method used to localize these “black holes”, and centralize the information in a single point, where a modified global routing function can be defined. This embedded software makes an extensive use of a distributed fault-tolerant configuration firmware assisted by a Distributed Cooperative Configuration Infrastructure (DCCI), that is also presented. Finally, “black hole” detection and localization coverage is evaluated.
    Full-text · Conference Paper · Jun 2011
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: This session brings together specialists from the DfT, DfY and DfR domains that will address key problems together with their solutions for the 14nm node and beyond, dealing with extremely complex chips affected by high defect levels, unpredictable and heterogeneous timing behavior, circuit degradation over time, including extreme situations related with the ultimate CMOS nodes, where all processor nodes, routers and links of single-chip massively parallel tera-device processors could comprise timing faults (such as delay faults or clock skews); a large percentage of these parts are affected by catastrophic failures; all parts experience significant performance degradations over time; and new catastrophic failures occur at low MTBF.
    Full-text · Article · Mar 2012
  • [Show abstract] [Hide abstract]
    ABSTRACT: This paper addresses the important issue of fault tolerance in network-on-chip (NoC) and presents an on-the-field test and configuration infrastructure for a 2-D-mesh NoC, which can be used in many generic shared-memory many-core tiled architectures and MPSoCs. This paper also details all the hardware and software means needed to: 1) initialize the NoC in a clean state (self-deactivation of faulty NoC components using a controlled built-in self-test strategy) and 2) set up a distributed collaborative configuration infrastructure that can be used to make the chip autonomously determine, during its initialization, the operational degraded architecture, identify and bypass black holes. Experimental results prove that the approach is effective and lightweight in terms of additional software and hardware resources.
    No preview · Article · Jun 2014 · IEEE Transactions on Very Large Scale Integration (VLSI) Systems