Conference Paper

An Analytical Model for Reliability Evaluation of NoC Architectures.

Univ. of Tehran, Tehran
DOI: 10.1109/IOLTS.2007.13 Conference: 13th IEEE International On-Line Testing Symposium (IOLTS 2007), 8-11 July 2007, Heraklion, Crete, Greece
Source: DBLP

ABSTRACT This paper proposes an analytical model to assess Reliability Factor of an NoC based System-on-Chip design. Reliability Factor is the probability that faults in the NoC infrastructure can be recovered without any effect on system functionality. The proposed method classifies switch faults of an NoC according to their impact on system functionality. Based on this classification, the contribution of each transient fault lowering the reliability of the NoC is calculated. This model can be used to decide which fault tolerant techniques cause more improvement on system reliability.

1 Bookmark
  • [Show abstract] [Hide abstract]
    ABSTRACT: Reliability is a growing fundamental challenge in the design of multiprocessor Systems-on-Chip (MPSoCs). This trend is accelerated by the increasingly adverse process variations and wearout mechanisms that result in an increased number of errors. Previously proposed fault-tolerant techniques are ad-hoc and target processors or Networks-on-Chip (NoC) separately. Because each of these two units may become a reliability bottleneck for NoC based multiprocessor SoCs, it is imperative that the reliability of SoCs be evaluated and addressed in a unified manner, as a combination of communication and computational units. Using this holistic approach, in this paper, we propose a new architecture level unified reliability evaluation methodology for MPSoCs. At the core of the reliability estimation engine lies a Monte Carlo algorithm which works with failure times for time-dependent dielectric breakdown (TDDB) and negative bias temperature instability (NBTI) modeled as Weibull distributions. To demonstrate its usefulness, we utilize the proposed methodology to explore the impact of NoC router layout on the failure time of the system running the same set of benchmarks. In addition, we investigate the failure time of the system when the NoC as the communication unit of the MPSoC is taken or not - as in previous work - into consideration. Our simulation framework can be very helpful to architecture designers, who could use it to identify architectural characteristics and to develop design techniques meant to improve system's lifetime.
    Green Computing Conference (IGCC), 2012 International; 01/2012
  • Source
    ACM/IEEE International Symposium on Networks-on-Chip; 01/2011
  • [Show abstract] [Hide abstract]
    ABSTRACT: Advances in technology scaling increasingly make Network-on-Chips (NoCs) more susceptible to failures that cause various reliability challenges. With increasing area occupied by different on-chip memories, strategies for maintaining fault-tolerance of distributed on-chip memories become a major design challenge. We propose a system-level design methodology for scalable fault-tolerance of distributed on-chip memories in NoCs. We introduce a novel reliability clustering model for fault-tolerance analysis and shared redundancy management of on-chip memory blocks. We perform extensive design space exploration applying the proposed reliability clustering on a block-redundancy fault-tolerant scheme to evaluate the tradeoffs between reliability, performance, and overheads. Evaluations on a 64-core chip multiprocessor (CMP) with an 8x8 mesh NoC show that distinct strategies of our case study may yield up to 20% improvements in performance gains and 25% improvement in energy savings across different benchmarks, and uncover interesting design configurations.
    Design, Automation & Test in Europe Conference & Exhibition (DATE), 2013; 01/2013