Conference Paper

Cross-layer resilience using wearout aware design flow

DOI: 10.1109/DSN.2011.5958226 Conference: Proceedings of the 2011 IEEE/IFIP International Conference on Dependable Systems and Networks, DSN 2011, Hong Kong, China, June 27-30 2011
Source: DBLP


As process technology shrinks devices, circuits experience accelerated wearout. Monitoring wearout will be critical for improving the efficiency of error detection and correction. The most effective wearout monitoring approach relies on continuously checking only the most critical circuit paths to detect timing degradation. However, circuits optimized for power and area efficiency have a steep critical path wall in some designs. Furthermore, wearout depends on dynamic conditions, such as processor's operating environment, and application-specific path utilization profile. The dynamic nature of wearout coupled with steep critical path walls may result in excessive number of paths that need to be monitored. In this paper we propose a novel cross-layer circuit design flow that uses path timing information and runtime path utilization data to significantly enhance monitoring efficiency. The proposed methodology uses application-specific path utilization profile to select only a few paths to be monitored for wearout. We propose and evaluate four novel algorithms for selecting paths to be monitored. These four approaches allow designers to select the best group of paths under varying power, area and monitoring budget constraints.

Full-text preview

Available from:
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Circuit-level timing speculation has been proposed as a technique to reduce dependence on design margins and eliminating power/performance overheads. Recent work has proposed microarchitectural methods to dynamically detect and recover from timing errors in processor logic. To a large extent existing work has relied on statistical error models and has not evaluated potential disparity of error rates at the level of static instructions. In this paper, we analyze gate-level hardware models for an execution pipeline and demonstrate pronounced locality in instruction-level error rates due to value locality and data dependences. We propose timing error prediction to dynamically anticipate timing errors at the instruction-level and error padding techniques to avoid the full recovery cost of timing errors. We show that with simple prediction strategies our mechanism can reduce 80% of the performance penalty incurred by error recovery on average. This allows us to alleviate some limitations of timing speculation and improves energy-efficiency by 21% when compared to baseline timing speculation techniques using the same dynamic adaptive tuning mechanism.
    Preview · Article · Jan 2011
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: The amount of charge stored in an SRAM cell shrinks rapidly with each technology generation thus increasingly exposing caches to soft errors. Benchmarking the FIT rate of caches due to soft errors is critical to evaluate the relative merits of a plethora of protection schemes that are being proposed to protect against soft errors. The benchmarking of cache reliability introduces a unique challenge as compared to internal processor storage structures, such as the load/store queue. In the case of internal processor structures the time a data bit resides in the structure is so short that it is generally safe to assume that no more than one soft error strike can occur. Thus the reliability of such structures is overwhelmingly dominated by single bit errors. By contrast, a memory block may reside for millions of cycles in a last level cache. In this case it is important to consider the impact of the spatial and temporal distribution of multiple errors within the lifetime of a cache block in the presence of error protection. This paper introduces a unified reliability benchmarking framework called PARMA (Precise Analytical Reliability Model for Architecture). PARMA is a rigorous analytical framework that accurately accounts for the distribution of multiple errors to measure the failure rate under any protection scheme. In a single simulation run PARMA provides a precise FIT rate (expected number of failures in one billion hours) measurement for storage structures where the effect of multiple errors cannot be neglected. We have implemented the PARMA framework on top of a cycle-accurate out-of-order processor simulator (sim-outorder) to benchmark L2 cache failure rates for a set of CPU 2000 benchmarks. The effectiveness of three protection schemes are compared in terms of L2 cache FIT rate: parity, word-level Single Error Correcting Double Error Detecting (SECDED) code and block-level SECDED. Exploiting the accuracy of PARMA, we demonstrate that current techniques to evaluate cache FIT rates in the presence of SECDED, such as accelerated fault injection simulations and first-principle derivations based on Architectural Vulnerability Factor (AVF), can overestimate FIT rates by vast amounts. Based on the insights gained during this research we also introduce a new approximate analytical model that can quickly and more accurately estimate cache FIT rate in the presence of SECDED.
    Preview · Conference Paper · Jun 2011
  • [Show abstract] [Hide abstract]
    ABSTRACT: This paper proposes a fault injection method for the accurate prediction of failure rates of embedded applications. The presented approach relies on Mixture Importance Sampling. Hence, it is very efficient and requires far fewer samples than standard Monte Carlo. We utilize the presented injection method to link a technology-aware fault model for cache soft errors to a system-level simulation. This cross-layer approach is demonstrated to analyze the fault tolerance of an autonomous robot. The presented approach is a step towards designing fault tolerant embedded systems with reduced protection mechanisms to save power and area, as this requires efficient methods to predict system failure rates.
    No preview · Conference Paper · Oct 2013