Conference Paper

Cross-layer resilience using wearout aware design flow.

DOI: 10.1109/DSN.2011.5958226 Conference: Proceedings of the 2011 IEEE/IFIP International Conference on Dependable Systems and Networks, DSN 2011, Hong Kong, China, June 27-30 2011
Source: DBLP

ABSTRACT As process technology shrinks devices, circuits experience accelerated wearout. Monitoring wearout will be critical for improving the efficiency of error detection and correction. The most effective wearout monitoring approach relies on continuously checking only the most critical circuit paths to detect timing degradation. However, circuits optimized for power and area efficiency have a steep critical path wall in some designs. Furthermore, wearout depends on dynamic conditions, such as processor's operating environment, and application-specific path utilization profile. The dynamic nature of wearout coupled with steep critical path walls may result in excessive number of paths that need to be monitored. In this paper we propose a novel cross-layer circuit design flow that uses path timing information and runtime path utilization data to significantly enhance monitoring efficiency. The proposed methodology uses application-specific path utilization profile to select only a few paths to be monitored for wearout. We propose and evaluate four novel algorithms for selecting paths to be monitored. These four approaches allow designers to select the best group of paths under varying power, area and monitoring budget constraints.

0 Bookmarks
 · 
63 Views
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: The amount of charge stored in an SRAM cell shrinks rapidly with each technology generation thus increasingly exposing caches to soft errors. Benchmarking the FIT rate of caches due to soft errors is critical to evaluate the relative merits of a plethora of protection schemes that are being proposed to protect against soft errors. The benchmarking of cache reliability introduces a unique challenge as compared to internal processor storage structures, such as the load/store queue. In the case of internal processor structures the time a data bit resides in the structure is so short that it is generally safe to assume that no more than one soft error strike can occur. Thus the reliability of such structures is overwhelmingly dominated by single bit errors. By contrast, a memory block may reside for millions of cycles in a last level cache. In this case it is important to consider the impact of the spatial and temporal distribution of multiple errors within the lifetime of a cache block in the presence of error protection. This paper introduces a unified reliability benchmarking framework called PARMA (Precise Analytical Reliability Model for Architecture). PARMA is a rigorous analytical framework that accurately accounts for the distribution of multiple errors to measure the failure rate under any protection scheme. In a single simulation run PARMA provides a precise FIT rate (expected number of failures in one billion hours) measurement for storage structures where the effect of multiple errors cannot be neglected. We have implemented the PARMA framework on top of a cycle-accurate out-of-order processor simulator (sim-outorder) to benchmark L2 cache failure rates for a set of CPU 2000 benchmarks. The effectiveness of three protection schemes are compared in terms of L2 cache FIT rate: parity, word-level Single Error Correcting Double Error Detecting (SECDED) code and block-level SECDED. Exploiting the accuracy of PARMA, we demonstrate that current techniques to evaluate cache FIT rates in the presence of SECDED, such as accelerated fault injection simulations and first-principle derivations based on Architectural Vulnerability Factor (AVF), can overestimate FIT rates by vast amounts. Based on the insights gained during this research we also introduce a new approximate analytical model that can quickly and more accurately estimate cache FIT rate in the presence of SECDED.
    SIGMETRICS 2011, Proceedings of the 2011 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems, San Jose, CA, USA, 07-11 June 2011 (Co-located with FCRC 2011); 01/2011
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Circuit-level timing speculation has been proposed as a technique to reduce dependence on design margins and eliminating power/performance overheads. Recent work has proposed microarchitectural methods to dynamically detect and recover from timing errors in processor logic. To a large extent existing work has relied on statistical error models and has not evaluated potential disparity of error rates at the level of static instructions. In this paper, we analyze gate-level hardware models for an execution pipeline and demonstrate pronounced locality in instruction-level error rates due to value locality and data dependences. We propose timing error prediction to dynamically anticipate timing errors at the instruction-level and error padding techniques to avoid the full recovery cost of timing errors. We show that with simple prediction strategies our mechanism can reduce 80% of the performance penalty incurred by error recovery on average. This allows us to alleviate some limitations of timing speculation and improves energy-efficiency by 21% when compared to baseline timing speculation techniques using the same dynamic adaptive tuning mechanism.
    01/2011;

Full-text

View
0 Downloads
Available from