Cross-layer Resilience Using Wearout Aware Design Flow
Bardia Zandian, Murali Annavaram
Electrical Engineering Department, University of Southern California
Los Angeles, CA
Abstract—As process technology shrinks devices, circuits
experience accelerated wearout. Monitoring wearout will be
critical for improving the efficiency of error detection and
correction. The most effective wearout monitoring approach
relies on continuously checking only the most critical circuit
paths to detect timing degradation. However, circuits
optimized for power and area efficiency have a steep critical
path wall in some designs. Furthermore, wearout depends on
dynamic conditions, such as processor’s
environment, and application-specific path utilization profile.
The dynamic nature of wearout coupled with steep critical
path walls may result in excessive number of paths that need to
be monitored. In this paper we propose a novel cross-layer
circuit design flow that uses path timing information and
runtime path utilization data to significantly enhance
monitoring efficiency. The proposed methodology uses
application-specific path utilization profile to select only a few
paths to be monitored for wearout. We propose and evaluate
four novel algorithms for selecting paths to be monitored.
These four approaches allow designers to select the best group
of paths under varying power, area and monitoring budget
Keywords-Wearout; Timing margin; Cross-layer design
As devices scale to nanometer dimensions, processor’s
lifetime reliability is reduced due to increased stress factors
such as higher current density, electric field, and operation
temperature . Reliability degradation manifests in the
form of many electro-physical phenomena such as
Electromigration, Time Dependant Dielectric Breakdown
(TDDB), Hot Carrier Injection (HCI), and Negative Bias
Temperature Instability (NBTI) . The net result of these
phenomena is gradual timing degradation and eventual
breakdown of circuits [3-5], which is referred to as wearout
or aging. Designers estimate the expected wearout during the
lifetime of a processor and use guardbands to proactively
reduce the clock frequency (and increase supply voltage) to
account for worst-case wearout. But wearout prediction is
becoming increasingly challenging as process variations lead
to random device characteristics both within and across
chips. Dynamically changing environmental conditions and
exacerbate the problem of wearout estimation. While these
uncertainties existed even before, the severity of their impact
is increased as devices are scaled [6, 7]. There are different
models which explain the rate of wearout and its dependence
on different static and dynamic parameters. All these models
path utilization further
and physical device level experiments show a gradual
degradation which happens over long time periods [3-5].
Given that wearout occurs at a glacial time scale compared to
processor cycle time, it is best to monitor wearout first
before deploying expensive error correction mechanisms.
With continuous monitoring the occurrence of a wearout
related error is predicted before the error occurrence and
preventive adjustments are made to the circuit’s operation
point (e.g. changing clock frequency and supply voltage; or
deploying modular redundancy) in order to avoid errors [8,
9]. High accuracy error prediction allows for designing a
truly reliability-aware circuit which can adapt to the in-field
reliability state of the hardware.
Given the promise of the prediction approach several
prediction techniques have been proposed. Some prediction
techniques use “canary” circuits which are designed to fail
before the actual circuit . Other techniques use sensors
inserted into the circuit at design time which are capable of
detecting wearout by sensing increased circuit delay [11, 12]
or changes in other parameters, such as threshold voltage
(Vth) . Canary circuits do not test the actual signal paths
in the circuit; rather they only act as proxies for the primary
circuit wearout. More recently researchers proposed wearout
prediction based on monitoring the signal paths in the
primary circuit itself. In WearMon , stored test vectors
that are specifically selected to sensitize the critical paths of
the circuit are used for runtime tests that capture the timing
margin (also called timing slack) of these paths. Another in
situ circuit checking method  uses Built-in Self Test
(BIST) mechanism to perform runtime circuit tests. The
main advantage of monitoring actual signal paths using
stored test vectors compared to static sensory circuit
insertion is the ability to capture the effects of actual circuit
lifetime utilization at a lower cost and with higher flexibility
for online adaptation. For instance, test coverage can be
optimized during in-field operation with little or no overhead
by simply updating the stored test vectors.
Wearout monitoring mechanisms generally make the
following two basic assumptions: (1) In any given circuit
there are only a few circuit paths that have critical timing
margins. Hence, to accurately predict imminent timing
failures only a few circuit paths with the least timing margin
need to be monitored. (2) Circuit paths with least timing
margin have a higher probability of being among the first to
violate timing. Hence, monitoring prioritizes paths purely
based on the timing margin measured at design time. The
first assumption indicates that selection of only a few paths
for monitoring would be sufficient for robust monitoring.
This assumption may hold well in some designs that use
Figure 1. Design time and runtime cross-layer interaction.
automatic design tools to synthesize, place and route the
design. In the absence of knowledgeable designer’s input
these tools typically do not create steep critical path walls
[16, 17], where a large number of paths have small timing
margin. However, custom design optimizations for
maximizing power and area efficiency, particularly
employed in high performance processors, may result in the
creation of a steep critical path wall in several circuit blocks.
In the presence of a steep critical path wall the number of
paths that need to be monitored can be very large, thereby
increasing the monitoring overhead. The second assumption
made by in situ approaches results in the selection of paths
purely based on design stage timing margin. However, it has
been shown that wearout depends on dynamic runtime
utilization of the processor and many of the causes of
wearout get exacerbated with increased circuit utilization [6,
7]. Path selection purely based on timing margin neglects
this important dependence. Thus the robustness of the
monitoring approach that relies purely on timing margin can
be compromised due to the dynamic nature of path
utilization. Hence, we conclude that in order for monitoring
approaches to be more broadly applied (beyond low-cost
computing segment) there is a need for a symbiotic
interaction between circuit design tools, monitoring
hardware and the high-level application software. Only
through such an interaction it is possible to identify circuit
paths which are the slowest at design time and also have
higher lifetime utilization resulting in most wearout induced
In this paper we propose a novel cross-layer circuit
design flow methodology that combines static path timing
information with runtime path utilization data to significantly
enhance monitoring efficiency and robustness. Fig. 1 shows
the layered framework consisting of two phases:
(1) Cross-layer design flow (CLDF) phase: This phase
(marked as “Design Time” in the figure) uses representative
application inputs to derive circuit path utilization profile.
The microarchitecture specification provides monitoring
budget, such as the amount of chip area or the power
consumption allocated for monitoring. CLDF also derives
timing profile from static timing analysis of circuit’s design.
The wearout aware algorithm then combines information
from software, microarchitecture and circuit layers to drive
circuit design optimizations with the explicit goal of making
a circuit amenable for robust and efficient monitoring. The
algorithm selects a refined group of paths along with a robust
set of input vectors for wearout monitoring.
(2) Wearout monitoring phase: A runtime wearout
monitoring phase, similar to that proposed in WearMon ,
continuously monitors the paths selected from the CLDF
phase. The information about the circuit paths which need to
be monitored, obtained from the CLDF phase, is used in the
runtime phase for wearout detection.
The focus of this research work is to develop the CLDF
framework. As such, we assume that a wearout monitoring
mechanism exists in the underlying microarchitecture. CLDF
significantly enhances the applicability of existing runtime
monitoring approaches. For example, where wearout sensors
or canary circuits are used for monitoring, CLDF will
identify circuit paths that are most susceptible to failure
thereby allowing the designer to select the most appropriate
location of the wearout sensors or canary circuitry. When in
situ monitoring approaches are used [14, 15, 18, 19] only the
most susceptible circuit paths reported by the CLDF
framework are monitored. It should be noted that although
the CLDF framework can be used with all the above
mentioned reliability monitoring approaches, throughout this
paper we assume that the underlying microarchitecture uses
an in situ monitoring approach similar to WearMon  to
illustrate how our design phase optimizations can enhance
runtime monitoring efficiency.
The main contributions of this work are:
1. We design and implement a novel cross-layer circuit
design flow methodology that combines static path timing
information with runtime path utilization data to significantly
enhance monitoring efficiency. This framework uses path
utilization profile, path delay characteristics, and number of
devices in critical paths to optimize the circuit using
selective path constraint adjustments (i.e. increasing the
timing margin of selected group of paths). This optimization
results in a new implementation of the circuit which is more
amenable for low overhead monitoring of wearout-induced
2. We propose four algorithms for selecting the best
group of paths to be observed as early indicators of wearout
induced timing failures. Each of these algorithms allows the
designer to tradeoff area and power overhead of monitoring
with robustness and efficiency of monitoring.
3. We develop a hybrid hierarchical emulation/simulation
infrastructure to study the effects of application level events
on gate-level utilization profile. This setup provides a fast
and accurate framework to study system utilization across
multiple layers of the system stack using a combination of
FPGA emulation and gate-level simulation.
In an era when computers are built from increasing
number of components with decreasing reliability, multi-
layer resiliency is becoming a requirement for all computer
systems. In this work we design and implement a low cost
and scalable solution in which different layers of the
computer system stack can communicate and adapt both at
design phase and during the runtime of the system. Our
proposed cross-layer design flow approach is discussed in
Section II. Section III shows our hybrid cross-layer
evaluation infrastructure, followed by the evaluation results
in Section IV. Section V describes the most relevant prior
work, followed by conclusions in Section VI.
TABLE II. COMPARISON OF FOUR APPROACHES FOR DIFFERENT FUBS
Circuit block Description
% Area % Area % Area % Area
Exec. control logic
ST buffer control
Reg. management logic
structural, timing, and utilization characteristics of the
circuit. sparc_exu_ecl and lsu_stb_ctl FUBs do not show a
steep critical path wall. Hence the area overhead of all
approaches for these FUBs is almost zero. In essence, our
approaches do not introduce any additional overhead when
the FUB is already well suited for monitoring. On the other
hand, sparc_ifu_dec and sparc_exu_rml have a steep critical
path wall as shown on Fig. 6 (a) and (d). As a result there are
overheads associated with all 4 approaches for these FUBs.
The area overhead of Approach 1 is always higher for these
FUBs since majority of paths are optimized. In Approach 2
increased correlation between the paths monitored and paths
optimized results in a more robust monitoring that has less
reliance on path utilization profile. However, it increases the
area overhead in some cases, since 25 additional paths are
optimized. To increase monitoring robustness Approach 1
and 2 optimize many paths. As a result these two approaches
can dramatically alter the timing distribution. This shift is
also clearly seen in Fig. 6 (a) and (d) for Approaches 1 and 2
where timing redistribution looks significantly different.
Approaches 3 and 4 limit the number of paths to be
optimized and hence the area overhead is significantly
reduced. Approaches 3 and 4 enhance monitoring robustness
by jointly optimizing the paths and relying on elevated test
frequency to control the area overhead. Since Approaches 3
and 4 only change the timing distribution of a fixed number
of paths (nMonitor paths) they do not fundamentally alter the
initial timing distribution. This can be clearly seen in Fig. 6
where the path distribution after Approaches 3 and 4 are very
similar to the initial distribution.
It summary, it has been shown that different FUBs
benefit from different approaches. These differences are due
to the differences in the timing margin distribution and
utilization profile of FUBs. Aggressive optimization of the
FUB is not always an outcome the algorithms presented and
in scenarios where a FUB does not have a steep critical path
wall the ideal group of paths to be monitored can be selected
without the adding any additional optimization overhead.
One interesting aspect of these four different approaches is
that their overheads are dependent on the initial timing and
utilization profile of the circuit and hence each approach is
more suitable for a category of circuits with specific
Predictions by the International Technology Roadmap for
Semiconductors (ITRS)  for more severe wearout in
future technology generations has resulted in increased
research efforts in modeling, detecting, and predicting
wearout. Some methods have been specifically developed for
prediction of wearout related timing failures [11, 12, 14].
These methods tackle the problem of wearout at
microarchitecture level or by making circuit level
enhancements. While these approaches focus only at one
layer of the system design, in this work we presented a novel
approach which correlates microarchitectural wearout
prediction techniques with the circuit design implementation.
The resulting circuit design is aware of the presence of
wearout monitoring and hence can make monitoring more
robust and efficient. Our work takes the utilization of the
circuit paths driven by application level information to
change circuit implementation.
Research in using runtime behavior during circuit design
time has expanded in the recent years. Many of these efforts
target improved power consumption or operation at reduced
error rates. Design time error rate analysis has been used for
improving reliability in presence of variations . Circuit
modifications proposed in  make the implementation of
the circuit more suitable for timing speculation . In
Blueshift  targeted acceleration for frequently exercised
path is used to change circuit implementation with the goal
of improving performance of timing speculation even in the
presence of a critical path wall. In , the authors presented
a design time optimization with the goal of reducing error
rates even when the circuit is operating at a reduced
operation voltage. These approaches allow for more
aggressive voltage scaling and increased power savings
without impacting reliability.
In many of these prior studies the circuit is intentionally
operated at a higher than nominal clock frequency resulting
in some circuit paths not meeting the timing constraint. In
contrast our approach does not do timing speculation. Rather
the goal is to continuously monitor circuit wearout
efficiently. Hence, the design changes necessary for wearout
monitoring are quite different than those necessary for timing
speculation. We have exploited runtime path utilization
information, as was done in  which has a different end
goal of improving timing speculation. We use runtime
information about how the design is used to reduce the
number of paths needed for monitoring. We have taken
advantage of the graduate nature of wearout and its
dependence on utilization to correlate design time
optimization efforts with runtime wearout monitoring
enhancements. The resulting
framework improves the effectiveness and efficiency of in
situ circuit monitoring techniques.
As device sizes in a processor continue to shrink with
each new process technology, there is a growing concern for
reliability. While reliability issues can take different forms,
wearout is a prevalent degradation where circuit timing
margins gradually decrease over the lifetime of the circuit.
Continuously monitoring wearout will become critical. By
monitoring the amount of timing margin left in a circuit it is
possible to enable just-in-time error detection and correction
solutions. Since monitoring itself will be done continuously
it is necessary to improve monitoring efficiency. Selecting
only the most critical paths to monitor can reduce monitoring
overhead. But in the presence of critical path timing wall,
monitoring overhead can be significant due to the need to
monitor many paths. This research addresses this serious
bottleneck to monitoring in the presence of critical path
We present a cross-layer design flow that uses
application knowledge to separate more frequently used
critical paths from the ones with low utilization. Since
wearout is a function of utilization, using application level
information to derive path utilization provides new ways to
improve monitoring efficiency and robustness. We describe
four approaches to redistribute the timing of circuit paths that
take advantage of this cross-layer utilization information. All
these approaches provide the designer the ability to tradeoff
monitoring robustness with power and area overheads.
The proposed design is implemented in a novel
evaluation framework that allows application level
information and circuit design tools to interact and exchange
information. Our evaluation framework provides an
automated mechanism to generate the best set of paths that
need to be monitored given the design constraints. Using
OpenSPARC T1 processor FUBs we evaluated the four
proposed approaches. Our results show that all four
approaches have unique capabilities that allow them to be
applied to FUBs with different characteristics.
This work was supported by National Science
Foundation grants CAREER-0954211, CCF-0834798, CCF-
0834799, and an IBM Faculty Fellowship.
 S. Borkar, "Designing reliable systems from unreliable components:
The challenges of transistor variability and degradation,"
Microarchitecture, vol. 25, pp. 10-16, 2005.
 S. Borkar, "Electronics beyond nano-scale CMOS," Design
Automation Conference, pp. 807-808, 2006.
 J. Srinivasan, S. V. Adve, P. Bose, and J. A. Rivers, "The case for
lifetime reliability-aware microprocessors," International Symposium
on Computer Architecture, pp. 276-287, 2004.
 J. Shin, V. Zyuban, Z. Hu, J. A. Rivers, and P. Bose, "A framework
for architecture-level lifetime reliability Modeling," Dependable
Systems and Networks, pp. 534-543, 2007.
 W. P. Wang, V. Reddy, A. T. Krishnan, R. Vattikonda, S. Krishnan,
and Y. Cao, "Compact Modeling and simulation of circuit reliability
for 65-nm CMOS technology," IEEE Transactions on Device and
Materials Reliability, vol. 7, pp. 509-517, 2007.
 C. Kenyon, A. Kornfeld, K. Kuhn, M. Liu, and A. Maheshwari,
"Managing Process Variation in Intel's 45nm CMOS Technology,"
Intel Technology Journal, vol. 12, pp. 93-109, 2008.
 J. Hicks, D. Bergstrom, M. Hattendorf, J. Jopling, and J. Maiz, "45nm
Transistor Reliability," Intel Technology Journal, vol. 12, pp. 131-
 J. Abella, X. Vera, and A. Gonzalez, "Penelope : The NBTI-Aware
processor," International Symposium on Microarchitecture, pp. 85-96,
 S. V. Kumar, C. H. Kim, and S. S. Sapatnekar, "Adaptive Techniques
for Overcoming Performance Degradation due to Aging in Digital
Circuits," Asia and South Pacific Design Automation Conference, pp.
 A. C. Cabe, Z. Y. Qi, S. N. Wooters, T. N. Blalock, and M. R. Stan,
"Small Embeddable NBTI Sensors (SENS) for Tracking On-Chip
Performance Decay," International Symposium on Quality Electronic
Design, pp. 1-6, 2009.
 M. Agarwal, B. C. Paul, M. Zhang, and S. Mitra, "Circuit failure
prediction and its application to transistor aging," VLSI Test
Symposium, pp. 277-286, 2007.
 J. Blome, S. Feng, S. Gupta, and S. Mahlke, "Self-callibrating online
wearout detection," International Symposium on Microarchitecture,
pp. 109-122, 2007.
 E. Karl, P. Singh, D. Blaauw, and D. Sylvester, "Compact in situ
sensors for monitoring negative-bias-temperature-instability effect
and oxide degradation," Solid State Circircuts Conferance, pp. 410–
 B. Zandian, W. Dweik, S. H. Kang, T. Punihaole, and M. Annavaram,
"WearMon: Reliability Monitoring Using Adaptive Critical Path
Testing," Dependable Systems and Networks, pp. 151-160, 2010.
 S. Shyam, K. Constantinides, S. Phadke, V. Bertacco, and T. Austin,
"Ultra low-cost defect protection for microprocessor pipelines," ACM
Sigplan Notices, vol. 41, pp. 73-82, Nov 2006.
 J. Patel, "CMOS Process Variations: A Critical Operation Point
Hypothesis," Online Presentation, 2008.
 A. B. Kahng, S. Kang, R. Kumar, and J. Sartori, "Slack
Redistribution for Graceful Degradation Under Voltage Overscaling,"
Asia and South Pacific Design Automation Conference, pp. 825-831,
 Y. J. Li, S. Makar, and S. Mitra, "CASP: Concurrent Autonomous
chip self-test using Stored test Patterns," Design, Automation and
Test in Europe, pp. 885-890, 2008.
 A. H. Baba and S. Mitra, "Testing for Transistor Aging," VLSI Test
Symposium, pp. 215-220, 2009.
 D. Ernst, N. S. Kim, S. Das, S. Pant, R. Rao, T. Pham, C. Ziesler, D.
Blaauw, T. Austin, K. Flautner, and T. Mudge, "Razor: A low-power
pipeline based on circuit-level timing speculation," International
Symposium on Microarchitecture, pp. 7-18, 2003.
 A. Kahng, S. Kang, R. Kumar, and J. Sartori, "Designing Processors
from the Ground Up to Allow Voltage/Reliability Tradeoffs," High
Performance Computer Architecture, pp. 1-11, 2010.
 S. Sarangi, B. Greskamp, A. Tiwari, and J. Torrellas, "EVAL:
Utilizing Processors with Variation-Induced Timing Errors,"
International Symposium on Microarchitecture, pp. 423-434, 2008.
 T. Austin, V. Bertacco, D. Blaauw, and T. Mudge, "Opportunities and
challenges for better than worst-case design," Asia and South Pacific
Design Automation Conference, pp. 2-7, 2005.
 B. Greskamp, L. Wan, U. R. Karpuzcu, J. J. Cook, J. Torrellas, D. M.
Chen, and C. Zilles, "BlueShift: Designing Processors for Timing
Speculation from the Ground Up," High Performance Computer
Architecture, pp. 213-224, 2009.