Shuai Wang

Nanjing University, Nan-ching, Jiangsu Sheng, China

Are you Shuai Wang?

Claim your profile

Publications (21)4.91 Total impact

  • Guangshan Duan, Shuai Wang
    [Show abstract] [Hide abstract]
    ABSTRACT: Due to the high cell density, low leakage power consumption, and less vulnerability to soft errors, non-volatile memory technologies are among the most promising alternatives for replacing the traditional DRAM and SRAM technologies used in implementing main memory and caches in the modern microprocessor. However, one of the difficulties is the limited write endurance of most non-volatile memory technologies. In this paper, we propose to exploit the narrow-width values to improve the lifetime of non-volatile last level caches. Leading zeros masking scheme is first proposed to reduce the write stress to the upper half of the narrow-width data. To balance the write variations between the upper half and the lower half of the narrow-width data, two swap schemes, the swap on write (SW) and swap on replacement (SRepl), are proposed. To further reduce the write stress to non-volatile caches, we adopt two optimization schemes, the multiple dirty bit (MDB) and read before write (RBW), to improve their lifetime. Our experimental results show that by combining all our proposed schemes, the lifetime of non-volatile caches can be improved by 245% on average.
    Design Automation and Test in Europe; 01/2014
  • [Show abstract] [Hide abstract]
    ABSTRACT: The negative bias temperature instability (NBTI) in CMOS devices is one of most prominent sources of aging mechanisms, which can induce severe threats to the reliability of modern processors at deep submicron semiconductor technologies. Due to the unbalanced duty cycle ratio of the SRAM cells, the data cache suffers a heavy NBTI stress and this will further exacerbate the aging effect in the data cache. In this paper, an aging-aware design is proposed to combat the NBTI-induced aging in the data cache. First, the detailed lifetime behaviors of the cachelines in the data cache are studied. Then, different schemes are proposed to mitigate the negative aging effects by balancing the duty cycle ratio of the SRAM cells in the cachelines according to their different lifetime phases. By applying our proposed idle-time-based cacheline invalidation, early write-back, and bit-flipping schemes, the duty cycle ratio of the data cache can be well balanced. By adopting the drowsy scheme for invalidated cachelines, our design can also reduce the power consumption significantly, which will further optimize the thermal behavior and aging effect of data caches.
    Proceedings of the 23rd ACM international conference on Great lakes symposium on VLSI; 05/2013
  • [Show abstract] [Hide abstract]
    ABSTRACT: With shrinking transistor feature size, lowering nodal capacitance and supply voltage at new technology generations, microprocessors are becoming more vulnerable to single-event upsets and transients, a.k.a., soft errors. While chip-multiprocessor (CMP) architecture has been employed in mainstream microprocessors and the number of on-chip processor cores keeps increasing, the system-level reliability of chip-multiprocessors is degrading reversely proportional to the core number. In this work, we propose to exploit abundant on-chip processor cores for redundant hardware transaction processing, which provides native support for error detection and recovery in transactional chip-multiprocessors (TxCMPs) against soft errors. The proposed transactional processor cores execute everything as transactions and TxCMPs execute redundant transactions on different cores. To alleviate the performance overhead due to transaction commits, we further propose two architectural optimizations, namely early partial commit packet transmission and speculative transaction execution in reliable computing mode. Our experimental evaluation confirms the effectiveness of our optimized TxCMPs in achieving low cost reliable computing against soft errors.
    Defect and Fault Tolerance in VLSI and Nanotechnology Systems (DFT), 2012 IEEE International Symposium on; 01/2012
  • [Show abstract] [Hide abstract]
    ABSTRACT: The degradation of CMOS devices over the lifetime can cause the severe threat to the system performance and reliability at deep submicron semiconductor technologies. The negative bias temperature instability (NBTI) is among the most important sources of the aging mechanisms. Applying the traditional guardbanding technique to address the decreased speed of devices is too costly. Due to presence of the narrow-width values, integer register files in high-performance microprocessors suffer a very high NBTI stress. In this paper, we propose an aging-aware register file (AARF) design to combat the NBTI-induced aging in integer register files. The proposed AARF design can mitigate the negative aging effects by balancing the duty cycle ratio of the internal bits in register files. By gating the leading bits of the narrow-width values during the register accesses, our AARF can also achieve a significantly power reduction, which will further reduce the temperature and NBTI degradation of integer register files. Our experimental results show that AARF can effectively reduce the NBTI stress with a 36.9% power saving for integer register files.
    01/2012;
  • Tao Jin, Shuai Wang
    [Show abstract] [Hide abstract]
    ABSTRACT: The degradation of CMOS devices over the lifetime can cause the severe threat to the system performance and reliability at deep sub micron semiconductor technologies. The negative bias temperature instability (NBTI) is among the most important sources of the aging mechanisms. Applying the traditional guard banding technique to address the decreased speed of devices is too costly. Due to the unbalanced duty cycle ratio of the SRAM cells, the instruction cache suffers a heavy NBTI stress and this will further exacerbate the aging effect in the instruction cache. In this paper, we propose an aging-aware design to combat the NBTI-induced aging in the instruction cache. First, the detailed lifetime behaviors of the cache lines in the instruction cache are studied. Then, different schemes are proposed to mitigate the negative aging effects by balancing the duty cycle ratio of the SRAM cells in the cache lines according to their different lifetime phases. By applying our proposed idle-time-based cache line invalidation and bit-flipping /complementing schemes, the duty cycle ratio of the instruction cache can be well balanced and the NBTI stress will be significantly reduced.
    VLSI (ISVLSI), 2012 IEEE Computer Society Annual Symposium on; 01/2012
  • Source
    Shuai Wang, Jie Hu, Sotirios G Ziavras
    [Show abstract] [Hide abstract]
    ABSTRACT: Protecting on-chip cache memories against soft errors has become an increasing challenge in designing new generation reliable microprocessors. Previous efforts have mainly focused on improving the reliability of the cache data arrays. Due to its crucial importance to the correctness of cache accesses, the tag array also demands high reliability against soft errors. Exploiting the address locality of memory accesses, we propose to duplicate most recently accessed tag entries in a small tag replication buffer (TRB) thus to protect the information integrity of the tag array in the data cache. Experimental results show that our proposed TRB scheme achieves a high 90% access-with-replica (AWR) rate with low per-formance (0%), energy (16.3%), and area (19.9%) overheads. We also conduct a detailed design space exploration for the TRB design and propose a selective TRB scheme that achieves a higher AWR rate (97.4%) for the dirty cachelines with negligible over-heads. To provide a comprehensive evaluation of the tag-array re-liability, we further conduct an architectural vulnerability factor (AVF) analysis for the tag array in the data cache and propose a refined metric, detected-without-replica-AVF (DOR-AVF), which combines the AVF and AWR analysis. Based on our DOR-AVF analysis, a selective TRB scheme with early write-back (S-TRB-EWB) is proposed, which achieves a zero DOR-AVF and 100% AWR rate at a negligible performance overhead. Results from sta-tistical fault/error injection experiment also confirm the effective-ness of our TRB schemes and the achieved reliability of the cache tag array that recovers 100% of detected errors. Index Terms—Cache tag array, reliability, soft error, tag repli-cation buffer (TRB).
    IEEE Transactions on Very Large Scale Integration (VLSI) Systems 01/2012; 1. · 1.22 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: With continuous technology scaling, current and next generation microprocessors are becoming more vulnerable to transient errors such as soft errors induced by energetic particle strikes. While mainstream microprocessors are employing multi-/many-core architectures targeting at high-performance parallel computing applications, the transistor/area share of on-chip caches keeps increasing. As cache memories being the major victim of soft errors, it is of paramount importance to characterize on-chip cache's vulnerability in this context for devising potential reliability optimizations, especially under the interaction with cache coherence protocols. In this work, we develop a lifetime model for the private L1 data cache in chip-multiprocessors (CMPs), which is based on the cache activities and the states of cache lines. This lifetime model is then applied to characterize and predict cache's vulnerability trend in CMPs. Our experimental evaluation shows that cache vulnerable phases due to remote accesses increase dramatically as the number of processor cores increases. Based on vulnerable phase analysis, we propose a protocol enhancement to prematurely invalidate cache lines in modified (M) state for minimizing the vulnerability factor due to remote reads to modified cachelines.
    IEEE Computer Society Annual Symposium on VLSI, ISVLSI 2011, 4-6 July 2011, Chennai, India; 01/2011
  • Source
    Shuai Wang
    [Show abstract] [Hide abstract]
    ABSTRACT: With continuous scaling down of the semiconductor technology, the soft errors induced by energetic particles have become an increasing challenge in designing current and next-generation reliable microprocessors. Due to their large share of the transistor budget and die area, cache memories suffer from an increasing vulnerability against soft errors. Previous work based on the vulnerability factor (VF) analysis proposed analytical models to evaluate the reliability of on-chip data and instruction caches. However, we have no possession of a system-level study on the vulnerability of instruction caches. In this paper, we propose a new analytical model to estimate the system-level vulnerability factor for on-chip instruction caches. In our model, the error masking/detection effects in instructions based on the Instruction Set Architecture (ISA) are studied. Our experimental results using SPEC benchmark suite show that the self-error-masking/detection in instructions will reduce the VF of the instruction caches compared to the previous study. We also conduct an evaluation on the effectiveness of the reliability optimization techniques for instruction caches under our system-level VF characterization. Our proposed vulnerability model can provide an insightful guidance for the reliable instruction cache and ISA design.
    2011 IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems, DFT 2011, Vancouver, BC, Canada, October 3-5, 2011; 01/2011
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Protecting the on-chip cache memories against soft errors has become an increasing challenge in designing new generation reliable microprocessors. Previous efforts have mainly focused on improving the reliability of the cache data arrays. Due to its crucial importance to the correctness of cache accesses, the tag array demands high reliability against soft errors while the data array is fully protected. Exploiting the address locality of memory accesses, we propose to duplicate most recently accessed tag entries in a small Tag Replication Buffer (TRB) thus to protect the information integrity of the tag array in the data cache with low performance, energy and area overheads. A Selective-TRB scheme is further proposed to protect only tag entries of dirty cache lines. The experimental results show that the Selective-TRB scheme achieves a higher access-with-replica (AWR) rate of 97.4% for the dirty-cache line tags. To provide a comprehensive evaluation of the tag-array reliability, we also conduct an architectural vulnerability factor (AVF) analysis for the tag array and propose a refined metric, detected-without-replica-AVF (DOR-AVF), which combines the AVF and AWR analysis. Based on our DOR-AVF analysis, a TRB scheme with early write-back (EWB) is proposed, which achieves a zero DOR-AVF at a negligible performance overhead.
    IEEE Computer Society Annual Symposium on VLSI, ISVLSI 2010, 5-7 July 2010, Lixouri Kefalonia, Greece; 01/2010
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Soft errors induced by energetic particle strikes in on-chip cache memories have become an increasing challenge in designing new generation reliable microprocessors. Previous efforts have exploited information redundancy via parity/ECC codings or cacheline duplication for information integrity in on-chip cache memories. Due to various performance, area/size, and energy constraints in various target systems, many existing unoptimized protection schemes may eventually prove significantly inadequate and ineffective. In this paper, we propose a new framework for conducting comprehensive studies and characterization on the reliability behavior of cache memories, in order to provide insight into cache vulnerability to soft errors as well as design guidance to architects for highly efficient reliable on-chip cache memory design. Our work is based on the development of new lifetime models for data and tag arrays residing in both the data and instruction caches. Those models facilitate the characterization of cache vulnerability of stored items at various lifetime phases. We then exemplify this design methodology by proposing reliability schemes targeting at specific vulnerable phases. Benchmarking is carried out to showcase the effectiveness of our approach.
    IEEE Transactions on Computers 09/2009; 58:1171-1184. · 1.38 Impact Factor
  • Source
    Jie Hu, Shuai Wang, S.G. Ziavras
    [Show abstract] [Hide abstract]
    ABSTRACT: Protecting the register value and its data buses is crucial to reliable computing in high-performance microprocessors due to the increasing susceptibility of CMOS circuitry to soft errors induced by high-energy particle strikes. Since the register file is in the critical path of the processor pipeline, any reliable design that increases either the pressure on the register file or the register file access latency is not desirable. In this paper, we propose to exploit narrow-width register values, which present the majority of the generated values, for making a duplicate of the value within the same data item; this in-register duplication (IRD) eliminates the requirement for additional copy registers. The data path pipeline is augmented to efficiently incorporate parity encoding and parity checking such that error recovery is seamlessly supported in IRD and the parity checking is overlapped with the execution stage to avoid increasing the critical path. A detailed architectural vulnerability factor (AVF) analysis shows that IRD significantly reduces the AVF from 8.4% in a conventional unprotected register file to 0.1% in an IRD register file. Our experimental evaluation using the SPEC CINT2000 benchmark suite also shows that IRD provides superior read-with-duplicate (RWD) and error detection/recovery rates under heavy error injection as compared to previous reliability schemes, while only incurring a small power overhead.
    IEEE Transactions on Very Large Scale Integration (VLSI) Systems 08/2009; · 1.22 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Localized heating-up creates thermal hotspots across the chip, with the integer register file ranked as the hottest unit in high-performance microprocessors. In this paper, we perform a detailed study on the thermal behavior of a low-power value-aware register file (VARF) that is subjected to internal fine-grain hotspots. To further optimize its thermal behavior, we propose and evaluate three thermal-aware control schemes, thermal sensor (TS), access counter (AC), and register-id (ID) based, to balance the access activity and thus the temperature across different partitions in the VARF. The simulation results using SPEC CINT2000 benchmarks show that the register-id controlled VARF (ID-VARF) scheme achieves optimized thermal behavior at minimum cost as compared to the other schemes. We further evaluate the performance impact of the thermal-aware VARF design with the dynamic thermal management (DTM). The experimental results show that the ID-VARF can improve the performance by 26.1% and 7.2% over the conventional register file and the original VARF design, respectively.
    Design, Automation and Test in Europe, DATE 2009, Nice, France, April 20-24, 2009; 01/2009
  • Source
    Shuai Wang, Jie Hu, S.G. Ziavras
    [Show abstract] [Hide abstract]
    ABSTRACT: Soft-error induced reliability problems have become a major challenge in designing new generation microprocessors. Due to the on-chip caches' dominant share in die area and transistor budget, protecting them against soft errors is of paramount importance. Recent research has focused on the design of cost-effective reliable data caches in terms of performance, energy, and area overheads, based on the assumption of fixed error rates. However, for systems in operating environments that vary with time or location, those schemes will be either insufficient or overdesigned for the changing error rates. In this paper, we explore the design of a self-adaptive reliable data cache that dynamically adapts its employed reliability schemes to the changing operating environments thus to maintain a target reliability. The proposed data cache is implemented with three levels of error protection schemes, a monitoring mechanism, and a control component that decides whether to upgrade, downgrade, or keep the current protection level based on the feedback from the monitor. Our experimental evaluation using a set of SPEC CPU2000 benchmarks shows that our self-adaptive data cache achieves similar reliability to a cache protected by the most reliable scheme, while simultaneously minimizing the performance and power overheads.
    IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 09/2008; · 1.09 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Powerful branch predictors along with a large branch target buffer (BTB) are employed in superscalar processors for instruction-level parallelism exploitation. However, the large BTB not only dominates the predictor energy consumption, but also becomes a major roadblock in achieving faster clock frequencies at deep sub-micron technologies. In this paper, we propose a filtering scheme to reduce the accesses to the BTB to achieve a significant dynamic energy reduction in the BTB while maintaining the performance. Our experimental evaluation using the SPEC2000 benchmark suite shows that our BTB Access Filtering (BAF) design achieves a 88.5% dynamic energy reduction over a default 2K-entry 2-way BTB at the cost of a negligible 0.1% performance loss, on the average across all benchmarks. We also studied the leakage behavior and its control in our BAF design. The results show that by applying a drowsy strategy, we can achieve a very effective leakage control in the BTB, a 83% leakage reduction at a marginal 0.3% performance overhead. For high performance design, our BAF can also improve BTB's performance scalability at new technologies. In deeply-pipelined designs, BAF design yields a 2.7% (and 8.1%) performance improvement over a conventional 2-cycle (and 3-cycle) BTB, with its energy efficiency fully exploited.
    IEEE Computer Society Annual Symposium on VLSI, ISVLSI 2008, 7-9 April 2008, Montpellier, France; 01/2008
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Designing high-performance low-energy register files is of critical importance to the continuation of current performance advances in wide-issue and deeply pipelined superscalar microprocessors. In this paper, we propose a new microarchitecture, the asymmetrically banked value-aware register file (AB-VARF), to exploit the prevailing narrow-width register values for low-latency and energy-efficient register file designs. The register bit-widths of different banks in our AB-VARF register files are specifically customized to capture different narrow-width values. Augmented with a value width predictor, the register renaming logic is slightly tuned to rename predicted narrow-width registers to the corresponding narrow-width banks. Our experimental evaluation with SPEC CINT2000 benchmark suite shows that AB-VARF reduces the energy consumption by 78.4% over a conventional register file, on the average, at the cost of a 0.7% performance loss to an ideal 1-cycle monolithic register file.
    Microprocessors and Microsystems. 01/2008;
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: In this paper, we propose and implement a vector pro- cessing system that includes two identical vector micropro - cessors embedded in two FPGA chips. Each vector micro- processor supports floating-point calculations and efficient sparse matrix operations. Dense matrix-matrix multipli- cation and sparse matrix-vector multiplication with bench - mark matrices from various application domains were run on the system to evaluate its performance. The resulting calculation times are compared with those of a commercial PC to show the effectiveness of our approach.
    2007 IEEE Computer Society Annual Symposium on VLSI (ISVLSI 2007), May 9-11, 2007, Porto Alegre, Brazil; 01/2007
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Designing high-performance low-power register files is of critical importance to the continuation of current perfor- mance advances in wide-issue and deeply-pipelined super- scalar microprocessors. In this paper, we propose a new microarchitecture, the asymmetrically-banked value-aware register file (AB-VARF), to exploit the prevailing narrow- width register values for low-latency and power-efficient register file designs. The register bit-widths of different banks in our AB-VARF register files are specifically cus- tomized to capture different narrow-width values. Aug- mented with a value width predictor, the register renaming logic is slightly tuned to rename predicted narrow-width registers to the corresponding narrow-width banks. Our experimental evaluation with SPEC CINT2000 benchmark suites shows that AB-VARF reduces the energy consumption by 92.6% over a conventional register file, on the average, at the cost of a 6.6% performance loss to an ideal 1-cycle monolithic register file.
    2007 IEEE Computer Society Annual Symposium on VLSI (ISVLSI 2007), May 9-11, 2007, Porto Alegre, Brazil; 01/2007
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Protecting the register value and its data buses is crucial to reliable computing in high-performance microprocessors due to the increasing susceptibility of CMOS circuitry to soft errors induced by high-energy particle strikes. Since the register file is in the critical path of the processor pipeline, any reliable design that increases either the pressure on the register file or the register file access latency is not desirable. In this paper, we propose to exploit narrow-width register values, which present the majority of the generated values, for duplicating a copy of the value within the same data item, called in-register duplication (IRD), eliminating the requirement of additional copy registers. The datapath pipeline is augmented to efficiently incorporate parity encoding and parity checking such that error recovery is seamlessly supported in IRD and the parity checking is overlapped with the execution stage to avoid increasing the critical path. Our experimental evaluation using the SPEC CINT2000 benchmark suite shows that IRD provides superior read-with-duplicate (RWD) and error detection/recovery rates under heavy error injection as compared to previous reliability schemes
    2006 International Conference on Dependable Systems and Networks (DSN 2006), 25-28 June 2006, Philadelphia, Pennsylvania, USA, Proceedings; 01/2006
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Energetic-particle induced soft errors in on-chip cache memories have become a major challenge in designing new generation reliable microprocessors. Uniformly applying conven- tional protection schemes such as error correcting codes (ECC) to SRAM caches may not be practical where performance, power, and die area are highly constrained, especially for embedded systems. In this paper, we propose to analyze the lifetime behavior of the data cache to identify its temporal vulnerability. For this vulnerability analysis, we develop a new lifetime model. Based on the new lifetime model, we evaluate the effectiveness of several existing schemes in reducing the vulnerability of the data cache. Furthermore, we propose to periodically invalidate clean cache lines to reduce the probability of errors being read in by the CPU. Combined with previously proposed early writeback strategies (1), our schemes achieve a substantially low vulnerability in the data cache, which indicate the necessity of different protection schemes for data items during various phases in their lifetime. I. INTRODUCTION With continuous technology scaling down, microprocessors are becoming more susceptible to soft errors induced by
    Proceedings of 2006 International Conference on Embedded Computer Systems: Architectures, Modeling and Simulation (IC-SAMOS 2006), Samos, Greece, July 17-20, 2006; 01/2006
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Increasing microprocessor vulnerability to soft errors induced by neutron and alpha particle strikes prevents aggressive scaling and in- tegration of transistors in future technologies if left unaddressed. Pre- viously proposed instruction-level redundant execution, as a means of detecting errors, suffers from a severe performance loss due to the re- source shortage caused by the large number of redundant instructions injected into the superscalar core. In this paper, we propose to apply three architectural enhancements, namely 1) floating-point unit sharing (FUS), 2) prioritizing primary instructions (PRI), and 3) early retiring of redundant instructions (ERT), that enable transient-fault detecting re- dundant execution in superscalar microarchitectures with a much smaller performance penalty, while maintaining the original full coverage of soft errors. In addition, our enhancements are compatible with many other proposed techniques, allowing for further performance improvement.
    Advances in Computer Systems Architecture, 10th Asia-Pacific Conference, ACSAC 2005, Singapore, October 24-26, 2005, Proceedings; 01/2005

Publication Stats

108 Citations
4.91 Total Impact Points

Institutions

  • 2011–2013
    • Nanjing University
      • Department of Computer Science & Technology
      Nan-ching, Jiangsu Sheng, China
  • 2009
    • Siena Heights University
      Newark, New Jersey, United States
  • 2005–2008
    • New Jersey Institute of Technology
      • Department of Electrical and Computer Engineering
      Newark, NJ, United States