Zhigang Mao

Shanghai Jiao Tong University, Shanghai, Shanghai Shi, China

Are you Zhigang Mao?

Claim your profile

Publications (27)0 Total impact

  • [show abstract] [hide abstract]
    ABSTRACT: The heavily-threaded data processing demands of streaming multiprocessors (SM) in a GPGPU require a large register file (RF). The fast increasing size of the RF makes the area cost and power consumption unaffordable for traditional SRAM designs in the future technologies. In this paper, we propose to use embedded-DRAM (eDRAM) as an alternative in future GPGPUs. Compared with SRAM, eDRAM provides higher density and lower leakage power. However, the limited data retention time in eDRAM poses new challenges. Periodic refresh operations are needed to maintain data integrity. This is exacerbated with the scaling of eDRAM density, process variations and temperature. Unlike conventional CPUs which make use of multi-ported RF, most of the RFs in modern GPGPU are heavily banked but not multi-ported to reduce the hardware cost. This provides a unique opportunity to hide the refresh overhead. We propose two different eDRAM implementations based on 3T1D and 1T1C memory cells. To mitigate the impact of periodic refresh, we propose two novel refresh solutions using bank bubble and bank walk-through. Plus, for the 1T1C RF, we design an interleaved bank organization together with an intelligent warp scheduling strategy to reduce the impact of the destructive reads. The analysis shows that our schemes present better energy efficiency, scalability and variation tolerance than traditional SRAM-based designs.
    International Symposium on Computer Architecture; 06/2013
  • [show abstract] [hide abstract]
    ABSTRACT: In this paper, we propose a parallel stack algorithm for MIMO detection to reduce memory and achieve high throughput. Through partitioning the global stack into multiple local stacks and assigning them to each non-leaf layer of the tree, the proposed algorithm performs the best-first strategy on all stacks in parallel. The parallel processing reduces the iteration loops per detection, and the node pruning rule and the leaf enumeration method help decrease the total stack size. Moreover, a Dual-term APP approach is designed for the proposed algorithm to improve BER performance without extra iteration loops. The simulation results demonstrate that the proposed algorithm reduces the minimum required stack size and the average number of iteration loops per detection of advanced stack algorithm by 50% and 30% respectively to achieve the same BER performance with STS-SD for a 4×4 64QAM MIMO system.
    Signal Processing Systems (SiPS), 2013 IEEE Workshop on; 01/2013
  • Zhi Yue, Guanghui He, Jun Ma, Zhigang Mao
    [show abstract] [hide abstract]
    ABSTRACT: Multiple-input multiple-output (MIMO) technology can enhance the spectral efficiency significantly at the cost of high detection complexity. The stack algorithm can minimize the average complexity and achieve the optimal performance, but it suffers from the large memory size to store the candidate nodes. In this paper, we propose a memory reduced stack algorithm for soft-output MIMO detection. With the leaf enumeration scheme and parallel hypotheses update method, the proposed algorithm only stores non-leaf nodes in the stack and the leaf nodes are used for updating the soft-output. The proposed node pruning rule can simplify the search process and reduce the memory size further. The simulation results show that the proposed algorithm can reduce the demanded memory size of the advanced stack algorithm by 50% to achieve the same BER performance with the STS-SD for a 4 × 4 64QAM MIMO system.
    Signal Processing, Communication and Computing (ICSPCC), 2013 IEEE International Conference on; 01/2013
  • [show abstract] [hide abstract]
    ABSTRACT: Real-time constraints pose a new challenge when performing real-time application mapping onto Network-On-Chip. To this problem, we first propose a new task graph description in this paper, to enable both computation mapping and communication scheduling. Based on the proposed graph, we then propose a contention and energy aware mapping algorithm to eliminate the communication conflicts and reduce energy cost, thus delivering higher throughput and lower energy consumption on communicational links. In the experiments, we show that different real-time constraints impacts the on-chip network a lot, and our algorithm is able to find a better mapping and scheduling solution for a given real-time application and on-chip network structure. For example, comparing to traditional mapping without considering timing constraints, our algorithm reduces the energy up to 44% on average. It also improves throughput of the system up to 25%.
    SoC Design Conference (ISOCC), 2012 International; 01/2012
  • [show abstract] [hide abstract]
    ABSTRACT: A pare to optimal temporal partition methodology was developed for splitting and mapping large data flow graph (DFG) to the coarse-grained reconfigurable architecture (CGRA). A multi-objective genetic algorithm (MOGA) derived from the SPEA-II algorithm was first time introduced to the temporal partition realm for simultaneously optimizing multiple mutually exclusive objectives. Experiments carried out on the ESL (electronic system level) model of the REmus processor show that MOGA based temporal partition algorithms is superior than heuristic algorithm by reducing execution delay 5%-28%, communication overheads 16%-37% without degradation the resource efficiency. Furthermore, comparisons with weight-based multi-objective simulated annealing algorithm show the pare to optimal algorithm can achieve slight better latency objective (3%), while dramatically decrease the communication overheads by at most 21% and the resource efficiency doesn't get worse.
    Parallel and Distributed Processing Symposium Workshops & PhD Forum (IPDPSW), 2012 IEEE 26th International; 01/2012
  • Wei Jin, Sheng Lu, Weifeng He, Zhigang Mao
    [show abstract] [hide abstract]
    ABSTRACT: As a major sequential logic element, D flip-flop is an indispensible cell in logic cell library. In this paper, we proposed two improved sub-threshold D flip-flop circuits (mTGMS and emC2MOS D flip-flop) after conducting robustness analysis of several typical flip-flop circuits. Using SMIC 0.18um CMOS technology, the simulation results show that the minimum work voltage of our proposed mTGMS and emC2MOS is 0.19V and 0.18V, the minimum average power is 13.2pW and 14.1pW, while the minimum Power Delay Product (PDP) is 13aJ and 4.35aJ respectively.
    IEEE/IFIP 19th International Conference on VLSI and System-on-Chip, VLSI-SoC 2011, Kowloon, Hong Kong, China, October 3-5, 2011; 01/2011
  • [show abstract] [hide abstract]
    ABSTRACT: Today high speed signal transmission system for on chip global interconnect requires elaborate design of the transceiver. The design goal of transceiver is to ensure the transmission obtains an improvement in latency and power, which are the two most important factors in high speed transmission. In this paper, we present an efficient structured transceiver which implements low swing technology based on the differential structure with the accurate modeling of the on chip global interconnect. And we give the principals of the structure and compare the optimized simulation results with the traditional inverter insertion method used to decrease the delay of the global interconnect. Our transceiver design is based on the 90nm CMOS technology and TSMC 90nm interconnect structure on Metal 5. The global interconnect length we focus on is the general length 10mm. Compared to repeater insertion, this system has a latency advantage of 17% and remarkable advantage in power up to 33.9%.
    01/2011;
  • [show abstract] [hide abstract]
    ABSTRACT: This paper studies the SEU (Single Event Upset) fault for SRAM-based FPGAs. Considering detailed fault behavior on various circuit elements in a post-layout FPGA application, we develop a simulation-based SEU evaluation tool that quantifies fault contribution for each configuration bit. Using this tool and MCNC benchmark circuits, we study the fault characteristics of FPGA circuits and architectures. We show that interconnects not only contribute to the lion share of functional failures, but also have higher failure rate per configuration bit than LUTs. Particularly, multiplexers in local interconnects have the highest failure rate per bit. We find that tuning LUT and cluster sizes helps to reduce the rate (up to 38% in our experiments). In addition, we evaluate two recent fault mitigation algorithms IPD and IPF, which reduce LUT faults by an average of 74% and 15% respectively. But when interconnects are taken into account, the reduction via IPD which considers only LUT faults is merely 6% on chip level. Yet the reduction via IPF which implicitly considers interconnect faults is still around 15%. Therefore, synthesis algorithm should be evaluated with interconnect faults and future algorithms should be developed with consideration of interconnect faults explicitly.
    International Conference on Field Programmable Logic and Applications, FPL 2011, September 5-7, Chania, Crete, Greece; 01/2011
  • Wei Jin, Sheng Lu, Weifeng He, Zhigang Mao
    [show abstract] [hide abstract]
    ABSTRACT: A customized design flow for ultra-low power cell library and an 8-bit ultra-low power microprocessor for wireless sensor network application are presented in this paper. According to the logic pre-synthesis results of the 8-bit microprocessor HDL code, frequently used standard cells are collected to develop a customized sub-threshold cell library through size and structure modifications. The ultimate transistor-level netlist of the processor is generated by the RTL code re-synthesis results through cell substitution under the sub-threshold cell library. Using the SPICE simulator, experimental results show that our 8-bit microprocessor can work at a supply voltage as low as 230mV at full temperature range of all technology corners, which has a power only 79nW and the frequency 10 KHz at 230mV and room temperature. As a result, the proposed microprocessor provides a feasible solution for emerging energy-constrained applications.
    IEEE/IFIP 19th International Conference on VLSI and System-on-Chip, VLSI-SoC 2011, Kowloon, Hong Kong, China, October 3-5, 2011; 01/2011
  • Jianpeng Yu, Jun Ma, Zhigang Mao
    [show abstract] [hide abstract]
    ABSTRACT: A novel soft-extension of fixed-complexity sphere decoding (FSD) algorithm is proposed, taking Soft FSD (SFSD)1 algorithm as a starting point, which suffers from the multiple-tree-search problem, and the so called iteration problem, that is, its performance degrades as the number of the receiver iteration increases. The proposed parallel SFSD (PSFSD) performs a single tree search by taking partial best nodes to generate counter-hypotheses. To make use of priori information in tree search to overcome the iteration problem, soft-hard combination (SHC) enumeration method is presented as a low complexity algorithm to find the best child with prior term added. The proposed scheme is suitable for hardware implementations and can be easily extended to other tree search schemes.
    Proceedings of the IEEE Workshop on Signal Processing Systems, SiPS 2011, October 4-7, 2011, Beirut, Lebanon; 01/2011
  • [show abstract] [hide abstract]
    ABSTRACT: Quadratic permutation polynomial (QPP) interleaver is a contention-free interleaver which is suitable for parallel turbo decoder implementation. In this paper, a systematic recursive method to design configurable QPP interleaving multistage network is proposed based on the property of QPP. Due to the nature of recursion, the proposed network for 2n-level parallel turbo decoder can be used for any 2i parallelism (0 < i ≤ n) without the need to redesign additional network for different level of parallelism. Address generator is modified to provide control signals to the network. Furthermore, the proposed QPP architecture is generalized to support arbitrary contention-free interleavers by appending an additional specially designed network. When the whole network is used in multi-standard design, the appended network can be turned off at QPP interleaver mode to reduce more than 49% dynamic power for parallelism greater than 16.
    Proceedings of the IEEE Workshop on Signal Processing Systems, SiPS 2011, October 4-7, 2011, Beirut, Lebanon; 01/2011
  • Naifeng Jing, Weifeng He, Zhigang Mao
    IEEE/IFIP 19th International Conference on VLSI and System-on-Chip, VLSI-SoC 2011, Kowloon, Hong Kong, China, October 3-5, 2011; 01/2011
  • [show abstract] [hide abstract]
    ABSTRACT: A Simulated Annealing Genetic Algorithm (SAGA) with greedy mapping mechanism is developed to solve task partitioning problems in coarse grain reconfigurable systems. A fitness function combined multiple objectives (communication cost, number of partitions and number of bypass nodes) is constructed to optimize the execution time. Experimental results show that SAGA produces better solutions than traditional level-based or clustering-based partitioning algorithm. The operation time saved is up to 5% compared with level-based algorithm, and critical parameters such as communication cost and number of partitions are reduced by 10% in average.
    01/2011;
  • Source
    [show abstract] [hide abstract]
    ABSTRACT: Modern SRAM-based FPGAs (Field Programmable Gate Arrays) use multiplexer-based unidirectional routing, and SRAM configuration cells in these multiplexers contribute to the majority of soft errors in FPGAs. In this paper, we formulate an In-Placed inVersion (IPV) on LUT (Look-Up Table) logic polarities to reduce the Soft Error Rate (SER) at chip level, and reveal a locality and NP-Hardness of the IPV problem. We then develop an exact algorithm based on the binary integer linear programming (ILP) and also a heuristic based on the simulated annealing (SA), both enabled by the locality. We report results for the 10 largest MCNC combinational benchmarks synthesized by ABC and then placed and routed by VPR. The results show that IPV obtains close to 4x chip level SER reduction on average and SA is highly effective by obtaining the same SER reduction as ILP does. A recent work IPD has the largest LUT level SER reduction of 2.7x in literature, but its chip level SER reduction is merely 7% due to the dominance of interconnects. In contrast, SA-based IPV obtains nearly 4x chip level SER reduction and runs 30x faster. Furthermore, combining IPV and IPD leads to a chip level SER reduction of 5.3x. This does not change placement and routing, and does not affect design closure. To the best of our knowledge, our work is the first in-depth study on SER reduction for modern multiplexer-based FPGA routing by in-placed logic re-synthesis.
    2011 IEEE/ACM International Conference on Computer-Aided Design (ICCAD), San Jose, California, USA, November 7-10, 2011; 01/2011
  • [show abstract] [hide abstract]
    ABSTRACT: The reliability of SRAM-based Field Programmable Gate Array (FPGA) is susceptible to Single Event Upset (SEU) fault. To investigate the fault impact, particular the fault in interconnects on FPGA functionality, this paper proposes a SEU fault analysis framework by evaluating the fault with a unified metric. This metric, termed as criticality, quantifies the sensitivity of FPGA functional failure to the SEU fault on logical and interconnect configuration bits. Considering the post layout information, our framework can characterize the SEU fault with respect to different FPGA architectures and CAD algorithms, such that the sensitivity of FPGA functional failure can be investigated in detail during design phase. The experiment result quantitatively shows that the configuration bits in interconnects dominate those in LUTs, several times both in bit number and criticality contribution. The ratio of their criticalities is even higher when LUT input size increases from 4 to 6. The higher criticality of interconnects than their LUT counterpart is due to their natural sensitivity to functional failure instead of their majority of bits. In addition, it is also shown that, among the three common types of switch boxes, the Subset switch box is less fault tolerant than Wilton and Universal.
    Proceedings of the ACM/SIGDA 19th International Symposium on Field Programmable Gate Arrays, FPGA 2011, Monterey, California, USA, February 27, March 1, 2011; 01/2011
  • [show abstract] [hide abstract]
    ABSTRACT: Reconfigurable computing arrays facilitate the flexibility with high performance for regular and computation- intensive algorithms in multimedia processing. However, the efficiency of the irregular and control-intensive algorithms becomes the performance bottleneck of reconfigurable multimedia systems. In this paper, we propose the design and VLSI implementation of a novel memory efficient macroblock prediction and boundary strength (Bs) calculation engine. The control-intensive algorithms, including intra mode prediction, motion vector prediction, and Bs calculation, are implemented with 4x4 block level pipeline to achieve real-time decoding for H.264/AVC high profile and Chinese AVS Jizhun profile. Compared with existing designs, our design achieves 60% registers reduction for neighboring block load and update. Implementation results indicate that the proposed architecture can support 1920x1088@30fps of H.264 and AVS decoding at 86 MHz.
    International Symposium on Circuits and Systems (ISCAS 2011), May 15-19 2011, Rio de Janeiro, Brazil; 01/2011
  • Li Xie, Weifeng He, Naifeng Jing, Zhigang Mao
    [show abstract] [hide abstract]
    ABSTRACT: This paper presents a task level mapping flow for coarse-grained dynamic reconfigurable array processor based on static thermal-aware mapping techniques. The flow is composed of front-end SUIF tool, temporal partitioning algorithm, thermal aware sub-graph mapping algorithms and back-end RAM compiler to compile HLL task into binary code for the processor automatically. Using compact thermal model, the temperature distribution of each task sub-graph on reconfigurable RC array is pre-estimated statically. The runtime sequence of all task sub-graphs is generated ultimately with the random searching algorithm to balance the reconfigurable array's temperature. Experimental results show that the average maximum temperature and temperature distribution range can be reduced about 6.3°C and 12°C, respectively.
    International Symposium on Circuits and Systems (ISCAS 2011), May 15-19 2011, Rio de Janeiro, Brazil; 01/2011
  • Naifeng Jing, Weifeng He, Zhigang Mao
    [show abstract] [hide abstract]
    ABSTRACT: This paper proposes a heuristic algorithm to the resource constrained mapping problem in coarse-grained reconfigurable computing, such that the partition number and communications are co-optimized to improve the application performance. Our approach modifies the network flow algorithm with a customized mapping procedure embedded to satisfy the micro-architecture resource constraints. Additionally, we use integer linear programming to set optimal baseline to the problem. Our algorithms reformulate the cost function by identifying some flaws in previous literatures. The experiment results qualify the benefit gained by our proposed approach.
    Annual IEEE International SoC Conference, SoCC 2010, September 27-29, 2010, Las Vegas, NV, USA, Proceedings; 01/2010
  • [show abstract] [hide abstract]
    ABSTRACT: In order to exploit the advantages in on-chip communication introduced by Network-on-Chip, many optimization algorithms have been proposed for a joint optimization on power and performance in communication mapping and routing. However, the optimality of solutions relative to these algorithms has been neglected in previous studies. To this problem, this paper proposes an early estimating approach to evaluate the optimality of the solutions for the first time. This approach is based on a statistical property that the overall solutions in solution space conform to a quasi-Gaussian distribution, which can be previewed by two parameters with a computation complexity of O(n4) as presented. The generality of our proposed approach makes itself extensible to other on-chip network options. Experiments on real and synthetic application benchmarks demonstrate an average error ratio less than 7% which tends to be even smaller when problem scales up. These results validate our early estimating approach on optimality evaluation as credible and efficient to boost its utility in the promising Network-on-Chip design.
    Integration, the VLSI Journal. 01/2010;
  • Xu Wang, Weifeng He, Zhigang Mao
    [show abstract] [hide abstract]
    ABSTRACT: In this paper, two novel structures at 200mV 0.18um sub-threshold full adders are proposed for wireless sensor network nodes or medical electronics. They use three state gate to enhance the transition time and drivability of carry out signal. Simulation results show that the transition time of the proposed structure using three state gate is 60% of that of old structure using transmission gate. The proposed full adders are employed in an 8-bit ripple carry adder and the simulation results exhibit 20% power saving and 34% Power-Delay Product (PDP) saving compared to typical CMOS adder.
    01/2010;