Zhigang Mao

Shanghai Jiao Tong University, Shanghai, Shanghai Shi, China

Are you Zhigang Mao?

Claim your profile

Publications (51)10.8 Total impact

  • Zhiting Yan · Guanghui He · Yifan Ren · Weifeng He · Jianfei Jiang · Zhigang Mao ·
    [Show abstract] [Hide abstract]
    ABSTRACT: This paper proposes a flexible dual-mode soft-output multiple-input multiple-output (MIMO) detector to support open-loop and closed-loop in Chinese enhanced ultra high throughput (EUHT) wireless local area network (LAN) standard. The proposed detector uses minimum mean square error (MMSE) sorted QR decomposition (MMSE-SQRD) to produce channel preprocessing result, which is realized by a modified systolic array architecture with concurrent sorting. Moreover, the adopted square-root MMSE algorithm for closed-loop reuses MMSE-SQRD preprocessing to largely save hardware overhead. In addition, an optimized K-Best detection algorithm is proposed for open-loop, which increases throughput by odd-even parallel sorting and produces high quality soft-output with discarded paths (DPs). A flexible VLSI architecture is designed for the proposed dual-mode detector, which supports 1× 1∼ 4× 4 antennas and BPSK ∼ 64-QAM modulation configuration. Implemented in SMIC 65 nm CMOS technology, the detector is capable of running at 550 MHz, which has a maximum throughput of 2.64 Gb/s for K-Best detection and 3.3 Gb/s for linear MMSE detection. The proposed detector is competitive to recent published works and meets the data-rate requirement of the EUHT standard.
    Circuits and Systems I: Regular Papers, IEEE Transactions on 09/2015; 62(11):1-12. DOI:10.1109/TCSI.2015.2479055 · 2.40 Impact Factor
  • Zhiting Yan · Guanghui He · Weifeng He · Shuaijie Wang · Zhigang Mao ·
    [Show abstract] [Hide abstract]
    ABSTRACT: In this paper, a high performance parallel turbo decoder is designed to support 188 block sizes in the 3rd generation partnership (3GPP) long term evolution (LTE) standard. A novel configurable quadratic permutation polynomial (QPP) multistage network and address generator are proposed to reduce the complexity of interleaving. This 2n-input network can be configured to support any 2i-input network. Furthermore, it can flexibly support arbitrary contention-free interleavers by cascading an additional specially designed network. In addition, an optimized decoding schedule scheme is presented to reduce the performance loss caused by high parallelism. Memory architecture and address mapping method are optimized to avoid memory access contention of small blocks. Moreover, a dual-mode add–compare–select (ACS) unit implementing both radix-2 and radix-4 recursion is proposed to support the block sizes that are not divided by 16. Implemented in 130 nm CMOS technology, the design achieves 384.3 Mbps peak throughput at clock rate 290 MHz with 5.5 iterations. Consuming 4.02 mm2 core area and 716 mW power, the decoder has a 1.81 bits/cycle/iteration/mm2 architecture efficiency and a 0.34 nJ/bit/iteration energy efficiency, which is competitive with other recent works.
    Integration the VLSI Journal 06/2015; DOI:10.1016/j.vlsi.2015.05.003 · 0.66 Impact Factor
  • Jing Guo · Liyi Xiao · Tianqi Wang · Shanshan Liu · Xu Wang · Zhigang Mao ·
    [Show abstract] [Hide abstract]
    ABSTRACT: Radiation-induced single event upsets (SEUs), or soft errors, have become a dominant factor in the reliability degradation of nanoscale memories. In this paper, based on the SEU physics mechanism, and reasonable layout-topology, a novel soft error hardened memory cell is proposed in 65 nm Complementary Metal Oxide Semiconductor (CMOS) technology. The design comparisons for several hardened memory cells in terms of access time (read access time and write access time), power consumption, and layout area are also executed. The main advantage of the proposed cell is that it can provide 100% fault tolerance, which is very useful for memory applications in severe radiation environments. Furthermore, Monte Carlo simulations are carried out to evaluate the effects of process, voltage, and temperature (PVT) variations. From simulations, we confirmed that the proposed cell has exhibited a sufficient multiple-node upset tolerance capability even under PVT variations.
    IEEE Transactions on Reliability 06/2015; 64(2):1-7. DOI:10.1109/TR.2015.2410275 · 1.93 Impact Factor
  • Source
    Jianfei Jiang · Weifeng He · Jizeng Wei · Qin Wang · Zhigang Mao ·
    [Show abstract] [Hide abstract]
    ABSTRACT: On-chip global wires are speed and power bottleneck in state-of-the-art chips. AC coupling technique is an efficient way to reduce interconnection delay and power. This paper proposes a new capacitive-resistively driven AC coupling global link. Bandwidth performance of the proposed wire is analyzed and an optimization algorithm for capacitive-resistively driven wire is presented. Simulation results show that our optimization methodology can improve the bandwidth. By applying our optimization algorithm, data rate can be improved from 2 Gb/s to 2.5 Gb/s in the implemented transceiver circuit. The proposed optimization algorithm can be applied in high speed global communication.
    IEICE Electronics Express 04/2015; 12(8):20150111-20150111. DOI:10.1587/elex.12.20150111 · 0.32 Impact Factor
  • Zhiting Yan · Guanghui He · Weifeng He · Zhigang Mao ·
    [Show abstract] [Hide abstract]
    ABSTRACT: Co-channel interference (CCI) is becoming a challenging factor that causes performance degradation in modern communication systems. The receiver equipped with multiple antennas can suppress such interference by exploiting spatial correlation. However, it is difficult to estimate the spatial covariance matrix (SCM) of CCI accurately with limited number of known symbols. To address this problem, this paper first proposes an improved SCM estimation method by shrinking the variance of eigenvalues. In addition, based on breadth-first tree search schemes and improved channel updating, a low complexity iterative detector is presented with channel preprocessing, which not only considers the existence of CCI but also reduces the computational complexity in terms of visited nodes in a search tree. Furthermore, by scaling the extrinsic soft information which is fed back to the input of detector, the detection performance loss due to max-log approximation is compensated. Simulation results show that the proposed iterative receiver provides improved signal to interference ratio (SIR) gain with low complexity, which demonstrate the proposed scheme is attractive in practical implementation.
    IEICE Transactions on Fundamentals of Electronics Communications and Computer Sciences 01/2015; E98.A(2):776-782. DOI:10.1587/transfun.E98.A.776 · 0.23 Impact Factor
  • Zhiting Yan · Guanghui He · Xi Chen · Weifeng He · Zhigang Mao ·
    [Show abstract] [Hide abstract]
    ABSTRACT: A novel bit-interleaved coded modulation with iterative detection and decoding (BICM-IDD) receiver for multiple-input and multiple-output (MIMO) systems using Max-Log-MAP algorithm is proposed. This receiver improves the detection and decoding performance by an improved turbo principle, in which pre-scaling the input information of the detector is performed at each iteration. From an information theory perspective, the proposed scheme is proved to outperform the traditional iterative architecture. Simulations results show that the proposed receiver significantly reduces the performance loss incurred by the suboptimal Max-Log-MAP detection and decoding algorithm with small additional complexity.
    IEICE Electronics Express 10/2014; 11(21):20140800-20140800. DOI:10.1587/elex.11.20140800 · 0.32 Impact Factor
  • Jing Guo · Liyi Xiao · Zhigang Mao ·
    [Show abstract] [Hide abstract]
    ABSTRACT: In this paper, a novel low-power and highly reliable radiation hardened memory cell (RHM-12T) using 12 transistors is proposed to provide enough immunity against single event upset in TSMC 65 nm CMOS technology. The obtained results show that the proposed cell can not only tolerate upset at its any sensitive node regardless of upset polarity and strength, but also recover from multiple-node upset induced by charge sharing on the fixed nodes independent of the stored value. Moreover, the proposed cell has comparable or lower overheads in terms of static power, area and access time compared with previous radiation hardened memory cells.
    Circuits and Systems I: Regular Papers, IEEE Transactions on 07/2014; 61(7):1994-2001. DOI:10.1109/TCSI.2014.2304658 · 2.40 Impact Factor
  • Ziyou Yao · Weifeng He · Liang Hong · Guanghui He · Zhigang Mao ·
    [Show abstract] [Hide abstract]
    ABSTRACT: High Efficiency Video Coding (HEVC) is new video coding standard beyond H.264/AVC. In this paper, an area and throughput efficient 2-D IDCT/IDST VLSI architecture for HEVC standard is presented. Adopting proposed data flow scheduling and shared constant multiplication structure, the architecture supports variable block size IDCT from 4×4 to 32×32 pixels as well as 4×4 pels IDST. Using 65nm technology, the synthesis results show that the maximum work frequency is 500MHz and the architecture hardware cost is about 145.4K gate count. Compared with previous work, our design achieves more than 50% reduction in hardware cost and 66% improvement in throughput efficiency. Experimental results show that the proposed architecture is able to deal with real-time HEVC IDCT/IDST of 4K×2K (4096×2048)@30 fps video sequence at 412MHz in average. In consequence, it offers a cost-effective solution for the future UHDTV applications.
    2014 IEEE International Symposium on Circuits and Systems (ISCAS); 06/2014
  • Jing Guo · Liyi Xiao · Zhigang Mao · Qiang Zhao ·
    [Show abstract] [Hide abstract]
    ABSTRACT: Transient multiple cell upsets (MCUs) are becoming major issues in the reliability of memories exposed to radiation environment. To prevent MCUs from causing data corruption, more complex error correction codes (ECCs) are widely used to protect memory, but the main problem is that they would require higher delay overhead. Recently, matrix codes (MCs) based on Hamming codes have been proposed for memory protection. The main issue is that they are double error correction codes and the error correction capabilities are not improved in all cases. In this paper, novel decimal matrix code (DMC) based on divide-symbol is proposed to enhance memory reliability with lower delay overhead. The proposed DMC utilizes decimal algorithm to obtain the maximum error detection capability. Moreover, the encoder-reuse technique (ERT) is proposed to minimize the area overhead of extra circuits without disturbing the whole encoding and decoding processes. ERT uses DMC encoder itself to be part of the decoder. The proposed DMC is compared to well-known codes such as the existing Hamming, MCs, and punctured difference set (PDS) codes. The obtained results show that the mean time to failure (MTTF) of the proposed scheme is 452.9%, 154.6%, and 122.6% of Hamming, MC, and PDS, respectively. At the same time, the delay overhead of the proposed scheme is 73.1%, 69.0%, and 26.2% of Hamming, MC, and PDS, respectively. The only drawback to the proposed scheme is that it requires more redundant bits for memory protection.
    IEEE Transactions on Very Large Scale Integration (VLSI) Systems 01/2014; 22(1):127-135. DOI:10.1109/TVLSI.2013.2238565 · 1.36 Impact Factor
  • Yuxiao Ling · Zheng Guo · Zhimin Zhang · Zhigang Mao · Zeleng Zhuang ·
    [Show abstract] [Hide abstract]
    ABSTRACT: In this paper, we will present a block cipher circuit design against Power Analysis. This design consists of usual masking and hiding method. For XOR, permutation and other linear layer, masking method of protection is used, but for S-box and other non-linear layer, hiding method is used in the reason that masking requires a lot of hardware consumption. We accomplished hardware implementation and Power Analysis in our research, whose test results proved that the design had strong capacity against Power Analysis. 200, 000 curves were extracted in our attack simulation, and the key successfully resisted complete recovery.
    Proceedings of the 2013 Ninth International Conference on Computational Intelligence and Security; 12/2013
  • Jing Guo · Liyi Xiao · Zhigang Mao · Qiang Zhao ·
    [Show abstract] [Hide abstract]
    ABSTRACT: Error-correcting codes (ECCs) are commonly used to protect static RAM (SRAM) from soft errors induced by particle radiation. The traditional single-error correction, double-error detection (SEC-DED) codes are not enough because transient multiple-cell upsets (MCUs) are becoming major issues in SRAM reliability. One-step majority-logic decodable (OS-MLD) codes that can correct MCUs are a good choice due to low complexity and latency. However, these codes restrict the choice of code-word lengths such that they can't be directly used to protect memories with common word lengths (that is, a power of two). This article proposes novel mixed codes (MCs), which are constructed by doubly transitive invariant (DTI) and Hamming codes, to mitigate MCUs in common memories with lower overheads and higher code rates. For the Hamming codes, the encoder-reuse technique (ERT) is used to minimize area overhead without disturbing the whole encoding and decoding process. In addition, the puncturing technique is used to increase the code rates of the proposed codes. As an application example, the authors evaluate a (64, 42) double-error correction (DEC) MC and compare it with the existing DEC codes. The results show that the proposed MC with higher code rate can not only effectively mitigate MCUs in memories but also reduce the overheads of the extra circuits and memory cells.
    IEEE Micro 11/2013; 33(6):66-74. DOI:10.1109/MM.2013.125 · 1.52 Impact Factor
  • Zheng Tang · Jing Xie · Zhigang Mao ·
    [Show abstract] [Hide abstract]
    ABSTRACT: The processors' architecture design plays an important role in high performance DSP era, where how to balance the power consumption and the computing ability is always a great concern. In this paper we propose an architecture scheme with VLIW instruction driven adaptive pipeline coupling technique for a multi-core processor design to achieve the high computing performance with a low powered capability. Combined with the loop buffering design and implementation, the scheme is evaluated with the typical DSP application and the results show that the performance is improved about 43.4% while the power consumption is reduced by 48.7% in average.
    2013 IEEE 10th International Conference on ASIC (ASICON 2013); 10/2013
  • Jieliang Lu · Qin Wang · Jing Xie · Zhigang Mao ·
    [Show abstract] [Hide abstract]
    ABSTRACT: 3D integrated technique gives a promising method of overcoming the increasing problems of interconnect wire length and power consumption. In the design of the 3D-IC, the floorplanning algorithm decides the performance of the circuit. In this paper, we present a floorplanning algorithm considering both the critical wire length and the number of TSVs. Finally MCNC floorplan circuits are used as benchmarks. The result shows that the algorithm can reduce the critical wire length by average 40.1% and reduce the TSVs' number by 24.8% under the same critical length. The algorithm can be widely used in the design of 3D integrated circuits.
    2013 IEEE 10th International Conference on ASIC (ASICON 2013); 10/2013
  • Zhi Yue · Guanghui He · Jiangpeng Li · Jun Ma · Zhigang Mao ·
    [Show abstract] [Hide abstract]
    ABSTRACT: In this paper, we propose a parallel stack algorithm for MIMO detection to reduce memory and achieve high throughput. Through partitioning the global stack into multiple local stacks and assigning them to each non-leaf layer of the tree, the proposed algorithm performs the best-first strategy on all stacks in parallel. The parallel processing reduces the iteration loops per detection, and the node pruning rule and the leaf enumeration method help decrease the total stack size. Moreover, a Dual-term APP approach is designed for the proposed algorithm to improve BER performance without extra iteration loops. The simulation results demonstrate that the proposed algorithm reduces the minimum required stack size and the average number of iteration loops per detection of advanced stack algorithm by 50% and 30% respectively to achieve the same BER performance with STS-SD for a 4×4 64QAM MIMO system.
    Signal Processing Systems (SiPS), 2013 IEEE Workshop on; 10/2013
  • Haopeng Liu · Weiguang Sheng · Weifeng He · Zhigang Mao ·
    [Show abstract] [Hide abstract]
    ABSTRACT: Reconfiguration delay seriously downgrades coarse grained reconfigurable processor's performance because large numbers of cycles are needed to transmit the configuration contexts. Therefore, two delay hidden techniques, configuration contexts reuse and differential reconfiguration, are introduced into our REmus coarse-grained reconfigurable processor by employing the repeatability and similarity between the subsequent configuration contexts. An on-chip Scratchpad Configuration Memory (SCM) less than 4KB is embedded to buffer the repeated obsolete configuration contexts, which can be retransmitted to the reconfigurable elements much faster than reread the configuration words from the outer main memory. Furthermore, a partial reconfiguration mechanism is also designed to mitigate the transmission overheads of two adjacent or non-adjacent similar configuration contexts by only transmitting the distinct parts of the succeeding context (differential reconfiguration). At the same time, corresponding compiling scheme for fully utilizing the two features is also being designed and integrated into the original task compiler. Some preliminary experiments on REmus processor indicate that at most 35% speed-up is achieved than the original design without the two enhancing fabrics.
    2013 IEEE 10th International Conference on ASIC (ASICON 2013); 10/2013
  • [Show abstract] [Hide abstract]
    ABSTRACT: The heavily-threaded data processing demands of streaming multiprocessors (SM) in a GPGPU require a large register file (RF). The fast increasing size of the RF makes the area cost and power consumption unaffordable for traditional SRAM designs in the future technologies. In this paper, we propose to use embedded-DRAM (eDRAM) as an alternative in future GPGPUs. Compared with SRAM, eDRAM provides higher density and lower leakage power. However, the limited data retention time in eDRAM poses new challenges. Periodic refresh operations are needed to maintain data integrity. This is exacerbated with the scaling of eDRAM density, process variations and temperature. Unlike conventional CPUs which make use of multi-ported RF, most of the RFs in modern GPGPU are heavily banked but not multi-ported to reduce the hardware cost. This provides a unique opportunity to hide the refresh overhead. We propose two different eDRAM implementations based on 3T1D and 1T1C memory cells. To mitigate the impact of periodic refresh, we propose two novel refresh solutions using bank bubble and bank walk-through. Plus, for the 1T1C RF, we design an interleaved bank organization together with an intelligent warp scheduling strategy to reduce the impact of the destructive reads. The analysis shows that our schemes present better energy efficiency, scalability and variation tolerance than traditional SRAM-based designs.
    International Symposium on Computer Architecture; 06/2013
  • Zhi Yue · Guanghui He · Jun Ma · Zhigang Mao ·
    [Show abstract] [Hide abstract]
    ABSTRACT: Multiple-input multiple-output (MIMO) technology can enhance the spectral efficiency significantly at the cost of high detection complexity. The stack algorithm can minimize the average complexity and achieve the optimal performance, but it suffers from the large memory size to store the candidate nodes. In this paper, we propose a memory reduced stack algorithm for soft-output MIMO detection. With the leaf enumeration scheme and parallel hypotheses update method, the proposed algorithm only stores non-leaf nodes in the stack and the leaf nodes are used for updating the soft-output. The proposed node pruning rule can simplify the search process and reduce the memory size further. The simulation results show that the proposed algorithm can reduce the demanded memory size of the advanced stack algorithm by 50% to achieve the same BER performance with the STS-SD for a 4 × 4 64QAM MIMO system.
    Signal Processing, Communication and Computing (ICSPCC), 2013 IEEE International Conference on; 01/2013
  • Naifeng Jing · Ju-Yueh Lee · Zhe Feng · Weifeng He · Zhigang Mao · Lei He ·
    [Show abstract] [Hide abstract]
    ABSTRACT: Reliability has become an increasingly important concern for SRAM-based field programmable gate arrays (FPGAs). Targeting SEU (single event upset) in SRAM-based FPGAs, this article first develops an SEU evaluation framework that can quantify the failure sensitivity for each configuration bit during design time. This framework considers detailed fault behavior and logic masking on a post-layout FPGA application and performs logic simulation on various circuit elements for fault evaluation. Applying this framework onMCNC benchmark circuits, we first characterize SEUs with respect to different FPGA circuits and architectures, for example, bidirectional routing and unidirectional routing. We show that in both routing architectures, interconnects not only contribute to the lion's share of the SEU-induced functional failures, but also present higher failure rates per configuration bits than LUTs. Particularly, local interconnect multiplexers in logic blocks have the highest failure rate per configuration bit. Then, we evaluate three recently proposed SEU mitigation algorithms, IPD, IPF, and IPV, which are all logic resynthesis-based with little or no overhead on placement and routing. Different fault mitigating capabilities at the chip level are revealed, and it demonstrates that algorithms with explicit consideration for interconnect significantly mitigate the SEU at the chip level, for example, IPV achieves 61% failure rate reduction on average against IPF with about 15%. In addition, the combination of the three algorithms delivers over 70% failure rate reduction on average at the chip level. The experiments also reveal that in order to improve fault tolerance at the chip level, it is necessary for future fault mitigation algorithms to concern not only LUT or interconnect faults, but also their interactions. We envision that our framework can be used to cast more useful insights for more robust FPGA circuits, architectures, and better synthesis algorithms.
    ACM Transactions on Design Automation of Electronic Systems 01/2013; 18(1). DOI:10.1145/2390191.2390204 · 0.69 Impact Factor
  • Bingjing Ge · Naifeng Jing · Weifeng He · Zhigang Mao ·
    [Show abstract] [Hide abstract]
    ABSTRACT: Real-time constraints pose a new challenge when performing real-time application mapping onto Network-On-Chip. To this problem, we first propose a new task graph description in this paper, to enable both computation mapping and communication scheduling. Based on the proposed graph, we then propose a contention and energy aware mapping algorithm to eliminate the communication conflicts and reduce energy cost, thus delivering higher throughput and lower energy consumption on communicational links. In the experiments, we show that different real-time constraints impacts the on-chip network a lot, and our algorithm is able to find a better mapping and scheduling solution for a given real-time application and on-chip network structure. For example, comparing to traditional mapping without considering timing constraints, our algorithm reduces the energy up to 44% on average. It also improves throughput of the system up to 25%.
    SoC Design Conference (ISOCC), 2012 International; 01/2012
  • Weiguang Sheng · Weifeng He · Jianfei Jiang · Zhigang Mao ·
    [Show abstract] [Hide abstract]
    ABSTRACT: A pare to optimal temporal partition methodology was developed for splitting and mapping large data flow graph (DFG) to the coarse-grained reconfigurable architecture (CGRA). A multi-objective genetic algorithm (MOGA) derived from the SPEA-II algorithm was first time introduced to the temporal partition realm for simultaneously optimizing multiple mutually exclusive objectives. Experiments carried out on the ESL (electronic system level) model of the REmus processor show that MOGA based temporal partition algorithms is superior than heuristic algorithm by reducing execution delay 5%-28%, communication overheads 16%-37% without degradation the resource efficiency. Furthermore, comparisons with weight-based multi-objective simulated annealing algorithm show the pare to optimal algorithm can achieve slight better latency objective (3%), while dramatically decrease the communication overheads by at most 21% and the resource efficiency doesn't get worse.
    Parallel and Distributed Processing Symposium Workshops & PhD Forum (IPDPSW), 2012 IEEE 26th International; 01/2012