Zhibin Xiao

University of California, Davis, Davis, California, United States

Are you Zhibin Xiao?

Claim your profile

Publications (10)4.88 Total impact

  • Zhibin Xiao, B. Baas
    [Show abstract] [Hide abstract]
    ABSTRACT: 2-Dimensional meshes are the most commonly used Network-on-Chip (NoC) topology for on-chip communication in many-core processor arrays due to their low complexity and excellent match to rectangular processor tiles. However, 2D meshes may incur local traffic congestion for applications with significant levels of traffic with non-neighboring cores, resulting in long latencies and high power consumption. In this paper, we propose an 8-neighbor mesh topology and a 6-neighbor topology with hexagonal-shaped processor tiles. A 16-bit DSP processor and the corresponding processor arrays are implemented in all three topologies. The hexagonal processor tile and arrays of tiles are laid out using industry-standard CAD tools and automatic place and route flow without full-custom design, and result in DRC-clean and LVS-clean layout. A 1080p H.264/AVC residual video encoder and a 54 Mbps 802.11a/11g OFDM wireless LAN baseband receiver are mapped onto all topologies. The 6-neighbor hexagonal grid topology incurs a 2.9% area increase per tile compared to the 4-neighbor 2D mesh, but its much more effective inter-processor interconnect yields an average total application area reduction of 21%, an average power reduction of 17%, and a total application inter-processor communication distance reduction of 19%.
    VLSI and System-on-Chip (VLSI-SoC), 2012 IEEE/IFIP 20th International Conference on; 01/2012
  • Source
    Zhibin Xiao, Stephen Le, Bevan Baas
    [Show abstract] [Hide abstract]
    ABSTRACT: The emerging many-core architecture provides a flexible solution for the rapid evolving multimedia applications demanding both high performance and high energy-efficiency. However, developing parallel multimedia applications that can efficiently harness and utilize many-core architectures is the key challenge for scalable computing. We contribute to this challenge by presenting a fully-parallel H.264/AVC baseline encoder on a 167-core asynchronous array of simple processors (AsAP) computation platform. By exploiting fine-grained data and task level parallelism in the algorithms, we partition and map the dataflow of the H.264/AVC encoder to an array of 115 small processors coupled with two shared memories and a hardware accelerator for motion estimation. The proposed parallel H.264/AVC encoder is capable of encoding video sequences with variable frame sizes. The encoder presented is capable of encoding VGA (640 × 480) video at 21 frames per second (fps) with 931 mW average power consumption by adjusting each processor to workload-based optimal clock frequencies and dual supply voltages with less than 1dB loss in resolution.
    Circuits, Systems and Computers, 1977. Conference Record. 1977 11th Asilomar Conference on 01/2011;
  • Zhibin Xiao, Bevan M. Baas
    [Show abstract] [Hide abstract]
    ABSTRACT: This paper presents a baseline residual encoder for H.264/AVC on a programmable fine-grained many-core processing array that utilizes no application-specific hardware. The software encoder contains integer transform, quantization, and context-based adaptive variable length coding functions. By exploiting fine-grained data and task-level parallelism, the residual encoder is partitioned and mapped to an array of 25 small processors. The proposed encoder encodes video sequences with variable frame sizes and can encode 1080p high-definition television at 30 f/s with 293 mW average power consumption by adjusting each processor to workload-based optimal clock frequencies and dual supply voltages-a 38.4% power reduction compared to operation with only one clock frequency and supply voltage. In comparison to published implementations on the TI C642 digital signal processing platform, the design has approximately 2.9-3.7 times higher scaled throughput, 11.2-15.0 times higher throughput per chip area, and 4.5-5.8 times lower energy per pixel. Compared to a heterogeneous single instruction, multiple data architecture customized for H.264, the presented design has 2.8-3.6 times greater throughput, 4.5-5.9 times higher area efficiency, and similar energy efficiency. The proposed fine-grained parallelization methodology provides a new approach to program a large number of simple processors allowing for a higher level of parallelization and energy-efficiency for video encoding than conventional processors while avoiding the cost and design time of implementing an application specific integrated circuit or other application-specific hardware.
    IEEE Transactions on Circuits and Systems for Video Technology 01/2011; 21:890-902. · 1.82 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: A 167-processor computational platform consists of an array of simple programmable processors capable of per-pro- cessor dynamic supply voltage and clock frequency scaling, three algorithm-specific processors, and three 16 KB shared memories; and is implemented in 65 nm CMOS. All processors and shared memories are clocked by local fully independent, dynamically haltable, digitally-programmable oscillators and are intercon- nected by a configurable circuit-switched network which supports long-distance communication. Programmable processors occupy 0.17 mm and operate at a maximum clock frequency of 1.2 GHz at 1.3 V. At 1.2 V, they operate at 1.07 GHz and consume 47.5 mW when 100% active, resulting in an energy dissipation of 44 pJ per operation. At 0.675 V, they operate at 66 MHz and consume 608 W when 100% active, resulting in a total energy dissipation of 9.2 pJ per ALU or MAC operation.
    IEEE Journal of Solid-State Circuits 01/2009; 44(4):1130-1144. · 3.06 Impact Factor
  • Source
    Zhibin Xiao, B. Baas
    [Show abstract] [Hide abstract]
    ABSTRACT: This paper presents a high-performance parallel context-based adaptive length coding (CAVLC) encoder implemented on a fine-grained many-core system. The software encoder is designed for a H.264/AVC baseline profile encoder. By utilizing arithmetic table elimination and compression techniques, the data-flow of the CAVLC encoder has been partitioned and mapped to an array of 15 small processors. The parallel workload of each processor is characterized and balanced for further throughput optimization. The proposed parallel CAVLC encoder achieves the real-time processing requirement of 30 frames per second for 720 p HDTV. Our experiments show that the presented CAVLC encoder has 4.86 to 6.83 times higher throughput and requires far smaller chip area than the identical encoder implemented on state-of-art general-purpose processors. In comparison to published implementations on common DSP processors, the design has approximately 1.0 to 6.15 times higher throughput while requiring less than 6 times smaller area.
    Computer Design, 2008. ICCD 2008. IEEE International Conference on; 11/2008
  • Source
    08/2008
  • 08/2008
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: A 167-processor 65 nm computational platform well suited for DSP, communication, and multimedia workloads contains 164 programmable processors with dynamic supply voltage and dynamic clock frequency circuits, three algorithm-specific processors, and three 16 KB shared memories, all clocked by independent oscillators and connected by configurable long-distance-capable links.
    VLSI Circuits, 2008 IEEE Symposium on; 07/2008
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: With deep submicron technology nodes other meth-ods are needed to obtain scaling factors rather than the tradi-tional scaling factors which held for the pre-submicron era. This work presents scaling factors between major technology nodes between 180 nm and 22 nm operating at voltages from 1.8 V to 0.7 V. Common operating data for these technologies were taken from the International Technology Roadmap for Semiconductors (IRTS). HSpice simulations that rely on the Predictive Technology Model (PTM) for transistor characteristics were used to find the scaling factors.
  • Source
    Zhibin Xiao, Stephen Le, Bevan Baas
    [Show abstract] [Hide abstract]
    ABSTRACT: The emerging many-core architecture provides a flexible solution for the rapid evolving multimedia applications demanding both high performance and high energy-efficiency. However, developing paral-lel multimedia applications that can efficiently harness and utilize many-core architectures is the key challenge for scalable computing. We con-tribute to this challenge by presenting a fully-parallel H.264/AVC baseline encoder on a 167-core asynchronous array of simple processors(AsAP) computation platform. By exploiting fine-grained data and task level parallelism in the algorithms, we partition and map the dataflow of the H.264/AVC encoder to an array of 115 small processors coupled with two shared memories and a hardware accelerator for motion estimation. Due to the large number of independent processors available, the video encoding process can be divided into three main stages: prediction, entropy encoding, and reconstruction, with the entropy encoding and reconstruction stages done in parallel and pipelined with the prediction stage. Within each stage, each independent procedure is mapped onto an individual processor for greater parallelization and efficiency. The proposed parallel H.264/AVC encoder is capable of encoding video sequences with variable frame sizes. The preliminary implemenation is capable of encoding CIF (352x288) video at 54 frames per second (fps) with 925 mW average power consumption by adjusting each processor to workload-based optimal clock frequencies and dual supply voltages with less than 1dB loss in resolution.