Zhiyi Yu

Fudan University, Shanghai, Shanghai Shi, China

Are you Zhiyi Yu?

Claim your profile

Publications (12)11.89 Total impact

  • Source
    Article: A Low-Area Multi-Link Interconnect Architecture for GALS Chip Multiprocessors
    Zhiyi Yu, B.M. Baas
    [show abstract] [hide abstract]
    ABSTRACT: A new inter-processor communication architecture for chip multiprocessors is proposed which has a low area cost, flexible routing capability, and supports globally asynchronous locally synchronous (GALS) clocking styles. To achieve a low area cost, the proposed statically-configurable asymmetric architecture assigns large buffer resources to only the nearest neighbor interconnect and much smaller buffer resources for long distance interconnect. To maintain flexible routing capability, each neighboring processor pair has multiple connecting links. The architecture supports long distance communication in GALS systems by transferring the source clock with the data signals along the entire path for write synchronization. Compared to a traditional dynamically-configurable interconnect architecture with symmetric buffer allocation and single-links between neighboring processor pairs, this implementation has approximately two times smaller communication circuitry area with a similar routing capability. Area and speed estimates are obtained with the physical design of seven chips in 0.18-??m CMOS.
    IEEE Transactions on Very Large Scale Integration (VLSI) Systems 06/2010; · 1.22 Impact Factor
  • Source
    Article: A 167-Processor Computational Platform in 65 nm CMOS
    [show abstract] [hide abstract]
    ABSTRACT: A 167-processor computational platform consists of an array of simple programmable processors capable of per-processor dynamic supply voltage and clock frequency scaling, three algorithm-specific processors, and three 16 KB shared memories; and is implemented in 65 nm CMOS. All processors and shared memories are clocked by local fully independent, dynamically haltable, digitally-programmable oscillators and are interconnected by a configurable circuit-switched network which supports long-distance communication. Programmable processors occupy 0.17 mm<sup>2</sup> and operate at a maximum clock frequency of 1.2 GHz at 1.3 V. At 1.2 V, they operate at 1.07 GHz and consume 47.5 mW when 100% active, resulting in an energy dissipation of 44 pJ per operation. At 0.675 V, they operate at 66 MHz and consume 608 muW when 100% active, resulting in a total energy dissipation of 9.2 pJ per ALU or MAC operation.
    IEEE Journal of Solid-State Circuits 05/2009; · 3.23 Impact Factor
  • Article: A 167-Processor Computational Platform in 65 nm CMOS
    Solid-State Circuits, IEEE Journal of. 04/2009; 44:1130 -1144.
  • Source
    Article: High Performance, Energy Efficiency, and Scalability With GALS Chip Multiprocessors
    Zhiyi Yu, B.M. Baas
    [show abstract] [hide abstract]
    ABSTRACT: Chip multiprocessors with globally asynchronous locally synchronous (GALS) clocking styles are promising candidates for processing computationally-intensive and energy-constrained workloads. The GALS methodology simplifies clock tree design, provides opportunities to use clock and voltage scaling jointly in system submodules to achieve high energy efficiencies, and can also result in easily scalable clocking systems. However, its use typically also introduces performance penalties due to additional communication latency between clock domains. We show that GALS chip multiprocessors (CMPs) with large inter-processor first-inputs-first-outputs (FIFOs) buffers can inherently hide much of the GALS performance penalty while executing applications that have been mapped with few communication loops. In fact, the penalty can be driven to zero with sufficiently large FIFOs and the removal of multiple-loop communication links. We present an example mesh-connected GALS chip multiprocessor and show it has a less than 1% performance (throughput) reduction on average compared to the corresponding synchronous system for many DSP workloads. Furthermore, adaptive clock and voltage scaling for each processor provides an approximately 40% power savings without any performance reduction. These results compare favorably with the GALS uniprocessor, which compared to the corresponding synchronous uniprocessor, has a reported greater than 10% performance (throughput) reduction and an energy savings of approximately 25% using dynamic clock and voltage scaling for many general purpose applications.
    IEEE Transactions on Very Large Scale Integration (VLSI) Systems 02/2009; · 1.22 Impact Factor
  • Source
    Conference Proceeding: A 167-processor 65 nm computational platform with per-processor dynamic supply voltage and dynamic clock frequency scaling
    [show abstract] [hide abstract]
    ABSTRACT: A 167-processor 65 nm computational platform well suited for DSP, communication, and multimedia workloads contains 164 programmable processors with dynamic supply voltage and dynamic clock frequency circuits, three algorithm-specific processors, and three 16 KB shared memories, all clocked by independent oscillators and connected by configurable long-distance-capable links.
    VLSI Circuits, 2008 IEEE Symposium on; 07/2008
  • Source
    Conference Proceeding: A low-area interconnect architecture for chip multiprocessors
    Zhiyi Yu, B.M. Baas
    [show abstract] [hide abstract]
    ABSTRACT: A new inter-processor communication architecture for chip multiprocessors is proposed which has a low area cost and flexible routing capability. To achieve a low area cost, the proposed statically-configurable asymmetric architecture assigns large buffer resources only to the nearest neighbor interconnect and much smaller buffer resources for long distance interconnect. To maintain flexible routing capability, each neighboring processor pair has two connecting links. Compared to a traditional dynamically-configurable interconnect architecture with symmetric buffer allocation and single-links between neighboring processor pairs, this implementation has approximately 2 times smaller communication circuitry area with a similar routing capability. Area and speed estimates are obtained with the physical design of seven chips in 0.18 mum CMOS.
    Circuits and Systems, 2008. ISCAS 2008. IEEE International Symposium on; 06/2008
  • Source
    Article: AsAP: An Asynchronous Array of Simple Processors
    [show abstract] [hide abstract]
    ABSTRACT: An array of simple programmable processors is implemented in 0.18 mum CMOS and contains 36 asynchronously clocked independent processors. Each processor occupies 0.66 and is fully functional at a clock rate of 520-540 MHz at 1.8 V and over 600 MHz at 2.0 V. Processors dissipate an average of 32 mW under typical conditions at 1.8 V and 475 MHz, and 2.4 mW at 0.9 V and 116 MHz while executing applications such as a JPEG encoder core and a fully compliant IEEE 802.11 a/g wireless LAN baseband transmitter.
    IEEE Journal of Solid-State Circuits 04/2008; · 3.23 Impact Factor
  • Source
    Article: A Scalable Dual-Clock FIFO for Data Transfers Between Arbitrary and Haltable Clock Domains
    [show abstract] [hide abstract]
    ABSTRACT: A robust, scalable, and power efficient dual-clock first-input first-out (FIFO) architecture which is useful for transferring data between modules operating in different clock domains is presented. The architecture supports correct operation in applications where multiple clock cycles of latency exist between the data producer, FIFO, and the data consumer; and with arbitrary clock frequency changes, halting, and restarting in either or both clock domains. The architecture is demonstrated in both a 0.18- mum CMOS full-custom design and a 0.18-mum CMOS standard cell design used in a globally asynchronous locally synchronous array processor. It achieves 580-MHz operation and 10.3-mW power dissipation while performing simultaneous FIFO read and write operations at 1.8 V.
    IEEE Transactions on Very Large Scale Integration (VLSI) Systems 11/2007; · 1.22 Impact Factor
  • Source
    Conference Proceeding: Implementing Tile-based Chip Multiprocessors with GALS Clocking Styles
    Zhiyi Yu, B. Baas
    [show abstract] [hide abstract]
    ABSTRACT: This paper investigates implementation techniques for tile-based chip multiprocessors with Globally Asynchronous Locally Synchronous (GALS) clocking styles. These architectures can simplify the physical design flow since they allow focusing on a single processor when designing an entire chip. However, they also introduce challenges to maintain system robustness and scalability. We propose a physical design flow for these architectures, investigate timing issues for robust implementations, and propose methods to take full advantage of their potential scalability. As a design example, we present data from a recently implemented single-chip 6 x 6 tile-based GALS processing array.
    Computer Design, 2006. ICCD 2006. International Conference on; 11/2007
  • Article: AsAP: A Fine-Grained Many-Core Platform for DSP Applications
    [show abstract] [hide abstract]
    ABSTRACT: Many emerging and future applications require significant levels of complex digital signal processing and operate within limited power budgets. Moreover, dramatically rising VLSI fabrication and design costs make programmable and reconfigurable solutions increasingly attractive. the ASAP project addresses these challenges with a chip multiprocessor composed of simple processors with small memories, achieving high energy efficiency and throughput in a small chip area.
    IEEE Micro 04/2007; · 1.78 Impact Factor
  • Conference Proceeding: Performance and power analysis of globally asynchronous locally synchronous multiprocessor systems
    Zhiyi Yu, B.M. Baas
    [show abstract] [hide abstract]
    ABSTRACT: This paper investigates the performance and power dissipation of globally asynchronous locally synchronous (GALS) multi-processor systems. We show that communication loops are a source of significant throughput degradation in communications links and that there is no degradation whatsoever under certain conditions for one-way links, and that it is possible to design GALS multiprocessors without this performance penalty. Independent clock domains and unbalanced computation in the GALS multiprocessor allow scaling of the clock frequency and supply voltage to achieve high energy efficiency. The synchronization overhead between independent clock domains results in a less than 1% performance reduction compared to a globally synchronous system over a number of DSP and numerical applications. Clock and voltage scaling can achieve an approximately 40% power savings with no reduction of performance. These results compare favorably with the 25% power savings and more than 10% performance reduction reported for GALS uniprocessors.
    Emerging VLSI Technologies and Architectures, 2006. IEEE Computer Society Annual Symposium on; 04/2006
  • Source
    Conference Proceeding: An asynchronous array of simple processors for dsp applications
    [show abstract] [hide abstract]
    ABSTRACT: An array of simple programmable processors designed for DSP applications is implemented in 0.18mum CMOS and contains 36 asynchronously clocked independent processors. The processors operate at 475MHz, and each processor has a maximum power of 144mW at 1.8V and occupies 0.66 mm<sup>2</sup>
    Solid-State Circuits Conference, 2006. ISSCC 2006. Digest of Technical Papers. IEEE International; 03/2006