[show abstract][hide abstract] ABSTRACT: A new inter-processor communication architecture for chip multiprocessors is proposed which has a low area cost, flexible routing capability, and supports globally asynchronous locally synchronous (GALS) clocking styles. To achieve a low area cost, the proposed statically-configurable asymmetric architecture assigns large buffer resources to only the nearest neighbor interconnect and much smaller buffer resources for long distance interconnect. To maintain flexible routing capability, each neighboring processor pair has multiple connecting links. The architecture supports long distance communication in GALS systems by transferring the source clock with the data signals along the entire path for write synchronization. Compared to a traditional dynamically-configurable interconnect architecture with symmetric buffer allocation and single-links between neighboring processor pairs, this implementation has approximately two times smaller communication circuitry area with a similar routing capability. Area and speed estimates are obtained with the physical design of seven chips in 0.18-??m CMOS.
IEEE Transactions on Very Large Scale Integration (VLSI) Systems 06/2010; · 1.22 Impact Factor
[show abstract][hide abstract] ABSTRACT: Traditional uni-core processors have met tremendous challenges to improve their performance and energy efficiency, and to
adapt to the deep submicron fabrication technology. Meanwhile, traditional ASIC implementations are also widely prohibited
due to their inherent inflexibility and high design cost. On the other hand, rapidly advancing fabrication technologies have
enabled the integration of many processors into a single chip, called multi-core processors, and promise a platform with high
performance, high energy efficiency, and high flexibility.
This chapter will discuss the motivations of shifting from traditional IC systems (including uni-core processors and ASIC
implementations) to multi-core processors, investigate the design cases of multi-core processors and their key features, and
look forward to the future work.
[show abstract][hide abstract] ABSTRACT: Chip multiprocessors with globally asynchronous locally synchronous (GALS) clocking styles are promising candidates for processing computationally-intensive and energy-constrained workloads. The GALS methodology simplifies clock tree design, provides opportunities to use clock and voltage scaling jointly in system submodules to achieve high energy efficiencies, and can also result in easily scalable clocking systems. However, its use typically also introduces performance penalties due to additional communication latency between clock domains. We show that GALS chip multiprocessors (CMPs) with large inter-processor first-inputs-first-outputs (FIFOs) buffers can inherently hide much of the GALS performance penalty while executing applications that have been mapped with few communication loops. In fact, the penalty can be driven to zero with sufficiently large FIFOs and the removal of multiple-loop communication links. We present an example mesh-connected GALS chip multiprocessor and show it has a less than 1% performance (throughput) reduction on average compared to the corresponding synchronous system for many DSP workloads. Furthermore, adaptive clock and voltage scaling for each processor provides an approximately 40% power savings without any performance reduction. These results compare favorably with the GALS uniprocessor, which compared to the corresponding synchronous uniprocessor, has a reported greater than 10% performance (throughput) reduction and an energy savings of approximately 25% using dynamic clock and voltage scaling for many general purpose applications.
IEEE Transactions on Very Large Scale Integration (VLSI) Systems 02/2009; · 1.22 Impact Factor
[show abstract][hide abstract] ABSTRACT: A 167-processor computational platform consists of an array of simple programmable processors capable of per-pro- cessor dynamic supply voltage and clock frequency scaling, three algorithm-specific processors, and three 16 KB shared memories; and is implemented in 65 nm CMOS. All processors and shared memories are clocked by local fully independent, dynamically haltable, digitally-programmable oscillators and are intercon- nected by a configurable circuit-switched network which supports long-distance communication. Programmable processors occupy 0.17 mm and operate at a maximum clock frequency of 1.2 GHz at 1.3 V. At 1.2 V, they operate at 1.07 GHz and consume 47.5 mW when 100% active, resulting in an energy dissipation of 44 pJ per operation. At 0.675 V, they operate at 66 MHz and consume 608 W when 100% active, resulting in a total energy dissipation of 9.2 pJ per ALU or MAC operation.
[show abstract][hide abstract] ABSTRACT: A 167-processor 65 nm computational platform well suited for DSP, communication, and multimedia workloads contains 164 programmable processors with dynamic supply voltage and dynamic clock frequency circuits, three algorithm-specific processors, and three 16 KB shared memories, all clocked by independent oscillators and connected by configurable long-distance-capable links.
[show abstract][hide abstract] ABSTRACT: A new inter-processor communication architecture for chip multiprocessors is proposed which has a low area cost and flexible routing capability. To achieve a low area cost, the proposed statically-configurable asymmetric architecture assigns large buffer resources only to the nearest neighbor interconnect and much smaller buffer resources for long distance interconnect. To maintain flexible routing capability, each neighboring processor pair has two connecting links. Compared to a traditional dynamically-configurable interconnect architecture with symmetric buffer allocation and single-links between neighboring processor pairs, this implementation has approximately 2 times smaller communication circuitry area with a similar routing capability. Area and speed estimates are obtained with the physical design of seven chips in 0.18 mum CMOS.
Circuits and Systems, 2008. ISCAS 2008. IEEE International Symposium on; 06/2008
[show abstract][hide abstract] ABSTRACT: An array of simple programmable processors is implemented in 0.18 mum CMOS and contains 36 asynchronously clocked independent processors. Each processor occupies 0.66 and is fully functional at a clock rate of 520-540 MHz at 1.8 V and over 600 MHz at 2.0 V. Processors dissipate an average of 32 mW under typical conditions at 1.8 V and 475 MHz, and 2.4 mW at 0.9 V and 116 MHz while executing applications such as a JPEG encoder core and a fully compliant IEEE 802.11 a/g wireless LAN baseband transmitter.
IEEE Journal of Solid-State Circuits 04/2008; · 3.06 Impact Factor
[show abstract][hide abstract] ABSTRACT: Abstractó This paper presents the architecture of an Asyn- chronous Array of simple Processors (AsAP), and evaluates its key architectural features as well as its performance and energy efciency . The AsAP processor calculates DSP applications with high energy-efciency , is capable of high-performance, is easily scalable, and is well-suited to future fabrication technologies. It is composed of a 2-D array of simple single-issue programmable processors interconnected by a recongurable mesh network. Processors are designed to capture the kernels of many DSP algorithms with very little additional overhead. Each processor contains its own tunable and haltable clock oscillator, and pro- cessors operate completely asynchronously with respect to each other in a globally asynchronous locally synchronous (GALS) fashion. A 6 6 AsAP array has been designed and fabricated in a 0.18 CMOS technology. Each processor occupies 0.66
Journal of Signal Processing Systems 01/2008; 53:243-259. · 0.55 Impact Factor
[show abstract][hide abstract] ABSTRACT: This paper investigates implementation techniques for tile-based chip multiprocessors with Globally Asynchronous Locally Synchronous (GALS) clocking styles. These architectures can simplify the physical design flow since they allow focusing on a single processor when designing an entire chip. However, they also introduce challenges to maintain system robustness and scalability. We propose a physical design flow for these architectures, investigate timing issues for robust implementations, and propose methods to take full advantage of their potential scalability. As a design example, we present data from a recently implemented single-chip 6 x 6 tile-based GALS processing array.
Computer Design, 2006. ICCD 2006. International Conference on; 11/2007
[show abstract][hide abstract] ABSTRACT: A robust, scalable, and power efficient dual-clock first-input first-out (FIFO) architecture which is useful for transferring data between modules operating in different clock domains is presented. The architecture supports correct operation in applications where multiple clock cycles of latency exist between the data producer, FIFO, and the data consumer; and with arbitrary clock frequency changes, halting, and restarting in either or both clock domains. The architecture is demonstrated in both a 0.18- mum CMOS full-custom design and a 0.18-mum CMOS standard cell design used in a globally asynchronous locally synchronous array processor. It achieves 580-MHz operation and 10.3-mW power dissipation while performing simultaneous FIFO read and write operations at 1.8 V.
IEEE Transactions on Very Large Scale Integration (VLSI) Systems 11/2007; · 1.22 Impact Factor
[show abstract][hide abstract] ABSTRACT: Many emerging and future applications require significant levels of complex digital signal processing and operate within limited power budgets. Moreover, dramatically rising VLSI fabrication and design costs make programmable and reconfigurable solutions increasingly attractive. the ASAP project addresses these challenges with a chip multiprocessor composed of simple processors with small memories, achieving high energy efficiency and throughput in a small chip area.
[show abstract][hide abstract] ABSTRACT: An array of simple programmable processors designed for DSP applications is implemented in 0.18mum CMOS and contains 36 asynchronously clocked independent processors. The processors operate at 475MHz, and each processor has a maximum power of 144mW at 1.8V and occupies 0.66 mm<sup>2</sup>
[show abstract][hide abstract] ABSTRACT: This paper investigates the performance and power dis- sipation of Globally Asynchronous Locally Synchronous (GALS) multi-processor systems. We show that communi- cation loops are a source of significant throughput degrada- tion in communications links and that there is no degrada- tion whatsoever under certain conditions for one-way links, and that it is possible to design GALS multi-processors without this performance penalty. Independent clock do- mains and unbalanced computation in the GALS multi- processor allow scaling of the clock frequency and supply voltage to achieve high energy efficiency. The synchroniza- tion overhead between independent clock domains results in a less than 1% performance reduction compared to a glob- ally synchronous system over a number of DSP and numer- ical applications. Clock and voltage scaling can achieve an approximately 40% power savings with no reduction of per- formance. These results compare favorably with the 25% power savings and more than 10% performance reduction reported for GALS uniprocessors.
2006 IEEE Computer Society Annual Symposium on VLSI (ISVLSI 2006), 2-3 March 2006, Karlsruhe, Germany; 01/2006