-
TACO. 01/2011; 8:16.
-
Journal of Circuits, Systems, and Computers. 01/2010; 19:1817-1834.
-
[show abstract]
[hide abstract]
ABSTRACT: Multithreading and multicore processing are powerful ways to take advantage of parallelism in applications in order to boost a system's performance. However, exploring sufficient parallelism and achieving data locality with low communication overhead are still important research issues in embedded multithreading/multicore design. This paper introduces the design of a fast data switching mechanism between multilevel storage structures in a new multicore architecture. This paper makes several contributions to the development of contemporary sophisticated multimedia applications with advanced standards such as H.264. The first contribution, collaborative-multithreading , tightly unifies reduced instruction set computer and collaborative multithreading digital signal processing (DSP) in order to exploit high parallelism to provide sufficient computing power to applications. Each collaborative thread of our DSP is constructed by a heterogeneous-simultaneously multithreading single instruction, multiple data structure, and four media processing cores, which is connected by a fast switch for providing a fast data exchange mechanism among correlative streams on a thread-level basis. Our second contribution is one-stop streaming processing , which aims to keep data in the system for as long as possible until it is no longer needed, thus making data more efficient to access. Our third contribution is a chunk threading programming model , including a thread management library and threading communication directives for reducing data communication and synchronization overhead. By a combination of coarse-grained and fine-grained threading, programmers can choose various threading levels based on the amount of data exchange in a program. With our proposed techniques and an appropriate programming model, we can reduce processing time by 54.9% in H.264 video encoding (common intermediate format video at 16.574 f/s) with the 1-virtual independent and streaming proc-
-
essing by open collaborative multithreading configuration, compared to the Texas Instruments C62 core that owns 8 function units. We realize our design as a prototype by chip implementation, and fabricate it as a chip based on the Taiwan Semiconductor Manufacturing Company Ltd. 0.13 mum process. The die size of the processor core is 16.12 mm<sup>2</sup>, including 414 k logic transistors and 34.4 kB of on-chip static random access memory. The processor runs at 180 MH0z/1.2-V and consumes 245 mW by postsimulation results.
IEEE Transactions on Circuits and Systems for Video Technology 12/2009; · 1.65 Impact Factor
-
IEEE Trans. Circuits Syst. Video Techn. 01/2009; 19:1633-1645.
-
ACM Trans. Design Autom. Electr. Syst. 01/2008; 13.
-
[show abstract]
[hide abstract]
ABSTRACT: Pipeline scaling provides an attractive solution for increasingly serious branch misprediction penalties within deep pipeline
processor. In this paper we investigate Adaptive Pipeline Scaling (APS) techniques that are related to reducing branch misprediction penalties. We present a dual supply-voltage architecture framework
that can be efficiently exploited in an deep pipeline processor to reduce pipeline depth depending on the confidence level
of branches in pipeline. We also propose two techniques, Dual Path Index Table (DPIT) and Step-By-Step (STEP) manner, that increase the efficiency for pipeline scaling . With these techniques, we then show that APS not only provides a fast branch misprediction recovery, but also speeds up the resolve of mispredicted branch. The evaluation
of APS in a 13-stage superscalar processor with benchmarks from SPEC2000 applications shows a performance improvement (between 3%-12%,
average 8%) over baseline processor that does not exploit APS.
07/2007: pages 105-119;
-
Proceedings of the 44th Design Automation Conference, DAC 2007, San Diego, CA, USA, June 4-8, 2007; 01/2007
-
[show abstract]
[hide abstract]
ABSTRACT: To support various bandwidth requirements for mobile multimedia services for future heterogeneous mobile environments, such as portable notebooks, personal digital assistants (PDAs), and 3G cellular phones, a transcoding video proxy is usually necessary to provide mobile clients with adapted video streams by not only transcoding videos to meet different needs on demand, but also caching them for later use. Traditional proxy technology is not applicable to a video proxy because it is less cost-effective to cache the complete videos to fit all kinds of clients in the proxy. Since transcoded video objects have inheritance dependency between different bit-rate versions, we can use this property to amortize the retransmission overhead from transcoding other objects cached in the proxy. In this paper, we propose the object relation graph (ORG) to manage the static relationships between video versions and an efficient replacement algorithm to dynamically manage video segments cached in the proxy. Specifically, we formulate a transcoding time constrained profit function to evaluate the profit from caching each version of an object. The profit function considers not only the sum of the costs of caching individual versions of an object, but also the transcoding relationship among these versions. In addition, an effective data structure, cached object relation tree (CORT), is designed to facilitate the management of multiple versions of different objects cached in the transcoding proxy. Experimental results show that the proposed algorithm outperforms companion schemes in terms of the byte-hit ratios and the startup latency.
Journal of Systems Architecture. 01/2007;
-
[show abstract]
[hide abstract]
ABSTRACT: As the number of cores on a chip increases, power consumed by the communication structures takes significant portion of the overall power-budget. We first propose a novel 2D segmented interconnect architecture, which uses crossroad switches to dynamically construct a dedicated communication path between any two cores. We then present two application-specific bus operation schemes (normal mode and lease line mode). Each switch may operate with a "lease line", which can dynamically offer a dedicated path between two highly-communicative cores for a specific period according to the application characteristics. Finally, we present a concept of wrappers to help designers use our crossroad architecture as communication backbone. We take the JPEG and MPEG4 reference codes as our case studies and experimental results show the power consumptions can be saved if we dynamically control NoC switches when the behavior of the embedded software is well-known
Circuits and Systems, 2006. ISCAS 2006. Proceedings. 2006 IEEE International Symposium on; 06/2006
-
Proceedings of the 2006 IEEE International Conference on Multimedia and Expo, ICME 2006, July 9-12 2006, Toronto, Ontario, Canada; 01/2006
-
Proceedings of the 43rd Design Automation Conference, DAC 2006, San Francisco, CA, USA, July 24-28, 2006; 01/2006
-
International Symposium on Circuits and Systems (ISCAS 2006), 21-24 May 2006, Island of Kos, Greece; 01/2006
-
Proceedings of the 2006 International Conference on Embedded Systems & Applications, Las Vegas, Nevada, USA, June 26-29, 2006; 01/2006
-
[show abstract]
[hide abstract]
ABSTRACT: For the success of an SoC design, a design platform and some key design technologies are needed. For this purpose, a research project in "Technology Development Program for Academia" was granted recently in Taiwan to develop the following technologies toward a high-performance low-power SoC design platform. First, a soft intellectual property (soft IP) and related RTOS, compiler, and integrated design environment (IDE) software of an Advanced Taiwan VLTW DSP Processor Core, which can be used as the Star IP of an SoC design platform, will be developed. Second, low-power and low-voltage digital and mixed-signal circuit design technologies will be developed based on the advanced MTCMOS process. Third, some key multimedia and communication soft or hard IPs will be developed. The developed advanced key technologies can be used as the technology driver to facilitate the design of SoC-based products, and they will also help to enhance the design capability of the Taiwan SoC industry. In this paper, we introduce the research project and focus on architecture and software technologies developing in one of the subitems.
Embedded and Real-Time Computing Systems and Applications, 2005. Proceedings. 11th IEEE International Conference on; 09/2005
-
[show abstract]
[hide abstract]
ABSTRACT: As the number of cores on a chip increases, power consumed by the communication structures takes significant portion of the overall power-budget. The individual components of the SoCs will be heterogeneous in nature with widely varying functionality and communication requirements. The communication topology should possibly match communication workflows among these components. In this paper, the authors first proposed an interconnection architecture for SoC, which uses crossroad switches to construct a dedicated communication path dynamically between any two cores. Then a design methodology for constructing network on chip (NoC) was presented for application-specific computer systems with profiled communication characteristics. A core placement tool, which automatically maps cores to a communication topology such that the total communication energy can be minimized, was proposed. Experimental results show that the design methodology can generate optimized on-chip networks with fewer resources than meshes and tori, and the power saving approximates to 40%.
Low Power Electronics and Design, 2005. ISLPED '05. Proceedings of the 2005 International Symposium on; 09/2005
-
[show abstract]
[hide abstract]
ABSTRACT: In this paper, a new CMOS design scheme called the single-low-V<sub>DD</sub> CMOS (SLVCMOS) is proposed. With this scheme, a CMOS design implemented in a multi-V<sub>TH</sub> CMOS technology can be operated with a very low external supply voltage, say 0.5-V, with a sleep current at the level of only picoampere per gate. The key items for a single-chip SLVCMOS design include a sleepless mixed-V<sub>TH</sub> flip-flop, a boosted sleeping clock signal, and three low-power hard blocks. Analysis shows that additional benefits of using the SLVCMOS include higher performance and lower power consumption in the active mode, smaller leakage current in the sleep mode, shorter wake-up time and reduced wake-up energy during the sleep-to-active transition, and a reduced number of sleep-control signals, saving precious routing resources and reducing the chip area. A dual-rail SLVCMOS cell library and two test chips, one 32-b RISC core and the other verifying the design of hard blocks, are designed and implemented to show the feasibility of the proposed design scheme and the design techniques.
IEEE Journal of Solid-State Circuits 06/2005; · 3.23 Impact Factor
-
[show abstract]
[hide abstract]
ABSTRACT: To meet varying requirements for a wide variety of contemporary media applications with evolving standards, a generic and flexible media-processing design platform is essential to manage increasing system-on-a-chip (SoC) design complexity with minimal design cost and time-to-market. A possible solution to the problem is combining several reconfigurable hardware resources with a programmable processor into a single-chip device such that flexibility and performance can be pursued at the same time. While several innovative reconfigurable architectures have been reported in the literature, none of these architectures are poised to provide the ease of software development for sophisticated SoC-based media processing applications. In this paper, we propose a new flexible heterogeneous multicore architecture system, which comprises a main reduced-instruction set computer (RISC) processor and a reconfigurable controller along with other configurable hardware blocks such as DSP processors or intellectual property blocks (IPBs). The key idea is that the interactions of those hardware blocks are grouped together and instructions are defined to express a middle-grained parallelism among intellectual property blocks in terms of several sequences of customized long instruction words (CLIW). A CLIW ROM is reconfigurable in response to application changes. We design the instruction set architecture, called reconfigurable controller, and show the implementation details. In addition, we demonstrate the necessary software tools that are needed to generate the suitable CLIW instruction code for applications.
IEEE Transactions on Circuits and Systems for Video Technology 06/2005; · 1.65 Impact Factor
-
Proceedings of the 2005 IEEE International Conference on Multimedia and Expo, ICME 2005, July 6-9, 2005, Amsterdam, The Netherlands; 01/2005
-
Proceedings of the 2005 International Conference on Pervasive Systems and Computing, PSC 2005, Las Vegas, Nevada, June 27-30, 2005; 01/2005
-
IEEE Trans. Circuits Syst. Video Techn. 01/2005; 15:659-672.