[Show abstract][Hide abstract] ABSTRACT: A 36 mm<sup>2</sup> graphics processor with fixed-point programmable vertex shader is designed and implemented for portable two-dimensional (2-D) and three-dimensional (3-D) graphics applications. The graphics processor contains an ARM-10 compatible 32-bit RISC processor,a 128-bit programmable fixed-point single-instruction-multiple-data (SIMD)vertex shader, a low-power rendering engine, and a programmable frequency synthesizer (PFS). Different from conventional graphics hardware, the proposed graphics processor implements ARM-10 co-processor architecture with dual operations so that user-programmable vertex shading is possible for advanced graphics algorithms and various streaming multimedia processing in mobile applications. The circuits and architecture of the graphics processor are optimized for fixed-point operations and achieve the low power consumption with help of instruction-level power management of the vertex shader and pixel-level clock gating of the rendering engine. The PFS with a fully balanced voltage-controlled oscillator (VCO) controls the clock frequency from 8 MHz to 271 MHz continuously and adaptively for low-power modes by software. The chip shows 50 Mvertices/s and 200 Mtexels/s peak graphics performance, dissipating 155 mW in 0.18-μm 6-metal standard CMOS logic process.
[Show abstract][Hide abstract] ABSTRACT: A full 3D graphics pipeline is investigated, and optimizations of graphics architecture are assessed for satisfying the performance requirements and overcoming the limited system resources found in mobile terminals. Two mobile 3D graphics processor architectures, RAMP and DigiAcc, are proposed based on the analysis, and a prototype development platform (REMY) is implemented. REMY includes a software graphics library and simulation environment developed for more flexible realization of mobile 3D graphics. The experimental results demonstrate the feasibility of mobile 3D graphics with 3.6 Mpolygons/s at 155 mW power consumption for full 3D operation.
[Show abstract][Hide abstract] ABSTRACT: A fixed-point multimedia coprocessor is designed and integrated into an ARM-10 based mobile graphics processor for portable 2D and 3D multimedia applications. The user-programmable SIMD vertex shader with ARM-10 co-processor architecture realizes advanced 3D graphics algorithms and various multimedia functions. Different from conventional ARM coprocessor architecture, the multimedia coprocessor implements dual operations, by which parallel and streaming multimedia processing is enabled in mobile applications. For low power consumption, fixed-point SIMD datapath is designed with instruction-wise clock gating. The co-processor takes 10.2mm<sup>2</sup> in 0.18μm 6-metal standard CMOS logic process and achieves 50Mvertices/s graphics performance with 75.4mW power consumption.
Solid-State Circuits Conference, 2005. ESSCIRC 2005. Proceedings of the 31st European; 10/2005
[Show abstract][Hide abstract] ABSTRACT: A user-programmable mobile 3D graphics processor with a 32 bit RISC, a 128 bit fixed-point SIMD vertex shader and a rendering engine is implemented. A programmable frequency synthesizer controls the clock frequency continuously and adaptively for low power. The chip achieves 50 Mvertices/s and 200 Mtexels/s, dissipating 155 mW in a 0.1 μm 6M CMOS process.
[Show abstract][Hide abstract] ABSTRACT: A low-power three-dimensional (3-D) rendering engine with two texture units and 29-Mb embedded DRAM is designed and integrated into an LSI for mobile third-generation (3G) multimedia terminals. Bilinear MIPMAP texture-mapped 3-D graphics can be realized with the help of low-power pipeline structure, optimization of datapath, extensive clock gating, texture address alignment, and the distributed activation of embedded DRAM. The scalable performance reaches up to 100 Mpixels/s and 400 Mtexels/s at 50 MHz. The chip is implemented with 0.16-μm pure DRAM process to reduce the fabrication cost of the embedded-DRAM chip. The logic with DRAM takes 46 mm<sup>2</sup> and consumes 140 mW at 33-MHz operation, respectively. The 3-D graphics images are successfully demonstrated by using the fabricated chip on the prototype PDA board.
IEEE Journal of Solid-State Circuits 08/2004; · 3.06 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: A 121-mm<sup>2</sup> graphics LSI is designed and implemented for portable two-dimensional (2-D) and three-dimensional (3-D) graphics and MPEG-4 applications. The LSI contains a RISC processor with a multiply-accumulate unit (MAC), a 3-D rendering engine, a programmable power optimizer, and 29-Mb embedded DRAM. The chip is built in a 0.16-μm pure DRAM technology to reduce the fabrication cost. Texture-mapped 3-D graphics with perspective-correct address calculation and bilinear MIPMAP filtering can be realized while consuming the low power with the help of depth-first clock gating, address alignment logic, and embedded DRAM. Programmable clocking allows the LSI to operate in lower power modes for various applications. The chip consumes less than 210 mW, delivering 66 Mpixels/s and 264 Mtexel/s texture-mapped pixels with real-time special effects such as full-scene antialiasing and motion blur.
IEEE Journal of Solid-State Circuits 03/2004; · 3.06 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: The real time D graphics becomes one of the attractive applications for 3G wireless terminals although their battery lifetime and memory bandwidth limit the system resources for graphics processing. Instead of using the dedicated hardware engine with complex functions, we propose an efficient hardware architecture of low power vertex shader with programmability. Our architecture includes the following three features: I) a fixed-point SIMD datapath to exploit parallelism in vertex processing while keeping the power consumption low, II) a multithreaded coprocessor interface to decrease unwanted stalls between the main processor and the vertex shader, reducing power consumption by instruction-level power management, III) a programmable vertex engine to increases the datapath throughput by concurrent operations with main processor. Simulation results show that full 3D geometry pipeline can be performed at 7.2M vertices/sec with 115mW power consumption for polygons using the OpenGL lighting model. The improvement is about 10 times greater than that of the latest graphics core with floating-point datapath for wireless applications in terms of processing speed normalized by power consumption, Kvertices/sec per milliwatt.
Proceedings of the ACM SIGGRAPH/EUROGRAPHICS Symposium on Graphics Hardware 2004, Grenoble, France, August 29-30, 2004; 01/2004
[Show abstract][Hide abstract] ABSTRACT: A low-power 3D rendering engine with 2 texture units and 29Mb embedded DRAM is designed and integrated into an LSI for portable 3G multimedia terminals. Texture-mapped 3D graphics with perspective-correct address calculation and bilinear MIPMAP filtering can be realized while consuming the low power with the help of clock gating, precision-controlled look-up table dividers, texture address alignment and embedded DRAM. The performance is scalable and it reaches up to 100Mpixels/s and 400 Mtexles/s at 50MHz. The chip is implemented with 0.16μm pure DRAM process to reduce the fabrication cost. The logic and DRAM consume 46mm<sup>2</sup> and 140mW at 33MHz operation. The 3D graphics images are successfully demonstrated by the fabricated chip on the PDA system board.
Solid-State Circuits Conference, 2003. ESSCIRC '03. Proceedings of the 29th European; 10/2003
[Show abstract][Hide abstract] ABSTRACT: A 121 mm2 graphics LSI is for portable 2D/3D graphics and MPEG4 applications. The LSI contains a RISC processor with MAC, a 3D rendering engine, 29Mb DRAM and is built in a 0.16μm pure DRAM technology. Programmable clocking allows the LSI to operate in several power modes for various applications. In lower cost mode, power consumption is under 210mW, delivering 264M texture mapped pixels per second.
Digest of Technical Papers - IEEE International Solid-State Circuits Conference 01/2003;
[Show abstract][Hide abstract] ABSTRACT: A low-power three-dimensional (3-D) rendering engine is implemented as part of a mobile personal digital assistant (PDA) chip. Six-megabit embedded DRAM macros attached to 8-pixel-parallel rendering logic are logically localized with a 3.2-GB/s runtime reconfigurable bus, reducing the area by 25% compared with conventional local frame-buffer architectures. The low power consumption is achieved by polygon-dependent access to the embedded DRAM macros with line-block mapping providing read-modify-write data transaction. The 3-D rendering engine with 2.22-Mpolygons/s drawing speed was fabricated using 0.18-μm CMOS embedded memory logic technology. Its area is 24 mm<sup>2</sup> and its power consumption is 120 mW.
IEEE Journal of Solid-State Circuits 11/2002; · 3.06 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: Recently, the level of realism in PC graphics applications has
been approaching that of high-end graphics workstations, necessitating a
more sophisticated texture data cache memory to overcome the finite
bandwidth of the AGP or PCI bus. This paper proposes a multilevel
parallel texture cache memory to reduce the required data bandwidth on
the AGP or PCI bus and to accelerate the operations of parallel graphics
pipelines in PC graphics cards. The proposed cache memory is fabricated
by 0.16-μm DRAM-based SOC technology. It is composed of four
components: an 8-MB DRAM L2 cache, 8-way parallel SRAM L1 caches,
pipelined texture data filters, and a serial-to-parallel loader. For
high-speed parallel L1 cache data replacement, the internal bus
bandwidth has been maximized up to 75 GB/s with a newly proposed hidden
double data transfer scheme. In addition, the cache memory has a
reconfigurable architecture in its line size for optimal caching
performance in various graphics applications from three-dimensional
(3-D) games to high-quality 3-D movies
IEEE Journal of Solid-State Circuits 06/2002; · 3.06 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: The optimal architecture of personal digital assistants (PDA) system for real-time D graphics was analyzed by simulating the 3D applications on the various Advanced RISC Machines (ARM) processor platforms. Simulation results show that for 256x256 screen resolution, even the performance of 200MHz StrongARM with 160MHz floating point unit (FPU) shows only 1.78 % of the requirement of full 3D pipeline. To realize the real-time D graphics on PDA, the optimal architecture must contain hardware acceleration engine with embedded DRAM as the rendering stage. In this architecture, MAC- enhanced ARM9 without FPU that is used as a host processor can provide the necessary geometry operations and we verified this architecture by the implementation of a PDA chip.
Circuits and Systems, 2002. ISCAS 2002. IEEE International Symposium on; 01/2002
[Show abstract][Hide abstract] ABSTRACT: A low-power multimedia processor for mobile applications is
presented. An 80-MHz 32-b RISC with enhanced multiplier, two 20-MHz
hardware accelerators with 7.125-Mb embedded DRAM for MPEG-4 visual
SP@L1 decoding and 3-D graphics processing, 2-kB dual-port SRAM, and
peripheral blocks are integrated together on a single chip, MPEG-4 SP@L1
video decoding and 3-D graphics rendering with a 16-b depth-buffer
alpha-blending double-buffering and gouraud-shading features at 2,
2-Mpolygons/s speed are realized with the help of the dedicated hardware
accelerators/ The architecture of the processor is optimized in terms of
power consumption and performance, and various low-power circuit
techniques are adopted in each hardware block. The chip is implemented
using 0.18-μm embedded memory logic (EML) technology. Its area is 84
mm<sup>2</sup>, and power consumption is 160 mW when all of the
functions are activated
IEEE Journal of Solid-State Circuits 12/2001; · 3.06 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: An 84 mm<sup>2</sup> 160 mW programmable processor in 0.18 μm EMC technology consists of 32 b RISC with MAC, 20 MHz motion compensation accelerator for MPEG-4 at SP, 3D rendering engine with 2.2 M polygon/s at 20 MHz, and 7.125 Mb embedded DRAM with single bitline writing scheme
[Show abstract][Hide abstract] ABSTRACT: We implemented POPeye (Probe of Performance+eye), a system
analysis simulator to evaluate DRAM performance in a personal computer
environment. When running any real-life application programs such as
Microsoft Office and Paint Shop Pro on Windows OS, POPeye simulates
detailed transactions between a CPU and a memory system. Using this
tool, we comparatively analyzed the performance of a DDR-SDRAM, a
D-RDRAM, and a DDR-FCRAM
Circuits and Systems, 2001. ISCAS 2001. The 2001 IEEE International Symposium on; 02/2001
[Show abstract][Hide abstract] ABSTRACT: A 16.3 mW low power motion compensation (MC) block IP with 1.25
Mbit embedded DRAM macro is implemented using 0.18 μm EML technology
for portable video applications. For low power consumption, its
frequency is lowered to 20 MHz by utilizing parallelism in datapath.
Embedded DRAM frame buffer eliminates external data I/O. In addition,
distributed nine-tiled mapping (DNTM) with partial activation scheme
reduces power for accessing the frame buffer up to 31% compared to
conventional 1-bank tiled mapping. Adaptive fetch control (AFC) in data
buffer reduces power up to 29% by eliminating unnecessary switching in
[Show abstract][Hide abstract] ABSTRACT: In this paper, a high-speed 64-bit carry look-ahead adder is
implemented by race logic for fast carry generation. G<sup>1</sup>g/G
<sup>1</sup>k (Level 1 Group Generate/Kill) and G<sup>2</sup>g/G<sup>2
</sup>k (Level 2 Group Generate/Kill) stages are designed by race logic.
The adder consists of 4-stages, and clk-S63 delay is 480 ps with 0.18
μm CMOS technology
[Show abstract][Hide abstract] ABSTRACT: A dedicated single-chip multilevel parallel graphics cache memory
for high-speed parallel texture mapping in PC graphics has been
fabricated by a 0.16 μm DRAM technology. The proposed cache
architecture is composed of four components: 1) an 8 MB DRAM L2 cache,
2) eight 16 KB SRAM L1 parallel caches, 3) eight pipelined texture data
filters, 4) serial-to-parallel latches. The refill bandwidth of the
parallel L1 cache is maximized up to 75 GB/sec by a hidden double data
transfer scheme between the L2 and L1 caches. Furthermore, by adaptive
sub-wordline activation scheme, the line sizes of the L2 and L1 caches
are reconfigurable for achieving optimal cache miss rate and lower power
consumption. The SRAM L1 caches and the texture filters by use of
parallel pipelined structures result in higher system performance
[Show abstract][Hide abstract] ABSTRACT: An embedded 3D graphics rendering engine (E3GRE) is implemented as
a part of a mobile PDA-chip. 6 Mb embedded DRAM (eDRAM) macros attached
to 8-pixel-parallel rendering logic are logically localized with 3.2
GByte/s runtime reconfigurable bus, by which the area is reduced by 25%.
Polygon-dependent access to eDRAM macros with line-block mapping reduces
the power consumption by 70% with the read-modify-write data
transaction. E3GRE with 2.22 M polygons/s drawing speed was fabricated
using 0.18 μm CMOS embedded memory logic technology. Its area and
power consumption are 24 mm<sup>2</sup> and 120 mW, respectively