Kyeounsoo Kim’s research while affiliated with Hanyang University and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (12)


Fig. 12. R gen and CL circuit (in dash boxes) of the T4PFB join stage.
Fig. 16. Speculative delay matching templates.
Fig. 18. Matrix multiplication with a 5 stage asynchronous pipeline.
Fig. 19. Mask signals generation unit based on static logic.
Efficient asynchronous bundled-data pipelines for DCT matrix-vector multiplication
  • Article
  • Full-text available

May 2005

·

404 Reads

·

10 Citations

IEEE Transactions on Very Large Scale Integration (VLSI) Systems

·

Youpyo Hong

·

Daewook Kim

·

[...]

·

This paper demonstrates the design of efficient asynchronous bundled-data pipelines for the matrix-vector multiplication core of discrete cosine transforms (DCTs). The architecture is optimized for both zero and small-valued data, typical in DCT applications, yielding both high average performance and low average power. The proposed bundled-data pipelines include novel data-dependent delay lines with integrated control circuitry to efficiently implement speculative completion sensing. The control circuits are based on a novel control-circuit template that simplifies the design of such nonlinear pipelines. Extensive post-layout back-end timing analysis was performed to gain confidence in the timing margins as well as to quantify performance and energy. Comparison with a synchronous counterpart suggests that our best asynchronous design yields 30% higher average throughput with negligible energy overhead.

Download

Figure 4. Proposed asynchronous fine-grained carry-save hardwired multiplier for 0. 35352 × x′ ′ ′ ′ 1 , where 0.35352 is expressed as (2 -9 × x′ ′ ′ ′ 1 ) + (2 -7 × x′ ′ ′ ′ 1 ) + (2 -5 × x′ ′ ′ ′ 1 ) + (2 -4 × x′ ′ ′ ′ 1 ) + (2 -2 × x′ ′ ′ ′ 1 ).  
An asynchronous matrix-vector multiplier for discrete cosine transform

February 2000

·

79 Reads

·

12 Citations

This paper proposes an efficient asynchronous hardwired matrix-vector multiplier for the two-dimensional discrete cosine transform and inverse discrete cosine transform (DCT/IDCT). The design achieves low power and high performance by taking advantage of the typically large fraction of zero and small-valued data in DCT and IDCT applications. In particular, it skips multiplication by zero and dynamically activates/deactivates required bit-slices of fine-grain bit-partitioned adders using simplified, static-logic-based speculative completion sensing. The results extracted by both bit-level analysis and HSPICE simulations indicate significant improvements compared to traditional designs.



MSB-Controlled Inversion Coding for a LowPower Matrix Transposer

October 1999

·

70 Reads

·

1 Citation

Introduction: The increasing demand for portable and wireless multimedia applications that rely on limited battery energy has made low power architectures and designs for these applications critical. Since real-time matrix transposition consumes a large fraction of the power in multi-dimensional image and signal processing, low-power matrix transposers are particularly important. When a digital circuit is implemented in CMOS, the largest energy is typically consumed by the dynamic power dissipation, which is expressed as (1/2)TC L V dd 2 f, where T is the circuit's transition activity, C L is the total capacitance, V dd and f are the supply voltage and the frequency of operation, respectively. The product TC L is referred to as the total switching capacitance of the circuit [1]. Various coding schemes, including bus-invert coding, re


An efficient frame memory interface of MPEG-2 video encoder ASIC chip

September 1999

·

14 Reads

·

1 Citation

IEEE Transactions on Consumer Electronics

This paper presents an efficient frame memory interface of MPEG-2 video encoder which is accomplished in not only reducing interface buffer size through efficient memory map organization and access timing schedules but also avoiding unnecessary small size buffers and simplifying their control circuits. In this design, 0.5 μm CMOS TLM (triple layer metal) standard cells are used as design libraries, and VHDL simulator and logic synthesis tools are used for hardware design and verification, and the hardware emulator that is a C-language model of the proposed architecture is exploited for various test vector generation and functional verification. The improved frame memory interface module takes about 58% less hardware area than the previous design (Kim et al. 1997), and results in reducing the total hardware area of the video encoder ASIC chip up to 24.3%. We also reduced the random memory accesses to save the power consumption caused by the transition of the system-level I/O buses


Statistically Optimized Asynchronous Barrel Shifters for Variable Length Codecs

August 1999

·

55 Reads

·

20 Citations

: This paper presents low-power asynchronous barrel shifters for variable length encoders and decoders useful in portable applications using multimedia standards. Our approach is to create multi-level asynchronous barrel shifters optimized for the skewed shift control statistics often found in these codecs. For common shifts, data passes through one level, whereas for rare shifts, data passes though multiple levels. We compare our optimized designs with the straight-forward asynchronous and synchronous designs. Both pre- and post-layout HSPICE simulation results indicate that, compared to their synchronous counterparts, our designs provide over a 40% savings in average energy consumption for a given average performance. 1 Introduction Asynchronous circuits sometimes can consume very low average energy for a given average performance partly because of their ability to adapt to variations in chip temperature and voltage supply level. To achieve this goal, however, the asynchronous circu...


Table 2 .
Table 3 . Area reuorts of the vrovosed architecture
A design of DPCM hybrid coding loop using single 1-D DCT in MPEG-2 video encoder

August 1999

·

63 Reads

·

2 Citations

In this paper, a VLSI architecture for DPCM Hybrid Coding Loop (DHCL), which consists of 2D-DCT, quantization, scan conversion, inverse quantization and 2D-IDCT, is presented. The architecture of the DHCL is designed to handle macroblock data within 1320 cycles and suitable for MPEG-2 video encoder accepting NTSC and PAL image formats. Only single 1-D DCT/IDCT is used for the design instead of 2-D DCT and IDCT to reduce the hardware size, and 3-bit serial distributed arithmetic architecture is adopted for 1-D IDCT to reduce the processing time in this architecture. As the result, the maximum utilization of hardware can be achieved, and power consumption can be minimized. The proposed designs can be operated on 80 MHz clock. The area is 50% smaller than the previous methods with 2D-DCT and IDCT. The experimental results show that the accuracy of DCT and IDCT meet the IEEE specification


An area efficient DCT architecture for MPEG-2 video encoder

March 1999

·

13 Reads

·

27 Citations

IEEE Transactions on Consumer Electronics

This paper presents an area efficient VLSI architecture of transform coding module for MPEG-2 video encoder. This module consists of 2-D DCT and 2-D IDCT, Q and IQ, and zigzag and alternate scan conversion circuits. Hardware cost and performance of this module are mainly affected by the 2-D DCT and 2-D IDCT. In the proposed architecture, it is shown that a single 1-D DCT/IDCT could take the roles of the 2-D DCT and 2-D IDCT. It is capable of reusing a single 1-D DCT/IDCT four times. It is based on the row-column decomposition technique. It can be achieved through precise timing schedules. Intuitively, three 1-D DCT/IDCT and a matrix transposition memory could be saved as compared to the conventional architectures, which usually use two one-dimensional transforms and transposition memory. Even though there are some extra circuits due to timing controls and processing sequence schedules, this architecture takes about 24% and 50% respectively less area than the architectures published by Miyazaki et al. (1993) and by Matsiu et al. (1994). This design and implementation are applicable to the MPEG-2 video encoder accepting NTSC and PAL image formats in which the number of clocks to be allocated during a macro block period is 1320 for 54 MHz operating clock. To reduce its processing time, the proposed architecture uses a 3-bit serial distributed arithmetic method. As a result, this architecture can be characterized to maximize the utilization of the hardware resources, end can be used for encoders having a similar structure as the MPEC-2 video encoder. It also can be applied to the ASIC chips for multimedia services especially requiring low hardware complexity


A low-power matrix transposer using MSB-controlled inversion coding

February 1999

·

47 Reads

·

2 Citations

This paper proposes a low-overhead MSB-controlled inversion coding technique to reduce the transition activity in a matrix transposer a commonly used component in 2-dimensional discrete cosine transform (DCT) and inverse DCT (IDCT) applications. A family of designs is identified in which this technique is applied to different bit slices of the matrix data and the optimal design within the family is determined using transition activity analysis driven by real image sequences. Our results suggest that the optimal design using MSB-controlled inversion coding yields power savings of 33% for DCT data and 46% for IDCT data. These results are remarkable since existing bus-invert coding techniques have high overheads and are only effective for system-level high-capacitive buses


Statistically optimized asynchronous barrel shifters for variable length codecs

February 1999

·

25 Reads

·

17 Citations

This paper presents low-power asynchronous barrel shifters for variable length encoders and decoders useful in portable applications using multimedia standards. Our approach is to create multi-level asynchronous barrel shifters optimized for the skewed shift control statistics often found in these codecs. For common shifts, data passes through one level, whereas for rare shifts, data passes though multiple levels. We compare our optimized designs with the straight-forward asynchronous and synchronous designs. Both pre- and post-layout HSPICE simulation results indicate that, compared to their synchronous counterparts, our designs provide over a 40% savings in average energy consumption for a given average performance.


Citations (8)


... Gate level techniques are more efficient than other techniques because signal gating and bypassing cannot be used at architecture level. The2-Dimensional signal gating techniques can achieve power savings for low-precision input data with large dynamic range [31][59] [60]. Using a typically large fraction of zero and small valued input, a signal gating approach can achieve power savings by deactivating slices. ...

Reference:

Power Optimization of Sum-of-Products Design in Signal Processing Applications
An asynchronous matrix-vector multiplier for discrete cosine transform

... In [70] a review of some of the encoding techniques has been undertaken. The encoding function can be optimized for specific access patterns such as sequential access (Gray [71,72], T0 [68], Pyramid [73]) or random data (Bus Invert [61]), for special data types such as floating point numbers in DSP applications [74]. The encoding may be fully customized to the target data (Working Zone [75], Beach [76]). ...

A low-power matrix transposer using MSB-controlled inversion coding

... dissipation. An analysis finds that the DA architecture provides higher speed than a hardwired multiplier but also dissipates more power [34]. The advantages of the DA make it a popular choice for high-speed, low area implementations [12] [36]. ...

A high-performance low-power asynchronous matrix-vector multiplier for discrete cosine transform

... Asynchronous processors such as Atlas, MU-5, Iliac, and Iliac II, were also available commercially (Nowick & Singh, 2015). From the mid-1970s to the early 1980s, the next era did not witness any significant development in the asynchronous design technique, and the research work during this duration was more focused on the development of synchronous circuits [5]. However, the asynchronous approach caught the attention of researchers once again between the mid-1980s to late 1990s, and initial EDA tools to design asynchronous circuits were developed. ...

Statistically optimized asynchronous barrel shifters for variable length codecs
  • Citing Conference Paper
  • February 1999

... The authors in [11] present a design for the quantization for AVS. The design in [12] describes an MPEG-2 encoder. In [13], another JPEG encoder is implemented for images where the quantization block is designed using multiplication and shift operation instead of division. ...

A design of DPCM hybrid coding loop using single 1-D DCT in MPEG-2 video encoder

... Asynchronous circuits in the pipeline style can be classified into two groups, which are [4,5]: a) bundled-data (BD) [6][7][8][9][10][11][12], where acknowledge and request signals are used as handshaking signals and the data is transmitted using the single-rail encoding, i.e., one signal per data bit. Handshaking signals are generated in the time required (matched delay) for the data to be processed; b) Data-Drive (DD) [13][14][15], where the data is encoded, for example, as dual-rail, and the data bit is represented by two signals. ...

Efficient asynchronous bundled-data pipelines for DCT matrix-vector multiplication

IEEE Transactions on Very Large Scale Integration (VLSI) Systems

... We consider a set of video coders and decoders, which share a set of cores to perform some common signal processing operations. In particular, we include two legacy standards, MPEG-2 [31], [32] and H.263 [33], and the more recent MPEG-4 [34], [35] format. The codecs are supposed to be configured on an FPGA device (in this example, we opted for a Xilinx XC4VLX60) to encode or decode a video stream with one of the supported formats, but they cannot fit on the device at the same time because of area and power-related issues. ...

An area efficient DCT architecture for MPEG-2 video encoder
  • Citing Article
  • March 1999

IEEE Transactions on Consumer Electronics

... A different approach has been used to implement a few asynchronous circuits using the bundled data scheme with completion detection techniques [112,[161][162][163][164][165], which can indicate the data validity as soon as the process is complete. A speculative completion detection scheme is designed for asynchronous fixed-point adders [161,162], and for barrel shifters [163], where the datapath channel is implemented with multiple delay models, including the worst-case delay. ...

Statistically Optimized Asynchronous Barrel Shifters for Variable Length Codecs