-
IEEE 13th International Workshop on Multimedia Signal Processing (MMSP 2011), Hangzhou, China, October 17-19, 2011; 01/2011
-
[show abstract]
[hide abstract]
ABSTRACT: Since motion-compensated temporal filtering (MCTF) becomes an important temporal prediction scheme in video coding algorithms, this paper presents an efficient temporal prediction engine which not only is the first MCTF hardware work but also supports traditional motion-compensated prediction (MCP) scheme to provide computation scalability. For the prediction stage of MCTF and MCP schemes, modified extended double current Frames is adopted to reduce the system memory bandwidth, and a frame-interleaved macroblock pipelining scheme is proposed to eliminate the induced data buffer overhead. In addition, the proposed update stage architecture with pipelined scheduling and motion estimation (ME)-like motion compensation (MC) with level C+ scheme can also save about half external memory bandwidth and eliminate irregular memory access for MC. Moreover, 76.4% hardware area of the update stage is saved by reusing the hardware resources of the prediction stage. This MCTF chip can process CIF 30 fps in real-time, and the searching range is [-32, 32) for 5/3 MCTF with four-decomposition level and also support 1/3 MCTF, hierarchical B-frames, and MCP coding schemes in JSVM and H.264/AVC. The gate count is 352-K gates with 16.8 KBytes internal memory, and the maximum operating frequency is 60 MHz.
IEEE Transactions on Circuits and Systems for Video Technology 02/2008; · 1.65 Impact Factor
-
Signal Processing Systems. 01/2008; 53:285-300.
-
[show abstract]
[hide abstract]
ABSTRACT: The on-chip line buffer dominates the total area and power of line-based 2-D discrete wavelet transform (DWT). In this paper, a memory-efficient VLSI implementation scheme for line-based 2-D DWT is proposed, which consists of two parts, the wordlength analysis methodology and the multiple-lifting scheme. The required wordlength of on-chip memory is determined firstly by use of the proposed wordlength analysis methodology, and a memory-efficient VLSI implementation scheme for line-based 2-D DWT, named multiple-lifting scheme, is then proposed. The proposed wordlength analysis methodology can guarantee to avoid overflow of coefficients, and the average difference between predicted and experimental quality level is only 0.1 dB in terms of PSNR. The proposed multiple-lifting scheme can reduce not only at least 50% on-chip memory bandwidth but also about 50% area of line buffer in 2-D DWT module.
IEEE Transactions on Circuits and Systems for Video Technology 08/2007; · 1.65 Impact Factor
-
[show abstract]
[hide abstract]
ABSTRACT: Multimedia intellectual property (IP) cores play a critical role in a successful multimedia SOC design. This chapter will
focus on the design of image and video codec IPs, which usually requires lots of computational power. From theory to practice
and from algorithm to hardware architecture, design methodologies toward an optimized architecture and also real design cases
will be presented. Both top-down system analysis and bottom-up core module design are emphasized. Following theoretical discussions
of the overall scenario, key building blocks of image and video codecs proposed in literature are reviewed. Examples will
cover motion estimation, discrete cosine transform, discrete wavelet transform, and entropy coder. Then, complete image and
video codec designs are explored. JPEG, JPEG 2000, and H.264/AVC are the three case studies. This chapter is intended to provide
an overview, from theory to practice, on how to design efficient multimedia IPs
05/2007: pages 19-72;
-
[show abstract]
[hide abstract]
ABSTRACT: A computation-aware motion estimation algorithm is proposed in this paper. Its goal is to find the best block-matching results in a computation-limited and computation-variant environment. Our algorithm is characterized by a one-pass flow with adaptive search strategy. In the prior scheme, Tsai et al. propose that all macroblocks are processed simultaneously, and more computation is allocated to the macroblock with the largest distortion among the entire frame in a step-by-step fashion. This implies that random access of macroblocks is required, and the related information of neighboring macroblocks cannot be used to be prediction. The random access flow requires a huge memory size for all macroblocks to store the up-to-date minimum distortions, best motion vectors, and searching steps. On the contrary, our one-pass flow processes the macroblocks one by one, which can not only significantly reduce the memory size but also effectively utilize the context information of neighboring macroblocks to achieve faster speed and better quality. Moreover, in order to improve the video quality when the computation resource is still sufficient, the search pattern is allowed to adaptively change from diamond search to three step search, and then to full search. Last but not least, traditional block matching speed-up methods are also combined to provide much better computation-distortion curves
IEEE Transactions on Multimedia 09/2006; · 1.93 Impact Factor
-
[show abstract]
[hide abstract]
ABSTRACT: H.264/AVC significantly outperforms previous video coding standards with many new coding tools. However, the better performance comes at the price of the extraordinarily huge computational complexity and memory access requirement, which makes it difficult to design a hardwired encoder for real-time applications. In addition, due to the complex, sequential, and highly data-dependent characteristics of the essential algorithms in H.264/AVC, both the pipelining and the parallel processing techniques are constrained to be employed. The hardware utilization and throughput are also decreased because of the block/MB/frame-level reconstruction loops. In this paper, we describe our techniques to design the H.264/AVC video encoder for HDTV applications. On the system design level, in consideration of the characteristics of the key components and the reconstruction loops, the four-stage macroblock pipelined system architecture is first proposed with an efficient scheduling and memory hierarchy. On the module design level, the design considerations of the significant modules are addressed followed by the hardware architectures, including low-bandwidth integer motion estimation, parallel fractional motion estimation, reconfigurable intrapredictor generator, dual-buffer block-pipelined entropy coder, and deblocking filter. With these techniques, the prototype chip of the efficient H.264/AVC encoder is implemented with 922.8 K logic gates and 34.72-KB SRAM at 108-MHz operation frequency.
IEEE Transactions on Circuits and Systems for Video Technology 07/2006; · 1.65 Impact Factor
-
[show abstract]
[hide abstract]
ABSTRACT: Motion-compensated temporal filtering (MCTF) is an open-loop prediction scheme, so the frame-level data reuse for MCTF is possible. In this paper, we propose two general frame-level data reuse schemes which can minimize the memory bandwidth of current and reference frames, respectively. And their relationships between the required memory bandwidth and the number of searching range buffers are also formulated under the constraint of the data dependency in joint scalable video model. Finally, we extend our analysis to pyramid MCTF and the impact of the inter-layer prediction scheme is also considered
Circuits and Systems, 2006. ISCAS 2006. Proceedings. 2006 IEEE International Symposium on; 06/2006
-
[show abstract]
[hide abstract]
ABSTRACT: The memory bandwidth reduction for motion estimation is important because of the power consumption and limited memory bandwidth in video coding systems. In this paper, we propose a Level C+ scheme which can fully reuse the overlapped searching region in the horizontal direction and partially reuse the overlapped searching region in the vertical direction to save more memory bandwidth compared to the Level C scheme. However, direct implementation of the Level C+ scheme may conflict with some important coding tools and then induces a lower hardware efficiency of video coding systems. Therefore, we propose n-stitched zigzag scan for the Level C+ scheme and discuss two types of 2-stitched zigzag scan for MPEG-4 and H.264 as examples. They can reduce memory bandwidth and solve the conflictions. When the specification is HDTV 720p, where the searching range is [-128,128), the required memory bandwidth is only 54%, and the increase of on-chip memory size is only 12% compared to those of traditional Level C data reuse scheme.
IEEE Transactions on Circuits and Systems for Video Technology 05/2006; · 1.65 Impact Factor
-
[show abstract]
[hide abstract]
ABSTRACT: Variable block-size motion estimation (VBSME) has become an important video coding technique, but it increases the difficulty of hardware design. In this paper, we use inter-/intra-level classification and various data flows to analyze the impact of supporting VBSME in different hardware architectures. Furthermore, we propose two hardware architectures that can support traditional fixed block-size motion estimation as well as VBSME with less chip area overhead compared to previous approaches. By broadcasting reference pixel rows and propagating partial sums of absolute differences (SADs), the first design has the fewer reference pixel registers and a shorter critical path. The second design utilizes a two-dimensional distortion array and one adder tree with the reference buffer that can maximize the data reuse between successive searching candidates. The first design is suitable for low resolution or a small search range, and the second design has advantages of supporting a high degree of parallelism and VBSME. Finally, we propose an eight-parallel SAD tree with a shared reference buffer for H.264/AVC integer motion estimation (IME). Its processing ability is eight times of the single SAD tree, but the reference buffer size is only doubled. Moreover, the most critical issue of H.264 IME, which is huge memory bandwidth, is overcome. We are able to save 99.9% off-chip memory bandwidth and 99.22% on-chip memory bandwidth. We demonstrate a 720-p, 30-fps solution at 108 MHz with 330.2k gate count and 208k bits on-chip memory
Circuits and Systems I: Regular Papers, IEEE Transactions on 04/2006; · 1.97 Impact Factor
-
[show abstract]
[hide abstract]
ABSTRACT: A de-interlacing algorithm using adaptive 4-field global/local motion compensated approach is presented. It consists of block-based directional edge interpolation, same-parity 4-field motion detection, global/local motion estimation and compensation. The edges are sharper when the directional edge interpolation is adopted. The same parity 4-field motion detection and the 4-field local motion estimation detect the static areas and fast motion by four reference fields, and the global motion estimation detects the camera panning and zooming motions. The global and local motion compensation recover the interlaced videos to the progressive ones. Experimental results show that the peak signal-to-noise ratio of our proposed algorithm is 2∼3 dB higher than that of previous studies and attain the best quality of subjective view.
IEEE Transactions on Circuits and Systems for Video Technology 01/2006; · 1.65 Impact Factor
-
International Symposium on Circuits and Systems (ISCAS 2006), 21-24 May 2006, Island of Kos, Greece; 01/2006
-
VLSI Signal Processing. 01/2006; 42:297-320.
-
Proceedings of the 2006 IEEE International Conference on Multimedia and Expo, ICME 2006, July 9-12 2006, Toronto, Ontario, Canada; 01/2006
-
[show abstract]
[hide abstract]
ABSTRACT: The motion-compensated temporal filtering (MCTF) is an innovative prediction scheme for video coding and has become the core technology of the coming video coding standard, MPEG-21 part 13 - scalable video coding (SVC). This paper provides the system analysis of MCTF for VLSI implementation, which includes computational complexity, external memory access, external storage size, and coding delay. The one-level MCTF is analyzed first, and a modified double current frames scheme is introduced to address the external memory access penalty that results from fractional-pel motion compensation (MC). Then the analysis is extended to multi-level MCTF, in which many important system issues will be explored. Finally, a real-life test case was given to compare the system requirements of many different MCTF schemes and the prediction scheme of H.264/AVC.
Image Processing, 2005. ICIP 2005. IEEE International Conference on; 10/2005
-
IEEE Communications Magazine 09/2005; · 3.79 Impact Factor
-
[show abstract]
[hide abstract]
ABSTRACT: Global motion compensation (GMC) is an important coding tool in MPEG-4 advanced simple profile (ASP). In this paper, we propose an efficient GMC hardware architecture for MPEG-4 ASP@L5. Based on analysis of the affine model, the proposed memory arrangement and cascaded scheduling reduce the impact of irregular memory access and improve processing ability. It can process 30 fps at only 25 MHz. The implementation result shows that the total gate count is 19.3 K and internal memory size is 1.28 Kb. It is suitable to be integrated into MPEG-4 ASP encoders and decoders.
Circuits and Systems, 2005. ISCAS 2005. IEEE International Symposium on; 06/2005
-
[show abstract]
[hide abstract]
ABSTRACT: A computation-aware motion estimation algorithm is proposed. Its goal is to find the best block matching results in a computation-limited and computation-variant environment. Our new features are one-pass flow and adaptive search strategies. The prior scheme allocates more computation to the macroblock with the highest distortion in the entire frame step by step. This implies that random access of macroblocks is inevitable, and the search pattern must be determined in advance. The random access flow requires a huge size of memory for all macroblocks to store the up-to-date minimum distortions, best motion vectors, and searching steps. In contrast, the one-pass flow can not only significantly reduce the memory size but also effectively use the context information of neighboring macroblocks to achieve faster convergence and better quality. Moreover, to improve video quality when computation resource is still sufficient, the search strategy is allowed to change adaptively from diamond search to three step search, and then to full search. Last but not least, traditional block matching speedup methods are combined to provide much better computation-distortion curves.
Circuits and Systems, 2005. ISCAS 2005. IEEE International Symposium on; 06/2005
-
[show abstract]
[hide abstract]
ABSTRACT: A four field variable block size motion compensated adaptive de-interlacing method is proposed to improve the accuracy of the motion vectors and lower the occlusions of motion compensated de-interlacing. The proposed de-interlacing method consists of variable block size motion estimation/compensation with four field SAD, interlaced block mode decision, and new block modes. The variable block size motion estimation and compensation improve the accuracy of the motion vectors, especially for spatially-periodic patterns. The new block modes and the interlaced block mode decision make block decisions more precisely, and special patterns that motion compensation cannot be compensated are correctly de-interlaced by these two methods. The subjective view shows an improvement of the accuracy of the motion vectors and the correctness of the mode decision.
Acoustics, Speech, and Signal Processing, 2005. Proceedings. (ICASSP '05). IEEE International Conference on; 04/2005 · 4.63 Impact Factor
-
[show abstract]
[hide abstract]
ABSTRACT: To the best of authors' knowledge, this paper presents the first work on memory analysis of VLSI architectures for motion-compensated temporal filtering (MCTF). The open-loop MCTF prediction scheme has led the revolution for hybrid video coding methods that are mainly based on the close-loop MC prediction (MCP) scheme, and it also becomes the core technology of the coming video coding standard, MPEG-21 part 13-scalable video coding (SVC). In this paper, the macroblock (MB)-level and frame-level data reuse schemes are analyzed for the MCTF. The MB-level data reuse is especially for the motion estimation (ME), and the level C+ scheme is proposed, which can further reduce the memory bandwidth of the conventional level C scheme. Frame-level data reuse schemes for MCTF are proposed according to the open-loop prediction nature.
Acoustics, Speech, and Signal Processing, 2005. Proceedings. (ICASSP '05). IEEE International Conference on; 04/2005 · 4.63 Impact Factor