Conference Paper

Analysis and design of macroblock pipelining for H.264/AVC VLSI architecture

Dept. of Electr. Eng., Nat. Taiwan Univ., Taipei, Taiwan
DOI: 10.1109/ISCAS.2004.1329261 Conference: Circuits and Systems, 2004. ISCAS '04. Proceedings of the 2004 International Symposium on, Volume: 2
Source: IEEE Xplore

ABSTRACT This paper presents a new macroblock (MB) pipelining scheme for H.264/AVC encoder. Conventional video encoders adopt two-stage MB pipelines, which are not suitable for H.264/AVC due to the long encoding path, sequential procedure, and large bandwidth requirement. According to our analysis of encoding process, an H.264/AVC accelerator is divided into five major functional blocks with four-stage MB pipelines to highly increase the processing capability and hardware utilization. By adopting shared memories between adjacent pipelines with sophisticated task scheduling, 55% of the bus bandwidth can be further reduced. Besides, hardware-oriented algorithms are proposed without loss of video quality to remove data dependencies that prevent parallel processing and MB pipelining. The H.264/AVC Baseline Profile Level Three encoder, which requires computational complexity of 1.8 tera-instructions per second (TIPS), is successfully mapped into hardware with our MB pipeline scheme at 100 MHz.

0 Followers
 · 
201 Views
  • [Show abstract] [Hide abstract]
    ABSTRACT: Fractional motion estimation (FME) is an important part of the H.264/AVC video encoding standard. The algorithm can significantly increase the compression ratio of video encoders while preserving high video quality. The full-search FME algorithm, however, is computationally expensive and can consist of over 45% of the total motion estimation process. To maximise the performance and efficiency of FME implementations on field-programmable gate arrays (FPGAs), one needs to efficiently exploit the inherent parallelism in the algorithm. The authors investigate the scalability of the full-search FME algorithm on FPGAs and also implemented six scaled versions of the algorithm on Xilinx Virtex-5 FPGAs. The authors found that scaling the algorithm vertically within a 4 × 4 sub-block is more efficient than scaling horizontally across several sub-blocks. It is shown that, with four reference frames, the best vertically scaled design can achieve 96 frames-per-second (fps) performance while encoding full 1920 ×1088 progressive HDTV video, and the design only consumes 25.5 K LUTS and 28.7 K registers.
    IET Computers & Digital Techniques 03/2012; 6(2):95-104. DOI:10.1049/iet-cdt.2010.0167 · 0.36 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Fractional Motion Estimation (FME) is an important part of the H.264/AVC video encoding standard. The algorithm can significantly increase the compression ratio of video encoders while improving video quality. However, it is computationally expensive and can consist of over 45% of the total motion estimation runtime. To maximize the performance and utilization of FME implementations on Field-Programmable Gate Arrays (FPGAs), one needs to effectively exploit the inherent parallelism in the algorithm. In this work, we explore two approaches to FME algorithm parallelization in order to effectively increase the processing power of the computing hardware. We call the first method vertical scaling and the second horizontal scaling. We implemented six scaled FME designs on a Xilinx XC5VLX85T (Virtex-5) FPGA. We found that scaling vertically within a 4×4 sub-block is more efficient than scaling horizontally across several sub-blocks. As a result, we were able to achieve higher video resolutions at lower hardware resource cost. In particular, it is shown that the best vertically scaled design can achieve 30 fps of QSXGA video with 4 reference frames with only 25.5 K LUTS and 28.7 K registers.
    Integration the VLSI Journal 09/2012; 45(4):427–438. DOI:10.1016/j.vlsi.2011.11.017 · 0.53 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: This paper presents a new video encoder architecture for H.264 and AVS, which adopts a novel macroblock (MB) encoding order. As a replacement of Level C+ zigzag coding order, the so-called Level C+ slash scan coding order with NOP insertion is used as MB scheduling to remove MB-level data dependency of the pipeline so that the left MB's coded results such as motion vector (MV) and reconstructed pixels can be obtained early in motion estimation (ME) stages. As a result, by sharing the reconstruction (REC) loop, sequential intra prediction (INTRA) can be split into multiple pipeline stages to explore more block-level parallelization and rate distortion optimization (RDO) based mode decision is apt to implement. The exact MV predictors (MVP) obtained in motion estimation can not only improve coding performance but also make pre-skip ME algorithm able to be applied into this architecture for low power applications. Since the proposed scheme is attributed to Level C+ data reuse, the bandwidth is decreased greatly. A real-time high-definition (HD) 1080P AVS encoder implementation on FPGA verification board with search range [−128, 128]×[−96, 96] and two reference frames at an operating frequency of 160 MHz validates the efficiency of proposed architecture.