Conference Paper

Analysis and design of macroblock pipelining for H.264/AVC VLSI architecture

Dept. of Electr. Eng., Nat. Taiwan Univ., Taipei, Taiwan
DOI: 10.1109/ISCAS.2004.1329261 Conference: Circuits and Systems, 2004. ISCAS '04. Proceedings of the 2004 International Symposium on, Volume: 2
Source: IEEE Xplore


This paper presents a new macroblock (MB) pipelining scheme for H.264/AVC encoder. Conventional video encoders adopt two-stage MB pipelines, which are not suitable for H.264/AVC due to the long encoding path, sequential procedure, and large bandwidth requirement. According to our analysis of encoding process, an H.264/AVC accelerator is divided into five major functional blocks with four-stage MB pipelines to highly increase the processing capability and hardware utilization. By adopting shared memories between adjacent pipelines with sophisticated task scheduling, 55% of the bus bandwidth can be further reduced. Besides, hardware-oriented algorithms are proposed without loss of video quality to remove data dependencies that prevent parallel processing and MB pipelining. The H.264/AVC Baseline Profile Level Three encoder, which requires computational complexity of 1.8 tera-instructions per second (TIPS), is successfully mapped into hardware with our MB pipeline scheme at 100 MHz.

Full-text preview

Available from:
  • Source
    • "The first part performs intra prediction and intra mode decision using original pixels while the second part performs prediction with reconstructed pixels for the selected best mode. However, sometimes this modification causes severe quality degradation especially for the sequences which needs many intra MBs, like an action movie in which motion estimation often fails to find a good match [1]. Figure 1. "
    [Show abstract] [Hide abstract]
    ABSTRACT: This paper presents a new video encoder architecture for H.264 and AVS, which adopts a novel macroblock (MB) encoding order. As a replacement of Level C+ zigzag coding order, the so-called Level C+ slash scan coding order with NOP insertion is used as MB scheduling to remove MB-level data dependency of the pipeline so that the left MB's coded results such as motion vector (MV) and reconstructed pixels can be obtained early in motion estimation (ME) stages. As a result, by sharing the reconstruction (REC) loop, sequential intra prediction (INTRA) can be split into multiple pipeline stages to explore more block-level parallelization and rate distortion optimization (RDO) based mode decision is apt to implement. The exact MV predictors (MVP) obtained in motion estimation can not only improve coding performance but also make pre-skip ME algorithm able to be applied into this architecture for low power applications. Since the proposed scheme is attributed to Level C+ data reuse, the bandwidth is decreased greatly. A real-time high-definition (HD) 1080P AVS encoder implementation on FPGA verification board with search range [−128, 128]×[−96, 96] and two reference frames at an operating frequency of 160 MHz validates the efficiency of proposed architecture.
    Full-text · Article · May 2012
  • Source
    • "An efficient design would be comprised of a system that can process a lot of operations while keeping the energy consumption to a relatively low level. A general hardware assisted approach was presented by Chen et al. [3]. Their hardware accelerator was connected to the system bus, so that the processing time might be influenced if other peripherals would like to communicate with processor at the same time. "
    [Show abstract] [Hide abstract]
    ABSTRACT: In this paper, an advanced video coding acceleration based on software-hardware co-design for low power embedded system is proposed. Today, people enjoy HD video formats all over the world, but to compress it into a portable format (such as H.264) costs too much time. In embedded systems, it is very costly to transform the entire software application into a hardware solution especially if it will consume a large amount of power. Thus, we studied the famous H.264 model in order to explore the hotspot function and balance the tradeoff between speed and energy consumption. The idea is to only transform the more readily used functions into hardware by designing a coprocessor and implementing it on Virtex 5 Field-Programmable Gate Array (FPGA) platform. The experimental results from this hardware implementation showed a 5 times increase in coding speed while minimizing the energy consumption to around 81 percent.
    Preview · Article · Mar 2012
  • Source
    • "Nevertheless, such powerful characteristics come at a huge cost in terms of computational power, memory size and bandwidth. To alleviate such constraints and achieve realtime video coding, many embedded implementations are based on parallel software applications supported by multi-core hardware structures [2], [3]. Such structures typically consist of an array of heterogeneous processing elements, integrated in a System-on-Chip (SoC) architecture, often composed by a General Purpose Processor (GPP) and multiple dedicated or specialized processors, used as hardware accelerators for the most critical parts of the video coding algorithm. "
    [Show abstract] [Hide abstract]
    ABSTRACT: A highly modular framework for developing parallel H.264/AVC video encoders in multi-core systems is presented. Such framework implements an efficient hardware/software co-design methodology, which enables replacing the software implementation of any operation in the video encoder application by a corresponding system call to a hardware accelerator. To achieve such goal, this design strategy adopts a simple and straightforward method to model all functional blocks of the video encoder into self-contained software modules. Such method takes into consideration not only the data structures required to implement the considered operations, but also the available interface of the target hardware structure. To prove the validity of the proposed framework, an implementation of a multi-core H.264/AVC video encoder using an ASIP IP core as a ME hardware accelerator is presented. The obtained results evidence the advantages of this methodology and demonstrate the performance gains it can provide. For the considered system, speedup factors greater than 15 were obtained for the ME operation.
    Full-text · Conference Paper · Oct 2010
Show more