Conference Paper

Multi-Pass and Frame Parallel Algorithms of Motion Estimation in H.264/AVC for Generic GPU

Nat. Taiwan Univ., Taipei
DOI: 10.1109/ICME.2007.4284972 Conference: Multimedia and Expo, 2007 IEEE International Conference on
Source: IEEE Xplore

ABSTRACT In this paper, multi-pass and frame parallel algorithms are proposed to accelerate various motion estimation (ME) tools in H.264 with the graphics processing unit (GPU). By the multi-pass method to unroll and rearrange the multiple nested loops, the integer-pel ME can be implemented with two-pass process on GPU. Moreover, fractional ME needs six passes for frame interpolation with six-tap filter and motion vector refinement. Motion estimation with multiple reference frames can be implemented with two-pass process with frame-level parallel scheme by use of SIMD vector operations of GPU. Experimental results show that, compared to implementations with only CPU, about 6 times to 56 times speed-up can be achieved for different ME algorithms.

0 Bookmarks
 · 
52 Views
  • [Show abstract] [Hide abstract]
    ABSTRACT: The high computational demands and overall encoding complexity make the processing of high definition video sequences hard to be achieved in real-time. In this manuscript, we target an efficient parallelization and RD performance analysis of H.264/AVC inter-loop modules and their collaborative execution in hybrid multi-core CPU and multi-GPU systems. The proposed dynamic load balancing algorithm allows efficient and concurrent video encoding across several heterogeneous devices by relying on realistic run-time performance modeling and module-device execution affinities when distributing the computations. Due to an online adjustment of load balancing decisions, this approach is also self-adaptable to different execution scenarios. Experimental results show the proposed algorithm's ability to achieve real-time encoding for different resolutions of high-definition sequences in various heterogeneous platforms. Speed-up values of up to 2.6 were obtained when compared to the video inter-loop encoding on a single GPU device, and up to 8.5 when compared to a highly optimized multi-core CPU execution. Moreover, the proposed algorithm also provides an automatic tuning of the encoding parameters, in order to meet strict encoding constraints.
    IEEE Transactions on Multimedia 01/2014; · 1.75 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Abstract H.264/MVC is a standard for supporting the sensation of 3D, based on coding from 2 (stereo) to N views. H.264/MVC adopts many coding options inherited from single view H.264/AVC, and thus its complexity is even higher, mainly because the number of processing views is higher. In this manuscript, we aim at an efficient parallelization of the most computationally intensive video encoding module for stereo sequences. In particular, inter prediction and its collaborative execution on a heterogeneous platform. The proposal is based on an efficient dynamic load balancing algorithm and on breaking encoding dependencies. Experimental results demonstrate the proposed algorithm’s ability to reduce the encoding time for different stereo high definition sequences. Speed-up values of up to 90× were obtained when compared with the reference encoder on the same platform. Moreover, the proposed algorithm also provides a more energy–efficient approach and hence requires less energy than the sequential reference algorithm.
    Computers & Electrical Engineering 01/2013; · 0.93 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: Video compression has been receiving much deserved attention due to the widespread adoption of digital video technology, and the need of optimizing the storage and transmission of such media. In this paper, we are concerned with the optimization of one step of the H.264 compression standard, namely, the motion estimation, in which motion vectors coding the movement of macroblocks (or sub-macroblocks) between two frames are computed. Specifically, we present here a comparative study between two architectures that were used to implement the full search (FS) algorithm for single pixel precision according to the standard H.264/AVC. We are particularly concerned with the relation area × throughput of the two architectures. We report here on experiments performed on CIF, SD and full HD data, comparing the maximum throughput achieved and bandwidth required by the architectures.
    NORCHIP, 2012; 01/2012