A LOW COST SINGLE-PASS FRACTIONAL MOTION ESTIMATION ARCHITECTURE USING BIT CLIPPING FOR H.264 VIDEO CODEC

Giwon Kim, Jaemoon Kim, Chong-Min Kyung

Dept. of EECS at KAIST
{gwkim, jmkim}@vslab.kaist.ac.kr, kyung@ee.kaist.ac.kr

ABSTRACT

As the video resolution increases, high computational complexity of the fractional motion estimation (FME) introduces difficulty to meet real-time constraints in a video coding. In this paper, we proposed a single-pass FME algorithm and its architecture with low hardware cost and negligible loss of the image quality. The proposed algorithm directly searches only surroundings of both the predicted fractional motion vector and the search center. To reduce the hardware cost of processing units in the proposed FME architecture, bit clipping scheme is applied to processing units reducing the hardware cost by 25%. Experimental results show that the proposed algorithm provides almost the same rate-distortion performance as the full-search algorithm. The result of hardware implementation shows that a quad full high definition video (4096 × 2160) can be processed in real time (24 frame/sec) using 134k gates when the operating frequency is 250MHz. Compared with the recent work supporting quad full high definition video [8], the proposed FME architecture has shown 70% reduction of the hardware cost.

Keywords—H.264/AVC, Fractional Motion Estimation, Quad Full High Definition, Video Coding

1. INTRODUCTION

Variable block size motion estimation (VBSME) with quarter-pixel accuracy which consists of integer motion estimation (IME) and fractional motion estimation (FME) achieves substantially higher compression efficiency in video coding. IME performs the search of the integer motion vector for 41 sub-blocks (IMV) in coarse resolution, and FME refines each of the 41 integer motion vectors into the quarter-pixel accuracy. FME is divided into two search passes: fractional motion vector (FMV) search with the half-pixel accuracy around the best IMV (first pass) and FMV search with the quarter-pixel accuracy around the best half-pixel FMV (second pass). The improvement of PSNR due to FME with quarter-pixel accuracy is significant, i.e., up to 4 dB [1].

H.264/AVC adopts seven block modes for each macroblock (MB), namely 16 × 16, 16 × 8, 8 × 16, 8 × 8, 8 × 4, 4 × 8, and 4 × 4. Because FME refines these various block modes with the quarter-pixel accuracy, FME occupies over 45% of the computational complexity of the total encoding process [2]. FME architecture based on the full-search [1], which refines each block mode sequentially, cannot meet the real-time requirement of the high resolution such as HD1080p or QFHD. For example, the architecture proposed in [1] must operate at no less than 405 MHz and 1.4 GHz for HD1080p and QFHD video coding, respectively. Because operating frequency is 100MHz in [1], it is difficult to meet these conditions. Many FME architectures proposed to overcome this problem have high hardware cost. To achieve real-time video coding in the high video resolution, parallel architectures are adopted for FME operation [3] [4]. However, parallel FME architectures are very expensive; they occupy 43.6% of the whole H.264/AVC encoder in [3].

To reduce the increased hardware cost of FME, single-pass FME algorithms have been proposed [5] [6] [8]. In [5], the scheme, which searches directly six quarter pixels, reduces the hardware cost as the number of PUs is reduced to six. To allow real-time operation in the HD1080p, additional mode reduction scheme is utilized in [6]. After the scheme divides seven block modes into two groups, FME is processed only for the two best modes of each group. In [8], the use of two-tap finite impulse response (FIR) filter instead of six-tap FIR filter in the generation of half pixels leads to high throughput, while the image quality is degraded despite the increased hardware to handle all 49 quarter pixels in the FME search range.

In this paper, we propose a single-pass fractional motion estimation (SPFME) algorithm and its hardware architecture with low hardware cost and negligible loss of image quality. The proposed SPFME algorithm, to increase the throughput with negligible loss of image quality, directly searches ten quarter pixels as candidates of the best fractional motion vectors. A bit clipping scheme is employed to reduce the hardware cost of the proposed FME architecture. Since the difference between current pixel value and reference pixel value is small, the proposed bit clipping scheme reduces the size of each processing unit without degrading the image quality.

This paper is organized as follows. Section 2 proposes the SPFME algorithm and bit clipping scheme. Section 3 explains SPFME architecture. In Section 4, rate-distortion performance of the proposed algorithm and the result of hardware implementation are given, followed by the conclusion in Section 5.
2. PROPOSED ALGORITHM

2.1. Single-Pass Fractional Motion Estimation (SPFME)

There are many fast FME algorithms based on prediction, including a hardware-friendly prediction scheme proposed in [7]. In H.264/AVC, the predicted motion vector (\(\text{pred}_\text{mv}\)) is defined as the median of three neighboring motion vectors. The predicted fractional motion vector (\(\text{pred}_\text{frac}\)) is extracted from \(\text{pred}_\text{mv}\) and the best integer motion vector (\(\text{mv}\)),

\[
\text{pred}_\text{frac} = (\text{pred}_\text{mv} - \text{mv}) \mod 4
\]

where \(\mod\) operation is applied to obtain the fractional component by removing the integer part. The basic idea of obtaining the \(\text{pred}_\text{frac}\) according to the equation (1) is based on the assumption that most of the best fractional motion vectors (\(\text{best}_\text{frac}\)) lies on either \(\text{pred}_\text{frac}\) or its four neighbors (top, down, left, and right).

To observe the probability distribution of the \(\text{best}_\text{frac}\) when it is located at neither \(\text{pred}_\text{frac}\) nor its four neighbors (top, down, left, and right), we extracted statistics from six HD1080p video sequences, classified into three groups, over 100 frames when quantization parameter (QP) is 24. The first group includes Aspen and sunflower sequences with low motion contents. The second group includes RushFieldCuts and station2 with moderate motion contents. The third group includes pedestrian_area and tractor with high motion contents. The average probability distribution of the six sequences is shown in Fig. 1, where x-axis and y-axis are in quarter-pixel resolution. It shows that the distribution of \(\text{best}_\text{frac}\); when it is located at neither \(\text{pred}_\text{frac}\) nor its four neighbors, is concentrated at the search center, \((0,0)\), and its surroundings. Because the contents without motion like background occupy a substantial part of the total image, \(\text{best}_\text{frac}\) can be found at either the search center, \((0,0)\), or its surroundings.

We then performed an experiment to observe the effect of searching \((0,0)\) and its neighbors in addition to the

\[\text{pred}_\text{frac} \text{mv} \text{ with its four neighbors on the image quality for the HD1080p sequences over 100 frames when QP is 24. Fig. 2 shows three different algorithms for fractional motion vector search. (a) searches \(\text{pred}_\text{frac}\) its four neighbors (top, down, left, and right), and \((0,0)\) [5]. (b) is our scheme which additionally searches \((0,0)\) and its four neighbors in a diagonal direction. (c) is full search scheme of the reference software [9]. In Table 1, a comparison of prediction ratio, which denotes the probability of the \(\text{best}_\text{frac}\) being found among the candidate MV locations and \(\Delta\text{PSNR}\) of Fig. 2 (a) and (b) in comparison with full search in Fig. 2 (c) are shown. In (b) which shows the single-pass fractional motion estimation (SPFME) algorithm proposed in this paper, 10 candidates are directly searched. Average PSNR degradation of SPFME algorithm compared to the full search is negligible, i.e., 0.034dB, while the improvement of image quality of (b) in comparison with (a) is 0.092dB. Because \(\text{best}_\text{frac}\), when it is located at neither \(\text{pred}_\text{frac}\) nor its four neighbors, is concentrated at \((0,0)\) and its surroundings as shown in Fig. 1, including \((0,0)\) with its four neighboring quarter-pixels has proven to be quite effective. We select the four diagonal neighbors of \((0,0)\) (left-up, right-up, left-down, and right-down) as the additional search candidates to avoid the duplication of search candidates when \(\text{pred}_\text{frac}\) is \((0,0)\).

In the algorithm based on two passes shown in Fig. 2 (c), it is difficult to achieve the HD-sufficient throughput, due to the large cycle count per MB for FME. The proposed SPFME algorithm reduces the cycle count per MB for FME into approximately half. To additionally suppress the hardware cost, the so-called bit clipping scheme is proposed as follows.
Aspen Gate counts values are mostly limited to poral correlation between current and reference pixels, residue and Δ25% in comparison with 0-bit clipping scheme while and hardware cost is shown in Table 2. 3-bit clipping confines 6 bits with almost no loss of information.

The effect of 3-bit clipping scheme on the R-D performance and hardware cost is shown in Table 2. 3-bit clipping confines all residues within (−31, 31), while 0-bit clipping maintains the original residue. 3-bit clipping reduces the gate count per PU by 25% in comparison with 0-bit clipping scheme while ΔPSNR and Δrate are negligible. Table 2 shows that due to high temporal correlation between current and reference pixels, residue values are mostly limited to ±31. This 3-bit clipping scheme helps suppress the hardware cost of the proposed SPFME.

2.2. Bit Clipping Scheme

Generally each pixel of an image has a value ranging from 0 to 255 and is represented in eight bits. As the residue is the difference between current pixel value and reference pixel value ranging from -255 to 255, nine bits are normally required to represent the residue. However, the bit length of the residue can be reduced if the temporal correlation of pixels between current and reference frame is exploited. To observe the temporal correlation of pixels between previous and current frame, accumulated probability distribution of residue values was obtained from six HD1080p video sequences over 100 frames when QP is 24, as shown in Fig. 3, where x-axis denotes the absolute residue and y-axis denotes the accumulated probability. Fig. 3 shows that the probability of absolute residue being less than 32 is close to 100% for all six sequences, i.e., the maximum difference between current and reference pixel can be represented in 6 bits with almost no loss of information.

The increasing the pixel width leads to an increase of throughput due to parallelization. However, with the increase of the number of pixels (n) to be processed, the number of horizontal and vertical FIR filters are also increasing (n+1 and 2n+3, respectively).

Table 3 also shows the gate count of FME architecture and the latency for FME processing according to the pixel width when the operating frequency is 250MHz. In our FME architecture, ‘block mode culling’ is used. It has four block modes, i.e., 16×16, 16×8, 8×16, and 8×8 mode instead of seven block modes of H.264/AVC. Table 4 shows the effect of ‘block mode culling’ on PSNR for various video resolutions when QP is 24. As the video resolution increases, the degradation of PSNR reduces. Although the reduction of PSNR is significant in low video resolution, the degradation of PSNR is negligible in the high resolution. Therefore, we choose four block modes to improve the throughput of FME architecture.

In Table 3, as the pixel width increases from 4 to 8, the cycle count per MB is reduced to half, while the gate count of the FME architecture roughly doubles. When the pixel width increases from 8 to 16, gate count roughly doubles, but the cycle count per MB is not reduced to half due to lower utilization of interpolation unit of 8×16 and 8×8 block mode. Each block mode is independently processed because it has a different integer motion vector. In 8×16 and 8×8 block mode, nine horizontal half-pixels are generated, while interpolation unit has 17 horizontal FIR filters when pixel width is 16. Therefore, these two block modes do not need to use eight horizontal FIR filters. As shown in Table 3, real-time encoding of 1080p video sequence is supported when the pixel width is 4, while QFHD resolution cannot be supported in real time. When the pixel width is 8 or 16, it enables real-time encoding in both QFHD and HD1080p resolution. We chose 8, i.e., the eight-pixel interpolation unit requires additional hardware elements, i.e., horizontal and vertical FIR filters, as shown in Fig. 4. When the pixel width is four, five horizontal FIR filters and eleven vertical FIR filters are required as denoted by white circles and white vertical bars, respectively. Table 3 shows the number of horizontal and vertical FIR’s required for various video resolutions.
Fig. 4. The block diagram of half-pixel interpolation unit and additional hardware elements according to increment of pixel_width.

Fig. 5. The number of required cycles per macroblock (MB) in each mode. (a) 16\times16 and 8\times16 block mode requiring 22\times2 = 44 cycles, (b) 16\times8 and 8\times8 block mode requiring (14\times2)\times2 = 56 cycles. Dark-shaded region denotes the current sub-macroblock, while the interpolation window denoted as light-shaded region represents a group of integer pixels used to generate fractional pixels. Arrow denotes the direction of interpolation.

Table 3. The number of horizontal and vertical FIR’s, gate count of FME architecture, latency for FME processing, and support of HD1080p (30 frame/sec) and QFHD (24 frame/sec) according to the pixel_width. Maximum latency for HD1080p and QFHD video resolution is 1054 cycles/MB and 301 cycles/MB, respectively.

<table>
<thead>
<tr>
<th>pixel_width (n)</th>
<th>The number of horizontal FIR (n+1)</th>
<th>The number of vertical FIR (2n+3)</th>
<th>gate count (k)</th>
<th>latency (cycles/MB)</th>
<th>HD1080p support</th>
<th>QFHD support</th>
</tr>
</thead>
<tbody>
<tr>
<td>4</td>
<td>5</td>
<td>11</td>
<td>69</td>
<td>512</td>
<td>Yes</td>
<td>No</td>
</tr>
<tr>
<td>8</td>
<td>9</td>
<td>19</td>
<td>134</td>
<td>256</td>
<td>Yes</td>
<td>Yes</td>
</tr>
<tr>
<td>16</td>
<td>17</td>
<td>35</td>
<td>267</td>
<td>150</td>
<td>Yes</td>
<td>Yes</td>
</tr>
</tbody>
</table>

Table 4. Effect of ‘block mode culling’ on PSNR for various video resolutions, as compared to the case of ‘no culling’.

<table>
<thead>
<tr>
<th>Sequences</th>
<th>QCIF</th>
<th>CIF</th>
<th>HD720p</th>
<th>HD1080p</th>
<th>ΔPSNR (dB)</th>
</tr>
</thead>
<tbody>
<tr>
<td>Aspen</td>
<td>-0.172</td>
<td>-0.123</td>
<td>-0.062</td>
<td>-0.056</td>
<td></td>
</tr>
<tr>
<td>sunflower</td>
<td>-0.077</td>
<td>-0.087</td>
<td>-0.042</td>
<td>-0.029</td>
<td></td>
</tr>
<tr>
<td>RushFieldCuts</td>
<td>-0.161</td>
<td>-0.176</td>
<td>-0.146</td>
<td>-0.108</td>
<td></td>
</tr>
<tr>
<td>station2</td>
<td>-0.132</td>
<td>-0.122</td>
<td>-0.055</td>
<td>-0.013</td>
<td></td>
</tr>
<tr>
<td>pedestrian_area</td>
<td>-0.223</td>
<td>-0.153</td>
<td>-0.033</td>
<td>-0.019</td>
<td></td>
</tr>
<tr>
<td>tractor</td>
<td>-0.227</td>
<td>-0.185</td>
<td>-0.026</td>
<td>-0.021</td>
<td></td>
</tr>
<tr>
<td>Average</td>
<td>-0.165</td>
<td>-0.141</td>
<td>-0.061</td>
<td>-0.048</td>
<td></td>
</tr>
</tbody>
</table>

Fig. 6 shows the proposed FME architecture based on earlier-mentioned SPFME algorithm. “Half-pixel interpolation unit”, which consists of horizontal and vertical six-tap FIR filters shown in Fig. 4, generates half-pixels from reference pixels. According to the predicted fractional motion vector, pred_frac_mv, decided by (1), the generated half-pixels are adaptively selected to interpolate 10 quarter-pixels (pred_frac_mv, (0, 0), and their four neighbors, respectively). The quarter-pixels are generated by averaging the value of two neighboring half-pixels. Because quarter-pixels are directly obtained from reference pixels (integer-pixels), FME with single-pass can be achieved.

The bit-clipped processing unit (BCPU) generates residues and extracts SATD for each candidate. Ten BCPPUs are divided into two groups. One group of BCPPUs generates residues and extracts SATDs for pred_frac_mv and its four neighbors. Another group is utilized for residues and SATDs of (0, 0) and their four neighbors. Fig. 7(a) shows the details of the BCPU
Half-pixel interpolation
Quarter-pixel interpolation

Best fractional motion vector decision
Best block mode decision

Reference pixels
Current pixels

Pred, mv

Fig. 6. Proposed FME architecture based on the SPFME algorithm.

Fig. 7. (a) Proposed bit-clipped processing unit (BCPU). ABS represents the operation taking the absolute value. ref denotes the quarter-pixel in reference MB and cur denotes the integer-pixel in current MB. The number next to slash denotes the required number of bits. (b) Bit clipping unit (BCU) architecture.

which is based on two 4×4 blocks to accommodate 8×8 block because block modes below 8×8 block are culled in our design. In BCPU, residues are obtained from the difference between candidates in reference MB and integer-pixels in current MB. To reduce the hardware cost of the BCPU, we applied the earlier-mentioned bit clipping scheme to the BCPU. The bit clipping unit (BCU), which confines the nine-bit residue [-255, 255] within six-bit residue [-31, 31], can be represented as the simple architecture. In Fig. 7(b), the clipped residue is selected between the subordinate five bits of nine-bit residue (from $r_4$ to $r_0$) and 31 according to the value obtained from combination of $r_7$, $r_6$, and $r_5$. Due to bit clipping scheme, the reduced bit-width in BCPU leads to the reduction of hardware cost.

The “best fractional motion vector decision unit” selects the candidate with the minimum value among ten block mode costs obtained from BCPUs. After motion vector with minimum cost is decided for each mode, “best block mode decision unit” selects the block mode with the minimum value among the accumulated block mode costs as the best block mode. Finally, MC for the selected best block mode is performed.

4. EXPERIMENT RESULT

4.1. Performance Comparison

The proposed algorithm as well as the algorithm in [6] and [8] was implemented in JM 14.0, H.264/AVC reference software [9]. The simulation environment is as follows:

1. main profile,
2. CABAC is enabled,
3. RDO is disabled,
4. motion vector search range is $\pm 64, \pm 64$,
5. the number of reference frames is one,
6. tested QP is from 20 to 32,
7. the total number of frames per QP is 100, and
8. IPPP GOP (Group of Pictures) structure is used.

Fig. 8 shows the comparison of R-D performance curve for station2 and tractor. The difference of R-D curve between the proposed SPFME algorithm and full search in reference software is negligible, while the R-D curve of [6] and [8] shows substantial degradation. Table 5 shows the comparison of the algorithms with single pass [6] [8] in terms of the average $\Delta$PSNR and the average $\Delta$rate for five HD1080p sequences when QP is from 20 to 32. The average $\Delta$PSNR of the proposed SPFME algorithm is -0.062dB and the average $\Delta$rate is -3.29%. On the
other hand, average ΔPSNR of [6] is -0.157dB, and Δrate is 1.07%. The average ΔPSNR of [8] is -0.058dB, and Δrate is 5.71%. Compared with the R-D performance in [6], the improvement of the proposed algorithm is driven from additional hardware cost and throughput. The definition of TPUA is as follows.

\[ TPUA = \frac{\text{Throughput}}{\text{Gate count}} \quad \text{(MB/sec/gate)} \quad (2) \]

In [6], TPUA has minimum and maximum value due to the variable latency of the selected block mode. Compared with [6] and [8] in terms of TPUA, our design can achieve the highest throughput with low hardware cost. With the proposed design, real-time constraints can be met in the video resolution of QFHD as well as HD1080p at the expense of low hardware cost.

## 5. CONCLUSION

In H.264/AVC coder, FME architecture with the low hardware cost and the high throughput is significantly important in the real-time video coding with high resolution. In this paper, we proposed FME architecture and its algorithm with the low hardware cost and the negligible loss of image quality. Our SPFME algorithm has the high throughput with negligible loss of image quality. We also proposed a bit clipping scheme for reducing the hardware cost of the processing unit (PU) by 25% in the FME architecture. In our design, the use of block mode over 8×8 leads to substantial reduction of cycle count for MB. Our FME design has 977KMB/sec throughput with 0.062dB PSNR degradation. With TSMC 0.13um, total gate count of the proposed architecture is 134K at 250MHz. As a result, our design can support QFHD video resolution with low hardware cost.

## 6. REFERENCES


## Table 5.

<table>
<thead>
<tr>
<th></th>
<th></th>
<th></th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<td>PSNR (dB)</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>Gate count</td>
<td>156</td>
<td>7.29</td>
<td>0.13</td>
</tr>
<tr>
<td>Latency (cycles/MB)</td>
<td>134k</td>
<td>7.29</td>
<td>0.13</td>
</tr>
<tr>
<td>Throughput (kMB/sec)</td>
<td>488k</td>
<td>7.29</td>
<td>0.13</td>
</tr>
<tr>
<td>TPUA (MB/sec/gate)</td>
<td>134k</td>
<td>7.29</td>
<td>0.13</td>
</tr>
</tbody>
</table>

1) The left and the right part of slash are the worst and the best case, respectively.