[Show abstract][Hide abstract] ABSTRACT: This paper proposes an efficient architecture, which can perform multiple 8×8 transforms for both H.264/AVC and VC-1 decoders. The hardware design which supports multiple standards becomes more and more important. By designing a unique data flow for VC-1 8×8 inverse transform, the H.264/AVC and VC-1 8×8 inverse transforms are realized in a hardware sharing architecture. The proposed multiple transforms architecture contains fast one-dimensional (1-D) transforms and rounding operations. Simulation results show the proposed architecture takes 6,702 gates which are much less than the individual designs for the H.264/AVC and VC-1 8×8 inverse transforms.
[Show abstract][Hide abstract] ABSTRACT: In this paper, a combined kernel architecture for efficiently decoding the residual data in the H.264/AVC baseline decoder is proposed. The kernel architecture in the H.264/AVC decoder consists of context-based adaptive variable length code (CAVLC) decoder, inverse quantization (IQ), and inverse transforms (IT) units. Since the decoding speeds of these kernel units vary with data, traditional methods require data buffers between these units. The first proposed architecture efficiently combines CAVLC decoding and IQ procedures. The multiple 2-D transforms architecture is applied to all inverse transforms, including the 4times4 inverse integer transform, the 4times4 inverse Hadamard transform and the 2times2 inverse Hadamard transform, to attain fewer gate counts than those of existing transform designs. Simulation results show that the total number of gates is 14.1 k and the maximum operating frequency is 130 MHz. For real-time requirements, in the worst case, the proposed architectures can achieve the operation speed of the H.264/AVC decoder up to 4VGA@30 frames/sec in 4:2:0 format.
No preview · Article · Feb 2009 · IEEE Transactions on Circuits and Systems for Video Technology
[Show abstract][Hide abstract] ABSTRACT: In this paper, we propose an approximate square criterion for H.264/AVC intra mode decision. The sum of square difference (SSD) criterion achieves the best video quality but takes much high computation due to the square operation. A sum of approximate square difference (SASD) criterion is proposed to maintain the video quality and reduce the computation. By applying the characteristic of the SSD criterion, simulation results show the rate-distortion performance of the SASD criterion is close to that of the SSD method. For the hardware implementation, synthesized results show that the proposed approximate square operation respectively reduces 75% and 61% in the area cost and timing delay than that of the square function.
[Show abstract][Hide abstract] ABSTRACT: In this paper, we design a novel architecture for computing all transforms required in H.264/AVC high profile decoder. This flexible architecture design can compute all transforms including 8 and 4-point integer transforms as well as 4 and 2-point Hardamard transforms such that we can reduce the implementation chip area dramatically. With 8 pixels/cycle throughput, this proposed design can complete the computation in 95 clock cycles with 8times8 inverse transform involved or 54 clock cycles without 8times8 inverse transform for one macroblock. Simulation results show that the implemented area is 18.5 k gate counts, and the maximum clock frequency is 125 MHz. For the real-time requirement, the architecture can deal with all existed frame sizes in 4:2:0 format. For example, if this architecture is operated at 106 MHz, it achieves 4096times2304@30 frames/sec.
[Show abstract][Hide abstract] ABSTRACT: This paper proposes combined decoding architecture and high-throughput flexible transform design to effectively decode the residual data for H.264/AVC decoders. The inverse quantization (IQ) procedure is combined with context-based adaptive variable length coding (CAVL) decoder to efficiently achieve the simplification. Besides, the flexible transform architecture is also proposed for effective computation of all transforms needed in H.264/AVC decoders. Since all the transforms are realized in the same architecture, the flexible transform design with the throughput of 8 pixels/sec needs fewer logic gate counts. Simulation results show that the implemented gate count is 18.6k and the maximum operating frequency is 125 MHz. For real-time requirements, this proposed design achieves 4VGA (1280times960)@30 frames/sec in the worst case.
[Show abstract][Hide abstract] ABSTRACT: In this paper, we propose a high throughput and data reuse architecture for de-blocking filter in H.264/AVC. There are two SRAMs exploited in the design. One is 144times32 bits single-port SRAM, and the other is 16times32 bits two-port SRAM. We use the group-of-pixel access method to store the pixels in SRAMs instead of the column-of-pixel or row-of-pixel approach. In the algorithm level, we modify the filtering order in the de-blocking filter without violating the H.264/AVC standard. Therefore, we efficiently use the data reuse skill to reduce the access frequency of SRAMs. We implement this architecture with UMC 0.18 mum cell library, and the maximum clock frequency we can achieve is 100 MHz. The simulation results show that the total number of logic gate counts is 16.6k. When the clock frequency equals 100 MHz, it can process 14619 macroblocks in 1/30 second. In other words, we achieve 4XGA (2048times1536) @30 frames/sec when we set the clock frequency to 85 MHz
[Show abstract][Hide abstract] ABSTRACT: This paper proposes an efficient architecture, which combines the context-based adaptive variable length coding (CAVLC) decoder and inverse quantization (IQ) together to simplify the H.264/AVC decoder. The IQ function is effectively moved to the run before stage in the CAVLC decoder. With this efficient arrangement, it can easily implement the interface between CAVLC decoder and IQ without additional logic circuit. However, the authors also use pipeline skill to improve the performance. Because there are data dependency properties in the CAVLC decoder, it should modify the algorithm in the standard to realize the pipeline skill. The authors implement this architecture with UMC 0.18 mum cell library. The simulation results show the operation frequency can achieve 200 MHz. The total number of logic gate counts is 9.23k. For the real-time requirement, it achieves 1080HD (1920times1088) @30 frames/sec while the clock frequency is set to 195 MHz
[Show abstract][Hide abstract] ABSTRACT: In this paper, the low-complexity hardware architectures of 4×4 forward and inverse transforms with integrated quantizer and dequantizer for H.264 advanced video coders (AVC) are proposed. By applying the regularity of the quantization matrix, the quantization can be merged into the transform step, which results in a reduction of the hardware complexity in VLSI implementation. The proposed integrated transforms have been synthesized with TSMC 0.35 μm technology. Simulation results show that it can achieve 256 M samples/sec at 32 MHz in the encoder part and 448 M samples/sec at 56 MHz in the decoder part.