Article

Overview of the H.264/AVC Video coding standard

Authors:
  • Picsel Labs
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

H.264/AVC is newest video coding standard of the ITU-T Video Coding Experts Group and the ISO/IEC Moving Picture Experts Group. The main goals of the H.264/AVC standardization effort have been enhanced compression performance and provision of a "network-friendly" video representation addressing "conversational" (video telephony) and "nonconversational" (storage, broadcast, or streaming) applications. H.264/AVC has achieved a significant improvement in rate-distortion efficiency relative to existing standards. This article provides an overview of the technical features of H.264/AVC, describes profiles and applications for the standard, and outlines the history of the standardization process.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Additionally, we explore various applications of our method, including video compression and video denoising tasks. With quantization-aware training and entropy coding, MetaNeRV outperforms widelyused video codecs such as H.264 (Wiegand et al. 2003) and HEVC (Sullivan et al. 2012) and performs comparably with state-of-the-art video compression algorithms. ...
... Video compresion Visual data compression, a cornerstone of computer vision and image processing, has been extensively studied over several decades. Traditional video compression algorithms like H.264 (Wiegand et al. 2003), and HEVC (Sullivan et al. 2012) have achieved remarkable success. Some works have approached video compression as an image interpolation problem, introducing competitive interpolation networks (Wu, Singhal, and Krahenbuhl 2018), generalized optical flow to scale-space flow for enhanced uncertainty modeling (Agustsson et al. 2020;Yang et al. 2020b), and employed temporal hierarchical structures with neural networks for various components (Yang et al. 2020a). ...
... We also apply an additional neural network parameter pruning with various prune ratios for different NeRV-based methods to evaluate the video compression performance. In addition, we compare the compression ability of our methods with lots of popular methods, including H.264 (Wiegand et al. 2003), HEVC (Sullivan et al. 2012), HLVC (Yang et al. 2020a), Scale-space (Agustsson et al. 2020), Wu et al. (Wu, Singhal, and Krahenbuhl 2018), NeRV (Chen et al. 2021), and PS-NeRV (Bai et al. 2023). ...
Preprint
Full-text available
Neural Representations for Videos (NeRV) has emerged as a promising implicit neural representation (INR) approach for video analysis, which represents videos as neural networks with frame indexes as inputs. However, NeRV-based methods are time-consuming when adapting to a large number of diverse videos, as each video requires a separate NeRV model to be trained from scratch. In addition, NeRV-based methods spatially require generating a high-dimension signal (i.e., an entire image) from the input of a low-dimension timestamp, and a video typically consists of tens of frames temporally that have a minor change between adjacent frames. To improve the efficiency of video representation, we propose Meta Neural Representations for Videos, named MetaNeRV, a novel framework for fast NeRV representation for unseen videos. MetaNeRV leverages a meta-learning framework to learn an optimal parameter initialization, which serves as a good starting point for adapting to new videos. To address the unique spatial and temporal characteristics of video modality, we further introduce spatial-temporal guidance to improve the representation capabilities of MetaNeRV. Specifically, the spatial guidance with a multi-resolution loss aims to capture the information from different resolution stages, and the temporal guidance with an effective progressive learning strategy could gradually refine the number of fitted frames during the meta-learning process. Extensive experiments conducted on multiple datasets demonstrate the superiority of MetaNeRV for video representations and video compression.
... This hybrid technique, that has been developed over the years, become a subject of numerous standardization projects of experts groups of ISO/IEC MPEG and ITU-T VCEG (e.g. MPEG-2 [1], H.263 [2], AVC [3,4], HEVC [5,6] standards). Constantly changing requirements imposed on video compression algorithms, as well as the improving capabilities of the hardware results in faster and faster changing generations of video compression technology. ...
... Today, the state-of-the-art in the field of video compression is High Efficiency Video Coding (HEVC) technology [6], which has been jointly developed by ISO/IEC and ITU-T and published in 2013 simultaneously as an international standard of ISO/IEC MPEG-H part 2 and recommendation ITU-T H.265 [5]. The new technique is the successor of the widely used and extremely successful Advance Video Compression (AVC) technology [4,7]. In relation to this technique, the HEVC allows up to 2-fold reduction in the size of the encoded images without compromising the quality of video [8], and, more importantly, it supports compression of ultra-high definition video, that is believed to be the direction in which the future video systems will evolve. ...
... HEVC exploits finite precision approximation of DCT transformation [4]. Taking into consideration boththe coding efficiency and complexity aspects, HEVC standard requires 16 bit depth of values obtained at each stage of the transform process, including the sign bit (for 8-bit input image sample representation). ...
Preprint
The paper presents quantitative analysis of the video quality losses in the homogenous HEVC video transcoder. With the use of HM15.0 reference software and a set of test video sequences, cascaded pixel domain video transcoder (CPDT) concept has been used to gather all the necessary data needed for the analysis. This experiment was done for wide range of source and target bitrates. The essential result of the work is extensive evaluation of CPDT, commonly used as a reference in works on effective video transcoding. Until now no such extensively performed study have been made available in the literature. Quality degradation between transcoded video and the video that would be result of direct compression of the original video at the same bitrate as the transcoded one have been reported. The dependency between quality degradation caused by transcoding and the bitrate changes of the transcoded data stream are clearly presented on graphs.
... Lossy compression in modern video coding standards, such as HEVC [1] or H.264 [2], is achieved with a block-based approach. First, a block of pixels are predicted using pixels either from a previously coded frame (inter prediction) or from previously coded regions of the current frame (intra prediction). ...
... Block-based spatial prediction, or also commonly called intra prediction, is a widely used technique for predictive coding of intra-frames in modern video coding standards [2], [29]. In this well-known method, a block of pixels are predicted by copying the block's spatially neighbor pixels (which reside in the previously reconstructed left and upper blocks) along a predefined direction inside the block [29]. ...
... Of course, our solution to the optimization problem can be modified so that only ordered branch pairs that can be implemented in parallel are used in the search. In this case, the best transform with a total of L = 4 rotations becomes the one with ordered branchpairs of (2,4), (1,3), (3,4) and (1,2), and achieves a coding gain of -0.1206 dB relative to the KLT. ...
Preprint
Video coding standards are primarily designed for efficient lossy compression, but it is also desirable to support efficient lossless compression within video coding standards using small modifications to the lossy coding architecture. A simple approach is to skip transform and quantization, and simply entropy code the prediction residual. However, this approach is inefficient at compression. A more efficient and popular approach is to skip transform and quantization but also process the residual block with DPCM, along the horizontal or vertical direction, prior to entropy coding. This paper explores an alternative approach based on processing the residual block with integer-to-integer (i2i) transforms. I2i transforms can map integer pixels to integer transform coefficients without increasing the dynamic range and can be used for lossless compression. We focus on lossless intra coding and develop novel i2i approximations of the odd type-3 DST (ODST-3). Experimental results with the HEVC reference software show that the developed i2i approximations of the ODST-3 improve lossless intra-frame compression efficiency with respect to HEVC version 2, which uses the popular DPCM method, by an average 2.7% without a significant effect on computational complexity.
... In the first direction, it has been shown [2], [3] that the emerging High Efficiency Video Coding (HEVC/H.265) standard [4] is capable of significantly improving the video compression efficiency at a given reconstructed video quality compared to the existing H.264 standard [5], [6], albeit at the cost of an increased computational complexity [2], [3]. However, an increased compression efficiency usually makes the coded video stream more vulnerable to packet losses [3], [7]. ...
... At the transmit side, a standard video encoder, such as H.264 [5], [6] or H.265 [2]- [4] may be first invoked at each user for video compression. Then, the output video streams of the video encoders are segmented into packets of different size for transport over IP networks. ...
... We note that in video commu-nications, typically CRC codes are used for detecting whether a bitstream is error-free or not at the output of the channel decoder. This feature is supported by most video compression standards, such as H.264/Advanced Video Coding (AVC) [5], [6] and H.265/HEVC [2]- [4], [15]. At the receiver, each channel-decoded/MIMO-detected NALU failing to pass the CRC process is removed during the packet combining process, namely prior to the video decoding. ...
Preprint
A wireless video transmission architecture relying on the emerging large-scale multiple-input--multiple-output (LS-MIMO) technique is proposed. Upon using the most advanced High Efficiency Video Coding (HEVC) (also known as H.265), we demonstrate that the proposed architecture invoking the low-complexity linear zero-forcing (ZF) detector and dispensing with any channel coding is capable of significantly outperforming the conventional small-scale MIMO based architecture, even if the latter employs the high-complexity optimal maximum-likelihood (ML) detector and a rate-1/3 recursive systematic convolutional (RSC) channel codec. Specifically, compared to the conventional small-scale MIMO system, the effective system throughput of the proposed LS-MIMO based scheme is increased by a factor of up to three and the quality of reconstructed video quantified in terms of the peak signal-to-noise ratio (PSNR) is improved by about 22.5dB22.5\, \text{dB} at a channel-SNR of Eb/N06dBE_b/N_0 \approx 6\,\text{dB} for delay-tolerant video-file delivery applications, and about 20dB20\,\text{dB} for lip-synchronized real-time interactive video applications. Alternatively, viewing the attainable improvement from a power-saving perspective, a channel-SNR gain as high as ΔEb/N05dB\Delta_{E_b/N_0}\approx 5\,\text{dB} is observed at a PSNR of 36dB36\, \text{dB} for the scenario of delay-tolerant video applications and again, an even higher gain is achieved in the real-time video application scenario. Therefore, we envisage that LS-MIMO aided wireless multimedia communications is capable of dispensing with power-thirsty channel codec altogether!
... However, it is not enough since the development of RDH mainly focuses on speech and images. Along with the development of video compression standards, such as H.264/Advanced Video Coding (AVC) [29] and High Efficiency Video Coding (HEVC) [30], it is necessary to develop RDH technique in videos. ...
... In most cases, zero coefficients, i.e., zero coefficient-pairs, may be changed to nonzero coefficients in our proposed scheme. Furthermore, according to H.264/AVC [29] it needs more coding and thus significantly making bit-rate increase. For Liu et al.'s scheme, only a part of nonzero coefficients are changed that leads to less increase in bitrate. ...
... This subsection defines zero coefficient-pair. Before defining zero coefficient-pair, H.264/AVC video compression standard[29] is briefly reviewed. H.264/AVC uses a macroblock as the operation unit to compress videos, which are composed of many frames. ...
Preprint
H.264/Advanced Video Coding (AVC) is one of the most commonly used video compression standard currently. In this paper, we propose a Reversible Data Hiding (RDH) method based on H.264/AVC videos. In the proposed method, the macroblocks with intra-frame 4×44\times 4 prediction modes in intra frames are first selected as embeddable blocks. Then, the last zero Quantized Discrete Cosine Transform (QDCT) coefficients in all 4×44\times 4 blocks of the embeddable macroblocks are paired. In the following, a modification mapping rule based on making full use of modification directions are given. Finally, each zero coefficient-pair is changed by combining the given mapping rule with the to-be-embedded information bits. Since most of last QDCT coefficients in all 4×44\times 4 blocks are zero and they are located in high frequency area. Therefore, the proposed method can obtain high embedding capacity and low distortion.
... This poses grand challenges in the underlying data management tasks, including I/O and data transfer, which necessitates the need for effective data compression. * Co-first author † Corresponding author Data compression has been widely used in reducing data volumes [18,37,40,49] and accelerating queries [7,9,53,54] in database and data management systems. Nonetheless, challenges arise when applying existing compression techniques in the context of scientific data. ...
... On the one hand, while lossless compression techniques [1, 6,10,31,37] can recover data to the exact precision, they suffer from limited compression ratios in floating-point scientific data and thus fail to meet the desired reduction level in data size. On the other hand, lossy compressors for natural vision data and time-series databases, such as JPEGs [43,48], H.26x series [8,40,49], SummaryStore [4], and ModelarDB [18,20], cannot provide a quantifiable bound on the errors in the decompressed data, leading to uncertainties in the downstream data visualization and analytics. Under these circumstances, error-bounded lossy compression has been proposed as a viable way to reduce the size of scientific data while providing guaranteed error control. ...
... Traditional compression techniques for database systems: Various data compressors have been proposed for large-scale databases in diverse data domains and formats. Generic lossless compressors such as GZIP [11], ZSTD [10], and LZ4 [36] are widely used to compress various types of data to reduce storage requirement [18,37,40,49] and accelerating queries [7,9,53,54]. With the increasing amount of floating-point data in applications, several compressors have been proposed to specifically deal with floating-point data. ...
Preprint
Error-bounded lossy compression has been widely adopted in many scientific domains because it can address the challenges in storing, transferring, and analyzing the unprecedented amount of scientific data. Although error-bounded lossy compression offers general data distortion control by enforcing strict error bounds on raw data, they may fail to meet the quality requirements on the results of downstream analysis derived from raw data, a.k.a Quantities of Interest (QoIs). This may lead to uncertainties and even misinterpretations in scientific discoveries, significantly limiting the use of lossy compression in practice. In this paper, we propose QPET, a novel, versatile, and portable framework for QoI-preserving error-bounded lossy compression, which overcomes the challenges of modeling diverse QoIs by leveraging numerical strategies. QPET features (1) high portability to multiple existing lossy compressors, (2) versatile preservation to most differentiable univariate and multivariate QoIs, and (3) significant compression improvements in QoI-preservation tasks. Experiments with six real-world datasets demonstrate that QPET outperformed existing QoI-preserving compression framework in terms of speed, and integrating QPET into state-of-the-art error-bounded lossy compressors can gain up to 250% compression ratio improvements to original compressors and up to 75% compression ratio improvements to existing QoI-integrated scientific compressors. Under the same level of peak signal-to-noise ratios in the QoIs, QPET can improve the compression ratio by up to 102%.
... Classic video and image compression standards, such as JPEGs [30,26], MPEG [16], and H.264 [33], have been used for video encoding/decoding for decades. These methods employ transform-based approaches, utilizing the wavelet and the discrete cosine transform [24,32] combined with motion compensation [21]. ...
... However, NeRV [7] often face challenges in capturing fine-grained temporal dynamics, highlighting the need for more advanced approaches. Notable ones are FFNeRV [17], which incorporates optical flow into the architecture in order to capitalize on the temporal redundancy, and HiNeRV [15], which uses attention, and currently set the benchmark for frame-wise INRs methods, surpassing even traditional decoders like HEVC x265 [33]. ...
Preprint
Full-text available
In the field of video compression, the pursuit for better quality at lower bit rates remains a long-lasting goal. Recent developments have demonstrated the potential of Implicit Neural Representation (INR) as a promising alternative to traditional transform-based methodologies. Video INRs can be roughly divided into frame-wise and pixel-wise methods according to the structure the network outputs. While the pixel-based methods are better for upsampling and parallelization, frame-wise methods demonstrated better performance. We introduce CoordFlow, a novel pixel-wise INR for video compression. It yields state-of-the-art results compared to other pixel-wise INRs and on-par performance compared to leading frame-wise techniques. The method is based on the separation of the visual information into visually consistent layers, each represented by a dedicated network that compensates for the layer's motion. When integrated, a byproduct is an unsupervised segmentation of video sequence. Objects motion trajectories are implicitly utilized to compensate for visual-temporal redundancies. Additionally, the proposed method provides inherent video upsampling, stabilization, inpainting, and denoising capabilities.
... (1) Adrenaline estimates the user-side visual quality with a given RQ and compression parameter on the server side. To estimate the user-side visual quality, we train a regression model on a quality metric [40] and standard video codec [34], and Adrenaline leverages the pretrained model for its RQ optimization ( §4.2). (2) When serving multiple users with different network conditions and workloads, the optimization process prioritizes and coordinates the RQ adaptation among them to maximize the aggregate gaming service quality and efficiency of the resource usage. To enable this, we propose a scoring mechanism that quantifies the efficiency of RQ settings with respect to the rendering cost and estimated visual quality ( §4.3). ...
... When the game frames are streamed, it involves lossy compression, e.g., H.264 [34] and HEVC [31], for efficient transmission. The streaming module with modern streaming methods such as WebRTC [10], estimates the available bandwidth by using congestion control algorithms, e.g., Google congestion control [15], and adapts its compression parameter to the estimated bandwidth [2]. ...
Preprint
Cloud gaming requires a low-latency network connection, making it a prime candidate for being hosted at the network edge. However, an edge server is provisioned with a fixed compute capacity, causing an issue for multi-user service and resulting in users having to wait before they can play when the server is occupied. In this work, we present a new insight that when a user's network condition results in use of lossy compression, the end-to-end visual quality more degrades for frames of high rendering quality, wasting the server's computing resources. We leverage this observation to build Adrenaline, a new system which adaptively optimizes the game rendering qualities by considering the user-side visual quality and server-side rendering cost. The rendering quality optimization of Adrenaline is done via a scoring mechanism quantifying the effectiveness of server resource usage on the user-side gaming quality. Our open-sourced implementation of Adrenaline demonstrates easy integration with modern game engines. In our evaluations, Adrenaline achieves up to 24% higher service quality and 2x more users served with the same resource footprint compared to other baselines.
... The motion vector was determined by the amount of motion variation between the original block and the matched block. Building on these advancements, the ITU-T video coding expert group and the ISO/IEC motion picture expert group jointly proposed H.264/AVC [32] in 2003. H.264/AVC divided the image into fixed 16 × 16 pixel blocks for intra-frame and inter-frame prediction and employed entropy coding techniques that were faster than conventional DCTs, such as CABAC and context adaptive variable length coding (CAVLC). ...
... It was one of the earlier approaches to compress the current frame by utilizing temporal correlations between the previous and current frames, akin to the use of adjacent frame correlations in our work [16]. Comparing our algorithm with them allowed for a direct assessment of performance levels, particularly in terms of compression efficiency and video quality [32,33]. ...
Article
Full-text available
In recent years, the rapid growth of video data posed challenges for storage and transmission. Video compression techniques provided a viable solution to this problem. In this study, we proposed a bidirectional coding video compression model named DeepBiVC, which was based on two-stage learning. Firstly, we conducted preprocessing on the video data by segmenting the video flow into groups of continuous image frames, with each group comprising five frames. Then, in the first stage, we developed an image compression module based on an invertible neural network (INN) model to compress the first and last frames of each group. In the second stage, we designed a video compression module that compressed the intermediate frames using bidirectional optical flow estimation. Experimental results indicated that DeepBiVC outperformed other state-of-the-art video compression methods regarding PSNR and MS-SSIM metrics. Specifically, on the VUG dataset at bpp = 0.3, DeepBiVC achieved a PSNR of 37.16 and an MS-SSIM of 0.98.
... Firstly, the methodology relies on an offline training phase, where 360 o video instances involving different video encoding configurations are processed to derive objective forward prediction models for video quality, bitrate demands, and encoding time. Both the H.264/AVC [22] and H.265/ HEVC [23] video compression standards are used for this purpose. Figure 1 captures a subset of the dataset's video content diversity. ...
... In this study, we use open-source x264 and x265 [25][26] implementations that facilitate encoding optimizations enabling real-time encoding performance, being orders of magnitude faster than the JM and HM reference implementations of H.264/AVC [22] and HEVC/H.265 [23] video compression standards, respectively [18]. ...
Conference Paper
Full-text available
360° video streaming is one of the prevalent communication technologies for enhancing user experience and has thus seen widespread adoption in virtual and mixed reality applications. However, delivering content at scale while securing the quality of wirelessly communicated 360°videos in real-time poses significant challenges. 360°videos come in ultra-high definition, necessitate unprecedented bitrate demands and involve high encoding complexity. The time-varying nature of underlying wireless channels further introduces a destabilizing factor, calling for video systems to seamlessly adjust to varying bandwidth throughput to maintain adequate quality of service and experience. To address this issue, in this study, we have developed a multi-objective optimization framework for real-time video encoding adaptation. The objective is to optimize both video quality and encoding efficiency while minimizing the required bitrate, subject to real-time application constraints. To achieve this, we relied on generating (offline) precise forward prediction models of video quality, bitrate demands, and encoding time, that can be used to select the optimum encoding configuration in real-time. To validate our methods, we implemented an adaptive video encoding controller, and ran emulations employing actual network traces from 5G mobile video streaming scenarios, using the popular open-source x264 and x265 codecs for video encoding. A dataset of 4K omnidirectional videos at 30 frames per second was used.
... However, because of storage and data-transfer limitations, all camera chipsets and video processing pipelines provide compressed-domain video formats like MPEG/ITU-T AVC/H.264 [8] and HEVC [9], or open-source video formats like VP9 [10] and AOMedia Video 1 instead of uncompressed (pixel-domain) video. Alas, the state-of-the-art in CNN-based classification and recognition in video [4]- [6] ignores the fact that video codecs can be tuned at the macroblock (MB) level. ...
... A. Network Input 1) Temporal Stream: For our temporal stream input, we extract and retain only P-type MB MVs, i.e., uni-directionally predicted MBs [8], [9]. The standard UCF-101 [30] and HMDB-51 [31] datasets are composed of 320 × 240 RGB pixels per frame. ...
Preprint
We investigate video classification via a two-stream convolutional neural network (CNN) design that directly ingests information extracted from compressed video bitstreams. Our approach begins with the observation that all modern video codecs divide the input frames into macroblocks (MBs). We demonstrate that selective access to MB motion vector (MV) information within compressed video bitstreams can also provide for selective, motion-adaptive, MB pixel decoding (a.k.a., MB texture decoding). This in turn allows for the derivation of spatio-temporal video activity regions at extremely high speed in comparison to conventional full-frame decoding followed by optical flow estimation. In order to evaluate the accuracy of a video classification framework based on such activity data, we independently train two CNN architectures on MB texture and MV correspondences and then fuse their scores to derive the final classification of each test video. Evaluation on two standard datasets shows that the proposed approach is competitive to the best two-stream video classification approaches found in the literature. At the same time: (i) a CPU-based realization of our MV extraction is over 977 times faster than GPU-based optical flow methods; (ii) selective decoding is up to 12 times faster than full-frame decoding; (iii) our proposed spatial and temporal CNNs perform inference at 5 to 49 times lower cloud computing cost than the fastest methods from the literature.
... Considering the road map of video coding evolution through the last decades, two video coding standards stand out as the state-of-the-art: H.266/Versatile Video Coding (VVC) [14] and AOMedia Video 1 (AV1) [15]. The H.266/VVC (released in 2020) is the successor of H.265/High-Efficiency Video Coding (HEVC) (2013) [16] and the H.264/Advanced Video Coding (AVC) (2003) [17]. This thread of standardization is led by a series of joint collaborative groups from ITU-T and ISO experts. ...
Article
Full-text available
Video coding standards are the key enablers of recent video applications, such as video conferencing, video on demand, and immersive video. The recent standards have become more efficient in compressing such video data with higher quality for the user over the years. However, the computational effort of video encoders has also increased due to the new coding tools and the support of higher video resolutions and frame rates. The design of dedicated hardware accelerators, associated with approximate computing and storage techniques, has been used today to deal with such an increase in the computational effort and memory requirements needed to implement recent video encoders. This survey article reviews state-of-the-art works that propose approximate computing and storage techniques and apply them to video coding systems.
... To address this, we compress the long videos to one frame per second to reduce data redundancy. Subsequently, we extract keyframes using I-frame detection methods [30]. I-frames, which are the least compressible and do not require other frames for decoding, contain most of the visual information in a video. ...
Preprint
Multimodal large language models have become a popular topic in deep visual understanding due to many promising real-world applications. However, hour-long video understanding, spanning over one hour and containing tens of thousands of visual frames, remains under-explored because of 1) challenging long-term video analyses, 2) inefficient large-model approaches, and 3) lack of large-scale benchmark datasets. Among them, in this paper, we focus on building a large-scale hour-long long video benchmark, HLV-1K, designed to evaluate long video understanding models. HLV-1K comprises 1009 hour-long videos with 14,847 high-quality question answering (QA) and multi-choice question asnwering (MCQA) pairs with time-aware query and diverse annotations, covering frame-level, within-event-level, cross-event-level, and long-term reasoning tasks. We evaluate our benchmark using existing state-of-the-art methods and demonstrate its value for testing deep long video understanding capabilities at different levels and for various tasks. This includes promoting future long video understanding tasks at a granular level, such as deep understanding of long live videos, meeting recordings, and movies.
... Versatile video coding (VVC) [1,2] is an international video coding standard that provides superior compression capabilities compared with advanced video coding (AVC) [3,4] and high efficiency video coding (HEVC) [5,6]. It was jointly designed by the joint video experts team (JVET), a collaborative effort between two international standardization organizations, the ITU-T video coding experts group (VCEG) and the ISO/IEC moving picture experts group (MPEG). ...
Article
Full-text available
Versatile video coding (VVC) is the next-generation video coding standard. VVC proposes a new partitioning block structure called quadtree with Nested multi-type tree (QTMT) that introduces a more flexible partition shape using a quadtree (QT) and a nested multi-type tree (MTT) splitting compared to the previous splitting algorithms, namely quadtree plus binary tree and QT structures adopted in High Efficiency Video Coding. QTMT significantly improves coding efficiency, but it brings considerable computational complexity, which limits VVC’s practical applications. To efficiently address the problem of redundant processing in QTMT structures in inter-mode prediction, in this paper, we propose a fast QTMT inter-partitioning algorithm based on a machine learning approach, namely gradient boosting machines (GBM). The proposed algorithm is divided into three steps. In the first step, the average local variance (ALV) is extracted from each coding unit (CU) to determine their homogeneity. Then, a classification-based GBM is employed to analyze and build a binary classification model from the extracted ALV features. The GBM model is employed to extract and efficiently obtain suitable thresholds for each QT CU size and a threshold between QT and MTT modes. In the last step, a fast QTMT partition decision algorithm is performed based on the extracted thresholds. The experimental results show that the proposed algorithm reduces a significant amount of encoding time, while the loss in coding efficiency is negligible.
... In general, when lossless data compression technologies are used, the original data can be reconstructed without any loss of information. On the other hand, when a certain level of information loss is tolerated, lossy data compression technologies can usually achieve better compression ratios than lossless ones [4,5], such as the JPEG standard for images [16,17], and the H.263 standard for videos [18]. Sometimes, the differences between the reconstructed voice, pictures, and movies and the original data are undetectable by human eyes and ears. ...
Article
Full-text available
Today, huge amounts of time series data are sensed continuously by AIoT devices, transmitted to edge nodes, and to data centers. It costs a lot of energy to transmit these data, store them, and process them. Data compression technologies are commonly used to reduce the data size and thus save energy. When a certain level of data accuracy is sacrificed, lossy compression technologies can achieve better compression ratios. However, different applications may have different requirements for data accuracy. Instead of keeping multiple compressed versions of a time series w.r.t. different error bounds, HIRE hierarchically maintains a tree, where the root records a constant function to approximate the whole time series, and each other node records a constant function to approximate a part of the residual function of its parent for a particular time period. To retrieve data w.r.t. a specific error bound, it traverses the tree from the root down to certain levels according to the requested error bound and aggregates the constant functions on the visited nodes to generate a new bounded error compressed version dynamically. However, the number of nodes to be visited is unknown before the tree traversal completes, and thus the data size of the new version. In this paper, a time series is progressively decomposed into multiple piecewise linear functions. The first function is an approximation of the original time series w.r.t. the largest error bound. The second function is an approximation of the residual function between the original time series and the first function w.r.t. the second largest error bound, and so forth. The sum of the first, second, …, and m-th functions is an approximation of the original time series w.r.t. the m-th error bound. For each iteration, Swing-RR is used to generate a Bounded Error Piecewise Linear Approximation (BEPLA). Resolution Reduction (RR) plays an important role. Eight real-world datasets are used to evaluate the proposed method. For each dataset, approximations w.r.t. three typical error bounds, 5%, 1%, and 0.5%, are requested. Three BEPLAs are generated accordingly, which can be summed up to form three approximations w.r.t. the three error bounds. For all datasets, the total data size of the three BEPLAs is almost the same, with the size used to store just one version w.r.t. the smallest error bound and significantly smaller than the size used to keep three independent versions. The experiment result shows that the proposed method, referred to as PBEPLA-RR, can achieve very good compression ratios and provide multiple approximations w.r.t. different error bounds.
... Methods based on integral arithmetic coding are widely popular. Publications [16,17,30] proved the perspective of using integral arithmetic coding to increase the level of accessibility. These methods are used as separate elements in the JPEG, JPEG-2000 methods. ...
Article
Full-text available
The paper proposes a method of improved adaptive integral arithmetic coding. This method is advisable to use in the technology of multi-level processing of video data based on the JPEG method. The technology is based on the detection of key information at several stages of video data processing. To reduce the output volume, the RLE algorithm and integral arithmetic coding are adapted to the new structure of the input data. Thus, the method of linearization of two-dimensional transformants based on zig-zag scanning was further developed. The differences of the method consist in carrying out vector intertransformation zig-zag linearization taking into account the selection of spectral components defined as complementary. The linearized decomposition approach was developed for the first time transformants based on entry into control ranges. In connection with the presence of different types of transformants in the group, the threshold is adapted according to the criterion of the total uneven number of non-equilibrium complementary components. On the basis of taking into account the probability of occurrence of dictionary elements, integrated arithmetic coding (two-dictionary integrated arithmetic coding) has been improved. Determination of current code components according to the decomposed working interval depending on the power of the dictionaries of significant elements and the number of repetitions. This allows you to additionally take into account the statistical features of the components of the RLE-structured linearized transformants and reduce the length of the arithmetic code; for the first time, a transformant compression method was created based on the reduction of various types of redundancy in groups of transformants. Comparative experimental analysis with known methods indicated that the developed technology has a higher compression ratio with reduced processing time. This makes it possible to ensure the necessary level of access and reliability in the conditions of the growth of the original volume of data.
... V-PCC utilizes video encoding technologies, including H.264 [4], H.265 [5] and others, to compress the point cloud data by transforming it into 2D video streams. This process creates three types of frames: occupancy maps, which indicate valid 3D projection points; geometry maps, which provide the depth information; and attribute maps, which contain the color information of the points as shown in Fig. 1. ...
Preprint
Video-based point cloud compression (V-PCC) converts the dynamic point cloud data into video sequences using traditional video codecs for efficient encoding. However, this lossy compression scheme introduces artifacts that degrade the color attributes of the data. This paper introduces a framework designed to enhance the color quality in the V-PCC compressed point clouds. We propose the lightweight de-compression Unet (LDC-Unet), a 2D neural network, to optimize the projection maps generated during V-PCC encoding. The optimized 2D maps will then be back-projected to the 3D space to enhance the corresponding point cloud attributes. Additionally, we introduce a transfer learning strategy and develop a customized natural image dataset for the initial training. The model was then fine-tuned using the projection maps of the compressed point clouds. The whole strategy effectively addresses the scarcity of point cloud training data. Our experiments, conducted on the public 8i voxelized full bodies long sequences (8iVSLF) dataset, demonstrate the effectiveness of our proposed method in improving the color quality.
... For example, subtle position shifts that are imperceptible to the human eye may be captured by these metrics. Compared to traditional video codecs like H.264 (Wiegand et al. 2003), which requires an average bitrate of approximately 347 kbps at 512 resolution, our approach achieves comparable performance while reducing the bitrate to approximately 11 kbps. This suggests that our discrete representation captures facial dynamics more efficiently than the continuous representation. ...
Preprint
Full-text available
We present VQTalker, a Vector Quantization-based framework for multilingual talking head generation that addresses the challenges of lip synchronization and natural motion across diverse languages. Our approach is grounded in the phonetic principle that human speech comprises a finite set of distinct sound units (phonemes) and corresponding visual articulations (visemes), which often share commonalities across languages. We introduce a facial motion tokenizer based on Group Residual Finite Scalar Quantization (GRFSQ), which creates a discretized representation of facial features. This method enables comprehensive capture of facial movements while improving generalization to multiple languages, even with limited training data. Building on this quantized representation, we implement a coarse-to-fine motion generation process that progressively refines facial animations. Extensive experiments demonstrate that VQTalker achieves state-of-the-art performance in both video-driven and speech-driven scenarios, particularly in multilingual settings. Notably, our method achieves high-quality results at a resolution of 512*512 pixels while maintaining a lower bitrate of approximately 11 kbps. Our work opens new possibilities for cross-lingual talking face generation. Synthetic results can be viewed at https://x-lance.github.io/VQTalker.
... The high data rates concerning 4K video streaming have amplified the role of video compression, making it essential to reduce both the network and the storage requirements to reasonable levels. Widespread video codecs such as Advanced Video Coding (AVC) / H.264 [1] and High-Efficiency Video Coding (HEVC) / H.265 [2] have been utilized over the past decades to efficiently compress Standard-Definition (SD) and High-Definition (HD) videos. However, the proliferation of UHD videos, linked with the prevalence of video communication applications that now dominate internet traffic, have triggered the need for new video compression standards. ...
Conference Paper
Full-text available
In this work, we conduct a subjective video quality assessment (VQA) experiment to evaluate the performance of modern video codecs when displayed on a large (30sqm) ultra-high-definition (UHD) video wall. Our comparative study involves four encoding standards: Advanced Video Coding (AVC) / H.264, High-Efficiency Video Coding (HEVC) / H.265, Versatile Video Coding (VVC) / H.266, and AOMedia Video 1 (AV1). Moreover, the study includes the evaluation in terms of correlation performance of a set of widely used full-reference objective VQA metrics against the subjective scores. Our results showcase the perceptual superiority of the VVC over rival encoding standards, aligning with the objective quality assessment and relevant literature.
... Video Coding Standards: Video codecs are under constant evolution due to ever increasing performance requirements and novel use cases. The current video coding standards, H.265/HEVC [36], replaced the previous one, H.264/AVC [37], due to requirements for higher coding efficiency, higher spatial resolution (4K/8K video), color resolution and dynamic range. Extensions of HEVC include scalable (SHVC), multi-view (MV-HEVC), range (RExt) and 3D video coding (3D-HEVC) [38]. ...
Preprint
An exponential increase in mobile video delivery will continue with the demand for higher resolution, multi-view and large-scale multicast video services. Novel fifth generation (5G) 3GPP New Radio (NR) standard will bring a number of new opportunities for optimizing video delivery across both 5G core and radio access networks. One of the promising approaches for video quality adaptation, throughput enhancement and erasure protection is the use of packet-level random linear network coding (RLNC). In this review paper, we discuss the integration of RLNC into the 5G NR standard, building upon the ideas and opportunities identified in 4G LTE. We explicitly identify and discuss in detail novel 5G NR features that provide support for RLNC-based video delivery in 5G, thus pointing out to the promising avenues for future research.
... Video coding standards such as the state of art High Efficiency Video Coding (HEVC) [1] and widely used H.264/AVC [2] support both lossy and lossless compression. In both lossy and lossless compression modes, prediction is performed in a block based approach and then the difference between the original block and the predicted block (residual block) is further processed depending on the mode of compression and the input configurations. ...
Preprint
In pixel-by-pixel spatial prediction methods for lossless intra coding, the prediction is obtained by a weighted sum of neighbouring pixels. The proposed prediction approach in this paper uses a weighted sum of three neighbor pixels according to a two-dimensional correlation model. The weights are obtained after a three step optimization procedure. The first two stages are offline procedures where the computed prediction weights are obtained offline from training sequences. The third stage is an online optimization procedure where the offline obtained prediction weights are further fine-tuned and adapted to each encoded block during encoding using a rate-distortion optimized method and the modification in this third stage is transmitted to the decoder as side information. The results of the simulations show average bit rate reductions of 12.02% and 3.28% over the default lossless intra coding in HEVC and the well-known Sample-based Angular Prediction (SAP) method, respectively.
... This will be used to define the traffic states at each time slot in Subsection B. The video data is encoded periodically using a Group of Pictures (GOP) structure as in [22] [23], which lasts a period of T time slots. The video frames within one GOP are encoded interdependently using motion estimation, while the frames belonging to different GOPs are encoded independently. ...
Preprint
In this paper, we formulate the collaborative multi-user wireless video transmission problem as a multi-user Markov decision process (MUMDP) by explicitly considering the users' heterogeneous video traffic characteristics, time-varying network conditions and the resulting dynamic coupling between the wireless users. These environment dynamics are often ignored in existing multi-user video transmission solutions. To comply with the decentralized nature of wireless networks, we propose to decompose the MUMDP into local MDPs using Lagrangian relaxation. Unlike in conventional multi-user video transmission solutions stemming from the network utility maximization framework, the proposed decomposition enables each wireless user to individually solve its own dynamic cross-layer optimization (i.e. the local MDP) and the network coordinator to update the Lagrangian multipliers (i.e. resource prices) based on not only current, but also future resource needs of all users, such that the long-term video quality of all users is maximized. However, solving the MUMDP requires statistical knowledge of the experienced environment dynamics, which is often unavailable before transmission time. To overcome this obstacle, we then propose a novel online learning algorithm, which allows the wireless users to update their policies in multiple states during one time slot. This is different from conventional learning solutions, which often update one state per time slot. The proposed learning algorithm can significantly improve the learning performance, thereby dramatically improving the video quality experienced by the wireless users over time. Our simulation results demonstrate the efficiency of the proposed MUMDP framework as compared to conventional multi-user video transmission solutions.
... Indeed, the DCT has found application in several image and video coding schemes [5,12], such as JPEG [36], MPEG-1 [39], MPEG-2 [22], H.261 [23], H.263 [24], and H.264 [32,46,51]. ...
Preprint
The discrete cosine transform (DCT) is the key step in many image and video coding standards. The 8-point DCT is an important special case, possessing several low-complexity approximations widely investigated. However, 16-point DCT transform has energy compaction advantages. In this sense, this paper presents a new 16-point DCT approximation with null multiplicative complexity. The proposed transform matrix is orthogonal and contains only zeros and ones. The proposed transform outperforms the well-know Walsh-Hadamard transform and the current state-of-the-art 16-point approximation. A fast algorithm for the proposed transform is also introduced. This fast algorithm is experimentally validated using hardware implementations that are physically realized and verified on a 40 nm CMOS Xilinx Virtex-6 XC6VLX240T FPGA chip for a maximum clock rate of 342 MHz. Rapid prototypes on FPGA for 8-bit input word size shows significant improvement in compressed image quality by up to 1-2 dB at the cost of only eight adders compared to the state-of-art 16-point DCT approximation algorithm in the literature [S. Bouguezel, M. O. Ahmad, and M. N. S. Swamy. A novel transform for image compression. In {\em Proceedings of the 53rd IEEE International Midwest Symposium on Circuits and Systems (MWSCAS)}, 2010].
... The main difference is in the size of the blocks to which MVs are assigned or to which DCT is applied. In standards up to MPEG-4 ASP, the minimum block size was 8 × 8, whereas H.264/AVC allowed block sizes down to 4 × 4 [83]. In pursuance of a fair comparison, for which all models should accept the same input data, we chose to encode all videos in two currently most widely used video formats -MPEG-4 ASP and H.264/AVC. ...
Preprint
Computational modeling of visual saliency has become an important research problem in recent years, with applications in video quality estimation, video compression, object tracking, retargeting, summarization, and so on. While most visual saliency models for dynamic scenes operate on raw video, several models have been developed for use with compressed-domain information such as motion vectors and transform coefficients. This paper presents a comparative study of eleven such models as well as two high-performing pixel-domain saliency models on two eye-tracking datasets using several comparison metrics. The results indicate that highly accurate saliency estimation is possible based only on a partially decoded video bitstream. The strategies that have shown success in compressed-domain saliency modeling are highlighted, and certain challenges are identified as potential avenues for further improvement.
... Moreover, the image quality should be good enough to provide as much information about the remote site for teleoperation as possible. In this paper, H.264 algorithm [21] is used to compress the images before sending, and decompress them after receiving. H.264 is a kind of video compression scheme which supports video meeting application. ...
Preprint
In this paper, we introduce a software and hardware structure for on-line mobile robotic systems. The hardware mainly consists of a Multi-Sensor Smart Robot connected to the Internet through 3G mobile network. The system employs a client-server software architecture in which the exchanged data between the client and the server is transmitted through different transport protocols. Autonomous mechanisms such as obstacle avoidance and safe-point achievement are implemented to ensure the robot safety. This architecture is put into operation on the real Internet and the preliminary result is promising. By adopting this structure, it will be very easy to construct an experimental platform for the research on diverse tele-operation topics such as remote control algorithms, interface designs, network protocols and applications etc.
... Motion estimation (ME) and motion compensation (MC) are the fundamental techniques of video coding to remove the temporal redundancy between video frames. Block matching-based ME and block-based MC have been integrated into the reference softwares of almost all the existing video coding standards, including the currently widely adopted H.264/MPEG-4 AVC [1] and the state-of-the-art H.265/MPEG-H High Efficiency Video Coding (HEVC) [2]. The underlying model of block-based MC is translational motion model, which is too simple to efficiently describe the complex motions in natural videos, such as rotation and zooming. ...
Preprint
In this paper, we study a simplified affine motion model based coding framework to overcome the limitation of translational motion model and maintain low computational complexity. The proposed framework mainly has three key contributions. First, we propose to reduce the number of affine motion parameters from 6 to 4. The proposed four-parameter affine motion model can not only handle most of the complex motions in natural videos but also save the bits for two parameters. Second, to efficiently encode the affine motion parameters, we propose two motion prediction modes, i.e., advanced affine motion vector prediction combined with a gradient-based fast affine motion estimation algorithm and affine model merge, where the latter attempts to reuse the affine motion parameters (instead of the motion vectors) of neighboring blocks. Third, we propose two fast affine motion compensation algorithms. One is the one-step sub-pixel interpolation, which reduces the computations of each interpolation. The other is the interpolation-precision-based adaptive block size motion compensation, which performs motion compensation at the block level rather than the pixel level to reduce the interpolation times. Our proposed techniques have been implemented based on the state-of-the-art high efficiency video coding standard, and the experimental results show that the proposed techniques altogether achieve on average 11.1% and 19.3% bits saving for random access and low delay configurations, respectively, on typical video sequences that have rich rotation or zooming motions. Meanwhile, the computational complexity increases of both encoder and decoder are within an acceptable range.
... In many cases much of the content of an image block can be described by few main structures, employing a simplified image model with much fewer parameters, leading to reduced overhead. In particular, the directional model has become rather popular, e.g. in directional intra prediction modes [61] and directional transforms [62]- [68], including more sophisticated transforms such as bandelets [69] and anisotropic transforms [70]. The GFT framework can also be employed to design simplified adaptive transforms. ...
Preprint
Recent advent of graph signal processing (GSP) has spurred intensive studies of signals that live naturally on irregular data kernels described by graphs (e.g., social networks, wireless sensor networks). Though a digital image contains pixels that reside on a regularly sampled 2D grid, if one can design an appropriate underlying graph connecting pixels with weights that reflect the image structure, then one can interpret the image (or image patch) as a signal on a graph, and apply GSP tools for processing and analysis of the signal in graph spectral domain. In this article, we overview recent graph spectral techniques in GSP specifically for image / video processing. The topics covered include image compression, image restoration, image filtering and image segmentation.
... A quantization parameter is used to determine the quantization level of transform coefficients in H.264/AVC. An increase of 1 unit in the quantization parameter means an increase of quantization step size by approximately 12 percent, which in turn means 12 percent reduction in the video-rate[40].13 For instance, the variance of GoP size in Star Wars IV with QP 10 is almost three times larger than the variance of GoP size in Star Wars IV with QP 16, although the GoP size variation patterns are identical. ...
Preprint
Streaming video is becoming the predominant type of traffic over the Internet with reports forecasting the video content to account for 80% of all traffic by 2019. With significant investment on Internet backbone, the main bottleneck remains at the edge servers (e.g., WiFi access points, small cells, etc.). In this work, we propose and prove the optimality of a multiuser resource allocation mechanism operating at the edge server that minimizes the probability of stalling of video streams due to buffer under-flows. Our proposed policy utilizes Media Presentation Description (MPD) files of clients that are sent in compliant to Dynamic Adaptive Streaming over HTTP (DASH) protocol to be cognizant of the deadlines of each of the media file to be displayed by the clients. Then, the policy schedules the users in the order of their deadlines. After establishing the optimality of this policy to minimize the stalling probability for a network with links associated with fixed loss rates, the utility of the algorithm is verified under realistic network conditions with detailed NS-3 simulations.
... This video was selected due to the availability of raw video data, allowing us to generate representations with high MMBR's. We encoded 9 representations with MMBR's distributed between 100 and 20000 kbps with exponentially increasing intervals: 101, 194, 377, 730, 1415, 2743, 5319, 10314, and 20000 kbps, using the H.264/MPEG-4 AVC [62] compression format. The chosen intervals correspond to a roughly linear increase of the video quality in terms of Peak Signal-to-Noise Ratio (PSNR) [56]. ...
Preprint
Recently, HTTP-based adaptive streaming has become the de facto standard for video streaming over the Internet. It allows clients to dynamically adapt media characteristics to network conditions in order to ensure a high quality of experience, that is, minimize playback interruptions, while maximizing video quality at a reasonable level of quality changes. In the case of live streaming, this task becomes particularly challenging due to the latency constraints. The challenge further increases if a client uses a wireless network, where the throughput is subject to considerable fluctuations. Consequently, live streams often exhibit latencies of up to 30 seconds. In the present work, we introduce an adaptation algorithm for HTTP-based live streaming called LOLYPOP (Low-Latency Prediction-Based Adaptation) that is designed to operate with a transport latency of few seconds. To reach this goal, LOLYPOP leverages TCP throughput predictions on multiple time scales, from 1 to 10 seconds, along with an estimate of the prediction error distribution. In addition to satisfying the latency constraint, the algorithm heuristically maximizes the quality of experience by maximizing the average video quality as a function of the number of skipped segments and quality transitions. In order to select an efficient prediction method, we studied the performance of several time series prediction methods in IEEE 802.11 wireless access networks. We evaluated LOLYPOP under a large set of experimental conditions limiting the transport latency to 3 seconds, against a state-of-the-art adaptation algorithm from the literature, called FESTIVE. We observed that the average video quality is by up to a factor of 3 higher than with FESTIVE. We also observed that LOLYPOP is able to reach a broader region in the quality of experience space, and thus it is better adjustable to the user profile or service provider requirements.
... Meanwhile, next-generation image compression algorithms with good compression ratios, like FLIF [4] and JPEG-XL [5], have emerged in recent years. The traditional video coding standards such as H.264/AVC [6], H.265/HEVC [7], FFV1 [8] and VVC [9] are also candidates for volumetric image compression algorithms. Although these traditional codecs are still prevalent in real scenarios, they all rely on hand-crafted modules, which cannot be optimized based on massive training data, resulting in relatively poor compression ratios [10]. ...
Article
Full-text available
Recently, learning-based lossless compression methods for volumetric medical images have attracted much attention. They can achieve higher compression ratios than traditional methods, albeit at the cost of slower compression speed. Although using field programmable gate arrays (FPGAs) is feasible to mitigate this disadvantage, existing FPGA-based compression frameworks still need CPU for co-processing. In this work, we propose a hardware end-to-end neural conditional entropy encoder (HENCE), for losslessly compressing 3D medical images with the balanced compression ratio and speed. To achieve this, we first introduce a context-based entropy model to reduce data redundancy within inter-slice and intra-slice features, using an efficient combination of auto-regressive and recurrent neural networks. Then, we design a hardware-friendly arithmetic coding module to collaborate with our learning-based entropy model. To obtain the cumulative distribution function of the discrete logistic distribution, we further introduce a high-precision Sigmoid approximation algorithm, using the Newton-Raphson method. Finally, we design a dataflow mechanism for the entropy model and the coding module, achieving a fully pipelined compression system. Extensive experimental results show that our method outperforms traditional image/video codecs like FLIF, JPEX-XL, and HEVC on several volumetric medical image datasets. And our method obtains faster encoding speed than existing learning-based medical image compression frameworks.
Article
Neural surveillance video compression methods have demonstrated significant improvements over traditional video compression techniques. In current surveillance video compression frameworks, the first frame in a Group of Pictures (GOP) is usually compressed fully as an I frame, and the subsequent P frames are compressed by referencing this I frame at Low Delay P (LDP) encoding mode. However, this compression approach overlooks the utilization of background information, which limits its adaptability to different scenarios. In this paper, we propose a novel Adaptive Surveillance Video Compression framework based on background hyperprior, dubbed as ASVC. This background hyperprior is related with side information to assist in coding both the temporal and spatial domains. Our method mainly consists of two components. First, the background information from a GOP is extracted, modeled as hyperprior and is compressed by exiting methods. Then these hyperprior is used as side information to compress both I frames and P frames. ASVC effectively captures the temporal dependencies in the latent representations of surveillance videos by leveraging background hyperprior for auxiliary video encoding. The experimental results demonstrate that applying ASVC to traditional and learning based methods significantly improves performance.
Preprint
While learned image compression methods have achieved impressive results in either human visual perception or machine vision tasks, they are often specialized only for one domain. This drawback limits their versatility and generalizability across scenarios and also requires retraining to adapt to new applications-a process that adds significant complexity and cost in real-world scenarios. In this study, we introduce an innovative semantics DISentanglement and COmposition VERsatile codec (DISCOVER) to simultaneously enhance human-eye perception and machine vision tasks. The approach derives a set of labels per task through multimodal large models, which grounding models are then applied for precise localization, enabling a comprehensive understanding and disentanglement of image components at the encoder side. At the decoding stage, a comprehensive reconstruction of the image is achieved by leveraging these encoded components alongside priors from generative models, thereby optimizing performance for both human visual perception and machine-based analytical tasks. Extensive experimental evaluations substantiate the robustness and effectiveness of DISCOVER, demonstrating superior performance in fulfilling the dual objectives of human and machine vision requirements.
Article
How to compress face video is a crucial problem for a series of online applications, such as video chat/conference, live broadcasting and remote education. Compared to other natural videos, these face-centric videos owning abundant structural information can be compactly represented and high-quality reconstructed via deep generative models, such that the promising compression performance can be achieved. However, the existing generative face video compression schemes are faced with the inconsistency between the 3D facial motion in the physical world and the face content evolution in the 2D view. To solve this drawback, we propose a 3D-Keypoint-and-2D-Motion based generative method for Face Video Compression, namely FVC-3K2M, which can well ensure perceptual compensation and visual consistency between motion description and face reconstruction. In particular, the temporal evolution of face video can be characterized into separate 3D keypoints from the global and local perspectives, entailing great coding flexibility and accurate motion representation. Moreover, a cascade motion conversion mechanism is further proposed to internally convert 3D keypoints to 2D dense motion, enforcing the face video reconstruction to be perceptually realistic. Finally, an adaptive reference frame selection scheme is developed to enhance the adaptation of various temporal movements. Experimental results show that the proposed scheme can realize reliable video communication in the extremely limited bandwidth, e.g., 2 kbps. Compared to the state-of-the-art video coding standards and the latest face video compression methods, extensive comparisons demonstrate that our proposed scheme achieves superior compression performance in terms of multiple quality evaluations.
Preprint
Two multiplierless pruned 8-point discrete cosine transform (DCT) approximation are presented. Both transforms present lower arithmetic complexity than state-of-the-art methods. The performance of such new methods was assessed in the image compression context. A JPEG-like simulation was performed, demonstrating the adequateness and competitiveness of the introduced methods. Digital VLSI implementation in CMOS technology was also considered. Both presented methods were realized in Berkeley Emulation Engine (BEE3).
Preprint
We study domain-specific video streaming. Specifically, we target a streaming setting where the videos to be streamed from a server to a client are all in the same domain and they have to be compressed to a small size for low-latency transmission. Several popular video streaming services, such as the video game streaming services of GeForce Now and Twitch, fall in this category. While conventional video compression standards such as H.264 are commonly used for this task, we hypothesize that one can leverage the property that the videos are all in the same domain to achieve better video quality. Based on this hypothesis, we propose a novel video compression pipeline. Specifically, we first apply H.264 to compress domain-specific videos. We then train a novel binary autoencoder to encode the leftover domain-specific residual information frame-by-frame into binary representations. These binary representations are then compressed and sent to the client together with the H.264 stream. In our experiments, we show that our pipeline yields consistent gains over standard H.264 compression across several benchmark datasets while using the same channel bandwidth.
Preprint
Due to its remarkable energy compaction properties, the discrete cosine transform (DCT) is employed in a multitude of compression standards, such as JPEG and H.265/HEVC. Several low-complexity integer approximations for the DCT have been proposed for both 1-D and 2-D signal analysis. The increasing demand for low-complexity, energy efficient methods require algorithms with even lower computational costs. In this paper, new 8-point DCT approximations with very low arithmetic complexity are presented. The new transforms are proposed based on pruning state-of-the-art DCT approximations. The proposed algorithms were assessed in terms of arithmetic complexity, energy retention capability, and image compression performance. In addition, a metric combining performance and computational complexity measures was proposed. Results showed good performance and extremely low computational complexity. Introduced algorithms were mapped into systolic-array digital architectures and physically realized as digital prototype circuits using FPGA technology and mapped to 45nm CMOS technology. All hardware-related metrics showed low resource consumption of the proposed pruned approximate transforms. The best proposed transform according to the introduced metric presents a reduction in power consumption of 21--25%.
Preprint
This paper introduces a new fast algorithm for the 8-point discrete cosine transform (DCT) based on the summation-by-parts formula. The proposed method converts the DCT matrix into an alternative transformation matrix that can be decomposed into sparse matrices of low multiplicative complexity. The method is capable of scaled and exact DCT computation and its associated fast algorithm achieves the theoretical minimal multiplicative complexity for the 8-point DCT. Depending on the nature of the input signal simplifications can be introduced and the overall complexity of the proposed algorithm can be further reduced. Several types of input signal are analyzed: arbitrary, null mean, accumulated, and null mean/accumulated signal. The proposed tool has potential application in harmonic detection, image enhancement, and feature extraction, where input signal DC level is discarded and/or the signal is required to be integrated.
Article
Video bit-rate control techniques are essential for efficiently transmitting videos over communication networks. These techniques have been pivotal in broadcasting and internet streaming services, and they could significantly enhance defense capabilities if applied to defense and aerospace imaging systems. Therefore, this paper first reviews the history and standard technologies of video coding, then describes the standard trends of the latest video coding technologies. Finally, it outlines the features of the latest video bit-rate control techniques and discusses their applicability in the defense and aerospace fields.
Article
Full-text available
H.264 is the ITU-T's new, nonbackward compatible video compression Recommendation that significantly outperforms all previous video compression standards. It consists of a video coding layer (VCL) which performs all the classic signal processing tasks and generates bit strings containing coded macroblocks, and a network adaptation layer (NAL) which adapts those bit strings in a network friendly way. The paper describes the use of H.264 coded video over best-effort IP networks, using RTP as the real-time transport protocol. After a description of the environment, the error-resilience tools of H.264 and the draft specification of the RTP payload format are introduced. Next the performance of several possible VCL- and NAL-based error-resilience tools of H.264 are verified in simulations.
Article
Full-text available
This paper reviews recent advances in using B pictures in the context of the draft H.264/AVC video-compression standard. We focus on reference picture selection and linearly combined motion-compensated prediction signals. We show that bidirectional prediction exploits partially the efficiency of combined prediction signals whereas multihypothesis prediction allows a more general form of B pictures. The general concept of linearly combined prediction signals chosen from an arbitrary set of reference pictures improves the H.264/AVC test model TML-9 which is used in the following. We outline H.264/AVC macroblock prediction modes for B pictures, classify them into four groups and compare their efficiency in terms of rate-distortion performance. When investigating multihypothesis prediction, we show that bidirectional prediction is a special case of this concept. Multihypothesis prediction allows also two combined forward prediction signals. Experimental results show that this case is also advantageous in terms of compression efficiency. The draft H.264/AVC video-compression standard offers improved entropy coding by context-based adaptive binary arithmetic coding. Simulations show that the gains by multihypothesis prediction and arithmetic coding are additive. B pictures establish an enhancement layer and are predicted from reference pictures that are provided by the base layer. The quality of the base layer influences the rate-distortion trade-off for B pictures. We demonstrate how the quality of the B pictures should be reduced to improve the overall rate-distortion performance of the scalable representation.
Article
Full-text available
In video coding standards, a compliant bit stream must be decoded by a hypothetical decoder that is conceptually connected to the output of an encoder and consists of a decoder buffer, a decoder, and a display unit. This virtual decoder is known as the hypothetical reference decoder (HRD) in H.263 and the video buffering verifier in MPEG. The encoder must create a bit stream so that the hypothetical decoder buffer does not overflow or underflow. These previous decoder models assume that a given bit stream will be transmitted through a channel of a known bit rate and will be decoded (after a given buffering delay) by a device of some given buffer size. Therefore, these models are quite rigid and do not address the requirements of many of today's important video applications such as broadcasting video live or streaming pre-encoded video on demand over network paths with various peak bit rates to devices with various buffer sizes. In this paper, we present a new HRD for H.264/AVC that is more general and flexible than those defined in prior standards and provides significant additional benefits.
Article
Full-text available
This paper presents an overview of the transform and quantization designs in H.264. Unlike the popular 8×8 discrete cosine transform used in previous standards, the 4×4 transforms in H.264 can be computed exactly in integer arithmetic, thus avoiding inverse transform mismatch problems. The new transforms can also be computed without multiplications, just additions and shifts, in 16-bit arithmetic, thus minimizing computational complexity, especially for low-end processors. By using short tables, the new quantization formulas use multiplications but avoid divisions.
Article
Full-text available
Context-based adaptive binary arithmetic coding (CABAC) as a normative part of the new ITU-T/ISO/IEC standard H.264/AVC for video compression is presented. By combining an adaptive binary arithmetic coding technique with context modeling, a high degree of adaptation and redundancy reduction is achieved. The CABAC framework also includes a novel low-complexity method for binary arithmetic coding and probability estimation that is well suited for efficient hardware and software implementations. CABAC significantly outperforms the baseline entropy coding method of H.264/AVC for the typical area of envisaged target applications. For a set of test sequences representing typical material used in broadcast applications and for a range of acceptable video quality of about 30 to 38 dB, average bit-rate savings of 9%-14% are achieved.
Article
Full-text available
Video transmission in wireless environments is a challenging task calling for high-compression efficiency as well as a network friendly design. Both have been major goals of the H.264/AVC standardization effort addressing "conversational" (i.e., video telephony) and "nonconversational" (i.e., storage, broadcast, or streaming) applications. The video compression performance of the H.264/AVC video coding layer typically provides a significant improvement. The network-friendly design goal of H.264/AVC is addressed via the network abstraction layer that has been developed to transport the coded video data over any existing and future networks including wireless systems. The main objective of this paper is to provide an overview over the tools which are likely to be used in wireless environments and discusses the most challenging application, wireless conversational services in greater detail. Appropriate justifications for the application of different tools based on experimental results are presented.
Article
Full-text available
This paper reviews recent advances in using B pictures in the context of the draft H.26L video compression standard. We focus on reference picture selection and linearly combined motion-compensated prediction signals. We show that bi-directional prediction exploits partially the efficiency of combined prediction signals whereas multihypothesis prediction allows a more general form of B pictures. The general concept of linearly combined prediction signals chosen from an arbitrary set of reference pictures can further improve the H.26L test model TML-9 which is used in the following.
Book
Preface. Introduction. 1. State-Of-The-Art Video Transmission. 2. Rate-Constrained Coder Control. 3. Long-Term Memory Motion-Compensated Prediction. 4. Affine Multi-Frame Motion-Compensated Prediction. 5. Fast Motion Estimation for Multi-Frame Prediction. 6. Error Resilient Video Transmission. 7. Conclusions. References. Index.
Conference Paper
Multi-hypothesis motion-compensated prediction extends traditional motion-compensated prediction used in video coding schemes. Known algorithms for block-based multi-hypothesis motion-compensated prediction are, for example, overlapped block motion compensation (OBMC) and bidirectionally predicted frames (B-frames). This paper presents a generalization of these algorithms in a rate-distortion framework. All blocks which are available for prediction are called hypotheses. Further, we explicitly distinguish between the search space and the superposition of hypotheses. Hypotheses are selected from a search space and their spatio-temporal positions are transmitted by means of spatio-temporal displacement codewords. Constant predictor coefficients are used to combine linearly hypotheses of a multi-hypothesis. The presented design algorithm provides an estimation criterion for optimal multi-hypotheses, a rule for optimal displacement codes, and a condition for optimal predictor coefficients. Statistically dependent hypotheses of a multi-hypothesis are determined by an iterative algorithm. Experimental results show that Increasing the number of hypotheses from 1 to 8 provides prediction gains up to 3 dB in prediction error
Article
In order to reduce the bit rate of video signals, the standardized coding techniques apply motion-compensated prediction in combination with transform coding of the prediction error. By mathematical analysis, it is shown that aliasing components are deteriorating the prediction efficiency. In order to compensate the aliasing, two-dimensional (2-D) and three-dimensional interpolation filters are developed. As a result, motion- and aliasing-compensated prediction with 1/4-pel displacement vector resolution and a separable 2-D Wiener interpolation filter provide a coding gain of up to 2 dB when compared to 1/2-pel displacement vector resolution as it is used in H.263 or MPEG-2. An additional coding gain of 1 dB can be obtained with 1/8-pel displacement vector resolution when compared to 1/4-pel displacement vector resolution. In consequence of the significantly improved coding efficiency, a Wiener interpolation filter and 1/4-pel displacement vector resolution is applied in H.264/AVC and in MPEG-4 (advanced simple profile).
Article
This paper describes the adaptive deblocking filter used in the H.264/MPEG-4 AVC video coding standard. The filter performs simple operations to detect and analyze artifacts on coded block boundaries and attenuates those by applying a selected filter.
Article
This paper discusses two new frame types, SP-frames and SI-frames, defined in the emerging video coding standard, known as ITU-T Rec. H.264 or ISO/IEC MPEG-4/Part 10-AVC. The main feature of SP-frames is that identical SP-frames can be reconstructed even when different reference frames are used for their prediction. This property allows them to replace I-frames in applications such as splicing, random access, and error recovery/resilience. We also include a description of SI-frames, which are used in conjunction with SP-frames. Finally, simulation results illustrating the coding efficiency of SP-frames are provided. It is shown that SP-frames have significantly better coding efficiency than I-frames while providing similar functionalities.
Article
A unified approach to the coder control of video coding standards such as MPEG-2, H.263, MPEG-4, and the draft video coding standard H.264/AVC (advanced video coding) is presented. The performance of the various standards is compared by means of PSNR and subjective testing results. The results indicate that H.264/AVC compliant encoders typically achieve essentially the same reproduction quality as encoders that are compliant with the previous standards while typically requiring 60% or less of the bit rate.
Article
Long-term memory motion-compensated prediction extends the spatial displacement vector utilized in block-based hybrid video coding by a variable time delay permitting the use of more frames than the previously decoded one for motion compensated prediction. The long-term memory covers several seconds of decoded frames at the encoder and decoder. The use of multiple frames for motion compensation in most cases provides significantly improved prediction gain. The variable time delay has to be transmitted as side information requiring an additional bit rate which may be prohibitive when the size of the long-term memory becomes too large. Therefore, me control the bit rate of the motion information by employing rate constrained motion estimation. Simulation results are obtained by integrating long-term memory prediction into an H.263 codec. Reconstruction PSNR improvements up to 2 dB for the Foreman sequence and 1.5 dB for the Mother-Daughter sequence are demonstrated in comparison to the TMN-2.0 H.263 coder. The PSNR improvements correspond to bit-rate savings up to 34 and 30%, respectively. Mathematical inequalities are used to speed up motion estimation while achieving full prediction gain
Article
This article provides an overview of H.263, the new ITU-T Recommendation for low-bit-rate video communication. H.263 specifies a coded representation for compressing the moving picture component of audio-visual signals at low bit rates. The basic structure of the video source coding algorithm is taken from ITU-T Recommendation H.261 and is a hybrid of interpicture prediction to reduce temporal redundancy and transform coding of the prediction residual to reduce spatial redundancy. The source coder can operate on five standardized picture formats: sub-QCIF, QCIF, CIF, 4CIF, and 16CIF. The decoder has motion compensation capability with half-pixel precision, in contrast to H.261 which uses full-pixel precision and employs a loop filter. H.263 includes four negotiable coding options which provide improved coding efficiency: unrestricted motion vectors, syntax-based arithmetic coding, advanced prediction, and PB-frames
Draft ITU-T recommendation and final draft international standard of joint video specification (ITU-T Rec
  • Iso Iec Mpeg
" Draft ITU-T recommendation and final draft international standard of joint video specification (ITU-T Rec. H.264/ISO/IEC 14 496-10 AVC, " in Joint Video Team (JVT) of ISO/IEC MPEG and ITU-T VCEG, JVT-G050, 2003.
Motion compensation in H.264/AVC
  • T Wedi
T. Wedi, "Motion Compensation in H.264/AVC," in IEEE Transactions on Circuits and Systems for Video Technology, this issue.
Video Codec for Audiovisual Services at px64 kbit/s, ITU-T Recommendation H
  • Itu-T
ITU-T, Video Codec for Audiovisual Services at px64 kbit/s, ITU-T Recommendation H.261, Version 1: Nov. 1990; Version 2: Mar. 1993.
Draft ITU-T Recommendation and Final Draft International Standard of Joint Video Specification (ITU-T Rec. H.264 | ISO/IEC 14496-10 AVC)
  • Joint Video Team
  • Iso Itu-T
Joint Video Team of ITU-T and ISO/IEC JTC 1, "Draft ITU-T Recommendation and Final Draft International Standard of Joint Video Specification (ITU-T Rec. H.264 | ISO/IEC 14496-10 AVC)," Joint Video Team (JVT) of ISO/IEC MPEG and ITU-T VCEG, JVT-G050, March 2003.
Generic Coding of Moving Pictures and Associated Audio Information -Part 2: Video262 and ISO/IEC 13 818-2 (MPEG-2)Video Codec for Audiovisual Services at p264 kbit=s ITU-T Recom-[8] T. WediMotion compensation in H.264/AVC
  • Itu-T Recommendation
"Generic Coding of Moving Pictures and Associated Audio Information -Part 2: Video," ITU-T and ISO/IEC JTC 1, ITU-T Recommendation H.262 and ISO/IEC 13 818-2 (MPEG-2), 1994. [3] "Video Codec for Audiovisual Services at p264 kbit=s ITU-T Recom-[8] T. Wedi, "Motion compensation in H.264/AVC," IEEE Trans. Circuits Syst. Video Technol., vol. 13, pp. 577–586, July 2003.
Low-Com-plexitytransformandquantizationinH
  • H Malvar
  • A Hallapuro
  • M Karczewicz
  • L Kerofsky
  • Avc
  • Ieeetrans
H. Malvar, A. Hallapuro, M. Karczewicz, and L. Kerofsky, “Low-Com-plexitytransformandquantizationinH.264/AVC,”IEEETrans.Circuits Syst. Video Technol., vol. 13, pp. 598–603, July 2003.
Generic coding of moving pictures and associated audio information -Part 2: Video
  • Iso Itu-T
ITU-T and ISO/IEC JTC 1, "Generic coding of moving pictures and associated audio information -Part 2: Video," ITU-T Recommendation H.262 -ISO/IEC 13818-2 (MPEG-2), Nov. 1994.
Video coding for low bit rate communication
ITU-T, "Video coding for low bit rate communication," ITUT Recommendation H.263; version 1, Nov. 1995; version 2, Jan. 1998; version 3, Nov. 2000.