Conference Paper

A Hybrid Deep Animation Codec for Low-Bitrate Video Conferencing

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... By exploiting a strong facial prior for compact encoding and combining a powerful generative model for highquality decoding, GFVC can realize promising compression efficiency for face video compression. Various efforts are made to improve representation compactness [2], [4], [33], reconstruction quality [5], [6], [34], modeling capacities [35], [36], reduced complexity [8], [37], latency optimization [38] and the diversity of applications [39]. Despite substantial advancements, there remains a lack of a systematic survey and thorough analysis in the field of GFVC research. ...
... HDAC [34] ICIP 2022 Proposes a layered coding scheme that can facilitate GFVC to improve long-term dependencies and alleviate background occlusions. ...
... For these five basic GFVC algorithms, they are implemented as multi-resolution models [64] to support both 256×256 and 512×512 resolutions. For performance evaluation, we follow the general pipeline in Fig. 1 and test procedure in [59], where the intra mode of the VTM 22. c) GFVC with Different Optimization Strategies: we choose three representative optimization strategies and their corresponding models, i.e., HDAC [34], RDAC [6] and MR-DAC [53]. These three algorithms use DAC [2] as basic animation-based model, and aim to improve the prediction quality and bit-rate coverage with layered coding, residual ...
Preprint
The rise of deep generative models has greatly advanced video compression, reshaping the paradigm of face video coding through their powerful capability for semantic-aware representation and lifelike synthesis. Generative Face Video Coding (GFVC) stands at the forefront of this revolution, which could characterize complex facial dynamics into compact latent codes for bitstream compactness at the encoder side and leverages powerful deep generative models to reconstruct high-fidelity face signal from the compressed latent codes at the decoder side. As such, this well-designed GFVC paradigm could enable high-fidelity face video communication at ultra-low bitrate ranges, far surpassing the capabilities of the latest Versatile Video Coding (VVC) standard. To pioneer foundational research and accelerate the evolution of GFVC, this paper presents the first comprehensive survey of GFVC technologies, systematically bridging critical gaps between theoretical innovation and industrial standardization. In particular, we first review a broad range of existing GFVC methods with different feature representations and optimization strategies, and conduct a thorough benchmarking analysis. In addition, we construct a large-scale GFVC-compressed face video database with subjective Mean Opinion Scores (MOSs) based on human perception, aiming to identify the most appropriate quality metrics tailored to GFVC. Moreover, we summarize the GFVC standardization potentials with a unified high-level syntax and develop a low-complexity GFVC system which are both expected to push forward future practical deployments and applications. Finally, we envision the potential of GFVC in industrial applications and deliberate on the current challenges and future opportunities.
... DAC [1] proposes a deep learning approach for ultra-low bitrate video compression that reconstructs target frames by using a keyframe and target keypoints representing facial features. HDAC [2] uses non-key frames with a high compression ratio as auxiliary information to handle background motion and disocclusions that cannot be captured by keypoints. The limitations of keypoints are overcome by using animated and auxiliary frame features to reconstruct the target frame. ...
... Furthermore, by efficiently encoding the residuals between consecutive frames over time, temporal dependencies are eliminated, leading to improved compression efficiency. With advanced image animation techniques [4,5], these animation based video compression methods [1][2][3] have demonstrated superior results. However, these unidirectional methods rely on a single keyframe, and as the target frame moves away from the keyframe, the loss of temporal correlation makes it difficult to capture large facial movements, leading to distortions in the face region of the generated frames. ...
... We demonstrated an average bitrate reduction of 24% compared to HDAC [2], 55% compared to the latest animation based codec RDAC [3]. Additionally, we achieved a 35% reduction compared to the low-delay configuration of the latest video coding standard, Versatile Video Coding (VVC) [6]. ...
Preprint
Existing deep facial animation coding techniques efficiently compress talking head videos by applying deep generative models. Instead of compressing the entire video sequence, these methods focus on compressing only the keyframe and the keypoints of non-keyframes (target frames). The target frames are then reconstructed by utilizing a single keyframe, and the keypoints of the target frame. Although these unidirectional methods can reduce the bitrate, they rely on a single keyframe and often struggle to capture large head movements accurately, resulting in distortions in the facial region. In this paper, we propose a novel bidirectional learned animation codec that generates natural facial videos using past and future keyframes. First, in the Bidirectional Reference-Guided Auxiliary Stream Enhancement (BRG-ASE) process, we introduce a compact auxiliary stream for non-keyframes, which is enhanced by adaptively selecting one of two keyframes (past and future). This stream improves video quality with a slight increase in bitrate. Then, in the Bidirectional Reference-Guided Video Reconstruction (BRG-VRec) process, we animate the adaptively selected keyframe and reconstruct the target frame using both the animated keyframe and the auxiliary frame. Extensive experiments demonstrate a 55% bitrate reduction compared to the latest animation based video codec, and a 35% bitrate reduction compared to the latest video coding standard, Versatile Video Coding (VVC) on a talking head video dataset. It showcases the efficiency of our approach in improving video quality while simultaneously decreasing bitrate.
... Recently, inspired by image animation applications, modelbased video compression [16,[33][34][35][36][37][38][39][40] has become a new research direction. [33] proposed a real-time mobile device compatible architecture based on FOMM, yet with large pose variations due to the fixed reference frame. ...
... DAC also optimizes the selection of keyframes using the objective metric PSNR. Based on DAC [36,37] proposed the hybrid deep animation codec (H-DAC) [37], a layered hybrid coding scheme, to address the problem of rapid performance saturation as bandwidth increased by introducing a lower bitrate version of the auxiliary stream. However, this approach introduces additional error and results in an increased bitrate. ...
... DAC also optimizes the selection of keyframes using the objective metric PSNR. Based on DAC [36,37] proposed the hybrid deep animation codec (H-DAC) [37], a layered hybrid coding scheme, to address the problem of rapid performance saturation as bandwidth increased by introducing a lower bitrate version of the auxiliary stream. However, this approach introduces additional error and results in an increased bitrate. ...
Article
Full-text available
Motion model based video coding approach, which employs sparse sets of keypoints instead of dense optical flows, can efficiently compress videos at ultra-low bitrates. Such schemes obtain notable performance gains over traditional video codecs in face-centric scenarios, such as video conferencing. However, due to the high complexity of human poses, there is still a lack of research on motion model based human body video coding, especially in the case of large pose variations. In order to overcome this limitation, we present a thin-plate spline motion model based portrait video compression framework oriented to adaptive pose processing. Firstly a more flexible thin-plate spline transformation rather than simple affine transformation is adopted for motion estimation, since the nonlinear property allows representing more complex motions. Meanwhile, spatial constraints are incorporated into the keypoint detector to generate keypoints that are more consistent with the human poses, thus obtaining more accurate optical flow. In addition, a motion intensity evaluation module is designed at the encoder side to dynamically evaluate the inter-frame motion intensity. Adaptive Reference Frame Selection algorithm is then further devised at the decoder side to adaptively select the reconstruction scheme for different intensities of portrait motion. Finally, a multi-frame reconstruction module is introduced for large pose variations to improve the consistency of human pose and subjective quality. The experimental results demonstrate that compared to the state-of-the-art video coding standard Versatile Video Coding and existing motion model based compression techniques, our proposed scheme can better cope with large pose variation scenarios and outperforms in both objective and subjective quality at the similar bitrate with higher temporal consistency.
... This is particularly problematic in scenes with large pose variations. Previous methods have explored various strategies to address this, such as improving sparse motion representation vectors [3], multiple reference coding with feature fusion [6], [7], and side information enhancement [4], and residual coding [5]. Despite these efforts, the issue of prediction quality deterioration as target frames move further in time from the references remains unsolved. ...
... Initial methods such our previously proposed DAC [1] and similar works [2], [3] used a single reference frame along with motion landmarks from subsequent frames to reconstruct the video sequence, achieving competitive coding performance at ultra-low bitrates. To overcome the loss of temporal coherence in the animation, in [4] we proposed a hybrid, layered coding scheme consisting of a low-quality conventional HEVC stream and an animation-based stream similar to DAC. A variant of this solution has been recently submitted to GFVC standardization in JVET-AH0114 [10]. ...
... Animation can also be interpreted as a spatial predictor, with residual coding added to mimic classical closed-loop video codecs [5]. These codecs [4], [5] can, in principle, reduce or eliminate temporal drift in animation. However, the extra transmission cost of the lowquality HEVC or residual bitstream limits their functionality at extremely low bitrates. ...
Preprint
Full-text available
Generative face video coding (GFVC) has been demonstrated as a potential approach to low-latency, low bitrate video conferencing. GFVC frameworks achieve an extreme gain in coding efficiency with over 70% bitrate savings when compared to conventional codecs at bitrates below 10kbps. In recent MPEG/JVET standardization efforts, all the information required to reconstruct video sequences using GFVC frameworks are adopted as part of the supplemental enhancement information (SEI) in existing compression pipelines. In light of this development, we aim to address a challenge that has been weakly addressed in prior GFVC frameworks, i.e., reconstruction drift as the distance between the reference and target frames increases. This challenge creates the need to update the reference buffer more frequently by transmitting more Intra-refresh frames, which are the most expensive element of the GFVC bitstream. To overcome this problem, we propose instead multiple reference animation as a robust approach to minimizing reconstruction drift, especially when used in a bi-directional prediction mode. Further, we propose a contrastive learning formulation for multi-reference animation. We observe that using a contrastive learning framework enhances the representation capabilities of the animation generator. The resulting framework, MRDAC (Multi-Reference Deep Animation Codec) can therefore be used to compress longer sequences with fewer reference frames or achieve a significant gain in reconstruction accuracy at comparable bitrates to previous frameworks. Quantitative and qualitative results show significant coding and reconstruction quality gains compared to previous GFVC methods, and more accurate animation quality in presence of large pose and facial expression changes.
... Recent work on learning-based video coding for videoconferencing applications has shown that it is possible to compress videos of talking heads with extremely low bitrate, without significant losses in visual quality [1,2,3,4,5,6]. The basic tenet of these methods is that face motion can be represented through a compact set of sparse keypoints [7], which can be transmitted and used at the decoder side to animate a reference video frame. ...
... Image animation models have been applied to compress talking head videos at ultra-low bitrates in conferencing-type applications [1,2,3,4,5,6]. Different from other learningbased compression frameworks [8,9,10,11,12,13,14], the animation-based codecs in [3] and [4] propose architectures that use a variable number of motion keypoints to change the reconstruction quality within a small range of low bitrates. ...
... ability since it relies only on face animation. In our recent work [2], we proposed a hybrid coding architecture (HDAC) that uses a low-quality HEVC bitstream as side information to enhance the final result of the animation codec. While improving on previous methods, the use of this low-quality auxiliary stream limits in practice the possibility to reconstruct high-frequency details. ...
Preprint
Full-text available
We address the problem of efficiently compressing video for conferencing-type applications. We build on recent approaches based on image animation, which can achieve good reconstruction quality at very low bitrate by representing face motions with a compact set of sparse keypoints. However, these methods encode video in a frame-by-frame fashion, i.e. each frame is reconstructed from a reference frame, which limits the reconstruction quality when the bandwidth is larger. Instead, we propose a predictive coding scheme which uses image animation as a predictor, and codes the residual with respect to the actual target frame. The residuals can be in turn coded in a predictive manner, thus removing efficiently temporal dependencies. Our experiments indicate a significant bitrate gain, in excess of 70% compared to the HEVC video standard and over 30% compared to VVC, on a datasetof talking-head videos
... HEVC [16] and VVC [1], renowned for their high compression efficiency, have been widely adopted as conventional and advanced approaches for video compression. In recent years, learning-based codecs [2,[8][9][10]17] have demonstrated superior performance compared to conventional approaches. In particular, they have achieved high-quality reconstruction under challenging low-bitrate conditions, which was difficult for traditional methods, thereby improving encoding efficiency. ...
... Our codec achieves a 33% bitrate reduction in FID compared to HEVC and a 22% reduction compared to VVC. Furthermore, it reduces the bitrate in most cases when compared to existing learning-based video codecs [2,[8][9][10]17]. Fig. 2 provides the rate-distortion performance of the proposed codec compared to existing video codecs. ...
Preprint
Full-text available
Talking head video compression has advanced with neural rendering and keypoint-based methods, but challenges remain, especially at low bit rates, including handling large head movements, suboptimal lip synchronization, and distorted facial reconstructions. To address these problems, we propose a novel audio-visual driven video codec that integrates compact 3D motion features and audio signals. This approach robustly models significant head rotations and aligns lip movements with speech, improving both compression efficiency and reconstruction quality. Experiments on the CelebV-HQ dataset show that our method reduces bitrate by 22% compared to VVC and by 8.5% over state-of-the-art learning-based codec. Furthermore, it provides superior lip-sync accuracy and visual fidelity at comparable bitrates, highlighting its effectiveness in bandwidth-constrained scenarios.
... This approach helps to achieve an efficient balance among image quality and compression rate rate-distortion. Furthermore, NN is simple to execute and these approaches take minimum time for training when compared to conventional VC codec approaches [7][8][9][10]. This approach helps achieve an efficient balance between compression rate and image quality Enhancing the performance of Video Compression (VC) has become a significant aim for industrial applications and academia over the years. ...
... The BVI-HD is a high-definition video quality dataset, and it involves 32 references and 384 partial video sequences with subjective scores. The BVI-HD is classified into 4 participant groups: the primary group comprises observed trails, restraining original HEVC, compressed from the initial 16 references (1-16), the second group comprises observed trails, restraining HEVC from the balance 16 references (17)(18)(19)(20)(21)(22)(23)(24)(25)(26)(27)(28)(29)(30)(31)(32), the third group depicts twisted sequences developed by a combination-based codec from (1)(2)(3)(4)(5)(6)(7)(8)(9)(10)(11)(12)(13)(14)(15)(16), and the last group of subjects comprises observation of the combination sequences (17)(18)(19)(20)(21)(22)(23)(24)(25)(26)(27)(28)(29)(30)(31)(32). For testing, at the same time three applicants were observed by 3 blocks of 32-length trials. ...
Article
Full-text available
Video Compression (VC) is a significant aspect of multimedia technology, in which the goal to minimize the size of video data, while also preserving its perceptual quality, for effective transmission and storage. Traditional approaches such as transform coding, predictive coding, and entropy coding are some of the much earlier discovered approaches in this area. VC is a challenging concept which plays a significant role in the effective transmission of data with low storage and minimum bandwidth requirements. However, the limited processing power, storage, memory, lower compression rate and lower resolution are some factors that impact the functionality and performance of VC. This survey aims to encompass a comprehensive review of present DL approaches for VC, especially the application of advanced DL-based Neural Network (NN) algorithms that are developed for solving the aforementioned challenges of VC. Adaptability of DL algorithms is exploited to enhance the potential quality of compressed videos and to positively influence lossless video compression outcomes. The DL approaches include Deep Neural Networks (DNN) methods such as Convolutional Neural Networks (CNN), Generative Adversarial Networks (GAN), Recurrent Neural Networks (RNN), Deep Recurrent Auto-Encoder (DRAE), etc. This survey examines the relationships, strengths as well as problem statements of DL-based compression approaches of VC. Furthermore, this survey also deliberates on datasets, hardware specifications, comparative analysis, and research directions. This survey embeds DL-based computer vision approaches, with hardware accelerators like GPU and FPGA, to minimize the complexity of in a model. This survey aims to overcome the limitations of VC, such as the varying effectiveness of specific encoder approaches, the challenges in utilizing hardware accelerators, low-resource devices, and difficulties in managing the large-scale databases. Integrating DL-based approaches with existing standard codecs remains a significant challenge. Ensuring compatibility, interoperability, and standardization is important for widespread adoption and integration. Enhancing the interpretability and control of DL approaches permit for better customization of compression settings, allowing the users to balance bit rate and quality according to their specific requirements. To gather relevant studies, widespread VC datasets are researched and utilized such as, Ultra-Video-Group dataset (UVG), Video Trace Library (VTL), etc. The selection criteria for this study of VC techniques and deep learning (DL) approaches are chosen to focus on the integration of DL with codecs, which is a primary research area of interest. This integration provides valuable insights into advanced DL applications in overcoming challenges associated with VC. Frameworks such as TensorFlow, Keras, PyTorch are utilized to classify the approaches according to their fundamental NN architectures.
... Compact temporal trajectory representation with 4×4 matrix is introduced in CFTE [9] and CTTR [48] for talking face video compression. Following these efforts, HDAC [49] further incorporates conventional codec as a base layer that is fused with generative prediction, while RDAC [50] incorporates predictive coding in generative coding framework. Besides, multi-reference [51], multi-view [52] and bi-directional prediction [53] schemes are also adopted to improve generation quality. ...
Preprint
Full-text available
In this paper, we propose a novel Multi-granularity Temporal Trajectory Factorization framework for generative human video compression, which holds great potential for bandwidth-constrained human-centric video communication. In particular, the proposed motion factorization strategy can facilitate to implicitly characterize the high-dimensional visual signal into compact motion vectors for representation compactness and further transform these vectors into a fine-grained field for motion expressibility. As such, the coded bit-stream can be entailed with enough visual motion information at the lowest representation cost. Meanwhile, a resolution-expandable generative module is developed with enhanced background stability, such that the proposed framework can be optimized towards higher reconstruction robustness and more flexible resolution adaptation. Experimental results show that proposed method outperforms latest generative models and the state-of-the-art video coding standard Versatile Video Coding (VVC) on both talking-face videos and moving-body videos in terms of both objective and subjective quality. The project page can be found at https://github.com/xyzysz/Extreme-Human-Video-Compression-with-MTTF.
Article
Deep neural video compression codecs have shown great promise in recent years. However, there are still considerable challenges for ultra-low bitrate video coding. Inspired by recent diffusion models for image and video compression attempts, we attempt to leverage diffusion models for ultra-low bitrate portrait video compression. In this paper, we propose a predictive portrait video compression method that leverages the temporal prediction capabilities of diffusion models. Specifically, we develop a temporal diffusion predictor based on a conditional latent diffusion model, with the predicted results serving as decoded frames. We symmetrically integrate a temporal diffusion predictor at the encoding and decoding side, respectively. When the perceptual quality of the predicted results in encoding end falls below a predefined threshold, a new frame sequence is employed for prediction. While the predictor at the decoding side directly generates predicted frames as reconstruction based on the evaluation results. This symmetry ensures that the prediction frames generated at the decoding end are consistent with those at the encoding end. We also design an adaptive coding strategy that incorporates frame quality assessment and adaptive keyframe control. To ensure consistent quality of subsequent predicted frames and achieve high perceptual reconstruction, this strategy dynamically evaluates the visual quality of the predicted results during encoding, retains the predicted frames that meet the quality threshold, and adaptively adjusts the length of the keyframe sequence based on motion complexity. The experimental results demonstrate that, compared with the traditional video codecs and other popular methods, the proposed scheme provides superior compression performance at ultra-low bitrates while maintaining competitiveness in visual effects, achieving more than 24% bitrate savings compared with VVC in terms of perceptual distortion.
Article
Full-text available
In order to provide an immersive visual experience, modern displays require head mounting, high image resolution, low latency, as well as high refresh rate. This poses a challenging computational problem. On the other hand, the human visual system can consume only a tiny fraction of this video stream due to the drastic acuity loss in the peripheral vision. Foveated rendering and compression can save computations by reducing the image quality in the peripheral vision. However, this can cause noticeable artifacts in the periphery, or, if done conservatively, would provide only modest savings. In this work, we explore a novel foveated reconstruction method that employs the recent advances in generative adversarial neural networks. We reconstruct a plausible peripheral video from a small fraction of pixels provided every frame. The reconstruction is done by finding the closest matching video to this sparse input stream of pixels on the learned manifold of natural videos. Our method is more efficient than the state-of-the-art foveated rendering, while providing the visual experience with no noticeable quality degradation. We conducted a user study to validate our reconstruction method and compare it against existing foveated rendering and video compression techniques. Our method is fast enough to drive gaze-contingent head-mounted displays in real time on modern hardware. We plan to publish the trained network to establish a new quality bar for foveated rendering and compression as well as encourage follow-up research.
Conference Paper
Full-text available
Image compression standards rely on predictive coding , transform coding, quantization and entropy coding, in order to achieve high compression performance. Very recently, deep generative models have been used to optimize or replace some of these operations, with very promising results. However, so far no systematic and independent study of the coding performance of these algorithms has been carried out. In this paper, for the first time, we conduct a subjective evaluation of two recent deep-learning-based image compression algorithms, comparing them to JPEG 2000 and to the recent BPG image codec based on HEVC Intra. We found that compression approaches based on deep auto-encoders can achieve coding performance higher than JPEG 2000, and sometimes as good as BPG. We also show experimentally that the PSNR metric is to be avoided when evaluating the visual quality of deep-learning-based methods, as their artifacts have different characteristics from those of DCT or wavelet-based codecs. In particular, images compressed at low bitrate appear more natural than JPEG 2000 coded pictures, according to a no-reference naturalness measure. Our study indicates that deep generative models are likely to bring huge innovation into the video coding arena in the coming years.
Article
Full-text available
Objective methods for assessing perceptual image quality traditionally attempted to quantify the visibility of errors (differences) between a distorted image and a reference image using a variety of known properties of the human visual system. Under the assumption that human visual perception is highly adapted for extracting structural information from a scene, we introduce an alternative complementary framework for quality assessment based on the degradation of structural information. As a specific example of this concept, we develop a structural similarity index and demonstrate its promise through a set of intuitive examples, as well as comparison to both subjective ratings and state-of-the-art objective methods on a database of images compressed with JPEG and JPEG2000. A MATLAB implementation of the proposed algorithm is available online at http://www.cns.nyu.edu/∼lcv/ssim/.
Article
We present a novel approach that enables photo-realistic re-animation of portrait videos using only an input video. In contrast to existing approaches that are restricted to manipulations of facial expressions only, we are the first to transfer the full 3D head position, head rotation, face expression, eye gaze, and eye blinking from a source actor to a portrait video of a target actor. The core of our approach is a generative neural network with a novel space-time architecture. The network takes as input synthetic renderings of a parametric face model, based on which it predicts photo-realistic video frames for a given target actor. The realism in this rendering-to-video transfer is achieved by careful adversarial training, and as a result, we can create modified target videos that mimic the behavior of the synthetically-created input. In order to enable source-to-target video re-animation, we render a synthetic target video with the reconstructed head animation parameters from a source video, and feed it into the trained network - thus taking full control of the target. With the ability to freely recombine source and target parameters, we are able to demonstrate a large variety of video rewrite applications without explicitly modeling hair, body or background. For instance, we can reenact the full head using interactive user-controlled editing, and realize high-fidelity visual dubbing. To demonstrate the high quality of our output, we conduct an extensive series of experiments and evaluations, where for instance a user study shows that our video edits are hard to detect.
Article
We describe an end-to-end trainable model for image compression based on variational autoencoders. The model incorporates a hyperprior to effectively capture spatial dependencies in the latent representation. This hyperprior relates to side information, a concept universal to virtually all modern image codecs, but largely unexplored in image compression using artificial neural networks (ANNs). Unlike existing autoencoder compression methods, our model trains a complex prior jointly with the underlying autoencoder. We demonstrate that this model leads to state-of-the-art image compression when measuring visual quality using the popular MS-SSIM index, and yields rate-distortion performance surpassing published ANN-based methods when evaluated using a more traditional metric based on squared error (PSNR). Furthermore, we provide a qualitative comparison of models trained for different distortion metrics.
Article
We propose a new framework for estimating generative models via an adversarial process, in which we simultaneously train two models: a generative model G that captures the data distribution, and a discriminative model D that estimates the probability that a sample came from the training data rather than G. The training procedure for G is to maximize the probability of D making a mistake. This framework corresponds to a minimax two-player game. In the space of arbitrary functions G and D, a unique solution exists, with G recovering the training data distribution and D equal to 1/2 everywhere. In the case where G and D are defined by multilayer perceptrons, the entire system can be trained with backpropagation. There is no need for any Markov chains or unrolled approximate inference networks during either training or generation of samples. Experiments demonstrate the potential of the framework through qualitative and quantitative evaluation of the generated samples.
Article
We describe an image compression system, consisting of a nonlinear encoding transformation, a uniform quantizer, and a nonlinear decoding transformation. Like many deep neural network architectures, the transforms consist of layers of convolutional linear filters and nonlinear activation functions, but we use a joint nonlinearity that implements a form of local gain control, inspired by those used to model biological neurons. Using a variant of stochastic gradient descent, we jointly optimize the system for rate-distortion performance over a database of training images, introducing a continuous proxy for the discontinuous loss function arising from the quantizer. The relaxed optimization problem resembles that of variational autoencoders, except that it must operate at any point along the rate-distortion curve, whereas the optimization of generative models aims only to minimize entropy of the data under the model. Across an independent database of test images, we find that the optimized coder exhibits significantly better rate-distortion performance than the standard JPEG and JPEG 2000 compression systems, as well as a dramatic improvement in visual quality of compressed images.
Article
We present a machine learning-based approach to lossy image compression which outperforms all existing codecs, while running in real-time. Our algorithm typically produces files 2.5 times smaller than JPEG and JPEG 2000, 2 times smaller than WebP, and 1.7 times smaller than BPG on datasets of generic images across all quality levels. At the same time, our codec is designed to be lightweight and deployable: for example, it can encode or decode the Kodak dataset in around 10ms per image on GPU. Our architecture is an autoencoder featuring pyramidal analysis, an adaptive coding module, and regularization of the expected codelength. We also supplement our approach with adversarial training specialized towards use in a compression setting: this enables us to produce visually pleasing reconstructions for very low bitrates.
Conference Paper
We describe an image compression method, consisting of a nonlinear analysis transformation, a uniform quantizer, and a nonlinear synthesis transformation. The transforms are constructed in three successive stages of convolutional linear filters and nonlinear activation functions. Unlike most convolutional neural networks, the joint nonlinearity is chosen to implement a form of local gain control, inspired by those used to model biological neurons. Using a variant of stochastic gradient descent, we jointly optimize the entire model for rate-distortion performance over a database of training images, introducing a continuous proxy for the discontinuous loss function arising from the quantizer. Under certain conditions, the relaxed loss function may be interpreted as the log likelihood of a generative model, as implemented by a variational autoencoder. Unlike these models, however, the compression model must operate at any given point along the rate-distortion curve, as specified by a trade-off parameter. Across an independent set of test images, we find that the optimized method generally exhibits better rate-distortion performance than the standard JPEG and JPEG 2000 compression methods. More importantly, we observe a dramatic improvement in visual quality for all images at all bit rates, which is supported by objective quality estimates using MS-SSIM.
Article
It is known that the Karhunen-Lo\`{e}ve transform (KLT) of Gaussian first-order auto-regressive (AR(1)) processes results in sinusoidal basis functions. The same sinusoidal bases come out of the independent-component analysis (ICA) and actually correspond to processes with completely independent samples. In this paper, we relax the Gaussian hypothesis and study how orthogonal transforms decouple symmetric-alpha-stable (Sα\alphaS) AR(1) processes. The Gaussian case is not sparse and corresponds to α=2\alpha=2, while 0<α<20<\alpha<2 yields processes with sparse linear-prediction error. In the presence of sparsity, we show that operator-like wavelet bases do outperform the sinusoidal ones. Also, we observe that, for processes with very sparse increments (0<α10<\alpha\leq 1), the operator-like wavelet basis is indistinguishable from the ICA solution obtained through numerical optimization. We consider two criteria for independence. The first is the Kullback-Leibler divergence between the joint probability density function (pdf) of the original signal and the product of the marginals in the transformed domain. The second is a divergence between the joint pdf of the original signal and the product of the marginals in the transformed domain, which is based on Stein's formula for the mean-square estimation error in additive Gaussian noise. Our framework then offers a unified view that encompasses the discrete cosine transform (known to be asymptotically optimal for α=2\alpha=2) and Haar-like wavelets (for which we achieve optimality for 0<α10<\alpha\leq1).
First order motion model for image animationn
  • A Siarohin
  • S Lathuiliere
  • S Tulyakov
  • E Ricci
  • N Sebe
On the optimality of operator-like wavelets for sparse AR (1) processes
  • P Pad
  • M Unser
Every-body dance now
  • C Chan
  • S Ginosar
  • T Zhou
  • A A Efros
First order motion model for image animationn
  • Siarohin
Every-body dance now
  • Chan