Article
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Restoring face images from distortions is important in face recognition applications and is challenged by multiple scale issues, which is still not well-solved in research area. In this paper, we present a Sequential Gating Ensemble Network (SGEN) for multi-scale face restoration issue. We first employ the principle of ensemble learning into SGEN architecture design to reinforce predictive performance of the network. The SGEN aggregates multi-level base-encoders and base-decoders into the network, which enables the network to contain multiple scales of receptive field. Instead of combining these base-en/decoders directly with non-sequential operations, the SGEN takes base-en/decoders from different levels as sequential data. Specifically, the SGEN learns to sequentially extract high level information from base-encoders in bottom-up manner and restore low level information from base-decoders in top-down manner. Besides, we propose to realize bottom-up and top-down information combination and selection with Sequential Gating Unit (SGU). The SGU sequentially takes two inputs from different levels and decides the output based on one active input. Experiment results demonstrate that our SGEN is more effective at multi-scale human face restoration with more image details and less noise than state-of-the-art image restoration models. By using adversarial training, SGEN also produces more visually preferred results than other models through subjective evaluation.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Blind Face Super-Resolution (BFSR) has great potential for practical applications in various fields including surveillance, biometrics, and entertainment [1,2,3], which aims to restore high-resolution (HR) face images from their low-resolution (LR) counterparts suffering from arbitrary unknown degradation, such as noise, blurring, compression artifacts, and their hybrid forms. ...
Article
Blind Face Super-Resolution (BFSR) has recently gained widespread attention, which aims to super-resolve Low-Resolution (LR) face images with complex unknown degradation to High-Resolution (HR) face images. However, existing BFSR methods suffer from two major limitations. First, most of them are trained on synthetic degradation data pairs with pre-defined degradation models, which leads to poor performance due to the degradation mismatch between other unknown complex degradations in real-world scenarios. Second, some methods rely on hand-crafted face priors as constraints, such as facial landmarks and parsing maps, which require additional callouts and laborious hyperparameter tuning for real cases. To tackle these issues, we propose a simple and effective self-supervised cooperative learning framework via a conditional diffusion contraction method for BFSR, dubbed DifBFSR, which establishes the posterior distribution of HR images from degraded LR images with unknown degradation via a powerful diffusion model without expensive supervised training or additional constraint design. Specifically, we first transform the degraded LR face image to an intermediate HR face prediction with degradation-invariant by a simple Super-Resolution module (SRM), which only relies on self-supervised optimization. To enhance the face prediction, we propose a Contraction Filter Module (CFM) to gradually contract the restoration error by adaptive dynamic filtering, which efficiently leverages rich nature face prior encapsulated in the pre-trained diffusion model through conditional posterior sampling. Finally, by combining the SRM, CFM, and diffusion model in a self-supervised cooperative learning framework, DifBFSR can robustly handle unknown complex degradations, which favorably avoids the cumbersome training and parameter tuning. Extensive qualitative and quantitative experiments on complex degraded synthetic and real-world datasets show that our method outperforms state-of-the-art BFSR methods.
Article
With the development of generative adversarial networks (GANs), recent face restoration (FR) methods often utilize pre-trained GAN models ( i.e ., StyleGAN2) as prior to generate rich details. However, these methods usually struggle to balance realness and fidelity when facing various degradation levels. In this paper, we propose a novel DEgradation-Aware Restoration network with GAN prior, dubbed DEAR-GAN, for FR tasks by explicitly learning the degradation representations (DR) to adapt to various degradation. Specifically, an unsupervised degradation representation learning (UDRL) strategy is first developed to extract DR of the input degraded images. Then, a degradation-aware feature interpolation (DAFI) module is proposed to dynamically fuse the two types of informative features ( i.e ., features from degraded images and features from GAN prior network) with flexible adaption to various degradation based on DR. Extensive experiments show that our proposed DEAR-GAN outperforms the state-of-the-art methods for face restoration under multiple degradation and face super-resolution, and demonstrate the effectiveness of feature interpolation, which can be extended to face inpainting to achieve excellent results.
Chapter
Most existing image restoration networks are designed in a disposable way and catastrophically forget previously learned distortions when trained on a new distortion removal task. To alleviate this problem, we raise the novel lifelong image restoration problem for blended distortions. We first design a base fork-join model in which multiple pre-trained expert models specializing in individual distortion removal task work cooperatively and adaptively to handle blended distortions. When the input is degraded by a new distortion, inspired by adult neurogenesis in human memory system, we develop a neural growing strategy where the previously trained model can incorporate a new expert branch and continually accumulate new knowledge without interfering with learned knowledge. Experimental results show that the proposed approach can not only achieve state-of-the-art performance on blended distortions removal tasks in both PSNR/SSIM metrics, but also maintain old expertise while learning new restoration tasks.
Conference Paper
Full-text available
Interfering signals, such as rain streak, haze, noise, etc, introduce various types of visibility degradation on original clean signals. Traditional algorithms tackle the signal de-interference problem by the way of signal removal, which usually causes over-smoothness and unexpected arti-facts. Hereby, this paper attempts to solve this problem from a totally different perspective of signal decomposition, and introduces the interaction and constraints between the two decomposed signals during the restoration procedure. Specifically , we propose an Asynchronous Interactive Generative Adversarial Network (AI-GAN), which decomposes the degraded signal into original and interfering parts progressively through a double branch structure. Each branch employs an asynchronous synthesis strategy for the corresponding generator and interacts with each other by exchanging the feed-forward signal values and sharing the corresponding feedback gradients, achieving an effect of mutual adversarial optimization. The proposed AI-GAN shows significant qualitative and quantitative improvement on general signal de-interference tasks such as deraining, dehazing, and denoising. Index Terms---Signal de-interference, signal decomposition , asynchronous and interactive, double branch, GANs
Article
Full-text available
We present a variety of new architectural features and training procedures that we apply to the generative adversarial networks (GANs) framework. We focus on two applications of GANs: semi-supervised learning, and the generation of images that humans find visually realistic. Unlike most work on generative models, our primary goal is not to train a model that assigns high likelihood to test data, nor do we require the model to be able to learn well without using any labels. Using our new techniques, we achieve state-of-the-art results in semi-supervised classification on MNIST, CIFAR-10 and SVHN. The generated images are of high quality as confirmed by a visual Turing test: our model generates MNIST samples that humans cannot distinguish from real data, and CIFAR-10 samples that yield a human error rate of 21.3%. We also present ImageNet samples with unprecedented resolution and show that our methods enable the model to learn recognizable features of ImageNet classes.
Article
Full-text available
Face image super-resolution has attracted much attention in recent years. Many algorithms have been proposed. Among them, sparse representation (SR)-based face image super-resolution approaches are able to achieve competitive performance. However, these SR-based approaches only perform well under the condition that the input is noiseless or has small noise. When the input is corrupted by large noise, the reconstruction weights (or coefficients) of the input low-resolution (LR) patches using SR-based approaches will be seriously unstable, thus leading to poor reconstruction results. To this end, in this paper, we propose a novel SR-based face image super-resolution approach that incorporates smooth priors to enforce similar training patches having similar sparse coding coefficients. Specifically, we introduce the fused least absolute shrinkage and selection operator-based smooth constraint and locality-based smooth constraint to the least squares representation-based patch representation in order to obtain stable reconstruction weights, especially when the noise level of the input LR image is high. Experiments are carried out on the benchmark FEI face database and CMU+MIT face database. Visual and quantitative comparisons show that the proposed face image super-resolution method yields superior reconstruction results when the input LR face image is contaminated by strong noise.
Article
Full-text available
In this work, we introduce a novel interpretation of residual networks showing they are exponential ensembles. This observation is supported by a large-scale lesion study that demonstrates they behave just like ensembles at test time. Subsequently, we perform an analysis showing these ensembles mostly consist of networks that are each relatively shallow. For example, contrary to our expectations, most of the gradient in a residual network with 110 layers comes from an ensemble of very short networks, i.e., only 10-34 layers deep. This suggests that in addition to describing neural networks in terms of width and depth, there is a third dimension: multiplicity, the size of the implicit ensemble. Ultimately, residual networks do not resolve the vanishing gradient problem by preserving gradient flow throughout the entire depth of the network - rather, they avoid the problem simply by ensembling many short networks together. This insight reveals that depth is still an open research question and invites the exploration of the related notion of multiplicity.
Article
Full-text available
Recently, position-patch based approaches have been proposed to replace the probabilistic graph-based or manifold learning-based models for face hallucination. In order to obtain the optimal weights of face hallucination, these approaches represent one image patch through other patches at the same position of training faces by employing least square estimation or sparse coding. However, they cannot provide unbiased approximations or satisfy rational priors, thus the obtained representation is not satisfactory. In this paper, we propose a simpler yet more effective scheme called Locality-constrained Representation (LcR). Compared with Least Square Representation (LSR) and Sparse Representation (SR), our scheme incorporates a locality constraint into the least square inversion problem to maintain locality and sparsity simultaneously. Our scheme is capable of capturing the non-linear manifold structure of image patch samples while exploiting the sparse property of the redundant data representation. Moreover, when the locality constraint is satisfied, face hallucination is robust to noise, a property that is desirable for video surveillance applications. A statistical analysis of the properties of LcR is given together with experimental results on some public face databases and surveillance images to show the superiority of our proposed scheme over state-of-the-art face hallucination approaches.
Article
Full-text available
Learning to store information over extended time intervals by recurrent backpropagation takes a very long time, mostly because of insufficient, decaying error backflow. We briefly review Hochreiter's (1991) analysis of this problem, then address it by introducing a novel, efficient, gradient based method called long short-term memory (LSTM). Truncating the gradient where this does not do harm, LSTM can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units. Multiplicative gate units learn to open and close access to the constant error flow. LSTM is local in space and time; its computational complexity per time step and weight is O. 1. Our experiments with artificial data involve local, distributed, real-valued, and noisy pattern representations. In comparisons with real-time recurrent learning, back propagation through time, recurrent cascade correlation, Elman nets, and neural sequence chunking, LSTM leads to many more successful runs, and learns much faster. LSTM also solves complex, artificial long-time-lag tasks that have never been solved by previous recurrent network algorithms.
Article
Full-text available
In video surveillance, the faces of interest are often of small size. Image resolution is an important factor affecting face recognition by human and computer. In this paper, we propose a new face hallucination method using eigentransformation. Different from most of the proposed methods based on probabilistic models, this method views hallucination as a transformation between different image styles. We use Principal Component Analysis (PCA) to fit the input face image as a linear combination of the low-resolution face images in the training set. The high-resolution image is rendered by replacing the low-resolution training images with high-resolution ones, while retaining the same combination coefficients. Experiments show that the hallucinated face images are not only very helpful for recognition by humans, but also make the automatic recognition procedure easier, since they emphasize the face difference by adding more high-frequency details.
Conference Paper
Deep Neural Networks (DNNs) are powerful models that have achieved excellent performance on difficult learning tasks. Although DNNs work well whenever large labeled training sets are available, they cannot be used to map sequences to sequences. In this paper, we present a general end-to-end approach to sequence learning that makes minimal assumptions on the sequence structure. Our method uses a multilayered Long Short-Term Memory (LSTM) to map the input sequence to a vector of a fixed dimensionality, and then another deep LSTM to decode the target sequence from the vector. Our main result is that on an English to French translation task from the WMT-14 dataset, the translations produced by the LSTM achieve a BLEU score of 34.7 on the entire test set, where the LSTM's BLEU score was penalized on out-of-vocabulary words. Additionally, the LSTM did not have difficulty on long sentences. For comparison, a strong phrase-based SMT system achieves a BLEU score of 33.3 on the same dataset. When we used the LSTM to rerank the 1000 hypotheses produced by the aforementioned SMT system, its BLEU score increases to 36.5, which beats the previous state of the art. The LSTM also learned sensible phrase and sentence representations that are sensitive to word order and are relatively invariant to the active and the passive voice. Finally, we found that reversing the order of the words in all source sentences (but not target sentences) improved the LSTM's performance markedly, because doing so introduced many short term dependencies between the source and the target sentence which made the optimization problem easier.
Conference Paper
In recent years, supervised learning with convolutional networks (CNNs) has seen huge adoption in computer vision applications. Comparatively, unsupervised learning with CNNs has received less attention. In this work we hope to help bridge the gap between the success of CNNs for supervised learning and unsupervised learning. We introduce a class of CNNs called deep convolutional generative adversarial networks (DCGANs), that have certain architectural constraints, and demonstrate that they are a strong candidate for unsupervised learning. Training on various image datasets, we show convincing evidence that our deep convolutional adversarial pair learns a hierarchy of representations from object parts to scenes in both the generator and discriminator. Additionally, we use the learned features for novel tasks - demonstrating their applicability as general image representations.
Article
Objective methods for assessing perceptual image quality have traditionally attempted to quantify the visibility of errors between a distorted image and a reference image using a variety of known properties of the human visual system. Under the assumption that human visual perception is highly adapted for extracting structural information from a scene, we introduce an alternative framework for quality assessment based on the degradation of structural information. As a specific example of this concept, we develop a Structural Similarity Index and demonstrate its promise through a set of intuitive examples, as well as comparison to both subjective ratings and state-of-the-art objective methods on a database of images compressed with JPEG and JPEG2000. A MatLab implementation of the proposed algorithm is available online at http://www.cns.nyu.edu/~lcv/ssim/.
Conference Paper
We consider image transformation problems, where an input image is transformed into an output image. Recent methods for such problems typically train feed-forward convolutional neural networks using a per-pixel loss between the output and ground-truth images. Parallel work has shown that high-quality images can be generated by defining and optimizing perceptual loss functions based on high-level features extracted from pretrained networks. We combine the benefits of both approaches, and propose the use of perceptual loss functions for training feed-forward networks for image transformation tasks. We show results on image style transfer, where a feed-forward network is trained to solve the optimization problem proposed by Gatys et al. in real-time. Compared to the optimization-based method, our network gives similar qualitative results but is three orders of magnitude faster. We also experiment with single-image super-resolution, where replacing a per-pixel loss with a perceptual loss gives visually pleasing results.
Conference Paper
We present an image-conditional image generation model. The model transfers an input domain to a target domain in semantic level, and generates the target image in pixel level. To generate realistic target images, we employ the real/fake-discriminator as in Generative Adversarial Nets [6], but also introduce a novel domain-discriminator to make the generated image relevant to the input image. We verify our model through a challenging task of generating a piece of clothing from an input image of a dressed person. We present a high quality clothing dataset containing the two domains, and succeed in demonstrating decent results.
Conference Paper
Conventional face super-resolution methods, also known as face hallucination, are limited up to 2 ⁣ ⁣4×2 \! \sim \! 4\times scaling factors where 4164 \sim 16 additional pixels are estimated for each given pixel. Besides, they become very fragile when the input low-resolution image size is too small that only little information is available in the input image. To address these shortcomings, we present a discriminative generative network that can ultra-resolve a very low resolution face image of size 16×1616 \times 16 pixels to its 8×8\times larger version by reconstructing 64 pixels from a single pixel. We introduce a pixel-wise 2\ell _2 regularization term to the generative model and exploit the feedback of the discriminative network to make the upsampled face images more similar to real ones. In our framework, the discriminative network learns the essential constituent parts of the faces and the generative network blends these parts in the most accurate fashion to the input image. Since only frontal and ordinary aligned images are used in training, our method can ultra-resolve a wide range of very low-resolution images directly regardless of pose and facial expression variations. Our extensive experimental evaluations demonstrate that the presented ultra-resolution by discriminative generative networks (UR-DGN) achieves more appealing results than the state-of-the-art.
Article
Image denoising is a long-standing problem in computer vision and image processing, as well as a test bed for low-level image modeling algorithms. In this paper, we propose a very deep encoding-decoding framework for image denoising. Instead of using image priors, the proposed framework learns end-to-end fully convolutional mappings from noisy images to the clean ones. The network is composed of multiple layers of convolution and de-convolution operators. With the observation that deeper networks improve denoising performance, we propose to use significantly deeper networks than those employed previously for low-level image processing tasks such as denoising. We propose to symmetrically link convolutional and de-convolutional layers with skip-layer connections, with which the training converges much faster and attains a higher-quality local optimum. From the image processing point of view, those symmetric connections help preserve image details.
Conference Paper
Ensemble methods are learning algorithms that construct a set of classifiers and then classify new data points by taking a (weighted) vote of their predictions. The original ensemble method is Bayesian averaging, but more recent algorithms include error-correcting output coding, Bagging, and boosting. This paper reviews these methods and explains why ensembles can often perform better than any single classifier. Some previous studies comparing ensemble methods are reviewed, and some new experiments are presented to uncover the reasons that Adaboost does not overfit rapidly.
Article
In this paper we introduce a generative parametric model capable of producing high quality samples of natural images. Our approach uses a cascade of convolutional networks within a Laplacian pyramid framework to generate images in a coarse-to-fine fashion. At each level of the pyramid, a separate generative convnet model is trained using the Generative Adversarial Nets (GAN) approach (Goodfellow et al.). Samples drawn from our model are of significantly higher quality than alternate approaches. In a quantitative assessment by human evaluators, our CIFAR10 samples were mistaken for real images around 40% of the time, compared to 10% for samples drawn from a GAN baseline model. We also show samples from models trained on the higher resolution images of the LSUN scene dataset.
Conference Paper
Neural networks have become increasingly popular for the task of language modeling. Whereas feed-forward networks only exploit a fixed context length to predict the next word of a sequence, conceptually, standard recurrent neural networks can take into account all of the predecessor words. On the other hand, it is well known that recurrent networks are difficult to train and therefore are unlikely to show the full potential of recurrent models. These problems are addressed by a the Long Short-Term Memory neural network architecture. In this work, we analyze this type of network on an English and a large French language modeling task. Experiments show improvements of about 8 % relative in perplexity over standard recurrent neural network LMs. In addition, we gain considerable improvements in WER on top of a state-of-the-art speech recognition system. Index Terms: language modeling, recurrent neural networks, LSTM neural networks
Article
Sparse representation-based face hallucination approaches proposed so far use fixed ell1ell_{1} norm penalty to capture the sparse nature of face images, and thus hardly adapt readily to the statistical variability of underlying images. Additionally, they ignore the influence of spatial distances between the test image and training basis images on optimal reconstruction coefficients. Consequently, they cannot offer a satisfactory performance in practical face hallucination applications. In this paper, we propose a weighted adaptive sparse regularization (WASR) method to promote accuracy, stability and robustness for face hallucination reconstruction, in which a distance-inducing weighted ellqell_{q} norm penalty is imposed on the solution. With the adjustment to shrinkage parameter q , the weighted ellqell_{q} penalty function enables elastic description ability in the sparse domain, leading to more conservative sparsity in an ascending order of q . In particular, WASR with an optimal q>1q>1 can reasonably represent the less sparse nature of noisy images and thus remarkably boosts noise robust performance in face hallucination. Various experimental results on standard face database as well as real-world images show that our proposed method outperforms state-of-the-art methods in terms of both objective metrics and visual quality.
Conference Paper
Restricted Boltzmann machines were developed using binary stochastic hidden units. These can be generalized by replacing each binary unit by an infinite number of copies that all have the same weights but have progressively more negative biases. The learning and inference rules for these “Stepped Sigmoid Units ” are unchanged. They can be approximated efficiently by noisy, rectified linear units. Compared with binary units, these units learn features that are better for object recognition on the NORB dataset and face verification on the Labeled Faces in the Wild dataset. Unlike binary units, rectified linear units preserve information about relative intensities as information travels through multiple layers of feature detectors. 1.
Article
A novel face hallucination method is proposed in this paper for the reconstruction of a high-resolution face image from a low-resolution observation based on a set of high- and low-resolution training image pairs. Different from most of the established methods based on probabilistic or manifold learning models, the proposed method hallucinates the high-resolution image patch using the same position image patches of each training image. The optimal weights of the training image position-patches are estimated and the hallucinated patches are reconstructed using the same weights. The final high-resolution facial image is formed by integrating the hallucinated patches. The necessity of two-step framework or residue compensation and the differences between hallucination based on patch and global image are discussed. Experiments show that the proposed method without residue compensation generates higher-quality images and costs less computational time than some recent face image super-resolution (hallucination) techniques.
Article
High-dimensional data can be converted to low-dimensional codes by training a multilayer neural network with a small central layer to reconstruct high-dimensional input vectors. Gradient descent can be used for fine-tuning the weights in such "autoencoder" networks, but this works well only if the initial weights are close to a good solution. We describe an effective way of initializing the weights that allows deep autoencoder networks to learn low-dimensional codes that work much better than principal components analysis as a tool to reduce the dimensionality of data.
Learning a deep convolutional network for image super-resolution
  • C Dong
  • C C Loy
  • K He
  • X Tang
  • I Goodfellow
  • J Pouget-Abadie
  • M Mirza
  • B Xu
  • D Warde-Farley
  • S Ozair
  • A Courville
  • Y Bengio
Dong, C.; Loy, C. C.; He, K.; and Tang, X. 2014. Learning a deep convolutional network for image super-resolution. In European Conference on Computer Vision, 184-199. Springer. Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; and Bengio, Y.
Adam: A method for stochastic optimization
  • D Kingma
  • J Ba
Kingma, D., and Ba, J. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
  • M Lin
  • Q Chen
Lin, M.; Chen, Q.; and Yan, S. 2013. Network in network. arXiv preprint arXiv:1312.4400.
Deep learning face attributes in the wild
  • Z Liu
  • P Luo
  • X Wang
  • X Tang
Liu, Z.; Luo, P.; Wang, X.; and Tang, X. 2015. Deep learning face attributes in the wild. In Proceedings of International Conference on Computer Vision (ICCV).
Rectifier nonlinearities improve neural network acoustic models
  • A L Maas
  • A Y Hannun
  • A Y Ng
Maas, A. L.; Hannun, A. Y.; and Ng, A. Y. 2013. Rectifier nonlinearities improve neural network acoustic models. In Proc. ICML, volume 30.
Ensemble based systems in decision making
  • R Polikar
Polikar, R. 2006. Ensemble based systems in decision making. IEEE Circuits and systems magazine 6(3):21-45.
Unsupervised cross-domain image generation
  • Y Taigman
  • A Polyak
  • L Wolf
Taigman, Y.; Polyak, A.; and Wolf, L. 2016. Unsupervised cross-domain image generation. arXiv preprint arXiv:1611.02200.