Chapter

Automatic Beautification for Group-Photo Facial Expressions Using Novel Bayesian GANs: 27th International Conference on Artificial Neural Networks, Rhodes, Greece, October 4-7, 2018, Proceedings, Part I

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Directly benefiting from the powerful generative adversarial networks (GANs) in recent years, various new image processing tasks pertinent to image generation and synthesis have gained more popularity with the growing success. One such application is individual portrait photo beautification based on facial expression detection and editing. Yet, automatically beautifying group photos without tedious and fragile human interventions still remains challenging. The difficulties inevitably arise from diverse facial expression evaluation, harmonious expression generation, and context-sensitive synthesis from single/multiple photos. To ameliorate, we devise a two-stage deep network for automatic group-photo evaluation and beautification by seamless integration of multi-label CNN with Bayesian network enhanced GANs. First, our multi-label CNN is designed to evaluate the quality of facial expressions. Second, our novel Bayesian GANs framework is proposed to automatically generate photo-realistic beautiful expressions. Third, to further enhance naturalness of beautified group photos, we embed Poisson fusion in the final layer of the GANs in order to synthesize all the beautified individual expressions. We conducted extensive experiments on various kinds of single-/multi-frame group photos to validate our novel network design. All the experiments confirm that, our novel method can uniformly accommodate diverse expression evaluation and generation/synthesis of group photos, and outperform the state-of-the-art methods in terms of effectiveness, versatility, and robustness.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... In addition, GANs were conditioned on beauty scores [60], in order to generate realistic facial images using Progressive Growing of GANs (PGGAN) [61]. Similarly, Liu et al. [62] proposed a two-stage deep network for beautification, where a multi-label CNN evaluated the quality of faces, followed by a Bayesian GANs framework, automatically generating photo-realistic beautified faces. Example images before and after applying the aforementioned types of makeup are depicted in Figure 3. Relevant works investigating the impact of makeup on face recognition and proposing makeup detection methods are listed in Table 2. Dantcheva et al. [39] firstly systematically studied the effects of makeup on different face recognition systems. ...
Article
Full-text available
Facial beautification induced by plastic surgery, cosmetics or retouching has the ability to substantially alter the appearance of face images. Such types of beautification can negatively affect the accuracy of face recognition systems. In this work, a conceptual categorisation of beautification is presented, relevant scenarios with respect to face recognition are discussed, and related publications are revisited. Additionally, technical considerations and trade-offs of the surveyed methods are summarized along with open issues and challenges in the field. This survey is targeted to provide a comprehensive point of reference for biometric researchers and practitioners working in the field of face recognition, who aim at tackling challenges caused by facial beautification.
Article
Full-text available
Facial expression recognition (FER) plays a crucial role in affective computing, enhancing human-computer interaction by enabling machines to understand and respond to human emotions. Despite advancements in deep learning, current FER systems often struggle with challenges such as occlusions, head pose variations, and motion blur in natural environments. These challenges highlight the need for more robust FER solutions. To address these issues, we propose the Attention-Enhanced Multi-Layer Transformer (AEMT) model, which integrates a dual-branch Convolutional Neural Network (CNN), an Attentional Selective Fusion (ASF) module, and a Multi-Layer Transformer Encoder (MTE) with transfer learning. The dual-branch CNN captures detailed texture and color information by processing RGB and Local Binary Pattern (LBP) features separately. The ASF module selectively enhances relevant features by applying global and local attention mechanisms to the extracted features. The MTE captures long-range dependencies and models the complex relationships between features, collectively improving feature representation and classification accuracy. Our model was evaluated on the RAF-DB and AffectNet datasets. Experimental results demonstrate that the AEMT model achieved an accuracy of 81.45% on RAF-DB and 71.23% on AffectNet, significantly outperforming existing state-of-the-art methods. These results indicate that our model effectively addresses the challenges of FER in natural environments, providing a more robust and accurate solution. The AEMT model significantly advances the field of FER by improving the robustness and accuracy of emotion recognition in complex real-world scenarios. This work not only enhances the capabilities of affective computing systems but also opens new avenues for future research in improving model efficiency and expanding multimodal data integration.
Article
Full-text available
Image-to-image translation is a class of vision and graphics problems where the goal is to learn the mapping between an input image and an output image using a training set of aligned image pairs. However, for many tasks, paired training data will not be available. We present an approach for learning to translate an image from a source domain X to a target domain Y in the absence of paired examples. Our goal is to learn a mapping G: X -> Y such that the distribution of images from G(X) is indistinguishable from the distribution Y using an adversarial loss. Because this mapping is highly under-constrained, we couple it with an inverse mapping F: Y -> X and introduce a cycle consistency loss to push F(G(X)) \approx X (and vice versa). Qualitative results are presented on several tasks where paired training data does not exist, including collection style transfer, object transfiguration, season transfer, photo enhancement, etc. Quantitative comparisons against several prior methods demonstrate the superiority of our approach.
Article
Full-text available
We investigate conditional adversarial networks as a general-purpose solution to image-to-image translation problems. These networks not only learn the mapping from input image to output image, but also learn a loss function to train this mapping. This makes it possible to apply the same generic approach to problems that traditionally would require very different loss formulations. We demonstrate that this approach is effective at synthesizing photos from label maps, reconstructing objects from edge maps, and colorizing images, among other tasks. As a community, we no longer hand-engineer our mapping functions, and this work suggests we can achieve reasonable results without hand-engineering our loss functions either.
Conference Paper
Full-text available
Attribute recognition, particularly facial, extracts many labels for each image. While some multi-task vision problems can be decomposed into separate tasks and stages, e.g., training independent models for each task, for a growing set of problems joint optimization across all tasks has been shown to improve performance. We show that for deep convolutional neural network (DCNN) facial attribute extraction, multi-task optimization is better. Unfortunately, it can be difficult to apply joint optimization to DCNNs when training data is imbalanced, and re-balancing multi-label data directly is structurally infeasible, since adding/removing data to balance one label will change the sampling of the other labels. This paper addresses the multi-label imbalance problem by introducing a novel mixed objective optimization network (MOON) with a loss function that mixes multiple task objectives with domain adaptive re-weighting of propagated loss. Experiments demonstrate that not only does MOON advance the state of the art in facial attribute recognition, but it also outperforms independently trained DCNNs using the same data. When using facial attributes for the LFW face recognition task, we show that our balanced (domain adapted) network outperforms the unbalanced trained network.
Article
Full-text available
A training process for facial expression recognition is usually performed sequentially in three individual stages: feature learning, feature selection, and classifier construction. Extensive empirical studies are needed to search for an optimal combination of feature representation, feature set, and classifier to achieve good recognition performance. This paper presents a novel Boosted Deep Belief Network (BDBN) for performing the three training stages iteratively in a unified loopy framework. Through the proposed BDBN framework, a set of features, which is effective to characterize expression-related facial appearance/shape changes, can be learned and selected to form a boosted strong classifier in a statistical way. As learning continues, the strong classifier is improved iteratively and more importantly, the discriminative capabilities of selected features are strengthened as well according to their relative importance to the strong classifier via a joint fine-tune process in the BDBN framework. Extensive experiments on two public databases showed that the BDBN framework yielded dramatic improvements in facial expression analysis.
Article
Full-text available
We propose a convolutional neural network (CNN) architecture for facial expression recognition. The proposed architecture is independent of any hand-crafted feature extraction and performs better than the earlier proposed convolutional neural network based approaches. We visualize the automatically extracted features which have been learned by the network in order to provide a better understanding. The standard datasets, i.e. Extended Cohn-Kanade (CKP) and MMI Facial Expression Databse are used for the quantitative evaluation. On the CKP set the current state of the art approach, using CNNs, achieves an accuracy of 99.2%. For the MMI dataset, currently the best accuracy for emotion recognition is 93.33%. The proposed architecture achieves 99.6% for CKP and 98.63% for MMI, therefore performing better than the state of the art using CNNs. Automatic facial expression recognition has a broad spectrum of applications such as human-computer interaction and safety systems. This is due to the fact that non-verbal cues are important forms of communication and play a pivotal role in interpersonal communication. The performance of the proposed architecture endorses the efficacy and reliable usage of the proposed work for real world applications.
Article
Full-text available
Predicting face attributes from web images is challenging due to background clutters and face variations. A novel deep learning framework is proposed for face attribute prediction in the wild. It cascades two CNNs (LNet and ANet) for face localization and attribute prediction respectively. These nets are trained in a cascade manner with attribute labels, but pre-trained differently. LNet is pre-trained with massive general object categories, while ANet is pre-trained with massive face identities. This framework not only outperforms state-of-the-art with large margin, but also reveals multiple valuable facts on learning face representation as below. (1) It shows how LNet and ANet can be improved by different pre-training strategies. (2) It reveals that although filters of LNet are fine-tuned by attribute labels, their response maps over the entire image have strong indication of face's location. This fact enables training LNet for face localization with only attribute tags, but without face bounding boxes (which are required by all detection works). With a novel fast feed-forward scheme, the cascade of LNet and ANet can localize faces and recognize attributes in images with arbitrary sizes in real time. (3) It also demonstrates that the high-level hidden neurons of ANet automatically discover semantic concepts after pre-training, and such concepts are significantly enriched after fine-tuning. Each attribute can be well explained by a sparse linear combination of these concepts. By analyzing such combinations, attributes show clear grouping patterns, which could be well interpreted semantically.
Article
Full-text available
We present an image editing program which allows artists to paint in the gradient domain with real-time feedback on megapixel-sized images. Along with a pedestrian, though powerful, gradient-painting brush and gradient-clone tool, we introduce an edge brush designed for edge selection and replay. These brushes, coupled with special blending modes, allow users to accomplish global lighting and contrast adjustments using only local image manipulations --- e.g. strengthening a given edge or removing a shadow boundary. Such operations would be tedious in a conventional intensity-based paint program and hard for users to get right in the gradient domain without real-time feedback. The core of our paint program is a simple-to-implement GPU multigrid method which allows integration of megapixel-sized full-color gradient fields at over 20 frames per second on modest hardware. By way of evaluation, we present example images produced with our program and characterize the iteration time and convergence rate of our integration method.
Article
We present a novel approach for image completion that results in images that are both locally and globally consistent. With a fully-convolutional neural network, we can complete images of arbitrary resolutions by filling-in missing regions of any shape. To train this image completion network to be consistent, we use global and local context discriminators that are trained to distinguish real images from completed ones. The global discriminator looks at the entire image to assess if it is coherent as a whole, while the local discriminator looks only at a small area centered at the completed region to ensure the local consistency of the generated patches. The image completion network is then trained to fool the both context discriminator networks, which requires it to generate images that are indistinguishable from real ones with regard to overall consistency as well as in details. We show that our approach can be used to complete a wide variety of scenes. Furthermore, in contrast with the patch-based approaches such as PatchMatch, our approach can generate fragments that do not appear elsewhere in the image, which allows us to naturally complete the images of objects with familiar and highly specific structures, such as faces.
Article
While humans easily recognize relations between data from different domains without any supervision, learning to automatically discover them is in general very challenging and needs many ground-truth pairs that illustrate the relations. To avoid costly pairing, we address the task of discovering cross-domain relations given unpaired data. We propose a method based on generative adversarial networks that learns to discover relations between different domains (DiscoGAN). Using the discovered relations, our proposed network successfully transfers style from one domain to another while preserving key attributes such as orientation and face identity.
Conference Paper
We report our image based static facial expression recognition method for the Emotion Recognition in the Wild Challenge (EmotiW) 2015. We focus on the sub-challenge of the SFEW 2.0 dataset, where one seeks to automatically classify a set of static images into 7 basic emotions. The proposed method contains a face detection module based on the ensemble of three state-of-the-art face detectors, followed by a classification module with the ensemble of multiple deep convolutional neural networks (CNN). Each CNN model is initialized randomly and pre-trained on a larger dataset provided by the Facial Expression Recognition (FER) Challenge 2013. The pre-trained models are then fine-tuned on the training set of SFEW 2.0. To combine multiple CNN models, we present two schemes for learning the ensemble weights of the network responses: by minimizing the log likelihood loss, and by minimizing the hinge loss. Our proposed method generates state-of-the-art result on the FER dataset. It also achieves 55.96% and 61.29% respectively on the validation and test set of SFEW 2.0, surpassing the challenge baseline of 35.96% and 39.13% with significant gains.
Article
We propose the coupled generative adversarial network (CoGAN) framework for generating pairs of corresponding images in two different domains. It consists of a pair of generative adversarial networks, each responsible for generating images in one domain. We show that by enforcing a simple weight-sharing constraint, the CoGAN learns to generate pairs of corresponding images without existence of any pairs of corresponding images in the two domains in the training set. In other words, the CoGAN learns a joint distribution of images in the two domains from images drawn separately from the marginal distributions of the individual domains. This is in contrast to the existing multi-modal generative models, which require corresponding images for training. We apply the CoGAN to several pair image generation tasks. For each task, the GoGAN learns to generate convincing pairs of corresponding images. We further demonstrate the applications of the CoGAN framework for the domain adaptation and cross-domain image generation tasks.
Article
Face detection and alignment in unconstrained environment are challenging due to various poses, illuminations and occlusions. Recent studies show that deep learning approaches can achieve impressive performance on these two tasks. In this paper, we propose a deep cascaded multi-task framework which exploits the inherent correlation between them to boost up their performance. In particular, our framework adopts a cascaded structure with three stages of carefully designed deep convolutional networks that predict face and landmark location in a coarse-to-fine manner. In addition, in the learning process, we propose a new online hard sample mining strategy that can improve the performance automatically without manual sample selection. Our method achieves superior accuracy over the state-of-the-art techniques on the challenging FDDB and WIDER FACE benchmark for face detection, and AFLW benchmark for face alignment, while keeps real time performance.
Article
We present an unsupervised visual feature learning algorithm driven by context-based pixel prediction. By analogy with auto-encoders, we propose Context Encoders -- a convolutional neural network trained to generate the contents of an arbitrary image region conditioned on its surroundings. In order to succeed at this task, context encoders need to both understand the content of the entire image, as well as produce a plausible hypothesis for the missing part(s). When training context encoders, we have experimented with both a standard pixel-wise reconstruction loss, as well as a reconstruction plus an adversarial loss. The latter produces much sharper results because it can better handle multiple modes in the output. We found that a context encoder learns a representation that captures not just appearance but also the semantics of visual structures. We quantitatively demonstrate the effectiveness of our learned features for CNN pre-training on classification, detection, and segmentation tasks. Furthermore, context encoders can be used for semantic inpainting tasks, either stand-alone or as initialization for non-parametric methods.
Article
Using generic interpolation machinery based on solving Poisson equations, a variety of novel tools are introduced for seamless editing of image regions. The first set of tools permits the seamless importation of both opaque and transparent source image regions into a destination region. The second set is based on similar mathematical ideas and allows the user to modify the appearance of the image seamlessly, within a selected region. These changes can be arranged to affect the texture, the illumination, and the color of objects lying in the region, or to make tileable a rectangular selection.