Article

The Digital Emily Project: Achieving a Photorealistic Digital Actor

Authors:
  • Image Metrics
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

The Digital Emily Project uses advanced face scanning, character rigging, performance capture, and compositing to achieve one of the world's first photorealistic digital facial performances. The project scanned the geometry and reflectance of actress Emily O'Brien's face in 33 poses, showing different emotions, gaze directions, and lip formations in a light stage. These high-resolution scans-accurate to skin pores and fine wrinkles-became the basis for building a blendshape-based facial-animation rig whose expressions closely matched the scans. The blendshape rig drove displacement maps to add dynamic surface detail. A video-based facial animation system animated the face according to the performance in a reference video, and the digital face was tracked onto the video's motion and rendered under the same illumination. The result was a realistic 3D digital facial performance credited as one of the first to cross the "uncanny valley" between animated and fully human performances.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... To avoid such artifacts, we follow the W initialization protocol introduced in [39], and on top we apply an L 2 constraint to make sure W's values do not greatly deviate from their feasible space. Shape Regularization: We follow the literature [10] in constraining the shape parameters weighted by their inverse eigenvalues: L s = ||p s || Ground Truth [2] Ours, 3 input images Ours, 1 input image AvatarMe++ [43] AlbedoMM [67] Dib et al. 2021 [18] Figure 5. Comparison of diffuse and specular albedo reconstruction and rendering, of Digital Emily [2] with prior works. ...
... Shape Regularization: We follow the literature [10] in constraining the shape parameters weighted by their inverse eigenvalues: L s = ||p s || Ground Truth [2] Ours, 3 input images Ours, 1 input image AvatarMe++ [43] AlbedoMM [67] Dib et al. 2021 [18] Figure 5. Comparison of diffuse and specular albedo reconstruction and rendering, of Digital Emily [2] with prior works. Both our single and three images reconstruction, achieve similar results to the captured data, which need specialized hardware and hundreds of images. ...
... We find that a frontal and two side images produce high quality reconstructions that resemble facial capture. Fig. 5 shows a comparison of our 3-image reconstruction and our 1-image reconstruction, with a Light Stage captured Digital Emily [2], and prior work [18,43,67]. As can be seen, our method can successfully be used for fast shape and reflectance acquisition from multi-view sets. ...
Preprint
In this paper, we introduce FitMe, a facial reflectance model and a differentiable rendering optimization pipeline, that can be used to acquire high-fidelity renderable human avatars from single or multiple images. The model consists of a multi-modal style-based generator, that captures facial appearance in terms of diffuse and specular reflectance, and a PCA-based shape model. We employ a fast differentiable rendering process that can be used in an optimization pipeline, while also achieving photorealistic facial shading. Our optimization process accurately captures both the facial reflectance and shape in high-detail, by exploiting the expressivity of the style-based latent representation and of our shape model. FitMe achieves state-of-the-art reflectance acquisition and identity preservation on single "in-the-wild" facial images, while it produces impressive scan-like results, when given multiple unconstrained facial images pertaining to the same identity. In contrast with recent implicit avatar reconstructions, FitMe requires only one minute and produces relightable mesh and texture-based avatars, that can be used by end-user applications.
... To evaluate our reconstruction pipeline, we compare reconstructed relfectance maps and renderings acquired with AvatarMe ++ , with ground truth data captured in a similar manner as our dataset RealFaceDB, the digital Emily Project [80] and current state-of-the-art. We use the Mean Squared Error (MSE) and Peak Signal-to-Noise Ratio (PSNR) [81]. ...
... Fig. 10: Consistency of AvatarMe ++ on varying conditions, from the Digital Emily Project[80]. We calculate on average 30.94 ...
... PSNR and 0.0007 MSE between our results. Compared to the ground truth from[80], we achieve on average 0.0083 MSE and 20.13 PSNR on albedo and and 0.011 MSE and 24.02 PSNR on normals. Diffuse Albedo Specular Albedo Spec. ...
Preprint
Over the last years, many face analysis tasks have accomplished astounding performance, with applications including face generation and 3D face reconstruction from a single "in-the-wild" image. Nevertheless, to the best of our knowledge, there is no method which can produce render-ready high-resolution 3D faces from "in-the-wild" images and this can be attributed to the: (a) scarcity of available data for training, and (b) lack of robust methodologies that can successfully be applied on very high-resolution data. In this work, we introduce the first method that is able to reconstruct photorealistic render-ready 3D facial geometry and BRDF from a single "in-the-wild" image. We capture a large dataset of facial shape and reflectance, which we have made public. We define a fast facial photorealistic differentiable rendering methodology with accurate facial skin diffuse and specular reflection, self-occlusion and subsurface scattering approximation. With this, we train a network that disentangles the facial diffuse and specular BRDF components from a shape and texture with baked illumination, reconstructed with a state-of-the-art 3DMM fitting method. Our method outperforms the existing arts by a significant margin and reconstructs high-resolution 3D faces from a single low-resolution image, that can be rendered in various applications, and bridge the uncanny valley.
... We also show an extensive qualitative comparison with related 3D reconstruction methods in Fig. 5 (most of which can only recover the texture), where similar observations can be made. Finally, we test our method on images from the Digital Emily [2] and show the results in Fig. 8 together with related works [18,42]. We yield similar results regardless of the lighting, thanks to our coupled texture/reflectance modeling that combines reflectance with randomly rendered textures during training. ...
... We show the cosine similarity distribution between ground truth and reconstruction. [2]) by our method as well as [18,42]. We show the diffuse and specular albedo for all methods (where available), plus the recovered texture for our method. ...
Preprint
Full-text available
Following the remarkable success of diffusion models on image generation, recent works have also demonstrated their impressive ability to address a number of inverse problems in an unsupervised way, by properly constraining the sampling process based on a conditioning input. Motivated by this, in this paper, we present the first approach to use diffusion models as a prior for highly accurate 3D facial BRDF reconstruction from a single image. We start by leveraging a high-quality UV dataset of facial reflectance (diffuse and specular albedo and normals), which we render under varying illumination settings to simulate natural RGB textures and, then, train an unconditional diffusion model on concatenated pairs of rendered textures and reflectance components. At test time, we fit a 3D morphable model to the given image and unwrap the face in a partial UV texture. By sampling from the diffusion model, while retaining the observed texture part intact, the model inpaints not only the self-occluded areas but also the unknown reflectance components, in a single sequence of denoising steps. In contrast to existing methods, we directly acquire the observed texture from the input image, thus, resulting in more faithful and consistent reflectance estimation. Through a series of qualitative and quantitative comparisons, we demonstrate superior performance in both texture completion as well as reflectance reconstruction tasks.
... Recent years have witnessed the rise of human digitization [25,2,70,3,72]. This technology greatly impacts the entertainment, education, design, and engineering industry. ...
... There is a well-developed industry solution for this task. High-fidelity reconstruction of humans can be achieved either with full-body laser scans [77], dense synchronized multi-view cameras [91,90], or light stages [2]. However, these settings are expensive and tedious to deploy and consist of a complex processing pipeline, preventing the technology's democratization. ...
Preprint
Full-text available
Efficiently digitizing high-fidelity animatable human avatars from videos is a challenging and active research topic. Recent volume rendering-based neural representations open a new way for human digitization with their friendly usability and photo-realistic reconstruction quality. However, they are inefficient for long optimization times and slow inference speed; their implicit nature results in entangled geometry, materials, and dynamics of humans, which are hard to edit afterward. Such drawbacks prevent their direct applicability to downstream applications, especially the prominent rasterization-based graphic ones. We present EMA, a method that Efficiently learns Meshy neural fields to reconstruct animatable human Avatars. It jointly optimizes explicit triangular canonical mesh, spatial-varying material, and motion dynamics, via inverse rendering in an end-to-end fashion. Each above component is derived from separate neural fields, relaxing the requirement of a template, or rigging. The mesh representation is highly compatible with the efficient rasterization-based renderer, thus our method only takes about an hour of training and can render in real-time. Moreover, only minutes of optimization is enough for plausible reconstruction results. The disentanglement of meshes enables direct downstream applications. Extensive experiments illustrate the very competitive performance and significant speed boost against previous methods. We also showcase applications including novel pose synthesis, material editing, and relighting. The project page: https://xk-huang.github.io/ema/.
... The next step is to register the scans such that they share a common parameterization: each face should contain the same number of vertices and triangulation F , and each three-dimensional point should have 10 CHAPTER 2. BACKGROUND Figure 2.1 -Example of facial mesh. From left to right: raw scan (from Savran et al. [2008]), template mesh (adapted from Alexander et al. [2010]), registered mesh. ...
... The framework was implemented in PyTorch v1.0.1 [Paszke et al., 2019], and the experiments were run using a NVidia GeForce GTX 1080 GPU. For the facial mesh template we used a cropped version of the publicly available Digital Emily [Alexander et al., 2010], which consisted of n = 10057 vertices (see Figure 2.1). ...
Thesis
Data-driven models of the 3D face are a promising direction for capturing the subtle complexities of the human face, and a central component to numerous applications thanks to their ability to simplify complex tasks. Most data-driven approaches to date were built from either a relatively limited number of samples or by synthetic data augmentation, mainly because of the difficulty in obtaining large-scale and accurate 3D scans of the face. Yet, there is a substantial amount of information that can be gathered when considering publicly available sources that have been captured over the last decade, whose combination can potentially bring forward more powerful models.This thesis proposes novel methods for building data-driven models of the 3D face geometry, and investigates whether improved performances can be obtained by learning from large and varied datasets of 3D facial scans. In order to make efficient use of a large number of training samples we develop novel deep learning techniques designed to effectively handle three-dimensional face data. We focus on several aspects that influence the geometry of the face: its shape components including fine details, its motion components such as expression, and the interaction between these two subspaces.We develop in particular two approaches for building generative models that decouple the latent space according to natural sources of variation, e.g.identity and expression. The first approach considers a novel deep autoencoder architecture that allows to learn a multilinear model without requiring the training data to be assembled as a complete tensor. We next propose a novel non-linear model based on adversarial training that further improves the decoupling capacity. This is enabled by a new 3D-2D architecture combining a 3D generator with a 2D discriminator, where both domains are bridged by a geometry mapping layer.As a necessary prerequisite for building data-driven models, we also address the problem of registering a large number of 3D facial scans in motion. We propose an approach that can efficiently and automatically handle a variety of sequences while making minimal assumptions on the input data. This is achieved by the use of a spatiotemporal model as well as a regression-based initialization, and we show that we can obtain accurate registrations in an efficient and scalable manner.Finally, we address the problem of recovering surface normals from natural images, with the goal of enriching existing coarse 3D reconstructions. We propose a method that can leverage all available image and normal data, whether paired or not, thanks to a new cross-modal learning architecture. Core to our approach is a novel module that we call deactivable skip connections, which allows to transfer the local details from the image to the output surface without hurting the performance when autoencoding modalities, achieving state-of-the-art results for the task.
... The advantages brought by 3D facial analysis systems come at the price of a more complex imaging process, which can often limit their scope. 3D facial information is usually captured using stereo-vision systems [11,12,13], 3D laser scanners [14] (e.g. NextEngine and Cyberware), and RGB-D cameras (such as Kinect). ...
... From each image I i in the training set, its reflected light B i is estimated assuming a Lambertian illumination model as A i Y T in eq. (11). Therefore, each image can be recovered as a weighted combination of the columns of B i : I i = B i l i . ...
Preprint
Full-text available
Recently, a lot of attention has been focused on the incorporation of 3D data into face analysis and its applications. Despite providing a more accurate representation of the face, 3D face images are more complex to acquire than 2D pictures. As a consequence, great effort has been invested in developing systems that reconstruct 3D faces from an uncalibrated 2D image. However, the 3D-from-2D face reconstruction problem is ill-posed, thus prior knowledge is needed to restrict the solutions space. In this work, we review 3D face reconstruction methods in the last decade, focusing on those that only use 2D pictures captured under uncontrolled conditions. We present a classification of the proposed methods based on the technique used to add prior knowledge, considering three main strategies, namely, statistical model fitting, photometry, and deep learning, and reviewing each of them separately. In addition, given the relevance of statistical 3D facial models as prior knowledge, we explain the construction procedure and provide a comprehensive list of the publicly available 3D facial models. After the exhaustive study of 3D-from-2D face reconstruction approaches, we observe that the deep learning strategy is rapidly growing since the last few years, matching its extension to that of the widespread statistical model fitting. Unlike the other two strategies, photometry-based methods have decreased in number since the required strong assumptions cause the reconstructions to be of more limited quality than those resulting from model fitting and deep learning methods. The review also identifies current gaps and suggests avenues for future research.
... However, it comes at a high cost and also demands high computational power to process the captured data [Garrido et al. 2015;Thies et al. 2016]. Second, the level of detail that 3D face models can represent has to be improved beyond the uncanny valley [Alexander et al. 2010]. Despite the advances of computer graphics over decades [Blanz and Vetter 1999;Blanz et al. 2004;Suwajanakorn et al. 2015b;Averbuch-Elor et al. 2017], computer-generated face images still provide us with a sense of repulsion or distaste. ...
... We use the N s = N r = 128 most significant principal directions to span our face space. The used expression basis is a combination of the Digital Emily model [Alexander et al. 2010] and FaceWarehouse [Cao et al. 2014b] (see Thies et al. (2016) for details). We use PCA to compress the over-complete blendshapes (76 vectors) to a subspace of N e = 64 dimensions. ...
... To generate user-specific blendshapes for each neutral face, hand-crafted or 3D-scanned blendshape models are required [1,25]. Li et al. [26] generate facial blendshape rigs from sparse exemplars. ...
Preprint
Full-text available
With the booming of virtual reality (VR) technology, there is a growing need for customized 3D avatars. However, traditional methods for 3D avatar modeling are either time-consuming or fail to retain similarity to the person being modeled. We present a novel framework to generate animatable 3D cartoon faces from a single portrait image. We first transfer an input real-world portrait to a stylized cartoon image with a StyleGAN. Then we propose a two-stage reconstruction method to recover the 3D cartoon face with detailed texture, which first makes a coarse estimation based on template models, and then refines the model by non-rigid deformation under landmark supervision. Finally, we propose a semantic preserving face rigging method based on manually created templates and deformation transfer. Compared with prior arts, qualitative and quantitative results show that our method achieves better accuracy, aesthetics, and similarity criteria. Furthermore, we demonstrate the capability of real-time facial animation of our 3D model.
... Creating 3D human avatars from texts or images is a longstanding challenging task in both computer vision and computer graphics, which is key to a broad range of downstream applications including the digital human, film industry, and virtual reality. Previous approaches have relied on expensive and complex acquisition equipment to reconstruct high-fidelity avatar models [Alexander et al. 2010;Guo et al. 2017;Xiao et al. 2022]. However, these methods require multi-view images or depth maps that are unaffordable for consumer-level applications. ...
Preprint
We introduce AvatarBooth, a novel method for generating high-quality 3D avatars using text prompts or specific images. Unlike previous approaches that can only synthesize avatars based on simple text descriptions, our method enables the creation of personalized avatars from casually captured face or body images, while still supporting text-based model generation and editing. Our key contribution is the precise avatar generation control by using dual fine-tuned diffusion models separately for the human face and body. This enables us to capture intricate details of facial appearance, clothing, and accessories, resulting in highly realistic avatar generations. Furthermore, we introduce pose-consistent constraint to the optimization process to enhance the multi-view consistency of synthesized head images from the diffusion model and thus eliminate interference from uncontrolled human poses. In addition, we present a multi-resolution rendering strategy that facilitates coarse-to-fine supervision of 3D avatar generation, thereby enhancing the performance of the proposed system. The resulting avatar model can be further edited using additional text descriptions and driven by motion sequences. Experiments show that AvatarBooth outperforms previous text-to-3D methods in terms of rendering and geometric quality from either text prompts or specific images. Please check our project website at https://zeng-yifei.github.io/avatarbooth_page/.
... Unfortunately, the traditional pipeline for creating 3D human avatars involves tedious procedures, including scanning, meshing, rigging and many more. Furthermore, such a pipeline requires expert knowledge and sophisticated capture systems, limiting its access and increasing its cost [Alexander et al. 2010]. ...
Preprint
We present AvatarReX, a new method for learning NeRF-based full-body avatars from video data. The learnt avatar not only provides expressive control of the body, hands and the face together, but also supports real-time animation and rendering. To this end, we propose a compositional avatar representation, where the body, hands and the face are separately modeled in a way that the structural prior from parametric mesh templates is properly utilized without compromising representation flexibility. Furthermore, we disentangle the geometry and appearance for each part. With these technical designs, we propose a dedicated deferred rendering pipeline, which can be executed in real-time framerate to synthesize high-quality free-view images. The disentanglement of geometry and appearance also allows us to design a two-pass training strategy that combines volume rendering and surface rendering for network training. In this way, patch-level supervision can be applied to force the network to learn sharp appearance details on the basis of geometry estimation. Overall, our method enables automatic construction of expressive full-body avatars with real-time rendering capability, and can generate photo-realistic images with dynamic details for novel body motions and facial expressions.
... However, acquiring human rendering assets requires tremendous manually work, either by scanning systems [37,18,27,28] or artists [47]. The available datasets are either small in size [2] or do not contain relightable reflectance [51], such as the diffuse albedo, specular albedo and normals. Recent works [29,30,36,10] have introduced methods that produce high-quality renderable assets from arbitrary facial images. ...
Preprint
Near infrared (NIR) to Visible (VIS) face matching is challenging due to the significant domain gaps as well as a lack of sufficient data for cross-modality model training. To overcome this problem, we propose a novel method for paired NIR-VIS facial image generation. Specifically, we reconstruct 3D face shape and reflectance from a large 2D facial dataset and introduce a novel method of transforming the VIS reflectance to NIR reflectance. We then use a physically-based renderer to generate a vast, high-resolution and photorealistic dataset consisting of various poses and identities in the NIR and VIS spectra. Moreover, to facilitate the identity feature learning, we propose an IDentity-based Maximum Mean Discrepancy (ID-MMD) loss, which not only reduces the modality gap between NIR and VIS images at the domain level but encourages the network to focus on the identity features instead of facial details, such as poses and accessories. Extensive experiments conducted on four challenging NIR-VIS face recognition benchmarks demonstrate that the proposed method can achieve comparable performance with the state-of-the-art (SOTA) methods without requiring any existing NIR-VIS face recognition datasets. With slightly fine-tuning on the target NIR-VIS face recognition datasets, our method can significantly surpass the SOTA performance. Code and pretrained models are released under the insightface (https://github.com/deepinsight/insightface/tree/master/recognition).
... In terms of avatar generation and expression driving, previous research work [20][21][22][23][24] proposed techniques for capturing high-fidelity avatars that can capture material information of human faces with high fidelity. Alexander et al. [25] produced a high-fidelity digital actor in the Digital Emily project. MetaHuman Creator [26] provides a highly realistic avatar creation process and can create avatars of various genders, skin tones, ages and appearances. ...
Article
Full-text available
In immersive virtual reality (VR) applications, the facial expressions and lip-syncing of an avatar can have a significant impact on a user’s experience. In this paper, we designed a VR “trust game” scene to evaluate the effects of four expression conditions (positive facial expressions, neutral facial expressions, negative facial expressions and no expressions and lip-syncing) on participants in an immersive VR scene. We measured the participants with both objective and subjective measures. The two objective behavioral measures were the level of investment in the “trust game” and the users’ eye-movement data, and the subjective measures included social presence, emotional awareness level, and user preferences. We found that the participants were generally less trusting of the avatars with negative expressions, while the avatars with positive expressions made the participants feel comfortable and thus increased their willingness to cooperate with the avatars. In conclusion, avatars with facial expressions, whether positive or negative, were more effective in influencing the participants’ trust levels and decision-making behaviors than those without facial expressions. These findings provide novel ideas and suggestions for improving the level of human–computer interaction in VR and enhancing user experience in VR scenes.
... Since then, face forgery technology has developed rapidly, especially in the past decade. Alexander et al. [16] scanned an actress's 33 facial expressions through sophisticated equipment to synthesize a digital version of her. Dale et al. [17] proposed a method for replacing facial expression in videos, which took into account the differences in identity, visual appearance, speech, and time of source and target videos. ...
Article
Full-text available
With the emergence of deep learning, generating forged images or videos has become much easier in recent years. Face forgery detection, as a way to detect forgery, is an important topic in digital media forensics. Despite previous works having made remarkable progress, the spatial relationships of each part of the face that has significant forgery clues are seldom explored. To overcome this shortcoming, a two-stream face forgery detection network that fuses Inception ResNet stream and capsule network stream (IR-Capsule) is proposed in this paper, which can learn both conventional facial features and hierarchical pose relationships and angle features between different parts of the face. Furthermore, part of the Inception ResNet V1 model pre-trained on the VGGFACE2 dataset is utilized as an initial feature extractor to reduce overfitting and training time, and a modified capsule loss is proposed for the IR-Capsule network. Experimental results on the challenging FaceForensics++ benchmark show that the proposed IR-Capsule improves accuracy by more than 3% compared with several state-of-the-art methods.
... is means there are opportunities for malicious attackers to deceive facial recognition systems to impersonate others by using CG images and to create fake news to gain illegal profits, damage others' reputations, or maliciously create chaos. All the projects, such as the Digital Emily Project in 2010 [1], the Face2Face Project in 2016 [2], and the Synthesizing Obama Project in 2017 [3], prove that performing a spoofing attack has been greatly simplified, and the illegible CG images have created a focus of security concerns in the fields of news media and judiciary. ...
Article
Full-text available
Computer-generated (CG) images have become indistinguishable from natural images due to powerful image rendering technology. Fake CG images have brought huge troubles to news media, judicial forensics, and other fields. How to detect CG image has become a key point to solve the problems mentioned above. The image classification method based on deep learning, due to its strong self-learning ability, can automatically determine the differences in the image features between CG images and natural images and can be used to detect CG images. However, deep learning often requires a large amount of labeled data, which is usually a tedious and complex task. This paper proposes an improved self-training strategy with fine-tuning teacher/student exchange (FTTSE) to solve the problem of missing labeled datasets. Our method is actually a strategy based on semisupervised learning to train the teacher model through labeled data and to predict the unlabeled data by the teacher model to generate pseudo labels. The student model is obtained by continuous training on the mixed dataset composed of labeled and pseudo-labeled data. A teacher/student exchange strategy is designed for iterative training; i.e., the identities of the teacher model and the student model are exchanged at the beginning of each round of iteration. And then the new teacher model is used to predict pseudo labels, and the new student model exchanged from teacher model in the previous round of iteration is fine-tuned and retrained by the mixed dataset with new pseudo labels. Furthermore, we introduced malicious image attacks to perturb the mixed dataset to improve the robustness of the student model. The experimental results show that the improved self-training model we proposed can stably maintain the image classification ability even if the testing images are maliciously attacked. After iterative training, the CG image detection accuracy of the final model increases by 5.18%. The robustness against 100% malicious attacks is also improved, where the final trained model has an accuracy improvement of 7.63% higher than the initial model. The self-training model with FTTSE strategy proposed in this paper can effectively enhance the detection ability of the existing model and can greatly improve the robustness of the model with iterative training.
... To solve the correspondence problem in multiview stereo vision, these methods often place markers [3,[15][16][17] onto the human face, or use projectors [2,18,19] to project a specified pattern onto the human face. The well-known Digital Emily Project [3,20] utilized the Light Stage 5 [21], which is a setup composed of 156 LED lights and several high-resolution digital cameras, to generate 37 high-resolution 3D facial models from 90 min of scanning data. In [18], the authors proposed a novel spacetime stereo algorithm to compute depth from video sequences using synchronized video cameras and SL projectors. ...
Article
The acquisition of 3D facial models is crucial in the gaming and film industries. In this study, we developed a facial acquisition system based on infrared structured light sensors to obtain facial-expression models with high fidelity and accuracy. First, we employed time-multiplexing structured light to obtain an accurate and dense point cloud. The template model was then warped to the captured facial expression. A model-tracking method based on optical flow was applied to track the registered 3D model displacement. Finally, the live 3D mesh was textured using high-resolution images captured by three color cameras. Experiments were conducted on human faces to demonstrate the performance of the proposed system and methods.
... This paper attacks the problem of recovering 3D shape and complete facial texture from a single 2D face image. Image-based 3D face reconstruction is a fundamental yet essential problem in computer vision, with broad applications to facial animation [1,2], pose-invariant face recognition [3][4][5][6][7], human-machine interaction [8], etc. Dramatic improvements have been made in image-based 3D face reconstruction in recent years [9][10][11]. ...
Article
Full-text available
Recent years have witnessed significant progress in image-based 3D face reconstruction using deep convolutional neural networks. However, current reconstruction methods often perform improperly in self-occluded regions and can lead to inaccurate correspondences between a 2D input image and a 3D face template, hindering use in real applications. To address these problems, we propose a deep shape reconstruction and texture completion network, SRTC-Net, which jointly reconstructs 3D facial geometry and completes texture with correspondences from a single input face image. In SRTC-Net, we leverage the geometric cues from completed 3D texture to reconstruct detailed structures of 3D shapes. The SRTC-Net pipeline has three stages. The first introduces a correspondence network to identify pixel-wise correspondence between the input 2D image and a 3D template model, and transfers the input 2D image to a U - V texture map. Then we complete the invisible and occluded areas in the U - V texture map using an inpainting network. To get the 3D facial geometries, we predict coarse shape ( U - V position maps) from the segmented face from the correspondence network using a shape network, and then refine the 3D coarse shape by regressing the U - V displacement map from the completed U - V texture map in a pixel-to-pixel way. We examine our methods on 3D reconstruction tasks as well as face frontalization and pose invariant face recognition tasks, using both in-the-lab datasets (MICC, MultiPIE) and in-the-wild datasets (CFP). The qualitative and quantitative results demonstrate the effectiveness of our methods on inferring 3D facial geometry and complete texture; they outperform or are comparable to the state-of-the-art.
... Lately, 3D digital characters have sparked more and more interest in numerous domains, with recent research enabling higher fidelity and realism. While the modeling of actors was made available years ago [2], 3D realistic faces can now be generated from scratch [30], and thus ease the population of digital worlds. With capture technology becoming available to the massmarket, realistic digital doubles enable seamless telecommunication between individuals within virtual worlds [51]. ...
Preprint
In this paper, we present FaceTuneGAN, a new 3D face model representation decomposing and encoding separately facial identity and facial expression. We propose a first adaptation of image-to-image translation networks, that have successfully been used in the 2D domain, to 3D face geometry. Leveraging recently released large face scan databases, a neural network has been trained to decouple factors of variations with a better knowledge of the face, enabling facial expressions transfer and neutralization of expressive faces. Specifically, we design an adversarial architecture adapting the base architecture of FUNIT and using SpiralNet++ for our convolutional and sampling operations. Using two publicly available datasets (FaceScape and CoMA), FaceTuneGAN has a better identity decomposition and face neutralization than state-of-the-art techniques. It also outperforms classical deformation transfer approach by predicting blendshapes closer to ground-truth data and with less of undesired artifacts due to too different facial morphologies between source and target.
... 3D face reconstruction is a fundamental yet essential task in computer vision, with broad applications in facial animation [23,41], poseinvariant face recognition [13,51], human-machine interaction [1], and others. Dramatic improvements [14,17,24] have been made on 3D face shape reconstruction in recent years. ...
Article
Recent years witnessed that deep learning based methods have achieved significant progresses in recovering 3D face shape from single image. However, reconstructing realistic 3D facial texture from single image is still a challenging task due to the unavailability of large-scale training datasets and the low expression ability of previous statistical texture models (e.g. 3DMM). In this paper, we introduce a novel deep architecture trained by self-supervision with multi-view setup, to reconstruct 3D facial texture. Specifically, we first obtain incomplete UV texture map from input facial image, and then introduce a Texture Completion Network (TC-Net) to inpaint missing areas. To train TC-Net, firstly, we collect 50,000 triplets of facial images from in-the-wild videos, each triplet consists of a nearly frontal, a left-side, and a right-side facial images. With this dataset, we propose a novel multi-view consistency loss that ensures consistent photometric, face identity, 3DMM identity, and UV texture among multi-view facial images. This loss allows to optimize TC-Net in a self-supervision way without using ground-truth texture map as supervision. In addition, multi-view input images are only required in training to provide self-supervision, and our method only needs single input image in inference. Extensive experiments show that our method achieves state-of-the-art performance in both qualitative and quantitative comparisons.
... Building expressive and animatable virtual humans is a well studied problem in the graphics community. The creation of so called digital doubles has roots in the special effects industry [Alexander et al. 2010], and in recent years has begun to see examples of real-time uses as well such as Siren of Epic Games and DigiDoug of Digital Domain. These models are typically built using sophisticated multi-view capture systems with elaborate scripts that span variations in pose and expression. ...
Preprint
Full-text available
We present a learning-based method for building driving-signal aware full-body avatars. Our model is a conditional variational autoencoder that can be animated with incomplete driving signals, such as human pose and facial keypoints, and produces a high-quality representation of human geometry and view-dependent appearance. The core intuition behind our method is that better drivability and generalization can be achieved by disentangling the driving signals and remaining generative factors, which are not available during animation. To this end, we explicitly account for information deficiency in the driving signal by introducing a latent space that exclusively captures the remaining information, thus enabling the imputation of the missing factors required during full-body animation, while remaining faithful to the driving signal. We also propose a learnable localized compression for the driving signal which promotes better generalization, and helps minimize the influence of global chance-correlations often found in real datasets. For a given driving signal, the resulting variational model produces a compact space of uncertainty for missing factors that allows for an imputation strategy best suited to a particular application. We demonstrate the efficacy of our approach on the challenging problem of full-body animation for virtual telepresence with driving signals acquired from minimal sensors placed in the environment and mounted on a VR-headset.
... Most light stages consist of room-scale, spherical arrays of brightly-flashing colored lights and cameras. They are used widely for movie special effects [5,2,1,21,15], volumetric media [25,6], presidential portraits [12], and to provide rich data for training computer vision relighting algorithms [18,24,19,9,13,20,10]. ...
Preprint
Full-text available
Every time you sit in front of a TV or monitor, your face is actively illuminated by time-varying patterns of light. This paper proposes to use this time-varying illumination for synthetic relighting of your face with any new illumination condition. In doing so, we take inspiration from the light stage work of Debevec et al., who first demonstrated the ability to relight people captured in a controlled lighting environment. Whereas existing light stages require expensive, room-scale spherical capture gantries and exist in only a few labs in the world, we demonstrate how to acquire useful data from a normal TV or desktop monitor. Instead of subjecting the user to uncomfortable rapidly flashing light patterns, we operate on images of the user watching a YouTube video or other standard content. We train a deep network on images plus monitor patterns of a given user and learn to predict images of that user under any target illumination (monitor pattern). Experimental evaluation shows that our method produces realistic relighting results. Video results are available at http://grail.cs.washington.edu/projects/Light_Stage_on_Every_Desk/.
... Generating a complete and high-quality 3D model from a single-view image is a fundamental problem in the computer vision community. Image-based 3D face estimation has a wide variety of applications in animation and virtual reality [1], face recognition [2], mask designs [3,4], human-machine interactions [5], and virtual driving [5]. This article studies the problem of 3D face reconstruction from a single noisy depth image. ...
Article
Full-text available
Abstract This paper addresses the 3D face reconstruction and semantic annotation from a single‐view noisy depth image. A deep neural network‐based coarse‐to‐fine framework is presented to take advantage of 3D morphable model (3DMM) regression and per‐vertex geometry refinement. The low‐dimensional subspace coefficients of the 3DMM initialize the global facial geometry, being prone to be over‐smooth because of the low‐pass characteristics of the shape subspace. The proposed geometry refinement subnetwork predicts per‐vertex displacements to enrich local details, which is learned from unlabelled noisy depth images based on the registration‐like loss. In order to guarantee the semantic correspondence between the resultant 3D face and the depth image, a semantic consistency constraint is introduced to adapt an annotation model learned from the synthetic data to real noisy depth images. The resultant depth annotations are required to be consistent with the label propagation from the coarse and refined parametric 3D faces. The proposed coarse‐to‐fine reconstruction scheme and the semantic consistency constraint are evaluated on the depth‐based 3D face reconstruction and semantic annotation. The series of experiments demonstrate that the proposed approach achieves the performance improvements over compared methods regarding 3D face reconstruction and depth image annotation.
... Creating a photo-realistic digital actor is a dream of many people working in computer graphics. One initial success is the Digital Emily Project [3], in which sophisticated devices were used to capture the appearance of an actress and her motions to synthesize a digital version of her. At that time, this ability was unavailable to attackers, so it was impossible to create a digital version of a victim. ...
... The uncanny valley has arguably been overcome in many applications involving static, lifelike imagery such as computer-generated humans (Alexander et al., 2010;Perry, 2014), but animated human-like objects (mainly robots) are arguably far less human-like. A variety of research has shown the importance of kinematics in human-robot interactions. ...
Article
Full-text available
Uncanny valley research has shown that human likeness is an important consideration when designing artificial agents. It has separately been shown that artificial agents exhibiting human-like kinematics can elicit positive perceptual responses. However the kinematic characteristics underlying that perception have not been elucidated. This paper proposes kinematic jerk amplitude as a candidate metric for kinematic human likeness, and aims to determine whether a perceptual optimum exists over a range of jerk values. We created minimum-jerk two-digit grasp kinematics in a prosthetic hand model, then added different amplitudes of temporally smooth noise to yield a variety of animations involving different total jerk levels, ranging from maximally smooth to highly jerky. Subjects indicated their perceptual affinity for these animations by simultaneously viewing two different animations side-by-side, first using a laptop, then separately within a virtual reality (VR) environment. Results suggest that (a) subjects generally preferred smoother kinematics, (b) subjects exhibited a small preference for rougher-than minimum jerk kinematics in the laptop experiment, and that (c) the preference for rougher-than minimum-jerk kinematics was amplified in the VR experiment. These results suggest that non-maximally smooth kinematics may be perceptually optimal in robots and other artificial agents.
Chapter
Two core factors influence the perception of avatars. On one side are the developers who are concerned with building avatars and in some cases pushing the boundaries of the realism. These developers are constrained by the resources available to them that allow them to produce the optimal avatar with the equipment, time, and skills available to them. On the other side are users who engage with avatars who are directly affected by the choices the developers have made. Inside the interaction between these sides is the avatar itself, whose appearance, level of realism and fundamental characteristics like a perceived gender(sex) can influence the user’s perception of that avatar. Despite the large amount of research dedicated to understanding the perception of avatars, many gaps in our understanding remain. In this work, we aim to contribute to further understanding one of these gaps by investigating the potential role of gender(sex) in the perception of avatar realism and uncanniness. We add to this discussion by presenting the results of an experiment where we evaluated realism and uncanniness perceptions by presenting a set of avatars to participants (n = 2065). These avatars are representative of those used in simulation and training contexts, from publicly available sources, and have varying levels of realism. Participants were asked to rank these avatars in terms of their realism and uncanniness perceptions to determine whether the gender(sex) of the participant influences in these perceptions. Our findings show that the gender(sex) of a participant does affect the perception of an avatar’s realism and uncanniness levels.KeywordsUncanny valleyUncanninessAvatarHuman computer interactionGender(sex)
Article
We present AvatarReX, a new method for learning NeRF-based full-body avatars from video data. The learnt avatar not only provides expressive control of the body, hands and the face together, but also supports real-time animation and rendering. To this end, we propose a compositional avatar representation, where the body, hands and the face are separately modeled in a way that the structural prior from parametric mesh templates is properly utilized without compromising representation flexibility. Furthermore, we disentangle the geometry and appearance for each part. With these technical designs, we propose a dedicated deferred rendering pipeline, which can be executed at a real-time framerate to synthesize high-quality free-view images. The disentanglement of geometry and appearance also allows us to design a two-pass training strategy that combines volume rendering and surface rendering for network training. In this way, patch-level supervision can be applied to force the network to learn sharp appearance details on the basis of geometry estimation. Overall, our method enables automatic construction of expressive full-body avatars with real-time rendering capability, and can generate photo-realistic images with dynamic details for novel body motions and facial expressions.
Book
Full-text available
What kind of relationship do we have with artificial beings (avatars, puppets, robots, etc.)? What does it mean to mirror ourselves in them, to perform them or to play trial identity games with them? Actor & Avatar addresses these questions from artistic and scholarly angles. Contributions on the making of »technical others« and philosophical reflections on artificial alterity are flanked by neuroscientific studies on different ways of perceiving living persons and artificial counterparts. The contributors have achieved a successful artistic-scientific collaboration with extensive visual material.
Article
Full-text available
Background: The task of creating personalised photorealistic talking head models, i.e., systems that can synthesise plausible video sequences of speech expressions and mimics of a particular individual, is considered. Objectives: In this work, we present a system for creating talking head models from a handful of photographs (so-called few shots learning) with limited training time. In fact, our system can generate a reasonable result based on a single photograph (one-shot learning), while adding a few more photographs increase the fidelity of personalization. Methods / Statistical Analysis: The talking heads created by our model are deep convolutional networks that synthesise video frames in a direct manner by a sequence of convolutional operations rather than by warping. Findings: We present a system with such a few-shot capability. It performs lengthy meta-learning on a large dataset of videos, and after that, it is able to frame the few-and one-shot learning of neural talking head models of previously unseen people as adversarial training problems with high-capacity generators and discriminators. Applications / Improvements: The system is able to initialise the parameters of both the generator and the discriminator in a person-specific way, so that training can be based on just a few images and done quickly, despite the need to tune tens of millions of parameters. We show that such an approach is able to learn highly realistic and personalised talking head models of new people and even portrait paintings.
Chapter
Full-text available
AI-synthesized face-swapping videos, commonly known as DeepFakes , is an emerging problem threatening the trustworthiness of online information. The need to develop and evaluate DeepFake detection algorithms calls for large-scale datasets. However, current DeepFake datasets suffer from low visual quality and do not resemble DeepFake videos circulated on the Internet. We present a new large-scale challenging DeepFake video dataset, Celeb-DF , which contains 5, 639 high-quality DeepFake videos of celebrities generated using an improved synthesis process. We conduct a comprehensive evaluation of DeepFake detection methods and datasets to demonstrate the escalated level of challenges posed by Celeb-DF. Then we introduce Landmark Breaker , the first dedicated method to disrupt facial landmark extraction, and apply it to the obstruction of the generation of DeepFake videos. The experiments are conducted on three state-of-the-art facial landmark extractors using our Celeb-DF dataset.
Chapter
Full-text available
Recent years have witnessed exciting progress in automatic face swapping and editing. Many techniques have been proposed, facilitating the rapid development of creative content creation. The emergence and easy accessibility of such techniques, however, also cause potential unprecedented ethical and moral issues. To this end, academia and industry proposed several effective forgery detection methods. Nonetheless, challenges could still exist. (1) Current face manipulation advances can produce high-fidelity fake videos, rendering forgery detection challenging. (2) The generalization capability of most existing detection models is poor, particularly in real-world scenarios where the media sources and distortions are unknown. The primary difficulty in overcoming these challenges is the lack of amenable datasets for real-world face forgery detection. Most existing datasets are either of a small number, of low quality, or overly artificial. Meanwhile, the large distribution gap between training data and actual test videos also leads to weak generalization ability. In this chapter, we present our on-going effort of constructing DeeperForensics-1.0, a large-scale forgery detection dataset, to address the challenges above. We discuss approaches to ensure the quality and diversity of the dataset. Besides, we describe the observations we obtained from organizing DeeperForensics Challenge 2020, a real-world face forgery detection competition based on DeeperForensics-1.0. Specifically, we summarize the winning solutions and provide some discussions on potential research directions.
Chapter
Existing one-shot face reenactment methods either present obvious artifacts in large pose transformations, or cannot well-preserve the identity information in the source images, or fail to meet the requirements of real-time applications due to the intensive amount of computation involved. In this paper, we introduce Face2Faceρ, the first Real-time High-resolution and One-shot (RHO, ρ) face reenactment framework. To achieve this goal, we designed a new 3DMM-assisted warping-based face reenactment architecture which consists of two fast and efficient sub-networks, i.e., a u-shaped rendering network to reenact faces driven by head poses and facial motion fields, and a hierarchical coarse-to-fine motion network to predict facial motion fields guided by different scales of landmark images. Compared with existing state-of-the-art works, Face2Faceρ can produce results of equal or better visual quality, yet with significantly less time and memory overhead. We also demonstrate that Face2Faceρ can achieve real-time performance for face images of 1440×1440 resolution with a desktop GPU and 256×256 resolution with a mobile CPU.
Article
Creating photorealistic avatars of existing people currently requires extensive person-specific data capture, which is usually only accessible to the VFX industry and not the general public. Our work aims to address this drawback by relying only on a short mobile phone capture to obtain a drivable 3D head avatar that matches a person's likeness faithfully. In contrast to existing approaches, our architecture avoids the complex task of directly modeling the entire manifold of human appearance, aiming instead to generate an avatar model that can be specialized to novel identities using only small amounts of data. The model dispenses with low-dimensional latent spaces that are commonly employed for hallucinating novel identities, and instead, uses a conditional representation that can extract person-specific information at multiple scales from a high resolution registered neutral phone scan. We achieve high quality results through the use of a novel universal avatar prior that has been trained on high resolution multi-view video captures of facial performances of hundreds of human subjects. By fine-tuning the model using inverse rendering we achieve increased realism and personalize its range of motion. The output of our approach is not only a high-fidelity 3D head avatar that matches the person's facial shape and appearance, but one that can also be driven using a jointly discovered shared global expression space with disentangled controls for gaze direction. Via a series of experiments we demonstrate that our avatars are faithful representations of the subject's likeness. Compared to other state-of-the-art methods for lightweight avatar creation, our approach exhibits superior visual quality and animateability.
Article
In real-time dynamic reconstruction, geometry and motion are the major focuses while appearance is not fully explored, leading to the low-quality appearance of the reconstructed surfaces. In this paper, we propose a lightweight lighting model that considers spatially varying lighting conditions caused by self-occlusion. This model estimates per-vertex masks on top of a single Spherical Harmonic (SH) lighting to represent spatially varying lighting conditions without adding too much computation cost. The mask is estimated based on the local geometry of a vertex to model the self-occlusion effect, which is the major reason leading to the spatial variation of lighting. Furthermore, to use this model in dynamic reconstruction, we also improve the motion estimation quality by adding a real-time per-vertex displacement estimation step. Experiments demonstrate that both the reconstructed appearance and the motion are largely improved compared with the current state-of-the-art techniques.
Article
We present a method to edit a target portrait footage by taking a sequence of audio as input to synthesize a photo-realistic video. This method is unique because it is highly dynamic. It does not assume a person-specific rendering network yet capable of translating one source audio into one random chosen video output within a set of speech videos. Instead of learning a highly heterogeneous and nonlinear mapping from audio to the video directly, we first factorize each target video frame into orthogonal parameter spaces, i.e. , expression, geometry, and pose, via monocular 3D face reconstruction. Next, a recurrent network is introduced to translate source audio into expression parameters that are primarily related to the audio content. The audio-translated expression parameters are then used to synthesize a photo-realistic human subject in each video frame, with the movement of the mouth regions precisely mapped to the source audio. The geometry and pose parameters of the target human portrait are retained, therefore preserving the context of the original video footage. Finally, we introduce a novel video rendering network and a dynamic programming method to construct a temporally coherent and photo-realistic video. Extensive experiments demonstrate the superiority of our method over existing approaches. Our method is end-to-end learnable and robust to voice variations in the source audio.
Article
Over the last years, many face analysis tasks have accomplished astounding performance, with applications including face generation and 3D face reconstruction from a single ‘'in-the-wild’' image. Nevertheless, to the best of our knowledge, there is no method which can produce render-ready high-resolution 3D faces from ‘'in-the-wild’' images and this can be attributed to the: (a) scarcity of available data for training, and (b) lack of robust methodologies that can successfully be applied on very high-resolution data. In this work, we introduce the first method that is able to reconstruct photorealistic render-ready 3D facial geometry and BRDF from a single ‘'in-the-wild’' image. We capture a large dataset of facial shape and reflectance, which we have made public. We define a fast facial photorealistic differentiable rendering methodology with accurate facial skin diffuse and specular reflection, self-occlusion and subsurface scattering approximation. With this, we train a network that disentangles the facial diffuse and specular BRDF components from a shape and texture with baked illumination, reconstructed with a state-of-the-art 3DMM fitting method. Our method outperforms the existing arts by a significant margin and reconstructs high-resolution 3D faces from a single low-resolution image, that can be rendered in various applications, and bridge the uncanny valley.
Article
We present a learning-based method for building driving-signal aware full-body avatars. Our model is a conditional variational autoencoder that can be animated with incomplete driving signals, such as human pose and facial keypoints, and produces a high-quality representation of human geometry and view-dependent appearance. The core intuition behind our method is that better drivability and generalization can be achieved by disentangling the driving signals and remaining generative factors, which are not available during animation. To this end, we explicitly account for information deficiency in the driving signal by introducing a latent space that exclusively captures the remaining information, thus enabling the imputation of the missing factors required during full-body animation, while remaining faithful to the driving signal. We also propose a learnable localized compression for the driving signal which promotes better generalization, and helps minimize the influence of global chance-correlations often found in real datasets. For a given driving signal, the resulting variational model produces a compact space of uncertainty for missing factors that allows for an imputation strategy best suited to a particular application. We demonstrate the efficacy of our approach on the challenging problem of full-body animation for virtual telepresence with driving signals acquired from minimal sensors placed in the environment and mounted on a VR-headset.
Article
We develop a high resolution face texture generation system which uses artist provided appearance controls as the conditions for a generative network. Artists are able to control various elements in the generated textures, such as the skin, eye, lip, and hair color. This is made possible by reparameterizing our dataset to the same UV mapping, allowing us to utilize image-to-image translation networks. Although our dataset is limited in size, only 126 samples in total, our system is still able to generate realistic face textures which strongly adhere to the input appearance attribute conditions because of our training augmentation methods. Once our system has generated the face texture, it is ready to be used in a modern game production environment. Thanks to our novel SuperResolution and material property recovery methods, our generated face textures are 4K resolution and have the associated material property maps required for raytraced rendering.
Article
We present a differentiable ray-tracing based novel face reconstruction approach where scene attributes – 3D geometry, reflectance (diffuse, specular and roughness), pose, camera parameters, and scene illumination – are estimated from unconstrained monocular images. The proposed method models scene illumination via a novel, parameterized virtual light stage, which in-conjunction with differentiable ray-tracing, introduces a coarse-to-fine optimization formulation for face reconstruction. Our method can not only handle unconstrained illumination and self-shadows conditions, but also estimates diffuse and specular albedos. To estimate the face attributes consistently and with practical semantics, a two-stage optimization strategy systematically uses a subset of parametric attributes, where subsequent attribute estimations factor those previously estimated. For example, self-shadows estimated during the first stage, later prevent its baking into the personalized diffuse and specular albedos in the second stage. We show the efficacy of our approach in several real-world scenarios, where face attributes can be estimated even under extreme illumination conditions. Ablation studies, analyses and comparisons against several recent state-of-the-art methods show improved accuracy and versatility of our approach. With consistent face attributes reconstruction, our method leads to several style – illumination, albedo, self-shadow – edit and transfer applications, as discussed in the paper.
Article
Recently, a lot of attention has been focused on the incorporation of 3D data into face analysis and its applications. Despite providing a more accurate representation of the face, 3D facial images are more complex to acquire than 2D pictures. As a consequence, great effort has been invested in developing systems that reconstruct 3D faces from an uncalibrated 2D image. However, the 3D-from-2D face reconstruction problem is ill-posed, thus prior knowledge is needed to restrict the solutions space. In this work, we review 3D face reconstruction methods proposed in the last decade, focusing on those that only use 2D pictures captured under uncontrolled conditions. We present a classification of the proposed methods based on the technique used to add prior knowledge, considering three main strategies, namely, statistical model fitting, photometry, and deep learning, and reviewing each of them separately. In addition, given the relevance of statistical 3D facial models as prior knowledge, we explain the construction procedure and provide a list of the most popular publicly available 3D facial models. After the exhaustive study of 3D-from-2D face reconstruction approaches, we observe that the deep learning strategy is rapidly growing since the last few years, becoming the standard choice in replacement of the widespread statistical model fitting. Unlike the other two strategies, photometry-based methods have decreased in number due to the need for strong underlying assumptions that limit the quality of their reconstructions compared to statistical model fitting and deep learning methods. The review also identifies current challenges and suggests avenues for future research.
Article
In addition to 3D geometry, accurate representation of texture is important when digitizing real objects in virtual worlds. Based on a single consumer RGBD sensor, accurate texture representation for static objects can be realized by fusing multi-frame information; however, extending the process to dynamic objects, which typically have time-varying textures, is difficult. Thus, to address this problem, we propose a compact keyframe-based representation that decouples a dynamic texture into a basic static texture and a set of multiplicative changing maps. With this representation, the proposed method first aligns textures recorded from multiple keyframes with the reconstructed dynamic geometry of the object. Errors in the alignment and geometry are then compensated in an innovative iterative linear optimization framework. With the reconstructed texture, we then employ a scheme to synthesize the dynamic object from arbitrary viewpoints. By considering temporal and local pose similarities jointly, dynamic textures in all keyframes are fused to guarantee high-quality image generation. Experimental results demonstrate that the proposed method handles various dynamic objects, including faces, bodies, cloth, and toys. In addition, qualitative and quantitative comparisons demonstrate that the proposed method outperforms state-of-the-art solutions.
Article
Modeling of the human face is a challenging yet important problem in computer graphics. Building accurate muscle models for physics-based simulation of the face is a problem that either requires a lot of manual effort or drastic over-parameterization of the muscles to achieve desirable results. In this work, we reduce the number of parameters required to build personalized muscle models by taking into account the blending of the fine muscles and passive tissue when we solve for the muscle activations. We begin by adapting an anatomical template model to a neutral scan of a subject. Then, we solve an inverse physics problem using several scans simultaneously to solve for both the muscle activations and the geometry matrix representing blending of the muscles. Finally, we demonstrate that this geometry matrix can be used on new, previously unseen scans to solve for only the muscle activations. This greatly reduces the number of parameters that must be solved for compared to previous works while requiring no additional manual effort in constructing the muscles.
Preprint
Full-text available
We present a differentiable ray-tracing based novel face reconstruction approach where scene attributes - 3D geometry, reflectance (diffuse, specular and roughness), pose, camera parameters, and scene illumination - are estimated from unconstrained monocular images. The proposed method models scene illumination via a novel, parameterized virtual light stage, which in-conjunction with differentiable ray-tracing, introduces a coarse-to-fine optimization formulation for face reconstruction. Our method can not only handle unconstrained illumination and self-shadows conditions, but also estimates diffuse and specular albedos. To estimate the face attributes consistently and with practical semantics, a two-stage optimization strategy systematically uses a subset of parametric attributes, where subsequent attribute estimations factor those previously estimated. For example, self-shadows estimated during the first stage, later prevent its baking into the personalized diffuse and specular albedos in the second stage. We show the efficacy of our approach in several real-world scenarios, where face attributes can be estimated even under extreme illumination conditions. Ablation studies, analyses and comparisons against several recent state-of-the-art methods show improved accuracy and versatility of our approach. With consistent face attributes reconstruction, our method leads to several style -- illumination, albedo, self-shadow -- edit and transfer applications, as discussed in the paper.
Article
Editing of portrait images is a very popular and important research topic with a large variety of applications. For ease of use, control should be provided via a semantically meaningful parameterization that is akin to computer animation controls. The vast majority of existing techniques do not provide such intuitive and fine-grained control, or only enable coarse editing of a single isolated control parameter. Very recently, high-quality semantically controlled editing has been demonstrated, however only on synthetically created StyleGAN images. We present the first approach for embedding real portrait images in the latent space of StyleGAN, which allows for intuitive editing of the head pose, facial expression, and scene illumination in the image. Semantic editing in parameter space is achieved based on StyleRig, a pretrained neural network that maps the control space of a 3D morphable face model to the latent space of the GAN. We design a novel hierarchical non-linear optimization problem to obtain the embedding. An identity preservation energy term allows spatially coherent edits while maintaining facial integrity. Our approach runs at interactive frame rates and thus allows the user to explore the space of possible edits. We evaluate our approach on a wide set of portrait photos, compare it to the current state of the art, and validate the effectiveness of its components in an ablation study.
Chapter
We introduce a novel data-driven approach for taking a single-view noisy depth image as input and inferring a detailed 3D face with per-pixel semantic labels. The critical point of our method is its ability to handle the depth completions with varying extent of geometric details, managing 3D expressive face estimation by exploiting low-dimensional linear subspace and dense displacement field-based non-rigid deformations. We devise a deep neural network-based coarse-to-fine 3D face reconstruction and semantic annotation framework to produce high-quality facial geometry while preserving large-scale contexts and semantics. We evaluate the semantic consistency constraint and the generative model for 3D face reconstruction and depth annotation in extensive series of experiments. The results demonstrate that the proposed approach outperforms the compared methods not only in the face reconstruction with high-quality geometric details, but also semantic annotation performances regarding segmentation and landmark location.
Chapter
We propose a neural rendering-based system that creates head avatars from a single photograph. Our approach models a person’s appearance by decomposing it into two layers. The first layer is a pose-dependent coarse image that is synthesized by a small neural network. The second layer is defined by a pose-independent texture image that contains high-frequency details. The texture image is generated offline, warped and added to the coarse image to ensure a high effective resolution of synthesized head views. We compare our system to analogous state-of-the-art systems in terms of visual quality and speed. The experiments show significant inference speedup over previous neural head avatar models for a given visual quality. We also report on a real-time smartphone-based implementation of our system.
Article
Full-text available
We present a technique for capturing an actor's live-action performance in such a way that the lighting and reflectance of the actor can be designed and modified in postproduction. Our approach is to illuminate the subject with a sequence of time-multiplexed basis lighting conditions, and to record these conditions with a high-speed video camera so that many conditions are recorded in the span of the desired output frame interval. We investigate several lighting bases for representing the sphere of incident illumination using a set of discrete LED light sources, and we estimate and compensate for subject motion using optical flow and image warping based on a set of tracking frames inserted into the lighting basis. To composite the illuminated performance into a new background, we include a time-multiplexed matte within the basis. We also show that the acquired data enables time-varying surface normals, albedo, and ambient occlusion to be estimated, which can be used to transform the actor's reflectance to produce both subtle and stylistic effects.
Article
Full-text available
"The Human Face Project" is a short film documenting an effort at Walt Disney Feature Animation to track and animate human facial performance, which was shown in the SIGGRAPH 2001 Electronic Theater. This short paper outlines the techniques developed in this project, and demonstrated in that film.The face tracking system we developed is exemplary of model-based computer vision, and exploits the detailed degrees of freedom of a geometric face model to confine the space of solutions. Optical flow and successive rerendering of the model are employed in an optimization loop to converge on model parameter estimates. The structure of the model permits very principled mapping of estimated expressions to different targets.Of critical importance in media applications is the handling of details beyond the resolution or degrees of freedom of the tracking model. We describe behavioral modeling expedients for realizing these details in a plausible way in resynthesis.
Article
Full-text available
We present a technique for capturing an actor's live-action performance in such a way that the lighting and reflectance of the actor can be designed and modified in postproduction. Our approach is to illuminate the subject with a sequence of time-multiplexed basis lighting conditions, and to record these conditions with a high-speed video camera so that many conditions are recorded in the span of the desired output frame interval. We investigate several lighting bases for representing the sphere of incident illumination using a set of discrete LED light sources, and we estimate and compensate for subject motion using optical flow and image warping based on a set of tracking frames inserted into the lighting basis. To composite the illuminated performance into a new background, we include a time-multiplexed matte within the basis. We also show that the acquired data enables time-varying surface normals, albedo, and ambient occlusion to be estimated, which can be used to transform the actor's reflectance to produce both subtle and stylistic effects.
Conference Paper
Full-text available
We estimate surface normal maps of an object from either its diffuse or specular reflectance using four spherical gradient illumination patterns. In contrast to traditional photometric stereo, the spherical patterns allow normals to be estimated simultaneously from any number of viewpoints. We present two polarized lighting techniques that allow the diffuse and specular normal maps of an object to be measured independently. For scattering materials, we show that the specular normal maps yield the best record of detailed surface shape while the diffuse normals deviate from the true surface normal due to subsurface scattering, and that this effect is dependent on wavelength. We show several applications of this acquisition technique. First, we capture normal maps of a facial performance simultaneously from several viewing positions using time-multiplexed illumination. Second, we show that high- resolution normal maps based on the specular component can be used with structured light D scanning to quickly acquire high-resolution facial surface geometry using off-the-shelf digital still cameras. Finally, we present a real- time shading model that uses independently estimated normal maps for the specular and diffuse color channels to reproduce some of the perceptually important effects of subsurface scattering.
Conference Paper
Full-text available
We have created a system for capturing both the three-dimensional geometry and color and shading information for human facial ex- pressions. We use this data to reconstruct photorealistic, 3D ani- mations of the captured expressions. The system uses a large set of sampling points on the face to accurately track the three dimen- sional deformations of the face. Simultaneously with the tracking of the geometric data, we capture multiple high resolution, regis- tered video images of the face. These images are used to create a texture map sequence for a three dimensional polygonal face model which can then be rendered on standard D graphics hardware. The resulting facial animation is surprisingly life-like and looks very much like the original live performance. Separating the capture of the geometry from the texture images eliminates much of the vari- ance in the image data due to motion, which increases compression ratios. Although the primary emphasis of our work is not compres- sion we have investigated the use of a novel method to compress the geometric data based on principal components analysis. The texture sequence is compressed using an MPEG4 video codec. An- imations reconstructed from 512x512 pixel textures look good at data rates as low as 240 Kbits per second.
Conference Paper
Full-text available
We present a method to acquire the reflectance field of a human face and use these measurements to render the face under arbitrary changes in lighting and viewpoint. We first acquire images of the face from a small set of viewpoints under a dense sampling of incident illumination directions using a light stage. We then construct a reflectance function image for each observed image pixel from its values over the space of illumination directions. From the reflectance functions, we can directly generate images of the face from the original viewpoints in any form of sampled or computed illumination. To change the viewpoint, we use a model of skin reflectance to estimate the appearance of the reflectance functions for novel viewpoints. We demonstrate the technique with synthetic renderings of a person's face under novel illumination and viewpoints.
Conference Paper
We present a method that uses measured scene radiance and global illumination in order to add new objects to light-based models with correct lighting. The method uses a high dynamic range image-based model of the scene, rather than synthetic light sources, to illuminate the new objects. To compute the illumination, the scene is considered as three components: the distant scene, the local scene, and the synthetic objects. The distant scene is assumed to be photometrically unaffected by the objects, obviating the need for reflectance model information. The local scene is endowed with estimated reflectance model information so that it can catch shadows and receive reflected light from the new objects. Renderings are created with a standard global illumination method by simulating the interaction of light amongst the three components. A differential rendering technique allows for good results to be obtained when only an estimate of the local scene reflectance properties is known. We apply the general method to the problem of rendering synthetic objects into real scenes. The light-based model is constructed from an approximate geometric model of the scene and by using a light probe to measure the incident illumination at the location of the synthetic objects. The global illumination solution is then composited into a photograph of the scene using the differential rendering technique. We conclude by discussing the relevance of the technique to recovering surface reflectance properties in uncontrolled lighting situations. Applications of the method include visual effects, interior design, and architectural visualization.
Conference Paper
We present a technique for creating an animatable image-based appearance model of a human face, able to capture appearance variation over changing facial expression, head pose, view direction, and lighting condition. Our capture process makes use of a specialized lighting apparatus designed to rapidly illuminate the subject sequentially from many different directions in just a few seconds. For each pose, the subject remains still while six video cameras capture their appearance under each of the directions of lighting. We repeat this process for approximately 60 different poses, capturing different expressions, visemes, head poses, and eye positions. The images for each of the poses and camera views are registered to each other semi-automatically with the help of fiducial markers. The result is a model which can be rendered realistically under any linear blend of the captured poses and under any desired lighting condition by warping, scaling, and blending data from the original images. Finally, we show how to drive the model with performance capture data, where the pose is not necessarily a linear combination of the original captured poses.
Article
Roboticists believe that people will have an unpleasant impression of a humanoid robot that has an almost, but not perfectly, realistic human appearance. This is called the uncanny valley, and is not limited to robots, but is also applicable to any type of human-like object, such as dolls, masks, facial caricatures, avatars in virtual reality, and characters in computer graphics movies. The present study investigated the uncanny valley by measuring observers' impressions of facial images whose degree of realism was manipulated by morphing between artificial and real human faces. Facial images yielded the most unpleasant impressions when they were highly realistic, supporting the hypothesis of the uncanny valley. However, the uncanny valley was confirmed only when morphed faces had abnormal features such as bizarre eyes. These results suggest that to have an almost perfectly realistic human appearance is a necessary but not a sufficient condition for the uncanny valley. The uncanny valley emerges only when there is also an abnormal feature.
Article
We present a method that uses measured scene radiance and global illumination in order to add new objects to light-based models with correct lighting. The method uses a high dynamic range imagebased model of the scene, rather than synthetic light sources, to illuminate the newobjects. To compute the illumination, the scene is considered as three components: the distant scene, the local scene, and the synthetic objects. The distant scene is assumed to be photometrically unaffected by the objects, obviating the need for re- flectance model information. The local scene is endowed with estimated reflectance model information so that it can catch shadows and receive reflected light from the new objects. Renderings are created with a standard global illumination method by simulating the interaction of light amongst the three components. A differential rendering technique allows for good results to be obtained when only an estimate of the local scene reflectance properties is known. We apply the general method to the problem of rendering synthetic objects into real scenes. The light-based model is constructed from an approximate geometric model of the scene and by using a light probe to measure the incident illumination at the location of the synthetic objects. The global illumination solution is then composited into a photograph of the scene using the differential rendering technique. We conclude by discussing the relevance of the technique to recovering surface reflectance properties in uncontrolled lighting situations. Applications of the method include visual effects, interior design, and architectural visualization.
Article
This paper introduces a simple model for subsurface light transport in translucent materials. The model enables efficient simulation of effects that BRDF models cannot capture, such as color bleeding within materials and diffusion of light across shadow boundaries. The technique is efficient even for anisotropic, highly scattering media that are expensive to simulate using existing methods. The model combines an exact solution for single scattering with a dipole point source diffusion approximation for multiple scattering. We also have designed a new, rapid image-based measurement technique for determining the optical properties of translucent materials. We validate the model by comparing predicted and measured values and show how the technique can be used to recover the optical properties of a variety of materials, including milk, marble, and skin. Finally, we describe sampling techniques that allow the model to be used within a conventional ray tracer.
The Digital Eye: Image Metrics Attempts to Leap the Uncanny Valley
  • plantec