Article

The Digital Emily project: photoreal facial modeling and animation

Authors:
  • Image Metrics
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

This course describes how high-resolution face scanning, advanced character rigging, and performance-driven facial animation were combined to create Digital Emily, a believably photorealistic digital actor. Actress Emily O'Brien was scanned in the USC ICT light stage in 35 different facial poses using a new high-resolution face-scanning process capable of capturing geometry and textures down to the level of skin pores and fine wrinkles. These scans were assembled into a rigged digital character, which could then be driven by Image Metrics video-based facial animation technology. The real Emily was captured speaking on a small set, and her movements were used to drive a complete digital face replacement of her character, including its diffuse, specular, and animated displacement maps. HDRI lighting reconstruction techniques were used to reproduce the lighting on her original performance. The most recent results show new real-time animation and rendering research for the Digital Emily character.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... The results obtained were highly correlated with the complexity of the capture equipment. Although complicated setups such as camera matrices [1,2] and light stages [3,4] can provide high fidelity and accurate results, they are exceptionally expensive and difficult to deploy. Furthermore, some authors have tried to utilize off-the-shelf products such as consumer cameras [5,6] and Microsoft Kinect [7] to obtain plausible outcomes. ...
... Multiview stereo vision Capturing methods based on multiview stereo vision involve two or more cameras. To solve the correspondence problem in multiview stereo vision, these methods often place markers [3,[15][16][17] onto the human face, or use projectors [2,18,19] to project a specified pattern onto the human face. The well-known Digital Emily Project [3,20] utilized the Light Stage 5 [21], which is a setup composed of 156 LED lights and several high-resolution digital cameras, to generate 37 high-resolution 3D facial models from 90 min of scanning data. ...
... To solve the correspondence problem in multiview stereo vision, these methods often place markers [3,[15][16][17] onto the human face, or use projectors [2,18,19] to project a specified pattern onto the human face. The well-known Digital Emily Project [3,20] utilized the Light Stage 5 [21], which is a setup composed of 156 LED lights and several high-resolution digital cameras, to generate 37 high-resolution 3D facial models from 90 min of scanning data. In [18], the authors proposed a novel spacetime stereo algorithm to compute depth from video sequences using synchronized video cameras and SL projectors. ...
Article
The acquisition of 3D facial models is crucial in the gaming and film industries. In this study, we developed a facial acquisition system based on infrared structured light sensors to obtain facial-expression models with high fidelity and accuracy. First, we employed time-multiplexing structured light to obtain an accurate and dense point cloud. The template model was then warped to the captured facial expression. A model-tracking method based on optical flow was applied to track the registered 3D model displacement. Finally, the live 3D mesh was textured using high-resolution images captured by three color cameras. Experiments were conducted on human faces to demonstrate the performance of the proposed system and methods.
... Recovering detailed 3D properties of a person's face traditionally requires expensive capture setups with multiple cameras and lights (Alexander et al., 2009;Beeler et al., 2010). Additionally, manual interventions by artists are required for building controllable rigs suitable for editing (Lewis et al., 2014b). ...
... The linear PCA bases E s ∈ R 3N×80 and E e ∈ R 3N×64 encode the modes with the highest shape and expression variation, respectively. The expression basis is obtained by applying PCA to the combined set of blendshapes of Alexander et al. (2009) and Cao et al. (2013), which have been re-targeted to the face topology of Blanz and Vetter (1999) using deformation transfer (Sumner and Popović, 2004). The PCA basis covers more than 99% of the variance of the original blendshapes. ...
... The subspace of expression variations is spanned by the vectors {b g k } m s +m e k=m s +1 . These vectors were created using PCA of a subset of blendshapes from the datasets of Alexander et al. (2009) and Cao et al. (2013). Note that these blendshapes have been transferred to the used topology using deformation transfer (Sumner and Popović, 2004). ...
... A 3D morphable face model (3DMM) produces vector space representations that capture various facial attributes such as shape, expression and pose [6,4,8,15,16]. Although the previous 3DMM methods [6,4,8] have limitations in estimating face texture and lighting conditions accurately, recent methods [15,16] overcome these limitations. ...
... A 3D morphable face model (3DMM) produces vector space representations that capture various facial attributes such as shape, expression and pose [6,4,8,15,16]. Although the previous 3DMM methods [6,4,8] have limitations in estimating face texture and lighting conditions accurately, recent methods [15,16] overcome these limitations. We utilize the state-of-the-art 3DMM [16] to effectively capture the various facial attributes and supervise our model. ...
Preprint
Face swapping is a task that changes a facial identity of a given image to that of another person. In this work, we propose a novel face-swapping framework called Megapixel Facial Identity Manipulation (MFIM). The face-swapping model should achieve two goals. First, it should be able to generate a high-quality image. We argue that a model which is proficient in generating a megapixel image can achieve this goal. However, generating a megapixel image is generally difficult without careful model design. Therefore, our model exploits pretrained StyleGAN in the manner of GAN-inversion to effectively generate a megapixel image. Second, it should be able to effectively transform the identity of a given image. Specifically, it should be able to actively transform ID attributes (e.g., face shape and eyes) of a given image into those of another person, while preserving ID-irrelevant attributes (e.g., pose and expression). To achieve this goal, we exploit 3DMM that can capture various facial attributes. Specifically, we explicitly supervise our model to generate a face-swapped image with the desirable attributes using 3DMM. We show that our model achieves state-of-the-art performance through extensive experiments. Furthermore, we propose a new operation called ID mixing, which creates a new identity by semantically mixing the identities of several people. It allows the user to customize the new identity.
... The richness of cues that the human face encodes (Jack et al., 2015) points to the indispensability of having photorealistic IVA faces. In recent years, solutions for 3D face digitization, such as 3D scanning (Alexander et al., 2009) have emerged and are becoming increasingly more affordable (Straub and Kerlin, 2014), as well as machine learning methods to make the face digitization scalable and less cumbersome Yamaguchi et al., 2018). The focus on realism of IVAs is not surprising as it ties into the desire to create artificial entities that simulate life and can evoke appropriate responses from humans (Stacey and Suchman, 2012). ...
... In four studiesthree psychological experiments and one computational study, we investigated which specific features of the face affect an IVA to be assessed human-like. In Experiment 1, we investigated the extent to which the facial images of IVAs developed using 3D scanning (Alexander et al., 2009;Seymour et al., 2017) and other advanced techniques, such as deep neural networks Nagano et al., 2018;Saito et al., 2017) are rated human-like. That is, we first aimed to ascertain whether the IVA faces were perceived on a par with photographs of actual humans, and whether gender of IVAs played a role in the perception of human-likeness. ...
Article
Full-text available
Despite advancements in computer graphics and artificial intelligence, it remains unclear which aspects of intelligent virtual agents (IVAs) make them identifiable as human-like agents. In three experiments and a computational study, we investigated which specific facial features in static IVAs contribute to judging them human-like. In Experiment 1, participants were presented with facial images of state-of-the-art IVAs and humans and asked to rate these stimuli on human-likeness. The results showed that IVAs were judged less human-like compared to photographic images of humans, which led to the hypothesis that the discrepancy in human-likeness was driven by skin and eye reflectance. A follow-up computational analysis confirmed this hypothesis, showing that the faces of IVAs had smoother skin and a reduced number of corneal reflections than human faces. In Experiment 2, we validated these findings by systematically manipulating the appearance of skin and eyes in a set of human photographs, including both female and male faces as well as four different races. Participants indicated as quickly as possible whether the image depicted a real human face or not. The results showed that smoothening the skin and removing corneal reflections affected the perception of human-likeness when quick perceptual decisions needed to be made. Finally, in Experiment 3, we combined the images of IVA faces and those of humans, unaltered and altered, and asked participants to rate them on human-likeness. The results confirmed the causal role of both features for attributing human-likeness. Both skin and eye reflectance worked in tandem in driving judgements regarding the extent to which the face was perceived human-like in both IVAs and humans. These findings are of relevance to computer graphics artists and psychology researchers alike in drawing attention to those facial characteristics that increase realism in IVAs.
... the nose. Then, for p being the chin landmark in 3D space, we define a virtual plane that goes through the point p + 1 2 v and has the normal v. We only keep the side of the plane that contains the facial landmarks. ...
... We combine the BFM [38] identity geometry and appearance models with the expression model used in Tewari et al. [52], and use this to fit to our test scans. The expression model is a combination of two blendshape models [1,9]. Since BFM also includes color, we jointly optimize to minimize the geometry as well as color alignment errors. ...
Preprint
We present the first deep implicit 3D morphable model (i3DMM) of full heads. Unlike earlier morphable face models it not only captures identity-specific geometry, texture, and expressions of the frontal face, but also models the entire head, including hair. We collect a new dataset consisting of 64 people with different expressions and hairstyles to train i3DMM. Our approach has the following favorable properties: (i) It is the first full head morphable model that includes hair. (ii) In contrast to mesh-based models it can be trained on merely rigidly aligned scans, without requiring difficult non-rigid registration. (iii) We design a novel architecture to decouple the shape model into an implicit reference shape and a deformation of this reference shape. With that, dense correspondences between shapes can be learned implicitly. (iv) This architecture allows us to semantically disentangle the geometry and color components, as color is learned in the reference space. Geometry is further disentangled as identity, expressions, and hairstyle, while color is disentangled as identity and hairstyle components. We show the merits of i3DMM using ablation studies, comparisons to state-of-the-art models, and applications such as semantic head editing and texture transfer. We will make our model publicly available.
... To render a realistic human face, highquality geometry and reflectance data are essential. There exist specialized hardware like Light Stage [Alexander et al. 2009] for high-fidelity 3D faces capturing and reconstruction in the movie industry, but they are cumbersome to use for consumers. Research efforts have been dedicated to consumer-friendly solutions, trying to create 3D faces with consumer cameras, e.g., RGB-D data [Thies et al. 2015;Zollhöfer et al. 2011], multiview images [Ichim et al. 2015], or even a single image Lattas et al. 2020;Yamaguchi et al. 2018]. ...
... Creating high-fidelity realistic digital human characters commonly relies on specialized hardware [Alexander et al. 2009;Beeler et al. 2010;Debevec et al. 2000] and tedious artist labors like model editing and rigging [von der Pahlen et al. 2014]. Several recent work seek to create realistic 3D avatars with consumer devices like a smartphone using domain specific reconstruction approaches (i.e., with face shape/appearance priors) [Ichim et al. 2015;Lattas et al. 2020;Yamaguchi et al. 2018]. ...
Preprint
Full-text available
We present a fully automatic system that can produce high-fidelity, photo-realistic 3D digital human characters with a consumer RGB-D selfie camera. The system only needs the user to take a short selfie RGB-D video while rotating his/her head, and can produce a high quality reconstruction in less than 30 seconds. Our main contribution is a new facial geometry modeling and reflectance synthesis procedure that significantly improves the state-of-the-art. Specifically, given the input video a two-stage frame selection algorithm is first employed to select a few high-quality frames for reconstruction. A novel, differentiable renderer based 3D Morphable Model (3DMM) fitting method is then applied to recover facial geometries from multiview RGB-D data, which takes advantages of extensive data generation and perturbation. Our 3DMM has much larger expressive capacities than conventional 3DMM, allowing us to recover more accurate facial geometry using merely linear bases. For reflectance synthesis, we present a hybrid approach that combines parametric fitting and CNNs to synthesize high-resolution albedo/normal maps with realistic hair/pore/wrinkle details. Results show that our system can produce faithful 3D characters with extremely realistic details. Code and the constructed 3DMM is publicly available.
... First, we compare to Face2Face [33] in Fig. 9 (a). The Face2Face [33] transfers facial expressions inferred by 3D face morphable model [64], [65], [66], retrieves mouth texture by the facial expression and renders the talking face. Our method produces competitive results with realistic texture and mouth movement, suggesting that our Style Translation Network learns accurate mouth movement from the input audio. ...
Preprint
Full-text available
Existing automated dubbing methods are usually designed for Professionally Generated Content (PGC) production, which requires massive training data and training time to learn a person-specific audio-video mapping. In this paper, we investigate an audio-driven dubbing method that is more feasible for User Generated Content (UGC) production. There are two unique challenges to design a method for UGC: 1) the appearances of speakers are diverse and arbitrary as the method needs to generalize across users; 2) the available video data of one speaker are very limited. In order to tackle the above challenges, we first introduce a new Style Translation Network to integrate the speaking style of the target and the speaking content of the source via a cross-modal AdaIN module. It enables our model to quickly adapt to a new speaker. Then, we further develop a semi-parametric video renderer, which takes full advantage of the limited training data of the unseen speaker via a video-level retrieve-warp-refine pipeline. Finally, we propose a temporal regularization for the semi-parametric renderer, generating more continuous videos. Extensive experiments show that our method generates videos that accurately preserve various speaking styles, yet with considerably lower amount of training data and training time in comparison to existing methods. Besides, our method achieves a faster testing speed than most recent methods.
... Both stages are computationally expensive, each taking several minutes per scan. For professional captures, both steps are augmented with manual clean-up to enhance the quality of the output meshes [3,67]. Such manual editing is infeasible for large-scale captures (≫ 10K scans). ...
Preprint
Existing methods for capturing datasets of 3D heads in dense semantic correspondence are slow, and commonly address the problem in two separate steps; multi-view stereo (MVS) reconstruction followed by non-rigid registration. To simplify this process, we introduce TEMPEH (Towards Estimation of 3D Meshes from Performances of Expressive Heads) to directly infer 3D heads in dense correspondence from calibrated multi-view images. Registering datasets of 3D scans typically requires manual parameter tuning to find the right balance between accurately fitting the scans surfaces and being robust to scanning noise and outliers. Instead, we propose to jointly register a 3D head dataset while training TEMPEH. Specifically, during training we minimize a geometric loss commonly used for surface registration, effectively leveraging TEMPEH as a regularizer. Our multi-view head inference builds on a volumetric feature representation that samples and fuses features from each view using camera calibration information. To account for partial occlusions and a large capture volume that enables head movements, we use view- and surface-aware feature fusion, and a spatial transformer-based head localization module, respectively. We use raw MVS scans as supervision during training, but, once trained, TEMPEH directly predicts 3D heads in dense correspondence without requiring scans. Predicting one head takes about 0.3 seconds with a median reconstruction error of 0.26 mm, 64% lower than the current state-of-the-art. This enables the efficient capture of large datasets containing multiple people and diverse facial motions. Code, model, and data are publicly available at https://tempeh.is.tue.mpg.de.
... Codec Avatars. Traditional methods for photorealistic human face modeling [2,39] rely on accurate but complex 3D reconstruction processes, which are not suitable for real-time applications. To enable photorealistic telepresence, [30] uses a deep appearance model in a data-driven manner, which has been dubbed a Codec Avatar. ...
Preprint
Real-time and robust photorealistic avatars for telepresence in AR/VR have been highly desired for enabling immersive photorealistic telepresence. However, there still exists one key bottleneck: the considerable computational expense needed to accurately infer facial expressions captured from headset-mounted cameras with a quality level that can match the realism of the avatar's human appearance. To this end, we propose a framework called Auto-CARD, which for the first time enables real-time and robust driving of Codec Avatars when exclusively using merely on-device computing resources. This is achieved by minimizing two sources of redundancy. First, we develop a dedicated neural architecture search technique called AVE-NAS for avatar encoding in AR/VR, which explicitly boosts both the searched architectures' robustness in the presence of extreme facial expressions and hardware friendliness on fast evolving AR/VR headsets. Second, we leverage the temporal redundancy in consecutively captured images during continuous rendering and develop a mechanism dubbed LATEX to skip the computation of redundant frames. Specifically, we first identify an opportunity from the linearity of the latent space derived by the avatar decoder and then propose to perform adaptive latent extrapolation for redundant frames. For evaluation, we demonstrate the efficacy of our Auto-CARD framework in real-time Codec Avatar driving settings, where we achieve a 5.05x speed-up on Meta Quest 2 while maintaining a comparable or even better animation quality than state-of-the-art avatar encoder designs.
... However, creating these believable 3D human characters is not for anyone. Early attempts [Alexander et al. 2009;Debevec et al. 2000] generally require expensive apparatus and immense artistic expertise, and hence are limited to celebrities for feature film productions. It has been a long journey for the graphics community to democratize the accessible use of 3D facial assets to the mass crowd, equipped with powerful neural generative techniques, from variational autoencoders (VAEs) [Kingma and Welling 2014], generative adversarial networks (GANs) [Goodfellow et al. 2020] to the latest Diffusion Models [Ho et al. 2020]. ...
Preprint
Full-text available
Emerging Metaverse applications demand accessible, accurate, and easy-to-use tools for 3D digital human creations in order to depict different cultures and societies as if in the physical world. Recent large-scale vision-language advances pave the way to for novices to conveniently customize 3D content. However, the generated CG-friendly assets still cannot represent the desired facial traits for human characteristics. In this paper, we present DreamFace, a progressive scheme to generate personalized 3D faces under text guidance. It enables layman users to naturally customize 3D facial assets that are compatible with CG pipelines, with desired shapes, textures, and fine-grained animation capabilities. From a text input to describe the facial traits, we first introduce a coarse-to-fine scheme to generate the neutral facial geometry with a unified topology. We employ a selection strategy in the CLIP embedding space, and subsequently optimize both the details displacements and normals using Score Distillation Sampling from generic Latent Diffusion Model. Then, for neutral appearance generation, we introduce a dual-path mechanism, which combines the generic LDM with a novel texture LDM to ensure both the diversity and textural specification in the UV space. We also employ a two-stage optimization to perform SDS in both the latent and image spaces to significantly provides compact priors for fine-grained synthesis. Our generated neutral assets naturally support blendshapes-based facial animations. We further improve the animation ability with personalized deformation characteristics by learning the universal expression prior using the cross-identity hypernetwork. Notably, DreamFace can generate of realistic 3D facial assets with physically-based rendering quality and rich animation ability from video footage, even for fashion icons or exotic characters in cartoons and fiction movies.
... IVAs are embodied virtual characters that can interact with humans using verbal, para-verbal, and nonverbal behaviors (Lugrin, 2021). With advances in computer graphics (Alexander et al., 2009), the faces of IVAs have become photorealistic (Seymour et al., 2017). The growing prevalence of using IVAs as stimuli in research (Kätsyri et al., 2020) and the interest to use them in different use cases, such as e-commerce (Etemad-Sajadi, 2016), healthcare (Robinson et al., 2014), and education (Belpaeme et al., 2018) highlight the importance of investigating the similarities between processing IVA faces, henceforth, virtual faces and natural human faces. ...
Article
Full-text available
Virtual faces have been found to be rated less human-like and remembered worse than photographic images of humans. What it is in virtual faces that yields reduced memory has so far remained unclear. The current study investigated face memory in the context of virtual agent faces and human faces, real and manipulated, considering two factors of predicted influence, i.e., corneal reflections and skin contrast. Corneal reflections referred to the bright points in each eye that occur when the ambient light reflects from the surface of the cornea. Skin contrast referred to the degree to which skin surface is rough versus smooth. We conducted two memory experiments, one with high-quality virtual agent faces (Experiment 1) and the other with the photographs of human faces that were manipulated (Experiment 2). Experiment 1 showed better memory for virtual faces with increased corneal reflections and skin contrast (rougher rather than smoother skin). Experiment 2 replicated these findings, showing that removing the corneal reflections and smoothening the skin reduced memory recognition of manipulated faces, with a stronger effect exerted by the eyes than the skin. This study highlights specific features of the eyes and skin that can help explain memory discrepancies between real and virtual faces and in turn elucidates the factors that play a role in the cognitive processing of faces.
... Numerous methods have been proposed for generating a realistic avatar of a subject. Some of them [1,2,21,24,40,48,55,60,61,[72][73][74]80] model a 3D avatar of a subject and then drive its animation using a video sequence or speech text. However, dedicated device setup and heavily manual work are always needed for generating a realistic avatar and reconstructing the detailed appearance, subtle expressions, and gaze movement of a subject. ...
Preprint
The VirtualCube system is a 3D video conference system that attempts to overcome some limitations of conventional technologies. The key ingredient is VirtualCube, an abstract representation of a real-world cubicle instrumented with RGBD cameras for capturing the 3D geometry and texture of a user. We design VirtualCube so that the task of data capturing is standardized and significantly simplified, and everything can be built using off-the-shelf hardware. We use VirtualCubes as the basic building blocks of a virtual conferencing environment, and we provide each VirtualCube user with a surrounding display showing life-size videos of remote participants. To achieve real-time rendering of remote participants, we develop the V-Cube View algorithm, which uses multi-view stereo for more accurate depth estimation and Lumi-Net rendering for better rendering quality. The VirtualCube system correctly preserves the mutual eye gaze between participants, allowing them to establish eye contact and be aware of who is visually paying attention to them. The system also allows a participant to have side discussions with remote participants as if they were in the same room. Finally, the system sheds lights on how to support the shared space of work items (e.g., documents and applications) and track the visual attention of participants to work items.
... This technique optimizes the facial deformations distributing the geometry as muscle lines. There is no standard for defining the rig system interface; in our case, we have chosen an interface based on a 3D view with 2D handlers, similar to the solutions proposed by Alexander et al. [28] and Digital tutors [29]. With these approximations, we have an exhaustive control of the geometric deformation that performs the mesh simulation of facial muscles as well as an intuitive tool to easily perform all the facial actions. ...
Article
Full-text available
Laughter and smiling are significant facial expressions used in human to human communication. We present a computational model for the generation of facial expressions associated with laughter and smiling in order to facilitate the synthesis of such facial expressions in virtual characters. In addition, a new method to reproduce these types of laughter is proposed and validated using databases of generic and specific facial smile expressions. In particular, a proprietary database of laugh and smile expressions is also presented. This database lists the different types of classified and generated laughs presented in this work. The generated expressions are validated through a user study with 71 subjects, which concluded that the virtual character expressions built using the presented model are perceptually acceptable in quality and facial expression fidelity. Finally, for generalization purposes, an additional analysis shows that the results are independent of the type of virtual character’s appearance.
... Amberg et al. [2008] combines a PCA model of a neutral face with a PCA space derived from the residual vectors of different expressions to the neutral pose. Blendshapes can either be handcrafted by animators [Alexander et al. 2009;], or be generated via statistical analysis from large facial expression datasets [Cao et al. 2014;Li et al. 2017;Vlasic et al. 2005]. The multilinear model [Cao et al. 2014;Vlasic et al. 2005] offers a way of capturing a joint space of expression and identity. ...
Article
The creation of high-fidelity computer-generated (CG) characters for films and games is tied with intensive manual labor, which involves the creation of comprehensive facial assets that are often captured using complex hardware. To simplify and accelerate this digitization process, we propose a framework for the automatic generation of high-quality dynamic facial models, including rigs which can be readily deployed for artists to polish. Our framework takes a single scan as input to generate a set of personalized blendshapes, dynamic textures, as well as secondary facial components ( e.g. , teeth and eyeballs). Based on a facial database with over 4, 000 scans with pore-level details, varying expressions and identities, we adopt a self-supervised neural network to learn personalized blendshapes from a set of template expressions. We also model the joint distribution between identities and expressions, enabling the inference of a full set of personalized blendshapes with dynamic appearances from a single neutral input scan. Our generated personalized face rig assets are seamlessly compatible with professional production pipelines for facial animation and rendering. We demonstrate a highly robust and effective framework on a wide range of subjects, and showcase high-fidelity facial animations with automatically generated personalized dynamic textures.
... Video-driven performance-based facial animation most commonly involves tracking facial features in image sequences from either an RGB or an RGB-D camera. The motion of the features is then transferred to either a user-dependent rig [5,25,30,56] or a personalization of a generic face model [9,13,14]. To improve the robustness of the facial feature tracking, and thus the quality of the retargeting, it is common to first train a model offline to provide constraints on the facial feature tracking using either a user-specific blendshape model [10,33,54], or a statistical model in the form of multi-linear models [9]. ...
... Number of algorithms, including traditional 3D face models [3,4], data driven Neural-Networks [5,6,7,1,8,9,10,11,12], and their combinations [13] have been presented for creating photorealistic face animations. In 3D ...
Preprint
Full-text available
The face reenactment is a popular facial animation method where the person's identity is taken from the source image and the facial motion from the driving image. Recent works have demonstrated high quality results by combining the facial landmark based motion representations with the generative adversarial networks. These models perform best if the source and driving images depict the same person or if the facial structures are otherwise very similar. However, if the identity differs, the driving facial structures leak to the output distorting the reenactment result. We propose a novel Facial Attribute Controllable rEenactment GAN (FACEGAN), which transfers the facial motion from the driving face via the Action Unit (AU) representation. Unlike facial landmarks, the AUs are independent of the facial structure preventing the identity leak. Moreover, AUs provide a human interpretable way to control the reenactment. FACEGAN processes background and face regions separately for optimized output quality. The extensive quantitative and qualitative comparisons show a clear improvement over the state-of-the-art in a single source reenactment task. The results are best illustrated in the reenactment video provided in the supplementary material. The source code will be made available upon publication of the paper.
... Amberg et al. [2008] combines a PCA model of a neutral face with a PCA space derived from the residual vectors of different expressions to the neutral pose. Blendshapes can either be handcrafted by animators [Alexander et al. 2009;, or be generated via statistical analysis from large facial expression datasets [Cao et al. 2014;Li et al. 2017;Vlasic et al. 2005]. The multilinear model [Cao et al. 2014;Vlasic et al. 2005] offers a way of capturing a joint space of expression and identity. ...
Preprint
Full-text available
The creation of high-fidelity computer-generated (CG) characters used in film and gaming requires intensive manual labor and a comprehensive set of facial assets to be captured with complex hardware, resulting in high cost and long production cycles. In order to simplify and accelerate this digitization process, we propose a framework for the automatic generation of high-quality dynamic facial assets, including rigs which can be readily deployed for artists to polish. Our framework takes a single scan as input to generate a set of personalized blendshapes, dynamic and physically-based textures, as well as secondary facial components (e.g., teeth and eyeballs). Built upon a facial database consisting of pore-level details, with over $4,000$ scans of varying expressions and identities, we adopt a self-supervised neural network to learn personalized blendshapes from a set of template expressions. We also model the joint distribution between identities and expressions, enabling the inference of the full set of personalized blendshapes with dynamic appearances from a single neutral input scan. Our generated personalized face rig assets are seamlessly compatible with cutting-edge industry pipelines for facial animation and rendering. We demonstrate that our framework is robust and effective by inferring on a wide range of novel subjects, and illustrate compelling rendering results while animating faces with generated customized physically-based dynamic textures.
... We use a multi-linear PCA model based on [3,1,9]. The first two dimensions represent facial identity -i.e., geometric shape and skin reflectance -and the third dimension controls the facial expression. ...
Preprint
We present Face2Face, a novel approach for real-time facial reenactment of a monocular target video sequence (e.g., Youtube video). The source sequence is also a monocular video stream, captured live with a commodity webcam. Our goal is to animate the facial expressions of the target video by a source actor and re-render the manipulated output video in a photo-realistic fashion. To this end, we first address the under-constrained problem of facial identity recovery from monocular video by non-rigid model-based bundling. At run time, we track facial expressions of both source and target video using a dense photometric consistency measure. Reenactment is then achieved by fast and efficient deformation transfer between source and target. The mouth interior that best matches the re-targeted expression is retrieved from the target sequence and warped to produce an accurate fit. Finally, we convincingly re-render the synthesized target face on top of the corresponding video stream such that it seamlessly blends with the real-world illumination. We demonstrate our method in a live setup, where Youtube videos are reenacted in real time.
... Unlike other biological information such as iris or fingerprints which are relatively easy to be verified for authenticity and hard to be tampered [10,11,36,45], human faces in videos are easier to be manipulated. Led by DeepFakes [14], these technologies synthesize fake videos in the form of: 1) face-swapping that directly replaces faces of one individual in target video with the one from source video [19]; 2) face manipulation that usually reconstructs a 3D morphable model of target faces and simulates facial movement by morphing [2,5,7]. Sometimes it is also realized by a performer acting out head movements or facial expressions to control the target puppet to do the same. ...
Article
Full-text available
Recent progress of artificial intelligence makes it easier to edit facial movements in videos or create face substitutions, bringing new challenges to anti-fake-faces techniques. Although multimedia forensics provides many detection algorithms from a traditional point of view, it is increasingly hard to discriminate the fake videos from real ones while they become more sophisticated and plausible with updated forgery technologies. In this paper, we introduce a motion discrepancy based method that can effectively differentiate AI-generated fake videos from real ones. The amplitude of face motions in videos is first magnified, and fake videos will show more serious distortion or flicker than the pristine videos. We pre-trained a deep CNN on frames extracted from the training videos and the output vectors of the frame sequences are used as input of an LSTM at secondary training stage. Our approach is evaluated over a large fake video dataset Faceforensics++ produced by various advanced generation technologies, it shows superior performance contrasted to existing pixel-based fake video forensics approaches.
... Video-driven performance-based facial animation most commonly involves tracking facial features in image sequences from either an RGB or an RGB-D camera. The motion of the features is then transferred to either a user-dependent rig [5,25,30,56] or a personalization of a generic face model [9,13,14]. To improve the robustness of the facial feature tracking, and thus the quality of the retargeting, it is common to first train a model offline to provide constraints on the facial feature tracking using either a user-specific blendshape model [10,33,54], or a statistical model in the form of multi-linear models [9]. ...
Preprint
We describe our novel deep learning approach for driving animated faces using both acoustic and visual information. In particular, speech-related facial movements are generated using audiovisual information, and non-speech facial movements are generated using only visual information. To ensure that our model exploits both modalities during training, batches are generated that contain audio-only, video-only, and audiovisual input features. The probability of dropping a modality allows control over the degree to which the model exploits audio and visual information during training. Our trained model runs in real-time on resource limited hardware (e.g.\ a smart phone), it is user agnostic, and it is not dependent on a potentially error-prone transcription of the speech. We use subjective testing to demonstrate: 1) the improvement of audiovisual-driven animation over the equivalent video-only approach, and 2) the improvement in the animation of speech-related facial movements after introducing modality dropout. Before introducing dropout, viewers prefer audiovisual-driven animation in 51% of the test sequences compared with only 18% for video-driven. After introducing dropout viewer preference for audiovisual-driven animation increases to 74%, but decreases to 8% for video-only.
... Facial Capture Systems Physical object scanning devices span a wide range of categories; from single RGB cameras [15,40], to active [4,18], and passive [5] light stereo capture setups, and depth sensors based on time-of-flight or stereo re-projection. Multi-view stereophotogrammetry (MVS) [5] is the most readily available method for 3D face capturing. ...
Preprint
Full-text available
Based on a combined data set of 4000 high resolution facial scans, we introduce a non-linear morphable face model, capable of producing multifarious face geometry of pore-level resolution, coupled with material attributes for use in physically-based rendering. We aim to maximize the variety of identities, while increasing the robustness of correspondence between unique components, including middle-frequency geometry, albedo maps, specular intensity maps and high-frequency displacement details. Our deep learning based generative model learns to correlate albedo and geometry, which ensures the anatomical correctness of the generated assets. We demonstrate potential use of our generative model for novel identity generation, model fitting, interpolation, animation, high fidelity data visualization, and low-to-high resolution data domain transferring. We hope the release of this generative model will encourage further cooperation between all graphics, vision, and data focused professionals, while demonstrating the cumulative value of every individual's complete biometric profile.
... Performance-based facial animation: Most methods to animate digital avatars are based on visual data. Alexander et al. [3], Wu et al. [61], and Laine et al. [35] build subject-specific face-rigs from high-resolution face scans and animate these rigs with video-based animation systems. ...
Preprint
Full-text available
Audio-driven 3D facial animation has been widely explored, but achieving realistic, human-like performance is still unsolved. This is due to the lack of available 3D datasets, models, and standard evaluation metrics. To address this, we introduce a unique 4D face dataset with about 29 minutes of 4D scans captured at 60 fps and synchronized audio from 12 speakers. We then train a neural network on our dataset that factors identity from facial motion. The learned model, VOCA (Voice Operated Character Animation) takes any speech signal as input - even speech in languages other than English - and realistically animates a wide range of adult faces. Conditioning on subject labels during training allows the model to learn a variety of realistic speaking styles. VOCA also provides animator controls to alter speaking style, identity-dependent facial shape, and pose (i.e. head, jaw, and eyeball rotations) during animation. To our knowledge, VOCA is the only realistic 3D facial animation model that is readily applicable to unseen subjects without retargeting. This makes VOCA suitable for tasks like in-game video, virtual reality avatars, or any scenario in which the speaker, speech, or language is not known in advance. We make the dataset and model available for research purposes at http://voca.is.tue.mpg.de.
Article
Emerging Metaverse applications demand accessible, accurate and easy-to-use tools for 3D digital human creations in order to depict different cultures and societies as if in the physical world. Recent large-scale vision-language advances pave the way for novices to conveniently customize 3D content. However, the generated CG-friendly assets still cannot represent the desired facial traits for human characteristics. In this paper, we present Dream-Face, a progressive scheme to generate personalized 3D faces under text guidance. It enables layman users to naturally customize 3D facial assets that are compatible with CG pipelines, with desired shapes, textures and fine-grained animation capabilities. From a text input to describe the facial traits, we first introduce a coarse-to-fine scheme to generate the neutral facial geometry with a unified topology. We employ a selection strategy in the CLIP embedding space to generate coarse geometry, and subsequently optimize both the detailed displacements and normals using Score Distillation Sampling (SDS) from the generic Latent Diffusion Model (LDM). Then, for neutral appearance generation, we introduce a dual-path mechanism, which combines the generic LDM with a novel texture LDM to ensure both the diversity and textural specification in the UV space. We also employ a two-stage optimization to perform SDS in both the latent and image spaces to significantly provide compact priors for fine-grained synthesis. It also enables learning the mapping from the compact latent space into physically-based textures (diffuse albedo, specular intensity, normal maps, etc.). Our generated neutral assets naturally support blendshapes-based facial animations, thanks to the unified geometric topology. We further improve the animation ability with personalized deformation characteristics. To this end, we learn the universal expression prior in a latent space with neutral asset conditioning using the cross-identity hypernetwork, we subsequently train a neural facial tracker from video input space into the pre-trained expression space for personalized fine-grained animation. Extensive qualitative and quantitative experiments validate the effectiveness and generalizability of DreamFace. Notably, DreamFace can generate realistic 3D facial assets with physically-based rendering quality and rich animation ability from video footage, even for fashion icons or exotic characters in cartoons and fiction movies.
Preprint
Limited by the nature of the low-dimensional representational capacity of 3DMM, most of the 3DMM-based face reconstruction (FR) methods fail to recover high-frequency facial details, such as wrinkles, dimples, etc. Some attempt to solve the problem by introducing detail maps or non-linear operations, however, the results are still not vivid. To this end, we in this paper present a novel hierarchical representation network (HRN) to achieve accurate and detailed face reconstruction from a single image. Specifically, we implement the geometry disentanglement and introduce the hierarchical representation to fulfill detailed face modeling. Meanwhile, 3D priors of facial details are incorporated to enhance the accuracy and authenticity of the reconstruction results. We also propose a de-retouching module to achieve better decoupling of the geometry and appearance. It is noteworthy that our framework can be extended to a multi-view fashion by considering detail consistency of different views. Extensive experiments on two single-view and two multi-view FR benchmarks demonstrate that our method outperforms the existing methods in both reconstruction accuracy and visual effects. Finally, we introduce a high-quality 3D face dataset FaceHD-100 to boost the research of high-fidelity face reconstruction.
Chapter
Face swapping is a task that changes a facial identity of a given image to that of another person. In this work, we propose a novel face-swapping framework called Megapixel Facial Identity Manipulation (MFIM). The face-swapping model should achieve two goals. First, it should be able to generate a high-quality image. We argue that a model which is proficient in generating a megapixel image can achieve this goal. However, generating a megapixel image is generally difficult without careful model design. Therefore, our model exploits pretrained StyleGAN in the manner of GAN-inversion to effectively generate a megapixel image. Second, it should be able to effectively transform the identity of a given image. Specifically, it should be able to actively transform ID attributes (e.g., face shape and eyes) of a given image into those of another person, while preserving ID-irrelevant attributes (e.g., pose and expression). To achieve this goal, we exploit 3DMM that can capture various facial attributes. Specifically, we explicitly supervise our model to generate a face-swapped image with the desirable attributes using 3DMM. We show that our model achieves state-of-the-art performance through extensive experiments. Furthermore, we propose a new operation called ID mixing, which creates a new identity by semantically mixing the identities of several people. It allows the user to customize the new identity.
Article
Existing automated dubbing methods are usually designed for Professionally Generated Content (PGC) production, which requires massive training data and training time to learn a person-specific audio-video mapping. In this paper, we investigate an audio-driven dubbing method that is more feasible for User Generated Content (UGC) production. There are two unique challenges to design a method for UGC: 1) the appearances of speakers are diverse and arbitrary as the method needs to generalize across users; 2) the available video data of one speaker are very limited. In order to tackle the above challenges, we first introduce a new Style Translation Network to integrate the speaking style of the target and the speaking content of the source via a cross-modal AdaIN module. It enables our model to quickly adapt to a new speaker. Then, we further develop a semi-parametric video renderer, which takes full advantage of the limited training data of the unseen speaker via a video-level retrieve-warp-refine pipeline. Finally, we propose a temporal regularization for the semi-parametric renderer, generating more continuous videos. Extensive experiments show that our method generates videos that accurately preserve various speaking styles, yet with considerably lower amount of training data and training time in comparison to existing methods. Besides, our method achieves a faster testing speed than most recent methods.
Article
We present a fully automatic system that can produce high-fidelity, photo-realistic three-dimensional (3D) digital human heads with a consumer RGB-D selfie camera. The system only needs the user to take a short selfie RGB-D video while rotating his/her head and can produce a high-quality head reconstruction in less than 30 s. Our main contribution is a new facial geometry modeling and reflectance synthesis procedure that significantly improves the state of the art. Specifically, given the input video a two-stage frame selection procedure is first employed to select a few high-quality frames for reconstruction. Then a differentiable renderer-based 3D Morphable Model (3DMM) fitting algorithm is applied to recover facial geometries from multiview RGB-D data, which takes advantages of a powerful 3DMM basis constructed with extensive data generation and perturbation. Our 3DMM has much larger expressive capacities than conventional 3DMM, allowing us to recover more accurate facial geometry using merely linear basis. For reflectance synthesis, we present a hybrid approach that combines parametric fitting and Convolutional Neural Networks (CNNs) to synthesize high-resolution albedo/normal maps with realistic hair/pore/wrinkle details. Results show that our system can produce faithful 3D digital human faces with extremely realistic details. The main code and the newly constructed 3DMM basis is publicly available.
Article
The VirtualCube system is a 3D video conference system that attempts to overcome some limitations of conventional technologies. The key ingredient is VirtualCube, an abstract representation of a real-world cubicle instrumented with RGBD cameras for capturing the user's 3D geometry and texture. We design VirtualCube so that the task of data capturing is standardized and significantly simplified, and everything can be built using off-the-shelf hardware. We use VirtualCubes as the basic building blocks of a virtual conferencing environment, and we provide each VirtualCube user with a surrounding display showing life-size videos of remote participants. To achieve real-time rendering of remote participants, we develop the V-Cube View algorithm, which uses multi-view stereo for more accurate depth estimation and Lumi-Net rendering for better rendering quality. The VirtualCube system correctly preserves the mutual eye gaze between participants, allowing them to establish eye contact and be aware of who is visually paying attention to them. The system also allows a participant to have side discussions with remote participants as if they were in the same room. Finally, the system sheds lights on how to support the shared space of work items (e.g., documents and applications) and track participants' visual attention to work items.
Article
As virtual reality (VR) devices become increasingly commonplace, asymmetric interactions between people with and without headsets are becoming more frequent. Existing video pass-through VR headsets solve one side of these asymmetric interactions by showing the user a live reconstruction of the outside world. This paper further advocates for reverse pass-through VR , wherein a three-dimensional view of the user's face and eyes is presented to any number of outside viewers in a perspective-correct manner using a light field display. Tying together research in social telepresence and copresence, autostereoscopic displays, and facial capture, reverse pass-through VR enables natural eye contact and other important non-verbal cues in a wider range of interaction scenarios, providing a path to potentially increase the utility and social acceptability of VR headsets in shared and public spaces.
Chapter
We present a light-weight multi-view capture system with different lighting conditions to generate a topology consistent facial geometry and high-resolution reflectance texture maps. Firstly, we construct the base mesh from multi-view images using the stereo reconstruction algorithms. Then we leverage the mesh deformation technique to register a template mesh to the reconstructed geometry for topology consistency. The facial and ear landmarks are also utilized to guide the deformation. We adopt the photometric stereo and BRDF fitting methods to recover the facial reflectance field. The specular normal which contains high-frequency information is finally utilized to refine the coarse geometry for sub-millimeter details. The captured topology consistent finer geometry and high-quality reflectance information can be used to produce a lifelike personalized digital avatar.
Article
We present a method for building high-fidelity animatable 3D face models that can be posed and rendered with novel lighting environments in real-time. Our main insight is that relightable models trained to produce an image lit from a single light direction can generalize to natural illumination conditions but are computationally expensive to render. On the other hand, efficient, high-fidelity face models trained with point-light data do not generalize to novel lighting conditions. We leverage the strengths of each of these two approaches. We first train an expensive but generalizable model on point-light illuminations, and use it to generate a training set of high-quality synthetic face images under natural illumination conditions. We then train an efficient model on this augmented dataset, reducing the generalization ability requirements. As the efficacy of this approach hinges on the quality of the synthetic data we can generate, we present a study of lighting pattern combinations for dynamic captures and evaluate their suitability for learning generalizable relightable models. Towards achieving the best possible quality, we present a novel approach for generating dynamic relightable faces that exceeds state-of-the-art performance. Our method is capable of capturing subtle lighting effects and can even generate compelling near-field relighting despite being trained exclusively with far-field lighting data. Finally, we motivate the utility of our model by animating it with images captured from VR-headset mounted cameras, demonstrating the first system for face-driven interactions in VR that uses a photorealistic relightable face model.
Article
In addition to 3D geometry, accurate representation of texture is important when digitizing real objects in virtual worlds. Based on a single consumer RGBD sensor, accurate texture representation for static objects can be realized by fusing multi-frame information; however, extending the process to dynamic objects, which typically have time-varying textures, is difficult. Thus, to address this problem, we propose a compact keyframe-based representation that decouples a dynamic texture into a basic static texture and a set of multiplicative changing maps. With this representation, the proposed method first aligns textures recorded from multiple keyframes with the reconstructed dynamic geometry of the object. Errors in the alignment and geometry are then compensated in an innovative iterative linear optimization framework. With the reconstructed texture, we then employ a scheme to synthesize the dynamic object from arbitrary viewpoints. By considering temporal and local pose similarities jointly, dynamic textures in all keyframes are fused to guarantee high-quality image generation. Experimental results demonstrate that the proposed method handles various dynamic objects, including faces, bodies, cloth, and toys. In addition, qualitative and quantitative comparisons demonstrate that the proposed method outperforms state-of-the-art solutions.
Article
We present a modular differentiable renderer design that yields performance superior to previous methods by leveraging existing, highly optimized hardware graphics pipelines. Our design supports all crucial operations in a modern graphics pipeline: rasterizing large numbers of triangles, attribute interpolation, filtered texture lookups, as well as user-programmable shading and geometry processing, all in high resolutions. Our modular primitives allow custom, high-performance graphics pipelines to be built directly within automatic differentiation frameworks such as PyTorch or TensorFlow. As a motivating application, we formulate facial performance capture as an inverse rendering problem and show that it can be solved efficiently using our tools. Our results indicate that this simple and straightforward approach achieves excellent geometric correspondence between rendered results and reference imagery.
Article
Interacting with people across large distances is important for remote work, interpersonal relationships, and entertainment. While such face-to-face interactions can be achieved using 2D video conferencing or, more recently, virtual reality (VR), telepresence systems currently distort the communication of eye contact and social gaze signals. Although methods have been proposed to redirect gaze in 2D teleconferencing situations to enable eye contact, 2D video conferencing lacks the 3D immersion of real life. To address these problems, we develop a system for face-to-face interaction in VR that focuses on reproducing photorealistic gaze and eye contact. To do this, we create a 3D virtual avatar model that can be animated by cameras mounted on a VR headset to accurately track and reproduce human gaze in VR. Our primary contributions in this work are a jointly-learnable 3D face and eyeball model that better represents gaze direction and upper facial expressions, a method for disentangling the gaze of the left and right eyes from each other and the rest of the face allowing the model to represent entirely unseen combinations of gaze and expression, and a gaze-aware model for precise animation from headset-mounted cameras. Our quantitative experiments show that our method results in higher reconstruction quality, and qualitative results show our method gives a greatly improved sense of presence for VR avatars.
Preprint
StyleGAN generates photorealistic portrait images of faces with eyes, teeth, hair and context (neck, shoulders, background), but lacks a rig-like control over semantic face parameters that are interpretable in 3D, such as face pose, expressions, and scene illumination. Three-dimensional morphable face models (3DMMs) on the other hand offer control over the semantic parameters, but lack photorealism when rendered and only model the face interior, not other parts of a portrait image (hair, mouth interior, background). We present the first method to provide a face rig-like control over a pretrained and fixed StyleGAN via a 3DMM. A new rigging network, RigNet is trained between the 3DMM's semantic parameters and StyleGAN's input. The network is trained in a self-supervised manner, without the need for manual annotations. At test time, our method generates portrait images with the photorealism of StyleGAN and provides explicit control over the 3D semantic parameters of the face.
Preprint
Monocular image-based 3D reconstruction of faces is a long-standing problem in computer vision. Since image data is a 2D projection of a 3D face, the resulting depth ambiguity makes the problem ill-posed. Most existing methods rely on data-driven priors that are built from limited 3D face scans. In contrast, we propose multi-frame video-based self-supervised training of a deep network that (i) learns a face identity model both in shape and appearance while (ii) jointly learning to reconstruct 3D faces. Our face model is learned using only corpora of in-the-wild video clips collected from the Internet. This virtually endless source of training data enables learning of a highly general 3D face model. In order to achieve this, we propose a novel multi-frame consistency loss that ensures consistent shape and appearance across multiple frames of a subject's face, thus minimizing depth ambiguity. At test time we can use an arbitrary number of frames, so that we can perform both monocular as well as multi-frame reconstruction.
Article
In this work we propose a novel model-based deep convolutional autoencoder that addresses the highly challenging problem of reconstructing a 3D human face from a single in-the-wild color image. To this end, we combine a convolutional encoder network with an expert-designed generative model that serves as decoder. The core innovation is the differentiable parametric decoder that encapsulates image formation analytically based on a generative model. Our decoder takes as input a code vector with exactly defined semantic meaning that encodes detailed face pose, shape, expression, skin reflectance and scene illumination. Due to this new way of combining CNN-based with model-based face reconstruction, the CNN-based encoder learns to extract semantically meaningful parameters from a single monocular input image. For the first time, a CNN encoder and an expert-designed generative model can be trained end-to-end in an unsupervised manner, which renders training on very large (unlabeled) real world datasets feasible. The obtained reconstructions compare favorably to current state-of-the-art approaches in terms of quality and richness of representation. This work is an extended version of [1], where we additionally present a stochastic vertex sampling technique for faster training of our networks, and moreover, we propose and evaluate analysis-by-synthesis and shape-from-shading refinement approaches to achieve a high-fidelity reconstruction.
Article
Full-text available
Abstract We present a framework,that captures and synthesizes high resolution facial geometry and performance. In order to capture highly detailed surface structures, a theory of fast normal recovery using spherical gradient illumination patterns is presented to estimate surface normal maps of an object from either its diffuse or specular reflectance, simul- taneously from any viewpoints. We show that the normal map from specular reflectance yields the best record of detailed surface shape, which can be used for geometry enhance- ment. Moreover, the normal map from the diffuse reflectance is able to produce a good approximation of subsurface scattering. Based on the theory, two systems are developed to capture high resolution facial geometry of a static face or dynamic,facial performance. The static face scanning system consists of a spherical illumination device, two single- lens reflex (SLR) cameras,and a video projector. The spherical illumination device is used to cast spherical gradient patterns onto the subject. The captured spherical gradient images are then turned into surface normals of the subject. The two cameras and one pro- jector are used to build a structured-light-assisted two-view stereo system, which acquires a moderate resolution geometry of the subject. We then use the acquired specular normal map to enhance the initial geometry based on an optimization process. To further analyze how facial geometry deforms during performance, we build another facial performance capture system, which is analogous to the previous face scanning sys- tem, but employs two high-speed video cameras and a high-speed projector. The system is able to capture 30 facial geometry,measurements,per second. A novel method,named polynomial,displacement maps is presented to cooperate motion capture with real-time face scans, so that realistic facial deformation can be modeled and synthesized. Finally, we present a real-time relighting algorithm based on spherical wavelets for rendering re- alistic faces under modern,GPU architecture. Contents
Article
Full-text available
We present a process for estimating spatially-varying surface re- flectance of a complex scene observed under natural illumination conditions. The process uses a laser-scanned model of the scene's geometry, a set of digital images viewing the scene's surfaces under a variety of natural illumination conditions, and a set of correspond- ing measurements of the scene's incident illumination in each pho- tograph. The process then employs an iterative inverse global illu- mination technique to compute surface colors for the scene which, when rendered under the recorded illumination conditions, best re- produce the scene's appearance in the photographs. In our process we measure BRDFs of representative surfaces in the scene to better model the non-Lambertian surface reflectance. Our process uses a novel lighting measurement apparatus to record the full dynamic range of both sunlit and cloudy natural illumination conditions. We employ Monte-Carlo global illumination, multiresolution geome- try, and a texture atlas system to perform inverse global illumina- tion on the scene. The result is a lighting-independent model of the scene that can be re-illuminated under any form of lighting. We demonstrate the process on a real-world archaeological site, show- ing that the technique can produce novel illumination renderings consistent with real photographs as well as reflectance properties that are consistent with ground-truth reflectance measurements.
Article
Full-text available
We present a technique for capturing an actor's live-action performance in such a way that the lighting and reflectance of the actor can be designed and modified in postproduction. Our approach is to illuminate the subject with a sequence of time-multiplexed basis lighting conditions, and to record these conditions with a high-speed video camera so that many conditions are recorded in the span of the desired output frame interval. We investigate several lighting bases for representing the sphere of incident illumination using a set of discrete LED light sources, and we estimate and compensate for subject motion using optical flow and image warping based on a set of tracking frames inserted into the lighting basis. To composite the illuminated performance into a new background, we include a time-multiplexed matte within the basis. We also show that the acquired data enables time-varying surface normals, albedo, and ambient occlusion to be estimated, which can be used to transform the actor's reflectance to produce both subtle and stylistic effects.
Article
Full-text available
"The Human Face Project" is a short film documenting an effort at Walt Disney Feature Animation to track and animate human facial performance, which was shown in the SIGGRAPH 2001 Electronic Theater. This short paper outlines the techniques developed in this project, and demonstrated in that film.The face tracking system we developed is exemplary of model-based computer vision, and exploits the detailed degrees of freedom of a geometric face model to confine the space of solutions. Optical flow and successive rerendering of the model are employed in an optimization loop to converge on model parameter estimates. The structure of the model permits very principled mapping of estimated expressions to different targets.Of critical importance in media applications is the handling of details beyond the resolution or degrees of freedom of the tracking model. We describe behavioral modeling expedients for realizing these details in a plausible way in resynthesis.
Conference Paper
Full-text available
We have created a system for capturing both the three-dimensional geometry and color and shading information for human facial ex- pressions. We use this data to reconstruct photorealistic, 3D ani- mations of the captured expressions. The system uses a large set of sampling points on the face to accurately track the three dimen- sional deformations of the face. Simultaneously with the tracking of the geometric data, we capture multiple high resolution, regis- tered video images of the face. These images are used to create a texture map sequence for a three dimensional polygonal face model which can then be rendered on standard D graphics hardware. The resulting facial animation is surprisingly life-like and looks very much like the original live performance. Separating the capture of the geometry from the texture images eliminates much of the vari- ance in the image data due to motion, which increases compression ratios. Although the primary emphasis of our work is not compres- sion we have investigated the use of a novel method to compress the geometric data based on principal components analysis. The texture sequence is compressed using an MPEG4 video codec. An- imations reconstructed from 512x512 pixel textures look good at data rates as low as 240 Kbits per second.
Conference Paper
Full-text available
We present a method to acquire the reflectance field of a human face and use these measurements to render the face under arbitrary changes in lighting and viewpoint. We first acquire images of the face from a small set of viewpoints under a dense sampling of incident illumination directions using a light stage. We then construct a reflectance function image for each observed image pixel from its values over the space of illumination directions. From the reflectance functions, we can directly generate images of the face from the original viewpoints in any form of sampled or computed illumination. To change the viewpoint, we use a model of skin reflectance to estimate the appearance of the reflectance functions for novel viewpoints. We demonstrate the technique with synthetic renderings of a person's face under novel illumination and viewpoints.
Article
Full-text available
We present a novel automatic method for high resolution, non-rigid dense 3D point tracking. High quality dense point clouds of non-rigid geometry moving at video speeds are acquired using a phase-shifting structured light ranging technique. To use such data for the temporal study of subtle motions such as those seen in facial expressions, an efficient non-rigid 3D motion tracking algorithm is needed to establish inter-frame correspondences. The novelty of this paper is the development of an algorithmic framework for 3D tracking that unifies tracking of intensity and geometric features, using harmonic maps with added feature correspondence constraints. While the previous uses of harmonic maps provided only global alignment, the proposed introduction of interior feature constraints allows to track non-rigid deformations accurately as well. The harmonic map between two topological disks is a diffeomorphism with minimal stretching energy and bounded angle distortion. The map is stable, insensitive to resolution changes and is robust to noise. Due to the strong implicit and explicit smoothness constraints imposed by the algorithm and the high-resolution data, the resulting registration/deformation field is smooth, continuous and gives dense one-to-one inter-frame correspondences. Our method is validated through a series of experiments demonstrating its accuracy and efficiency.
Article
Full-text available
We present a novel method for acquisition, modeling, compression, and synthesis of realistic facial deformations using polynomial displacement maps. Our method consists of an analysis phase where the relationship between motion capture markers and detailed facial geometry is inferred, and a synthesis phase where novel detailed animated facial geometry is driven solely by a sparse set of motion capture markers. For analysis, we record the actor wearing facial markers while performing a set of training expression clips. We capture real-time high-resolution facial deformations, including dynamic wrinkle and pore detail, using interleaved structured light D scanning and photometric stereo. Next, we compute displacements between a neutral mesh driven by the motion capture markers and the high-resolution captured expressions. These geometric displacements are stored in a polynomial displacement map which is parameterized according to the local deformations of the motion capture dots. For synthesis, we drive the polynomial displacement map with new motion capture data. This allows the recreation of large-scale muscle deformation, medium and fine wrinkles, and dynamic skin pore detail. Applications include the compression of existing performance data and the synthesis of new performances. Our technique is independent of the underlying geometry capture system and can be used to automatically generate high-frequency wrinkle and pore details on top of many existing facial animation systems.
Article
Full-text available
We propose a flexible technique to easily calibrate a camera. It only requires the camera to observe a planar pattern shown at a few (at least two) different orientations. Either the camera or the planar pattern can be freely moved. The motion need not be known. Radial lens distortion is modeled. The proposed procedure consists of a closed-form solution, followed by a nonlinear refinement based on the maximum likelihood criterion. Both computer simulation and real data have been used to test the proposed technique and very good results have been obtained. Compared with classical techniques which use expensive equipment such as two or three orthogonal planes, the proposed technique is easy to use and flexible. It advances 3D computer vision one more step from laboratory environments to real world use.
Article
We present a practical method for modeling layered facial reflectance consisting of specular reflectance, single scattering, and shallow and deep subsurface scattering. We estimate parameters of appropriate reflectance models for each of these layers from just 20 photographs recorded in a few seconds from a single viewpoint. We extract spatially-varying specular reflectance and single-scattering parameters from polarization-difference images under spherical and point source illumination. Next, we employ direct-indirect separation to decompose the remaining multiple scattering observed under cross-polarization into shallow and deep scattering components to model the light transport through multiple layers of skin. Finally, we match appropriate diffusion models to the extracted shallow and deep scattering components for different regions on the face. We validate our technique by comparing renderings of subjects to reference photographs recorded from novel viewpoints and under novel illumination conditions.
Article
First description of the uncanny valley theory
Article
We present a method that uses measured scene radiance and global illumination in order to add new objects to light-based models with correct lighting. The method uses a high dynamic range imagebased model of the scene, rather than synthetic light sources, to illuminate the newobjects. To compute the illumination, the scene is considered as three components: the distant scene, the local scene, and the synthetic objects. The distant scene is assumed to be photometrically unaffected by the objects, obviating the need for re- flectance model information. The local scene is endowed with estimated reflectance model information so that it can catch shadows and receive reflected light from the new objects. Renderings are created with a standard global illumination method by simulating the interaction of light amongst the three components. A differential rendering technique allows for good results to be obtained when only an estimate of the local scene reflectance properties is known. We apply the general method to the problem of rendering synthetic objects into real scenes. The light-based model is constructed from an approximate geometric model of the scene and by using a light probe to measure the incident illumination at the location of the synthetic objects. The global illumination solution is then composited into a photograph of the scene using the differential rendering technique. We conclude by discussing the relevance of the technique to recovering surface reflectance properties in uncontrolled lighting situations. Applications of the method include visual effects, interior design, and architectural visualization.
Article
Since the Lucas-Kanade algorithm was proposed in 1981 image alignment has become one of the most widely used techniques in computer vision. Applications range from optical flow, tracking, and layered motion, to mosaic construction, medical image registration, and face coding. Numerous algorithms have been proposed and a variety of extensions have been made to the original formulation. We present an overview of image alignment, describing most of the algorithms in a consistent framework. We concentrate on the inverse compositional algorithm, an efficient algorithm that we recently proposed. We examine which of the extensions to the Lucas-Kanade algorithm can be used with the inverse compositional algorithm without any significant loss of efficiency, and which cannot. In this paper, the fourth and final part in the series, we cover the addition of priors on the parameters. We first consider the addition of priors on the warp parameters. We show that priors can be added with minimal extra cost to all of the algorithms in Parts 1--3. Next we consider the addition of priors on both the warp and appearance parameters. Image alignment with appearance variation was covered in Part 3. For each algorithm in Part 3, we describe whether priors can be placed on the appearance parameters or not, and if so what the cost is.
Article
This paper introduces a simple model for subsurface light transport in translucent materials. The model enables efficient simulation of effects that BRDF models cannot capture, such as color bleeding within materials and diffusion of light across shadow boundaries. The technique is efficient even for anisotropic, highly scattering media that are expensive to simulate using existing methods. The model combines an exact solution for single scattering with a dipole point source diffusion approximation for multiple scattering. We also have designed a new, rapid image-based measurement technique for determining the optical properties of translucent materials. We validate the model by comparing predicted and measured values and show how the technique can be used to recover the optical properties of a variety of materials, including milk, marble, and skin. Finally, we describe sampling techniques that allow the model to be used within a conventional ray tracer.
Creating virtual performers: Disney's human face project. Millimeter magazine
  • E Wolff
The digital eye: Image metrics attempts to leap the uncanny valley. VFXWorld magazine
  • P Plantec
Volumetric cinematography: The world no longer flat
  • S Perlman
What's old is new again
  • B Robertson
  • Robertson B.
Facial Action Coding System: A Technique for the Measurement of Facial Movement. Consulting Psychologists Press Palo Alto. Ekman P. and Friesen W. 1978. Facial Action Coding System: A Technique for the Measurement of Facial Movement
  • P Ekman
  • W Friesen