Sachin Kajareker's research while affiliated with Apple Inc. and other places

Publications (5)

Preprint
We describe our novel deep learning approach for driving animated faces using both acoustic and visual information. In particular, speech-related facial movements are generated using audiovisual information, and non-speech facial movements are generated using only visual information. To ensure that our model exploits both modalities during training...
Conference Paper
Full-text available
Speech-driven visual speech synthesis involves mapping acoustic speech features to the corresponding lip animation controls for a face model. This mapping can take many forms, but a powerful approach is to use deep neural networks (DNNs). The lack of synchronized audio, video, and depth data is a limitation to reliably train DNNs, especially for sp...
Preprint
Speech-driven visual speech synthesis involves mapping features extracted from acoustic speech to the corresponding lip animation controls for a face model. This mapping can take many forms, but a powerful approach is to use deep neural networks (DNNs). However, a limitation is the lack of synchronized audio, video, and depth data required to relia...

Citations

... The applications of talking face generation can be broadly categorized into two groups, as depicted in Fig. 1. The first group involves generating talking faces based on text inputs, which can be used for video production or multimodal chatbots [2][3][4][5][6][7][8]. In most cases, this group also requires simultaneous generation of speech synchronized with talking faces. ...
... 3D Coefficient based. Besides 2D facial coefficient models, 3D facial coefficients via principal component analysis (PCA) are more commonly used in VSG [67,70,171,172,173,174,175]. Pham et al. [171,172,176] proposed utilizing CNN + RNN based backbone architectures to map audio signals to blendshape coefficients [177] of a 3D face. ...