Conference PaperPDF Available

Face Parsing for Mobile AR Applications

Authors:

Abstract and Figures

Face parsing is a segmentation task over facial components, which is very important for a lot of facial augmented reality applications. We present a demonstration of face parsing for mobile platforms such as iPhone and Android. We design an efficient fully convolutional neural network (CNN) in an hourglass form that is adapted to live face parsing. The CNN is implemented on the iPhone with the CoreML framework. In order to visualize the segmentation results, we superpose a mask with false colors so that the user can have an instant AR experience.
Content may be subject to copyright.
Face Parsing for Mobile AR Applications
Yongzhe Yan*
Universit´
e Clermont-Auvergne
Wisimage
Benjamin Bout
Universit´
e Clermont-Auvergne
Wisimage
Anthony Berthelier
Universit´
e Clermont-Auvergne
Wisimage
Xavier Naturel
Wisimage
Thierry Chateau
Universit´
e Clermont-Auvergne
ABSTRACT
Face parsing is a segmentation task over facial components, which is
very important for a lot of facial augmented reality applications. We
present a demonstration of face parsing for mobile platforms such
as iPhone and Android. We design an efficient fully convolutional
neural network (CNN) in an hourglass form that is adapted to live
face parsing. The CNN is implemented on the iPhone with the
CoreML framework. In order to visualize the segmentation results,
we superpose a mask with false colors so that the user can have an
instant AR experience.
Keywords:
Face Parsing, Mobile AR, Semantic Segmentation,
Video Segmentation, Computer Vision
1 INTRODUCTION
The goal of face parsing is to classify every pixel of an image
into a category of facial components. Detecting different facial
components is of great interest for a lot of augmented reality (AR)
applications such as facial image beautification and facial image
editing. For example, given the lip area, we can apply virtual lipstick-
wearing effect on it by colorizing the region with proper colors. In
this paper, we focus on designing an efficient face parsing methods
by neural networks since lots of these applications are aimed at the
mobile platforms such as iOS and Android.
Deep convolutional networks have been proved to be the lead-
ing methods for lots of computer vision tasks especially semantic
segmentation [1, 8]. Nonetheless, these methods are either time-
consuming or over-sized due to the excessive amount of parameters
and computation. Most of the semantic segmentation methods are
designed for general complex scenes or street scenes. However, face
parsing is a quite different task for the following reasons.
Face parsing is usually done based on an ROI given by prelim-
inary face detection compared to the general semantic segmen-
tation which is generally performed on the entire image.
Sharp boundaries are demanded for facial AR applications in
order to render better visual effects.
Facial components have more deformable variance but less
position and size variance compared to semantic segmentation.
In this paper, we present a real-time AR face parsing demonstra-
tion on iPhone with an efficient deep convolutional neural network.
Unlike the precedent face parsing methods, we consider the adap-
tion of neural networks on video in order to provide a fluid and
temporally consistent rendering. The users are able to visualize the
segmentation results by a mask which indicates different facial re-
gions. A rendering result is shown in figure 1. More AR applications
can be realized based on our results.
*e-mail: yongzhe.yan@etu.uca.fr
e-mail: thierry.chateau@uca.fr
Figure 1: The visual results of our method on iPhone. Our method is
robust to extreme expressions and poses
2 RE LATED WORK
A commonly shared consensus in deep semantic segmentation area
is that there exists two kinds of mainstream methods [3], the spatial
pyramid pooling (SPP) module based structures and encoder-decoder
structures [1, 8]. Encoder-decoder structures adopt progressive up-
sampling with skip connection to reconstruct the object boundary
as sharp as possible. Skip connections play an important role in the
network structure so that the CNN can transfer the low-level detailed
information to the output layers.
To ensure the best user experience in AR applications, short infer-
ence time and low latency are required. Some researchers proposed
to use optical flow and reuse the feature map of the past frames if the
static scene persists. Another way to accelerate the inference time is
to use network acceleration techniques like quantification or prun-
ing. Recent works such as Mobile-net [9] and Shuffle-net proposed
to use depth-wise separable convolution to reduce computational
complexity of CNN with almost the same performance.
CNN based methods [6] brought a huge advance to face parsing
compared to exemplar based methods [10]. Liu et al. [6] proposed
to use Conditional Random Field (CRF) to provide sharp facial
component contours. Sharing the same motivation, another work [5]
proposed to use spatially variant recurrent unit which enables the
regional information propagation.
3 MOBILE FACE PARSING DEMO DESCRIPTION
In this part, we present the design of our segmentation network, how
we adapt it to the video as well as some implementation details.
3.1 Mobile Hourglass Network
We follow the design of the hourglass model in human pose esti-
mation [7]. Our network structure is presented in Figure 2. The
Hourglass model is a symmetric encoder-decoder fully convolutional
network with a depth of 4, which means the encoder downsamples
the input for 4 times and the decoder upsamples the feature map for
4 times to reconstruct the output. Each yellow block represents a
network block which enables the CNN to learn the information flow
at each stage and skip connection.
We drop several convolutional and max-pooling layers at the
beginning of the network and augment the size of the output map to
256, which is the same size as the input image. This will make the
boundaries of facial objects sharper but increase the computational
complexity as well. In order to accelerate the inference, we replace
Figure 2: Overview of our method for video face parsing. Blue chan-
nels: RGB facial image, Green channels: Mask predictions, More
transparency signifies earlier predictions in time dimension. The out-
put predictions of
t1
is reinjected to
t
for robustness. Purple block:
convolutional layer + batch norm layer. Yellow blocks: The Hourglass
model composed of Mobile-net Blocks. Red blocks: Dense [4] Blocks
all of the ResNet blocks in the hourglass network by MobileNet
[9] inverted bottleneck block. Expansion factor
t
is an important
hyper parameter which indicates how many times the number of
feature channels are expanded in its blocks. We chose the number of
features
f
as 32 and the expansion factor
t
as 6 by finding the best
compromise between the speed and segmentation quality.
3.2 Video Adaption
Inspired by this blog [2], we implemented two strategies to adapt
our model to video face parsing.
For inference, we take the mask of the frame
t1
as input to
the frame
t
to stabilize the segmentation. Due to the lack of video
dataset, we apply a randomly transformed mask as a fake previous
mask during the training. According to our experiments, this will
eliminate the random segmentation noise which is present without
taking the previous frame mask as input.
We add four dense layers [4] with a growth rate of 8 at the end of
the network for more robust rendering.
4 EXPERIMENTS
We train our model on the Helen dataset [10] which contains 2330
images manually labeled in 11 classes including background, skin,
hair nose, left/right eyebrows, left/right eyes, upper/bottom lips as
well as inner-mouth. The models are trained on all of the labels
except the hair because the hair annotations are not precise. The
images are cropped with a margin of 30%-70% of bounding box
size according to the facial landmark annotations.
We use the RMSprop as optimizer and softmax cross-entropy
function as loss function. We apply an initial learning rate of 0.0005
with decay of 0.1 for each 40 training epochs until total 190 epochs
are finished. We use ONNX as an intermediate format to transform
our Pytorch model to CoreML model, which is optimized for iOS
devices. We adopt the Vision framework for face detection that is
anterior to face parsing .
We compare our methods with several well-known segmentation
networks [1, 8]. The results are measured in F1-score in Table1.
The number of parameters are also listed aside to provide more
information about the model size, which is critic for mobile platform.
We measure the runtime of different MobileNet block settings
by changing the number of channels
f
and expansion factor
t
. We
also provide a profiling time analysis on the face detection, array
transformation and colorization in Table2.
Table 1: Quantitative evaluation on Helen dataset
Model Overall F-score Num. of parameters
SegNet [1] 92.90 29.45M
Unet [8] 93.72 13.40M
Mobile-Hourglass-f16-t3 93.00 0.09M
Mobile-Hourglass-f16-t6 93.08 0.12M
Mobile-Hourglass-f32-t3 93.18 0.17M
Mobile-Hourglass-f32-t6 93.55 0.27M
Table 2: Run-time (in ms) profiling on iPhone X
Model FD MI Colorization Total
Unet [8] 7 203 54 269
Mobile-Hourglass-f16-t3 8 77 18 106
Mobile-Hourglass-f16-t6 7 75 18 103
Mobile-Hourglass-f32-t3 7 76 18 104
Mobile-Hourglass-f32-t6 7 78 18 106
Dense Mobile-Hourglass-f32-t6 7 75 18 110
FD: Face Detection MI: Model Inference
5 CONCLUSION
We present a real-time encoder-decoder video face parsing mobile
AR demonstration. Our method will draw interest because it is
simple and interesting for everybody to test. More importantly,
Face parsing is crucial for a lot of facial editing applications
for example virtual make-up, hair dying, skin analysis, face
morphing, reenactment etc.
Our method is not only limited to facial AR applications but
also interesting for fine-grained segmentation based AR appli-
cations.
REFERENCES
[1]
V. Badrinarayanan, A. Kendall, and R. Cipolla. Segnet: A deep convo-
lutional encoder-decoder architecture for image segmentation. IEEE
Transactions on Pattern Analysis and Machine Intelligence, 2017.
[2]
V. Bazarevsky and A. Tkachenka. Mobile real-time video
segmentation.
https://ai.googleblog.com/2018/03/
mobile-real- time-video-segmentation.html, 2018.
[3]
L.-C. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam. Encoder-
decoder with atrous separable convolution for semantic image segmen-
tation. arXiv preprint arXiv:1802.02611, 2018.
[4]
G. Huang, Z. Liu, L. van der Maaten, and K. Q. Weinberger. Densely
connected convolutional networks. In 2017 IEEE Conference on Com-
puter Vision and Pattern Recognition (CVPR), pp. 2261–2269, 2017.
[5]
S. Liu, J. Shi, J. Liang, and M.-H. Yang. Face parsing via recurrent
propagation. arXiv preprint arXiv:1708.01936, 2017.
[6]
S. Liu, J. Yang, C. Huang, and M.-H. Yang. Multi-objective convolu-
tional learning for face labeling. In Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, pp. 3451–3459, 2015.
[7]
A. Newell, K. Yang, and J. Deng. Stacked hourglass networks for
human pose estimation. In European Conference on Computer Vision,
pp. 483–499. Springer, 2016.
[8]
O. Ronneberger, P. Fischer, and T. Brox. U-net: Convolutional net-
works for biomedical image segmentation. In International Conference
on Medical image computing and computer-assisted intervention, pp.
234–241. Springer, 2015.
[9]
M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen.
Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings
of the IEEE Conference on Computer Vision and Pattern Recognition,
pp. 4510–4520, 2018.
[10]
B. M. Smith, L. Zhang, J. Brandt, Z. Lin, and J. Yang. Exemplar-based
face parsing. In Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, pp. 3484–3491, 2013.
... Yan et al. [86] proposed a face parsing platform for mobile applications. The proposed model was evaluated on both iPhone and Android systems. ...
... Yan et al. [86] Deep learning HELEN Dong et al. [76] CRFs LFW-PL Borza et al. [88] Deep learning CelebA and Figaro1k Umar et al. [36] Appearance based Figaro1k Guclu et al. [74] CRFs LFW and HELEN Nirkin et al. [20] Deep learning LFW Liu et al. [38] Deep learning HELEN and LFW 2017 ...
... Yan et al. [86] proposed a face parsing platform for mobile applications. The proposed model was evaluated on both iPhone and Android systems. ...
... Yan et al. [86] Deep learning HELEN Dong et al. [76] CRFs LFW-PL Borza et al. [88] Deep learning CelebA and Figaro1k Umar et al. [36] Appearance based Figaro1k Guclu et al. [74] CRFs LFW and HELEN Nirkin et al. [20] Deep learning LFW Liu et al. [38] Deep learning HELEN and LFW 2017 ...
Article
Full-text available
Face segmentation represents an active area of research within the bio-metric community in particular and the computer vision community in general. Over the last two decades, methods for face segmentation have received increasing attention due to their diverse applications in several human-face image analysis tasks. Although many algorithms have been developed to address the problem, face segmentation is still a challenge not being completely solved, particularly for images taken in wild, unconstrained conditions. In this paper, we present a comprehensive review of face segmentation, focusing on methods for both the constrained and unconstrained environmental conditions. The article illustrates the advantages and disadvantages of previously proposed methods in state-of-the-art (SOA). The approaches presented comprise the seminal works on face segmentation and culminating in SOA approaches of the deep learning architecture. An extensive comparison of the previous approaches is intuitively presented, with a discussion of the potential directions for future research on the topic. We believe this comprehensive review and recap will contribute to a number of application domains, and will augment the knowledge of the research community.
... AR-based mobile applications require excellent camera results for fast target recognition, and both the phones have excellent camera quality, which enables the app to work perfectly. iPhone devices have been used in research for such application testing in [24]. The power consumption of the mobile application with respect to time in both devices is shown in Fig. 6. ...
Article
Full-text available
This paper presents an augmented reality (AR)-based game for childhood education. The developed AR game teaches children by giving them tasks and providing guidance to complete them on time. The developed model discussed in the paper is focusing on the kids for their learning and memorizing tasks easier and timely. The game allows children to interact with computer-generated cars, buildings, and other 3D objects in an image target. The main goal of proposed AR game combines learning with enjoyment to help children understand basic places names like parking areas, how to spell them, drive a car with a mobile joystick to reach a place on time, and to see that place in the real world. It is expected that such an interactive project will significantly improve learning in children.
Article
Full-text available
We present a novel and practical deep fully convolutional neural network architecture for semantic pixel-wise segmentation termed SegNet. This core trainable segmentation engine consists of an encoder network, a corresponding decoder network followed by a pixel-wise classification layer. The architecture of the encoder network is topologically identical to the 13 convolutional layers in the VGG16 network [1]. The role of the decoder network is to map the low resolution encoder feature maps to full input resolution feature maps for pixel-wise classification. The novelty of SegNet lies is in the manner in which the decoder upsamples its lower resolution input feature map(s). Specifically, the decoder uses pooling indices computed in the max-pooling step of the corresponding encoder to perform non-linear upsampling. This eliminates the need for learning to upsample. The upsampled maps are sparse and are then convolved with trainable filters to produce dense feature maps. We compare our proposed architecture with the widely adopted FCN [2] and also with the well known DeepLab-LargeFOV [3] , DeconvNet [4] architectures. This comparison reveals the memory versus accuracy trade-off involved in achieving good segmentation performance. SegNet was primarily motivated by scene understanding applications. Hence, it is designed to be efficient both in terms of memory and computational time during inference. It is also significantly smaller in the number of trainable parameters than other competing architectures and can be trained end-to-end using stochastic gradient descent. We also performed a controlled benchmark of SegNet and other architectures on both road scenes and SUN RGB-D indoor scene segmentation tasks. These quantitative assessments show that SegNet provides good performance with competitive inference time and most efficient inference memory-wise as compared to other architectures. We also provide a Caffe implementation of SegNet and a web demo at http://mi.eng.cam.ac.uk/projects/segnet/.
Article
Full-text available
Recent work has shown that convolutional networks can be substantially deeper, more accurate and efficient to train if they contain shorter connections between layers close to the input and those close to the output. In this paper we embrace this observation and introduce the Dense Convolutional Network (DenseNet), where each layer is directly connected to every other layer in a feed-forward fashion. Whereas traditional convolutional networks with L layers have L connections, one between each layer and its subsequent layer (treating the input as layer 0), our network has L(L+1)/2 direct connections. For each layer, the feature maps of all preceding layers are treated as separate inputs whereas its own feature maps are passed on as inputs to all subsequent layers. Our proposed connectivity pattern has several compelling advantages: it alleviates the vanishing gradient problem and strengthens feature propagation; despite the increase in connections, it encourages feature reuse and leads to a substantial reduction of parameters; its models tend to generalize surprisingly well. We evaluate our proposed architecture on five highly competitive object recognition benchmark tasks. The DenseNet obtains significant improvements over the state-of-the-art on all five of them (e.g., yielding 3.74% test error on CIFAR-10, 19.25% on CIFAR-100 and 1.59% on SVHN).
Conference Paper
Full-text available
Recent work has shown that convolutional networks can be substantially deeper, more accurate, and efficient to train if they contain shorter connections between layers close to the input and those close to the output. In this paper, we embrace this observation and introduce the Dense Convolutional Network (DenseNet), which connects each layer to every other layer in a feed-forward fashion. Whereas traditional convolutional networks with L layers have L connections - one between each layer and its subsequent layer - our network has L(L+1)/2 direct connections. For each layer, the feature-maps of all preceding layers are used as inputs, and its own feature-maps are used as inputs into all subsequent layers. DenseNets have several compelling advantages: they alleviate the vanishing-gradient problem, strengthen feature propagation, encourage feature reuse, and substantially reduce the number of parameters. We evaluate our proposed architecture on four highly competitive object recognition benchmark tasks (CIFAR-10, CIFAR-100, SVHN, and ImageNet). DenseNets obtain significant improvements over the state-of-the-art on most of them, whilst requiring less memory and computation to achieve high performance. Code and models are available at https://github.com/liuzhuang13/DenseNet.
Article
Spatial pyramid pooling module or encode-decoder structure are used in deep neural networks for semantic segmentation task. The former networks are able to encode multi-scale contextual information by probing the incoming features with filters or pooling operations at multiple rates and multiple effective fields-of-view, while the latter networks can capture sharper object boundaries by gradually recovering the spatial information. In this work, we propose to combine the advantages from both methods. Specifically, our proposed model, DeepLabv3+, extends DeepLabv3 by adding a simple yet effective decoder module to refine the segmentation results especially along object boundaries. We further explore the Xception model and apply the depthwise separable convolution to both Atrous Spatial Pyramid Pooling and decoder modules, resulting in a faster and stronger encoder-decoder network. We demonstrate the effectiveness of the proposed model on the PASCAL VOC 2012 semantic image segmentation dataset and achieve a performance of 89% on the test set without any post-processing. Our paper is accompanied with a publicly available reference implementation of the proposed models in Tensorflow.
Conference Paper
This work introduces a novel convolutional network architecture for the task of human pose estimation. Features are processed across all scales and consolidated to best capture the various spatial relationships associated with the body. We show how repeated bottom-up, top-down processing used in conjunction with intermediate supervision is critical to improving the performance of the network. We refer to the architecture as a “stacked hourglass” network based on the successive steps of pooling and upsampling that are done to produce a final set of predictions. State-of-the-art results are achieved on the FLIC and MPII benchmarks outcompeting all recent methods.
Conference Paper
There is large consent that successful training of deep networks requires many thousand annotated training samples. In this paper, we present a network and training strategy that relies on the strong use of data augmentation to use the available annotated samples more efficiently. The architecture consists of a contracting path to capture context and a symmetric expanding path that enables precise localization. We show that such a network can be trained end-to-end from very few images and outperforms the prior best method (a sliding-window convolutional network) on the ISBI challenge for segmentation of neuronal structures in electron microscopic stacks. Using the same network trained on transmitted light microscopy images (phase contrast and DIC) we won the ISBI cell tracking challenge 2015 in these categories by a large margin. Moreover, the network is fast. Segmentation of a 512x512 image takes less than a second on a recent GPU. The full implementation (based on Caffe) and the trained networks are available at http://lmb.informatik.uni-freiburg.de/people/ronneber/u-net .
In this work, we propose an exemplar-based face image segmentation algorithm. We take inspiration from previous works on image parsing for general scenes. Our approach assumes a database of exemplar face images, each of which is associated with a hand-labeled segmentation map. Given a test image, our algorithm first selects a subset of exemplar images from the database, Our algorithm then computes a nonrigid warp for each exemplar image to align it with the test image. Finally, we propagate labels from the exemplar images to the test image in a pixel-wise manner, using trained weights to modulate and combine label maps from different exemplars. We evaluate our method on two challenging datasets and compare with two face parsing algorithms and a general scene parsing algorithm. We also compare our segmentation results with contour-based face alignment results, that is, we first run the alignment algorithms to extract contour points and then derive segments from the contours. Our algorithm compares favorably with all previous works on all datasets evaluated.
Mobile real-time video segmentation
  • V Bazarevsky
  • A Tkachenka
V. Bazarevsky and A. Tkachenka. Mobile real-time video segmentation. https://ai.googleblog.com/2018/03/ mobile-real-time-video-segmentation.html, 2018.