ArticlePDF Available

Deep Skin Detection on Low Resolution Grayscale Images

Authors:

Abstract and Figures

In this work we present a facial skin detection method, based on a deep learning architecture, that is able to precisely associate a skin label to each pixel of a given image depicting a face. This is an important preliminary step in many applications, such as remote photoplethysmography (rPPG) in which the hearth rate of a subject needs to be estimated analyzing a video of his/her face. The proposed method can detect skin pixels even in low resolution grayscale face images (64 × 32 pixel). A dataset is also described and proposed in order to train the deep learning model. Given the small amount of data available, a transfer learning approach is adopted and validated in order to learn to solve the skin detection problem exploiting a colorization network. Qualitative and quantitative results are reported testing the method on different datasets and in presence of general illumination, facial expressions, object occlusions and it is able to work regardless of the gender, age and ethnicity of the subject.
Content may be subject to copyright.
1
Deep Skin Detection on Low Resolution Grayscale Images
Marco Paracchinia,∗∗, Marco Marcona, Federica Villaa, Stefano Tubaroa
aDipartimento di Informazione, Elettronica e Bioingegneria, Politecnico di Milano, Italy
ABSTRACT
In this work we present a facial skin detection method, based on a deep learning architecture, that
is able to precisely associate a skin label to each pixel of a given image depicting a face. This is
an important preliminary step in many applications, such as remote photoplethysmography (rPPG) in
which the hearth rate of a subject needs to be estimated analyzing a video of his/her face. The proposed
method can detect skin pixels even in low resolution grayscale face images (64x32 pixel). A dataset
is also described and proposed in order to train the deep learning model. Given the small amount of
data available, a transfer learning approach is adopted and validated in order to learn to solve the skin
detection problem exploiting a colorization network. Qualitative and quantitative results are reported
testing the method on dierent datasets and in presence of general illumination, facial expressions,
object occlusions and it is able to work regardless of the gender, age and ethnicity of the subject.
1. Introduction
Skin detection is an important preliminary task in a wide
range of image processing problems. In particular this work
is driven by the development of a remote PhotoPlethysmoG-
raphy (rPPG) application. This kind of applications aims at
solving the problem of estimating the heart rate of a subject
given a video stream of his/her face. Typically a signal, repre-
senting the time variation of the light intensity reflected by the
skin, which is caused by the transition of blood in the vessel
underneath the skin, is extracted averaging the color/intensity
value on some selected pixels in each frame. This signal is
then consequently analyzed in order to estimate the heart rate
of the subject and/or other bio-medical measurements. Many
rPPG applications (Rouast et al., 2017) estimate the face re-
gions in which to extract the signal using a combination of clas-
sical face detection methods, such as Viola and Jones (2001),
and fixed proportions in order to select specific parts of the
face, e.g. typically the forehead. This procedure is not opti-
mal since the skin in preselected parts of the face could not
be visible due to occlusions of hair, wearable objects or other
elements. Furthermore skin segmentation based on a prede-
fined template suers from errors of the face detection phase
∗∗Corresponding author:
e-mail: marcobrando.paracchini@polimi.it (Marco Paracchini)
and/or due to intrinsic variance of face shapes. Moreover due
to the high variability of the subject pose, motion blur, age,
ethnicity, hair, facial hair, wearable objects, etc., the first step
of a rPPG application (i.e. selecting the face region in which
to extract the signal) is not trivial and errors in this step could
heavily compromise the final hearth rate estimation. The major-
ity of rPPG applications (Rouast et al., 2017) utilize a standard
RGB camera, based on CMOS or CCD technologies, in order
to acquire the video stream. The goal of this work is to pro-
pose a skin detection algorithm able also to work when applied
to images acquired using SPAD (i.e. Single-Photon Avalanche
Diode) array cameras. This kind of cameras is capable to detect
even a single photon (Bronzi et al., 2016a), has extremely high
frame rate (Bronzi et al., 2014) and has proved to be useful in
a very large range of applications (Bronzi et al., 2016b), such
as 3D optical ranging (LIDAR), Positron Emission Tomogra-
phy (PET) and many others. In some rPPG works (Paracchini
et al., 2019) SPAD cameras are used instead of traditional ones,
where their high precision are useful in measure accurately the
skin intensity fluctuations produced by the blood flow. On the
other hand, due to the complexity of the SPAD sensor, this kind
of cameras has a very small spatial resolution, 64x32 in Bronzi
et al. (2014), and produces grayscale intensity image, since the
low spatial resolution does not allow the use of Bayer filters.
In this work we propose an automatic method, based on deep
learning, with the aim of solving the task of detecting skin pix-
els in face images. Furthermore, the proposed method is de-
2
signed to work with low resolution grayscale images such the
one obtained using a SPAD array camera (Bronzi et al., 2014).
The rest of the paper is organized as follows: in Sec. 2 a brief
state of the art review on skin detection is reported highlighting
the peculiarity of the problem addressed in this work; in Sec. 3
the proposed method is described while in Sec. 4 the training
procedure that exploit transfer learning is illustrated; qualita-
tive and quantitative results are shown in Sec. 5 and finally in
Sec. 6 the contribution of this work are highlighted.
2. Related work
The skin detection problem is usually tackled using color
information and exploiting the fact that skin-tone colors
share some common properties defined in particular color
spaces (Kawulok et al., 2014). After applying the optimal color
space transformation it is possible to define rules to discrimi-
nate between skin pixels and other materials. Since this kind of
methods are based on color information, they obviously require
color images (RGB) to be applied to. As stated in Sec. 1, due
to the choice of developing a method able to work with SPAD
camera output (grayscale), this class of methods could not be
applied in this specific problem. Moreover, they have no way
to discriminate between the face and other body parts and this
could be a problem in rPPG in which, due to the blood flow
dynamic in the body, dierent body parts could carry dierent
information (i.e. time-shifted signal). An extensive review of
color based skin segmentation methods could be find in Kaku-
manu et al. (2007). Some skin detection methods able to work
with grayscale images exist, e.g. Sarkar et al. (2017) , but they
achieve good results only working with high resolution images
since they learn local texture characteristics. Another problem,
related to the one described in Sec. 1, is face parsing or face seg-
mentation, which is the problem to analyze an input image of a
face and densely segment it in dierent regions corresponding
to dierent face parts and the background (Zhou et al., 2017).
This is performed by labeling pixels in a dense fashion, i.e. to
each pixel a label is assigned. In recent years, many deep learn-
ing methods have been proposed to solve this kind of problems,
e.g. Liu et al. (2017), Nirkin et al. (2017) and Zhou et al. (2017),
exploiting the promising results achieved by neural network
based methods in semantic segmentation (Guo et al., 2018).
Even though this problem is very similar to the one tackled
in this paper (e.g. this last could be view as a simplified seg-
mentation problem with just two classes, i.e. skin and other)
some dierences exist in the definition of the two problems. In
fact, in face parsing methods, wearable objects such as glasses
and sunglasses, or facial hair are not separated from the face
region in which they are present, making this kind of methods
not suitable for the skin detection problem. Moreover, methods
such the ones proposed in Zhou et al. (2017) and Liu et al.
(2017) work on high resolution color images. To the best of
our knowledge no other method specifically designed to solve
the skin detection problem on low resolution grayscale images
exists in the state of the art.
Table 1. Pretraining network architecture (Baldassarre et al., 2017). Blue
layers are used for transfer learning.
Encoder Decoder
Layer Kernels Layer Kernel
Conv. (str. 2x2) 64x3x3 Conv. 128x3x3
Conv. 128x3x3 Upsamp. 2x2
Conv. (str. 2x2) 128x3x3 Conv. 64x3x3
Conv. 256x3x3 Conv. 64x3x3
Conv. (str. 2x2) 256x3x3 Upsamp. 2x2
Conv. 512x3x3 Conv. 32x3x3
Conv. 512x3x3 Conv. 2x3x3
Conv. 256x3x3 Upsamp. 2x2
fusion
Conv. 256x1x1
3. Proposed method
As described in Sec. 2, deep learning based methods repre-
sent the state of the art for segmentation problem and they usu-
ally require a massive amount of data. On the other hand, due
to the uniqueness of the skin detection problem, the amount of
data available is very limited. For this reason a transfer learn-
ing (Pan and Yang, 2010) procedure was adopted. In particular,
a colorization network (Baldassarre et al., 2017) was adapted
for the skin detection task. The main reason behind the choice
of exploiting a colorization method as a starting point for the
proposed network is the empirical observation that a method of
this kind applied to a grayscale image that depicts a face cor-
rectly colorize each skin pixel with the proper skin color. This
means that the network must have learned a way to discriminate
between skin pixels and pixels that depicts other objects. Fur-
thermore, both the skin detection problem described in Sec. 1,
and the colorization problem share the same kind of input, i.e.
grayscale image. Moreover collecting training data in order to
train a colorization network is trivial and the problem could be
seen as a self supervised one. The driving idea is to propose
slight changes to the colorization network in order to be able to
transfer as much knowledge as possible from the colorization
task to the skin segmentation one and then use a fine tuning
approach.
3.1. Network topology
The colorization network presented in Baldassarre et al.
(2017), whose architecture is reported in Tab 1, is based on
a convolutional autoencoder with an auxiliary parallel branch.
This additional branch, starting from the input image, exploits
the first layers of a pretrained Inception-ResNet-v2 (Szegedy
et al., 2016) in order to extract a vectorized representation of
the image semantic. This vector is then merged to the encoded
representation of the main branch before performing the decod-
ing part. In particular this operation is performed to help the
colorization method better understand the scene depicted in the
input image, in order to colorize more precisely a large variety
of objects and scenes. On the other hand, this auxiliary branch
3
Fig. 1. Proposed network topology. The green layers and the last one are trained from scratch while for the blue ones the knowledge is transferred from a
colorization network. The number under each layer indicate the dimension of its output (number of filters).
is totally unnecessary in the case that the input images are a-
priori known to contain just a single human face. However its
role was crucial in the Baldassarre et al. (2017) approach, in
the proposed network this additional branch was completely re-
moved, providing us with a suitable architecture. Another ma-
jor dierence between the proposed network topology and the
one proposed in Baldassarre et al. (2017) resides in the output
layer dimension. In particular, the original colorization network
outputs a two channels image relative to the a* and b* channels
of a L*a*b* (Robertson, 1976) color representation of the im-
age (L being the luminance of the image, i.e. the grayscale input
image). On the other hand, the proposed network needs to out-
put a single channel image (called mask in the rest of the paper)
with each pixel value ˆyi j [0,1]. This is achieved substituting
the last activation function with a sigmoid function. In partic-
ular, for each pixel of the output mask, its value represents the
probability attributed by the network of the input image having
a skin pixel in that particular location. As reported in Fig 1, the
encoding part of the network is composed by 8 convolutional
layers, with 3x3 kernels and ReLu activation functions, and 3
max pooling layers in order to reduce the spatial dimension in
the last encoding layer to 1/8 of the original input dimension.
On the other hand, the decoding part is composed by 6 lay-
ers with 3x3 kernels and ReLu activation functions (except the
last one, which is a sigmoid function in order to output val-
ues in [0,1]) coupled with upconvolutional layers to increase
back the spatial dimension to the input one. In Fig 1 the layers
colored in blue are the ones trained propagating the coloriza-
tion network knowledge while the other ones are trained from
scratch. The central ones (with output depth 256 and 128) need
to be trained with no prior information due to the removal of the
colorization fusion layer. The next two have input and output
shapes as in Baldassarre et al. (2017) so their weights value is
propagated. Finally, the last ones, since are introduced to solve
the skin detection problem, are randomly initialized.
4. Training procedure
As stated in Sec. 3 the training procedure, adopted to es-
timate the optimal network parameters, is based on a trans-
fer learning approach. This implies that, before training the
proposed network, the colorization network described in Bal-
dassarre et al. (2017) needs to be trained on an appropriately
chosen dataset. The same loss function and optimization algo-
rithm as the ones described in the original work were used, i.e.
mean square error, between the ground truth color image and
the one reconstructed by the network, and Adam (Kingma and
Ba, 2014) respectively; further information on the colorization
network training are reported in Baldassarre et al. (2017). As a
training set, on which to perform the colorization training step,
a collection of color images depicting faces downloaded from
the internet was added to the Labeled Face in the Wild (LFW)
dataset (Huang et al., 2007) reaching 20000 images, doubled
using data augmentation (horizontal flipping). In order to per-
form the colorization network training step, the colored images
were used as the desired ground truth output while a grayscale
representation of them were used as the network input. The
model was trained exploiting the implementation described in
the original paper (Baldassarre et al., 2017) which is imple-
mented using synergistically Keras (Chollet et al., 2015) and
Tensorflow (Abadi et al., 2015). After performing the training
of the colorization method, the shared layers between the two
networks (the ones colored in blue in Fig 1) were frozen (i.e.
set to not trainable) and their weights value set to the corre-
sponding one obtained from the colorization training step de-
scribed above. The other ones were randomly initialized. The
network was trained using Keras (Chollet et al., 2015), with
Tensorflow (Abadi et al., 2015) as backend, with the Adam op-
timization algorithm (Kingma and Ba, 2014) and a learning rate
of 0.0005. The loss function and the datasets used are described
in Sec. 4.1 and Sec. 4.2 respectively. After a sucient amount
of epochs, 50, a fine tuning step was finally performed in which
4
all the layers were trained on the whole dataset and using the
same training conditions, for an additional 100 epochs on the
same training set.
4.1. Loss function
Regarding the loss function, the mean square error could be
sucient in order to train the network to perform the skin seg-
mentation task. On the other hand, considering the main mo-
tivation that drives the building of this network (i.e. the rPPG
application described in Sec. 1), false negative and false posi-
tive errors should not have the same weight in the loss function
computation. In particular, in order to estimate the heart rate of
a subject, it is not strictly necessary to consider all visible skin
pixels whilst, on the other hand, labeling as skin a pixel depict-
ing other tissues or materials could have an important negative
impact on the final estimation. For this reason, given a pre-
dicted mask ˆyobtained applying the proposed network to an
input grayscale image xhaving a ground truth mask ywith ele-
ments yi j ∈ {0,1}, we define the loss function as:
Ey,y)=X
i j
(yi j ˆyij )2(α·yi j +(1 α)(1 yi j)) (1)
Where α[0,1] is a parameter introduced in order to make
Easymmetric. We choose a value for αsmaller than 0.5, e.g.
0.4, in order to penalize false positive errors (i.e. ˆyi j =1 with
yi j =0).
4.2. Datasets
To the best of our knowledge, there is no dataset available
specifically created for the purpose of solving the facial skin
segmentation problem. Some skin detection dataset exists,
e.g. Tan et al. (2014), but they features images with multiple
people and annotations with other body parts, which made them
not usable for this particular problem. Moreover the number
of images in this dataset is extremely low, e.g. 78 images are
present in Tan et al. (2014), and insucient to train deep meth-
ods. For this reason, we choose to adapt two already existing
datasets, i.e. MUCT (Milborrow et al., 2010) and Helen (Zhou
et al., 2013), consisting of RGB face images annotated with
landmark locations, in order to produce facial grayscale images
associated with skin masks. In particular, both datasets provide
diversity in lighting, pose, age and ethnicity of the subjects.
Moreover, the ones appertaining to the MUCT dataset are ac-
quired in a controlled environment whilst the Helen ones are
captured in the wild. A more detailed description on the pro-
cessing performed on the two dataset is described in the follow-
ing sections (Sec.4.2.1 and Sec.4.2.2 for the MUCT and Helen
datasets respectively).
4.2.1. MUCT dataset
As described in Milborrow et al. (2010), the MUCT dataset
consists of 3755 images (each one with a resolution of 640x480
pixels) captured from 276 subjects. Each image depicts a single
face with a homogeneous blue background and it is associated
with the pixel coordinates of 76 manually annotated facial land-
marks. During the photo acquisition, in order to increase the
dataset variety, five dierent camera views and three dierent
lighting sets were used. The landmarks provided are relative to
the lower face contour, eyes, eyebrows, nose and mouth. Start-
ing from these landmark positions, for each image, a mask is
produced considering a filled polygon shape with corners given
by the jaw/chin contour points and the eyebrows upper contour.
The eyes’, eyebrows’ and mouth’s regions are consequently re-
moved from the mask using the corresponding given contours.
Unfortunately, as in the majority of facial landmark datasets,
no upper face contour annotation is provided in this dataset
(skin/hair contour). In order to extend the obtained masks to
the forehead region a color similarity method has been used, ex-
ploiting the RGB channels information (the color information
is indeed available in this preprocessing step for the creation of
the dataset but is not available in the network training step). In
particular, a rectangular region above the eyebrows is consid-
ered; each pixel in that region is clusterized in 3 sets using a K-
means algorithm and using the Euclidean distance in the RGB
space the pixel belonging to the hair or other occluding objects
are rejected. This method, being automatic and based on color
similarity, inevitably introduces some errors in the pixel label-
ing and produces worse results compared to manual annotation,
which is unfortunately unavailable. Moreover in this dataset a
binary information on the presence of glasses is provided al-
though their position inside the image is not available. In order
to remove the glasses region from the mask two rectangle of
fixed size and centered around the eyes are subtracted from the
mask. Lastly, in the original dataset facial hair is not labeled and
in order to remove it from the mask a similar approach to the
one adapted for the forehead region is performed in the lower
part of the face and only on male subjects (gender labels are
available in the original dataset).
4.2.2. Helen dataset
The Helen dataset features 2330 high quality, real world pho-
tographs of a large variety of people. Each image, each one
with a dierent resolution (from less than 1 Mpixel up to 12
Mpixel), is densely annotated with landmarks locations. More-
over it has been used for face parsing works (Smith et al., 2013)
in which an accurate face segmentation annotation for dier-
ent part of the face has been provided. The masks need for the
skin segmentation problem are simply built combining dier-
ent segmentation regions. Unfortunately, in the Helen dataset
many images feature more than one visible face while just one
face is annotated in each image. Training a CNN on this data
could compromise its performances due to a not consistent an-
notation. In order to avoid this problem a simple state of the
art face detector algorithm (Viola and Jones, 2001) is run on
each image of the dataset. Since even in presence of multiple
faces the ground-truth annotation is always related to just one
of faces a new images were created cropping the original im-
ages in regions centered around the annotated face. This step,
being performed automatically, introduces inevitably some er-
rors. Lastly, also in this dataset, a facial hair annotation is un-
available and the same method used for the MUCT dataset was
implemented in order to remove beard regions from the masks.
5
Fig. 2. Example of some images in the proposed dataset with the skin mask
superimposed, originally in the MUCT and the Helen datasets respectively.
4.2.3. Complete dataset
The complete dataset is built merging the two datasets ob-
tained as described in Sec. 4.2.1 and Sec. 4.2.2 resulting in
roughly 6000 grayscale face images (converted from the orig-
inal RGB images) each associated with a skin labeling mask.
Two examples of images in the final dataset are reported in
Fig. 2, on the left an image originally in the MUCT dataset
and on the left one from the Helen one; the ground truth skin
mask is superimposed in pink. Moreover, in order to better ap-
proximate the test conditions (images coming from low spatial
resolution devices, such as SPAD cameras) the grayscale im-
ages were downsampled to 64x64 (adding black border if nec-
essary) and then upsampled to 128x128 using bicubic interpola-
tion. The training/testing data split was obtained selecting 100
images (50 for each original dataset, randomly selected from
MUCT and selected in the same way as Zhou et al. (2013) for
Helen) for building the testing set. In order to ensure fair skin
detection results, all the images belonging to the test set were
checked manually and the annotations were corrected if needed.
Subsequently, a horizontal flipped version of each training im-
age is added in order to perform data augmentation. Finally,
a validation set is created randomly selecting the 10% of the
training set.
5. Results
The proposed method was trained as described in Sec. 4 us-
ing the training set described in Sec. 4.2.3. In this section some
results are reported highlighting the necessity for the transfer
learning approach and the accuracy of the obtained method, in
Sec. 5.1 and Sec. 5.2 respectively.
5.1. Training with transfer learning
The learning curves of the skin detection network, obtained
following the training procedure described in Sec. 4, are re-
ported in Fig. 3. In particular, red curves are related to the loss
error calculated on the whole training set at each epoch while
blue ones are obtained on the validation set. As described in
Sec. 4, following a transfer learning approach, in the first part
of the training (first 50 epochs) the majority of the layers are
kept frozen, as described in Sec. 3.1, preserving the weights
value inherited from the colorization network, trained in a pre-
liminary step as described in Sec. 4. This allow the network to
Fig. 3. Loss values during the training. Red lines represent loss values in
each epoch on the training set while blue ones are obtained on the vali-
dation one. Dashed lines are the related to training directly on the skin
detection problem with random initialization.
quickly adapt to the skin detection problem as can be seen in the
first part of the solid red and blue curves reported in Fig. 3. On
the other hand, since the colorization and skin detection prob-
lem are related but dierent, an additional fine tuning step is
necessary in order to further specialize the network to solve the
skin detection problem. The eect of the fine tuning is clearly
visible in Fig. 3, in which both the solid curves have a sharp de-
cay after the dashed vertical gray line (fine tuning begin point).
The importance of the transfer learning approach could be also
observed in Fig. 3, in which the red and blue dashed lines repre-
sents respectively the training and the validation loss obtained
without using the colorization network wights as the initializa-
tion. In this case the training almost immediately collapses to
the trivial solution of producing a masks with just zero values.
Once the model reaches this point the training is not able to
converge to other more interesting solutions. The same trivial
result is obtained in all the training runs executed, regardless of
the random initialization and hyperparameter settings. As can
be observed from Fig. 3, a two steps approach, is able to drive
the model training to a non trivial solution reaching a more in-
teresting minimum point of the loss function.
5.2. Skin detection accuracy
5.2.1. Quantitative Results
The proposed method was tested on the 100 images test set
described in Sec. 4.2.3 resulting in a test loss value of 0.012
between the output masks (values [0,1]) and the ground truth
ones (values ∈ {0,1}). ROC curves for the per pixel skin clas-
sification task are reported in Fig. 4. In particular the proposed
method achieved the best results on the test images originally
belonging to the MUCT dataset (green line) given the less vari-
ability of the data, as described in Sec. 4.2.1. Considering the
complete test set curve (red line), the best work point have a
true positive rate (i.e. recall) of 89.8% with just 3.0% of false
positive rate (i.e. fallout), thanks to the asymmetrical loss func-
6
Table 2. Comparison between the proposed method and Nirkin et al. (2017) based on intersection over union and F-score results obtained on MUCT, Helen
and complete test set. The second line show results obtained combining Nirkin et al. (2017) and ground-truth masks in order to exclude eyes, eyebrows
and mouth regions.
IOU F-score
Method MUCT Helen Complete MUCT Helen Complete
Nirkin et al. (2017) 70 56 63 82 71 76
Nirkin et al. (2017) +GT - 62 - - 76 -
Proposed method 78 69 73 87 81 84
Fig. 4. Skin classification ROC curves obtained with the proposed method
on the complete test set (red), MUCT test subset (green) and Helen test
subset (blue).
tion defined in Sec. 4.1. As explained in Sec. 2, other meth-
ods for facial skin detection on grayscale low resolution im-
ages are rare or no existing. However a quantitative compari-
son between the proposed method and facial segmentation ones
could be made. In particular we selected the facial segmenta-
tion method proposed in Nirkin et al. (2017) since as ours it
can work with occluded faces. Since the method in Nirkin et al.
(2017) produces masks that contain the eyebrow, eye and mouth
regions we tested also the accuracy of this method combined
with ground-truth information of these regions. In particular
we removed from the mask obtained from Nirkin et al. (2017)
the ground-truth mask of the unwanted regions (assuming so a
perfect estimation of them). We compared the three methods
(the one proposed by us and the one in Nirkin et al. (2017) not
using or using ground-truth information) adopting the Intersec-
tion Over Union (IOU) and F-score metrics. In particular the
F-score is defined as the harmonic mean between precision and
recall. The results obtained are summarized in Tab.2; as can be
observed even using the ground-truth information for the eyes,
eyebrows and mouth regions, the proposed method produces
more accurate results achieving a IOU of 73% and an F-score
of 84% on the complete test set.
5.2.2. Qualitative Results
Some qualitative results with various images belonging to the
test set are shown in Fig. 5 and Fig. 6, where the returned skin
mask is superimposed to the input image using a pink color.
Fig. 5 reports some results on images originally belonging to
the Helen dataset while Fig. 6 shows other results using im-
ages initially in the MUCT dataset. As can be observed, the
proposed skin detection method is able to produce qualitatively
good results even in presence of non frontal faces, in-plane ro-
tation, dierent head shapes and sizes, expressions, hair occlu-
sions, glasses and other wearable objects. The beard is not al-
ways properly rejected, especially if it has an intensity similar
to the subject skin. Moreover, in Fig. 7 some masks obtained
with the proposed method are superimposed to some input im-
ages acquired by the SPAD array camera. These results are
particularly promising since these images belong to a very dif-
ferent dataset in respect to the one used for training, even ac-
quired with a dierent technology. As can be seen in Fig. 7, the
network is able to generalize and produce good quality results
even on images acquired in dierent conditions compared to the
training dataset, and even in presence of dierent expressions,
poses and heavy occlusions.
5.3. Real time performance
We evaluated the time performance of our method executing
it on the test set described in Sec. 4.2.3 achieving an execution
time of 6.6 milliseconds for each image corresponding to 152
fps. We obtain this result with a Tensorflow (Abadi et al., 2015)
implementation of the network and executing it on a Nvidia
Titan XpR
GPU.
5.4. Hidden layer output visualization
In Fig. 8 a visualization of the knowledge acquired by the
network is reported. In particular, Fig. 8 visualizes the output
of the decoder’s second hidden layer when the network is run
on an image acquired by the SPAD camera: the left picture
in Fig. 7. As can be observed, after the training some filters
specialized in detecting same particular facial feature relevant
for the skin detection problem (e.g. eyes, 4th and 6th column
on 5th row), the background (6th column on 1st and 2nd row)
the face contour (5th column on 1st row) and finally the skin
(6th column on 3rd row, 1st column on 5th row and 5th column
7th row).
7
Fig. 5. Some qualitative results on images in the test set belonging originally
to the Helen dataset.
6. Conclusions
In this paper, we presented a deep learning based method
proposed in order to solve the facial skin detection problem on
low-resolution grayscale images, motivated by a rPPG appli-
cation (as described in Sec. 1). Analyzing the state of the art
of similar problems, in Sec. 2, we showed the peculiarity of
the proposed problem and how, to the best of our knowledge,
the method described in this work is the first being proposed
specifically to solve it. Given the similarity between this prob-
lem and a semantic segmentation one, and the good accuracy
achieved by neural network methods in this field, a deep learn-
ing based method was proposed. On the other hand, these kind
of methods need massive amount of data to be trained on. Since
the facial skin detection problem is very specific unfortunately
not a huge amount of data are available for this specific prob-
lem. For this reason a transfer learning approach was adopted
in the training phase. In particular, the proposed network ar-
chitecture was chosen in order to have the majority of layers
in common with the convolutional neural network proposed to
solve the grayscale images colorization problem (Baldassarre
et al., 2017). The similarities between the two problems are
described in Sec. 3.1. As described in Sec. 4, the adopted trans-
fer learning strategy was the following: firstly the coloriza-
tion method was trained on a large dataset of unlabeled face
images while the proposed network was subsequently trained
starting from the colorization network weights and minimiz-
ing an asymmetric loss function, described in Sec. 4.1, on a
Fig. 6. Some qualitative results on images in the test set belonging originally
to the MUCT dataset.
Fig. 7. Qualitative results on three face images acquired by the SPAD cam-
era.
novel dataset obtained using two freely available datasets (as
described in Sec. 4.2). Lastly in Sec. 5.1 this training proce-
dure has been justified showing that, without using it, it would
be impossible to train the proposed network with the few data
available. In addition, in Sec. 5.2 some quantitative results
were reported providing accuracy evaluation for the proposed
skin detection method and comparisons with a state of the art
face segmentation method. Moreover, many network outputs
were shown for both images acquired in similar conditions with
respect to the ones used to built the training set and for im-
ages completely independent from the training set, acquired
with the SPAD camera. Both these results show how the pro-
posed method is able to achieve quantitative and qualitative
good results in the skin detection problem even in presence of
dierent poses, ages, expressions, ethnicity, wearable objects
and other occlusions. The trained model along with image la-
bels created by the author of this work are made available at
the link https://github.com/marcobrando/Deep-Skin-
Detection-on-Low-Resolution-Grayscale-Images.
8
Fig. 8. Visual representation of the activations of the second hidden layer
in the decoder stage when tested on a face image acquired by the SPAD
camera.
Acknowledgments
Founding for this work was provided by the H2020 European
project DEIS (EU grant agreement No 732242).
References
Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Cor-
rado, G.S., Davis, A., Dean, J., Devin, M., Ghemawat, S., Goodfellow,
I., Harp, A., Irving, G., Isard, M., Jia, Y., Jozefowicz, R., Kaiser, L.,
Kudlur, M., Levenberg, J., Man´
e, D., Monga, R., Moore, S., Murray, D.,
Olah, C., Schuster, M., Shlens, J., Steiner, B., Sutskever, I., Talwar, K.,
Tucker, P., Vanhoucke, V., Vasudevan, V., Vi´
egas, F., Vinyals, O., War-
den, P., Wattenberg, M., Wicke, M., Yu, Y., Zheng, X., 2015. Tensor-
Flow: Large-scale machine learning on heterogeneous systems. URL:
https://www.tensorflow.org/. software available from tensorflow.org.
Baldassarre, F., Gonzalez-Morin, D., Rodes-Guirao, L., 2017. Deep-
koalarization: Image colorization using cnns and inception-resnet-v2.
ArXiv:1712.03400 URL: https://arxiv.org/abs/1712.03400.
Bronzi, D., Villa, F., Tisa, S., Tosi, A., Zappa, F., 2016a. Spad figures of merit
for photon-counting, photon-timing, and imaging applications: A review.
IEEE Sensors Journal 16, 3–12. URL: http://ieeexplore.ieee.org/
xpl/articleDetails.jsp?arnumber=7283534, doi:10.1109/JSEN.
2015.2483565.
Bronzi, D., Villa, F., Tisa, S., Tosi, A., Zappa, F., Durini, D., Weyers, S., Brock-
herde, W., 2014. 100 000 frames/s 64 x 32 single-photon detector array for
2-D imaging and 3-D ranging. IEEE Journal of Selected Topics in Quantum
Electronics 20, 354–363. doi:10.1109/JSTQE.2014.2341562.
Bronzi, D., Zou, Y., Villa, F., Tisa, S., Tosi, A., Zappa, F., 2016b. Automo-
tive three-dimensional vision through a single-photon counting spad cam-
era. IEEE Transactions on Intelligent Transportation Systems 17, 782–795.
doi:10.1109/TITS.2015.2482601.
Chollet, F., et al., 2015. Keras. https://github.com/fchollet/keras.
Guo, Y., Liu, Y., Georgiou, T., Lew, M.S., 2018. A review of semantic seg-
mentation using deep neural networks. International Journal of Multime-
dia Information Retrieval 7, 87–93. URL: https://doi.org/10.1007/
s13735-017- 0141-z, doi:10.1007/s13735- 017-0141- z.
Huang, G.B., Ramesh, M., Berg, T., Learned-Miller, E., 2007. Labeled Faces in
the Wild: A Database for Studying Face Recognition in Unconstrained En-
vironments. Technical Report 07-49. University of Massachusetts, Amherst.
Kakumanu, P., Makrogiannis, S., Bourbakis, N., 2007. A survey of skin-
color modeling and detection methods. Pattern Recogn. 40, 1106–
1122. URL: http://dx.doi.org/10.1016/j.patcog.2006.06.010,
doi:10.1016/j.patcog.2006.06.010.
Kawulok, M., Kawulok, J., Nalepa, J., 2014. Spatial-based skin detection
using discriminative skin-presence features. Pattern Recogn. Lett. 41,
3–13. URL: http://dx.doi.org/10.1016/j.patrec.2013.08.028,
doi:10.1016/j.patrec.2013.08.028.
Kingma, D.P., Ba, J., 2014. Adam: A method for stochastic optimiza-
tion. CoRR abs/1412.6980. URL: http://dblp.uni- trier.de/db/
journals/corr/corr1412.html#KingmaB14.
Liu, S., Shi, J., Liang, J., Yang, M.H., 2017. Face parsing via recurrent propa-
gation. CoRR abs/1708.01936.
Milborrow, S., Morkel, J., Nicolls, F., 2010. The MUCT Landmarked Face
Database. Pattern Recognition Association of South Africa .
Nirkin, Y., Masi, I., Tran, A.T., Hassner, T., Medioni, G., 2017. On face seg-
mentation, face swapping, and face perception. arXiv:1704.06729.
Pan, S.J., Yang, Q., 2010. A survey on transfer learning. IEEE Trans. on
Knowl. and Data Eng. 22, 1345–1359. URL: http://dx.doi.org/10.
1109/TKDE.2009.191, doi:10.1109/TKDE.2009.191.
Paracchini, M., Marchesi, L., Pasquinelli, K., Marcon, M., Fontana, G.,
Gabrielli, A., Villa, F., 2019. Remote photoplethysmography using spad
camera for automotive health monitoring application, in: 2019 AEIT In-
ternational Conference of Electrical and Electronic Technologies for Au-
tomotive (AEIT AUTOMOTIVE), pp. 1–6. doi:10.23919/EETA.2019.
8804516.
Robertson, A.R., 1976. The cie 1976 color-dierence formulae. Color Research
& Application 2, 7–11. URL: https://onlinelibrary.wiley.com/
doi/abs/10.1002/j.1520-6378.1977.tb00104.x, doi:10.1002/j.
1520-6378.1977.tb00104.x.
Rouast, P.V., Adam, M.T.P., Chiong, R., Cornforth, D., Lux, E., 2017. Remote
heart rate measurement using low-cost rgb face video: a technical litera-
ture review. Frontiers of Computer Science URL: https://doi.org/10.
1007/s11704-016- 6243-6, doi:10.1007/s11704- 016-6243- 6.
Sarkar, A., Abbott, A.L., Doerzaph, Z., 2017. Universal skin detection without
color information, in: 2017 IEEE Winter Conference on Applications of
Computer Vision (WACV), pp. 20–28. doi:10.1109/WACV.2017.10.
Smith, B.M., Zhang, L., Brandt, J., Lin, Z., Yang, J., 2013. Exemplar-based
face parsing, in: CVPR, IEEE Computer Society. pp. 3484–3491.
Szegedy, C., Ioe, S., Vanhoucke, V., 2016. Inception-v4, inception-
resnet and the impact of residual connections on learning. CoRR
abs/1602.07261. URL: http://dblp.uni-trier.de/db/journals/
corr/corr1602.html#SzegedyIV16.
Tan, W.R., Chan, C.S., Pratheepan, Y., Condell, J., 2014. A fusion approach for
ecient human skin detection. CoRR abs/1410.3751.
Viola, P., Jones, M., 2001. Rapid object detection using a boosted cascade
of simple features, in: Proceedings of the 2001 IEEE Computer Society
Conference on Computer Vision and Pattern Recognition. CVPR 2001, pp.
I–511–I–518 vol.1.
Zhou, F., Brandt, J., Lin, Z., 2013. Exemplar-based graph matching for robust
facial landmark localization, in: IEEE International Conference on Com-
puter Vision (ICCV).
Zhou, L., Liu, Z., He, X., 2017. Face parsing via a fully-convolutional contin-
uous CRF neural network. CoRR abs/1708.03736.
... For skin region, the proposed method extracts color features from HSV channels of the output of SCHP, and fixes dynamic thresholds to segment skin regions. It is noted that pixel values of skin usually share uniform values [14]. The same observation is used for determining dynamic thresholds for segmenting skin regions from the output of SCHP. ...
Article
Full-text available
Human trafficking is a global issue of the world and the problems related to human trafficking remain unsolved. This paper presents a new method for the identification of photos of different types of families and non-families such that the method can assist investigation team to find a solution to such issue. We believe that parts of human beings are the main resources for representing family and non-family photos. Based on this intuition, we propose to segment hair, head, cloth, torso, and skin regions from each human in input photos by exploring a self-correlation for human parsing method. This step results in region of interest (ROI). Motivated by ability of deep learning models in solving complex issues and special property of MobileNet, which is light weight model, we further explore MobileNetv2 for the identification of photos of different families and non-families by considering ROI as the input. For the experiment of this work, we consider a dataset of ten classes, which include five family classes, namely, Couple, Nuclear Family, Multi-Cultural Family, Father–Child, Mother–Child and five more non-family classes, namely, Male Friends, Female Friends, Mixed Friends, Male Celebrity, Female Celebrity. The results of the proposed method are demonstrated by testing on our dataset of family and non-family photos classification. Comparative results with the existing methods show that our proposed method outperforms existing methods in terms of classification rate and F-Score.
... In this paper, skin segmentation in the UCI database with 98.60% accuracy has been done with HSV color transformation, which is higher than Nrgb. In 2020, Paracchini et al. [20] tested deep learning in low resolution grayscale face images and in presence of general illumination to perform the skin detection. Three test datasets are used including MUCT dataset, Helen dataset, Complete dataset (merging of MUCT and Helen datasets). ...
Preprint
Full-text available
Skin color detection is an essential required step in various applications related to computer vision. These applications will include face detection, finding pornographic images in movies and photos, finding ethnicity, age, diagnosis, and so on. Therefore, proposing a proper skin detection method can provide solution to several problems. In this study, first a new color space is created using FCM and PSO algorithms. Then, skin classification has been performed in the new color space utilizing linear and nonlinear modes. Additionally, it has been done in RGB and LAB color spaces by using ANFIS and neural network. Skin detection in RBG color space has been performed using Mahalanobis distance and Euclidean distance algorithms. In comparison, this method has 18.38% higher accuracy than the most accurate method on the same database. Additionally, this method has achieved 90.05% in equal error rate (1-EER) in testing COMPAQ dataset and 92.93% accuracy in testing Pratheepan dataset, which compared to the previous method on COMPAQ database, 1-EER has increased by %0.87.
... Also, depending on the problem area to be identified, the presence or absence of color may be advantageous for discrimination. Even black, white, and gray color may be advantageous, especially in fields such as medical imaging diagnosis [38] or security search [39]. In some cases, preprocessing, or graying, in a color image is performed in order to remove unnecessary features that interfere with interpretation. ...
Article
Full-text available
Stock price prediction has long been the subject of research because of the importance of accuracy of prediction and the difficulty in forecasting. Traditionally, forecasting has involved linear models such as AR and MR or nonlinear models such as ANNs using standardized numerical data such as corporate financial data and stock price data. Due to the difficulty of securing a sufficient variety of data, researchers have recently begun using convolutional neural networks (CNNs) with stock price graph images only. However, we know little about which characteristics of stock charts affect the accuracy of predictions and to what extent. The purpose of this study is to analyze the effects of stock chart characteristics on stock price prediction via CNNs. To this end, we define the image characteristics of stock charts and identify significant differences in prediction performance for each characteristic. The results reveal that the accuracy of prediction is improved by utilizing solid lines, color, and a single image without axis marks. Based on these findings, we describe the implications of making predictions only with images, which are unstructured data, without using large amounts of standardized data. Finally, we identify issues for future research.
... This method can attach a skin label to each pixel of the face image. The size of the image resolution used uses a gray scale image with low resolution (64x32 pixels) [10]. However, previous studies have not focused on the problem of the detection of facial skin diseases based on their severity. ...
... Two very recent skin segmentation architectures are OR-Skip-Net [14], a model that transfers direct edge information across the network in such a way as to empower the features, and Skinny [15], based on a lightweight U-Net. A deep learning approach based on transfer learning was proposed in reference [16] for skin detection in gray-scale images; this model is based on the combination of a colorization network and a pretrained Inception-ResNet-v2. ...
Article
Full-text available
Skin detectors play a crucial role in many applications: face localization, person tracking, objectionable content screening, etc. Skin detection is a complicated process that involves not only the development of apposite classifiers but also many ancillary methods, including techniques for data preprocessing and postprocessing. In this paper, a new postprocessing method is described that learns to select whether an image needs the application of various morphological sequences or a homogeneity function. The type of postprocessing method selected is learned based on categorizing the image into one of eleven predetermined classes. The novel postprocessing method presented here is evaluated on ten datasets recommended for fair comparisons that represent many skin detection applications. The results show that the new approach enhances the performance of the base classifiers and previous works based only on learning the most appropriate morphological sequences.
... Skin detection has been a long-standing challenge in the field of computer vision, and it has been used in many studies that involve decision-making about humans [33,38]. Several survey papers have explored various methods for skin detection [19,32]. ...
Article
Full-text available
The healthcare industry requires the integration of digital technologies, such as Artificial Intelligence (AI) and the Internet of Things (IoT), to their full potential, particularly during this challenging time and the recent outbreak of the COVID-19 pandemic, which resulted in the disruptions in healthcare delivery, service operations, and shortage of healthcare personnel. However, every opportunity has barriers and bumps, and when it comes to IoT healthcare, data privacy is one of the main growing issues. Despite the recent advances in the development of IoT healthcare architectures, most of them are invasive for the data subjects. In this context, the broad applications of AI in the IoT domain have also been hindered by emerging strict legal and ethical requirements to protect individual privacy. Camera-based solutions that monitor human subjects in everyday settings, e.g., for Online Range of Motion (ROM) detection, are making this problem even worse. One actively practiced branch of such solutions is telerehabilitation, which provides remote solutions for the physically impaired to regain their strength and get back to their normal daily routines. The process usually involves transmitting video/images from the patient performing rehabilitation exercises and applying Machine Learning (ML) techniques to extract meaningful information to help therapists devise further treatment plans. Thereby, real-time measurement and assessment of rehabilitation exercises in a reliable, accurate, and Privacy-Preserving manner is imperative. To address the privacy issue of existing solutions, this paper proposes a holistic Privacy-Preserving (PP) hierarchical IoT solution that simultaneously addresses the utilization of AI-driven IoT and the demands for data protection. Furthermore, the efficiency of the proposed architecture is demonstrated by a novel machine learning-based system that allows immediate assessment and extraction of ROM as the critical information for analyzing the progress of patients.
... Lumini et al. demonstrated that the probability maps obtained using a CNN can be further enhanced during post-processing by learned morphological operations [25]. There were also attempts to detect skin from grayscale images, both using hand-crafted features [26] and a CNN of an autoencoder architecture [27]. Arsanal et al. introduced OR-Skip-Net [28]a network with residual outer skip connections to transfer the features extracted at the initial layers to the layers responsible for generating the final segmentation map. ...
... The complete data set is available for download at the link: https://github.com/marcobrando/Deep-Skin-Detection-on-Low-Resolution-Grayscale-Images. Further detail on the chosen architecture and training procedure are described in [27]. Although the SPAD acquisition frame rate is set to 100 Hz, the deep learning skin detection method is executed at 10 Hz on key frames obtained by averaging 10 consecutive frames. ...
Article
Full-text available
The problem of performing remote biomedical measurements using just a video stream of a subject face is called remote photoplethysmography (rPPG). The aim of this work is to propose a novel method able to perform rPPG using single-photon avalanche diode (SPAD) cameras. These are extremely accurate cameras able to detect even a single photon and are already used in many other applications. Moreover, a novel method that mixes deep learning and traditional signal analysis is proposed in order to extract and study the pulse signal. Experimental results show that this system achieves accurate results in the estimation of biomedical information such as heart rate, respiration rate, and tachogram. Lastly, thanks to the adoption of the deep learning segmentation method and dependability checks, this method could be adopted in non-ideal working conditions—for example, in the presence of partial facial occlusions.
Chapter
Identifying Tattoo is an integral part of forensic investigation and crime identification. Tattoo text detection is challenging because of its freestyle handwriting over the skin region with a variety of decorations. This paper introduces Deformable Convolution and Inception based Neural Network (DCINN) for detecting tattoo text. Before tattoo text detection, the proposed approach detects skin regions in the tattoo images based on color models. This results in skin regions containing Tattoo text, which reduces the background complexity of the tattoo text detection problem. For detecting tattoo text in the skin regions, we explore a DCINN, which generates binary maps from the final feature maps using differential binarization technique. Finally, polygonal bounding boxes are generated from the binary map for any orientation of text. Experiments on our Tattoo-Text dataset and two standard datasets of natural scene text images, namely, Total-Text, CTW1500 show that the proposed method is effective in detecting Tattoo text as well as natural scene text in the images. Furthermore, the proposed method outperforms the existing text detection methods in several criteria.
Article
Full-text available
We review some of the most recent approaches to colorize gray-scale images using deep learning methods. Inspired by these, we propose a model which combines a deep Convolutional Neural Network trained from scratch with high-level features extracted from the Inception-ResNet-v2 pre-trained model. Thanks to its fully convolutional architecture, our encoder-decoder model can process images of any size and aspect ratio. Other than presenting the training results, we assess the "public acceptance" of the generated images by means of a user study. Finally, we present a carousel of applications on different types of images, such as historical photographs.
Article
Full-text available
Face parsing is an important problem in computer vision that finds numerous applications including recognition and editing. Recently, deep convolutional neural networks (CNNs) have been applied to image parsing and segmentation with the state-of-the-art performance. In this paper, we propose a face parsing algorithm that combines hierarchical representations learned by a CNN, and accurate label propagations achieved by a spatially variant recurrent neural network (RNN). The RNN-based propagation approach enables efficient inference over a global space with the guidance of semantic edges generated by a local convolutional model. Since the convolutional architecture can be shallow and the spatial RNN can have few parameters, the framework is much faster and more light-weighted than the state-of-the-art CNNs for the same task. We apply the proposed model to coarse-grained and fine-grained face parsing. For fine-grained face parsing, we develop a two-stage approach by first identifying the main regions and then segmenting the detail components, which achieves better performance in terms of accuracy and efficiency. With a single GPU, the proposed algorithm parses face images accurately at 300 frames per second, which facilitates real-time applications.
Article
Full-text available
Remote Photoplethysmography (rPPG) allows remote measurement of the heart rate using low-cost RGB imaging equipment. In this paper, we review the development of the field since its emergence in 2008, classify existing approaches for rPPG, and derive a framework that provides an overview of modular steps. Based on this framework, practitioners can use the classification to orchestrate algorithms to an rPPG approach that suits their specific needs. Researchers can use the reviewed and classified algorithms as a starting point to improve particular features of an rPPG algorithm.
Article
Full-text available
TensorFlow is an interface for expressing machine learning algorithms, and an implementation for executing such algorithms. A computation expressed using TensorFlow can be executed with little or no change on a wide variety of heterogeneous systems, ranging from mobile devices such as phones and tablets up to large-scale distributed systems of hundreds of machines and thousands of computational devices such as GPU cards. The system is flexible and can be used to express a wide variety of algorithms, including training and inference algorithms for deep neural network models, and it has been used for conducting research and for deploying machine learning systems into production across more than a dozen areas of computer science and other fields, including speech recognition, computer vision, robotics, information retrieval, natural language processing, geographic information extraction, and computational drug discovery. This paper describes the TensorFlow interface and an implementation of that interface that we have built at Google. The TensorFlow API and a reference implementation were released as an open-source package under the Apache 2.0 license in November, 2015 and are available at www.tensorflow.org.
Article
Full-text available
We present an optical 3-D ranging camera for automotive applications that is able to provide a centimeter depth resolution over a 40° x 20° field of view up to 45 m with just 1.5 W of active illumination at 808 nm. The enabling technology we developed is based on a CMOS imager chip of 64 x 32 pixels, each with a single-photon avalanche diode (SPAD) and three 9-bit digital counters, able to perform lock-in time-of-flight calculation of individual photons emitted by a laser illuminator, reflected by the objects in the scene, and eventually detected by the camera. Due to the SPAD single-photon sensitivity and the smart in-pixel processing, the camera provides state-of-the-art performance at both high frame rates and very low light levels without the need for scanning and with global shutter benefits. Furthermore, the CMOS process is automotive certified.
Article
In this work, we address the face parsing task with a Fully-Convolutional continuous CRF Neural Network (FC-CNN) architecture. In contrast to previous face parsing methods that apply region-based subnetwork hundreds of times, our FC-CNN is fully convolutional with high segmentation accuracy. To achieve this goal, FC-CNN integrates three subnetworks, a unary network, a pairwise network and a continuous Conditional Random Field (C-CRF) network into a unified framework. The high-level semantic information and low-level details across different convolutional layers are captured by the convolutional and deconvolutional structures in the unary network. The semantic edge context is learnt by the pairwise network branch to construct pixel-wise affinity. Based on a differentiable superpixel pooling layer and a differentiable C-CRF layer, the unary network and pairwise network are combined via a novel continuous CRF network to achieve spatial consistency in both training and test procedure of a deep neural network. Comprehensive evaluations on LFW-PL and HELEN datasets demonstrate that FC-CNN achieves better performance over the other state-of-arts for accurate face labeling on challenging images.
Article
Very deep convolutional networks have been central to the largest advances in image recognition performance in recent years. One example is the Inception architecture that has been shown to achieve very good performance at relatively low computational cost. Recently, the introduction of residual connections in conjunction with a more traditional architecture has yielded state-of-the-art performance in the 2015 ILSVRC challenge; its performance was similar to the latest generation Inception-v3 network. This raises the question of whether there are any benefit in combining the Inception architecture with residual connections. Here we give clear empirical evidence that training with residual connections accelerates the training of Inception networks significantly. There is also some evidence of residual Inception networks outperforming similarly expensive Inception networks without residual connections by a thin margin. We also present several new streamlined architectures for both residual and non-residual Inception networks. These variations improve the single-frame recognition performance on the ILSVRC 2012 classification task significantly. We further demonstrate how proper activation scaling stabilizes the training of very wide residual Inception networks. With an ensemble of three residual and one Inception-v4, we achieve 3.08 percent top-5 error on the test set of the ImageNet classification (CLS) challenge