Conference PaperPDF Available

Abstract and Figures

In this paper, we propose two simple yet effective methods to estimate facial attributes in unconstrained images. We use a straight forward and fast face alignment technique for preprocessing and estimate the face attributes using MobileNetV2 and Nasnet-Mobile, two lightweight CNN (Convolutional Neural Network) architectures. Both architectures perform similarly well in terms of accuracy and speed. A comparison with state-of-the-art methods with respect to processing time and accuracy shows that our proposed approach perform faster than the best state-of-the-art model and better than the fastest state-of-the-art model. Moreover, our approach is easy to use and capable of being deployed on mobile devices.
Content may be subject to copyright.
Face Attribute Detection with MobileNetV2 and
NasNet-Mobile
Frerk Saxen, Philipp Werner, Sebastian Handrich, Ehsan Othman, Laslo Dinges, Ayoub Al-Hamadi
Faculty of Electrical Engineering and Information Technology, Neuro-Information Technology
Otto von Guericke University
Magdeburg, Germany
Frerk.Saxen@ovgu.de
Abstract—In this paper, we propose two simple yet effective
methods to estimate facial attributes in unconstrained images.
We use a straight forward and fast face alignment technique for
preprocessing and estimate the face attributes using MobileNetV2
and Nasnet-Mobile, two lightweight CNN (Convolutional Neural
Network) architectures. Both architectures perform similarly well
in terms of accuracy and speed. A comparison with state-of-the-
art methods with respect to processing time and accuracy shows
that our proposed approach perform faster than the best state-of-
the-art model and better than the fastest state-of-the-art model.
Moreover, our approach is easy to use and capable of being
deployed on mobile devices.
Index Terms—Mobile face attribute detection, MobileNetV2,
Nasnet-Mobile
I. INTRODUCTION
Estimating human face attributes is important for sev-
eral applications (e.g. face retrieval, social media or video
surveillance [1]). The estimation however is difficult due to
vast changes in appearance and shape of different attributes,
out of plane head rotations and difficult lighting conditions
in unconstrained settings. Many applications, however, re-
quire accurate and fast solutions that perform on resource-
constrained systems, such as mobile devices. E.g. it may not
be possible to upload images to the cloud for recognition due
to privacy reasons or a lack of Internet connection.
In this work we discuss previous works on facial attribute
detection and evaluate two lightweight CNN architectures with
respect to performance and speed using a straight forward
methodology that can be implemented on mobile devices.
A. Related Work
During the last decade, multiple approaches have been pro-
posed for face attribute estimation. Similar to other domains,
deep learning approaches excelled traditional methods with the
emergence of huge datasets like CelebA [2].
LNets+ANet: Liu et al. [2] propose a combined face and
attribute detection framework (similar to R-CNN [3]). Thus
it does not require any preprocessing like face and landmark
detection. However, the pipeline takes a considerable amount
of training time but the results made a remarkable improve-
ment over the state-of-the-art. With the introduction of Faster
R-CNN [4] and YOLOv3 [5] (for object detection) many
shortcomings have been eliminated. Nevertheless, since Liu
Face and
landmark
detection
Input
Rotate and
extend face
boundary
Nasnet-
Mobile
Or
MobileNet
V2
Attributes
Arch. Eyebrows
Attractive
Black Hair
Heavy Makeup
Mouth S. O.
Narrow Eyes
No Beard
Wear. Lipstick
Young
Fig. 1: Proposed attribute estimation pipeline. The face and
landmark detection as well as the Nasnet-Mobile and Mo-
bileNetV2 implementation are available online at [7], [8], [9]
and [10], respectively. The trained models will be provided on
request.
et al. [2] nobody tried to adopt these models for face attribute
detection again.
MCNN-AUX: Hand and Chellappa [6] propose a multi-
task CNN with an auxiliary network. They suggest a straight
forward CNN architecture and manually group the attributes
to train the multi-task CNN. The architecture has about 64
million parameters but only 3convolution layer and 2fully
connected layer. The training was done within 2.5hours only.
Thus, we suspect a very fast inference time on modern GPUs.
Mid-Level: Zhong et al. [11] propose to use mid-level
representations of the pre-trained FaceNet NN.1 [12] architec-
ture. They apply multi-scale spatial pooling on intermediate
layers and classify attributes at each level using a Support
Vector Machine. For each attribute the best performing layer is
chosen. The authors report that “features from the intermediate
layers demonstrate an obvious advantage [...] for attributes
describing motions of the mouth area where the gap is almost
20%”. Although this approach is interesting, it is not trained
end-to-end.
AFFACT: G¨
unther et al. [13] uses a ResNet-50 architecture.
To facilitate an alignment free attribute detection they perform
heavy augmentation during inference by applying 162 different
transformations to the detected bounding boxes. Thus, the
input tensor of their network has 162 channels. Although they
report superior classification results, it should be noted that
performing these transformations beforehand is quite costly.
DMTL: Wang et al. [14] and DMTL+: Han et al. [1]
propose, similar to Hand and Chellappa [6], a deep multi-task
learning approach. They use a slightly modified AlexNet [15]
Saxen et al., "Face Attribute Detection with MobileNetV2 and NasNet-Mobile," International Symposium on Image and Signal Processing and
Analysis (ISPA), 2019, DOI: 10.1109/ISPA.2019.8868585.
This is the accepted manuscript. The final, published version is available on IEEEXplore.
(C) 2019 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or
future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for
resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.
and model both attribute correlation and attribute heterogeneity
in a single network. Instead of the 6 subnetworks by Hand and
Chellappa [6] they use 8 subnetworks, one holistic subnetwork
and seven local nominal subnetworks.
SPLITFACE: Mahbub et al. [16] propose a CNN architec-
ture that is designed for face attribute detection of partial
occluded faces. They create severe occlusions synthetically
on CelebA and train a CNN on local patches at facial key
points of the image to robustly detect face attributes despite
heavy occlusions. However, since we do not address occlusion
explicitly we did not include SPLITFACE in our evaluation.
B. Dataset
The availability of comprehensive and well-designed
databases is crucial for any classification problem. In this
work, we use the CelebA dataset [2]. CelebA is a large-
scale dataset of celebrity images with large pose variations and
background clutter. In total, there are 202,599 face images of
10,777 identities. The dataset is split into training, validation
and test sets, with 162,770, 19,867, and 19,962 images,
respectively. For each image, 40 face attributes were annotated.
These attributes range from general attributes like sex, age and
demographic information to specific and individual character-
istics, like e.g. face shape, lip and nose size. All these labels
are binary labels, i.e. the corresponding facial attribute is either
present or not. Table I shows the performance of different
models, including a trivial classifier, that always votes for the
majority class. Thus, the trivial classifier can show the label
distribution of the test set (if we know the majority class). E.g.,
50% of the faces are smiling, and only 2.1% are bold. For the
majority of the classes, the distribution is imbalanced. Not all
given labels are correct, though. For some attributes, a clear
and objective decision might be difficult. However, Rolnick et
al. discovered that Deep Learning techniques are highly robust
against such label noise [17].
II. ME TH OD
In this work, we evaluate the performance of two new archi-
tectures for face attribute estimation: Nasnet-Mobile [18] (see
Sec. II-B) and MobileNetV2 [19] (see Sec. II-C). Compared
to most competitors our training procedure is straight forward
(see Fig. 1): We do not modify the network architecture and
we do not perform a sophisticated alignment. We discuss
the different approaches and dataset specifics. Our proposed
methods perform faster than the best state-of-the-art model
and better than the fastest state-of-the-art model (see details
in Sec. III). Sec. II-A explains the training procedure in detail
that is the same for both architectures. In Sec. II-B and
Sec. II-C the training details are given for Nasnet-Mobile and
MobileNetV2, respectively.
A. Training
As the first step in our recognition pipeline we detect the
face with a multi-scale ResNet model [7] and estimate facial
landmarks using an ensemble of regression trees [8]. We rotate
each image to align the eyes horizontally and crop the face
(centered between the eyes) with a square bounding box of 2
times the width of the face detection. We do that to capture
details like hair and necklace. Finally, we rescale the crop to
a resolution of 256 ×256 pixels.
To augment the training data we crop the image randomly
within a range of 95% to 100% of the respective axes,
independently for the image width and height. We randomly
flip the image horizontally and randomly change the saturation
and value within the HSV color space. After augmentation the
image is rescaled to 224 ×224 and converted to RGB to meet
the desired input shape of the CNNs. Also the pixel values are
scaled to the required range between -1 and 1. We monitored
the validation set accuracy during training and applied the test
set only a single time when the training finished.
B. Nasnet-Mobile
Nasnet is a scalable CNN architecture (constructed by neu-
ral architecture search) that consists of basic building blocks
(cells) that are optimized using reinforcement learning [18].
A cell consists of only a few operations (several separable
convolutions and pooling) and is repeated multiple times
according to the required capacity of the network. The mobile
version (Nasnet-Mobile) consists of 12 cells with 5.3million
parameters and 564 million multiply-accumulates (MACs).
We perform transfer learning with the pre-trained model
(pre-trained on ImageNet [20]) from [9] with the suggested
training setup (dropout = 0.5,weight decay =4e5,batch
norm decay = 0.9997,batch norm epsilon = 1e3). We
start with the learning rate of 0.05 (batch size = 64) and
automatically reduce the learning rate until 5e6. Instead of
the cosine learning decay [21] used by [9] we use an automatic
learning rate scheduler [22] that estimates the slope of the
loss and reduce the learning rate (by a factor of 0.1) when
the loss has not been improved over the last 5ktraining steps.
This reduced the training time significantly without sacrificing
performance. We trained Nasnet-Mobile for 32 hours on a
single NVidia 1080 GTX.
C. MobileNetV2
MobileNetV2 is a CNN architecture for mobile devices
proposed by Sandler et al. [19]. Its first version was also
designed for face attribute detection but trained and evaluated
on Googles inhouse dataset [23]. They introduce inverted
residuals and linear bottlenecks and achieve state-of-the-art
results balancing inference time and performance for common
benchmarks like ImageNet [20], COCO [24], and VOC [25].
Our version of MobileNetV2 has 3.47 million parameters and
300 million MACs.
We used the pre-trained model mobilenet v2 1.0 224 from
[10] with the suggested training setup (depth multiplier = 1.0).
Just like the Nasnet-Mobile training we start with the learning
rate of 0.05 (batch size = 64) and automatically reduce the
learning rate until 5e6(each step with a factor of 0.1).
We also tried the pre-trained model with depth multiplier =
1.4(mobilenet v2 1.4 224) but it was overfitting. Training
MobileNetV2 took 13 hours on our NVidia 1080 GTX.
1081091010
MACs (number of multiply-accumulates)
91
91.2
91.4
91.6
91.8
92
92.2
92.4
92.6
92.8
93
Average Performance
Nasnet-Mobile
#params: 5.3M
AFFACT(TD)
#params: 25.6M
AFFACT(L)
#params: 25.6M
DMTL+
#params: 193M
DMTL
#params: 92M
MobileNetV2
#params: 3.8M
Fig. 2: Number of multiply-accumulates (MACs) needed to
compute an inference on a single image vs. average test set ac-
curacy in a log-linear scale. The area of each model is propor-
tional to the number of parameters. Note that we estimated the
number of MACs and parameters for DMTL+ [1], DMTL [14],
and AFFACT [13] based on their reported changes to well
known architectures.
III. EXP ER IM EN TS A ND RE SU LTS
Table I shows a comparison of a trivial classifier, 5state-
of-the-art methods and our proposed methods: MobileNetV2
and Nasnet-Mobile. The results of the cited methods are
obtained from the respected publications. We report the test set
accuracies for the CelebA dataset. The trivial classifier always
votes for the majority class (obtained from the training set and
applied to the test set). We sorted the methods by their average
accuracy.
Although LNets+ANet [2] and Mid-Level [11] are out-
performed by more recent methods, both provide interesting
solutions that might provide some insights for further research.
LNets+ANet showed that a combined face and attribute de-
tection (inspired by R-CNN) can provide excellent results.
Future research might adopt recent changes to the object
detection methods (e.g. YOLOv3 [5]). Mid-Level [11] uses
feature representations from intermediate layers and obtains
remarkable results.
MCNN-AUX [6], DMTL [14], and DMTL+ [1] use a multi-
task learning approach, where the network is split into several
subnetworks. This idea seems to provide promising results.
However, from our perspective, it is not clear why splitting the
network into several subnetworks improves the performance.
A fully connected layer can learn any linear mapping that
two separate parallel fully connected layer can learn. Future
research might want to investigate this.
While AFFACT (L) [13] uses images aligned with automat-
ically detected landmarks, AFFACT (TD) performs 162 trans-
formations of detected bounding boxes without performing
any alignment. This might improve the performance slightly
(approx. 0.5%) but comes with a huge computational burden.
Our methods perform similar to MCNN-AUX, AFFACT(L),
and DMTL, but are outperformed by DMTL+ and AF-
FACT(TD). Nevertheless, Fig. 3 shows that our methods
perform faster than DMTL+ [1] and LNets+ANet [2] regarding
DMTL+
Titan X GPU
8ms; 93%
LNets+ANet
Tesla K20
49ms; 87%
Nasnet-Mobile
1080 GTX;
2.5ms; 91.6%
MobileNetV2
1080 GTX;
1,8ms; 91,5%
86
87
88
89
90
91
92
93
94
0 2 4 6 8 10 12 14 16
Accuracy (avg)
Inference time [ms]
DMTL+ - Titan X GPU LNets+Anet - Testla K20 Nasnet-Mobile - 1080 GTX MobileNetV2 - 1080 GTX
Fig. 3: Mean accuracy vs. inference time for the proposed
and other state-of-the-art approaches that report inference time.
The results for DMTL+ and LNets+ANet are taken from Han
et al. [1].
the average inference time for a single image without align-
ment. The results for DMTL+ and LNets+ANet are from [1].
Note that each approach is evaluated on a different device,
thus this comparison is biased. However, other papers did not
provide the number of multiply-accumulates (MACs), which
would allow for a better comparison. Thus, we estimate the
number of MACs and parameters based on their reported
changes to well known architectures. DMTL+, and DMTL use
a modified AlexNet, and AFFACT uses ResNet-50 (with one
additional layer). In Fig. 2 we report the number of MACs
vs. average test set accuracy in a log-linear-scale. The area
of each circle is proportional to the number of parameters
of each model and also proportional to the required memory.
This is particularly interesting for mobile applications with
limited resources, because storing huge amounts of parameters
in memory might not be possible. Compared to DMTL+, our
MobileNetV2 model needs 2.9times less MACs, requires 56
times less memory, and performs 1.1% worse in terms of
accuracy. Although Nasnet-Mobile und MobileNetV2 perform
very similar, Nasnet-Mobile requires 1.5M more parameters
and is about 40% slower. Even though we tried several regu-
larization methods, Nasnet-Mobile was not able to utilize its
higher capacity. Usually, models with higher capacity just need
more regularization to improve performance. Nevertheless, the
test devices weren’t too different and even if all methods would
have been tested on the same device, we do not believe that
their order would change.
IV. CONCLUSION AND FUTURE WOR K
We addressed the problem of estimating facial attributes
from RGB images for mobile devices. Face attribute estimation
is important for human machine interaction (HCI) systems
by providing relevant context information like age, sex and
ethnicity of the interacting user. We evaluated the face at-
tribute detection performance on two mobile architectures:
Nasnet-Mobile and MobileNetV2. Our experimental evalua-
5 o Clock Shadow
Arched Eyebrows
Attractive
Bags Under Eyes
Bald
Bangs
Big Lips
Big Nose
Black Hair
Blond Hair
Blurry
Brown Hair
Bushy Eyebrows
Chubby
Double Chin
Eyeglasses
Goatee
Gray Hair
Heavy Makeup
High Cheekbones
Male
Trivial 90.0 71.6 49.6 79.7 97.9 84.4 67.3 78.8 72.8 86.7 94.9 82.0 87.0 94.7 95.4 93.5 95.4 96.8 59.5 51.8 61.4
LNets+ANet [2] 91 79 81 79 98 95 68 78 88 95 84 80 90 91 92 99 95 97 90 87 98
Mid-Level [11] 93.3 82.5 80.8 82.2 97.8 95.6 69.9 82.6 86.0 94.9 96.2 84.2 91.9 94.9 96.2 99.5 97.1 97.8 90.1 86.1 98.1
MCNN-AUX [6] 94.5 83.4 83.1 84.9 98.9 96.0 71.5 84.5 89.8 96.0 96.2 89.2 92.8 95.7 96.3 99.6 97.2 98.2 91.5 87.6 98.2
AFFACT (L) [13] 94.8 83.9 82.8 85.2 99.1 96.1 72.5 84.4 90.5 96.2 96.0 88.5 92.3 95.7 96.4 99.6 97.5 98.3 92.0 87.6 98.2
AFFACT (TD) [13] 94.4 85.5 81.4 84.2 99.0 95.5 84.0 83.0 91.6 95.7 96.1 85.7 92.8 95.7 96.8 99.5 96.7 98.1 91.7 88.3 98.7
DMTL [14] 94 86 83 85 99 96 79 85 91 96 96 88 92 96 97 99 97 98 92 88 98
DMTL+ [1] 95 86 85 85 99 99 96 85 91 96 96 88 92 96 97 99 99 98 92 88 98
MobileNetV2 94.9 84.2 82.7 85.6 99.0 96.2 72.2 84.6 89.9 96.0 96.3 88.8 92.8 95.8 96.5 99.6 97.5 98.3 91.8 87.7 98.4
Nasnet-Mobile 94.9 84.1 83.2 85.4 99.1 96.2 72.3 85.0 90.2 96.1 96.4 89.4 92.9 95.9 96.5 99.7 97.6 98.2 92.0 87.8 98.4
Mouth Slightly Open
Mustache
Narrow Eyes
No Beard
Oval Face
Pale Skin
Pointy Nose
Receding Hairline
Rosy Cheeks
Sideburns
Smiling
Straight Hair
Wavy Hair
Wearing Earrings
Wearing Hat
Wearing Lipstick
Wearing Necklace
Wearing Necktie
Young
Average
Trivial 50.5 96.1 85.1 85.4 70.4 95.8 71.4 91.5 92.8 95.4 50.0 79.0 63.6 79.3 95.8 47.8 86.2 93.0 75.7 79.9
LNets+ANet [2] 92 95 81 95 66 91 72 89 90 96 92 73 80 82 99 93 71 93 87 87
Mid-Level [11] 92.6 96.6 86.9 95.4 70.6 96.7 76.2 92.1 94.3 97.4 92.1 80.0 77.3 86.7 98.8 92.3 85.8 94.4 87.5 89.8
MCNN-AUX [6] 93.7 96.9 87.2 96.0 75.8 97.0 77.5 93.8 95.2 97.8 92.7 83.6 83.9 90.4 99.0 94.1 86.6 96.5 88.5 91.3
AFFACT (L) [13] 93.8 97.0 87.6 96.2 76.7 97.1 77.1 93.7 95.2 97.8 92.8 85.0 85.7 91.0 99.1 93.7 88.3 96.9 88.8 91.5
AFFACT (TD) [13] 93.9 96.4 93.8 96.0 76.8 96.8 77.6 94.9 95.2 97.3 92.9 85.5 87.9 92.0 98.9 92.7 90.3 96.8 88.7 92.0
DMTL [14] 94 97 90 96 78 97 78 94 96 98 93 85 87 91 99 93 89 97 90 92.1
DMTL+ [1] 94 97 90 97 78 97 78 94 96 98 94 85 87 91 99 93 89 97 90 92.6
MobileNetV2 94.1 97.1 87.8 96.5 76.0 96.8 77.4 93.6 95.1 97.9 93.1 84.6 85.0 90.8 99.1 93.9 87.4 96.8 88.4 91.5
Nasnet-Mobile 94.1 97.1 87.6 96.4 76.4 97.0 77.8 94.0 95.2 98.0 93.1 85.0 85.6 91.0 99.1 94.0 87.5 96.8 88.5 91.6
TABLE I: CelebA [2] test set accuracy for each individual attribute and the average accuracy across all attributes. Color indicates
rank of method from red (worst) to green (best) for each attribute individually. Trivial refers to the trivial classifier that
always votes for the majority class. The results of the state-of-the-art approaches are obtained from the respected publications.
See Sec. III for a detailed discussion of the results.
tion showed that using architectures for resource limited appli-
cations can perform almost as good as current top performing
architectures. Our methods are fast, accurate, and easy to
implement.
We also discussed the contributions of previous works. Es-
pecially the results by LNets+ANet [2] show interesting ideas
for future research towards a complete end-to-end training e.g.
with SSD [26] or Yolo [5]. Also, works by Wang et al. [14]
and Han et al. [1] have shown that multi-task learning can
improve face attribute detection performance. They showed
that including face recognition or other face related tasks into
the same network improves face attribute detection as well.
ACK NOW LE DG EM EN T
This work has been funded by the Federal Ministry
of Education and Research (BMBF), projects 03ZZ0443G,
03ZZ0459C, and 03ZZ0470. The sole responsibility for the
content lies with the authors.
REFERENCES
[1] Hu Han, Anil K. Jain, Fang Wang, Shiguang Shan, and Xilin Chen,
“Heterogeneous Face Attribute Estimation: A Deep Multi-Task Learning
Approach,” IEEE Transactions on Pattern Analysis and Machine
Intelligence, jun 2017.
[2] Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang, “Deep
Learning Face Attributes in the Wild, in 2015 IEEE International
Conference on Computer Vision (ICCV). dec 2015, pp. 3730–3738,
IEEE.
[3] Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik, “Rich
Feature Hierarchies for Accurate Object Detection and Semantic Seg-
mentation - IEEE Conference Publication,” in Computer Vision and
Pattern Recognition (CVPR), 2014 IEEE Conference on, Columbus, OH,
USA, 2014.
[4] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun, “Faster
R-CNN: Towards Real-Time Object Detection with Region Proposal
Networks,” IEEE Transactions on Pattern Analysis and Machine
Intelligence, pp. 1137—-1149, 2016.
[5] Joseph Redmon and Ali Farhadi, “YOLOv3: An Incremental Improve-
ment,” arXiv, 2018.
[6] Emily M. Hand and Rama Chellappa, “Attributes for Improved At-
tributes: A Multi-Task Network Utilizing Implicit and Explicit Re-
lationships for Facial Attribute Classification, in Thirty-First AAAI
Conference on Artificial Intelligence, apr 2017, pp. 4068—-4074.
[7] Davis E. King, “Easily Create High Quality Object Detectors with Deep
Learning,” 2016.
[8] Davis E. King, “Real-Time Face Pose Estimation,” 2014.
[9] Github Repository, “Slim Nasnet,” 2018.
[10] Github Repository, “Slim Mobilenet,” 2018.
[11] Yang Zhong, Josephine Sullivan, and Haibo Li, “Leveraging mid-level
deep representations for predicting face attributes in the wild, in IEEE
International Conference on Image Processing (ICIP). sep 2016, pp.
3239–3243, IEEE.
[12] Florian Schroff, Dmitry Kalenichenko, and James Philbin, “FaceNet:
A Unified Embedding for Face Recognition and Clustering,” in 2015
IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
jun 2015, pp. 815–823, IEEE.
[13] Manuel G¨
unther, Andras Rozsa, and Terranee E. Boult, AFFACT:
Alignment-free facial attribute classification technique, in IEEE Inter-
national Joint Conference on Biometrics (IJCB). oct 2017, pp. 90–99,
IEEE.
[14] Fang Wang, Hu Han, Shiguang Shan, and Xilin Chen, “Deep Multi-
Task Learning for Joint Prediction of Heterogeneous Face Attributes,”
in 2017 12th IEEE International Conference on Automatic Face and
Gesture Recognition (FG 2017). may 2017, pp. 173–179, IEEE.
[15] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton, “ImageNet
Classification with Deep Convolutional Neural Networks, in Pro-
ceedings of the 25th International Conference on Neural Information
Processing Systems - Volume 1. 2012, pp. 1097–1105, Curran Associates
Inc.
[16] Upal Mahbub, Sayantan Sarkar, and Rama Chellappa, “Segment-based
Methods for Facial Attribute Detection from Partial Faces, arXiv, 2018.
[17] David Rolnick, Andreas Veit, Serge Belongie, and Nir Shavit, “Deep
Learning is Robust to Massive Label Noise, in International Conference
on Learning Representations (ICLR), may 2018.
[18] Barret Zoph, Google Brain, Vijay Vasudevan, Jonathon Shlens, and
Quoc V Le, “Learning Transferable Architectures for Scalable Image
Recognition,” arXiv, 2017.
[19] Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov,
and Liang-Chieh Chen, “MobileNetV2: Inverted Residuals and Linear
Bottlenecks,” arXiv, jan 2018.
[20] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev
Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla,
Michael Bernstein, Alexander C. Berg, and Li Fei-Fei, “ImageNet Large
Scale Visual Recognition Challenge, International Journal of Computer
Vision, vol. 115, no. 3, pp. 211–252, dec 2015.
[21] Ilya Loshchilov and Frank Hutter, “SGDR: STOCHASTIC GRADIENT
DESCENT WITH WARM RESTARTS, in 5th International Confer-
ence on Learning Representations, Palais des Congr`
es Neptune, Toulon,
France, 2017.
[22] Davis E. King, Automatic Learning Rate Scheduling That Really
Works, 2018.
[23] Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko,
Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam,
“MobileNets: Efficient Convolutional Neural Networks for Mobile Vi-
sion Applications,” arXiv, vol. 1704.04861, apr 2017.
[24] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Per-
ona, Deva Ramanan, Piotr Doll´
ar, and C. Lawrence Zitnick, “Microsoft
COCO: Common Objects in Context,” in European Conference on
Computer Vision. 2014, pp. 740–755, Springer, Cham.
[25] Mark Everingham, S. M. Ali Eslami, Luc Van Gool, Christopher K. I.
Williams, John Winn, and Andrew Zisserman, “The Pascal Visual Object
Classes Challenge: A Retrospective, International Journal of Computer
Vision, vol. 111, no. 1, pp. 98–136, jan 2015.
[26] Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott
Reed, Cheng-Yang Fu, and Alexander C. Berg, “SSD: Single Shot
MultiBox Detector, in European Conference on Computer Vision
(ECCV). 2016, pp. 21–37, Springer, Cham.
... As can be observed from the table, the architecture consists of reduction cells and normal cells. Reduction cells produce a feature map of their inputs by reducing their height and weight, whereas normal cells produce a feature map with the same dimensions as their inputs [27]. The second model chosen was MobileNetV3 (small) because of its high efficiency for on-device machine learning. ...
... The high classification accuracy of NasNet can be ascribed to its architecture, where there are only two types of modules or cells. The normal cell extracts features and a reduction cell downsamples the input [27]. The ultimate architecture is created by stacking these cells in a specific pattern, resulting in faster convergence of the network with high accuracy. ...
Article
Full-text available
Chickpeas are one of the most widely consumed pulses globally because of their high protein content. The morphological features of chickpea seeds, such as colour and texture, are observable and play a major role in classifying different chickpea varieties. This process is often carried out by human experts, and is time-consuming, inaccurate, and expensive. The objective of the study was to design an automated chickpea classifier using an RGB-colour-image-based model for considering the morphological features of chickpea seed. As part of the data acquisition process, five hundred and fifty images were collected per variety for four varieties of chickpea (CDC-Alma, CDC-Consul, CDC-Cory, and CDC-Orion) using an industrial RGB camera and a mobile phone camera. Three CNN-based models such as NasNet-A (mobile), MobileNetV3 (small), and EfficientNetB0 were evaluated using a transfer-learning-based approach. The classification accuracy was 97%, 99%, and 98% for NasNet-A (mobile), MobileNetV3 (small), and EfficientNetB0 models, respectively. The MobileNetV3 model was used for further deployment on an Android mobile and Raspberry Pi 4 devices based on its higher accuracy and light-weight architecture. The classification accuracy for the four chickpea varieties was 100% while the MobileNetV3 model was deployed on both Android mobile and Raspberry Pi 4 platforms.
... Por lo tanto, este artículo presenta una sistema de visión por computador, embarcado en una tarjeta embebida, para la detección de la orientación del rostro basado en MobileNet-V2.Ésta es una red neuronal de aprendizaje profundo diseñada por Google para aplicaciones móviles [9]. ...
... El primero consta de una CPU con Ubuntu 18.04 LTS, con procesador Intel, una tarjeta gráfica GPU Ti-1080 con 12 GB de memoria RAM, para las etapas de entrenamiento, validación y pruebas e inferencia. El segundo reside en la Jetson Nano [9], [18], que consta de una ARMv8 con sistema operativo Ubuntu 18.04 LTS con 3.9 GB de RAM, y tarjeta gráfica Tegra X. Sobreéste equipo móvil, de recursos limitados, se realiza la inferencia. ...
... Although these networks suffer from computational complexity, they may not be applicable for real-time decision-making over edge devices. On other hand, CNN-based architectures, particularly NASNetMobile [43], EfficientNetB0 [44], Xception [35], and MobileNet, demonstrate superior performance with fewer parameters, especially in visual tasks. Among these networks, NASNetMobile, EfficientNetB0, and MobileNet are optimized for rapid and reliable response times. ...
Article
Full-text available
Globally, fire incidents cause significant social, economic, and environmental destruction, making early detection and rapid response essential for minimizing such devastation. While various traditional machine learning and deep learning techniques have been proposed, their detection performances remain poor, particularly due to low-resolution data and ineffective feature selection methods. Therefore, this study develops a novel framework for accurate fire detection, especially in challenging environments, focusing on two distinct phases: preprocessing and model initializing. In the preprocessing phase, super-resolution is applied to input data using LapSRN to effectively enhance the data quality, aiming to achieve optimal performance. In the subsequent phase, the proposed network utilizes an attention-based deep neural network (DNN) named Xception for detailed feature selection while reducing the computational cost, followed by adaptive spatial attention (ASA) to further enhance the model’s focus on a relevant spatial feature in the training data. Additionally, we contribute a medium-scale custom fire dataset, comprising high-resolution, imbalanced, and visually similar fire/non-fire images. Moreover, this study conducts an extensive experiment by exploring various pretrained DNN networks with attention modules and compares the proposed network with several state-of-the-art techniques using both a custom dataset and a standard benchmark. The experimental results demonstrate that our network achieved optimal performance in terms of precision, recall, F1-score, and accuracy among different competitive techniques, proving its suitability for real-time deployment compared to edge devices.
... In this study, 20 DL-based CNNs with different features and capabilities, developed to solve different problems, were preferred. These Algorithms are AlexNet [8-10], VGG-16 [11], VGG-19 [12,13], GoogLeNet [14][15][16][17][18], Places-365 [19], ResNET-18 [20], ResNET-50 [21], ResNET-101 [22], ShufleNET [23,24], MobileNET [25], NASNET-Mobile [26], EfficientNET-B0 [27], Inception-v3 [28,29], DarkNET-19 [30,31], DarkNET-53 [32][33][34], Xception [35], Inception-ResNet [36], DenseNET-201 [37,38], SqueezeNET [39,40], and NASNet-Large [41,42]. ...
Article
Full-text available
As technology advances rapidly, deep learning applications, a subset of machine learning, are becoming increasingly relevant in various aspects of our lives. Essential daily applications like license plate recognition and optical character recognition are now commonplace. Alongside current technological progress, the development of future-integrated technologies such as suspicious situation detection from security cameras and autonomous vehicles is also accelerating. The success and accuracy of these technologies have reached impressive levels. This study focuses on the early and accurate detection of forest fires before they cause severe damage. Using primarily forest fire images from datasets obtained from Kaggle, various deep learning algorithms were trained via transfer learning using MATLAB. This approach allowed for comparing different deep learning algorithms based on their efficiency and accuracy in detecting forest fires. High success rates, generally exceeding 90%, were achieved.
... Overall, MobileNetV2 gains recognition for its ability to effectively process and analyze visual input on resource-constrained devices, making it popular for various mobile and edge computing applications [38][39][40][41]. The MobileNetV2 method is widely adopted for various applications, as evidenced in the following studies: [42][43][44][45][46][47][48][49][50][51][52][53][54]. ...
Article
Full-text available
IoT applications revolutionize industries by enhancing operations, enabling data-driven decisions, and fostering innovation. This study explores the growing potential of IoT-based facial recognition for mobile devices, a technology rapidly advancing within the interconnected IoT landscape. The investigation proposes a framework called IoT-MFaceNet (Internet-of-Things-based face recognition using MobileNetV2 and FaceNet deep-learning) utilizing pre-existing deep-learning methods, employing the MobileNetV2 and FaceNet algorithms on both ImageNet and FaceNet databases. Additionally, an in-house database is compiled, capturing data from 50 individuals via a web camera and 10 subjects through a smartphone camera. Pre-processing of the in-house database involves face detection using OpenCV's Haar Cascade, Dlib's CNN Face Detector, and Mediapipe's Face. The resulting system demonstrates high accuracy in real-time and operates efficiently on low-powered devices like the Raspberry Pi 400. The evaluation involves the use of the multilayer perceptron (MLP) and support vector machine (SVM) classifiers. The system primarily functions as a closed set identification system within a computer engineering department at the College of Engineering, Mustansiriyah University, Iraq, allowing access exclusively to department staff for the department rapporteur room. The proposed system undergoes successful testing, achieving a maximum accuracy rate of 99.976%.
... NASNetMobile [28] stands for Neural Search Architecture (NAS), with only 22 layers, and was developed in 2017. It uses the reinforcement learning search method, which contains a random grid search, gradient-based strategies, evolutionary algorithms, and reinforcement learning strategies for the search algorithm. ...
Article
Full-text available
Plants are a major food source worldwide, and to provide a healthy crop yield, they must be protected from diseases. However, checking each plant to detect and classify every type of disease can be time-consuming and would require enormous expert manual labor. These difficulties can be solved using deep learning techniques and algorithms. It can check diseased crops and even categorize the type of disease at a very early stage to prevent its further spread to other crops. This paper proposed a deep-learning approach to detect and classify cauliflower diseases. Several deep learning architectures were experimented on our selected dataset VegNet, a novel dataset containing 656 cauliflower images categorized into four classes: downy mildew, black rot, bacterial spot rot, and healthy. We analyzed the results conducted, and the best test accuracy reached was 99.25% with an F1-Score of 0.993 by NASNetMobile architecture, outperforming many other neural networks and displaying the model’s efficiency for plant disease detection.
Article
Full-text available
In this paper, we introduce Bangle Fashion Image Retrieval (BangleFIR), a novel dataset focusing on bangles within the fashion domain. While garment and footwear retrieval has garnered significant attention in recent years, intricate fashion items like jewellery have been comparatively overlooked due to their design complexity and the absence of suitable datasets. To bridge this gap, we have curated the BangleFIR dataset, comprising over 4.4k high-quality bangle images sourced from various online retailers and categorized into 56 distinct labels. Benchmarking the dataset with state-of-the-art image retrieval techniques demonstrated inadequate performance, concluding the need for dedicated research in this domain. We anticipate that the release of our dataset and benchmark will stimulate further exploration and innovation in the field. The dataset will be available to the public at: https://github.com/iammaidul/BangleFIR.
Conference Paper
Full-text available
Deepfake expresses unreal or manipulated video, image, or audio data, usually created using artificial intelligence techniques. Advanced deep learning techniques create convincingly realistic media, including videos and images, which can be used for various purposes such as entertainment, political manipulation, spreading misinformation, or even threatening individuals or organizations. Consequently, detecting deepfake videos has become increasingly important. Therefore, many recent studies have been proposed for exposing deepfake images and videos. One way to detect or predict deepfake videos is by utilizing deep learning, despite its initial use in creating deepfakes. NASNet, a type of CNN discovered through Neural Architecture Search (NAS), exemplifies this capability. This technique identifies frameworks that are adaptable and scalable for image detection, recognition, and feature extraction, with its two variants being -Large and -Mobile. In this study, we conduct a comparative analysis of NASNetLarge and NASNetMobile with LSTM deep learning models, which allows us to capture long and short-term dependencies for deepfake video detection. To evaluate these models, we used the Deepfake Detection Challenge (DFDC) and Celeb-DF video datasets, ensuring dataset balance as necessary. Adding LSTM to pre-trained CNN models has promising results, with NASNetLarge-LSTM achieving accuracies of 96.67% and 95.72%, and NASNetMobile-LSTM achieving accuracies of 96.70% and 96.07% for DFDC and Celeb-DF datasets, respectively.
Article
Full-text available
State-of-the-art methods of attribute detection from faces almost always assume the presence of a full, unoccluded face. Hence, their performance degrades for partially visible and occluded faces. In this paper, we introduce SPLITFACE, a deep convolutional neural network-based method that is explicitly designed to perform attribute detection in partially occluded faces. Taking several facial segments and the full face as input, the proposed method takes a data driven approach to determine which attributes are localized in which facial segments. The unique architecture of the network allows each attribute to be predicted by multiple segments, which permits the implementation of committee machine techniques for combining local and global decisions to boost performance. With access to segment-based predictions, SPLITFACE can predict well those attributes which are localized in the visible parts of the face, without having to rely on the presence of the whole face. We use the CelebA and LFWA facial attribute datasets for standard evaluations. We also modify both datasets, to occlude the faces, so that we can evaluate the performance of attribute detection algorithms on partial faces. Our evaluation shows that SPLITFACE significantly outperforms other recent methods especially for partial faces.
Article
Attributes, or mid-level semantic features, have gained popularity in the past few years in domains ranging from activity recognition to face verification. Improving the accuracy of attribute classifiers is an important first step in any application which uses these attributes. In most works to date, attributes have been considered independent of each other. However, attributes can be strongly related, such as heavy makeup and wearing lipstick as well as male and goatee and many others. We propose a multi-task deep convolutional neural network (MCNN) with an auxiliary network at the top (AUX) which takes advantage of attribute relationships for improved classification. We call our final network MCNN-AUX. MCNN-AUX uses attribute relationships in three ways: by sharing the lowest layers for all attributes, by sharing the higher layers for spatially-related attributes, and by feeding the attribute scores from MCNN into the AUX network to find score-level relationships. Using MCNN-AUX rather than individual attribute classifiers, we are able to reduce the number of parameters in the network from 64 million to fewer than 16 million and reduce the training time by a factor of 16. We demonstrate the effectiveness of our method by producing results on two challenging publicly available datasets achieving state-of-the-art performance on many attributes.
Article
We present some updates to YOLO! We made a bunch of little design changes to make it better. We also trained this new network that's pretty swell. It's a little bigger than last time but more accurate. It's still fast though, don't worry. At 320x320 YOLOv3 runs in 22 ms at 28.2 mAP, as accurate as SSD but three times faster. When we look at the old .5 IOU mAP detection metric YOLOv3 is quite good. It achieves 57.9 mAP@50 in 51 ms on a Titan X, compared to 57.5 mAP@50 in 198 ms by RetinaNet, similar performance but 3.8x faster. As always, all the code is online at https://pjreddie.com/yolo/
Conference Paper
Can a large convolutional neural network trained for whole-image classification on ImageNet be coaxed into detecting objects in PASCAL? We show that the answer is yes, and that the resulting system is simple, scalable, and boosts mean average precision, relative to the venerable deformable part model, by more than 40% (achieving a final mAP of 48% on VOC 2007). Our framework combines powerful computer vision techniques for generating bottom-up region proposals with recent advances in learning high-capacity convolutional neural networks. We call the resulting system R-CNN: Regions with CNN features. The same framework is also competitive with state-of-the-art semantic segmentation methods, demonstrating its flexibility. Beyond these results, we execute a battery of experiments that provide insight into what the network learns to represent, revealing a rich hierarchy of discriminative and often semantically meaningful features.
Conference Paper
We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 dif- ferent classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. To make training faster, we used non-saturating neurons and a very efficient GPU implemen- tation of the convolution operation. To reduce overfitting in the fully-connected layers we employed a recently-developed regularization method called dropout that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry
Article
Face attribute estimation has many potential applications in video surveillance, face retrieval, and social media. While a number of methods have been proposed for face attribute estimation, most of them did not explicitly consider the attribute correlation and heterogeneity (e.g., ordinal vs. nominal attributes) during feature representation learning. In this paper, we present a Deep Multi-Task Learning (DMTL) approach to jointly estimate multiple heterogeneous attributes from a single face image. In DMTL, we tackle attribute correlation and heterogeneity with convolutional neural networks (CNNs) consisting of shared feature learning for all the attributes, and category-specific feature learning for heterogeneous attributes. We also introduce an unconstrained face database (LFW+), an extension of public-domain LFW, with heterogeneous demographic attributes (age, gender, and race) obtained via crowdsourcing. Experimental results on benchmarks with multiple heterogeneous attributes (LFW+ and MORPH II) show that the proposed approach has superior performance compared to state of the art. Finally, evaluations on public-domain face databases with multiple binary attributes (CelebA and LFWA) and a single attribute (LAP) show that the proposed approach has excellent generalization ability.