ArticlePDF Available

A Classification of Arab Ethnicity Based on Face Image Using Deep Learning Approach

Authors:

Abstract and Figures

Human face and facial features gain a lot of attention from researchers and are considered as one of the most popular topics recently. Features and information extracted from a person are known as soft biometric, they have been used to improve the recognition performance and enhance the search engine for face images, which can be further applied in various fields such as law enforcement, surveillance videos, advertisement, and social media profiling. By observing relevant studies in the field, we noted a lack of mention of the Arab world and an absence of Arab dataset as well. Therefore, our aim in this paper is to create an Arab dataset with proper labeling of Arab sub-ethnic groups, then classify these labels using deep learning approaches. Arab image dataset that was created consists of three labels: Gulf Cooperation Council countries (GCC), the Levant, and Egyptian. Two types of learning were used to solve the problem. The first type is supervised deep learning (classification); a Convolutional Neural Network (CNN) pre-trained model has been used as CNN models achieved state of art results in computer vision classification problems. The second type is unsupervised deep learning (deep clustering). The aim of using unsupervised learning is to explore the ability of such models in classifying ethnicities. To our knowledge, this is the first time deep clustering is used for ethnicity classification problems. For this, three methods were chosen. The best result of training a pre-trained CNN on the full Arab dataset then evaluating on a different dataset was 56.97%, and 52.12% when Arab dataset labels were balanced. The methods of deep clustering were applied on different datasets, showed an ACC from 32% to 59%, and NMI and ARI result from zero to 0.2714 and 0.2543 respectively.
Content may be subject to copyright.
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2021.3069022, IEEE Access
VOLUME XX, 2021 1
Date of publication xxxx 00, 0000, date of current version xxxx 00, 0000.
Digital Object Identifier 10.1109/ACCESS.2017 .Doi Number
A Classification of Arab Ethnicity Based on
Face Image using Deep Learning Approach
Norah A. Al-Humaidan1, Master Prince2
1 Department of Computer Science, Qassim University, Mulaydha 51452, Saudi Arabia
2 Department of Computer Science, Qassim University, Mulaydha 51452, Saudi Arabia
Corresponding author: Norah A. Al-Humaidan (e-mail: noura.ah1493@gmail.com).
This work is partial fulfillment to complete Master Thesis under the course MS in Computer Science at Qassim University, K.S.A.
ABSTRACT Human face and facial features gain a lot of attention from researchers and are considered as
one of the most popular topics recently. Features and information extracted from a person are known as soft
biometric, they have been used to improve the recognition performance and enhance the search engine for
face images, which can be further applied in various fields such as law enforcement, surveillance videos,
advertisement, and social media profiling. By observing relevant studies in the field, we noted a lack of
mention of the Arab world and an absence of Arab dataset as well. Therefore, our aim in this paper is to create
an Arab dataset with proper labeling of Arab sub-ethnic groups, then classify these labels using deep learning
approaches. Arab image dataset that was created consists of three labels: Gulf Cooperation Council countries
(GCC), the Levant, and Egyptian. Two types of learning were used to solve the problem. The first type is
supervised deep learning (classification); a Convolutional Neural Network (CNN) pre-trained model has been
used as CNN models achieved state of art results in computer vision classification problems. The second type
is unsupervised deep learning (deep clustering). The aim of using unsupervised learning is to explore the
ability of such models in classifying ethnicities. To our knowledge, this is the first time deep clustering is
used for ethnicity classification problems. For this, three methods were chosen. The best result of training a
pre-trained CNN on the full Arab dataset then evaluating on a different dataset was 56.97%, and 52.12% when
Arab dataset labels were balanced. The methods of deep clustering were applied on different datasets, showed
an ACC from 32% to 59%, and NMI and ARI result from zero to 0.2714 and 0.2543 respectively.
INDEX TERMS Arab, Convolutional neural network (CNN), Deep learning, Deep clustering, Ethnicity
I. INTRODUCTION
The research on human face ethnicity and gender
recognition was initially studied by psychologists from the
perspective of cognitive science [1].
In Computer Vision, human face and facial features gain a
lot of attention from researchers and are considered as one of
the most popular topics recently [2], [3]. Features and
information extracted from a person such as age, gender, and
ethnicity, are known as soft biometric. Other soft biometric
examples are hair color, eyes color, height, width, scars,
marks, the shape of nose and mouth, etc. Soft biometrics can’t
identify a person’s identity by itself. However, it has been used
to improve the recognition performance by combining it with
hard biometric [4]. K. Veropoulos et al. [5] used soft biometric
classification as a filtering step to limit the search space in
databases for user identification systems. Also N. Kumar et
al.[6] used soft biometrics to enhance the search engine for
face images. Besides, in Human-Computer Interface (HCI)
field, the computer can provide speech recognition or offer
options to the user based on soft biometrics [7], [8]. In
surveillance videos, based on soft biometrics, suspects can be
located [8]. K. Niinuma et al. [9] used it for continuous user
authentication. It also can be used in law enforcement,
advertisement, social media profiling [3], [10].
As mentioned before, ethnicity is a soft biometric. It is
defined as: "the fact or state of belonging to a social group that
has a common national or cultural tradition"[11]. It is
multifaceted and keeps changing over time based on cultural
or geographical factors. Therefore, it is used for people who
share the same race, language, nationality, or religion, etc.
[12], [13]. In Computer Vision, ethnicity classification using
face image has been getting a lot of attention that addresses
two main topics; basic races such as Black, White, Asian, etc.
[3], [14], [15]. The second type focuses on smaller ethnic
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2021.3069022, IEEE Access
VOLUME XX, 2021 2
groups or sub-ethnic groups, these groups can be people of the
same nationality, for example [16] proposed a Myanmar / non-
Myanmar classification method, [17] focused on East Asian
countries: Vietnam, Burma, Thailand, China, Korea, Japan,
Indonesia, and Malaysia, while [14] classified Bangladeshi,
Chinese and Indian people. Besides, sub-ethnic groups can
define smaller groups in the same country, for example, [18]
perform classification on eight ethnic groups from China.
In this research, our focus is on Arab ethnicity. Arab refers
to people who live in the Arab world, which consists of 22
countries from North Africa and Western Asia who have
Arabic as their official language [19]. Our contribution is to
provide an Arab face dataset that consists of three labels which
are; Gulf Cooperation Council countries, Levant, and
Egyptians, and perform supervised classification on it. And
lastly, test some deep clustering methods on our dataset and a
benchmark dataset. To the best of our knowledge, this is the
first time that deep clustering is performed to solve the
ethnicity classification problem.
The rest of the paper is divided into 5 sections. The second
section contains literature reviewed. The methodology comes
after that, in section four experimental results are discussed.
Finally, the entire work is concluded and further developments
are suggested.
II. RELATED WORK
The interest in CNN reached the field of ethnicity
classification. In this section, we will summarize some studies
that used CNN for ethnicity classification.
Z. Heng et al. [14] hybridized a pre-trained CNN classifier
(VGG-16), which was trained on ImageNet, output with image
ranking engine and train Support Vector Machine (SVM)
using the hybrid features. The approach evaluated on a new
dataset which contains Bangladeshi, Chinese and Indian
people faces. The result showed improvement when compared
to Faster R-CNN and Wang’s method with an accuracy of
95.2%.
N. Narang and T. Bourlai [20] investigated the problems of
distance, night time, and uncontrolled conditions of face
images. They used NIR images taken from 30, 60, 90, and 120
meters at night and visible images at a distance of 1.5 meters
from the Long Distance WVU Database, they used Long
Distance Heterogeneous Face LDHF database as well. They
used CNN (VGG architecture), to classify the gender and
ethnicity of Asians and Caucasians under these environments,
and the results 78.98% showed improvement from previous
results.
W. Wang et al. [21] proposed a deep CNN classification
model consists of three convolutional and pooling layers and
end it with two fully-connected layers. The model was applied
on several datasets in addition to two self-collected datasets.
Some of the datasets were used only for testing to evaluate the
classification of images that are not from the same trained
dataset. The classification was done separately on white and
black, Chinese and non-Chinese, finally on Han, Uyghurs, and
Non-Chinese. The results were 100% vs 99.4%, 99.8 vs 99.9
and 99.4 vs 99.5 vs 99.9 respectively. They were compared to
previous works and showed good improvements.
I. Anwar and N. U. Islam [15] used a CNN called VGG-
face, which was pre-trained on a large face dataset of 2.6
million images to extract features then SVM with linear kernel
is used as a classifier. It was trained on three classes; Asian,
African-American, and Caucasian from ten different datasets
they are Computer Vision Lab (CVL), Chicago Face Database
(CFD), FERET, Multi-racial mega resolution (MR2) face
database, UT Dallas face database, Psychological Image
Collection at Stirling (PICS) Aberdeen, Japanese Female
Facial Expression (JAFFE), CAS-PEAL-R1, Montreal Set of
Facial Displays of Emotion Database (MSFDE) and Chinese
University of Hong Kong face database (CUFC). Average
classification accuracy over all databases is 98.28%, 99.66%,
and 99.05% for Asian, African-American, and Caucasian
respectively.
S. Masood et al. [3] also used pre-trained VGGNet, which
is a 16-layer architecture in their proposed work. They
attempted to classify the ethnicity of Mongolian, Caucasian,
and the Negro using ANN and CNN. ANN was applied after
calculating geometric features, calculating normalized
forehead area, and extracting skin color. FERET database was
used in the experiments. CNN achieved superior results with
98.6%, while ANN was good as compared to other works with
82.4%.
N. Srinivas et al. [17] presented a new dataset called
WEAFD, which consists of constrained and unconstrained
images of people from East Asian countries. Also, a CNN
model that has three convolutional layers and several fully
connected layers was applied on WEAFD to classify age,
gender, and ethnicity. they presented two networks; one with
full face images and the second with face divided into regions.
Age and gender results were better in the first network, while
ethnicity was better in the second one. However, the results of
ethnicity (24.06%, 33.33%) and age (38.04%, 36.43%) were
low in both networks compared to gender (88.02%, 84.70%).
They explain that low results could be because of the lack of
training data for age and ethnicity. Also, the quality of labels
could be another reason as they mentioned that it is more likely
for a human to make mistakes while labeling age and ethnicity
data than gender.
In [22], the structure of Gudi's CNN as follows:
convolutional layer, local contrast normalization and a max-
pooling layer, another two convolutional layers, after that a
fully connected layer. For the preprocessing step they use
Global Contrast Normalization on the VicarVision dataset.
The classification accuracy is 92.24%. The model performed
well on Caucasian and East Asian, which has a higher number
of images (thousands). However, it didn’t perform well on the
rest of the labels. The average precision for all labels was
61.52%. In this case, average precision was a better indicator
to understand how well the model performed.
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2021.3069022, IEEE Access
VOLUME XX, 2021 3
H. Chen et al.[23] used four different algorithms, which are
k-Nearest Neighbor (kNN), SVM, Two-Layer Neural
Network, and CNN to classify Korean, Japanese, and Chinese
with and without identified gender. CNN architecture consists
of two convolutional layers followed by one fully-connected
layer and a dropout layer. The dataset used for experiments is
self-collected. CNN was the best with 89.2% accuracy for 3
classes and 83.5% for 6 classes, which includes gender as well.
Despite the good accuracy, CNN only predicts 61.3%
accuracy on other new images, which means overfitting
occurred.
Table 1 summarizes all the studies. However, the Arab
world ethnicity has not been considered in the literature yet.
One reason could be the lack of a dataset. Hence, a dataset
containing images of people from different parts of the Arab
world and label them accordingly is needed. In this study, we
aim to create a dataset that contains Arab images with labels.
However, labeling people according to their countries could
be hard to classify due to the closeness of a lot of countries.
Therefore, we have decided to classify according to a wider
range (regions) such as Gulf Cooperation Council countries
(GCC), which contains Kuwait, Oman, Qatar, Saudi Arabia
and the United Arab Emirates. Another known region is Al-
Sham (the Levant) which consists of four countries; Syria,
Palestine, Jordan, Lebanon. The last label is Egypt, which has
the largest population in Arab world with over 100 million
inhabitants [24]. It has more population than both CGG and
the Levant, therefore we decided to give it its own label. The
map in Figure 1 illustrates the distribution of countries for each
label
All related studies mentioned previously are performed
under supervised learning. Labeling data to perform
supervised learning methods take a lot of time and effort.
Therefore, there is a need to develop unsupervised learning
methods to deal with unlabeled data. Clustering is one of the
Author
Year
Databases
No. of images/subjects for each lab el
used in experiments
Classifier
Accuracy
N. Narang and T.
Bourlai [10]
2016
WVU Database
LDHF databa se
103 subjects
100 subjects
(Asian an d Caucasian)
CNN
78.98%
W. Wang et al. [21]
2016
MORPH-II
10,530 white, 10,530 black
CNN
100% vs 99.4%
IDPhotos
SurvImages
CAS-PEAL
CASIA-WebFace
MORPH-II
40,000 Chinese
71,319 Chinese
4,886 Chinese
91,594 Non-Chinese
38,817 Non-Chinese
99.8 vs 99.9
IDPhotos (Han)
IDPhotos (Uyghur)
CASIA-WebFace
MORPH-II
100,000 Han
100,000 Uyghur
91,594 Non-Chinese
38,817 Non-Chinese
99.4% vs 99.5% vs 99.9%
N. Srinivas et al. [17]
2017
the new dataset called
WEAFD
8706 Chinese
1047 Japanese
960 Korean
795 Filipino; Indonesian; Malaysian
589 Vietnamese; Burmese; Thai
(maximum No. of training 500 /label, rest
used for testing)
CNN: full image
24.06%
CNN: divided into regions
33.33%
Z. Heng et al. [1 4]
2018
Self-collected dataset
1000 images/1000 subjects Bangladeshi
1520 images/1042 subjects Chinese
1078 images/1009 subjects Indian
pre-trained CNN hybridized with image ranking
engine and the hybrid features classified by SVM
95.2%
I. Anwar and N. U.
Islam [15]
2017
CVL, CFD, FERET, MR2,
UT Dallas face database,
PICS, Aberdeen, JAFFE,
CAS-PEAL-R1, MSFDE,
and CUFC
8291 Asian
1035 Black
4068 White
SVM (pre-trained CNN used for feature extraction)
An average accuracy of all
10 datasets:
97.72 %
100%
99.01%
H. Chen et al .[23]
2016
self-collected with 1380
image
418 Chinese
498 Japanese
466 Korean
kNN,
57.5%
SVM,
62.1%
Two-Layer NN
64.7%
CNN
89.2%, and 61.3% when
tested on a different dataset
A. Gudi [22]
2016
VicarVision dataset
3266 Caucasian
4556 East Asian
161 African
252 South Asian
47 others
CNN
92.24% (average precision
61.52%).
S. Masood et al. [3]
2018
FERET database
White 151
Asian 152
Black 147
ANN
82.4%
pre-trained CNN
98.6%
TABLE 1
SUMMARY OF ETHNICITY CLASSIFICATION STUDIES IN THE LITERAT URE
FIGURE 1.
Map of Arab countries that included in Arab dataset and their
corresponding labels
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2021.3069022, IEEE Access
VOLUME XX, 2021 4
most popular unsupervised methods. It means grouping data
that are more similar to each other to form a cluster [25]. H.
Chen et al. [23], attempted to apply unsupervised learning to
ethnicity classification by using k-means clustering. They did
not report experiment details nor results of the experiment,
they only mentioned that the algorithm failed to cluster labels
due to background noises and similarity between labels.
To our knowledge, deep clustering has not been used to
solve ethnicity classification problems before. Deep clustering
becomes an interesting field to researchers after Deep
Embedded Clustering (DEC) [26] was proposed by J. Xie et al
[27]. In this study, deep clustering methods will be applied to
the labeled datasets, to conclude the effectiveness of the
methods.
Methods that will be used are DEC, the blueprint of a lot of
deep clustering methods. Also, we will use Improved Deep
Embedded Clustering (IDEC) [28], which is slightly different
than DEC by keeping the decoder after the pre-training phase.
It showed improvement in results from DEC [28]. The last
method is Dynamic Autoencoder (DynAE) [29]. DynAE has
a bit similar structure to the other two; however, its main
contribution is the dynamic loss function. It achieved state of
art results in image clustering of three benchmark datasets
(MNIST-full, MNIST-test, and USPS) according to [30].
III. METHODOLOGY
This work aims to classify sub-ethnic groups of Arabs. To
accomplish this, an Arab dataset is introduced by collecting
images from the internet that belongs to specific subjects.
Then some pre-processing and data cleaning tasks have been
performed on the collected images. Once the dataset was
ready, supervised and unsupervised models are applied to
solve the ethnicity classification problem. CNN was chosen
for the supervised learning due to its amazing performance in
solving ethnicity classification problems in [2, 14, 15, 21]. In
unsupervised learning, three deep clustering methods will be
used. To evaluate models, our Arab dataset, as well as other
datasets are used. Se vera l matrices are used to report
evaluation results. The summary of the methodology is shown
in Figure 2.
FIGURE 2.
Methodology summary
A. DATASETS
Our dataset provides labeled images from the Arab world. We
decided to choose three labels as explained in the related work
section. The process to create an Arab dataset follows:
1. Collect names of subjects. Subjects are public figures
with accessible background information, such as
Actors, Actresses, Singers, Football players, Social
media influencers, Writers, Announcers, Journalists,
Businessmen / Businesswomen, Ministers,
Producers, Filmmakers, Artists, etc.
2. Check their nationalities, origins if available. They
must belong to only one of the labels and no others:
o Saudi, Kuwaiti. Qatari, Emirati, Omani,
Bahraini are called: GCC.
o Syrian, Palestinian, Jordanian, Lebanese
are called: The Levant
o Egyptian
3. Download images from google search using a
modified python script from [31] accordingly. We
download 10 images for each subject.
4. Use a face detector to detect faces in images of each
subject, mentioned in detail in preprocessing section.
5. Cleaning: Remove unrelated images to subjects
(images of objects or other people), duplicated
images, and images that were mistaking by the
detector as faces.
Table 2 illustrate the number of subjects and images for each
label in the Arab dataset. A subject is a unique person; that
means, the dataset may have more than one image for one
subject. As shown in Figure 3, the dataset is unbalanced. GCC
has 70% of the subjects. In addition to the Arab dataset, A
Private Arab dataset was collected from non-public figures,
will be used to evaluate models. The private dataset consists
of 88 Egyptian subjects, 104 GCC subjects, and 91 from the
Levant.
In addition to the Arab dataset, we will introduce other
datasets that will be used in our experiments:
Racial Faces in-the-Wild (RFW) [32], [33], [34] was
collected from MS-Celeb-1M, and have four labels in which,
each label contains 10K images of 3K subjects [32]. RFW is
similar to the Arab dataset in terms of data source (internet).
Moreover, it identifies subjects individually. Therefore, it is
suitable to be combined with the Arab dataset to classify four
labels (Arab, Asian, Black, White) without concern of subjects
overlapping between train and test sets. Figure 4 shows
samples from the Arab dataset and RFW.
BUPT-Transferface has 50K images of African, Asian and
Indian, also over 460K images and 10K subjects of White.
FERET dataset [34], [35] is a well-known benchmark
dataset of facial images used to report and compare results of
different methods. It contains high-quality images for
individual subjects with different poses, expressions, and
lighting. Furthermore, it provides gender, age, and ethnicity
information about subjects.
Lastly UTK dataset[36]. It contains images collected from
the internet and provides their age, gender, and ethnicity.
However, the dataset does not provide information about the
individual identity of subjects. Therefore, we cannot be sure if
overlapping occurs between train and test sets if it is used to
train a classification model.
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2021.3069022, IEEE Access
VOLUME XX, 2021 5
FIGURE 3.
The chart illustrates the dataset’s labels and percentage of
gender for each label
TABLE 2
ARAB DATASET INFORMATION
Label
No. of
Images
No. of subjects per
Gender (male, female)
GCC
5598
(718, 66)
Levant
1665
(141, 54)
Egyptians
1555
(100, 69)
Total
8843
(959, 189)
Arab
Egypt
GCC
Levant
UTK
Asian
Black
White
RFW
Asian
Black
White
BUPT-
Transferface
Asian
Black
White
FERET
Asian
Black
White
FIGURE 4.
Samples of all datasets la bels
B. PREPROCESSING
Dlib’s pre-trained face detector based on a modification to the
standard Histogram of Oriented Gradients + Linear SVM
method for object detection from [37] was used to detect faces
from all images. Then, the detected face is cropped and resized
to 224x224. The size was chosen based on pre-trained CNN
that used the same size. The steps are illustrated in Figure 5.
After that, cleaning images is implemented. Images such as
duplicated, unrelated results to the subject (due to google
search error or other faces in the same image as the subject)
and detected errors will be removed.
FIGURE 5.
An illu stration of pre-processing steps
C. DATA AUGMENTATIO N
Data augmentation (DA) is a way to reduce overfitting [38] by
applying certain methods to images during the training
process. In our experiments, Different data augmentation
method sets will be used in different experiments to find
suitable for Arab dataset. The methods that will be used in our
experiments: flip horizontally and/or vertically (FL), multiply
all pixels by random values to make them brighter or
darker(ML), increase or decrease hue and saturation by
random values (HS), rescaling, blur images, adjust image
contrast, dropout (set some pixels to zero) and convert to
grayscale.
D. CLASSIFICATION MODEL
In this section, we represent classification models that will be
used to solve the ethnicity classification problem. Two types
of learning are going to be used separately to solve the
problem, supervised learning, and unsupervised learning.
1) SUPERVISED LEARNING
After pre-processing, a CNN model will be trained on the Arab
dataset. Convolutional layers in CNN work as feature
extractors, so there is no need for a separate feature extraction
step [39]. The model which we use is a pre-trained CNN
model. Usually, pre-trained models are trained on millions of
images then used to train on small datasets (thousands in our
case) [40], [41]. It was proven that pre-trained models improve
results and outperform newly trained CNN from scratch [40],
[41].
The model architecture to be used is ResNet-50 layers that
were created by K. He et al. [42]. They were motivated by the
degradation problem, which can be explained as; when the
network depth increased, accuracy gets saturated and then
degrades rapidly after the saturation region. This degradation
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2021.3069022, IEEE Access
VOLUME XX, 2021 6
FIGURE 6.
An illu stration of ResNet-50 layers architecture
is not caused by overfitting, and adding more layers to a deep
model that leads to higher training error was unexpected since
theoretically, the network was supposed to perform better
while going deeper [42].
Shortcut connections between blocks differentiate ResNet
from other models [42]. Two types of shortcuts are used in
ResNet-50 layers; Identity shortcuts are used when input/
output have the same dimensions, while projection shortcuts
are used to match dimensions [42]. Figure 6 shows more
details about ResNet-50 layers architecture. Downsampling is
performed between blocks with a stride of 2 [42].
ResNet50 was trained by Q. Cao et al. [43] on the
VGGface2 dataset, which contains 3,31 million images of
9131 subjects.
We will use the pre-trained ResNet50 model. However, the
last layer of the model is going to be replaced with a new fully-
connected layer that has an output of three classes. Then we
start training with categorical cross-entropy loss function (Ln)
for the nth training sample which is given by the equation:
!"# $ %
&
'($)*+,
-
.(
/
0
(12
(1)
where
'(
is the truth label (0 or 1) of class
3
and
.(
(0 to 1) is
the probability of an object classified as a member of class
3
.
Categorical cross-entropy loss function or sometimes called
softmax loss function was used to train the ResNet50 pre-
trained model [43]. It is a common and popular choice for
classification problems (multiclass classification) [44]. The
model aims to minimize the loss function to improve the
performance.
We tune the model with different hyperparameters. The total
number of experiments is over 60. The hyperparameters were:
Learning rate (LR): constant LR 0.01, 0.001, 0.0001,
and automatic LR that increase at certain values of
the epoch.
Optimizer: SGD and Adam
Freezing layers (blocks): there are 4 blocks in
ResNet-50 explained in Figure 6. Switch freezing
between blocks in different experiments and
sometimes made all the network trainable. Freezing
layers in pre-trained CNN means that the layer will
not learn from the training process and only use what
it learned from training before.
Data augmentation: different sets of methods or not
use at all in some experiments.
The number of epochs is 30 and the batch size is 64 for all
experiments. We test all models on a private dataset and
represent the top five accuracies in the results and discussion
section.
2) DEEP CLUSTERING MODELS (UNSUPERVISED
LEARNING)
The general idea of deep clustering consists of two stages; pre-
training an autoencoder, which allows the network to learn
features that are used to initialize the cluster centers [45], and
fine-tuning, where clustering and feature learning are jointly
performed [45]. The methods we will use for clustering are as
mentioned before; DEC [26], IDEC [28], and DynAE [29].
The first two methods were implemented with a
convolutional network that was introduced in [45]. The
difference between DEC and IDEC is that DEC discards the
decoder after pre-training and fine-tune the encoder with
clustering loss, while IDEC keeps the decoder. Figures 7 and
8 illustrate each method architecture.
In DynAE [29], they overcame the trade-off between
clustering and reconstruction by using dynamic loss function.
Figure 9 shows the general architecture of DynAE.
For all methods, the number of clusters is a prior knowledge
given before the start of clustering.
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2021.3069022, IEEE Access
VOLUME XX, 2021 7
FIGURE 7.
DEC architecture [26]
FIGURE 8.
IDEC ar chitectur e [28]
FIGURE 9.
DynAE archi tectur e [29]
E. EVALUATION METRICS
For supervised model (classification) evaluation we evaluate
it using accuracy metric:
45567458 #9
:
(2)
Equation 2, where
;
is the number of correct samples and
<
is the number of all samples. Besides, we use two metrics that
are widely used to evaluate deep clustering methods [27]. The
first one is unsupervised clustering accuracy (ACC):
=>> #?@A
B
&
2C$DE1B$
-
FE
/
G
EHI
"
(3)
Equation 3, where
J(
is the ground-truth label,
7
(
is the
clustering algorithm result, and m ranges over all one-to-one
mappings that are possible between clusters and true labels.
The metric takes cluster results from the clustering algorithm
and a ground-truth label and then discovers the best matching
between them, which can be computed by the Hungarian
algorithm [46]. The second metric is Normalized Mutual
Information (NMI):
<;K
-
)L 5
/
#9
-
DLF
/
I
M
N
O
-
D
/
PO
-
F
/Q (4)
Equation 4, where
;
is the mutual information metric,
R
is
entropy,
$J
is the ground-truth label and
7
is the clustering
result. Mutual information measures the mutual dependence
of two groups, which are ground-truth and clustering results.
NMI a normalized version of it and permutations does not
affect its results [47]. When NMI equal to 0, it means the two
are independent. And if it equal to 1, that means the two are
identical.
The last metric is the Adjusted Rand Index (ARI), which is
the chance-corrected version of the Rand Index (RI). RI
focus on the pairwise agreement. For each possible pair, it
evaluates how similar the two clusters treat them [48]. RI is
calculated by:
SK # $ TPU
TPUPVP W
(5)
Where
4
and
X
are pairs that both ground truth and clustering
results agree.
5
and
Y
represent the disagreement, on one side
they are put together, where they are separated on the other
[48]. And ARI is calculated using Equation 5 by:
=SK # $ Z[\]-Z[ /
Z[^_`\]
-
Z[
/ (6)
IV. RESULTS AND DISCUSSION
Experiments were done using Google Colab and Deep
Learning AMI (Ubuntu 18.04) Version 28.1 and g3s.xlarge
from Amazon Web Services (AWS).
A. CLASSIFICATION RESULTS
Arab dataset subjects were divided into 80% training set and
20% validation set without subjects overlapping, i.e. images
of subjects used in train sets are not used in the test set. 60
Experiments were done on the Arab dataset using the
Model
LR
Optimizer
DA
Blocks freeze
1
Start at 0.01, increase expon entially each 5 ep och by a factor of 0.1
SGD
FL, HS, rescaling
1
2
Start at 0.01, increase expon entially each 5 ep och by a factor of 0.1
SGD
All methods mentioned
None
3
Start at 0.01, increase expon entially each 5 ep och by a factor of 0.1
SGD
FL, HS, rescaling
1,2
4
Start at 0.01, increase expon entially each 5 ep och by a factor of 0.1
SGD
FL, HS, rescaling
None
5
Start from 0.01, exponentially increasing by a factor of 0.1 at epoch 5, epoch 10,
and epoch 20
SGD
None
1,2,3
6
Start at 0.01, increase exponentially each 5 epoch by a factor of 0.1
SGD
FL, rescaling
None
TABLE 3
HYPER PARAMETERS OF FIVE MODELS TRAINED ON ARAB DA TASE T AND EVALUATED DIFFERENT DATASET (OUR PRI VATE DATASET ) THAT
ACHIEVED HIGHEST ACCURACY
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2021.3069022, IEEE Access
VOLUME XX, 2021 8
ResNet50 pre-trained model to tune hyperparameters, please
refer to the supervised learning section for more details about
hyperparameters. After that, all models were evaluated on a
different dataset to determine the best model. Accuracy was
calculated according to equation 1. Table 4 represents only
top-5 accuracy results that were obtained by testing on the
private dataset, last two have equal results. As we can see in
Table 3, all top-5 results used SGD as the optimizer. 5 out of
6 used data augmentation but different sets of methods, and 3
had no frozen layer, while one had the first block frozen,
another had the first two blocks and the last had the first three
blocks frozen. 5 out of 6 used the same learning rate method,
which starts from 0.01 and exponentially increasing by a factor
of 0.1 every 5 epochs. The last model used a learning rate that
starts from 0.01, increasing by a factor of 0.1 at epoch 5, epoch
10, and epoch 20. The best model was optimized with SGD,
has its first block froze, has FL, HS, and rescaling as DA, and
LR 0.01 increase exponentially by 0.1 at every 5 epochs.
Results of all models tested on Arab dataset and private
dataset are represented in Table 4. The accuracies of testing in
the Arab dataset were between 0.72 and 0.76. However, it
drops to 0.56 when testing on a different dataset. The best
accuracy was 0.5697 by model-1 and comes close to it 0.5606
by model-2.
We will look deeper into model-1 prediction results (exp1).
Confusion matrix of model 1 evaluated on a private Arab
dataset shown in Figure 10. As we can see 75% of GCC were
predicted correctly. Levant and Egyptian labels have 43% and
48% of images predicted correctly, respectively. Another
thing we noticed, over 30% of Levant and Egyptians were
predicted as GCC. We were concern if GCC dominating the
dataset by 70% had caused the model to be biased toward
GCC.
TABLE 4
TOP 5 HIGHEST ACCURACIES ON A PRIVATE DATASET AND ARAB DATASET
model
Accuracy in Arab dataset
Accuracy in private dataset
1
0.7406
0.5697
2
0.7267
0.5606
3
0.7278
0.5545
4
0.7395
0.5484
5
0.7621
0.5455 (equal to model 6)
6
0.7417
0.5455 (equal to model 5)
FIGURE 10.
Normalized confusion matrix of exp1, model-1 evaluated on
private dataset after training and validating in Arab dataset.
To solve this concern, we did another experiment (exp2) with
a modified Arab dataset (Arab balanced dataset). The
modified dataset has a similar number of subjects and images
for each label. Hyperparameters used in this experiment are
the same as model-1. The accuracy result on the Arab balanced
dataset was 0.5349. When evaluated on a private Arab dataset,
the accuracy was 0.5212. In the confusion matrix Figure 11,
GCC again had the highest correct predictions by 65%, lower
than the model-1 result by 10%. Levant and Egyptian have
42% and 45% respectively. This experiment shows that the
model can identify GCC better than the others even with a
similar number of subjects/images. 31% of Levant were
predicted as Egyptian, which is greater by 9% than model-1
results. We can see in both exp 1 and 2; models struggle in
classification. Especially in classifying Levant and Egyptian.
FIGURE 11.
Normalized confusion matrix of exp2, model evaluated on
private dataset after tr aining a nd valid ating in Arab balanced da taset.
The third experiment (exp 3) had four labels, three labels
from the RFW dataset (Black, Asian and White) and one Arab
label from Arab dataset labels combined. The dataset is
divided into 80% of subjects for the train set and 20% test set.
The hyperparameters were the same as model-1. Testing on
the same dataset results in high accuracy of 0.9663. However,
we had two tests on two different datasets (BUPT-
TRANSFERFACE, UTK) combined with the Arab private
dataset as one label. The two tests achieved 0.9675 and 0.6995
respectively. Figures 12 and 13 show the confusion matrix of
both tests. 88% of Arab label was predicted correctly in the
two tests, there is 9% of Arabs were wrongly predicted as
White. As for other labels, there was a wide gap between
Black and White results in test-1 (BUPT-TRANSFERFACE
dataset) and test-2 (UTK dataset). Test-1 has almost all labels
predicted correctly while in test-2, 30% of Black and 36% of
White were predicted as Arab.
Through these experiments, we noticed that the model can
successfully identify Arabs up to 88% when put with others.
Even though around 30% of Black and White were mistaken
as Arab in test-2. However, the model does not give good
classification performance if we classify Arab labels together,
it probably because the similarity between Arab classes is
higher.
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2021.3069022, IEEE Access
VOLUME XX, 2021 9
FIGURE12.
(test-1) Normal ized confu sion matrix of exp 3, model ev aluated
on BUPT-TRANSFERFACE + private Arab (one label called Arab) datasets
after training and validating in RFW +Arab (one label called Arab) datasets.
FIGURE13.
(test-2) Normaliz ed confusion matrix of exp3, model evaluated
on UTK +private Arab (one label called Arab) datasets after training and
validating in RFW +Arab (one label called Arab) datasets.
B. DEEP CLUSTERING RESULTS
Experiments were done using DEC, IDEC, and DynAE. The
size of images used is 60x60 for all datasets. The parameters
used are the same as the implementation in their respective
papers for all three methods. Adam was the optimizer for DEC
and IDEC and pre-training phase in DynAE while SGD was
used for the clustering phase. In DEC and IDEC CNN was
used while in DynAE it was a fully connected network. Three
metrics were used to evaluate experiments: ACC (Equation 3)
measures how many individuals clustered correctly. NMI
(Equation 4) focuses on partitioning and distribution of ground
truth and clusters. ARI (Equation 6) considers counting all
pairs that are assigned to the same or different clusters in
predicted and ground truth. We did some experiments with
balanced and unbalanced datasets because according to [49],
the cluster size could affect the results.
Table 5 shows ACC, NMI, and ARI for each method with
different datasets. All experiments have three labels except the
last two experiments, one had four labels which is a
combination of RFW dataset (Black, White Asian) and Arab
label from Arab dataset. And the last experiment had five
classes, the Indian class from RFW is added.
The best ACC was 0.5955 in FERET by DynAE. In Figure
14, most images are clustered in White, when we look at the
statistics of the FERET dataset (Asian: 952, Black: 257,
White: 2883), the White class consists of 70% of total images
which affect the ACC.
FIGURE14.
Normal ized confusion matrix of DynA E clusters on FERET
dataset.
Worst ACC was 0.3206 in RFW (4 labels) + Arab by
DynAE too. Figure 15 shows that the correct prediction of all
labels is low, with Black having ACC of 44% being the
highest, while the rest were from 36% to 22%.
FIGURE15.
Normalized confusion matrix of DynAE clusters on RFW(4
labels)+Arab dataset.
TABLE 5
ACC, NMI, AND ARI RESULTS OF DEEP CLUSTERING METHODS FO R EACH DATASET
FERET da taset
FERET ba lanced
dataset
Arab full dataset
Arab balanced dataset
RFW dataset
RFW(3 labels)+Arab(1
label)
RFW(4 labels)+Arab(1
label)
Metrics
ACC
NMI
ARI
ACC
NMI
ARI
ACC
NMI
ARI
ACC
NMI
ARI
ACC
NMI
ARI
ACC
NMI
ARI
ACC
NMI
ARI
DynAE
0.5955
0.0012
-0.0008
0.4176
0.0186
0.0170
0.4328
0.0533
0.0577
0.3724
0.0063
0.0050
0.5292
0.2071
0.1854
0.4116
0.1311
0.0999
0.3206
0.1143
0.0735
IDEC
0.4081
0.0001
-0.0007
0.4034
0.0166
0.0138
0.4060
0.0750
0.0213
0.4050
0.0394
0.0421
0.5364
0.1902
0.1938
0.5187
0.2682
0.2459
0.4232
0.2366
0.1818
DEC
0.4147
0.0001
-0.0004
0.4008
0.0161
0.0135
0.4502
0.0740
0.0565
0.4683
0.0561
0.0621
0.5258
0.1897
0.1866
0.5235
0.2714
0.2543
0.4399
0.2451
0.2024
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2021.3069022, IEEE Access
VOLUME XX, 2021 10
NMI and ARI consider the unmatched parts of clusters, the
distribution of images, and pairing [49]. Their best results
were 0.2714 and 0.2543 respectively, by DEC applied to RFW
(3 labels) + Arab, while there are several lows, most notable
in all experiments on the FERET dataset. DynAE with FERET
dataset achieved NMI of 0.0012 and ARI of -0.0008. Figure
14 shows that images of each label were distributed
throughout the clusters by the same percentage, which means
there is no specific relation between cluster items.
The second one is Figure 16 which an NMI of 0.0740 and
an ARI of 0.0565. Around half of GCC is in one cluster while
Egyptian and Levant were similarly distributed.
Figure 17 has better results than the previous ones with an
NMI of 0.1902 and ARI of 0.1938. Black is dominating one
cluster. While White and Asian are similarly distributed in the
other two clusters. However, the correct clusters here are
higher than the previous one.
Figure 18 has the best results for NMI and ARI. Arab and
Black are both dominating one cluster for each. Asian and
White are similarly distributed, with half of Asians been
clustered correctly.
FIGURE16.
Normalized confusion matrix of DEC clusters on Arab
dataset.
FIGURE17.
Normal ized confusion matrix of I DEC c luster s on RFW
dataset.
FIGURE18.
Normal ized confusion matrix of DEC clusters on a
combination of RFW dataset and Arab dataset. (4 labels)
Experiments on the FERET dataset and the balanced
version of it have similar results. Even though the ACC is
between 40% and 59%, NMI and ARI are between 0 and 0.02.
These results tell that partitions are random, that ground truth
and clusters are independent, and the model is uncertain about
the clusters.
Experiments on the Arab dataset and the balanced version
results are also similar. ACC is between 37% and 47%. NMI
and ARI are slightly better here, NMI achieved from 0.03 to
0.07 while ARI from 0.01 to 0.08. Only one experiment,
DynAE on Arab balanced dataset, performed worse than the
rest. It had NMI and ARI near zero. The rest of the results
showed a small improvement. However, they are still low.
Ground-truth and clusters are nearly independent and not
similar.
Another experiment was done on the RFW dataset with
three labels (Black, White, Asian). Results of NMI and ARI
are much better than previous experiments. THE best NMI
was 0. 2071 by DynAE, while the lowest was 0. 1897 by DEC.
and best ARI was 0.1938 by IDEC, while worst 0.1854 by
DynAE. ACC was near 53% for all methods.
The last two experiments were done on RFW+Arab, one
has four labels: Black, White and Asian from RFW and Arab,
while the other has Indian as an addition. DynAE performed
the worst in terms of all metrics in both experiments. DEC and
IDEC for the first experiment had ACC of 52% for both, NMI
of 0.2714 and 0.2682 which is the best of all experiments, and
ARI of 0.2543 and 0.2459 respectively. As for the last
experiment, DEC and IDEC had ACC of 44% and 42%, NMI
of 0.2451 and 0.2366, ARI of 0.2024 and 0.1818 respectively.
We can see a similarity in the confusion matrix in Figures
18 and 19, 78% and 71% of Black were grouped respectively.
Almost half of Asians and White were grouped in one cluster
in both experiments, while Arabs were divided into two
clusters in Figure 18, one of the clusters has also around 20%
of Asians and White as well. Then in Figure 19 Arab and
Indian are separated into three clusters.
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2021.3069022, IEEE Access
VOLUME XX, 2021 11
FIGURE19.
Normal ized confusion matrix of DEC clusters on a
combination of RFW dataset and Arab dataset. (5 labels)
Based on the discussion above no consistency has been seen
in the performance of any of the methods considered for the
experiment as far as deep clustering is concern. Moreover, the
lower score of NMI and ARI confirms low intra and high inter-
cluster similarity as well. So it can be said that facial features
are much similar across the borderline of different ethnic
groups. And this can be one of the reasons for the poorer
performance of clustering. In support of this conclusion, it is
noticeable that all the models across supervised and
unsupervised provide the best accuracy with RFW dataset, the
NMI and ARI score is higher for this dataset as well.
The conclusion can be drawn based on experiments. Table
6 shows a comparative accuracy of supervised and
unsupervised learning models with two datasets Arab and
RFW (3 labels) + Arab (1 label). Supervised learning
witnesses better results whereas unsupervised methods that
were used could not match the performance level of the
supervised learning model yet.
TABLE 6
ACCURACY OF SUPERVISED LEARNING MODEL TESTED ON TH E SAME
DATASET, AND AVERAGE ACC OF ALL THREE UNSUPERVISED LEARNING
METHODS
Arab dataset
RFW(3 labels) + Arab(1 label)
Supervised
(accuracy)
0.7406
0.9663
Unsupervised
(avg ACC)
0.4296
0.4627
V. CONCLUSION AND FUTURE WORK
In this study, we investigate the possibility of a CNN model to
classify sub-ethnic groups of Arabs. First, we create an Arab
dataset with three labels chosen according to countries'
distribution into regions. Then a pre-train ResNet50 model
was used to classify the Arab dataset. Over 60 experiments
were done to fine-tune hyperparameters, explained in more
detail in the supervised learning section. After that, the models
have evaluated on a different dataset and the best accuracy
result was 0.5697. Another experiment was done after
balancing the number of subjects in each class. The accuracy
after evaluation in a different dataset was 0.5212. From both
experiments, the model is struggling to identify between
labels, which can be due to the strong similarity between them.
A third experiment was done to classify Arabs as a whole
and the other three ethnicities (Black, White, Asian) from the
RFW dataset. The model was evaluated two times with two
datasets (BUPT-TRANSFERFACE, UTK) each combined
with our private Arab dataset. The results were 0.9675 and
0.6995 respectively.
For deep clustering experiments. ACC results were between
59% and 32%. However, NMI and ARI vary according to each
dataset and method. They were Zero in FERET dataset
experiments. And the best was experiments on a combination
of three labels from RFW and one label Arab. The best was
NMI of 0.2714 and ARI of 0.2543 by DEC. In the future, we
would like to investigate more methods regarding ethnicity
classification.
This study has some limitations; first, our Arab dataset does
not cover all countries of the Arab world. The limited time and
lack of knowledge about public figures in other countries
made it hard to collect a proper amount of subjects. Moreover,
the Arab dataset is unbalanced with GCC have 2/3 of subjects.
We recommend that in future work, to increase the number of
subjects for other labels and to cover other countries if
possible. Regarding age, the Arab dataset does not have people
under 17, we are not sure if the same results can be applied to
them. Also, we resized images to a small size (60x60) while
performing Deep clustering methods, due to limited memory.
We are concerned about how the size could affect the quality
of the performances.
Link of the dataset:
https://www.dropbox.com/sh/j4kjs9z9qnkewad/AABixLKW
aME-3YiCfqKOdmlSa?dl=0
ACKNOWLEDGMENT
Portions of the research in this paper use the FERET database
of facial images collected under the FERET program,
sponsored by the DOD Counterdrug Technology
Development Program Office.
REFERENCES
[1] C. Yu, Y. Fang, and Y. Li, “Multi-Task Learning
for Face Ethnicity and Gender Recognition,” 2014,
pp. 136–144.
[2] H. Ding, D. Huang, Y. Wang, and L. Chen, “Facial
ethnicity classification based on boosted local
texture and shape descriptions,” in 2013 10th IEEE
International Conference and Workshops on
Automatic Face and Gesture Recognition (FG),
2013, pp. 1–6.
[3] S. Masood, S. Gupta, A. Wajid, S. Gupta, and M.
Ahmed, “Prediction of Human Ethnicity from
Facial Images Using Neural Networks,” in Data
Engineering and Intelligent Computing, 2018, pp.
217–226.
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2021.3069022, IEEE Access
VOLUME XX, 2021 12
[4] A. K. Jain, S. C. Dass, and K. Nandakumar, “Soft
Biometric Traits for Personal Recognition
Systems,” in Biometric Authentication, 2004, pp.
731–738.
[5] K. Veropoulos, G. Bebis, and M. Webster,
“Investigating the Impact of Face Categorization on
Recognition Performance,” in Advances in Visual
Computing, 2005, pp. 207–218.
[6] N. Kumar, P. Belhumeur, and S. Nayar,
“FaceTracer: A Search Engine for Large
Collections of Images with Faces,” in Computer
Vision -- ECCV 2008, 2008, pp. 340–353.
[7] Y. Hu, Y. Fu, U. Tariq, and T. S. Huang,
“Subjective Experiments on Gender and Ethnicity
Recognition from Different Face Representations,”
in Advances in Multimedia Modeling, 2010, pp. 66–
75.
[8] A. Dantcheva, P. Elia, and A. Ross, “What Else
Does Your Biometric Data Reveal? A Survey on
Soft Biometrics,IEEE Trans. Inf. Forensics
Secur., vol. 11, no. 3, pp. 441–467, 2016.
[9] K. Niinuma, U. Park, and A. K. Jain, “Soft
biometric traits for continuous user authentication,”
IEEE Trans. Inf. Forensics Secur., vol. 5, no. 4, pp.
771–780, 2010.
[10] Z. Heng, M. Dipu, and K. Yap, “Hybrid Supervised
Deep Learning for Ethnicity Classification using
Face Images,” in 2018 IEEE International
Symposium on Circuits and Systems (ISCAS), 2018,
pp. 1–5.
[11] E. J. Jewell and F. R. Abate, “The New Oxford
American Dictionary,” Oxford Univ. Press, 2001.
[12] D. J. Da Silva Santos, N. B. Palomares, D.
Normando, C. Cardoso, and A. Quintão, “Race
versus ethnicity: Differing for better application,
Dent. Press J Orthod, vol. 15, no. 3, pp. 121–4,
2010.
[13] G. MUHAMMAD, M. HUSSAIN, F. ALENEZY,
G. BEBIS, A. M. MIRZA, and H. ABOALSAMH,
“RACE CLASSIFICATION FROM FACE
IMAGES USING LOCAL DESCRIPTORS,” Int. J.
Artif. Intell. Tools, vol. 21, no. 5, p. 1250019, 2012.
[14] Z. Heng, M. Dipu, and K. H. Yap, “Hybrid
Supervised Deep Learning for Ethnicity
Classification using Face Images,” 2018 IEEE Int.
Symp. Circuits Syst., pp. 1–5, 2018.
[15] I. Anwar and N. U. Islam, “Learned features are
better for ethnicity classification,” Cybern. Inf.
Technol., vol. 17, no. 3, pp. 152–164, 2017.
[16] H. H. K. Tin and M. M. Sein, “Race Identification
from Face Images,” proceeding Int. Conf. Adv.
Comput. Eng. (ACE 2011), pp. 1–4, 2011.
[17] N. Srinivas, H. Atwal, D. C. Rose, G. Mahalingam,
K. Ricanek, and D. S. Bolme, “Age, Gender, and
Fine-Grained Ethnicity Prediction Using
Convolutional Neural Networks for the East Asian
Face Dataset,Proc. - 12th IEEE Int. Conf. Autom.
Face Gesture Recognition, FG 2017 - 1st Int. Work.
Adapt. Shot Learn. Gesture Underst. Prod.
ASL4GUP 2017, Biometrics Wild, Bwild 2017,
Heteroge, pp. 953–960, 2017.
[18] C. Wang, Q. Zhang, X. Duan, and J. Gan, “Multi-
ethnical Chinese facial characterization and
analysis,” Multimed. Tools Appl., vol. 77, no. 23,
pp. 30311–30329, 2018.
[19] “Arab Countries 2019.” [Online]. Available:
http://worldpopulationreview.com/countries/arab-
countries/. [Accessed: 08-May-2019].
[20] N. Narang and T. Bourlai, “Gender and ethnicity
classification using deep learning in heterogeneous
face recognition,” in 2016 International Conference
on Biometrics, ICB 2016, 2016, pp. 1–8.
[21] W. Wang, F. He, and Q. Zhao, “Facial Ethnicity
Classification with Deep Convolutional Neural
Networks,” in Biometric Recognition, Springer,
Cham, 2016, pp. 176–185.
[22] A. Gudi, “Recognizing Semantic Features in Faces
using Deep Learning,” 2016.
[23] H. Chen, Y. Deng, and S. Zhang, “Where am I
from?-East Asian Ethnicity Classification from
Facial Recognition,” 2016.
[24] “No Title.” [Online]. Available:
https://www.capmas.gov.eg/Pages/populationClock
.aspx.
[25] N. Grira, M. Crucianu, and N. Boujemaa,
“Unsupervised and Semi-supervised Clustering: A
Brief Survey,” ’A Rev. Mach. Learn. Tech.
Process. Multimed. Content, Rep. MUSCLE Eur.
Netw. Excell. (6th Framew. Program., pp. 1–12,
2004.
[26] J. Xie, R. Girshick, and A. Farhadi, “Unsupervised
deep embedding for clustering analysis,” 33rd Int.
Conf. Mach. Learn. ICML 2016, vol. 1, pp. 740–
749, 2016.
[27] E. Min, X. Guo, Q. Liu, G. Zhang, J. Cui, and J.
Long, “A Survey of Clustering with Deep Learning:
From the Perspective of Network Architecture,”
IEEE Access, vol. 6, pp. 39501–39514, 2018.
[28] X. Guo, L. Gao, X. Liu, and J. Yin, “Improved deep
embedded clustering with local structure
preservation,” IJCAI Int. Jt. Conf. Artif. Intell., vol.
0, pp. 1753–1759, 2017.
[29] N. Mrabah, N. M. Khan, R. Ksantini, and Z.
Lachiri, “Deep clustering with a Dynamic
Autoencoder: From reconstruction towards
centroids construction,” Neural Networks, vol. 130,
pp. 206–228, 2020.
[30] “Deep Clustering with a Dynamic Autoencoder:
From Reconstruction towards Centroids
Construction.” [Online]. Available:
https://paperswithcode.com/paper/deep-clustering-
with-a-dynamic-autoencoder.
[31] “simple_image_download.” [Online]. Available:
https://github.com/RiddlerQ/simple_image_downlo
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2021.3069022, IEEE Access
VOLUME XX, 2021 13
ad.
[32] M. Wang, W. Deng, J. Hu, X. Tao, and Y. Huang,
“Racial Faces in-the-Wild: Reducing Racial Bias by
Information Maximization Adaptation Network,” in
Proceedings of the IEEE/CVF International
Conference on Computer Vision, 2019, pp. 692–
702.
[33] M. Wang and W. Deng, “Mitigate Bias in Face
Recognition using Skewness-Aware Reinforcement
Learning,” 2019.
[34] P. J. Phillips, H. Wechslerb, J. Huangb, and P. J.
Raussa, “The FERET database and evaluation
procedure for face recognition,” Image Vis.
Comput., vol. 16, no. I 998, pp. 295–306, 1997.
[35] P. J. Phillips, H. Moon, S. A. Rizvi, and P. J. Rauss,
“The FERET Evaluation Methodology for Face-
Recognition Algorithms 1 Introduction,” pp. 1–20,
1999.
[36] Susanqq, “UTKFace.” [Online]. Available:
https://susanqq.github.io/UTKFace/.
[37] N. Dalal et al., “Histograms of Oriented Gradients
for Human Detection To cite this version: HAL Id:
inria-00548512 Histograms of Oriented Gradients
for Human Detection,” IEEE Comput. Soc. Conf.
Comput. Vis. Pattern Recognit., pp. 886–893, 2010.
[38] A. Krizhevsky, I. Sutskever, and G. E. Hinton,
“ImageNet classification with deep convolutional
neural networks,” in Advances in Neural
Information Processing Systems, 2012, vol. 2, pp.
1097–1105.
[39] W. Rawat and Z. Wang, “Deep convolutional
neural networks for image classification: A
comprehensive review,” Neural Comput., vol. 29,
no. 9, pp. 2352–2449, 2017.
[40] R. Girshick, J. Donahue, T. Darrell, and J. Malik,
“Rich Feature Hierarchies for Accurate Object
Detection and Semantic Segmentation,” in 2014
IEEE Conference on Computer Vision and Pattern
Recognition, 2014, pp. 580–587.
[41] K. Chatfield, K. Simonyan, A. Vedaldi, and A.
Zisserman, “Return of the devil in the details:
Delving deep into convolutional nets,” BMVC 2014
- Proc. Br. Mach. Vis. Conf. 2014, 2014.
[42] K. He, X. Zhang, S. Ren, and J. Sun, “Deep
residual learning for image recognition,” Proc.
IEEE Comput. Soc. Conf. Comput. Vis. Pattern
Recognit., vol. 2016–Decem, pp. 770–778, 2016.
[43] Q. Cao, L. Shen, W. Xie, O. M. Parkhi, and A.
Zisserman, “VGGFace2: A dataset for recognising
faces across pose and age,” Proc. - 13th IEEE Int.
Conf. Autom. Face Gesture Recognition, FG 2018,
pp. 67–74, 2018.
[44] B. Barz and J. Denzler, “Deep Learning on Small
Datasets without Pre-Training using Cosine Loss,”
in 2020 IEEE Winter Conference on Applications of
Computer Vision (WACV), 2020, pp. 1360–1369.
[45] X. Guo, E. Zhu, X. Liu, and J. Yin, “Deep
Embedded Clustering with Data Augmentation,” in
Proceedings of The 10th Asian Conference on
Machine Learning, 2018, vol. 95, pp. 550–565.
[46] H. W. Kuhn, “The Hungarian method for the
assignment problem,” Nav. Res. Logist. Q., vol. 2,
no. 1a2, pp. 83–97, 1955.
[47] “Mutual Information based scores.” [Online].
Available: https://scikit-
learn.org/stable/modules/clustering.html#mutual-
info-score].
[48] V. Labatut, “Generalised measures for the
evaluation of community detection methods,” Int. J.
Soc. Netw. Min., vol. 2, no. 1, pp. 44–63, 2015.
[49] M. Rezaei and P. Franti, “Set matching measures
for external cluster validity,” IEEE Trans. Knowl.
Data Eng., vol. 28, no. 8, pp. 2173–2186, 2016.
Norah A. Al-Humaidan received the B.S degree in Computer Science
from Qassim University, Saudi Arabia, in 2016. She is currently a Master's
student in Computer Science Department, Qassim University.
Master Prince received the B.S degree in
computer science from Patna University, India, in
1996, the M.S degree in computer science from
Indira Gandhi National Open University, New
Delhi, India, in 2004, and the Ph.D. degree in
computer science from Pune University, India, in
2008. Since 2009, he has been working as an
Assistant Professor with the Department of
Computer Science, Qassim University, Saudi
Arabia. His research interests include computer
vision and machine learning.,Dr. Prince received the Best Ph.D. Thesis
Dissertation of the Year 2009 Award of the Pune University, India.
... Some have blue eyes and red hair, while others are dark-skinned (Awad et al., 2021;Wang, 2022). This diversity reflects the complex history of the Arab world, where various ethnic groups, including Persians, Turks, and Africans, have migrated and intermixed with the Arab population over time (Al-Humaidan & Prince, 2021;Eid, 2007). Despite this diversity, traditional Arabic and Islamic references tend to consider Arab as a race, particularly for those who inhabit the Arabian Peninsula (Ibn Khaldun, 1958;Tabari, 1987). ...
Thesis
Full-text available
This research explored the impact of religious, cultural, and traditional beliefs on Arab Muslims' understanding of mental disorders and their treatment, with a particular emphasis on the role of Islamic theology. Employing an exploratory sequential mixed-methods design, the study first conducted qualitative semi-structured interviews with 12 Arab Muslim participants (6 men and 6 women) to examine their mental health perspectives. Thematic analysis of these interviews informed the development of a quantitative survey, which was administered to 169 Arab Muslim participants using Qualtrics. The quantitative data were analysed using SPSS 29. The integration of qualitative and quantitative findings revealed that Arab Muslim participants exhibited moderate to high levels of religiosity, which, along with their cultural and traditional beliefs influenced their mental health perceptions and treatment approaches. Notably, a discrepancy between participants' self-identification as religious and their actual religious practices suggests a cultural value placed on modesty. The preference for traditional healing practices and supernatural explanations for mental disorders indicates a strategic approach to navigating mental health stigma. Furthermore, education was identified as a crucial element in dispelling mental health misconceptions, with higher levels of education associated with a more accurate understanding of mental disorders and an increased likelihood of utilising formal mental health services. These insights highlight the challenges of integrating cultural, religious, and educational factors in shaping mental health perceptions and underscore the need for culturally and religiously sensitive mental health interventions and education. This study advocates for bridging the gap between traditional beliefs and formal mental health services to improve access and attitudes towards mental health care among Arab Muslims in Australia.
... ResNet50 Model Architecture[6] ...
Preprint
Full-text available
In nations such as Bangladesh, agriculture plays a vital role in providing livelihoods for a significant portion of the population. Identifying and classifying plant diseases early is critical to prevent their spread and minimize their impact on crop yield and quality. Various computer vision techniques can be used for such detection and classification. While CNNs have been dominant on such image classification tasks, vision transformers has become equally good in recent time also. In this paper we study the various computer vision techniques for Bangladeshi rice leaf disease detection. We use the Dhan-Shomadhan -- a Bangladeshi rice leaf disease dataset, to experiment with various CNN and ViT models. We also compared the performance of such deep neural network architecture with traditional machine learning architecture like Support Vector Machine(SVM). We leveraged transfer learning for better generalization with lower amount of training data. Among the models tested, ResNet50 exhibited the best performance over other CNN and transformer-based models making it the optimal choice for this task.
... Graph Convolutional Networks are effectively used to learn discriminant spatial features from the input face images and is proven to outperform for occluded face detection [7]. The authors in [8] use a deep convolutional neural network (DCNN) with a SoftMax classifier to perform the classification task on Arab ethnicities. A coupled attribute learning framework that leverages the relationships between facial attributes and face recognition is proposed in [9]. ...
Article
Full-text available
Security concerns are rampant in our world today. No location is immune to crime, whether it be public spaces, commercial areas, offices, or private property. While law enforcement agencies work to address crimes after they have taken place, there is no guarantee that they can prevent them from occurring. Communities and businesses often turn to security personnel and surveillance cameras to keep watch, but this only provides passive protection. Criminals may act while security is absent and cameras only record events, they do not prevent them. Although, there are few systems with face recognition and alerting mechanisms are existing, they are too expensive and not affordable to a common man. In response to this challenge, this work proposes a most economic, hassle free proactive home security system which can be used as a plug and play device for the already installed surveillance system employing CCTV cameras. This system features facial recognition technology, enabling it to identify authorized individuals by storing their images in a database, thus swiftly distinguishing between intruders and residents. In cases of unauthorized intrusions, the system promptly activates alerts and alarms. Additionally, it incorporates supplementary security measures utilizing sensors and actuators for heightened protection. Moreover, the system utilizes an IP camera, permitting remote monitoring via live streaming. An array of machine learning algorithms is assessed to ascertain the most effective method for facial recognition, integrated into a monitoring dashboard using RTSP streaming.
... Race classification has become significant in applications like surveillance [37] and market advertising [38]. Recently, various deep learning models have been used for race identification [37,39,40,41,42]. ...
Preprint
Full-text available
Technologies for recognizing facial attributes like race, gender, age, and emotion have several applications, such as surveillance, advertising content, sentiment analysis, and the study of demographic trends and social behaviors. Analyzing demographic characteristics based on images and analyzing facial expressions have several challenges due to the complexity of humans' facial attributes. Traditional approaches have employed CNNs and various other deep learning techniques, trained on extensive collections of labeled images. While these methods demonstrated effective performance, there remains potential for further enhancements. In this paper, we propose to utilize vision language models (VLMs) such as generative pre-trained transformer (GPT), GEMINI, large language and vision assistant (LLAVA), PaliGemma, and Microsoft Florence2 to recognize facial attributes such as race, gender, age, and emotion from images with human faces. Various datasets like FairFace, AffectNet, and UTKFace have been utilized to evaluate the solutions. The results show that VLMs are competitive if not superior to traditional techniques. Additionally, we propose "FaceScanPaliGemma"--a fine-tuned PaliGemma model--for race, gender, age, and emotion recognition. The results show an accuracy of 81.1%, 95.8%, 80%, and 59.4% for race, gender, age group, and emotion classification, respectively, outperforming pre-trained version of PaliGemma, other VLMs, and SotA methods. Finally, we propose "FaceScanGPT", which is a GPT-4o model to recognize the above attributes when several individuals are present in the image using a prompt engineered for a person with specific facial and/or physical attributes. The results underscore the superior multitasking capability of FaceScanGPT to detect the individual's attributes like hair cut, clothing color, postures, etc., using only a prompt to drive the detection and recognition tasks.
Article
This paper presents an Internet of Things (IoT) application that utilizes an AI classifier for fast-object detection using the frame difference method. This method, with its shorter duration, is the most efficient and suitable for fast-object detection in IoT systems, which require energy-efficient applications compared to end-to-end methods. We have implemented this technique on three edge devices: AMD AlveoTMU50, Jetson Orin Nano, and Hailo-8TMAI Accelerator, and four models with artificial neural networks and transformer models. We examined various classes, including birds, cars, trains, and airplanes. Using the frame difference method, the MobileNet model consistently has high accuracy, low latency, and is highly energy-efficient. YOLOX consistently shows the lowest accuracy, lowest latency, and lowest efficiency. The experimental results show that the proposed algorithm has improved the average accuracy gain by 28.314%, the average efficiency gain by 3.6 times, and the average latency reduction by 39.305% compared to the end-to-end method. Of all these classes, the faster objects are trains and airplanes. Experiments show that the accuracy percentage for trains and airplanes is lower than other categories. So, in tasks that require fast detection and accurate results, end-to-end methods can be a disaster because they cannot handle fast object detection. To improve computational efficiency, we designed our proposed method as a lightweight detection algorithm. It is well suited for applications in IoT systems, especially those that require fast-moving object detection and higher accuracy.
Article
Full-text available
Our face plays a vital role in many human-to-human encounters and is closely linked to our identity. Significant promise exists for the automatic recognition of facial features, opening the door to hands-free alternatives and innovative uses in computer-human digital interactions. Deep learning techniques have led to a notable increase in interest in the field of face picture analysis in recent years, especially in applications like biometrics, security, and surveillance. Due to feature overlaps and dataset under-representation, ethnicity classification in computer vision is still a difficult task, particularly for African populations. This study explores Nigerian ethnicity classification, focusing on the three major groups-Hausa, Igbo, and Yoruba-using a hybrid model that integrates MobileNetV2, Local Binary Patterns (LBP), and an Attention Mechanism. The hybrid model achieved an overall classification accuracy of 87%, significantly outperforming benchmarks, particularly in Igbo and Yoruba classifications. While the Yoruba group demonstrated the highest accuracy, overlaps between Hausa and Igbo highlight areas for refinement. This research advances the field by addressing dataset imbalances, incorporating innovative feature fusion, and improving the inclusivity of computer vision models. It has practical implications for identity verification, security, and demographic research while emphasizing the importance of culturally sensitive AI systems tailored to underrepresented populations. Future work includes expanding datasets, enhancing model architectures, and exploring interdisciplinary approaches to further refine ethnicity classification.
Article
Full-text available
Recently, researchers in the field of Machine Learning have paid a lot of interest to the human face. Soft biometric features taken from a facial image can be used to distinguish between racial classes. Other soft biometrics features include race, age, gender, and emotions. Research has employed different techniques (traditional and deep learning) in predicting the major racial classes (Asian, Hispanic, African, Caucasian) with outstanding cutting-edge performances. Recently, research has focused on identifying distinguishing characteristics in sub-racial (ethnic) groups. Racial profiling has been used in a variety of fields, including social media profiling, security surveillance, law enforcement, and targeted advertising. By seeing relevant studies in the field, we noted that the Black race (African/African American) is considered a single racial entity, models developed do not have practical application in the Nigerian domain, and most of the datasets available are racially imbalanced. As a result, the goal of this research is to create a unique dataset with accurate labels for Nigeria's three major ethnic groups, and then using deep learning techniques to classify these labels. There are three labels in the image dataset: Hausa, Igbo, and Yoruba. For feature extraction and classification, a pre-trained Convolutional Neural Network (CNN) was used. The model was evaluated on the test, and The Hausa ethnic group had the highest accuracy of 87.3%; lower accuracies were recorded from the Igbo and Yoruba subclass, which gave an accuracy of 56.0% and 56.0%, respectively. The result could be attributed and migration and inter-ethnic marriages which have dwindled the boundary between the ethnic groups. Key words: Racial Classification, Convolutional Neural Network, Computer Vision, Transfer Learning
Chapter
With science and technology taking quantum leaps every day, a lot of progress has been made in the field of deep networks and computer vision. Identifying various features from input face images to draw meaningful information and critical insights has garnered much interest. Using these features and images, AI models can provide real-time feedback and alerts for appropriate interventions in health. However, these results lack sufficient accuracy due to the convoluted network architecture and complexity of time regarding the weight suboptimal solution. This chapter aims to create a model that predicts age, gender, and ethnicity using the UTKFace Dataset. Post the data cleaning and label extraction, various neural network architectures were trained, and the performances of these models were evaluated to conduct a comparative analysis. The study demonstrated that ResNet-50 will facilitate the creation of a robust and efficient model for the purpose of gender prediction while EfficientNet B0 could be deployed to enhance the performance of age and ethnicity prediction. Combining such information with AI models can help to develop predictive algorithms that assess an individual’s risk of developing certain diseases or conditions based on the mentioned demographic factors.
Conference Paper
Full-text available
Deep Embedded Clustering (DEC) surpasses traditional clustering algorithms by jointly performing feature learning and cluster assignment. Although a lot of variants have emerged, they all ignore a crucial ingredient, data augmentation, which has been widely employed in supervised deep learning models to improve the generalization. To fill this gap, in this paper, we propose the framework of Deep Embedded Clustering with Data Augmentation (DEC-DA). Specifically, we first train an autoencoder with the augmented data to construct the initial feature space. Then we constrain the embedded features with a clustering loss to further learn clustering-oriented features. The clustering loss is composed of the target (pseudo label) and the actual output of the feature learning model, where the target is computed by using clean (non-augmented) data, and the output by augmented data. This is analogous to supervised training with data augmentation and expected to facilitate unsupervised clustering too. Finally, we instantiate five DEC-DA based algorithms. Extensive experiments validate that incorporating data augmentation can improve the clustering performance by a large margin. Our DEC-DA algorithms become the new state of the art on various datasets.
Article
Full-text available
Clustering is a fundamental problem in many data-driven application domains, and clustering performance highly depends on the quality of data representation. Hence, linear or non-linear feature transformations have been extensively used to learn a better data representation for clustering. In recent years, a lot of works focused on using deep neural networks to learn a clustering-friendly representation, resulting in a significant increase of clustering performance. In this paper, we give a systematic survey of clustering with deep learning in views of architecture. Specifically, we first introduce the preliminary knowledge for better understanding of this field. Then, a taxonomy of clustering with deep learning is proposed and some representative methods are introduced. Finally, we propose some interesting future opportunities of clustering with deep learning and give some conclusion remarks.
Article
Full-text available
Facial image based characterization and analysis of ethnicity, which is an important index of human demography, have become increasingly popular in the research areas of pattern recognition, computer vision, and machine learning. Many applications, such as face recognition and facial expression recognition, are affected by ethnicity information of individuals. In this study, we first create a human face database, which focuses on human ethnicity information and includes individuals from eight ethnic groups in China. This dataset can be used to conduct psychological experiments or evaluate the performance of computational algorithms. To evaluate the usefulness of this created dataset, some critical landmarks of these face images are detected and three types of features are extracted as ethnicity representations. Next, the ethnicity manifolds are learnt to demonstrate the discriminative power of the extracted features. Finally, ethnicity classifications with different popular classifiers are conducted on the constructed database, and the results indicate the effectiveness of the proposed features.
Article
Full-text available
In this paper, we introduce a new large-scale face dataset named VGGFace2. The dataset contains 3.31 million images of 9131 subjects, with an average of 362.6 images for each subject. Images are downloaded from Google Image Search and have large variations in pose, age, illumination, ethnicity and profession (e.g. actors, athletes, politicians). The dataset was collected with three goals in mind: (i) to have both a large number of identities and also a large number of images for each identity; (ii) to cover a large range of pose, age and ethnicity; and (iii) to minimize the label noise. We describe how the dataset was collected, in particular the automated and manual filtering stages to ensure a high accuracy for the images of each identity. To assess face recognition performance using the new dataset, we train ResNet-50 (with and without Squeeze-and-Excitation blocks) Convolutional Neural Networks on VGGFace2, on MS- Celeb-1M, and on their union, and show that training on VGGFace2 leads to improved recognition performance over pose and age. Finally, using the models trained on these datasets, we demonstrate state-of-the-art performance on the IJB-A and IJB-B face recognition benchmarks, exceeding the previous state-of-the-art by a large margin. Datasets and models are publicly available.
Article
In unsupervised learning, there is no apparent straightforward cost function that can capture the significant factors of variations and similarities. Since natural systems have smooth dynamics, an opportunity is lost if an unsupervised objective function remains static. The absence of concrete supervision suggests that smooth dynamics should be integrated during the training process. Compared to classical static cost functions, dynamic objective functions allow to better make use of the gradual and uncertain knowledge acquired through pseudo-supervision. In this paper, we propose Dynamic Autoencoder (DynAE), a novel model for deep clustering that addresses a clustering-reconstruction trade-off, by gradually and smoothly eliminating the reconstruction objective function in favor of a construction one. Experimental evaluations on benchmark datasets show that our approach achieves state-of-the-art results compared to the most relevant deep clustering methods.