Access to this full-text is provided by Wiley.
Content available from Chinese Journal of Electronics
This content is subject to copyright. Terms and conditions apply.
Distinguishing Between Natural and GAN-
Generated Face Images by Combining
Global and Local Features
CHEN Beijing1,2,3, TAN Weijin1,2, WANG Yiting4, and ZHAO Guoying5
(1. Engineering Research Center of Digital Forensics, Ministry of Education, Nanjing
University of Information Science and Technology, Nanjing 210044, China)
(2. School of Computer, Nanjing University of Information Science and Technology, Nanjing 210044, China)
(3. Jiangsu Collaborative Innovation Center of Atmospheric Environment and Equipment Technology (CICAEET),
Nanjing University of Information Science and Technology, Nanjing 210044, China)
(4. Warwick Manufacturing Group, University of Warwick, Coventry CV4 7AL, UK)
(5. Center for Machine Vision and Signal Analysis, University of Oulu, Oulu 90014, Finland)
Abstract—With the development of face image syn-
thesis and generation technology based on generative ad-
versarial networks (GANs), it has become a research hot-
spot to determine whether a given face image is natural
or generated. However, the generalization capability of
the existing algorithms is still to be improved. Therefore,
this paper proposes a general algorithm. To do so, firstly,
the learning on important local areas, containing many
face key-points, is strengthened by combining the global
and local features. Secondly, metric learning based on the
ArcFace loss is applied to extract common and discrimin-
ative features. Finally, the extracted features are fed into
the classification module to detect GAN-generated faces.
The experiments are conducted on two publicly available
natural datasets (CelebA and FFHQ) and seven GAN-
generated datasets. Experimental results demonstrate that
the proposed algorithm achieves a better generalization
performance with an average detection accuracy over 0.99
than the state-of-the-art algorithms. Moreover, the pro-
posed algorithm is robust against additional attacks, such
as Gaussian blur, and Gaussian noise addition.
Keywords—Generated image, Global feature, Loc-
al features, Generative adversarial network, Metric
learning.
I.Introduction
In recent years, deep learning-based generative
techniques, especially for generative adversarial net-
works (GANs), have been applied to the field of face
image synthesis and generation. The visual quality of
GAN-generated face images is getting closer to natural
face images by some strategies, such as depth structure,
adversarial training, and prior information fusion. Some
advanced GANs, such as boundary equilibrium GAN
(BEGAN)[1], progressive growing of GAN (PGGAN)[2],
and StyleGAN[3], have been shown great success in gen-
erating high-resolution and photo-realistic face images.
Then, many tools have been released, such as
Face2Face, FaceSwap, and DeepFake. Consequently, it
has become more and more difficult to identify the gen-
erated face images from natural face images with naked
human eyes. Besides, it is well known that face images
have been widely used in identification[4,5] and authen-
tication services[6] in the daily life, such as face pay-
ment, face retrieval, and face check-in. Therefore, it has
become a research hotspot to determine whether a giv-
en face image is natural or generated.
Until now, various kinds of algorithms have been
proposed for detecting GAN-generated face images.
These algorithms can be roughly divided into two cat-
egories: intrinsic attributes-based[7−11] and deep learn-
ing-based[12−17]. The intrinsic attribute-based al-
gorithms start from the perspective of conventional di-
Manuscript Received Nov. 6, 2020; Accepted July 5, 2021. This work was supported by the National Natural Science Foundation of
China (62072251), NUIST Students’ Platform for Innovation and Entrepreneurship Training Program (202110300022Z), and the Priority
Academic Program Development of Jiangsu Higher Education Institutions (PAPD) Fund.
©2022 Chinese Institute of Electronics. DOI:10.1049/cje.2020.00.372
Chinese Journal of Electronics
Vol.31, No.1, Jan. 2022
gital image forensics to exploit the inconsistence of dif-
ferent types of attributes in face images. Yang et al.[7]
considered the geometric inconsistency based on facial
feature points. Li et al.[8] analyzed the disparities in dif-
ferent color spaces, and extracted the feature informa-
tion by the color components. Nataraj et al.[9] extrac-
ted the co-occurrence matrices on three color channels
in the pixel domain. McCloskey et al.[10] studied the
brightness and exposure between PGGAN-generated
and natural face. Matern et al.[11] magnified several
visual artifacts in the global symmetry of organs and
the color of eyes. As exposed, these intrinsic attributes-
based algorithms are based on hand-crafted features,
which may limit their generalization capability[18,19].
Besides, for a given face image to be authenticated,
which algorithm is suitable is unknown for the absence
of prior information[18]. So, recently, the deep learning-
based algorithms[12−17] have been proposed. Basically,
these algorithms learn image features expectedly and
automatically and do not need any prior knowledge and
assumptions. More specifically, Marra et al.[12] used
transfer learning with Xception model[13]. Mo et al.[14]
transformed the face image into residuals by a high-pass
filter and then extracted features from the residual in-
put by convolutional neural network (CNN). Hsu et
al.[15] adopted contrastive loss to seek typical features.
Tarip et al.[16] presented an ensemble-based neural net-
work classifier with three sub-networks, i.e., Shal-
lowNet v1-v3 having different depths. However, the
generalization capabilities of these deep learning-based
algorithms still need to be improved. Besides, Wang et
al.[17] pointed out that the generalization capability was
a significant indicator in the field of image forensics.
This makes sense because face image generation models
are highly varied in training datasets, network architec-
tures, loss functions, and image pre-processing opera-
tions. Besides, to the best of our knowledge, almost all
the existing works about GAN-generated face detection
only consider the global features, while ignore the local
features. But, the local features have played significant
roles in the field of face recognition[20,21] and face syn-
thesis[22]; people usually pay more attention to some
local areas when distinguishing between the natural and
generated faces.
The solution that we propose in this paper solves
the above-mentioned drawbacks. To enhance the gener-
alization capability, the learning on important local
areas is strengthened by combining the global and local
features. Moreover, in the feature learning phase, the
metric learning based on ArcFace loss[23] is applied to
learn common features in the same type of faces and
discriminative features between natural and GAN-gen-
erated faces.
The rest of this paper is organized as follows. In
Section II, some related techniques are reviewed briefly.
In Section III, the proposed algorithm is presented. The
experimental results and analysis are illustrated in Sec-
tion IV. Finally, we draw conclusions in Section V.
II.Related Techniques
In this section, some related techniques are re-
called, such as squeeze-and-excitation block and metric
learning.
1.Squeeze-and-excitation block
In human vision, attention usually leads people to
focus on the local area of the whole scene[24]. It can fil-
ter out irrelevant information and enhance important
information. As a lightweight gating attention mechan-
ism, squeeze-and-excitation (SE) block[25] is modelled in
channel-wise relationships to enhance the representa-
tional ability of the network. The main architecture of
SE block is presented in Fig.1.
GAP SE ×
F (c×h×w)
FGAP (c×l×l)
Fscale (c×h×w)
S (c×l×l)
Fig.1.Architecture of squeeze-and-excitation block
As shown in the Fig.1, the feature map F is manip-
ulated to aggregate global information by global aver-
age pooling (GAP) operation. Then, the global feature
FGAP is fed into the SE operation. The SE operation is
composed of two fully connected (FC) layers. After
that, the attention scaler S is computed as,
S=σ(F C2(δ(F C1(FGAP )))) (1)
where σ corresponds to the sigmoid activation; δ repres-
ents the ReLU activation. Finally, the output of the
block is obtained by scaling operation. The scaled fea-
ture map Fscale is computed as,
Fscale =Multipy(F, S)(2)
where Multipy refers to channel-wise multiplication.
2.Metric learning
Classic metric learning usually considers the char-
acteristics of data and learns an effective metric meth-
od to measure the similarity of data. Deep metric learn-
ing, which combines the advantages of deep neural net-
works and end-to-end training, has been widely used in
the field of computer vision.
Deep metric learning based on loss function has de-
veloped rapidly in recent years. Chopra et al.[26] pro-
posed contrastive loss and then introduced metric learn-
ing based on the contrastive loss into convolutional
neural networks. The aim of contrastive loss is to make
60 Chinese Journal of Electronics 2022
the feature distance corresponding to the samples with-
in the class closer, and to make the feature distance cor-
responding to samples between classes farther. Com-
pared with the contrastive loss, triplet loss[27] further
considers the relative relationship of intra-class and
inter-class pairs. However, the triple loss is difficult to
train all samples fully, and the choice of sample pairs
affects the final result. So, Deng et al.[23] proposed the
ArcFace loss based on cosine angle instead of the dis-
tance to measure similarity for metric learning and ap-
plied it to the field of face recognition, achieving an ex-
cellent performance. Therefore, it is also introduced in-
to the proposed algorithm. Its role is to make the fea-
tures having small intra-class and large inter-class dis-
tance.
Given the feature X and weight vector W, the Ar-
cFace loss is presented as follow,
LArcF ace =−
1
N
N
∑
i=1
log es(cos(θyi+m))
es(cos(θyi+m)) +∑n
j=1,j=yi
escos θj
(3)
where N and n represent the batch size and the num-
ber of categories, respectively, θ denotes the angle
between W and X, s represents the scaling factor, m
represents the additive angular margin penalty, and yi
is the label value of the i-th sample.
III.Proposed Algorithm
Most existing GAN-generated face detection al-
gorithms only consider the global features. However, the
local features have played significant roles in the field of
face recognition[20,21] and face synthesis[22]. So, in order
to improve the generalization capability, we suggest
strengthening the learning on important local areas and
combining the global and local features. In this section,
the main architecture of the proposed algorithm is
presented first, then its three modules, i.e., global fea-
ture extraction module, local feature extraction module,
and classification module, are described one by one.
1.Main architecture
As shown in Fig.2, the overall architecture of the
proposed algorithm comprises two steps: feature learn-
ing and classification learning. Notice that the cubes in
Fig.2 are just some simple signs rather than detailed
structures. The detailed structures of three modules are
given in the following three subsections. In the feature
learning step, firstly, the features from the global and
local feature extraction modules are merged; secondly,
metric learning is used to further learn common fea-
tures in the same type of faces and discriminative fea-
tures between natural and GANs-generated faces. Met-
ric learning transforms the merged features into an em-
bedding feature with fixed dimensions (128 dimensions
in this paper) by a fully connected (FC) layer. There-
after, the ArcFace loss[23] is applied to supervise the
metric learning. By minimizing the ArcFace loss defined
in Eq.(3), face images with the same label will have
similar features after being extracted by the feature ex-
traction modules. In the classification learning step, the
features extracted from the global and local modules are
fed into the classification module to obtain the pre-
dicted results. Notice that the metric learning is not
considered in the testing phase because it needs labels
to supervise the feature distribution and there is no la-
bel when making decision. The input face images are
processed by the feature extraction module and the
classification module to predict the result directly. The
details of three modules are presented in the following
three subsections.
Concatenation
..
..
Predicted
result
Global feature extraction module
Local feature extraction module
Classification module
Metric
learning
Softmax
loss
ArcFace
loss
Step2: Classification learning
Step1: Feature learning
Label
Crop
I
Fig.2.Main architecture of the proposed algorithm
2.Global feature extraction module
SE-Residual block is a main component in the
global feature extraction module. It embeds the SE
block into residual structure[28]. The reason is that the
SE block can adaptively establish channel-wise feature
responses by potential importance. The SE-Residual
Distinguishing Between Natural and GAN-Generated Face Images by Combining Global and Local Features 61
block is presented in Fig.3. As shown in Fig.3, if the in-
put x and the output y have matched dimensions, we
use the structure of Fig.3(a), otherwise Fig.3(b) (match-
ing dimensions by 1×1 convolution). Each architecture
of SE-Residual block has two convolutional groups and
a SE block. A convolutional group includes convolution
(Conv), batch normalization (BN), and ReLU activa-
tion. In real application, the dimensions of the input
and output of each SE-Residual block are already
known when we design a network model, thus choosing
Fig.3(a) or Fig.3(b) is also known in the model design
phase.
Convolutional group
SE block
Convolutional group
Conv1×1
Convolutional group
SE block
Convolutional group
x
y
x
y
+ +
(a) Same dimensions (b) Different dimensions
Fig.3.Architecture of the SE-Residual blocks for the input
and output with same/different dimensions
The detailed architecture of the global feature ex-
traction module is shown in Fig.4. This module is com-
posed of four SE-Residual blocks, a convolutional
group, and four maxpooling layers. The SE-Residual
block can extract inherent feature information to im-
prove the effectiveness of features. The global feature
extraction module is a relatively shallow network due to
the small size of input face and the uncomplicated clas-
sification task. Regarding the network parameters, the
kernel size of the convolutional group is 7×7 with stride
1, while those of the rest of convolutional layers in four
SE-Residual blocks are 3×3 with stride 1; The number
of kernels in the convolutional group and four SE-Re-
sidual blocks is 32, 32, 64, 64, and 128, respectively;
The kernel sizes of all maxpooling layers are 2×2 with
stride 2.
Convolutional group
Maxpooling
SE-Residual block
Maxpooling
SE-Residual block
Maxpooling
SE-Residual block
Maxpooling
SE-Residual block
Maxpooling
Input
Output
Fig.4.Architecture of the global feature extraction module
3.Local feature extraction module
Face key-points are some important landmarks in
the face. So, they are utilized to determine the import-
ant local areas. We use the face landmark detection
code from Dlib C++ library[29] to collect 68 face key-
points. Then, we find that these key-points mainly dis-
tribute in four areas, i.e., left eye, right eye, nose, and
mouth. To obtain these four areas, a rectangular is used
to crop each area in each face image. The rectangular is
the smallest rectangle containing all key-points near
each area. Besides, to contain each area completely,
each rectangular is extended around 10 pixels. Here-
after, four cropped patches are normalized to 32×32.
Fig.5 shows some extracted face key-points and their
corresponding four normalized cropped areas.
Fig.5.Some extracted face key-points and their corres-
ponding four cropped areas
The detailed architecture of the local feature ex-
traction module is shown in Fig.6. Compared with the
global module as shown in the Fig.4, a group of SE-Re-
sidual blocks and a maxpooling layer are removed be-
cause of the following concatenation operation between
the local module and the global module, whose outputs
should have the same resolution. The four cropped
areas obtained from the extracted key-points are fed in-
to the residual attention network one by one to output
four groups of features, i.e., F1, F2, F3, and F4. Then,
these four groups of features are fused by an add opera-
tion to obtain the final feature of the local module. The
number of kernels in the first convolutional group and
three SE-Residual blocks is 32, 32, 64, and 128, respect-
ively. Other parameters of the remaining layers are
same with the global module described in the Section
III.2.
Convolutional group
SE-Residual block
Maxpooling
SE-Residual block
Maxpooling
SE-Residual block
Maxpooling
Add
F1
F2
F3
F4
Input
Output
Fig.6.Architecture of the local feature extraction module
4.Classification module
This module is comprised of a convolutional group,
62 Chinese Journal of Electronics 2022
a GAP layer, and a FC layer. The kernel size of the
convolutional layer is 3×3 with stride 1. The number of
neurons in the FC layer is equal to 2 (two categories:
natural and generated).
To effectively supervise the classification module,
the most widely used classification loss function soft-
max loss is considered,
Lsof t max =−
1
N
N
∑
i=1
log eWT
yixi+byi
∑n
j=1 eWT
jxi+bj
(4)
where N and n represent the batch size and number of
categories, respectively, xi represents the i-th sample of
the deep feature x, yi denotes the label of the i-th
sample, W and b are the weight vector and the bias
term, respectively. The goal is to minimize the loss Lsoftmax.
5.Pseudo-code
In order to better understand the proposed al-
gorithm, the pseudo-code is summarized as the follow-
ing Algorithm 1.
Algorithm 1 Pseudo-code of the proposed algorithm
Input: Image I
F1
P
F2
P
F3
P
F4
P
1: Crop the input image into four parts according to
face key-points as { , , , } = Crop (I);
FL
F1
P
F2
P
F3
P
F4
P
2: Extract local features as = Local feature extrac-
tion module ( , , , );
FG
3: Extract global features as = Global feature extrac-
tion module (I);
FC
FG
FL
4: Merge global features and local features as =
Concatenate ( , );
5: if step==feature learning do
FM
FC
6: Extract features of Metric learning as = fully
connected FC ( );
FM
7: LArcFace = ArcFace_Loss ( , Label);
8: Minimize LArcFace and update the network paramet-
ers by back propagation;
9: end if
10: if step==classification learning do
FConv
FC
11: Extract dimensionality reduction features as
= Conv ( );
FR
FConv
12: Obtain the final feature as = fully connected FC
( );
FR
13:Predicted result = Argmax ( );
14: Lsoftmax = Softmax_Loss (Predicted result, Label);
15: Minimize Lsoftmax and update the network paramet-
ers by back propagation;
16: end if
Output: Predicted result (real or GAN-generated)
IV.Experimental Results and Analysis
1.Experimental datasets
In our experiments, we consider two large natural
face image datasets (CelebA[30], and FFHQ[3]) and sev-
en GAN models (four state-of-the-art GANs: WGAN-
GP[31], BEGAN[1], PGGAN[2], and StyleGAN[3]; three
relatively early GANs: WGAN[32], LSGAN[33], and
DCGAN[34]). Notice that: regarding CelebA and FFHQ
datasets, we crop the facial regions by removing the
background, and then resize the cropped regions into
the resolution of 128×128. The GAN models generate
face images based on the natural CelebA dataset. We
use the abbreviation of Real/Generated with GAN
model to denote the corresponding natural/generated
face image datasets, i.e., RCelebA, RFFHQ, GWGAN-GP,
GBEGAN, GPGGAN, GStyleGAN, GWGAN, GLSGAN, and
GDCGAN. The description of all the datasets we used are
listed in Table 1. Some face images from these datasets
are shown in Fig.7.
Table 1.Description of all the datasets
Datasets The number of images Image resolution
RCelebA 202,599 218×178
RFFHQ 20,000 1024×1024
GWGAN-GP/GBEGAN/
GPGGAN/GStyleGAN
200,000 128×128
GWGAN/GLSGAN/GDCGAN 20,000 64×64
Fig.7.Some face images in the datasets considered. From left to right, the columns are real images in RCelebA and RFFHQ, fake im-
ages in GDCGAN, GLSGAN, GWGAN, GBEGAN, GWGAN-GP, GPGGAN, and GStyleGAN
Distinguishing Between Natural and GAN-Generated Face Images by Combining Global and Local Features 63
2.Experimental setup
To train the proposed model, we set the learning
rate as 1.0E−3 , and the total number of epochs as 20.
Adam optimizer[35] is used for both the feature and clas-
sification learning. The number of epochs of the first-
step and second-step learning are set to 5 and 15, re-
spectively. The batch sizes are 128 for all the steps. The
parameters settings and initialization of all layers use
the default function of TensorFlow framework.
The evaluation metric Accuracy is used to evalu-
ate the performance of the proposed algorithm and the
compared algorithms. It can be computed by,
Accuracy =Cnumber
Tnumber
(5)
where Cnumber is the number of correctly classified face
images, Tnumber denotes the number of the total face im-
ages. All the deep learning-based algorithms have been
implemented with TensorFlow on a single 11 GB Ge-
Force GTX 1080 Ti, 3.20 GHz i7-6900K CPU, and
64GB RAM.
3.Experimental results and analysis
In order to evaluate the performance of the pro-
posed algorithm, five recent deep learning-based al-
gorithms and one conventional algorithm are con-
sidered as baselines for comparison. These algorithms
are proposed by Quan et al.[36], Mo et al.[14], Hsu et al.[16],
Marra et al.[12], Chang et al.[37], and Li et al.[38]. The
implementations of all these compared algorithms are
provided by the authors of the corresponding released
papers.
The first test verifies the effectiveness of detection
directly. So, the total 202,599 images from five datasets
(GWGAN-GP, GBEGAN, GPGGAN, GStyleGAN, and RCelebA)
are divided into training, validation, and testing sets
with the ratio 8:1:1 for each dataset. Experimental res-
ults are given in Table 2 in terms of the Accuracy
value. These results show that when the testing set and
the training set come from the same source, almost all
algorithms can basically achieve satisfactory results
with an average Accuracy over 0.99, especially for the
proposed algorithm with dual branches achieving the
accuracy 1.0. The reason is that the deep convolutional
network has a strong learning ability.
Table 2.Comparison of detection performance on test-
ing dataset consisting of GWGAN-GP, GBEGAN,
GPGGAN, GStyleGAN and RCelebA
Algorithms Accuracy Algorithms Accuracy
Quan[36]0.9608 Chang[37]0.9993
Mo[14]0.9999 Li[38]0.9224
Hsu[15]0.9858 Proposed with only global branch 0.9998
Marra[12]0.9999 Proposed with dual branches 1.0000
The second test evaluates the generalization capab-
ility. So, we randomly choose three datasets from four
generated datasets (GWGAN-GP, GBEGAN, GPGGAN, and
GStyleGAN) and the natural dataset RCelebA for training
and validation, while the remaining one generated data-
set for testing. Three datasets (GWGAN, GLSGAN, and
GDCGAN) generated by the relatively early GANs and a
natural dataset RFFHQ are also used to test the general-
ization capability. Each dataset has 20,000 face images.
Table 3 and Table 4 present the average Accuracy val-
ues for all algorithms. The results in the Table 3 and
Table 4 show that: a) the proposed algorithm with dual
branches achieves the best generalization performance
among eight compared algorithms with an average Ac-
curacy over 0.99 for all four cases considered because it
combines both of the global and local features and uses
the ArcFace loss; b) the proposed algorithm with dual
branches is superior to that with only global branch, es-
pecially for GWGAN-GP and GBEGAN. This proves that
the local features are effective.
The third test is to evaluate the robustness against
additional attacks. Three types of additional attacks are
considered, i.e., JPEG compression (JC), Gaussian blur
(GB), and Gaussian noise addition (GNA). The origin-
al testing dataset is same as the first test. It is at-
tacked by JC with six different quality factors, GB with
Table 3.Comparison of the generalization capability on GWGAN-GP, GBEGAN, and
four other datasets in terms of the Accuracy value
Algorithms Train on RCelebA, GBEGAN, GPGGAN, and GStyleGAN Train on RCelebA, GWGAN-GP, GPGGAN, and GStyleGAN
GWGAN-GP GWGAN GDCGAN GLSGAN RFFHQ GBEGAN GWGAN GDCGAN GLSGAN RFFHQ
Quan[36]0.8863 0.8775 0.9998 0.9784 0.8207 0.8807 0.9407 0.9999 0.9804 0.8310
Mo[14]0.9222 0.9762 0.9662 0.9199 0.9778 0.8958 0.9539 0.9810 0.9899 0.9169
Hsu[15]0.8297 0.9793 0.9791 0.9703 0.1479 0.9697 0.9897 1.0000 0.9998 0.6873
Marra[12]0.8765 0.9437 0.9111 0.9449 0.9752 0.8130 0.8240 0.8525 0.9381 0.9698
Chang[37]0.9168 0.9259 0.9837 0.9397 0.6448 0.9400 0.8984 0.9887 0.9698 0.7493
Li[38]0.5010 0.5009 0.4989 0.5052 0.5000 0.5017 0.5002 0.5112 0.5060 0.4991
Proposed with only global branch 0.8965 0.8951 0.8428 0.8997 0.6713 0.7783 0.9956 0.9856 0.9925 0.9585
Proposed with dual branches 0.9931 0.9988 0.9895 0.9959 0.9972 0.9941 0.9999 1.0000 1.0000 0.9998
64 Chinese Journal of Electronics 2022
two different filter sizes, and GNA with three different
standard deviations. Fig.8 presents the results of each
compared algorithm. The results in the Fig.8 show that:
a) the Accuracy values of all eight algorithms decrease
with the increase of attack levels for different types of
attacks; b) the detection performance of all eight al-
gorithms is greatly affected by JPEG compression at-
tack; c) the proposed algorithm achieves the overall
best performance in robustness, especially for the GNA
attack. Regarding the GB attack, the proposed al-
gorithm is not better than the algorithm proposed by
Marra et al. though it is superior to other compared al-
gorithms. The main reason is that the Marra’s al-
gorithm considers the global features, while our al-
gorithm focuses on the local features on the basis of the
global features. As is known to all, the local features are
easier to be smoothed by GB attack than the global fea-
tures. However, the main objective of our paper is to
improve the generalization capability. Moreover, as
shown in Table 3 and Table 4, the proposed algorithm
achieves an obviously better generalization than other
algorithms including the Marra’s method.
0.6
0.7
0.8
0.9
1.0
3×3 5×5
Accuracy
Filter size of GB
0.4
95 90 80 70 60 50
0.5
0.6
0.7
0.8
0.9
1.0
Accuracy
Quality factor of JC
0.4
0.4 0.7 1.0
0.5
0.6
0.7
0.8
0.9
1.0
Accuracy
Standard deviation of GNA
Quan Mo Hsu Marra Chang Li
Proposed with only global branch Proposed with double branches
(a) Gaussian blur (b) JPEG compression (c) Gaussian noise addition
Fig.8.Comparison of robustness against three types of attacks on testing dataset consisting of GWGAN-GP, GBEGAN, GPGGAN, GStyl-
eGAN and RCelebA
V.Conclusions
In order to improve the generalization capability of
the existing GAN-generated face image detection al-
gorithms, we have proposed a general solution by com-
bining both the global and local features and also using
the metric learning based on the ArcFace loss. Experi-
mental results have demonstrated that the proposed al-
gorithm achieves a satisfactory generalization capabil-
ity with an average accuracy value over 0.99 on all the
eight testing datasets and outperforms some existing al-
gorithms. The main reasons are as follows: a) the learn-
ing on important local areas is strengthened by combin-
ing the global and local features extracted by residual
attention network; b) the metric learning is applied to
obtain common features in the same type of face and
discriminative features between natural and GAN-gen-
erated face in the feature learning phase. Certainly, the
performance of the proposed algorithm in the robust-
ness against additional attacks, especially for JPEG
compression, still needs to be improved. This is one of
our future objectives. In addition, all the current works,
including our work, focus on GAN-generated face detec-
tion in plaintext. In the future, we will try to detect the
encrypted fake face for privacy protection.
Table 4.Comparison of the generalization capability on GPGGAN, GStyleGAN, and
four other datasets in terms of the Accuracy value
Algorithms Train on RCelebA, GWGAN-GP, GBEGAN, and GStyleGAN Train on RCelebA, GWGAN-GP, GBEGAN, and GPGGAN
GPGGAN GWGAN GDCGAN GLSGAN RFFHQ GStyleGAN GWGAN GDCGAN GLSGAN RFFHQ
Quan[36]0.7963 0.9163 0.9998 0.9840 0.7101 0.8404 0.9299 0.9996 0.9997 0.7852
Mo[14]0.8935 0.9688 1.0000 1.0000 0.9599 0.8719 0.9999 1.0000 1.0000 0.8845
Hsu[15]0.9148 0.9987 0.9999 1.0000 0.6812 0.9602 0.9998 0.9988 0.9998 0.8277
Marra[12]0.8541 0.8945 0.8847 0.8934 0.9933 0.8767 0.9552 0.9995 1.0000 0.9815
Chang[37]0.8551 0.9586 0.9899 0.9999 0.9367 0.8202 0.9999 0.9835 0.9999 0.8564
Li[38]0.7690 0.8770 0.9886 0.9662 0.6277 0.6010 0.7150 0.9243 0.9204 0.6867
Proposed with only global branch 0.9998 1.0000 1.0000 1.0000 0.9977 0.9495 1.0000 1.0000 1.0000 0.9598
Proposed with dual branches 0.9919 1.0000 1.0000 1.0000 0.9949 0.9910 1.0000 0.9999 1.0000 0.9954
Distinguishing Between Natural and GAN-Generated Face Images by Combining Global and Local Features 65
References
D. Berthelot, T. Schumm, and L. Metz, “Began: Boundary
equilibrium generative adversarial networks,” arXiv pre-
print, arXiv: 1703.10717, 2017.
[1]
T. Karras, T. Aila, S. Laine, et al., “Progressive growing of
GANs for improved quality, stability, and variation,” arXiv
preprint, arXiv: 1710.10196, 2017.
[2]
T. Karras, S. Laine, and T. Aila, “A style-based generator
architecture for generative adversarial networks,” in Proc.
of the 2019 IEEE Conference on Computer Vision and
Pattern Recognition, Long Beach, California, pp.4401–4410,
2019.
[3]
X. Xu, L. Zhang, B. Lang, et al., “Research on inception
module incorporated Siamese convolutional neural networks
to realize face recognition,” Acta Electronica Sinica, vol.48,
no.4, pp.643–647, 2020. (in Chinese)
[4]
H. Li, Q. Li, and L. Zhou, “Dynamic facial expression recog-
nition based on multi-visual and audio descriptors,” Acta
Electronica Sinica, vol.47, no.8, pp.1643–1653, 2019. (in
Chinese)
[5]
C. Gao, X. Li, F. Zhou, et al., “Face liveness detection
based on the improved CNN with context and texture in-
formation,” Chinese Journal of Electronics, vol.28, no.6,
pp.1092–1098, 2019.
[6]
X. Yang, Y. Li, H. Qi, et al., “Exposing GAN-synthesized
faces using landmark locations,” in Proc. of the ACM
Workshop on Information Hiding and Multimedia Security,
Paris, pp.113–118, 2019.
[7]
H. Li, B. Li, S. Tan, et al., “Detection of deep network gen-
erated images using disparities in color components,” arXiv
preprint, arXiv: 1808.07276, 2018.
[8]
L. Nataraj, T. M. Mohammed, B. S. Manjunath, et al., “De-
tecting GAN generated fake images using co-occurrence
matrices,” Electronic Imaging, vol.532, no.5, pp.1–7, 2019.
[9]
S. McCloskey and M. Albright, “Detecting GAN-generated
imagery using saturation cues,” in Proc. of 2019 IEEE In-
ternational Conference on Image Processing, Taipei,
pp.4584–4588, 2019.
[10]
F. Matern, C. Riess, and M. Stamminger, “Exploiting visu-
al artifacts to expose deepfakes and face manipulations,” in
Proc. of 2019 IEEE Winter Applications of Computer Vis-
ion Workshops, Waikoloa, HI, pp.83–92, 2019.
[11]
F. Marra, D. Gragnaniello, D. Cozzolino, et al., “Detection
of gan-generated fake images over social networks,” in Proc.
of 2018 IEEE Conference on Multimedia Information Pro-
cessing and Retrieval, Miami, Florida, pp.384–389, 2018.
[12]
F. Chollet, “Xception: Deep learning with depthwise separ-
able convolutions,” in Proc. of the 2017 IEEE Conference
on Computer Vision and Pattern Recognition, Honolulu,
Hawaii, pp.1251–1258, 2017.
[13]
H. Mo, B. Chen, and W. Luo, “Fake faces identification via
convolutional neural network,” in Proc. of the ACM Work-
shop on Information Hiding and Multimedia Security, Inns-
bruck,pp.43–47, 2018.
[14]
C. C. Hsu, Y. X. Zhuang, and C. Y. Lee, “Deep fake image
detection based on pairwise learning,” Applied Sciences,
vol.10, no.1, article no.370, 2020.
[15]
S. Tariq, S. Lee, H. Kim, et al., “Detecting both machine
and human created fake face images in the wild,” in Proc.
of the 2nd International Workshop on Multimedia Privacy
[16]
and Security, Toronto, pp.81–87, 2018.
S. Y. Wang, O. Wang, R. Zhang, et al., “CNN-generated
images are surprisingly easy to spot... for now,” in Proc. of
the 2020 IEEE Conference on Computer Vision and Pat-
tern Recognition, Seattle, WA,pp.8695–8704, 2020.
[17]
B. Liu and C. M. Pun, “Locating splicing forgery by fully
convolutional networks and conditional random field,” Sig-
nal Processing: Image Communication, vol.66, pp.103–112,
2018.
[18]
B. Chen, W. Tan, G. Coatrieux, et al., “A serial image
copy-move forgery localization scheme with source/target
distinguishment,” IEEE Transactions on Multimedia,
vol.23, pp.3506–3517, 2020.
[19]
C. Ding and D. Tao, “Robust face recognition via mul-
timodal deep face representation,” IEEE Transactions on
Multimedia, vol.17, no.11, pp.2049–2058, 2015.
[20]
B. Chen, X. Ju, B. Xiao, et al., “Locally GAN-generated
face detection based on an improved Xception,” Informa-
tion Sciences, vol.572, pp.16–28, 2021.
[21]
R. Huang, S. Zhang, T. Li, et al., “Beyond face rotation:
Global and local perception gan for photorealistic and iden-
tity preserving frontal view synthesis,” in Proc. of the 2017
IEEE International Conference on Computer Vision, Hon-
olulu, Hawaii, pp.2439–2448, 2017.
[22]
J. Deng, J. Guo, N. Xue, et al., “ArcFace: Additive angular
margin loss for deep face recognition,” in Proc. of the 2019
IEEE Conference on Computer Vision and Pattern Recog-
nition, Long Beach, California, pp.4690–4699, 2019.
[23]
H. Larochelle and G. E. Hinton, “Learning to combine fo-
veal glimpses with a third-order boltzmann machine,” in
Proc. of a meeting of the 24th Annual Conference on Neur-
al Information Processing Systems 2010: Advances in
Neural Information Processing Systems, Vancouver,
pp.1243–1251, 2010.
[24]
J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation net-
works,” in Proc. of the 2018 IEEE Conference on Com-
puter Vision and Pattern Recognition, Salt Lake City,
Utah, pp.7132–7141, 2018.
[25]
S. Chopra, R. Hadsell, and Y. LeCun, “Learning a similar-
ity metric discriminatively, with application to face verifica-
tion”, in Proc. of the 2005 IEEE Conference on Computer
Vision and Pattern Recognition, San Diego, CA,
pp.539–546, 2005.
[26]
F. Schroff, D. Kalenichenko, and J. Philbin, “Facenet: A
unified embedding for face recognition and clustering,” in
Proc. of the IEEE Conference on Computer Vision and
Pattern Recognition, Boston, MA, pp.815–823, 2015.
[27]
K. He, X. Zhang, S. Ren, et al., “Deep residual learning for
image recognition,” in Proc. of the 2016 IEEE Conference
on Computer Vision and Pattern Recognition, Las Vegas,
NV, pp.770–778, 2016.
[28]
D. King, “Dlib c++ library,” Access on: http://dlib.net,
2018.
[29]
Z. Liu, P. Luo, X. Wang, et al., “Deep learning face attrib-
utes in the wild,” in Proc. of the 2015 IEEE International
Conference on Computer Vision, Boston, MA,
pp.3730–3738, 2015.
[30]
I. Gulrajani, F. Ahmed, M. Arjovsky, et al., “Improved
training of wasserstein gans,” arXiv preprint,
arXiv:1704.00028, 2017.
[31]
M. Arjovsky, S. Chintala, and L. Bottou, “Wasserstein[32]
66 Chinese Journal of Electronics 2022
GAN,” arXiv Preprint, arXiv: 1701.07875, 2017.
X. Mao, Q. Li, H. Xie, et al., “Least squares generative ad-
versarial networks,” in Proc. of the 2017 IEEE Internation-
al Conference on Computer Vision, Honolulu, HI,
pp.2813–2821, 2017.
[33]
A. Radford, L. Metz, and S. Chintala, “Unsupervised repres-
entation learning with deep convolutional generative ad-
versarial networks,” arXiv Preprint, arXiv: 1511.06434,
2015.
[34]
I. Sutskever, J. Martens, G. Dahl, et al., “On the import-
ance of initialization and momentum in deep learning”, in
Proc. of International Conference on Machine Learning,
Atlanta, Georgia, pp.1139–1147, 2013.
[35]
W. Quan, K. Wang, D. M. Yan, et al., “Distinguishing
between natural and computer-generated images using con-
volutional neural networks,” IEEE Trans. on Information
Forensics and Security, vol.13, no.11, pp.2772–2787, 2018.
[36]
X. Chang, J. Wu, T. Yang, et al., “DeepFake face image de-
tection based on improved VGG convolutional neural net-
work,” in Proc. of the 39th Chinese Control Conference,
Shenyang, pp.7252–7256, 2020.
[37]
H. Li, B. Li, S. Tan, et al., “Identification of deep network
generated images using disparities in color components,”
Signal Processing, vol.174, article no.107616, 2020.
[38]
CHENBeijing(corresponding
author) received the Ph.D. degree in
computer science from Southeast Uni-
versity, Nanjing, China, in 2011. Now he
is a Professor in School of Computer,
Nanjing University of Information Sci-
ence and Technology, China. His re-
search interests include color image pro-
cessing, image forensics, image water-
marking, and pattern recognition. He serves as an Editorial Board
Member of the Journal of Mathematical Imaging and Vision.
(Email: nbutimage@126.com)
TANWeijinreceived the M.S.
degree in computer science and techno-
logy from Nanjing University of Informa-
tion Science and Technology, Nanjing,
China, in 2011. His research interests in-
clude image forensics and image pro-
cessing.
WANGYitingreceived the
B.S. degree in safety engineering from
Nanjing University of Information Sci-
ence and Technology, Nanjing, China, in
2019. Now she is pursuing the Ph.D. de-
gree in Warwick Manufacturing Group,
University of Warwick, UK. Her re-
search interests include machine learning
and image processing.
ZHAOGuoyingreceived the
Ph.D. degree in computer science from
the Chinese Academy of Sciences,
Beijing, China, in 2005. She is currently a
Professor with the Center for Machine
Vision and Signal Analysis, University of
Oulu, Finland. She is a Fellow of the
IAPR. She has authored or coauthored
more than 240 papers in journals and
conferences. Her current research interests include image and
video descriptors, facial expression and micro-expression recogni-
tion, and person identification.
Distinguishing Between Natural and GAN-Generated Face Images by Combining Global and Local Features 67
Content uploaded by Yiting Wang
Author content
All content in this area was uploaded by Yiting Wang on Dec 09, 2022
Content may be subject to copyright.