ArticlePDF Available

Distinguishing Between Natural and GAN‐Generated Face Images by Combining Global and Local Features

Authors:

Abstract and Figures

With the development of face image synthesis and generation technology based on generative adversarial networks (GANs), it has become a research hotspot to determine whether a given face image is natural or generated. However, the generalization capability of the existing algorithms is still to be improved. Therefore, this paper proposes a general algorithm. To do so, firstly, the learning on important local areas, containing many face key‐points, is strengthened by combining the global and local features. Secondly, metric learning based on the ArcFace loss is applied to extract common and discriminative features. Finally, the extracted features are fed into the classification module to detect GAN‐generated faces. The experiments are conducted on two publicly available natural datasets (CelebA and FFHQ) and seven GAN‐generated datasets. Experimental results demonstrate that the proposed algorithm achieves a better generalization performance with an average detection accuracy over 0.99 than the state‐of‐the‐art algorithms. Moreover, the proposed algorithm is robust against additional attacks, such as Gaussian blur, and Gaussian noise addition.
This content is subject to copyright. Terms and conditions apply.
Distinguishing Between Natural and GAN-
Generated Face Images by Combining
Global and Local Features
CHEN Beijing1,2,3, TAN Weijin1,2, WANG Yiting4, and ZHAO Guoying5
(1. Engineering Research Center of Digital Forensics, Ministry of Education, Nanjing
University of Information Science and Technology, Nanjing 210044, China)
(2. School of Computer, Nanjing University of Information Science and Technology, Nanjing 210044, China)
(3. Jiangsu Collaborative Innovation Center of Atmospheric Environment and Equipment Technology (CICAEET),
Nanjing University of Information Science and Technology, Nanjing 210044, China)
(4. Warwick Manufacturing Group, University of Warwick, Coventry CV4 7AL, UK)
(5. Center for Machine Vision and Signal Analysis, University of Oulu, Oulu 90014, Finland)
Abstract—With the development of face image syn-
thesis and generation technology based on generative ad-
versarial networks (GANs), it has become a research hot-
spot to determine whether a given face image is natural
or generated. However, the generalization capability of
the existing algorithms is still to be improved. Therefore,
this paper proposes a general algorithm. To do so, firstly,
the learning on important local areas, containing many
face key-points, is strengthened by combining the global
and local features. Secondly, metric learning based on the
ArcFace loss is applied to extract common and discrimin-
ative features. Finally, the extracted features are fed into
the classification module to detect GAN-generated faces.
The experiments are conducted on two publicly available
natural datasets (CelebA and FFHQ) and seven GAN-
generated datasets. Experimental results demonstrate that
the proposed algorithm achieves a better generalization
performance with an average detection accuracy over 0.99
than the state-of-the-art algorithms. Moreover, the pro-
posed algorithm is robust against additional attacks, such
as Gaussian blur, and Gaussian noise addition.
Keywords—Generated image, Global feature, Loc-
al features, Generative adversarial network, Metric
learning.
I.Introduction
In recent years, deep learning-based generative
techniques, especially for generative adversarial net-
works (GANs), have been applied to the field of face
image synthesis and generation. The visual quality of
GAN-generated face images is getting closer to natural
face images by some strategies, such as depth structure,
adversarial training, and prior information fusion. Some
advanced GANs, such as boundary equilibrium GAN
(BEGAN)[1], progressive growing of GAN (PGGAN)[2],
and StyleGAN[3], have been shown great success in gen-
erating high-resolution and photo-realistic face images.
Then, many tools have been released, such as
Face2Face, FaceSwap, and DeepFake. Consequently, it
has become more and more difficult to identify the gen-
erated face images from natural face images with naked
human eyes. Besides, it is well known that face images
have been widely used in identification[4,5] and authen-
tication services[6] in the daily life, such as face pay-
ment, face retrieval, and face check-in. Therefore, it has
become a research hotspot to determine whether a giv-
en face image is natural or generated.
Until now, various kinds of algorithms have been
proposed for detecting GAN-generated face images.
These algorithms can be roughly divided into two cat-
egories: intrinsic attributes-based[711] and deep learn-
ing-based[1217]. The intrinsic attribute-based al-
gorithms start from the perspective of conventional di-
Manuscript Received Nov. 6, 2020; Accepted July 5, 2021. This work was supported by the National Natural Science Foundation of
China (62072251), NUIST Students’ Platform for Innovation and Entrepreneurship Training Program (202110300022Z), and the Priority
Academic Program Development of Jiangsu Higher Education Institutions (PAPD) Fund.
©2022 Chinese Institute of Electronics. DOI:10.1049/cje.2020.00.372
Chinese Journal of Electronics
Vol.31, No.1, Jan. 2022
gital image forensics to exploit the inconsistence of dif-
ferent types of attributes in face images. Yang et al.[7]
considered the geometric inconsistency based on facial
feature points. Li et al.[8] analyzed the disparities in dif-
ferent color spaces, and extracted the feature informa-
tion by the color components. Nataraj et al.[9] extrac-
ted the co-occurrence matrices on three color channels
in the pixel domain. McCloskey et al.[10] studied the
brightness and exposure between PGGAN-generated
and natural face. Matern et al.[11] magnified several
visual artifacts in the global symmetry of organs and
the color of eyes. As exposed, these intrinsic attributes-
based algorithms are based on hand-crafted features,
which may limit their generalization capability[18,19].
Besides, for a given face image to be authenticated,
which algorithm is suitable is unknown for the absence
of prior information[18]. So, recently, the deep learning-
based algorithms[1217] have been proposed. Basically,
these algorithms learn image features expectedly and
automatically and do not need any prior knowledge and
assumptions. More specifically, Marra et al.[12] used
transfer learning with Xception model[13]. Mo et al.[14]
transformed the face image into residuals by a high-pass
filter and then extracted features from the residual in-
put by convolutional neural network (CNN). Hsu et
al.[15] adopted contrastive loss to seek typical features.
Tarip et al.[16] presented an ensemble-based neural net-
work classifier with three sub-networks, i.e., Shal-
lowNet v1-v3 having different depths. However, the
generalization capabilities of these deep learning-based
algorithms still need to be improved. Besides, Wang et
al.[17] pointed out that the generalization capability was
a significant indicator in the field of image forensics.
This makes sense because face image generation models
are highly varied in training datasets, network architec-
tures, loss functions, and image pre-processing opera-
tions. Besides, to the best of our knowledge, almost all
the existing works about GAN-generated face detection
only consider the global features, while ignore the local
features. But, the local features have played significant
roles in the field of face recognition[20,21] and face syn-
thesis[22]; people usually pay more attention to some
local areas when distinguishing between the natural and
generated faces.
The solution that we propose in this paper solves
the above-mentioned drawbacks. To enhance the gener-
alization capability, the learning on important local
areas is strengthened by combining the global and local
features. Moreover, in the feature learning phase, the
metric learning based on ArcFace loss[23] is applied to
learn common features in the same type of faces and
discriminative features between natural and GAN-gen-
erated faces.
The rest of this paper is organized as follows. In
Section II, some related techniques are reviewed briefly.
In Section III, the proposed algorithm is presented. The
experimental results and analysis are illustrated in Sec-
tion IV. Finally, we draw conclusions in Section V.
II.Related Techniques
In this section, some related techniques are re-
called, such as squeeze-and-excitation block and metric
learning.
1.Squeeze-and-excitation block
In human vision, attention usually leads people to
focus on the local area of the whole scene[24]. It can fil-
ter out irrelevant information and enhance important
information. As a lightweight gating attention mechan-
ism, squeeze-and-excitation (SE) block[25] is modelled in
channel-wise relationships to enhance the representa-
tional ability of the network. The main architecture of
SE block is presented in Fig.1.
GAP SE ×
F (c×h×w)
FGAP (c×l×l)
Fscale (c×h×w)
S (c×l×l)
Fig.1.Architecture of squeeze-and-excitation block
As shown in the Fig.1, the feature map F is manip-
ulated to aggregate global information by global aver-
age pooling (GAP) operation. Then, the global feature
FGAP is fed into the SE operation. The SE operation is
composed of two fully connected (FC) layers. After
that, the attention scaler S is computed as,
S=σ(F C2(δ(F C1(FGAP )))) (1)
where σ corresponds to the sigmoid activation; δ repres-
ents the ReLU activation. Finally, the output of the
block is obtained by scaling operation. The scaled fea-
ture map Fscale is computed as,
Fscale =Multipy(F, S)(2)
where Multipy refers to channel-wise multiplication.
2.Metric learning
Classic metric learning usually considers the char-
acteristics of data and learns an effective metric meth-
od to measure the similarity of data. Deep metric learn-
ing, which combines the advantages of deep neural net-
works and end-to-end training, has been widely used in
the field of computer vision.
Deep metric learning based on loss function has de-
veloped rapidly in recent years. Chopra et al.[26] pro-
posed contrastive loss and then introduced metric learn-
ing based on the contrastive loss into convolutional
neural networks. The aim of contrastive loss is to make
60 Chinese Journal of Electronics 2022
the feature distance corresponding to the samples with-
in the class closer, and to make the feature distance cor-
responding to samples between classes farther. Com-
pared with the contrastive loss, triplet loss[27] further
considers the relative relationship of intra-class and
inter-class pairs. However, the triple loss is difficult to
train all samples fully, and the choice of sample pairs
affects the final result. So, Deng et al.[23] proposed the
ArcFace loss based on cosine angle instead of the dis-
tance to measure similarity for metric learning and ap-
plied it to the field of face recognition, achieving an ex-
cellent performance. Therefore, it is also introduced in-
to the proposed algorithm. Its role is to make the fea-
tures having small intra-class and large inter-class dis-
tance.
Given the feature X and weight vector W, the Ar-
cFace loss is presented as follow,
LArcF ace =
1
N
N
i=1
log es(cos(θyi+m))
es(cos(θyi+m)) +n
j=1,j=yi
escos θj
(3)
where N and n represent the batch size and the num-
ber of categories, respectively, θ denotes the angle
between W and X, s represents the scaling factor, m
represents the additive angular margin penalty, and yi
is the label value of the i-th sample.
III.Proposed Algorithm
Most existing GAN-generated face detection al-
gorithms only consider the global features. However, the
local features have played significant roles in the field of
face recognition[20,21] and face synthesis[22]. So, in order
to improve the generalization capability, we suggest
strengthening the learning on important local areas and
combining the global and local features. In this section,
the main architecture of the proposed algorithm is
presented first, then its three modules, i.e., global fea-
ture extraction module, local feature extraction module,
and classification module, are described one by one.
1.Main architecture
As shown in Fig.2, the overall architecture of the
proposed algorithm comprises two steps: feature learn-
ing and classification learning. Notice that the cubes in
Fig.2 are just some simple signs rather than detailed
structures. The detailed structures of three modules are
given in the following three subsections. In the feature
learning step, firstly, the features from the global and
local feature extraction modules are merged; secondly,
metric learning is used to further learn common fea-
tures in the same type of faces and discriminative fea-
tures between natural and GANs-generated faces. Met-
ric learning transforms the merged features into an em-
bedding feature with fixed dimensions (128 dimensions
in this paper) by a fully connected (FC) layer. There-
after, the ArcFace loss[23] is applied to supervise the
metric learning. By minimizing the ArcFace loss defined
in Eq.(3), face images with the same label will have
similar features after being extracted by the feature ex-
traction modules. In the classification learning step, the
features extracted from the global and local modules are
fed into the classification module to obtain the pre-
dicted results. Notice that the metric learning is not
considered in the testing phase because it needs labels
to supervise the feature distribution and there is no la-
bel when making decision. The input face images are
processed by the feature extraction module and the
classification module to predict the result directly. The
details of three modules are presented in the following
three subsections.
Concatenation
..
..
Predicted
result
Global feature extraction module
Local feature extraction module
Classification module
Metric
learning
Softmax
loss
ArcFace
loss
Step2: Classification learning
Step1: Feature learning
Label
Crop
I
Fig.2.Main architecture of the proposed algorithm
2.Global feature extraction module
SE-Residual block is a main component in the
global feature extraction module. It embeds the SE
block into residual structure[28]. The reason is that the
SE block can adaptively establish channel-wise feature
responses by potential importance. The SE-Residual
Distinguishing Between Natural and GAN-Generated Face Images by Combining Global and Local Features 61
block is presented in Fig.3. As shown in Fig.3, if the in-
put x and the output y have matched dimensions, we
use the structure of Fig.3(a), otherwise Fig.3(b) (match-
ing dimensions by 1×1 convolution). Each architecture
of SE-Residual block has two convolutional groups and
a SE block. A convolutional group includes convolution
(Conv), batch normalization (BN), and ReLU activa-
tion. In real application, the dimensions of the input
and output of each SE-Residual block are already
known when we design a network model, thus choosing
Fig.3(a) or Fig.3(b) is also known in the model design
phase.
Convolutional group
SE block
Convolutional group
Conv1×1
Convolutional group
SE block
Convolutional group
x
y
x
y
+ +
(a) Same dimensions (b) Different dimensions
Fig.3.Architecture of the SE-Residual blocks for the input
and output with same/different dimensions
The detailed architecture of the global feature ex-
traction module is shown in Fig.4. This module is com-
posed of four SE-Residual blocks, a convolutional
group, and four maxpooling layers. The SE-Residual
block can extract inherent feature information to im-
prove the effectiveness of features. The global feature
extraction module is a relatively shallow network due to
the small size of input face and the uncomplicated clas-
sification task. Regarding the network parameters, the
kernel size of the convolutional group is 7×7 with stride
1, while those of the rest of convolutional layers in four
SE-Residual blocks are 3×3 with stride 1; The number
of kernels in the convolutional group and four SE-Re-
sidual blocks is 32, 32, 64, 64, and 128, respectively;
The kernel sizes of all maxpooling layers are 2×2 with
stride 2.
Convolutional group
Maxpooling
SE-Residual block
Maxpooling
SE-Residual block
Maxpooling
SE-Residual block
Maxpooling
SE-Residual block
Maxpooling
Input
Output
Fig.4.Architecture of the global feature extraction module
3.Local feature extraction module
Face key-points are some important landmarks in
the face. So, they are utilized to determine the import-
ant local areas. We use the face landmark detection
code from Dlib C++ library[29] to collect 68 face key-
points. Then, we find that these key-points mainly dis-
tribute in four areas, i.e., left eye, right eye, nose, and
mouth. To obtain these four areas, a rectangular is used
to crop each area in each face image. The rectangular is
the smallest rectangle containing all key-points near
each area. Besides, to contain each area completely,
each rectangular is extended around 10 pixels. Here-
after, four cropped patches are normalized to 32×32.
Fig.5 shows some extracted face key-points and their
corresponding four normalized cropped areas.
Fig.5.Some extracted face key-points and their corres-
ponding four cropped areas
The detailed architecture of the local feature ex-
traction module is shown in Fig.6. Compared with the
global module as shown in the Fig.4, a group of SE-Re-
sidual blocks and a maxpooling layer are removed be-
cause of the following concatenation operation between
the local module and the global module, whose outputs
should have the same resolution. The four cropped
areas obtained from the extracted key-points are fed in-
to the residual attention network one by one to output
four groups of features, i.e., F1, F2, F3, and F4. Then,
these four groups of features are fused by an add opera-
tion to obtain the final feature of the local module. The
number of kernels in the first convolutional group and
three SE-Residual blocks is 32, 32, 64, and 128, respect-
ively. Other parameters of the remaining layers are
same with the global module described in the Section
III.2.
Convolutional group
SE-Residual block
Maxpooling
SE-Residual block
Maxpooling
SE-Residual block
Maxpooling
Add
F1
F2
F3
F4
Input
Output
Fig.6.Architecture of the local feature extraction module
4.Classification module
This module is comprised of a convolutional group,
62 Chinese Journal of Electronics 2022
a GAP layer, and a FC layer. The kernel size of the
convolutional layer is 3×3 with stride 1. The number of
neurons in the FC layer is equal to 2 (two categories:
natural and generated).
To effectively supervise the classification module,
the most widely used classification loss function soft-
max loss is considered,
Lsof t max =
1
N
N
i=1
log eWT
yixi+byi
n
j=1 eWT
jxi+bj
(4)
where N and n represent the batch size and number of
categories, respectively, xi represents the i-th sample of
the deep feature x, yi denotes the label of the i-th
sample, W and b are the weight vector and the bias
term, respectively. The goal is to minimize the loss Lsoftmax.
5.Pseudo-code
In order to better understand the proposed al-
gorithm, the pseudo-code is summarized as the follow-
ing Algorithm 1.
Algorithm 1 Pseudo-code of the proposed algorithm
Input: Image I
F1
P
F2
P
F3
P
F4
P
1: Crop the input image into four parts according to
face key-points as { , , , } = Crop (I);
FL
F1
P
F2
P
F3
P
F4
P
2: Extract local features as = Local feature extrac-
tion module ( , , , );
FG
3: Extract global features as = Global feature extrac-
tion module (I);
FC
FG
FL
4: Merge global features and local features as =
Concatenate ( , );
5: if step==feature learning do
FM
FC
6: Extract features of Metric learning as = fully
connected FC ( );
FM
7: LArcFace = ArcFace_Loss ( , Label);
8: Minimize LArcFace and update the network paramet-
ers by back propagation;
9: end if
10: if step==classification learning do
FConv
FC
11: Extract dimensionality reduction features as
= Conv ( );
FR
FConv
12: Obtain the final feature as = fully connected FC
( );
FR
13:Predicted result = Argmax ( );
14: Lsoftmax = Softmax_Loss (Predicted result, Label);
15: Minimize Lsoftmax and update the network paramet-
ers by back propagation;
16: end if
Output: Predicted result (real or GAN-generated)
IV.Experimental Results and Analysis
1.Experimental datasets
In our experiments, we consider two large natural
face image datasets (CelebA[30], and FFHQ[3]) and sev-
en GAN models (four state-of-the-art GANs: WGAN-
GP[31], BEGAN[1], PGGAN[2], and StyleGAN[3]; three
relatively early GANs: WGAN[32], LSGAN[33], and
DCGAN[34]). Notice that: regarding CelebA and FFHQ
datasets, we crop the facial regions by removing the
background, and then resize the cropped regions into
the resolution of 128×128. The GAN models generate
face images based on the natural CelebA dataset. We
use the abbreviation of Real/Generated with GAN
model to denote the corresponding natural/generated
face image datasets, i.e., RCelebA, RFFHQ, GWGAN-GP,
GBEGAN, GPGGAN, GStyleGAN, GWGAN, GLSGAN, and
GDCGAN. The description of all the datasets we used are
listed in Table 1. Some face images from these datasets
are shown in Fig.7.
Table 1.Description of all the datasets
Datasets The number of images Image resolution
RCelebA 202,599 218×178
RFFHQ 20,000 1024×1024
GWGAN-GP/GBEGAN/
GPGGAN/GStyleGAN
200,000 128×128
GWGAN/GLSGAN/GDCGAN 20,000 64×64
Fig.7.Some face images in the datasets considered. From left to right, the columns are real images in RCelebA and RFFHQ, fake im-
ages in GDCGAN, GLSGAN, GWGAN, GBEGAN, GWGAN-GP, GPGGAN, and GStyleGAN
Distinguishing Between Natural and GAN-Generated Face Images by Combining Global and Local Features 63
2.Experimental setup
To train the proposed model, we set the learning
rate as 1.0E−3 , and the total number of epochs as 20.
Adam optimizer[35] is used for both the feature and clas-
sification learning. The number of epochs of the first-
step and second-step learning are set to 5 and 15, re-
spectively. The batch sizes are 128 for all the steps. The
parameters settings and initialization of all layers use
the default function of TensorFlow framework.
The evaluation metric Accuracy is used to evalu-
ate the performance of the proposed algorithm and the
compared algorithms. It can be computed by,
Accuracy =Cnumber
Tnumber
(5)
where Cnumber is the number of correctly classified face
images, Tnumber denotes the number of the total face im-
ages. All the deep learning-based algorithms have been
implemented with TensorFlow on a single 11 GB Ge-
Force GTX 1080 Ti, 3.20 GHz i7-6900K CPU, and
64GB RAM.
3.Experimental results and analysis
In order to evaluate the performance of the pro-
posed algorithm, five recent deep learning-based al-
gorithms and one conventional algorithm are con-
sidered as baselines for comparison. These algorithms
are proposed by Quan et al.[36], Mo et al.[14], Hsu et al.[16],
Marra et al.[12], Chang et al.[37], and Li et al.[38]. The
implementations of all these compared algorithms are
provided by the authors of the corresponding released
papers.
The first test verifies the effectiveness of detection
directly. So, the total 202,599 images from five datasets
(GWGAN-GP, GBEGAN, GPGGAN, GStyleGAN, and RCelebA)
are divided into training, validation, and testing sets
with the ratio 8:1:1 for each dataset. Experimental res-
ults are given in Table 2 in terms of the Accuracy
value. These results show that when the testing set and
the training set come from the same source, almost all
algorithms can basically achieve satisfactory results
with an average Accuracy over 0.99, especially for the
proposed algorithm with dual branches achieving the
accuracy 1.0. The reason is that the deep convolutional
network has a strong learning ability.
Table 2.Comparison of detection performance on test-
ing dataset consisting of GWGAN-GP, GBEGAN,
GPGGAN, GStyleGAN and RCelebA
Algorithms Accuracy Algorithms Accuracy
Quan[36]0.9608 Chang[37]0.9993
Mo[14]0.9999 Li[38]0.9224
Hsu[15]0.9858 Proposed with only global branch 0.9998
Marra[12]0.9999 Proposed with dual branches 1.0000
The second test evaluates the generalization capab-
ility. So, we randomly choose three datasets from four
generated datasets (GWGAN-GP, GBEGAN, GPGGAN, and
GStyleGAN) and the natural dataset RCelebA for training
and validation, while the remaining one generated data-
set for testing. Three datasets (GWGAN, GLSGAN, and
GDCGAN) generated by the relatively early GANs and a
natural dataset RFFHQ are also used to test the general-
ization capability. Each dataset has 20,000 face images.
Table 3 and Table 4 present the average Accuracy val-
ues for all algorithms. The results in the Table 3 and
Table 4 show that: a) the proposed algorithm with dual
branches achieves the best generalization performance
among eight compared algorithms with an average Ac-
curacy over 0.99 for all four cases considered because it
combines both of the global and local features and uses
the ArcFace loss; b) the proposed algorithm with dual
branches is superior to that with only global branch, es-
pecially for GWGAN-GP and GBEGAN. This proves that
the local features are effective.
The third test is to evaluate the robustness against
additional attacks. Three types of additional attacks are
considered, i.e., JPEG compression (JC), Gaussian blur
(GB), and Gaussian noise addition (GNA). The origin-
al testing dataset is same as the first test. It is at-
tacked by JC with six different quality factors, GB with

Table 3.Comparison of the generalization capability on GWGAN-GP, GBEGAN, and
four other datasets in terms of the Accuracy value
Algorithms Train on RCelebA, GBEGAN, GPGGAN, and GStyleGAN Train on RCelebA, GWGAN-GP, GPGGAN, and GStyleGAN
GWGAN-GP GWGAN GDCGAN GLSGAN RFFHQ GBEGAN GWGAN GDCGAN GLSGAN RFFHQ
Quan[36]0.8863 0.8775 0.9998 0.9784 0.8207 0.8807 0.9407 0.9999 0.9804 0.8310
Mo[14]0.9222 0.9762 0.9662 0.9199 0.9778 0.8958 0.9539 0.9810 0.9899 0.9169
Hsu[15]0.8297 0.9793 0.9791 0.9703 0.1479 0.9697 0.9897 1.0000 0.9998 0.6873
Marra[12]0.8765 0.9437 0.9111 0.9449 0.9752 0.8130 0.8240 0.8525 0.9381 0.9698
Chang[37]0.9168 0.9259 0.9837 0.9397 0.6448 0.9400 0.8984 0.9887 0.9698 0.7493
Li[38]0.5010 0.5009 0.4989 0.5052 0.5000 0.5017 0.5002 0.5112 0.5060 0.4991
Proposed with only global branch 0.8965 0.8951 0.8428 0.8997 0.6713 0.7783 0.9956 0.9856 0.9925 0.9585
Proposed with dual branches 0.9931 0.9988 0.9895 0.9959 0.9972 0.9941 0.9999 1.0000 1.0000 0.9998
64 Chinese Journal of Electronics 2022
two different filter sizes, and GNA with three different
standard deviations. Fig.8 presents the results of each
compared algorithm. The results in the Fig.8 show that:
a) the Accuracy values of all eight algorithms decrease
with the increase of attack levels for different types of
attacks; b) the detection performance of all eight al-
gorithms is greatly affected by JPEG compression at-
tack; c) the proposed algorithm achieves the overall
best performance in robustness, especially for the GNA
attack. Regarding the GB attack, the proposed al-
gorithm is not better than the algorithm proposed by
Marra et al. though it is superior to other compared al-
gorithms. The main reason is that the Marra’s al-
gorithm considers the global features, while our al-
gorithm focuses on the local features on the basis of the
global features. As is known to all, the local features are
easier to be smoothed by GB attack than the global fea-
tures. However, the main objective of our paper is to
improve the generalization capability. Moreover, as
shown in Table 3 and Table 4, the proposed algorithm
achieves an obviously better generalization than other
algorithms including the Marra’s method.
0.6
0.7
0.8
0.9
1.0
3×3 5×5
Accuracy
Filter size of GB
0.4
95 90 80 70 60 50
0.5
0.6
0.7
0.8
0.9
1.0
Accuracy
Quality factor of JC
0.4
0.4 0.7 1.0
0.5
0.6
0.7
0.8
0.9
1.0
Accuracy
Standard deviation of GNA
Quan Mo Hsu Marra Chang Li
Proposed with only global branch Proposed with double branches
(a) Gaussian blur (b) JPEG compression (c) Gaussian noise addition
Fig.8.Comparison of robustness against three types of attacks on testing dataset consisting of GWGAN-GP, GBEGAN, GPGGAN, GStyl-
eGAN and RCelebA
V.Conclusions
In order to improve the generalization capability of
the existing GAN-generated face image detection al-
gorithms, we have proposed a general solution by com-
bining both the global and local features and also using
the metric learning based on the ArcFace loss. Experi-
mental results have demonstrated that the proposed al-
gorithm achieves a satisfactory generalization capabil-
ity with an average accuracy value over 0.99 on all the
eight testing datasets and outperforms some existing al-
gorithms. The main reasons are as follows: a) the learn-
ing on important local areas is strengthened by combin-
ing the global and local features extracted by residual
attention network; b) the metric learning is applied to
obtain common features in the same type of face and
discriminative features between natural and GAN-gen-
erated face in the feature learning phase. Certainly, the
performance of the proposed algorithm in the robust-
ness against additional attacks, especially for JPEG
compression, still needs to be improved. This is one of
our future objectives. In addition, all the current works,
including our work, focus on GAN-generated face detec-
tion in plaintext. In the future, we will try to detect the
encrypted fake face for privacy protection.

Table 4.Comparison of the generalization capability on GPGGAN, GStyleGAN, and
four other datasets in terms of the Accuracy value
Algorithms Train on RCelebA, GWGAN-GP, GBEGAN, and GStyleGAN Train on RCelebA, GWGAN-GP, GBEGAN, and GPGGAN
GPGGAN GWGAN GDCGAN GLSGAN RFFHQ GStyleGAN GWGAN GDCGAN GLSGAN RFFHQ
Quan[36]0.7963 0.9163 0.9998 0.9840 0.7101 0.8404 0.9299 0.9996 0.9997 0.7852
Mo[14]0.8935 0.9688 1.0000 1.0000 0.9599 0.8719 0.9999 1.0000 1.0000 0.8845
Hsu[15]0.9148 0.9987 0.9999 1.0000 0.6812 0.9602 0.9998 0.9988 0.9998 0.8277
Marra[12]0.8541 0.8945 0.8847 0.8934 0.9933 0.8767 0.9552 0.9995 1.0000 0.9815
Chang[37]0.8551 0.9586 0.9899 0.9999 0.9367 0.8202 0.9999 0.9835 0.9999 0.8564
Li[38]0.7690 0.8770 0.9886 0.9662 0.6277 0.6010 0.7150 0.9243 0.9204 0.6867
Proposed with only global branch 0.9998 1.0000 1.0000 1.0000 0.9977 0.9495 1.0000 1.0000 1.0000 0.9598
Proposed with dual branches 0.9919 1.0000 1.0000 1.0000 0.9949 0.9910 1.0000 0.9999 1.0000 0.9954
Distinguishing Between Natural and GAN-Generated Face Images by Combining Global and Local Features 65
References
D. Berthelot, T. Schumm, and L. Metz, “Began: Boundary
equilibrium generative adversarial networks,” arXiv pre-
print, arXiv: 1703.10717, 2017.
[1]
T. Karras, T. Aila, S. Laine, et al., “Progressive growing of
GANs for improved quality, stability, and variation,” arXiv
preprint, arXiv: 1710.10196, 2017.
[2]
T. Karras, S. Laine, and T. Aila, “A style-based generator
architecture for generative adversarial networks,” in Proc.
of the 2019 IEEE Conference on Computer Vision and
Pattern Recognition, Long Beach, California, pp.4401–4410,
2019.
[3]
X. Xu, L. Zhang, B. Lang, et al., “Research on inception
module incorporated Siamese convolutional neural networks
to realize face recognition,” Acta Electronica Sinica, vol.48,
no.4, pp.643–647, 2020. (in Chinese)
[4]
H. Li, Q. Li, and L. Zhou, “Dynamic facial expression recog-
nition based on multi-visual and audio descriptors,” Acta
Electronica Sinica, vol.47, no.8, pp.1643–1653, 2019. (in
Chinese)
[5]
C. Gao, X. Li, F. Zhou, et al., “Face liveness detection
based on the improved CNN with context and texture in-
formation,” Chinese Journal of Electronics, vol.28, no.6,
pp.1092–1098, 2019.
[6]
X. Yang, Y. Li, H. Qi, et al., “Exposing GAN-synthesized
faces using landmark locations,” in Proc. of the ACM
Workshop on Information Hiding and Multimedia Security,
Paris, pp.113–118, 2019.
[7]
H. Li, B. Li, S. Tan, et al., “Detection of deep network gen-
erated images using disparities in color components,” arXiv
preprint, arXiv: 1808.07276, 2018.
[8]
L. Nataraj, T. M. Mohammed, B. S. Manjunath, et al., “De-
tecting GAN generated fake images using co-occurrence
matrices,” Electronic Imaging, vol.532, no.5, pp.1–7, 2019.
[9]
S. McCloskey and M. Albright, “Detecting GAN-generated
imagery using saturation cues,” in Proc. of 2019 IEEE In-
ternational Conference on Image Processing, Taipei,
pp.4584–4588, 2019.
[10]
F. Matern, C. Riess, and M. Stamminger, “Exploiting visu-
al artifacts to expose deepfakes and face manipulations,” in
Proc. of 2019 IEEE Winter Applications of Computer Vis-
ion Workshops, Waikoloa, HI, pp.83–92, 2019.
[11]
F. Marra, D. Gragnaniello, D. Cozzolino, et al., “Detection
of gan-generated fake images over social networks,” in Proc.
of 2018 IEEE Conference on Multimedia Information Pro-
cessing and Retrieval, Miami, Florida, pp.384–389, 2018.
[12]
F. Chollet, “Xception: Deep learning with depthwise separ-
able convolutions,” in Proc. of the 2017 IEEE Conference
on Computer Vision and Pattern Recognition, Honolulu,
Hawaii, pp.1251–1258, 2017.
[13]
H. Mo, B. Chen, and W. Luo, “Fake faces identification via
convolutional neural network,” in Proc. of the ACM Work-
shop on Information Hiding and Multimedia Security, Inns-
bruck,pp.43–47, 2018.
[14]
C. C. Hsu, Y. X. Zhuang, and C. Y. Lee, “Deep fake image
detection based on pairwise learning,” Applied Sciences,
vol.10, no.1, article no.370, 2020.
[15]
S. Tariq, S. Lee, H. Kim, et al., “Detecting both machine
and human created fake face images in the wild,” in Proc.
of the 2nd International Workshop on Multimedia Privacy
[16]
and Security, Toronto, pp.81–87, 2018.
S. Y. Wang, O. Wang, R. Zhang, et al., “CNN-generated
images are surprisingly easy to spot... for now,” in Proc. of
the 2020 IEEE Conference on Computer Vision and Pat-
tern Recognition, Seattle, WA,pp.8695–8704, 2020.
[17]
B. Liu and C. M. Pun, “Locating splicing forgery by fully
convolutional networks and conditional random field,” Sig-
nal Processing: Image Communication, vol.66, pp.103–112,
2018.
[18]
B. Chen, W. Tan, G. Coatrieux, et al., “A serial image
copy-move forgery localization scheme with source/target
distinguishment,” IEEE Transactions on Multimedia,
vol.23, pp.3506–3517, 2020.
[19]
C. Ding and D. Tao, “Robust face recognition via mul-
timodal deep face representation,” IEEE Transactions on
Multimedia, vol.17, no.11, pp.2049–2058, 2015.
[20]
B. Chen, X. Ju, B. Xiao, et al., “Locally GAN-generated
face detection based on an improved Xception,” Informa-
tion Sciences, vol.572, pp.16–28, 2021.
[21]
R. Huang, S. Zhang, T. Li, et al., “Beyond face rotation:
Global and local perception gan for photorealistic and iden-
tity preserving frontal view synthesis,” in Proc. of the 2017
IEEE International Conference on Computer Vision, Hon-
olulu, Hawaii, pp.2439–2448, 2017.
[22]
J. Deng, J. Guo, N. Xue, et al., “ArcFace: Additive angular
margin loss for deep face recognition,” in Proc. of the 2019
IEEE Conference on Computer Vision and Pattern Recog-
nition, Long Beach, California, pp.4690–4699, 2019.
[23]
H. Larochelle and G. E. Hinton, “Learning to combine fo-
veal glimpses with a third-order boltzmann machine,” in
Proc. of a meeting of the 24th Annual Conference on Neur-
al Information Processing Systems 2010: Advances in
Neural Information Processing Systems, Vancouver,
pp.1243–1251, 2010.
[24]
J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation net-
works,” in Proc. of the 2018 IEEE Conference on Com-
puter Vision and Pattern Recognition, Salt Lake City,
Utah, pp.7132–7141, 2018.
[25]
S. Chopra, R. Hadsell, and Y. LeCun, “Learning a similar-
ity metric discriminatively, with application to face verifica-
tion”, in Proc. of the 2005 IEEE Conference on Computer
Vision and Pattern Recognition, San Diego, CA,
pp.539–546, 2005.
[26]
F. Schroff, D. Kalenichenko, and J. Philbin, “Facenet: A
unified embedding for face recognition and clustering,” in
Proc. of the IEEE Conference on Computer Vision and
Pattern Recognition, Boston, MA, pp.815–823, 2015.
[27]
K. He, X. Zhang, S. Ren, et al., “Deep residual learning for
image recognition,” in Proc. of the 2016 IEEE Conference
on Computer Vision and Pattern Recognition, Las Vegas,
NV, pp.770–778, 2016.
[28]
D. King, “Dlib c++ library,” Access on: http://dlib.net,
2018.
[29]
Z. Liu, P. Luo, X. Wang, et al., “Deep learning face attrib-
utes in the wild,” in Proc. of the 2015 IEEE International
Conference on Computer Vision, Boston, MA,
pp.3730–3738, 2015.
[30]
I. Gulrajani, F. Ahmed, M. Arjovsky, et al., “Improved
training of wasserstein gans,” arXiv preprint,
arXiv:1704.00028, 2017.
[31]
M. Arjovsky, S. Chintala, and L. Bottou, “Wasserstein[32]
66 Chinese Journal of Electronics 2022
GAN,” arXiv Preprint, arXiv: 1701.07875, 2017.
X. Mao, Q. Li, H. Xie, et al., “Least squares generative ad-
versarial networks,” in Proc. of the 2017 IEEE Internation-
al Conference on Computer Vision, Honolulu, HI,
pp.2813–2821, 2017.
[33]
A. Radford, L. Metz, and S. Chintala, “Unsupervised repres-
entation learning with deep convolutional generative ad-
versarial networks,” arXiv Preprint, arXiv: 1511.06434,
2015.
[34]
I. Sutskever, J. Martens, G. Dahl, et al., “On the import-
ance of initialization and momentum in deep learning”, in
Proc. of International Conference on Machine Learning,
Atlanta, Georgia, pp.1139–1147, 2013.
[35]
W. Quan, K. Wang, D. M. Yan, et al., “Distinguishing
between natural and computer-generated images using con-
volutional neural networks,” IEEE Trans. on Information
Forensics and Security, vol.13, no.11, pp.2772–2787, 2018.
[36]
X. Chang, J. Wu, T. Yang, et al., “DeepFake face image de-
tection based on improved VGG convolutional neural net-
work,” in Proc. of the 39th Chinese Control Conference,
Shenyang, pp.7252–7256, 2020.
[37]
H. Li, B. Li, S. Tan, et al., “Identification of deep network
generated images using disparities in color components,”
Signal Processing, vol.174, article no.107616, 2020.
[38]
CHENBeijing(corresponding
author) received the Ph.D. degree in
computer science from Southeast Uni-
versity, Nanjing, China, in 2011. Now he
is a Professor in School of Computer,
Nanjing University of Information Sci-
ence and Technology, China. His re-
search interests include color image pro-
cessing, image forensics, image water-
marking, and pattern recognition. He serves as an Editorial Board
Member of the Journal of Mathematical Imaging and Vision.
(Email: nbutimage@126.com)
TANWeijinreceived the M.S.
degree in computer science and techno-
logy from Nanjing University of Informa-
tion Science and Technology, Nanjing,
China, in 2011. His research interests in-
clude image forensics and image pro-
cessing.
WANGYitingreceived the
B.S. degree in safety engineering from
Nanjing University of Information Sci-
ence and Technology, Nanjing, China, in
2019. Now she is pursuing the Ph.D. de-
gree in Warwick Manufacturing Group,
University of Warwick, UK. Her re-
search interests include machine learning
and image processing.
ZHAOGuoyingreceived the
Ph.D. degree in computer science from
the Chinese Academy of Sciences,
Beijing, China, in 2005. She is currently a
Professor with the Center for Machine
Vision and Signal Analysis, University of
Oulu, Finland. She is a Fellow of the
IAPR. She has authored or coauthored
more than 240 papers in journals and
conferences. Her current research interests include image and
video descriptors, facial expression and micro-expression recogni-
tion, and person identification.
Distinguishing Between Natural and GAN-Generated Face Images by Combining Global and Local Features 67
... However, when confronted with fake face images where only local areas are generated, searching for feature differences directly on the entire generated face image may lead to detection failure. Therefore, [4,5,8,38] also combine local information such as artifacts to assist in detection. Although the methods described above achieve relatively high detection accuracy, they suffer from poor generalization and a lack of interpretability [21]. ...
... Besides a cosine scheduler for warm start. Training stops when the learning (8) It can be seen that the output feature A 1 Cross contains global information of f 1 for each pixel, and A 2 Cross is the same. CCA promotes the generation of more discriminative features for semantically similar regions between support and query images, allowing the network to adjust its "focus" on the images during testing. ...
Article
Full-text available
The rapid development of the Generative Adversarial Network (GAN) makes generated face images more and more visually indistinguishable, and the detection performance of previous methods will degrade seriously when the testing samples are out-of-sample datasets or have been post-processed. To address the above problems, we propose a new relational embedding network based on “what to observe” and “where to attend” from a relational perspective for the task of generated face detection. In addition, we designed two attention modules to effectively utilize global and local features. Specifically, the dual-self attention module selectively enhances the representation of local features through both image space and channel dimensions. The cross-correlation attention module computes similarity between images to capture the global information of the output in the image. We conducted extensive experiments to validate our method, and the proposed algorithm can effectively extract the correlations between features and achieve satisfactory generalization and robustness in generating face detection. In addition, we also explored the design of the model structure and the inspection performance on more categories of generated images (not limited to faces). The results show that RENet also has good detection performance on datasets other than faces.
... For instance, McCloskey et al. [6] leveraged red-green bivariate histograms and abnormal pixel exposure ratios for detection, while Agarwal et al. [7] exploited highfrequency artifacts stemming from GANs' upsampling processes. Chen et al. [8] integrated both global and local image features, utilizing metric learning to enhance the overall detection performance of model. In terms of network architectures, Convolutional Neural Networks (CNNs) remain a predominant choice for deepfake detection tasks due to their capacity to extract semantic, color, and texture information [9], as seen in Fu et al. [10] dual-channel CNN architecture, which is capable of concurrently processing both high-and low-frequency image components. ...
Article
Full-text available
Within the domain of Artificial Intelligence Generated Content (AIGC), technological strides in image generation have been marked, resulting in the proliferation of deepfake images that pose substantial security threats. The current landscape of deepfake detection technologies is marred by limited generalization across diverse generative models and a subpar detection rate for images generated through diffusion processes. In response to these challenges, this paper introduces a novel detection model designed for high generalizability, leveraging multiscale frequency and spatial domain features. Our model harnesses an array of specialized filters to extract frequency-domain characteristics, which are then integrated with spatial-domain features captured by a Feature Pyramid Network (FPN). The integration of the Attentional Feature Fusion (AFF) mechanism within the feature fusion module allows for the optimal utilization of the extracted features, thereby enhancing detection capabilities. We curated an extensive dataset encompassing deepfake images from a variety of GANs and diffusion models for rigorous evaluation. The experimental findings reveal that our proposed model achieves superior accuracy and generalization compared to existing baseline models when confronted with deepfake images from multiple generative sources. Notably, in cross-model detection scenarios, our model outperforms the next best model by a significant margin of 29.1% for diffusion-generated images and 15.1% for GAN-generated images. This accomplishment presents a viable solution to the pressing issues of generalization and adaptability in the field of deepfake detection.
... Most deepfake detectors can achieve good performance when applied intra-domain scenarios, but their performance may encounter dramatic degradation when subject to cross-domain settings [1][2][3][4][5][6][7][8][9]. ...
Article
Full-text available
Most existing deepfake detection methods often fail to maintain their performance when confronting new test domains. To address this issue, we propose a generalizable deepfake detection system to implement style diversification by alternately learning the domain generalization (DG)-based detector and the stylized fake face synthesizer (SFFS). For the DG-based detector, we first adopt instance normalization- and batch normalization-based structures to extract the local and global image statistics as the style and content features, which are then leveraged to obtain the more diverse feature space. Subsequently, contrastive learning is used to emphasize common style features while suppressing domain-specific ones, and adversarial learning is performed to obtain the domain-invariant features. These optimized features help the DG-based detector to learn generalized classification features and also encourage the SFFS to simulate possibly unseen domain data. In return, the samples generated by the SFFS would contribute to the detector's learning of more generalized features from augmented training data. Such a joint learning and training process enhances both the detector's and the synthesizer's feature representation capability for generalizable deepfake detection. Experimental results demonstrate that our method outperforms the state-of-the-art competitors not only in intra-domain tests but particularly in cross-domain tests.
... Existing detection methods can be roughly divided into three categories. Deep learning-based detectors [6][7][8][9] directly learn features from the raw image, alleviating the burden of constructing handcrafted features. For example, Barni et al. compute the cooccurrence of images to train a deep neural network for identifying synthesized faces [7], and Wang et al. observe that the neurons in the network 'react' differently when processing authentic and generated images [8]. ...
Chapter
Advances in Generative adversarial networks (GAN) have significantly improved the quality of synthetic facial images, posing threats to many vital areas. Thus, identifying whether a presented facial image is synthesized is of forensic importance. Our fundamental discovery is the lack of capillaries in the sclera of the GAN-generated faces, which is caused by the lack of physical/physiological constraints in the GAN model. Because there are more or fewer capillaries in people’s eyes, one can distinguish real faces from GAN-generated ones by carefully examining the sclera area. Following this idea, we first extract the sclera area from a probe image, then feed it into a residual attention network to distinguish GAN-generated faces from real ones. The proposed method is validated on the Flickr-Faces-HQ and StyleGAN2/StyleGAN3-generated face datasets. Experiments demonstrate that the capillary in the sclera is a very effective feature for identifying GAN-generated faces. Our code is available at: https://github.com/10961020/Deepfake-detector-based-on-blood-vessels.
Article
Recently, the rapid advancement of generative model has led to its exploitation by malicious actors who employ it to fabricate fake synthetic images. Meanwhile, the deceptive images are often disseminated on social network platforms, thereby undermining public trust. Although reliable forensic tools have emerged to detect generative fake images, the existing supervised detectors excessively rely on the correctly-labeled training samples, leading to overwhelming outsourcing annotation costs and the potential risk of suffering from label flipping attack. In light of the aforementioned limitations, we propose an unsupervised detector fighting against generative fake image. In particular, we assign the noisy labels to the training samples. Then dependent on the pre-clustered samples with noisy labels, the strategy of pre-training and re-training mechanism helps us train the feature extractor utilized to extract the discriminative feature. Last, the extracted feature guides us to respectively cluster both pristine and fake images; the fake images are effectively filtered by employing cosine similarity. Extensive experimental results highlight that our unsupervised detector rivals the baseline supervised methods; moreover, it has better capability of defending against label flipping attack.
Article
Accurately predicting pedestrian movements in complex environments is challenging due to social interactions, scene constraints, and pedestrians' multimodal behaviors. Sequential models like long short-term memory fail to effectively integrate scene features to make predicted trajectories comply with scene constraints due to disparate feature modalities of scene and trajectory. Though existing convolution neural network (CNN) models can extract scene features, they are ineffective in mapping these features into scene constraints for pedestrians and struggle to model pedestrian interactions due to the loss of target pedestrian information. To address these issues, we propose a unified environmental network based on CNN for pedestrian trajectory prediction. We introduce a polar-based method to reflect the distance and direction relationship between any position in the environment and the target pedestrian. This enables us to simultaneously model scene constraints and pedestrian social interactions in the form of feature maps. Additionally, we capture essential local features in the feature map, characterizing potential multimodal movements of pedestrians at each time step to prevent redundant predicted trajectories. We verify the performance of our proposed model on four trajectory prediction datasets, encompassing both short-term and long-term predictions. The experimental results demonstrate the superiority of our approach over existing methods.
Article
Full-text available
This study aims to explore the integration of Internet of Things (IoT) technology and artificial intelligence (AI) in art education, assessing its impact on learners’ experiences and learning outcomes. The study first proposes a digital teaching system that enables the IoT and Generative Adversarial Networks (GANs) to play a role in art education by monitoring students’ creative state in real-time, providing immediate feedback, and facilitating the generation of creative works. The system framework includes sensor nodes, an IoT platform, a GAN model, and a user interface to build a real-time interactive environment. Sensor nodes constantly collect physiological, movement, and environmental data from students, and the GAN model receives student data from the IoT platform, combining creative input from students to generate artwork in real-time. The generated works are transmitted to the discriminator through the IoT platform, which evaluates their quality and provides real-time feedback. Students interact with the system through the user interface, observe the generated artwork, adjust generator parameters, and propose new ideas. These interactions influence further artistic creation. The WikiArt public art creation dataset is selected to establish the experimental foundation, and the experimental evaluation focuses on image generation quality, system performance, and student learning outcomes. It is compared with Deep Convolutional Generative Adversarial Network (DCGAN) and Variational Autoencoder (VAE) models. The research results indicate that the designed IoT and GANs integrated system remarkably outperforms DCGAN and VAE in image generation quality, with an Inception Score of 4.5, which is more diverse and recognizable than other models. Regarding system performance, the IoT and GANs integrated system is significantly ahead in image generation speed and user interaction, with a transmission speed of up to 200 Mbps. Regarding student learning outcomes, the system performs excellently in emotional feedback, learning outcomes, and creative work quality, achieving 80% satisfaction and 90% positive feedback. Overall, the research conclusion clearly points out that the integration of IoT and GANs has a significant and comprehensive effect on improving the quality of art education. This study expands the field of art education by integrating IoT and GANs, enhancing students’ creative experiences, and providing innovative methods for art teaching in the digital age.
Article
Full-text available
Generative adversarial networks (GANs) can be used to generate a photo-realistic image from a low-dimension random noise. Such a synthesized (fake) image with inappropriate content can be used on social media networks, which can cause severe problems. With the aim to successfully detect fake images, an effective and efficient image forgery detector is necessary. However, conventional image forgery detectors fail to recognize fake images generated by the GAN-based generator since these images are generated and manipulated from the source image. Therefore, in this paper, we propose a deep learning-based approach for detecting the fake images by using the contrastive loss. First, several state-of-the-art GANs are employed to generate the fake–real image pairs. Next, the reduced DenseNet is developed to a two-streamed network structure to allow pairwise information as the input. Then, the proposed common fake feature network is trained using the pairwise learning to distinguish the features between the fake and real images. Finally, a classification layer is concatenated to the proposed common fake feature network to detect whether the input image is fake or real. The experimental results demonstrated that the proposed method significantly outperformed other state-of-the-art fake image detectors.
Article
Full-text available
Face liveness detection, as a key module of real face recognition systems, is to distinguish a fake face from a real one. In this paper, we propose an improved Convolutional neural network (CNN) architecture with two bypass connections to simultaneously utilize low‐level detailed information and high‐level semantic information. Considering the importance of the texture information for describing face images, texture features are also adopted under the conventional recognition framework of Support vector machine (SVM). The improved CNN and the texture feature based SVM are fused. Context information which is usually neglected by existing methods is well utilized in this paper. Two widely used datasets are used to test the proposed method. Extensive experiments show that our method outperforms the state‐of‐the‐art methods.
Article
Full-text available
The advent of Generative Adversarial Networks (GANs) has brought about completely novel ways of transforming and manipulating pixels in digital images. GAN based techniques such as Image-to-Image translations, DeepFakes, and other automated methods have become increasingly popular in creating fake images. In this paper, we propose a novel approach to detect GAN generated fake images using a combination of co-occurrence matrices and deep learning. We extract co-occurrence matrices on three color channels in the pixel domain and train a model using a deep convolutional neural network (CNN) framework. Experimental results on two diverse and challenging GAN datasets comprising more than 56,000 images based on unpaired image-to-image translations (cycleGAN [1]) and facial attributes/expressions (StarGAN [2]) show that our approach is promising and achieves more than 99% classification accuracy in both datasets. Further, our approach also generalizes well and achieves good results when trained on one dataset and tested on the other.
Article
It has become a research hotspot to detect whether a face is natural or GAN-generated. However, all the existing works focus on whole GAN-generated faces. So, an improved Xception model is proposed for locally GAN-generated face detection. To the best of our knowledge, our work is the first one to address this issue. Some improvements over Xception are as follows: (1) Four residual blocks are removed to avoid the overfitting problem as much as possible; (2) Inception block with the dilated convolution is used to replace the common convolution layer in the pre-processing module of the Xception to obtain multi-scale features; (3) Feature pyramid network is utilized to obtain multi-level features for final decision. The first locally GAN-based generated face (LGGF) dataset is constructed by the pluralistic image completion method on the basis of FFHQ dataset. It has a total 952,000 images with the generated regions in different shapes and sizes. Experimental results demonstrate the superiority of the proposed model which outperforms some existing models, especially for the faces having small generated regions.
Article
With the powerful deep network architectures, such as generative adversarial networks, one can easily generate photorealistic images. Although the generated images are not dedicated for fooling human or deceiving biometric authentication systems, research communities and public media have shown great concerns on the security issues caused by these images. This paper addresses the problem of identifying deep network generated (DNG) images. Taking the differences between camera imaging and DNG image generation into considerations, we analyze the disparities between DNG images and real images in different color components. We observe that the DNG images are more distinguishable from real ones in the chrominance components, especially in the residual domain. Based on these observations, we propose a feature set to capture color image statistics for identifying DNG images. Additionally, we evaluate several detection situations, including the training-testing data are matched or mismatched in image sources or generative models and detection with only real images. Extensive experimental results show that the proposed method can accurately identify DNG images and outperforms existing methods when the training and testing data are mismatched. Moreover, when the GAN model is unknown, our methods also achieves good performance with one-class classification by using only real images for training.