Content uploaded by Thanh Thi Nguyen
Author content
All content in this area was uploaded by Thanh Thi Nguyen on Aug 11, 2022
Content may be subject to copyright.
Content uploaded by Thanh Thi Nguyen
Author content
All content in this area was uploaded by Thanh Thi Nguyen on Jul 29, 2020
Content may be subject to copyright.
Content uploaded by Thanh Thi Nguyen
Author content
All content in this area was uploaded by Thanh Thi Nguyen on Dec 05, 2019
Content may be subject to copyright.
Content uploaded by Thanh Thi Nguyen
Author content
All content in this area was uploaded by Thanh Thi Nguyen on Sep 26, 2019
Content may be subject to copyright.
Content uploaded by Tien Dung Nguyen
Author content
All content in this area was uploaded by Tien Dung Nguyen on Oct 14, 2019
Content may be subject to copyright.
Deep Learning for Deepfakes Creation and Detection: A Survey
Thanh Thi Nguyena, Quoc Viet Hung Nguyenb, Dung Tien Nguyena, Duc Thanh Nguyena, Thien Huynh-Thec,
Saeid Nahavandid, Thanh Tam Nguyene, Quoc-Viet Phamf, Cuong M. Nguyeng
aSchool of Information Technology, Deakin University, Victoria, Australia
bSchool of Information and Communication Technology, Griffith University, Queensland, Australia
cICT Convergence Research Center, Kumoh National Institute of Technology, Gyeongbuk, Republic of Korea
dInstitute for Intelligent Systems Research and Innovation, Deakin University, Victoria, Australia
eFaculty of Information Technology, Ho Chi Minh City University of Technology (HUTECH), Ho Chi Minh City, Vietnam
fKorean Southeast Center for the 4th Industrial Revolution Leader Education, Pusan National University, Busan, Republic of Korea
gLAMIH UMR CNRS 8201, Universite Polytechnique Hauts-de-France, Valenciennes, France
Abstract
Deep learning has been successfully applied to solve various complex problems ranging from big data analytics
to computer vision and human-level control. Deep learning advances however have also been employed to create
software that can cause threats to privacy, democracy and national security. One of those deep learning-powered
applications recently emerged is deepfake. Deepfake algorithms can create fake images and videos that humans
cannot distinguish them from authentic ones. The proposal of technologies that can automatically detect and assess
the integrity of digital visual media is therefore indispensable. This paper presents a survey of algorithms used to
create deepfakes and, more importantly, methods proposed to detect deepfakes in the literature to date. We present
extensive discussions on challenges, research trends and directions related to deepfake technologies. By reviewing
the background of deepfakes and state-of-the-art deepfake detection methods, this study provides a comprehensive
overview of deepfake techniques and facilitates the development of new and more robust methods to deal with the
increasingly challenging deepfakes.
Keywords: deepfakes, face manipulation, artificial intelligence, deep learning, autoencoders, GAN, forensics, survey
1. Introduction
In a narrow definition, deepfakes (stemming from
“deep learning” and “fake”) are created by techniques
that can superimpose face images of a target person onto
a video of a source person to make a video of the target
person doing or saying things the source person does.
This constitutes a category of deepfakes, namely face-
swap. In a broader definition, deepfakes are artificial
intelligence-synthesized content that can also fall into
two other categories, i.e., lip-sync and puppet-master.
Lip-sync deepfakes refer to videos that are modified to
make the mouth movements consistent with an audio
recording. Puppet-master deepfakes include videos of
a target person (puppet) who is animated following the
facial expressions, eye and head movements of another
person (master) sitting in front of a camera [1].
While some deepfakes can be created by traditional
visual effects or computer-graphics approaches, the re-
cent common underlying mechanism for deepfake cre-
ation is deep learning models such as autoencoders and
generative adversarial networks (GANs), which have
been applied widely in the computer vision domain [2–
8]. These models are used to examine facial expressions
and movements of a person and synthesize facial images
of another person making analogous expressions and
movements [9]. Deepfake methods normally require a
large amount of image and video data to train models
to create photo-realistic images and videos. As pub-
lic figures such as celebrities and politicians may have
a large number of videos and images available online,
they are initial targets of deepfakes. Deepfakes were
used to swap faces of celebrities or politicians to bod-
ies in porn images and videos. The first deepfake video
emerged in 2017 where face of a celebrity was swapped
to the face of a porn actor. It is threatening to world se-
curity when deepfake methods can be employed to cre-
ate videos of world leaders with fake speeches for falsi-
fication purposes [10–12]. Deepfakes therefore can be
abused to cause political or religion tensions between
countries, to fool public and affect results in election
Published in Computer Vision and Image Understanding
https://doi.org/10.1016/j.cviu.2022.103525
campaigns, or create chaos in financial markets by cre-
ating fake news [13–15]. It can be even used to generate
fake satellite images of the Earth to contain objects that
do not really exist to confuse military analysts, e.g., cre-
ating a fake bridge across a river although there is no
such a bridge in reality. This can mislead a troop who
have been guided to cross the bridge in a battle [16, 17].
As the democratization of creating realistic digital
humans has positive implications, there is also positive
use of deepfakes such as their applications in visual ef-
fects, digital avatars, snapchat filters, creating voices
of those who have lost theirs or updating episodes of
movies without reshooting them [18]. Deepfakes can
have creative or productive impacts in photography,
video games, virtual reality, movie productions, and
entertainment, e.g., realistic video dubbing of foreign
films, education through the reanimation of historical
figures, virtually trying on clothes while shopping, and
so on [19, 20]. However, the number of malicious uses
of deepfakes largely dominates that of the positive ones.
The development of advanced deep neural networks and
the availability of large amount of data have made the
forged images and videos almost indistinguishable to
humans and even to sophisticated computer algorithms.
The process of creating those manipulated images and
videos is also much simpler today as it needs as little
as an identity photo or a short video of a target individ-
ual. Less and less effort is required to produce a stun-
ningly convincing tempered footage. Recent advances
can even create a deepfake with just a still image [21].
Deepfakes therefore can be a threat affecting not only
public figures but also ordinary people. For example, a
voice deepfake was used to scam a CEO out of $243,000
[22]. A recent release of a software called DeepNude
shows more disturbing threats as it can transform a per-
son to a non-consensual porn [23]. Likewise, the Chi-
nese app Zao has gone viral lately as less-skilled users
can swap their faces onto bodies of movie stars and in-
sert themselves into well-known movies and TV clips
[24]. These forms of falsification create a huge threat
to violation of privacy and identity, and affect many as-
pects of human lives.
Finding the truth in digital domain therefore has be-
come increasingly critical. It is even more challenging
when dealing with deepfakes as they are majorly used
to serve malicious purposes and almost anyone can cre-
ate deepfakes these days using existing deepfake tools.
Thus far, there have been numerous methods proposed
to detect deepfakes [25–29]. Most of them are based on
deep learning, and thus a battle between malicious and
positive uses of deep learning methods has been aris-
ing. To address the threat of face-swapping technology
Fig. 1. Number of papers related to deepfakes in years from 2016 to
2021, obtained from https://app.dimensions.ai at the end of 2021 with
the search keyword “deepfake” applied to full text of scholarly papers.
or deepfakes, the United States Defense Advanced Re-
search Projects Agency (DARPA) initiated a research
scheme in media forensics (named Media Forensics or
MediFor) to accelerate the development of fake digital
visual media detection methods [30]. Recently, Face-
book Inc. teaming up with Microsoft Corp and the Part-
nership on AI coalition have launched the Deepfake De-
tection Challenge to catalyse more research and devel-
opment in detecting and preventing deepfakes from be-
ing used to mislead viewers [31]. Data obtained from
https://app.dimensions.ai at the end of 2021 show that
the number of deepfake papers has increased signifi-
cantly in recent years (Fig. 1). Although the obtained
numbers of deepfake papers may be lower than actual
numbers but the research trend of this topic is obviously
increasing.
There have been existing survey papers about creat-
ing and detecting deepfakes, presented in [19, 20, 32].
For example, Mirsky and Lee [19] focused on reen-
actment approaches (i.e., to change a target’s expres-
sion, mouth, pose, gaze or body), and replacement
approaches (i.e., to replace a target’s face by swap
or transfer methods). Verdoliva [20] separated detec-
tion approaches into conventional methods (e.g., blind
methods without using any external data for train-
ing, one-class sensor-based and model-based methods,
and supervised methods with handcrafted features) and
deep learning-based approaches (e.g., CNN models).
Tolosana et al. [32] categorized both creation and detec-
tion methods based on the way deepfakes are created,
including entire face synthesis, identity swap, attribute
manipulation, and expression swap. On the other hand,
we carry out the survey with a different perspective and
taxonomy. We categorize the deepfake detection meth-
2
Fig. 2. Categories of reviewed papers relevant to deepfake detection
methods where we divide papers into two major groups, i.e., fake im-
age detection and face video detection.
ods based on the data type, i.e., images or videos, as
presented in Fig. 2. With fake image detection methods,
we focus on the features that are used, i.e., whether they
are handcrafted features or deep features. With fake
video detection methods, two main subcategories are
identified based on whether the method uses temporal
features across frames or visual artifacts within a video
frame. We also discuss extensively the challenges, re-
search trends and directions on deepfake detection and
multimedia forensics problems.
2. Deepfake Creation
Deepfakes have become popular due to the quality
of tampered videos and also the easy-to-use ability of
their applications to a wide range of users with vari-
ous computer skills from professional to novice. These
applications are mostly developed based on deep learn-
ing techniques. Deep learning is well known for its ca-
pability of representing complex and high-dimensional
data. One variant of the deep networks with that ca-
pability is deep autoencoders, which have been widely
applied for dimensionality reduction and image com-
pression [33–35]. The first attempt of deepfake cre-
ation was FakeApp, developed by a Reddit user using
autoencoder-decoder pairing structure [36, 37]. In that
method, the autoencoder extracts latent features of face
images and the decoder is used to reconstruct the face
images. To swap faces between source images and tar-
get images, there is a need of two encoder-decoder pairs
where each pair is used to train on an image set, and
the encoder’s parameters are shared between two net-
work pairs. In other words, two pairs have the same
encoder network. This strategy enables the common en-
coder to find and learn the similarity between two sets
of face images, which are relatively unchallenging be-
Fig. 3. A deepfake creation model using two encoder-decoder pairs.
Two networks use the same encoder but different decoders for train-
ing process (top). An image of face A is encoded with the common
encoder and decoded with decoder B to create a deepfake (bottom).
The reconstructed image (in the bottom) is the face B with the mouth
shape of face A. Face B originally has the mouth of an upside-down
heart while the reconstructed face B has the mouth of a conventional
heart.
cause faces normally have similar features such as eyes,
nose, mouth positions. Fig. 3 shows a deepfake cre-
ation process where the feature set of face A is con-
nected with the decoder B to reconstruct face B from
the original face A. This approach is applied in several
works such as DeepFaceLab [38], DFaker [39], Deep-
Fake tf (tensorflow-based deepfakes) [40].
By adding adversarial loss and perceptual loss imple-
mented in VGGFace [57] to the encoder-decoder archi-
tecture, an improved version of deepfakes based on the
generative adversarial network [4], i.e., faceswap-GAN,
was proposed in [58]. The VGGFace perceptual loss
is added to make eye movements to be more realistic
and consistent with input faces and help to smooth out
artifacts in segmentation mask, leading to higher qual-
ity output videos. This model facilitates the creation of
outputs with 64x64, 128x128, and 256x256 resolutions.
In addition, the multi-task convolutional neural network
(CNN) from the FaceNet implementation [59] is used
to make face detection more stable and face alignment
more reliable. The CycleGAN [60] is utilized for gen-
erative network implementation in this model.
A conventional GAN model comprises two neural
networks: a generator and a discriminator as depicted in
Fig. 4. Given a dataset of real images xhaving a distri-
bution of pdat a, the aim of the generator Gis to produce
images G(z) similar to real images xwith zbeing noise
signals having a distribution of pz. The aim of the dis-
criminator Gis to correctly classify images generated
3
Table 1: Summary of notable deepfake tools
Tools Links Key Features
Faceswap https://github.com/deepfakes/faceswap - Using two encoder-decoder pairs.
- Parameters of the encoder are shared.
Faceswap-GAN https://github.com/shaoanlu/faceswap-GAN Adversarial loss and perceptual loss (VGGface) are added to an auto-encoder architecture.
Few-Shot Face
Translation
https://github.com/shaoanlu/fewshot-face-
translation-GAN
- Use a pre-trained face recognition model to extract latent embeddings for GAN process-
ing.
- Incorporate semantic priors obtained by modules from FUNIT [41] and SPADE [42].
DeepFaceLab https://github.com/iperov/DeepFaceLab - Expand from the Faceswap method with new models, e.g. H64, H128, LIAEF128, SAE
[43].
- Support multiple face extraction modes, e.g. S3FD, MTCNN, dlib, or manual [43].
DFaker https://github.com/dfaker/df - DSSIM loss function [44] is used to reconstruct face.
- Implemented based on Keras library.
DeepFake tf https://github.com/StromWine/DeepFake tf Similar to DFaker but implemented based on tensorflow.
AvatarMe https://github.com/lattas/AvatarMe - Reconstruct 3D faces from arbitrary “in-the-wild” images.
- Can reconstruct authentic 4K by 6K-resolution 3D faces from a single low-resolution
image [45].
MarioNETte https://hyperconnect.github.io/MarioNETte - A few-shot face reenactment framework that preserves the target identity.
- No additional fine-tuning phase is needed for identity adaptation [46].
DiscoFaceGAN https://github.com/microsoft/DiscoFaceGAN - Generate face images of virtual people with independent latent variables of identity, ex-
pression, pose, and illumination.
- Embed 3D priors into adversarial learning [47].
StyleRig https://gvv.mpi-inf.mpg.de/projects/StyleRig - Create portrait images of faces with a rig-like control over a pretrained and fixed Style-
GAN via 3D morphable face models.
- Self-supervised without manual annotations [48].
FaceShifter https://lingzhili.com/FaceShifterPage - Face swapping in high-fidelity by exploiting and integrating the target attributes.
- Can be applied to any new face pairs without requiring subject specific training [49].
FSGAN https://github.com/YuvalNirkin/fsgan - A face swapping and reenactment model that can be applied to pairs of faces without
requiring training on those faces.
- Adjust to both pose and expression variations [50].
StyleGAN https://github.com/NVlabs/stylegan - A new generator architecture for GANs is proposed based on style transfer literature.
- The new architecture leads to automatic, unsupervised separation of high-level attributes
and enables intuitive, scale-specific control of the synthesis of images [51].
Face2Face https://justusthies.github.io/posts/face2face/- Real-time facial reenactment of monocular target video sequence, e.g. Youtube video.
- Animate the facial expressions of the target video by a source actor and re-render the
manipulated output video in a photo-realistic fashion [52].
Neural Textures https://github.com/SSRSGJYD/NeuralTexture - Feature maps that are learned as part of the scene capture process and stored as maps on
top of 3D mesh proxies.
- Can coherently re-render or manipulate existing video content in both static and dynamic
environments at real-time rates [53].
Transformable
Bottleneck Net-
works
https://github.com/kyleolsz/TB-Networks - A method for fine-grained 3D manipulation of image content.
- Apply spatial transformations in CNN models using a transformable bottleneck frame-
work [54].
“Do as I Do”
Motion
Transfer
github.com/carolineec/EverybodyDanceNow - Automatically transfer the motion from a source to a target person by learning a video-
to-video translation.
- Can create a motion-synchronized dancing video with multiple subjects [55].
Neural Voice
Puppetry
https://justusthies.github.io/posts/neural-voice-
puppetry
- A method for audio-driven facial video synthesis.
- Synthesize videos of a talking head from an audio sequence of another person using 3D
face representation. [56].
4
Fig. 4. The GAN architecture consisting of a generator and a discrim-
inator, and each can be implemented by a neural network. The entire
system can be trained with backpropagation that allows both networks
to improve their capabilities.
by Gand real images x. The discriminator Dis trained
to improve its classification capability, i.e., to maximize
D(x), which represents the probability that xis a real
image rather than a fake image generated by G. On the
other hand, Gis trained to minimize the probability that
its outputs are classified by Das synthetic images, i.e.,
to minimize 1 −D(G(z)). This is a minimax game be-
tween two players Dand Gthat can be described by the
following value function [4]:
min
Gmax
DV(D,G)=Ex∼pdata(x)[log D(x)]
+Ez∼pz(z)[log(1 −D(G(z)))] (1)
After sufficient training, both networks improve their
capabilities, i.e., the generator Gis able to produce im-
ages that are really similar to real images while the dis-
criminator Dis highly capable of distinguishing fake
images from real ones.
Table 1 presents a summary of popular deepfake tools
and their typical features. Among them, a prominent
method for face synthesis based on a GAN model,
namely StyleGAN, was introduced in [51]. StyleGAN
is motivated by style transfer [61] with a special gen-
erator network architecture that is able to create realis-
tic face images. In a traditional GAN model, e.g., the
progressive growing of GAN (PGGAN) [62], the signal
noise (latent code) is fed to the input layer of a feed-
forward network that represents the generator. In Style-
GAN, there are two networks constructed and linked to-
gether, a mapping network fand a synthesis network g.
The latent code z∈Zis first converted to w∈W(where
Wis an intermediate latent space) through a non-linear
function f:Z→W, which is characterized by a neural
network (i.e., the mapping network) consisting of sev-
eral fully connected layers. Using an affine tranforma-
tion, the intermediate representation wis specialized to
styles y=(ys,yb) that will be fed to the adaptive in-
stance normalization (AdaIN) operations, specified as:
AdaIN(xi,y)=ys,i
xi−µ(xi)
σ(xi)+yb,i(2)
where each feature map xiis normalized separately. The
StyleGAN generator architecture allows controlling the
image synthesis by modifying the styles via different
scales. In addition, instead of using one random latent
code during training, this method uses two latent codes
to generate a given proportion of images. More specifi-
cally, two latent codes z1and z2are fed to the mapping
network to create respectively w1and w2that control the
styles by applying w1before and w2after the crossover
point. Fig. 5 demonstrates examples of images cre-
ated by mixing two latent codes at three different scales
where each subset of styles controls separate meaning-
ful high-level attributes of the image. In other words,
the generator architecture of StyleGAN is able to learn
separation of high-level attributes (e.g., pose and iden-
tity when trained on human faces) and enables intuitive,
scale-specific control of the face synthesis.
Fig. 5. Examples of mixing styles using StyleGAN: the output im-
ages are generated by copying a specified subset of styles from source
B and taking the rest from source A. a) Copying coarse styles from
source B will generate images that have high-level aspects from
source B and all colors and finer facial features from source A; b)
if copying the styles of middle resolutions from B, the output images
will have smaller scale facial features from B and preserve the pose,
general face shape, and eyeglasses from A; c) if copying the fine styles
from source B, the generated images will have the color scheme and
microstructure of source B [51].
5
3. Deepfake Detection
Deepfake detection is normally deemed a binary clas-
sification problem where classifiers are used to clas-
sify between authentic videos and tampered ones. This
kind of methods requires a large database of real and
fake videos to train classification models. The num-
ber of fake videos is increasingly available, but it is
still limited in terms of setting a benchmark for vali-
dating various detection methods. To address this issue,
Korshunov and Marcel [63] produced a notable deep-
fake dataset consisting of 620 videos based on the GAN
model using the open source code Faceswap-GAN [58].
Videos from the publicly available VidTIMIT database
[64] were used to generate low and high quality deep-
fake videos, which can effectively mimic the facial ex-
pressions, mouth movements, and eye blinking. These
videos were then used to test various deepfake detection
methods. Test results show that the popular face recog-
nition systems based on VGG [65] and Facenet [59, 66]
are unable to detect deepfakes effectively. Other meth-
ods such as lip-syncing approaches [67–69] and im-
age quality metrics with support vector machine (SVM)
[70] produce very high error rate when applied to detect
deepfake videos from this newly produced dataset. This
raises concerns about the critical need of future develop-
ment of more robust methods that can detect deepfakes
from genuine.
This section presents a survey of deepfake detection
methods where we group them into two major cate-
gories: fake image detection methods and fake video
detection ones (Fig. 2). The latter is distinguished
into two smaller groups: visual artifacts within sin-
gle video frame-based methods and temporal features
across frames-based ones. Whilst most of the methods
based on temporal features use deep learning recurrent
classification models, the methods use visual artifacts
within video frame can be implemented by either deep
or shallow classifiers.
3.1. Fake Image Detection
Deepfakes are increasingly detrimental to privacy, so-
ciety security and democracy [71]. Methods for detect-
ing deepfakes have been proposed as soon as this threat
was introduced. Early attempts were based on hand-
crafted features obtained from artifacts and inconsisten-
cies of the fake image synthesis process. Recent meth-
ods, e.g., [72, 73], have commonly applied deep learn-
ing to automatically extract salient and discriminative
features to detect deepfakes.
3.1.1. Handcrafted Features-based Methods
Most works on detection of GAN generated images
do not consider the generalization capability of the de-
tection models although the development of GAN is on-
going, and many new extensions of GAN are frequently
introduced. Xuan et al. [74] used an image prepro-
cessing step, e.g., Gaussian blur and Gaussian noise,
to remove low level high frequency clues of GAN im-
ages. This increases the pixel level statistical similar-
ity between real images and fake images and allows the
forensic classifier to learn more intrinsic and meaning-
ful features, which has better generalization capability
than previous image forensics methods [75, 76] or im-
age steganalysis networks [77].
Zhang et al. [78] used the bag of words method to
extract a set of compact features and fed it into various
classifiers such as SVM [79], random forest (RF) [80]
and multi-layer perceptrons (MLP) [81] for discriminat-
ing swapped face images from the genuine. Among
deep learning-generated images, those synthesised by
GAN models are probably most difficult to detect as
they are realistic and high-quality based on GAN’s ca-
pability to learn distribution of the complex input data
and generate new outputs with similar input distribu-
tion.
On the other hand, Agarwal and Varshney [82] cast
the GAN-based deepfake detection as a hypothesis test-
ing problem where a statistical framework was intro-
duced using the information-theoretic study of authen-
tication [83]. The minimum distance between distribu-
tions of legitimate images and images generated by a
particular GAN is defined, namely the oracle error. The
analytic results show that this distance increases when
the GAN is less accurate, and in this case, it is easier
to detect deepfakes. In case of high-resolution image
inputs, an extremely accurate GAN is required to gen-
erate fake images that are hard to detect by this method.
3.1.2. Deep Features-based Methods
Face swapping has a number of compelling applica-
tions in video compositing, transfiguration in portraits,
and especially in identity protection as it can replace
faces in photographs by ones from a collection of stock
images. However, it is also one of the techniques that
cyber attackers employ to penetrate identification or au-
thentication systems to gain illegitimate access. The
use of deep learning such as CNN and GAN has made
swapped face images more challenging for forensics
models as it can preserve pose, facial expression and
lighting of the photographs [84].
Hsu et al. [85] introduced a two-phase deep learn-
ing method for detection of deepfake images. The first
6
phase is a feature extractor based on the common fake
feature network (CFFN) where the Siamese network ar-
chitecture presented in [86] is used. The CFFN en-
compasses several dense units with each unit including
different numbers of dense blocks [61] to improve the
representative capability for the input images. Discrim-
inative features between the fake and real images are
extracted through the CFFN learning process based on
the use of pairwise information, which is the label of
each pair of two input images. If the two images are
of the same type, i.e., fake-fake or real-real, the pair-
wise label is 1. In contrast, if they are of different types,
i.e., fake-real, the pairwise label is 0. The CFFN-based
discriminative features are then fed to a neural network
classifier to distinguish deceptive images from genuine.
The proposed method is validated for both fake face
and fake general image detection. On the one hand,
the face dataset is obtained from CelebA [87], contain-
ing 10,177 identities and 202,599 aligned face images
of various poses and background clutter. Five GAN
variants are used to generate fake images with size of
64x64, including deep convolutional GAN (DCGAN)
[88], Wasserstein GAN (WGAN) [89], WGAN with
gradient penalty (WGAN-GP) [90], least squares GAN
[91], and PGGAN [62]. A total of 385,198 training im-
ages and 10,000 test images of both real and fake ones
are obtained for validating the proposed method. On
the other hand, the general dataset is extracted from the
ILSVRC12 [92]. The large scale GAN training model
for high fidelity natural image synthesis (BIGGAN)
[93], self-attention GAN [94] and spectral normaliza-
tion GAN [95] are used to generate fake images with
size of 128x128. The training set consists of 600,000
fake and real images whilst the test set includes 10,000
images of both types. Experimental results show the su-
perior performance of the proposed method against its
competing methods such as those introduced in [96–99].
Likewise, Guo et al. [100] proposed a CNN model,
namely SCnet, to detect deepfake images, which are
generated by the Glow-based facial forgery tool [101].
The fake images synthesized by the Glow model [101]
have the facial expression maliciously tampered. These
images are hyper-realistic with perfect visual qualities,
but they still have subtle or noticeable manipulation
traces, which are exploited by the SCnet. The SCnet is
able to automatically learn high-level forensics features
of image data thanks to a hierarchical feature extraction
block, which is formed by stacking four convolutional
layers. Each layer learns a new set of feature maps from
the previous layer, with each convolutional operation is
defined by:
f(n)
j=
i
X
i=1
f(n−1)
i∗ω(n)
i j +b(n)
j(3)
where f(n)
jis the jt h feature map of the nth layer, ω(n)
i j
is the weight of the ith channel of the jth convolutional
kernel in the nth layer, and b(n)
jis the bias term of
the jth convolutional kernel in the nth layer. The pro-
posed approach is evaluated using a dataset consisting
of 321,378 face images, which are created by applying
the Glow model [101] to the CelebA face image dataset
[87]. Evaluation results show that the SCnet model ob-
tains higher accuracy and better generalization than the
Meso-4 model proposed in [102].
Recently, Zhao et al. [104] proposed a method
for deepfake detection using self-consistency of lo-
cal source features, which are content-independent,
spatially-local information of images. These features
could come from either imaging pipelines, encoding
methods or image synthesis approaches. The hypothe-
sis is that a modified image would have different source
features at different locations, while an original im-
age will have the same source features across loca-
tions. These source features, represented in the form
of down-sampled feature maps, are extracted by a CNN
model using a special representation learning method
called pairwise self-consistency learning. This learn-
ing method aims to penalize pairs of feature vectors
that refer to locations from the same image for hav-
ing a low cosine similarity score. At the same time, it
also penalizes the pairs from different images for hav-
ing a high similarity score. The learned feature maps
are then fed to a classification method for deepfake de-
tection. This proposed approach is evaluated on seven
popular datasets, including FaceForensics++ [105],
DeepfakeDetection [106], Celeb-DF-v1 & Celeb-DF-
v2 [107], Deepfake Detection Challenge (DFDC) [108],
DFDC Preview [109], and DeeperForensics-1.0 [110].
Experimental results demonstrate that the proposed ap-
proach is superior to state-of-the-art methods. It how-
ever may have a limitation when dealing with fake im-
ages that are generated by methods that directly output
the whole images whose source features are consistent
across all positions within each image.
3.2. Fake Video Detection
Most image detection methods cannot be used for
videos because of the strong degradation of the frame
data after video compression [102]. Furthermore,
videos have temporal characteristics that are varied
7
Fig. 6. A two-step process for face manipulation detection where the preprocessing step aims to detect, crop and align faces on a sequence of
frames and the second step distinguishes manipulated and authentic face images by combining convolutional neural network (CNN) and recurrent
neural network (RNN) [103].
among sets of frames and they are thus challenging
for methods designed to detect only still fake images.
This subsection focuses on deepfake video detection
methods and categorizes them into two smaller groups:
methods that employ temporal features and those that
explore visual artifacts within frames.
3.2.1. Temporal Features across Video Frames
Based on the observation that temporal coherence is
not enforced effectively in the synthesis process of deep-
fakes, Sabir et al. [103] leveraged the use of spatio-
temporal features of video streams to detect deepfakes.
Video manipulation is carried out on a frame-by-frame
basis so that low level artifacts produced by face ma-
nipulations are believed to further manifest themselves
as temporal artifacts with inconsistencies across frames.
A recurrent convolutional model (RCN) was proposed
based on the integration of the convolutional network
DenseNet [61] and the gated recurrent unit cells [111] to
exploit temporal discrepancies across frames (see Fig.
6). The proposed method is tested on the FaceForen-
sics++ dataset, which includes 1,000 videos [105], and
shows promising results.
Likewise, G¨
uera and Delp [112] highlighted that
deepfake videos contain intra-frame inconsistencies and
temporal inconsistencies between frames. They then
proposed the temporal-aware pipeline method that uses
CNN and long short term memory (LSTM) to detect
deepfake videos. CNN is employed to extract frame-
level features, which are then fed into the LSTM to cre-
ate a temporal sequence descriptor. A fully-connected
network is finally used for classifying doctored videos
from real ones based on the sequence descriptor as il-
lustrated in Fig. 7. An accuracy of greater than 97%
was obtained using a dataset of 600 videos, includ-
ing 300 deepfake videos collected from multiple video-
hosting websites and 300 pristine videos randomly se-
lected from the Hollywood human actions dataset in
[113].
On the other hand, the use of a physiological signal,
eye blinking, to detect deepfakes was proposed in Li
Fig. 7. A deepfake detection method using convolutional neural net-
work (CNN) and long short term memory (LSTM) to extract tem-
poral features of a given video sequence, which are represented via
the sequence descriptor. The detection network consisting of fully-
connected layers is employed to take the sequence descriptor as input
and calculate probabilities of the frame sequence belonging to either
authentic or deepfake class [112].
et al. [114] based on the observation that a person in
deepfakes has a lot less frequent blinking than that in
untampered videos. A healthy adult human would nor-
mally blink somewhere between 2 to 10 seconds, and
each blink would take 0.1 and 0.4 seconds. Deepfake
algorithms, however, often use face images available
online for training, which normally show people with
open eyes, i.e., very few images published on the inter-
net show people with closed eyes. Thus, without hav-
ing access to images of people blinking, deepfake algo-
rithms do not have the capability to generate fake faces
that can blink normally. In other words, blinking rates in
deepfakes are much lower than those in normal videos.
To discriminate real and fake videos, Li et al. [114] crop
eye areas in the videos and distribute them into long-
term recurrent convolutional networks (LRCN) [115]
for dynamic state prediction. The LRCN consists of
a feature extractor based on CNN, a sequence learning
based on long short term memory (LSTM), and a state
prediction based on a fully connected layer to predict
probability of eye open and close state. The eye blink-
ing shows strong temporal dependencies and thus the
implementation of LSTM helps to capture these tempo-
ral patterns effectively.
Recently, Caldelli et al. [116] proposed the use of
optical flow to gauge the information along the tempo-
ral axis of a frame sequence for video deepfake detec-
tion. The optical flow is a vector field calculated on two
8
temporal-distinct frames of a video that can describe the
movement of objects in a scene. The optical flow fields
are expected to be different between synthetically cre-
ated frames and naturally generated ones [117]. Un-
natural movements of lips, eyes, or of the entire faces
inserted into deepfake videos would introduce distinc-
tive motion patterns when compared with pristine ones.
Based on this assumption, features consisting of optical
flow fields are fed into a CNN model for discriminating
between deepfakes and original videos. More specifi-
cally, the ResNet50 architecture [118] is implemented
as a CNN model for experiments. The results obtained
using the FaceForensics++ dataset [105] show that this
approach is comparable with state-of-the-art methods in
terms of classification accuracy. A combination of this
kind of feature with frame-based features is also exper-
imented, which results in an improved deepfake detec-
tion performance. This demonstrates the usefulness of
optical flow fields in capturing the inconsistencies on
the temporal axis of video frames for deepfake detec-
tion.
3.2.2. Visual Artifacts within Video Frame
As can be noticed in the previous subsection, the
methods using temporal patterns across video frames
are mostly based on deep recurrent network models to
detect deepfake videos. This subsection investigates the
other approach that normally decomposes videos into
frames and explores visual artifacts within single frames
to obtain discriminant features. These features are then
distributed into either a deep or shallow classifier to dif-
ferentiate between fake and authentic videos. We thus
group methods in this subsection based on the types of
classifiers, i.e. either deep or shallow.
Deep classifiers. Deepfake videos are normally created
with limited resolutions, which require an affine face
warping approach (i.e., scaling, rotation and shearing)
to match the configuration of the original ones. Because
of the resolution inconsistency between the warped face
area and the surrounding context, this process leaves
artifacts that can be detected by CNN models such as
VGG16 [119], ResNet50, ResNet101 and ResNet152
[118]. A deep learning method to detect deepfakes
based on the artifacts observed during the face warp-
ing step of the deepfake generation algorithms was pro-
posed in [120]. The proposed method is evaluated on
two deepfake datasets, namely the UADFV and Deep-
fakeTIMIT. The UADFV dataset [121] contains 49 real
videos and 49 fake videos with 32,752 frames in to-
tal. The DeepfakeTIMIT dataset [69] includes a set of
low quality videos of 64 x 64 size and another set of
high quality videos of 128 x 128 with totally 10,537
pristine images and 34,023 fabricated images extracted
from 320 videos for each quality set. Performance of
the proposed method is compared with other prevalent
methods such as two deepfake detection MesoNet meth-
ods, i.e. Meso-4 and MesoInception-4 [102], HeadPose
[121], and the face tampering detection method two-
stream NN [122]. Advantage of the proposed method
is that it needs not to generate deepfake videos as nega-
tive examples before training the detection models. In-
stead, the negative examples are generated dynamically
by extracting the face region of the original image and
aligning it into multiple scales before applying Gaus-
sian blur to a scaled image of random pick and warping
back to the original image. This reduces a large amount
of time and computational resources compared to other
methods, which require deepfakes are generated in ad-
vance.
Nguyen et al. [123] proposed the use of capsule
networks for detecting manipulated images and videos.
The capsule network was initially introduced to address
limitations of CNNs when applied to inverse graphics
tasks, which aim to find physical processes used to pro-
duce images of the world [124]. The recent develop-
ment of capsule network based on dynamic routing al-
gorithm [125] demonstrates its ability to describe the hi-
erarchical pose relationships between object parts. This
development is employed as a component in a pipeline
for detecting fabricated images and videos as demon-
strated in Fig. 8. A dynamic routing algorithm is de-
ployed to route the outputs of the three capsules to the
output capsules through a number of iterations to sepa-
rate between fake and real images. The method is eval-
uated through four datasets covering a wide range of
forged image and video attacks. They include the well-
known Idiap Research Institute replay-attack dataset
[126], the deepfake face swapping dataset created by
Afchar et al. [102], the facial reenactment FaceForen-
sics dataset [127], produced by the Face2Face method
[52], and the fully computer-generated image dataset
generated by Rahmouni et al. [128]. The proposed
method yields the best performance compared to its
competing methods in all of these datasets. This shows
the potential of the capsule network in building a gen-
eral detection system that can work effectively for vari-
ous forged image and video attacks.
Shallow classifiers. Deepfake detection methods
mostly rely on the artifacts or inconsistency of intrinsic
features between fake and real images or videos. Yang
et al. [121] proposed a detection method by observing
the differences between 3D head poses comprising head
9
Fig. 8. Capsule network takes features obtained from the VGG-19
network [119] to distinguish fake images or videos from the real ones
(top). The pre-processing step detects face region and scales it to the
size of 128x128 before VGG-19 is used to extract latent features for
the capsule network, which comprises three primary capsules and two
output capsules, one for real and one for fake images (bottom). The
statistical pooling constitutes an important part of capsule network
that deals with forgery detection [123].
orientation and position, which are estimated based on
68 facial landmarks of the central face region. The 3D
head poses are examined because there is a shortcoming
in the deepfake face generation pipeline. The extracted
features are fed into an SVM classifier to obtain the
detection results. Experiments on two datasets show the
great performance of the proposed approach against its
competing methods. The first dataset, namely UADFV,
consists of 49 deep fake videos and their respective
real videos [121]. The second dataset comprises 241
real images and 252 deep fake images, which is a
subset of data used in the DARPA MediFor GAN
Image/Video Challenge [129]. Likewise, a method to
exploit artifacts of deepfakes and face manipulations
based on visual features of eyes, teeth and facial
contours was studied in [130]. The visual artifacts arise
from lacking global consistency, wrong or imprecise
estimation of the incident illumination, or imprecise
estimation of the underlying geometry. For deepfakes
detection, missing reflections and missing details in
the eye and teeth areas are exploited as well as texture
features extracted from the facial region based on facial
landmarks. Accordingly, the eye feature vector, teeth
feature vector and features extracted from the full-face
crop are used. After extracting the features, two
classifiers including logistic regression and small neural
network are employed to classify the deepfakes from
real videos. Experiments carried out on a video dataset
downloaded from YouTube show the best result of
0.851 in terms of the area under the receiver operating
characteristics curve. The proposed method however
has a disadvantage that requires images meeting certain
prerequisite such as open eyes or visual teeth.
The use of photo response non uniformity (PRNU)
analysis was proposed in [131] to detect deepfakes from
authentic ones. PRNU is a component of sensor pattern
noise, which is attributed to the manufacturing imper-
fection of silicon wafers and the inconsistent sensitivity
of pixels to light because of the variation of the physical
characteristics of the silicon wafers. The PRNU anal-
ysis is widely used in image forensics [132–136] and
advocated to use in [131] because the swapped face is
supposed to alter the local PRNU pattern in the facial
area of video frames. The videos are converted into
frames, which are cropped to the questioned facial re-
gion. The cropped frames are then separated sequen-
tially into eight groups where an average PRNU pattern
is computed for each group. Normalised cross correla-
tion scores are calculated for comparisons of PRNU pat-
terns among these groups. A test dataset was created,
consisting of 10 authentic videos and 16 manipulated
videos, where the fake videos were produced from the
genuine ones by the DeepFaceLab tool [38]. The anal-
ysis shows a significant statistical difference in terms
of mean normalised cross correlation scores between
deepfakes and the genuine. This analysis therefore sug-
gests that PRNU has a potential in deepfake detection
although a larger dataset would need to be tested.
When seeing a video or image with suspicion, users
normally want to search for its origin. However, there is
currently no feasibility for such a tool. Hasan and Salah
[137] proposed the use of blockchain and smart con-
tracts to help users detect deepfake videos based on the
assumption that videos are only real when their sources
are traceable. Each video is associated with a smart con-
tract that links to its parent video and each parent video
has a link to its child in a hierarchical structure. Through
this chain, users can credibly trace back to the origi-
nal smart contract associated with pristine video even if
the video has been copied multiple times. An impor-
tant attribute of the smart contract is the unique hashes
of the interplanetary file system, which is used to store
video and its metadata in a decentralized and content-
addressable manner [138]. The smart contract’s key fea-
tures and functionalities are tested against several com-
mon security challenges such as distributed denial of
services, replay and man in the middle attacks to ensure
the solution meeting security requirements. This ap-
proach is generic, and it can be extended to other types
of digital content, e.g., images, audios and manuscripts.
4. Discussions and Future Research Directions
With the support of deep learning, deepfakes can be
created easier than ever before. The spread of these fake
10
Table 2: Summary of prominent deepfake detection methods
Methods Classifiers/
Techniques
Key Features Dealing
with
Datasets Used
Eye blinking
[114]
LRCN - Use LRCN to learn the temporal patterns of eye blink-
ing.
- Based on the observation that blinking frequency of
deepfakes is much smaller than normal.
Videos Consist of 49 interview and presentation videos, and
their corresponding generated deepfakes.
Intra-frame and
temporal incon-
sistencies [112]
CNN and
LSTM
CNN is employed to extract frame-level features, which
are distributed to LSTM to construct sequence descrip-
tor useful for classification.
Videos A collection of 600 videos obtained from multiple
websites.
Using face warp-
ing artifacts [120]
VGG16 [119],
ResNet mod-
els [118]
Artifacts are discovered using CNN models based on
resolution inconsistency between the warped face area
and the surrounding context.
Videos - UADFV [121], containing 49 real videos and 49 fake
videos with 32752 frames in total.
- DeepfakeTIMIT [69]
MesoNet [102] CNN - Two deep networks, i.e. Meso-4 and MesoInception-4
are introduced to examine deepfake videos at the meso-
scopic analysis level.
- Accuracy obtained on deepfake and FaceForensics
datasets are 98% and 95% respectively.
Videos Two datasets: deepfake one constituted from on-
line videos and the FaceForensics one created by the
Face2Face approach [52].
Eye, teach and fa-
cial texture [130]
Logistic re-
gression and
neural network
(NN)
- Exploit facial texture differences, and missing reflec-
tions and details in eye and teeth areas of deepfakes.
- Logistic regression and NN are used for classifying.
Videos A video dataset downloaded from YouTube.
Spatio-temporal
features with
RCN [103]
RCN Temporal discrepancies across frames are explored
using RCN that integrates convolutional network
DenseNet [61] and the gated recurrent unit cells [111]
Videos FaceForensics++ dataset, including 1,000 videos
[105].
Spatio-temporal
features with
LSTM [139]
Convolutional
bidirectional
recurrent
LSTM net-
work
- An XceptionNet CNN is used for facial feature extrac-
tion while audio embeddings are obtained by stacking
multiple convolution modules.
- Two loss functions, i.e. cross-entropy and Kullback-
Leibler divergence, are used.
Videos FaceForensics++ [105] and Celeb-DF (5,639 deep-
fake videos) [107] datasets and the ASVSpoof 2019
Logical Access audio dataset [140].
Analysis of
PRNU [131]
PRNU - Analysis of noise patterns of light sensitive sensors of
digital cameras due to their factory defects.
- Explore the differences of PRNU patterns between the
authentic and deepfake videos because face swapping is
believed to alter the local PRNU patterns.
Videos Created by the authors, including 10 authentic and 16
deepfake videos using DeepFaceLab [38].
Phoneme-viseme
mismatches [141]
CNN - Exploit the mismatches between the dynamics of the
mouth shape, i.e. visemes, with a spoken phoneme.
- Focus on sounds associated with the M, B and
P phonemes as they require complete mouth closure
while deepfakes often incorrectly synthesize it.
Videos Four in-the-wild lip-sync deepfakes from Instagram
and YouTube (www.instagram.com/bill posters uk
and youtu.be/VWMEDacz3L4) and others are cre-
ated using synthesis techniques, i.e. Audio-to-Video
(A2V) [68] and Text-to-Video (T2V) [142].
Using attribution-
based confidence
(ABC) metric
[143]
ResNet50
model [118],
pre-trained on
VGGFace2
[144]
- The ABC metric [145] is used to detect deepfake
videos without accessing to training data.
- ABC values obtained for original videos are greater
than 0.94 while those of deepfakes have low ABC val-
ues.
Videos VidTIMIT and two other original
datasets obtained from the COHFACE
(https://www.idiap.ch/dataset/cohface) and from
YouTube. datasets from COHFACE [146] and
YouTube are used to generate two deepfake datasets
by commercial website https://deepfakesweb.com
and another deepfake dataset is DeepfakeTIMIT
[147].
Using appearance
and behaviour
[148]
Rules based on
facial and be-
havioural fea-
tures.
Temporal, behavioral biometric based on facial expres-
sions and head movements are learned using ResNet-
101 [118] while static facial biometric is obtained using
VGG [65].
Videos The world leaders dataset [1], FaceForensics++
[105], Google/Jigsaw deepfake detection dataset
[106], DFDC [109] and Celeb-DF [107].
FakeCatcher
[149]
CNN Extract biological signals in portrait videos and use
them as an implicit descriptor of authenticity because
they are not spatially and temporally well-preserved in
deepfakes.
Videos UADFV [121], FaceForensics [127], FaceForen-
sics++ [105], Celeb-DF [107], and a new dataset of
142 videos, independent of the generative model, res-
olution, compression, content, and context.
Emotion audio-
visual affective
cues [150]
Siamese net-
work [86]
Modality and emotion embedding vectors for the face
and speech are extracted for deepfake detection.
Videos DeepfakeTIMIT [147] and DFDC [109].
Head poses [121] SVM - Features are extracted using 68 landmarks of the face
region.
- Use SVM to classify using the extracted features.
Videos/
Images
- UADFV consists of 49 deep fake videos and their
respective real videos.
- 241 real images and 252 deep fake images from
DARPA MediFor GAN Image/Video Challenge.
Capsule-forensics
[123]
Capsule net-
works
- Latent features extracted by VGG-19 network [119]
are fed into the capsule network for classification.
- A dynamic routing algorithm [125] is used to route
the outputs of three convolutional capsules to two out-
put capsules, one for fake and another for real images,
through a number of iterations.
Videos/
Images
Four datasets: the Idiap Research Institute replay-
attack [126], deepfake face swapping by [102], facial
reenactment FaceForensics [127], and fully computer-
generated image set using [128].
11
Methods Classifiers/
Techniques
Key Features Dealing
with
Datasets Used
Preprocessing
combined with
deep network
[74]
DCGAN,
WGAN-GP and
PGGAN.
- Enhance generalization ability of deep learning mod-
els to detect GAN generated images.
- Remove low level features of fake images.
- Force deep networks to focus more on pixel level sim-
ilarity between fake and real images to improve gener-
alization ability.
Images - Real dataset: CelebA-HQ [62], including high qual-
ity face images of 1024x1024 resolution.
- Fake datasets: generated by DCGAN [88], WGAN-
GP [90] and PGGAN [62].
Analyzing con-
volutional traces
[151]
KNN, SVM, and
linear discrim-
inant analysis
(LDA)
Using expectation-maximization algorithm to extract
local features pertaining to convolutional generative
process of GAN-based image deepfake generators.
Images Authentic images from CelebA and correspond-
ing deepfakes are created by five different GANs
(group-wise deep whitening-and-coloring transfor-
mation GDWCT [152], StarGAN [153], AttGAN
[154], StyleGAN [51], StyleGAN2 [155]).
Bag of words
and shallow
classifiers [78]
SVM, RF, MLP Extract discriminant features using bag of words
method and feed these features into SVM, RF and MLP
for binary classification: innocent vs fabricated.
Images The well-known LFW face database [156], containing
13,223 images with resolution of 250x250.
Pairwise learn-
ing [85]
CNN concate-
nated to CFFN
Two-phase procedure: feature extraction using CFFN
based on the Siamese network architecture [86] and
classification using CNN.
Images - Face images: real ones from CelebA [87], and
fake ones generated by DCGAN [88], WGAN [89],
WGAN-GP [90], least squares GAN [91], and PG-
GAN [62].
- General images: real ones from ILSVRC12 [92],
and fake ones generated by BIGGAN [93], self-
attention GAN [94] and spectral normalization GAN
[95].
Defenses against
adversarial per-
turbations in
deepfakes [157]
VGG [65] and
ResNet [118]
- Introduce adversarial perturbations to enhance deep-
fakes and fool deepfake detectors.
- Improve accuracy of deepfake detectors using Lips-
chitz regularization and deep image prior techniques.
Images 5,000 real images from CelebA [87] and 5,000 fake
images created by the “Few-Shot Face Translation
GAN” method [158].
Face X-ray
[159]
CNN - Try to locate the blending boundary between the target
and original faces instead of capturing the synthesized
artifacts of specific manipulations.
- Can be trained without fake images.
Images FaceForensics++ [105], DeepfakeDetection (DFD)
[106], DFDC [109] and Celeb-DF [107].
Using common
artifacts of
CNN-generated
images [160]
ResNet-50 [118]
pre-trained with
ImageNet [92]
Train the classifier using a large number of fake images
generated by a high-performing unconditional GAN
model, i.e., PGGAN [62] and evaluate how well the
classifier generalizes to other CNN-synthesized images.
Images A new dataset of CNN-generated images, namely
ForenSynths, consisting of synthesized images from
11 models such as StyleGAN [51], super-resolution
methods [161] and FaceForensics++ [105].
Using convolu-
tional traces on
GAN-based im-
ages [162]
KNN, SVM, and
LDA
Training the expectation-maximization algorithm [163]
to detect and extract discriminative features via a fin-
gerprint that represents the convolutional traces left by
GANs during image generation.
Images A dataset of images generated by ten GAN models,
including CycleGAN [164], StarGAN [153], AttGAN
[154], GDWCT [152], StyleGAN [51], StyleGAN2
[155], PGGAN [62], FaceForensics++ [105], IMLE
[165], and SPADE [42].
Using deep fea-
tures extracted
by CNN [100]
A new CNN
model, namely
SCnet
The CNN-based SCnet is able to automatically learn
high-level forensics features of image data thanks to a
hierarchical feature extraction block, which is formed
by stacking four convolutional layers.
Images A dataset of 321,378 face images, created by apply-
ing the Glow model [101] to the CelebA face image
dataset [87].
contents is also quicker thanks to the development of
social media platforms [166]. Sometimes deepfakes do
not need to be spread to massive audience to cause detri-
mental effects. People who create deepfakes with ma-
licious purpose only need to deliver them to target au-
diences as part of their sabotage strategy without using
social media. For example, this approach can be utilized
by intelligence services trying to influence decisions
made by important people such as politicians, leading to
national and international security threats [167]. Catch-
ing the deepfake alarming problem, research commu-
nity has focused on developing deepfake detection algo-
rithms and numerous results have been reported. This
paper has reviewed the state-of-the-art methods and a
summary of typical approaches is provided in Table 2.
It is noticeable that a battle between those who use ad-
vanced machine learning to create deepfakes with those
who make effort to detect deepfakes is growing.
Deepfakes’ quality has been increasing and the per-
formance of detection methods needs to be improved
accordingly. The inspiration is that what AI has broken
can be fixed by AI as well [168]. Detection methods are
still in their early stage and various methods have been
proposed and evaluated but using fragmented datasets.
An approach to improve performance of detection meth-
ods is to create a growing updated benchmark dataset of
deepfakes to validate the ongoing development of detec-
tion methods. This will facilitate the training process of
detection models, especially those based on deep learn-
ing, which requires a large training set [108].
Improving performance of deepfake detection meth-
ods is important, especially in cross-forgery and cross-
dataset scenarios. Most detection models are designed
and evaluated in the same-forgery and in-dataset exper-
iments, which do not ensure their generalization capa-
bility. Some previous studies have addressed this issue,
12
e.g., in [104, 116, 160, 169, 170], but more work needs
to be done in this direction. A model trained on a spe-
cific forgery needs to be able to work against another
unknown one because potential deepfake types are not
normally known in the real-world scenarios. Likewise,
current detection methods mostly focus on drawbacks
of the deepfake generation pipelines, i.e., finding weak-
ness of the competitors to attack them. This kind of
information and knowledge is not always available in
adversarial environments where attackers commonly at-
tempt not to reveal such deepfake creation technologies.
Recent works on adversarial perturbation attacks to fool
DNN-based detectors make the deepfake detection task
more difficult [157, 171–174]. These are real challenges
for detection method development and a future study
needs to focus on introducing more robust, scalable and
generalizable methods.
Another research direction is to integrate detection
methods into distribution platforms such as social me-
dia to increase its effectiveness in dealing with the
widespread impact of deepfakes. The screening or fil-
tering mechanism using effective detection methods can
be implemented on these platforms to ease the deep-
fakes detection [167]. Legal requirements can be made
for tech companies who own these platforms to remove
deepfakes quickly to reduce its impacts. In addition,
watermarking tools can also be integrated into devices
that people use to make digital contents to create im-
mutable metadata for storing originality details such as
time and location of multimedia contents as well as
their untampered attestment [167]. This integration is
difficult to implement but a solution for this could be
the use of the disruptive blockchain technology. The
blockchain has been used effectively in many areas and
there are very few studies so far addressing the deep-
fake detection problems based on this technology. As
it can create a chain of unique unchangeable blocks of
metadata, it is a great tool for digital provenance solu-
tion. The integration of blockchain technologies to this
problem has demonstrated certain results [137] but this
research direction is far from mature.
Using detection methods to spot deepfakes is crucial,
but understanding the real intent of people publishing
deepfakes is even more important. This requires the
judgement of users based on social context in which
deepfake is discovered, e.g. who distributed it and what
they said about it [175]. This is critical as deepfakes
are getting more and more photorealistic and it is highly
anticipated that detection software will be lagging be-
hind deepfake creation technology. A study on social
context of deepfakes to assist users in such judgement
is thus worth performing.
Videos and photographics have been widely used as
evidences in police investigation and justice cases. They
may be introduced as evidences in a court of law by
digital media forensics experts who have background in
computer or law enforcement and experience in collect-
ing, examining and analysing digital information. The
development of machine learning and AI technologies
might have been used to modify these digital contents
and thus the experts’ opinions may not be enough to au-
thenticate these evidences because even experts are un-
able to discern manipulated contents. This aspect needs
to take into account in courtrooms nowadays when im-
ages and videos are used as evidences to convict per-
petrators because of the existence of a wide range of
digital manipulation methods [176]. The digital me-
dia forensics results therefore must be proved to be
valid and reliable before they can be used in courts.
This requires careful documentation for each step of the
forensics process and how the results are reached. Ma-
chine learning and AI algorithms can be used to sup-
port the determination of the authenticity of digital me-
dia and have obtained accurate and reliable results, e.g.,
[177, 178], but most of these algorithms are unexplain-
able. This creates a huge hurdle for the applications of
AI in forensics problems because not only the foren-
sics experts oftentimes do not have expertise in com-
puter algorithms, but the computer professionals also
cannot explain the results properly as most of these al-
gorithms are black box models [179]. This is more crit-
ical as the most recent models with the most accurate
results are based on deep learning methods consisting
of many neural network parameters. Researchers have
recently attempted to create white box and explainable
detection methods. An example is the approach pro-
posed by Giudice et al. [180] in which they use discrete
cosine transform statistics to detect so-called specific
GAN frequencies to differentiate between real images
and deepfakes. Through the analysis of particular fre-
quency statistics, that method can be used to mathemat-
ically explain whether a multimedia content is a deep-
fake and why it is. More research must be conducted in
this area and explainable AI in computer vision there-
fore is a research direction that is needed to promote and
utilize the advances and advantages of AI and machine
learning in digital media forensics.
5. Conclusions
Deepfakes have begun to erode trust of people in me-
dia contents as seeing them is no longer commensurate
with believing in them. They could cause distress and
13
negative effects to those targeted, heighten disinforma-
tion and hate speech, and even could stimulate political
tension, inflame the public, violence or war. This is es-
pecially critical nowadays as the technologies for creat-
ing deepfakes are increasingly approachable and social
media platforms can spread those fake contents quickly.
This survey provides a timely overview of deepfake cre-
ation and detection methods and presents a broad dis-
cussion on challenges, potential trends, and future di-
rections in this area. This study therefore will be valu-
able for the artificial intelligence research community to
develop effective methods for tackling deepfakes.
Declaration of Competing Interest
Authors declare no conflict of interest.
References
[1] Shruti Agarwal, Hany Farid, Yuming Gu, Mingming He, Koki
Nagano, and Hao Li. Protecting world leaders against deep
fakes. In Computer Vision and Pattern Recognition Workshops,
volume 1, pages 38–45, 2019.
[2] Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and Pierre-
Antoine Manzagol. Extracting and composing robust features
with denoising autoencoders. In Proceedings of the 25th Inter-
national Conference on Machine learning, pages 1096–1103,
2008.
[3] Diederik P Kingma and Max Welling. Auto-encoding varia-
tional Bayes. arXiv preprint arXiv:1312.6114, 2013.
[4] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing
Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and
Yoshua Bengio. Generative adversarial nets. Advances in Neu-
ral Information Processing Systems, 27:2672–2680, 2014.
[5] Alireza Makhzani, Jonathon Shlens, Navdeep Jaitly, Ian Good-
fellow, and Brendan Frey. Adversarial autoencoders. arXiv
preprint arXiv:1511.05644, 2015.
[6] Ayush Tewari, Michael Zollhoefer, Florian Bernard, Pablo
Garrido, Hyeongwoo Kim, Patrick Perez, and Christian
Theobalt. High-fidelity monocular face reconstruction based
on an unsupervised model-based face autoencoder. IEEE
Transactions on Pattern Analysis and Machine Intelligence, 42
(2):357–370, 2018.
[7] Jiacheng Lin, Yang Li, and Guanci Yang. FPGAN: Face de-
identification method with generative adversarial networks for
social robots. Neural Networks, 133:132–147, 2021.
[8] Ming-Yu Liu, Xun Huang, Jiahui Yu, Ting-Chun Wang, and
Arun Mallya. Generative adversarial networks for image and
video synthesis: Algorithms and applications. Proceedings of
the IEEE, 109(5):839–862, 2021.
[9] Siwei Lyu. Detecting ’deepfake’ videos in the blink of an eye.
http://theconversation.com/detecting-deepfake-
videos - in- the - blink - of- an - eye - 101072, August
2018.
[10] Bloomberg. How faking videos became easy and why that’s
so scary. https: //fortune .com/ 2018 /09 / 11/deep -
fakes-obama- video/, September 2018.
[11] Robert Chesney and Danielle Citron. Deepfakes and the new
disinformation war: The coming age of post-truth geopolitics.
Foreign Affairs, 98:147, 2019.
[12] T. Hwang. Deepfakes: A grounded threat assessment. Tech-
nical report, Centre for Security and Emerging Technologies,
Georgetown University, 2020.
[13] Xinyi Zhou and Reza Zafarani. A survey of fake news: Fun-
damental theories, detection methods, and opportunities. ACM
Computing Surveys (CSUR), 53(5):1–40, 2020.
[14] Rohit Kumar Kaliyar, Anurag Goswami, and Pratik Narang.
Deepfake: improving fake news detection using tensor
decomposition-based deep neural network. The Journal of Su-
percomputing, 77(2):1015–1037, 2021.
[15] Bin Guo, Yasan Ding, Lina Yao, Yunji Liang, and Zhiwen Yu.
The future of false information detection on social media: New
perspectives and trends. ACM Computing Surveys (CSUR), 53
(4):1–36, 2020.
[16] Patrick Tucker. The newest AI-enabled weapon: ‘deep-faking’
photos of the earth. https : / / www . defenseone.com/
technology/2019/03 /next - phase- ai - deep- faking -
whole-world- and-china- ahead/155944/, March 2019.
[17] T Fish. Deep fakes: AI-manipulated media will be
‘weaponised’ to trick military. https:/ / www. express .
co . uk / news / science / 1109783 / deep - fakes -
ai - artificial - intelligence - photos - video -
weaponised-china, April 2019.
[18] B Marr. The best (and scariest) examples of AI-enabled deep-
fakes. https://www.forbes.com/sites/bernardmarr/
2019/07 /22/ the- best-and - scariest- examples- of -
ai-enabled- deepfakes/, July 2019.
[19] Yisroel Mirsky and Wenke Lee. The creation and detection of
deepfakes: A survey. ACM Computing Surveys (CSUR), 54(1):
1–41, 2021.
[20] Luisa Verdoliva. Media forensics and deepfakes: an overview.
IEEE Journal of Selected Topics in Signal Processing, 14(5):
910–932, 2020.
[21] Egor Zakharov, Aliaksandra Shysheya, Egor Burkov, and Vic-
tor Lempitsky. Few-shot adversarial learning of realistic neural
talking head models. In Proceedings of the IEEE/CVF Inter-
national Conference on Computer Vision, pages 9459–9468,
2019.
[22] J Damiani. A voice deepfake was used to scam a ceo
out of $243,000. https:// www . forbes . com / sites /
jessedamiani/ 2019/ 09/03 /a - voice- deepfake - was-
used - to- scam - a - ceo- out - of - 243000/, September
2019.
[23] S Samuel. A guy made a deepfake app to turn photos of women
into nudes. it didn’t go well. https://www.vox.com/2019/
6/ 27 /18761639 /ai - deepfake- deepnude - app - nude-
women-porn, June 2019.
[24] The Guardian. Chinese deepfake app Zao sparks privacy
row after going viral. https:// www. theguardian. com/
technology/ 2019/ sep/02 /chinese - face- swap - app-
zao-triggers- privacy-fears- viral, September 2019.
[25] Siwei Lyu. Deepfake detection: Current challenges and next
steps. In IEEE International Conference on Multimedia &
Expo Workshops (ICMEW), pages 1–6. IEEE, 2020.
[26] Luca Guarnera, Oliver Giudice, Cristina Nastasi, and Sebas-
tiano Battiato. Preliminary forensics analysis of deepfake im-
ages. In AEIT International Annual Conference (AEIT), pages
1–6. IEEE, 2020.
[27] Mousa Tayseer Jafar, Mohammad Ababneh, Mohammad Al-
Zoube, and Ammar Elhassan. Forensics and analysis of deep-
fake videos. In The 11th International Conference on Infor-
mation and Communication Systems (ICICS), pages 053–058.
IEEE, 2020.
[28] Loc Trinh, Michael Tsang, Sirisha Rambhatla, and Yan Liu.
Interpretable and trustworthy deepfake detection via dynamic
14
prototypes. In Proceedings of the IEEE/CVF Winter Confer-
ence on Applications of Computer Vision, pages 1973–1983,
2021.
[29] Mohammed Akram Younus and Taha Mohammed Hasan. Ef-
fective and fast deepfake detection method based on haar
wavelet transform. In International Conference on Computer
Science and Software Engineering (CSASE), pages 186–190.
IEEE, 2020.
[30] M Turek. Media Forensics (MediFor). https://www.darpa.
mil/program/media-forensics, January 2019.
[31] M Schroepfer. Creating a data set and a challenge for deep-
fakes. https: / / ai .facebook. com / blog / deepfake-
detection-challenge, September 2019.
[32] Ruben Tolosana, Ruben Vera-Rodriguez, Julian Fierrez,
Aythami Morales, and Javier Ortega-Garcia. Deepfakes and
beyond: A survey of face manipulation and fake detection. In-
formation Fusion, 64:131–148, 2020.
[33] Abhijith Punnappurath and Michael S Brown. Learning raw
image reconstruction-aware deep image compressors. IEEE
Transactions on Pattern Analysis and Machine Intelligence, 42
(4):1013–1019, 2019.
[34] Zhengxue Cheng, Heming Sun, Masaru Takeuchi, and Jiro
Katto. Energy compaction-based image compression using
convolutional autoencoder. IEEE Transactions on Multimedia,
22(4):860–873, 2019.
[35] Jan Chorowski, Ron J Weiss, Samy Bengio, and A¨
aron van den
Oord. Unsupervised speech representation learning using
WaveNet autoencoders. IEEE/ACM Transactions on Audio,
Speech, and Language Processing, 27(12):2041–2053, 2019.
[36] Faceswap: Deepfakes software for all. https://github .
com/deepfakes/faceswap.
[37] FakeApp 2.2.0. https://www .malavida. com/en /soft/
fakeapp/.
[38] DeepFaceLab. https : / / github . com / iperov /
DeepFaceLab, .
[39] DFaker. https://github.com/dfaker/df.
[40] DeepFake tf: Deepfake based on tensorflow. https : / /
github.com/StromWine/DeepFake\_tf, .
[41] Ming-Yu Liu, Xun Huang, Arun Mallya, Tero Karras, Timo
Aila, Jaakko Lehtinen, and Jan Kautz. Few-shot unsupervised
image-to-image translation. In Proceedings of the IEEE/CVF
International Conference on Computer Vision, pages 10551–
10560, 2019.
[42] Taesung Park, Ming-Yu Liu, Ting-Chun Wang, and Jun-Yan
Zhu. Semantic image synthesis with spatially-adaptive nor-
malization. In Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition, pages 2337–2346,
2019.
[43] DeepFaceLab: Explained and usage tutorial. https :
/ / mrdeepfakes . com / forums / thread - deepfacelab -
explained-and- usage-tutorial.
[44] DSSIM. https :/ / github. com/ keras - team/ keras-
contrib / blob / master / keras \ _contrib / losses /
dssim.py.
[45] Alexandros Lattas, Stylianos Moschoglou, Baris Gecer,
Stylianos Ploumpis, Vasileios Triantafyllou, Abhijeet Ghosh,
and Stefanos Zafeiriou. AvatarMe: Realistically renderable
3D facial reconstruction “in-the-wild”. In Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recog-
nition, pages 760–769, 2020.
[46] Sungjoo Ha, Martin Kersner, Beomsu Kim, Seokjun Seo, and
Dongyoung Kim. Marionette: Few-shot face reenactment pre-
serving identity of unseen targets. In Proceedings of the AAAI
Conference on Artificial Intelligence, volume 34, pages 10893–
10900, 2020.
[47] Yu Deng, Jiaolong Yang, Dong Chen, Fang Wen, and Xin
Tong. Disentangled and controllable face image generation
via 3D imitative-contrastive learning. In Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recog-
nition, pages 5154–5163, 2020.
[48] Ayush Tewari, Mohamed Elgharib, Gaurav Bharaj, Florian
Bernard, Hans-Peter Seidel, Patrick P´
erez, Michael Zollhofer,
and Christian Theobalt. StyleRig: Rigging StyleGAN for 3D
control over portrait images. In Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition,
pages 6142–6151, 2020.
[49] Lingzhi Li, Jianmin Bao, Hao Yang, Dong Chen, and Fang
Wen. FaceShifter: Towards high fidelity and occlusion aware
face swapping. arXiv preprint arXiv:1912.13457, 2019.
[50] Yuval Nirkin, Yosi Keller, and Tal Hassner. FSGAN: Subject
agnostic face swapping and reenactment. In Proceedings of
the IEEE/CVF International Conference on Computer Vision,
pages 7184–7193, 2019.
[51] Tero Karras, Samuli Laine, and Timo Aila. A style-based gen-
erator architecture for generative adversarial networks. In Pro-
ceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition, pages 4401–4410, 2019.
[52] Justus Thies, Michael Zollhofer, Marc Stamminger, Christian
Theobalt, and Matthias Nießner. Face2Face: Real-time face
capture and reenactment of RGB videos. In Proceedings of
the IEEE Conference on Computer Vision and Pattern Recog-
nition, pages 2387–2395, 2016.
[53] Justus Thies, Michael Zollh ¨
ofer, and Matthias Nießner. De-
ferred neural rendering: Image synthesis using neural textures.
ACM Transactions on Graphics (TOG), 38(4):1–12, 2019.
[54] Kyle Olszewski, Sergey Tulyakov, Oliver Woodford, Hao Li,
and Linjie Luo. Transformable bottleneck networks. In Pro-
ceedings of the IEEE/CVF International Conference on Com-
puter Vision, pages 7648–7657, 2019.
[55] Caroline Chan, Shiry Ginosar, Tinghui Zhou, and Alexei A
Efros. Everybody dance now. In Proceedings of the IEEE/CVF
International Conference on Computer Vision, pages 5933–
5942, 2019.
[56] Justus Thies, Mohamed Elgharib, Ayush Tewari, Christian
Theobalt, and Matthias Nießner. Neural voice puppetry:
Audio-driven facial reenactment. In European Conference on
Computer Vision, pages 716–731. Springer, 2020.
[57] Keras-VGGFace: VGGFace implementation with Keras
framework. https: / / github . com / rcmalli / keras -
vggface.
[58] Faceswap-GAN. https : / / github . com / shaoanlu /
faceswap-GAN, .
[59] FaceNet. https : / / github . com / davidsandberg /
facenet, .
[60] CycleGAN. https :/ /github.com/ junyanz/ pytorch-
CycleGAN-and- pix2pix.
[61] Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kil-
ian Q Weinberger. Densely connected convolutional networks.
In Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition, pages 4700–4708, 2017.
[62] Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen.
Progressive growing of GANs for improved quality, stability,
and variation. arXiv preprint arXiv:1710.10196, 2017.
[63] Pavel Korshunov and S´
ebastien Marcel. Vulnerability assess-
ment and detection of deepfake videos. In 2019 International
Conference on Biometrics (ICB), pages 1–6. IEEE, 2019.
[64] VidTIMIT database. http://conradsanderson .id .au /
vidtimit/.
[65] Omkar M Parkhi, Andrea Vedaldi, and Andrew Zisserman.
Deep face recognition. In Proceedings of the British Machine
15
Vision Conference (BMVC), pages 41.1–41.12, 2015.
[66] Florian Schroff, Dmitry Kalenichenko, and James Philbin.
FaceNet: A unified embedding for face recognition and clus-
tering. In Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, pages 815–823, 2015.
[67] Joon Son Chung, Andrew Senior, Oriol Vinyals, and An-
drew Zisserman. Lip reading sentences in the wild. In 2017
IEEE Conference on Computer Vision and Pattern Recognition
(CVPR), pages 3444–3453. IEEE, 2017.
[68] Supasorn Suwajanakorn, Steven M Seitz, and Ira
Kemelmacher-Shlizerman. Synthesizing Obama: learn-
ing lip sync from audio. ACM Transactions on Graphics
(ToG), 36(4):1–13, 2017.
[69] Pavel Korshunov and S´
ebastien Marcel. Speaker inconsistency
detection in tampered video. In The 26th European Signal
Processing Conference (EUSIPCO), pages 2375–2379. IEEE,
2018.
[70] Javier Galbally and S´
ebastien Marcel. Face anti-spoofing
based on general image quality assessment. In The 22nd In-
ternational Conference on Pattern Recognition, pages 1173–
1178. IEEE, 2014.
[71] Robert Chesney and Danielle Keats Citron. Deep fakes: A
looming challenge for privacy, democracy, and national secu-
rity. Democracy, and National Security, 107, 2018.
[72] Oscar de Lima, Sean Franklin, Shreshtha Basu, Blake
Karwoski, and Annet George. Deepfake detection us-
ing spatiotemporal convolutional networks. arXiv preprint
arXiv:2006.14749, 2020.
[73] Irene Amerini and Roberto Caldelli. Exploiting prediction er-
ror inconsistencies through LSTM-based classifiers to detect
deepfake videos. In Proceedings of the 2020 ACM Workshop
on Information Hiding and Multimedia Security, pages 97–
102, 2020.
[74] Xinsheng Xuan, Bo Peng, Wei Wang, and Jing Dong. On
the generalization of GAN image forensics. In Chinese Con-
ference on Biometric Recognition, pages 134–141. Springer,
2019.
[75] Pengpeng Yang, Rongrong Ni, and Yao Zhao. Recapture image
forensics based on laplacian convolutional neural networks. In
International Workshop on Digital Watermarking, pages 119–
128. Springer, 2016.
[76] Belhassen Bayar and Matthew C Stamm. A deep learning ap-
proach to universal image manipulation detection using a new
convolutional layer. In Proceedings of the 4th ACM Workshop
on Information Hiding and Multimedia Security, pages 5–10,
2016.
[77] Yinlong Qian, Jing Dong, Wei Wang, and Tieniu Tan. Deep
learning for steganalysis via convolutional neural networks. In
Media Watermarking, Security, and Forensics, volume 9409,
page 94090J, 2015.
[78] Ying Zhang, Lilei Zheng, and Vrizlynn LL Thing. Automated
face swapping and its detection. In The 2nd International Con-
ference on Signal and Image Processing (ICSIP), pages 15–19.
IEEE, 2017.
[79] Xin Wang, Nicolas Thome, and Matthieu Cord. Gaze latent
support vector machine for image classification improved by
weakly supervised region selection. Pattern Recognition, 72:
59–71, 2017.
[80] Shuang Bai. Growing random forest on deep convolutional
neural networks for scene categorization. Expert Systems with
Applications, 71:279–287, 2017.
[81] Lilei Zheng, Stefan Duffner, Khalid Idrissi, Christophe Gar-
cia, and Atilla Baskurt. Siamese multi-layer perceptrons for
dimensionality reduction and face identification. Multimedia
Tools and Applications, 75(9):5055–5073, 2016.
[82] Sakshi Agarwal and Lav R Varshney. Limits of deepfake
detection: A robust estimation viewpoint. arXiv preprint
arXiv:1905.03493, 2019.
[83] Ueli M Maurer. Authentication theory and hypothesis testing.
IEEE Transactions on Information Theory, 46(4):1350–1356,
2000.
[84] Iryna Korshunova, Wenzhe Shi, Joni Dambre, and Lucas Theis.
Fast face-swap using convolutional neural networks. In Pro-
ceedings of the IEEE International Conference on Computer
Vision, pages 3677–3685, 2017.
[85] Chih-Chung Hsu, Yi-Xiu Zhuang, and Chia-Yen Lee. Deep
fake image detection based on pairwise learning. Applied Sci-
ences, 10(1):370, 2020.
[86] Sumit Chopra, Raia Hadsell, and Yann LeCun. Learning a
similarity metric discriminatively, with application to face ver-
ification. In IEEE Computer Society Conference on Computer
Vision and Pattern Recognition (CVPR’05), volume 1, pages
539–546. IEEE, 2005.
[87] Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep
learning face attributes in the wild. In Proceedings of the IEEE
International Conference on Computer Vision, pages 3730–
3738, 2015.
[88] Alec Radford, Luke Metz, and Soumith Chintala. Unsuper-
vised representation learning with deep convolutional genera-
tive adversarial networks. arXiv preprint arXiv:1511.06434,
2015.
[89] Martin Arjovsky, Soumith Chintala, and L´
eon Bottou. Wasser-
stein generative adversarial networks. In International Confer-
ence on Machine Learning, pages 214–223. PMLR, 2017.
[90] Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Du-
moulin, and Aaron Courville. Improved training of Wasser-
stein GANs. arXiv preprint arXiv:1704.00028, 2017.
[91] Xudong Mao, Qing Li, Haoran Xie, Raymond YK Lau, Zhen
Wang, and Stephen Paul Smolley. Least squares generative
adversarial networks. In Proceedings of the IEEE International
Conference on Computer Vision, pages 2794–2802, 2017.
[92] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, San-
jeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy,
Aditya Khosla, Michael Bernstein, et al. ImageNet large scale
visual recognition challenge. International Journal of Com-
puter Vision, 115(3):211–252, 2015.
[93] Andrew Brock, JeffDonahue, and Karen Simonyan. Large
scale GAN training for high fidelity natural image synthesis.
arXiv preprint arXiv:1809.11096, 2018.
[94] Han Zhang, Ian Goodfellow, Dimitris Metaxas, and Augus-
tus Odena. Self-attention generative adversarial networks. In
International Conference on Machine Learning, pages 7354–
7363. PMLR, 2019.
[95] Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and
Yuichi Yoshida. Spectral normalization for generative adver-
sarial networks. arXiv preprint arXiv:1802.05957, 2018.
[96] Hany Farid. Image forgery detection. IEEE Signal Processing
Magazine, 26(2):16–25, 2009.
[97] Huaxiao Mo, Bolin Chen, and Weiqi Luo. Fake faces iden-
tification via convolutional neural network. In Proceedings of
the 6th ACM Workshop on Information Hiding and Multimedia
Security, pages 43–47, 2018.
[98] Francesco Marra, Diego Gragnaniello, Davide Cozzolino, and
Luisa Verdoliva. Detection of GAN-generated fake images
over social networks. In 2018 IEEE Conference on Multimedia
Information Processing and Retrieval (MIPR), pages 384–389.
IEEE, 2018.
[99] Chih-Chung Hsu, Chia-Yen Lee, and Yi-Xiu Zhuang. Learning
to detect fake face images in the wild. In 2018 International
Symposium on Computer, Consumer and Control (IS3C), pages
16
388–391. IEEE, 2018.
[100] Zhiqing Guo, Lipin Hu, Ming Xia, and Gaobo Yang. Blind
detection of glow-based facial forgery. Multimedia Tools and
Applications, 80(5):7687–7710, 2021.
[101] Diederik P Kingma and Prafulla Dhariwal. Glow: genera-
tive flow with invertible 1×1 convolutions. In Proceedings
of the 32nd International Conference on Neural Information
Processing Systems, pages 10236–10245, 2018.
[102] Darius Afchar, Vincent Nozick, Junichi Yamagishi, and Isao
Echizen. MesoNet: a compact facial video forgery detection
network. In 2018 IEEE International Workshop on Information
Forensics and Security (WIFS), pages 1–7. IEEE, 2018.
[103] Ekraam Sabir, Jiaxin Cheng, Ayush Jaiswal, Wael AbdAl-
mageed, Iacopo Masi, and Prem Natarajan. Recurrent con-
volutional strategies for face manipulation detection in videos.
Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition Workshops, 3(1):80–87, 2019.
[104] Tianchen Zhao, Xiang Xu, Mingze Xu, Hui Ding, Yuanjun
Xiong, and Wei Xia. Learning self-consistency for deepfake
detection. In Proceedings of the IEEE/CVF International Con-
ference on Computer Vision, pages 15023–15033, 2021.
[105] Andreas Rossler, Davide Cozzolino, Luisa Verdoliva, Christian
Riess, Justus Thies, and Matthias Nießner. FaceForensics++:
Learning to detect manipulated facial images. In Proceedings
of the IEEE/CVF International Conference on Computer Vi-
sion, pages 1–11, 2019.
[106] Nick Dufour and Andrew Gully. Contributing data to deepfake
detection research. https://ai. googleblog.com/ 2019/
09 / contributing - data - to - deepfake - detection .
html, September 2019.
[107] Yuezun Li, Xin Yang, Pu Sun, Honggang Qi, and Siwei
Lyu. Celeb-DF: A large-scale challenging dataset for deep-
fake forensics. In Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition, pages 3207–3216,
2020.
[108] Brian Dolhansky, Joanna Bitton, Ben Pflaum, Jikuo Lu,
Russ Howes, Menglin Wang, and Cristian Canton Ferrer.
The deepfake detection challenge dataset. arXiv preprint
arXiv:2006.07397, 2020.
[109] Brian Dolhansky, Russ Howes, Ben Pflaum, Nicole Baram,
and Cristian Canton Ferrer. The deepfake detection challenge
(DFDC) preview dataset. arXiv preprint arXiv:1910.08854,
2019.
[110] Liming Jiang, Ren Li, Wayne Wu, Chen Qian, and
Chen Change Loy. DeeperForensics-1.0: A large-scale dataset
for real-world face forgery detection. In Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recog-
nition, pages 2889–2898, 2020.
[111] Kyunghyun Cho, Bart Van Merri¨
enboer, Caglar Gulcehre,
Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and
Yoshua Bengio. Learning phrase representations using RNN
encoder-decoder for statistical machine translation. arXiv
preprint arXiv:1406.1078, 2014.
[112] David G ¨
uera and Edward J Delp. Deepfake video detection
using recurrent neural networks. In 15th IEEE International
Conference on Advanced Video and Signal based Surveillance
(AVSS), pages 1–6. IEEE, 2018.
[113] Ivan Laptev, Marcin Marszalek, Cordelia Schmid, and Ben-
jamin Rozenfeld. Learning realistic human actions from
movies. In IEEE Conference on Computer Vision and Pattern
Recognition, pages 1–8. IEEE, 2008.
[114] Yuezun Li, Ming-Ching Chang, and Siwei Lyu. In ictu oculi:
Exposing ai created fake videos by detecting eye blinking. In
2018 IEEE International Workshop on Information Forensics
and Security (WIFS), pages 1–7. IEEE, 2018.
[115] Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama,
Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, and
Trevor Darrell. Long-term recurrent convolutional networks
for visual recognition and description. In Proceedings of the
IEEE Conference on Computer Vision and Pattern Recogni-
tion, pages 2625–2634, 2015.
[116] Roberto Caldelli, Leonardo Galteri, Irene Amerini, and Al-
berto Del Bimbo. Optical flow based CNN for detection of
unlearnt deepfake manipulations. Pattern Recognition Letters,
146:31–37, 2021.
[117] Irene Amerini, Leonardo Galteri, Roberto Caldelli, and Al-
berto Del Bimbo. Deepfake video detection through opti-
cal flow based CNN. In Proceedings of the IEEE/CVF In-
ternational Conference on Computer Vision Workshops, pages
1205–1207, 2019.
[118] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
Deep residual learning for image recognition. In Proceed-
ings of the IEEE Conference on Computer Vision and Pattern
Recognition, pages 770–778, 2016.
[119] Karen Simonyan and Andrew Zisserman. Very deep convo-
lutional networks for large-scale image recognition. arXiv
preprint arXiv:1409.1556, 2014.
[120] Yuezun Li and Siwei Lyu. Exposing deepfake videos by detect-
ing face warping artifacts. arXiv preprint arXiv:1811.00656,
2018.
[121] Xin Yang, Yuezun Li, and Siwei Lyu. Exposing deep fakes us-
ing inconsistent head poses. In IEEE International Conference
on Acoustics, Speech and Signal Processing (ICASSP), pages
8261–8265. IEEE, 2019.
[122] Peng Zhou, Xintong Han, Vlad I Morariu, and Larry S Davis.
Two-stream neural networks for tampered face detection. In
IEEE Conference on Computer Vision and Pattern Recognition
Workshops (CVPRW), pages 1831–1839. IEEE, 2017.
[123] Huy H Nguyen, Junichi Yamagishi, and Isao Echizen. Capsule-
forensics: Using capsule networks to detect forged images
and videos. In IEEE International Conference on Acoustics,
Speech and Signal Processing (ICASSP), pages 2307–2311.
IEEE, 2019.
[124] Geoffrey E Hinton, Alex Krizhevsky, and Sida D Wang. Trans-
forming auto-encoders. In International Conference on Artifi-
cial Neural Networks, pages 44–51. Springer, 2011.
[125] Sara Sabour, Nicholas Frosst, and Geoffrey E Hinton. Dynamic
routing between capsules. In Proceedings of the 31st Interna-
tional Conference on Neural Information Processing Systems,
pages 3859–3869, 2017.
[126] Ivana Chingovska, Andr ´
e Anjos, and S´
ebastien Marcel. On
the effectiveness of local binary patterns in face anti-spoofing.
In Proceedings of the International Conference of Biometrics
Secial Interest Group (BIOSIG), pages 1–7. IEEE, 2012.
[127] Andreas R ¨
ossler, Davide Cozzolino, Luisa Verdoliva, Christian
Riess, Justus Thies, and Matthias Nießner. FaceForensics: A
large-scale video dataset for forgery detection in human faces.
arXiv preprint arXiv:1803.09179, 2018.
[128] Nicolas Rahmouni, Vincent Nozick, Junichi Yamagishi, and
Isao Echizen. Distinguishing computer graphics from natu-
ral images using convolution neural networks. In IEEE Work-
shop on Information Forensics and Security (WIFS), pages 1–
6. IEEE, 2017.
[129] Haiying Guan, Mark Kozak, Eric Robertson, Yooyoung Lee,
Amy N Yates, Andrew Delgado, Daniel Zhou, Timothee
Kheyrkhah, JeffSmith, and Jonathan Fiscus. MFC datasets:
Large-scale benchmark datasets for media forensic challenge
evaluation. In IEEE Winter Applications of Computer Vision
Workshops (WACVW), pages 63–72. IEEE, 2019.
[130] Falko Matern, Christian Riess, and Marc Stamminger. Exploit-
17
ing visual artifacts to expose deepfakes and face manipulations.
In IEEE Winter Applications of Computer Vision Workshops
(WACVW), pages 83–92. IEEE, 2019.
[131] Marissa Koopman, Andrea Macarulla Rodriguez, and Zeno
Geradts. Detection of deepfake video manipulation. In The
20th Irish Machine Vision and Image Processing Conference
(IMVIP), pages 133–136, 2018.
[132] Kurt Rosenfeld and Husrev Taha Sencar. A study of the robust-
ness of PRNU-based camera identification. In Media Forensics
and Security, volume 7254, page 72540M, 2009.
[133] Chang-Tsun Li and Yue Li. Color-decoupled photo response
non-uniformity for digital image forensics. IEEE Transactions
on Circuits and Systems for Video Technology, 22(2):260–271,
2011.
[134] Xufeng Lin and Chang-Tsun Li. Large-scale image clustering
based on camera fingerprints. IEEE Transactions on Informa-
tion Forensics and Security, 12(4):793–808, 2016.
[135] Ulrich Scherhag, Luca Debiasi, Christian Rathgeb, Christoph
Busch, and Andreas Uhl. Detection of face morphing attacks
based on PRNU analysis. IEEE Transactions on Biometrics,
Behavior, and Identity Science, 1(4):302–317, 2019.
[136] Quoc-Tin Phan, Giulia Boato, and Francesco GB De Natale.
Accurate and scalable image clustering based on sparse repre-
sentation of camera fingerprint. IEEE Transactions on Infor-
mation Forensics and Security, 14(7):1902–1916, 2018.
[137] Haya R Hasan and Khaled Salah. Combating deepfake videos
using blockchain and smart contracts. IEEE Access, 7:41596–
41606, 2019.
[138] IPFS powers the distributed web. https://ipfs.io/.
[139] Akash Chintha, Bao Thai, Saniat Javid Sohrawardi, Kartavya
Bhatt, Andrea Hickerson, Matthew Wright, and Raymond
Ptucha. Recurrent convolutional structures for audio spoof and
video deepfake detection. IEEE Journal of Selected Topics in
Signal Processing, 14(5):1024–1037, 2020.
[140] Massimiliano Todisco, Xin Wang, Ville Vestman, Md Sahidul-
lah, H´
ector Delgado, Andreas Nautsch, Junichi Yamag-
ishi, Nicholas Evans, Tomi Kinnunen, and Kong Aik Lee.
ASVspoof 2019: Future horizons in spoofed and fake audio
detection. arXiv preprint arXiv:1904.05441, 2019.
[141] Shruti Agarwal, Hany Farid, Ohad Fried, and Maneesh
Agrawala. Detecting deep-fake videos from phoneme-viseme
mismatches. In Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition Workshops, pages
660–661, 2020.
[142] Ohad Fried, Ayush Tewari, Michael Zollh ¨
ofer, Adam Finkel-
stein, Eli Shechtman, Dan B Goldman, Kyle Genova, Zeyu
Jin, Christian Theobalt, and Maneesh Agrawala. Text-based
editing of talking-head video. ACM Transactions on Graphics
(TOG), 38(4):1–14, 2019.
[143] Steven Fernandes, Sunny Raj, Rickard Ewetz, Jodh Singh
Pannu, Sumit Kumar Jha, Eddy Ortiz, Iustina Vintila, and Mar-
garet Salter. Detecting deepfake videos using attribution-based
confidence metric. In Proceedings of the IEEE/CVF Confer-
ence on Computer Vision and Pattern Recognition Workshops,
pages 308–309, 2020.
[144] Qiong Cao, Li Shen, Weidi Xie, Omkar M Parkhi, and Andrew
Zisserman. VGGFace2: A dataset for recognising faces across
pose and age. In The 13th IEEE International Conference on
Automatic Face &Gesture Recognition (FG 2018), pages 67–
74. IEEE, 2018.
[145] Susmit Jha, Sunny Raj, Steven Fernandes, Sumit K Jha,
Somesh Jha, Brian Jalaian, Gunjan Verma, and Ananthram
Swami. Attribution-based confidence metric for deep neural
networks. Advances in Neural Information Processing Sys-
tems, 32:11826–11837, 2019.
[146] Steven Fernandes, Sunny Raj, Eddy Ortiz, Iustina Vintila, Mar-
garet Salter, Gordana Urosevic, and Sumit Jha. Predicting heart
rate variations of deepfake videos using neural ODE. In Pro-
ceedings of the IEEE/CVF International Conference on Com-
puter Vision Workshops, pages 1721–1729, 2019.
[147] Pavel Korshunov and S´
ebastien Marcel. Deepfakes: a new
threat to face recognition? assessment and detection. arXiv
preprint arXiv:1812.08685, 2018.
[148] Shruti Agarwal, Hany Farid, Tarek El-Gaaly, and Ser-Nam
Lim. Detecting deep-fake videos from appearance and behav-
ior. In IEEE International Workshop on Information Forensics
and Security (WIFS), pages 1–6. IEEE, 2020.
[149] Umur Aybars Ciftci, Ilke Demir, and Lijun Yin. FakeCatcher:
Detection of synthetic portrait videos using biological signals.
IEEE Transactions on Pattern Analysis and Machine Intelli-
gence, 2020.
[150] Trisha Mittal, Uttaran Bhattacharya, Rohan Chandra, Aniket
Bera, and Dinesh Manocha. Emotions don’t lie: A deep-
fake detection method using audio-visual affective cues. arXiv
preprint arXiv:2003.06711, 3, 2020.
[151] Luca Guarnera, Oliver Giudice, and Sebastiano Battiato. Deep-
fake detection by analyzing convolutional traces. In Proceed-
ings of the IEEE/CVF Conference on Computer Vision and Pat-
tern Recognition Workshops, pages 666–667, 2020.
[152] Wonwoong Cho, Sungha Choi, David Keetae Park, Inkyu Shin,
and Jaegul Choo. Image-to-image translation via group-wise
deep whitening-and-coloring transformation. In Proceedings
of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition, pages 10639–10647, 2019.
[153] Yunjey Choi, Minje Choi, Munyoung Kim, Jung-Woo Ha,
Sunghun Kim, and Jaegul Choo. StarGAN: Unified generative
adversarial networks for multi-domain image-to-image trans-
lation. In Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, pages 8789–8797, 2018.
[154] Zhenliang He, Wangmeng Zuo, Meina Kan, Shiguang Shan,
and Xilin Chen. AttGAN: Facial attribute editing by only
changing what you want. IEEE Transactions on Image Pro-
cessing, 28(11):5464–5478, 2019.
[155] Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten,
Jaakko Lehtinen, and Timo Aila. Analyzing and improv-
ing the image quality of stylegan. In Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recog-
nition, pages 8110–8119, 2020.
[156] Gary B. Huang, Manu Ramesh, Tamara Berg, and Erik
Learned-Miller. Labeled faces in the wild: A database
for studying face recognition in unconstrained environ-
ments. Technical Report 07-49, University of Massachusetts,
Amherst, October 2007.
[157] Apurva Gandhi and Shomik Jain. Adversarial perturbations
fool deepfake detectors. In IEEE International Joint Confer-
ence on Neural Networks (IJCNN), pages 1–8. IEEE, 2020.
[158] Few-shot face translation GAN. https : // github .com/
shaoanlu/fewshot-face- translation-GAN.
[159] Lingzhi Li, Jianmin Bao, Ting Zhang, Hao Yang, Dong Chen,
Fang Wen, and Baining Guo. Face X-ray for more general
face forgery detection. In Proceedings of the IEEE/CVF Con-
ference on Computer Vision and Pattern Recognition, pages
5001–5010, 2020.
[160] Sheng-Yu Wang, Oliver Wang, Richard Zhang, Andrew
Owens, and Alexei A Efros. CNN-generated images are
surprisingly easy to spot... for now. In Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recog-
nition, pages 8695–8704, 2020.
[161] Tao Dai, Jianrui Cai, Yongbing Zhang, Shu-Tao Xia, and
Lei Zhang. Second-order attention network for single image
18
super-resolution. In Proceedings of the IEEE/CVF Conference
on Computer Vision and Pattern Recognition, pages 11065–
11074, 2019.
[162] Luca Guarnera, Oliver Giudice, and Sebastiano Battiato. Fight-
ing deepfake by exposing the convolutional traces on images.
IEEE Access, 8:165085–165098, 2020.
[163] Todd K Moon. The expectation-maximization algorithm. IEEE
Signal Processing Magazine, 13(6):47–60, 1996.
[164] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros.
Unpaired image-to-image translation using cycle-consistent
adversarial networks. In Proceedings of the IEEE international
conference on computer vision, pages 2223–2232, 2017.
[165] Ke Li, Tianhao Zhang, and Jitendra Malik. Diverse image syn-
thesis from semantic layouts via conditional imle. In Proceed-
ings of the IEEE/CVF International Conference on Computer
Vision, pages 4220–4229, 2019.
[166] Arkaitz Zubiaga, Ahmet Aker, Kalina Bontcheva, Maria Li-
akata, and Rob Procter. Detection and resolution of rumours in
social media: A survey. ACM Computing Surveys (CSUR), 51
(2):1–36, 2018.
[167] R. Chesney and D. K. Citron. Disinformation on steroids:
The threat of deep fakes. https://www.cfr.org/report/
deep-fake- disinformation-steroids, October 2018.
[168] Luciano Floridi. Artificial intelligence, deepfakes and a future
of ectypes. Philosophy &Technology, 31(3):317–321, 2018.
[169] Davide Cozzolino, Justus Thies, Andreas R¨
ossler, Christian
Riess, Matthias Nießner, and Luisa Verdoliva. ForensicTrans-
fer: Weakly-supervised domain adaptation for forgery detec-
tion. arXiv preprint arXiv:1812.02510, 2018.
[170] Francesco Marra, Cristiano Saltori, Giulia Boato, and Luisa
Verdoliva. Incremental learning for the detection and classifi-
cation of gan-generated images. In 2019 IEEE International
Workshop on Information Forensics and Security (WIFS),
pages 1–6. IEEE, 2019.
[171] Shehzeen Hussain, Paarth Neekhara, Malhar Jere, Farinaz
Koushanfar, and Julian McAuley. Adversarial deepfakes: Eval-
uating vulnerability of deepfake detectors to adversarial exam-
ples. In Proceedings of the IEEE/CVF Winter Conference on
Applications of Computer Vision, pages 3348–3357, 2021.
[172] Nicholas Carlini and Hany Farid. Evading deepfake-image de-
tectors with white-and black-box attacks. In Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recog-
nition Workshops, pages 658–659, 2020.
[173] Chaofei Yang, Leah Ding, Yiran Chen, and Hai Li. Defending
against gan-based deepfake attacks via transformation-aware
adversarial faces. In IEEE International Joint Conference on
Neural Networks (IJCNN), pages 1–8. IEEE, 2021.
[174] Chin-Yuan Yeh, Hsi-Wen Chen, Shang-Lun Tsai, and Sheng-
De Wang. Disrupting image-translation-based deepfake al-
gorithms with adversarial attacks. In Proceedings of the
IEEE/CVF Winter Conference on Applications of Computer Vi-
sion Workshops, pages 53–62, 2020.
[175] M. Read. Can you spot a deepfake? does it matter? http:
//nymag.com/intelligencer /2019 /06 /how - do- you -
spot- a - deepfake- it - might- not - matter. html, June
2019.
[176] Marie-Helen Maras and Alex Alexandrou. Determining au-
thenticity of video evidence in the age of artificial intelligence
and in the wake of deepfake videos. The International Journal
of Evidence &Proof, 23(3):255–262, 2019.
[177] Lichao Su, Cuihua Li, Yuecong Lai, and Jianmei Yang. A fast
forgery detection algorithm based on exponential-Fourier mo-
ments for video region duplication. IEEE Transactions on Mul-
timedia, 20(4):825–840, 2017.
[178] Massimo Iuliani, Dasara Shullani, Marco Fontani, Saverio
Meucci, and Alessandro Piva. A video forensic framework
for the unsupervised analysis of MP4-like file container. IEEE
Transactions on Information Forensics and Security, 14(3):
635–645, 2018.
[179] Badhrinarayan Malolan, Ankit Parekh, and Faruk Kazi. Ex-
plainable deep-fake detection using visual interpretability
methods. In The 3rd International Conference on Information
and Computer Technologies (ICICT), pages 289–293. IEEE,
2020.
[180] Oliver Giudice, Luca Guarnera, and Sebastiano Battiato. Fight-
ing deepfakes by detecting GAN DCT anomalies. arXiv
preprint arXiv:2101.09781, 2021.
19