ArticlePDF Available

A dual-branch neural network for DeepFake video detection by detecting spatial and temporal inconsistencies

Springer Nature
Multimedia Tools and Applications
Authors:

Abstract and Figures

It has become a research hotspot to detect whether a video is natural or DeepFake. However, almost all the existing works focus on detecting the inconsistency in either spatial or temporal. In this paper, a dual-branch (spatial branch and temporal branch) neural network is proposed to detect the inconsistency in both spatial and temporal for DeepFake video detection. The spatial branch aims at detecting spatial inconsistency by the effective EfficientNet model. The temporal branch focuses on temporal inconsistency detection by a new network model. The new temporal model considers optical flow as input, uses the EfficientNet to extract optical flow features, utilize the Bidirectional Long-Short Term Memory (Bi-LSTM) network to capture the temporal inconsistency of optical flow. Moreover, the optical flow frames are stacked before inputting into the EfficientNet. Finally, the softmax scores of two branches are combined with a binary-class linear SVM classifier. Experimental results on the compressed FaceForensics++ dataset and Celeb-DF dataset show that: (a) the proposed dual-branch network model performs better than some recent spatial and temporal models for the Celeb-DF dataset and all the four manipulation methods in FaceForensics++ dataset since these two branches can complement each other; (b) the use of optical flow inputs, Bi-LSTM and dual-branches can greatly improve the detection performance by the ablation experiments.
Content may be subject to copyright.
Vol.:(0123456789)
https://doi.org/10.1007/s11042-021-11539-y
1 3
1221: DEEP LEARNING FORIMAGE/VIDEO COMPRESSION
ANDVISUAL QUALITY ASSESSMENT
A dual‑branch neural network forDeepFake video detection
bydetecting spatial andtemporal inconsistencies
LiangKuang1,2· YitingWang3· TianHang1· BeijingChen1,4 · GuoyingZhao5
Received: 18 May 2021 / Revised: 17 August 2021 / Accepted: 9 September 2021
© The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2021
Abstract
It has become a research hotspot to detect whether a video is natural or DeepFake. How-
ever, almost all the existing works focus on detecting the inconsistency in either spatial
or temporal. In this paper, a dual-branch (spatial branch and temporal branch) neural net-
work is proposed to detect the inconsistency in both spatial and temporal for DeepFake
video detection. The spatial branch aims at detecting spatial inconsistency by the effec-
tive EfficientNet model. The temporal branch focuses on temporal inconsistency detection
by a new network model. The new temporal model considers optical flow as input, uses
the EfficientNet to extract optical flow features, utilize the Bidirectional Long-Short Term
Memory (Bi-LSTM) network to capture the temporal inconsistency of optical flow. Moreo-
ver, the optical flow frames are stacked before inputting into the EfficientNet. Finally, the
softmax scores of two branches are combined with a binary-class linear SVM classifier.
Experimental results on the compressed FaceForensics++ dataset and Celeb-DF dataset
show that: (a) the proposed dual-branch network model performs better than some recent
spatial and temporal models for the Celeb-DF dataset and all the four manipulation meth-
ods in FaceForensics++ dataset since these two branches can complement each other; (b)
the use of optical flow inputs, Bi-LSTM and dual-branches can greatly improve the detec-
tion performance by the ablation experiments.
Keywords DeepFake video detection· Optical flow· Convolution neural network· Long
short-term memory network
* Beijing Chen
nbutimage@126.com
1 School ofComputer, Nanjing University ofInformation Science andTechnology, Nanjing210044,
China
2 School ofIoT Engineering, Jiangsu Vocational College ofInformation Technology, Wuxi214153,
China
3 Warwick Manufacturing Group, University ofWarwick, CoventryCV47AL, UK
4 Jiangsu Collaborative Innovation Center ofAtmospheric Environment andEquipment Technology
(CICAEET), Nanjing University ofInformation Science andTechnology, Nanjing210044, China
5 Center forMachine Vision andSignal Analysis, University ofOulu, 90014Oulu, Finland
Published online: 12 July 2022
Multimedia Tools and Applications (2022) 81:42591–42606
/
1 3
1 Introduction
Digital images and videos have filled our lives and become an indispensable part of social
network. They contain a large amount of information and are easy to understand. However, the
popularity of image editing software and technologies, especially for the development of deep
learning, makes image tampering easier and easier [6, 33, 35]. Furthermore, the tampering
can leave no obviously visible traces of any modification [40]. In particular, a new AI-based
fake video generation methods known as DeepFake has attracted much attention recently [18].
It is a technique that can superimpose a face image of a target person to a video of a source
person and then create a video of the target person doing or saying things that the source per-
son does. DeepFake videos can be abused to fool the public or even cause political or religious
tensions between countries [7, 24]. It has been applied to create videos of some countries’
leaders with fake speeches for falsification purposes, such as US President Obama and Trump.
At the same time, DeepFake has also be used to exchange stars’ faces on pornographic videos
for illegal profits [24]. Accordingly, there is an urgent need for reliable and effective methods
to expose DeepFake videos.
Until now, DeepFake video detection methods have relied on either spatial or temporal
inconsistencies. From the perspective of the spatial inconsistency, each frame will inevitably
have artifact when it is generated. Therefore, many recent works [1, 17, 20, 23, 25, 36] have
effectively detected DeepFake video via the artifact detection. However, with the develop-
ment of the image forgery technology, such artifacts have been more and more challenging
to capture [1]. In addition, the compression of the video in transmission makes it more dif-
ficult to detect because the image quality is significantly degraded [20]. Therefore, the tempo-
ral (inter-frame) inconsistency detection methods are proposed [2, 3, 5, 8, 11, 14, 19, 21, 22,
26]. According to the studies in [21], the temporal inconsistency is easier to be captured and
achieves the higher detection accuracy than the spatial inconsistency. However, the detection
performance still to be improved.
So, both the spatial and temporal inconsistencies are considered in this paper with a dual-
branch neural network. The major contributions of this paper are summarized as follows:
(a) A dual-branch (spatial branch and temporal branch) neural network is proposed to
detect the inconsistency in both spatial and temporal for DeepFake video detection.
(b) A new temporal network model adopted as the temporal branch is constructed to cap-
ture the temporal inconsistencies between frames. This model considers optical flow
as input, uses the EfficientNet [32] to extract optical flow features, and utilizes Bidi-
rectional Long-Short Term Memory (Bi-LSTM) network [28] to capture the temporal
inconsistency of optical flow. Moreover, the optical flow frames are stacked before
inputting into the EfficientNet.
The rest of the manuscript is organized as follows. Section 2 presents the related works
about DeepFake video detection. Section3 explains the proposed model. Experiment results
and analysis are presented in Section4. The findings are concluded in Section5.
42592 Multimedia Tools and Applications (2022) 81:42591–42606
1 3
2 Related works
The research of DeepFake video detection has been primarily driven by the advances of
image classification technologies. Currently, all the existing works focus on detecting the
inconsistency in either spatial or temporal. The spatial inconsistency of videos can be
found within a frame. For example, Yang etal. [36] used the head posture estimation to
distinguish real faces from fake faces through the face marker estimation and central region
estimation. Matern etal. [23] utilized artifacts of eyes and teeth to expose DeepFakes. On
the other hand, some works [1, 17, 20, 25] introduced deep learning to learn discriminative
features or find manipulation traces within a frame. For instance, Afchar etal. and Rossler
etal. proposed the well-known MesoNet model [1] and Xception model [25], respectively.
Recently, Li etal. [20] presented an approach called face X-ray for DeepFake detection.
They adopt a framework based on convolutional neural network (CNN) to extract the face
X-ray of an input image and then to output the probability of the input image being real or
blended. Moreover, Khalid etal. [17] proposed the OC-FakeDect, which uses a one-class
Variational Autoencoder (VAE) to train only on real face images and detects DeepFake
images by treating them as anomalies.
The temporal inconsistency is also an important clue for DeepFake video detection.
Agarwal etal. [2] simplified each 10-s video clip into a 190-dimensional facial expres-
sion feature vector, which was used to classify video into real or fake by Support Vec-
tor Machine (SVM). With the breakthrough development in deep learning, some deep
learning-based works have been studied. Ciftci etal. [8] presented the FakeCatcher, a fake
portrait video detector based on biological signals. They generated video-based biological
signal maps and employed a Convolutional Neural Networks (CNN) to detect synthetic
content using the generated maps. Amerini etal. [3] proposed to use the optical flow fields
to exploit possible inter-frame dissimilarities. Optical flow frames are then used as inputs
of a CNN-based model to detect fake faces. Some recent works [5, 14, 19, 26] utilized
the recurrent neural networks (RNNs) [38] combined with the CNN. For example, long-
term recurrent CNN (LRCN) [10] was adopted to make eye-blinking detection in Deep-
Fake videos [19]. Convolutional LSTM architecture was used to detect DeepFake videos in
[14]. Furthermore, Sabir etal. [26] proposed to use Bi-LSTM combined with the DenseNet
[15] to find inconsistent features across frames. Then, Chen etal. [5] improved the archi-
tecture in [26] by introducing a superpixel-wise binary classification unit, which is spe-
cifically designed to guide the backbone network to particularly focus on the differences
between forged face with its surrounding regions. Recently, Li etal. [21] presented sharp
multiple instance learning (S-MIL) for DeepFake video detection. They designed a new
spatial–temporal instance to capture the inconsistency between faces, which can help to
improve the accuracy of DeepFake detection. What’s more, DeepFake video discriminators
with a 3D convolutional network [34] is also introduced in [11, 22] since the 3D convolu-
tional network is able to extract motion features encoded in adjacent frames in the video.
It can be seen from the above analysis that, on the one hand, the works by the spatial
inconsistency are effective in still fake face image detection but not suitable to capture the
variation in a DeepFake video. On the other hand, the works using temporal inconsistency
is more suitable for capturing the variation but they pay less attention to subtle artifacts
within a frame. Therefore, just like human beings, it is necessary to take both the spa-
tial and temporal inconsistencies into consideration. The spatial anomaly can be used as
an effective complementary clue of the temporal anomaly. With the abundant information
from two aspects, the model should be more robust and effective.
42593Multimedia Tools and Applications (2022) 81:42591–42606
1 3
3 Proposed model
In this section, we describe the proposed dual-branch neural network model for Deep-
Fake video detection. Before the introduction of the proposed model, some pre-process-
ing technologies are described in the Sect.3.1.
3.1 Pre‑Processing
3.1.1 Face cropping
According to [26], face cropping is beneficial for classification. So, it is employed here
as well. All the faces in the videos are cropped frame by frame. Inspired by its success
in the face detection, the MTCNN [39] is used to locate the human face in the image. It
leverages a cascaded architecture that consists of three parts: Proposal Network (P-Net),
Refine Network (R-Net), and Output Network (O-Net). P-Net is a deep convolutional
network used to classify face and non-face in the frame by generating a candidate win-
dow, while the R-Net is a network to reject false candidates from the P-Net. Finally,
O-Net is adopted to output five facial landmarks: the left and right mouth corners, the
center of the nose, and the centers of the left and right eyes. After detecting the input
frames with MTCNN, we crop all the regions where faces are detected and resize them
to 224
×
224 pixels. Figure1 shows an example face cropped from a frame.
Fig. 1 Extraction the specified
face from a frame
42594 Multimedia Tools and Applications (2022) 81:42591–42606
1 3
3.1.2 Optical flow
Gibson [12] first introduced the optical flow method to extract foreground object movement
information in the videos. Specially, the objective of the optical flow method is to find a
disparity map u. The target of u is to minimize an image-based error criterion together with
a regularization force as
where I0 and I1 are the two frames,
𝜑(u,u, ...)
represents the regularization term induc-
ing the shape prior,
Φ(I0(x)−I1(x+u(x)))
is the image data fidelity, and
𝜆
is the weight
between the regularization force and the data fidelity.
Currently, there are four types of optical flow algorithms in general [4]: frequency-
based, phase-based, match-based and gradient-based. Compared to the other three types,
the gradient-based algorithms are simple and easy to calculate. Moreover, the optical flow
by the gradient-based algorithms can describe the motion trajectory more accurately. The
TV- L1 algorithm is a commonly-used gradient-based algorithm. It utilizes the L1 norm and
can calculate the large offset of frames. Specially, in TV-L1 algorithm, two functions in (1)
are chosen as
Φ(x)=|x|
and
𝜑(∇u)=|u|
, then (1) yields to
The detailed solution of (2) can be found in [37].
3.2 Dual‑branch architecture
The proposed dual-branch neural network model consists of a spatial branch and a tempo-
ral branch. The spatial branch is dedicated to detecting the artifact in the RGB frame, while
the temporal branch is used to detect the temporal inconsistency through a series of optical
flow frames. The overall architecture is shown in Fig.2. As shown in Fig.2, each branch
(1)
Ω{𝜆Φ(I0(x)−I1(x+u(x))) + 𝜑(u,u,…)}
dx
(2)
E
=
Ω{𝜆
I0(x)−I1(x+u(x))
+u}
dx
Fig. 2 Structure of the dual-branch neural network
42595Multimedia Tools and Applications (2022) 81:42591–42606
1 3
performs detection on its own. The softmax scores of two branches are combined with a
binary-class linear SVM classifier for the final classification.
3.2.1 Spatial branch
The spatial branch takes a sequence of RGB frames of a video as input and outputs a
sequence of the corresponding softmax scores. These scores are averaged as the final score
of the video. Here, the EfficientNet model [32] is used to detect each frame because it
transfers well and has achieved better performance in the DeepFake Detection Challenge
(DFDC) [9] than some existing CNNs such as AlexNet [16], GoogleNet [31], and Xcep-
tion [13]. The baseline EfficientNet-B0 network structure used in this paper is shown in
Fig.3. As we can see in Fig.3, there are 16 MBConv layers, 2 conv layers, 1 pooling layer
and 1 fully connected layer. The main building block is mobile inverted bottleneck con-
volution MBConv [27], which can reduce the computational cost by a factor proportional
to the number of channel. The MBConv architecture is show in Fig.4 as MBConv1, and
Conv3x3,Layers=1
MBConv1,k3x3,,Layers=1
MBConv6,k3x3,Layers=2
MBConv6,k5x5,Layers=2
MBConv6,k3x3,Layers=3
MBConv6,k5x5,Layers=3
MBConv6,k5x5,Layers=4
MBConv6,k3x3,Layers=1
Conv1x1&Pooling&FC,
Layers=1
Fig. 3 Structure of the EfficientNet-B0 network
Conv1×1,BN
DWConv3×3/5×5,BN,Relu
Conv1×1,BN,Relu
+h:h)
+h:h)
+h:h)
+h:h)
Conv1×1,BN
DWConv3×3,BN,Relu
Conv1×1,BN,Relu
+h:h)
+h:h)
+h:h)
+h:h)
(a)MBConv1(b)MBConv6
Fig. 4 MBConv architecture
42596 Multimedia Tools and Applications (2022) 81:42591–42606
1 3
MBConv6. The DWConv denotes depthwise conv, k3 × 3/k5 × 5 denotes kernel size, BN is
batch norm and H × W × F denotes the tensor shape (height, width, depth). It is transferred
to our task by replacing the last fully connected layer with two outputs. Cross entropy is
adopted as the loss function by,
where
represents the final predicted probability, and y is set to 1 if the face image is
manipulated, otherwise it is set to 0.
To better understand how our spatial branch works, Gradient-weighted Class Activa-
tion Mapping (Grad-CAM) [29] is employed to compare and visualize the learned fea-
tures in spatial branch. The Grad-CAM takes a simple RGB frames as input and outputs a
coarse localization map, which highlights the important regions in the image for prediction
after passing into the final layer. Some results are illustrated in the Fig.5. It can be clearly
observed that the EfficientNet effectively focuses on the important regions that people pay
attention to, such as the eyes, nose, mouth, etc. The softmax score for each frame is shown
at the bottom of the figure. The number in the red box represents the probability that the
RGB frame is considered as fake, and the number in the green box represents the probabil-
ity that the RGB frame is considered as real.
3.2.2 Temporal branch
Unlike the spatial branch, the input of the temporal branch is a sequence of consecutive
optical flow frames. The proposed temporal branch is based on the EfficientNet [32] and
Bi-LSTM network [28]. The EfficientNet is used to extract the optical flow feature. The
Bi-LSTM network is used to capture the temporal inconsistency introduced by the face
swapping process.
In the temporal branch, we adopt the modified EfficientNet B0 by removing the fully-
connected layer to output the feature vector of each frame directly. Notice that the Efficient-
Net should be pre-trained with optical flow data set. Then the representations, 1280-dimen-
sional feature vectors, are used as the sequential input of Bi-LSTM. Finally, the Bi-LSTM
is followed by a 1024 fully-connected layer with 0.5 chance of dropout. The cross entropy
loss in (3) and the softmax activation function are also considered in the temporal branch.
Stacked optical flow frames A dense optical flow can be seen as the horizontal dis-
placement vector fields
dU
𝜏
and vertical displacement vector fields
dV
𝜏
between the pairs of
(3)
LCE =ylog y+(1y)log(1y)
Fig. 5 Hotmaps extracted with Grad-CAM from a frame sequence. The last row shows the softmax score of
each frame
42597Multimedia Tools and Applications (2022) 81:42591–42606
1 3
consecutive frames τ and τ + 1. By averaging
dU
𝜏
(x,y
)
and
dV
𝜏
(x,y
)
, we denote the final dis-
placement vector
d𝜏(x,y)
at the pixel (x, y). The optical flow frames from the correspond-
ing video frames are shown in Fig.6. As we can see from the second row, the optical flow
frames are single-channel grayscale images in where some areas highlight the moving parts
on the frames. However, motion representation may not always be obvious in optical flow
frames due to the face slight movement. So, every L consecutive optical flow frames (e.g.
L = 5) are stacked to one frame. The objective is to fuse the adjacent optical flows frames
to represent motion information over time. Specially, let w and h be the width and height
of an optical flow frame, the input volume
D𝜏Rw×h
of the EfficientNet for an arbitrary
stacked optical flow frame τ is constructed as,
The pixel values of optical flow frame
D𝜏(u,v)
range from 0–255. The greater the value,
the larger the motion. The optical flow stacking method shown in (4) is quite different from
that in [30].
Figure7 presents the stacked (L = 5) and non-stacked (L = 1) optical flow frames. It can
be observed from the Fig.7 that the stacked optical flow frames contain far more sufficient
information than the non-stacked ones.
(4)
D𝜏(u,v)=max{d𝜏(u,v),d𝜏+1(u,v),,d𝜏+L1(u,v)},u=1, 2, w,v=1, 2, h.
Fig. 6 Optical flow frames from the corresponding video frames. The first row is the consecutive video
frames, and the bottom is the corresponding optical flow frames
Fig. 7 Stacked and non-stacked optical flow frames. The first row is the non-stacked frames (L = 1), and the
bottom is the stacked frames (L = 5)
42598 Multimedia Tools and Applications (2022) 81:42591–42606
1 3
Bi‑LSTM The RNN is a neural network that is specialized for processing sequential inputs.
It is well-suited to process non-linear dynamics and temporal information. So, the Bi-
LSTM, a special RNN, is adopted in our temporal branch to capture temporal inconsist-
ency. The LSTM network is a typical type of RNNs improved by adding cell state to the
hidden layer. The cell state is preserved by three unique structures: forgetfulness gate, input
gate, and output gate. Further more, as an improved version of the LSTM, the Bi-LSTM
has two LSTMs stacked on the top of each other. One RNN goes in the forward direction,
while the other one goes in the backward direction. Then, the outputs of the two RNNs are
combined. Figure8 shows the architecture of the Bi-LSTM.
3.2.3 Pseudo‑code
In order to make the proposed model clear and help the reader to implement it, Table1
shows the pseudo-code of the proposed model.
4 Experimental results andanalysis
In this section, the proposed DeepFake video detection model is evaluated. We first intro-
duce the overall experimental setups and then present some experiments to prove the supe-
riority of our model. In addition, two ablation experiments are conducted to prove the
validity of both temporal branch and two branches fusion.
4.1 Experimental setups
4.1.1 Experimental datasets
The widely used datasets FaceForensics++ (FF++) and Celeb-DF are considered here for
evaluation. A brief description of each dataset is provided below.
FaceForensics++ (FF++) There are 1000 original videos and 4000 fake videos in the raw
sub-dataset without compression. The fake videos are generated by four different manipu-
lation methods: DeepFake, Face2Face, FaceSwap, and Neural Texture (NT). Each of them
has 1000 fake videos. Moreover, every video is compressed with two different compression
ratios, obtaining two compressed sub-datasets (C23 and C40). The higher the compres-
sion ratio, the more difficult the detection. In this paper, we only evaluate the sub-dataset
with the highest compression ratio (C40) because the accuracies of the raw sub-dataset and
C23 sub-dataset have been greater than 98% by the Xception model shown in [25]. For
the experiments, the C40 dataset is split into three sets (training, validation, and test sets).
LS
TM
LSTM L
S
TM
LSTM
O
U
T
OUT
L
S
TM
LSTM
L
STM
LSTM
O
U
T
OUT
L
S
T
M
LSTM
LS
T
M
LSTM
O
U
T
OUT
...
Fig. 8 Bi-LSTM architecture
42599Multimedia Tools and Applications (2022) 81:42591–42606
1 3
The three sets are constitute of 720,140, and 140 videos, respectively. Moreover, 32 frames
are sampled from each video and conducted the face cropping. Some examples of cropped
faces are shown in Fig.9.
Celeb‑DF It is a new DeepFake dataset generated by using a refined synthesis algorithm
that can reduce the visual artifacts effectively. It contains 590 real videos and 5639 synthe-
sized videos. The videos have themes of different ages, races and genders. For experiments,
the 590 real videos are randomly cut into 1000 clips. Each clip also contains 32 frames. So
do the 5639 synthesized videos. Then, the total 2000 clips are divided into three sets (train-
ing, validation, and test sets) with the same ratio as the FF++ dataset. Some examples of
cropped faces are also provided in Fig.9.
Notice that in the spatial branch all the 32 cropped faces of a video are taken as inputs
separately and the average softmax score of 32 face frames is regarded as the score of the
spatial branch. In the temporal branch, it takes 32 consecutive stacked optical flow frames
as input.
Table 1 Pseudo-code of the proposed model
Algorithm 1: Pseudo-code of the proposed model
Input: Video V
1: Extract RGB frames from DeepFake videos as IRGB = Ext(V);
2: Crop the frames to extract specified faces as IC-RGB = Crop(IRGB);
3: Calculate optical flow frames as IC-OF = TV-L1(IC-RGB);
4: if step == Train do
5: Calculate the softmax scores of each frame in spatial branch as SS_score =
Spatial_branch_model(IC_RGB);
6: Calculatethe loss of spatial branch as Ls_CE = CrossEntropy_Loss(SS_score, Label);
7: Minimize Ls_CE and update the network parameters by back propagation;
8: Obtain the spatial branch model;
9: Calculate the softmax scores of the consecutive optical flow frames in temporal
branch as ST_score= Te mporal_branch_model (IC-OF);
10: Calculate the loss of the temporal branch as LT_CE = CrossEntropy_Loss(ST_score,
Label);
11:Minimize LT_CE and update the network parameters by back propagation;
12: Obtain the spatial branch model;
13: Average the softmax scores in sp atial branch as SS_score_avg=Avg(SS_score);
14: Fuse the decision of two branches as Sfused = SVM(SS_score_avg,ST_score);
15: Calculate the fused loss of thedual branches as LSVM=Hinge(Sfused, Label);
16: Minimize LSVM and update the SVM network parameters by back propagation;
17:end if
18: if step == Test do
19: Calculate the softmax scores of each frame in the spatial branch with the trained
spatial branch model as SS_score= Spatial_branch_model(IC_RGB);
20: Obtain theaverage softmax scores of all the frames in sp atial branch as
SS_score_avg= Avg(SS_score);
21: Calculate the softmax scores of the temporal branch with the trai ned temporal
branch model as ST_score= Temporal_branch_model(IC_OF);
22: Predicted result Sfused=SVM(SS_score_avg, ST_score));
23: end if
Output: Predicted result (real or fake)
42600 Multimedia Tools and Applications (2022) 81:42591–42606
1 3
4.1.2 Parameters
In the spatial branch, the batch size is set to 16. The ADAM optimizer is used with the
initial learning rate 1.0e-4, which decays to half of original if the accuracy is not improved
in 5 epochs. The maximum number of iterations is set to 200, and the early stop is also
adopted.
In the temporal branch, the batch size of is set to 4 and the ADAM optimizer is used
with the initial learning rate 1.0e−5. We set the time step to 32, the number of optical flow
frames. The total number of iterations is set to 1000. Regarding the EfficientNet used in
both branches, we consider EfficientNet-B0 and re-train it on the basis of the pre-trained
model in ImageNet.
4.2 Evaluation theimprovements onthetemporal branch
The following metric Accuracy is used to measure the performance of DeepFake video
detection,
where TP denotes the number of correctly predicted fake video cases, FP is the number
of the normal cases that are misclassified as fake video, TN represents the number of the
(5)
Accuracy =(TP +TN )∕(TN +FP +TP +FN100%,
Fig. 9 Examples of cropped faces in FF++ dataset and Celeb-DF dataset. From top to bottom are generated
by DeepFake, Face2Face, FaceSwap, Neural Texture and Celeb-DF, respectively
42601Multimedia Tools and Applications (2022) 81:42591–42606
1 3
normal cases that are correctly classified, and FN denotes the number of the fake video
cases that are misclassified as normal cases.
Firstly, in order to verify the improvements on the temporal branch, an ablation study
is conducted by comparing the basic EfficientNet with three improved versions: Efficient-
Net + LSTM (L = 1), EfficientNet + Bi-LSTM (L = 1), and EfficientNet + Bi-LSTM (L = 5).
The ablation experimental results are shown in Table2. Notice that, for the EfficientNet,
the softmax scores of optical flow frames are averaged as the final video-level prediction.
It can be observed from the Table2 that: (a) the models combining with LSTM greatly
improves the performance of the model with the EfficientNet only because the LSTM can
effectively capture the temporal inconsistency in DeepFake videos; (b) the Bi-LSTM per-
forms better than the LSTM due to the bidirectional detection of Bi-LSTM; (c) using the
stacked optical flow frames can achieve the better performance than using the non-stacked
optical flow frames because the stacked frames carry more motion-dependent information
than the non-stacked ones.
Then, in order to test the performance of the proposed temporal branch, it is also com-
pared with three existing temporal models, i.e., C3D [22], CNN-LSTM [26], and Sharp
Multiple Instance Learning (S-MIL) [21]. Moreover, the input of each compared model
is a sequence of consecutive stacked optical flow frames. The comparative results are pre-
sented in Fig.10. The results show that our model is much better than C3D and S-MIL due
to the use of the EfficientNet and Bi-LSTM. The average accuracy of our model reaches
92% for the two datasets. Even for the most difficult NT subdataset, the accuracy is still
85%. The results also demonstrate that our model outperforms the CNN-LSTM based on
DenseNet because of the effective of the EfficientNet.
4.3 Comparison withsome existing models
Firstly, in order to verify the improvements of dual branches, the second ablation experi-
ment is conducted here. The proposed dual-branch model is compared with two models
with each branch only. Table 3 shows the comparative results. It can be observed from
the Table3 that: (a) the temporal branch achieves a better overall performance than the
spatial branch because the temporal inconsistency is usually more obvious than the spa-
tial inconsistency; (b) the proposed dual-branch model obtains the accuracies over 90%
for all the five types of manipulation methods and is superior to the other two models with
each branch only due to the consideration of both the temporal inconsistency and spatial
inconsistency.
Then, the proposed dual-branch model is compared with seven state-of-the-art mod-
els. The compared seven models contains three spatial ones (MesoNet [1], Xception
Table 2 Ablation experimental results in the temporal branch by the detection accuracy (%)
Models FF++ Celeb-DF
DeepFake Face2Face FaceSwap NT
EfficientNet 70.83 75.00 72.50 66.67 71.5
EfficientNet + LSTM (L = 1) 88.21 86.07 83.57 75.71 82.14
EfficientNet + Bi-LSTM (L = 1) 90.00 86.43 89.29 76.43 84.29
EfficientNet + Bi-LSTM (L = 5) 96.43 90.0 93.21 85.00 96.43
42602 Multimedia Tools and Applications (2022) 81:42591–42606
1 3
[25], and OC-FakeDec [17]) and four temporal ones (Optical Flow Features [3], C3D
[11], CNN-LSTM [26], and S-MIL [21]). Notice that here the inputs of three temporal
models (C3D, CNN-LSTM, and S-MIL) are the RGB frames instead of optical flow
frames. The comparative results are given in Table 4. The results for the OC-Fake-
Dect and S-MIL models are taken from their corresponding literatures [17] and [21],
respectively. The results in the Table 4 show our proposed dual branches model out-
performs others in most cases, achieving the overall best performance among eight
compared models. In addition, the temporal models are usually better than the spatial
models, which is consistent with the results given in the Table3. Besides, eight com-
pared models achieves a better performance on Celeb-DF dataset than FF++ dataset.
This is because the FF++ dataset we actually used is C40 with the highest compression
ratio, which is more challenging than Celeb-DF. The satisfactory results of the proposed
model are attributed to the following main reasons: (a) both the spatial and temporal
inconsistency are considered, and they complement each other; (b) the optical flow
frames are used as the input for the temporal branch; (c) the effective models Efficient-
Net and Bi-LSTM are applied into two branches.
Fig. 10 Accuracies (%) of different temporal models on the FF++ dataset generated by four different
manipulation methods and Celeb-DF dataset
Table 3 Ablation experimental results in dual-branch fusion by the detection accuracy (%)
Models FF++ Celeb-DF
DeepFake Face2Face FaceSwap NT
Spatial branch 91.43 92.50 92.86 80.71 95.36
Temporal branch 96.43 90.00 93.21 85.00 96.67
Proposed dual-branch 98.21 95.00 93.57 90.71 98.57
42603Multimedia Tools and Applications (2022) 81:42591–42606
1 3
5 Conclusions
In this paper, we propose a dual-branch network model to detect DeepFake video by the
inconsistency in both spatial and temporal. The experimental results and analysis on the
FF++ dataset and Celeb-DF dataset show that the spatial and temporal inconsistency can
complement each other so that the proposed model achieves a better performance than
some existing spatial models and temporal ones. Moreover, the combination of Efficient-
Net and Bi-LSTM can capture the temporal inconsistency more effectively. For the future
work, since all the current works including our work focus on DeepFake video detection in
plaintext, we will try to detect the encrypted DeepFake videos for privacy protection.
Acknowledgements This work was supported by the National Natural Science Foundation of China under
Grant 62072251, Natural Science Research Project of Jiangsu Universities under Grant 20KJB520021,
Higher Vocational Education Teaching Fusion Production Integration Platform Construction Projects of
Jiangsu Province under Grant No. 2019(26), the PAPD fund.
References
1. Afchar D, Nozick V, Yamagishi J, etal. (2018) Mesonet: a compact facial video forgery detection net-
work. In: Proceedings of the 2018 IEEE international workshop on information forensics and security
(WIFS2018), pp 1–7
2. Agarwal S, Farid H, Gu Y, etal. (2019) Protecting world leaders against Deep Fakes. In: Proceedings
of the IEEE conference on computer vision and pattern recognition workshops, pp 38–45
3. Amerini I, Galteri L, Caldelli R, et al. (2019) Deepfake video detection through optical flow based
CNN. In: Proceedings of the 2019 IEEE/CVF international conference on computer vision workshops,
pp 1205–1207
4. Barron JL, Fleet DJ, Beauchemin SS etal (1992) Performance of optical flow techniques. Int J Comput
Vis 12:43–77
5. Chen P, Liu J, Liang T, etal. (2020) FSSPOTTER: spotting face-swapped video by spatial and tem-
poral clues. In: Proceedings of the 2020 IEEE international conference on multimedia and expo
(ICME2020), pp 1–6
Table 4 Comparison results with other state-of-the-art models by the detection accuracy (%)
Models FF++ Celeb-DF
DeepFake Face2Face FaceSwap NT
Spatial
MesoNet [1] 84.64 67.50 74.29 67.86 85.71
Xception [25] 92.14 90.71 92.50 80.71 93.57
OC-FakeDect [17] 88.35 71.20 86.05 97.45 89.03
Temporal
Optical flow features [3] 72.86 76.43 69.29 64.64 75.00
C3D [11] 90.71 83.21 91.79 75.71 93.21
CNN-LSTM [26] 94.64 89.29 94.29 81.43 95.36
S-MIL [21] 97.14 91.07 96.07 86.79 98.84
Spatial + temporal
Proposed dual-branch 98.21 95.00 93.57 90.71 98.93
42604 Multimedia Tools and Applications (2022) 81:42591–42606
1 3
6. Chen B, Tan W, Coatrieux G etal (2020) A serial image copy-move forgery localization scheme with
source/target distinguishment. IEEE Trans Multimedia. https:// doi. org/ 10. 1109/ TMM. 2020. 30268 68
7. Chen B, Ju X, Xiao B etal (2021) Locally GAN-generated face detection based on an improved Xcep-
tion. Inf Sci 572:16–28
8. Ciftci UA, Demir I, Yin L (2020) FakeCatcher: detection of synthetic portrait videos using biological
signals. IEEE Trans Pattern Anal Mach Intell. https:// doi. org/ 10. 1109/ TPAMI. 2020. 30092 87
9. DeepFake Detection Challenge (DFDC). https:// ai. faceb ook. com/ datas ets/ dfdc/
10. Donahue J, Hendricks LA, Guadarrama S, etal. (2015) Long-term recurrent convolutional networks
for visual recognition and description. In: Proceedings of the 2015 IEEE conference on computer
vision and pattern recognition (CVPR2015), pp 2625–2634
11. Ganiyusufoglu I, Ngô LM, Savov N, etal. (2020) Spatio-temporal features for generalized detection of
deepfake videos. https:// arxiv. org/ abs/ 2010. 11844
12. Gibson JJ (1950) The perception of the visual world. Houghton Mifflin, Boston
13. Google FC (2017) Xception: deep learning with depthwise separable convolutions. In: Proceedings of
the 2017 IEEE conference on computer vision and pattern recognition (CVPR2017), pp 1800–1807
14. Guera D, Delp EJ (2018) Deepfake video detection using recurrent neural networks. In: Proceedings of
the 15th IEEE international conference on advanced video and signal based surveillance (AVSS2018),
pp 1–6
15. Huang G, Liu Z, Maaten LVD, etal. (2017) Densely connected convolutional networks. In: Pro-
ceedings of the 2017 IEEE conference on computer vision and pattern recognition (CVPR2017), pp
2261–2269
16. Krizhevsky A, Sutskever I, Hinton GE (2012) ImageNet classification with deep convolutional neural
networks. Neural Inf Process Syst 25:1097–1105
17. Khalid H, Woo SS (2020) OC-FakeDect: classifying deepfakes using one-class variational autoen-
coder. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition work-
shops, vol 656–657
18. Li Y, Lyu S (2018) Exposing deepfake videos by detecting face warping artifacts. In: Proceedings of
the IEEE/CVF conference on computer vision and pattern recognition workshops, pp 46–52
19. Li Y, Chang M, Lyu S (2018) Exposing ai created fake videos by detecting eye blinking. In: The 2018
IEEE international workshop on information forensics and security (WIFS2018), pp. 1–7
20. Li L, Bao J, Zhang T, etal. (2020) Face x-ray for more general face forgery detection. In: Proceed-
ings of the 2020 IEEE/CVF conference on computer vision and pattern recognition (CVPR2020), pp
5001–5010
21. Li X, Lang Y, Chen Y, etal. (2020) Sharp multiple instance learning for deepfake video detection. In:
Proceedings of the 28th ACM international conference on multimedia, vol 1864–1872
22. Lima OD, Franklin S, Basu S etal (2020) Deepfake detection using spatiotemporal convolutional net-
works.https:// arxiv. org/ abs/ 2006. 14749
23. Matern F, Riess C, Stamminger M (2019) Exploiting visual artifacts to expose deepfakes and face
manipulations. In: Proceedings of the 2019 IEEE winter applications of computer vision workshops
(WACVW2019), pp 83–92
24. Nguyen TT, Nguyen CM, Nguyen DT, etal (2019) Deep learning for deepfakes creation and detection.
https:// arxiv. org/ abs/ 1909. 11573.
25. Rossler A, Cozzolino D, Verdoliva L, etal. (2019) Faceforensics++: learning to detect manipulated
facial images. In: Proceedings of the 2017 IEEE/CVF international conference on computer vision, pp
1–11
26. Sabir E, Cheng J, Jaiswal A, et al (2019) Recurrent convolutional strategies for face manipulation
detection in videos. In: Proceedings of the 2018 IEEE/CVF international conference on computer
vision workshops, pp 80–87
27. Sandler M, Howard A, Zhu M, etal. (2018) MobileNetV2: inverted residuals and linear bottlenecks.
In: Proceedings of the 2018 IEEE/CVF conference on computer vision and pattern recognition
(CVPR2018), pp 4510–4520
28. Schuster M, Paliwal KK (1997) Bidirectional recurrent neural networks. IEEE Trans Signal Process
45(11):2673–2681
29. Selvaraju RR, Cogswell M, Das A (2017) Grad-cam: visual explanations from deep networks via
gradient-based localization. In: Proceedings of the 2017 IEEE international conference on computer
vision (CVPR2017), pp 618–626
30. Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos.
Adv Neural Inf Process Syst 27:568–576
31. Szegedy C, Liu W, Jia Y, etal. (2015) Going deeper with convolutions. In: Proceedings of the 2015
IEEE conference on computer vision and pattern recognition (CVPR2015), pp 1–9
42605Multimedia Tools and Applications (2022) 81:42591–42606
1 3
32. Tan M, Le QV (2019) Efficientnet: rethinking model scaling for convolutional neural networks. In:
Proceedings of 2019 international conference on machine learning, pp 6105–6114
33. Tolosana R, Vera-Rodriguez R, Fierrez J etal (2020) Deepfakes and beyond: a survey of face manipu-
lation and fake detection. Inf Fusion 64:131–148
34. Tran D, Wang H, Torresani L, et al. A closer look at spatiotemporal convolutions for action recog-
nition. In: Proceedings of the 2018 IEEE conference on computer vision and pattern recognition
(CVPR2018), pp 6450–6459
35. Verdoliva L (2020) Media forensics and deepfakes: an overview. IEEE J Select Topic Signal Process
14(5):910–932
36. Yang X, Li Y, Lyu S (2019) Exposing deep fakes using inconsistent head poses. In: Proceedings of
2019 IEEE international conference on acoustics, speech and signal processing (ICASSP2019), pp
8261–8265
37. Zach C, Pock T, Bischof H (2007) A duality based approach for realtime TV-L1 optical flow. In: Pro-
ceedings of the 29th DAGM conference on pattern recognition, pp 214–223
38. Zaremba W, Sutskever I, Vinyals O (2014) Recurrent neural network regularization. https:// arxiv. org/
abs/ 1409. 2329.
39. Zhang K, Zhang Z, Li Z etal (2016) Joint face detection and alignment using multitask cascaded con-
volutional networks. IEEE Signal Process Lett 23(10):1499–1503
40. Zhang D, Chen X, Li F etal (2020) Seam-carved image tampering detection based on the cooccur-
rence of adjacent LBPs. Secur Commun Netw. https:// doi. org/ 10. 1155/ 2020/ 88303 10
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and
institutional affiliations.
42606 Multimedia Tools and Applications (2022) 81:42591–42606
... Moreover, it was also found that real faces are coherent in different local regions, while forged faces are mixed from different face sources and thus produce inconsistent information at certain locations. Therefore, the concept of consistency learning [19] was introduced into forgery detection [20][21][22], which usually measures the local similarity between individual patches of an image to capture the inconsistency between tampered and authentic regions. However, these methods tend to overlook the importance of global features, which encompass valuable discriminative information such as the colors of the artifacts in different facial regions and the contextual links between individual artifacts. ...
... Forgery detection based on consistency learning. Recent works [19][20][21][22] show that the manipulation methods typically disrupt the correlation between the local regions of the faces and attempt to utilize consistency learning to capture the local artifacts. Zhao et al. [20] extracted the middle layer features of the ResNet network and constructed patch similarity features based on them, which were used to assist in locating the forged regions and guide the model to detect the local inconsistencies of the forged faces. ...
... Chen et al. [21] constructed the similarity matrix of the frequency domain stream and the RGB stream to capture the local inconsistencies of the forged faces in both the spatial and frequency domains. Kuang et al. [22] proposed a dual-branch (spatial branch and temporal branch) neural network to detect the inconsistency in both spatial and temporal for DeepFake video detection. The spatial branch aims at detecting spatial inconsistency by the effective EfficientNet model. ...
Article
Full-text available
The proliferation of fake images generated by deepfake techniques has significantly threatened the trustworthiness of digital information, leading to a pressing need for face forgery detection. However, due to the similarity between human face images and the subtlety of artefact information, most deep face forgery detection methods face certain challenges, such as incomplete extraction of artefact information, limited performance in detecting low-quality forgeries, and insufficient generalization across different datasets. To address these issues, this paper proposes a novel noise-aware multi-scale deepfake detection model. Firstly, a progressive spatial attention module is introduced, which learns two types of spatial feature weights: boosting weight and suppression weight. The boosting weight highlights salient regions, while the suppression weight enables the model to capture more subtle artifact information. Through multiple boosting-suppression stages, the proposed model progressively focuses on different facial regions and extracts multi-scale RGB features. Additionally, a noise-aware two-stream network is introduced, which leverages frequency-domain features and fuses image noise with multi-scale RGB features. This integration enhances the model’s ability to handle image post-processing. Furthermore, the model learns global features from multi-modal features through multiple convolutional layers, which are combined with local similarity features for deepfake detection, thereby improving the model’s robustness. Experimental results on several benchmark databases demonstrate the superiority of our proposed method over state-of-the-art techniques. Our contributions lie in the progressive spatial attention module, which effectively addresses overfitting in CNNs, and the integration of noise-aware features and multi-scale RGB features. These innovations lead to enhanced accuracy and generalization performance in face forgery detection.
... The rapid development and distribution of DeepFakes technology issued an urgent call for the research community to present Face Image Manipulation Detection (FIMD) techniques that can check the authenticity of the face images and detect manipulations if exist. In the last few years, the research community dedicated a lot of efforts in developing FIMD techniques that can generally serve the media forensics and security systems [5][6][7][8][9][10][11][12][13][14]. ...
Article
Full-text available
DeepFakes and face image manipulation methods have been widely distributed in the last few years and several techniques have been presented to check the authenticity of the face image and detect manipulation if exists. Most of the available manipulation detection techniques have been successfully applied to reveal one type of manipulation under specific conditions, however, many limitations and challenges can be encountered in this field. To overcome some limitations and challenges, this paper presents a new face image authentication (FIA) scheme based on Multi-Task Cascaded Conventional Neural Networks (MTCNN) and watermarking in Slantlet transform (SLT) domain. The proposed FIA scheme has three main algorithms that are face detection and selection, embedding, and extraction algorithms. Different block sizes have been used to divide the image into non-overlapping blocks followed by classifying them into two groups that are blocks from face area (FA) and blocks from the remaining area (RA) of the image. In the embedding algorithms, the authentication information is generated from FA blocks and embedded in the RA blocks. In the extraction algorithms, the embedded information is extracted from RA blocks and compared with the calculated data from FA blocks to reveal manipulations and localize the manipulated blocks if exist. Extensive experiments have been conducted to evaluate the performance of the proposed FIA scheme for different face images. The experimental work included tests for payload, capacity, visual quality, time complexity, and localization of manipulations. The results proved the efficiency of the proposed scheme in detecting and localizing different face image manipulations such as attributes attacks, retouching attacks, expression swap, face swap, and morphing attacks. The proposed scheme overcomes many limitations and it is 100% accurate in localizing the tampered blocks which makes it a better candidate for practical applications.
... Deepfake videos can be detected by deploying a dual-branch neural network that works on spatial and temporal inconsistencies [56], by implementing light-weight deep ensemble model that uses visual inputs [62]. Another author attempted to generalize the existing deepfake detection techniques to draw a simplified one-way method [16]. ...
Article
Full-text available
Artificial images and recordings are broad on the web via different media channels such as blogs, YouTube videos, etc. These manipulated and synthesized images tend to steal the identity of individuals and majorly contribute to establishing societal disruptions such as theft, political errors, social engineering, disinformation attacks and reputation fraud. These fake visual objects gradually came to be known as deep fakes. Different deep learning techniques are used to generate deepfake images which go unnoticed by human eyes. It is essential to develop a defense mechanism that can stop the common people from being manipulated and harnessed. The objective of this work is to develop an ensemble deep learning-based system that can differentiate between fake and real images. With the use of the recommended optical flow technique, a novel approach is proposed that extracts the apparent motion of image pixels which gives more accurate results compared to other state-of-the-art. FaceForensics + + dataset is used to test the extraction algorithms and ensemble model which fetched an accuracy of 86.02% for the DeepFake subset and 85.7% for the FaceSwap subset of the dataset. To the best knowledge, no one has completely used the ensemble model- OptiFake on the optical flow derived frames, highlighting a research gap in the field of deepfake detection.
... Moreover, a web interface has been designed to upload the video for the subsequent deepfake prediction. (Kuang et al., 2022) explored a dual-branch approach to capture inconsistencies from the sequence of video frames to detect deepfake manipulation in videos. The method comprises spatial and temporal branches for learning the spatial and temporal information from the input video. ...
Article
Full-text available
Recent advancements in deep learning generative models have raised concerns as they can create highly convincing counterfeit images and videos. This poses a threat to people's integrity and can lead to social instability. To address this issue, there is a pressing need to develop new computational models that can efficiently detect forged content and alert users to potential image and video manipulations. This paper presents a comprehensive review of recent studies for deepfake content detection using deep learning‐based approaches. We aim to broaden the state‐of‐the‐art research by systematically reviewing the different categories of fake content detection. Furthermore, we report the advantages and drawbacks of the examined works, and prescribe several future directions towards the issues and shortcomings still unsolved on deepfake detection.
... Kuang et al. [91] explored a dual-branch approach to capture inconsistencies from the sequence of video frames to detect deepfake manipulation in videos. ...
Article
Full-text available
Seam carving has been widely used in image resizing due to its superior performance in avoiding image distortion and deformation, which can maliciously be used on purpose, such as tampering contents of an image. As a result, seam-carving detection is becoming crucially important to recognize the image authenticity. However, existing methods do not perform well in the accuracy of seam-carving detection especially when the scaling ratio is low. In this paper, we propose an image forensic approach based on the cooccurrence of adjacent local binary patterns (LBPs), which employs LBP to better display texture information. Specifically, a total of 24 energy-based, seam-based, half-seam-based, and noise-based features in the LBP domain are applied to the seam-carving detection. Moreover, the cooccurrence features of adjacent LBPs are combined to highlight the local relationship between LBPs. Besides, SVM after training is adopted for feature classification to determine whether an image is seam-carved or not. Experimental results demonstrate the effectiveness in improving the detection accuracy with respect to different scaling ratios, especially under low scaling ratios.
Article
Full-text available
The free access to large-scale public databases, together with the fast progress of deep learning techniques, in particular Generative Adversarial Networks, have led to the generation of very realistic fake content with its corresponding implications towards society in this era of fake news. This survey provides a thorough review of techniques for manipulating face images including DeepFake methods, and methods to detect such manipulations. In particular, four types of facial manipulation are reviewed: i) entire face synthesis, ii) identity swap (DeepFakes), iii) attribute manipulation, and iv) expression swap. For each manipulation group, we provide details regarding manipulation techniques, existing public databases, and key benchmarks for technology evaluation of fake detection methods, including a summary of results from those evaluations. Among all the aspects discussed in the survey, we pay special attention to the latest generation of DeepFakes, highlighting its improvements and challenges for fake detection. In addition to the survey information, we also discuss open issues and future trends that should be considered to advance in the field.
Article
It has become a research hotspot to detect whether a face is natural or GAN-generated. However, all the existing works focus on whole GAN-generated faces. So, an improved Xception model is proposed for locally GAN-generated face detection. To the best of our knowledge, our work is the first one to address this issue. Some improvements over Xception are as follows: (1) Four residual blocks are removed to avoid the overfitting problem as much as possible; (2) Inception block with the dilated convolution is used to replace the common convolution layer in the pre-processing module of the Xception to obtain multi-scale features; (3) Feature pyramid network is utilized to obtain multi-level features for final decision. The first locally GAN-based generated face (LGGF) dataset is constructed by the pluralistic image completion method on the basis of FFHQ dataset. It has a total 952,000 images with the generated regions in different shapes and sizes. Experimental results demonstrate the superiority of the proposed model which outperforms some existing models, especially for the faces having small generated regions.
Article
In this paper, we improve the parallel deep neural network (DNN) scheme BusterNet for image copy-move forgery localization with source/target region distinguishment. BusterNet is based on two branches, i.e., Simi-Det and Mani-Det, and suffers from two main drawbacks: (a) it should ensure that both branches correctly locate regions; (b) the Simi-Det branch only extracts single-level and low-resolution features using VGG16 with four pooling layers. To ensure the identification of the source and target regions, we introduce two subnetworks that are constructed serially: the copy-move similarity detection network (CMSDNet) and the source/target region distinguishment network (STRDNet). Regarding the second drawback, the CMSDNet subnetwork improves Simi-Det by removing the last pooling layer in VGG16 and by introducing atrous convolution into VGG16 to preserve field-of-views of filters after the removal of the fourth pooling layer; double-level self-correlation is also considered for matching hierarchical features. Moreover, atrous spatial pyramid pooling and attention mechanism allow the capture of multiscale features and provide evidence for important information. Finally, STRDNet is designed to determine the similar regions obtained from CMSDNet directly as tampered regions and untampered regions. It determines regions at the image-level rather than at the pixel-level as made by Mani-Det of BusterNet. Experimental results on four publicly available datasets (new synthetic dataset, CASIA, CoMoFoD, and COVERAGE) demonstrate that the proposed algorithm is superior to the state-of-the-art algorithms in terms of similarity detection ability and source/target distinguishment ability.
Article
The recent proliferation of fake portrait videos poses direct threats on society, law, and privacy [1]. Believing the fake video of a politician, distributing fake pornographic content of celebrities, fabricating impersonated fake videos as evidence in courts are just a few real world consequences of deep fakes. We present a novel approach to detect synthetic content in portrait videos, as a preventive solution for the emerging threat of deep fakes. In other words, we introduce a deep fake detector. We observe that detectors blindly utilizing deep learning are not effective in catching fake content, as generative models produce formidably realistic results. Our key assertion follows that biological signals hidden in portrait videos can be used as an implicit descriptor of authenticity, because they are neither spatially nor temporally preserved in fake content. To prove and exploit this assertion, we first engage several signal transformations for the pairwise separation problem, achieving 99.39% accuracy. Second, we utilize those findings to formulate a generalized classifier for fake content, by analyzing proposed signal transformations and corresponding feature sets. Third, we generate novel signal maps and employ a CNN to improve our traditional classifier for detecting synthetic content. Lastly, we release an "in the wild" dataset of fake portrait videos that we collected as a part of our evaluation process. We evaluate FakeCatcher on several datasets, resulting with 96%, 94.65%, 91.50%, and 91.07% accuracies, on Face Forensics [2], Face Forensics++ [3], CelebDF [4], and on our new Deep Fakes Dataset respectively. In addition, our approach produces a significantly superior detection rate against baselines, and does not depend on the source, generator, or properties of the fake content. We also analyze signals from various facial regions, under image distortions, with varying segment durations, from different generators, against unseen datasets, and under several dimensionality reduction techniques.