Available via license: CC BY 4.0
Content may be subject to copyright.
IEEE TRANSACTIONS ON MULTIMEDIA, VOL. XX, NO. X, AUGUST 2024 1
DIP: Diffusion Learning of Inconsistency Pattern
for General DeepFake Detection
Fan Nie, Jiangqun Ni, Member, IEEE, Jian Zhang, Bin Zhang, Weizhe Zhang, Senior Member, IEEE,
Abstract—With the advancement of deepfake generation tech-
niques, the importance of deepfake detection in protecting
multimedia content integrity has become increasingly obvious.
Recently, temporal inconsistency clues have been explored to
improve the generalizability of deepfake video detection. Accord-
ing to our observation, the temporal artifacts of forged videos
in terms of motion information usually exhibits quite distinct
inconsistency patterns along horizontal and vertical directions,
which could be leveraged to improve the generalizability of
detectors. In this paper, a transformer-based framework for
Diffusion Learning of Inconsistency Pattern (DIP) is proposed,
which exploits directional inconsistencies for deepfake video
detection. Specifically, DIP begins with a spatiotemporal encoder
to represent spatiotemporal information. A directional incon-
sistency decoder is adopted accordingly, where direction-aware
attention and inconsistency diffusion are incorporated to explore
potential inconsistency patterns and jointly learn the inherent
relationships. In addition, the SpatioTemporal Invariant Loss
(STI Loss) is introduced to contrast spatiotemporally augmented
sample pairs and prevent the model from overfitting nonessential
forgery artifacts. Extensive experiments on several public datasets
demonstrate that our method could effectively identify directional
forgery clues and achieve state-of-the-art performance.
Index Terms—Deepfake detection, Vision transformer, Graph
diffusion learning
I. INTRODUCTION
WITH the rapid development of AI generative models
and social networks, a large number of fake videos
generated by advanced face forgery methods such as Deepfake
Manuscript received April 15, 2024; revised July 8, 2024; accepted Septem-
ber 8, 2024. This work was supported in part by the National Natural Science
Foundation of China under Grants U23B2022 and U22A2030; in part by
Guangdong Major Project of Basic and Applied Basic Research under Grant
2023B0303000010; in part by the Major Key Project of PCL under Grant
PCL2023A05. The associate editor coordinating the review of this article and
approving it for publication was Dr. Richang Hong. (Corresponding author:
Jiangqun Ni.)
Fan Nie is with the School of Computer Science and Engineering, Sun
Yat-sen University, Guangzhou 510006, China, and also with the Department
of New Networks, Pengcheng Laboratory, Shenzhen 518066, China (e-mail:
nief6@mail2.sysu.edu.cn).
Jiangqun Ni is with the School of Cyber Science and Technology, Sun
Yat-sen University, Shenzhen 510275, China, and also with the Department
of Networks, Pengcheng Laboratory, Shenzhen 518066, China (e-mail: is-
sjqni@mail.sysu.edu.cn).
Jian Zhang is with the School of Computer Science and Engi-
neering, Sun Yat-sen University, Guangzhou 510006, China (e-mail:
zhangj266@mail2.sysu.edu.cn).
Bin Zhang is with the Department of Networks, Pengcheng Laboratory,
Shenzhen 518066, China (e-mail: bin.zhang@pcl.ac.cn).
Weizhe Zhang is with the School of Cyberspace Science, Harbin Institute
of Technology, Harbin 150001, China, and also with the Department of
New Networks, Peng Cheng Laboratory, Shenzhen 518066, China (e-mail:
wzzhang@hit.edu.cn).
Real Fake
tt
(a) Video clips
Horizontal Changes Vertical Changes
(d) Motion evolution analysis
Horizontal
Vertical
Horizontal
Vertical
(b) Motion information extraction
Real Fake
Horizontal
Vertical
(c) Temporal slice of certain region
Real Fake
Timestep
Location
Fig. 1. Illustration of the temporal inconsistencies. For a pair of real and
fake videos, the motion information in terms of optical flow is extracted and
visualized with the TVL1 algorithm [6]. Each optical flow frame is then sliced
to obtain horizontal and vertical motion slices for real and fake videos. The
comparison between real and fake videos for the average temporal motion
evolution reveals the inconsistency along both the horizontal and vertical
directions.
[1], Face2Face [2], Faceswap [3], HifaFace [4], and Neu-
ralTexture [5] could be publicly disseminated, which could
disrupt the order of cyberspace, and raise serious security con-
cerns. To ensure the authenticity and integrity of multimedia
data, it is of great significance to develop effective deepfake
detection methods.
Prior studies [7]–[9] formulate deepfake detection as a
vanilla classification task and achieve satisfactory performance
for the in-dataset scenario. However, their performance in
0000–0000/00$00.00 © 2021 IEEE
arXiv:2410.23663v1 [cs.CV] 31 Oct 2024
IEEE TRANSACTIONS ON MULTIMEDIA, VOL. XX, NO. X, AUGUST 2024 2
terms of generalizability usually drops significantly under
cross-dataset evaluation, where unseen facial manipulation
techniques, data distributions, and distortions are involved.
This prevents the detectors from being deployed in real-world
applications. Recently, the spatiotemporal inconsistencies aris-
ing from the facial manipulation have been explored [10]–[13]
to capture forgery clues, exhibiting promising generalizabil-
ity. Zhang et al. [11] utilized convolutional neural networks
(CNNs) and temporal dropout to extract discriminative clues.
Gu et al. [12] proposed a region-aware temporal filter deployed
in CNNs to capture local spatiotemporal inconsistencies. Simi-
larly, Guan et al. [13] incorporated the advanced Vision Trans-
former [14] and used a local sequence transformer to capture
long-term temporal inconsistencies. In general, both spatial
and temporal artifacts could be explored to expose deepfake
videos. However, detectors without elaborate spatiotemporal
modeling tend to capture more salient clues in either the spatial
or temporal domain [15] and do not take full advantage of
spatiotemporal representation, leading to less generalization
capability. Therefore, how to effectively represent the spatial
and temporal forgery cues as a whole remains an open issue
in the image forensic community.
To address the aforementioned problem, we extract and
visualize the motion information in terms of optical flow for
a pair of real and fake video clips with the TVL1 algorithm
[6], as shown in Fig. 1(b), where red and green correspond to
the displacements along the horizontal and vertical directions
and brighter colors indicate larger values. We then slice each
optical flow frame vertically at a fixed horizontal location,
i.e., within the red dotted box in Fig. 1(b), and concatenate
them in terms of horizontal and vertical displacements to
obtain both the horizontal and vertical motion slices for real
and fake videos respectively, as shown in Fig. 1(c). Fig. 1(d)
shows the comparison between real and fake videos for the
average temporal motion evolution within the solid red box
in Fig. 1(c) along the horizontal and vertical directions. It
is quite evident that the motion slices for real video (either
horizontal or vertical) evolve much more smoothly whereas
those for fake video exhibit more discontinuous patterns. It
is the direction of motion that allows us to characterize the
temporal inconsistency from a new perspective. In addition, a
close observation of Fig. 1 reveals that the temporal motion
for both real and fake videos features local similarity and
diffusion. Specifically, in an optical flow frame, the regions,
that are spatially close to each other, exhibit similar motion
patterns (local similarity). The motion patterns in one region
would gradually affect its surroundings to some extent, known
as diffusion, which has been mainly ignored by prior studies.
Based on the above observations, we propose the Diffusion
Learning of Inconsistency Pattern (DIP) framework with a vi-
sion transformer. For the backbone of DIP, considering the dif-
ference in representation between spatial and temporal forgery
artifacts, the asymmetric spatiotemporal attention mechanism
is adopted to balance the interaction between spatial and
temporal artifacts. To effectively capture the forgery clues
along horizontal and vertical directions, directional embed-
dings are obtained through directional pooling operations. The
DIP structure allows us to take the temporal inconsistency and
the characteristics of temporal motion for both real and fake
videos into consideration and to represent the forgery videos
through the following modules:
•Inconsistency Diffusion Module (IDM). As shown in
Fig. 1(b), the temporal motions of both real and fake
videos exhibit diffusion effects along the horizontal and
vertical directions, and have their own diffusion patterns.
The term ”diffusion” is used to illustrate the extent to
which the motion of one region affects its surroundings.
The IDM is introduced to learn the diffusion intensity
(distance) [16], [17] among neighboring regions along
the horizontal and vertical directions, where the IDM
regards directional region features as graph nodes and
incorporates graph-based diffusion to learn regional dif-
fusion intensities effectively. Equipped with this module,
the DIP could effectively capture the forgery artifacts in
the horizontal and vertical directions from the perspective
of inconsistency diffusion. On the other hand, the learned
diffusion patterns (distances) are exploited to optimize
the spatiotemporal features with the DIP backbone in a
weakly supervised manner.
•Directional Cross Attention (DiCA). With DiCA,
direction-aware multi-head cross attention is adopted to
learn general forgery features through directional inter-
action. By cross-attention between the horizontal and
vertical directions, DiCA could effectively characterize
the directional discrepancies between real and fake videos
and learn more discriminative and general forgery fea-
tures for deepfake detection. In addition, the diffusion
patterns obtained by IDM are incorporated in DiCA
as attention bias to provide a comprehensive view of
directional interactions.
To further improve the generalizability and robustness per-
formance of the proposed DIP, spatiotemporal data augmen-
tation (DA) is implemented to capture critical information
in facial forgeries for deepfake detection. Unlike the work
in [11], [15], both spatial and temporal DAs are adopted
by the well-devised triplet data structure, i.e., anchor, pos-
itive and negative, to suppress specific and trivial forgery
traces. For spatial DA, the specific spatial clues, e.g., high-
frequency artifacts, color mismatches, and noise patterns, are
effectively mitigated by applying Gaussian blur, color satu-
ration, Gaussian noise, etc., whereas for temporal DA, frame
dropping and repeating are employed to attenuate the temporal
motion artifacts. The proposed spatiotemporal DA facilitates
the model to learn more general forgery representations and
refrain it from getting trapped in the local optima. On the other
hand, a SpatioTemporal Invariant Loss (STI Loss) is developed
to incorporate the spatiotemporally augmented samples in a
triplet, which pulls the anchor and positive samples closer
while pushing the negative samples from the anchor in the
feature space, thus improving the generalizability of the DIP.
The main contributions of the paper are summarized as
follows:
•A new deepfake video detection framework - DIP is
proposed, which exploits the spatiotemporal inconsis-
tency of deepfake video along the horizontal and vertical
IEEE TRANSACTIONS ON MULTIMEDIA, VOL. XX, NO. X, AUGUST 2024 3
directions for general deepfake detection.
•The Directional Cross Attention (DiCA) is devised to
model the temporal inconsistency of deepfake video along
horizontal and vertical directions, while the Inconsistency
Diffusion Module (IDM) is designed to characterize the
temporal artifacts in terms of diffusion patterns along
horizontal and vertical directions. These two modules
facilitate the model in learning more discriminative spa-
tiotemporal representations for deepfake detection.
•A new SpatioTemporal Invariant Loss (STI Loss) is
developed, which incorporates spatiotemporal data aug-
mentation with triplet structure to drive the model to learn
prominent representations for general and robust deepfake
detection.
•Experimental results demonstrate the effectiveness of the
proposed modules and the superior performance in terms
of generalizability and robustness.
II. RE LATE D WOR K
A. DeepFake Detection
Owing to the diversity of practical applications and the
agnostic attacks of underlying channels, performance in terms
of generalizability and robustness is the primary goal of
deepfake detection. To this end, the following inconsistency
clues are explored for deepfake video detection, i.e., spatial
frequency clues, postprocessing clues, and temporal clues.
In [18], [19], spatial clues are captured by integrating local
artifacts with high-level semantic features. Equipped with DCT
and Wavelet transform, [20]–[23] exploit both spatial and
frequency inconsistency learning to enhance the generalization
performance. On the other hand, the artifacts that appear in
facial boundaries due to the blending operations of existing
deepfake techniques are also leveraged to significantly improve
the performance of deepfake detection [24]–[26].
Compared with image-based methods, video-based methods
[12], [27]–[30] take forgery clues in the temporal domain
into consideration, e.g., the visual and biological inconsistency
patterns such as abnormal facial movements [29], [30], in-
consistent eye blinking [31], non-synchronous lip movements
[28]. With the development of more advanced techniques, the
generated facial images tend to be more visually convincing,
which would inevitably disable the above methods.
Given all this, in this paper, the spatiotemporal inconsisten-
cies are characterized with directional temporal representations
(along the horizontal and vertical directions) via directional in-
teraction and diffusion learning for general deepfake detection.
B. Temporal Analysis
Many efforts have been made with the hope of modeling
temporal dependency, which is the essence of video-relevant
tasks. Initially, 3D CNNs are usually used for temporal model-
ing, although they are computationally intensive. Some meth-
ods [32], [33] incorporate 2D CNNs with different dimensions
to perform temporal modeling, which, however, fails to capture
the long-term dependency among video frames. Owing to
its prominent long-term modeling capability, the transformer-
based methods [34], [35] are developed for temporal modeling
and achieve promising results.
Recently, the transformer-based architecture has also been
adopted to learn the spatiotemporal inconsistencies for deep-
fake video detection. In [27], [29], [36], stacked CNNs and
transformers are used to learn frame-level forgery represen-
tations and expose the temporal forgery clues by exploring
the multi-frame forgery representation. In [13], the hybrid
model of 3D CNN and transformer are developed to learn
the temporal inconsistency of deepfake videos. However, its
performance for stationary videos is less prominent due to not
taking full consideration of the spatial clues.
For the proposed DIP, both the spatial and temporal in-
consistencies are taken into consideration by transforming the
input video clip into directional representations. The incon-
sistency between real and fake videos is then modeled from
two different perspectives, i.e., directional cross-attention and
diffusion, which contributes to better spatiotemporal inconsis-
tency learning.
III. METHOD
A. DIP Pipeline
As illustrated in Fig. 2, the proposed DIP framework mainly
consists of three components, i.e., a SpatioTemporal Encoder
(STE), a Joint Directional Inconsistency Decoder (DID), and
a Cross-Direction Classifier (MDC), whose functionalities are
as follows:
•The STE first extracts spatiotemporal features with a
unified transformer structure. The extracted features are
then split into two directional features via directional
pooling operations, which serve as the key assets for
pattern modeling and feature fusion.
•The DID is adopted to learn directional inconsistency
patterns simultaneously. Equipped with Directional Cross
Attention (DiCA) and Inconsistency Diffusion Module
(IDM), the DID is expected to learn a better cross-
direction inconsistency representation for deepfake detec-
tion.
•The well-learned directional inconsistency features are
then exploited by MDC to classify fake and real videos.
B. SpatioTemporal Encoder
Given the input video clip V∈RT×M×M×3with TRGB
frames, and without loss of generality, the square frame of size
M×Mis assumed, each frame is decomposed into L×Lnon-
overlapping patches of size P×P, L =M/P as shown in Fig.
2. The video clip Vis then transformed into the spatiotemporal
token sequence X∈RT×L×L×Eby linearly embedding each
patch in Vinto a E-dim token. Let X(t, i, j)denotes the (i, j )-
th token in t-th frame, we then have ˆ
X∈RT×(L·L+1)×Eby
appending a classification token in the first position of each
token frame {X(t, 0,0)}, t = 1,··· , T , and embedding the
spatial relation es∈R(L·L+1)×Eand the temporal relation
et∈RT×Einto each token X(t, i, j), t = 1,··· , T, and i, j =
1,·· · , L.
IEEE TRANSACTIONS ON MULTIMEDIA, VOL. XX, NO. X, AUGUST 2024 4
…
STE
Video Clip
CLS
Horizontal
Vertic al
DID
FC
Real or Fake
C
MDC
CLS
CLS
IDM
Di-CrossAtt
×𝑁
Spatial Att.
Tem p or al Att .
STE Block
Spatial Att.
Tem p or al Att .
STE Block
Spatial Att.
Tem p or al Att .
STE Block
…
…
Pooling
Reshape
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
Fig. 2. Overview of the proposed DIP. STE extracts forgery spatiotemporal features with embedded sequences. The DID then models inconsistency patterns
and fuses features, and MDC exploits the classification tokens for final prediction.
1 2 3 …
Reshape
Spatial
Patches
Temporal
Patches
or
Multi-Head Attention
Add & Norm
Feed Forward
Temporal Attention ×1
Spatial Attention ×3
𝐿!
𝐿!×𝑇×𝐸
𝑇×𝐿!×𝐸
𝐿
𝐿
T
0
1 2 3T
1 2 3T
…
…
Token Sequence
T
Classification
token
1 2 3 …𝐿!
0
Temporal Average and Duplicate
Spatial Token
Temporal Token
Fig. 3. Illustration of a unit STE block. Spatial attention is used to extract
spatial dependency for each frame, and temporal attention is applied to
characterize the temporal dependency at a specific location across multiple
frames.
The token sequence ˆ
Xis then fed into the SpatioTemporal
Encoder (STE) to obtain the discriminative spatiotemporal
representation Z∈RT×(L·L+1)×E, which consists of three
stacked asymmetric spatiotemporal attention modules, each
implementing three-layer successive spatial attentions for each
token frame {ˆ
X(t, i, j)}i,j of size L×Lfollowed by one-
layer temporal attention for each temporal token sequence
{ˆ
X(t, i, j)}tof size Tat spatial position (i, j )as shown in Fig.
3. Note that we exclude classification tokens when forwarding
token sequence into temporal attention. To better explore
the fine-grained directional forgery clues, the spatiotemporal
representation Zis then transformed int both horizontal and
vertical sequences Zhand Zvof size L×Evia spatial and
temporal pooling operations. For Zh, we first perform spatial
pooling by averaging or maximizing each row {ˆ
X(t, i, j)}j(i-
th row) in t-th frame to calculate the token sequence {ˆ
Z(t, i)}i
of size L, which is then averaged or maximized over frames
for temporal pooling to obtain Zhas illustrate Fig. 2. And the
vertical sequence Zvcould be derived similarly.
C. Joint Directional Inconsistency Decoder
With the directional spatiotemporal representation in terms
of Zhand Zv, we then proceed to develop the Joint Directional
Inconsistency Decoder (DID) to obtain the directional incon-
sistency representation. The DID is used to integrate the di-
rectional features by incorporating the inconsistency patterns.
According to the temporal characteristics of real/fake videos,
as shown in Fig. 1, the inconsistency patterns are characterized
from two different perspectives, i.e., directional interaction
and diffusion, which are obtained by the proposed Directional
Cross Attention (DiCA) and Inconsistency Diffusion Module
(IDM). In addition to the attention scores calculated by DiCA,
the diffusion distances by IDM are also taken advantage
of to learn the inherent inconsistency relationships among
directional features.
1) Inconsistency Diffusion Module (IDM): As illustrated
in Fig. 1, there are quite distinct motion diffusion differences
between real and fake videos, which have not been explored
in prior arts. Modeling such diffusion patterns provides a
different view to characterize the crucial distinction between
real and fake videos. To this end, a graph-based Inconsistency
Diffusion Module (IDM) is adopted to calculate the diffu-
sion distances [37] among nodes on the graph structure via
multistep random walking. With the horizontal and vertical
sequences Zh,Zv, the IDM is used to characterize not only
the diffusion patterns along the same direction (Zhor Zv) but
also the ones cross the directions (between Zhand Zv).
We exclude classification tokens of both sequences and
construct a graph G= (S, E)with node set Sof size 2L,
comprising two types of nodes, i.e., Sh∈ {sh
1, sh
2,·· · , sh
L}
from horizontal token sequence and Sv∈ {sv
1, sv
2,·· · , sv
L}
from vertical token sequence, and edge set E. Note that fh
i
and fv
i(i= 1,·· · , L)represent the embeddings for the node
sh
iand sv
irespectively, and edge set Ecould be derived from
the associated nodes and their neighborhoods.
Computation of the Transition Matrix. For the graph G=
(S, E), we assume that any two nodes in Gwith closer
spatiotemporal embeddings are more similar to each other in
motion pattern, and the transition matrix among nodes is then
used to illustrate the motion patterns for real and fake videos,
where two similar nodes have greater transition probabilities.
Note that the transition matrix serves to measure two types
of similarities among nodes in G, i.e., the similarity between
IEEE TRANSACTIONS ON MULTIMEDIA, VOL. XX, NO. X, AUGUST 2024 5
LL
L
L
Horizontal Sequence 𝒁𝒉
Vertical Sequence 𝒁𝒗
Motion Similarity
Transition Matrix 𝑃
Eq. (1)(2) 𝑃
!! 𝑃
!"
𝑃
"! 𝑃
""
Fig. 4. Calculation of motion similarity transition matrix with horizontal and
vertical token sequences zh and zv. The transition matrix Pof 2L×2L
consists of four types of submatrices (L×L), i.e., horizontal-horizontal
transition (blue dotted, Phh), horizontal-vertical transition (blue solid, Phv ),
vertical-horizontal transition (purple solid, Pvh), and vertical-vertical transi-
tion (purple dotted, Pvv ).
nodes in the same direction (Zhor Zv) and the one cross the
direction (Zhand Zv).
We then define the score matrix Wamong nodes in G,
comprising four submatrices, i.e.,
W=Whh Whv
Wvh Wvv ,(1)
where Whh and Wvv are used to measure the similarity
between two nodes in the same direction, while Whv and Wvh
represent the similarity between two nodes from a different
direction, and each of size L×L. For Whh, let iand j∈Nh
h(i)
be the node and its neighbor in Sh, we have
Whh(i, j ) = (exp(−µ||fh
i−fh
j||2
2),for j∈Nh
h(i),
0, otherwise, (2)
where µis a learnable parameter that is initialized as 0.05,
and neighbor size kn=|Nh
h(i)|is a hyperparameter. For
constructing cross-direction submatrices, e.g., Whv, we define
the cross-direction neighbor set Nh
v(i)of the node i∈Sh,
which includes node sv
j∈Sv, j ∈ {i−kn
2,·· · , i +kn
2}.
And Wvh could be obtained similarly.
Note that, according to our arrangement for score matrix W,
for graph Gwith 2Lnodes, the node si∈Sh, i ∈[1,·· · , L],
while si∈Sv, i ∈[L+ 1,··· ,2L]. We then have the
probabilistic transition matrix P:
P=D−1W, (3)
where Dis a diagonal normalization matrix D(i, i) =
PjW(i, j), and P= [p(i, j )] denotes the probability of
stochastic walking from sito sjin one step. As illustrated
in Fig. 4, the transition matrix Pis symmetric and could be
used to characterize spatiotemporal motion changes or motion
diffusion patterns in both directions.
Motion Similarity Diffusion. With the transition matrix
P= [p(i, j)], we have Pt[pt(i, j)] =
ttimes
z }| {
P·P·· · P, where
pt(i, j)represents the transition probability through Sin t
steps from sito sj. Following [17], the motion similarity
diffusion distance between siand sjcould be defined as the
sum of the squared difference between the probabilities of a
random walker starting from siand sj, and ending up at the
same node in graph Gin tsteps, i.e.,
Dt(i, j) = X
k
(Pt(i, k)−Pt(j, k))2ˆw(k),(4)
where ˆw(k)is the reciprocal of the local density at sk.
The diffusion distance Dt(i, j)is smaller if nodes si, sj
are connected with intermediate nodes with similar transition
probabilities, indicating that the two nodes have a similar
motion diffusion pattern.
Computational Acceleration for Diffusion. In practice, it-
erative computation of Eq. (4) is computationally intensive,
and the diffusion process could be significantly accelerated by
spectral decomposition of Dt. According to [16], Dthas a set
of 2Lreal eigenvalues {λr}2L
1:λ1= 1 ≥λ2≥ ·· · ≥ λ2L≥0
if at least one connected path exists for any two nodes in S,
guaranteed by above settings of the graph. The corresponding
eigenvectors are denoted as Φ1,Φ2,··· ,Φ2L. The diffusion
distance Dt(i, j)can be rewritten as
Dt(i, j) =
2L
X
r=1
λ2t
r(Φr(i)−Φr(j))2(5)
Identical to the arrangement of transition matrix Pas shown
in Fig. 4, Dtalso characterizes two types of diffusion distances
for real and fake videos: (1) the diffusion distances in the same
direction, i.e., Dt
hh and Dt
vv , which would serve as the pseudo
labels in the proposed DA Loss in a weakly supervised way
later in Section III-E2. (2) the diffusion distances cross the
direction, i.e., Dt
hv and Dt
vh, which are incorporated in the
cross attention module to obtain discriminative spatiotemporal
representations.
2) Directional Cross Attention (DiCA): In our implemen-
tation of DiCA with six cross-attention layers, the diffusion
distances Dt
hv and Dt
vh are also taken into account and serve
as biases to capture more crucial directional clues. Concretely,
we first transform Dt
hv,Dt
vh to diffusion similarity matrix via
a negative exponential function. For one DiCA block, given
the directional sequences Zhand Zv, the learnable parameters
Wh
q, W v
qare used for query projection, Wh
k, W v
kfor key
projection, and Wh
v, W v
vfor value projection in horizontal
and vertical directions. Then, with the Softmax function ϕ,
we have:
Z′
h=ϕ Wv
qZv·(Wh
kZh)⊤
√E+ exp(−τ1·Dt
vh)!Wh
vZh,
(6)
Z′
v=ϕ Wh
qZh·(Wv
kZv)⊤
√E+ exp(−τ1·Dt
hv)!Wv
vZv,
(7)
where Z′
hand Z′
vare the output of the cross-attention layer,
τ1is a learnable scale parameter. The proposed pipeline of
fusing directional inconsistency clues takes advantage of both
directional interaction and cross-diffusion patterns, with the
expectation of capturing discriminative inconsistency features.
IEEE TRANSACTIONS ON MULTIMEDIA, VOL. XX, NO. X, AUGUST 2024 6
D. Spatiotemporal Data Augmentation
Data Augmentation (DA) has long been regarded as an
effective way to improve the performance of representation
learning. Considering the fact that advanced generated algo-
rithms are emerging, which could generate visually convincing
facial images with much fewer artifacts. Therefore the deep-
fake detection schemes trained on existing datasets could only
exhibit inferior generalization performance.
Inspired by the works in [15], [24], [26], we also take
advantages of data augmentation strategy to encourage the
detector to capture more substantial artifacts. Specifically, we
augment video samples in two ways. (1) Spatial DA. To
allow the DIP to mine more crucial spatial forgery artifacts,
we diminish the effects of common spatial artifacts, e.g.,
high-frequency artifacts, color mismatches, and noise patterns,
by augmenting samples with Gaussian blur, color saturation,
Gaussian noise, etc. (2) Temporal DA. Similarly, we also
mitigate the negative effects of unreliable temporal artifacts,
i.e., frame dropping and repeating, by augmenting samples
with them.
Furthermore, to prevent the optimization of our DIP from
being trapped with hard samples, i.e., augmented samples,
during the early stage of model training, we adopt a student-
teacher optimization framework [38] as illustrated in Fig. 5.
Specifically, after conducting spatiotemporal DA, we denote
the processed inputs as Vanc, which is not augmented, Vpos ,
and Vneg, where Vanc and Vpos are sampled from the same
real video, and Vneg is sampled from the corresponding
forged video. Note that the backbone, illustrated in Fig. 2,
is utilized for both student model Msand teacher model
Mtin our implementation. And Vanc is fed forward to Ms,
while Vpos and Vneg are encoded by Mt. Note that we apply
different strategies to update the parameters θsof Msand
the parameters θtof Mt, which are detailed in the following
section.
E. Loss Functions
As illustrated in Fig.5, the proposed DIP model follows a
student-teacher framework, including spatiotemporal DA and
three loss functions. In specific, the parameters of Msare
updated with gradient backpropagation on loss functions, and
the parameters of Mtare weighted summations of the student
and teacher models through the exponential moving average
strategy [38], i.e.,
θt=α∗θt+ (1 −α)∗θs,(8)
where αis the momentum of updating the parameters θtand
set as 0.99.
1) SpatioTemporal Invariant Loss (STI Loss): Lsti is a
loss function in alignment with the proposed spatiotemporal
data augmentation strategy to mine general forgery features in
fine-grained triplets by increasing intra-class compactness and
inter-class separability. The loss function is defined as:
Lsti =max(dsti +cos(Fanc
pool, F neg
pool)−cos(Fanc
pool, F pos
pool),0),
(9)
CCE Loss
DA Loss
Global Pooling
Global Pooling
//
𝑉
!"#
𝑉
"$%
𝑉
&'(
//
Data Augmentation Student Branch
Teacher Branch
Stop Gradient EMA
Anchor Sample Negative Sample
Positive Sample
Drop Frame
Repeat Frame
Real
Fake
Input
Videos Student Model
STE
DID
Teacher Model
STE
DID
MDC
MDC
STI Loss
…
…
…
…
CLS CLS
CLS CLS
𝐹
!""#
$%&
𝐹
!""#
%'(/!"*
Fig. 5. Overview of the proposed optimization framework.
where dsti is the degree of intra-class separability, set as
1.0, and cos(·)represents the cosine distance to measure
the similarity of samples. Fanc
pool, F neg
pool and Fpos
pool are globally
pooled and directionally concatenated features for the positive,
anchor, and negative samples.
As shown in Eq. (9), Lsti constrains the similarity between
the anchor/positive pair is greater than the one between the
anchor/negative pair by at least dsti. Therefore, equipped with
spatiotemporal DA and Lsti, the proposed model tends to learn
more substantial spatiotemporal forgery clues for deepfake
detection.
2) Diffusion-Aware Loss (DA Loss): The prior knowledge
of Zd(d∈ {h, v})is exploited to speed up the diffusion
process in the IDM. Specifically, the DA loss Lda is adopted
to compute the normalized cosine similarity and diffusion
distance among tokens in the same direction and encourage
them to be close to each other. The DA loss can be written
as:
Lda =X
d∈{h,v}||norm(cos(Zd, Zd)) −exp(−τ2·Dt
dd)||2
2,
(10)
where norm(·)is a function that normalizes the cosine simi-
larities into [0,1] and τ2is a learnable scale parameter. Note
that when the normalized cosine similarity of Zdincreases, the
exponentially scaled Dt
dd is expected to be larger (i.e., smaller
diffusion distance), and vice versa.
3) Collaborative Cross-Entropy Loss (CCE Loss): As
shown in Fig. 2, in addition to the video-based prediction, both
the horizontal and vertical representations could also exhibit
complementary directional forgery clues. Therefore, the CCE
Loss is developed, which allows the proposed DIP model
to extract discriminative inconsistency features globally and
directionally.
Lcce =l(y, ˆy) + X
d∈{h,v}
λd·l(yd,ˆy),(11)
where l(·)represents the cross-entropy loss function, yand ˆy
are the video-based prediction probability and ground truth,
IEEE TRANSACTIONS ON MULTIMEDIA, VOL. XX, NO. X, AUGUST 2024 7
respectively. yd(d∈ {h, v})stands for directional prediction
probability and λdis the weight for each directional loss.
As a result, the overall loss function for the proposed DIP
framework could be written as the sum of all the above loss
functions, i.e.,
Lall =Lcce +Lsti +Lda.(12)
IV. EXP ER IM EN TS
A. Experimental Settings
1) Implementation Details: For each video, 8 clips, each
with 16 frames, are sampled for training and testing. The
open-source Dlib algorithm is employed to detect and crop
facial regions, which are then resized to 224 ×224. The
proposed method is implemented using the PyTorch library on
the four NVIDIA RTX 3090 platforms. For model training, the
AdamW optimizer [42] with an initial learning rate of 4×10−6
is utilized, and the batch size is set to 16. The setting of
each transformer block in DIP follows the ViT model [14].
For the hyperparameter settings, the neighborhood number
knof one node and the diffusion step tin IDM are set to
7,20, respectively. The optimal ratio of spatial and temporal
attention layers in the STE block will be discussed in the
following ablation experiments.
2) Evaluation Metrics: Following the evaluation protocols
in [26], [28], the Area Under the Receiver Operating Char-
acteristic Curve (AUC) and Detection Accuracy (ACC) at the
video level are utilized as evaluation metrics. Specifically, for
comparison with image-level methods, the video-level results
of such methods are reported by averaging the detection scores
of all frames.
3) Dataset: FaceForensics++ (FF++) [8]: FF++ is a widely
used benchmark in deepfake detection. It includes 1,000 real
videos collected from YouTube and 4,000 fake videos gener-
ated by four types of face manipulation techniques: Deepfakes
(DF) [1], Face2Face (F2F) [2], FaceSwap (FS) [3], and Neural-
Textures (NT) [5]. In addition, videos in FF++ are compressed
at three quality levels: raw, high-quality (C23), and low-quality
(C40). To better simulate practical scenarios, the C23 and C40
versions are adopted to conduct our experiments.
Celeb-DF-v2 (CDF) [39]: CDF is a large-scale public
deepfake dataset that contains 590 real videos and 5,639 high-
quality face-swapped videos of celebrities.
WildDeepFake (WDF) [40]: WDF is a real-world dataset
with 3,805 real videos and 3,805 fake videos containing
diverse and complicated conditions of video collection and
manipulation, which is more challenging for deepfake detec-
tion.
Deepfake Detection Challenge [41], [42] generates more
challenging forged faces, including two versions: preview and
complete datasets termed DFDC-P and DFDC, which apply
some specifically designed augmentations to the target videos
to approximate the actual degradations.
DeeperForensics-1.0 (DFR) [43]: DFR applies more so-
phisticated forgery methods and well-controlled video sources
to generate ”natural” deepfake videos with various distortions.
B. Intra-dataset Evaluation
We first evaluate our proposed DIP with state-of-the-art
methods under the intra-dataset setting, in which all models are
trained and tested on FF++ with multiple qualities, i.e., C23
and C40. The results are shown in Table I. Our DIP achieves
a comparable and balanced detection performance compared
with the SOTA methods across different video qualities.
As forgery artifacts are greatly weakened by strong video
compression, it is more challenging to extract discriminative
artifact features from highly compressed videos. Despite this,
our proposed DIP focuses on mining fine-grained spatiotem-
poral forgery artifacts and captures multidirectional forgery
patterns, which is more robust in video compression, thus
achieving comparable performance with various video qual-
ities. For instance, compared with the recent image-based
methods MRL [44], our DIP improves the ACC from 93.82%
to 99.10% for high-quality data while achieving satisfactory
performance on challenging low-quality videos. The difference
in detection performance on MRL could be caused by quality-
sensitive artifacts, resulting in unbalanced performance on
different video qualities. In contrast, both the proposed DIP
and ISTVT [36] incorporate temporal forgery artifacts, which
are more invariant to video compression and could capture
discriminative clues even on low-quality videos.
TABLE I
INT RA- DATASE T EVALUATIO N ON FF++. TH E BES T AN D SEC ON D-BE ST
RE SULT S ARE S HOW N IN BO LD A ND UN DE RLI NE D TEX T,RES PE CTI VE LY.
Method Venue FF++ (C40) FF++ (C23)
ACC AUC ACC AUC
Face X-ray [24] CVPR 2020 - 61.60 - 87.40
F3Net [45] ECCV 2020 90.43 93.30 97.52 98.10
FDFL [22] CVPR 2021 89.00 92.40 96.69 99.30
DCL [46] AAAI 2022 - - 96.74 99.30
UiA-ViT [47] ECCV 2022 - - - 99.33
Lisiam [48] TIFS 2022 91.29 94.65 97.57 99.52
CDIN [49] TCSVT 2023 - 96.80 - 98.50
MRL [44] TIFS 2023 91.81 96.18 93.82 98.27
ISTVT [36] TIFS 2023 96.15 - 99.00 -
LDFNet [50] TSCVT 2024 92.32 96.79 96.01 98.92
DIP Ours 92.33 95.16 99.10 99.46
C. Generalizability Evaluation
In practical scenarios, the manipulation methods and data
source of the suspicious face video to be detected are usually
unknown, which requires the well-trained detector to capture
highly generalized forgery patterns and exhibit good general-
ization ability. To evaluate the generalizability of the proposed
method, we adopt two more practical settings: cross-dataset
and cross-manipulation evaluations.
1) Cross-dataset Evaluation: Table II shows the general-
ization results for cross-dataset evaluation in terms of AUC,
in which all models are trained in FF++ (C23) for a fair
comparison. It can be observed that our DIP exhibits supe-
rior performance in most cases, including the average result
compared with the prior methods. in specific, our method
achieves significant performance improvements of 4.13% and
5.70% in AUC on WDF and DFDC, respectively, compared
with state-of-the-art methods, such as DCL [46], SBI [26],
IEEE TRANSACTIONS ON MULTIMEDIA, VOL. XX, NO. X, AUGUST 2024 8
TABLE II
GENERALIZABILITY EVALUATI ON IN T ER MS OF AU C (%). AL L MO DEL S AR E TRA IN ED ON FF++, A ND TE ST ED ON T HE R EMA IN ING FI VE DATAS ETS ,
WITH VIDEO-LEVEL METRICS. TH E BES T AND S EC OND -BEST RESULTS ARE BOLDED AND UNDERLINED,RESP EC TIV ELY.
Category Method Venue CDF DFDC-P DFDC DFR WDF Average
Image-based
Face X-ray [24] CVPR 2020 79.50 74.20 65.50 86.80 - 76.50
F3Net [45] ECCV 2020 68.69 67.45 - - - 68.07
PCL [25] ICCV 2021 90.03 74.37 67.52 - - 77.31
DCL [46] AAAI 2022 88.24 77.57 - 94.42 76.87 84.28
RECCE [51] CVPR 2022 69.25 66.90 - 93.28 76.99 74.67
SBI [26] CVPR 2022 89.90 86.15 72.42 77.70 70.27 79.78
Lisiam [48] TIFS 2022 78.21 - - - - 78.21
Video-based
FTCN [27] ICCV 2021 86.90 71.00 74.00 98.80 - 82.70
LipForensics [28] CVPR 2021 82.40 - 73.50 97.60 - 84.50
HCIL [52] ECCV 2022 79.00 - 69.21 - - 74.11
RATIL [12] IJCAI 2022 76.50 - 69.06 - - 72.78
STDT [53] MM 2022 69.78 - 66.99 - - 68.39
CDIN [49] TCSVT 2023 89.10 - 78.40 - - 83.75
AltFreezing [15] CVPR 2023 89.50 - 71.25 99.30 - 86.68
AdapGRnet [54] TMM 2023 71.50 - - - - 71.50
ISTVT [36] TIFS 2023 84.10 - 74.20 98.60 - 85.63
DIP Ours 88.36 87.98 79.90 98.02 81.12 87.08
RECCE [51], and ISTVT [36]. Previous image-based methods
commonly suffer from dramatic performance drops when eval-
uated on DFDC, WDF, and DFR, in which videos are distorted
with complicated perturbations and thus more challenging
to detectors, while the proposed DIP takes full advantage
of temporal artifact mining and achieves the best detection
performance, indicating temporal artifacts are significant and
non-negligible clues against highly distorted videos in the
cross-dataset evaluation.
Following the general spatiotemporal learning paradigm, the
FTCN [27] and ISTVT [36] explore temporal forgery artifacts
for deepfake detection and achieve better performance on
CDE and DFR compared with the image-level methods. How-
ever, they fail to consider the importance of multi-directional
forgery pattern modeling, which results in poor generalization
on challenging cases, e.g., WDF and DFDC. It is benefiting
from multidirectional forgery pattern modeling and temporal
data augmentation that our DIP could achieve better cross-
dataset detection performance across all datasets.
2) Cross-manipulation Evaluation: Following previous
works [28], [55], we then proceed to evaluate the cross-
manipulation performance of the proposed DIP on FF++.
Specifically, all models are trained on three manipulation
methods and tested on the remaining one on FF++. It can
be observed in Table III, when tested on FS, our proposed
DIP outperforms other counterparts with a near 3% AUC
improvement. Compared to the state-of-the-art DIL [56], our
DIP consistently achieves better average cross-manipulation
performance. The generalization performance of DIL benefits
from temporal and local forgery artifact mining, especially on
NT. However, it gains unsatisfactory performance when ap-
plied to the other three cross-manipulation evaluations since it
fails to capture the global forgery artifacts. In contrast, our DIP
could effectively mine both global and local forgery artifacts
and capture more discriminative forgery patterns, resulting in
better generalizability in various cross-manipulation scenarios.
TABLE III
CROS S-MANIPULATION EVAL UATIO N ON FO UR F ORG ERY M ETH ODS I N
TERMS OF AUC (%). A LL ME NT ION ED M ODE LS A RE TR AI NED O N TH REE
MANIPULATION METHODS AND TESTED ON THE REMAINING ONE ON
FF++. THE B EST A ND S ECO ND-B ES T RES ULTS A RE B OLD ED A ND
UNDERLINED,RE SPE CTI VE LY.
Method Venue Train on remaining three
DF FS F2F NT Avg
Xcep. [57] ICCV 2017 93.9 51.2 86.8 79.7 77.9
Lips [28] CVPR 2021 93.0 56.7 98.8 98.3 86.7
FTCN [27] CVPR 2021 96.7 96.0 96.1 96.2 96.3
DIAnet [58] IJCAI 2021 87.9 86.1 88.2 86.7 87.2
DIL [56] AAAI 2022 94.4 94.8 94.8 94.9 94.7
DIP Ours 98.2 99.0 99.1 92.4 97.2
D. Robustness Evaluation
During collection and transmission, the video is usually
distorted with unknown intensities, which would potentially
weaken or destroy essential forgery artifacts. To evaluate the
robustness of the proposed method, following the previous
work [28], we perform robustness evaluations on our DIP
and other compared methods under multiple distortions with
different intensities, including color saturation modification,
color contrast modification, blockwise noise, Gaussian noise,
Gaussian blur, pixelation, and video compression.
As shown in Fig. 6, our proposed DIP is more robust than
the other involved methods under various distortion scenarios.
Since the spatial forgery artifacts are heavily corrupted under
spatial distortions, including Gaussian noise, pixelation, color
saturation, and contrast, while the interframe relationships, i.e.
temporal artifacts, are relatively well preserved, the video-
based methods, i.e., LipForensics [28] and the proposed DIP,
are significantly more robust than the image-based methods,
e.g., Xception [8], Face X-ray [24], and PatchForensics [59],
which rely heavily on spatial forgery artifacts.
Video compression corrupts temporal forgery artifacts to a
great extent. In this scenario, our proposed DIP exhibits better
robustness compared with LipForensics. The reason is that
LipForensics fails to model the global temporal patterns with
IEEE TRANSACTIONS ON MULTIMEDIA, VOL. XX, NO. X, AUGUST 2024 9
012345
50
60
70
80
90
100
012345
50
60
70
80
90
100
012345
50
60
70
80
90
100
012345
50
60
70
80
90
100
012345
50
60
70
80
90
100
012345
50
60
70
80
90
100
012345
50
60
70
80
90
100
012345
50
60
70
80
90
100
C o l o r S a t u r a t i o n C o l o r C o n t r a s t B l o c k - w i s e G a u s s i a n N o i s e
X c e p t i o n C N N - a u g C N N - G R U P a t c h F o r e n s i c s F a c e X - r a y L i p F o r e n s i c s o u r s
G a u s s i a n B l u r
A U C ( % )
P e r t u b . L e v e l
A U C ( % )
P e r t u b . L e v e l
A U C ( % )
P e r t u b . L e v e l
A U C ( % )
P e r t u b . L e v e l
A U C ( % )
P e r t u b . L e v e l
A U C ( % )
P e r t u b . L e v e l
A U C ( % )
P e r t u b . L e v e l
A U C ( % )
P e r t u b . L e v e l
A U C ( % )
P e r t u b . L e v e l
A U C ( % )
P e r t u b . L e v e l
P i x e l a t i o n C o m p r e s s i o n A v e r a g e
Fig. 6. Robustness Evaluation within unseen distortions of various levels in terms of AUC (%). Several models trained on clean FF++ are evaluated under
several distortions with different levels. ”Average” denotes the average performance against the same-level distortions.
the constraint of local forgery pattern modeling. In contrast,
our DIP considers both local and global forgery artifacts and
implements multidirectional forgery pattern modeling, thereby
achieving more robust detection performance.
E. Ablation Studies
We conduct extensive ablation experiments to analyze the
impacts of various components in the proposed DIP. All
models and their variants are trained on FF++ and evaluated on
intra-dataset and cross-dataset detection performance in terms
of AUC (%).
1) Study on the architectures of STE unit: To determine
the optimal setting in STE, i.e., the ratio to the number of
spatial attention (SA) and temporal attention (TA) modules in
one unit, we also conduct experiments under different settings,
and the results are shown in Table IV. The DIP framework
is not expected to gain detection performance improvements
by stacking more temporal attention modules, and adequate
spatial forgery pattern learning with increasing spatial attention
modules could promote spatiotemporal inconsistency pattern
learning. Moreover, the setting of three SA modules can better
balance spatial and temporal inconsistency pattern learning,
which is chosen as the default setting of DIP.
2) Study on the effects of the STE: We further investigate
three paradigms of deepfake spatiotemporal learning, i.e.,
spatial-then-temporal learning (FTCN [27]), one-spatial-one-
temporal learning (ViViT [35]), and more-spatial-one-temporal
learning (Ours). The FTCN extracts frame-level forgery rep-
resentations in the first stage and then focuses on capturing
temporal forgery artifacts, which fails to mine the importance
of local forgery artifacts and results in unsatisfactory detection
performance. To better capture local spatiotemporal forgery
artifacts, ViViT fuses spatiotemporal forgery learning by fac-
torizing self-attention layers into spatial and temporal ones
in equal numbers. Furthermore, the proposed DIP designs
asymmetric spatiotemporal attention. It can be observed in
Table V that the asymmetric spatiotemporal attention design
in the proposed DIP could better learn spatiotemporal forgery
representations and achieve significant performance improve-
ments.
3) Study on different settings of knand td:At the in-
consistency diffusion module, knin Eqn. (2) controls the
scale of the node’s neighborhood in the graph G, while t
in Eqn. (4) constrains diffusion steps. Table VI presents the
performance under various combinations of knand t. Initially,
we fix knto 3 or 7 to examine the impacts of different
tvalues. We observe that when tis set as 20, the overall
performance is more balanced. With respect to kn, when fewer
connected neighboring nodes are involved in the inconsistency
transition graph, insufficient forgery relationships prevent this
module from modeling more discriminative forgery diffusion
patterns, thereby gaining unsatisfactory detection performance.
However, a much larger knwould result in the loss of localities
of inconsistency relationships, leading to lower cross-dataset
performance.
4) Study on the effects of DID: We then proceed to study
the impacts of DiCA, IDM, and DA Loss. The results are
shown in Table VII. When integrated with only DiCA or IDM,
the generalizability performance is improved, which indicates
that the directional interaction (DiCA) and diffusion (IDM)
are vital for inconsistency modeling. Moreover, the IDM
achieves better performance on the CDF, which indicates that
IEEE TRANSACTIONS ON MULTIMEDIA, VOL. XX, NO. X, AUGUST 2024 10
the CDF exhibits more discriminative inconsistency diffusion
patterns, rather than directional interaction patterns compared
with DFDC-P. When equipped with all the three components,
both the intra-dataset and cross-dataset performance are further
boosted, indicating that our proposed DIP could mine more
various forgery patterns and generalize well on cross-dataset
evaluations.
5) Study on the effects of Spatiotemporal DA: Spatiotempo-
ral DA methods designed with forgery-related prior knowledge
are essential to prevent the detector from overfitting nonintrin-
sic patterns. To measure the impacts of all the involved DA
methods, we train several variants of the proposed DIP, includ-
ing models without any one specific type of augmentation and
one model without any augmentations. Additionally, several
postprocessed FF++ datasets are involved in experiments to
simulate different video perturbations, i.e., FFhf (FF++ with
high-frequency filtering), FFnoise (FF++ with noise addition),
FFcolor (FF++ with color distortion), and FFta (FF++ with
random frame dropping and repeating).
Table VIII summarizes the detection performance of all
the settings in the FF++ dataset. It can be observed that the
trained detector without data augmentations is more vulnerable
to various distortions, especially high-frequency filtering and
temporal distortions. Moreover, benefiting from the proposed
data augmentation methods, our DIP further improves the
generalization performance against various natural distortions,
which indicates the effectiveness of the proposed spatiotem-
poral DA methods.
TABLE IV
STUDY ON THE ARCHITECTURES OF A UNIT BLOCK IN STE. NOTE TH AT
FO UR FO RG ERY ME THO DS I N FF++(LQ) AND CDF A RE L EVE RAG ED TO
EVALU ATE DIFF ER ENT S ET TIN GS O F SA: TA RAT IOS .
SA TA DF F2F FS NT CDF
1 3 96.1 89.6 93.2 89.4 79.7
1 1 98.2 92.9 95.4 93.2 83.1
2 1 97.6 92.4 96.5 89.8 85.6
3 1 98.9 95.7 96.3 95.5 88.3
4 1 99.3 94.5 94.2 92.4 87.5
6 1 98.5 93.8 93.7 92.5 86.4
TABLE V
STU DY ON TH E EFF ECT S OF STE. WE COMPARE PERFORMANCE WITH
TH REE PA RAD IG MS OF E XIS TI NG FO RG ERY SPATI OTE MPO RA L LEA RN ING
ON INTRA-DATASE T AND C ROS S-DATASE T EVAL UATIO NS. N OTE T HAT TH E
BE ST RE SU LTS ARE B OL DED .
Methods Venue FF++ CDF DFDC-P Average
FTCN [27] CVPR 2021 99.7 86.9 74.0 86.9
ViViT [35] ICCV 2021 97.9 80.5 74.2 84.2
DIP Ours 99.4 88.3 88.0 92.9
F. Visualization
In this subsection, we visualize directional saliency maps of
input videos to better show the effectiveness of the directional
patterns in our proposed DIP. Moreover, feature distributions
and forgery patterns generated from DiCA and IDM are
also leveraged to illustrate the superior forgery discrimination
ability of our method.
DF F2F FS NT
Input
Horizontal Ver tic al
Fig. 7. Visualization of horizontal and vertical forgery inconsistencies in the
four facial manipulation algorithms. ‘Vertical’ and ‘Horizontal’ indicate the
heatmaps in the vertical and horizontal directions, respectively.
Input w/o Both DiCA IDM w/o Both DiCA IDM
Horizontal Vertical
Real
Fake
Fig. 8. Visualization of inconsistency patterns modeled by DiCA and IDM.
’w/o Both’ represents the inconsistency clues without pattern modeling by
DiCA and IDM.
Fig. 9. Visualization of feature distributions using t-SNE projection. Left: our
DIP. Right: DIP without DiCA and IDM. Note that all models are trained on
FF++.
t
Fig. 10. Examples of detection failure with the proposed method. Top: the
face-swapping video (classified as real). Bottom: the real talking face video
(classified as fake).
IEEE TRANSACTIONS ON MULTIMEDIA, VOL. XX, NO. X, AUGUST 2024 11
TABLE VI
STU DY ON DI FFE REN T SE TTI NGS O F knAND t. C ROS S AVG.INDICATES
AVERAGE GENERALIZATION PERFORMANCE AT THE AUC METRIC UNDER
CROSS-DATASET S CEN ARI OS (CDF, DFDC-P, AND WDF).
Hyper-parameters FF++ CDF DFDC-P WDF Cross Avg.
knt
3 10 99.33 84.11 86.25 79.81 83.39
3 20 99.37 85.07 86.10 80.01 83.73
7 20 99.46 88.36 87.98 81.12 85.82
7 25 99.51 87.98 87.52 81.01 85.50
7 40 99.42 88.45 88.23 80.24 85.64
9 25 99.64 88.12 86.58 79.89 84.86
15 25 99.82 87.40 86.38 80.76 84.85
TABLE VII
STU DY ON TH E EFF ECT S OF DID. WE DECOMPOSE DID INTO SEVERAL
EL EME NT S,I.E., D ICA, IDM, AND DAL, A ND E VALUATE TH EI R
CONTRIBUTIONS IN INTRA-DATASET (FF++) A ND C ROS S-DATASE T (CDF
AN D DFDC-P) SET TI NGS .
DiCA IDM DA Loss FF++ CDF DFDC-P
× × × 99.2 84.3 79.8
✓× × 99.6 85.1 84.6
×✓×99.4 86.4 80.9
×✓ ✓ 99.4 87.0 81.8
✓ ✓ ×99.3 87.5 85.2
✓ ✓ ✓ 99.6 88.3 88.6
1) Directional Saliency Map Visualization: We apply Grad-
CAM [60] on horizontal and vertical directions of manipulated
face clips to visualize where the proposed DIP pays attention
to the forged faces in different directions. As shown in Fig.7,
spatiotemporal inconsistencies caused by deepfake methods
have direction-aware characteristics, and our model is capable
of discriminating directional forgery inconsistencies. In addi-
tion, the proposed DIP can learn the different characteristics of
each deepfake method. In NeuralTextures (NT), which reen-
acts regions related to the mouth, the corresponding saliency
maps show that forgery inconsistencies are located around
the mouth region. In FaceSwap (FS), our model focuses on
blending facial boundaries from different directions. These
results provide a reasonable and straightforward explanation
for the effectiveness of directional modeling in the DIP.
2) Inconsistency Pattern Visualization: Inconsistency pat-
terns modeled by our DIP express regional embedding re-
lationships from two different perspectives, i.e., BiCA and
IDM. Here, we directly visualize the mentioned patterns to
illustrate the effectiveness of the designed modules. As shown
in Fig. 8, compared with ’w/o Both’, both the DiCA and
IDM extract distinct patterns along both the horizontal and
vertical directions. The IDM extracts more discriminative
patterns than DiCA does. Specifically, patterns modeled by
IDM demonstrate that inconsistency diffusion caused by deep-
fake algorithms is distributed evenly across regions, whereas
diffusion caused by motion is more regionally independent,
as DiCA models inconsistency patterns via a cross attention
mechanism, i.e., directional interaction. Compared with forged
videos, real videos have a more discrete pattern distribution.
Therefore, the differences in inconsistency patterns between
real and forged videos demonstrate the effectiveness of our
DIP and reveal the forgery inconsistency patterns caused by
TABLE VIII
STU DY ON TH E EFF ECT O F SPATIOT EMP OR AL DA. VAR IAN TS O F THE
PRO POS ED DIP W ITH D IFF ERE NT S ETT IN GS AR E EVALU ATED IN T ER MS OF
AUC (%): ’HFREQ AUG .’ (HIGH-FR EQU EN CY AUG ME NTATIO N), ’ NO ISE
AUG .’ (NO ISE AUGMENTATION), ’CO LOR AU G.’ (C OL OR DI ST ORTI NG
AUGMENTATION), AND ’TE MP O AUG.’ (TEM PO RAL AU GM ENTATI ON) .
NOTE T HAT TH E BES T RE SULT S ARE B OLD ED .
Variants FF++ FFhf FFnoise FFcolor FFta
w/o Aug. 99.23 92.71 95.60 96.26 91.89
w/o HFreq Aug. 99.16 93.66 97.13 98.01 97.57
w/o Noise Aug. 99.37 97.02 97.12 98.51 98.06
w/o Color Aug. 99.09 95.69 97.03 97.89 98.18
w/o Tempo Aug. 99.42 96.10 96.93 97.18 93.57
Original 99.46 96.60 98.91 98.53 98.62
existing deepfake techniques.
3) T-SNE Feature Visualization: We employ the t-SNE
projection [61] to analyze the proposed DIP and its variant in
intra-dataset and cross-dataset scenarios. Fig. 9 shows different
distributions of feature embeddings on both FF++ and CDF.
For the intra-dataset evaluation (FF++), both DIP and DIP-
variant could effectively distinguish manipulated samples from
pristine ones and exhibit clear decision boundaries in the
feature distribution. However, without the proposed DiCA and
IDM, the DIP-variant fails to effectively discriminate between
pristine faces and manipulated faces in the cross-dataset sce-
nario (CDF). Particularly, some real faces in the CDF closely
cluster with the manipulated faces from FF++ in the fea-
ture distribution. Moreover, there exist certain regions where
features of all faces are clustered together, which negatively
affects the generalization performance. Conversely, our DIP
clusters pristine faces with a tighter state and maintain a high
detection accuracy in the cross-dataset settings. Benefiting
from explicit forgery pattern modeling implemented by DiCA
and IDM, more universal forgery artifacts are captured and
taken into final predictions, thereby verifying the effectiveness
of both involved modules.
4) Limitation: While the proposed DIP achieves promising
detection performance, especially on cross-dataset evaluation,
there still exist a few failure instances with the proposed
method. Fig. 10 illustrates two video samples, where the face-
swapping video is mispredicted as real (top), and the real
talking face video (bottom) is misclassified as fake. This is
because either the inconspicuous facial movements (top) or
the extreme shooting angle (bottom) prevent the videos from
exhibiting sufficient facial motion information. In other words,
the proposed DIP could hardly capture their discriminative
forgery clues for deepfake detection. And this also motivates
us to delve into more general and robust forgery representation
for deepfake detection, which remains as the topic of our future
research efforts.
V. CONCLUSION
In this paper, we propose a novel spatiotemporal in-
consistency learning method, i.e., Diffusion Learning of
Inconsistency Pattern (DIP), to exploit the directional spa-
tiotemporal inconsistency for deepfake video detection by
modeling inconsistency patterns from two perspectives, i.e.,
IEEE TRANSACTIONS ON MULTIMEDIA, VOL. XX, NO. X, AUGUST 2024 12
directional interaction and diffusion. These patterns are lever-
aged to learn more discriminative inconsistency clues jointly.
In addition, the STI Loss is developed to facilitate the learning
of invariant forgery information with well-devised spatiotem-
poral data augmentations, resulting in a generalized represen-
tation for deepfake detection. The experimental results show
that our method outperforms other state-of-the-art methods
on four commonly used benchmarks. Ablation studies also
demonstrate the effectiveness of the involved designs.
REFERENCES
[1] “Deepfakes faceswap,” Jan. 2024. [Online]. Available: https://github.
com/deepfakes/faceswap
[2] J. Thies, M. Zollhofer, M. Stamminger, C. Theobalt, and M. Niessner,
“Face2Face: Real-Time Face Capture and Reenactment of RGB Videos,”
in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2016.
[3] K. Marek, “FaceSwap,” 2020. [Online]. Available: https://github.com/
MarekKowalski/FaceSwap
[4] T. Wang, Y. Zhang, Y. Fan, J. Wang, and Q. Chen, “High-Fidelity
GAN Inversion for Image Attribute Editing,” in 2022 IEEE/CVF Conf.
Comput. Vis. Pattern Recognit., 2022, pp. 11 369–11 378.
[5] J. Thies, M. Zollh¨
ofer, and M. Nießner, “Deferred Neural Rendering:
Image Synthesis Using Neural Textures,” ACM Trans. Graph., vol. 38,
no. 4, Jul. 2019.
[6] A. Wedel, T. Pock, C. Zach, H. Bischof, and D. Cremers, “An Improved
Algorithm for TV-L1 Optical Flow,” in Statistical and Geo. Approach.
to Vis. Moti. Analysis, 2009, pp. 23–45.
[7] D. Afchar, V. Nozick, J. Yamagishi, and I. Echizen, “MesoNet: a Com-
pact Facial Video Forgery Detection Network,” in IEEE Int. Workshop
Inf. Forensics Secur., 2018, pp. 1–7.
[8] A. Rossler, D. Cozzolino, L. Verdoliva, C. Riess, J. Thies, and M. Niess-
ner, “FaceForensics++: Learning to Detect Manipulated Facial Images,”
in Proc. IEEE/CVF Int. Conf. Comput. Vis., 2019, pp. 1–11.
[9] I. Amerini, L. Galteri, R. Caldelli, and A. Del Bimbo, “Deepfake Video
Detection through Optical Flow Based CNN,” in Proc. IEEE/CVF Int.
Conf. Comput. Vis. Workshops, 2019, pp. 0–0.
[10] Z. Gu and et al., “Spatiotemporal Inconsistency Learning for DeepFake
Video Detection,” in Proc. 29th ACM Int. Conf. Multimedia, 2021, pp.
3473–3481.
[11] D. Zhang, C. Li, F. Lin, D. Zeng, and S. Ge, “Detecting Deepfake
Videos with Temporal Dropout 3DCNN.” in Proc. 30th Int Joint Conf.
Artif. Intell., 2021, pp. 1288–1294.
[12] Z. Gu, T. Yao, C. Yang, R. Yi, S. Ding, and L. Ma, “Region-aware
temporal inconsistency learning for deepfake video detection,” in Proc.
31th Int Joint Conf on Artif Intell, vol. 1, 2022, pp. 920–926.
[13] J. Guan and et al., “Delving into Sequential Patches for Deepfake
Detection,” in Proc. Adv. Neural Inf. Process. Syst., vol. 35, 2022, pp.
4517–4530.
[14] A. Dosovitskiy and et al., “An Image is Worth 16x16 Words: Transform-
ers for Image Recognition at Scale,” in International Conf. Learning
Represent., 2021.
[15] Z. Wang, J. Bao, W. Zhou, W. Wang, and H. Li, “AltFreezing for
More General Video Face Forgery Detection,” in Proc. IEEE/CVF Conf.
Comput. Vis. Pattern Recognit., 2023, pp. 4129–4138.
[16] Z. Farbman, R. Fattal, and D. Lischinski, “Diffusion Maps for Edge-
Aware Image Editing,” ACM Trans. Graph., vol. 29, no. 6, Dec. 2010.
[17] J. Sun and Z. Xu, “Neural Diffusion Distance for Image Segmentation,”
in Proc. Adv. Neural Inf. Process. Syst., vol. 32, 2019.
[18] H. Zhao, W. Zhou, D. Chen, T. Wei, W. Zhang, and N. Yu, “Multi-
Attentional Deepfake Detection,” in Proc. IEEE/CVF Conf. Comput. Vis.
Pattern Recognit, 2021, pp. 2185–2194.
[19] B. Zhang, S. Li, G. Feng, Z. Qian, and X. Zhang, “Patch diffusion: a
general module for face manipulation detection,” in Proc. AAAI Conf.
Artif. Intell., 2022, pp. 3243–3251.
[20] S. Chen, T. Yao, Y. Chen, S. Ding, J. Li, and R. Ji, “Local Relation
Learning for Face Forgery Detection,” 2021, pp. 1081–1088.
[21] H. Liu and et al., “Spatial-Phase Shallow Learning: Rethinking Face
Forgery Detection in Frequency Domain,” in Proc. IEEE/CVF Conf.
Comput Vis. Pattern Recognit., 2021, pp. 772–781.
[22] J. Li, H. Xie, J. Li, Z. Wang, and Y. Zhang, “Frequency-Aware
Discriminative Feature Learning Supervised by Single-Center Loss for
Face Forgery Detection,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern
Recognit., 2021, pp. 6458–6467.
[23] Y. Yu, R. Ni, W. Li, and Y. Zhao, “Detection of AI-Manipulated
Fake Faces via Mining Generalized Features,” ACM Trans. Multimedia
Comput., Commun., and Appl., vol. 18, no. 4, pp. 1–23, 2022. [Online].
Available: https://dl.acm.org/doi/10.1145/3499026
[24] L. Li and et al., “Face X-Ray for More General Face Forgery Detection,”
in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2020, pp.
5001–5010.
[25] T. Zhao, X. Xu, M. Xu, H. Ding, Y. Xiong, and W. Xia, “Learning
Self-Consistency for Deepfake Detection,” in Proc. IEEE/CVF Int. Conf.
Comput. Vis., 2021, pp. 15023–15 033.
[26] K. Shiohara and T. Yamasaki, “Detecting Deepfakes With Self-Blended
Images,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2022,
pp. 18 720–18 729.
[27] Y. Zheng, J. Bao, D. Chen, M. Zeng, and F. Wen, “Exploring Temporal
Coherence for More General Video Face Forgery Detection,” in Proc.
IEEE/CVF Int. Conf. on Comput. Vis., 2021, pp. 15044–15 054.
[28] A. Haliassos, K. Vougioukas, S. Petridis, and M. Pantic, “Lips Don’t
Lie: A Generalisable and Robust Approach To Face Forgery Detection,”
in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2021, pp.
5039–5049.
[29] S. A. Khan and H. Dai, “Video Transformer for Deepfake Detection
with Incremental Learning,” in Proc. 29th ACM Int. Conf. Multimedia,
2021, pp. 1821–1828.
[30] Z. Sun, Y. Han, Z. Hua, N. Ruan, and W. Jia, “Improving the Efficiency
and Robustness of Deepfakes Detection Through Precise Geometric
Features,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit.,
2021, pp. 3609–3618.
[31] Y. Li, M.-C. Chang, and S. Lyu, “In Ictu Oculi: Exposing AI Created
Fake Videos by Detecting Eye Blinking,” in IEEE Int. Workshop Inf.
Forensics and Secur., 2018, pp. 1–7.
[32] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and
L. Fei-Fei, “Large-scale Video Classification with Convolutional Neural
Networks,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2014,
pp. 1725–1732.
[33] K. Simonyan and A. Zisserman, “Two-Stream Convolutional Networks
for Action Recognition in Videos,” in Proc. Adv. Neural Inf. Process.
Syst., vol. 27, 2014.
[34] G. Bertasius, H. Wang, and L. Torresani, “Is Space-Time Attention All
You Need for Video Understanding?” in Proc. 38th Int. Conf. Machi.
Learn., vol. 139, 2021, pp. 813–824.
[35] A. Arnab, M. Dehghani, G. Heigold, C. Sun, M. Luˇ
ci´
c, and C. Schmid,
“ViViT: A Video Vision Transformer,” in Proc. IEEE/CVF Int. Conf.
Comput. Vis., 2021, pp. 6836–6846.
[36] C. Zhao, C. Wang, G. Hu, H. Chen, C. Liu, and J. Tang, “ISTVT: Inter-
pretable Spatial-Temporal Video Transformer for Deepfake Detection,”
IEEE Trans. Inf. Forensics Security, vol. 18, pp. 1335–1348, 2023.
[37] R. R. Coifman and S. Lafon, “Diffusion maps,” Appli. Comput. harmonic
analysis, vol. 21, no. 1, pp. 5–30, 2006.
[38] A. Tarvainen and H. Valpola, “Mean teachers are better role mod-
els: Weight-averaged consistency targets improve semi-supervised deep
learning results,” in Proc. Adv. Neural Inf. Process. Syst., vol. 30, 2017.
[39] Y. Li, X. Yang, P. Sun, H. Qi, and S. Lyu, “Celeb-DF: A Large-Scale
Challenging Dataset for DeepFake Forensics,” in Proc. IEEE/CVF Conf.
Comput. Vis. Pattern Recognit., 2020, pp. 3207–3216.
[40] B. Zi, M. Chang, J. Chen, X. Ma, and Y.-G. Jiang, “WildDeepfake: A
Challenging Real-World Dataset for Deepfake Detection,” in Proc. 28th
ACM Int. Conf. Multimedia, 2020, pp. 2382–2390.
[41] B. Dolhansky, R. Howes, B. Pflaum, N. Baram, and C. C. Ferrer, “The
Deepfake Detection Challenge (DFDC) Preview Dataset,” arXiv preprint
arXiv:1910.08854, Oct. 2019.
[42] B. Dolhansky and et al., “The deepfake detection challenge (dfdc)
dataset,” arXiv preprint arXiv:2006.07397, 2020.
[43] L. Jiang, R. Li, W. Wu, C. Qian, and C. C. Loy, “DeeperForensics-1.0:
A Large-Scale Dataset for Real-World Face Forgery Detection,” in Proc.
IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2020, pp. 2889–2898.
[44] Z. Yang, J. Liang, Y. Xu, X.-Y. Zhang, and R. He, “Masked Relation
Learning for DeepFake Detection,” IEEE Trans. Inf. Forensics Security,
vol. 18, pp. 1696–1708, 2023.
[45] Y. Qian, G. Yin, L. Sheng, Z. Chen, and J. Shao, “Thinking in
Frequency: Face Forgery Detection by Mining Frequency-Aware Clues,”
in Proc. Eur. Conf. Comput. Vis., 2020, pp. 86–103.
[46] K. Sun, T. Yao, S. Chen, S. Ding, J. Li, and R. Ji, “Dual Contrastive
Learning for General Face Forgery Detection,” 2022, pp. 2316–2324.
[47] W. Zhuang and et al., “UIA-ViT: Unsupervised Inconsistency-Aware
Method Based on Vision Transformer for Face Forgery Detection,” in
Proc. Eur. Conf. Comput. Vis., 2022, pp. 391–407.
IEEE TRANSACTIONS ON MULTIMEDIA, VOL. XX, NO. X, AUGUST 2024 13
[48] J. Wang, Y. Sun, and J. Tang, “LiSiam: Localization Invariance Siamese
Network for Deepfake Detection,” IEEE Trans. Inf. Forensics Security,
vol. 17, pp. 2425–2436, 2022.
[49] H. Wang, Z. Liu, and S. Wang, “Exploiting complementary dynamic
incoherence for deepfake video detection,” IEEE Trans. Circuits Syst.
Video Technol., vol. 33, no. 8, pp. 4027–4040, 2023.
[50] Z. Guo, L. Wang, W. Yang, G. Yang, and K. Li, “Ldfnet: Lightweight
dynamic fusion network for face forgery detection by integrating local
artifacts and global texture information,” IEEE Trans. Circuits Syst.
Video Technol., vol. 34, no. 2, pp. 1255–1265, 2024.
[51] J. Cao, C. Ma, T. Yao, S. Chen, S. Ding, and X. Yang, “End-to-End
Reconstruction-Classification Learning for Face Forgery Detection,” in
Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2022, pp. 4113–
4122.
[52] Z. Gu, T. Yao, Y. Chen, S. Ding, and L. Ma, “Hierarchical Contrastive
Inconsistency Learning for Deepfake Video Detection,” in Proc. Eur.
Conf. Comput. Vis., 2022, pp. 596–613.
[53] D. Zhang, F. Lin, Y. Hua, P. Wang, D. Zeng, and S. Ge, “Deepfake
Video Detection with Spatiotemporal Dropout Transformer,” in Proc.
30th ACM Int. Conf. Multimedia, 2022, pp. 5833–5841.
[54] Z. Guo, G. Yang, J. Chen, and X. Sun, “Exposing Deepfake Face
Forgeries With Guided Residuals,” IEEE Trans. Multimedia, vol. 25,
pp. 8458–8470, 2023.
[55] Y. Yu, X. Zhao, R. Ni, S. Yang, Y. Zhao, and A. C. Kot, “Augmented
Multi-Scale Spatiotemporal Inconsistency Magnifier for Generalized
DeepFake Detection,” IEEE Trans. Multimedia, vol. 25, pp. 8487–8498,
2023.
[56] Z. Gu, Y. Chen, T. Yao, S. Ding, J. Li, and L. Ma, “Delving into the
Local: Dynamic Inconsistency Learning for DeepFake Video Detection,”
in Proc. AAAI Conf. Artif. Intell., 2022, pp. 744–752.
[57] F. Chollet, “Xception: Deep Learning With Depthwise Separable Con-
volutions,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017,
pp. 1251–1258.
[58] Z. Hu, H. Xie, Y. Wang, J. Li, Z. Wang, and Y. Zhang, “Dynamic
Inconsistency-aware DeepFake Video Detection.” in Proc. 30th Int. Joint
Conf. Artif. Intell., 2021, pp. 736–742.
[59] L. Chai, D. Bau, S.-N. Lim, and P. Isola, “What Makes Fake Images
Detectable? Understanding Properties that Generalize,” in Proc. Eur.
Conf. Comput. Vis., 2020, pp. 103–120.
[60] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and
D. Batra, “Grad-CAM: Visual Explanations From Deep Networks via
Gradient-Based Localization,” in Proc. IEEE Int. Conf. Comput. Vis.,
2017, pp. 618–626.
[61] L. Van der Maaten and G. Hinton, “Visualizing data using t-sne.” J.
Mach. Learn. Res., vol. 9, pp. 2579–2605, 2008.
Fan Nie received his B.S. degree and M.S. degree
in the School of Computer Science from Beijing
University of Posts and Telecommunications and
China Electric Power Research Institute, respec-
tively, in 2019 and 2022. He is currently pursuing
Ph.D. degree in the School of Computer Science
and Engineering, Sun Yat-sen University, also with
Pengcheng Laboratory. His research interests include
multimedia forensics, and AIGC safety.
Jiangqun Ni (Member, IEEE) received the Ph.D.
degree in electronic engineering from The University
of Hong Kong in 1998. Then, he was a Post-Doctoral
Fellow for a joint program between Sun Yatsen Uni-
versity, China, and Guangdong Institute of Telecom-
munication Research from 1998 to 2000. From 2001
to 2023, he was a Professor with the School of Data
and Computer Science, Sun Yatsen University. He
is currently a Professor with the School of Cyber
Science and Technology, Sun Yat-sen University,
Shenzhen, China, and also with the Department of
New Networks, Peng Cheng Laboratory, Shenzhen. His research interests
include data hiding, digital forensics, and image/video processing.
Jian Zhang received his B.S. degree in the School
of Electronics and Information Technology from Sun
Yat-Sen University in 2019. He is currently pursuing
Ph.D. degree in the School of Computer Science
and Engineering from Sun Yat-Sen University. His
research interests includes DeepFake detection and
other multimedia forensics.
Bin Zhang received his Ph.D. degree in the De-
partment of Computer Science and Technology, Ts-
inghua University, China in 2012. He worked as
a post-doctor in Nanjing Telecommunication Tech-
nology Institute from 2014 to 2017. He is now
a researcher in the Department of New Networks
of Pengcheng Laboratory. He publishes more than
50 papers in refereed international conferences and
journals. His current research interests focus on net-
work anomaly detection, Internet architecture and its
protocols, network traffic measurement, information
privacy security, etc.
Weizhe Zhang (Senior Member, IEEE) received the
B.S, M.E and PhD degrees in computer science and
technology in 1999, 2001, and 2006, respectively,
from the Harbin Institute of Technology. He is cur-
rently a professor at the School of Computer Science
and Technology, Harbin Institute of Technology,
China, and Vice Dean of the New Network De-
partment, Peng Cheng Laboratory, Shenzhen, China.
He has authored or coauthored more than 100 aca-
demic papers in journals, books, and conference
proceedings. His research interests primarily include
computer network, cyberspace security,, high-performance computing, cloud
computing and embedded computing.