ArticlePDF Available

Multiscale Features Integrated Model for Generalizable Deepfake Detection

Wiley
International Journal of Intelligent Systems
Authors:

Abstract and Figures

Within the domain of Artificial Intelligence Generated Content (AIGC), technological strides in image generation have been marked, resulting in the proliferation of deepfake images that pose substantial security threats. The current landscape of deepfake detection technologies is marred by limited generalization across diverse generative models and a subpar detection rate for images generated through diffusion processes. In response to these challenges, this paper introduces a novel detection model designed for high generalizability, leveraging multiscale frequency and spatial domain features. Our model harnesses an array of specialized filters to extract frequency-domain characteristics, which are then integrated with spatial-domain features captured by a Feature Pyramid Network (FPN). The integration of the Attentional Feature Fusion (AFF) mechanism within the feature fusion module allows for the optimal utilization of the extracted features, thereby enhancing detection capabilities. We curated an extensive dataset encompassing deepfake images from a variety of GANs and diffusion models for rigorous evaluation. The experimental findings reveal that our proposed model achieves superior accuracy and generalization compared to existing baseline models when confronted with deepfake images from multiple generative sources. Notably, in cross-model detection scenarios, our model outperforms the next best model by a significant margin of 29.1% for diffusion-generated images and 15.1% for GAN-generated images. This accomplishment presents a viable solution to the pressing issues of generalization and adaptability in the field of deepfake detection.
This content is subject to copyright. Terms and conditions apply.
Research Article
Multiscale Features Integrated Model for Generalizable
Deepfake Detection
Siqi Gu,
1
Zihan Qin,
1
Lizhe Xie ,
2
,
3
Zheng Wang,
1
and Yining Hu
1
1
School of Cyber Science and Engineering, Southeast University, Nanjing 211189, China
2
State Key Laboratory Cultivation Base of Research, Prevention and Treatment for Oral Diseases, Nanjing Medical University,
Nanjing, China
3
Jiangsu Province Engineering Research Center of Stomatological Translational Medicine, Nanjing Medical University,
Gulou District, Nanjing, China
Correspondence should be addressed to Yining Hu; hyn.list@seu.edu.cn
Received 12 November 2024; Accepted 18 December 2024
Academic Editor: Beijing Chen
Copyright ©2025 Siqi Gu et al. International Journal of Intelligent Systems published by John Wiley & Sons Ltd. is is an open
access article under the terms of the Creative Commons Attribution License, which permits use, distribution and reproduction in
any medium, provided the original work is properly cited.
Within the domain of Articial Intelligence Generated Content (AIGC), technological strides in image generation have been
marked, resulting in the proliferation of deepfake images that pose substantial security threats. e current landscape of deepfake
detection technologies is marred by limited generalization across diverse generative models and a subpar detection rate for images
generated through diusion processes. In response to these challenges, this paper introduces a novel detection model designed for
high generalizability, leveraging multiscale frequency and spatial domain features. Our model harnesses an array of specialized
lters to extract frequency-domain characteristics, which are then integrated with spatial-domain features captured by a Feature
Pyramid Network (FPN). e integration of the Attentional Feature Fusion (AFF) mechanism within the feature fusion module
allows for the optimal utilization of the extracted features, thereby enhancing detection capabilities. We curated an extensive
dataset encompassing deepfake images from a variety of GANs and diusion models for rigorous evaluation. e experimental
ndings reveal that our proposed model achieves superior accuracy and generalization compared to existing baseline models
when confronted with deepfake images from multiple generative sources. Notably, in cross-model detection scenarios, our model
outperforms the next best model by a signicant margin of 29.1% for diusion-generated images and 15.1% for GAN-generated
images. is accomplishment presents a viable solution to the pressing issues of generalization and adaptability in the eld of
deepfake detection.
Keywords: Articial Intelligence Generated Content; deepfake detection; diusion model; generative adversarial networks
1. Introduction
As advancements in Articial Intelligence Generated Con-
tent (AIGC) technologies continue to accelerate, a multitude
of detection methods have been devised to authenticate
synthetic media [1–3], tackling challenges such as the spread
of misinformation, infringements on personal privacy, and
instances of online nancial fraud. Notably, deepfake images
emerge as a focal point within this domain, owing to their
propensity for rapid dissemination and their ubiquity on
social media platforms. erefore, this research underscores
the importance of developing robust detection techniques
specically tailored to deepfake images.
Currently, the creation of deepfake images is primarily
facilitated by two leading generative models: generative
adversarial networks (GANs) [4] and diusion models (DMs)
[5]. Detection models for GAN-generated images typically
enhance accuracy (ACC) by rening feature extraction, op-
timizing network architectures, and incorporating data
augmentation strategies. Feature-based approaches primarily
target the identication of spatial and frequency domain
inconsistencies. For instance, McCloskey et al. [6] leveraged
Wiley
International Journal of Intelligent Systems
Volume 2025, Article ID 7084582, 11 pages
https://doi.org/10.1155/int/7084582
red-green bivariate histograms and abnormal pixel exposure
ratios for detection, while Agarwal et al. [7] exploited high-
frequency artifacts stemming from GANs’ upsampling pro-
cesses. Chen et al. [8] integrated both global and local image
features, utilizing metric learning to enhance the overall
detection performance of model. In terms of network ar-
chitectures, Convolutional Neural Networks (CNNs) remain
a predominant choice for deepfake detection tasks due to their
capacity to extract semantic, color, and texture information
[9], as seen in Fu et al. [10] dual-channel CNN architecture,
which is capable of concurrently processing both high- and
low-frequency image components. Additionally, data aug-
mentation plays a crucial role in boosting robustness and
generalization. Wang et al. [11] demonstrated that combining
preprocessing, postprocessing, and data augmentation allows
a CNN trained on a single GAN dataset to generalize across
multiple GAN models eectively.
For the detection of DM generated images, a straightfor-
ward approach involves adapting existing GAN-based de-
tectors. However, Corvi et al. [12] demonstrated that even
advanced GAN detection algorithms exhibit suboptimal per-
formance when tasked with identifying DM-generated images.
In a manner similar to GAN-generated image detection,
several studies leveraged spatial and frequency domain features
for detecting DM images. For instance, Nguyen et al. [13]
employed gradient-based features, while Bammey et al. [14]
used cross-dierential lters to highlight frequency artifacts.
Additionally, Farid et al. [15, 16] highlighted inconsistencies in
lighting, geometric structures, shadows, and reections present
in DM-generated images. Ojha et al. [17] proposed an alter-
native method by mapping images into CLIP’s feature space
and classifying them using cosine similarity with reference real
or fake images. WANG et al. [18] introduced a method using
the dierence between reconstructed images and those under
examination as features, achieving high ACC in the detection
of diusion-generated images.
Despite these developments, deepfake detection meth-
odologies encounter numerous challenges. For GAN-
generated images, existing detection models tend to per-
form inadequately when applied to multisource datasets due
to their constrained feature extraction dimensions. While data
augmentation and dataset expansion have been employed to
enhance generalization [11], they often yield limited eec-
tiveness due to their predictability. Furthermore, the ro-
bustness of GAN-based detectors is often compromised in the
face of adversarial perturbations, as evidenced by the localized
attacks conducted by Zhang et al. [19], which revealed vul-
nerabilities in a range of detection models. As for diusion-
generated images, eective detection techniques remain
underexplored, and those available tend to be overly complex,
lacking comprehensive image feature analysis, thus hindering
their practical applicability.
In real-world scenarios, the generative model responsible
for creating deepfake images is often unknown, emphasizing
the necessity for detection models with strong cross-model
generalization capabilities. However, traditional detection
models struggle to handle such uncertainty, for they often fail
to adequately leverage the rich and complex features inherent
in deepfake images, or rely solely on single-dimensional
feature learning, which limits their ability to capture the
shared characteristics across dierent generative models.
Ricker et al. [20] demonstrated that GAN detectors often fail
to detect images generated by DMs, and retraining these
detectors on diusion-generated datasets leads to only
marginal improvements. is lack of adaptability exposes
a gap in current detection technologies, which struggle to
handle the variety of deepfake generation techniques and their
artifacts. To address the growing concerns surrounding
synthetic media manipulation, there is an urgent need for
robust and adaptable models capable of detecting deepfakes
across diverse platforms and applications, regardless of the
generative model or source datasets employed.
To address these limitations, this paper introduces
a generalizable deepfake detection model that leverages both
spatial and frequency domain features commonly exhibited
by deepfake images. e proposed model utilizes a series of
lters to extract multiscale frequency-domain features, and
a Feature Pyramid Network (FPN) [21] to capture multiscale
spatial-domain features. ese features are subsequently
fused and processed through a ResNet50 backbone for
classication. To fully exploit the extracted features, multiple
fusion strategies incorporating the Attentional Feature
Fusion (AFF) [16] mechanism are devised, further en-
hancing the model’s performance. e primary contribu-
tions of this work are as follows: (1) We develop a model-
independent detection method that eectively identies
forged images from various generative models and datasets,
addressing the black-box detection challenge where the
generative model is unknown. (2) By extracting and fusing
multiscale features from both spatial and frequency do-
mains, we capture shared characteristics across dierent
generative models, overcoming the limitations of single-
feature modeling. (3) Experimental results demonstrate
that our model signicantly outperforms existing detectors
on both GAN-generated and diusion-generated images, as
well as on images generated by advanced models, exhibiting
superior generalization and detection ACC, thus validating
the eectiveness of our approach.
2. The Proposed Method
2.1. Overall Structure. e overall structure of the proposed
model is illustrated in Figure 1, comprising two primary
components: a feature extraction module and a feature fusion
module. In the feature extraction module, the images are
initially processed through an image pyramid to create
multiscale representations, which are subsequently passed
through a series of lters to extract multiscale frequency-
domain features. Concurrently, the original input images are
fed into a FPN, facilitating the extraction of multiscale spatial-
domain features. In the subsequent feature fusion module, the
extracted features from both domains are eectively in-
tegrated through the AFF mechanism, ensuring an eective
combination of multiscale spatial and frequency information.
e fused features are then processed by a ResNet50 backbone
network, which ultimately classies the image as either real or
fake. e detailed architecture and functionality of the model
will be elaborated in the following sections.
2International Journal of Intelligent Systems
2.2. Frequency-Domain Feature Extraction. e generation
of deepfake images involves sophisticated synthesis tech-
niques, such as pixel-level manipulation, temporal and
spatial consistency adjustments, and ne-tuning of facial
features. ese manipulations often introduce subtle dis-
crepancies between deepfake and authentic images, which
are most eectively observed in the frequency domain.
However, these inconsistencies do not exhibit uniform
patterns, necessitating the extraction of subtle variations and
anomalous features across multiple scales. To address this,
we employ a combination of Gaussian, Gabor, and wavelet
lters, each selected for its specic advantages in extracting
relevant frequency-domain features.
Gaussian lters are employed to capture low-frequency
information, which is essential for identifying overall
structural elements and large-scale patterns in images. As
fundamental techniques for smoothing and noise reduction,
they are particularly suitable for extracting the broad, global
characteristics of deepfake images, which are often aected
by low-frequency distortions. Specically, a 3 ×3 Gaussian
convolution kernel with a standard deviation of 1 and mean
values of one is applied. e two-dimensional Gaussian
function is dened as follows:
Gauss (x, y) 1
2πσxσy
exp xx
􏼁2
2σ2
x
+yy
􏼐 􏼑2
2σ2
y
,
(1)
where Gauss(x, y)represents the Gaussian weight at point
(x, y),σx,σyare the standard deviations, x,yare the mean
values, and the normalization factor 1/2πσxσyensures the
sum of kernel equals 1. To capture multiscale features, input
images are downsampled by a factor of 2 using an image
pyramid before Gaussian ltering, followed by linear in-
terpolation and pixel-wise summation across scales. is
procedure integrates multiscale ltered information while
preserving critical edges and details.
Gabor lters, renowned for their ability to detect edges
and analyze textures, are employed for their robust capa-
bility in directional and scale-selective analysis. ese lters
are particularly eective in isolating mid-to-high-frequency
components and detecting local patterns, which are often
indicative of subtle artifacts introduced by deepfake gen-
eration techniques. eir ability to capture directional and
textural information makes them well-suited for analyzing
textured regions in deepfake images. e two-dimensional
Gabor function is dened as follows:
Gabor(x, y) exp x2+c2y2
2σ2
·cos 2πx
λ+ψ
􏼠 􏼡,
(2)
where σis the Gaussian standard deviation, cis the spatial
aspect ratio, λis the wavelength, and ψis the phase oset.
e rotated coordinates xand yare given in equations (3)
and (4):
xxcos θ+ysin θ,(3)
y xsin θ+ycos θ,(4)
where θdenes the lter’s orientation. In this work, we
employ 16 Gabor kernels with orientations ranging from
0°to 360°, each with a kernel size of 31 ×31, a standard
deviation of 4, a spatial aspect ratio of 0.5, a wavelength of 10,
Feature pyramid network
Gaussian filter
Gabor filter
Wavelet filter
Feature pyramid network
Concat Conv
ResNet50
Real or fake
AFF
BN +
ReLU SE
Conv BN +
ReLU SE
Input image
Image pyramid
C1 C2 C3 C4 C5
256 256 3
128 128 128
64 64 256 32 32 512
16 16 1024 8 8 2048
P5P2 P3 P4
8 8 256
16 16 256
32 32 256
64 64 256
1 1 conv
1 1 conv 1 1 conv 1 1 conv
P1
Pooling Pooling Pooling Pooling
256 256 256
Up-
sampling
Up-
sampling
Up-
sampling
Up-
sampling
Feature
256 256 3
1 1 Conv
Conv
& pooling
Figure 1: Overall structure of the multiscale features integrated model for deepfake detection.
International Journal of Intelligent Systems 3
and a phase oset of 0. ese lters process multiscale
images generated by the image pyramid, capturing texture
and directional features from various orientations.
Wavelet lters are particularly eective in balancing
spatial and frequency domain information, enabling
a comprehensive multiscale analysis of deepfake images.
ey allow the separation of high- and low-frequency
components, facilitating the detection of both large-scale
structural inconsistencies and localized anomalies. Wavelet
decomposition provides a powerful tool for extracting
multiresolution features, which can be crucial for identifying
subtle artifacts that are distributed across dierent frequency
bands. To harness these capabilities, the Daubechies 4-tap
(db4) wavelet is selected due to its balance between com-
putational eciency and its ability to capture detailed image
features across scales. e wavelet decomposition and re-
construction processes are represented by equations (5) and
(6), respectively:
DWT (I)(LL, LH, HL, HH),(5)
high freq img 􏽘(|LH| +|HL| +|HH|),(6)
where Irepresents the input image. LL denotes the low-
frequency approximation coecients capturing the overall
structure and large-scale features of the image. LH, HL, HH
represent the high-frequency detail coecients in the hor-
izontal, vertical, and diagonal directions, respectively. e
high-frequency images are combined to generate a complete
high-frequency image using equation (6), which, along with
the low-frequency approximation LL, is utilized for de-
tection. Given the wavelet transform’s inherent capability to
eectively represent multiscale features, the original image is
utilized directly as input, bypassing the need for pyramid
preprocessing. e wavelet decomposition is conducted over
two levels, producing multiscale wavelet-ltered represen-
tations that encapsulate both global and localized image
characteristics. e db4 wavelet enables the comprehensive
utilization of an image’s detailed information and noise
characteristics, thereby enabling robust analysis of both
global and local anomalies in deepfake images.
Collectively, the integration of Gaussian, Gabor, and
wavelet lters facilitates the extraction of comprehensive
frequency-domain features. Gaussian lters capture global
structural patterns by focusing on low-frequency in-
formation. Gabor lters extract localized texture and edge
information through their directional and scale-selective
properties. Wavelet lters provide a multiscale analysis by
decomposing the image into various frequency components.
Together, these lters enhance the model’s ability to identify
subtle inconsistencies across scales, leading to more eective
deepfake detection.
2.3. Spatial-Domain Feature Extraction. In the previous
section, it was illustrated that the selected lters eectively
captured multiscale features of frequency domain, including
texture and edge information. However, relying exclusively
on the frequency-domain features imposes inherent
limitations on the detection model’s ability to fully encap-
sulate the semantic content of images. Generative models
often introduce subtle artifacts in both global structures and
localized details, necessitating the extraction of comple-
mentary spatial-domain features for a more comprehensive
analysis. By synthesizing both spatial and frequency-domain
characteristics, the model achieves a more holistic repre-
sentation, signicantly improving the detection perfor-
mance, particularly in complex and heterogeneous
scenarios.
In this research, we utilize a FPN [21] architecture, which
is constructed on the foundation of a ResNet50 [22]
backbone, to eectively extract multiscale spatial-domain
features. During the bottom-up pathway of the FPN, cor-
responding to the forward propagation within ResNet50,
four feature maps of distinct resolutions are produced. ese
feature maps progressively distill high-level semantic in-
formation through multiple convolutional layers, while si-
multaneously preserving essential multiscale structural
information. In the top-down pathway, these feature maps
are subjected to dimensionality reduction via 1 ×1 con-
volutional layers, upsampled using linear interpolation, and
then merged layer by layer to integrate semantic information
across various scales. e fused feature maps are sub-
sequently rened through additional convolution layers to
yield the nal multiscale feature representation. e struc-
ture of the FPN is shown in Figure 2. By these means, the
FPN can eectively extract and synthesize multiscale spatial-
domain features, enabling the model to eectively perceive
both overarching semantic structures and intricate local
details, thereby substantially improving detection ACC and
generalization across diverse scenarios.
2.4. Feature Fusion Module. In the preceding sections, we
selected a variety of complementary frequency-domain and
spatial-domain features to enhance the ecacy of deepfake
detection. By integrating these features, a more holistic and
robust image representation can be constructed. ese
features are processed through a Batch Normalization (BN)
layer, followed by ReLU activation, and then passed through
a Squeeze-and-Excitation (SE) attention layer before being
input into a ResNet50 network for nal classication.
To further optimize the utilization of extracted features,
we incorporate the AFF [23] mechanism within the feature
fusion module. AFF is an adaptive fusion technique designed
to dynamically combine input features in a scale-aware and
content-adaptive manner. is mechanism is particularly
eective in addressing the challenge of merging features with
disparate scales, resolutions, or semantic meanings, which is
a common issue when integrating spatial and frequency-
domain features.
e AFF framework consists of two key components:
feature aggregation and attention-based weighting. e
feature aggregation process rst combines the input features
through element-wise addition to produce a unied rep-
resentation. To address the limitations of raw aggregation,
which fails to account for the varying importance of features
across dierent scales and contexts, attention-based
4International Journal of Intelligent Systems
weighting is employed. e Multiscale Channel Attention
Module (MS-CAM) calculates dynamic attention weights
for each feature, facilitating the selective enhancement of
more informative features while suppressing less relevant
ones. is enables the model to prioritize features that are
most indicative of subtle inconsistencies in deepfake images,
thereby rening the fusion process and improving detection
ACC. e structures of AFF and MS-CAM are depicted in
Figures 3 and 4, respectively.
e specic process of the AFF strategy is shown in
equation (7). Given two input features Xand Y, the initial
fusion is performed through element-wise addition, producing
a preliminary combined feature XY. is is followed by the
computation of a weight M(XY)via MS-CAM, which
modulates the relative signicance of Xand Y. e nal fused
feature Zis derived by applying a weighted sum, where
M(XY)is assigned to feature Xand 1M(XY)to
feature Y. Although this approach introduces a slight increase
in computational cost, the additional FLOPs are minimal,
making it a favorable trade-o in light of the signicant im-
provements in detection performance.
ZM(XY)X+(1M(XY)) Y. (7)
3. Experiment
3.1. Datasets and Experiment Setting. In this study, we
comprehensively evaluate the performance of model by
constructing a diverse dataset that includes both diusion-
generated and GAN-generated images, as well as more re-
cent deepfake images from advanced models. e dataset is
carefully curated to include images from a variety of gen-
erative models, ensuring a broad representation of dierent
architectures and data sources. As outlined in Table 1, the
training set includes images generated by ProGAN [26],
a GAN-based model, and ADM [24], a DM. e testing set is
designed to provide a comprehensive evaluation, including
models like iDDPM [27], PNDM [28], and ADM [24]
trained on datasets such as LSUN_bedroom, ImageNet,
CelebA, as well as more recent models like Midjourney [34]
and DALLE2 [35]. Notably, there are no overlapping
samples between the training and testing sets, which ensures
an unbiased evaluation of the model’s generalization ca-
pabilities and mitigates the risk of overtting. ese sources
are selected to ensure both model diversity and real-world
applicability, covering a wide range of generative techniques.
During the training phase, the dataset is split into
training and validation subsets in an 80–20 ratio to facilitate
cross-validation and ensure a fair evaluation of the model’s
performance. To enhance the model’s robustness and gen-
eralization, data augmentation techniques such as random
C1 C2 C3 C4 C5
256 256 3
128 128 128
64 64 256
32 32 512
16 16 1024
8 8 2048
P5P2 P3 P4
8 8 256
16 16 256
32 32 256
64 64 256
1 1 con
v
1 1 conv 1 1 conv 1 1 conv
P1
Pooling Pooling Pooling Pooling
256 256 256
Up-
sampling
Up-
sampling
Up-
sampling
Up-
sampling
Feature
256 256 3
1 1 conv
Conv
& pooling
Figure 2: Structure of Feature Pyramid Network.
MS-CAM
C × H × W
XY
C × H × W
+
+
× ×
Figure 3: Structure of Attentional Feature Fusion Module.
International Journal of Intelligent Systems 5
rotation (with a ±15-degree range) and random horizontal
ipping are applied, and all images are uniformly center-
cropped to 256 ×256 pixels.
e model is implemented using the PyTorch frame-
work, version 2.0.0. Its performance is evaluated using two
metrics: ACC and F1-score. ACC measures overall detection
ACC, while the F1-score accounts for the balance between
precision and recall, which is particularly crucial in scenarios
with potential class imbalance. For the training process,
binary cross-entropy is chosen as the loss function, as the
task involves binary classication. e model is trained using
the Adam optimizer with an initial learning rate of 0.0001
and a batch size of 32 for 50 epochs. A learning rate decay
strategy is applied, where the learning rate is reduced to 10%
of its current value if validation metrics do not improve over
8 consecutive epochs. Training is conducted on an NVIDIA
V100 GPU to leverage its computational capabilities. e
environment setup includes CUDA 11.7 and cuDNN 8 to
ensure ecient utilization of GPU resources.
3.2. Ablation Experiment of ree Filters. In this experiment,
we aim to evaluate the independent and combined contri-
butions of three frequency-domain lters to the overall
performance of our model. e models under investigation
are as follows: (1) Origin: Directly feeding the original input
images into the ResNet50 network without applying any
frequency-domain ltering; (2) Gauss: Processing the im-
ages with a Gaussian lter before feeding them into the
ResNet50 backbone; (3) Gabor: Processing the images with
a Gabor lter before feeding them into the ResNet50
backbone; (4) Wave: Processing the images with a Wavelet
lter prior to feeding them into the ResNet50 backbone; (5)
Gauss_Gabor: Concatenating the feature maps from the
Gaussian and Gabor lters before feeding them into the
ResNet50 backbone; (6) Gauss_Wave: Concatenating the
feature maps from the Gaussian and Wavelet lters before
feeding them into the ResNet50 backbone; (7) Gabor_Wave:
Concatenating the feature maps from the Gabor and
Wavelet lters before feeding them into the ResNet50
GlobalAvgPooling
Point-wise conv
ReLU
Point-wise conv
Point-wise conv
ReLU
Point-wise conv
Sigmoid
X
X'
BN BN
BNBN
C × H × W
C × H × W
C × H × W
C × 1 × 1
C/r × 1 × 1
C × 1 × 1
C/r × H × W
×
+
Figure 4: Structure of Multiscale Channel Attention Module.
Table 1: Deepfake image dataset containing various types of generated images.
Data type Model type Generative model Source dataset Volume (real/fake)
Training DM ADM [24] LSUN_B [25] 20k/20k
GAN ProGAN [26] ProGAN 10k/10k
Testing
DM
iDDPM [27] LSUN_B 2k/2k
PNDM [28] LSUN_B 2k/2k
ADM [24] LSUN_B 2k/2k
ADM [24] ImageNet [29] 2k/2k
SDv2 [30] CelebA 2k/2k
GAN
bigGAN [31] ImageNet 2k/2k
cycleGAN [32] cycleGAN 2k/2k
starGAN [33] CelebA 2k/2k
Advanced models Midjourney [34] CelebA 2k/2k
DALLE2 [35] CelebA 2k/2k
6International Journal of Intelligent Systems
backbone; (8) Combined: Integrating the outputs of the
Gaussian, Gabor, and Wavelet lters by concatenating their
features, followed by processing them through convolutional
layers before feeding into the ResNet50 backbone.
For this study, we use the ADM and ProGAN datasets
described in Table 1 for training, while the testing set
comprises images from all datasets enumerated in Table 1.
Specically, 500 real images and 500 generated images are
randomly selected from each of the eight testing sets,
resulting in a comprehensive mixed testing set to evaluate
model performance comprehensively. e evaluation met-
rics include detection accuracy and F1-scores, which are
presented in Table 2 for each ltering strategy.
e experimental results indicate that all three individual
ltering strategies (Gauss, Gabor, Wave) outperform the
origin model, conrming the eectiveness of each lter.
Furthermore, the results of the two-lter combinations
(Gauss_Gabor, Gauss_Wave, Gabor_Wave) using concat-
enation demonstrate varying improvements in detection
performance, with the combined model, which integrates all
three lters by concatenating their feature maps, achieving
the highest performance. is suggests that while each lter
contributes positively, the combination of all three lters
leverages the complementary strengths of the three lters,
with Gaussian ltering enhancing local smoothing, Gabor
kernels capturing texture orientation, and wavelets high-
lighting multiscale features, leading to notable improve-
ments in both accuracy and robustness. ese ndings
highlight the value of integrating diverse frequency-domain
features to enhance model performance in complex
detection tasks.
3.3. Ablation Experiment of Feature Fusion Strategy. In this
experiment, we evaluate the performance of three distinct
feature fusion strategies. e strategies under investigation
are as follows: (1) Concat: Concatenating all the extracted
features; (2) AFF1: Applying AFF strategy to fuse the
multiscale frequency-domain features extracted from the
three lters, followed by concatenation with spatial-domain
features derived from FPN; (3) AFF2: Initially concatenating
the frequency-domain features, then fusing them with the
spatial-domain features using AFF. e detailed process of
each fusion strategy is illustrated in Figure 5. e same
training and testing sets used in Section 3.2 are employed in
this experiment to maintain consistency. e model per-
formance, in terms of detection ACC and F1-scores for each
fusion strategy, is presented in Table 3.
e results demonstrate that the model employing AFF2
fusion strategy achieves superior ACC and F1-score com-
pared to other models, which can be largely attributed to the
complementary nature of multiscale frequency-domain
features. e application of AFF solely to frequency-
domain features (AFF1) yields limited improvement. In
contrast, when AFF is employed to fuse both frequency- and
spatial-domain features (AFF2), it facilitates a more eective
integration of cross-scale and cross-modal information,
signicantly enhancing overall model performance. Besides,
the Concat model shows slightly inferior performance in
comparison with the combined model in Section 3.2, likely
due to its lack of eective mechanisms for integrating
complementary relationships between features. ese
ndings highlight the ecacy of AFF mechanism in deep-
fake detection, as it maximizes the utility of diverse feature
representations while reducing redundancy, thus improving
both detection accuracy and generalization.
3.4. Ablation Experiment of FPN. Although the contribution
of the FPN module has already been reected in the results of
the previous two ablation experiments, we present this
dedicated comparison to ensure the completeness of our
ablation study. e models under investigation are as fol-
lows: (1) Without FPN: the combined model from Section
3.2. (2) With FPN: the AFF2 model from Section 3.3, which
is also the proposed model.
e results, as summarized in Table 4, indicate that the
model with FPN outperforms the model without FPN in
both detection ACC and F1-score. Specically, the in-
corporation of the FPN allows the model to more eectively
capture spatial information at multiple scales, which com-
plements the frequency-domain features and results in
enhanced detection performance. ese ndings highlight
the value of integrating spatial-domain features via the FPN,
demonstrating that the addition of the FPN module sig-
nicantly contributes to the model’s capability to generalize
across dierent deepfake sources, thereby improving both
ACC and robustness.
3.5. Comparison Experiment of Detection ACC. In this ex-
periment, we evaluate the performance of our proposed
model in comparison to existing models recognized for their
strong generalization capabilities and high detection ACC.
Given the superior performance of the AFF2 fusion strategy
observed in the ablation studies, we select the model
employing this strategy for further comparison. To assess the
model’s eectiveness in detecting both GAN and diusion-
generated images, we utilize two baseline models:
WangCNN [11] and DIRE [18]. WangCNN functions as
a general detector for GAN-generated images, demon-
strating strong generalization across various GAN datasets
using preprocessing, postprocessing, and data augmentation
Table 2: Accuracy and F1-scores of models using dierent ltering
strategies.
Training set ProGAN ADM
Metrics ACC F1 ACC F1
Origin 0.501 0.003 0.509 0.040
Gauss 0.752 0.741 0.806 0.836
Gabor 0.717 0.712 0.782 0.761
Wave 0.776 0.756 0.807 0.837
Gauss_Gabor 0.762 0.754 0.823 0.839
Gauss_Wave 0.789 0.778 0.838 0.854
Gabor_Wave 0.782 0.763 0.831 0.847
Combined 0.803 0.792 0.853 0.865
Note: e highest values are highlighted in bold to emphasize the eec-
tiveness of the adopted modules and the superiority of our proposed model.
International Journal of Intelligent Systems 7
techniques. e DIRE model, on the other hand, is spe-
cically designed for detecting diusion-generated images
through the discrepancies between the input image and its
reconstructed counterpart.
Using the ProGAN dataset for training, we compare the
performance of our model with the baseline models across
images generated by various generative models and datasets,
as summarized in Table 5. e two variations of WangCNN,
dierentiated by their use of data augmentation techniques,
are referred to as WangCNN0.1 and WangCNN0.5. Since
the testing set includes images generated by ADM based on
two dierent datasets, they are distinguished as ADM_lsun
and ADM_imag in the table. Additionally, DM_mix,
GAN_mix, and Total_mix refer to the mixed datasets
consisting of diusion-generated, GAN-generated, and both
types of images, respectively. Specically, DM_mix contains
images from ve DM testing sets, GAN_mix from three
GAN testing sets, and Total_mix from all eight testing sets.
Gaussian
feature
Gabor
feature
Wavelet
feature
AFF AFF
Spatial
feature
Fused-frequency
features Concat Fused
features
Gaussian
feature
Gabor
feature
Wavelet
feature
Concat
Spatial
feature
AFF Fused
features
(2) AFF1
(3) AFF2
Gaussian
feature
Gabor
feature
Wavelet
feature
Concat
Spatial
feature
Fused
features
(1) Concat
Figure 5: Diagram of three feature fusion strategies.
Table 3: Accuracy and F1-scores of models under dierent fusion
strategies.
Training set Metrics Concat AFF1 AFF2 (proposed)
ProGAN ACC 0.799 0.8 0.815
F1 0.779 0.777 0.795
ADM ACC 0.842 0.871 0.881
F1 0.824 0.874 0.891
Note: e highest values are highlighted in bold to emphasize the eec-
tiveness of the adopted modules and the superiority of our proposed model.
Table 4: Accuracy and F1-scores of models with and without FPN.
Training set Metrics Without FPN With FPN (proposed)
ProGAN ACC 0.803 0.815
F1 0.792 0.795
ADM ACC 0.853 0.881
F1 0.865 0.891
Note: e highest values are highlighted in bold to emphasize the eec-
tiveness of the adopted modules and the superiority of our proposed model.
Table 5: Comparison of detection accuracy of dierent models
trained on the ProGAN dataset.
WangCNN0.1 WangCNN0.5 DIRE Proposed
ADM_lsun 0.528 0.513 0.521 0.798
iDDPM 0.575 0.536 0.522 0.762
PNDM 0.61 0.559 0.543 0.738
ADM_imag 0.573 0.556 0.679 0.87
SD 0.416 0.54 0.8 0.837
DM_mix 0.54 0.541 0.621 0.802
bigGAN 0.645 0.599 0.662 0.7
cycleGAN 0.817 0.833 0.726 0.773
starGAN 0.828 0.793 0.858 0.97
GAN_mix 0.763 0.742 0.749 0.836
Midjourney 0.553 0.523 0.623 0.863
DALLE2 0.470 0.646 0.806 0.839
Total_mix 0.624 0.616 0.669 0.815
Note: e highest values are highlighted in bold to emphasize the eec-
tiveness of the adopted modules and the superiority of our proposed model.
8International Journal of Intelligent Systems
For each mixed dataset, 500 real images and 500 generated
images are randomly selected from all constituent datasets.
e experimental results demonstrate that our model,
trained on the ProGAN dataset, not only performs well on
GAN-generated images but exhibits a particularly notable
advantage in detecting diusion-generated images. On the
three mixed datasets, the detection ACC of our model for
GAN-generated images (GAN_mix) is 9.6% higher than that
of the second-best model WangCNN0.1. Additionally, the
most signicant performance gain is observed for diusion-
generated images (DM_mix), where our model surpasses the
second-best model DIRE by an impressive 29.1%. Overall, our
model achieves a 21.8% higher detection ACC than DIRE on
the combined testing set (Total_mix), further underscoring its
robustness in handling both types of images, with a clear edge
in diusion-generated image detection.
When trained on the ADM dataset, the results in Table 6
reveal that our model consistently delivers superior detection
performance for both diusion- and GAN-generated images.
Notably, the model excels in identifying cross-model GAN-
generated images, achieving the highest detection ACC across
all GAN testing sets, with a 15.1% improvement on the
GAN_mix testing set over the second-best model, DIRE.
Furthermore, when tested on the Midjourney and
DALLE2 datasets, our model achieves the highest detection
ACC compared to other baselines, regardless of the training
dataset used. is further reinforces its generalization ca-
pabilities across both traditional and newer generative
models. e results of the two comparison experiments
demonstrate that our model not only maintains high de-
tection ACC for images generated by models of the same
type, but also excels in detecting cross-model generated
images, highlighting its strong generalization ability across
diverse deepfake generation techniques.
3.6. Analysis of Performance Variations. Although the model
demonstrates strong performance across most datasets,
notable variations in performance are observed across dif-
ferent testing sets. Using the ADM training dataset as a case
study, we explore these discrepancies in greater detail.
As shown in Table 6, when tested on the images gen-
erated by ADM using the ImageNet dataset (ADM_imag
dataset), the model’s ACC notably declines to 0.77, com-
pared to 0.985 on the ADM_lsun dataset. is suggests that
the higher variability and naturalistic features of ImageNet
images may present more challenges for the model, which
was primarily trained on the LSUN_bedroom dataset,
highlighting a potential gap in the model’s ability to gen-
eralize to such varied deepfake characteristics.
Additionally, the model’s performance on bigGAN
(0.715) and cycleGAN (0.711) is lower compared to its
performance on starGAN images (0.965). is may be due to
the fact that starGAN images are potentially more similar to
the images in training set or to diusion-generated images in
general. e more consistent and structured features of
starGAN images could align more closely with the model’s
learned characteristics, which were optimized for detecting
patterns found in diusion-based deepfakes. As a result, this
similarity could facilitate better detection ACC compared to
the more irregular patterns seen in bigGAN and cycleGAN
images.
ese performance variations highlight an area for
improvement in the model’s ability to generalize across
images with substantial inherent dierences. To enhance the
model’s robustness and broaden its applicability, future
work should consider utilizing more diverse training
datasets that ensure a more balanced representation of both
GAN- and diusion-generated images. Furthermore, ex-
ploring multimodel or hybrid training strategies could
further augment the model’s generalization capabilities
across the diverse range of deepfake generation techniques.
3.7. Model Robustness Experiment Results and Analysis.
To further evaluate the robustness of the proposed model
under adversarial conditions, we conduct a set of experi-
ments to assess the model’s resilience to common distortions
and manipulations, including Gaussian blur and JPEG
compression. Specically, Gaussian blur is applied using
a kernel with standard deviations of 0.5, while JPEG com-
pression is tested with quantization factors of 80. In this
experiment, we use the ProGAN training set, with the same
mixed testing set as in Section 3.2, to evaluate dierent
fusion strategies under Gaussian blur and JPEG compres-
sion. e performance metrics are summarized in Table 7.
e results indicate that Gaussian blur has a relatively
limited impact on detection ACC, with a gradual decline
observed as the blur intensity increases. e model, par-
ticularly when employing the AFF mechanisms, demon-
strates the capability to maintain high detection ACC even
under moderate blurring, highlighting its robustness against
this type of distortion. On the other hand, JPEG com-
pression shows a more pronounced eect on model per-
formance, especially as the compression level becomes more
severe.
ese ndings suggest that JPEG compression may have
a considerable impact on frequency-domain features, po-
tentially contributing to the observed performance
Table 6: Comparison of detection accuracy of dierent models
trained on the ADM dataset.
WangCNN0.1 WangCNN0.5 DIRE Proposed
ADM_lsun 0.971 0.96 0.985 0.985
iDDPM 0.973 0.966 0.976 0.979
PNDM 0.971 0.966 0.985 0.982
ADM_imag 0.655 0.616 0.733 0.77
SD 0.581 0.432 0.805 0.92
DM_mix 0.83 0.788 0.897 0.96
bigGAN 0.604 0.562 0.61 0.715
cycleGAN 0.508 0.507 0.492 0.711
starGAN 0.688 0.576 0.92 0.965
GAN_mix 0.6 0.548 0.674 0.776
Midjourney 0.594 0.532 0.501 0.920
DALLE2 0.615 0.755 0.811 0.952
Total_mix 0.744 0.698 0.813 0.881
Note: e highest values are highlighted in bold to emphasize the eec-
tiveness of the adopted modules and the superiority of our proposed model.
International Journal of Intelligent Systems 9
degradation. However, even under these challenging con-
ditions, the model demonstrates relatively good robustness,
indicating its potential practical applicability in real-world
scenarios where such common manipulations are often
encountered. Additionally, the results of this experiment
highlight the contribution of the AFF mechanism to the
model’s robustness. By incorporating the advanced fusion
strategy AFF2, the model demonstrates improved resilience,
making it better equipped to handle adversarial conditions
and solidifying its potential as an eective solution for
deepfake detection in diverse environments.
4. Conclusion
In this study, we propose a deepfake detection model that
demonstrates superior detection ACC and generalization
capability, eectively addressing the limitations of existing
methods. e experimental results conrm that our model
surpasses baseline methods in both detection ACC and
generalization across various GANs and DMs, showcasing
its robustness in identifying deepfake images from diverse
sources. A key highlight of our model is its ability to gen-
eralize across dierent generative models: even when trained
on the relatively dated ADM dataset, the model performs
strongly on images generated by more recent DM Stable
Diusion v2 (SDv2) [30], as well as images generated by
more advanced models like Midjourney [34] and DALLE2
[35], underscoring its robust generalization capability.
Such strong performance is largely attributable to the
structure of our model. By leveraging multiscale frequency-
domain features extracted by Gaussian, Gabor, and wavelet
lters, and integrating them with multiscale spatial-domain
features from a FPN, our model eectively captures com-
plementary information across multiple scales and domains.
e incorporation of the AFF mechanism further enhances
the model’s performance, as conrmed through ablation
experiments.
In conclusion, the proposed model excels in detection
ACC and generalization across diverse deepfake sources,
positioning it as a robust solution for tackling the growing
challenges in deepfake detection.
5. Discussion
While the proposed model demonstrates strong detection
ACC and generalization capabilities across GAN- and
diusion-generated images, certain limitations remain. e
model predominantly relies on frequency-domain features,
with the FPN being utilized for spatial feature modeling.
However, the spatial features captured by the FPN may not
fully reect the intrinsic spatial characteristics of the images.
Future work will aim to incorporate more explicit spatial-
domain features, such as those derived from advanced
denoising algorithms, edge-detection models, or texture
analysis techniques. Integrating these explicit spatial features
is expected to result in a more comprehensive and balanced
feature extraction process, optimizing the overall architec-
ture and enhancing the model’s ability to detect diverse
image manipulations.
Additionally, despite the notable generalization capa-
bilities of the current model, the rapid evolution of gener-
ative models, particularly the increasing prevalence of ne-
tuned variants, raises concerns about its adaptability to
newly emerging and previously unseen content. Addressing
this challenge will require further investigation into adaptive
learning strategies that can accommodate the fast-paced
advancements in generative models. Techniques such as
continual learning or incorporating a wider variety of
generative models during training may help maintain the
model’s eectiveness. Furthermore, as AIGC technologies
continue to evolve, ensuring the traceability of manipulated
content is becoming increasingly critical for maintaining
transparency and security. Future research will extend to-
ward improving detection capabilities for unknown gen-
erative models, focusing not only on adaptability but also on
AIGC traceability to ensure the authenticity and security of
generated content across diverse sectors.
Data Availability Statement
e data that support the ndings of this study are available
from the corresponding author upon reasonable request.
Conflicts of Interest
e authors declare no conicts of interest.
Funding
is work was supported in part by the National Key Re-
search and Development Program of China under Grant
(2023YFC3010302), National Natural Science Foundation of
China (Grant No. 82101079), and Key R&D Program of
Jiangsu Province (BE2023836).
References
[1] K. J. Ma, Y. F. Feng, B. J. Chen, and G. Y. Zhao, “End-to-End
Dual-Branch Network towards Synthetic Speech Detection,”
IEEE Signal Processing Letters 30 (2023): 359–363, https://
doi.org/10.1109/LSP.2023.3262419.
[2] P. Yu, Z. Xia, J. Fei, and Y. Lu, A Survey on Deepfake Video
Detection (2021).
[3] L. Lin, N. Gupta, Y. Zhang, et al., “Detecting Multimedia
Generated by Large AI Models: A Survey” (2024).
[4] I. Goodfellow, J. Pouget-Abadie, M. Mirza, et al., “Generative
Adversarial Nets,” Advances in Neural Information Processing
Systems 27 (2014): 2672–2680.
[5] L. Yang, Z. L. Zhang, Y. Song, et al., “Diusion Models: A
Comprehensive Survey of Methods and Applications,” ACM
Table 7: Accuracy and F1-scores of models using dierent fusion
strategies under Gaussian blur and JPEG compression, trained on
the ProGAN dataset.
Fusing strategies Concat AFF1 AFF2
Metrics ACC F1 ACC F1 ACC F1
None 0.799 0.779 0.8 0.777 0.815 0.795
Blur 0.785 0.776 0.786 0.747 0.798 0.77
Compression 0.711 0.685 0.703 0.692 0.758 0.708
10 International Journal of Intelligent Systems
Computing Surveys 56, no. 4 (2024): 1–39, https://doi.org/
10.1145/3626235.
[6] S. McCloskey and M. Albright, “Detecting GAN-Generated
Imagery Using Saturation Cues,” in Proc 26th IEEE In-
ternational Conference on Image Processing (September 2019),
4584–4588, https://doi.org/10.1109/icip.2019.8803661.
[7] S. Agarwal, N. Girdhar, and H. Raghav, “A Novel Neural
Model Based Framework for Detection of GAN Generated
Fake Images,” in Proc 11th International Conference on Cloud
Computing, Data Science and Engineering (Conuence)
(January 2021), 46–51, https://doi.org/10.1109/
Conuence51648.2021.9377150.
[8] B. J. Chen, W. J. Tan, Y. T. Wang, and G. Y. Zhao, “Dis-
tinguishing between Natural and GAN-Generated Face Im-
ages by Combining Global and Local Features,” Chinese
Journal of Electronics 31, no. 1 (2022): 59–67, https://doi.org/
10.1049/cje.2020.00.372.
[9] Z. W. Li, F. Liu, W. J. Yang, S. H. Peng, and J. Zhou, “A Survey
of Convolutional Neural Networks: Analysis, Applications,
and Prospects,” IEEE Transactions on Neural Networks and
Learning Systems 33, no. 12 (2022): 6999–7019, https://
doi.org/10.1109/tnnls.2021.3084827.
[10] Y. Fu, T. Sun, X. Jiang, K. Xu, and P. He, “Robust GAN-Face
Detection Based on Dual-Channel CNN Network,” in Proc
2019 12th International Congress on Image and Signal Processing,
BioMedical Engineering and Informatics (CISP-BMEI) (October
2019), https://doi.org/10.1109/CISP-BMEI48845.2019.8965991.
[11] S. Y. Wang, O. Wang, R. Zhang, A. Owens, and A. A. Efros,
“CNN-generated Images Are Surprisingly Easy to Spot for
Now,” in Proc 2020 IEEE/CVF Conference on Computer Vision
and Pattern Recognition (CVPR) (March 2020), 8692–8701,
https://doi.org/10.1109/CVPR42600.2020.00872.
[12] R. Corvi, D. Cozzolino, G. Zingarini, G. Poggi, K. Nagano, and
L. Verdoliva, “On the Detection of Synthetic Images Gen-
erated by Diusion Models,” in Proc ICASSP 2023 IEEE
International Conference on Acoustics, Speech and Signal
Processing (ICASSP) (June 2023), 1–5, https://doi.org/
10.1109/ICASSP49357.2023.10095167.
[13] M. Q. Nguyen, K. D. Ho, H. M. Nguyen, C. M. Tu, M. T. Tran,
and T. L. Do, “Unmasking the Artist: Discriminating Human-
Drawn and AI-Generated Human Face Art through Facial
Feature Analysis,” in Proc 2023 International Conference on
Multimedia Analysis and Pattern Recognition (MAPR) (Octo-
ber 2023), https://doi.org/10.1109/MAPR59823.2023.10289113.
[14] Q. Bammey, “Synthbuster: Towards Detection of Diusion
Model Generated Images,” IEEE Open Journal of Signal Pro-
cessing 5 (2023): 19, https://doi.org/10.1109/OJSP.2023.3337714.
[15] H. Farid, “Lighting (In)consistency of Paint by Text” (2022).
[16] H. Farid, Inconsistency of Paint by Text (2022).
[17] U. Ojha, Y. H. Li, and Y. J. Lee, “Towards Universal Fake
Image Detectors that Generalize across Generative Models,”
in Proc IEEE/CVF Conference on Computer Vision and Pattern
Recognition (CVPR) Vancouver, CANADA (June 2023),
24480–24489, https://doi.org/10.1109/cvpr52729.2023.02345.
[18] Z. D. Wang, J. M. Bao, W. G. Zhou, et al., “DIRE for
Diusion-Generated Image Detection,” in Proc IEEE/CVF
International Conference on Computer Vision (ICCV) Paris
(October 2023), 22388–22398, https://doi.org/10.1109/
iccv51070.2023.02051.
[19] H. Zhang, B. Chen, J. Wang, and G. Zhao, “A Local Per-
turbation Generation Method for GAN-Generated Face Anti-
forensics,” IEEE Transactions on Circuits and Systems for
Video Technology 33, no. 2 (2023): 661–676, https://doi.org/
10.1109/TCSVT.2022.3207310.
[20] J. Ricker, S. Damm, T. Holz, and A. Fischer, “Towards the
Detection of Diusion Model Deepfakes” (2022).
[21] T.-Y. Lin, P. Doll´
ar, R. Girshick, K. He, B. Hariharan, and
S. Belongie, “Feature Pyramid Networks for Object Detection”
(2016).
[22] K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning
for Image Recognition,” in Proc 2016 IEEE Conference on
Computer Vision and Pattern Recognition (June 2016), 1,
https://doi.org/10.1109/cvpr.2016.90.
[23] Y. Dai, F. Gieseke, S. Oehmcke, Y. Wu, and K. Barnard,
“Attentional Feature Fusion” (2020).
[24] P. Dhariwal and A. Nichol, Diusion Models Beat GANs on
Image Synthesis (2021).
[25] F. Yu, Y. Zhang, S. Song, A. Se, and J. J. c. s. Xiao, “LSUN:
Construction of a Large-Scale Image Dataset Using Deep
Learning with Humans in the Loop” (2015).
[26] T. Karras, T. Aila, S. Laine, and J. Lehtinen, “Progressive
Growing of GANs for Improved Quality, Stability, and
Variation” (2017).
[27] A. Nichol and P. Dhariwal, “Improved Denoising Diusion
Probabilistic Models,” in Proc International Conference on
Machine Learning (ICML) Electr Network (July 2021).
[28] L. Liu, Y. Ren, Z. Lin, and Z. Zhao, “Pseudo Numerical
Methods for Diusion Models on Manifolds” (2022), https://
doi.org/10.48550/arXiv.2202.09778.
[29] J. Deng, W. Dong, R. Socher, L. J. Li, and F. F. Li, “ImageNet:
a Large-Scale Hierarchical Image Database,” in Proc 2009
IEEE Computer Society Conference on Computer Vision and
Pattern Recognition (CVPR 2009) (June 2009).
[30] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, B. Ommer,
and S. O. C. Ieee Comp, “High-Resolution Image Synthesis
with Latent Diusion Models,” in Proc IEEE/CVF Conference
on Computer Vision and Pattern Recognition (June 2022),
10674–10685, https://doi.org/10.1109/cvpr52688.2022.01042.
[31] A. Brock, J. Donahue, and K. Simonyan, “Large Scale GAN
Training for High Fidelity Natural Image Synthesis” (2018).
[32] J. Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired Image-
To-Image Translation Using Cycle-Consistent Adversarial
Networks,” in Proc 16th IEEE International Conference on
Computer Vision (ICCV) Venice, ITALY (October 2017),
2242–2251, https://doi.org/10.1109/iccv.2017.244.
[33] Y. Choi, M. Choi, M. Kim, J. W. Ha, S. Kim, and J. Choo,
“StarGAN: Unied Generative Adversarial Networks for
Multi-Domain Image-To-Image Translation,” in Proc 31st
IEEE/CVF Conference on Computer Vision and Pattern Rec-
ognition (CVPR) Salt Lake City, UT (June 2018), 8789–8797,
https://doi.org/10.1109/cvpr.2018.00916.
[34] “Midjourney,” (2023), https://www.midjourney.com/home.
[35] A. Ramesh, P. Dhariwal, A. Nichol, C. Chu, and M. Chen,
Hierarchical Text-Conditional Image Generation with CLIP
Latents (2022).
International Journal of Intelligent Systems 11
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
Synthetically-generated images are getting increasingly popular. Diffusion models have advanced to the stage where even non-experts can generate photo-realistic images from a simple text prompt. They expand creative horizons but also open a Pandora's box of potential disinformation risks. In this context, the present corpus of synthetic image detection techniques, primarily focusing on older generative models like Generative Adversarial Networks, finds itself ill-equipped to deal with this emerging trend. Recognizing this challenge, we introduce a method specifically designed to detect synthetic images produced by diffusion models. Our approach capitalizes on the inherent frequency artefacts left behind during the diffusion process. Spectral analysis is used to highlight the artefacts in the Fourier transform of a residual image, which are used to distinguish real from fake images. The proposed method can detect diffusion-model-generated images even under mild JPEG compression, and generalizes relatively well to unknown models. By pioneering this novel approach, we aim to fortify forensic methodologies and ignite further research into the detection of AI-generated images.
Article
Full-text available
Diffusion models have emerged as a powerful new family of deep generative models with record-breaking performance in many applications, including image synthesis, video generation, and molecule design. In this survey, we provide an overview of the rapidly expanding body of work on diffusion models, categorizing the research into three key areas: efficient sampling, improved likelihood estimation, and handling data with special structures. We also discuss the potential for combining diffusion models with other generative models for enhanced results. We further review the wide-ranging applications of diffusion models in fields spanning from computer vision, natural language processing, temporal data modeling, to interdisciplinary applications in other scientific disciplines. This survey aims to provide a contextualized, in-depth look at the state of diffusion models, identifying the key areas of focus and pointing to potential areas for further exploration. Github: https://github.com/YangLing0818/Diffusion-Models-Papers-Survey-Taxonomy.
Article
Synthetic speech attacks bring more threats to Automatic Speaker Verification (ASV) systems, thus many synthetic speech detection (SSD) systems have been proposed to help the ASV system resist synthetic speech attacks. However, existing SSD systems still lack the generalization ability for the attacks generated by unknown synthesis algorithms. This paper proposes an end-to-end ensemble system, namely Dual-Branch Network, in which linear frequency cepstral coefficients (LFCC) and constant Q transform (CQT) are used as the input of two branches respectively. In addition, four fusion strategies are compared for the fusion of two branches to obtain an optimal one; multi-task learning and convolutional block attention module (CBAM) are introduced into the Dual-Branch Network to help the network learn the common forgery features from different forgery types of speech and enhance the representation power of learned features. Experimental results on the ASVspoof 2019 logical access (LA) dataset demonstrate that the proposed system outperforms existing state-of-the-art systems on both t-DCF and EER scores and has good generalization for unknown forgery types of synthetic speech. The source code is available at https://github.com/imagecbj/End-to-End-Dual-Branch-Network-Towards-Synthetic-Speech-Detection .</i
Article
Although the current generative adversarial networks (GAN)-generated face forensic detectors based on deep neural networks (DNNs) have achieved considerable performance, they are vulnerable to adversarial attacks. In this paper, an effective local perturbation generation method is proposed to expose the vulnerability of state-of-the-art forensic detectors. The main idea is to mine the fake faces’ areas of common concern in multiple-detectors’ decision-making, then generate local anti-forensic perturbations by GANs in these areas to enhance the visual quality and transferability of anti-forensic faces. Meanwhile, in order to improve the anti-forensic effect, a double- mask (soft mask and hard mask) strategy and a three-part loss (the GAN training loss, the adversarial loss consisting of ensemble classification loss and ensemble feature loss, and the regularization loss) are designed for the training of the generator. Experiments conducted on fake faces generated by StyleGAN demonstrate the proposed method’s advantage over the state-of-the-art methods in terms of anti-forensic success rate, imperceptibility, and transferability. The source code is available at https://github.com/imagecbj/A-Local-Perturbation-Generation-Method-for-GAN-generated-Face-Anti-forensics .