Access to this full-text is provided by Wiley.
Content available from International Journal of Intelligent Systems
This content is subject to copyright. Terms and conditions apply.
Research Article
Multiscale Features Integrated Model for Generalizable
Deepfake Detection
Siqi Gu,
1
Zihan Qin,
1
Lizhe Xie ,
2
,
3
Zheng Wang,
1
and Yining Hu
1
1
School of Cyber Science and Engineering, Southeast University, Nanjing 211189, China
2
State Key Laboratory Cultivation Base of Research, Prevention and Treatment for Oral Diseases, Nanjing Medical University,
Nanjing, China
3
Jiangsu Province Engineering Research Center of Stomatological Translational Medicine, Nanjing Medical University,
Gulou District, Nanjing, China
Correspondence should be addressed to Yining Hu; hyn.list@seu.edu.cn
Received 12 November 2024; Accepted 18 December 2024
Academic Editor: Beijing Chen
Copyright ©2025 Siqi Gu et al. International Journal of Intelligent Systems published by John Wiley & Sons Ltd. is is an open
access article under the terms of the Creative Commons Attribution License, which permits use, distribution and reproduction in
any medium, provided the original work is properly cited.
Within the domain of Articial Intelligence Generated Content (AIGC), technological strides in image generation have been
marked, resulting in the proliferation of deepfake images that pose substantial security threats. e current landscape of deepfake
detection technologies is marred by limited generalization across diverse generative models and a subpar detection rate for images
generated through diusion processes. In response to these challenges, this paper introduces a novel detection model designed for
high generalizability, leveraging multiscale frequency and spatial domain features. Our model harnesses an array of specialized
lters to extract frequency-domain characteristics, which are then integrated with spatial-domain features captured by a Feature
Pyramid Network (FPN). e integration of the Attentional Feature Fusion (AFF) mechanism within the feature fusion module
allows for the optimal utilization of the extracted features, thereby enhancing detection capabilities. We curated an extensive
dataset encompassing deepfake images from a variety of GANs and diusion models for rigorous evaluation. e experimental
ndings reveal that our proposed model achieves superior accuracy and generalization compared to existing baseline models
when confronted with deepfake images from multiple generative sources. Notably, in cross-model detection scenarios, our model
outperforms the next best model by a signicant margin of 29.1% for diusion-generated images and 15.1% for GAN-generated
images. is accomplishment presents a viable solution to the pressing issues of generalization and adaptability in the eld of
deepfake detection.
Keywords: Articial Intelligence Generated Content; deepfake detection; diusion model; generative adversarial networks
1. Introduction
As advancements in Articial Intelligence Generated Con-
tent (AIGC) technologies continue to accelerate, a multitude
of detection methods have been devised to authenticate
synthetic media [1–3], tackling challenges such as the spread
of misinformation, infringements on personal privacy, and
instances of online nancial fraud. Notably, deepfake images
emerge as a focal point within this domain, owing to their
propensity for rapid dissemination and their ubiquity on
social media platforms. erefore, this research underscores
the importance of developing robust detection techniques
specically tailored to deepfake images.
Currently, the creation of deepfake images is primarily
facilitated by two leading generative models: generative
adversarial networks (GANs) [4] and diusion models (DMs)
[5]. Detection models for GAN-generated images typically
enhance accuracy (ACC) by rening feature extraction, op-
timizing network architectures, and incorporating data
augmentation strategies. Feature-based approaches primarily
target the identication of spatial and frequency domain
inconsistencies. For instance, McCloskey et al. [6] leveraged
Wiley
International Journal of Intelligent Systems
Volume 2025, Article ID 7084582, 11 pages
https://doi.org/10.1155/int/7084582
red-green bivariate histograms and abnormal pixel exposure
ratios for detection, while Agarwal et al. [7] exploited high-
frequency artifacts stemming from GANs’ upsampling pro-
cesses. Chen et al. [8] integrated both global and local image
features, utilizing metric learning to enhance the overall
detection performance of model. In terms of network ar-
chitectures, Convolutional Neural Networks (CNNs) remain
a predominant choice for deepfake detection tasks due to their
capacity to extract semantic, color, and texture information
[9], as seen in Fu et al. [10] dual-channel CNN architecture,
which is capable of concurrently processing both high- and
low-frequency image components. Additionally, data aug-
mentation plays a crucial role in boosting robustness and
generalization. Wang et al. [11] demonstrated that combining
preprocessing, postprocessing, and data augmentation allows
a CNN trained on a single GAN dataset to generalize across
multiple GAN models eectively.
For the detection of DM generated images, a straightfor-
ward approach involves adapting existing GAN-based de-
tectors. However, Corvi et al. [12] demonstrated that even
advanced GAN detection algorithms exhibit suboptimal per-
formance when tasked with identifying DM-generated images.
In a manner similar to GAN-generated image detection,
several studies leveraged spatial and frequency domain features
for detecting DM images. For instance, Nguyen et al. [13]
employed gradient-based features, while Bammey et al. [14]
used cross-dierential lters to highlight frequency artifacts.
Additionally, Farid et al. [15, 16] highlighted inconsistencies in
lighting, geometric structures, shadows, and reections present
in DM-generated images. Ojha et al. [17] proposed an alter-
native method by mapping images into CLIP’s feature space
and classifying them using cosine similarity with reference real
or fake images. WANG et al. [18] introduced a method using
the dierence between reconstructed images and those under
examination as features, achieving high ACC in the detection
of diusion-generated images.
Despite these developments, deepfake detection meth-
odologies encounter numerous challenges. For GAN-
generated images, existing detection models tend to per-
form inadequately when applied to multisource datasets due
to their constrained feature extraction dimensions. While data
augmentation and dataset expansion have been employed to
enhance generalization [11], they often yield limited eec-
tiveness due to their predictability. Furthermore, the ro-
bustness of GAN-based detectors is often compromised in the
face of adversarial perturbations, as evidenced by the localized
attacks conducted by Zhang et al. [19], which revealed vul-
nerabilities in a range of detection models. As for diusion-
generated images, eective detection techniques remain
underexplored, and those available tend to be overly complex,
lacking comprehensive image feature analysis, thus hindering
their practical applicability.
In real-world scenarios, the generative model responsible
for creating deepfake images is often unknown, emphasizing
the necessity for detection models with strong cross-model
generalization capabilities. However, traditional detection
models struggle to handle such uncertainty, for they often fail
to adequately leverage the rich and complex features inherent
in deepfake images, or rely solely on single-dimensional
feature learning, which limits their ability to capture the
shared characteristics across dierent generative models.
Ricker et al. [20] demonstrated that GAN detectors often fail
to detect images generated by DMs, and retraining these
detectors on diusion-generated datasets leads to only
marginal improvements. is lack of adaptability exposes
a gap in current detection technologies, which struggle to
handle the variety of deepfake generation techniques and their
artifacts. To address the growing concerns surrounding
synthetic media manipulation, there is an urgent need for
robust and adaptable models capable of detecting deepfakes
across diverse platforms and applications, regardless of the
generative model or source datasets employed.
To address these limitations, this paper introduces
a generalizable deepfake detection model that leverages both
spatial and frequency domain features commonly exhibited
by deepfake images. e proposed model utilizes a series of
lters to extract multiscale frequency-domain features, and
a Feature Pyramid Network (FPN) [21] to capture multiscale
spatial-domain features. ese features are subsequently
fused and processed through a ResNet50 backbone for
classication. To fully exploit the extracted features, multiple
fusion strategies incorporating the Attentional Feature
Fusion (AFF) [16] mechanism are devised, further en-
hancing the model’s performance. e primary contribu-
tions of this work are as follows: (1) We develop a model-
independent detection method that eectively identies
forged images from various generative models and datasets,
addressing the black-box detection challenge where the
generative model is unknown. (2) By extracting and fusing
multiscale features from both spatial and frequency do-
mains, we capture shared characteristics across dierent
generative models, overcoming the limitations of single-
feature modeling. (3) Experimental results demonstrate
that our model signicantly outperforms existing detectors
on both GAN-generated and diusion-generated images, as
well as on images generated by advanced models, exhibiting
superior generalization and detection ACC, thus validating
the eectiveness of our approach.
2. The Proposed Method
2.1. Overall Structure. e overall structure of the proposed
model is illustrated in Figure 1, comprising two primary
components: a feature extraction module and a feature fusion
module. In the feature extraction module, the images are
initially processed through an image pyramid to create
multiscale representations, which are subsequently passed
through a series of lters to extract multiscale frequency-
domain features. Concurrently, the original input images are
fed into a FPN, facilitating the extraction of multiscale spatial-
domain features. In the subsequent feature fusion module, the
extracted features from both domains are eectively in-
tegrated through the AFF mechanism, ensuring an eective
combination of multiscale spatial and frequency information.
e fused features are then processed by a ResNet50 backbone
network, which ultimately classies the image as either real or
fake. e detailed architecture and functionality of the model
will be elaborated in the following sections.
2International Journal of Intelligent Systems
2.2. Frequency-Domain Feature Extraction. e generation
of deepfake images involves sophisticated synthesis tech-
niques, such as pixel-level manipulation, temporal and
spatial consistency adjustments, and ne-tuning of facial
features. ese manipulations often introduce subtle dis-
crepancies between deepfake and authentic images, which
are most eectively observed in the frequency domain.
However, these inconsistencies do not exhibit uniform
patterns, necessitating the extraction of subtle variations and
anomalous features across multiple scales. To address this,
we employ a combination of Gaussian, Gabor, and wavelet
lters, each selected for its specic advantages in extracting
relevant frequency-domain features.
Gaussian lters are employed to capture low-frequency
information, which is essential for identifying overall
structural elements and large-scale patterns in images. As
fundamental techniques for smoothing and noise reduction,
they are particularly suitable for extracting the broad, global
characteristics of deepfake images, which are often aected
by low-frequency distortions. Specically, a 3 ×3 Gaussian
convolution kernel with a standard deviation of 1 and mean
values of one is applied. e two-dimensional Gaussian
function is dened as follows:
Gauss (x, y) � 1
2πσxσy
exp −x−x
2
2σ2
x
+y−y
2
2σ2
y
⎡
⎢
⎢
⎢
⎣⎤
⎥
⎥
⎥
⎦
⎛
⎝⎞
⎠,
(1)
where Gauss(x, y)represents the Gaussian weight at point
(x, y),σx,σyare the standard deviations, x,yare the mean
values, and the normalization factor 1/2πσxσyensures the
sum of kernel equals 1. To capture multiscale features, input
images are downsampled by a factor of 2 using an image
pyramid before Gaussian ltering, followed by linear in-
terpolation and pixel-wise summation across scales. is
procedure integrates multiscale ltered information while
preserving critical edges and details.
Gabor lters, renowned for their ability to detect edges
and analyze textures, are employed for their robust capa-
bility in directional and scale-selective analysis. ese lters
are particularly eective in isolating mid-to-high-frequency
components and detecting local patterns, which are often
indicative of subtle artifacts introduced by deepfake gen-
eration techniques. eir ability to capture directional and
textural information makes them well-suited for analyzing
textured regions in deepfake images. e two-dimensional
Gabor function is dened as follows:
Gabor(x, y) � exp −x′2+c2y′2
2σ2
⎛
⎝⎞
⎠·cos 2πx′
λ+ψ
,
(2)
where σis the Gaussian standard deviation, cis the spatial
aspect ratio, λis the wavelength, and ψis the phase oset.
e rotated coordinates x′and y′are given in equations (3)
and (4):
x′�xcos θ+ysin θ,(3)
y′� −xsin θ+ycos θ,(4)
where θdenes the lter’s orientation. In this work, we
employ 16 Gabor kernels with orientations ranging from
0°to 360°, each with a kernel size of 31 ×31, a standard
deviation of 4, a spatial aspect ratio of 0.5, a wavelength of 10,
Feature pyramid network
Gaussian filter
Gabor filter
Wavelet filter
Feature pyramid network
Concat Conv
ResNet50
Real or fake
AFF
BN +
ReLU SE
Conv BN +
ReLU SE
Input image
Image pyramid
C1 C2 C3 C4 C5
256 ∗ 256 ∗ 3
128 ∗ 128 ∗ 128
64 ∗ 64 ∗ 256 32 ∗ 32 ∗ 512
16 ∗ 16 ∗ 1024 8 ∗ 8 ∗ 2048
P5P2 P3 P4
8 ∗ 8 ∗ 256
16 ∗ 16 ∗ 256
32 ∗ 32 ∗ 256
64 ∗ 64 ∗ 256
1 ∗ 1 conv
1 ∗ 1 conv 1 ∗ 1 conv 1 ∗ 1 conv
P1
Pooling Pooling Pooling Pooling
256 ∗ 256 ∗ 256
Up-
sampling
Up-
sampling
Up-
sampling
Up-
sampling
Feature
256 ∗ 256 ∗ 3
1 ∗ 1 Conv
Conv
& pooling
Figure 1: Overall structure of the multiscale features integrated model for deepfake detection.
International Journal of Intelligent Systems 3
and a phase oset of 0. ese lters process multiscale
images generated by the image pyramid, capturing texture
and directional features from various orientations.
Wavelet lters are particularly eective in balancing
spatial and frequency domain information, enabling
a comprehensive multiscale analysis of deepfake images.
ey allow the separation of high- and low-frequency
components, facilitating the detection of both large-scale
structural inconsistencies and localized anomalies. Wavelet
decomposition provides a powerful tool for extracting
multiresolution features, which can be crucial for identifying
subtle artifacts that are distributed across dierent frequency
bands. To harness these capabilities, the Daubechies 4-tap
(db4) wavelet is selected due to its balance between com-
putational eciency and its ability to capture detailed image
features across scales. e wavelet decomposition and re-
construction processes are represented by equations (5) and
(6), respectively:
DWT (I)⟶(LL, LH, HL, HH),(5)
high freq img �(|LH| +|HL| +|HH|),(6)
where Irepresents the input image. LL denotes the low-
frequency approximation coecients capturing the overall
structure and large-scale features of the image. LH, HL, HH
represent the high-frequency detail coecients in the hor-
izontal, vertical, and diagonal directions, respectively. e
high-frequency images are combined to generate a complete
high-frequency image using equation (6), which, along with
the low-frequency approximation LL, is utilized for de-
tection. Given the wavelet transform’s inherent capability to
eectively represent multiscale features, the original image is
utilized directly as input, bypassing the need for pyramid
preprocessing. e wavelet decomposition is conducted over
two levels, producing multiscale wavelet-ltered represen-
tations that encapsulate both global and localized image
characteristics. e db4 wavelet enables the comprehensive
utilization of an image’s detailed information and noise
characteristics, thereby enabling robust analysis of both
global and local anomalies in deepfake images.
Collectively, the integration of Gaussian, Gabor, and
wavelet lters facilitates the extraction of comprehensive
frequency-domain features. Gaussian lters capture global
structural patterns by focusing on low-frequency in-
formation. Gabor lters extract localized texture and edge
information through their directional and scale-selective
properties. Wavelet lters provide a multiscale analysis by
decomposing the image into various frequency components.
Together, these lters enhance the model’s ability to identify
subtle inconsistencies across scales, leading to more eective
deepfake detection.
2.3. Spatial-Domain Feature Extraction. In the previous
section, it was illustrated that the selected lters eectively
captured multiscale features of frequency domain, including
texture and edge information. However, relying exclusively
on the frequency-domain features imposes inherent
limitations on the detection model’s ability to fully encap-
sulate the semantic content of images. Generative models
often introduce subtle artifacts in both global structures and
localized details, necessitating the extraction of comple-
mentary spatial-domain features for a more comprehensive
analysis. By synthesizing both spatial and frequency-domain
characteristics, the model achieves a more holistic repre-
sentation, signicantly improving the detection perfor-
mance, particularly in complex and heterogeneous
scenarios.
In this research, we utilize a FPN [21] architecture, which
is constructed on the foundation of a ResNet50 [22]
backbone, to eectively extract multiscale spatial-domain
features. During the bottom-up pathway of the FPN, cor-
responding to the forward propagation within ResNet50,
four feature maps of distinct resolutions are produced. ese
feature maps progressively distill high-level semantic in-
formation through multiple convolutional layers, while si-
multaneously preserving essential multiscale structural
information. In the top-down pathway, these feature maps
are subjected to dimensionality reduction via 1 ×1 con-
volutional layers, upsampled using linear interpolation, and
then merged layer by layer to integrate semantic information
across various scales. e fused feature maps are sub-
sequently rened through additional convolution layers to
yield the nal multiscale feature representation. e struc-
ture of the FPN is shown in Figure 2. By these means, the
FPN can eectively extract and synthesize multiscale spatial-
domain features, enabling the model to eectively perceive
both overarching semantic structures and intricate local
details, thereby substantially improving detection ACC and
generalization across diverse scenarios.
2.4. Feature Fusion Module. In the preceding sections, we
selected a variety of complementary frequency-domain and
spatial-domain features to enhance the ecacy of deepfake
detection. By integrating these features, a more holistic and
robust image representation can be constructed. ese
features are processed through a Batch Normalization (BN)
layer, followed by ReLU activation, and then passed through
a Squeeze-and-Excitation (SE) attention layer before being
input into a ResNet50 network for nal classication.
To further optimize the utilization of extracted features,
we incorporate the AFF [23] mechanism within the feature
fusion module. AFF is an adaptive fusion technique designed
to dynamically combine input features in a scale-aware and
content-adaptive manner. is mechanism is particularly
eective in addressing the challenge of merging features with
disparate scales, resolutions, or semantic meanings, which is
a common issue when integrating spatial and frequency-
domain features.
e AFF framework consists of two key components:
feature aggregation and attention-based weighting. e
feature aggregation process rst combines the input features
through element-wise addition to produce a unied rep-
resentation. To address the limitations of raw aggregation,
which fails to account for the varying importance of features
across dierent scales and contexts, attention-based
4International Journal of Intelligent Systems
weighting is employed. e Multiscale Channel Attention
Module (MS-CAM) calculates dynamic attention weights
for each feature, facilitating the selective enhancement of
more informative features while suppressing less relevant
ones. is enables the model to prioritize features that are
most indicative of subtle inconsistencies in deepfake images,
thereby rening the fusion process and improving detection
ACC. e structures of AFF and MS-CAM are depicted in
Figures 3 and 4, respectively.
e specic process of the AFF strategy is shown in
equation (7). Given two input features Xand Y, the initial
fusion is performed through element-wise addition, producing
a preliminary combined feature X⊕Y. is is followed by the
computation of a weight M(X⊕Y)via MS-CAM, which
modulates the relative signicance of Xand Y. e nal fused
feature Zis derived by applying a weighted sum, where
M(X⊕Y)is assigned to feature Xand 1−M(X⊕Y)to
feature Y. Although this approach introduces a slight increase
in computational cost, the additional FLOPs are minimal,
making it a favorable trade-o in light of the signicant im-
provements in detection performance.
Z�M(X⊕Y)⊗X+(1−M(X⊕Y)) ⊗Y. (7)
3. Experiment
3.1. Datasets and Experiment Setting. In this study, we
comprehensively evaluate the performance of model by
constructing a diverse dataset that includes both diusion-
generated and GAN-generated images, as well as more re-
cent deepfake images from advanced models. e dataset is
carefully curated to include images from a variety of gen-
erative models, ensuring a broad representation of dierent
architectures and data sources. As outlined in Table 1, the
training set includes images generated by ProGAN [26],
a GAN-based model, and ADM [24], a DM. e testing set is
designed to provide a comprehensive evaluation, including
models like iDDPM [27], PNDM [28], and ADM [24]
trained on datasets such as LSUN_bedroom, ImageNet,
CelebA, as well as more recent models like Midjourney [34]
and DALLE2 [35]. Notably, there are no overlapping
samples between the training and testing sets, which ensures
an unbiased evaluation of the model’s generalization ca-
pabilities and mitigates the risk of overtting. ese sources
are selected to ensure both model diversity and real-world
applicability, covering a wide range of generative techniques.
During the training phase, the dataset is split into
training and validation subsets in an 80–20 ratio to facilitate
cross-validation and ensure a fair evaluation of the model’s
performance. To enhance the model’s robustness and gen-
eralization, data augmentation techniques such as random
C1 C2 C3 C4 C5
256 ∗ 256 ∗ 3
128 ∗ 128 ∗ 128
64 ∗ 64 ∗ 256
32 ∗ 32 ∗ 512
16 ∗ 16 ∗ 1024
8 ∗ 8 ∗ 2048
P5P2 P3 P4
8 ∗ 8 ∗ 256
16 ∗ 16 ∗ 256
32 ∗ 32 ∗ 256
64 ∗ 64 ∗ 256
1 ∗ 1 con
v
1 ∗ 1 conv 1 ∗ 1 conv 1 ∗ 1 conv
P1
Pooling Pooling Pooling Pooling
256 ∗ 256 ∗ 256
Up-
sampling
Up-
sampling
Up-
sampling
Up-
sampling
Feature
256 ∗ 256 ∗ 3
1 ∗ 1 conv
Conv
& pooling
Figure 2: Structure of Feature Pyramid Network.
MS-CAM
C × H × W
XY
C × H × W
+
+
× ×
Figure 3: Structure of Attentional Feature Fusion Module.
International Journal of Intelligent Systems 5
rotation (with a ±15-degree range) and random horizontal
ipping are applied, and all images are uniformly center-
cropped to 256 ×256 pixels.
e model is implemented using the PyTorch frame-
work, version 2.0.0. Its performance is evaluated using two
metrics: ACC and F1-score. ACC measures overall detection
ACC, while the F1-score accounts for the balance between
precision and recall, which is particularly crucial in scenarios
with potential class imbalance. For the training process,
binary cross-entropy is chosen as the loss function, as the
task involves binary classication. e model is trained using
the Adam optimizer with an initial learning rate of 0.0001
and a batch size of 32 for 50 epochs. A learning rate decay
strategy is applied, where the learning rate is reduced to 10%
of its current value if validation metrics do not improve over
8 consecutive epochs. Training is conducted on an NVIDIA
V100 GPU to leverage its computational capabilities. e
environment setup includes CUDA 11.7 and cuDNN 8 to
ensure ecient utilization of GPU resources.
3.2. Ablation Experiment of ree Filters. In this experiment,
we aim to evaluate the independent and combined contri-
butions of three frequency-domain lters to the overall
performance of our model. e models under investigation
are as follows: (1) Origin: Directly feeding the original input
images into the ResNet50 network without applying any
frequency-domain ltering; (2) Gauss: Processing the im-
ages with a Gaussian lter before feeding them into the
ResNet50 backbone; (3) Gabor: Processing the images with
a Gabor lter before feeding them into the ResNet50
backbone; (4) Wave: Processing the images with a Wavelet
lter prior to feeding them into the ResNet50 backbone; (5)
Gauss_Gabor: Concatenating the feature maps from the
Gaussian and Gabor lters before feeding them into the
ResNet50 backbone; (6) Gauss_Wave: Concatenating the
feature maps from the Gaussian and Wavelet lters before
feeding them into the ResNet50 backbone; (7) Gabor_Wave:
Concatenating the feature maps from the Gabor and
Wavelet lters before feeding them into the ResNet50
GlobalAvgPooling
Point-wise conv
ReLU
Point-wise conv
Point-wise conv
ReLU
Point-wise conv
Sigmoid
X
X'
BN BN
BNBN
C × H × W
C × H × W
C × H × W
C × 1 × 1
C/r × 1 × 1
C × 1 × 1
C/r × H × W
×
+
Figure 4: Structure of Multiscale Channel Attention Module.
Table 1: Deepfake image dataset containing various types of generated images.
Data type Model type Generative model Source dataset Volume (real/fake)
Training DM ADM [24] LSUN_B [25] 20k/20k
GAN ProGAN [26] ProGAN 10k/10k
Testing
DM
iDDPM [27] LSUN_B 2k/2k
PNDM [28] LSUN_B 2k/2k
ADM [24] LSUN_B 2k/2k
ADM [24] ImageNet [29] 2k/2k
SDv2 [30] CelebA 2k/2k
GAN
bigGAN [31] ImageNet 2k/2k
cycleGAN [32] cycleGAN 2k/2k
starGAN [33] CelebA 2k/2k
Advanced models Midjourney [34] CelebA 2k/2k
DALLE2 [35] CelebA 2k/2k
6International Journal of Intelligent Systems
backbone; (8) Combined: Integrating the outputs of the
Gaussian, Gabor, and Wavelet lters by concatenating their
features, followed by processing them through convolutional
layers before feeding into the ResNet50 backbone.
For this study, we use the ADM and ProGAN datasets
described in Table 1 for training, while the testing set
comprises images from all datasets enumerated in Table 1.
Specically, 500 real images and 500 generated images are
randomly selected from each of the eight testing sets,
resulting in a comprehensive mixed testing set to evaluate
model performance comprehensively. e evaluation met-
rics include detection accuracy and F1-scores, which are
presented in Table 2 for each ltering strategy.
e experimental results indicate that all three individual
ltering strategies (Gauss, Gabor, Wave) outperform the
origin model, conrming the eectiveness of each lter.
Furthermore, the results of the two-lter combinations
(Gauss_Gabor, Gauss_Wave, Gabor_Wave) using concat-
enation demonstrate varying improvements in detection
performance, with the combined model, which integrates all
three lters by concatenating their feature maps, achieving
the highest performance. is suggests that while each lter
contributes positively, the combination of all three lters
leverages the complementary strengths of the three lters,
with Gaussian ltering enhancing local smoothing, Gabor
kernels capturing texture orientation, and wavelets high-
lighting multiscale features, leading to notable improve-
ments in both accuracy and robustness. ese ndings
highlight the value of integrating diverse frequency-domain
features to enhance model performance in complex
detection tasks.
3.3. Ablation Experiment of Feature Fusion Strategy. In this
experiment, we evaluate the performance of three distinct
feature fusion strategies. e strategies under investigation
are as follows: (1) Concat: Concatenating all the extracted
features; (2) AFF1: Applying AFF strategy to fuse the
multiscale frequency-domain features extracted from the
three lters, followed by concatenation with spatial-domain
features derived from FPN; (3) AFF2: Initially concatenating
the frequency-domain features, then fusing them with the
spatial-domain features using AFF. e detailed process of
each fusion strategy is illustrated in Figure 5. e same
training and testing sets used in Section 3.2 are employed in
this experiment to maintain consistency. e model per-
formance, in terms of detection ACC and F1-scores for each
fusion strategy, is presented in Table 3.
e results demonstrate that the model employing AFF2
fusion strategy achieves superior ACC and F1-score com-
pared to other models, which can be largely attributed to the
complementary nature of multiscale frequency-domain
features. e application of AFF solely to frequency-
domain features (AFF1) yields limited improvement. In
contrast, when AFF is employed to fuse both frequency- and
spatial-domain features (AFF2), it facilitates a more eective
integration of cross-scale and cross-modal information,
signicantly enhancing overall model performance. Besides,
the Concat model shows slightly inferior performance in
comparison with the combined model in Section 3.2, likely
due to its lack of eective mechanisms for integrating
complementary relationships between features. ese
ndings highlight the ecacy of AFF mechanism in deep-
fake detection, as it maximizes the utility of diverse feature
representations while reducing redundancy, thus improving
both detection accuracy and generalization.
3.4. Ablation Experiment of FPN. Although the contribution
of the FPN module has already been reected in the results of
the previous two ablation experiments, we present this
dedicated comparison to ensure the completeness of our
ablation study. e models under investigation are as fol-
lows: (1) Without FPN: the combined model from Section
3.2. (2) With FPN: the AFF2 model from Section 3.3, which
is also the proposed model.
e results, as summarized in Table 4, indicate that the
model with FPN outperforms the model without FPN in
both detection ACC and F1-score. Specically, the in-
corporation of the FPN allows the model to more eectively
capture spatial information at multiple scales, which com-
plements the frequency-domain features and results in
enhanced detection performance. ese ndings highlight
the value of integrating spatial-domain features via the FPN,
demonstrating that the addition of the FPN module sig-
nicantly contributes to the model’s capability to generalize
across dierent deepfake sources, thereby improving both
ACC and robustness.
3.5. Comparison Experiment of Detection ACC. In this ex-
periment, we evaluate the performance of our proposed
model in comparison to existing models recognized for their
strong generalization capabilities and high detection ACC.
Given the superior performance of the AFF2 fusion strategy
observed in the ablation studies, we select the model
employing this strategy for further comparison. To assess the
model’s eectiveness in detecting both GAN and diusion-
generated images, we utilize two baseline models:
WangCNN [11] and DIRE [18]. WangCNN functions as
a general detector for GAN-generated images, demon-
strating strong generalization across various GAN datasets
using preprocessing, postprocessing, and data augmentation
Table 2: Accuracy and F1-scores of models using dierent ltering
strategies.
Training set ProGAN ADM
Metrics ACC F1 ACC F1
Origin 0.501 0.003 0.509 0.040
Gauss 0.752 0.741 0.806 0.836
Gabor 0.717 0.712 0.782 0.761
Wave 0.776 0.756 0.807 0.837
Gauss_Gabor 0.762 0.754 0.823 0.839
Gauss_Wave 0.789 0.778 0.838 0.854
Gabor_Wave 0.782 0.763 0.831 0.847
Combined 0.803 0.792 0.853 0.865
Note: e highest values are highlighted in bold to emphasize the eec-
tiveness of the adopted modules and the superiority of our proposed model.
International Journal of Intelligent Systems 7
techniques. e DIRE model, on the other hand, is spe-
cically designed for detecting diusion-generated images
through the discrepancies between the input image and its
reconstructed counterpart.
Using the ProGAN dataset for training, we compare the
performance of our model with the baseline models across
images generated by various generative models and datasets,
as summarized in Table 5. e two variations of WangCNN,
dierentiated by their use of data augmentation techniques,
are referred to as WangCNN0.1 and WangCNN0.5. Since
the testing set includes images generated by ADM based on
two dierent datasets, they are distinguished as ADM_lsun
and ADM_imag in the table. Additionally, DM_mix,
GAN_mix, and Total_mix refer to the mixed datasets
consisting of diusion-generated, GAN-generated, and both
types of images, respectively. Specically, DM_mix contains
images from ve DM testing sets, GAN_mix from three
GAN testing sets, and Total_mix from all eight testing sets.
Gaussian
feature
Gabor
feature
Wavelet
feature
AFF AFF
Spatial
feature
Fused-frequency
features Concat Fused
features
Gaussian
feature
Gabor
feature
Wavelet
feature
Concat
Spatial
feature
AFF Fused
features
(2) AFF1
(3) AFF2
Gaussian
feature
Gabor
feature
Wavelet
feature
Concat
Spatial
feature
Fused
features
(1) Concat
Figure 5: Diagram of three feature fusion strategies.
Table 3: Accuracy and F1-scores of models under dierent fusion
strategies.
Training set Metrics Concat AFF1 AFF2 (proposed)
ProGAN ACC 0.799 0.8 0.815
F1 0.779 0.777 0.795
ADM ACC 0.842 0.871 0.881
F1 0.824 0.874 0.891
Note: e highest values are highlighted in bold to emphasize the eec-
tiveness of the adopted modules and the superiority of our proposed model.
Table 4: Accuracy and F1-scores of models with and without FPN.
Training set Metrics Without FPN With FPN (proposed)
ProGAN ACC 0.803 0.815
F1 0.792 0.795
ADM ACC 0.853 0.881
F1 0.865 0.891
Note: e highest values are highlighted in bold to emphasize the eec-
tiveness of the adopted modules and the superiority of our proposed model.
Table 5: Comparison of detection accuracy of dierent models
trained on the ProGAN dataset.
WangCNN0.1 WangCNN0.5 DIRE Proposed
ADM_lsun 0.528 0.513 0.521 0.798
iDDPM 0.575 0.536 0.522 0.762
PNDM 0.61 0.559 0.543 0.738
ADM_imag 0.573 0.556 0.679 0.87
SD 0.416 0.54 0.8 0.837
DM_mix 0.54 0.541 0.621 0.802
bigGAN 0.645 0.599 0.662 0.7
cycleGAN 0.817 0.833 0.726 0.773
starGAN 0.828 0.793 0.858 0.97
GAN_mix 0.763 0.742 0.749 0.836
Midjourney 0.553 0.523 0.623 0.863
DALLE2 0.470 0.646 0.806 0.839
Total_mix 0.624 0.616 0.669 0.815
Note: e highest values are highlighted in bold to emphasize the eec-
tiveness of the adopted modules and the superiority of our proposed model.
8International Journal of Intelligent Systems
For each mixed dataset, 500 real images and 500 generated
images are randomly selected from all constituent datasets.
e experimental results demonstrate that our model,
trained on the ProGAN dataset, not only performs well on
GAN-generated images but exhibits a particularly notable
advantage in detecting diusion-generated images. On the
three mixed datasets, the detection ACC of our model for
GAN-generated images (GAN_mix) is 9.6% higher than that
of the second-best model WangCNN0.1. Additionally, the
most signicant performance gain is observed for diusion-
generated images (DM_mix), where our model surpasses the
second-best model DIRE by an impressive 29.1%. Overall, our
model achieves a 21.8% higher detection ACC than DIRE on
the combined testing set (Total_mix), further underscoring its
robustness in handling both types of images, with a clear edge
in diusion-generated image detection.
When trained on the ADM dataset, the results in Table 6
reveal that our model consistently delivers superior detection
performance for both diusion- and GAN-generated images.
Notably, the model excels in identifying cross-model GAN-
generated images, achieving the highest detection ACC across
all GAN testing sets, with a 15.1% improvement on the
GAN_mix testing set over the second-best model, DIRE.
Furthermore, when tested on the Midjourney and
DALLE2 datasets, our model achieves the highest detection
ACC compared to other baselines, regardless of the training
dataset used. is further reinforces its generalization ca-
pabilities across both traditional and newer generative
models. e results of the two comparison experiments
demonstrate that our model not only maintains high de-
tection ACC for images generated by models of the same
type, but also excels in detecting cross-model generated
images, highlighting its strong generalization ability across
diverse deepfake generation techniques.
3.6. Analysis of Performance Variations. Although the model
demonstrates strong performance across most datasets,
notable variations in performance are observed across dif-
ferent testing sets. Using the ADM training dataset as a case
study, we explore these discrepancies in greater detail.
As shown in Table 6, when tested on the images gen-
erated by ADM using the ImageNet dataset (ADM_imag
dataset), the model’s ACC notably declines to 0.77, com-
pared to 0.985 on the ADM_lsun dataset. is suggests that
the higher variability and naturalistic features of ImageNet
images may present more challenges for the model, which
was primarily trained on the LSUN_bedroom dataset,
highlighting a potential gap in the model’s ability to gen-
eralize to such varied deepfake characteristics.
Additionally, the model’s performance on bigGAN
(0.715) and cycleGAN (0.711) is lower compared to its
performance on starGAN images (0.965). is may be due to
the fact that starGAN images are potentially more similar to
the images in training set or to diusion-generated images in
general. e more consistent and structured features of
starGAN images could align more closely with the model’s
learned characteristics, which were optimized for detecting
patterns found in diusion-based deepfakes. As a result, this
similarity could facilitate better detection ACC compared to
the more irregular patterns seen in bigGAN and cycleGAN
images.
ese performance variations highlight an area for
improvement in the model’s ability to generalize across
images with substantial inherent dierences. To enhance the
model’s robustness and broaden its applicability, future
work should consider utilizing more diverse training
datasets that ensure a more balanced representation of both
GAN- and diusion-generated images. Furthermore, ex-
ploring multimodel or hybrid training strategies could
further augment the model’s generalization capabilities
across the diverse range of deepfake generation techniques.
3.7. Model Robustness Experiment Results and Analysis.
To further evaluate the robustness of the proposed model
under adversarial conditions, we conduct a set of experi-
ments to assess the model’s resilience to common distortions
and manipulations, including Gaussian blur and JPEG
compression. Specically, Gaussian blur is applied using
a kernel with standard deviations of 0.5, while JPEG com-
pression is tested with quantization factors of 80. In this
experiment, we use the ProGAN training set, with the same
mixed testing set as in Section 3.2, to evaluate dierent
fusion strategies under Gaussian blur and JPEG compres-
sion. e performance metrics are summarized in Table 7.
e results indicate that Gaussian blur has a relatively
limited impact on detection ACC, with a gradual decline
observed as the blur intensity increases. e model, par-
ticularly when employing the AFF mechanisms, demon-
strates the capability to maintain high detection ACC even
under moderate blurring, highlighting its robustness against
this type of distortion. On the other hand, JPEG com-
pression shows a more pronounced eect on model per-
formance, especially as the compression level becomes more
severe.
ese ndings suggest that JPEG compression may have
a considerable impact on frequency-domain features, po-
tentially contributing to the observed performance
Table 6: Comparison of detection accuracy of dierent models
trained on the ADM dataset.
WangCNN0.1 WangCNN0.5 DIRE Proposed
ADM_lsun 0.971 0.96 0.985 0.985
iDDPM 0.973 0.966 0.976 0.979
PNDM 0.971 0.966 0.985 0.982
ADM_imag 0.655 0.616 0.733 0.77
SD 0.581 0.432 0.805 0.92
DM_mix 0.83 0.788 0.897 0.96
bigGAN 0.604 0.562 0.61 0.715
cycleGAN 0.508 0.507 0.492 0.711
starGAN 0.688 0.576 0.92 0.965
GAN_mix 0.6 0.548 0.674 0.776
Midjourney 0.594 0.532 0.501 0.920
DALLE2 0.615 0.755 0.811 0.952
Total_mix 0.744 0.698 0.813 0.881
Note: e highest values are highlighted in bold to emphasize the eec-
tiveness of the adopted modules and the superiority of our proposed model.
International Journal of Intelligent Systems 9
degradation. However, even under these challenging con-
ditions, the model demonstrates relatively good robustness,
indicating its potential practical applicability in real-world
scenarios where such common manipulations are often
encountered. Additionally, the results of this experiment
highlight the contribution of the AFF mechanism to the
model’s robustness. By incorporating the advanced fusion
strategy AFF2, the model demonstrates improved resilience,
making it better equipped to handle adversarial conditions
and solidifying its potential as an eective solution for
deepfake detection in diverse environments.
4. Conclusion
In this study, we propose a deepfake detection model that
demonstrates superior detection ACC and generalization
capability, eectively addressing the limitations of existing
methods. e experimental results conrm that our model
surpasses baseline methods in both detection ACC and
generalization across various GANs and DMs, showcasing
its robustness in identifying deepfake images from diverse
sources. A key highlight of our model is its ability to gen-
eralize across dierent generative models: even when trained
on the relatively dated ADM dataset, the model performs
strongly on images generated by more recent DM Stable
Diusion v2 (SDv2) [30], as well as images generated by
more advanced models like Midjourney [34] and DALLE2
[35], underscoring its robust generalization capability.
Such strong performance is largely attributable to the
structure of our model. By leveraging multiscale frequency-
domain features extracted by Gaussian, Gabor, and wavelet
lters, and integrating them with multiscale spatial-domain
features from a FPN, our model eectively captures com-
plementary information across multiple scales and domains.
e incorporation of the AFF mechanism further enhances
the model’s performance, as conrmed through ablation
experiments.
In conclusion, the proposed model excels in detection
ACC and generalization across diverse deepfake sources,
positioning it as a robust solution for tackling the growing
challenges in deepfake detection.
5. Discussion
While the proposed model demonstrates strong detection
ACC and generalization capabilities across GAN- and
diusion-generated images, certain limitations remain. e
model predominantly relies on frequency-domain features,
with the FPN being utilized for spatial feature modeling.
However, the spatial features captured by the FPN may not
fully reect the intrinsic spatial characteristics of the images.
Future work will aim to incorporate more explicit spatial-
domain features, such as those derived from advanced
denoising algorithms, edge-detection models, or texture
analysis techniques. Integrating these explicit spatial features
is expected to result in a more comprehensive and balanced
feature extraction process, optimizing the overall architec-
ture and enhancing the model’s ability to detect diverse
image manipulations.
Additionally, despite the notable generalization capa-
bilities of the current model, the rapid evolution of gener-
ative models, particularly the increasing prevalence of ne-
tuned variants, raises concerns about its adaptability to
newly emerging and previously unseen content. Addressing
this challenge will require further investigation into adaptive
learning strategies that can accommodate the fast-paced
advancements in generative models. Techniques such as
continual learning or incorporating a wider variety of
generative models during training may help maintain the
model’s eectiveness. Furthermore, as AIGC technologies
continue to evolve, ensuring the traceability of manipulated
content is becoming increasingly critical for maintaining
transparency and security. Future research will extend to-
ward improving detection capabilities for unknown gen-
erative models, focusing not only on adaptability but also on
AIGC traceability to ensure the authenticity and security of
generated content across diverse sectors.
Data Availability Statement
e data that support the ndings of this study are available
from the corresponding author upon reasonable request.
Conflicts of Interest
e authors declare no conicts of interest.
Funding
is work was supported in part by the National Key Re-
search and Development Program of China under Grant
(2023YFC3010302), National Natural Science Foundation of
China (Grant No. 82101079), and Key R&D Program of
Jiangsu Province (BE2023836).
References
[1] K. J. Ma, Y. F. Feng, B. J. Chen, and G. Y. Zhao, “End-to-End
Dual-Branch Network towards Synthetic Speech Detection,”
IEEE Signal Processing Letters 30 (2023): 359–363, https://
doi.org/10.1109/LSP.2023.3262419.
[2] P. Yu, Z. Xia, J. Fei, and Y. Lu, A Survey on Deepfake Video
Detection (2021).
[3] L. Lin, N. Gupta, Y. Zhang, et al., “Detecting Multimedia
Generated by Large AI Models: A Survey” (2024).
[4] I. Goodfellow, J. Pouget-Abadie, M. Mirza, et al., “Generative
Adversarial Nets,” Advances in Neural Information Processing
Systems 27 (2014): 2672–2680.
[5] L. Yang, Z. L. Zhang, Y. Song, et al., “Diusion Models: A
Comprehensive Survey of Methods and Applications,” ACM
Table 7: Accuracy and F1-scores of models using dierent fusion
strategies under Gaussian blur and JPEG compression, trained on
the ProGAN dataset.
Fusing strategies Concat AFF1 AFF2
Metrics ACC F1 ACC F1 ACC F1
None 0.799 0.779 0.8 0.777 0.815 0.795
Blur 0.785 0.776 0.786 0.747 0.798 0.77
Compression 0.711 0.685 0.703 0.692 0.758 0.708
10 International Journal of Intelligent Systems
Computing Surveys 56, no. 4 (2024): 1–39, https://doi.org/
10.1145/3626235.
[6] S. McCloskey and M. Albright, “Detecting GAN-Generated
Imagery Using Saturation Cues,” in Proc 26th IEEE In-
ternational Conference on Image Processing (September 2019),
4584–4588, https://doi.org/10.1109/icip.2019.8803661.
[7] S. Agarwal, N. Girdhar, and H. Raghav, “A Novel Neural
Model Based Framework for Detection of GAN Generated
Fake Images,” in Proc 11th International Conference on Cloud
Computing, Data Science and Engineering (Conuence)
(January 2021), 46–51, https://doi.org/10.1109/
Conuence51648.2021.9377150.
[8] B. J. Chen, W. J. Tan, Y. T. Wang, and G. Y. Zhao, “Dis-
tinguishing between Natural and GAN-Generated Face Im-
ages by Combining Global and Local Features,” Chinese
Journal of Electronics 31, no. 1 (2022): 59–67, https://doi.org/
10.1049/cje.2020.00.372.
[9] Z. W. Li, F. Liu, W. J. Yang, S. H. Peng, and J. Zhou, “A Survey
of Convolutional Neural Networks: Analysis, Applications,
and Prospects,” IEEE Transactions on Neural Networks and
Learning Systems 33, no. 12 (2022): 6999–7019, https://
doi.org/10.1109/tnnls.2021.3084827.
[10] Y. Fu, T. Sun, X. Jiang, K. Xu, and P. He, “Robust GAN-Face
Detection Based on Dual-Channel CNN Network,” in Proc
2019 12th International Congress on Image and Signal Processing,
BioMedical Engineering and Informatics (CISP-BMEI) (October
2019), https://doi.org/10.1109/CISP-BMEI48845.2019.8965991.
[11] S. Y. Wang, O. Wang, R. Zhang, A. Owens, and A. A. Efros,
“CNN-generated Images Are Surprisingly Easy to Spot for
Now,” in Proc 2020 IEEE/CVF Conference on Computer Vision
and Pattern Recognition (CVPR) (March 2020), 8692–8701,
https://doi.org/10.1109/CVPR42600.2020.00872.
[12] R. Corvi, D. Cozzolino, G. Zingarini, G. Poggi, K. Nagano, and
L. Verdoliva, “On the Detection of Synthetic Images Gen-
erated by Diusion Models,” in Proc ICASSP 2023 IEEE
International Conference on Acoustics, Speech and Signal
Processing (ICASSP) (June 2023), 1–5, https://doi.org/
10.1109/ICASSP49357.2023.10095167.
[13] M. Q. Nguyen, K. D. Ho, H. M. Nguyen, C. M. Tu, M. T. Tran,
and T. L. Do, “Unmasking the Artist: Discriminating Human-
Drawn and AI-Generated Human Face Art through Facial
Feature Analysis,” in Proc 2023 International Conference on
Multimedia Analysis and Pattern Recognition (MAPR) (Octo-
ber 2023), https://doi.org/10.1109/MAPR59823.2023.10289113.
[14] Q. Bammey, “Synthbuster: Towards Detection of Diusion
Model Generated Images,” IEEE Open Journal of Signal Pro-
cessing 5 (2023): 1–9, https://doi.org/10.1109/OJSP.2023.3337714.
[15] H. Farid, “Lighting (In)consistency of Paint by Text” (2022).
[16] H. Farid, Inconsistency of Paint by Text (2022).
[17] U. Ojha, Y. H. Li, and Y. J. Lee, “Towards Universal Fake
Image Detectors that Generalize across Generative Models,”
in Proc IEEE/CVF Conference on Computer Vision and Pattern
Recognition (CVPR) Vancouver, CANADA (June 2023),
24480–24489, https://doi.org/10.1109/cvpr52729.2023.02345.
[18] Z. D. Wang, J. M. Bao, W. G. Zhou, et al., “DIRE for
Diusion-Generated Image Detection,” in Proc IEEE/CVF
International Conference on Computer Vision (ICCV) Paris
(October 2023), 22388–22398, https://doi.org/10.1109/
iccv51070.2023.02051.
[19] H. Zhang, B. Chen, J. Wang, and G. Zhao, “A Local Per-
turbation Generation Method for GAN-Generated Face Anti-
forensics,” IEEE Transactions on Circuits and Systems for
Video Technology 33, no. 2 (2023): 661–676, https://doi.org/
10.1109/TCSVT.2022.3207310.
[20] J. Ricker, S. Damm, T. Holz, and A. Fischer, “Towards the
Detection of Diusion Model Deepfakes” (2022).
[21] T.-Y. Lin, P. Doll´
ar, R. Girshick, K. He, B. Hariharan, and
S. Belongie, “Feature Pyramid Networks for Object Detection”
(2016).
[22] K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning
for Image Recognition,” in Proc 2016 IEEE Conference on
Computer Vision and Pattern Recognition (June 2016), 1,
https://doi.org/10.1109/cvpr.2016.90.
[23] Y. Dai, F. Gieseke, S. Oehmcke, Y. Wu, and K. Barnard,
“Attentional Feature Fusion” (2020).
[24] P. Dhariwal and A. Nichol, Diusion Models Beat GANs on
Image Synthesis (2021).
[25] F. Yu, Y. Zhang, S. Song, A. Se, and J. J. c. s. Xiao, “LSUN:
Construction of a Large-Scale Image Dataset Using Deep
Learning with Humans in the Loop” (2015).
[26] T. Karras, T. Aila, S. Laine, and J. Lehtinen, “Progressive
Growing of GANs for Improved Quality, Stability, and
Variation” (2017).
[27] A. Nichol and P. Dhariwal, “Improved Denoising Diusion
Probabilistic Models,” in Proc International Conference on
Machine Learning (ICML) Electr Network (July 2021).
[28] L. Liu, Y. Ren, Z. Lin, and Z. Zhao, “Pseudo Numerical
Methods for Diusion Models on Manifolds” (2022), https://
doi.org/10.48550/arXiv.2202.09778.
[29] J. Deng, W. Dong, R. Socher, L. J. Li, and F. F. Li, “ImageNet:
a Large-Scale Hierarchical Image Database,” in Proc 2009
IEEE Computer Society Conference on Computer Vision and
Pattern Recognition (CVPR 2009) (June 2009).
[30] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, B. Ommer,
and S. O. C. Ieee Comp, “High-Resolution Image Synthesis
with Latent Diusion Models,” in Proc IEEE/CVF Conference
on Computer Vision and Pattern Recognition (June 2022),
10674–10685, https://doi.org/10.1109/cvpr52688.2022.01042.
[31] A. Brock, J. Donahue, and K. Simonyan, “Large Scale GAN
Training for High Fidelity Natural Image Synthesis” (2018).
[32] J. Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired Image-
To-Image Translation Using Cycle-Consistent Adversarial
Networks,” in Proc 16th IEEE International Conference on
Computer Vision (ICCV) Venice, ITALY (October 2017),
2242–2251, https://doi.org/10.1109/iccv.2017.244.
[33] Y. Choi, M. Choi, M. Kim, J. W. Ha, S. Kim, and J. Choo,
“StarGAN: Unied Generative Adversarial Networks for
Multi-Domain Image-To-Image Translation,” in Proc 31st
IEEE/CVF Conference on Computer Vision and Pattern Rec-
ognition (CVPR) Salt Lake City, UT (June 2018), 8789–8797,
https://doi.org/10.1109/cvpr.2018.00916.
[34] “Midjourney,” (2023), https://www.midjourney.com/home.
[35] A. Ramesh, P. Dhariwal, A. Nichol, C. Chu, and M. Chen,
Hierarchical Text-Conditional Image Generation with CLIP
Latents (2022).
International Journal of Intelligent Systems 11