ArticlePDF Available

LDSGAN: Unsupervised Image-to-Image Translation With Long-Domain Search GAN for Generating High-Quality Anime Images

Wiley
International Journal of Intelligent Systems
Authors:

Abstract and Figures

Image-to-image (I2I) translation has emerged as a valuable tool for privacy protection in the digital age, offering effective ways to safeguard portrait rights in cyberspace. In addition, I2I translation is applied in real-world tasks such as image synthesis, super-resolution, virtual fitting, and virtual live streaming. Traditional I2I translation models demonstrate strong performance when handling similar datasets. However, when the domain distance between two datasets is large, translation quality may degrade significantly due to notable differences in image shape and edges. To address this issue, we propose Long-Domain Search GAN (LDSGAN), an unsupervised I2I translation network that employs a GAN structure as its backbone, incorporating a novel Real-Time Routing Search (RTRS) module and Sketch Loss. Specifically, RTRS aids in expanding the search space within the target domain, aligning feature projection with images closest to the optimization target. Additionally, Sketch Loss retains human visual similarity during long-domain distance translation. Experimental results indicate that LDSGAN surpasses existing I2I translation models in both image quality and semantic similarity between input and generated images, as reflected by its mean FID and LPIPS scores of 31.509 and 0.581, respectively.
This content is subject to copyright. Terms and conditions apply.
Research Article
LDSGAN: Unsupervised Image-to-Image Translation With
Long-Domain Search GAN for Generating High-Quality
Anime Images
Hao Wang ,
1
Chenbin Wang,
1
Xin Cheng,
1
Hao Wu,
2
Jiawei Zhang,
3
Jinwei Wang ,
4
Xiangyang Luo ,
2
and Bin Ma
5
,
6
,
7
1
School of Computer and Software Engineering, Huaiyin Institute of Technology, Huai’an, China
2
State Key Laboratory of Mathematical Engineering and Advanced Computing, Zhengzhou, China
3
School of Cyber Security and Information Law, Chongqing University of Posts and Telecommunications, Chongqing, China
4
College of Cyber Science, Nankai University, Tianjin, China
5
Shandong Provincial Key Laboratory of Computer Networks, Jinan, China
6
Shandong Computer Science Center, Jinan, China
7
School of Cyber Security, Qilu University of Technology, Jinan, China
Correspondence should be addressed to Jinwei Wang; wjwei_2004@163.com
Received 13 September 2024; Accepted 22 March 2025
Academic Editor: Stefano Cirillo
Copyright ©2025 Hao Wang et al. International Journal of Intelligent Systems published by John Wiley & Sons Ltd. is is an
open access article under the terms of the Creative Commons Attribution License, which permits use, distribution and
reproduction in any medium, provided the original work is properly cited.
Image-to-image (I2I) translation has emerged as a valuable tool for privacy protection in the digital age, oering eective ways to
safeguard portrait rights in cyberspace. In addition, I2I translation is applied in real-world tasks such as image synthesis, super-
resolution, virtual tting, and virtual live streaming. Traditional I2I translation models demonstrate strong performance when
handling similar datasets. However, when the domain distance between two datasets is large, translation quality may degrade
signicantly due to notable dierences in image shape and edges. To address this issue, we propose Long-Domain Search GAN
(LDSGAN), an unsupervised I2I translation network that employs a GAN structure as its backbone, incorporating a novel Real-
Time Routing Search (RTRS) module and Sketch Loss. Specically, RTRS aids in expanding the search space within the target
domain, aligning feature projection with images closest to the optimization target. Additionally, Sketch Loss retains human visual
similarity during long-domain distance translation. Experimental results indicate that LDSGAN surpasses existing I2I translation
models in both image quality and semantic similarity between input and generated images, as reected by its mean FID and LPIPS
scores of 31.509 and 0.581, respectively.
Keywords: I2I translation; long-distance domain; unsupervised learning
1. Introduction
As cyberspace activities become increasingly frequent, cy-
berspace portraits raise many security issues. erefore, it is
crucial to protect the privacy of portraits [1, 2]. Image-to-
image (I2I) translation has become a popular research topic
in recent years, and it aims to learn image-projecting
functions from source to target domains. I2I protects por-
trait privacy and prevents facial recognition technology in
cyberspace from infringing on personal identity by con-
verting personal portraits into anime or other portraits.
With the development of Generative Adversarial Networks
(GANs) [3] and diusion models [4], the powerful tting
ability of the models has further advanced the application of
I2I in real-life applications. In addition to portrait pro-
tection, image coloring [5–7], makeup migration [8], fashion
editing [9], style migration [10], and game character face
generation [11] also involve I2I technology.
Wiley
International Journal of Intelligent Systems
Volume 2025, Article ID 4450460, 12 pages
https://doi.org/10.1155/int/4450460
Existing I2I translation methods are mainly categorized
as supervised [7–9, 12–14] and unsupervised. Supervised I2I
methods use paired data to train models. However, it is
tough to obtain paired data in realistic environments, so
unsupervised I2I methods have received widespread at-
tention. For example, CycleGAN [15] introduces two
mirror-symmetric GANs for establishing constraints be-
tween the source and target domains and preserving the
image information through skip connections for cross-
domain image transformation. Similarly, Drit++ [16] pro-
posed a multidomain unsupervised I2I model that uses an
encoder to extract the content and feature information from
an image and exploits it as input to the CycleGAN.
However, these models assume that the images in the
source and target domains have high similarity in shape or
texture. When in an I2I task with close inter-domain dis-
tances, shapes and textures can be considered as domain-
independent features and preserved by skip-connected
structures. Once in an I2I task with large inter-domain
distances, shape and texture are no longer considered as
domain-independent features due to the notable dierences
between the source and target domains. When using skip
connections, the source domain image in long-distance
domain translation (LDDT) cannot be transferred to the
target domain correctly, and artifacts of the source domain
may remain on the generated image (as shown in
Figure 1(b)). e image generated by the proposed method is
shown in Figure 1(c). Figures 1(d), 1(e), and 1(f) demon-
strate the feature projecting passed by skip connection,
which contains too much texture in the source domain,
resulting in a degradation of the quality of the generated
image. In addition, I2I models based on encoder–decoder
[16–18] often struggle to accurately match the source image
in the target domain due to severe information loss.
erefore, it is dicult for the existing methods to achieve
satisfactory performance on the LDDT.
In contrast to the above approach, UI2I-via-StyleGAN2
[19] divides the I2I task into short-distance domain trans-
lation (SDDT) and LDDT by inter-domain distance. e
target domain model is trained by ne-tuning the source
domain model to minimize the eect of model distance.
Meanwhile, the layer-swap approach directly swaps the
high-level convolutional feature projecting from the source
domain to the target domain, thereby preserving more
features in the source domain. However, the color ar-
rangement of the generated image still cannot match the
input image completely, and there is still potential for
further improvement of the image quality.
To solve the problem in LDDT, we propose an un-
supervised I2I translation method which is called Long-
Domain Search GAN (LDSGAN). It introduces a novel Real-
Time Route Search (RTRS) module for generating high-
quality images. RTRS module searches for a representation
of the color arrangement and texture in the source domain.
It projects it to a representation of the target domain, thus
helping the model preserve domain-independent features
and map to improve the domain’s domain-related features
to the target domain’s nearest projected points. In addition,
we propose Sketch Loss, which helps the model maintain the
similarity between the generated image and the source
domain image. However, it does not enforce that the gen-
erated image is identical to the source domain image in
terms of distance.
us, the contributions of this paper can be summarized
as follows:
We proposed LDSGAN, which is capable of achieving
higher generation quality in LDDT while maintaining
the similarity between the input image and the
generated image.
We proposed RTRS module, which helps the model to
nd the nearest neighbor projection points of the
source domain image in the target domain, thus im-
proving the upper bound of the network t.
We proposed Sketch Loss to improve the similarity
between the generated image and the source domain
image in visual quality. is loss function does not
limit the distance between the generated image and the
source domain image.
e remaining sections of the thesis are organized as
follows. In Section 2, the application of GAN to I2I is
outlined. In Section 3, the proposed method for I2I is
proposed and described in detail. In Section 4, we present the
experimental setup, verify the detection performance of the
proposed method on dierent datasets, and discuss the
experimental results. Finally, a conclusion is made in Section
5 and future work is indicated.
2. Related Work
In recent years, with the development of GANs [3, 20], I2I
translation research has achieved impressive synthetic image
results. On the one hand, in the supervised I2I translation
method, Pix2pix [13] utilizes the UNet [12] architecture.
Experiments have shown that introducing skip connections
in image translation tasks can signicantly improve trans-
lation performance. In addition, supervised I2I methods can
be customized to specic scenarios, such as grayscale image
coloring, fashion editing, and makeup transfer. Supervised-
based I2I mostly uses a similarity-based strategy that retains
most of the information in the source domain and edits the
target region.
On the other hand, among the unsupervised methods,
CycleGAN [15] proposed a method to learn the input image to
output image projecting without pairing samples. Drit++ [16]
proposed a multidomain I2I translation model that achieves
better results in terms of image quality by using multiple
encoders to encode the context and features of the image.
To further enhance the quality of the generated images, the
researchers explored the application of StyleGAN [21] to I2I
translation as it is capable of generating impressive and high-
quality images. For example, UI2I-via-StyleGAN2 [19] in-
troduces a method that encodes images end-to-end into the
latent space of StyleGAN [21]. It denes model distances based
on StyleGAN2 [22] and proposes a GAN embedding-based
inversion approach to achieve higher-quality results in long-
range inter-model image-to-image (I2I) translation.
2International Journal of Intelligent Systems
Although the above supervised networks have made
signicant progress in I2I translation, they require paired
data. When using unpaired data, unsupervised networks
may be unable to generate high-quality color images in
LDDT. In addition, in the I2I task, many studies have as-
sumed a high degree of consistency in the distribution of
images in the source and target domains over the content
space. ese studies usually use texture-based detectors such
as LPIPS [23] as loss functions or similarity evaluation
metrics but ignore the dierences in image texture types
in LDDT.
To overcome the above problems, we propose
LDSGAN, which has a wider search space in the target
domain and thus can map the source domain image to the
target domain more eciently. It achieves feature aggre-
gation by constructing RTRS modules to characterize the
search for color alignments and textures in the source
domain. Without limiting the distance between the gen-
erated image and the source domain image, Sketch Loss is
utilized to improve the visual similarity between the
generated image and the source domain image, which in
turn enables LDSGAN to generate high-quality and more
appealing color images during the long-distance domain
transformation process.
3. The Proposed Method
e proposed unsupervised learning LDSGAN searches for
images close to the source domain in the target domain space
and improves the performance of long-domain translation.
As shown in Figure 2, the main components of LDSGAN
include the source distill network (SDNet), the long-distance
transform network (LDTNet), and the target domain search
network (TDSNet).
Firstly, in SDNet, the model reduces the dimension of
the source domain image while preserving the details,
projecting it to a stream shape closer to the target do-
main. Secondly, LDTNet discards textures that are
present in the source domain but not in the target do-
main by separating features and weakening domain-
related features in the source domain and then trans-
lating the image to the target domain. Finally, TDSNet
uses the RTRS module to search for the nearest neighbor
projection points of the source domain image in the
(a) (b) (c)
(d) (e) (f )
Figure 1: Examples of I2I translation results and corresponding feature maps under skip connection. e generated image exhibits
degradation in quality due to excessive texture from the source domain introduced by the skip connections. (a) e input image. (b) e
image generated by using a skip connection during generation. (c) e image generated by the proposed method. (d–f) Feature projecting
passed by skip connection.
International Journal of Intelligent Systems 3
target domain, which improves the upper tting limit of
the network (shown in Figure 3). With Sketch Loss, the
network preserves consistency between the edited image
and the human-generated visual image, resulting in
a high-quality generated image in the target domain that
closely resembles the original.
3.1. SDNet. It is more workable to map a low-dimensional
image to the target domain than map a high-dimensional
image directly to the target domain. Lower dimensional
images have less detail, so the source domain image is
closer to the target domain in lower dimensional space.
However, reducing the image from high to low di-
mensions leads to information loss in the source domain.
We rst apply Rich Init to preserve the high-frequency
features, ensuring their transfer to the target domain’s
high-frequency features. erefore, SDNet can eectively
preserve the source domain information while reducing
the image dimension.
SDNet uses successive convolutional layers as the en-
coder architecture and does not contain a decoder part.
Inspired by the model in [24], we use a convolutional layer to
enhance the input image. en, the feature projecting is
performed Ntimes downsampling to make it closer to the
low-dimensional target domain. e value of Ndepends on
prior knowledge. Usually, the greater the distance between
domains, the higher the value of N. As shown in Figure 3,
the output feature projecting (VxsRC×(H/2NW/2N)of
SDNet is input into LDTNet for further source-to-target
domain transformation.
3.2. LDTNet. In LDDT, signicant dierences in texture and
detailed shapes exist between the source and target domain
images. For example, when a human face portrait is
transformed into an anime-style portrait, there are sub-
stantial dierences in the detailed shapes of the facial fea-
tures. e low-dimensional feature maps of the source
domain obtained by SDNet cannot be directly used to
generate images in the target domain. For this purpose, we
introduce LDTNet, which transforms the source domain
feature projecting into a feature projecting suitable for the
target domain in a low-dimensional space.
LDTNet uses multiple residual blocks [25] to transform
the low-dimensional feature maps from the source domain
to the target domain. It ensures that the information content
of feature maps before and after conversion is the same
without changing the number and size of feature maps. xs
denotes the source domain image extracted by SDNet and ys
denotes the target domain image corresponding to the
source domain. LDTNet tries to learn a projecting function
fsuch that f(xs) 􏽥
ys, where 􏽥
ysis the nearest neighbor map
of the f(xs)in the target domain.
3.3. TDSNet. To restore the feature projecting 􏽥
ysto the high
dimensional space and to make its projecting location close
to the nearest neighbor projection point in the target do-
main, TDSNet consists of N times RTRSs and N times
Search Application Blocks (SABlocks). Each RTRS and its
corresponding SABlock projects the low-dimensional fea-
ture maps to a high-dimensional feature space with ap-
propriate routes, and these routes are applied to each
Rich Init
Search apply block
Residual block
Residual block
Residual block
Residual block
Source domain x
Downsample
Downsample
RTRS
Search apply block
Target domain y
Residual downsample
Residual downsample
D discriminatorG generator
SDNet LDTNet
Search apply block
Search apply block
Search apply block
Search apply block
Downsample
RTRS
RTRS
RTRS
ToR GB
ToR GB
ToR GB
ToR GB
L2 loss
Style loss
Sketch loss
OR
Residual downsample
Wasserstein critic
TDSNet
Result x'
Figure 2: e architecture of LDSGAN. Generator Gtakes the source domain image xas input and generates the target domain image y. We
do not use cyclic consistency loss to constrain the generated images because we discard textures and information that are present in the
source domain but not in the target domain. Rich Init is used to preserve the high-frequency features and its output channels are set to 64,
and RTRS will extract 512 Route Codes. e Route Code in the next layer is extracted by the ToRGB’s results in the previous layer and
rich maps.
4International Journal of Intelligent Systems
upsampling layer. e rich maps contain the shape and
texture information of the source domain, while the low-
dimensional feature maps are a low-dimensional repre-
sentation of the target domain and contain less shape and
texture information. e combination of the two generates
a Route Code, which is used to control the strength of the
convolutional kernel so that the convolutional layer can nd
the best path to project the low-dimensional feature maps to
the nearest projection point. us, textures that are not
decoupled and discarded in SDNet can be recovered
by RTRS.
e structure of RTRS is shown in Figure 3. RTRS
utilizes the feature maps of the TDSNet layers and the rich
maps in SDNet to extract Route Codes. Driven by the target
function, these Route Codes can capture domain-dependent
features in the source domain image. Instead of down-
sampling the feature mapping to a size of 1 ×1, we encode
the Route Code using the PatchGAN discriminator to obtain
features from the spatial information. Downsampling the
feature mapping to 1 ×1 focuses more on the condence of
a particular shape or texture, whereas the PatchGAN en-
coder focuses more on the layout of colors in space. Features
should be extracted from all three channels to include in-
formation from all three channels of RGB. However, given
the eectiveness of the PatchGAN discriminator, we
extracted the features from only one channel and used it to
encode the full information of the color image. In addition,
to extend the route search, we propose a Scale Block to
increase the variance of the Route Code. Scale Block can be
represented as
CiFfc1ix
􏼁·Ffc2ix
􏼁+Ffc3ix
􏼁,(1)
where Ciis the nal Route Code of i-th RTRS, Ffc1i,Ffc2i,
and Ffc3iare three full-connection layers of i-th RTRS, and
xis getting by atting the result of PatchGAN encoder.
e full RTRS structure can be represented as
CiRTRS x, ysi
􏼐 􏼑,(2)
where xrepresents the rich maps in SDNet, ysirepresents
ith feature maps output by SABlock, i0,1,2,. . . , N
{ }, and
ys0is the output of LDTNet.
SABlock customizes the convolution kernel using the
routing code obtained by RTRS to generate the target image.
e structure of SABlock is shown in Figure 3. Inspired by
the tunable convolution kernel in StyleGAN2, SABlock is
used to customize the image generation route. During the
generation process, noise is no longer added, as it would lead
to inaccurate optimization of the route search. To make the
route search process more focused on the color of the RGB
channel, the RGB mapping obtained through ToRGB is
combined with the routing code extracted by RTRS. In
SABlock, using the same search code does not deviate from
the translation route due to the short translation path of the
feature mapping. erefore, we use the same search code for
all modulated convolutional kernels to reduce computa-
tional overhead.
Given the i-th input ysi, RGBiof SABlock, Search Apply
Block can be represented as
ysi+1, RGBi+1SA ysi, RGBi, Ci
􏼐 􏼑,(3)
where ysi+1, RGBi+1are the output of SABlock of this layer
and also the input of the next layer, i0,1,2,. . . , N
{ }.RGB0
is obtained by converting ys0through ToRGB.
After each convolutional layer of the generator, a Leaky
ReLU [26] with a slope of 0.2 is used as the activation
function. In addition, to avoid checkerboard artifacts, we use
a sampled convolutional kernel similar to StyleGAN2.
3.4. Loss Function. Due to the dierences in image textures
and the diversity of colors and strokes in LDDT, network
training becomes challenging. To stabilize the training
process, the network employs a multiobjective weighting
strategy.
LDSGAN uses the loss function in WGAN-GP [27] as its
adversarial loss:
PatchGAN Discriminator
Bilinear
to ysi
Concatenate
Conv
C = 1
Route
code
Scale block
FC1
FC2
FC3 Modulated upconv
2d
Modulated conv2d To
RGB
Search apply blockRTRS
x'
ysi + 1
ysi
ysi
RGBi
RGBi + 1
PatchGAN
discriminator
Figure 3: e structures of RTRS and SABlock. RGBiand ysiare the results of Search Apply Block in the previous layer, and rich maps (x)
are bilinear to the size same with ysiand concentrate together to extract the Route Code. Search Apply Block uses the Route Code to nd
a suitable way to modulate the generating.
International Journal of Intelligent Systems 5
LadvDE
x
􏽥Pg
[D(􏽥
x)] E
xPr
[D(x)]
+λE
x
ˆ
Px
ˆ
x
ˆ
D(􏽢
x)
21
􏼐 􏼑2
􏼔 􏼕,
(4)
LadvGE
xPr
[D(x)] E
x
􏽥Pg
[D(􏽥
x)],(5)
where xdenotes the input images, xdenotes the generated
images, and 􏽢
xD(􏽢
x)denotes the gradient of random
samples.
In addition, in the generator, pixel-level L2loss is used
to limit the proximity of the regional hue of the input image
xand the generated image G(x).
L2(x) �‖xG(x)2.(6)
Finally, because LPIPS serves as a metric that in-
corporates both texture and structure, most of the textures in
the target and source domains do not coincide, and to
maintain the image distribution in the LDDT, we employ
Sketch Loss instead of LPIPS. We propose Sketch Loss which
can be divided into two parts. e rst is Soft-Sketch Loss
(LSSL), which drives the generated image to have sketch
complexity in the target domain. Soft-Sketch Loss can be
expressed as follows:
LSSL(x) softplus F(x) F(G(x))‖1
􏼁,(7)
where Fis a pretrained sketch extractor in [28].
e other loss is Hard-Sketch Loss (LHSL), which aims
to constrain the sketch of the generated image to have
similar complexity to the target domain image and can be
expressed as
LHSL(x) F(x) F(G(x))‖2.(8)
Overall, the loss of discriminator LDand the loss of
generator LGin the proposed network can be expressed as
follows:
LDLadvD,(9)
LGλ1LadvG+λ2L2+λ3LSSL +λ4LHSL,(10)
where λ1,λ2,λ3, and λ4are hyperparameters to balance the
relationship between dierent losses. By default, we set
λ11, λ210, λ310, and λ41.
4. Experiments
4.1. Baseline and Dataset. To evaluate the proposed
LDSGAN model, we have selected several state-of-the-art
works as baselines. CycleGAN [15] introduces Cycle Con-
sistency Loss (CCL) to ensure that the identity of the input
and output images remains consistent when trained with
unpaired data. Drit++ [16] supports I2I conversion through
high-quality multimodal translation. UI2I-via-StyleGAN2
[19] proposes a GAN embedding-based approach to
obtain higher image quality and similarity in I2I tasks by
ne-tuning the pretrained StyleGAN2.
is experiment is evaluated in two I2I translation
scenarios with large dierences between their domains. For
the Yellow2Anime task, the Netix face dataset and the
Danbooru2018 [29] are used. e Netix face dataset
contains 136,723 images of 512 ×512 size. For the
Sketch2Anime task, we used the DCS dataset collected in [7].
In the Sketch2Anime task, the image is cropped to a square
with short side lengths and then scaled to 512×512 size.
4.2. Experimental Settings. e framework for the experi-
ments is implemented in PyTorch [30]. All experiments are
performed with an NVIDIA Tesla V100 GPU. e method
uses the Adam optimizer with the learning rate set to 1e-4
and a total number of iterations of 400k. e momentum
parameters of the generator and discriminator are set to β1
0 and β20.99, respectively.
e generator’s output is the nearest neighbor estimate
of the input source domain image in the target domain. To
accelerate convergence, we use WGAN-GP [27] to initialize
the generator and the discriminator and impose no gradient
penalty on the generator. e gradient penalty of the dis-
criminator is executed every 16 iterations and the penalty
strength is amplied 16 times to reduce the network training
time. In addition, in all convolutional and fully connected
layers, we use Kaiming normal initialization [31] to improve
convergence speed while avoiding the problem of gradient
explosion during training. Similar to NAH’s approach [32],
we did not use any normalization layer in the network.
4.3. Comparison Experiment. To evaluate the quality of the
images generated by the method in this paper and their
similarity to the input images, we used the following metric:
Frechet Inception Distance (FID) [33], which is used to
calculate the distance of the feature vectors between the real
image and the generated image to measure the quality of the
output image. LPIPS [23] is a depth feature–based metric for
evaluating structural and textural similarity between images.
We use LPIPS to measure the structural similarity of images
despite the textural dierences between source and target
domain images in LDDT.
To evaluate the performance of LDSGAN, we compared
it with several other methods in the Yellow2Anime task,
including CycleGAN [15], Drit++ [16], UI2I-via-StyleGAN2
(UI2I-StGAN2) [19], AttentionGAN [34], and E2GAN [35].
Specically, we retrained these models on the same training
dataset using the same settings. For CycleGAN [15] and
Drit++ [16], the training period is set to 10 and the decay
period is set to 5. en, the performance of these models is
evaluated by FID and LPIPS. e experimental results are
shown in Table 1, where 2000 test images are randomly
selected as inputs. Figure 4 shows some comparison
examples.
As shown in Table 1, the FID scores of this paper’s
method are better than other methods. Although some
methods perform better on LPIPS, they mainly rely on the
6International Journal of Intelligent Systems
Table 1: Comparative results of this method with existing methods.
Method Yellow2Anime Sketch2Anime
FID LPIPS FID LPIPS
CycleGAN [15] 185.812 0.547 153.61 0.509
Drit++ [16] 60.156 0.605 59.995 0.583
UI2I-StGAN2 LS 1 [19] 46. 1 0.670 44. 81 0.631
UI2I-StGAN2 LS 3 [19] 71.361 0.644 63.056 0.627
UI2I-StGAN2 LS 5 [19] 102.405 0.618 81.551 0.594
AttentionGAN [34] 62.983 0.586 65.276 0.598
E2GAN [35] 73.03 0.603 72.615 0.618
LDSGAN (ours) 31.437 0.582 29.582 0.5 9
Note: Best results are in italics. Suboptimal results are in bold.
(a) (b) (c) (d)
Figure 4: Continued.
International Journal of Intelligent Systems 7
direct migration of textures from input images in nontarget
domains. As can be seen in Figure 4, the images generated
by CycleGAN and Drit++ contain too much texture of the
input image, which is the advantage of skip connections in
the SDDT task but is a major drawback in the LDDT task.
e results of UI2I-via-StyleGAN2 perform poorly when
the number of layer exchanges is 3 or 5. e main reason is
that the resulting image is more prone to corruption when
swapping more advanced convolutional layers, as the
swapped features need to be mapped through more con-
volutional layers that are not in the original model.
erefore, the FID score is lower when the number of layer
exchanges is 3 or 5. For the result of layer-swap 1, al-
though the generated image is similar to the input image in
terms of facial orientation and expression, it fails to ac-
curately preserve the color layout of the input image and
loses the decorations in the input image. In summary,
LDSGAN obtains the best image quality and eectively
maintains the similarity between the input and output
images.
4.4. Ablation Study and Analysis
4.4.1. Ablation Study of RTRS. To analyze the role of RTRS
in LDSGAN, we designed w/o RTRS and one RTRS and
compared their eectiveness with full RTRS. w/o RTRS
means that the model does not use RTRS and SABlock, so
the model is equivalent to an autoencoder. One RTRS
represents the generation of all the search codes at the rst
call to RTRS, whereas full RTRS represents the complete
method proposed earlier. e results of the experiments are
shown in Table 2, and some examples from Sketch2Anime
are illustrated in Figure 5.
As can be seen in Table 2, in the Yellow2Anime task, the
w/o RTRS model receives the lowest FID and LPIPS scores,
while the one RTRS model scores in the middle of the pack
and full RTRS model scores the highest.
In addition, the trend of LPIPS scores in the
Sketch2Anime task is the same as in the Yellow2Anime
task. One RTRS model signicantly improves the FID
score by utilizing the StyleGAN2 architecture of the
network. In contrast, the full RTRS model exclusively
improves the similarity between the input and generated
images by searching the layout. For the Sketch2Anime
task, the FID scores are relatively stable across the three
modes because the input image does not have a complex
color layout.
4.4.2. Ablation Study of Sketch Loss. e eectiveness of
using Sketch Loss to maintain image consistency is analyzed
by ablation studies. First, we conduct comparative experi-
ments by excluding both the Soft-Sketch Loss and Hard-
Sketch Loss. e comparison results are shown in Table 3,
and their visualization is shown in Figure 5.
Table 3 shows that although LSSL and LHSL have higher
FID scores than w/o LSSL +LHSL, they perform worse on
LPIPS scores. is result is because the optimization process
tries to direct the generated image to the target domain,
(e) (f) (g)
Figure 4: Visualization of comparison experiments on Yellow2Anime. (a) Input images, (b) results of CycleGAN, (c) results of Drit++,
(d) results of UI2I-via-StyleGAN2 with layer-swap 1, (e) results of UI2I-via-StyleGAN2 with layer-swap 3, (f) results of AttentionGAN,
and (g) results of LDSGAN.
8International Journal of Intelligent Systems
Table 2: e ablation results of RTRS.
Yellow2Anime Sketch2Anime
FID LPIPS FID LPIPS
w/o RTRS 47.662 0.664 34.5 1 0.609
One RTRS 34.621 0.604 46.156 0.628
Full RTRS 31.437 0.582 29.582 0.579
e italic values indicate the highest level and bold values represent the second highest level.
(a) (b) (c)
Figure 5: Continued.
International Journal of Intelligent Systems 9
(d) (e) (f)
Figure 5: Visualization of RTRS ablation experiments on Sketch2Anime. (a) Input sketch images, (b) w/o RTRS, (c) w/o Soft-Sketch Loss
and Hard-Sketch Loss, (d) Soft-Sketch Loss, (e) Hard-Sketch Loss, and (f) Soft-Sketch + Hard-Sketch Loss.
Table 3: e ablation results of Sketch Loss.
Yellow2Anime Sketch2Anime
FID LPIPS FID LPIPS
w/o LSSL +LHSL 36.931 0.631 36.028 0.594
LSSL 32.566 0.668 28.043 0.592
LHSL 33.676 0.649 34.752 0.609
LSSL +LHSL 31.437 0.582 29.582 0.579
1.0
0.9
0.8
0.7
0.6
0.5
LPIPS
0 100000 200000
Step
300000 400000
Figure 6: Curve of LPIPS score of Yellow2Anime task, with full RTRS + Hard-Sketch Loss.
10 International Journal of Intelligent Systems
which is further away from the source domain. We noticed
the phenomenon that in our experiments, the LPIPS score is
lower at the beginning of the optimization and increased at
the later stages. As shown in Figure 6, the LPIPS is low at the
beginning of the optimization but increased in the later
stages.
e main reason for the above phenomenon is that either
LSSL or LHSL is used, both of which cause the network to
focus on only some of the features in the source domain. e
network may become overly focused on these specic fea-
tures as the number of training sessions increases. erefore,
to overcome this shortcoming, we nally used a combination
of LSSL +LHSL to constrain the network training.
5. Conclusion
In this paper, we propose a novel LDSGAN method to
translate images to the target domain further and generate
high-quality images while maintaining the similarity be-
tween the input and generated images. e proposed
LDSGAN uses SDNet to distill the input images’ message
while leaving the rich maps of input images. en LDTNet is
adapted to further translate the feature maps to the target
domain. Furthermore, the applied TDSNet restores the color
layout of input images by searching for rich maps and
generating maps in the target domain. Finally, Sketch Loss is
used to maintain the image identity between input and
generated images. Comprehensive comparisons demon-
strate that the proposed LDSGAN generates more vivid
images in LDDT and maintains the color and layout, which
shows that the proposed method has obvious superiority
over state-of-the-art work in LDDT. Since the proposed
method suers from the detailed shape dim problem, we aim
to improve the coincidence of detailed shapes on input and
generated images in future work.
Nomenclature
ANA Anti-nuclear antibodies
APC Antigen-presenting cells
IRF Interferon regulatory factor.
Data Availability Statement
e data that support the ndings of this study are available
from the corresponding author upon reasonable request.
Conflicts of Interest
e authors declare no conicts of interest.
Author Contributions
Hao Wang: conceptualization, methodology, software,
formal analysis, and writing original draft.
Chenbin Wang: conceptualization, supervision, formal
analysis, writing review and editing, and supervision.
Xin Cheng: conceptualization, supervision, formal
analysis, writing review and editing, and supervision.
Hao Wu: conceptualization, supervision, formal analy-
sis, writing review and editing, and supervision.
Jiawei Zhang: methodology, validation, investigation,
writing review and editing, and supervision.
Jinwei Wang: project administration, methodology,
validation, investigation, writing review and editing, and
supervision.
Xiangyang Luo: funding acquisition, resources, and
methodology.
Bin Ma: data curation and visualization.
Funding
is work was supported by the National Key R and D
Program of China (Grant No. 2021QY0700), National
Natural Science Foundation of China (Grant Nos. 62472229,
62072250, U23A20305, U23B2022, 62371145, 62072480,
62172435, 62302249, 62272255, 62302248, and U20B2065),
Zhongyuan Science and Technology Innovation Leading
Talent Project of China (Grant No. 214200510019), Open
Foundation of Henan Key Laboratory of Cyberspace Situ-
ation Awareness (Grant No. HNTS2022002), and Graduate
Student Scientic Research Innovation Projects of Jiangsu
Province (Grant No. KYCX24_1513).
Acknowledgments
is work was supported by the National Key R and D
Program of China (Grant No. 2021QY0700), National
Natural Science Foundation of China (Grant Nos. 62472229,
62072250, U23A20305, U23B2022, 62371145, 62072480,
62172435, 62302249, 62272255, 62302248, and U20B2065),
Zhongyuan Science and Technology Innovation Leading
Talent Project of China (Grant No. 214200510019), Open
Foundation of Henan Key Laboratory of Cyberspace Situ-
ation Awareness (Grant No. HNTS2022002), and Graduate
Student Scientic Research Innovation Projects of Jiangsu
Province (Grant No. KYCX24_1513).
References
[1] H. Zhang, B. Chen, J. Wang, and G. Zhao, “A Local Per-
turbation Generation Method for gan-generated Face Anti-
forensics,” IEEE Transactions on Circuits and Systems for
Video Technology 33, no. 2 (2023): 661–676, https://doi.org/
10.1109/tcsvt.2022.3207310.
[2] P. Yue, B. Chen, and Z. Fu, “Local Region Frequency Guided
Dynamic Inconsistency Network for Deepfake Video De-
tection,” Big Data Mining and Analytics 7, no. 3 (2024):
889–904, https://doi.org/10.26599/bdma.2024.9020030.
[3] I. Goodfellow, J. Pouget-Abadie, M. Mirza, et al., “Generative
Adversarial Nets,” Advances in Neural Information Processing
Systems 27 (2014).
[4] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and
B. Ommer, “High-resolution Image Synthesis with Latent
Diusion Models,” in Proceedings of the IEEE/CVF Conference
on Computer Vision and Pattern Recognition (2022),
10684–10695.
[5] R. Zhang, J. Y. Zhu, P. Isola, et al., “Real-time User-Guided
Image Colorization with Learned Deep Priors,” ACM
International Journal of Intelligent Systems 11
Transactions on Graphics 36, no. 4 (2017): 1–11, https://
doi.org/10.1145/3072959.3073703.
[6] R. Zhang, P. Isola, and A. A. Efros, “Colorful Image Color-
ization,” in Computer Vision–ECCV 2016: 14th European
Conference (Amsterdam, the Netherlands: Springer, 2016),
649–666.
[7] Z. Dou, N. Wang, B. Li, Z. Wang, H. Li, and B. Liu, “Dual
Color Space Guided Sketch Colorization,” IEEE Transactions
on Image Processing 30 (2021): 7292–7304, https://doi.org/
10.1109/tip.2021.3104190.
[8] W. Jiang, S. Liu, C. Gao, et al., “Psgan: Pose and Expression
Robust Spatial-Aware gan for Customizable Makeup Trans-
fer,” in Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition (2020), 5194–5202.
[9] H. Dong, X. Liang, Y. Zhang, et al., “Fashion Editing with
Adversarial Parsing Learning,” in Proceedings of the IEEE/
CVF Conference on Computer Vision and Pattern Recognition
(2020), 8120–8128.
[10] X. Huang and S. Belongie, “Arbitrary Style Transfer in Real-
Time with Adaptive Instance Normalization,” in Proceedings
of the IEEE International Conference on Computer Vision
(2017), 1501–1510.
[11] S. Kang, Y. Ok, H. Kim, and T. Hahn, “Image-to-image
Translation Method for Game-Character Face Generation,”
in 2020 IEEE Conference on Games (CoG) (IEEE, 2020),
628–631.
[12] O. Ronneberger, P. Fischer, and T. Brox, “U-Net: Convolu-
tional Networks for Biomedical Image Segmentation,” in
Medical Image Computing and Computer-Assisted Inter-
vention–MICCAI 2015: 18th International Conference
(Munich, Germany: Springer, 2015), 234–241.
[13] P. Isola, J. Y. Zhu, T. Zhou, and A. A. Efros, “Image-to-image
Translation with Conditional Adversarial Networks,” in
Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition (2017), 1125–1134.
[14] E. Richardson, Y. Alaluf, O. Patashnik, et al., “Encoding in
Style: a Stylegan Encoder for Image-To-Image Translation,” in
Proceedings of the IEEE/CVF Conference on Computer Vision
and Pattern Recognition (2021), 2287–2296, https://doi.org/
10.1109/cvpr46437.2021.00232.
[15] J. Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired Image-
To-Image Translation Using Cycle-Consistent Adversarial
Networks,” in Proceedings of the IEEE International Confer-
ence on Computer Vision (2017), 2223–2232.
[16] H. Y. Lee, H. Y. Tseng, J. B. Huang, M. Singh, and M. H. Yang,
“Diverse Image-To-Image Translation via Disentangled
Representations,” in Proceedings of the European Conference
on Computer Vision (ECCV) (2018), 35–51.
[17] M. Y. Liu, T. Breuel, and J. Kautz, “Unsupervised Image-To-
Image Translation Networks,” Advances in Neural In-
formation Processing Systems 30 (2017).
[18] X. Huang, M. Y. Liu, S. Belongie, and J. Kautz, “Multimodal
Unsupervised Image-To-Image Translation,” in Proceedings of
the European Conference on Computer Vision (ECCV) (2018),
172–189.
[19] J. Huang, J. Liao, and S. Kwong, “Unsupervised Image-To-
Image Translation via Pre-trained Stylegan2 Network,” IEEE
Transactions on Multimedia 24 (2022): 1435–1448, https://
doi.org/10.1109/tmm.2021.3065230.
[20] M. Mirza and S. Osindero, “Conditional Generative Adver-
sarial Nets” (2014).
[21] T. Karras, S. Laine, and T. Aila, “A Style-Based Generator
Architecture for Generative Adversarial Networks,” in
Proceedings of the IEEE/CVF Conference on Computer Vision
and Pattern Recognition (2019), 4401–4410.
[22] T. Karras, S. Laine, M. Aittala, J. Hellsten, J. Lehtinen, and
T. Aila, “Analyzing and Improving the Image Quality of
Stylegan,” in Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition (2020), 8110–8119.
[23] R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang,
“e Unreasonable Eectiveness of Deep Features as a Per-
ceptual Metric,” in Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition (2018), 586–595.
[24] Y. Rao and J. Ni, “A Deep Learning Approach to Detection of
Splicing and Copy-Move Forgeries in Images,” in 2016 IEEE
International Workshop on Information Forensics and Security
(WIFS) (IEEE, 2016), 1–6.
[25] K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning
for Image Recognition,” in Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition (2016), 770–778.
[26] A. L. Maas, A. Y. Hannun, A. Y. Ng, et al., “Rectier Non-
linearities Improve Neural Network Acoustic Models,” in
Proc. Icml, 30 (Atlanta, GA, 2013), 3.
[27] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and
A. C. Courville, “Improved Training of Wasserstein Gans,”
Advances in Neural Information Processing Systems 30 (2017).
[28] M. Li, Z. Lin, R. Mech, E. Yumer, and D. Ramanan, “Photo-
sketching: Inferring Contour Drawings from Images,” in 2019
IEEE Winter Conference on Applications of Computer Vision
(WACV) (IEEE, 2019), 1403–1412.
[29] G. Branwen and A. Gokaslan, “Danbooru2019: A Large-Scale
Crowdsourced and Tagged Anime Illustration Dataset,”
Danbooru2017 6 (2019).
[30] A. Paszke, S. Gross, S. Chintala, et al., “Automatic Dier-
entiation in Pytorch” (2017).
[31] K. He, X. Zhang, S. Ren, and J. Sun, “Delving Deep into
Rectiers: Surpassing Human-Level Performance on Image-
net Classication,” in Proceedings of the IEEE International
Conference on Computer Vision (2015), 1026–1034.
[32] S. Nah, T. Hyun Kim, and K. Mu Lee, “Deep Multi-Scale
Convolutional Neural Network for Dynamic Scene Deblur-
ring,” in Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition (2017), 3883–3891.
[33] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and
S. Hochreiter, “Gans Trained by a Two Time-Scale Update
Rule Converge to a Local Nash Equilibrium,” Advances in
Neural Information Processing Systems 30 (2017).
[34] H. Tang, H. Liu, D. Xu, P. H. Torr, and N. Sebe, “Attentiongan:
Unpaired Image-To-Image Translation Using Attention-Guided
Generative Adversarial Networks,” IEEE Transactions on Neural
Networks and Learning Systems 34, no. 4 (2023): 19721987,
https://doi.org/10.1109/tnnls.2021.3105725.
[35] Y. Gong, Z. Zhan, Q. Jin, et al., “E 2 gan: Ecient Training of
Ecient Gans for Image-To-Image Translation,” in Forty-rst
International Conference on Machine Learning (2024).
12 International Journal of Intelligent Systems
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
In recent years, with the rapid development of deepfake technology, a large number of deepfake videos have emerged on the Internet, which poses a huge threat to national politics, social stability, and personal privacy. Although many existing deepfake detection methods exhibit excellent performance for known manipulations, their detection capabilities are not strong when faced with unknown manipulations. Therefore, in order to obtain better generalization ability, this paper analyzes global and local inter-frame dynamic inconsistencies from the perspective of spatial and frequency domains, and proposes a Local region Frequency Guided Dynamic Inconsistency Network (LFGDIN). The network includes two parts: Global SpatioTemporal Network (GSTN) and Local Region Frequency Guided Module (LRFGM). The GSTN is responsible for capturing the dynamic information of the entire face, while the LRFGM focuses on extracting the frequency dynamic information of the eyes and mouth. The LRFGM guides the GTSN to concentrate on dynamic inconsistency in some significant local regions through local region alignment, so as to improve the model’s detection performance. Experiments on the three public datasets (FF++, DFDC, and Celeb-DF) show that compared with many recent advanced methods, the proposed method achieves better detection results when detecting deepfake videos of unknown manipulation types.
Article
Although the current generative adversarial networks (GAN)-generated face forensic detectors based on deep neural networks (DNNs) have achieved considerable performance, they are vulnerable to adversarial attacks. In this paper, an effective local perturbation generation method is proposed to expose the vulnerability of state-of-the-art forensic detectors. The main idea is to mine the fake faces’ areas of common concern in multiple-detectors’ decision-making, then generate local anti-forensic perturbations by GANs in these areas to enhance the visual quality and transferability of anti-forensic faces. Meanwhile, in order to improve the anti-forensic effect, a double- mask (soft mask and hard mask) strategy and a three-part loss (the GAN training loss, the adversarial loss consisting of ensemble classification loss and ensemble feature loss, and the regularization loss) are designed for the training of the generator. Experiments conducted on fake faces generated by StyleGAN demonstrate the proposed method’s advantage over the state-of-the-art methods in terms of anti-forensic success rate, imperceptibility, and transferability. The source code is available at https://github.com/imagecbj/A-Local-Perturbation-Generation-Method-for-GAN-generated-Face-Anti-forensics .
Article
State-of-the-art methods in the image-to-image translation are capable of learning a mapping from a source domain to a target domain with unpaired image data. Though the existing methods have achieved promising results, they still produce visual artifacts, being able to translate low-level information but not high-level semantics of input images. One possible reason is that generators do not have the ability to perceive the most discriminative parts between the source and target domains, thus making the generated images low quality. In this article, we propose a new Attention-Guided Generative Adversarial Networks (AttentionGAN) for the unpaired image-to-image translation task. AttentionGAN can identify the most discriminative foreground objects and minimize the change of the background. The attention-guided generators in AttentionGAN are able to produce attention masks, and then fuse the generation output with the attention masks to obtain high-quality target images. Accordingly, we also design a novel attention-guided discriminator which only considers attended regions. Extensive experiments are conducted on several generative tasks with eight public datasets, demonstrating that the proposed method is effective to generate sharper and more realistic images compared with existing competitive models. The code is available at https://github.com/Ha0Tang/AttentionGAN .
Article
Automatic sketch colorization is a challenging task in both computer graphics and computer vision since all the color, texture, shading generation have to be created based on the abstract sketch. Besides, it is a subjective task in painting process, which needs illustrators to comprehend drawing priori (DP), such as hue variation, saturation contrast and gray contrast and utilize them in the HSV color space which is closer to human visual cognition system. As such, incorporating supplementary supervision in the HSV color space may be beneficial to sketch colorization. However, previous methods improve the colorization quality only in the RGB color space without considering the HSV color space, often causing results with dull color, inappropriate saturation contrast, and artifacts. To address this issue, we propose a novel sketch colorization method, dual color space guided generative adversarial network (DCSGAN), that considers the complementary information contained in both the RGB and HSV color space. Specifically, we incorporate the HSV color space to construct dual color spaces for supervising our method with a color space transformation (CST) network that learns transformation from the RGB to HSV color space. Then, we propose a DP loss that enables the DCSGAN to generate vivid color images with pixel level supervision. Additionally, a novel dual color space adversarial (DCSA) loss is designed to guide the generator at global level to reduce the artifacts to meet audiences’ aesthetic expectations. Extensive experiments and ablation studies demonstrate the superiority of the proposed method over previous state-of-the-art (SOTA) methods.
Article
Image-to-Image (I2I) translation is a heated topic in academia, and it also has been applied in real-world industry for tasks like image synthesis, super-resolution, and colorization. However, traditional I2I translation methods train data in two or more domains together. This requires lots of computation resources. Moreover, the results are of lower quality, and they contain many more artifacts. The training process could be unstable when the data in different domains are not balanced, and modal collapse is more likely to happen. We proposed a new I2I translation method that generates a new model in the target domain via a series of model transformations on a pretrained StyleGAN2 model in the source domain. After that, we proposed an inversion method to achieve the conversion between an image and its latent vector. By feeding the latent vector into the generated model, we can perform I2I translation between the source domain and target domain. Both qualitative and quantitative evaluations were conducted to prove that the proposed method can achieve outstanding performance in terms of image quality, diversity and semantic similarity to the input and reference images compared to state-of-the-art works.