Content uploaded by Jianxin Lin
Author content
All content in this area was uploaded by Jianxin Lin on Jun 03, 2019
Content may be subject to copyright.
Image-to-Image Translation with Multi-Path Consistency Regularization
Jianxin Lin1∗,Yingce Xia3∗,Yijun Wang2,Tao Qin3and Zhibo Chen1†
1CAS Key Laboratory of Technology in Geo-spatial Information Processing and Application System,
University of Science and Technology of China
2University of Science and Technology of China
3Microsoft Research Asia
{linjx, wyjun}@mail.ustc.edu.cn, {yingce.xia, taoqin}@microsoft.com, chenzhibo@ustc.edu.cn
Abstract
Image translation across different domains has at-
tracted much attention in both machine learning
and computer vision communities. Taking the
translation from source domain Dsto target domain
Dtas an example, existing algorithms mainly rely
on two kinds of loss for training: One is the dis-
crimination loss, which is used to differentiate im-
ages generated by the models and natural images;
the other is the reconstruction loss, which measures
the difference between an original image and the re-
constructed version through Ds→ Dt→ Dstrans-
lation. In this work, we introduce a new kind of
loss, multi-path consistency loss, which evaluates
the differences between direct translation Ds→ Dt
and indirect translation Ds→ Da→ Dtwith Da
as an auxiliary domain, to regularize training. For
multi-domain translation (at least, three) which fo-
cuses on building translation models between any
two domains, at each training iteration, we ran-
domly select three domains, set them respectively
as the source, auxiliary and target domains, build
the multi-path consistency loss and optimize the
network. For two-domain translation, we need to
introduce an additional auxiliary domain and con-
struct the multi-path consistency loss. We conduct
various experiments to demonstrate the effective-
ness of our proposed methods, including face-to-
face translation, paint-to-photo translation, and de-
raining/de-noising translation.
1 Introduction
Image-to-image translation aims at learning a mapping that
can transfer an image from a source domain to a target one,
while maintaining the main representations of the input image
from the source domain. Many computer vision problems can
be viewed as image-to-image translation tasks, including im-
age stylization [Gatys et al., 2016], image restoration [Mao
et al., 2016], image segmentation [Girshick, 2015]and so on.
Since a large amount of parallel data is costly to collect in
∗The first two authors contributed equally to this work
†Corresponding author
Figure 1: Illustration of multi-path consistency regularization on the
translation between different hair colors. (a) Results of StarGAN.
(b) Our results.
practice, most of recent works have focused on unsupervised
image-to-image translation algorithms. Two kinds of algo-
rithms are widely adopted for this problem. The first one is
generative adversarial networks (briefly, GAN), which con-
sists of an image generator used to produce images and an
image discriminator used to verify whether an image is a fake
one from a machine or a natural one. Ideally, the training will
reach an equilibrium where the generator should generate “re-
al” images that the discriminator cannot distinguish from nat-
ural images [Goodfellow et al., 2014]. The other one is dual
learning [He et al., 2016], which is first proposed for neural
machine translation and then successfully adapted into image
translation. In dual learning, a pair of dual tasks is involved,
like man-to-woman translation v.s. woman-to-man transla-
tion. The reconstruction loss of the two tasks are minimized
during optimization. The combination of GAN algorithms
and dual learning leads to many algorithms for two-domain
image translation like CycleGAN [Zhu et al., 2017], Dual-
GAN [Yi et al., 2017], DiscoGAN [Kim et al., 2017], con-
ditional DualGAN [Lin et al., 2018a], and for multi-domain
image translation like StarGAN [Choi et al., 2018].
We know that multiple domains are bridged by multi-path
consistency. Take the pictures in Figure 1 as an example. We
want to work on a three-domain image translation problem,
which targets at changing the hair color of the input image
to a specific one. Ideally, the direct translation (i.e., one-hop
translation) from brown hair to blond should be the same as
the indirect translation (i.e., two-hop translation) from brown
to black to blond. However, such an important property is
ignored in current literature. As shown in Figure 1(a), with-
out multi-path consistency regularization, the direct transla-
tion and indirect translation are not consist in terms of hair
color. Besides, on the right of the face in the one-hop transla-
tion, there is much horizontal noise. To keep the two gener-
ated images consistent, in this paper, we propose a new loss,
multi-path consistency loss, which explicitly models the re-
lationship among three domains. We require that the differ-
ences between direct translation from source to target domain
and indirect translation from source to auxiliary to target do-
main should be minimized. For example, in Figure 1, the L1-
norm loss of the two translated blond hair pictures should be
minimized. After applying this constraint, as shown in Fig-
ure 1(b), the direct and indirect translations are much similar,
and both the direct and indirect translations are of less noise.
Multi-path consistency loss can be generally applied in im-
age translation tasks. For multi-domain (≥3) translation,
during each training iteration, we can randomly select three
domains, apply the multi-path consistency loss to each trans-
lation task, and eventually obtain models that can generate
better images. For the two-domain image translation prob-
lem, we need to introduce a third auxiliary domain to help
establish the multi-path consistency relation.
Our contributions can be summarized as follows: (1) We
propose a new learning framework with multi-path consis-
tency loss that can leverage the information among multiple
domains. Such a loss function regularizes the training of each
task and leads to better performance. We provide an efficient
algorithm to optimize such a framework. (2) We conduct rich
experiments to verify the proposed method. Specifically, we
work on face-to-face translation, paint-to-photo translation,
and de-raining/de-noising translation. For qualitative analy-
sis, the models after applying multi-path consistence loss can
generate clearer images with less blocks artifacts. For quan-
titative analysis, we calculate the classification errors and P-
SNR for tasks, which all outperform the baselines. We also
conduct user study on the multi-domain translation task and
59.14%/89.85% users vote for that our proposed method is
better on face-to-face and paint-to-photo translations.
2 Related works
In this section, we summarize the literature about GAN and
unsupervised image-to-image translation.
GAN GAN [Goodfellow et al., 2014]was firstly proposed to
generate images in an unsupervised manner. A GAN is made
up of a generator and a discriminator. The generator map-
s a random noise to an image and the discriminator verifies
whether the image is a natural one or a fake one. The training
of GAN is formulated as a two-player minmax game. Various
versions of GAN have been proposed to exploit its capability
for different image generation tasks [Arjovsky et al., 2017;
Huang et al., 2017; Lin et al., 2018b]. InfoGAN [Chen et al.,
2016]learns to disentangle latent representations by maxi-
mizing the mutual information between a small subset of the
latent variables and the observation. [Radford et al., 2015]
presented a series of deep convolutional generative networks
(DCGANs) for high-quality image generation and unsuper-
vised image classification tasks, which bridges the convo-
lutional neural networks and unsupervised image generation
together. SRGAN [Ledig et al., 2017]maps low-resolution
images to high resolution images. Isola et al. [Isola et al.,
2017]proposed a general conditional GAN for image-to-
image translation tasks, which could be used to solve label-
to-street scene and aerial-to-map translation problems.
Unsupervised image-to-image translation Since it is usual-
ly hard to collect a large amount of parallel data for super-
vised image-to-image translation tasks, unsupervised learn-
ing based algorithms have been widely adopted. Based on
adversarial training, Dumoulin et al. [Dumoulin et al., 2016]
and Donahue et al. [Donahue et al., 2016]proposed algo-
rithms to jointly learn mappings between the latent space
and data space bidirectionally. Taigman et al. [Taigman et
al., 2016]presented a domain transfer network (DTN) for
unsupervised cross-domain image generation under the as-
sumption that a constant latent space between two domains
exits, which could generate images of target domain’s style
and preserve their identity. Inspired by the idea of du-
al learning [He et al., 2016], DualGAN [Yi et al., 2017],
DiscoGAN [Kim et al., 2017]and CycleGAN [Zhu et al.,
2017]were proposed to tackle the unpaired image transla-
tion problem by jointly training two cross-domain transla-
tion models. Meanwhile, several works [Choi et al., 2018;
Liu et al., 2018]have been further proposed for multiple do-
main image-to-image translation with a single model only.
3 Framework
In this section, we introduce our proposed framework built
on multi-path consistency loss. Suppose we have Ndifferent
image domains {D1,D2, ..., DN}where N≥2. A domain
can be seen as a collection of images. Generally, the image
translation task aims at learning N(N−1) mappings fi,j :
Di7→ Djwhere i6=j. Also, we might come across the cases
that we are interested in a subset of the N(N−1) mappings.
We first show how to build translation models between Di
and Djwith Dkas an auxiliary domain and then present the
general framework for multi-domain image translation with
consistency loss. Note that in our framework, i, j, k ∈[N]
and they are different to each other.
3.1 Translation between Diand Djwith an
auxiliary domain Dk
To effectively obtain the two translation models fi,j and fj,i
with an auxiliary domain Dk, we need the following addition-
al components in the system: (1) Three discriminators di,dj
and dk, which are used to classify whether an input image is
a natural one or an image generated by the machine. Math-
ematically, dl:Dl7→ [0,1] models the probability that the
input image is a natural image in domain Dl,l∈ {i, j, k}. (2)
Four auxiliary mappings fi,k,fk,i ,fj,k and fk,j , which are
all related to Dk.
Considering that deep learning algorithms usually itera-
tively work on mini-batches of data instead of the whole train-
Figure 2: The standard and our proposed frameworks of image-to-image translation, where ˆx(j)=fi,j (x(i)),ˆx(i)=fj,i(x(j)),ˆx(j,i)=
fj,i(ˆx(j)),ˆx(i,j)=fi,j(ˆx(i)),ˆx(k,j)=fk,j (fi,k(x(i))),ˆx(k,i)=fk,i (fj,k(x(j))) .
ing datasets at the same time, in the remaining part of this sec-
tion, we describe how the models are updated on the batches
Bl⊆ Dl,l∈ {i, j, k}, where Blis a minibatch of data in Dl.
The training loss consists of three parts:
(1) Dual learning loss between Diand Dj, which models the
reconstruction duality between fi,j and fj,i. Mathematically,
`i,j
d=1
|Bi|X
x(i)∈Bi
kx(i)−fj,i(fi,j (x(i)))k1+
1
|Bj|X
x(j)∈Bj
kx(j)−fi,j (fj,i(x(j)))k1,
(1)
where |Bi|and |Bj|are the numbers of images in mini-batch
Biand mini-batch Bj.
(2) Multi-path consistency loss with an auxiliary domain Dk,
which regularizes the training by leveraging the information
provided by the third domain. Mathematically,
`i,j|k
c=1
|Bi|X
x(i)∈Bi
kfi,j (x(i))−fk,j (fi,k (x(i)))k1+
1
|Bj|X
x(j)∈Bj
kfj,i(x(j))−fk,i (fj,k(x(j)))k1.
(2)
(3) GAN loss, which enforces the generated images to be nat-
ural enough. Let ˆ
Bldenote a collection of all generated/fake
images to domain Dl. When l=k,ˆ
Bl={fi,k(x(i))|x(i)∈
Bi}∪{fj,k(x(j))|x(j)∈ Bj}. When l∈ {i, j },ˆ
Blis the
combination of one-hop translation and two-hop translation,
defined as {fp,l(x(p))|x(p)∈ Bp} ∪ {fk,l (fp,k(x(p)))|x(p)∈
Bp}where p={i, j}\{l}. The GAN loss is defined as fol-
lows:
`i,j|k
GAN =X
l∈{i,j,k}
n1
|Bl|X
x(l)∈B1
[log dl(x(l))]
+1
|ˆ
Bl|X
ˆx(l)∈ˆ
Bl
[log(1 −dl(ˆx(l)))]o.
(3)
All f·,·’s work together to minimize the GAN loss, while all
d·’s try to enlarge the GAN loss.
Given the aforementioned three kinds of loss, the overall
loss can be defined as follows:
`i,j|k
total (Bi,Bj,Bk) = `i,j
d+`i,j|k
c+α`i,j|k
GAN ,(4)
where αis a hyper-parameter balancing the tradeoff between
the GAN loss and other losses. All the six generators f·,·’s
work on minimizing Eqn. (4) while all three discriminators
d·’s work on maximizing Eqn. (4).
3.2 Multi-domain image translation
For an N-domain translation problem, when N≥3, at each
training iteration, it is too costly to build the consistency loss
for each three domains. Alternatively, we can randomly select
three domains Di,Dj,Dkand build the consistency loss as
follows:
`i,j,k
total (Bi,Bj,Bk) = `i,j|k
total (Bi,Bj,Bk)
+`i,k|j
total (Bi,Bk,Bj) + `j,k|i
total (Bj,Bk,Bi),
(5)
where the `i,j|k
total is defined in Eqn. (4) and the other notations
can be similarly defined.
Discussion
(1) When N= 2, we need to find the third domain as a auxil-
iary domain to help establish consistency. In this case, we can
use Eqn. (4) as the training objective, without applying con-
sistency loss on the third domain. We work on the de-raining
and de-noising tasks to verify such a case. (See Section 5.)
(2) We can build the consistency loss with longer paths, e.g.,
the translation from D1→ D3· · · DN→ D2should be con-
sistent with D1→ D2. Considering computation resource
limitation, we leave this study to future work.
3.3 Connection with StarGAN
For an N-domain translation, when Nis large, it is impracti-
cal to learn N(N−1) mappings. StarGAN [Choi et al., 2018]
is a recently proposed method which uses a single model with
different target labels to achieve image translation. With S-
tarGAN, the mapping from Dito Djcould be specified as
f(x(i), cj)where x(i)∈ Di,fis shared among all tasks and
cjis a learnable vector used to identify Dj. All the generators
share a same copy of the parameters except the target domain
labels.
In terms of the discriminator, StarGAN only consists of
one network which is not only used for justifying whether
an image is a natural or fake one, but also serving as a clas-
sifier that distinguishes which domain does the input belong
to. We also adopt such a kind of discriminator when using
StarGAN as the basic model architecture. In this case, let
dcls(l|x)denote the probability that the input xis categorized
as an image from domain l,l∈[N]. Following the notations
in Section 3.1, the classification cost `i,j|k
cls,r of real images and
fake images `i,j|k
cls,f for StarGAN can be formulated as follows:
`i,j|k
cls,r =X
l∈{i,j,k}
−1
|Bl|X
x(l)∈B1
[log dcls(l|x(l))],
`i,j|k
cls,f =X
l∈{i,j,k}
−1
|ˆ
Bl|X
ˆx(l)∈ˆ
Bl
[log dcls(l|ˆx(l))].
(6)
When using StarGAN with the aforementioned classification
errors, the image generators and discriminators cannot share a
common objective function. Therefore, the loss function with
multi-path consistency regularization, i.e., Eqn. (4), should be
split and re-formulated as follows:
`i,j|k
total,G(Bi,Bj,Bk) = `i,j
d+`i,j|k
c+α`i,j|k
GAN +β`i,j |k
cls,f ,
`i,j|k
total,D(Bi,Bj,Bk) = −α`i,j|k
GAN +β`i,j |k
cls,r ,
(7)
where both αand βare the hyper-parameters. The generator
and discriminator should try to minimize `i,j|k
total,G and `i,j|k
total,D
respectively. Also, Eqn (5) should be re-defined accordingly.
4 Experiments on multi-domain translation
For multi-domain translation, we carry out two groups of
experiments to verify our proposed framework, which are
face-to-face translation with different attributes and paint-
to-photo translation with different art styles. We choose S-
tarGAN [Choi et al., 2018], a state-of-the-art algorithm on
multi-domain image translation as our baseline.
4.1 Setting
Datasets For multi-domain face-to-face translation, we use
the CelebA dataset [Liu et al., 2015], which consists of
202,599 face images of celebrities. Following [Choi et al.,
2018], we select seven attributes and build seven domains
correspondingly. Among these attributes, three of them rep-
resent hair color, including black hair, blond hair, brown hair;
two of them represent the gender, i.e., male and female; the
left two represent age, including old and young. Note that
these seven features are not disjoint. That is, a man can both
have blond hair and be young.
For multi-domain paint-to-photo translation, we use the
paintings and photographs collected by [Zhu et al., 2017],
where we construct five domains including Cezanne, Monet,
Ukiyo-e, Vangogh and photographs.
Architecture For multi-domain translation tasks, we choose
StarGAN [Choi et al., 2018]as our basic structure. One rea-
son is that for an N-domain translation, we need N(N−1)
independent models to achieve translations between any two
domains; with StarGAN, we only need one model. Another
reason is that [Choi et al., 2018]claim that on face-to-face
translation, StarGAN achieves better performance for multi-
domain translation compared to simply using multiple Cycle-
GANs since multi-tasks are involved in the same model and
the common knowledge among different tasks can be shared
to achieve better performance.
Optimization We use Adam optimizer [Kingma and Ba,
2014]with learning rate 0.0001 for the first 10 epochs and
linearly decay the learning rate every 10 epochs. All the mod-
els are trained on one NVIDIA K40 GPU for one day. The α
in Eqn. (4) and Eqn. (7) is set to 0.1, and βin Eqn. (7) is also
set to 0.1.
Evaluation We take both qualitative and quantitative analy-
sis to verify the experiment results. For qualitative analysis,
we visualize the results of both the baseline and our algorith-
m, and compare their differences. For quantitative analysis,
following [Choi et al., 2018], we perform classification ex-
periments on generation synthesis. We train the classifiers
on the image-translation training data using the same archi-
tecture as that for the discriminator, resulting in near-perfect
accuracies, and then compute the classification error rates of
generated images based on the classifiers. The Fr´
echet In-
ception Distance (FID) [Heusel et al., 2017]that measures
similarity between generated image dataset and real image
dataset is used to evaluate translated results quality. The low-
er the FID is, the better the translation results are. We also
carry out user study for the generated images.
Figure 3: Two groups of multi-domain face-to-face translation re-
sults. The rows started with “Baseline” and “Ours” represent the
results of the baseline and our method.
Figure 4: Multi-domain paint-to-photo translation results.
4.2 Face-to-face translation
The results of face-to-face translation are shown in Figure 3.
In general, both the baseline and our proposed method can
successfully transfer the images to the target domain. But
there are many places where our proposed method outper-
forms the baseline:
(1) Our proposed method could preserve more information of
the input images. Take the translations in the upper part as an
example. To translate the input to black hair and blond hair
domain, the domain-specific feature [Lin et al., 2018a]to be
changed is the hair color, while the other domain-independent
features should be kept as many as possible. The baseline al-
gorithm changes the beard of input images, while our algo-
rithm keeps such a feature due to the multi-path consisten-
cy regularization. Since a consistency regularization requires
both the one-hop translation and two-hop translation to pre-
serve enough similarity, and the two-hop translation path is
independent of one-hop translation path, errors/distortions in
the translation results are more likely to be avoided from the
aspect of probability. As a result, our proposed method can
carry more information of the original image.
(2) After applying multi-path consistency regularization, our
algorithm could generate clearer images with less noise than
the baseline, like the images with black hair and blond hair in
the bottom part. One possible reason is that multi-path con-
sistency pushes the models to generate consist images. The
random noise would affect the consistency and our proposed
regularization way could reduce such effects.
For quantitative evaluation, the results are shown in Ta-
ble 1 and Table 2. We can see that our proposed method
achieves lower classification error rates on three different do-
mains, which demonstrates that our generator could produce
images with more significant features in the target domain.
For FID score, our algorithm also achieves 1.79 improvemen-
t, which suggests that our translation results’s distribution are
more similar to real ones’.
Hair Color Gender Age
Baseline 19.01% 11.60% 25.52%
Ours 17.08% 10.23% 24.39 %
Improvements 1.93% 1.37% 1.13%
Table 1: Classification error rates of face-to-face translation.
Baseline Ours Improvement
20.15 18.36 1.79
Table 2: FID scores of face-to-face translation.
4.3 Paint-to-photo translation
The results of multi-domain paint-to-photo translation are
shown in Figure 4. Again, the models trained with multi-
path consistency loss outperform the baselines. For exam-
ple, our model can generate more domain-specific paintings
than the baseline method as shown in the generated Vangogh
paintings. We also observe that our model effectively reduces
the block artifact in the translation results, such as generated
Cezanne, Monet and Vangogh paintings. Besides, our model
prefers generating images with clearer edges and context. As
shown in the upper-left corner of Figure 4, we could generate
images with obvious edges and content, while the baseline
algorithm fails with unclear and messy generations.
Similar to face-to-face translation, we also show the classi-
fication errors and FID. The results are in Table 3 and Table 4.
Our algorithm achieves significantly better results than the
baseline, which demonstrates the effectiveness of our method.
Baseline Ours Improvement
35.52% 30.17% 5.35%
Table 3: Classification error rates of paint-to-photo translation.
Figure 5: Unsupervised de-raining (first two rows) and de-noising (last two rows) results. From left to right, the columns represent the
rainy/noisy input, the original clean image, the results of StarGAN (St) and CycleGAN (Cy) without multi-path consistency loss, and the
corresponding results with our method (St+ours, Cy+ours) respectively.
Cezanne Monet Ukiyo-e Vangogh Photograph
Baseline 219.43 199.77 163.46 226.77 79.33
Ours 210.82 170.52 154.29 216.78 64.10
Improvements 8.61 29.25 9.17 9.99 15.23
Table 4: FID scores of paint-to-photo translation.
4.4 User study
We carry out user study to further evaluate our results. 20
users with diverse education backgrounds are chosen as re-
viewers. We randomly select 35 groups of generated im-
ages for the face-to-face and paint-to-photo translation, where
each group contains the translation results from the same in-
put image to different categories of both the baseline and our
algorithm. For any group of experiments, reviewers have to
choose the better one without knowing which algorithm the
images are generated from.
The statistics are shown in Table 5. Among the 700 votes
for face-to-face translation, 59.14% belongs to our proposed
algorithm, and for paint-to-photo, 89.85% belongs to ours.
The user study shows that we achieve better performance than
the baseline, especially for paint-to-photo translation.
face-to-face paint-to-photo
Baseline Ours Baseline Ours
40.86% 59.14% 10.14% 89.85%
Table 5: Statistics of user study.
5 Experiments on two-domain translation
In this section, we work on two-domain translations with aux-
iliary domains. We choose two different tasks, unsupervised
de-raining and unsupervised de-noising, which means to re-
move the rain or noise from the input images.
5.1 Setting
Datasets We use the raining images and original images col-
lected by [Fu et al., 2017; Yang et al., 2017]. For unsuper-
vised translation, we randomly shuffle the raining images and
the original ones to obtain an unaligned dataset. As for the u-
naligned dataset for de-noising, we add uniform noise to each
original image and then shuffle them. For the de-raining and
de-noising experiments, we choose the noise image domain
and raining image domain as auxiliary ones respectively.
Architecture We first choose StarGAN as the basic network
architecture. The model architecture is the same as that used
in Section 4.1. In addition, to verify the generality of our
framework, we also apply CycleGAN [Zhu et al., 2017]to
this task. To combine with our framework, a total of six gen-
erators and three discriminators are implemented. We follow
Section 3.1 to jointly optimize the de-rain and add-rain net-
works for the de-raining task with consistency loss built upon
the domains with images of random noise. Similar method is
also applied to the de-noising task.
Evaluation For qualitative analysis, again we compare the
images generated by the baseline and our algorithm. For
quantitative analysis, except for the classification errors, we
check the Peak Signal-to-Noise Ratio (briefly, PSNR), of the
generated images with the original images. The larger PSNR
is, the better the restoration quality is.
5.2 Results
The unsupervised de-raining and unsupervised de-noising re-
sults are shown in Figure 5. On the two tasks, our proposed
method can improve both StarGAN and CycleGAN and gen-
erate cleaner images with less block artifacts, smoother col-
ors and clearer facial expressions. We also find that in de-
raining and de-noising tasks, CycleGAN outperforms Star-
GAN and can generate images with less rain and noise. One
reason is that unlike face-to-face translation whose domain-
independent features are centralized and easy to capture, nat-
ural scenes are usually diverse and complex, in which a single
StarGAN might not have enough capacity to model. In com-
parison, a CycleGAN works for two-direction translation on-
ly, which has enough capacity to model and de-rain/de-noise
the images.
We report the classification error rates and PSNR (dB) of
de-raining and de-noising in Table 6. The classification er-
ror rates of StarGAN and CycleGAN before using multi-path
consistency loss are 2.91% and 1.93% respectively, while
after applying multi-path consistency loss the numbers are
1.70% and 1.65%, which shows the efficiency of our method.
In terms of PSNR, as shown in Table 6, our method achieves
higher scores, which means that our model has better restora-
tion abilities. That is, for the two-domain translation, our
framework still works. We also plot PSNR curves of the S-
tarGAN based models on the test set w.r.t. training steps.
The results are shown in Figure 6. On the two tasks, train-
ing with multi-path consistency regularization could always
achieve higher PSNR than the corresponding baseline. This
shows that our proposed method can achieve not only higher
PSNR values, but also faster convergence speed.
Method Classification Error r→c (dB) n→c (dB)
StarGAN 2.91% 19.43 20.39
CycleGAN 1.93% 20.87 21.99
Ours(St) 1.70% 21.13 23.25
Ours(Cy) 1.65% 21.21 23.28
Table 6: Classification error rates and PSNR (dB) of de-raining and
de-noising translation results.
10
12
14
16
18
20
22
24
26
PSNR(dB)
Step
Baseline(r2c) Baseline(n2c) Ours(r2c) Ours(n2c)
Figure 6: PSNR curve w.r.t training steps.
6 Conclusion and future work
In this paper, we propose a new kind of loss, multi-path con-
sistency loss, which can leverage the information of multiple
domains to regularize the training. We provide an effective
way to optimize such a framework under multi-domain trans-
lation environments. Qualitative and quantitative results on
multiple tasks demonstrate the effectiveness of our method.
For future work, it is worth studying what would happen
if more than three paths are included. In addition, we will
generalize multi-path consistency by using stochastic latent
variables as the auxiliary domain
7 Acknowledgement
This work was supported in part by NSFC under Grant
61571413, 61632001.
References
[Arjovsky et al., 2017]Martin Arjovsky, Soumith Chintala, and
L´
eon Bottou. Wasserstein generative adversarial networks. In
International Conference on Machine Learning, pages 214–223,
2017.
[Chen et al., 2016]Xi Chen, Yan Duan, Rein Houthooft, John
Schulman, Ilya Sutskever, and Pieter Abbeel. Infogan: Inter-
pretable representation learning by information maximizing gen-
erative adversarial nets. In Advances in Neural Information Pro-
cessing Systems, pages 2172–2180, 2016.
[Choi et al., 2018]Yunjey Choi, Minje Choi, Munyoung Kim,
Jung-Woo Ha, Sunghun Kim, and Jaegul Choo. Stargan: Uni-
fied generative adversarial networks for multi-domain image-to-
image translation. In The IEEE Conference on Computer Vision
and Pattern Recognition (CVPR), June 2018.
[Donahue et al., 2016]Jeff Donahue, Philipp Kr ¨
ahenb¨
uhl, and
Trevor Darrell. Adversarial feature learning. arXiv preprint arX-
iv:1605.09782, 2016.
[Dumoulin et al., 2016]Vincent Dumoulin, Ishmael Belghazi, Ben
Poole, Alex Lamb, Martin Arjovsky, Olivier Mastropietro, and
Aaron Courville. Adversarially learned inference. arXiv preprint
arXiv:1606.00704, 2016.
[Fu et al., 2017]Xueyang Fu, Jiabin Huang, Delu Zeng, Yue
Huang, Xinghao Ding, and John Paisley. Removing rain from
single images via a deep detail network. In IEEE Conference
on Computer Vision and Pattern Recognition, pages 1715–1723,
2017.
[Gatys et al., 2016]Leon A Gatys, Alexander S Ecker, and
Matthias Bethge. Image style transfer using convolutional neural
networks. In Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, pages 2414–2423, 2016.
[Girshick, 2015]Ross Girshick. Fast r-cnn. In Proceedings of the
IEEE international conference on computer vision, pages 1440–
1448, 2015.
[Goodfellow et al., 2014]Ian Goodfellow, Jean Pouget-Abadie,
Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair,
Aaron Courville, and Yoshua Bengio. Generative adversarial net-
s. In Advances in neural information processing systems, pages
2672–2680, 2014.
[He et al., 2016]Di He, Yingce Xia, Tao Qin, Liwei Wang, Neng-
hai Yu, Tieyan Liu, and Wei-Ying Ma. Dual learning for machine
translation. In Advances in Neural Information Processing Sys-
tems, pages 820–828, 2016.
[Heusel et al., 2017]Martin Heusel, Hubert Ramsauer, Thomas
Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans
trained by a two time-scale update rule converge to a local nash
equilibrium. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach,
R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances
in Neural Information Processing Systems 30, pages 6626–6637.
Curran Associates, Inc., 2017.
[Huang et al., 2017]Xun Huang, Yixuan Li, Omid Poursaeed,
John E Hopcroft, and Serge J Belongie. Stacked generative ad-
versarial networks. In CVPR, volume 2, page 3, 2017.
[Isola et al., 2017]P. Isola, J. Zhu, T. Zhou, and A. A. Efros. Image-
to-image translation with conditional adversarial networks. In
2017 IEEE Conference on Computer Vision and Pattern Recog-
nition (CVPR), pages 5967–5976, July 2017.
[Kim et al., 2017]Taeksoo Kim, Moonsu Cha, Hyunsoo Kim,
Jung Kwon Lee, and Jiwon Kim. Learning to discover cross-
domain relations with generative adversarial networks. In Pro-
ceedings of the 34th International Conference on Machine Learn-
ing, pages 1857–1865, 2017.
[Kingma and Ba, 2014]Diederik Kingma and Jimmy Ba. Adam:
A method for stochastic optimization. arXiv preprint arX-
iv:1412.6980, 2014.
[Ledig et al., 2017]C. Ledig, L. Theis, F. Husz´
ar, J. Caballero,
A. Cunningham, A. Acosta, A. Aitken, A. Tejani, J. Totz,
Z. Wang, and W. Shi. Photo-realistic single image super-
resolution using a generative adversarial network. In 2017
IEEE Conference on Computer Vision and Pattern Recognition
(CVPR), pages 105–114, July 2017.
[Lin et al., 2018a]Jianxin Lin, Yingce Xia, Tao Qin, Zhibo Chen,
and Tie-Yan Liu. Conditional image-to-image translation. In Pro-
ceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, 2018.
[Lin et al., 2018b]Jianxin Lin, Tiankuang Zhou, and Zhibo Chen.
Multi-scale face restoration with sequential gating ensemble net-
work. In Proceedings of the Thirty-Second AAAI Conference on
Artificial Intelligence, (AAAI-18), New Orleans, Louisiana, USA,
February 2-7, 2018, pages 7122–7129, 2018.
[Liu et al., 2015]Ziwei Liu, Ping Luo, Xiaogang Wang, and Xi-
aoou Tang. Deep learning face attributes in the wild. In Pro-
ceedings of International Conference on Computer Vision (IC-
CV), 2015.
[Liu et al., 2018]Alexander H Liu, Yen-Cheng Liu, Yu-Ying Yeh,
and Yu-Chiang Frank Wang. A unified feature disentangler for
multi-domain image translation and manipulation. In Advances
in Neural Information Processing Systems, pages 2590–2599,
2018.
[Mao et al., 2016]Xiaojiao Mao, Chunhua Shen, and Yu-Bin Yang.
Image restoration using very deep convolutional encoder-decoder
networks with symmetric skip connections. In Advances in Neu-
ral Information Processing Systems 29, pages 2802–2810, 2016.
[Radford et al., 2015]Alec Radford, Luke Metz, and Soumith
Chintala. Unsupervised representation learning with deep con-
volutional generative adversarial networks. arXiv preprint arX-
iv:1511.06434, 2015.
[Taigman et al., 2016]Yaniv Taigman, Adam Polyak, and Lior
Wolf. Unsupervised cross-domain image generation. arXiv
preprint arXiv:1611.02200, 2016.
[Yang et al., 2017]Wenhan Yang, Robby T Tan, Jiashi Feng, Jiay-
ing Liu, Zongming Guo, and Shuicheng Yan. Deep joint rain
detection and removal from a single image. In Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition,
pages 1357–1366, 2017.
[Yi et al., 2017]Zili Yi, Hao Zhang, Ping Tan, and Minglun Gong.
Dualgan: Unsupervised dual learning for image-to-image trans-
lation. In The IEEE International Conference on Computer Vi-
sion (ICCV), Oct 2017.
[Zhu et al., 2017]Jun-Yan Zhu, Taesung Park, Phillip Isola, and
Alexei A. Efros. Unpaired image-to-image translation using
cycle-consistent adversarial networks. In The IEEE International
Conference on Computer Vision (ICCV), Oct 2017.