Conference PaperPDF Available

Conditional Image-to-Image Translation

Authors:
Conditional Image-to-Image Translation
Jianxin Lin1Yingce Xia1Tao Qin2Zhibo Chen1Tie-Yan Liu2
1University of Science and Technology of China 2Microsoft Research Asia
linjx@mail.ustc.edu.cn yingce.xia@gmail.com
{taoqin, tie-yan.liu}@microsoft.com chenzhibo@ustc.edu.cn
Abstract
Image-to-image translation tasks have been widely in-
vestigated with Generative Adversarial Networks (GANs)
and dual learning. However, existing models lack the abili-
ty to control the translated results in the target domain and
their results usually lack of diversity in the sense that a fixed
image usually leads to (almost) deterministic translation re-
sult. In this paper, we study a new problem, conditional
image-to-image translation, which is to translate an image
from the source domain to the target domain conditioned on
a given image in the target domain. It requires that the gen-
erated image should inherit some domain-specific features
of the conditional image from the target domain. Therefore,
changing the conditional image in the target domain will
lead to diverse translation results for a fixed input image
from the source domain, and therefore the conditional in-
put image helps to control the translation results. We tackle
this problem with unpaired data based on GANs and du-
al learning. We twist two conditional translation models
(one translation from A domain to B domain, and the other
one from B domain to A domain) together for inputs com-
bination and reconstruction while preserving domain inde-
pendent features. We carry out experiments on men’s faces
from-to women’s faces translation and edges to shoes&bags
translations. The results demonstrate the effectiveness of
our proposed method.
1. Introduction
Image-to-image translation covers a large variety of
computer vision problems, including image stylization [4],
segmentation [13] and saliency detection [5]. It aims at
learning a mapping that can convert an image from a source
domain to a target domain, while preserving the main pre-
sentations of the input images. For example, in the afore-
mentioned three tasks, an input image might be converted
to a portrait similar to Van Gogh’s styles, a heat map s-
plitted into different regions, or a pencil sketch, while the
edges and outlines remain unchanged. Since it is usual-
ly hard to collect a large amount of parallel data for such
tasks, unsupervised learning algorithms have been widely
adopted. Particularly, the generative adversarial networks
(GAN) [6] and dual learning [7, 21] are extensively studied
in image-to-image translations. [22, 9, 25] tackle image-
to-image translation by the aforementioned two techniques,
where the GANs are used to ensure the generated images
belonging to the target domain, and dual learning can help
improve image qualities by minimizing reconstruction loss.
An implicit assumption of image-to-image translation
is that an image contains two kinds of features1:domain-
independent features, which are preserved during the trans-
lation (i.e., the edges of face, eyes, nose and mouse while
translating a man’ face to a woman’ face), and domain-
specific features, which are changed during the transla-
tion (i.e., the color and style of the hair for face image
translation). Image-to-Image translation aims at transfer-
ring images from the source domain to the target domain
by preserving domain-independent features while replacing
domain-specific features.
While it is not difficult for existing image-to-image
translation methods to convert an image from a source do-
main to a target domain, it is not easy for them to control
or manipulate the style in fine granularity of the generated
image in the target domain. Consider the gender transform
problem studied in [9], which is to translate a man’s photo to
a woman’s. Can we translate Hillary’s photo to a man’ pho-
to with the hair style and color of Trump? DiscoGAN [9]
can indeed output a woman’s photo given a man’s photo as
input, but cannot control the hair style or color of the out-
put image. DualGAN [22, 25] cannot implement this kind
of fine-granularity control neither. To fulfill such a blank
in image translation, we propose the concept of condition-
al image-to-image translation, which can specify domain-
specific features in the target domain, carried by another
input image from the target domain. An example of condi-
tional image-to-image translation is shown in Figure 1, in
1Note that the two kinds of features are relative concepts, and domain-
specific features in one task might be domain-independent features in an-
other task, depending on what domains one focuses on in the task.
Figure 1. Conditional image-to-image translation. (a) Condition-
al women-to-men photo translation. (b) Conditional edges-to-
handbags translation. The purple arrow represents translation flow
and the green arrow represents the conditional information flow.
which we want to convert Hillary’s photo to a man’s photo.
As shown in the figure, with an addition man’s photo as in-
put, we can control the translated image (e.g., the hair color
and style).
1.1. Problem Setup
We first define some notations. Suppose there are t-
wo image domains DAand DB. Following the implic-
it assumption, an image xA∈ DAcan be represented as
xA=xi
Axs
A, where xi
A’s are domain-independent fea-
tures, xs
A’s are domain-specific features, and is the op-
erator that can merge the two kinds of features into a com-
plete image. Similarly, for an image xB∈ DB, we have
xB=xi
Bxs
B. Take the images in Figure 1 as exam-
ples: (1) If the two domains are man’s and woman’s pho-
tos, the domain-independent features are individual facial
organs like eyes and mouths and the domain-specific fea-
tures are beard and hair style. (2) If the two domains are
real bags and the edges of bags, the domain-independent
features are exactly the edges of bags themselves, and the
domain-specific are the colors and textures.
The problem of conditional image-to-image translation
from domain DAto DBis as follows: Taken an image
xA∈ DAas input and an image xB∈ DBas condition-
al input, outputs an image xAB in domain DBthat keeping
the domain-independent features of xAand combining the
domain-specific features carried in xB, i.e.,
xAB =GAB(xA, xB) = xi
Axs
B,(1)
where GABdenotes the translation function. Similarly,
we have the reverse conditional translation
xBA =GBA(xB, xA) = xi
Bxs
A.(2)
For simplicity, we call GABthe forward translation
and GBAthe reverse translation. In this work we study
how to learn such two translations.
1.2. Our Results
There are three main challenges in solving the condition-
al image translation problem. The first one is how to extract
the domain-independent and domain-specific features for a
given image. The second is how to merge the features from
two different domains into a natural image in the target do-
main. The third one is that there is no parallel data for us to
learn such the mappings.
To tackle these challenges, we propose the condition-
al dual-GAN (briefly, cd-GAN), which can leverage the
strengths of both GAN and dual learning. Under such a
framework, the mappings of two directions, GABand
GBA, are jointly learned. The model of cd-GAN follows
the encoder-decoder based framework: the encoder is used
to extract the domain-independent and domain-specific fea-
tures and the decoder is to merge the two kinds of fea-
tures to generate images. We chose GAN and dual learn-
ing due to the following considerations: (1) The dual learn-
ing framework can help learn to extract and merge the
domain-specific and domain-independent features by min-
imizing carefully designed reconstruction errors, includ-
ing reconstruction errors of the whole image, the domain-
independent features, and the domain-specific features. (2)
GAN can ensure that the generated images well mimic the
natural images in the target domain. (3) Both dual learning
[7, 22, 25] and GAN [6, 19, 1] work well under unsuper-
vised settings.
We carry out experiments on different tasks, including
face-to-face translation, edge-to-shoe translation, and edge-
to-handbag translation. The results demonstrate that our
network can effectively translate image with conditional in-
formation and robust to various applications.
Our main contributions lie in two folds: (1) We de-
fine a new problem, conditional image-to-image translation,
which is a more general framework than conventional im-
age translation. (2) We propose the cd-GAN algorithm to
solve the problem in an end-to-end way.
The remaining parts are organized follows. We introduce
related work in Section 2 and present the details of cd-GAN
in Section 3, including network architecture and the training
algorithm. Then we report experimental results in Section
4 and conclude in Section 5.
2. Related Work
Image generation has been widely explored in recen-
t years. Models based on variational autoencoder (VAE)
[11] aim to improve the quality and efficiency of image
generation by learning an inference network. GANs [6]
were firstly proposed to generate images from random vari-
ables by a two-player minimax game. Researchers have
been exploited the capability of GANs for various image
generation tasks. [1] proposed to synthesize images at
multiple resolutions with a Laplacian pyramid of adver-
sarial generators and discriminators, and can condition on
class labels for controllable generation. [19] introduced a
class of deep convolutional generative networks (DCGANs)
for high-quality image generation and unsupervised image
classification tasks.
Instead of learning to generate image samples from
scratch (i.e., random vectors), the basic idea of image-to-
image translation is to learn a parametric translation func-
tion that transforms an input image in a source domain to
an image in a target domain. [13] proposed a fully con-
volutional network (FCN) for image-to-segmentation trans-
lation. Pix2pix [8] extended the basic FCN framework to
other image-to-image translation tasks, including label-to-
street scene and aerial-to-map. Meanwhile, pix2pix utilized
adversarial training technique to ensure high-level domain
similarity of the translation results.
The image-to-image models mentioned above require
paired training data between the source and target domain-
s. There is another line of works studying unpaired do-
main translation. Based on adversarial training, [3] and
[2] proposed algorithms to jointly learn to map latent s-
pace to data space and project the data space back to laten-
t space. [20] presented a domain transfer network (DTN)
for unsupervised cross-domain image generation employ-
ing a compound loss function including multiclass adver-
sarial loss and f-constancy component, which could gener-
ate convincing novel images of previously unseen entities
and preserve their identity. [7] developed a dual learning
mechanism which can enable a neural machine translation
system to automatically learn from unlabeled data through
a dual learning game. Following the idea of dual learning,
DualGAN [22], DiscoGAN [9] and CycleGAN [25] were
proposed to tackle the unpaired image translation problem
by training two cross domain transfer GANs at the same
time. [15] proposed to utilize dual learning for semantic
image segmentation. [14] further proposed a conditional
CycleGAN for face super-resolution by adding facial at-
tributes obtained from human annotation. However, col-
lecting a large amount of such human annotated data can be
hard and expensive.
In this work, we study a new setting of image-to-image
translation, in which we hope to control the generated im-
ages in fine granularity with unpaired data. We call such a
new problem conditional image-to-image translation.
3. Conditional Dual GAN
Figure 2 shows the overall architecture of the proposed
model, in which the left part is an encoder-decoder based
framework for image translation and the right part includes
additional components introduced to train the encoder and
decoder.
3.1. The Encoder-Decoder Framework
As shown in the figure, there are two encoders eAand
eBand two decoders gAand gB.
The encoders serve as feature extractors, which take an
image as input and output the two kinds of features, domain-
independent features and domain-specific features, with the
corresponding modules in the encoders. In particular, given
two images xAand xB, we have
(xi
A, xs
A) = eA(xA); (xi
B, xs
B) = eB(xB).(3)
If only looking at the encoder, there is no difference be-
tween the two kinds of features. It is the remaining parts
of the overall model and the training process that differenti-
ate the two kinds of features. More details are discussed in
Section 3.3.
The decoders serve as generators, which take as input-
s the domain-independent features from the image in the
source domain and the domain-specific features from the
image in the target domain and output a generated image in
the target domain. That is,
xAB =gB(xi
A, xs
B); xBA =gA(xi
B, xs
A).(4)
3.2. Training Algorithm
We leverage dual learning techniques and the GAN tech-
niques to train the encoders and decoders. The optimization
process is shown in the right part of Figure 2.
3.2.1 GAN loss
To ensure the generated xAB and xBA are in the corre-
sponding domains, we employ two discriminators dAand
dBto differentiate the real images and synthetic ones. dA
(or dB) takes an image as input and outputs a probability
indicating how likely the input is a natural image from do-
main DA(or DB). The objective function is
`GAN = log(dA(xA)) + log(1 dA(xBA ))
+ log(dB(xB)) + log(1 dB(xAB )).(5)
The goal of the encoders and decoders eA,eB,gA,gBis
to generate images as similar to natural images and fool the
discriminators dAand dB, i.e., they try to minimize `GAN.
The goal of dAand dBis to differentiate generated images
from natural images, i.e., they try to maximize `GAN.
3.2.2 Dual learning loss
The key idea of dual learning is to improve the performance
of a model by minimizing the reconstruction error.
To reconstruct the two images ˆxAand ˆxB, as shown in
Figure 2, we first extract the two kinds of features of the
generated images:
xi
A,ˆxs
B) = eB(xAB ); (ˆxi
B,ˆxs
A) = eA(xBA ),(6)
Figure 2. Architecture of the proposed conditional dual GAN (cd-GAN).
and then reconstruct images as follows:
ˆxA=gAxi
A, xs
A); ˆxB=gBxi
B, xs
B).(7)
We evaluate the reconstruction quality from three aspect-
s: the image level reconstruction error `im
dual, the reconstruc-
tion error `di
dual of the domain-independent features, and the
reconstruction error `ds
dual of the domain-specific features as
follows:
`im
dual(xA, xB) = kxAˆxAk2+kxBˆxBk2,(8)
`di
dual(xA, xB) = kxi
Aˆxi
Ak2+kxi
Bˆxi
Bk2,(9)
`ds
dual(xA, xB) = kxs
Aˆxs
Ak2+kxs
Bˆxs
Bk2.(10)
Compared with the existing dual learning approach-
es [22] which only consider the image level reconstruction
error, our method considers more aspects and therefore is
expected to achieve better accuracy.
3.2.3 Overall training process
Since the discriminators only impact the GAN loss `GAN,
we only use this loss to compute the gradients and update
dAand dB. In contrast, the encoders and decoders impact
all the 4 losses (i.e., the GAN loss and three reconstruction
errors), we use all the 4 objectives to compute gradients and
update models for them. Note that since the 4 objectives
are of different magnitudes, their gradients may vary a lot
in terms of magnitudes. To smooth the training process, we
normalize the gradients so that their magnitudes are compa-
rable across 4 losses. We summarize the training process in
Algorithm 1.
Algorithm 1 cd-GAN training process
Require: Training images {xA,i}m
i=1 ⊂ DA,{xB,j }m
j=1
DB, batch size K, optimizer Opt(·,·);
1: Randomly initialize eA,eB,gA,gB,dAand dB.
2: Randomly sample a minibatch of images and prepare
the data pairs S={(xA,k, xB ,k)}K
k=1.
3: For any data pair (xA,k, xB,k )∈ S, generate condition-
al translations by Eqn.(3,4), and reconstruct the images
by Eqn.(6,7);
4: Update the discriminators as follows:
dAOpt(dA,(1/K)dAPK
k=1`GAN (xA,k, xB ,k)),
dBOpt(dB,(1/K)dBPK
k=1`GAN (xA,k, xB ,k));
5: For each Θ∈ {eA, eB, gA, gB}, compute the gradients
GAN = (1/K)ΘPK
k=1`GAN (xA,k, xB ,k),
im = (1/K)ΘPK
k=1`im
dual(xA,k , xB,k ),
di = (1/K)ΘPK
k=1`di
dual(xA,k , xB,k ),
ds = (1/K)ΘPK
k=1`ds
dual(xA,k , xB,k ),
normalize the four gradients to make their magni-
tudes comparable, sum them to obtain , and Θ
Opt,∆).
6: Repeat step 2 to step 6 until convergence
In Algorithm 1, the choice of optimizers Opt(·,·)is quite
flexible, whose two inputs are the parameters to be opti-
mized and the corresponding gradients. One can choose
different optimizers (e.g. Adam [10], or nesterov gradien-
t descend [18]) for different tasks, depending on common
practice for specific tasks and personal preferences. Be-
sides, the eA,eB,gA,gB,dA,dBmight refer to either the
models themselves, or their parameters, depending on the
context.
3.3. Discussions
Our proposed framework can learn to separate the
domain-independent features and domain-specific features.
In Figure 2, consider the path of xAeAxi
AgB
xAB . Note that after training we ensure that xAB is an im-
age in domain DBand the features xi
Aare still preserved in
xAB . Thus, xi
Ashould try to inherent the features that are
independent to domain DA. Given that xi
Ais domain inde-
pendent, it is xs
Bthat carries information about domain DB.
Thus, xs
Bis domain-specific features. Similarly, we can see
that xs
Ais domain-specific and xi
Bis domain-independent.
DualGAN [22], DiscoGAN [9] and CycleGAN [25] can
be treated as simplified versions of our cd-GAN, by remov-
ing the domain-specific features. For example, in Cycle-
GAN, given an xA∈ DA, any xAB ∈ DBis a legal trans-
lation, no matter what xB∈ DBis. In our work, we require
that the generated images should match the inputs from two
domains, which is more difficult.
Furthermore, cd-GAN works for both symmetric trans-
lations and asymmetric translations. In symmetric transla-
tions, both directions of translations need conditional inputs
(illustrated in Figure 1(a)). In asymmetric translations, only
one direction of translation needs a conditional image as in-
put (illustrated in Figure 1(b)). That is, the translation from
bag to edge does not need another edge image as input; even
given an additional edge image as the conditional input, it
does not change or help to control the translation result.
For asymmetric translations, we only need to slight-
ly modify objectives for cd-GAN training. Suppose the
translation direction of GBAdoes not need conditional
input. Then we do not need to reconstruct the domain-
specific features xs
A. Accordingly, we modify the error of
domain-specific features as follows, and other 3 losses do
not change.
`ds
dual(xA, xB) = kxs
Bˆxs
Bk2(11)
4. Experiments
We conduct a set of experiments to test the proposed
model. We first describe experimental settings, and then re-
port results for both symmetric translations and asymmetric
translations. Finally we study individual components and
loss functions of the proposed model.
4.1. Settings
For all experiments, the networks take images of 64×64
resolution as inputs. The encoders eAand eBstart with 3
convolutional layers, each convolutional layer followed by
leaky rectified linear units (Leaky ReLU) [16]. Then the
network is splitted into two branches: in one branch, a con-
volutional layer is attached to extract domain-independent
features; in the other branch, two fully-connected layers are
attached to extract domain-specific features. Decoder net-
works gAand gBcontain 4deconvolutional layers with Re-
LU units [17], except for the last layer using tanh activation
function. The discriminators dAand dBconsist of 4convo-
lution layers, two fully-connected layers. Each layer is fol-
lowed by Leaky ReLU units except for the last layer using
sigmoid activation function. Details (e.g., number and size
of filters, number of nodes in fully-connected layers) can be
found in the supplementary document.
We use Adam [10] as the optimization algorithm with
learning rate 0.0002. Batch normalization is applied to all
convolution layers and deconvolution layers except for the
first and last ones. Minibatch size is fixed as 200 for all the
tasks.
We implement three related baselines for comparison.
1. DualGAN [22, 9, 25]. DualGAN was primitively
proposed for unconditional image-to-image translation
which does not require conditional input. Similar to
our cd-GAN, DualGAN trains two translation models
jointly.
2. DualGAN-c. In order to enable DualGAN to utilize
conditional input, we design a network as DualGAN-
c. The main difference between DualGAN and
DualGAN-c is that DualGAN-c translates the target
outputs as Eqn.(3,4), and reconstructs inputs as ˆxA=
gA(eB(xAB )) and ˆxB=gB(eA(xBA)).
3. GAN-c. To verify the effectiveness of dual learning,
we remove the dual learning losses of cd-GAN during
training and obtain GAN-c.
For symmetric translations, we carry out experiments
on men-to-women face translations. We use the CelebA
dataset [12], which consists of 84434 men’s images (denot-
ed as domain DA) and 118165 women’s images (denoted as
domain DB). We randomly choose 4732 men’s images and
6379 women’s images for testing, and use the rest for train-
ing. In this task, the domain-independent features are or-
gans (e.g., eyes, nose, mouse) and domain-specific features
refer to hair-style, beard, the usage of lipstick. For asym-
metric translations, we work on edges-to-shoes and edges-
to-bags translations with datasets used in [23] and [24] re-
spectively. In these two tasks, the domain-independent fea-
tures are edges and domain-specific features are colors, tex-
tures, etc.
Figure 3. Conditional face-to-face translation. (a) Results of
conditional menwomen translation. (b) Results of conditional
womenmen translation.
4.2. Results
The translation results of face-to-face, edges-to-bags and
edges-to-shoes are shown in Figure 3-5 respectively.
For men-to-women translations, from Figure 3(a), we
have several observations. (1) DualGAN can indeed gen-
erate woman’s photo, but its results are purely based on
the men’s photos, since it does not take the conditional im-
ages as inputs. (2) Although taking the conditional image
as input, DualGAN-c fails to integrate the information (e.g.,
style) from the conditional input into its translation output.
(3) For GAN-c, sometimes its translation result is not rele-
vant to the original source-domain input, e.g., the 4-th row
Figure 3(a). This is because in training it is required to gen-
erate a target-domain image, but its output is not required
to be similar (in certain aspects) to the original input. (4)
cd-GAN works best among all the models by preserving
domain-independent features from the source-domain input
Figure 4. Results of conditional edgeshandbags translation.
Figure 5. Results of conditional edgesshoes translation.
and combining the domain-specific features from the target-
domain conditional input. Here are two examples. (1) In 6-
th column of 1-st row, the woman is put on red lipstick. (2)
In 6-th column of 5-th row, the hair-style of the generated
image is the most similar to the conditional input.
We can get similar observations for women-to-men
translations as shown in Figure 3(b), especially for the
domain-specific features such as hair style and beard.
From Figure 4 and 5, we find that cd-GAN can well
leverage the domain-specific information carried in the con-
ditional inputs and control the generated target-domain im-
ages accordingly. DualGAN, DuanGAN-c and GAN-c do
not effectively utilize the conditional inputs.
One important characteristic of conditional image-to-
image translation model is that it can generate diverse
target-domain images for a fixed source-domain image, on-
ly if different target-domain images are provided as input-
s. To verify such this ability of cd-GAN, we conduct t-
Figure 6. Our cd-GAN model can produce diverse results with dif-
ferent conditional images. (a) Results of womenmen translation
with two different men’s images as conditional inputs. (b) Result-
s of edgeshandbags translation with two different handbags as
conditional inputs.
wo experiments: (1) for each woman’s photo, we work on
women-to-men translations with different man’s photos as
conditional inputs; (2) for each edge of a bag, we work on
edges-to-bags translations with different bags as conditional
inputs. The results are shown in Figure 6. Figure 6(b) shows
that cd-GAN can fulfill edges with the colors and textures
provided by the conditional inputs. Besides, cd-GAN al-
so achieves reasonable improvements on most face transla-
tions: The domain-independent features like woman’s facial
outline, orientations and expressions are preserved, while
the women specific features like hair-style and the usage the
lipstick are replaced with men’s. An example is the second
row of Figure 6(a), where pointed chins, serious expressions
and looking forward are preserved in the generated images.
The hairstyles (bald v.s. short hair) and the beard (no beard
v.s. short beard) are reflected by the corresponding men’s.
Similar translations of the other images can also be found.
Figure 7. Results produced by different connections and losses of
cd-GANs.
Note that there are several failure cases in face translation-
s, such as first column of Figure 6 (a) and last column of
Figure 6 (b). Most translated results demonstrate the effec-
tiveness of our model. More examples can be found in our
supplementary document.
4.3. Component Study
In this sub section, we study other possible design choic-
es for the model architecture in Figure 2 and losses used in
training. We compare cd-GAN with other four models as
follows:
cd-GAN-rec. The inputs are reconstructed as
ˆxA=gAxi
A,ˆxs
A); ˆxB=gBxi
B,ˆxs
B)(12)
instead of Eqn.(7). That is, the connection from xs
A
to gAin the right box of Figure 2 is replaced by the
connection from ˆxs
Ato gA, and the connection from
xs
Bto gBin the right box of Figure 2 is replaced by the
connection from ˆxs
Bto gB.
cd-GAN-nof. Both domain-specific and domain-
independent feature reconstruction losses, i.e., E-
qn.(10) and Eqn.(9), are removed from dual learning
losses.
cd-GAN-nos. The domain-specific feature reconstruc-
tion loss, i.e., Eqn.(10), is removed from dual learning
losses.
cd-GAN-noi. The domain-independent feature recon-
struction loss, i.e., Eqn.(10) is removed from dual
learning losses.
The comparison experiments are conducted on the edges-
to-handbags task. The results are shown in Figure 7. Our
cd-GAN outperforms the other four candidate models with
better color schemes. Failure of cd-GAN-rec demonstrates
the necessity of “skip connections” (i.e., the connections
from xs
Ato gAand from xs
Bto gB) for image reconstruc-
tion. Since the domain-specific feature level and image lev-
el reconstruction losses have implicitly put constrains on
domain-specific feature to some extent, the results produced
by cd-GAN-noi are closest to results of cd-GAN among the
four candidate models.
So far, we have shown the translation results of cd-GAN
generated from the combination domain-specific features
and domain-independent features. One may be interested
in what we really learn in the two kinds of features. Here
we try to understand them by generating translation results
using each kind of features separately:
We generate an image using the domain-specific fea-
tures only:
xA=0
AB =gB(xi
A=0, xs
B),
in which we set the domain-independent features to 0.
We generate an image using the domain-independent
features only:
xB=0
AB =gB(xi
A, xs
B=0),
in which we set the domain-specific features to 0.
The results are shown in Figure 8. As we can see, the image
xA=0
AB has similar style to xB, which indicates that our cd-
GAN can indeed extract domain-specific features. While
xB=0
AB already loses conditional information of xB, it still
preserves main shape of xA, which demonstrates that cd-
GAN indeed extracts domain-independent features.
4.4. User Study
We have conducted user study to compare domain-
specific features similarity between generated images and
conditional images. Total 17 subjects (10 males, 7females,
age range 20 35) from different backgrounds are asked to
make comparison of 32 sets of images. We show the sub-
jects source image, conditional image, our result and results
from other methods. Then each subject selects generated
image most similar to conditional image. The result of us-
er study shows that our model obviously outperforms other
methods.
5. Conclusions and Future Work
In this paper, we have studied the problem of conditional
image-to-image translation, in which we translate an image
from a source domain to a target domain conditioned on an-
other target-domain image as input. We have proposed a
Figure 8. Images generated using only domain-independent fea-
tures or domain-specific features.
Figure 9. The result of user study.
new model based on GANs and dual learning. The mod-
el can well leverage the conditional inputs to control and
diversify the translation results. Experiments on two set-
tings (symmetric translations and asymmetric translations)
and three tasks (face-to-face, edges-to-shoes and edges-to-
handbags translations) have demonstrated the effectiveness
of the proposed model.
There are multiple aspects to explore for conditional im-
age translation. First, we will apply the proposed model to
more image translation tasks. Second, it is interesting to de-
sign better models for this translation problem. Third, the
problem of conditional translations may be extend to oth-
er applications, such as conditional video translations and
conditional text translations.
6. Acknowledgement
This work was supported in part by the Nation-
al Key Research and Development Program of China
under Grant No.2016YFC0801001, NSFC under Grant
61571413,61632001,61390514, and Intel ICRI MNC.
References
[1] E. L. Denton, S. Chintala, R. Fergus, et al. Deep genera-
tive image models using a laplacian pyramid of adversarial
networks. In Advances in neural information processing sys-
tems, pages 1486–1494, 2015.
[2] J. Donahue, P. Kr¨
ahenb¨
uhl, and T. Darrell. Adversarial fea-
ture learning. arXiv preprint arXiv:1605.09782, 2016.
[3] V. Dumoulin, I. Belghazi, B. Poole, A. Lamb, M. Arjovsky,
O. Mastropietro, and A. Courville. Adversarially learned in-
ference. arXiv preprint arXiv:1606.00704, 2016.
[4] L. A. Gatys, A. S. Ecker, and M. Bethge. Image style transfer
using convolutional neural networks. In Proceedings of the
IEEE Conference on Computer Vision and Pattern Recogni-
tion, pages 2414–2423, 2016.
[5] S. Goferman, L. Zelnik-Manor, and A. Tal. Context-aware
saliency detection. IEEE Transactions on Pattern Analysis
and Machine Intelligence, 34(10):1915–1926, 2012.
[6] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu,
D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Gen-
erative adversarial nets. In Advances in neural information
processing systems, pages 2672–2680, 2014.
[7] D. He, Y. Xia, T. Qin, L. Wang, N. Yu, T. Liu, and W.-Y. Ma.
Dual learning for machine translation. In Advances in Neural
Information Processing Systems, pages 820–828, 2016.
[8] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros. Image-
to-image translation with conditional adversarial networks.
arXiv preprint arXiv:1611.07004, 2016.
[9] T. Kim, M. Cha, H. Kim, J. Lee, and J. Kim. Learning to
discover cross-domain relations with generative adversarial
networks. arXiv preprint arXiv:1703.05192, 2017.
[10] D. Kingma and J. Ba. Adam: A method for stochastic opti-
mization. arXiv preprint arXiv:1412.6980, 2014.
[11] D. P. Kingma and M. Welling. Auto-encoding variational
bayes. arXiv preprint arXiv:1312.6114, 2013.
[12] Z. Liu, P. Luo, X. Wang, and X. Tang. Deep learning face
attributes in the wild. In Proceedings of International Con-
ference on Computer Vision (ICCV), 2015.
[13] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional
networks for semantic segmentation. In Proceedings of the
IEEE Conference on Computer Vision and Pattern Recogni-
tion, pages 3431–3440, 2015.
[14] Y. Lu, Y.-W. Tai, and C.-K. Tang. Conditional cyclegan for
attribute guided face image generation. arXiv preprint arX-
iv:1705.09966, 2017.
[15] P. Luo, G. Wang, L. Lin, and X. Wang. Deep dual learning
for semantic image segmentation. In The IEEE International
Conference on Computer Vision (ICCV), Oct 2017.
[16] A. L. Maas, A. Y. Hannun, and A. Y. Ng. Rectifier nonlin-
earities improve neural network acoustic models. In Proc.
ICML, volume 30, 2013.
[17] V. Nair and G. E. Hinton. Rectified linear units improve re-
stricted boltzmann machines. In Proc. ICML, pages 807–
814, 2010.
[18] Y. Nesterov. A method of solving a convex programming
problem with convergence rate o (1/k2). In Soviet Mathe-
matics Doklady, volume 27, pages 372–376, 1983.
[19] A. Radford, L. Metz, and S. Chintala. Unsupervised repre-
sentation learning with deep convolutional generative adver-
sarial networks. arXiv preprint arXiv:1511.06434, 2015.
[20] Y. Taigman, A. Polyak, and L. Wolf. Unsupervised cross-
domain image generation. arXiv preprint arXiv:1611.02200,
2016.
[21] Y. Xia, T. Qin, W. Chen, J. Bian, N. Yu, and T.-Y. Liu. D-
ual supervised learning. In D. Precup and Y. W. Teh, ed-
itors, Proceedings of the 34th International Conference on
Machine Learning, volume 70 of Proceedings of Machine
Learning Research, pages 3789–3798, International Conven-
tion Centre, Sydney, Australia, 06–11 Aug 2017. PMLR.
[22] Z. Yi, H. Zhang, P. Tan, and M. Gong. Dualgan: Unsuper-
vised dual learning for image-to-image translation. In The
IEEE International Conference on Computer Vision (ICCV),
Oct 2017.
[23] A. Yu and K. Grauman. Fine-grained visual comparisons
with local learning. In Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, pages 192–
199, 2014.
[24] J.-Y. Zhu, P. Kr¨
ahenb¨
uhl, E. Shechtman, and A. A. Efros.
Generative visual manipulation on the natural image mani-
fold. In European Conference on Computer Vision, pages
597–613. Springer, 2016.
[25] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired image-
to-image translation using cycle-consistent adversarial net-
works. arXiv preprint arXiv:1703.10593, 2017.
... Existing methods make initial attempts at V2IR. A common way is to directly formulate the V2IR task as imageto-image translation, with methods such as Variational Autoencoders (VAEs) [21,30,53], Generative Adversarial Networks (GANs) [2,23,37], and diffusion models [17,54,55]. To exploit semantic information (Challenge 1) and reduce the impact of diverse infrared radiation from different infrared cameras (Challenge 2), these methods integrate different low-level semantics into image-to-image translation models, such as edge prior, structural similarity, geometry information, and physical constraint. ...
Preprint
Full-text available
The task of translating visible-to-infrared images (V2IR) is inherently challenging due to three main obstacles: 1) achieving semantic-aware translation, 2) managing the diverse wavelength spectrum in infrared imagery, and 3) the scarcity of comprehensive infrared datasets. Current leading methods tend to treat V2IR as a conventional image-to-image synthesis challenge, often overlooking these specific issues. To address this, we introduce DiffV2IR, a novel framework for image translation comprising two key elements: a Progressive Learning Module (PLM) and a Vision-Language Understanding Module (VLUM). PLM features an adaptive diffusion model architecture that leverages multi-stage knowledge learning to infrared transition from full-range to target wavelength. To improve V2IR translation, VLUM incorporates unified Vision-Language Understanding. We also collected a large infrared dataset, IR-500K, which includes 500,000 infrared images compiled by various scenes and objects under various environmental conditions. Through the combination of PLM, VLUM, and the extensive IR-500K dataset, DiffV2IR markedly improves the performance of V2IR. Experiments validate DiffV2IR's excellence in producing high-quality translations, establishing its efficacy and broad applicability. The code, dataset, and DiffV2IR model will be available at https://github.com/LidongWang-26/DiffV2IR.
... A variety of techniques are used, including the most advanced generative models available today [14] and style transfer algorithms, to monitor their effectiveness and assess it using user scores. 2018 [15] saw the creation of handloom designs as an image-to-image translation problem, in which the Mekhala dataset is the target distribution and the normal dataset needs to be converted using CycleGAN [16]. The normal saree dataset is used as the input picture. ...
... Diffusion models for SFR. Recently, Diffusion Models (DMs) (Ho et al., 2020;Lin et al., 2018;Song et al., 2020) gained attention for both research and industry due to their potential to rival GANs on image synthesis, as they are easier to train without stability issues, and stem from a solid theoretical foundation. Among SFR diffusion models, IDiff-Face achieves SoTA performance. ...
Preprint
Full-text available
Synthetic face recognition (SFR) aims to generate synthetic face datasets that mimic the distribution of real face data, which allows for training face recognition models in a privacy-preserving manner. Despite the remarkable potential of diffusion models in image generation, current diffusion-based SFR models struggle with generalization to real-world faces. To address this limitation, we outline three key objectives for SFR: (1) promoting diversity across identities (inter-class diversity), (2) ensuring diversity within each identity by injecting various facial attributes (intra-class diversity), and (3) maintaining identity consistency within each identity group (intra-class identity preservation). Inspired by these goals, we introduce a diffusion-fueled SFR model termed ID3\text{ID}^3. ID3\text{ID}^3 employs an ID-preserving loss to generate diverse yet identity-consistent facial appearances. Theoretically, we show that minimizing this loss is equivalent to maximizing the lower bound of an adjusted conditional log-likelihood over ID-preserving data. This equivalence motivates an ID-preserving sampling algorithm, which operates over an adjusted gradient vector field, enabling the generation of fake face recognition datasets that approximate the distribution of real-world faces. Extensive experiments across five challenging benchmarks validate the advantages of ID3\text{ID}^3.
... Face Synthesis Face generation has become a prominent area of research within computer vision, driven by advancements in generative models such as Generative Adversarial Networks (GANs) [16], Variational Autoencoders (VAEs) [31,34,47,54], and diffusion models [19,53]. Among these, GANs have been extensively utilized for various tasks such as face generation and augumentation [8,22,36,44,58,65]. Various methods have been developed to ensure that the generated face retains the original identity while allowing for other modifications such as image styles or background [2,4,5,10,46,52,59,67]. ...
Preprint
Facial paralysis is a debilitating condition that affects the movement of facial muscles, leading to a significant loss of facial expressions. Currently, the diagnosis of facial paralysis remains a challenging task, often relying heavily on the subjective judgment and experience of clinicians, which can introduce variability and uncertainty in the assessment process. One promising application in real-life situations is the automatic estimation of facial paralysis. However, the scarcity of facial paralysis datasets limits the development of robust machine learning models for automated diagnosis and therapeutic interventions. To this end, this study aims to synthesize a high-quality facial paralysis dataset to address this gap, enabling more accurate and efficient algorithm training. Specifically, a novel Cycle Cross-Fusion Expression Generative Model (CCFExp) based on the diffusion model is proposed to combine different features of facial information and enhance the visual details of facial appearance and texture in facial regions, thus creating synthetic facial images that accurately represent various degrees and types of facial paralysis. We have qualitatively and quantitatively evaluated the proposed method on the commonly used public clinical datasets of facial paralysis to demonstrate its effectiveness. Experimental results indicate that the proposed method surpasses state-of-the-art methods, generating more realistic facial images and maintaining identity consistency.
... For example, AugGAN devised by Huang et al. [21] demonstrated effective image translation while maintaining characteristic consistency through training YOLO and Faster R-CNN detectors. In addition, Lin et al. [22] introduced cd-GAN, a conditional image translation method that preserves domainindependent features while reconstructing images in the target domain. Furthermore, Lu et al. [23] investigated a conditional CycleGAN for generating high-resolution face images from low-resolution inputs. ...
Article
Full-text available
The critical role of a remote chemical sensing using a Fourier Transform Infrared (FT-IR) spectrometer has been emphasized for detecting lethal chemicals in the atmosphere. To enhance standoff detection capabilities, acquiring adequate gas spectral data is crucial for training and optimizing detection algorithms across diverse outdoor scenarios. However, the collection of outdoor infrared spectra with a number of conditions is constrained owing to uncontrolled weather factors including a temperature and humidity, leading to impaired reliability of the data. In addressing outdoor data acquisition challenges, we introduced a data augmentation method using a conditional CycleGAN. This technique utilizes spectral data obtained exclusively under controlled laboratory conditions. The proposed deep generative model takes as input the background spectrum, which is concatenated with two critical attributes: the temperature difference between the target substance and the background, and pathlength concentration. Subsequently, the model computes a brightness temperature spectrum for a gas against a specific background, employing SF 6 as the target chemical gas. The validity of the generated data was assessed using two detection algorithms: the Pearson Correlation Coefficient and Adaptive Subspace Detector. In addition, the accuracy performance of detectors trained with the augmented dataset was compared and evaluated against those trained with the pure dataset. The results demonstrated that the model can simulate gas spectra onto unseen background spectra and enhance the chemical sensing database, and it can contribute to data augmentation for improving the performance of chemical gas detection systems.
Preprint
Full-text available
Active learning aims to select optimal samples for labeling, minimizing annotation costs. This paper introduces a unified representation learning framework tailored for active learning with task awareness. It integrates diverse sources, comprising reconstruction, adversarial, self-supervised, knowledge-distillation, and classification losses into a unified VAE-based ADROIT approach. The proposed approach comprises three key components - a unified representation generator (VAE), a state discriminator, and a (proxy) task-learner or classifier. ADROIT learns a latent code using both labeled and unlabeled data, incorporating task-awareness by leveraging labeled data with the proxy classifier. Unlike previous approaches, the proxy classifier additionally employs a self-supervised loss on unlabeled data and utilizes knowledge distillation to align with the target task-learner. The state discriminator distinguishes between labeled and unlabeled data, facilitating the selection of informative unlabeled samples. The dynamic interaction between VAE and the state discriminator creates a competitive environment, with the VAE attempting to deceive the discriminator, while the state discriminator learns to differentiate between labeled and unlabeled inputs. Extensive evaluations on diverse datasets and ablation analysis affirm the effectiveness of the proposed model.
Article
An image can convey a thousand words. This statement emphasizes the importance of illustrating ideas visually rather than writing them down. Although detailed image representation is typically instructive, there are several contexts where simplification of image representations may be more informative. One way to provide an abstract representation of an object is via clip art, which is a simple graphic illustration of an object that can be included in documents or presentations. Clip art representation can be utilized in graphic design for concepts expression and daily life objects illustration. Furthermore, clip art had shown remarkable advantages when compared to real images in terms of editing and manipulation since clip art geometry can be easily edited in addition to the feasibility of storing it in a resolution-independent manner. However, it can be difficult to automatically create clip arts that resemble any object. Here, we propose a novel deep learning model, named ClipArtGAN, that utilizes Generative Adversarial Networks (GANs) to convert a photo of an object to its corresponding clip art. To our knowledge, this is the first GAN model based on Pix2Pix architecture that is capable of automatically generating clip art for any arbitrary object. Three training datasets were collected to train and compare two GAN models based on Pix2Pix and CycleGAN architectures. Our results demonstrate that the mean accuracy obtained using ClipArtGAN utilizing Pix2Pix outperforms that of CycleGAN by 58% and that of Google Images-based search by 54%. These results indicate the efficacy of ClipArtGAN in generating representative clip art images.
Preprint
Learning conditional distributions π(x)\pi^*(\cdot|x) is a central problem in machine learning, which is typically approached via supervised methods with paired data (x,y)π(x,y) \sim \pi^*. However, acquiring paired data samples is often challenging, especially in problems such as domain translation. This necessitates the development of semi-supervised\textit{semi-supervised} models that utilize both limited paired data and additional unpaired i.i.d. samples xπxx \sim \pi^*_x and yπyy \sim \pi^*_y from the marginal distributions. The usage of such combined data is complex and often relies on heuristic approaches. To tackle this issue, we propose a new learning paradigm that integrates both paired and unpaired data seamlessly\textbf{seamlessly} through the data likelihood maximization techniques. We demonstrate that our approach also connects intriguingly with inverse entropic optimal transport (OT). This finding allows us to apply recent advances in computational OT to establish a light\textbf{light} learning algorithm to get π(x)\pi^*(\cdot|x). Furthermore, we demonstrate through empirical tests that our method effectively learns conditional distributions using paired and unpaired data simultaneously.
Article
In 4D seismic surveys performed for carbon capture and sequestration projects, it is essential to acquire consistent time-lapse data to track the behavior of carbon dioxide. However, in practice, seismic events affected by non-repeatable effects (i.e., non-repeatable noise) hinder the objective of these surveys. Cross-equalization (XEQ) is a task that aims to reduce differences between time-lapse data by alleviating the adverse effect of non-repeatable noise. XEQ using a convolutional neural network was proposed and applied to poststack data. By utilizing masks derived from the eikonal equation, we design an XEQ method for prestack data, which could contribute to retrieving ancillary information impaired during stacking and migration. The inherent nature of prestack data poses challenges when changing the data domain. To address these challenges, we introduce three supplementary methods: Fourier loss, coordinate conditioning, and logarithmic rescaling. Numerical examples show that the proposed XEQ effectively suppresses non-repeatable noise while preserving 4D signals even for prestack data, with minimal impact on amplitude information representing the degree of change. In addition, the supplementary methods enhance the matching quality and training stability. Sensitivity analyses on several factors (i.e., seawater velocity, source characteristics, ambient noise, and inaccurate masks) demonstrate the robustness of the proposed XEQ in suppressing non-repeatable noise.
Article
Full-text available
With the rapid development of artificial intelligence, the generation of emotional expression images has become a key research field. This article introduces a novel multi-stage emotion generation model cascade method that utilizes CGAN, Pix2Pix, and CycleGAN to create images with enhanced emotional depth and visual quality. We have outlined our approach, which involves a continuous process from emotional initialization to texture refinement, and then to style transition. Our experiments on facial and automotive datasets show a significant improvement in image quality compared to traditional models, with an average increase of 40 percentage points in structural similarity (SSIM) and an average increase of 11.1 percentage points in peak signal-to-noise ratio (PSNR). The research findings emphasize the potential applications of our model in advertising, entertainment, and human-computer interaction, where visual effects of emotional resonance are crucial.
Article
Full-text available
Image-to-image translation is a class of vision and graphics problems where the goal is to learn the mapping between an input image and an output image using a training set of aligned image pairs. However, for many tasks, paired training data will not be available. We present an approach for learning to translate an image from a source domain X to a target domain Y in the absence of paired examples. Our goal is to learn a mapping G: X -> Y such that the distribution of images from G(X) is indistinguishable from the distribution Y using an adversarial loss. Because this mapping is highly under-constrained, we couple it with an inverse mapping F: Y -> X and introduce a cycle consistency loss to push F(G(X)) \approx X (and vice versa). Qualitative results are presented on several tasks where paired training data does not exist, including collection style transfer, object transfiguration, season transfer, photo enhancement, etc. Quantitative comparisons against several prior methods demonstrate the superiority of our approach.
Article
Full-text available
We investigate conditional adversarial networks as a general-purpose solution to image-to-image translation problems. These networks not only learn the mapping from input image to output image, but also learn a loss function to train this mapping. This makes it possible to apply the same generic approach to problems that traditionally would require very different loss formulations. We demonstrate that this approach is effective at synthesizing photos from label maps, reconstructing objects from edge maps, and colorizing images, among other tasks. As a community, we no longer hand-engineer our mapping functions, and this work suggests we can achieve reasonable results without hand-engineering our loss functions either.
Article
We propose a new framework for estimating generative models via an adversarial process, in which we simultaneously train two models: a generative model G that captures the data distribution, and a discriminative model D that estimates the probability that a sample came from the training data rather than G. The training procedure for G is to maximize the probability of D making a mistake. This framework corresponds to a minimax two-player game. In the space of arbitrary functions G and D, a unique solution exists, with G recovering the training data distribution and D equal to 1/2 everywhere. In the case where G and D are defined by multilayer perceptrons, the entire system can be trained with backpropagation. There is no need for any Markov chains or unrolled approximate inference networks during either training or generation of samples. Experiments demonstrate the potential of the framework through qualitative and quantitative evaluation of the generated samples.
Article
Many supervised learning tasks are emerged in dual forms, e.g., English-to-French translation vs. French-to-English translation, speech recognition vs. text to speech, and image classification vs. image generation. Two dual tasks have intrinsic connections with each other due to the probabilistic correlation between their models. This connection is, however, not effectively utilized today, since people usually train the models of two dual tasks separately and independently. In this work, we propose training the models of two dual tasks simultaneously, and explicitly exploiting the probabilistic correlation between them to regularize the training process. For ease of reference, we call the proposed approach \emph{dual supervised learning}. We demonstrate that dual supervised learning can improve the practical performances of both tasks, for various applications including machine translation, image processing, and sentiment analysis.
Article
State-of-the-art techniques in Generative Adversarial Networks (GANs) such as cycleGAN is able to learn the mapping of one image domain X to another image domain Y using unpaired image data. We extend the cycleGAN to Conditional{\it Conditional} cycleGAN such that the mapping from X to Y is subjected to attribute condition Z. Using face image generation as an application example, where X is a low resolution face image, Y is a high resolution face image, and Z is a set of attributes related to facial appearance (e.g. gender, hair color, smile), we present our method to incorporate Z into the network, such that the hallucinated high resolution face image YY' not only satisfies the low resolution constrain inherent in X, but also the attribute condition prescribed by Z. Using face feature vector extracted from face verification network as Z, we demonstrate the efficacy of our approach on identity-preserving face image super-resolution. Our approach is general and applicable to high-quality face image generation where specific facial attributes can be controlled easily in the automatically generated results.
Article
While humans easily recognize relations between data from different domains without any supervision, learning to automatically discover them is in general very challenging and needs many ground-truth pairs that illustrate the relations. To avoid costly pairing, we address the task of discovering cross-domain relations given unpaired data. We propose a method based on generative adversarial networks that learns to discover relations between different domains (DiscoGAN). Using the discovered relations, our proposed network successfully transfers style from one domain to another while preserving key attributes such as orientation and face identity.