Content uploaded by Weiwei Cai
Author content
All content in this area was uploaded by Weiwei Cai on Mar 17, 2020
Content may be subject to copyright.
Available via license: CC BY 4.0
Content may be subject to copyright.
SPECIAL SECTION ON GIGAPIXEL PANORAMIC VIDEO WITH VIRTUAL REALITY
Received February 25, 2020, accepted March 5, 2020, date of publication March 9, 2020, date of current version March 18, 2020.
Digital Object Identifier 10.1109/ACCESS.2020.2979348
PiiGAN: Generative Adversarial Networks for
Pluralistic Image Inpainting
WEIWEI CAI , (Student Member, IEEE), AND ZHANGUO WEI
School of Logistics and Transportation, Central South University of Forestry and Technology, Changsha 410004, China
Corresponding author: Zhanguo Wei (t20110778@csuft.edu.cn)
This work was supported by the Hunan Key Laboratory of Intelligent Logistics Technology under Grant 2019TP1015.
ABSTRACT The latest methods based on deep learning have achieved amazing results regarding the
complex work of inpainting large missing areas in an image. But this type of method generally attempts
to generate one single ‘‘optimal’’ result, ignoring many other plausible results. Considering the uncertainty
of the inpainting task, one sole result can hardly be regarded as a desired regeneration of the missing area.
In view of this weakness, which is related to the design of the previous algorithms, we propose a novel
deep generative model equipped with a brand new style extractor which can extract the style feature (latent
vector) from the ground truth. Once obtained, the extracted style feature and the ground truth are both
input into the generator. We also craft a consistency loss that guides the generated image to approximate
the ground truth. After iterations, our generator is able to learn the mapping of styles corresponding to
multiple sets of vectors. The proposed model can generate a large number of results consistent with the
context semantics of the image. Moreover, we evaluated the effectiveness of our model on three datasets,
i.e., CelebA, PlantVillage, and MauFlex. Compared to state-of-the-art inpainting methods, this model is able
to offer desirable inpainting results with both better quality and higher diversity. The code and model will
be made available on https://github.com/vivitsai/PiiGAN.
INDEX TERMS Deep learning, generative adversarial networks, image inpainting, diversity inpainting.
I. INTRODUCTION
Image inpainting requires a computer to fill in the missing
area of an image according to the information found in the
image itself or the area around the image, thus creating a plau-
sible final inpaint image. However, in cases where the missing
area of an image is too large, the uncertainty of the inpainting
results increase greatly. For example, when inpainting a face
image, the eyes may look in different directions and there may
be glasses not, etc. Although a single inpainting result may
seem reasonable, it is difficult to determine whether this result
meets our expectation, as it is the only option. Therefore,
driven by this observation, we hope to inpainting a variety
of plausible results on a single missing region, which we call
pluralistic image inpainting (as shown in Figure 1).
Early researche [2] attempted to carry out image inpainting
using the classical texture synthesis method, that is, by sam-
pling similar pixel blocks from the undamaged area of the
image to fill the area to be completed. However, the premise
of these methods is that similar patches can be sampled
The associate editor coordinating the review of this manuscript and
approving it for publication was Zhihan Lv .
from the undamaged area. When the inpaint area is designed
with complex nonrepetitive structures (such as faces), these
methods obviously cannot work (cannot capture high-level
semantics). The vigorous development of deep generating
models has promoted recent related research [16], [39] which
encodes the image into high-dimensional hidden space, and
then decodes the feature into a whole inpaint image. Unfor-
tunately, because the receptive field of the convolutional
neural network is too small to obtain or borrow the informa-
tion of distant spatial locations effectively,these CNN-based
approaches typically generate boundary shadows, distorted
results, and blurred textures inconsistent with the surrounding
region. Recently, some works [34], [36] used spatial attention
to recover the lost area using the surrounding image features
as reference. These methods ensure the semantic consistency
between the generated content and the context information.
However, these existing methods are trying to inpainting a
unique ‘‘optimal’’ result, but are unable to generate a variety
of valuable and plausible results.
In order to obtain multiple diverse results, many meth-
ods based on CVAE [26] have been produced [14], [27],
but these methods are limited to specific fields, which need
VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ 48451
W. Cai, Z. Wei: PiiGAN: GANs for Pluralistic Image Inpainting
FIGURE 1. Examples of the inpainting results of our method on a face, leaf, and rainforest image (the missing regions are shown in white). The left is the
masked input image, while the right is the diverse and plausible direct output of our trained model without any postprocessing.
targeted attributes and may result in unreasonable images
being inpainting.
To achieve better diversity of inpainting results, we add
a new extractor in the generative adversarial network
(GAN) [19], which is used to extract the style feature of the
ground truth image of the training set and the fake image
generated by the generator. The encoder in CVAE-GAN [14]
takes the extracted features of the ground truth image as the
input of the generator directly. When a piece of label itself
is a masked image, the number of labels matching each label
in the training set is usually only one. Therefore, the results
generated have very limited variations.
We proposed a novel deep generative model-based
approach. In each round of iterative training, the extractor
first extracts the style feature of the ground truth of the train-
ing set and inputs it to the generator together with the ground
truth. We use the consistency loss L1 to force the generated
image to be as close to the ground truth of the training set as
possible. At the same time, we generate and input a random
vectors and masked image to the generator to get the output
fake image, and use the consistency loss L1 to make the
extracted style feature as close to the input vectors as possible.
After the iteration, the generator can learn the mapping of
styles corresponding to multiple sets of input vectors. We also
minimize the KL(Kullback-Leibler) loss to reduce the gap
between the prior distribution and the posterior distribution
of the potential vectors extracted by the extractor.
We experimented on the open datasets CelebA [33],
PlantVillage, and MauFlex [1]. Both quantitative and qual-
itative tests show that our model can generate not only higher
quality results, but also a variety of plausible results. In addi-
tion, our model has practical application value in various
fields such as art restoration, real-time inpainting of large-
area missing images, and facial micro-reshaping.
The main contributions of this work are summarized as
follows:
•We propose PiiGAN, a novel generative adversarial
networks for pluralistic image inpainting that not only
delivers higher quality results, but also produces a vari-
ety of realistic and reasonable outputs.
•We have designed a new extractor to improve GAN.
The extractor extracts the style vectors of the training
samples in each iteration and introduce the consistency
loss to guide the generator to learn a variety of styles that
match to the semantics of the input image.
•We validated that our model can inpainting the same
missing regions with multiple results that are plausi-
ble and consistent with the high-level semantics of the
image, and evaluated the effectiveness of our model on
multiple datasets.
48452 VOLUME 8, 2020
W. Cai, Z. Wei: PiiGAN: GANs for Pluralistic Image Inpainting
The rest of this paper is organized as follows. Section 2 pro-
vides related work on image inpainting. In addition, some
existing studies on conditional image generation are intro-
duced. In Section 3, we elaborate on the proposed model of
pluralistic image inpainting (PiiGAN). Section 4 provides an
evaluation. Finally, conclusions are given in Section 5.
II. RELATED WORK
A. IMAGE INPAINTING BY TRADITIONAL METHODS
The traditional method, which is based on diffusion, is to use
the edge information of the area to be inpaint to determine
the direction of diffusion, and spread the known information
to the edge. For example, Ballester et al. [5] used the varia-
tional method, the histogram statistical method based on local
features [6], and the fast marching method based on the level
set application proposed by Telea [7]. However, this kind of
method can only inpaint small-scale missing areas. In con-
trast to diffusion-based technologies, patch-based methods
can perform texture synthesis [6], [7], which can sample
similar patches from undamaged areas and paste them into
the missing areas. Bertalmio et al. [4] proposed a method of
filling texture and structure in the area with missing image
information at the same time, and Duan et al. [9] proposed a
method of using local patch statistics to complete the image.
However, these methods usually generate distorted structures
and unreasonable textures.
Xu and Sun [8] proposed a typical inpainting method
which involves investigating the spatial distribution of image
patches. This method can better distinguish the structure and
texture, thus forcing the new patched area to become clear
and consistent with the surrounding texture. Ting et al. [10]
proposed a global region filling algorithm based on Markov
random field energy minimization, which pays more attention
to the context rationality of texture. However, the calcula-
tion complexity of this method is high. Wu et al. [11] put
forward a fast approximate nearest neighbor algorithm called
PatchMatch, which can be used for advanced image editing.
Shao et al. [12] put forward an algorithm based on the Poisson
equation to decompose the image into texture and structure,
which is effective in large-area completion. However, these
methods can only obtain low-level features, and the obvious
limitation is that they only extract texture and structure from
the input image. If no texture can be found in the input image,
these methods have a very limited effect and do not generate
semantically reasonable results.
B. IMAGE INPAINTING BY DEEP GENERATIVE MODELS
Recently, using deep generative models to inpaint images
has yielded exciting results. In addition, Image inpainting
with generative adversarial networks (GAN) [19] has gained
significant attention. Early works [13], [15] trained CNNs for
image denoising and restoration. The deep generative model
named Context Encoder proposed by Pathak et al. [16] can be
used for semantic inpainting tasks. The CNN-based inpaint-
ing is extended to the large mask, and a context encoder
based on the generation adversarial network (GAN) is pro-
posed for inpainting the learned features [18]. The guide
loss is introduced to make the feature map generated in the
decoder as close as possible to the feature map of the ground
truth generated in the encoding process. Lizuka et al. [39]
improved the image completion effect by introducing local
and global discriminators as experience loss. The global dis-
criminator is used to check the whole image and to evaluate
its overall consistency, while the local discriminator is only
used to check a small area to ensure the local consistency
of the generated patch. Lizuka et al. [39] also proposed
the concept of dilated convolutions to the reception field.
However, this method needs a lot of computational resources.
For this reason, Sagong et al. [35] proposed a structure
(Pepsi) composed of a single shared coding network and a
parallel decoding network with rough and patching paths,
which can reduce the number of convolution operations.
Recently, some works [20], [23], [29] have proposed the use
of spatial attention [24], [25] to obtain high-frequency details.
Yu et al. [20] proposed a context attention layer, which fills
the missing pixels with similar patches of undamaged areas.
Isola et al. [22] tried to solve the problem of image restoration
using a general image translation model. Using advanced
semantic feature learning, the deep generation model can
generate semantically consistent results for the missing areas.
However, it is still very difficult to generate realistic results
from the residual potential features. Other work [3], [11],
[38], [49] also explores related applications.
C. CONDITIONAL IMAGE GENERATION
On the basis of VAE [31] and GAN [19], conditional image
generation has been widely used in conditional image gen-
eration tasks, such as 3D modeling, image translation, and
style generation. Sohn et al. [26] used random reasoning
to generate diverse but realistic outputs based on the deep
condition generation model of the Gaussian latent variable.
The automatic encoder of conditional variation proposed by
Walker et al. [27] can generate a variety of different predic-
tions for the future. After that, the variant automatic encoder
is combined with the generation countermeasure network to
generate a specific class image by changing the fine-grained
class label input into the generation model. In [28], different
facial image restorations are achieved by specifying specific
attributes (such as male and smile). However, this method is
limited to specific areas and requires specific attributes.
III. PROPOSED APPROACH
We built our pluralistic image inpainting network based
on the current state-of-the-art image inpainting model [20],
which has shown exciting results in terms of inpainting face,
leaf, and rainforest images. However, similar to other existing
methods [1], [20], [21], [36], [48], classic image completion
methods attempt to inpaint missing regions of the original
image in a deterministic manner, thus only producing a single
result. Instead, our goal was to generate multiple reasonable
results.
VOLUME 8, 2020 48453
W. Cai, Z. Wei: PiiGAN: GANs for Pluralistic Image Inpainting
A. EXTRACTOR
Figure 3shows the extractor network architecture of our
proposed method. It has four convolution layers, one flat-
tened layer, and two parallel fully connected layers. Each
convolutional layer uses the Elus activation function. All the
convolutional layers use a stride of 2 ×2 pixels and 5 ×5
kernels to reduce the image resolution while increasing the
number of output filters. The two fully connected layers in
the extractor will both output z_var , and they are equal, one of
them is dedicated to KL loss, and the other will be input to
the generator with the extracted latent vector Zr.
Let Igt be ground truth image, the extractor and style
feature extracts from Igt are denoted by Eand Zrrespectively.
We use the ground truth image Igt as the input, and Zris the
latent vector extracted from Igt by the extractor.
Z(i)
r=E(I(i)
gt ) (1)
Let Icf be the fake image generated by the generator. The
extractor and style feature extracts from Icf are denoted by E
and Zfrespectively. We use the fake image Icf as the input,
and Zfis the latent vector extracted from Icf by the extractor.
Z(i)
f=E(I(i)
cf ) (2)
The extractor extracts the style feature of each train-
ing sample and outputs its mean and covariance, i.e., µ
and σ. Similar to VAEs, the KL loss is used to narrow the
gap between the prior pθ(z)and the Gaussian distribution
qφ(z|I).
Let the latent vector Z be the centered isotropic multi-
variate Gaussian pθ(z)=N(z;0,I). Assume pθ(I|z) is
a multivariate Gaussian whose distribution parameters are
computed from z with the extractor network. We assume that
the true posterior adopts an approximate Gaussian form and
approximate diagonal covariance:
log q(z|I)=log Nz;µ, σ 2I(3)
Let σand µdenote the s.d. and variational mean evaluated
at datapoint i, and let µjand σjsimply denote the j-th element
of these vectors. Then, the KL divergence between the pos-
terior distribution qφz|I(i)and pθ(z)=N(z;0,I) can be
computed as
−DKL qφ(z)||pθ(z)=Zqθ(z)(log pθ(z)−log qθ(z))dz
(4)
According to our assumptions, the prior pθ(z)=N(z;0,I)
and the posterior approximation qφ(z|I)are Gaussian. Thus
we have
Zqθ(z) log qθ(z)dz =ZNz;µj, σ 2
jlog Nz;µj, σ 2
jdz
= −J
2log(2π)−1
2
J
X
j=11+log(σ2
j)
(5)
and
Zqθ(z) log p(z)dz =ZNz;µj, σ 2
jlog N(z;0,I)dz
= −J
2log(2π)−1
2
J
X
j=1µ2
j+σ2
j
(6)
Finally, we can obtain
−DKL qφ(z)||pθ(z)
= −1
2
J
X
j=1
log(2π)+1
2
J
X
j=11+log(σ2
j)
−[−1
2
J
X
j=1
log(2π)+1
2
J
X
j=1µ2
j+σ2
j]
=1
2
J
X
j=11+log(σ2
j)−µ2
j−σ2
j(7)
where the mean and s.d. of the approximate posterior, µ
and σ, are outputs of the extractor E, i.e. nonlinear functions
of the generated sample x(i)and the variational parameters φ.
After this, the latent vector z∼qφ(z|x) is sampled using
gφ(x, )=µ+σwhere ∼N(0,I) and is an
element-wise product. The obtained latent vector Z is fed into
the generator together with the masked input image.
The outputs of the generator are processed by the extractor
E again to get style feature, which is applied to another
masked input image.
B. PLURALISTIC IMAGE INPAINTING NETWORK: PiiGAN
Figure 2shows the network architecture of our proposed
model. We added a novel network after the generator and
named the network the extractor, which is responsible for
extracting latent vector Z. We concatenated an image with
white pixels as the missing regions and generated ran-
dom vectors as input, then we output the inpainting fake
image(Icf ). We input the fake image generated by the gen-
erator into the extractor to extract the style feature of the fake
image. At the same time, a sample of the ground truth image
input extractor was randomly taken from the training set to
obtain the style feature of the ground truth image, and the
ground truth image concatenated to the mask was input into
the generator to obtain the generated image Icr . We wanted
the extracted style feature to be as small as possible with the
input vectors. It is also desirable that the generated image Icr
is as close as possible to the ground truth image Igt to be
able to continuously update the parameters and weights of
the generator network. Here, we propose using the L1 loss
function to minimize the sum of the absolute differences
between the target value and the estimated value, because the
minimum absolute deviation method is more robust than the
least squares method.
48454 VOLUME 8, 2020
W. Cai, Z. Wei: PiiGAN: GANs for Pluralistic Image Inpainting
FIGURE 2. The architecture of our model. It consists of three modules: a generator G, an extractor E, and two discriminators D. (a) G takes in both the
image with white holes and the style feature as inputs and generates a fake image. The style feature is spatially replicated and concatenated with the
input image. (b) E is used to extract the style feature of the input image. (c) The global discriminator [39] identifies the entire image, while the local
discriminator [39] only discriminates the inpainting regions of the generator output.
FIGURE 3. The architecture of our extractor network.
1) CONSISTENCY LOSS
Since the perceptual loss [46] cannot directly optimize the
convolutional layer and ensure consistency between the fea-
ture maps after the generator and the extractor. We adjusted
the form of perceptual loss and propose the consistency loss
to handle this problem. As shown in Figure 3, we use the
extractor to extract a high-level style space in the ground truth
image. Our model also auto-encodes the visible inpainting
results deterministically, and the loss function needs to meet
this inpainting task. Therefore, the loss per instance here is
Le,(i)
c=
I(i)
cr −I(i)
gt
1(8)
where I(i)
cr =G(Z(i)
r,fm) and I(i)
gt are the completed and ground
truth images respectively. G is the generator and E is our
extractor, zris the extractor extracted latent vector we call
style feature, zr= E(I(i)
gt ). For the separate generative path,
the per-instance loss is
Lg,(i)
c=
I(i)
cf −I(i)
raw
1(9)
where I(i)
cf =G(Z(i)
f,fm) and I(i)
raw are the fake images com-
pleted by the generator and input raw images respectively.
2) ADVERSARIAL LOSS
To enhance the training process and to inpaint higher quality
images, Gulrajani et al. [47] proposed using gradient penalty
terms to improve the Wasserstein GAN [37].
LG
adv =Eiraw [D(Iraw)]−Eiraw,z[D(G(Icm ,z))]
−λEˆ
ih(
∇ˆ
iD(ˆ
i)
2−1)2i(10)
where ˆ
iis sampled uniformly along a straight line between a
pair of generated and input raw images. We used λ=10 for
all experiments.
For the image completion task, we only attempted to
inpaint the missing regions, so for the local discriminator,
we only applied the gradient penalty [47] to the pixels in
the missing area. This can be achieved by multiplying the
gradient by the input mask m as follows:
LL
adv =Eiraw [D(Iraw)]−Eiraw,z[D(G(Icm ,z))]
−λEˆ
ih(
∇ˆ
iD(ˆ
i)K(1 −m)
2−1)2i(11)
where, for the pixels in that missing regions, the mask value
is 0; for other locations, the mask value is 1.
VOLUME 8, 2020 48455
W. Cai, Z. Wei: PiiGAN: GANs for Pluralistic Image Inpainting
3) DISTRIBUTIVE REGULARIZATION
The KL divergence term serves to adjust the learned impor-
tance sampling function qφz|Igt to a fixed potential prior
p(zr). Defined as Gaussians, we get
Le,(i)
KL = −KL(qφ(zr|I(i)
gt )||N(0, σ 2(n)I)) (12)
For the fake image output by the generator, the learned
importance sampling function qφz|Icf to a fixed potential
prior pzfis also a Gaussian.
Lg,(i)
KL = −KL(qφ(zf|I(i)
cf )||N(0, σ 2(n)I)) (13)
4) OBJECTIVE
Through KL, consistency and adversarial losses obtained
above, the overall objective of our diversity inpainting net-
work is defined as
L=αKL (Le
KL +Lg
KL )+αc(Le
c+Lg
c)+αadv(LG
adv +LL
adv)
(14)
where αKL ,αc,αadv are the tradeoff parameters for the KL,
consistency, and adversarial losses respectively.
C. TRAINING
For training, given a ground truth image Igt, we used our
proposed extractor to extract the style feature of the ground
truth, and then concatenate the style feature to the masked
ground truth image. It was input to the generator G to obtain
an image Icr of the predicted output, and forced Icr to be as
close as possible to Igt through the consistency loss L1 to
update the parameters and weights of the generator. At the
same time, Sample image Iraw from the training data, generate
mask and random vectors for Iraw and concatenate together to
input generator G, to obtain the predicted output image Icf ,
we used our proposed extractor to extract the latent vec-
tor zfof the generated image and forced zfto be as close
to zas possible to update the generator using consistency
loss L1.
IV. EXPERIMENTS AND RESULTS
We evaluated our proposed model on three open datasets:
CelebA faces [33], PlantVillage, and MauFlex [1]. The
PlantVillage dataset is a publicly available dataset for
researchers, and we manually downloaded all training
and test set images from the PlantVillage page (https://
plantvillage.org). And MauFlex is also an open dataset by
Morales et al. The number of samples in the three datasets
we obtained were 200k, 45k and 25k images, respectively.
We randomly divide the data set into training and test sets,
of which 15% of the data set is the test set. Since our method
can inpaint countless results, we generated 100 images for
each image with missing regions and selected 10 of them,
each with different high-level semantic features. We com-
pared the results with current state-of-the-art methods for
quantitative and qualitative comparisons.
Algorithm 1 Training Procedure of Our Proposed Model
1: while G has not converged do
2: for i=1→ndo
3: Input ground truth images Igt ;
4: Get style feature by extractor Zr←E(Igt );
5: Concatenate inputs f
Irm ←ZrIgt m;
6: Get predicted outputs Icr ←G(Zr,Igt )(1 −M);
7: Update the generator G with L1 loss (Icr ,Igt );
8: Meanwhile,
9: Sample image Iraw from training set data;
10: Generate white mask m for Iraw;
11: Generate random vectors z for Iraw;
12: Concatenate inputs f
Icm ←Iraw mz;
13: Get predictions Icf ←G(Iraw,z);
14: Get style feature by extractor zf←E(Icf );
15: Update the generator G with L1 loss (zf,z);
16: end for
17: end while
Our method was compared to the following:
– CA Contextual Attention, proposed by Yu et al. [20]
– SH Shift-net, proposed by Yan et al. [17]
– GL Global and local, proposed by Iizuka et al. [39]
A. IMPLEMENTATION DETAILS
Our diversity-generation network was inspired by recent
works [20], [39], but with several significant modifications,
including the extractor. Furthermore, our inpainting network
which is implemented in TensorFlow [41] contains 47 million
trainable parameters, and was trained on a single NVIDIA
1080 GPU (8GB) with a batch size of 12. We use Tensorboard
to visualize the training process to view various parameters
of the training process in real time. It was confirmed that
The training of CelebA [33] model, PlantVillage model, and
MauFlex [1] model took roughly 3 days, 2 days, and 1 day,
respectively.
To fairly evaluate our method, we only conducted exper-
iments on the centering hole. We compared our method
with GL [39], CA [20], and SH [17] on images from the
CelebA [33], PlantVillage, and MauFlex [1] validation sets.
The size of all mask images were processed to 128 ×128
for training and testing. We used the Adam algorithm [42]
to optimize our model with a learning rate of 2 ×103and
β1=0.5, β2=0.9. The tradeoff parameters were set as
αKL =10, αrec =0.9, αadv =1. For the nonlinearities in
the network, we used the exponential linear units (ELUs) as
the activation function to replace the commonly used rectified
linear unit (ReLUs). We found that the ELUs tried to speed
up the learning by bringing the average value of the activation
function close to zero. Moreover, it helped avoid the problem
of gradient disappearance by positive value identification.
Compared with ReLUs, ELUs can be more robust to input
changes or noise.
48456 VOLUME 8, 2020
W. Cai, Z. Wei: PiiGAN: GANs for Pluralistic Image Inpainting
FIGURE 4. Comparison of qualitative results with CA [20], SH [17] and GL [39] on the CelebA dataset.
TABLE 1. Results using the CelebA dataset with large missing regions,
comparing global and local (GL) [39], Shift-net (SH) [17], Contextual
attention (CA) [20], and ours method. −Lower is better. +Higher is better.
B. QUANTITATIVE COMPARISONS
Quantitative measurement is difficult for the image diversity
inpainting task, as our research is to generate diverse and
plausible results for an image with missing regions. Compar-
isons should not be made based solely on a single inpainting
result.
However, solely for the purpose of obtaining quantitative
indicators, we randomly selected a single sample from our
set of results that was close to the ground truth image and
selected the best balance of quantitative indicators for com-
parison. The comparison was tested on 10,000 Celeba [33]
test images, with quantitative measures of mean L1 loss,
L2 loss, Peak Signal-To-Noise Ration (PSNR), and Structural
SIMilarity (SSIM) [43]. We used a 64×64 mask in the center
of the image. Table 1lists the results of the evaluation with
the centering mask. It is not difficult to see that our methods
are superior to all other methods in terms of these quantitative
tests.
C. QUALITATIVE COMPARISONS
We first evaluated our proposed method on the CelebA [33]
face dataset; Figure 4shows the inpainting results with large
missing regions, highlighting the diversity of the output of our
model, especially in terms of high-level semantics. GL [39]
can produce more natural images using local and global
discriminators to make images consistent. SH [17] has been
improved in terms of the copy function, but its predictions are
to some extent blurry and there is detail missing. In contrast,
our method not only produces clearer and more plausible
images, but also provides complementary results for multiple
attributes.
As shown in Figures 5and 6, we also evaluated our
approach on the MauFlex [1] dataset and PlantVillage dataset
to demonstrate the diversity of our output across different
datasets. Contextual Attention (CA) [20], while producing
reasonable completion results in many cases, can only pro-
duce a single result, and in some cases, a single solution is not
enough. Our model produces a variety of reasonable results.
Finally, Figure 4shows the various facial attribute results
from the CelebA [20] dataset. We observed that existing
models, such as GL [39], SH [17], and CA [20], can only
generate a single facial attribute for each masked input. The
results of our method on these test data provide higher visual
quality and a variety of attributes, such as the gaze angle of
the eye, whether or not glasses are worn, and the disease
location on the blade. This is obviously better for image
completion.
VOLUME 8, 2020 48457
W. Cai, Z. Wei: PiiGAN: GANs for Pluralistic Image Inpainting
FIGURE 5. Comparison of qualitative results with Contextual Attention (CA) [20] on the MauFlex dataset.
FIGURE 6. Comparison of qualitative results with Contextual Attention (CA) [20] on the PlantVillage dataset.
FIGURE 7. Our method(top), StarGAN [32] (middle), and BicycleGAN [40] (bottom).
D. OTHER COMPARISONS
Compared to some of the existing methods (BicycleGAN [40]
and StarGAN [32]), we investigated the influence of using
our proposed extractor. We used the common parameters to
train these three models. As shown in Figure 7, for Bicycle-
GAN [40], the output was not good and the generated result
48458 VOLUME 8, 2020
W. Cai, Z. Wei: PiiGAN: GANs for Pluralistic Image Inpainting
TABLE 2. We measure diversity using average LPIPS [44] distance.
TABLE 3. Quantitative comparisons of realism.
was not natural. For StarGAN [32], although it can output a
variety of results, this method is limited to specific targeted
attributes for training, such as gender, age, happy, angry, etc.
1) DIVERSITY
In Table 2, we use the LPIPS metric proposed by [44] to
calculate the diversity scores. For each approach, we calcu-
lated the average distance between the 10,000 pairs randomly
generated from the 1000 center-masked image samples. Igobal
and Ilocal are the full inpainting results and mask-region
inpainting results, respectively. It is worth emphasizing that
although BicycleGAN [40] obtained relatively high diversity
scores, this may indicate that unreasonable images were gen-
erated, resulting in worthless variations.
2) REALISM
Table 3shows the realism across methods. In [45] and later
in [22], in order to evaluate the visual realism of the output of
these models, human judgment was used to judge the output.
We also presented a variety of images generated by our model
to a human in a random order, for one second each, asking
them to judge the generated fake and measure the ‘‘spoofing’’
rate. The pix2pix ×noise model [22] achieved a higher
realism score. CAVE-GAN [14] helped to generate diversity,
but because the distribution of potential space for learning
is unclear, the generated samples were not reasonable. The
BicycleGAN [40] model suffered from mode collapse and
had a good realism score. However, our method adds the KL
divergence loss in the style feature extracted by the extrac-
tor, making the inpainting results more realistic, as well as
producing the highest realism score.
V. CONCLUSION
In this paper, we proposed PiiGAN, a novel generative adver-
sarial networks with a newly designed style extractor for
pluralistic image inpainting tasks. For a single input image
with missing regions, our model can generate numerous
diverse results with plausible content. Experiments on various
datasets have shown that our results are diverse and natural,
TABLE 4. Results using the PlantVillage dataset with large missing
regions, comparing GL [39], SH [17], CA [20], and ours method. −Lower is
better. +Higher is better.
TABLE 5. Results using the MauFlex dataset with large missing regions,
comparing GL [39], SH [17], CA [20], and our method. −Lower is better.
+Higher is better.
especially for images with large missing areas. Our model can
also be applied in the fields of art restoration, facial micro-
shaping and image augmentation. In future work, we will
further study the image inpainting of large irregular missing
areas.
APPENDIXES
APPENDIX A
MORE COMPARISONS RESULTS
More quantitative comparisons with CA [20], SH [17], and
GL [39] on the CelebA [33], PlantVillage, and MauFlex [1]
datasets were also conducted. Table 4and Table 5list the
evaluation results on the PlantVillage and MauFlex datasets,
respectively. It is obvious that our model is superior to current
state-of-the-art methods on multiple datasets.
More quantitative comparisons of realism with CA [20],
SH [17], and GL [39] on the CelebA [33], PlantVillage,
and MauFlex [1] datasets were also conducted. Table 4and
Table 5list the evaluation results on the PlantVillage and
MauFlex datasets, respectively. It is obvious that our model
is superior to current state-of-the-art methods on multiple
datasets.
APPENDIX B
NETWORK ARCHITECTURE
As a supplement to the content in Section III, in the following,
we elaborate on the design of the proposed extractor. The spe-
cific architectural design of our proposed extractor network is
shown in Table 6. We use the ELUs activation function after
each convolutional layer. Nis the number of output channels,
Kis the kernel size, Sis the stride size, and nis the batch size.
APPENDIX C
MORE DIVERSE EXAMPLES USING THE CelebA,
PlantVillage, AND MauFlex DATASETS
CelebA Figure 8shows the results of the qualitative anal-
ysis comparison of the models trained on the CelebA [33]
dataset. The direct output of our model shows a more valuable
VOLUME 8, 2020 48459
W. Cai, Z. Wei: PiiGAN: GANs for Pluralistic Image Inpainting
FIGURE 8. Additional examples of our model tested on the CelebA dataset. The examples have different genders, skin tones, and eyes. Because a large
area of the image is missing, it is impossible to duplicate the content in the surrounding regions, so the Contextual Attention(CA) [20] method cannot
generate visually realistic results like ours. In addition, our diversity inpainting results have different gaze angles for the eyes and variation in whether
glasses are worn or not. It is important to emphasize that we did not apply any attribute labels when training our model.
diversity than the existing methods. The initial resolution of
the CelebA dataset image was 218 ×178. We first randomly
cropped the images to a size of 178 ×178, and then resized
the image to 128 ×128 for both training and evaluation.
48460 VOLUME 8, 2020
W. Cai, Z. Wei: PiiGAN: GANs for Pluralistic Image Inpainting
FIGURE 9. Additional examples of our model tested on the PlantVillage dataset. Examples have blades of different kinds and colors. Since the existing CA
[20] method cannot find repeated leaf lesions around the missing area, it is difficult to generate a reasonable diseased leaf. Our model is capable of
generating a wide variety of leaves with different lesion locations. In addition, we did not apply any attribute labels when training our model.
PlantVillage Figure 9shows the results of the qualitative
analysis comparison of the models trained on the PlantVillage
dataset. Our models also have more valuable diversity than
existing methods. The PlantVillage dataset is an open dataset
whose original image resolution is irregular. We resized the
images to 128 ×128 for training and evaluation.
VOLUME 8, 2020 48461
W. Cai, Z. Wei: PiiGAN: GANs for Pluralistic Image Inpainting
FIGURE 10. Additional examples of our model tested on the MauFlex [1] dataset. The examples have different tree types. Since the existing CA [20]
method cannot find duplicate tree content around the missing area, it is difficult to generate reasonable trees images. Our model is capable of generating
a variety of trees with different locations. In addition, we did not apply any attribute labels when training our model.
TABLE 6. The architecture of our extractor network.
MauFlex Figure 10 shows the results of the qualitative
analysis comparison of the models trained on the MauFlex [1]
dataset. Our models also have more valuable diversity than
existing methods. The MauFlex dataset is an open dataset
published by Morales et al. [1] with an original image res-
olution of 513 ×513. We resized the images to 128 ×128 for
training and evaluation.
REFERENCES
[1] G. Morales, G. Kemper, G. Sevillano, D. Arteaga, I. Ortega, and J. Telles,
‘‘Automatic segmentation of Mauritia flexuosa in unmanned aerial vehicle
(UAV) imagery using deep learning,’’ Forests, vol. 9, no. 12, p. 736, 2018.
[2] A. A. Efros and T. K. Leung, ‘‘Texture synthesis by non-parametric
sampling,’’ in Proc. 7th IEEE Int. Conf. Comput. Vis., vol. 2, Sep. 1999,
pp. 1033–1038.
[3] Z. Chen, H. Cai, Y. Zhang, C. Wu, M. Mu, Z. Li, and M. A. Sotelo,
‘‘A novel sparse representation model for pedestrian abnormal trajectory
understanding,’’ Expert Syst. Appl., vol. 138, Dec. 2019, Art. no. 112753.
[4] M. Bertalmio, L. Vese, G. Sapiro, and S. Osher, ‘‘Simultaneous structure
and texture image inpainting,’’ IEEE Trans. Image Process., vol. 12, no. 8,
pp. 882–889, Aug. 2003.
[5] C. Ballester, M. Bertalmio, V. Caselles, G. Sapiro, and J. Verdera, ‘‘Filling-
in by joint interpolation of vector fields and gray levels,’’ IEEE Trans.
Image Process., vol. 10, no. 8, pp. 1200–1211, Aug. 2001.
[6] A. Levin, A. Zomet, and Y. Weiss, ‘‘Learning how to inpaint from global
image statistics,’’ in Proc. 9th IEEE Int. Conf. Comput. Vis., Oct. 2003,
p. 305.
[7] A. Telea, ‘‘An image inpainting technique based on the fast marching
method,’’ J. Graph. Tools, vol. 9, no. 1, pp. 23–34, Jan. 2004.
48462 VOLUME 8, 2020
W. Cai, Z. Wei: PiiGAN: GANs for Pluralistic Image Inpainting
[8] Z. Xu and J. Sun, ‘‘Image inpainting by patch propagation using patch
sparsity,’’ IEEE Trans. Image Process., vol. 19, no. 5, pp. 1153–1165,
May 2010.
[9] K. Duan, Y. Gong, and N. Hu, ‘‘Automatic image inpainting using local
patch statistics,’’ U.S. Patent 10 127 631, Nov. 13, 2018.
[10] H. Ting, S. Chen, J. Liu, and X. Tang, ‘‘Image inpainting by global
structure and texture propagation,’’ in Proc. 15th Int. Conf. Multimedia,
2007, pp. 517–520.
[11] B. Wu, T. Cheng, T. L. Yip, and Y. Wang, ‘‘Fuzzy logic based
dynamic decision-making system for intelligent navigation strategy within
inland traffic separation schemes,’’ Ocean Eng., vol. 197, Feb. 2020,
Art. no. 106909.
[12] X. Shao, Z. Liu, and H. Li, ‘‘An image inpainting approach based on the
Poisson equation,’’ in Proc. 2nd Int. Conf. Document Image Anal. Libraries
(DIAL), Apr. 2006, p. 5.
[13] J. Xie, L. Xu, and E. Chen, ‘‘Image denoising and inpainting with deep neu-
ral networks,’’ in Proc. Adv. Neural Inf. Process. Syst., 2012, pp. 341–349.
[14] J. Bao, D. Chen, F. Wen, H. Li, and G. Hua, ‘‘CVAE-GAN: Fine-grained
image generation through asymmetric training,’’ in Proc. IEEE Int. Conf.
Comput. Vis. (ICCV), Oct. 2017, pp. 2745–2754.
[15] L. Xu, J. S. Ren, C. Liu, and J. Jia, ‘‘Deep convolutional neural network
for image deconvolution,’’ in Proc. Adv. Neural Inf. Process. Syst., 2014,
pp. 1790–1798.
[16] D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell, and A. A. Efros, ‘‘Context
encoders: Feature learning by inpainting,’’ in Proc. IEEE Conf. Comput.
Vis. Pattern Recognit. (CVPR), Jun. 2016, pp. 2536–2544.
[17] Z. Yan, X. Li, M. Li, W. Zuo, and S. Shan, ‘‘Shift-net: Image inpainting
via deep feature rearrangement,’’ in Proc. Eur. Conf. Comput. Vis., 2018,
pp. 1–17.
[18] Y. Wang, X. Tao, X. Qi, X. Shen, and J. Jia, ‘‘Image inpainting via
generative multi-column convolutional neural networks,’’ in Proc. Adv.
Neural Inf. Process. Syst., 2018, pp. 331–340.
[19] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley,
S. Ozair, A. Courville, and Y. Bengio, ‘‘Generative adversarial nets,’’ in
Proc. Adv. Neural Inf. Process. Syst., 2014, pp. 2672–2680.
[20] J. Yu, Z. Lin, J. Yang, X. Shen, X. Lu, and T. S. Huang, ‘‘Generative image
inpainting with contextual attention,’’ in Proc. IEEE/CVF Conf. Comput.
Vis. Pattern Recognit., Jun. 2018, pp. 5505–5514.
[21] H. Liu, B. Jiang, Y. Xiao, and C. Yang, ‘‘Coherent semantic attention for
image inpainting,’’ in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV),
Oct. 2019, pp. 4170–4179.
[22] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, ‘‘Image-to-image translation
with conditional adversarial networks,’’ in Proc. IEEE Conf. Comput. Vis.
Pattern Recognit. (CVPR), Jul. 2017, pp. 1125–1134.
[23] Y. Song, C. Yang, Z. Lin, X. Liu, Q. Huang, H. Li, and C.-C. J. Kuo,
‘‘Contextual-based image inpainting: Infer, match, and translate,’’ in Proc.
Eur. Conf. Comput. Vis. (ECCV), 2018, pp. 3–19.
[24] M. Jaderberg, K. Simonyan, and A. Zisserman, ‘‘Spatial transformer net-
works,’’ in Proc. Adv. Neural Inf. Process. Syst., 2015, pp. 2017–2025.
[25] T. Zhou, S. Tulsiani, W. Sun, J. Malik, and A. A. Efros, ‘‘View synthesis
by appearance flow,’’ in Proc. Eur. Conf. Comput. Vis., 2016, pp. 286–301.
[26] K. Sohn, H. Lee, and X. Yan, ‘‘Learning structured output representation
using deep conditional generative models,’’ in Proc. Adv. Neural Inf.
Process. Syst., 2015, pp. 3483–3491.
[27] J. Walker, C. Doersch, A. Gupta, and M. Hebert, ‘‘An uncertain future:
Forecasting from static images using variational autoencoders,’’ in Proc.
Eur. Conf. Comput. Vis., 2016, pp. 835–851.
[28] Z. Chen, S. Nie, T.Wu, and C. G. Healey, ‘‘High resolution face completion
with multiple controllable attributes via fully end-to-end progressive gener-
ative adversarial networks,’’ 2018, arXiv:1801.07632. [Online]. Available:
http://arxiv.org/abs/1801.07632
[29] J. Yu, Z. Lin, J. Yang, X. Shen, X. Lu, and T. Huang, ‘‘Free-form image
inpainting with gated convolution,’’ in Proc. IEEE/CVF Int. Conf. Comput.
Vis. (ICCV), Oct. 2019, pp. 4471–4480.
[30] C. Doersch, S. Singh, A. Gupta, J. Sivic, and A. A. Efros, ‘‘What makes
Paris look like Paris?’’ Commun. ACM, vol. 58, no. 12, pp. 103–110,
Nov. 2015.
[31] D. P Kingma and M. Welling, ‘‘Auto-encoding variational Bayes,’’ 2013,
arXiv:1312.6114. [Online]. Available: http://arxiv.org/abs/1312.6114
[32] Y. Choi, M. Choi, M. Kim, J.-W. Ha, S. Kim, and J. Choo, ‘‘StarGAN:
Unified generative adversarial networks for multi-domain Image-to-Image
translation,’’ in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit.,
Jun. 2018, pp. 8789–8797.
[33] B. Zhou, A. Lapedriza, A. Khosla, A. Oliva, and A. Torralba, ‘‘Places: A 10
million image database for scene recognition,’’ IEEE Trans. Pattern Anal.
Mach. Intell., vol. 40, no. 6, pp. 1452–1464, Jun. 2018.
[34] R. A. Yeh, C. Chen, T. Y. Lim, A. G. Schwing, M. Hasegawa-Johnson,
and M. N. Do, ‘‘Semantic image inpainting with deep generative
models,’’ 2016, arXiv:1607.07539. [Online]. Available: http://arxiv.org/
abs/1607.07539
[35] M.-C. Sagong, Y.-G. Shin, S.-W. Kim, S. Park, and S.-J. Ko, ‘‘PEPSI:
Fast image inpainting with parallel decoding network,’’ in Proc.
IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2019,
pp. 11360–11368.
[36] Y. Zeng, J. Fu, H. Chao, and B. Guo, ‘‘Learning pyramid-context encoder
network for high-quality image inpainting,’’ in Proc. IEEE/CVF Conf.
Comput. Vis. Pattern Recognit. (CVPR), Jun. 2019, pp. 1486–1494.
[37] M. Arjovsky, S. Chintala, and L. Bottou, ‘‘Wasserstein GAN,’’ 2017,
arXiv:1701.07875. [Online]. Available: http://arxiv.org/abs/1701.07875
[38] Z. Huang, X. Xu, J. Ni, H. Zhu, and C. Wang, ‘‘Multimodal representation
learning for recommendation in Internet of Things,’’ IEEE Internet Things
J., vol. 6, no. 6, pp. 10675–10685, Dec. 2019.
[39] S. Iizuka, E. Simo-Serra, and H. Ishikawa, ‘‘Globallyand locally consistent
image completion,’’ ACM Trans. Graph., vol. 36, no. 4, pp. 1–14, Jul. 2017.
[40] J. Y. Zhu, R. Zhang, D. Pathak, T. Darrell, A. A. Efros, O. Wang, and
E. Shechtman, ‘‘Toward multimodal image-to-image translation,’’ in Proc.
Adv. Neural Inf. Process. Syst., 2017, pp. 465–476.
[41] M. Abadi et al., ‘‘TensorFlow: Large-scale machine learning on heteroge-
neous distributed systems,’’ 2016, arXiv:1603.04467. [Online]. Available:
http://arxiv.org/abs/1603.04467
[42] D. P. Kingma and J. Ba, ‘‘Adam: A method for stochastic opti-
mization,’’ 2014, arXiv:1412.6980. [Online]. Available: http://arxiv.
org/abs/1412.6980
[43] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, ‘‘Image quality
assessment: From error visibility to structural similarity,’’ IEEE Trans.
Image Process., vol. 13, no. 4, pp. 600–612, Apr. 2004.
[44] R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang, ‘‘The unrea-
sonable effectiveness of deep features as a perceptual metric,’’ in Proc.
IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Jun. 2018, pp. 586–595.
[45] R. Y. Zhang, P. Isola, and A. A. Efros, ‘‘Colorful image colorization,’’ in
Proc. Eur. Conf. Comput. Vis. (ECCV), 2016, pp. 649–666.
[46] J. Johnson, A. Alahi, and L. Fei-Fei, ‘‘Perceptual losses for real-time style
transfer and super-resolution,’’ in Proc. Eur. Conf. Comput. Vis., 2016,
pp. 694–711.
[47] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. Courville,
‘‘Improved training of wasserstein gans,’’ in Proc. Adv. NeuralInf. Process.
Syst., 2017, pp. 5767–5777.
[48] K. Nazeri, E. Ng, T. Joseph, F. Z. Qureshi, and M. Ebrahimi, ‘‘EdgeCon-
nect: Generative image inpainting with adversarial edge learning,’’ 2019,
arXiv:1901.00212. [Online]. Available: http://arxiv.org/abs/1901.00212
[49] Z. Huang, X. Xu, H. Zhu, and M. Zhou, ‘‘An efficient group recommen-
dation model with multiattention-based neural networks,’’ IEEE Trans.
Neural Netw. Learn. Syst., to be published.
WEIWEI CAI (Student Member, IEEE) is cur-
rently pursuing the master’s degree with the
Central South University of Forestry and Technol-
ogy, Changsha, China. Prior to that, he worked
with IT industry for more than ten years in the
roles of an Enterprise Architect and the Program
Manager. His research interests include machine
learning, deep learning, and computer vision.
ZHANGUO WEI received the Ph.D. degree from
the School of Technology, Beijing Forestry Uni-
versity, Beijing, China. He has been working as
an Associate Professor with the Central South
University of Forestry and Technology, Changsha,
China, since 2011. His research interests include
information retrieval, data mining, big data, and
deep learning. His most work mainly promotes the
development of the logistics engineering through
the applications of data mining and deep learning.
VOLUME 8, 2020 48463