ArticlePDF Available

PiiGAN: Generative Adversarial Networks for Pluralistic Image Inpainting

Authors:

Abstract and Figures

The latest methods based on deep learning have achieved amazing results regarding the complex work of inpainting large missing areas in an image. But this type of method generally attempts to generate one single "optimal" result, ignoring many other plausible results. Considering the uncertainty of the inpainting task, one sole result can hardly be regarded as a desired regeneration of the missing area. In view of this weakness, which is related to the design of the previous algorithms, we propose a novel deep generative model equipped with a brand new style extractor which can extract the style feature (latent vector) from the ground truth. Once obtained, the extracted style feature and the ground truth are both input into the generator. We also craft a consistency loss that guides the generated image to approximate the ground truth. After iterations, our generator is able to learn the mapping of styles corresponding to multiple sets of vectors. The proposed model can generate a large number of results consistent with the context semantics of the image. Moreover, we evaluated the effectiveness of our model on three datasets, i.e., CelebA, PlantVillage, and MauFlex. Compared to state-of-the-art inpainting methods, this model is able to offer desirable inpainting results with both better quality and higher diversity. The code and model will be made available on https://github.com/vivitsai/PiiGAN.
Content may be subject to copyright.
SPECIAL SECTION ON GIGAPIXEL PANORAMIC VIDEO WITH VIRTUAL REALITY
Received February 25, 2020, accepted March 5, 2020, date of publication March 9, 2020, date of current version March 18, 2020.
Digital Object Identifier 10.1109/ACCESS.2020.2979348
PiiGAN: Generative Adversarial Networks for
Pluralistic Image Inpainting
WEIWEI CAI , (Student Member, IEEE), AND ZHANGUO WEI
School of Logistics and Transportation, Central South University of Forestry and Technology, Changsha 410004, China
Corresponding author: Zhanguo Wei (t20110778@csuft.edu.cn)
This work was supported by the Hunan Key Laboratory of Intelligent Logistics Technology under Grant 2019TP1015.
ABSTRACT The latest methods based on deep learning have achieved amazing results regarding the
complex work of inpainting large missing areas in an image. But this type of method generally attempts
to generate one single ‘‘optimal’’ result, ignoring many other plausible results. Considering the uncertainty
of the inpainting task, one sole result can hardly be regarded as a desired regeneration of the missing area.
In view of this weakness, which is related to the design of the previous algorithms, we propose a novel
deep generative model equipped with a brand new style extractor which can extract the style feature (latent
vector) from the ground truth. Once obtained, the extracted style feature and the ground truth are both
input into the generator. We also craft a consistency loss that guides the generated image to approximate
the ground truth. After iterations, our generator is able to learn the mapping of styles corresponding to
multiple sets of vectors. The proposed model can generate a large number of results consistent with the
context semantics of the image. Moreover, we evaluated the effectiveness of our model on three datasets,
i.e., CelebA, PlantVillage, and MauFlex. Compared to state-of-the-art inpainting methods, this model is able
to offer desirable inpainting results with both better quality and higher diversity. The code and model will
be made available on https://github.com/vivitsai/PiiGAN.
INDEX TERMS Deep learning, generative adversarial networks, image inpainting, diversity inpainting.
I. INTRODUCTION
Image inpainting requires a computer to fill in the missing
area of an image according to the information found in the
image itself or the area around the image, thus creating a plau-
sible final inpaint image. However, in cases where the missing
area of an image is too large, the uncertainty of the inpainting
results increase greatly. For example, when inpainting a face
image, the eyes may look in different directions and there may
be glasses not, etc. Although a single inpainting result may
seem reasonable, it is difficult to determine whether this result
meets our expectation, as it is the only option. Therefore,
driven by this observation, we hope to inpainting a variety
of plausible results on a single missing region, which we call
pluralistic image inpainting (as shown in Figure 1).
Early researche [2] attempted to carry out image inpainting
using the classical texture synthesis method, that is, by sam-
pling similar pixel blocks from the undamaged area of the
image to fill the area to be completed. However, the premise
of these methods is that similar patches can be sampled
The associate editor coordinating the review of this manuscript and
approving it for publication was Zhihan Lv .
from the undamaged area. When the inpaint area is designed
with complex nonrepetitive structures (such as faces), these
methods obviously cannot work (cannot capture high-level
semantics). The vigorous development of deep generating
models has promoted recent related research [16], [39] which
encodes the image into high-dimensional hidden space, and
then decodes the feature into a whole inpaint image. Unfor-
tunately, because the receptive field of the convolutional
neural network is too small to obtain or borrow the informa-
tion of distant spatial locations effectively,these CNN-based
approaches typically generate boundary shadows, distorted
results, and blurred textures inconsistent with the surrounding
region. Recently, some works [34], [36] used spatial attention
to recover the lost area using the surrounding image features
as reference. These methods ensure the semantic consistency
between the generated content and the context information.
However, these existing methods are trying to inpainting a
unique ‘‘optimal’’ result, but are unable to generate a variety
of valuable and plausible results.
In order to obtain multiple diverse results, many meth-
ods based on CVAE [26] have been produced [14], [27],
but these methods are limited to specific fields, which need
VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ 48451
W. Cai, Z. Wei: PiiGAN: GANs for Pluralistic Image Inpainting
FIGURE 1. Examples of the inpainting results of our method on a face, leaf, and rainforest image (the missing regions are shown in white). The left is the
masked input image, while the right is the diverse and plausible direct output of our trained model without any postprocessing.
targeted attributes and may result in unreasonable images
being inpainting.
To achieve better diversity of inpainting results, we add
a new extractor in the generative adversarial network
(GAN) [19], which is used to extract the style feature of the
ground truth image of the training set and the fake image
generated by the generator. The encoder in CVAE-GAN [14]
takes the extracted features of the ground truth image as the
input of the generator directly. When a piece of label itself
is a masked image, the number of labels matching each label
in the training set is usually only one. Therefore, the results
generated have very limited variations.
We proposed a novel deep generative model-based
approach. In each round of iterative training, the extractor
first extracts the style feature of the ground truth of the train-
ing set and inputs it to the generator together with the ground
truth. We use the consistency loss L1 to force the generated
image to be as close to the ground truth of the training set as
possible. At the same time, we generate and input a random
vectors and masked image to the generator to get the output
fake image, and use the consistency loss L1 to make the
extracted style feature as close to the input vectors as possible.
After the iteration, the generator can learn the mapping of
styles corresponding to multiple sets of input vectors. We also
minimize the KL(Kullback-Leibler) loss to reduce the gap
between the prior distribution and the posterior distribution
of the potential vectors extracted by the extractor.
We experimented on the open datasets CelebA [33],
PlantVillage, and MauFlex [1]. Both quantitative and qual-
itative tests show that our model can generate not only higher
quality results, but also a variety of plausible results. In addi-
tion, our model has practical application value in various
fields such as art restoration, real-time inpainting of large-
area missing images, and facial micro-reshaping.
The main contributions of this work are summarized as
follows:
We propose PiiGAN, a novel generative adversarial
networks for pluralistic image inpainting that not only
delivers higher quality results, but also produces a vari-
ety of realistic and reasonable outputs.
We have designed a new extractor to improve GAN.
The extractor extracts the style vectors of the training
samples in each iteration and introduce the consistency
loss to guide the generator to learn a variety of styles that
match to the semantics of the input image.
We validated that our model can inpainting the same
missing regions with multiple results that are plausi-
ble and consistent with the high-level semantics of the
image, and evaluated the effectiveness of our model on
multiple datasets.
48452 VOLUME 8, 2020
W. Cai, Z. Wei: PiiGAN: GANs for Pluralistic Image Inpainting
The rest of this paper is organized as follows. Section 2 pro-
vides related work on image inpainting. In addition, some
existing studies on conditional image generation are intro-
duced. In Section 3, we elaborate on the proposed model of
pluralistic image inpainting (PiiGAN). Section 4 provides an
evaluation. Finally, conclusions are given in Section 5.
II. RELATED WORK
A. IMAGE INPAINTING BY TRADITIONAL METHODS
The traditional method, which is based on diffusion, is to use
the edge information of the area to be inpaint to determine
the direction of diffusion, and spread the known information
to the edge. For example, Ballester et al. [5] used the varia-
tional method, the histogram statistical method based on local
features [6], and the fast marching method based on the level
set application proposed by Telea [7]. However, this kind of
method can only inpaint small-scale missing areas. In con-
trast to diffusion-based technologies, patch-based methods
can perform texture synthesis [6], [7], which can sample
similar patches from undamaged areas and paste them into
the missing areas. Bertalmio et al. [4] proposed a method of
filling texture and structure in the area with missing image
information at the same time, and Duan et al. [9] proposed a
method of using local patch statistics to complete the image.
However, these methods usually generate distorted structures
and unreasonable textures.
Xu and Sun [8] proposed a typical inpainting method
which involves investigating the spatial distribution of image
patches. This method can better distinguish the structure and
texture, thus forcing the new patched area to become clear
and consistent with the surrounding texture. Ting et al. [10]
proposed a global region filling algorithm based on Markov
random field energy minimization, which pays more attention
to the context rationality of texture. However, the calcula-
tion complexity of this method is high. Wu et al. [11] put
forward a fast approximate nearest neighbor algorithm called
PatchMatch, which can be used for advanced image editing.
Shao et al. [12] put forward an algorithm based on the Poisson
equation to decompose the image into texture and structure,
which is effective in large-area completion. However, these
methods can only obtain low-level features, and the obvious
limitation is that they only extract texture and structure from
the input image. If no texture can be found in the input image,
these methods have a very limited effect and do not generate
semantically reasonable results.
B. IMAGE INPAINTING BY DEEP GENERATIVE MODELS
Recently, using deep generative models to inpaint images
has yielded exciting results. In addition, Image inpainting
with generative adversarial networks (GAN) [19] has gained
significant attention. Early works [13], [15] trained CNNs for
image denoising and restoration. The deep generative model
named Context Encoder proposed by Pathak et al. [16] can be
used for semantic inpainting tasks. The CNN-based inpaint-
ing is extended to the large mask, and a context encoder
based on the generation adversarial network (GAN) is pro-
posed for inpainting the learned features [18]. The guide
loss is introduced to make the feature map generated in the
decoder as close as possible to the feature map of the ground
truth generated in the encoding process. Lizuka et al. [39]
improved the image completion effect by introducing local
and global discriminators as experience loss. The global dis-
criminator is used to check the whole image and to evaluate
its overall consistency, while the local discriminator is only
used to check a small area to ensure the local consistency
of the generated patch. Lizuka et al. [39] also proposed
the concept of dilated convolutions to the reception field.
However, this method needs a lot of computational resources.
For this reason, Sagong et al. [35] proposed a structure
(Pepsi) composed of a single shared coding network and a
parallel decoding network with rough and patching paths,
which can reduce the number of convolution operations.
Recently, some works [20], [23], [29] have proposed the use
of spatial attention [24], [25] to obtain high-frequency details.
Yu et al. [20] proposed a context attention layer, which fills
the missing pixels with similar patches of undamaged areas.
Isola et al. [22] tried to solve the problem of image restoration
using a general image translation model. Using advanced
semantic feature learning, the deep generation model can
generate semantically consistent results for the missing areas.
However, it is still very difficult to generate realistic results
from the residual potential features. Other work [3], [11],
[38], [49] also explores related applications.
C. CONDITIONAL IMAGE GENERATION
On the basis of VAE [31] and GAN [19], conditional image
generation has been widely used in conditional image gen-
eration tasks, such as 3D modeling, image translation, and
style generation. Sohn et al. [26] used random reasoning
to generate diverse but realistic outputs based on the deep
condition generation model of the Gaussian latent variable.
The automatic encoder of conditional variation proposed by
Walker et al. [27] can generate a variety of different predic-
tions for the future. After that, the variant automatic encoder
is combined with the generation countermeasure network to
generate a specific class image by changing the fine-grained
class label input into the generation model. In [28], different
facial image restorations are achieved by specifying specific
attributes (such as male and smile). However, this method is
limited to specific areas and requires specific attributes.
III. PROPOSED APPROACH
We built our pluralistic image inpainting network based
on the current state-of-the-art image inpainting model [20],
which has shown exciting results in terms of inpainting face,
leaf, and rainforest images. However, similar to other existing
methods [1], [20], [21], [36], [48], classic image completion
methods attempt to inpaint missing regions of the original
image in a deterministic manner, thus only producing a single
result. Instead, our goal was to generate multiple reasonable
results.
VOLUME 8, 2020 48453
W. Cai, Z. Wei: PiiGAN: GANs for Pluralistic Image Inpainting
A. EXTRACTOR
Figure 3shows the extractor network architecture of our
proposed method. It has four convolution layers, one flat-
tened layer, and two parallel fully connected layers. Each
convolutional layer uses the Elus activation function. All the
convolutional layers use a stride of 2 ×2 pixels and 5 ×5
kernels to reduce the image resolution while increasing the
number of output filters. The two fully connected layers in
the extractor will both output z_var , and they are equal, one of
them is dedicated to KL loss, and the other will be input to
the generator with the extracted latent vector Zr.
Let Igt be ground truth image, the extractor and style
feature extracts from Igt are denoted by Eand Zrrespectively.
We use the ground truth image Igt as the input, and Zris the
latent vector extracted from Igt by the extractor.
Z(i)
r=E(I(i)
gt ) (1)
Let Icf be the fake image generated by the generator. The
extractor and style feature extracts from Icf are denoted by E
and Zfrespectively. We use the fake image Icf as the input,
and Zfis the latent vector extracted from Icf by the extractor.
Z(i)
f=E(I(i)
cf ) (2)
The extractor extracts the style feature of each train-
ing sample and outputs its mean and covariance, i.e., µ
and σ. Similar to VAEs, the KL loss is used to narrow the
gap between the prior pθ(z)and the Gaussian distribution
qφ(z|I).
Let the latent vector Z be the centered isotropic multi-
variate Gaussian pθ(z)=N(z;0,I). Assume pθ(I|z) is
a multivariate Gaussian whose distribution parameters are
computed from z with the extractor network. We assume that
the true posterior adopts an approximate Gaussian form and
approximate diagonal covariance:
log q(z|I)=log Nz;µ, σ 2I(3)
Let σand µdenote the s.d. and variational mean evaluated
at datapoint i, and let µjand σjsimply denote the j-th element
of these vectors. Then, the KL divergence between the pos-
terior distribution qφz|I(i)and pθ(z)=N(z;0,I) can be
computed as
DKL qφ(z)||pθ(z)=Zqθ(z)(log pθ(z)log qθ(z))dz
(4)
According to our assumptions, the prior pθ(z)=N(z;0,I)
and the posterior approximation qφ(z|I)are Gaussian. Thus
we have
Zqθ(z) log qθ(z)dz =ZNz;µj, σ 2
jlog Nz;µj, σ 2
jdz
= J
2log(2π)1
2
J
X
j=11+log(σ2
j)
(5)
and
Zqθ(z) log p(z)dz =ZNz;µj, σ 2
jlog N(z;0,I)dz
= J
2log(2π)1
2
J
X
j=1µ2
j+σ2
j
(6)
Finally, we can obtain
DKL qφ(z)||pθ(z)
= 1
2
J
X
j=1
log(2π)+1
2
J
X
j=11+log(σ2
j)
[1
2
J
X
j=1
log(2π)+1
2
J
X
j=1µ2
j+σ2
j]
=1
2
J
X
j=11+log(σ2
j)µ2
jσ2
j(7)
where the mean and s.d. of the approximate posterior, µ
and σ, are outputs of the extractor E, i.e. nonlinear functions
of the generated sample x(i)and the variational parameters φ.
After this, the latent vector zqφ(z|x) is sampled using
gφ(x, )=µ+σwhere N(0,I) and is an
element-wise product. The obtained latent vector Z is fed into
the generator together with the masked input image.
The outputs of the generator are processed by the extractor
E again to get style feature, which is applied to another
masked input image.
B. PLURALISTIC IMAGE INPAINTING NETWORK: PiiGAN
Figure 2shows the network architecture of our proposed
model. We added a novel network after the generator and
named the network the extractor, which is responsible for
extracting latent vector Z. We concatenated an image with
white pixels as the missing regions and generated ran-
dom vectors as input, then we output the inpainting fake
image(Icf ). We input the fake image generated by the gen-
erator into the extractor to extract the style feature of the fake
image. At the same time, a sample of the ground truth image
input extractor was randomly taken from the training set to
obtain the style feature of the ground truth image, and the
ground truth image concatenated to the mask was input into
the generator to obtain the generated image Icr . We wanted
the extracted style feature to be as small as possible with the
input vectors. It is also desirable that the generated image Icr
is as close as possible to the ground truth image Igt to be
able to continuously update the parameters and weights of
the generator network. Here, we propose using the L1 loss
function to minimize the sum of the absolute differences
between the target value and the estimated value, because the
minimum absolute deviation method is more robust than the
least squares method.
48454 VOLUME 8, 2020
W. Cai, Z. Wei: PiiGAN: GANs for Pluralistic Image Inpainting
FIGURE 2. The architecture of our model. It consists of three modules: a generator G, an extractor E, and two discriminators D. (a) G takes in both the
image with white holes and the style feature as inputs and generates a fake image. The style feature is spatially replicated and concatenated with the
input image. (b) E is used to extract the style feature of the input image. (c) The global discriminator [39] identifies the entire image, while the local
discriminator [39] only discriminates the inpainting regions of the generator output.
FIGURE 3. The architecture of our extractor network.
1) CONSISTENCY LOSS
Since the perceptual loss [46] cannot directly optimize the
convolutional layer and ensure consistency between the fea-
ture maps after the generator and the extractor. We adjusted
the form of perceptual loss and propose the consistency loss
to handle this problem. As shown in Figure 3, we use the
extractor to extract a high-level style space in the ground truth
image. Our model also auto-encodes the visible inpainting
results deterministically, and the loss function needs to meet
this inpainting task. Therefore, the loss per instance here is
Le,(i)
c=
I(i)
cr I(i)
gt
1(8)
where I(i)
cr =G(Z(i)
r,fm) and I(i)
gt are the completed and ground
truth images respectively. G is the generator and E is our
extractor, zris the extractor extracted latent vector we call
style feature, zr= E(I(i)
gt ). For the separate generative path,
the per-instance loss is
Lg,(i)
c=
I(i)
cf I(i)
raw
1(9)
where I(i)
cf =G(Z(i)
f,fm) and I(i)
raw are the fake images com-
pleted by the generator and input raw images respectively.
2) ADVERSARIAL LOSS
To enhance the training process and to inpaint higher quality
images, Gulrajani et al. [47] proposed using gradient penalty
terms to improve the Wasserstein GAN [37].
LG
adv =Eiraw [D(Iraw)]Eiraw,z[D(G(Icm ,z))]
λEˆ
ih(
ˆ
iD(ˆ
i)
21)2i(10)
where ˆ
iis sampled uniformly along a straight line between a
pair of generated and input raw images. We used λ=10 for
all experiments.
For the image completion task, we only attempted to
inpaint the missing regions, so for the local discriminator,
we only applied the gradient penalty [47] to the pixels in
the missing area. This can be achieved by multiplying the
gradient by the input mask m as follows:
LL
adv =Eiraw [D(Iraw)]Eiraw,z[D(G(Icm ,z))]
λEˆ
ih(
ˆ
iD(ˆ
i)K(1 m)
21)2i(11)
where, for the pixels in that missing regions, the mask value
is 0; for other locations, the mask value is 1.
VOLUME 8, 2020 48455
W. Cai, Z. Wei: PiiGAN: GANs for Pluralistic Image Inpainting
3) DISTRIBUTIVE REGULARIZATION
The KL divergence term serves to adjust the learned impor-
tance sampling function qφz|Igt to a fixed potential prior
p(zr). Defined as Gaussians, we get
Le,(i)
KL = −KL(qφ(zr|I(i)
gt )||N(0, σ 2(n)I)) (12)
For the fake image output by the generator, the learned
importance sampling function qφz|Icf to a fixed potential
prior pzfis also a Gaussian.
Lg,(i)
KL = −KL(qφ(zf|I(i)
cf )||N(0, σ 2(n)I)) (13)
4) OBJECTIVE
Through KL, consistency and adversarial losses obtained
above, the overall objective of our diversity inpainting net-
work is defined as
L=αKL (Le
KL +Lg
KL )+αc(Le
c+Lg
c)+αadv(LG
adv +LL
adv)
(14)
where αKL ,αc,αadv are the tradeoff parameters for the KL,
consistency, and adversarial losses respectively.
C. TRAINING
For training, given a ground truth image Igt, we used our
proposed extractor to extract the style feature of the ground
truth, and then concatenate the style feature to the masked
ground truth image. It was input to the generator G to obtain
an image Icr of the predicted output, and forced Icr to be as
close as possible to Igt through the consistency loss L1 to
update the parameters and weights of the generator. At the
same time, Sample image Iraw from the training data, generate
mask and random vectors for Iraw and concatenate together to
input generator G, to obtain the predicted output image Icf ,
we used our proposed extractor to extract the latent vec-
tor zfof the generated image and forced zfto be as close
to zas possible to update the generator using consistency
loss L1.
IV. EXPERIMENTS AND RESULTS
We evaluated our proposed model on three open datasets:
CelebA faces [33], PlantVillage, and MauFlex [1]. The
PlantVillage dataset is a publicly available dataset for
researchers, and we manually downloaded all training
and test set images from the PlantVillage page (https://
plantvillage.org). And MauFlex is also an open dataset by
Morales et al. The number of samples in the three datasets
we obtained were 200k, 45k and 25k images, respectively.
We randomly divide the data set into training and test sets,
of which 15% of the data set is the test set. Since our method
can inpaint countless results, we generated 100 images for
each image with missing regions and selected 10 of them,
each with different high-level semantic features. We com-
pared the results with current state-of-the-art methods for
quantitative and qualitative comparisons.
Algorithm 1 Training Procedure of Our Proposed Model
1: while G has not converged do
2: for i=1ndo
3: Input ground truth images Igt ;
4: Get style feature by extractor ZrE(Igt );
5: Concatenate inputs f
Irm ZrIgt m;
6: Get predicted outputs Icr G(Zr,Igt )(1 M);
7: Update the generator G with L1 loss (Icr ,Igt );
8: Meanwhile,
9: Sample image Iraw from training set data;
10: Generate white mask m for Iraw;
11: Generate random vectors z for Iraw;
12: Concatenate inputs f
Icm Iraw mz;
13: Get predictions Icf G(Iraw,z);
14: Get style feature by extractor zfE(Icf );
15: Update the generator G with L1 loss (zf,z);
16: end for
17: end while
Our method was compared to the following:
CA Contextual Attention, proposed by Yu et al. [20]
SH Shift-net, proposed by Yan et al. [17]
GL Global and local, proposed by Iizuka et al. [39]
A. IMPLEMENTATION DETAILS
Our diversity-generation network was inspired by recent
works [20], [39], but with several significant modifications,
including the extractor. Furthermore, our inpainting network
which is implemented in TensorFlow [41] contains 47 million
trainable parameters, and was trained on a single NVIDIA
1080 GPU (8GB) with a batch size of 12. We use Tensorboard
to visualize the training process to view various parameters
of the training process in real time. It was confirmed that
The training of CelebA [33] model, PlantVillage model, and
MauFlex [1] model took roughly 3 days, 2 days, and 1 day,
respectively.
To fairly evaluate our method, we only conducted exper-
iments on the centering hole. We compared our method
with GL [39], CA [20], and SH [17] on images from the
CelebA [33], PlantVillage, and MauFlex [1] validation sets.
The size of all mask images were processed to 128 ×128
for training and testing. We used the Adam algorithm [42]
to optimize our model with a learning rate of 2 ×103and
β1=0.5, β2=0.9. The tradeoff parameters were set as
αKL =10, αrec =0.9, αadv =1. For the nonlinearities in
the network, we used the exponential linear units (ELUs) as
the activation function to replace the commonly used rectified
linear unit (ReLUs). We found that the ELUs tried to speed
up the learning by bringing the average value of the activation
function close to zero. Moreover, it helped avoid the problem
of gradient disappearance by positive value identification.
Compared with ReLUs, ELUs can be more robust to input
changes or noise.
48456 VOLUME 8, 2020
W. Cai, Z. Wei: PiiGAN: GANs for Pluralistic Image Inpainting
FIGURE 4. Comparison of qualitative results with CA [20], SH [17] and GL [39] on the CelebA dataset.
TABLE 1. Results using the CelebA dataset with large missing regions,
comparing global and local (GL) [39], Shift-net (SH) [17], Contextual
attention (CA) [20], and ours method. Lower is better. +Higher is better.
B. QUANTITATIVE COMPARISONS
Quantitative measurement is difficult for the image diversity
inpainting task, as our research is to generate diverse and
plausible results for an image with missing regions. Compar-
isons should not be made based solely on a single inpainting
result.
However, solely for the purpose of obtaining quantitative
indicators, we randomly selected a single sample from our
set of results that was close to the ground truth image and
selected the best balance of quantitative indicators for com-
parison. The comparison was tested on 10,000 Celeba [33]
test images, with quantitative measures of mean L1 loss,
L2 loss, Peak Signal-To-Noise Ration (PSNR), and Structural
SIMilarity (SSIM) [43]. We used a 64×64 mask in the center
of the image. Table 1lists the results of the evaluation with
the centering mask. It is not difficult to see that our methods
are superior to all other methods in terms of these quantitative
tests.
C. QUALITATIVE COMPARISONS
We first evaluated our proposed method on the CelebA [33]
face dataset; Figure 4shows the inpainting results with large
missing regions, highlighting the diversity of the output of our
model, especially in terms of high-level semantics. GL [39]
can produce more natural images using local and global
discriminators to make images consistent. SH [17] has been
improved in terms of the copy function, but its predictions are
to some extent blurry and there is detail missing. In contrast,
our method not only produces clearer and more plausible
images, but also provides complementary results for multiple
attributes.
As shown in Figures 5and 6, we also evaluated our
approach on the MauFlex [1] dataset and PlantVillage dataset
to demonstrate the diversity of our output across different
datasets. Contextual Attention (CA) [20], while producing
reasonable completion results in many cases, can only pro-
duce a single result, and in some cases, a single solution is not
enough. Our model produces a variety of reasonable results.
Finally, Figure 4shows the various facial attribute results
from the CelebA [20] dataset. We observed that existing
models, such as GL [39], SH [17], and CA [20], can only
generate a single facial attribute for each masked input. The
results of our method on these test data provide higher visual
quality and a variety of attributes, such as the gaze angle of
the eye, whether or not glasses are worn, and the disease
location on the blade. This is obviously better for image
completion.
VOLUME 8, 2020 48457
W. Cai, Z. Wei: PiiGAN: GANs for Pluralistic Image Inpainting
FIGURE 5. Comparison of qualitative results with Contextual Attention (CA) [20] on the MauFlex dataset.
FIGURE 6. Comparison of qualitative results with Contextual Attention (CA) [20] on the PlantVillage dataset.
FIGURE 7. Our method(top), StarGAN [32] (middle), and BicycleGAN [40] (bottom).
D. OTHER COMPARISONS
Compared to some of the existing methods (BicycleGAN [40]
and StarGAN [32]), we investigated the influence of using
our proposed extractor. We used the common parameters to
train these three models. As shown in Figure 7, for Bicycle-
GAN [40], the output was not good and the generated result
48458 VOLUME 8, 2020
W. Cai, Z. Wei: PiiGAN: GANs for Pluralistic Image Inpainting
TABLE 2. We measure diversity using average LPIPS [44] distance.
TABLE 3. Quantitative comparisons of realism.
was not natural. For StarGAN [32], although it can output a
variety of results, this method is limited to specific targeted
attributes for training, such as gender, age, happy, angry, etc.
1) DIVERSITY
In Table 2, we use the LPIPS metric proposed by [44] to
calculate the diversity scores. For each approach, we calcu-
lated the average distance between the 10,000 pairs randomly
generated from the 1000 center-masked image samples. Igobal
and Ilocal are the full inpainting results and mask-region
inpainting results, respectively. It is worth emphasizing that
although BicycleGAN [40] obtained relatively high diversity
scores, this may indicate that unreasonable images were gen-
erated, resulting in worthless variations.
2) REALISM
Table 3shows the realism across methods. In [45] and later
in [22], in order to evaluate the visual realism of the output of
these models, human judgment was used to judge the output.
We also presented a variety of images generated by our model
to a human in a random order, for one second each, asking
them to judge the generated fake and measure the ‘‘spoofing’’
rate. The pix2pix ×noise model [22] achieved a higher
realism score. CAVE-GAN [14] helped to generate diversity,
but because the distribution of potential space for learning
is unclear, the generated samples were not reasonable. The
BicycleGAN [40] model suffered from mode collapse and
had a good realism score. However, our method adds the KL
divergence loss in the style feature extracted by the extrac-
tor, making the inpainting results more realistic, as well as
producing the highest realism score.
V. CONCLUSION
In this paper, we proposed PiiGAN, a novel generative adver-
sarial networks with a newly designed style extractor for
pluralistic image inpainting tasks. For a single input image
with missing regions, our model can generate numerous
diverse results with plausible content. Experiments on various
datasets have shown that our results are diverse and natural,
TABLE 4. Results using the PlantVillage dataset with large missing
regions, comparing GL [39], SH [17], CA [20], and ours method. Lower is
better. +Higher is better.
TABLE 5. Results using the MauFlex dataset with large missing regions,
comparing GL [39], SH [17], CA [20], and our method. Lower is better.
+Higher is better.
especially for images with large missing areas. Our model can
also be applied in the fields of art restoration, facial micro-
shaping and image augmentation. In future work, we will
further study the image inpainting of large irregular missing
areas.
APPENDIXES
APPENDIX A
MORE COMPARISONS RESULTS
More quantitative comparisons with CA [20], SH [17], and
GL [39] on the CelebA [33], PlantVillage, and MauFlex [1]
datasets were also conducted. Table 4and Table 5list the
evaluation results on the PlantVillage and MauFlex datasets,
respectively. It is obvious that our model is superior to current
state-of-the-art methods on multiple datasets.
More quantitative comparisons of realism with CA [20],
SH [17], and GL [39] on the CelebA [33], PlantVillage,
and MauFlex [1] datasets were also conducted. Table 4and
Table 5list the evaluation results on the PlantVillage and
MauFlex datasets, respectively. It is obvious that our model
is superior to current state-of-the-art methods on multiple
datasets.
APPENDIX B
NETWORK ARCHITECTURE
As a supplement to the content in Section III, in the following,
we elaborate on the design of the proposed extractor. The spe-
cific architectural design of our proposed extractor network is
shown in Table 6. We use the ELUs activation function after
each convolutional layer. Nis the number of output channels,
Kis the kernel size, Sis the stride size, and nis the batch size.
APPENDIX C
MORE DIVERSE EXAMPLES USING THE CelebA,
PlantVillage, AND MauFlex DATASETS
CelebA Figure 8shows the results of the qualitative anal-
ysis comparison of the models trained on the CelebA [33]
dataset. The direct output of our model shows a more valuable
VOLUME 8, 2020 48459
W. Cai, Z. Wei: PiiGAN: GANs for Pluralistic Image Inpainting
FIGURE 8. Additional examples of our model tested on the CelebA dataset. The examples have different genders, skin tones, and eyes. Because a large
area of the image is missing, it is impossible to duplicate the content in the surrounding regions, so the Contextual Attention(CA) [20] method cannot
generate visually realistic results like ours. In addition, our diversity inpainting results have different gaze angles for the eyes and variation in whether
glasses are worn or not. It is important to emphasize that we did not apply any attribute labels when training our model.
diversity than the existing methods. The initial resolution of
the CelebA dataset image was 218 ×178. We first randomly
cropped the images to a size of 178 ×178, and then resized
the image to 128 ×128 for both training and evaluation.
48460 VOLUME 8, 2020
W. Cai, Z. Wei: PiiGAN: GANs for Pluralistic Image Inpainting
FIGURE 9. Additional examples of our model tested on the PlantVillage dataset. Examples have blades of different kinds and colors. Since the existing CA
[20] method cannot find repeated leaf lesions around the missing area, it is difficult to generate a reasonable diseased leaf. Our model is capable of
generating a wide variety of leaves with different lesion locations. In addition, we did not apply any attribute labels when training our model.
PlantVillage Figure 9shows the results of the qualitative
analysis comparison of the models trained on the PlantVillage
dataset. Our models also have more valuable diversity than
existing methods. The PlantVillage dataset is an open dataset
whose original image resolution is irregular. We resized the
images to 128 ×128 for training and evaluation.
VOLUME 8, 2020 48461
W. Cai, Z. Wei: PiiGAN: GANs for Pluralistic Image Inpainting
FIGURE 10. Additional examples of our model tested on the MauFlex [1] dataset. The examples have different tree types. Since the existing CA [20]
method cannot find duplicate tree content around the missing area, it is difficult to generate reasonable trees images. Our model is capable of generating
a variety of trees with different locations. In addition, we did not apply any attribute labels when training our model.
TABLE 6. The architecture of our extractor network.
MauFlex Figure 10 shows the results of the qualitative
analysis comparison of the models trained on the MauFlex [1]
dataset. Our models also have more valuable diversity than
existing methods. The MauFlex dataset is an open dataset
published by Morales et al. [1] with an original image res-
olution of 513 ×513. We resized the images to 128 ×128 for
training and evaluation.
REFERENCES
[1] G. Morales, G. Kemper, G. Sevillano, D. Arteaga, I. Ortega, and J. Telles,
‘‘Automatic segmentation of Mauritia flexuosa in unmanned aerial vehicle
(UAV) imagery using deep learning,’Forests, vol. 9, no. 12, p. 736, 2018.
[2] A. A. Efros and T. K. Leung, ‘‘Texture synthesis by non-parametric
sampling,’’ in Proc. 7th IEEE Int. Conf. Comput. Vis., vol. 2, Sep. 1999,
pp. 1033–1038.
[3] Z. Chen, H. Cai, Y. Zhang, C. Wu, M. Mu, Z. Li, and M. A. Sotelo,
‘‘A novel sparse representation model for pedestrian abnormal trajectory
understanding,’Expert Syst. Appl., vol. 138, Dec. 2019, Art. no. 112753.
[4] M. Bertalmio, L. Vese, G. Sapiro, and S. Osher, ‘‘Simultaneous structure
and texture image inpainting,’IEEE Trans. Image Process., vol. 12, no. 8,
pp. 882–889, Aug. 2003.
[5] C. Ballester, M. Bertalmio, V. Caselles, G. Sapiro, and J. Verdera, ‘‘Filling-
in by joint interpolation of vector fields and gray levels,’’ IEEE Trans.
Image Process., vol. 10, no. 8, pp. 1200–1211, Aug. 2001.
[6] A. Levin, A. Zomet, and Y. Weiss, ‘‘Learning how to inpaint from global
image statistics,’’ in Proc. 9th IEEE Int. Conf. Comput. Vis., Oct. 2003,
p. 305.
[7] A. Telea, ‘‘An image inpainting technique based on the fast marching
method,’J. Graph. Tools, vol. 9, no. 1, pp. 23–34, Jan. 2004.
48462 VOLUME 8, 2020
W. Cai, Z. Wei: PiiGAN: GANs for Pluralistic Image Inpainting
[8] Z. Xu and J. Sun, ‘‘Image inpainting by patch propagation using patch
sparsity,’’ IEEE Trans. Image Process., vol. 19, no. 5, pp. 1153–1165,
May 2010.
[9] K. Duan, Y. Gong, and N. Hu, ‘‘Automatic image inpainting using local
patch statistics,’’ U.S. Patent 10 127 631, Nov. 13, 2018.
[10] H. Ting, S. Chen, J. Liu, and X. Tang, ‘‘Image inpainting by global
structure and texture propagation,’’ in Proc. 15th Int. Conf. Multimedia,
2007, pp. 517–520.
[11] B. Wu, T. Cheng, T. L. Yip, and Y. Wang, ‘‘Fuzzy logic based
dynamic decision-making system for intelligent navigation strategy within
inland traffic separation schemes,’Ocean Eng., vol. 197, Feb. 2020,
Art. no. 106909.
[12] X. Shao, Z. Liu, and H. Li, ‘‘An image inpainting approach based on the
Poisson equation,’’ in Proc. 2nd Int. Conf. Document Image Anal. Libraries
(DIAL), Apr. 2006, p. 5.
[13] J. Xie, L. Xu, and E. Chen, ‘‘Image denoising and inpainting with deep neu-
ral networks,’’ in Proc. Adv. Neural Inf. Process. Syst., 2012, pp. 341–349.
[14] J. Bao, D. Chen, F. Wen, H. Li, and G. Hua, ‘‘CVAE-GAN: Fine-grained
image generation through asymmetric training,’’ in Proc. IEEE Int. Conf.
Comput. Vis. (ICCV), Oct. 2017, pp. 2745–2754.
[15] L. Xu, J. S. Ren, C. Liu, and J. Jia, ‘‘Deep convolutional neural network
for image deconvolution,’’ in Proc. Adv. Neural Inf. Process. Syst., 2014,
pp. 1790–1798.
[16] D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell, and A. A. Efros, ‘‘Context
encoders: Feature learning by inpainting,’’ in Proc. IEEE Conf. Comput.
Vis. Pattern Recognit. (CVPR), Jun. 2016, pp. 2536–2544.
[17] Z. Yan, X. Li, M. Li, W. Zuo, and S. Shan, ‘‘Shift-net: Image inpainting
via deep feature rearrangement,’’ in Proc. Eur. Conf. Comput. Vis., 2018,
pp. 1–17.
[18] Y. Wang, X. Tao, X. Qi, X. Shen, and J. Jia, ‘‘Image inpainting via
generative multi-column convolutional neural networks,’’ in Proc. Adv.
Neural Inf. Process. Syst., 2018, pp. 331–340.
[19] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley,
S. Ozair, A. Courville, and Y. Bengio, ‘‘Generative adversarial nets,’’ in
Proc. Adv. Neural Inf. Process. Syst., 2014, pp. 2672–2680.
[20] J. Yu, Z. Lin, J. Yang, X. Shen, X. Lu, and T. S. Huang, ‘‘Generative image
inpainting with contextual attention,’’ in Proc. IEEE/CVF Conf. Comput.
Vis. Pattern Recognit., Jun. 2018, pp. 5505–5514.
[21] H. Liu, B. Jiang, Y. Xiao, and C. Yang, ‘‘Coherent semantic attention for
image inpainting,’’ in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV),
Oct. 2019, pp. 4170–4179.
[22] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, ‘‘Image-to-image translation
with conditional adversarial networks,’’ in Proc. IEEE Conf. Comput. Vis.
Pattern Recognit. (CVPR), Jul. 2017, pp. 1125–1134.
[23] Y. Song, C. Yang, Z. Lin, X. Liu, Q. Huang, H. Li, and C.-C. J. Kuo,
‘‘Contextual-based image inpainting: Infer, match, and translate,’’ in Proc.
Eur. Conf. Comput. Vis. (ECCV), 2018, pp. 3–19.
[24] M. Jaderberg, K. Simonyan, and A. Zisserman, ‘‘Spatial transformer net-
works,’’ in Proc. Adv. Neural Inf. Process. Syst., 2015, pp. 2017–2025.
[25] T. Zhou, S. Tulsiani, W. Sun, J. Malik, and A. A. Efros, ‘‘View synthesis
by appearance flow,’’ in Proc. Eur. Conf. Comput. Vis., 2016, pp. 286–301.
[26] K. Sohn, H. Lee, and X. Yan, ‘‘Learning structured output representation
using deep conditional generative models,’’ in Proc. Adv. Neural Inf.
Process. Syst., 2015, pp. 3483–3491.
[27] J. Walker, C. Doersch, A. Gupta, and M. Hebert, ‘‘An uncertain future:
Forecasting from static images using variational autoencoders,’’ in Proc.
Eur. Conf. Comput. Vis., 2016, pp. 835–851.
[28] Z. Chen, S. Nie, T.Wu, and C. G. Healey, ‘‘High resolution face completion
with multiple controllable attributes via fully end-to-end progressive gener-
ative adversarial networks,’’ 2018, arXiv:1801.07632. [Online]. Available:
http://arxiv.org/abs/1801.07632
[29] J. Yu, Z. Lin, J. Yang, X. Shen, X. Lu, and T. Huang, ‘‘Free-form image
inpainting with gated convolution,’’ in Proc. IEEE/CVF Int. Conf. Comput.
Vis. (ICCV), Oct. 2019, pp. 4471–4480.
[30] C. Doersch, S. Singh, A. Gupta, J. Sivic, and A. A. Efros, ‘‘What makes
Paris look like Paris?’Commun. ACM, vol. 58, no. 12, pp. 103–110,
Nov. 2015.
[31] D. P Kingma and M. Welling, ‘‘Auto-encoding variational Bayes,’’ 2013,
arXiv:1312.6114. [Online]. Available: http://arxiv.org/abs/1312.6114
[32] Y. Choi, M. Choi, M. Kim, J.-W. Ha, S. Kim, and J. Choo, ‘‘StarGAN:
Unified generative adversarial networks for multi-domain Image-to-Image
translation,’’ in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit.,
Jun. 2018, pp. 8789–8797.
[33] B. Zhou, A. Lapedriza, A. Khosla, A. Oliva, and A. Torralba, ‘‘Places: A 10
million image database for scene recognition,’IEEE Trans. Pattern Anal.
Mach. Intell., vol. 40, no. 6, pp. 1452–1464, Jun. 2018.
[34] R. A. Yeh, C. Chen, T. Y. Lim, A. G. Schwing, M. Hasegawa-Johnson,
and M. N. Do, ‘‘Semantic image inpainting with deep generative
models,’’ 2016, arXiv:1607.07539. [Online]. Available: http://arxiv.org/
abs/1607.07539
[35] M.-C. Sagong, Y.-G. Shin, S.-W. Kim, S. Park, and S.-J. Ko, ‘‘PEPSI:
Fast image inpainting with parallel decoding network,’’ in Proc.
IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2019,
pp. 11360–11368.
[36] Y. Zeng, J. Fu, H. Chao, and B. Guo, ‘‘Learning pyramid-context encoder
network for high-quality image inpainting,’’ in Proc. IEEE/CVF Conf.
Comput. Vis. Pattern Recognit. (CVPR), Jun. 2019, pp. 1486–1494.
[37] M. Arjovsky, S. Chintala, and L. Bottou, ‘‘Wasserstein GAN,’’ 2017,
arXiv:1701.07875. [Online]. Available: http://arxiv.org/abs/1701.07875
[38] Z. Huang, X. Xu, J. Ni, H. Zhu, and C. Wang, ‘‘Multimodal representation
learning for recommendation in Internet of Things,’IEEE Internet Things
J., vol. 6, no. 6, pp. 10675–10685, Dec. 2019.
[39] S. Iizuka, E. Simo-Serra, and H. Ishikawa, ‘‘Globallyand locally consistent
image completion,’ACM Trans. Graph., vol. 36, no. 4, pp. 1–14, Jul. 2017.
[40] J. Y. Zhu, R. Zhang, D. Pathak, T. Darrell, A. A. Efros, O. Wang, and
E. Shechtman, ‘‘Toward multimodal image-to-image translation,’’ in Proc.
Adv. Neural Inf. Process. Syst., 2017, pp. 465–476.
[41] M. Abadi et al., ‘‘TensorFlow: Large-scale machine learning on heteroge-
neous distributed systems,’’ 2016, arXiv:1603.04467. [Online]. Available:
http://arxiv.org/abs/1603.04467
[42] D. P. Kingma and J. Ba, ‘‘Adam: A method for stochastic opti-
mization,’’ 2014, arXiv:1412.6980. [Online]. Available: http://arxiv.
org/abs/1412.6980
[43] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, ‘‘Image quality
assessment: From error visibility to structural similarity,’’ IEEE Trans.
Image Process., vol. 13, no. 4, pp. 600–612, Apr. 2004.
[44] R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang, ‘‘The unrea-
sonable effectiveness of deep features as a perceptual metric,’’ in Proc.
IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Jun. 2018, pp. 586–595.
[45] R. Y. Zhang, P. Isola, and A. A. Efros, ‘‘Colorful image colorization,’’ in
Proc. Eur. Conf. Comput. Vis. (ECCV), 2016, pp. 649–666.
[46] J. Johnson, A. Alahi, and L. Fei-Fei, ‘‘Perceptual losses for real-time style
transfer and super-resolution,’’ in Proc. Eur. Conf. Comput. Vis., 2016,
pp. 694–711.
[47] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. Courville,
‘‘Improved training of wasserstein gans,’’ in Proc. Adv. NeuralInf. Process.
Syst., 2017, pp. 5767–5777.
[48] K. Nazeri, E. Ng, T. Joseph, F. Z. Qureshi, and M. Ebrahimi, ‘‘EdgeCon-
nect: Generative image inpainting with adversarial edge learning,’’ 2019,
arXiv:1901.00212. [Online]. Available: http://arxiv.org/abs/1901.00212
[49] Z. Huang, X. Xu, H. Zhu, and M. Zhou, ‘‘An efficient group recommen-
dation model with multiattention-based neural networks,’IEEE Trans.
Neural Netw. Learn. Syst., to be published.
WEIWEI CAI (Student Member, IEEE) is cur-
rently pursuing the master’s degree with the
Central South University of Forestry and Technol-
ogy, Changsha, China. Prior to that, he worked
with IT industry for more than ten years in the
roles of an Enterprise Architect and the Program
Manager. His research interests include machine
learning, deep learning, and computer vision.
ZHANGUO WEI received the Ph.D. degree from
the School of Technology, Beijing Forestry Uni-
versity, Beijing, China. He has been working as
an Associate Professor with the Central South
University of Forestry and Technology, Changsha,
China, since 2011. His research interests include
information retrieval, data mining, big data, and
deep learning. His most work mainly promotes the
development of the logistics engineering through
the applications of data mining and deep learning.
VOLUME 8, 2020 48463
... Facial expressions are an effective nonverbal communication method for conveying emotional information among humans [114,115]. Additionally, they are reflective of various human emotions and thoughts [116][117][118]. Consequently, countries across the globe share a high level of agreement in identifying emotions through facial expressions [119]. ...
... Consequently, countries across the globe share a high level of agreement in identifying emotions through facial expressions [119]. Recent research has even demonstrated that sixteen facial expressions occur in similar contexts worldwide, showcasing the universality of these expressions [114]. Facial emotion perception is influenced by two contrasting theories: categorical and dimensional. ...
Article
Full-text available
Effective interactions between humans and robots are vital to achieving shared tasks in collaborative processes. Robots can utilize diverse communication channels to interact with humans, such as hearing, speech, sight, touch, and learning. Our focus, amidst the various means of interactions between humans and robots, is on three emerging frontiers that significantly impact the future directions of human–robot interaction (HRI): (i) human–robot collaboration inspired by human–human collaboration, (ii) brain-computer interfaces, and (iii) emotional intelligent perception. First, we explore advanced techniques for human–robot collaboration, covering a range of methods from compliance and performance-based approaches to synergistic and learning-based strategies, including learning from demonstration, active learning, and learning from complex tasks. Then, we examine innovative uses of brain-computer interfaces for enhancing HRI, with a focus on applications in rehabilitation, communication, brain state and emotion recognition. Finally, we investigate the emotional intelligence in robotics, focusing on translating human emotions to robots via facial expressions, body gestures, and eye-tracking for fluid, natural interactions. Recent developments in these emerging frontiers and their impact on HRI were detailed and discussed. We highlight contemporary trends and emerging advancements in the field. Ultimately, this paper underscores the necessity of a multimodal approach in developing systems capable of adaptive behavior and effective interaction between humans and robots, thus offering a thorough understanding of the diverse modalities essential for maximizing the potential of HRI.
... Subsequently, they utilized a CNN-based guided upsampling network for lowresolution upsampling, which supplemented texture details and effectively reduced computational complexity. Cai et al. introduced PiiGAN [22], which incorporates a novel style extractor. It leverages both masked images and style features extracted from ground truth images as inputs to guide the generator network to produce images that approximate the ground truth. ...
Article
Full-text available
The majority of existing face inpainting methods primarily focus on generating a single result that visually resembles the original image. The generation of diverse and plausible results has emerged as a new branch in image restoration, often referred to as “Pluralistic Image Completion”. However, most diversity methods simply use random latent vectors to generate multiple results, leading to uncontrollable outcomes. To overcome these limitations, we introduce a novel architecture known as the Reference-Guided Directional Diverse Face Inpainting Network. In this paper, instead of using a background image as reference, which is typically used in image restoration, we have used a face image, which can have many different characteristics from the original image, including but not limited to gender and age, to serve as a reference face style. Our network firstly infers the semantic information of the masked face, i.e., the face parsing map, based on the partial image and its mask, which subsequently guides and constrains directional diverse generator network. The network will learn the distribution of face images from different domains in a low-dimensional manifold space. To validate our method, we conducted extensive experiments on the CelebAMask-HQ dataset. Our method not only produces high-quality oriented diverse results but also complements the images with the style of the reference face image. Additionally, our diverse results maintain correct facial feature distribution and sizes, rather than being random. Our network has achieved SOTA results in face diverse inpainting when writing. Code will is available at https://github.com/nothingwithyou/RDFINet .
... Pluralistic image inpainting focuses on stochastic state completion methods based on generative models. Generative adversarial network (GAN)-based methods [56,57] generate multiple plausible completions by conditioning on a random vector, often employing coarse-to-fine approaches. Variational autoencoder (VAE)-based methods [58] replace deterministic latent code with a sampling mechanism to allow for multiple plausible predictions. ...
Article
Full-text available
Cognitive scientists believe that adaptable intelligent agents like humans perform spatial reasoning tasks by learned causal mental simulation. The problem of learning these simulations is called predictive world modeling. We present the first framework for a learning open-vocabulary predictive world model (OV-PWM) from sensor observations. The model is implemented through a hierarchical variational autoencoder (HVAE) capable of predicting diverse and accurate fully observed environments from accumulated partial observations. We show that the OV-PWM can model high-dimensional embedding maps of latent compositional embeddings representing sets of overlapping semantics inferable by sufficient similarity inference. The OV-PWM simplifies the prior two-stage closed-set PWM approach to the single-stage end-to-end learning method. CARLA simulator experiments show that the OV-PWM can learn compact latent representations and generate diverse and accurate worlds with fine details like road markings, achieving 69 mIoU over six query semantics on an urban evaluation sequence. We propose the OV-PWM as a versatile continual learning paradigm for providing spatio-semantic memory and learned internal simulation capabilities to future general-purpose mobile robots.
... Subsequently, many studies have been done for the diverse image generation tasks. For example, PiiGAN [3] used an additional style extractor. PUT [29] utilized a patch-based vector quantized variational auto-encoder [41] and an unquantized transformer. ...
Preprint
Image harmonization aims to adjust the foreground illumination in a composite image to make it harmonious. The existing harmonization methods can only produce one deterministic result for a composite image, ignoring that a composite image could have multiple plausible harmonization results due to multiple plausible reflectances. In this work, we first propose a reflectance-guided harmonization network, which can achieve better performance with the guidance of ground-truth foreground reflectance. Then, we also design a diverse reflectance generation network to predict multiple plausible foreground reflectances, leading to multiple plausible harmonization results. The extensive experiments on the benchmark datasets demonstrate the effectiveness of our method.
... (3) Context Encoder (Method B) [24]: Context Encoders are tools that inpaints the background context and eliminate extraneous objects from images. (4) Generative Adversarial Network (Method C) [25]: This method creates fake image samples, resembling the characterisc of training data by the generator. It generates sharp images and thereby emphasize fine details of image. ...
Article
Full-text available
The aim of this research is to enhance the quality of prenatal ultrasound images by addressing common artifacts such as missing or damaged areas, speckle noise, and other types of distortions that can impede accurate diagnosis. The proposed approach involves a novel preprocessing pipeline for prenatal 5th-month ultrasound scan images, which includes three main steps. First, Multiscale Self Attention convolutional neural network (CNN) is used for image inpainting and augmentation to fill missing or damaged areas and generate augmented images for training DL models. Second, Anisotropic Diffusion Filtering is used for speckle noise reduction, and the filter parameters are adapted to local noise characteristics using memory-based speckle statistics. Third, the CNN is trained to estimate local statistics of the speckle noise and adapt filtering parameters accordingly to capture local and global image features. The effectiveness of the proposed approach is evaluated on a prenatal 5th-month ultrasound scan dataset. The results demonstrate that the proposed preprocessing steps significantly improve the quality of ultrasound images and lead to better performance of DL models. The proposed preprocessing pipeline using Multiscale Self Attention CNN for image inpainting and augmentation, followed by Anisotropic Diffusion Filtering and memory-based speckle statistics for speckle noise reduction, can significantly enhance the quality of prenatal ultrasound images and enhance the accuracy of diagnostic models. The approach has potential for broader use in medical imaging applications.
Chapter
This chapter examines key methodologies and case studies in human–robot interaction (HRI), emphasizing three core areas that are transforming this field: advanced synergy model-based HRI techniques, brain–computer interfaces (BCIs), and emotional intelligence in robotics. We explore how synergy-based models and human-in-the-loop dynamics facilitate effective interactions and how BCIs contribute to cognition and stress identification. Additionally, we investigate how emotional intelligence can enhance robots’ ability to understand and react to human emotions. Collectively, these domains illustrate the evolving landscape of HRI, where multimodal approaches are essential for developing systems that are both adaptive and capable of complex interactions with humans. This study aims to deepen the understanding of these interactions and pave the way for future innovations in the field.
Article
Face inpainting, the technique of restoring missing or damaged regions in facial images, is pivotal for applications like face recognition in occluded scenarios and image analysis with poor-quality captures. This process not only needs to produce realistic visuals but also preserve individual identity characteristics. The aim of this paper is to inpaint a face given periocular region (eyes-to-face) through a proposed new Generative Adversarial Network (GAN)-based model called Eyes-to-Face Network (E2F-Net). The proposed approach extracts identity and non-identity features from the periocular region using two dedicated encoders have been used. The extracted features are then mapped to the latent space of a pre-trained StyleGAN generator to benefit from its state-of-the-art performance and its rich, diverse and expressive latent space without any additional training. We further improve the StyleGAN's output to find the optimal code in the latent space using a new optimization for GAN inversion technique. Our E2F-Net requires a minimum training process reducing the computational complexity as a secondary benefit. Through extensive experiments, we show that our method successfully reconstructs the whole face with high quality, surpassing current techniques, despite significantly less training and supervision efforts. We have generated seven eyes-to-face datasets based on well-known public face datasets for training and verifying our proposed methods. The code and datasets are publicly available
Conference Paper
Full-text available
Over the last few years, deep learning techniques have yielded significant improvements in image inpainting. However , many of these techniques fail to reconstruct reasonable structures as they are commonly over-smoothed and/or blurry. This paper develops a new approach for image in-painting that does a better job of reproducing filled regions exhibiting fine details. We propose a two-stage adversarial model EdgeConnect that comprises of an edge generator followed by an image completion network. The edge generator hallucinates edges of the missing region (both regular and irregular) of the image, and the image completion network fills in the missing regions using hallucinated edges as a priori. We evaluate our model end-to-end over the publicly available datasets CelebA, Places2, and Paris StreetView, and show that it outperforms current state-of-the-art techniques quantitatively and qualitatively.
Article
This paper proposes a fuzzy logic-based intelligent decision-making approach for navigation strategy selection in the inland traffic separation scheme. The dynamic characteristics of navigation process, including free navigation, ship following and ship overtaking, are further analysed. The proposed model can be implemented in the decision support system for safe navigation or be included in the process of autonomous navigation. The decision-making model is achieved from the perception-anticipation-inference-strategy perspective, and the dynamic features of ships (i.e. speed, distance, and traffic flow) are comprehensively considered in the modelling process. From the results of both scenarios for overtaking and following, it illustrates that the timing is significant for strategy selection and should well consider the complex situation and ship behaviours, moreover, the proposed approach can be used for intelligent strategy selection.
Article
Group recommendation research has recently received much attention in a recommender system community. Currently, several deep-learning-based methods are used in group recommendation to learn preferences of groups on items and predict the next ones in which groups may be interested. However, their recommendation effectiveness is disappointing. To address this challenge, this article proposes a novel model called a multiattention-based group recommendation model (MAGRM). It well utilizes multiattention-based deep neural network structures to achieve accurate group recommendation. We train its two closely related modules: vector representation for group features and preference learning for groups on items. The former is proposed to learn to accurately represent each group's deep semantic features. It integrates four aspects of subfeatures: group co-occurrence, group description, and external and internal social features. In particular, we employ multiattention networks to learn to capture internal social features for groups. The latter employs a neural attention mechanism to depict preference interactions between each group and its members and then combines group and item features to accurately learn group preferences on items. Through extensive experiments on two real-world databases, we show that MAGRM remarkably outperforms the state-of-the-art methods in solving a group recommendation problem.
Article
Recommender system has recently drawn a lot of attention to the communities of information services and mobile applications. Many deep learning-based recommendation models have been proposed to learn the feature representations from items. However, in Internet of Things (IoT), items’ description information are typically heterogeneous and multimodal, posing a challenge to items’ representation learning of recommendation models. To address this challenge and to improve the recommendation effectiveness in IoT, a novel multimodal representation learning-based model (MRLM) was proposed. In MRLM, two closely related modules were trained simultaneously; they are global feature representation learning and multimodal feature representation learning. The former was designed to learn to accurately represent the global features of items and users through simultaneous training on three tasks: triplet metric learning, softmax classification, and microscopic verification. The latter was proposed to refine items’ global features and to generate the final multimodal features by using items’ multimodal description information. After MRLM converged, items’ multimodal features and users’ global features could be used to calculate users’ preferences on items via cosine similarity. Through extensive experiments on two real-world datasets, MRLM remarkably improved the recommendation effectiveness in IoT.
Conference Paper
In this paper, we propose a generative multi-column network for image inpainting. This network synthesizes different image components in a parallel manner within one stage. To better characterize global structures, we design a confidence-driven reconstruction loss while an implicit diversified MRF regularization is adopted to enhance local details. The multi-column network combined with the reconstruction and MRF loss propagates local and global information derived from context to the target inpainting regions. Extensive experiments on challenging street view, face, natural objects and scenes manifest that our method produces visual compelling results even without previously common post-processing.
Article
Pedestrian abnormal trajectory understanding based on video surveillance systems can improve public safety. However, manually identifying pedestrian abnormal trajectories is usually a prohibitive workload. The objective of this study is to propose an automatic method for understanding pedestrian abnormal trajectories. An improved sparse representation model, namely information entropy constrained trajectory representation method (IECTR), is developed for pedestrian trajectory classification. It aims to reduce the entropy for trajectory representation and to obtain superior analyzing results. In the proposed method, the orthogonal matching pursuit (OMP) is embedded in the expectation maximization (EM) method to iteratively obtain the selection probabilities and the sparse coefficients. In addition, the lower-bound sparser condition of Lp-minimization (0 < p < 1) is applied in the proposed method to guarantee salient solutions. In order to validate the performance and effectiveness of the proposed method, classification experiments are conducted using five pedestrian trajectory datasets. The results show that the identification accuracy of the proposed method is superior to the compared methods, including naïve Bayes classifier (NBC), support vector machine (SVM), k-nearest neighbor (kNN), and typical sparse representation-based methods.