Available via license: CC BY 4.0
Content may be subject to copyright.
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2021.3112996, IEEE Access
VOLUME XX, 2017 1
Date of publication xxxx 00, 0000, date of current version xxxx 00, 0000.
Digital Object Identifier 10.1109/ACCESS.2017.Doi Number
Neural Style Transfer: A Critical Review
Akhil Singh1, Vaibhav Jaiswal1, Gaurav Joshi1, Adith Sanjeeve1, Dr. Shilpa Gite2,3, Dr. Ketan
Kotecha4,5
1Computer Science and Information Technology Department, Symbiosis Institute of Technology, Symbiosis International (Deemed) University, Pune
412115, India
2Associate Professor, Computer Science, and Information Technology Department, Symbiosis Institute of Technology, Symbiosis International (Deemed) University, Pune
412115, India
3Faculty, Symbiosis Centre of Applied AI (SCAAI), Symbiosis International (Deemed) University, Pune 412115, India
4Director, Symbiosis Institute of Technology, Symbiosis International (Deemed) University, Pune 412115, India
5Head, Symbiosis Centre of Applied AI (SCAAI), Symbiosis International (Deemed) University, Pune 412115, India
Corresponding author: Shilpa Gite (e-mail: shilpa.gite@sitpune.edu.in) Ketan Kotecha (head@scaai.siu.edu.in)
ABSTRACT Neural Style Transfer (NST) is a class of software algorithms that allows us to transform scenes, change/edit the
environment of a media with the help of a Neural Network. NST finds use in image and video editing software allowing image
stylization based on a general model, unlike traditional methods. This made NST a trending topic in the entertainment industry
as professional editors/media producers create media faster and offer the general public recreational use. In this paper, the
current progress in Neural Style Transfer with all related aspects such as still images and videos is presented critically. The
authors looked at the different architectures used and compared their advantages and limitations. Multiple literature reviews
focus on the Neural Style Transfer of images and cover Generative Adversarial Networks (GANs) that generate video. As per
the authors' knowledge, this is the only research article that looks at image and video style transfer, particularly mobile devices
with high potential usage. This article also reviewed the challenges faced in applying for video neural style transfer in real-
time on mobile devices and presents research gaps with future research directions. NST, a fascinating deep learning application,
has considerable research and application potential in the coming years.
INDEX TERMS Neural Style Transfer, Video Style Transfer, Mobile, Convolutional Neural Networks,
Generative Adversarial Networks
I. INTRODUCTION
Since its conception, videos are considered a popular
multimedia tool for various functions like Education,
entertainment, communication, etc. Videos have become
more and more popular as the effort needed to make them
keeps dropping thanks to advancements in Cameras and,
more particularly, mobile cameras. Today, an average user
uses their mobile devices to capture videos rather than
expensive dedicated setups [1]. On the other hand,
entertainment producers use dedicated hardware and editing
tools to create picturesque scenes with the help of Compute
Generated Imagery (CGI) software like [2] and [3].
There are multiple resources, approaches, improvements,
and implementations since the first Generative Adversarial
Network was presented by Goodfellow et al. 2014. As of
now, NST is extremely popular and widely used to edit
images to create a host of effects (E.g., Prisma App) (Gatys
et al. 2016) (Liu et al. 2017). Recently developments have
been observed to use NST for video style transfer (Ruder et
al. 2017), (Huang et al. 2017). This has significant
applications like entertainment to directly transform the
scene or parts, usually taking hours of manual work and
supervision. It can also be used for recreational purposes
fusing with Augmented Reality to create a virtual world
modeled after the real one (Dudzik et al. 2020).
Generative Adversarial Networks (GANs) are often used to
produce or synthesize data since conception (Goodfellow et
al., 2014). This makes GANs a potential candidate for
generating Images/Videos given a set of inputs that control
its structure and texture. The paper focuses on Generative
Adversarial Networks (GANs) developments and
summarizes the advancements made to date (up to April
2021). It also describes basic techniques currently being used
to transform videos and then move onto the NST-based
techniques. To understand the developments and have
comparisons, all categorized papers are reviewed into four
parts, as shown in fig. 1. Each part has different objectives
and key takeaways, such as advantages, limitations and
research gaps, and future scope.
The papers selected for review were found using Scopus and
Web of Science databases with the search terms such as
"video neural style transfer," "real-time video neural style
transfer," "generative adversarial networks," "video neural
style transfer on mobile devices," "video style transfer
improvement." Shortlisted papers with code
implementations publicly available (on GitHub or similar
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2021.3112996, IEEE Access
VOLUME XX, 2017 9
services) and based on the quality of the videos they generate
(as shown in their demonstrations/Readme).
There is currently no benchmark dataset for Neural Style
Transfer. MS-COCO and Cityscape are the two datasets
most frequently utilized in the experiments within the papers
reviewed. These datasets are primarily used for object
detection and recognition. Still, they may also be used as
content photos for training various models since each dataset
was around 330K and 25K in size, respectively. The style
dataset was either scraped from online sources such as
Danbooru, Safebooru, and Videvo.net or was created
according to the requirements of the problem statement,
which were in the range of 1k to 2k images.
While reviewing the literature, a few research gaps such as
platform-related, dataset-related, and architecture-related
deficiencies were identified. Hardware limitations are the
primary cause of platform-related gaps. In the absence of a
benchmark dataset and benchmark metrics, there exist data-
related research gaps. Lastly, architecture-related gaps
concern how model parameters change based on the dataset
need. These gaps are further discussed in detail in section
VIII.
As presented in table 1, there are a total of 3 review papers
available in the NST domain. Out of those 3, only paper [4]
can be considered as a comprehensive review paper. The
novelty of our paper lies in terms of the latest papers review
till Aug 2021, stating all related facets of NST. There are four
major sections to the paper. The first part of the paper covers
the basics of GANs, their types, and how they work; the
second part covers the contemporary architecture of GANs
with NST and how they work; the third part of the paper
covers the improvements that can be made to GANs while
applying NST to it, such as deep photo style transfer; and the
fourth part is about how we can use NST along with GAN
architecture on a real-time basis.
Highlights of this literature review are listed below:
Qualitative analysis of the latest GAN architectures
models, along with their advantages and
limitations, is discussed.
A summary and in-depth analysis of neural style
transfer for both images and video are given,
emphasizing mobile devices.
Most relevant research papers on the Neural Style
Transfer were explicitly identified focused on real-
time NST which narrows down the research in
video style transfer.
A detailed study of the challenges in applying
Video neural style transfer in real-time on mobile
devices, research gaps and future research
directions are also discussed.
Fig. 1 shows the papers reviewed in this research study and
their categorization as per paper flow.
TABLE 1
A SHORT SUMMARY OF REVIEW PAPERS AND KEY CONTRIBUTIONS
Reference
Objectives and Topics
Paper Theme
Year
[4]
The paper discusses a comprehensive overview of the current progress, a
taxonomy of current algorithms in NST. It discusses several evaluation
methods of comparison, a discussion of various applications of NST and open
problems for future research.
Detailed Reviews of NST papers
up to March 2018
2019
[5]
The paper presents a short survey of major techniques of doing neural style
transfer on images, and very briefly examines one way of extending neural
style transfer to videos.
Short survey of Neural Style
Transfer on Images and Videos
2018
[6]
The paper presents a short survey of the current progress of NST from two
aspects: the image optimization-based method and model-optimization-based
method. It compares different types of the NST algorithms, applications of
some proposals for future research.
Brief survey of Model-
Optimization-Based Neural Style
Transfer Method
2020
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2021.3112996, IEEE Access
VOLUME XX, 2017 9
FIGURE 1. An overview and categorization of the papers studied in this review
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2021.3112996, IEEE Access
VOLUME XX, 2017 9
II. GENERATIVE ADVERSARIAL NETWORKS
OVERVIEW
The first part deals with papers that define the basics of most
GANs. These papers are essentially the backbone, as most
other articles follow their pathway by improving upon or
making amends to them. GANs generate data based on
previously learned patterns and regularities as the model
finds these patterns. Deep learning suits generative models
as they can effectively recognize patterns in input data.
A. Generative Adversarial Networks
[7] explores the framework, which was new around then for
making generative models in a loosely organized cycle,
wherein training two models: a generative model G which
gets the details, and a discriminative model D that calculates
the likelihood that an image comes from training examples
instead of G. The arrangement technique of G would be to
raise the likelihood of D creation a mistake. This
arrangement resembles a more modest than usual max two-
player game. Next to optional limits G and D, a response
occurs, with G recovering the arrangement of data course and
D up to 0.5 everywhere. For the situation where G and D are
represented by multilayer perceptron, they always set up the
fundamental structure with backpropagation. The technique
used here is to get the most extreme probability of doling out
the correct mark to both preparing models and tests from G
and at the same time preparing G to limit log(1-D(G(z))).
data
(1)
B.
Style based Generator Adversarial Networks
Generator improvement has seen less attention and
improvement compared to Discriminator. To enhance the
picture quality produced by the Generator, [8] introduced a
Style transfer literature-based generator. In this model, the
Generator is trained with the Progressive GAN setup of
Karras as a baseline. The following are details of the model:
1. Traditionally the Generator is provided with a latent
code through the input of the first layer of the feed-
forward network. At the same time, in the new approach,
it is omitted altogether and is started with a learned
constant.
2. Provided a latent code z in the non-linear mapping
network and latent input space Z, f: Z → W first
generates w W
3. After the mapping is done, learned affine transformation
specializes w to styles which operates
after each convolution layer of the generative network
and controls the normalization of the generative network
G.
4. The normalization technique used is adaptive instance
normalization (AdaIN), the (2) for the same is:
(2)
Over here to get the AdaIN between and , firstly
finding the distance between and the mean of
() further dividing it by standard deviation of
(), then to scale, the value is multiplied it by
and bias it by .
5. The Generator is then given direct noise input, which
allows it to generate stochastically. The noise input is
uncorrelated noise input generated via a single channel
of images. Dedicated noise input is given into each layer
of the synthesis layer.
6. Using the learned pre-feature scaling factor, the noise
image is first transmitted on all feature maps, and then
the corresponding convolution layer output is applied.
The above changes in the Generator lead to the following
observation and improvement:
20% improvement in FID over traditional
Generator.
This makes it possible by modifying the styles by
scale to track image synthesis. Then the display of
the mapping network and geometric transformation
that preserves content and style images to produce
new images from the trained distribution and
generative network based on a series of styles to
create examples. This resulted locally in each style
effect, meaning only some portion of the image
would be influenced by changing a small part of the
Style.
The use of regularization mixing, which is a given number
of images, is generated during training using two random
latent codes. Precisely, w1, w2 controls the Style of two
different codes, z1, z2 across the mapping network so that
w1 is applied before and w2 after the crossover point. This
approach prevents the network from assuming that the object
style is correlated.
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2021.3112996, IEEE Access
VOLUME XX, 2017 9
After each convolution, the architecture adds per-
pixel noise, resulting in noise only affecting the
stochastic aspects leaving intact the function and
aspects at a high level.
Global effects such as illumination, etc., were seen
to be coherently regulated, whereas the noise was
applied to each pixel separately, ideally only
suitable for stochastic variation. When the network
is monitoring, i.e., the Discriminator penalizes the
pose with the noise, leading to inconsistency in
space. This way, without clear instructions, the
network can learn to use global and local networks
properly. The perceptual path length is lowered by
using the style-based Generator, as seen below in
the picture:
It is shown that increasing the mapping network's
depth enhances both image quality and separability.
C. Deep Convolutional Generative Adversarial
Networks
Unsupervised learning using Convolutional Neural
networks (CNN) has seen less attention than supervised
learning and its adoption in computer vision applications. To
bridge the gap, [9] introduced Deep Convolutional GANs
(DCGANs). In this method, discriminators are trained for the
image classification task, and generators have vector
calculations that allow more control over generated images.
Following guidelines are proposed to create stable
convolutional GANs:
1. Use strided convolution in Discriminator and
fractional-strided convolution in Generator, enabling
the model to tune upsampling and downsampling
itself.
2. Batch normalization usage is done in the generators
and the discriminators. This prevents the generators
from mode collapse.
3. Remove dense hidden layers. Connect the highest
Conv. features to input and output of both parts, which
showed promising results. For Discriminator, flatten
and feed the last Conv. Layer into sigmoid output.
4. For Generator, use ReLU activation and Tanh (only in
the final layer).
5. In the Discriminator, make use of LeakyReLU in
layers.
These architectural changes result in regular training and
a model capable of handling high resolutions.
Testing on the CIFAR-10 and Street View House
Numbers dataset (SVHN) dataset confirmed the impressive
performance of DCGANS. However, it still falls short of
Exemplar CNNs [10]. Another point to improve upon is that
even with fewer feature maps in the Discriminator, it has a
more extensive feature vector size, increasing training size at
higher resolutions.
D. Cycle Consistent Adversarial Networks
Unlike the Deep Convolution GANs, CycleGANs allow
image translation on unpaired data. [11] achieve this the
concept of "Cyclical Consistency" meaning that if two
Generators, "G" and "F," are trained to be inverses of each
other than virtually, F(G(X)) ≈ X. [11] introduce a second
generator that takes the outputs of the first one to and tries to
produce the actual input image. By training two GANs
whose, generators perform inverses of each other, [11]
decouples the translation's style and structural aspects (one
model handles the style transfer while the other enforces
structure). A key takeaway is that these models do not need
paired data to train due to this structure. The loss function is
thus modified to:
TABLE 2
FIDS IN FFHQ FOR NETWORKS TRAINED FOR DIFFERENT PERCENTAGES
OF TRAINING EXAMPLES BY ALLOWING MIXING REGULARIZATION.
(KARRAS ET. AL. 2019)
Mixing
Regularization
Number of latents during testing
1
2
3
5
E 0%
4.42
8.22
12.88
17.41
50%
4.41
6.10
8.71
11.61
90%
4.40
5.11
6.88
9.03
100%
4.83
5.17
6.63
8.40
TABLE 3
IN FFHQ FOR DIFFERENT GENERATOR ARCHITECTURES, SEPARABILITY
SCORES AND PERCEPTUAL ROUTE LENGTHS (LOWER IS BETTER).
(KARRAS ET. AL. 2019)
Method
Path Length
Separability
Full
End
Traditional
generator
412.0
415.3
10.78
Style-based
generator
446.2
376.6
3.61
Add noise
inputs
200.5
160.6
3.54
+Mixing 50%
231.5
182.1
3.51
+Mixing 90%
234.0
195.9
3.79
TABLE 4
THE EFFECT OF A MAPPING NETWORK IN FFHQ. (KARRAS ET. AL. 2019)
Method
FID
Path Length
Separability
Full
End
Traditional
0
5.25
412.0
415.3
10.78
Traditional
8
4.87
896.3
902.0
170.29
Traditional
8
4.87
324.5
212.2
6.52
Style-based
0
5.06
283.5
285.5
9.88
Style-based
1
4.60
219.9
209.4
6.81
Style-based
2
4.43
217.8
199.9
6.25
Style-based
8
4.40
234.0
195.9
3.79
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2021.3112996, IEEE Access
VOLUME XX, 2017 9
(3)
(4)
(5)
(6)
where X and Y are two image domains, G is a generator
transforming an image from domain X to Y. F is a generator
transforming an image from domain Y to X. DY is the
discriminator concerning G (identifies real/generated images
in Y domain). DX is the discriminator concerning F
(identifies real/generated images in X domain). G(x) is the
image generated by G on an input image x such that x ∈ X
and F(y) is the image generated by F on an input image y such
that y ∈ Y. Thus, Equations (3) and (4) compute the
Adversarial losses for the two GANs. In contrast, Equation
(5) computes the cyclical consistency loss by comparing input
images x and y to their remapped/generated versions, F(G(x))
and G(F(x)), respectively. Equation (6) describes the total
loss of CycleGAN combining the adversarial and cyclical
losses. The transformations are diagrammed by [11] in Figure
2.
The generators use a Resnet based architecture and a few
encoder-decoder layers, while the discriminators use a
PatchGAN architecture to focus on local structural details.
The results show that CycleGANs perform exceptionally well
on all test metrics barring the Pix2Pix model. The model's
limitation is that it fails whenever an image sampled from a
different distribution is input.
Observations
In summary, [7] defines a basic GAN with its objective
function and training procedure. However, unconditional
(which cannot get precise results) and uncontrollable (which
do not control the individual features used for generation). [8]
builds upon this by modifying the Generator to allow control
over disentangled features. [9] improves the architecture by
introducing a Deep Convolutional Neural Network. [11]
focuses on Unpaired Image Style Transfer presented with the
help of "cyclical consistency."
FIGURE 3. Output of CycleGANs. Left-most images are inputs, middle
images are corresponding style transfer from the first Generator G. Right-
most images are the reconstruction of inputs by second generator F. (Zhu
et. al. 2020)
(a) (b) (c)
FIGURE 2. (a) CycleGAN model schematic showing the two generators and discriminators along with the image domains.
(b) The transforms used to compute forward cyclical consistency loss.
(c) The transforms used to compute backward cyclical consistency loss.
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2021.3112996, IEEE Access
VOLUME XX, 2017 9
III. GENERATIVE ADVERSARIAL NETWORKS IN
NEURAL STYLE TRANSFER
Now architectures prevalent and in use concerning Neural
Style Transfer (NST) are discussed. These papers look at
proposing a new architecture and employing new methods.
NST first appeared in Gatys et al. 2015. The approach takes
a content image and applies the textures of the Style given.
NST then gained momentum as many works followed,
increasing the quality of images generated or generating
them faster than Gatys et al. 2015. These efficiencies and/or
effect improvements paved the way for faster image editing
(E.g., Adobe Image stylization) or recreational use (E.g.,
Prisma App).
A.
Conditional Adversarial Networks for style transfer
Conditional GANs (cGANs) introduce image-to-image
translation and a loss function to allow the models' training.
It removes the usage of hand-engineered loss functions or
mapping functions. [12] aims to create a common framework
that predicts a particular set of pixels based on another given
set of pixels. Instead of treating the output space as
"unconditional" from the input image, cGANs use a
structured loss function, considering the structural differences
between input and generated images. Optimizing this loss
function allows the generated images to be structurally related
or "conditioned" as per the input image. The Generator has an
architecture based on U-Net, whereas the Discriminator has a
PatchGAN based architecture. The PatchGAN architecture is
shown to be useful as it penalizes local structural differences.
The effect of locality or "patch size" is also studied. The loss
function is given as:
(7)
where G and D are the Generator and Discriminator networks,
x and y are content and style images, respectively, and z is a
random noise vector that gets learned to produce the mapping
G: {x, z} y. The Discriminator is now fed "x" or input
image as an input. In addition, an L1 distance term is added
to make the generated images closer to ground truth and avoid
blurred images:
(8)
Thus, the final objective is given as:
(9)
where G and D are the Generator and Discriminator
networks, is the conditional loss given in (7) and
is the L1 loss of the generator as given in (8). The total
loss is and is a weight used to alter the importance of L1
loss in the total loss.
Noise is provided in dropouts and not as inputs as the models
ignored the latter. In addition, the U-Net architecture
introduces skip connections, which allow low-level details to
be transferred easily between the input and output images.
Meanwhile, PatchGANs, as discriminators, focuses on
localized information more. Another significant advantage is
that PatchGANs can work with a smaller subset of pixels at a
time, decreasing the number of parameters, computation, and
time required for discriminator predictions.
It is seen that having a small patch size causes loss of spatial
features (structure of image) with useful spectral features
(colorful images). As the transition towards a larger patch
size, a balance of spatial and spectral features producing a
crisp image. However, Increasing the patch size beyond this
"balance point" causes a lower quality image to be generated.
Another plus for PatchGAN is that the Discriminator can be
applied to large images. The places where the model fails to
be good at are:
1. Sparse input images (Images with shallow structural
details)
2. Unusual Inputs (Input which is not like training
data)
Figure 4. Encoder-decoder vs U-Net Architecture. (Isola et. al. 2018)
1
FIGURE 5. Introducing U-Net allows higher quality of generated
images. (Isola et. al. 2018)
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2021.3112996, IEEE Access
VOLUME XX, 2017 9
B.
Image Style Transfer Using CNN
It is difficult to render an image's semantic content
differently since it lacks image representations that explicitly
provide semantic information. To solve the limitation of
using only low-level image characteristics of the target
image, [13] presents an Artistic Style neural algorithm that
can isolate and recombine the content of images (style
texture) and generate the images using those Styles. The
image representations used here are derived from the
optimized Convolution Neural Network for object
recognition, explicitly providing high-level image details.
Overall, the approach combines CNN-based parametric
texture models to invert their representations of the image.
The method used is:
The standardized version of the 19-layer VGG
network includes 16 convolutional and five pooling
layers.
By scaling weights, the network was normalized
such that the mean activation of each convolutional
filter over images and positions was equivalent to
one.
Image synthesis was done by using average pooling
as it was seen that it provided a better result.
For Content representation:
o One can perform gradient descent to
display image data on several levels of a
white noise picture to locate another image
that fits the feature responses of the
content image.
Let - original image, - generated image,
P l- feature representation of in layer l,
F l- feature representation of in layer l.
The squared-error loss between the two feature
representations is defined as:
content
(10)
The derivative of this loss w.r.t activations in layer l equals
∂L content
content
if
if
(11)
Style representation:
o To acquire a style portrayal of an input
picture, feature space was utilized to
capture textural data. The feature space can
be based on top of the filter response of the
model layer, which comprises of
connection between various reactions,
where exceptional cases are taken over the
spatial degree of feature maps.
o Feature correlation, given by:
Gram matrix , where
is the
inner product between the vectorized
feature maps i and j in layer l:
(12)
o The total loss function seen was:
(13)
o Total style loss:
style
(14)
FIGURE 6. Style transfer Algorithm. (Gatys et. al. 2016)
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2021.3112996, IEEE Access
VOLUME XX, 2017 9
o The derivative of E w.r.t. the activation
functions in layers l can be computed
analytically:
if
if
(15)
Style transfer:
o the loss function jointly minimized distance
between feature representations of the
white noise of two images (content and
Style):
total content style (16)
The content image was resized to style image always before
computing its feature representations to keep them for
comparable sizes.
Results seen for the suggested image style transfer are:
Both the content and the image style type are easily
separable in CNN, and to produce new meaningful
visuals; then the changes can be represented
individually.
Due to the many layers in the image synthesis,
which layers fit the representations of content and
Style were shown.
The picture is cleaner if the matching is done up to
higher layers initializing noise before initialization
of gradient descent leads to the generation of
arbitrary numbers of new images.
The algorithm provides photo-realistic style
transfer; an example can be seen in Fig. 7.
C.
Towards the Automatic Anime Characters Creation
The most common problem in generating faces is that
they get distorted on some features and get blurred. [16]
addresses this problem in both data and model aspects. [16]
provides three contributions for generating anime faces:
1. GAN model based on DRAGAN architecture.
2. A suitable clean anime facial dataset comprising of
high-quality images which are collected from Getchu
(Japanese game selling website)
3. An approach to train GAN from untagged images.
Tags are assigned to the dataset using Illustration2Vec (a
CNN-based Tag estimation tool). This tool can detect and tag
512 different types of attributes. After tags are set, 34 tags
are selected, which were suitable for the task at hand. In this
way, any untagged dataset can be processed and prepared,
which vastly opens data collection sources.
Model architecture is based on DRAGAN proposed by
Kodali et al. [17]. DRAGAN has the least computation cost
than other GAN variants and is much faster to train.
Generator architecture is shown in Fig. 8, which is based on
a modified version of SRResNet [18]. It consists of 16
Residual Blocks and three feature upscaling blocks. The
discriminator architecture is depicted in Fig. 9. It has 11
Residual Blocks and a dense layer that acts as an attribute
classifier. The proposed model was compared with a
standard DRAGAN model with DCGAN Generator based on
Fréchet inception Distance (FID) scores. Table 5 shows, the
proposed model has lower average FID scores identifying it
as a better model.
Figure 10 shows the samples generated from the model.
These samples are transparent, have sharp images, and have
good diversity.
The only drawback of this model is that it cannot handle
super-resolution. It is observed that the high-resolution
images generated using this model have undesirable
artifacts, which made the results messy.
FIGURE 7. Photorealistic style transfer. (Gatys et. al. 2016)
TABLE 5
FID OF PROPOSED MODEL AND BASELINE MODEL.
(JIN ET. AL. 2017)
Model
Average FID
MaxFID-MinFID
DCGAN
Generator+DRAGAN
5974.96
85.63
Our Model
4607.56
122.96
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2021.3112996, IEEE Access
VOLUME XX, 2017 9
D.
CartoonGAN: Generative Adversarial Networks for
Photo Cartoonization
[19] proposes a solution to convert real-world scenery
images into cartoon-style images. The unique characteristic,
smooth shading, and textures of cartoon-style images prove
significant challenges to existing methods based on texture-
based loss functions. [19] proposes CartoonGAN, a new
GAN framework that can take unpaired images for training
to tackle this problem. CartoonGAN architecture is shown in
Fig. 11 Generator G, which comprises one flat Conv. Block
proceeded by two down-Conv. Blocks are meant to perform
compression as well as encoding of an input image. The
content and manifold part are made up of eight residual
blocks. Finally, two up-convolution blocks and a
convolution layer create the cartoon-style output images.
Discriminator D is made up of flat layers that are preceded
by two strided Conv. blocks to reduce resolution and encode
features. The final layers are made up of a feature
construction block with convolution layers to obtain a
classification.
The overall loss has two parts: adversarial loss and content
loss, described as
con(17)
Here ω is the weight by which can be limited to the content
retention amount from the input.
FIGURE 10. Generated Samples. (Jin et. al. 2017)
FIGURE 8. Generator Architecture. (Jin et. al. 2017)
FIGURE 9. Discriminator Architecture. (Jin et. al. 2017)
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2021.3112996, IEEE Access
VOLUME XX, 2017 9
Adversarial loss Ladv (G, D) is an edge-promoting loss
defined as:
data
datadata
(18)
Here, the Generator G outputs a generated image for
each photo in the photo manifold . is a cartoon image
without precise edges and is the corresponding actual
image. , are the probabilities of the
discriminator D assigning correct labels to the actual image,
cartoon image without clear edge, and generated image,
respectively.
Content Loss Lcon (G, D), which has a feature map in a pre-
trained VGG network defined by:
condata
Here, l refers to the feature maps of specific VGG layer
FIGURE 11. Proposed Generator Discriminator Architecture. (Chen et. al. 2018)
FIGURE 12. Generated output comparison of CartoonGAN, CycleGAN and NST. (Chen et. al. 2018)
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2021.3112996, IEEE Access
VOLUME XX, 2017 9
Along with the model, an initialization phase is proposed to
improve the GAN model's convergence. In this phase, the
Generator is trained only with semantic content loss (19) and
can reconstruct only the input images' content. The training
data is unpaired, consisting of real-world and cartoon images
that are all resized to 256x256. There are 5402 real-world
training images. Cartoon images comprise of Makoto
Shinkai (4,573), Mamoru (4,212), Miyazaki Hayao (3,617),
and Paprika (2,302) style images.
As Fig. 12 shows, outputs from CartoonGAN are compared
with NST [13], and CycleGAN [11] outputs trained on the
same dataset. The Figure demonstrates the inability of NST
and CycleGAN to handle cartoon style well. NST using only
style imagery cannot control the Style thoroughly because
the local regions are styled differently. This leads to
inconsistent artifacts. Similarly, results from CycleGAN are
also unable to understand and depict the cartoon style
appropriately. The absence of Identity loss renders it unable
to preserve input image content. Even with identity loss, the
results are unsatisfactory. The results clearly show that
CartoonGAN effectively transforms real-world scenery
images into cartoon-style efficiency and high quality. It
efficiently performs much better than other top stylization
methods.
E. Artsy–GAN A style transfer system
[20] introduces a novel method for GAN-based style
transfer termed Artsy-GAN. The problem with current
approaches, such as using CycleGAN, is the slow training of
these models due to their complexity. Another disadvantage
is the source of randomness, which is limited to input images.
[20] proposes three ways to tackle these problems:
1. Using perceptual loss instead of reconstructing to
improve training speed and quality.
2. Using chroma sub-sampling to process the images
improves inference/prediction speed and makes the
model compact by reducing size.
3. Improving the diversity in generated output by
appending noise to the Generator's input and pairing it
with the loss function would force it to develop a
variety of details for the same image.
Fig. 13 shows the model architecture of the Generator.
The inputs are a 3-channel color image (RGB) with noise
added to each channel. The Generator has three branches,
each of which receives the same input but produces different
output image channels that are converted back into RBG by
a model at the end of the network. The discriminator
architecture is the same as CycleGAN using 70x70
PatchGANs [13], [19], [20].
The objective loss function is made up of three types of
losses and is defined as
(20)
where α and β control the significance of losses.
Here, the loss functions are:
1. An adversarial loss LGAN for equalizing distribution of
domains. It is defined as
data real (21)
Where, real is the table of actual data, z is a noise tensor,
is a produced image from generator G, and D is
the discriminator.
2. Diversity loss LDIVERSITY to improve diversity in
generate/output images, which are defined as
DIVERSITY
(22)
Where, N is the number of input noises as well as several
outputs.
3. A perceptual loss LPERCEPTUAL to overcome the
unconstrained problem by keeping the object and
content in the output and can be described as:
(23)
Where, is the output of the j-th layer of feature
encoder network
for image
x.
If the
j
-th layer is a
convolutional later then, will be a feature map
of
FIGURE 13. Architecture of Generator. (Liu et. al. 2018)
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2021.3112996, IEEE Access
VOLUME XX, 2017 9
Comparison of Artsy-GAN is made with CycleGAN based
on FID, processing time, and diversity in generated images.
Table 6 shows that Artsy-GAN has lower FID scores across
all the styles for which both models are trained.
Table 7 Shows that Artsy-GAN is 9.33% faster than
CycleGAN at the minimum resolution(640x480) taken and
up to 74.96 % faster at the highest resolution (1960 x 1080).
As the resolution increases, the difference in processing
times increases, proving that Artsy-GAN is much faster and
well-suited for higher resolution images.
Fig. 14 shows that CycleGAN output images are very similar
for the same input image with shallow diversity even after
adding noise to input images. Whereas Artsy-GAN output
images vary significantly, confirming its high diversity.
Finally, the proposed Artsy-GAN is a better, faster, and more
diverse method for style transfer, which easily outperforms
other SOTA methods depicted by the result. The perceptual
loss proposed can also be used for different stylings, such as
oil paintings with vibrant textures.
F.
Depth Aware Style Transfer
After the style transfers have been rendered using a different
picture style, the depth of the content picture has not been
reproduced. It is seen that those traditional methods like
additional regularization in the optimization of the loss
function, etc., are either ineffective in computing or require
a different trained neural style network. AdaIn approach of
Huang et al. (2017) enables effective arbitrary style transfer
to the content image. The depth map of the content image
cannot be replicated. [22] proposed an extension to the AdaIn
method to preserve the depth map by applying variable
stylization strength. The comparison showed in the image
Fig. 15.
TABLE 6
COMPARISON OF FID OF ARTSY-GAN AND CYCLEGAN.
(LIU ET. AL. 2018)
Van
Gogh
Monet
Ukiyoe
Cezanne
CycleGAN
180.37
125.82
206.05
162.83
Artsy-GAN
168.03
125.65
198.32
160.36
TABLE 7
PROCESSING TIME COMPARISON IN GPU TESLA M40. (LIU ET. AL.
2018)
Resolution
CycleGAN
Ours
Speed Up
640x480
0.037 +/-
0.008
0.034s +/-
0.003
9.33%
1024x768
0.081 +/-
0.031
0.036s +/-
0.004
55.60%
1280x720
0.106s +/-
0.032
0.037s +/-
0.007
64.85%
1280x960
0.125s +/-
0.054
0.040s +/-
0.007
67.91%
1280x1024
0.145s +/-
0.047
0.042s +/-
0.047
71.23%
1960x1080
0.176s +/-
0.093
0.044s +/-
0.008
74.96%
FIGURE 15. Comparison between proposed DA-AdaIn to AdaIn methods.
(Kitov et. al. 2020)
FIGURE 14. Results compared with Cycle-GAN. (Liu et. al. 2018)
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2021.3112996, IEEE Access
VOLUME XX, 2017 9
The technique is the depth-aware AdaIN (DA-AdaIN),
which works with varied strength: closer areas are less
stylized, whereas distant regions that represent a background
have a more stylistic feature. Based on the following styling,
AdaIn applies the Style evenly to the content image:
(24)
where,
- Content Image
- Style Image
f(·) is an encoder
g(·) Is a decoder trained for appropriate stylization
with the encoder.
AdaIN(x, y) is a variant of instance normalization.
The extension proposed is:
Add Style using varied strengths, based on your
camera proximity, in various areas of the content
image.
Closer places must be preserved in the forefront so
that less stylized; remote areas are considered more
stylistic backgrounds. The hyperparameter α = [0,
1] can be regulated by the following formula in a
standard stylizer strength check:
Since is the actual unaltered content encoder
representation, whereas .Is a
completely styled encoder representation. To
manage spatially varying strength, the modified
formula can be used
(26)
where is stylization strength map
shows repeated element multiplication for each
channel for each spatial position in the content
encoder representation:
(27)
The algorithm has two hyperparameters:
o β > 0 controls the prominence of the
proximity map around its mean value.
o ε [0, 1] controls minimal offset of the
image regions from the camera.
Image result based on different hyperparameters values:
G.
StyleBank: An Explicit Representation for Neural
Image Style Transfer
StyleBank is made of many convolutional filter banks, each
of which explicitly reflects one Style, and is used to transmit
the Style of neural images. To convert a picture to a
particular style, the appropriate filter bank is used on top of
the intermediate feature embedding generated by a single
auto-encoder. The StyleBank and the auto-encoder are
concurrently learned, with the auto-encoder encoding no
style information due to the flexibility provided by the
explicit filter bank representation. Additionally, it supports
incremental learning to add a new image style by learning a
new filter bank while keeping the auto-encoder unchanged.
The explicit style representation and the adaptable network
design enable us to combine styles at the picture and area
levels.
To investigate an explicit representation for Style, [23]
revisit traditional texton (referred to as the essential element
of texture) mapping methods, in which mapping a texton to
the target location is equivalent to convolution between a
texton and a Delta function (indicating sampling positions)
in the image space.
In response, [23] offers StyleBank, a collection of different
convolution filter banks, each reflecting a distinct style. The
matching filter bank is convolved with the intermediate
feature embedding generated by a single autoencoder, which
decomposes the original picture into several feature maps to
convert a picture to a specific style.
In comparison to previously published neural style transfer
networks, the proposed neural style transfer network is novel
in the following ways:
This method offers an explicit representation of
styles using this way. After learning, the network
can isolate styles from content.
This technique enables region-based style transfer
due to the explicit style representation. This is not
FIGURE 16. Result of style transfer based on depth contrast parameter
β, ε =0. (Kitov et. al. 2020)
FIGURE 17. Result of Style transfer based on proximity offset
parameter ε, β = 20. (Kitov et. al. 2020)
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2021.3112996, IEEE Access
VOLUME XX, 2017 9
possible with existing neural style transfer
networks, but it is possible with classical texture
transfer.
This method enables concurrent training of many
styles with a single auto-encoder and progressive
learning of a new style without modifying the
auto-encoder.
[23] construct a feed-forward network based on a simple
image autoencoder (Figure 18), which converts the input
picture (i.e., the content image) to the feature space via the
encoder subnetwork.
H.
Two-Stage Color Ink Painting Style Transfer via CNN
[24] proposes the best approach to move bloom pictures to
paint ink painting. Not quite the same as a common neural
style move technique, the report presents a way that imitates
the creation of shading ink painting. It can be viewed as two
specific steps – edge marking and picture colorization.
Rather than utilizing edge identification calculations, the line
drawing is taken to help and adventure the CNN-based
neural style move technique to get line drawing. Concerning
picture colorization, the GAN-based neural style move
strategy is utilized.
The framework comprises two segments: a line extraction
model and an image colorization model. The line extraction
model changes blossom photograph content into a line
drawing through the planning x1 =f1(content). The image
colorization model colorizes the line drawing x1 to give
output y through the planning y=g (∙). In this methodology,
f1(content) is anticipated to be the planning as follows. At
that point, an estimated shading portrayal could likewise be
acquired from content ≈ f1-1(x1). Thus, both line and
estimated shading portrayal in substance picture is developed
through the planning f1(content). With matched information,
contingent GAN, prepared in a directed way, may
incorporate substance picture with client determined Style.
Hence, adapted photos from generator fool discriminator, yet
meet the necessities for shading ink painting tone.
As referenced before, there are two primary models: -
First, the Line Extraction model includes an image
colorization model that removes most lines of blossoms and
leaves in substance pictures. The Line Extraction model is
utilized to characterize loss capacities by estimating
contrasts in substance and tastefulness between highlights
extricated from pictures. In the training stage, the flower
image is taken by the image colorization model, and output
picture x1 is created in a like manner. The line extraction
model is fixed during the preparing stage, and output
highlights are utilized in picture loss capacities.
L=λcLc(F(xcontent), F(x1)) + λsLs(G(xs), G(x1)) (28)
In equation 28, Lc(∙) is the Euclidean distance between
content portrayals of substance pictures and adapted
pictures. Ls(∙) is the squared Frobenius standard of the
contrast between the Gram lattices of style picture and
adapted picture. F and G are the element change capacities.
Secondly, Image Colorization Network, which further has
Conditional GAN and DualGAN, is used to experiment and
check which one gives the better output. In this way, when
the Generator and Discriminator are adapted to additional
data, it studies a strict model. As shown in Fig. 19, line
drawing is taken over by both modeling and line extraction
models. As the line drawing can be noticed, the
Discriminator can observe how the Generator transforms the
information line to a suitable photo. In this manner, the
Discriminator will, in general, be more solid to separate the
created photos from the significant.
In DualGAN, an unaided learning system figures out how to
decipher pictures from area X to those in space Y and figure
out how to upset the errand. During this case, as appeared in
Fig. 21, two picture sets from 2 areas, explicitly, line drawing
set (space X) and shading ink painting set (area Y), are taken
care of into two gatherings of GAN. Generator GA initially
changes line drawing x1 from space X into adapted
composition picture y. Y is turned at that moment into a
regenerated line image x1. In the meantime, GB generators
convert the shading style of ink painting in an adapted line
image x1 to a recreated shading ink color. L1 distance is
obtained to live the remaking mistake, adding to the GAN
target. Hence, generators figure out how to get pictures with
perceptual authenticity.
FIGURE 18. The network model. (Chen et al. 2017)
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2021.3112996, IEEE Access
VOLUME XX, 2017 9
FIGURE 19. Line extraction model architecture. (Zheng et al. 2018)
FIGURE 20. The organization model of image colorization model. (Zheng et al. 2018)
FIGURE 21. Model design of the complete network. (Zheng et al. 2018)
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2021.3112996, IEEE Access
VOLUME XX, 2017 9
Observations
A summary of contributions is presented in Table 8. One
peculiar limitation seen is that the models tend to fail at
higher resolution images.
IV. ADVANCEMENT PAPERS
This set of papers present advancements to current
architectures. These advancements allow different types of
control to the Style Transfer by improving Color control,
Stability, Spatial Control, and other vital aspects which
enhance the quality of generated images.
A. Perceptual Factor Control in Neural Style Transfer
[25] presents an extension to the existing methods by
proposing spatial, color, and scale control over a generated
image's features. By breaking down the perceptual factors
into these features, more appealing images can be generated
that avoid common pitfalls. Finally, [25] shows a method to
incorporate this control into already existing processes. The
identification of perceptual factors is the key to producing
higher-quality images. Spatial control implies controlling
which region of the style image is applied to each region of
the content image. This helps as different regions have
different styling, and mapping them incorrectly can cause
visual artifacts. The first method to do this uses Guidance-
based Gram Matrices, where each image is provided with a
spatial guidance channel indicating which region should be
applied to what Style. This involves computing a Spatially
Guided Feature Map for R regions and L layers as:
(29)
where ◦ denotes element-wise multiplication. The Guided
Gram Matrix can then be defined as:
(30)
Furthermore, the contribution to the loss function is given as:
(31)
where is the number of feature maps in layer “l,”
and
are the guided gram matrices generated as per
Equations (29) and (30) for the generated image and the
input style image . is the weighting factor that controls
the stylization strength in the corresponding region r.
An alternative approach focuses on stacking the guidance
matrices with the feature maps directly. This is more efficient
than the previous approach but comes at the cost of texture
quality, as noted. The second factor addressed in [25] is Color
control, independent of geometric shapes or textures. Color
control is beneficial in situations where the model needs to
contain the image's color is essential (E.g., Photo-realistic
Style Transfers). [25] present two approaches to deal with
this:
1. Luminance-only Transfer: Style Transfer is only
performed on the luminance channel. This is done by
extracting style and content Luminance channels and
producing output luminance channels that are then
combined with the original content colors to create
the generated image.
2. Color Histogram Matching: In this method, the style
image's colors are transformed such that their mean
and covariance match with the content image's mean
and covariance using a linear transform.
Each of them has its pros and cons. For instance, Luminance-
only transfer preserves the content colors, but this comes at
the expense of losing dependencies between luminance and
colors. The color-matching might maintain this, but it
depends on the transform, which can be rather tricky to find.
Scale control allows us to pick separate styles at different
scales. The image's Style is the spread of image texture in an
arbitrary area. [25] propose creating fresh pictures of Style
from two separate photos combining a fine and a coarse-
scale picture. This is handy when it comes to Style Transfer
on high-resolution images. Given a high-resolution content
and style image, the output is achieved by downsampling to
the desired resolution. This output is upsampled and used as
the initialization for original images. This technique requires
fewer iterations for optimization and filters low-level noise
TABLE 8
A SHORT SUMMARY OF ARCHITECTURE-BASED PAPERS AND THEIR KEY CONTRIBUTIONS
Ref. No.
Paper
Contributions
[12]
Isola et al. 2018
Differentiates and penalizes generated images based on output types
[13]
Gatys et al. 2016
Separates the style and content parts of the process (like [11]) and breaks down the total loss into two parts, each
addressing a different issue.
[16]
Jin et al. 2017
Provides a way to generate facial features with less distortion.
[19]
Chen et al. 2018
Uses a novel initialization phase that helps in faster convergence and loss functions pertinent to cartoonish image
generation.
[20]
Liu et al. 2018
Achieves faster convergence using perceptual loss, chroma subsampling and noise addition.
[22]
Kitov et al. 2020
Addresses depth preservation by adding variable stylization strength.
[23]
Chen et al. 2017
Allows simultaneous training of multiple styles using the concept of “Style Banks”
[24]
Zheng et al. 2018
Breaks down stylization into two stages: marking the edges and actual colorization.
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2021.3112996, IEEE Access
VOLUME XX, 2017 9
as well. The method can be iterated to generate very high-
resolution images.
As seen in Fig. 22, the method works well to get a high-
resolution image like the one that does not use it. However,
the "CTF" model requires fewer iterations and is seen to have
less noise.
B. Stability improvements in Neural Style Transfer
The latest image style transfer methods can be grouped
into two groups. The first one is the optimization approach
that solves a particular optimization problem for the
generated image. These results are outstanding but take some
time to develop each picture. Second is Feed-forward
approaches that provide solutions to these problems and are
usable for real-time synthesis but tend to give unstable
readings. [26] introduces a new method for stabilizing feed-
forward style transfer methods for video stylization using a
recurrent network trained using temporal consistency loss. In
this method, the network tries to minimize the summation of
three losses. The combined loss is defined as
(32)
Here , , and are used to assign importance to loss
term.
The three losses are as follows:
1. Content style loss Lc which is defined as
(33)
Here, is the j-th layer activation network activation
of the shape for image x.
2. Style reconstruction loss Ls is defined as
(34)
Here, is a gram matrix for layer j
activations
3. Temporal consistency loss Lt defined as
(35)
Here, is 0 in the region of occlusion and
motion boundaries, ʘ indicates element-wise multiplication,
and H, W is the height and width of the input frame. Style
and content losses motivate high-level feature mapping of
the content image with features in Style. Temporal
consistency loss prevents drastic variations in the output
between time steps. Content image and a previous frame are
fed as input to the network. At each step, the output of the
network is passed as input in the next step.
As Fig. 23 shows, it is a recurrent convolutional network
where each style transfer network is a deep Conv. Network
with two spatial downsampling blocks, followed by several
residual blocks. The final layers are nearest-neighbor
upsampling blocks.
FIGURE 22. (a) The Content Image. (b) Spatial Control that
differentiates in sky and ground textures. (c) Color control that
tries to preserve the original colors of the content image. (d)
Two Styles are used on fine and coarse scale to stylize the
image. (Gatys et. al. 2017)
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2021.3112996, IEEE Access
VOLUME XX, 2017 9
Fig. 24 shows the results for translation and blurring
distortions of images. An image patch is taken and distorted
then SSIM is computed between both the original and
distorted patch. Both are then stylized, and SSIM is
calculated for the styled original and styled distorted patch.
The proposed method is compared with the Real-Time
baseline model on all styles. The results prove that this
method is significantly more robust at controlling
distortions.
Table 9 shows the results of the comparison done based on
speed. This method matches the Real-Time baseline in terms
of speed and is three times faster than the Optim baseline
[26].
Fig. 25 shows a pair-wise comparison of stylized frame
output. PSNR/SSIM values are shown for each example pair.
This method produces similar frames as Optim Baseline
[26]. Still, on comparing with real-time baseline, the frames
made are better and temporally consistent for unstable styles
like Rain Princess and Metzinger.
There are two problems with this method:
1. Occasionally, as a result, one object can block others,
which is undesirable.
2. Show-door artifacts appear in the generated image.
C. Preserving Color in Neural Artistic Style Transfer
Though there have been many papers on style transfer, there
has been some shortcoming: the algorithms transfer the
colors of the original painting to the output painting, which
can alter the appearance in undesirable ways. [27] describes
a simple linear method for retaining colors after style
FIGURE 23. System Overview. (Gupta et. al. 2017)
FIGURE 24. Image Sharpness based on SSIN. (Gupta et. al. 2017)
TABLE 9
(GUPTA ET. AL. 2017)
Image-Size
Real-Time
Baseline
Optim
Baseline
Ours
Speedup
256 x 256
0.024
22.14
0.024
922x
512 x 512
0.044
59.64
0.044
1355x
1024 x 1024
0.141
199.6
0.141
1415x
FIGURE 25. Pair-wise Stylized image SSIN comparison. (Gupta et. al. 2017)
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2021.3112996, IEEE Access
VOLUME XX, 2017 9
transfer, extending to the neural artistic style transfer
algorithm. One of the problems seen, as said before, is that
the yield after style transfer, however, replicates the Style of
brushstrokes, mathematical shapes, and painterly structures
displayed in the style picture. Nevertheless, it likewise
duplicates the color distribution of the style picture
undesirably.
Two different methods for preserving colors of the content
image seen are color histogram and luminance only transfer.
1. Color histogram matching:
1. Consider S- style image and C- input image.
Style image's colors are transformed to
coordinate the input image's colors, producing S'-
a new style image that replaces S as an input to
the NST algorithm. One choice that is to be made
is the color transfer procedure.
2. Each pixel is transformed as:
(36)
Where,
A: 3 X 3 matrix
B: 3D vector
3. This transformation is chosen in such a way that
the mean and covariance of the RGB value in
the new image style (S') matches the content
image (C), i.e., and
4. The values on A and b from equation (36) based
on the condition mentioned about (C) are:
5. There are many different solutions for A which
satisfies these constraints
1. The first variant is using the Cholesky
decompositions:
chol
where,
Σ = is Cholesky decomposition of Σ.
2. Formulation of 3D color matching is the
second variant, which is:
3. It is seen that transfer of color histogram
before style transfer gives better outcomes,
which is neural style move is figured from the
first data sources S and C. Afterwards, the
yield T is color-coordinated to C, creating
another yield T'.
4. The algorithm also reduced competition
between the reconstruction of the content
image and the simultaneous matching of the
texture details from the image style.
Luminance-only transfer:
Visual perception is much more susceptible to
changes in luminance than to color.
Luminance channels Ls and Lc are initially derived
from the style and content images., NST algorithm
is applied to them and yield luminance image Lt.
Using YIQ color space, I, and Q filters - the input
picture's color information merged with Lt to
generate the resulting image.
The significant mismatch between the style
luminosity histograms and the material images
should be balanced before the Style is transferred.
each style image's luminance pixel is updated:
(38)
Where µS and µC is the mean luminance
and is the standard deviation.
FIGURE 26. On the output image (c), the undesirable style image
color overlay is evident. (Gatys et. al. 2016)
FIGURE 27. Result Cholesky and Image Analogies color transfer.
(Gatys et. al. 2016)
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2021.3112996, IEEE Access
VOLUME XX, 2017 9
D. Deep Photo Style Transfer
A profound learning way to deal with photographic style
transfer manages an outsized kind of picture that reliably
moves with the given Style. One of the commitments is to
dispose of the works of art-like impacts by preventing spatial
data losses and obliging the exchange activity in the shading
area. Another critical commitment might be an answer for
the test presented by the distinction in content between the
given and reference pictures, ending in unwanted exchanges
between random substances. The calculation utilized here
takes two pictures: an input picture, commonly a stock photo,
and an adapted and corrected reference picture, the reference
style picture. The proposed approach might be a
photorealism regularization term inside the target work
during the improvement, compelling the reproduced picture
to be spoken to by locally relative shading changes of the
contribution to stop twists.
Photorealism regularization: [28] describes how to
regularize this optimization approach to maintain the
structure of the original image and generate photo-realistic
results. The idea is to express this limitation on the
transformation performed to the input image rather than on
the output image directly. The topic of characterizing the
space of photo-realistic photos remains unresolved. [28] did
not need to solve it; instead, utilized the fact that the input
was already photo-realistic. The goal is to protect images
from losing this attribute during the transfer by including a
provision that penalizes image distortions. The answer is to
find an image transform locally affine in color space, a
function that translates the input RGB values onto their
output counterparts for each output patch.
(39)
L is the no. of convolutional layers, and is the th Conv.
layer of the network. Weight controls style loss. Weights
and are layer preference parameters. Weight is used
to control photorealism regularization.
,
and
are
content, augmented Style, and photorealism regularization,
respectively. Fig. 29, Shows how clients can monitor the
exchange outcomes by only offering semantic masks. This
utilization case allows masterful applications and makes it
possible to handle unusual cases for which semantic naming
is not supported, e.g., direct fireball scent holders.
Augmented style loss with semantic segmentation: The style
term is restricted by calculating the matrix on the whole
picture. Because a Gram matrix defines its constituent
vectors up to an isometry, it implicitly stores the precise
distribution of brain responses, limiting its capacity to adjust
to changes in semantic context and causing "spillovers." The
masks are added extra channels to the input picture and
enhance the neural style method by concatenating the
segmentation channels and updating the style loss. [28] also
learned that the segmentation does not need to be pixel
precise because the regularization finally restricts the output.
Fig. 30 shows instances of disappointment because of
mismatching. These can be fixed utilizing manual
segmentation.
FIGURE 29. Manual division empowers assorted errands, for example,
moving a fireball (b) to a scent bottle (a) to create a fire-enlightened look
(c) or exchanging the surface between different apples (d, e).
(Luan et al. 2017)
FIGURE 28. Working of luminance-based style transfer with color
histogram. (Gatys et. al. 2016)
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2021.3112996, IEEE Access
VOLUME XX, 2017 9
FIGURE 30. Failures are caused by mismatching. (Luan et al. 2017)
E. GauGAN: Semantic Image Synthesis with Spatially
Adaptive Normalization
Conditional picture synthesis implies the task of creating
photo-realistic pictures molding on some input data. [29] is
about a particular restrictive picture blend changing over a
semantic division veil to a photo-realistic picture. This
structure has a broad scope of uses, for example, content
generation and picture altering.
[29], which is worked by stacking convolutional,
standardization, and nonlinearity layers, is the ideal
situation, defective because their normalization layers tend
to "wash away" information in information semantic covers.
To address the issue, the creator has proposed spatially
adaptable standardization. This restrictive standardization
layer directs the inceptions using semantic input formats
through a spatially flexible, learned change and can
reasonably multiply the semantic information all through the
networks.
SPADE generator. There is no convincing motivation to deal
with the division guide to the Generator's top layer with
SPADE since the informed regulation boundaries have
encoded enough information about the imprint design. This
way, discard the Generator's encoder, which is consistently
used in late plans that smooth out achieves a more
lightweight network. Equivalently to existing class-
contingent generators, the new Generator can acknowledge
a subjective vector as info, engaging a fundamental and
standard way for the multi-modular blend. Curiously, the
division shroud in the SPADE Generator is dealt with
through spatially flexible equilibrium without
standardization. Only networks from the past layer are
standardized. From this time forward, the SPADE generator
can all the more probable protect semantic information. It
acknowledges the benefit of standardization without losing
semantic information.
Multi-modal synthesis: Using a self-assertive vector as the
Generator's contribution, the design gives an essential
technique to the multi-modular union. To be explicit, one can
add an encoder that quantifies a picture into an irregular
vector, which the Generator then deals with. The encoder and
generator structure a variational autoencoder, in which the
encoder endeavors to get the Style of the image. In contrast,
the Generator solidifies the encoded Style and the division
veil information by methods for SPADE to change the
primary picture. Moreover, the encoder fills in as a style
direction network at test time to get the Style of target
pictures.
In the first place, [29] considers two kinds of tasks to the
Generator: self-assertive commotion or down inspected
division maps. Second, fluctuating the sort of limit-free
standardization layers before applying the tweak limits.
Next, move the convolutional piece size following up on the
name guide, and find that part size of 1x1 harms execution,
likely because it blocks utilizing the name's setting.
Ultimately, adjusting the restriction of the generator network
by changing the number of convolutional channels.
F. Exploring Style Transfer
In recent times NST algorithms have improved significantly
on tasks such as image segmentation, replicating the content
image into different images using styles. In [30], several new
extensions and improvements to the original neural style
transfer were seen, such as altering the original loss function
to achieve multiple style transfers while preserving the color
and semantically segmented style transfer. Gaty's approach
includes a pre-trained feed-forward network that performs a
forward pass "image transformation" on the input image
before inputting it to style transformations, which can be
done on real-time video applications.
Method:
The baseline taken was fast neural style transfer,
consisting of two components: picture
transformation network Fw and loss function φ.
The overall combined loss function is the final
objective is given as:
Where W- weights
X- image to be transformed
Yi- Style image
Image Transformation Network: Color images used
are 3 x 256 x 256 in shape.
o Downscaling: done by convolutional layer
with stride 2
o Upscaling: done by convolutional layer with
stride ½.
o This method provides computational benefits
of operating lower-dimensional spaces.
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2021.3112996, IEEE Access
VOLUME XX, 2017 9
Perceptual Losses:
o Feature Reconstruction loss: pixels of the
output image have feature presentations
similar to the loss network φ computes.
o Style Reconstruction loss: penalizes style
differences such as colors and textures
Firstly, the Gram matrix is defined:
Style loss is the squared normalization
of the Frobenius standard for the
difference between gram output
matrices of the generated and actual
target image.:
style
(43)
It minimizes the style reconstruction
loss results in generating an image that
preserves stylistic features over not
spatial characteristics of the target.
Simple Loss function:
o Pixel loss: the normalized distance between
the output and target
pixel
(44)
o Total Variation Regularization: this is used
for maintaining spatial smoothness.
Multiple Style Transfer: An extension of vanilla
neural style transfer allows multiple style images to
be transferred to a single content image.
o Requires a smile modification to the style loss
function:
multi
st yle
(45)
o This allows the flexible choice of the style
layers and weights independently for each
style image.
o Allows us to generate images that blend the
styles of multiple images readily.
o Trained on Adam optimizer
o When forced to blend multiple styles, it leads
to a more extensive style loss than a single
style image.
Color Preserving Style Transfer: used the luminous
only transfer, which works very well and takes a
simple transformation after a typical style transfer
algorithm.
Semantically Segmented Style transfer: clustering
parts of input images together that belong to the
same object class. It first generates a mask for the
input of shape H x W for each pixel location where
to apply gradient descent and where to not. All the
above extensions make it possible to be an effect to
achieve real-time video processing applications.
G. Automatic Image Colorization Using GANs
[31] talks about how GANs can automate the image's
colorization process without changing the picture's
configuration. [31] have used Conditional GAN to achieve
the result. The architecture approach used here was on fully
connected networks; [31] used layers of convolutions in
Generator. [31] have also used the technique similar to
expanding encoder networks and compressing decoder
networks to reduce the memory's dependency on training.
The Generator takes in the greyscale image and then
downsample it. It is compressed after it goes through here,
and these operations are repeated four times resulting in a
matrix. In the expansion phase, it gets upsampled. Batch
normalization and the use of leaky ReLU help in better
training and performance of GAN. The Discriminator starts
with greyscale images and the predicted image to form the
FIGURE 31. Flow of segmented Style transfer. (Makow et. al. 2017)
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2021.3112996, IEEE Access
VOLUME XX, 2017 9
color image. The unique activation functions used to
stabilize the last layers of Generator and Discriminator are
the tanh activation function and Sigmoid activation function.
Another unique method used here is Adam's Optimizer for
learning rate and Strided convolutions, resulting in
upgrading the training performance depending on the
invariances' convolution layers. Convergence failure was
experienced on various occasions, settling by changing
optimizers, expanding learning rates, changing kernel rates,
and presenting batch normalization.
Observations
This part focuses mainly on achieving better quality
images from GANs by improving color accuracy. [25], [27]
and [30] parallelly propose Spatial and Color Control, which
allows the use of multiple styles and preserves content image
color for generating more photo-realistic images. By
constraining transform and adding a custom energy term,
[28] provides a versatile model that handles various input
images. [29] introduces "spatially adaptive normalization"
that assists in synthesizing photo-realistic images. A key
feature provided by [25] is Scale Control, which allows us to
mix coarse and fine attributes of two different styles. This
method helps with training on high-resolution images and is
highly scalable in that regard. [26] is solely focused on video
Neural Style Transfer and introduces temporal consistency
in-between frames to allow dependency between adjacent
frames.
V. APPLICATION-BASED PAPERS
This section looks at the approaches, challenges, and
limitations in Neural Style Transfer for Videos on Mobile
phones. A few architectures are proposed based on their
performance on mobile devices.
A. Artistic Style Transfer for Videos
[33] presents the application of image style transfer to a
complete video. A few additions are made regarding
initializations and loss functions to suit the video input
allowing stable stylized videos even with a high degree of
motion. In addition, it processes each frame individually and
adds temporal constraints that penalize deviation among
point trajectories. [33] also, propose two more extensions:
Long-term motion estimates allow consistency over a
more considerable period in regions with occlusion.
A multi-pass algorithm is used to reduce the artifacts at the
image boundaries. The algorithm considers forward and
backward optical flow resulting in a better-quality video.
[33] propose using the previous frame to initialize the
optimizer for the current frame. This allows similar parts of
the frame to be rendered, whereas the changed parts are
rebuilt. However, the technique has flaws when used on
videos as moving objects are not initialized properly. To
address this, [27] consider the optical flow by warping the
previous output:
ω (46)
TABLE 10
REMARKS ON PART 3
Contribution
Ref. No.
Paper
Strength
Weakness
Spatial and Color
Control
[25], [27], [30],
[31]
Gatys et al. 2017,
Gatys et al. 2016,
Makow et al 2017,
Dhir et al 2021
Spatial Control can reduce visual
artifacts.
Color Control contain the image’s
color.
Spatial Control requires computing
spatially guided Gram Matrices which
can be computationally expensive.
Color Control comes at the expense of
losing the dependencies like luminance
and colors.
Photorealism
Regularization
[28]
Luan et al. 2017
To eliminate the painting-like effects,
prevent spatial distortion and limit the
transfer operation to color space alone.
A solution to the problem with content
differences between input and reference
pictures, which might result in
unwanted transfers between unrelated
contents.
There may be some scene elements
more (or less) represented in the input
than in the reference image.
Spatially Adaptive
Normalization
[29]
Park et al. 2019
Spatially Adaptive Normalization can
better preserve semantic information
against common normalization layers.
Removing any loss function for the
improvement, results in degradation of
the generated pictures.
Temporal
Consistency
[26]
Gupta et al. 2017
The network tries to minimize the
summation of Control style loss, Style
reconstruction loss and Temporal
Consistency Loss, which results in
more robust at controlling distortions.
Occasionally in results, one object can
block others, which is undesirable.
Show-door artifacts appear in the
generated image.
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2021.3112996, IEEE Access
VOLUME XX, 2017 9
where ω(i+1)i warps the input stylized frame x(i) using the
optical flow information derived from content frames g(i) and
g(i+1). [33] use DeepFlow and EpicFlow optical flow
estimation algorithms to do so. The next addition is the use
of temporal consistency losses to penalize adjacent frame
inconsistencies. To do so, they detect the disoccluded
regions by comparing the forward and backward flows. The
temporal loss then penalizes deviation between the generated
Image and the compatible optical flow parts of the warped
image. This is done with the help of a feature map "a" that
specifies per-pixel weightage depending on disocclusion and
motion boundaries.
temporal
(47)
Thus, the short-term loss function is given as:
shorttermcontentstyle
temporal
(48)
This is further extended to achieve longer-term consistency
by incorporating the data for multiple previous frames rather
than just one frame:
longtermcontentstyle
temporal
long
(49)
The weights
are computed as follows:
(50)
This means investigating past frames till consistent
correspondence is obtained. The advantage of this is that
each pixel is associated with the nearest frame, and as the
optical flow computed over temporally closer images has a
lesser error. Thus, it results in better videos. [33] handle the
problem of strong motion using a multi-pass algorithm. The
video is processed bi-directionally in multiple passes. By
alternating the direction of optical flow, firmer consistency
is achieved. Initially, every input is processed independently
based on random initializations. The inputs are then mixed
with the warped non-disoccluded parts of previous frames on
which the optimization algorithm is run for some iterations.
Next, the forward and backward passes are alternated. The
frame initializations for forwarding and backward passes are
given as:
FIGURE 32. A scene from the test Sintel dataset, the style image used, and the outputs obtained from various methods. The highlighted regions are the
ones with prominent differences. The error images show the temporal inconsistence which is prominent in the third approach. (Ruder et. al. 2016)
FIGURE 33. The ReCoNet Architecture [35] for Video neural Style Transfer (presented by Gao et al.). It, Ft and Ot are input image, encoded feature map,
and generated image at time “t”. A Frame is made up of these three objects. The previous frame is compared with the current frame to compute “temporal
loss” which results in better dependencies between two consecutive frames. (Ruder et. al. 2016)
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2021.3112996, IEEE Access
VOLUME XX, 2017 9
if
else.
(51)
if frames
else
The optical flow computation implementation takes roughly
3 minutes per frame at a resolution of 1024x436, which is
done with the help of parallel flow computation on the CPU.
At the same time, style transfer occurs on the GPU. The
short-term consistency results on the Sintel datasets are
presented in Table 11, where multiple approaches' errors are
compared across different videos.
The long-term consistency results are more qualitative. They
are thus presented in the form of supplementary videos.
B. Real-time Neural Style Transfer for Video
The work looks at the possibility of making video-style
transfers using a feed-forward network. Differentiated and
direct applying a current picture style move procedure to
accounts, the proposed method uses the readied association
to yield fleetingly consistent adjusted weights, which are
significant. In distinction to the previous video style move
methods, which rely upon progression on the fly, the
technique referenced disagreement ongoing while at the
same time creating severe visual outcomes.
The adapting network acknowledges one edge as
information and produces its adjusted yield. The loss
network, pre-prepared on the ImageNet order task, first
focuses on the features of the revised yield outlines and
registers the losses used to set up the adapting network.
During the arrangement cycle, the adapting network and loss
network are connected. The loss network's spatial loss is
utilized to establish the adjustable network. With satisfactory
setting up, the adapting network, tolerating one single casing
as information, has encoded the worldly cognizance picked
up from a video dataset and would in this manner have the
option to make transiently unsurprising adjusted video
outlines.
TABLE 11
DIFFERENT METHODS TESTED ON MULTIPLE SEQUENCES WITH THEIR TEMPORAL CONSISTENCY ERRORS. (RUDER ET. AL. 2016)
Methods
alley_2
ambush_5
ambush_6
bandage_2
market_6
DeepFlow
0.00061
0.0062
0.012
0.00084
0.0035
EpicFlow
0.00073
0.0068
0.014
0.00080
0.0032
Init prev warped
0.0016
0.0063
0.012
0.0015
0.0049
Init prev
0.010
0.018
0.028
0.0041
0.014
Init random
0.019
0.027
0.037
0.018
0.023
FIGURE 34. A chart of the proposed model. It includes two segments: an adapting and a loss network. Dark, green, and red square shapes address
an info outline, a yield outline, and a given style picture independently. (Huang et al. 2017)
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2021.3112996, IEEE Access
VOLUME XX, 2017 9
Stylizing network. The adapting network speaks to changing
a singular video edge to an adapted one. After three
convolutional blocks, the component map's objective is
diminished to a fourth of the information. By then, five
lingering blocks are in this manner followed, provoking a
brisk blend. Stood out from the current feed-forward
association for picture style move, a tremendous favorable
position of the network is used for fewer channels to reduce
the model size, which winds up gathering a distinct loss in
the stylization quality.
Loss Network. The sturdy and essential elements of the
primary model, the adapted edge, and the style image for
establishing the network adapter should be segregated for
space and global loss calculations. VGG-19 is employed in
this article as the loss network showing acceptable image
content and style images. Two kinds of losses can be found
in the model: Spatial Loss and Temporal Loss.
C. Real-time Video-neural Style Transfer on mobile
devices
[35] presents a solution to two problems of video style
transfer:
1. The difficulty of usage by non-experts.
2. Hardware Limitations
They present an app that can perform neural style transfer to
videos at over 25FPS. They also discuss performance
concerning iOS-based devices where they test an iPhone 6s
and iPhone 11 Pro. Limitations for Android devices are also
discussed. The solution includes:
1. A real-time application of NST on mobile devices
2. Existing solutions to temporal coherence.
The traditional approach of applying a convolution-based
image generator per frame causes "temporal inconsistency"
or unrelated frames causing flicker artifacts. [21] tries to
solve this problem; however, their model has time-
consuming computations.
[35] use Gao et al.'s lightweight forward feed network. There
are white bubbles seen in the images. However, these are
caused due to instance normalization and can be removed
using filter response normalization. However, no
implementations exist for mobile devices. Other issues
include faded colors. The model is trained in two stages,
First on Style and content losses and then on a regularization
term.
(53)
Second on achieving temporal consistency
contentstyle
temptemp
(54)
Ltemp, f, and Ltemp, o are features and output-based temporal
losses presented in Gao's paper.
The main idea is to use the optical flux between adjacent
frames. The models do not need this information, effectively
making it faster since dense optical flow estimation is
computationally expensive. On the other hand, introducing
Temporal Coherence weakens the style transfer. Speaking of
android vs. iPhone implementations, Apple had better
support since 2018's A12 chip and CoreML library, allowing
the use of dedicated NPUs effectively. However,
conversions between PyTorch to TensorFlow result in
additional layers causing a 30-40% FPS drop.
Furthermore, many Libraries are yet to provide full mobile
GPU operation support. Thus, due to the lack of
standardization, Android implementations are rare. [34] also,
compare two iPhones (6s and 11Pro) with different model
sizes, resolutions and chart their FPS:
FIGURE 35. Performance achieved per configuration in terms of Frames Per Second (FPS) v/s Number of parameters in the model charted for two
mobile devices at two resolution levels. (Dudzik et. al. 2020)
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2021.3112996, IEEE Access
VOLUME XX, 2017 9
Fig. 35 shows that the model mentioned above can output
around 13 FPS at 480p on an iPhone 11Pro with half a
million parameters. This indicates that Video NST on mobile
devices still needs many improvements. Then the coarse-to-
fine stylization presented in [26] can probably be applied to
increase the resolution of the generated images.
D. Multi-style Generative Network for Real-time Transfer
[36] finds it challenging, with dimensionally integrated
modeling, to obtain comprehensive styles in this study. A
novel MSG-NET technique is presented, allowing brush
dimension control in real-time. [36] believes that detailed
form modeling with dimensional style integration in [36] is
difficult to achieve. The method shown is a modern MSG-
NET approach that achieves real-time control of the size of
the brush. The image's resizing style adjusts the brush's
relative size based on the changing input images. A more
acceptable representation of image style requires a 2D
method.
The model is based on the following works:
Relation to Pyramid Matching: Early method was
developed using texture synthesis using multi-style
image pyramids. White noise image manipulation
could lead to realistic image synthesis, so that fayre
statics were inspired.
o This method uses a similar feed-forward
network, but it takes advantage of the benefits
of deep learning networks without putting
computational costs into the training process.
Relation to Fusion Layer: The computed Comatch
Layer uses both content and Style as input, hence a
separate style from content.
Content and Style Representation:
o The image texture or Style can be represented
as the distribution of the features by use of
Gram Matrix
o The Gram Matrix is ordered less and
describes the feature distributions
CoMatchLayer: Explicitly matches statistics of
second-order features based on the Style given.
o is a solution that holds the semantic
information of the content image and matches
the texture from the style image:
(56)
o To equalize the contribution target's Style and
content, the α parameter is used. α is a
parameter that allows a change of weightage
for style loss.
o An iterative technique allows the difficulty
mentioned above to be minimized. However,
in real-time, it is not practicable to achieve or
distinguish the model.
o Target style features map is tuned using the
following approximation:
(57)
o The layer can be differentiated and
introduced into the current Generative
network and can learn directly without
supervision from the loss function.
Multi-style Generative Network (MSG-Net): This
method introduces matching the feature statistics
explicitly during runtime.
o Siamese network and encoder of
transformation network share their weights,
which picks up the static features from the
Style and gives out Gram Matrices.
o Matches the features of style image at
multiple levels with the content image using
CoMatch.
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2021.3112996, IEEE Access
VOLUME XX, 2017 9
o Upsampled convolution: upsampling with
convolution layer of stride 2. Compared to
stride Convolution fractionally, the
calculation complexity and parameters are
precisely four times for this approach. This
way, the network does not sample objects.
Upsampled Residual block: Original
architecture is extended with an upsampling
version of fractionally strided convolution as
soon in image Fig. 37.
o Brush Stroke Size Control: The network was
conditioned to learn different brush stroke
sizes with different picture type sizes. Users
can choose the brush stroke size after
training.
o The employment of weighted layers of ReLU
and the normalizing process improves the
picture quality created and resists the
adaptation of picture contrast.
o Minimizing the Loss by:
The speed and size of models are crucial for mobile apps and
cloud services. These are shown in Table 12.
It is shown that MSG-Net is faster due to an endless
encoder in place of a pretrained VGG Network.
Model Scalability: It is noted that there is no loss in
quality as the number of styles rises on a real-time
basis.
Fig. 37 shows the spatial control using this model.
FIGURE 37. Extended Architecture. (Zhang et. al. 2017)
TABLE 12
COMPARISON BETWEEN DIFFERENT MODEL’S ARCHITECTURE BASED ON
MODEL-SIZE AND SPEED. (ZHANG ET. AL. 2017)
Model-size
Speed (256)
Speed (512)
Gatys et al.
N/A
0.07
0.02
Johnson et al.
6.7MB
91.7
26.3
Dumoulin et al.
6.8MB
88.3
24.7
Chen et al.
574MB
5.84
0.31
Huang et al.
28.1MB
37.0
10.2
MSG-Net-100
9.6MB
92.7
29.2
MSG-Net-1k
40.3MB
47.2
14.3
FIGURE 36. An overview of MSG-Net. (Zhang et. al. 2017)
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2021.3112996, IEEE Access
VOLUME XX, 2017 9
Observations
[21] Moreover, [34] extends on [26] and adds Optical Flow
estimation based on multiple frames to improve temporal
consistency. [34] introduces a CoMatch Layer that maps
second-order feature statistics with target styles. [35] focuses
on implementation on mobile devices and compares the
performance of video style transfer [35] models of varying
size and input images for two devices. It is observed that
achieving reasonable frame rates with high resolutions is
difficult, given the lack of GPU usage on mobile devices.
VI. NST Evaluation Metrics
Evaluation metrics for NST could be challenging because of
the variety in GANs models. However, accuracy, Fréchet
Inception Distance (FID), Intersection-over-Union (IoU),
time, perceptual path lengths, and warping error are the most
often utilized metrics for the models constructed in the
publications evaluated [38].
The accuracy was used to measure the relative
depth of the predicted images. It was also used to
predict feature maps, where the higher the accuracy,
the more accurate the feature maps indicated.
The Fréchet Inception Distance (FID) approximates
the real and fake feature distributions with two
Gaussian distributions. They then compute the
Fréchet distance (Wasserstein-2 distance) between
two Gaussian distributions and use the findings to
determine the model's quality.
Few papers use the Intersection-over-Union (IoU)
metric to determine the accuracy of segmentation
and detection in object classification and
localization.
When interpolating between two random inputs, the
perceptual path length quantifies the difference
between consecutive images (VGG16
embeddings). It determines if the image changes
along the shortest perceptual path in the latent space
where fake images are introduced.
The warping error is the difference between the
warped and real subsequent frames. The warping
error value is a good metric for determining the
smoothness of video since it is an efficient
technique to monitor video stability with many
frames.
VII. Possible Future Applications of NST:
Apart from various exciting image transformation use cases,
NST can be extended in a few more application areas such
as:
Movies: NST can change the scenes captured in
movies using representational objects instead of
green screens and tedious editing [39], [40].
Online Education: Using different style banks, the
same model can be used for other applications, such
as creating animated versions of real-life stories in
Education.
Gaming: It can also be used in Mixed Reality (MR)
games wherein the real world seen from the MR
headset will change based on the style used for the
game [40], [41].
Fashion industry: NST can find applications in the
fashion industry where designers and consumers
can use it to overlay items while designing or trying
them [42], [43].
Approaches like [44] provide a good starting point for real-
time video style transfer and can be improved to work on
mobiles efficiently. Observed with user privacy being in the
headlines every day, Federated learning can also provide
safer, more private data access by localizing training to
specific devices. Some recent approaches include [45], [45].
Studies like [38] compare the evaluation metrics commonly
used. Having more architectures that train on unpaired data
is another interesting sub-domain to venture into. A good
approach that performs style transfer on unpaired data is
[46]. Although [46] works for high-resolution unpaired
images and not videos, it can be considered a good entry
point for high-resolution video style transfer. [47] uses
Vision Transformers for image style transfer. There can be
many more such fascinating use cases for NST shortly based
on the user requirements discussed in [48-55]. Most of the
current work uses traditional neural networks with iterative
training mechanisms in solving the Video Style Transfer
problem, which may lead to the time-consuming training of
the whole model. The use of non-iterative methods to train
ANN could also implemented over typical iterative methods
[56,57]. It will help us in improving the efficiency of the
algorithm. [58].
VIII. Research Gaps
The research gaps observed in this literature review are
summed up in Fig. 38 and can be grouped into three basic
categories, namely architecture-related, platform-related,
and dataset related
Platform-related:
a. Native Mobile NST: Implementing real-time video
neural style transfer directly on mobiles. Most
applications implement style transfer on mobiles via a
Client-server approach. This is primarily due to mobiles
having relatively new software and low-power
hardware.
FIGURE 37. Spatial Control Result. (Zhang et. al. 2017)
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2021.3112996, IEEE Access
VOLUME XX, 2017 9
b. Use of Federated Learning: Federated learning is
another gap observed while looking at Mobile NST. It is
a recent idea and has been used to overcome low power
device limitations.
Dataset Related:
a. Lack of benchmark datasets: As discussed previously,
multiple papers mix and match datasets by re-purposing
them from different domains. While this has the pros of
swapping and replacing datasets in training, the need for
a benchmark dataset can be seen for evaluation
purposes. A benchmark dataset could make testing,
evaluating, and understanding the model's performance
standardized. Another point observed is that some
articles create their datasets and apply different
transforms to data, which can distort the image's
structure, leading to the generation of artifacts.
b. Lack of a good benchmark metric: It is observed and
discussed above that many papers turn to Amazon M-
Turks (a service that offers manual labor) to inspect the
quality of the images generated. Photorealism is usually
inspected manually and thus could be a place to add a
metric. However, this can be difficult as photorealism is
subjective and might change depending on context. In
addition, as discussed previously, whereas there are
metrics such as Intersection over Union or Accuracy,
they rely on "comparing" two similar images. This can
be particularly challenging to use as one needs some
"ground truth" to compare to, and paired samples can be
tricky to obtain.
Model Architectures: It is seen that many of the models
cannot handle super-resolution very well. The
scalability of models in terms of the resolution of
generated images is thus another concern. Apart from
this, most data available or pieced together is usually
unpaired, meaning the content and style images do not
have the same structural composition.
IX. CONCLUSION and FUTURE SCOPE
NST, one of the exhilarating AI applications adopted for
artistic use of photos and videos, has started capturing the
attention of GANs researchers in the last few years. These
papers consisted of a comprehensive study of GANs and
Video NST, divided into four parts. Initially, the working of
GANs has been explained and its recent development on the
different types of models for NST on mobile devices like
CartoonGAN, Artsy-GANs, etc. The unpaired images can be
used for training GANs using CycleGANs. Furthermore,
adding "temporal losses" allows consistency between
adjacently generated frames as seen over multiple
architectures. Then the GANs improvement papers,
explaining how Spatial, Color, and Scale control can allow
better image generation. Lastly, how NST can be applied
over mobile devices in real-time using GANs has been
explained.
However, real-time NST on mobile devices with a
reasonable frame rate is still relatively difficult to achieve.
As time progresses, low power devices and devices with a
smaller footprint will perform and handle large-scale
computation better. Use of non-iterative methods for AI
model training can make it more time efficient and would
work better in a real-time scenario. There will be an exciting
FIGURE 38. Research Gaps
Figure 39. Research Gaps
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2021.3112996, IEEE Access
VOLUME XX, 2017 9
avenue to investigate, considering NST can be used in
Augmented Reality. Non-iterative video NST is a good topic
for future research since it can considerably reduce the time
required to process videos. Since NST has vast potential, its
research would see growing exponentially in coming years.
REFERENCES
[1] The Smartphone vs. The Camera Industry, Available
at https://photographylife.com/smartphone-vs-
camera-industry/amp . Accessed on 2021-04-20.
[2] Adobe Premiere Pro, Available at
https://www.adobe.com/in/products/premiere/movie-
and-film-editing.html . Accessed on 2021-04-20.
[3] DaVinci Resolve, Available at
https://www.blackmagicdesign.com/in/products/davi
nciresolve. Accessed on 2021-04-20.
[4] Jing, Y., Yang, Y., Feng, Z., Ye, J., Yu, Y., &
Song, M. (2019). Neural style transfer: A
review. IEEE transactions on visualization and
computer graphics, 26(11), 3365-3385.
[5] Li, H. A literature review of neural style transfer.
Princeton University Technical report, Princeton NJ,
085442019.
[6] Li, J., Wang, Q., Chen, H., An, J., & Li, S. (2020,
October). A Review on Neural Style Transfer.
In Journal of Physics: Conference Series (Vol. 1651,
No. 1, p. 012156). IOP Publishing.
[7] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D.
Warde-Farley, S. Ozair, A. Courville and Y. Bengio
(2014), "Generative Adversarial Nets," in
arXiv:1406.2661v1.
[8] T. Karras, S. Laine and T. Aila (2019), "A Style-Based
Generator Architecture for Generative Adversarial
Networks," in arXiv:1812.04948v3.
[9] A. Radford, L. Metz and S. Chintala (2016),
"Unsupervised Representation Learning with Deep
Convolutional Generative Adversarial Networks," in
arXiv:1511.06434v2.
[10] A. Dosovitskiy, P. Fischer, J. Springenberg, M.
Riedmiller, T. Brox. “Discriminative Unsupervised
Feature Learning with Exemplar Convolutional
Neural Networks,” Advances in Neural Information
Processing Systems 27 (NIPS 2014).
[11] J. Zhu, T. Park, P. Isola and A. Efros (2020),
"Unpaired Image-to-Image Translation using Cycle-
Consistent Adversarial Networks," in
arXiv:1703.10593v7.
[12] P. Isola, J. Zhu, T. Zhou, A. Efros (2018), "Image-
to-Image Translation with Conditional Adversarial
Networks," in arXiv:1611.07004v3.
[13] J. Zhu, T. Park, P. Isola and A. Efros (2020),
"Unpaired Image-to-Image Translation using Cycle-
Consistent Adversarial Networks," in
arXiv:1703.10593v7.
[14] P. Isola, J. Zhu, T. Zhou, A. Efros (2018), "Image-
to-Image Translation with Conditional Adversarial
Networks," in arXiv:1611.07004v3.
[15] L. Gatys, A. Ecker and M. Bethge (2016), "Image
Style Transfer Using Convolutional Neural
Networks," in IEEE Conference on Computer Vision
and Pattern Recognition (CVPR).
[16] Y. Jin, J. Zhang, M. Li, Y. Tian, H. Zhu and Z. Fang
(2017), "Towards the Automatic Anime Characters
Creation with Generative Adversarial Networks," in
arXiv:1708.05509v1.
[17] N. Kodali, J. Abernethy, J. Hays, Z. Kira. “How to
train your dragan”. arXiv preprint arXiv:1705.07215,
2017.
[18] C. Ledig, L. Theis, F. Huszár, J. Caballero, A.
Cunningham, A. Acosta, A. Aitken, A. Tejani, J. Totz,
Z. Wang, et al. “Photo-realistic single image super-
resolution using a generative adversarial network”.
arXiv preprint arXiv:1609.04802, 2016.
[19] Y. Chen, Y. Lai and Y. Liu (2018), "CartoonGAN:
Generative Adversarial Networks for Photo
Cartoonization,” IEEE/CVF Conference on Computer
Vision and Pattern.
[20] H. Liu, P. N. Michelini and D. Zhu (2018), "Artsy-
GAN: A style transfer system with improved quality,
diversity, and performance," in 24th International
Conference on Pattern Recognition (ICPR).
[21] C. Li and M. Wang. “Precomputed Real-Time
Texture Synthesis with Markovian Generative
Adversarial Network”. ECCV, 2016.
[22] V. Kitov, K. Kozlovtsev, and M. Mishustina
(2020), "Depth-Aware Arbitrary Style Transfer Using
Instance Normalization," in arXiv:1906.01123v2.
[23] D. Chen, L. Yuan, J. Liao, N. Yu and G. Hua
(2017), "StyleBank: An Explicit Representation for
Neural Image Style Transfer," in IEEE Conference on
Computer Vision and Pattern Recognition (CVPR)
2017.
[24] C. Zheng and Y. Zhang (2018), "Two-Stage Color
ink Painting Style Transfer via Convolution Neural
Network," in 15th International Symposium on
Pervasive Systems, Algorithms, and Networks (I-
SPAN).
[25] L. Gatys, A. S. Ecker, M. Bethge, A. Hertzmann,
and E. Shechtman (2017), "Controlling Perceptual
Factors in Neural Style Transfer," in
arXiv:1611.07865v2.
[26] A. Gupta, J. Johnson, A. Alahi and L. Fei-Fei
(2017), "Characterizing and Improving Stability in
Neural Style Transfer," in arXiv:1705.02092v1.
[27] L. Gatys, M. Bethge, A. Hertzmann and E.
Shechtman (2016), "Preserving Color in Neural
Artistic Style Transfer," in arXiv:1606.05897v1.
[28] F. Luan, S. Paris, E. Shechtman and K. Bala (2017),
“Deep Photo Style Transfer,” in arXiv:1703.07511v3.
[29] T. Park, M. Liu, T. Wang and J. Zhu (2019),
"GauGAN: semantic image synthesis with spatially
adaptive normalization," in DOI: 1-1.
10.1145/3306305.3332370.
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2021.3112996, IEEE Access
VOLUME XX, 2017 9
[30] N. Makow and P. Hernandez (2017), "Exploring
Style Transfer: Extensions to Neural Style Transfer."
[31] Dhir R., Ashok M., Gite S., Kotecha K. (2021)
Automatic Image Colorization Using GANs. In: Patel
K.K., Garg D., Patel A., Lingras P. (eds) Soft
Computing and its Engineering Applications.
icSoftComp 2020. Communications in Computer and
Information Science, vol 1374. Springer, Singapore.
https://doi.org/10.1007/978-981-16-0708-0_2
[32] Rashi Dhir, Meghna Ashok, Shilpa Gite. "An
Overview of Advances in Image Colorization Using
Computer Vision and Deep Learning Techniques"
(2020) in " Review of Computer Engineering
Research, Conscientia Beam, vol. 7(2), pages 86-95.
DOI: 10.18488/journal.76.2020.72.86.95
[33] M. Ruder, A. Dosovitskiy and T Brox (2016),
"Artistic style transfer for videos," in
arXiv:1604.08610v2.
[34] Haozhi Huang, Hao Wang, Wenhan Luo, Lin Ma,
Wenhao Jiang, Xiaolong Zhu, Zhifeng Li, Wei Liu.
"Real-Time Neural Style Transfer for Videos," 2017
IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), 2017
[35] W. Dudzik and D. Kosowski (2020), "Kunster --
AR Art Video Maker -- Real-time video neural style
transfer on mobile devices," in arXiv:2005.03415v1.
[36] H. Zhang and K. Dana (2017), "Multi-style
Generative Network for Real-time Transfer," in
arXiv:1703.06953v2.
[37] D. Chen, J. Liao, L. Yuan, N. Yu, G. Hua.
“Coherent Online Video Style Transfer.” ICCV 2017.
[38] Q. Xu, G. Huang, Y. Yuan, C. Guo, Yu Sun (2018),
“An empirical study on evaluation metrics of
generative adversarial networks,” in
arXiv:1806.07755v2.
[39] B. Joshi, K. Stewart, D. Shapiro. “Bringing
Impressionism to Life with Neural Style Transfer in
Come Swim.” arXiv preprint arXiv:1701.04928v1,
2017.
[40] D. Chen, L. Yuan, J. Liao, N. Yu, G. Hua.
“Stereoscopic Neural Style Transfer.” CVPR 2018.
[41] Z. Hao, Arun M., S. Belongie, Ming-Yu L. (2021),
“GANcraft: Unsupervised 3D Neural Rendering of
Minecraft Worlds”, in arXiv:2104.07659v1.
[42] W. Jiang1, Si Liu, C. Gao, J. Cao, R. He, J. Feng,
S. Yan (2019), “PSGAN: Pose and Expression Robust
Spatial-Aware GAN for Customizable Makeup
Transfer,” in arXiv:1909.06956v2.
[43] T. Nguyen, A. T. Tran, M. Hoai (2021), “Lipstick
ain’t enough: Beyond Color Matching for In-the-Wild
Makeup Transfer,” in arXiv:2104.01867v1.
[44] Ondřej Texler, David Futschik, Michal Kučera,
Ondřej Jamriška, Šárka Sochorová, Menglei Chai,
Sergey Tulyakov, Daniel Sýkora, “Interactive Video
Stylization Using Few-Shot Patch-Based Training” In
ACM Transactions on Graphics 39(4) (SIGGRAPH
2020).
[45] J. Song, J. Ye (2021), “Federated CycleGAN for
Privacy-Preserving Image-to-Image Translation” in
arXiv:2106.09246v1.
[46] Ang Li, Chunpeng Wu, Yiran Chen, and Bin Ni.
2020. MVStylizer: an efficient edge-assisted video
photo-realistic style transfer system for mobile
phones. In Proceedings of the Twenty-First
International Symposium on Theory, Algorithmic
Foundations, and Protocol Design for Mobile
Networks and Mobile Computing (Mobihoc '20).
Association for Computing Machinery, New York,
NY, USA, 31–40. DOI:
https://doi.org/10.1145/3397166.3409140
[47] A. Junginger, M. Hanselmann, T. Strauss, S.
Boblest, J. Buchner, H. Ulmer (2018), “Unpaired
High-Resolution and Scalable Style Transfer Using
Generative Adversarial Networks” in
arXiv:1810.05724v1
[48] Y. Deng, F. Tang, X. Pan, W. Dong, C. Ma,
Changsheng Xu (2021), “StyTr^2: Unbiased Image
Style Transfer with Transformers” in
arXiv:2105.14576v2.
[49] Leon A. Gatys, Alexander S. Ecker, Matthias
Bethge, Aaron Hertzmann, Eli Shechtman.
"Controlling Perceptual Factors in Neural Style
Transfer," 2017 IEEE Conference on Computer
Vision and Pattern Recognition (CVPR), 2017
[50] "Computer Vision – ECCV 2018 Workshops",
Springer Science and Business Media LLC, 2019.
[51] "Computer Vision – ECCV 2016", Springer
[52] Zhou, Gu, Gao, Wang. "An Improved Style
Transfer Algorithm Using Feedforward Neural
Network for Real-Time Image Conversion,"
Sustainability, 2019.
[53] Tero Karras, Samuli Laine, Timo Aila. "A
StyleBased Generator Architecture for Generative
Adversarial Networks," IEEE Transactions on Pattern
Analysis and Machine Intelligence, 2020
[54] Chang Gao, Derun Gu, Fangjun Zhang, and Y. Yu
(2018),” ReCoNet: Real-time Coherent Video Style
Transfer Network,” in arXiv:1807.01197v2.
[55] D. Bau, J. Zhu, H. Strobelt, B. Zhou, J. B.
Tenenbaum, W. T. Freeman, A. Torralba.
“Visualizing and Understanding Generative
Adversarial Networks Extended Abstract.” arXiv
preprint arXiv: 1901.09887v1, 2019.
[56] Wang, Xizhao, and Weipeng Cao. "Non-iterative
approaches in training feed-forward neural networks and
their applications." (2018): 3473-3476.
[57] Cao, W., Hu, L., Gao, J., Wang, X., & Ming, Z.
(2020). A study on the relationship between the rank of
input data and the performance of random weight neural
network. Neural Computing and Applications, 1-12.
[58] Cao, W., Wang, X., Ming, Z., & Gao, J. (2018). A
review on neural networks with random weights.
Neurocomputing, 275, 278-287.