ArticlePDF Available

Abstract and Figures

Convolutional Neural Networks have been highly successful in performing a host of computer vision tasks such as object recognition, object detection, image segmentation and texture synthesis. In 2015, Gatys et. al [7] show how the style of a painter can be extracted from an image of the painting and applied to another normal photograph, thus recreating the photo in the style of the painter. The method has been successfully applied to a wide range of images and has since spawned multiple applications and mobile apps. In this paper, the neural style transfer algorithm is applied to fashion so as to synthesize new custom clothes. We construct an approach to personalize and generate new custom clothes based on a users preference and by learning the users fashion choices from a limited set of clothes from their closet. The approach is evaluated by analyzing the generated images of clothes and how well they align with the users fashion style.
Content may be subject to copyright.
Fashioning with Networks: Neural Style Transfer to Design
Prutha Date
University Of Maryland
Baltimore County (UMBC),
Baltimore, MD,
Ashwinkumar Ganesan
University Of Maryland
Baltimore County (UMBC),
Baltimore, MD,
Tim Oates
University Of Maryland
Baltimore County (UMBC),
Baltimore, MD,
Convolutional Neural Networks have been highly successful
in performing a host of computer vision tasks such as object
recognition, object detection, image segmentation and tex-
ture synthesis. In 2015, Gatys et. al [7] show how the style
of a painter can be extracted from an image of the painting
and applied to another normal photograph, thus recreating
the photo in the style of the painter. The method has been
successfully applied to a wide range of images and has since
spawned multiple applications and mobile apps. In this pa-
per, the neural style transfer algorithm is applied to fashion
so as to synthesize new custom clothes. We construct an
approach to personalize and generate new custom clothes
based on a user’s preference and by learning the user’s fash-
ion choices from a limited set of clothes from their closet.
The approach is evaluated by analyzing the generated im-
ages of clothes and how well they align with the user’s fash-
ion style.
CCS Concepts
Computing methodologies Computer vision; Ma-
chine learning approaches; Neural networks;
Convolutional Neural Networks, Personalization, Fashion,
Neural Networks, Style Transfer, Texture Synthesis
There have been recently impressive advances in computer
vision tasks like object recognition and detection, segmen-
tation [16][21][3]. The revolution started with Krizhevsky
et. al [16] substantially improving object recognition on
the Imagenet challenge using convolutional neural networks
(CNN). This led to research and subsequent improvements
in many tasks related to fashion such as classification of
clothes, predicting different kinds of attributes of a spe-
Permission to make digital or hard copies of part or all of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation
on the first page. Copyrights for third-party components of this work must be honored.
For all other uses, contact the owner/author(s).
ML4Fashion ’17 August 14, 2017, Nova Scotia, Canada
2017 Copyright held by the owner/author(s).
ACM ISBN 978-1-4503-2138-9.
DOI: 10.1145/1235
cific piece of clothing, and improving the retrieval of images
[19][14][2][29][15]. Giants of e-commerce are expanding their
investment in fashion. Recently, Amazon patented a system
to manufacture clothes on demand [22]. Also, they have
started shipping their virtual assistant Echo with an inte-
grated camera that clicks a picture of the user’s outfit and
rates its style [9]. StitchFix1aims to simplify the user’s expe-
rience to shop online. As the online fashion industry looks to
improve the kind of clothes that are recommended to users,
understanding the personal style preferences of users and
recommending custom designs becomes an important task.
Personalization and recommendation models are a well
researched area that include methods from collaborative fil-
tering [18] to content-based recommendation systems (e.g.,
probabilistic graph models, neural networks) as well as hy-
brid systems that combine both. Collaborative filtering [18]
tries to analyze user behaviour and preferences and align
users to predefined patterns so as to recommend a product.
Content-based methods recommend a product based on its
attributes or features that the user is searching for. A hy-
brid system (knowledge-based system [27]) incorporates user
preferences and product features to recommend an item to
the user.
While the above techniques retrieve the product (or its
image) we seek to synthesize new personalized merchandise.
Texture synthesis tries to learn the underlying texture of
an image in order to generate new samples with the same
texture. The research in this space [6] is largely focused on
parametric and non-parametric methods. Non-parametric
methods try to resample specific pixels from the image or
adopt specific patches from the original to generate the new
image [5][28][17][4]. Parametric methods define a statisti-
cal model that represents the texture [13][10][20]. In 2015,
Gatys et. al. [6] designed a new parametric model for tex-
ture synthesis based on convolutional neural networks. They
model the style of an image by extracting the feature maps
generated when the image is fed through a pre-trained CNN,
in this instance using a 19 layer VGGNet. They successfully
separate the style and content of an arbitrary image and
demonstrate how the other image can be stylized using the
textures of the prior.
Although Convolutional Neural Networks provide state-
of-the-art performance for multiple computer vision tasks,
their complexity and opacity has been a substantial research
question. Visualizing the features learned by the network
has been addressed in multiple efforts. Zeilar et. al [30] use
arXiv:1707.09899v1 [cs.CV] 31 Jul 2017
Figure 1: (a) and (b) provide the shape & style respectively (c) Final Design
a deconvolution network to reconstruct the features learned
in each layer of the CNN. Simoyan et. al. [25] backpropagate
the gradients generated for a class with respect to the input
image to create an artificial image (the initial image is just
random noise) that represents the class in the network. The
separation of style and content in an image by Gatys et. al.
[6] shows the variant (content) and invariant (style) parts
of the image.
Our contribution in this paper is a pipeline to learn the
user’s unique fashion sense and generate new design pat-
terns based on their preferences. Figure 1 shows a sample
clothing item generated using neural style transfer. The first
clothing item given by the user provides the shape for the
new dress. The second is initially provided by the user from
his/her closet to learn their preference. The third is the final
generated design for the user (the generated sample contains
styles from multiple pieces of the user’s clothing).
The following sections discuss the related work, how neu-
ral style transfer works, our system architecture, experi-
ments conducted and their results.
Prior research on fashion data in the computer vision com-
munity has dealt with a whole range of challenges including
clothes classification, predicting attributes of clothes and the
retrieval of items [14][2][29][15]. Liu et. al [19] create a ro-
bust fashion dataset of about 800,000 images that contains
annotations for various types of clothes, their attributes and
the location of landmarks as well as cross-domains pairs.
They also design a CNN to predict attibutes and landmarks.
The architecture is based on a 16 layer VGGNet and adds
convolution and fully-connected layers to train a network to
predict them. Phillip et. al [11] perform image to image
translation using a conditional adversarial network. They
perform experiments to generate various fashion accessories
when provided with a sketch of the item.
We use a 19-layer pre-trained VGGNet [25] that is trained
on the imagenet dataset [24]. The network consists of 8 con-
volutional layers and 3 fully-connected layers. It is trained
to predict 1000 classes (from the Imagenet challenge). The
network is known to be robust and the features generated
have been used to solve multiple downstream tasks. Gatys et
al. use the pre-trained VGGNet to extract style and content
Johnson et. al. [12] create an image transformation net-
work trained to transform the image with the given style. A
feed-forward transformation network is trained to run real-
time using perceptual loss functions that depend on high-
level features from a pre-trained loss network rather than
the per-pixel loss function based on low level pixel informa-
tion. The trained network does not start transforming the
image from white-noise but generates the output directly,
thus speeding up the process.
Gatys et al. [7, 8] describes the process of using image
representations encoded by multiple layers of VGGNet to
separate the content and style of images and recombine them
to form new images. The idea of style extraction is based
on the texture synthesis process that represents the texture
as a Gram Matrix of the feature maps generated from each
convolutional layer. The style is extracted as a weighted set
of gram matrices across all convolutional layers of the pre-
trained VGGNet when it processes an image. The content
is obtained from feature maps extracted from the higher
layers of the network when the image is processed. The style
and content losses are computed as the mean squared error
(MSE) between the features maps and Gram matrices of the
original image and a randomly generated image (initiated
from white noise). Minimizing the loss transforms the white
noise to a new artistic image.
We use the method described above to generate new fash-
ion designs.
This section describes how the style and content is ex-
tracted from an image using neural style transfer [7]. We
use the implementation given by [26], a pre-trained 19 layer
VGGnet model (VGG-19) that takes a content image and a
set of style images as input.
Consider an input image xand convolutional neural net-
work NN . Every convolution layer lin the convolutional
network has Nldistinct filters. Upon completion of the con-
volution operation (and the activation function being ap-
plied), let the feature map computed have height hand
width w. The flattened map (into a single vector) has a
size of Ml= 1 ×(hxw). Thus, the feature maps at every
layer lcan be given as Fl
ij ∈ RNl×Mlwhere Fl
ik represents
the activation of the ith filter at position k.
3.1 Style Extraction
Figure 2: Overall System Architecture. A1...Anare all the attributes in the dataset [19], A1...Akare set of
attributes given by the user. Lis the total loss between gram matrix of modified (iteratively) UCO image &
gram matrices from user’s personal style store (for A1...Ak). In the first phase the user provides the system
access to his / her closet images from where the user’s fashion preferences are learned. In phase two, the
user gives his / her choices (attributes such as Striped Top or Chiffon ) with the desired outline of piece of
clothing to get the new custom design.
The Gram matrix at layer lis given by GlRNl×Nlwhere
ij is calculated by the dot product of the feature maps i
and jfor layer l:
ij =X
jk (1)
The dot product computes the similarities between fea-
ture maps. Thus the Gram matrix Glinvariably contains
image points that are consistent between the maps while
inconsistent features become 0.
Consider two images x(input image used to transfer the
style) and ˆx(a randomly generated image from white noise).
Let their corresponding Gram matrices be Gland ˆ
Gl. The
style loss function is then computed for every layer as the
mean squared error (MSE) between Gland ˆ
ij ˆ
ij )2(2)
Elis the style loss.
3.2 Content Extraction
The feature maps from the higher layers in the model give
a representation of the image that is more biased towards
the content [6]. We use the feature representations of the
conv 4 2 layer to extract content. Given the feature repre-
sentations in layer lof the original image xand the generated
white noise image ˆxas Fland ˆ
Flrespectively, we define the
content loss as the mean squared difference between the two:
Lcontent(x, ˆx, l) = 1
ij ˆ
ij )2(3)
The derivative of this loss with respect to the feature map
at layer lgives the gradient used to minimize the loss:
∂F l
Fl)ij ,if Fl
ij >0
0,if Fl
ij <0(4)
Figure 2 shows the entire pipeline to personalize and de-
sign custom clothes for the user. There are four modules to
the architecture, namely, preprocessing, personal style store
creation, style transfer and post-processing to generate the
final design. The following section discusses these modules
in more detail.
To minimize the complexity of the problem, we consider
images from the DeepFashion dataset [19] that have a white
background. The images contain only clothing objects with
no humans or other artifacts. They are only upper-body or
full-body apparel pieces.
4.1 Preprocessing
All images are resized to 512 x 512. The image is resized
not by expanding/contracting the image, but by creating a
temporary white background image of the above mentioned
Figure 3: Evaluation model for predicting attribute labels on separate training and test generation images
size. The original image is then placed at the center of that
temporary image. This resizes the image to the expected
size without deforming it. Also, the mask of the image is
extracted and stored using the grabcut utility [23]. This
mask is used in the postprocessing step to get rid of patterns
lying beyond the contours of the apparel. The attributes for
the clothes are assumed to be provided and automatically
labeling them is beyond the scope of this paper.
4.2 Creating a Personal Style Store
To learn the user’s fashion preferences, the user initially
provides the set of clothes from his / her closet. The Gram
matrices Gl(eq.1) of all the clothes with their annotated
attributes are calculated. Tensorflow [1] allows us to get the
partially computed functions Elin 2 (where the gram matri-
ces for Glare computed first and then ˆ
Gllater). The style
losses Elare thus stored in a dictionary with the associated
attributes. A personal style store is constructed for each
4.3 Style Transfer
To perform style transfer, two inputs are necessary. As
shown in figure 2, the user inputs a list of attributes that
he/she will like in their new garment. This list can be at-
tributes like print and stripes or fabric such as chiffon. In
the current system, style is learned only for attribute types
texture and fabric. The dress shape is not considered as a
representation of the style of that object. Apart from these
attributes, the user also gives an image that contains the
shape of the dress they desire. This is called the User Cho-
sen Outline (UCO). Let the attributes of the dresses in the
closet be A1...An. The selected user attributes are A1...Ak
where k << n. The set of style loss functions having the cor-
responding attributes are selected from the user’s personal
style store. Although the style’s extracted from the user’s
closet as a whole represent the his/her fashion sense, we
pick the style functions of the chosen attributes because we
assume the user’s mental model of dress is likely to be sim-
ilar to the styles extracted for those attributes. All selected
functions are then combined to get a singular representation
of the user’s fashion choices.
For a style image xand the initialized image ˆx, the style
loss can be given as,
Lsx, x) =
where Lsis the style loss for a single image.
The combined loss is given by:
Lstyle =1
Here, Lstyle is the style loss computed over Sselect func-
The number of images for every attribute picked depends
on the distribution of the particular attribute across the en-
tire list of images present. The higher the frequency of the
attribute in the distribution, the higher is the bias towards
a certain label and suppresses the effect of the others. This
makes certain image characteristics more pronounced in the
final dress than others. Hence, to offset the bias the weight
Wsis utilized.
Total Loss is the summation of the style and content
losses obtained.
Ltotal =αLcontent(C, x) + βLsty le(S, x) (7)
Here, αand βare the weights assigned to the content and
style losses respectively. C is the user chosen outline (UCO).
An LBFGS optimizer is used to minimize the loss. The
output image is then post-processed to get the final image.
The objective is to minimize the content and style losses.
4.4 Postprocessing
The output image contains patches of patterns transferred
across the entire image. We resize the image to its orig-
inal dimensions and apply the mask (of the UCO image)
Figure 4: Multiple styles reinforced in a content image
extracted to white out the background and get the trans-
formed clothing object as the final resultant dress.
We present two approaches to evaluate the results of per-
sonalization using style transfer.
5.1 Predicting Attribute labels
Quantitative evaluation for personalization models is a
challenging task. A standard approach is to create a sur-
vey of mechanical turk and ask users if the styles have been
transferred properly and if the new dress designs are per-
sonalized given a wardrobe. But fashion presents a unique
challenge as it is highly dependent on the user’s taste for dif-
ferent kinds of clothing. Instead a different tact is applied.
Figure 3 shows how the evaluation is performed. We check
if style is imparted on the given UCO image by verifying if
the classifier is able to identify the style attributes present
in it. An SVM is trained to learn attributes of the clothes
present in the user’s closet using the features generated from
a 16-layer VGGNet (our system uses the 19 layer for fash-
ioning the clothes). The test dataset is created by generated
a random combination of attributes (these combinations are
likely not present in the training image closet). For these
random combinations of attributes, the new dress images
are generated. Once featurized by a pre-trained VGG-16,
we check the SVM’s ability to predict the combinations of
The UCO images and the set of images used for styling
are maintained separately. There are a total of 400 UCOs
and 100 images from the user’s wardrobe. There are two
kinds of tests considered in the experiment. In the first,
the test images are generated from a set of images separate
from the styles extracted from the training but with similar
attributes. In the second, the test images are generated from
the styles extracted from the training data itself. Figure
5 shows the F1-score for a varying number of test images
generated. The consistent performance above the baseline
suggests the style is likely transferred and the SVM is able
the classify based on features generated.
Our experiments with increasing the number of images
used for gaining more styles showed a drop in the F1 score,
suggesting that an increasing number of style functions im-
pact the quality of the result, thus making it difficult to
identify patterns. Hence it is necessary to limit the number
of style functions used to generate the new dress.
Figure 5: Bar-chart showing F1-scores for the base-
line and our model on actual test data using separate
training and test generation images, and using same
images for training and test data generation
5.2 Qualitative evaluation
We analyze the quality of dress images by seeing how sim-
ilar they are to the style images used in the personalization
process. The quality of the generated image is impacted by
a number of factors. The effect of various hyper-parameters
is measured. The Figure 4 shows an image of a sheer draped
blouse changed to adopt the styles extracted from a couple
of images. The result is a nice blend of patterns borrowed
from the style images given.
A single style superimposed on the same content image,
Figure 6: Styles extracted from multiple images for the same attribute ”knit”
but using multiple distinct style images, produces interesting
results. Figure 6 presents the style of four different knit
garments over a tank top. Four different textures of the
same fabric produce distinct results.
In this paper, we show an initial pipeline to generate new
designs for clothes based on the preference of the user. The
results indicate that style transfer happens successfully and
is personalized for the closet of a user. In the future we
will like to improve the performance of the pipeline as it is
time consuming to generate a new design. Also, we plan to
experiment with better methods to personalize and generate
designs with higher resolutions.
[1] M. Abadi, A. Agarwal, P. Barham, E. Brevdo,
Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean,
M. Devin, S. Ghemawat, I. Goodfellow, A. Harp,
G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser,
M. Kudlur, J. Levenberg, D. Man´e, R. Monga,
S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens,
B. Steiner, I. Sutskever, K. Talwar, P. Tucker,
V. Vanhoucke, V. Vasudevan, F. Vi´egas, O. Vinyals,
P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and
X. Zheng. TensorFlow: Large-scale machine learning
on heterogeneous systems, 2015. Software available
[2] L. Bossard, M. Dantone, C. Leistner, C. Wengert,
T. Quack, and L. Van Gool. Apparel classification
with style. In Asian conference on computer vision,
pages 321–335. Springer, 2012.
[3] L. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and
A. L. Yuille. Deeplab: Semantic image segmentation
with deep convolutional nets, atrous convolution, and
fully connected crfs. CoRR, abs/1606.00915, 2016.
[4] A. A. Efros and W. T. Freeman. Image quilting for
texture synthesis and transfer. In Proceedings of the
28th annual conference on Computer graphics and
interactive techniques, pages 341–346. ACM, 2001.
[5] A. A. Efros and T. K. Leung. Texture synthesis by
non-parametric sampling. In Computer Vision, 1999.
The Proceedings of the Seventh IEEE International
Conference on, volume 2, pages 1033–1038. IEEE,
[6] L. Gatys, A. S. Ecker, and M. Bethge. Texture
synthesis using convolutional neural networks. In
Advances in Neural Information Processing Systems,
pages 262–270, 2015.
[7] L. A. Gatys, A. S. Ecker, and M. Bethge. A neural
algorithm of artistic style. arXiv preprint
arXiv:1508.06576, 2015.
[8] L. A. Gatys, A. S. Ecker, and M. Bethge. Image style
transfer using convolutional neural networks. In
Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, pages 2414–2423,
[9] B. Heater. Amazonˆa ˘
Zs new echo look has a built-in
camera for style selfies.
amazons-new-echo-look-has-a-built-in- camera-for-style-selfies/.
Accessed: 2017-06-02.
[10] D. J. Heeger and J. R. Bergen. Pyramid-based texture
analysis/synthesis. In Proceedings of the 22nd annual
conference on Computer graphics and interactive
techniques, pages 229–238. ACM, 1995.
[11] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros.
Image-to-image translation with conditional
adversarial networks. arxiv, 2016.
[12] J. Johnson, A. Alahi, and L. Fei-Fei. Perceptual losses
for real-time style transfer and super-resolution. In
European Conference on Computer Vision, 2016.
[13] B. Julesz. Visual pattern discrimination. IRE
transactions on Information Theory, 8(2):84–92, 1962.
[14] Y. Kalantidis, L. Kennedy, and L.-J. Li. Getting the
look: clothing recognition and segmentation for
automatic product suggestions in everyday photos. In
Proceedings of the 3rd ACM conference on
International conference on multimedia retrieval,
pages 105–112. ACM, 2013.
[15] M. H. Kiapour, K. Yamaguchi, A. C. Berg, and T. L.
Berg. Hipster wars: Discovering elements of fashion
styles. In European conference on computer vision,
pages 472–488. Springer, 2014.
[16] A. Krizhevsky, I. Sutskever, and G. E. Hinton.
Imagenet classification with deep convolutional neural
networks. In Advances in neural information
processing systems, pages 1097–1105, 2012.
[17] V. Kwatra, A. Sch¨
odl, I. Essa, G. Turk, and A. Bobick.
Graphcut textures: image and video synthesis using
graph cuts. In ACM Transactions on Graphics (ToG),
volume 22, pages 277–286. ACM, 2003.
[18] G. Linden, B. Smith, and J. York. Amazon. com
recommendations: Item-to-item collaborative filtering.
IEEE Internet computing, 7(1):76–80, 2003.
[19] Z. Liu, P. Luo, S. Qiu, X. Wang, and X. Tang.
Deepfashion: Powering robust clothes recognition and
retrieval with rich annotations. In Proceedings of the
IEEE Conference on Computer Vision and Pattern
Recognition, pages 1096–1104, 2016.
[20] J. Portilla and E. P. Simoncelli. A parametric texture
model based on joint statistics of complex wavelet
coefficients. International journal of computer vision,
40(1):49–70, 2000.
[21] S. Ren, K. He, R. B. Girshick, and J. Sun. Faster
R-CNN: towards real-time object detection with region
proposal networks. CoRR, abs/1506.01497, 2015.
[22] J. D. REY. Amazon won a patent for an on-demand
clothing manufacturing warehouse.
Accessed: 2017-06-02.
[23] C. Rother, V. Kolmogorov, and A. Blake. Grabcut:
Interactive foreground extraction using iterated graph
cuts. In ACM transactions on graphics (TOG),
volume 23, pages 309–314. ACM, 2004.
[24] O. Russakovsky, J. Deng, H. Su, J. Krause,
S. Satheesh, S. Ma, Z. Huang, A. Karpathy,
A. Khosla, M. Bernstein, et al. Imagenet large scale
visual recognition challenge. International Journal of
Computer Vision, 115(3):211–252, 2015.
[25] K. Simonyan and A. Zisserman. Very deep
convolutional networks for large-scale image
recognition. CoRR, abs/1409.1556, 2014.
[26] C. Smith. neural-style-tf., 2016.
[27] S. Trewin. Knowledge-based recommender systems.
Encyclopedia of library and information science,
69(Supplement 32):180, 2000.
[28] L.-Y. Wei and M. Levoy. Fast texture synthesis using
tree-structured vector quantization. In Proceedings of
the 27th annual conference on Computer graphics and
interactive techniques, pages 479–488. ACM
Press/Addison-Wesley Publishing Co., 2000.
[29] T. Xiao, T. Xia, Y. Yang, C. Huang, and X. Wang.
Learning from massive noisy labeled data for image
classification. In Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, pages
2691–2699, 2015.
[30] M. D. Zeiler and R. Fergus. Visualizing and
understanding convolutional networks. CoRR,
abs/1311.2901, 2013.
... Highlevel fashion analysis encompasses fashion trend forecasts, fashion synthesis, and fashion suggestions. Convolutional Neural Networks have excelled in various computer vision applications, including object recognition, detection, picture segmentation, and texture generation [28]. In another work [87], the authors outlined various ways to upgrade fashion technology with deep learning in "Fashion Object Detection and Pixel-Wise Semantic Segmentation." ...
... The mean squared error (MSE) between the input and output pictures is used to calculate the style and content losses (initiated from white noise). The researchers in [28] employed a 19-layer pre-trained VGG-Net in "Fashioning with Networks: Neural Style Transfer to Design Clothes." In another work [109], authors used neural networks to process images from the DeepFashion dataset and the nearest neighbor-backed recommender to generate final recommendations based on a given input image to find the most similar one in a Personalized fashion recommender system with image-based neural networks. ...
... In another work [63], researchers suggested a two-step deep learning framework for recommending fashion garments based on the visual resemblance style of another image. The neural style transfer technique is used to fashion in [28] to synthesize new personalized outfits. They devised a method for creating new personalized outfits based on a user's preferences and learning the user's fashion preferences from a small group of items in their wardrobe. ...
Full-text available
The rising diversity, volume, and pace of fashion manufacturing pose a considerable challenge in the fashion industry, making it difficult for customers to pick which product to purchase. In addition, fashion is an inherently subjective, cultural notion and an ensemble of clothing items that maintains a coherent style. In most of the domains in which Recommender Systems are developed (e.g., movies, e-commerce, etc.), the similarity evaluation is considered for recommendation. Instead, in the Fashion domain, compatibility is a critical factor. In addition, raw visual features belonging to product representations that contribute to most of the algorithm’s performances in the Fashion domain are distinguishable from the metadata of the products in other domains. This literature review summarizes various Artificial Intelligence (AI) techniques that have lately been used in recommender systems for the fashion industry. AI enables higher-quality recommendations than earlier approaches. This has ushered in a new age for recommender systems, allowing for deeper insights into user-item relationships and representations and the discovery patterns in demographical, textual, virtual, and contextual data. This work seeks to give a deeper understanding of the fashion recommender system domain by performing a comprehensive literature study of research on this topic in the past 10 years, focusing on image-based fashion recommender systems taking AI improvements into account. The nuanced conceptions of this domain and their relevance have been developed to justify fashion domain-specific characteristics.
With the development of the convolutional neural network, image style transfer has drawn increasing attention. However, most existing approaches adopt a global feature transformation to transfer style patterns into content images (e.g., AdaIN and WCT). Such a design usually destroys the spatial information of the input images and fails to transfer fine-grained style patterns into style transfer results. To solve this problem, we propose a novel STyle TRansformer (STTR) network which breaks both content and style images into visual tokens to achieve a fine-grained style transformation. Specifically, two attention mechanisms are adopted in our STTR. We first propose to use self-attention to encode content and style tokens such that similar tokens can be grouped and learned together. We then adopt cross-attention between content and style tokens that encourages fine-grained style transformations. To compare STTR with existing approaches, we conduct user studies on Amazon Mechanical Turk (AMT), which are carried out with 50 human subjects with 1,000 votes in total. Extensive evaluations demonstrate the effectiveness and efficiency of the proposed STTR in generating visually pleasing style transfer results (Code is available at
Full-text available
The rapid progress of computer vision, machine learning, and artificial intelligence combined with the current growing urge for online shopping systems opened an excellent opportunity for the fashion industry. As a result, many studies worldwide are dedicated to modern fashion-related applications such as virtual try-on and fashion synthesis. However, the accelerated evolution speed of the field makes it hard to track these many research branches in a structured framework. This paper presents an overview of the matter, categorizing 110 relevant articles into multiple sub-categories and varieties of these tasks. An easy-to-use yet informative tabular format is used for this purpose. Such hierarchical application-based multi-label classification of studies increases the visibility of current research, promotes the field, provides research directions, and facilitates access to related studies.
Full-text available
The fashion industry is on the verge of an unprecedented change. The implementation of machine learning, computer vision, and artificial intelligence (AI) in fashion applications is opening lots of new opportunities for this industry. This paper provides a comprehensive survey on this matter, categorizing more than 580 related articles into 22 well-defined fashion-related tasks. Such structured task-based multi-label classification of fashion research articles provides researchers with explicit research directions and facilitates their access to the related studies, improving the visibility of studies simultaneously. For each task, a time chart is provided to analyze the progress through the years. Furthermore, we provide a list of 86 public fashion datasets accompanied by a list of suggested applications and additional information for each.
Full-text available
The purpose of the article consists in clarifying the features of the archetypal structures functioning (in this case – archetypes of Greek myths) in modern fashion design and also in the analysis of the fashion design methodology peculiarities in the condi­tions of metamodernism. The realization of this purpose is planned by developing the meth­od of designing modern clothing, taking into account the principles and characteristics of metamodernism. Research methodology. The complex problem of analyzing the features of fashion design development in the environ­ment of metamodernism is solved based on interdisciplinary system analysis with a combi­nation of subject, historiographic and morpho­logical analyzes.Scientific novelty. For the first time, it was shown that modern design trends, such as in-personal empathy and situational expressiveness, more careful consideration of aesthetic guidelines, psychological attitudes, and user expectations, are most harmoniously reflected in the principles of metamodernism. It was demonstrated that in the conceptual space of the metamodern, the principle of os­cillation corresponds to a choice between two competing primary sources of an image gen­eration in fashion design. The images of Greek mythology gods were chosen as traditional ele­ments of the design project, and prototypes of modern clothing models were chosen as com­petitive elements. Conclusions. As a result of the research, we conclude that in the practice of fashion design, it is expedient to solve the project approach in a balanced way between the factors of the creative source, the personal qualities of the individual, and the archetypes manifested in it. The methods for elements of Greek myths heroes’ images synthesizing with elements-carriers of fashion trends was pro­posed, taking into account the tendencies of metamodernism. Experimental approbation of the research results was carried out by assess­ing the artistic and aesthetic qualities of mod­ern women’s clothing models collection based on Greek myths motives. Thus, the results of the collection design confirmed the feasibility of turning to archetypal forms to find innovative design solutions.
Fashion companies’ chance to survive the current pandemic is much dependent on their analytics skills. Despite this urge and the arising possibilities in the “data era,” analytics activities are still underestimated and scattered across different fashion supply chain functions. Therefore, this article positions itself at the important intersection of analytics and fashion supply chain management. This article analyzed analytics applications across all relevant supply chain functions within the fashion industry. We conducted our literature review with a focus on different forms of data-driven decision making applied within fashion supply chain functions. We systematically compared the findings from a structured literature review and a content analysis of corporate annual reports and detailed state-of-the-art analytics examples. We highlight deviations in the analytics level: Research papers have a strong focus on advanced analytics methods while most companies are struggling to establish descriptive analytics capabilities. Based on this, we derive and detail managerial and research implications. Having created a holistic overview, this article presents itself as a cornerstone for further analytics-focused research within the fashion industry. Also, it provides managers with insights into the current landscape of analytics applications and develops the vision of a future analytics-driven fashion supply chain.
Full-text available
The aim of the article. The study is devoted to the analysis the role of personality image in social adaptation and to analysis features of new forms messages generation for social communications in digital and post-digital phases of the information society development. Research methodology. The elements of historiographical, content analysis and analysis of digitalization processes functionality, combined on a common system analysis platform, were used to characterize the role of a person image in the social adaptation. Results. It is shown that the digital phase of modeling a personal style and image is characterized by the appearance of a twain, real and virtual image of a person, which is consistent with the split of a human’s living space into real and virtual. The “clip thinking” that arises in the new realities of informatization among young generations is explained by a decrease in the time threshold of information perceiving. The post-digital phase of information society development creates the new role of human image-making acting as a source of educational information in the process of managing the loyalty of artificial intelligence to humanity. The scientific novelty consists of determining the roles of a human’s positive image and digital image-making process in the digital and post-digital phases of digitalization. During the research, the assumptions were formulated and substantiated that in order to achieve the loyalty of artificial intellect to humankind, at the start of the post-digital phase to inculcate to artificial intelligence social attitudes and human goals are needed according to the human scenario. The practical significance of the research. The results can be used to enhance the effectiveness of social communications in the digital environment and to achieve the loyalty of artificial intellect to humankind by using the image of cult personalities of our time as a prototype.
Full-text available
Here we present a parametric model for dynamic textures. The model is based on spatiotemporal summary statistics computed from the feature representations of a Convolutional Neural Network (CNN) trained on object recognition. We demonstrate how the model can be used to synthesise new samples of dynamic textures and to predict motion in simple movies.
Full-text available
We investigate conditional adversarial networks as a general-purpose solution to image-to-image translation problems. These networks not only learn the mapping from input image to output image, but also learn a loss function to train this mapping. This makes it possible to apply the same generic approach to problems that traditionally would require very different loss formulations. We demonstrate that this approach is effective at synthesizing photos from label maps, reconstructing objects from edge maps, and colorizing images, among other tasks. As a community, we no longer hand-engineer our mapping functions, and this work suggests we can achieve reasonable results without hand-engineering our loss functions either.
Conference Paper
We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 dif- ferent classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. To make training faster, we used non-saturating neurons and a very efficient GPU implemen- tation of the convolution operation. To reduce overfitting in the fully-connected layers we employed a recently-developed regularization method called dropout that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry
Conference Paper
We consider image transformation problems, where an input image is transformed into an output image. Recent methods for such problems typically train feed-forward convolutional neural networks using a per-pixel loss between the output and ground-truth images. Parallel work has shown that high-quality images can be generated by defining and optimizing perceptual loss functions based on high-level features extracted from pretrained networks. We combine the benefits of both approaches, and propose the use of perceptual loss functions for training feed-forward networks for image transformation tasks. We show results on image style transfer, where a feed-forward network is trained to solve the optimization problem proposed by Gatys et al. in real-time. Compared to the optimization-based method, our network gives similar qualitative results but is three orders of magnitude faster. We also experiment with single-image super-resolution, where replacing a per-pixel loss with a perceptual loss gives visually pleasing results.
In this work we address the task of semantic image segmentation with Deep Learning and make three main contributions that are experimentally shown to have substantial practical merit. First, we highlight convolution with upsampled filters, or 'atrous convolution', as a powerful tool in dense prediction tasks. Atrous convolution allows us to explicitly control the resolution at which feature responses are computed within Deep Convolutional Neural Networks. It also allows us to effectively enlarge the field of view of filters to incorporate larger context without increasing the number of parameters or the amount of computation. Second, we propose atrous spatial pyramid pooling (ASPP) to robustly segment objects at multiple scales. ASPP probes an incoming convolutional feature layer with filters at multiple sampling rates and effective fields-of-views, thus capturing objects as well as image context at multiple scales. Third, we improve the localization of object boundaries by combining methods from DCNNs and probabilistic graphical models. The commonly deployed combination of max-pooling and downsampling in DCNNs achieves invariance but has a toll on localization accuracy. We overcome this by combining the responses at the final DCNN layer with a fully connected Conditional Random Field (CRF), which is shown both qualitatively and quantitatively to improve localization performance. Our proposed "DeepLab" system sets the new state-of-art at the PASCAL VOC-2012 semantic image segmentation task, reaching 79.7% mIOU in the test set, and advances the results on three other datasets: PASCAL-Context, PASCAL-Person-Part, and Cityscapes. All of our code is made publicly available online.