Content uploaded by Ashwinkumar Ganesan
Author content
All content in this area was uploaded by Ashwinkumar Ganesan on Aug 03, 2017
Content may be subject to copyright.
Fashioning with Networks: Neural Style Transfer to Design
Clothes
Prutha Date
University Of Maryland
Baltimore County (UMBC),
Baltimore, MD,
USA
dprutha1@umbc.edu
Ashwinkumar Ganesan
University Of Maryland
Baltimore County (UMBC),
Baltimore, MD,
USA
gashwin1@umbc.edu
Tim Oates
University Of Maryland
Baltimore County (UMBC),
Baltimore, MD,
USA
oates@cs.umbc.edu
ABSTRACT
Convolutional Neural Networks have been highly successful
in performing a host of computer vision tasks such as object
recognition, object detection, image segmentation and tex-
ture synthesis. In 2015, Gatys et. al [7] show how the style
of a painter can be extracted from an image of the painting
and applied to another normal photograph, thus recreating
the photo in the style of the painter. The method has been
successfully applied to a wide range of images and has since
spawned multiple applications and mobile apps. In this pa-
per, the neural style transfer algorithm is applied to fashion
so as to synthesize new custom clothes. We construct an
approach to personalize and generate new custom clothes
based on a user’s preference and by learning the user’s fash-
ion choices from a limited set of clothes from their closet.
The approach is evaluated by analyzing the generated im-
ages of clothes and how well they align with the user’s fash-
ion style.
CCS Concepts
•Computing methodologies →Computer vision; Ma-
chine learning approaches; Neural networks;
Keywords
Convolutional Neural Networks, Personalization, Fashion,
Neural Networks, Style Transfer, Texture Synthesis
1. INTRODUCTION
There have been recently impressive advances in computer
vision tasks like object recognition and detection, segmen-
tation [16][21][3]. The revolution started with Krizhevsky
et. al [16] substantially improving object recognition on
the Imagenet challenge using convolutional neural networks
(CNN). This led to research and subsequent improvements
in many tasks related to fashion such as classification of
clothes, predicting different kinds of attributes of a spe-
Permission to make digital or hard copies of part or all of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation
on the first page. Copyrights for third-party components of this work must be honored.
For all other uses, contact the owner/author(s).
ML4Fashion ’17 August 14, 2017, Nova Scotia, Canada
c
2017 Copyright held by the owner/author(s).
ACM ISBN 978-1-4503-2138-9.
DOI: 10.1145/1235
cific piece of clothing, and improving the retrieval of images
[19][14][2][29][15]. Giants of e-commerce are expanding their
investment in fashion. Recently, Amazon patented a system
to manufacture clothes on demand [22]. Also, they have
started shipping their virtual assistant Echo with an inte-
grated camera that clicks a picture of the user’s outfit and
rates its style [9]. StitchFix1aims to simplify the user’s expe-
rience to shop online. As the online fashion industry looks to
improve the kind of clothes that are recommended to users,
understanding the personal style preferences of users and
recommending custom designs becomes an important task.
Personalization and recommendation models are a well
researched area that include methods from collaborative fil-
tering [18] to content-based recommendation systems (e.g.,
probabilistic graph models, neural networks) as well as hy-
brid systems that combine both. Collaborative filtering [18]
tries to analyze user behaviour and preferences and align
users to predefined patterns so as to recommend a product.
Content-based methods recommend a product based on its
attributes or features that the user is searching for. A hy-
brid system (knowledge-based system [27]) incorporates user
preferences and product features to recommend an item to
the user.
While the above techniques retrieve the product (or its
image) we seek to synthesize new personalized merchandise.
Texture synthesis tries to learn the underlying texture of
an image in order to generate new samples with the same
texture. The research in this space [6] is largely focused on
parametric and non-parametric methods. Non-parametric
methods try to resample specific pixels from the image or
adopt specific patches from the original to generate the new
image [5][28][17][4]. Parametric methods define a statisti-
cal model that represents the texture [13][10][20]. In 2015,
Gatys et. al. [6] designed a new parametric model for tex-
ture synthesis based on convolutional neural networks. They
model the style of an image by extracting the feature maps
generated when the image is fed through a pre-trained CNN,
in this instance using a 19 layer VGGNet. They successfully
separate the style and content of an arbitrary image and
demonstrate how the other image can be stylized using the
textures of the prior.
Although Convolutional Neural Networks provide state-
of-the-art performance for multiple computer vision tasks,
their complexity and opacity has been a substantial research
question. Visualizing the features learned by the network
has been addressed in multiple efforts. Zeilar et. al [30] use
1https://www.stitchfix.com/welcome/schedule
arXiv:1707.09899v1 [cs.CV] 31 Jul 2017
Figure 1: (a) and (b) provide the shape & style respectively (c) Final Design
a deconvolution network to reconstruct the features learned
in each layer of the CNN. Simoyan et. al. [25] backpropagate
the gradients generated for a class with respect to the input
image to create an artificial image (the initial image is just
random noise) that represents the class in the network. The
separation of style and content in an image by Gatys et. al.
[6] shows the variant (content) and invariant (style) parts
of the image.
Our contribution in this paper is a pipeline to learn the
user’s unique fashion sense and generate new design pat-
terns based on their preferences. Figure 1 shows a sample
clothing item generated using neural style transfer. The first
clothing item given by the user provides the shape for the
new dress. The second is initially provided by the user from
his/her closet to learn their preference. The third is the final
generated design for the user (the generated sample contains
styles from multiple pieces of the user’s clothing).
The following sections discuss the related work, how neu-
ral style transfer works, our system architecture, experi-
ments conducted and their results.
2. RELATED WORK
Prior research on fashion data in the computer vision com-
munity has dealt with a whole range of challenges including
clothes classification, predicting attributes of clothes and the
retrieval of items [14][2][29][15]. Liu et. al [19] create a ro-
bust fashion dataset of about 800,000 images that contains
annotations for various types of clothes, their attributes and
the location of landmarks as well as cross-domains pairs.
They also design a CNN to predict attibutes and landmarks.
The architecture is based on a 16 layer VGGNet and adds
convolution and fully-connected layers to train a network to
predict them. Phillip et. al [11] perform image to image
translation using a conditional adversarial network. They
perform experiments to generate various fashion accessories
when provided with a sketch of the item.
We use a 19-layer pre-trained VGGNet [25] that is trained
on the imagenet dataset [24]. The network consists of 8 con-
volutional layers and 3 fully-connected layers. It is trained
to predict 1000 classes (from the Imagenet challenge). The
network is known to be robust and the features generated
have been used to solve multiple downstream tasks. Gatys et
al. use the pre-trained VGGNet to extract style and content
features.
Johnson et. al. [12] create an image transformation net-
work trained to transform the image with the given style. A
feed-forward transformation network is trained to run real-
time using perceptual loss functions that depend on high-
level features from a pre-trained loss network rather than
the per-pixel loss function based on low level pixel informa-
tion. The trained network does not start transforming the
image from white-noise but generates the output directly,
thus speeding up the process.
Gatys et al. [7, 8] describes the process of using image
representations encoded by multiple layers of VGGNet to
separate the content and style of images and recombine them
to form new images. The idea of style extraction is based
on the texture synthesis process that represents the texture
as a Gram Matrix of the feature maps generated from each
convolutional layer. The style is extracted as a weighted set
of gram matrices across all convolutional layers of the pre-
trained VGGNet when it processes an image. The content
is obtained from feature maps extracted from the higher
layers of the network when the image is processed. The style
and content losses are computed as the mean squared error
(MSE) between the features maps and Gram matrices of the
original image and a randomly generated image (initiated
from white noise). Minimizing the loss transforms the white
noise to a new artistic image.
We use the method described above to generate new fash-
ion designs.
3. PRELIMINARIES
This section describes how the style and content is ex-
tracted from an image using neural style transfer [7]. We
use the implementation given by [26], a pre-trained 19 layer
VGGnet model (VGG-19) that takes a content image and a
set of style images as input.
Consider an input image xand convolutional neural net-
work NN . Every convolution layer lin the convolutional
network has Nldistinct filters. Upon completion of the con-
volution operation (and the activation function being ap-
plied), let the feature map computed have height hand
width w. The flattened map (into a single vector) has a
size of Ml= 1 ×(hxw). Thus, the feature maps at every
layer lcan be given as Fl
ij ∈ RNl×Mlwhere Fl
ik represents
the activation of the ith filter at position k.
3.1 Style Extraction
Figure 2: Overall System Architecture. A1...Anare all the attributes in the dataset [19], A1...Akare set of
attributes given by the user. Lis the total loss between gram matrix of modified (iteratively) UCO image &
gram matrices from user’s personal style store (for A1...Ak). In the first phase the user provides the system
access to his / her closet images from where the user’s fashion preferences are learned. In phase two, the
user gives his / her choices (attributes such as Striped Top or Chiffon ) with the desired outline of piece of
clothing to get the new custom design.
The Gram matrix at layer lis given by Gl∈RNl×Nlwhere
Gl
ij is calculated by the dot product of the feature maps i
and jfor layer l:
Gl
ij =X
k
Fl
ikFl
jk (1)
The dot product computes the similarities between fea-
ture maps. Thus the Gram matrix Glinvariably contains
image points that are consistent between the maps while
inconsistent features become 0.
Consider two images x(input image used to transfer the
style) and ˆx(a randomly generated image from white noise).
Let their corresponding Gram matrices be Gland ˆ
Gl. The
style loss function is then computed for every layer as the
mean squared error (MSE) between Gland ˆ
Glas
El=1
4M2
lN2
l
X
i,j
(Gl
ij −ˆ
Gl
ij )2(2)
Elis the style loss.
3.2 Content Extraction
The feature maps from the higher layers in the model give
a representation of the image that is more biased towards
the content [6]. We use the feature representations of the
conv 4 2 layer to extract content. Given the feature repre-
sentations in layer lof the original image xand the generated
white noise image ˆxas Fland ˆ
Flrespectively, we define the
content loss as the mean squared difference between the two:
Lcontent(x, ˆx, l) = 1
2X
i,j
(Fl
ij −ˆ
Fl
ij )2(3)
The derivative of this loss with respect to the feature map
at layer lgives the gradient used to minimize the loss:
∂Lcontent
∂F l
ij
=(Fl−ˆ
Fl)ij ,if Fl
ij >0
0,if Fl
ij <0(4)
4. SYSTEM ARCHITECTURE
Figure 2 shows the entire pipeline to personalize and de-
sign custom clothes for the user. There are four modules to
the architecture, namely, preprocessing, personal style store
creation, style transfer and post-processing to generate the
final design. The following section discusses these modules
in more detail.
To minimize the complexity of the problem, we consider
images from the DeepFashion dataset [19] that have a white
background. The images contain only clothing objects with
no humans or other artifacts. They are only upper-body or
full-body apparel pieces.
4.1 Preprocessing
All images are resized to 512 x 512. The image is resized
not by expanding/contracting the image, but by creating a
temporary white background image of the above mentioned
Figure 3: Evaluation model for predicting attribute labels on separate training and test generation images
size. The original image is then placed at the center of that
temporary image. This resizes the image to the expected
size without deforming it. Also, the mask of the image is
extracted and stored using the grabcut utility [23]. This
mask is used in the postprocessing step to get rid of patterns
lying beyond the contours of the apparel. The attributes for
the clothes are assumed to be provided and automatically
labeling them is beyond the scope of this paper.
4.2 Creating a Personal Style Store
To learn the user’s fashion preferences, the user initially
provides the set of clothes from his / her closet. The Gram
matrices Gl(eq.1) of all the clothes with their annotated
attributes are calculated. Tensorflow [1] allows us to get the
partially computed functions Elin 2 (where the gram matri-
ces for Glare computed first and then ˆ
Gllater). The style
losses Elare thus stored in a dictionary with the associated
attributes. A personal style store is constructed for each
user.
4.3 Style Transfer
To perform style transfer, two inputs are necessary. As
shown in figure 2, the user inputs a list of attributes that
he/she will like in their new garment. This list can be at-
tributes like print and stripes or fabric such as chiffon. In
the current system, style is learned only for attribute types
texture and fabric. The dress shape is not considered as a
representation of the style of that object. Apart from these
attributes, the user also gives an image that contains the
shape of the dress they desire. This is called the User Cho-
sen Outline (UCO). Let the attributes of the dresses in the
closet be A1...An. The selected user attributes are A1...Ak
where k << n. The set of style loss functions having the cor-
responding attributes are selected from the user’s personal
style store. Although the style’s extracted from the user’s
closet as a whole represent the his/her fashion sense, we
pick the style functions of the chosen attributes because we
assume the user’s mental model of dress is likely to be sim-
ilar to the styles extracted for those attributes. All selected
functions are then combined to get a singular representation
of the user’s fashion choices.
For a style image xand the initialized image ˆx, the style
loss can be given as,
Ls(ˆx, x) =
L
X
l=0
wlEl(5)
where Lsis the style loss for a single image.
The combined loss is given by:
Lstyle =1
S
S
X
s=0
WsLs(6)
Here, Lstyle is the style loss computed over Sselect func-
tions.
The number of images for every attribute picked depends
on the distribution of the particular attribute across the en-
tire list of images present. The higher the frequency of the
attribute in the distribution, the higher is the bias towards
a certain label and suppresses the effect of the others. This
makes certain image characteristics more pronounced in the
final dress than others. Hence, to offset the bias the weight
Wsis utilized.
Total Loss is the summation of the style and content
losses obtained.
Ltotal =αLcontent(C, x) + βLsty le(S, x) (7)
Here, αand βare the weights assigned to the content and
style losses respectively. C is the user chosen outline (UCO).
An LBFGS optimizer is used to minimize the loss. The
output image is then post-processed to get the final image.
The objective is to minimize the content and style losses.
4.4 Postprocessing
The output image contains patches of patterns transferred
across the entire image. We resize the image to its orig-
inal dimensions and apply the mask (of the UCO image)
Figure 4: Multiple styles reinforced in a content image
extracted to white out the background and get the trans-
formed clothing object as the final resultant dress.
5. EXPERIMENTS & RESULTS
We present two approaches to evaluate the results of per-
sonalization using style transfer.
5.1 Predicting Attribute labels
Quantitative evaluation for personalization models is a
challenging task. A standard approach is to create a sur-
vey of mechanical turk and ask users if the styles have been
transferred properly and if the new dress designs are per-
sonalized given a wardrobe. But fashion presents a unique
challenge as it is highly dependent on the user’s taste for dif-
ferent kinds of clothing. Instead a different tact is applied.
Figure 3 shows how the evaluation is performed. We check
if style is imparted on the given UCO image by verifying if
the classifier is able to identify the style attributes present
in it. An SVM is trained to learn attributes of the clothes
present in the user’s closet using the features generated from
a 16-layer VGGNet (our system uses the 19 layer for fash-
ioning the clothes). The test dataset is created by generated
a random combination of attributes (these combinations are
likely not present in the training image closet). For these
random combinations of attributes, the new dress images
are generated. Once featurized by a pre-trained VGG-16,
we check the SVM’s ability to predict the combinations of
attributes.
The UCO images and the set of images used for styling
are maintained separately. There are a total of 400 UCOs
and 100 images from the user’s wardrobe. There are two
kinds of tests considered in the experiment. In the first,
the test images are generated from a set of images separate
from the styles extracted from the training but with similar
attributes. In the second, the test images are generated from
the styles extracted from the training data itself. Figure
5 shows the F1-score for a varying number of test images
generated. The consistent performance above the baseline
suggests the style is likely transferred and the SVM is able
the classify based on features generated.
Our experiments with increasing the number of images
used for gaining more styles showed a drop in the F1 score,
suggesting that an increasing number of style functions im-
pact the quality of the result, thus making it difficult to
identify patterns. Hence it is necessary to limit the number
of style functions used to generate the new dress.
Figure 5: Bar-chart showing F1-scores for the base-
line and our model on actual test data using separate
training and test generation images, and using same
images for training and test data generation
5.2 Qualitative evaluation
We analyze the quality of dress images by seeing how sim-
ilar they are to the style images used in the personalization
process. The quality of the generated image is impacted by
a number of factors. The effect of various hyper-parameters
is measured. The Figure 4 shows an image of a sheer draped
blouse changed to adopt the styles extracted from a couple
of images. The result is a nice blend of patterns borrowed
from the style images given.
A single style superimposed on the same content image,
Figure 6: Styles extracted from multiple images for the same attribute ”knit”
but using multiple distinct style images, produces interesting
results. Figure 6 presents the style of four different knit
garments over a tank top. Four different textures of the
same fabric produce distinct results.
6. CONCLUSIONS & FUTURE WORK
In this paper, we show an initial pipeline to generate new
designs for clothes based on the preference of the user. The
results indicate that style transfer happens successfully and
is personalized for the closet of a user. In the future we
will like to improve the performance of the pipeline as it is
time consuming to generate a new design. Also, we plan to
experiment with better methods to personalize and generate
designs with higher resolutions.
7. REFERENCES
[1] M. Abadi, A. Agarwal, P. Barham, E. Brevdo,
Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean,
M. Devin, S. Ghemawat, I. Goodfellow, A. Harp,
G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser,
M. Kudlur, J. Levenberg, D. Man´e, R. Monga,
S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens,
B. Steiner, I. Sutskever, K. Talwar, P. Tucker,
V. Vanhoucke, V. Vasudevan, F. Vi´egas, O. Vinyals,
P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and
X. Zheng. TensorFlow: Large-scale machine learning
on heterogeneous systems, 2015. Software available
from tensorflow.org.
[2] L. Bossard, M. Dantone, C. Leistner, C. Wengert,
T. Quack, and L. Van Gool. Apparel classification
with style. In Asian conference on computer vision,
pages 321–335. Springer, 2012.
[3] L. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and
A. L. Yuille. Deeplab: Semantic image segmentation
with deep convolutional nets, atrous convolution, and
fully connected crfs. CoRR, abs/1606.00915, 2016.
[4] A. A. Efros and W. T. Freeman. Image quilting for
texture synthesis and transfer. In Proceedings of the
28th annual conference on Computer graphics and
interactive techniques, pages 341–346. ACM, 2001.
[5] A. A. Efros and T. K. Leung. Texture synthesis by
non-parametric sampling. In Computer Vision, 1999.
The Proceedings of the Seventh IEEE International
Conference on, volume 2, pages 1033–1038. IEEE,
1999.
[6] L. Gatys, A. S. Ecker, and M. Bethge. Texture
synthesis using convolutional neural networks. In
Advances in Neural Information Processing Systems,
pages 262–270, 2015.
[7] L. A. Gatys, A. S. Ecker, and M. Bethge. A neural
algorithm of artistic style. arXiv preprint
arXiv:1508.06576, 2015.
[8] L. A. Gatys, A. S. Ecker, and M. Bethge. Image style
transfer using convolutional neural networks. In
Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, pages 2414–2423,
2016.
[9] B. Heater. Amazonˆa ˘
A´
Zs new echo look has a built-in
camera for style selfies.
https://techcrunch.com/2017/04/26/
amazons-new-echo-look-has-a-built-in- camera-for-style-selfies/.
Accessed: 2017-06-02.
[10] D. J. Heeger and J. R. Bergen. Pyramid-based texture
analysis/synthesis. In Proceedings of the 22nd annual
conference on Computer graphics and interactive
techniques, pages 229–238. ACM, 1995.
[11] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros.
Image-to-image translation with conditional
adversarial networks. arxiv, 2016.
[12] J. Johnson, A. Alahi, and L. Fei-Fei. Perceptual losses
for real-time style transfer and super-resolution. In
European Conference on Computer Vision, 2016.
[13] B. Julesz. Visual pattern discrimination. IRE
transactions on Information Theory, 8(2):84–92, 1962.
[14] Y. Kalantidis, L. Kennedy, and L.-J. Li. Getting the
look: clothing recognition and segmentation for
automatic product suggestions in everyday photos. In
Proceedings of the 3rd ACM conference on
International conference on multimedia retrieval,
pages 105–112. ACM, 2013.
[15] M. H. Kiapour, K. Yamaguchi, A. C. Berg, and T. L.
Berg. Hipster wars: Discovering elements of fashion
styles. In European conference on computer vision,
pages 472–488. Springer, 2014.
[16] A. Krizhevsky, I. Sutskever, and G. E. Hinton.
Imagenet classification with deep convolutional neural
networks. In Advances in neural information
processing systems, pages 1097–1105, 2012.
[17] V. Kwatra, A. Sch¨
odl, I. Essa, G. Turk, and A. Bobick.
Graphcut textures: image and video synthesis using
graph cuts. In ACM Transactions on Graphics (ToG),
volume 22, pages 277–286. ACM, 2003.
[18] G. Linden, B. Smith, and J. York. Amazon. com
recommendations: Item-to-item collaborative filtering.
IEEE Internet computing, 7(1):76–80, 2003.
[19] Z. Liu, P. Luo, S. Qiu, X. Wang, and X. Tang.
Deepfashion: Powering robust clothes recognition and
retrieval with rich annotations. In Proceedings of the
IEEE Conference on Computer Vision and Pattern
Recognition, pages 1096–1104, 2016.
[20] J. Portilla and E. P. Simoncelli. A parametric texture
model based on joint statistics of complex wavelet
coefficients. International journal of computer vision,
40(1):49–70, 2000.
[21] S. Ren, K. He, R. B. Girshick, and J. Sun. Faster
R-CNN: towards real-time object detection with region
proposal networks. CoRR, abs/1506.01497, 2015.
[22] J. D. REY. Amazon won a patent for an on-demand
clothing manufacturing warehouse.
https://www.recode.net/2017/4/18/15338984/
amazon-on-demand-clothing-apparel-manufacturing-patent-warehouse-3d.
Accessed: 2017-06-02.
[23] C. Rother, V. Kolmogorov, and A. Blake. Grabcut:
Interactive foreground extraction using iterated graph
cuts. In ACM transactions on graphics (TOG),
volume 23, pages 309–314. ACM, 2004.
[24] O. Russakovsky, J. Deng, H. Su, J. Krause,
S. Satheesh, S. Ma, Z. Huang, A. Karpathy,
A. Khosla, M. Bernstein, et al. Imagenet large scale
visual recognition challenge. International Journal of
Computer Vision, 115(3):211–252, 2015.
[25] K. Simonyan and A. Zisserman. Very deep
convolutional networks for large-scale image
recognition. CoRR, abs/1409.1556, 2014.
[26] C. Smith. neural-style-tf.
https://github.com/cysmith/neural-style-tf, 2016.
[27] S. Trewin. Knowledge-based recommender systems.
Encyclopedia of library and information science,
69(Supplement 32):180, 2000.
[28] L.-Y. Wei and M. Levoy. Fast texture synthesis using
tree-structured vector quantization. In Proceedings of
the 27th annual conference on Computer graphics and
interactive techniques, pages 479–488. ACM
Press/Addison-Wesley Publishing Co., 2000.
[29] T. Xiao, T. Xia, Y. Yang, C. Huang, and X. Wang.
Learning from massive noisy labeled data for image
classification. In Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, pages
2691–2699, 2015.
[30] M. D. Zeiler and R. Fergus. Visualizing and
understanding convolutional networks. CoRR,
abs/1311.2901, 2013.