Deep Learning for Retail Product Recognition: Challenges
Yuchen Wei , Son Tran , Shuxiang Xu , Byeong Kang , and Matthew Springer
Discipline of ICT, School of TED, University of Tasmania, Launceston, Tasmania, Australia
Correspondence should be addressed to Yuchen Wei; email@example.com
Received 15 July 2020; Revised 13 October 2020; Accepted 19 October 2020; Published 12 November 2020
Academic Editor: Massimo Panella
Copyright ©2020 Yuchen Wei et al. is is an open access article distributed under the Creative Commons Attribution License,
which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Taking time to identify expected products and waiting for the checkout in a retail store are common scenes we all encounter in our
daily lives. e realization of automatic product recognition has great signiﬁcance for both economic and social progress because it
is more reliable than manual operation and time-saving. Product recognition via images is a challenging task in the ﬁeld of
computer vision. It receives increasing consideration due to the great application prospect, such as automatic checkout, stock
tracking, planogram compliance, and visually impaired assistance. In recent years, deep learning enjoys a ﬂourishing evolution
with tremendous achievements in image classiﬁcation and object detection. is article aims to present a comprehensive literature
review of recent research on deep learning-based retail product recognition. More speciﬁcally, this paper reviews the key
challenges of deep learning for retail product recognition and discusses potential techniques that can be helpful for the research of
the topic. Next, we provide the details of public datasets which could be used for deep learning. Finally, we conclude the current
progress and point new perspectives to the research of related ﬁelds.
1. Introduction and Background
e intention of product recognition is to facilitate the
management of retail products and improve consumers’
shopping experience. At present, barcode  recognition is
the most widely used technology not only in research but also
in industries where automatic identiﬁcation of commodities is
used. By scanning barcode marks on each product package,
the management of products can be easily facilitated. Nor-
mally, almost every item on the market has its corresponding
barcode. However, due to the uncertainty of the printing
position of the barcode, it often requires time to manually ﬁnd
the barcode and assist the machine in identifying the barcode
at the checkout counter. Based on a survey from Digimarc ,
45% customers complained that, sometimes, it was not
convenient to use barcode scanning machines. RFID (radio
frequency identiﬁcation)  has been applied in business
ﬁelds with the growth of computer technology to enhance the
automation of product identiﬁcation. is technology auto-
matically transmits data and information using radio fre-
quency signals. RFID tags are placed on each product. Each
tag has its speciﬁc number corresponding to a speciﬁc
product, and the product is identiﬁed by wireless signal
communication. Unlike the barcode, RFID tag data are
readable without the line-of-sight requirements of an optical
scanner. Deﬁnitely, RFID has shortcomings. Identifying
multiple products still has a high error rate due to radio waves
being blocked or inﬂuencing each other. Also, RFID labels are
expensive and diﬃcult to recycle, resulting in higher sales
costs and sustainability issues .
As retail is evolving at an accelerated rate, enterprises are
increasingly focusing on how to use artiﬁcial intelligence
technology to reshape the retail industry’s ecology and in-
tegrate online and oﬄine experiences . Based on the study
from Juniper Research, the global spending by retailers on AI
services will increase over 300% from $3.6 billion in 2019 to
$12 billion in 2023 . at is to say, the new innovative retail
in the future may be completely realized by artiﬁcial intel-
ligence technology. Also, with the improvement of living
standards, supermarket staﬀ and customers are greeted with
more than countless retail products. In this scenario, a
massive amount of human labour and a large percentage of
Computational Intelligence and Neuroscience
Volume 2020, Article ID 8875910, 23 pages
the workload were required for recognising products so as to
conduct goods management . Furthermore, with the help
of various electronic devices for photographing, image digital
resources of products are growing rapidly every day. As such,
for a tremendous amount of image data, how to eﬀectively
analyze and process them, as well as to be able to identify and
classify the products in supermarkets, has become a key re-
search issue in the product recognition ﬁeld. Product rec-
ognition refers to the use of technology which is mainly based
on computer vision methods so that computers can replace
the process of manually identifying and classifying products.
Implementing automatic product recognition in grocery
stores through images has a signiﬁcant impact on the retail
industry. Firstly, it will beneﬁt the planogram compliance of
products on the shelf. For instance, product detection can
identify which items are missing from the shelf to remind the
store staﬀ to replenish the products immediately. It is ob-
served that when an optimized planogram is 100% matched,
sales will be increased by 7.8% and proﬁt by 8.1% .
Furthermore, image-based commodity identiﬁcation can be
applied to automatic self-checkout systems to optimize the
user experience of checkout operations. Global self-checkout
(SCO) shipments have steadily increased between 2014 and
2019. Growing numbers of SCOs have been installed to
reduce retailers’ costs and enhance customer experience
[9, 10]. e research in [11, 12] demonstrates that customers’
waiting time for checkout operations has a negative inﬂu-
ence on their shopping satisfaction, which is to say that
applying a computer-vision-based product recognition in
SCOs beneﬁts both retailers and customers. irdly, product
recognition technology can assist people who are visually
impaired to shop independently, which is conducive to their
social connectivity . Traditional shopping methods
usually require assistance from a sighted person because it
can be diﬃcult for a person who is visually impaired to
identify products by their visual features (e.g., price, brand,
and due date), making purchase decisions diﬃcult .
In general, retail product recognition problems can be
described as an arduous instance related to image classiﬁcation
[15, 16] and object detection problems [17–19]. During the last
decade, deep learning, especially in the domain of computer
vision, has achieved tremendous success and has become the
core solution for image classiﬁcation and object detection. e
primary diﬀerence between deep learning and traditional
pattern recognition methods is that the former can directly
learn features from image data rather than using manually
designed features. Another reason for the strong ability of deep
learning is the deeper layers that can extract more precise
features than traditional neural networks. e above advan-
tages enable deep learning methods to bring new ideas to solve
some important computer vision problems such as image
segmentation and keypoint detection. Recently, a few attempts
have been applied to the retail industry, following with state-of-
the-art results [20–22]. In the meanwhile, some automated
retail stores have emerged, such as Amazon Go (https://www.
amazon.com/b?ie�UTF8&node�16008589011) and Walmart’s
Intelligent Retail Lab (https://www.intelligentretaillab.com/),
which indicate that there is interest in unmanned retail with
Deep learning-based retail product recognition has in-
creasingly attracted researchers, and plenty of work has been
done in this ﬁeld. However, it appears that there are very few
reviews or surveys that summarize existing achievements
and current progress. We collected over a hundred related
publications through Google Scholar, IEEE Xplore, and Web
of Science, as well as some great conferences such as CVPR,
ICCV, IJCAI, NIPS, and AAAI. As a result, only two for-
mally published surveys [4, 23] came to light, which studied
the detection of products on the shelf in retail stores. e
scenario of recognising products for self-checkout systems
has been neglected in their surveys, which is also a complex
task that needs to be solved for the retail industry.
In the published article , authors reviewed 24 papers
and proposed a classiﬁcation of product recognition sys-
tems. Nevertheless, deep learning methods are not men-
tioned in this paper. Another related survey was from ,
and the authors presented a brief study on computer vision-
based product recognition in shelf images. However, this
survey does not focus on the ﬁeld of deep learning: most of
the methods presented are based on hand-crafted features.
erefore, with the rising popularity and potential appli-
cations of deep learning in retail product recognition, a new
comprehensive survey is demanded for a better under-
standing of this research ﬁeld.
In this paper, we present an extensive literature review of
current studies on deep learning-based retail product rec-
ognition. Our detailed survey presents challenges, tech-
niques, and open datasets for deep learning-based product
recognition. It oﬀers meaningful insights into advances in
deep learning for retail product identiﬁcation. It also serves
as a guideline for researchers and engineers who have just
started researching the issue of product recognition, with the
purpose that they will ﬁnd the problems that need to be
studied quickly. In summary, there are three points for the
contribution of this paper: (1) for the implementation of
deep learning methods in product identiﬁcation, we provide
a comprehensive literature review. (2) We propose current
problem-solving techniques according to the complexity of
retail product recognition. (3) We discuss the challenges and
available resources and identify future research directions.
e rest of this article will be structured as follows:
Section 2 introduces the overview of computer vision
methods for product recognition. Section 3 presents the
challenges in the ﬁeld of detecting grocery products in retail
stores. Section 4 gives current techniques to solve the
complex problems. Section 5 describes the publicly available
datasets and analyzes their particular application scenarios.
Finally, Section 6 draws the conclusion and provides di-
rections for future studies.
2. Computer Vision Methods in Retail
2.1. Classic Methods. With computer vision’s rapid growth,
researchers have been drawn to product recognition using
the technology. Product recognition is realized by extracting
features on the image of the package. e composition of the
product image recognition system is shown in Figure 1. (1)
2Computational Intelligence and Neuroscience
Image capture: collecting images from cameras and mobile
phones. (2) Image preprocessing: reducing noise and re-
moving redundant information to provide high-quality
images for subsequent operations. It mainly includes image
segmentation, transformation, and enhancement. (3) Fea-
ture extraction: the analysis and processing of image data to
determine the invariant characteristics in the image. (4)
Feature classiﬁcation: after a certain image feature is mapped
to the feature vector or space, a speciﬁc decision rule is
applied to classify the low-dimensional feature to make the
recognition result accurate. (5) e output of recognition:
the pretrained classiﬁer is employed to predict the category
of the retail product.
e core of product recognition is whether accurate
features can be extracted or not. SIFT [24, 25] and SURF
[26, 27] are the best representatives of traditional feature
extraction technology. In 1999, Lowe suggested SIFT, paying
greater attention to local information, where an image
pyramid was established to solve the problem of multiscale
features. SIFT features have many advantages, such as ro-
tation invariance, translation invariance, and scale inﬁnity,
which are the most widely used hand-crafted features before
deep learning. In 2006, based on the foundation of SIFT,
some researchers proposed SURF features to improve cal-
culation speed. SIFT has been used as a feature extractor for
product classiﬁcation in , and the SURF algorithm has
been applied in  to detect the out-of-stock and misplaced
products on shelves. However, due to the features extracted
by SIFT and SURF being hand-crafted, it is unable to reﬂect
all suﬃcient information fully. us, researchers are in-
creasingly interested in deep learning for end-to-end
training to extract eﬀective features.
2.2. Deep Learning. Deep learning is often regarded as a
subﬁeld of machine learning. e vital objective of deep
learning is to learn deep representation, i.e., to learn mul-
tilevel representation and abstraction from information .
Initially, the concept of deep learning (also known as deep
structured learning) was proposed by authoritative scholars
in the ﬁeld of machine learning in 2006 . After a short
while in 2006, Hinton and Salakhutdinov presented the
methods of unsupervised pretraining and ﬁne-tuning to
solve the vanishing gradient problem . After that year,
deep learning became a research hotspot. In 2007, a greedy
layer-wise training strategy was provided to optimize the
initial weights for deep networks . ReLU (rectiﬁed linear
unit) was deﬁned in 2011 to preserve more information
among multiple layers which could restrain the vanishing
gradient problem . e dropout algorithm  was
proposed in 2012 to prevent overﬁtting, and it helped im-
prove the deep network performance.
In the ﬁeld of computer vision, deep neural networks
have been exploited with the improvement of computing
power from computer hardware, particularly thanks to the
implementation of GPUs in image processing. Nowadays,
the application of deep learning in retail product recognition
primarily covers the following two elements: (1) image
classiﬁcation: this is a fundamental task in computer vision,
which seeks to divide diﬀerent images into diﬀerent cate-
gories. e performance of classifying images with com-
puters is already better than humans. (2) Object detection: it
refers to detecting objects with rectangular boxes while
categorising images. In the last few years, with the ongoing
growth of deep learning, many scientists and developers
have built and optimized some deep learning frameworks to
help speed-up training and forecast procedures, such as
Caﬀe , TensorFlow , MXNet , and PyTorch ,
which are the most common frameworks that make the use
of deep learning methods much easier for scientists.
2.2.1. Convolutional Neural Networks. e success of deep
learning in computer vision proﬁts from convolutional
neural networks (CNNs), which are inspired by the biology
research of the cat’s visual cortex . LeCun et al. ﬁrst
proposed to employ convolutional neural networks to
classify images in 1988 . ey conceived the LeNet
convolutional neural network model that had seven layers.
After training on a dataset which contained 32 ∗32 hand-
written characters, this model had been successfully applied
to the digital identiﬁcation of checks. Opportunely, the
structure of the CNN and training techniques have been
experiencing strong advances since 2010, beneﬁting from the
ImageNet Large-Scale Visual Recognition Challenge. Also,
with the advance of computing power from GPUs, deep
learning has undoubtedly become a phenomenon. After the
year of 2010, a series of network structures such as AlexNet
, GoogLeNet , VGG , and ResNet  were
devised for image classiﬁcation based on LeNet . Re-
cently, the CNN becomes able to classify 3D objects, which is
named as a multiview CNN . e multiview CNN has
shown a remarkable performance on image classiﬁcation
tasks by inputting multiple images to the networks . In
the age of big data, it enables researchers to select large
datasets to train complex structures of networks that output
more accurate results. In conclusion, big data and deeper
networks are the two key elements for the success of deep
learning, and these two aspects accelerate each other.
2.2.2. Deep Learning for Object Detection. CNNs have been
the major deep learning technique for object detection.
erefore, all the deep learning models discussed in this
paper are based on the CNN. In order to detect various
objects, it is essential to conduct region extraction on dif-
ferent objects before image classiﬁcation. Before deep
learning, the common regional extraction method is the
sliding window algorithm . is algorithm is a tradi-
tional method, identifying the object in each window by
sliding the image. e sliding window strategy is ineﬃcient,
which requires a very large amount of calculation. After
incorporating deep learning into this ﬁeld, the object de-
tection techniques can be classiﬁed into two categories: the
two-stage model (region proposal-based) and the one-stage
model (regression/classiﬁcation-based) . e two-stage
model requires a region proposal algorithm to ﬁnd out the
possible location of the object in a graph. It takes advantage
of textures, edges, and colours from the image to ensure a
Computational Intelligence and Neuroscience 3
high recall rate, while fewer windows (thousands or even
hundreds) are selected. In the R-CNN algorithm , an
unsupervised region proposal method, selective search ,
is introduced, combining the power of both exhaustive
search and segmentation. Although this method has im-
proved computing speed, it still needs to implement a CNN
calculation for every region proposal. en, Fast R-CNN
 was developed to reduce the repeated CNN calculation.
Ren et al. proposed a region proposal network (RPN)  by
using a deep network while sharing features with the clas-
siﬁcation network. e shared features not only avoid the
time consumption caused by recalculation but also improve
the accuracy. e Faster R-CNN algorithm, based on the
RPN, is presently the mainstream technique of object
identiﬁcation, but it does not satisfy the computing speed
criteria in real time. Compared with the two-stage method,
the one-stage method computes faster because it skips the
region proposal stage, and then objects’ locations and cat-
egories are directly regressed from multiple positions of the
image. YOLO  and SSD  are the most representative
algorithms, greatly speeding up detection, while accuracy is
inferior to the two-stage method.
2.2.3. Product Recognition Based on Deep Learning. Deep
learning has made a research on object detection to develop
rapidly. In this work, we perceive product recognition as a
particular research issue related to object detection. At
present, computer vision has achieved widespread use al-
ready; however, its application of product image recognition
is still less perfect. A typical pipeline of image-based product
recognition is shown in Figure 2, and the product images are
from the RPC dataset . In general, as regional proposals,
an object detector was used to acquire a set of bounding
boxes. en, several single-product images are cropped from
the original image, which contains multiple products. Fi-
nally, each cropped image can be input into the classiﬁer,
making the recognition of the products an image classiﬁ-
In the last few years, some large technology companies have
applied deep learning methods for recognising retail products
in order to set up unmanned stores. Amazon Go (https://www.
amazon.com/b?ie�UTF8&node�16008589011) was the ﬁrst
unmanned retail store that was open to the general public in
2018. ere are dozens of CCTV cameras in the Amazon Go
store, and by using deep learning methods, the cameras are able
to detect the customers’ behaviour and identify the products
they are buying. Nevertheless, the recognition accuracy with
the images still leaves much to be desired. Hence, some other
technologies, including Bluetooth and weight sensors, are also
employed to ensure the retail products can be identiﬁed
correctly. Shortly after the Amazon Go store, a new retail store
called Intelligent Retail Lab (IRL) (https://www.
intelligentretaillab.com/) was designed by Walmart in 2019
to inspect the application of artiﬁcial intelligence in retail
services. In IRL, deep learning was exploited with cameras to
automatically detect the out-of-stock products and alert staﬀ
members when to restock. Furthermore, a number of intelli-
gent retail facilities, such as automatic vending machines and
self-serve scales, have emerged recently. A Chinese company,
DeepBlue Technology (https://en.deepblueai.com/), has de-
veloped automatic vending machines and self-checkout
counters based on deep learning algorithms, which can ac-
curately recognize commodities by using the cameras. Malong
Technologies (https://www.malong.com/en/home.html) is
another well-known business in China that aims to provide
deep learning solutions for the retail industry. e facilities
from Malong Technologies include AI Cabinets that perform
automatic product recognition using the computer vision
technology and AI Fresh that enables identiﬁcation of fresh
products on a self-serve scale automatically. However, all the
deep learning-based facilities are still in their early stages and
have not entered the widespread implementation. More re-
searches and practical tests need to be done in this area.
Based on the above review of current studies, we suggest
that deep learning is an advanced method, as well as a
growing technique, for retail product recognition; however,
more research is needed in this area.
As mentioned in the Introduction section, the peculiarity of
retail product recognition makes it more complicated than
common object detection since there are some speciﬁc
situations to consider. In this section, we generalize the
challenges regarding retail product recognition and classify
them into the four aspects shown in the following.
3.1. Large-Scale Classiﬁcation. e number of distinct
products to be identiﬁed in a supermarket can be enormous,
approximately several thousands, for a medium-sized gro-
cery store that far exceeds the ordinary capability of object
Currently, YOLO [17, 51, 54], SSD , Faster R-CNN
, and Mask R-CNN  are state-of-the-art object de-
tection methods, which evaluate their algorithms with
PASCAL VOC  and MS COCO  datasets. However,
PASCAL VOC only contains 20 classes of objects, and MS
COCO contains photos of 80 object categories. is is to say
that the current object detectors are not appropriate to apply
to retail product recognition directly due to their limitations
with large-scaled categories. Figure 3 compares the results
on VOC 2012 (20 object categories) and COCO (80 object
categories) test sets with diﬀerent algorithms, including
Faster R-CNN, SSD, and YOLOv2. We only list three ap-
proaches of object identiﬁcation to demonstrate that the
precision of all detectors reduces dramatically when the
Figure 1: e ﬂowchart of the product image recognition system.
4Computational Intelligence and Neuroscience
number of classes rises. More comparative results can be
found in .
Additionally, the data distribution of the VOC dataset is
more than 70 percent of the images contain objects belonging
to one category, and more than 50 percent involve only one
instance per image. On average, each picture contains 1.4
categories and 2.3 instances. With regard to the COCO
dataset, it contains an average of 3.5 categories and 7.7 in-
stances per image. In a practical scenario of a grocery store,
customers usually buy dozens of items from more than ten
categories. erefore, based on the data above, it illustrates
that the recognition of the retail product has its peculiarities
compared with common object detection. As a result, how to
settle this practical problem is still an open question.
3.2. Data Limitation. Deep learning-based approaches re-
quire a large amount of annotated data for training, raising a
remarkable challenge in circumstances where only a small
number of examples are available . In Table 1, it lists
some open-source tools that can be used for image labelling.
ese tools have been divided into two categories: bounding
box and mask. e bounding box category includes tools
that can label the object with a bounding box, while tools in
the mask category can be useful for image segmentation.
ese image captioning tools require manual labour to label
every object in each image. Normally, there are at least tens
of thousands of training images in a general object detection
dataset, apparently indicating that creating a dataset with
enough training data for deep learning is time-consuming
Furthermore, with regard to grocery product recog-
nition in retail scenarios, the majority of the training data
is acquired in ideal conditions instead of practical envi-
ronments . As a sample shown in Figure 4, training
images are usually taken with the same single product
from several diﬀerent angles in a rotating platform, while
testing images are from real conditions, which contain
multiple products per image with a complex background.
Last but not least, the majority of scholars aims to perfect
the dataset of common object detection, such as VOC 2012 and
COCO, which results in the data limitation issue to product
recognition. Figure 5 illustrates that compared with common
object datasets, retail product datasets have fewer images with
more classes. erefore, it is necessary to provide a larger
dataset for training a deep learning model when we want that
model to be able to recognize objects from various categories.
Based on the above realization, we can conclude that the
data shortage is a real challenge to retail product recognition.
3.3. Intraclass Variation. Intraclass classiﬁcation, also
known as subcategory recognition, is a popular research
topic both in the industrial and academic areas, aiming at
distinguishing subordinate-level categories. Generally,
identifying intraclass objects is a very challenging task due to
the following: (1) objects from similar subordinate categories
often have only minor diﬀerences in a certain area of their
appearance. Sometimes, this task is even diﬃcult for humans
to classify. (2) Intraclass objects may present multiple ap-
pearance variations with diﬀerent scales or from various
viewpoints. (3) Diﬀerent environmental factors, such as
lighting, backgrounds, and occlusions, may have a great
impact on the identiﬁcation of intraclass objects . To
solve this challenging problem, ﬁne-grained object classiﬁ-
cation is required to identify subcategory object classes,
which includes ﬁnding the subtle diﬀerences among visually
similar subcategories. At present, ﬁne-grained object
Beer, 98% Bottle
Figure 2: A typical pipeline of image-based product recognition.
Faster R-CNN SSD YOLOv2
VOC 2012 (20 classes)
COCO (80 classes)
Figure 3: Comparative results on VOC 2012 and COCO test sets.
Computational Intelligence and Neuroscience 5
classiﬁcation is mainly applied to distinguish diﬀerent
species of birds , dogs , ﬂowers , or diﬀerent
brands of cars . Moreover, compared with datasets for
common object classiﬁcation, it is more diﬃcult to acquire
ﬁne-grained image datasets, which require relevant pro-
fessional knowledge to complete image annotations.
Due to the visual similarity in terms of shape, colour,
text, and metric size between intraclass products, retail
products are really hard to be identiﬁed . It can be
diﬃcult for customers to determine the diﬀerence between
two ﬂavours of cookies of the same brand; we can expect it to
be complex for computers to classify these intraclass
products. Figure 6(a) demonstrates two products with dif-
ferent ﬂavours only have minute diﬀerences of colour and
text on the package. Figure 6(b) shows the visually similar
products with diﬀerent sizes. Additionally, up to now, there
have been no speciﬁc ﬁne-grained datasets for retail product
recognition. e ﬁne-grained classiﬁcation methods usually
require additional manual labelling information. Without
enough annotation data, it is more demanding to use deep
learning methods to identify similar products.
3.4. Flexibility. In general, with the increasing number of
new products every day, grocery stores need to import new
items regularly to attract customers. Moreover, the ap-
pearances of existing products change frequently over time.
Due to the reasons above, a practical recognition system
should be ﬂexible with no or minimal retraining whenever a
new product/package is introduced . However, con-
volutional neural networks always suﬀer from “catastrophic
forgetting”—they are unable to recognize some previously
learned objects when adapted to a new task .
Figure 7 illustrates that, after training a detector with a
new class, banana, it may probably forget the previous
objects. e top detector is trained with a dataset including
orange, so it can detect orange in the image. en, intro-
ducing a new class, banana, to the detector, we train it only
with banana images rather than with all the classes jointly.
Finally, the bottom detector is generated, which can rec-
ognize the new class, banana, in the image. Nevertheless, this
bottom detector fails to localize orange because of forgetting
the original knowledge of orange.
Currently, top-performing image classiﬁcation and ob-
ject detection models have to be retrained completely when
introducing a new category. It poses a key issue as collecting
Table 1: Image labelling tools.
Categories Tools Environment
labelImg (https://github.com/tzutalin/labelImg) Python
bbox-label-tool (https://github.com/puzzledqs/BBox-Label-Tool) Python
LabelBoundingBox (https://github.com/hjptriplebee/LabelBoundingBox) Python
YOLOmark (https://github.com/AlexeyAB/Yolo_mark) Python
CVAT (https://github.com/opencv/cvat) Python
RectLabel (https://rectlabel.com/) Mac OS
VoTT (https://github.com/microsoft/VoTT) Java/Python
Mask labelme (https://github.com/wkentaro/labelme) Python
Labelbox (https://github.com/Labelbox/Labelbox) Java/Python
Figure 4: GroZi-120: samples of training images (a) and testing images (b).
500 5000 50000 500000
Number of images
Figure 5: Comparison between common object datasets and retail
6Computational Intelligence and Neuroscience
new training data and retraining networks can be time-
consuming. erefore, how to develop an object detector
with long-term memory is a problem worthy of study.
Concerning the four challenges proposed in Section 3, we
refer to a considerable amount of literature and summarize
current techniques related to deep learning, aiming to
provide some references with which readers can quickly gain
entrance to the ﬁeld of deep learning-based product rec-
ognition. In this paper, we not only introduce the ap-
proaches in the scope of deep learning but also present some
related methods that can be combined with deep learning to
advance the recognition performance. Figure 8 demon-
strates the techniques’ target for the proposed challenges.
4.1. CNN-Based Feature Descriptors. e key issue of image
classiﬁcation lies in the extraction of image features; by using
the extracted features, the images can be categorized into
diﬀerent classes. For the challenge of large-scale classiﬁca-
tion in Section 3, the traditional hand-crafted feature ex-
traction methods, e.g., SIFT [24, 25] and SURF [26, 27], seem
to be overtaken by the convolutional neural network (CNN)
 due to their limitations for exploring deep information
from images. At the moment, CNN is a promising technique
that has a strong ability to create embedding for diﬀerent
classes of objects. Some researchers have attempted to use
the CNN for feature extraction [48, 67–69]. Table 2 shows
the related works with CNN-based feature descriptors for
retail product recognition.
In , Inception V3  has been used to implement
image classiﬁcation of eight diﬀerent kinds of products on
the shelves. e drawback is that the prediction accuracy of
the images from real stores only reaches 87.5%, and that
needs to be improved. Geng et al.  employed VGG-16 as
the feature descriptor to recognize the product instances,
Figure 6: Intraclass products with diﬀerent ﬂavours (a) (honey ﬂavour and chocolate ﬂavour) and size (b) (110 g and 190 g).
banana ??? Banana
Tra inin g wit h
Figure 7: An example of introducing a new class to an existing retail product detector.
Figure 8: Techniques for challenges.
Computational Intelligence and Neuroscience 7
achieving recognition for 857 classes of food products. In
this work, VGG-16 is integrated with recurring features and
attention maps to improve the performance of grocery
product recognition in the real-world application scenario.
e authors also implemented their method with ResNet;
then, 102 grocery products from CAPG-GP (the dataset built
in this paper) were successfully classiﬁed with the mAP of
0.75. Another notable work using ResNet is from  that
introduces a scale-aware network for generating product
proposals in supermarket images. Although this method
does not aim to predict the product categories, it can ac-
curately perform the object proposal detection for the
products with diﬀerent scale ratios in one image, which is a
practical issue in supermarket scenarios. In , authors
considered three diﬀerent popular CNN models, VGG-16
, ResNet , and Inception , in their approach and
performed the K-NN similarity search extensively with the
output of the three models. eir method was evaluated with
three grocery product datasets, and the largest one contained
938 classes of food items. AlexNet was exploited in  to
compute visual features of products, combining deep class
embedding into a CRF (conditional random ﬁeld) 
formulation, which enables classifying products with a huge
number of classes. e benchmark in this paper involved
24,024 images and 460,121 objects, and each object belonged
to one of 972 diﬀerent classes. e above method can only be
applied to a small retail store as all of them recognize up to
1,000 classes of products, while a stronger ability to classify
more categories of items is required for medium-sized and
large-sized retail stores.
Recent works have tried to realize large-scale classiﬁ-
cation, e.g., Tonioni et al. and Karlinsky et al. [20, 21]
proposed approaches that can detect several thousand
product classes. In , the backbone network for its feature
descriptor is VGG, from which a global image embedding is
obtained by computing MAC (maximum activations of
convolutions) features . is research is dealing with the
products belonging to 3,288 diﬀerent classes of food
products. Finally, Tonioni et al. obtained state-of-the-art
results of precision and recall, as 57.07% PR and 36.02%
mAP, respectively. In the work of Karlinsky et al. , the
CNN feature descriptor is based on ﬁne-tuning a variant of
the VGG-F network , which deploys the ﬁrst 2–15 layers
of VGG-F trained on ImageNet  unchanged. As a result,
the authors presented a method to recognize each product
category out of a total of 3,235, with an mAP of 52.16%.
According to the data from these two papers, it is obvious
that the recognition accuracy, including precision and recall,
still has a considerable space for improvement to implement
this technique in the retail industry area.
Lately, the most popular object detector YOLO9000 
has proposed a method that can detect 9,000 object classes by
using revised Darknet . Unfortunately, YOLO9000 has
been trained with millions of images, which is infeasible in
the case of training a product recognition model due to the
high annotation costs. However, the success of YOLO9000
illustrates the potential ability of the CNN to achieve a large-
scale level of classiﬁcation (thousands of classes). As for the
problem of how to produce more data available for training,
we will discuss in the next section.
4.2. Data Augmentation. It is common knowledge that deep
learning methods require a large number of training ex-
amples; nevertheless, acquiring large sets of training ex-
amples is often tricky and expensive . Data
augmentation is a common technique used in deep network
training to handle the shortage of training data . is
technique uses a small number of images to generate new
synthetic images, aiming to artiﬁcially enlarge the small
datasets to reduce the overﬁtting [15, 88]. In this paper, we
deﬁne the current mainstream approaches into two cate-
gories: common synthetic methods and generative models.
e existing publications are listed in Table 3.
4.2.1. Common Synthesis Methods. e common methods
for image data augmentation generate new images through
translations, rotations, mirror reﬂections, scaling, and
adding random noise [15, 90, 91]. A signiﬁcant attempt can
be found in the work of Merler et al. ; synthetic samples
were created from images under ideal imaging conditions
(referred to as in vitro) by applying randomly generated
e occlusion for each product is also a common
phenomenon in real practice. In , the authors proposed a
virtual supermarket dataset to let models learn in the virtual
environment. In this dataset, the occlusion threshold is set to
0.9, which means the product occluded under the threshold
0.9 will not be labelled as the ground truth. UnrealCV 
was employed to extract the ground truth of object masks
from real-world images. en, they manipulated the
extracted object masks on a background of shelves and
rendered 5,000 high-quality synthetic images. In this paper,
some other aspects such as realism, the randomness of
placement, products’ overlapping, object scales, lighting, and
materials were taken into account when constructing the
synthetic dataset. By using the virtual supermarket dataset,
they achieved identiﬁcation of items in the real-world
datasets without ﬁne-tuning. Recently, Yi et al.  tried to
simulate the situation of occlusion by overwriting a random
region in the original image either by a black block or a
random patch from another product. en, they ﬁne-tuned
their Faster R-CNN detection model with in vitro (in ideal
conditions) and in situ (in natural environments) data and
obtained a relatively high rate in mAP and recall. In situ is
divided into conveyor and shelf scenarios where the authors
obtained the mAP of 0.84 on the conveyor and 0.79 on the
Table 2: CNN-based feature descriptors and relevant approaches
where these descriptors are employed.
Feature descriptors Approaches
Inception  [71, 72]
GoogLeNet  
AlexNet  [21, 53, 58, 67, 73]
VGG  [20, 21, 71, 74–76]
CaﬀeNet  [10, 67, 73, 77]
ResNet  [22, 68, 71, 74, 78–80]
8Computational Intelligence and Neuroscience
shelf, respectively. Some synthetic samples are shown in the
ﬁrst two rows of Figure 9. Inadequately, the comparative
experiments between the proposed algorithm and the other
state-of-the-art algorithms are absent in this paper.
e work in  synthesizes new images containing
multiple objects by combining and reorganizing atom object
masks. Ten thousand new images were assembled, which
contained one to ﬁfteen objects randomly. For each gen-
erated image, the lighting, the class of object instances, the
orientation, and the location in the image are randomly
sampled. e last row in Figure 9 shows example synthetic
images under three diﬀerent lightings. By adding the 10,000
generated images to the training set, the AP on the test set
has been improved to 79.9% and 72.5% for Mask R-CNN
 and FCIS , respectively. By contrast, the achieve-
ment of AP is only 49.5% and 45.6% without the generated
To realize product recognition with a single example,
researchers in  generated large numbers of training
images using geometric and photometric transformations
based on a few available training examples. In order to
facilitate image augmentations for computer vision tasks,
albumentations are presented in  as a publicly available
tool that enables varieties of image transformation opera-
tions. Recent work in  has applied albumentations with a
small training dataset and then trained the product detection
model with the augmented dataset. e outcomes show that
the model can attain reasonable detection accuracy with
However, the common methods for generating new
images have their limitations to simulate various conditions
in the real world. Generative models are provided to prevent
the models from learning various conditions illogically.
4.2.2. Generative Models. Nowadays, generative models
include variational autoencoder (VAE)  and generative
adversarial networks (GANs) , gaining more and more
attention due to the potential ability to synthesize in vitro
images similar to those in realistic scenes for data aug-
mentation . Normally, generative models enrich the
training dataset in two ways. One is generating new images
with an object that looks similar to the real data. e
synthetic images can directly increase the number of training
images for each category. Another approach is the image-to-
image translation, which is described as the issue of
translating the picture style from the source domain to the
target domain . For example, if the target domain is
deﬁned as a practical scene in a retail store, this image
transfer approach can improve training images to be more
realistic, such as diﬀerent lightings, views, and backgrounds.
In Table 4, we list some state-of-the-art models that are based
on the architectures of VAE and GAN for image generation
and translation, respectively. e works displayed in the
table prove that the models based on the GAN are powerful
for producing new images as well as to enable the image-to-
image transfer. Unfortunately, the approaches based on the
strength of VAE have been unable to achieve image
translation tasks up to now. e detailed research and ap-
plication status of image synthesis with VAE and GAN are
introduced in the following.
VAE has not been applied as an image creator in the
domain of product recognition so far. e general frame-
work of VAE comprises an “encoder network” and a “de-
coder network.” After training the model, we can use the
“decoder network” to generate realistic images. In this paper,
we present some successful cases of VAE in other classiﬁ-
cation and detection ﬁelds for reference. In , a novel
layered foreground-background generative model trained in
an end-to-end deep neural network using VAE is provided
for generating realistic samples from visual attributes. is
model was evaluated with the Wild (LFW) dataset  and
the Caltech-UCSD Birds-200-2011 (CUB) dataset  which
contained natural images of faces and birds, respectively.
e authors have trained an attribute regressor to compare
the diﬀerences between generated images and real data.
Finally, their model achieved 16.71 mean squared error
(MSE) and 0.9057 cosine similarity on the generated sam-
ples. Another noteworthy work is from , where the
authors used a conditional VAE to generate the samples
from the given attributes for addressing zero-shot learning
problems. ey tested this method on four benchmark
datasets, AwA , CUB , SUN , and ImageNet
, and gained state-of-the-art results, particularly in a
more realistic generalized setting. ese successful appli-
cation examples of VAE manifest that VAE is a promising
technique for data augmentation. With the increasing at-
tention for product recognition, VAE will be applied in this
GAN, which was proposed in 2014, has been achieving
remarkable eﬃciency in various research ﬁelds. e
framework of the GAN consists of two models: a generator
that produces fake images and a discriminator that estimates
the probability that a sample is a real image rather than a fake
one . As a result, compared with common synthetic
methods, the generator can be used to generate images that
look more realistic.
With the advantage of generating realistic images,
scholars have demonstrated the great potential of using the
GAN and its variant [104–107] to produce images for en-
larging the training set. For example, in , authors built a
framework with structure-aware image-to-image translation
networks, which could generate large-scale trainable data.
After training with the synthetic dataset, the proposed de-
tector provided a signiﬁcant performance on night-time
vehicle detection. In another work , a novel deep se-
mantic hashing was presented, which combined with the
semisupervised GAN, to produce highly compelling data
with intrinsic invariance and global coherence. is method
achieved state-of-the-art results with CIFAR-10  and
Table 3: Related works for data limitation in the ﬁeld of retail
Technique Categories Existing works
Data augmentation Common synthesis [22, 76, 80]
[21, 79, 89]
Generative [7, 71, 78]
Computational Intelligence and Neuroscience 9
NUS-WIDE  datasets. A new image density model
based on the PixelCNN architecture was established in ,
which could be used to generate images from diverse classes
by simply conditioning on a one-hot encoding of that class.
Zheng et al. employed the DCGAN to produce unlabeled
images in  and then applied these new images to train
the model for recognising ﬁne-grained birds. is method
has attained an enhancement of +0.6% over a powerful
baseline . In , CycleGAN was used to create
200,000 license plate images from 9,000 real pictures. Its
result demonstrated an increase of 7.5 percentage points of
recognition precision over a strong benchmark that was
trained only with real data. e evidence above indicates that
GANs are powerful tools for generating realistic images that
can be used for training deep neural networks. It is likely
that, in the near future, the experience of the above methods
can be borrowed for improving the eﬀects of product
Although GANs have shown compelling results in the
domains of general object classiﬁcation and detection, there
are very few works using GANs for product recognition. To
the best of our knowledge, there are only three papers
[7, 71, 78] attempting to exploit GANs to create new images
in the ﬁeld of product recognition. In the work of , the
authors proposed a large-scale checkout dataset containing
synthetic training images generated by CycleGAN .
Technically, they ﬁrstly synthesized images with object in-
stances on a prepared background image. en, CycleGAN
was employed to translate these images into the checkout
image domain. By training with the combination of trans-
lated images and original images, their product detector,
feature pyramid network (FPN) , attained 56.68% ac-
curacy and 96.57% mAP on average. Figure 10 indicates the
CycleGAN translating eﬀects. Based on the work of Wei
et al. , Li et al.  conducted further research through
selecting reliable checkout images with the proposed data
priming network (DPNet). eir method achieved 80.51%
checkout accuracy and 97.91% mAP. In , GAN was
deployed to produce realistic samples, as well as to play an
adversarial game against the encoder network. However, the
translated images in both [7, 78] only contain a simple
background of ﬂat colour. Considering the complex back-
grounds of the real checkout counter and the goods shelf,
how to generate retail product images in a more true-to-life
setting is worthy of research.
4.3. Fine-Grained Classiﬁcation. Fine-grained classiﬁcation
is a challenging problem in computer vision, which can
enable computers to recognize the objects of subclass cat-
egories [124, 125]. Recently, a number of researchers and
engineers have focused on the technique of ﬁne-grained
classiﬁcation and already applied it in a signiﬁcant number
of domains with remarkable achievements, e.g., animal
breeds or species [126–131], plant species [62, 131–133], and
artiﬁcial entities [129, 130, 134–136]. Fine-grained retail
Figure 9: First two rows show examples of occlusion simulation in , and the third row demonstrates example images from  under
three diﬀerent lightings.
Table 4: Summary of models based on the structures of VAE and
GAN for image synthesis.
Synthesis type VAE GANs
VAE  GAN 
cVAE  CGAN 
Attribute2Image  DCGAN 
Multistage VAE  InfoGAN 
Image translation —
10 Computational Intelligence and Neuroscience
product recognition is a more challenging task than general
object recognition due to intraclass variance and interclass
similarity. Considering the speciﬁc complications in product
recognition in terms of blur, lighting, deformation, orien-
tation, and the alignment of products in shelves, we sum-
marized the existing product ﬁne-grained classiﬁcation
methods into two categories, i.e., ﬁne feature representation
and context awareness.
4.3.1. Fine Feature Representation. Fine feature represen-
tation refers to extracting ﬁne features in a local part of the
image to ﬁnd the discriminative information between vi-
sually similar products. As a consequence, how to eﬀectively
detect foreground objects and ﬁnd important local infor-
mation has become a principal problem for ﬁne-grained
feature representation. According to the supervisory in-
formation for training the models, the ﬁne feature repre-
sentation methods can be divided into two categories:
“strongly supervised ﬁne feature representation” and
“weakly supervised ﬁne feature representation.”
(1) Fine Feature Representation from Strongly Supervised
Models. e strongly supervised methods require additional
manual labelling information such as a bounding box and
part annotation. As mentioned in Section 3, the practical
applicability of such methods has been largely limited by the
high acquisition cost of annotation information. e clas-
sical methods include part-based R-CNN  and pose-
normalized CNN .
In , part-based R-CNN is established to identify
ﬁne-grained species of birds. is method uses R-CNN to
extract features from the whole-objects (birds) and local
areas (head, body, etc.). en, for each region proposal, it
computes scores with features from an object and each of its
parts. Finally, through considering synthetically with the
scores of ﬁne-grained features, this method achieves state-
of-the-art results on the widely used ﬁne-grained benchmark
Caltech-UCSD bird dataset .
Branson et al. presented pose-normalized CNN in ,
and the ﬁne-grained feature extraction process in this paper
is as follows: (1) the DPM algorithm is used to detect the
object location and its local areas. (2) e image is cropped
according to the bounding boxes, and features are extracted
from each cropped image. (3) Based on diﬀerent parts,
convolution features are extracted from multiple layers of
the CNN. (4) ese features are imported into one-vs-all
linear SVMs (support vector machines)  to learn
weights. Eventually, the classiﬁcation accuracy of their
method reached 75.7% on the Caltech-UCSD bird dataset.
In the domain of retail product recognition, the work in
 can be considered as a solution for the ﬁne-grained
classiﬁcation to some extent. e researchers designed an
algorithm called DiﬀNet that could detect diﬀerent products
between a pair of similar images. ey have labelled diﬀerent
products in each pair of images, and there is no need to
annotate the constant objects. e consequence of this was
that this algorithm achieved a relatively desirable detection
accuracy of 95.56% mAP. e DiﬀNet would probably
beneﬁt the progress of product recognition, particularly for
detecting the changes of the on-shelf products.
(2) Fine Feature Representation from Weakly Supervised
Models. e weakly supervised techniques prevent the use of
costly annotations such as bounding boxes and part in-
formation. Similar to the strongly supervised classiﬁcation
methods, the weakly supervised methods also require global
and local features for the ﬁne-grained classiﬁcation. Con-
sequently, the principal task of a weakly supervised model is
how to detect the parts of the object and extract ﬁne-grained
e two-level attention  algorithm is the ﬁrst at-
tempt to perform ﬁne-grained image classiﬁcation without
relying on part annotation information. is method is
based on a simple intuition: extracting the features from the
object level and then focusing on the most discriminative
parts that can be used for the ﬁne-grained classiﬁcation. e
constellation  algorithm was proposed by Simon and
Rodner in 2015. It exploits the features from the convolution
neural network to generate some neural activation patterns
that can be used to extract features from parts of the object.
Another remarkable work is from , where the authors
proposed novel bilinear models that contain two CNNs, A
and B. e function of CNN A is to complete the localization
of the object and its parts, while B is able to extract features
of region proposals from CNN B. ese two networks co-
ordinate with each other and obtain 84.1% accuracy in the
Caltech-UCSD bird dataset.
(a) (b) (c) (d)
Figure 10: Synthesized checkout images (left) and the corresponding images generated by CycleGAN (right) from .
Computational Intelligence and Neuroscience 11
Regarding the ﬁne-grained classiﬁcation of retail products,
some academic staﬀ are beginning to take advantage of ﬁne
feature representation to identify subclass products. In , a
CNN was proposed for improving the ﬁne-grained classiﬁ-
cation performance, combined with scored short-lists of
possible classiﬁcations from a fast detection model. Speciﬁcally,
a variable containing the product of the scores from a fast
detection model and corresponding CNN conﬁdences are used
for ranking the ﬁnal result. In the research of , Geng et al.
applied visual attention  to ﬁne-grained product classiﬁ-
cation tasks. Attention maps are employed to magnify the
inﬂuences of the features, consequently to guide the CNN
classiﬁer to focus on ﬁne discriminative details. Eventually, they
compared their method with state-of-the-art approaches and
obtained promising results. Based on the method of ,
George et al. performed ﬁne-grained classiﬁcation for products
on a shelf in . ey extracted midlevel discriminative
patches on product packaging and then employed SVM
classiﬁers to diﬀerentiate visually similar product classes by
analyzing the extracted patches. eir work shows the superior
performance of using discriminative patches in the ﬁne-
grained product classiﬁcation. In the recent study from , a
self-attention module is proposed for capturing the most in-
formative parts in images. e authors compared the activation
response of a position with the mean value of features to locate
the crucial parts of the ﬁne-grained objects. e experimental
results in  show that the ﬁne-grained recognition per-
formance has been improved in cross-domain scenarios.
4.3.2. Context Awareness. Context is a statistical property of
the world which provides critical cues to help us detect
speciﬁc objects in retail stores , especially when the
appearance of an object may not be suﬃcient for accurate
categorization. Context information has been applied to
improve the performance for the domain of object detection
[145–147] due to its ability to provide useful information
about spatial and semantic relationships between objects.
With regard to the scenario in a supermarket, products
are generally placed on shelves according to certain ar-
rangement rules, e.g., intraclass products are more likely to
appear adjacent to each other on the same shelf. Conse-
quently, context can be considered as a reference for rec-
ognising similar products on shelves, jointly with deep
features. Currently, there are few works of the literature taking
contextual information into account with deep learning de-
tectors for product recognition. In , a novel technique to
learn deep contextual and visual features for the ﬁne-grained
classiﬁcation of products on shelves is introduced. Techni-
cally, authors proposed a CRF-based method  to learn the
class embedding from a CNN concerning its neighbour’s
visual features. In this paper, the product recognition problem
is addressed not only based on its visual appearance but also
on its relative locations. is method has been evaluated on a
dataset that contains product images from retail stores, and it
improves the recall to 87% with 91% precision. Another two
papers also obtained prominent results by considering the
context. However, they did not use a deep learning-based
feature descriptor. One is from , and it presents a context-
aware hybrid classiﬁcation system for ﬁne-grained product
recognition, which combines the relationships between the
products on the shelf with image features extracted by SIFT
methods. is method achieves an 11.4% improvement
compared with the context-free method. In , authors
proposed a computer vision pipeline that detects missing or
misplaced items by using a novel graph-based consistency
check method. is method regards the product recognition
problem as a subgraph isomorphism between the item
packaging and the ideal locations.
4.4. One-Shot Learning. One-shot learning is derived from
distance metric learning  with the purpose of learning
information about object categories from one or only a few
training samples/images . It is of great beneﬁt for
seamlessly handling new products/packages as the only
requirement is to introduce one or several images of the new
item into the reference database with no or minimal
retraining. e basic concept of how to classify objects with
one-shot learning is shown in Figure 11. e points, C1, C2,
and C3, are the mean centres of feature embeddings from
three diﬀerent categories, respectively. Based on the feature
embedding of X, the calculation of the distance between X
and the three points (C1, C2, and C3) can be conducted.
us, Xwill be identiﬁed in the class that has the shortest
distance. Additionally, one-shot learning is also a powerful
method to deal with the training data shortage, with the
possibility of learning much information about a category
from just one or a handful of images . Considering the
advantages of one-shot learning, a lot of literature has
combined one-shot learning with the CNN for a variety of
tasks including image classiﬁcation [150–153] and object
detection [154, 155].
In , a novel metric was proposed, including colour-
invariant features from intensity images with CNNs and
colour components from a colour checker chart. e metric
is then used by a one-shot metric learning approach to
realize person identiﬁcation. Vinyals et al. in  designed
a matching network, which employs metric learning based
on deep neural features. eir approach is tested on the
ImageNet dataset and is able to recognize new items when
introducing a few examples of a new item. Compared with
the Inception classiﬁer , it has increased the accuracy of
one-shot classiﬁcation on ImageNet from 87.6% to 93.2%. In
the domain of object detection, the work in  combines
distance metric learning with R-CNN and implements an-
imal detection with few training examples. Video object
segmentation was achieved in , where the authors
adapted the pretrained CNN to retrieve a particular object
instance, given a single annotated image, by ﬁne-tuning on a
segmentation example for the speciﬁc target object.
Two very recent papers have succeeded in addressing
the speciﬁc domain of retail products to take the experience
of one-shot learning combined with deep features from
CNNs. In , a framework integrating feature-based
matching and one-shot learning with a coarse-to-ﬁne
strategy is introduced. is framework performs ﬂexibly,
which allows adding new product classes without
12 Computational Intelligence and Neuroscience
retraining existing classiﬁers. It has been evaluated on the
GroZi-3.2k , GP-20 , and GP181  datasets and
attained 73.93%, 65.55%, and 85.79% for mAP, respec-
tively. Another work from  proposes a pipeline which
pursues product recognition through a similarity search
between the deep features of reference and query images.
eir pipeline just requires one training image for each
product class and handles seamlessly new product
In this section, we provided a comprehensive literature
review to summarize the research status of the four tech-
niques, which are powerful tools to deal with the challenging
problems of product recognition. In the next section, we
introduce the public datasets and present a comparative
study on the performances of deep learning methods.
5. Dataset Resources
As mentioned earlier, deep learning always requires plenty
of annotation images for training and testing, while it is
often labour-intensive to obtain labelled images in real
practice. In this section, we present public dataset resources,
assisting researchers in testing their methods and comparing
results based on the same dataset. According to the diﬀerent
application scenarios, we split the resources into two cate-
gories: on-shelf and checkout. Table 5 lists the detailed
information of several available datasets, including the
number of product categories, the number of instances in
each image, and the number of images in the training and
testing sets. e datasets are brieﬂy introduced in the
5.1. On-Shelf Datasets. On-shelf datasets are benchmarks for
testing methods proposed to recognize products on shelves,
which shall beneﬁt product management. Here, we present
six available datasets.
5.1.1. GroZi-120. e GroZi-120 dataset  consists of 120
product categories, with images representing the same
products under completely diﬀerent conditions, together
with their text annotations. e training set includes the 676
images in vitro: such images are captured under ideal
conditions. A training image just contains one single
instance, enabling this dataset to suit one-shot learning. e
test set has 4,973 frames annotated with ground truth in situ,
which are rack images obtained from natural environments
with a variety of illuminations, sizes, and poses. It also has 29
videos with a total duration of 30 minutes, including every
product presented in the training set. e in situ videos are
recorded using a VGA resolution MiniDV camcorder at
30 fps, and the in situ rack images are of low resolution.
Samples of in vitro images and in situ rack images are shown
in Figure 4.
5.1.2. GroZi-3.2k. GroZi-3.2k  is a dataset containing
supermarket products, which can be used in ﬁne-grained
recognition. is dataset includes 8,350 training images
collected from the web, belonging to 80 broad product
categories. Training images are taken in ideal conditions
with a white background, and most of them only contain one
single instance in each image. On the contrary, the testing set
consists of 680 images captured from 5 real-life retail stores
using a mobile phone, with ground truth annotations. e
reason why this dataset is named as GroZi-3.2k is that all the
products in the test images are from the 27 training classes of
the “food” category under which 3,235 training images are
included. Examples of training and testing images are shown
in Figure 12.
5.1.3. Freiburg Grocery Dataset. e Freiburg Grocery
dataset  consists of 4,947 images of 25 grocery classes.
e training images are taken at some stores, apartments,
and oﬃces in Germany using four diﬀerent phone cameras.
Each training image has been downscaled to a size of
256 ∗256 pixels, containing one or several instances of one
category. Furthermore, an additional set includes 74 images
collected in 37 cluttered scenes that can be used as a testing
set. Each testing image is recorded by a Kinect v2 
camera at 1920 ∗1080 pixels RGB, containing several
products belonging to multiple classes. Figure 13 indicates
some examples of training images and testing images from
5.1.4. Cigarette Dataset. e Cigarette dataset  comes
with product images and shelf images from 40 retail stores,
captured by four diﬀerent types of cameras. e training set
consists of 3,600 product images belonging to 10 cigarette
classes. Each image in this set includes only one instance.
e testing set is made of 354 shelf images, which have
approximately 13,000 products in total. Each product in the
shelf image has been annotated with bounding boxes and
cigarette categories using the Image Clipper utility. Figure 14
demonstrates the brand classes and an example of shelf
5.1.5. Grocery Store Dataset. e Grocery Store dataset 
was developed to address the natural image classiﬁcation for
assisting people who are visually impaired. is dataset
consists of iconic images and natural images. e iconic
images are downloaded from a grocery store website with
Figure 11: e prototype network of one-shot learning.
Computational Intelligence and Neuroscience 13
product information, such as origin country, weight, and
nutrient values. On the contrary, the natural images are
collected images from 18 diﬀerent grocery stores recorded by
a 16-megapixel phone camera with diﬀerent distances and
angles. is set, containing 5,125 images from 81 ﬁne-grained
classes, has been split into one training set and one test set
randomly to reduce the data bias. e training and test set
contain 2,640 and 2,485 images, respectively, and each image
contains one or several instances of one product class. Fig-
ure 15 illustrates the examples of iconic and natural images.
5.1.6. GP181 Dataset. e GP181 dataset  is a subset of
the Grozi-3.2k dataset, with 183 and 73 images in training and
testing sets, respectively. Each training image includes a single
instance of one product category. Images from test sets have
been annotated with item-speciﬁc bounding boxes. is
dataset can be found at http://vision.disi.unibo.it/index.php?
Here, we present a comparison of the product recognition
performance on GroZi-120, GroZi-3.2k, and its subset in
Table 6. All the methods of the listed publications are based on
deep learning. e performance was calculated by using recall,
precision, and accuracy. Precision measures the percentage of
correct predictions over the total number of predictions, while
the recall measures the percentage of correctly detected
products over the total number of labelled products in the
image . Here are their mathematical deﬁnitions:
Table 5: Detailed information of several public datasets.
Scenario Dataset #product
Training set Test set
GroZi-120 dataset (http://grozi.calit2.net/grozi.
html) 120 Multiple 676 Multiple 4,973
GroZi-3.2k dataset (https://sites.google.com/
view/mariangeorge/datasets) 27/80 Single 8,350 Multiple 3,235
Freiburg Grocery dataset (https://github.com/
PhilJd/freiburg_groceries_dataset) 25 Multiple (one
class) 4,947 Multiple 74
Cigarette dataset (https://github.com/gulvarol/
grocerydataset) 10 Single 3,600 Multiple 354
Grocery Store dataset (https://github.com/
marcusklasson/GroceryStoreDataset) 81 Multiple (one
class) 2,640 Multiple (one
D2S dataset (https://www.mvtec.com/
company/research/datasets/mvtec-d2s/) 60 Single 4,380 Multiple 16,620
RPC dataset (https://rpc-dataset.github.io/) 200/17 Single 53,739 Multiple 30,000
Figure 12: GroZi-3.2k: samples of training images (a) and testing images (b).
Figure 13: Freiburg Grocery: samples of training images (a) and testing images (b).
14 Computational Intelligence and Neuroscience
precision � (TP/(TP +FP)),recall � (TP/(TP +FN)), and
accuracy � ((TP +TN)/(TP +FN +FP +TN)), where TP,
TN,FP, and FN refer to true positive, true negative, false
positive, and false negative, respectively.
5.2. Checkout Datasets. As mentioned in the Introduction
section, the scenario of recognising products for the self-
checkout system is also a complex task that needs to be
solved, which will beneﬁt both retailers and customers. Since
it is an emerging research area, this problem has not been
extensively studied. ere are two public datasets available
for the checkout system.
5.2.1. D2S Dataset. e D2S dataset  is the ﬁrst-ever
benchmark to provide pixelwise annotations on the instance
level, aiming to cover real-world applications of an automatic
checkout, inventory, or warehouse system. It contains a total of
21,000 high-resolution images of groceries and daily products,
such as fruits, vegetables, cereal packets, pasta, and bottles, from
60 categories. e images are taken in 700 diﬀerent scenes
under three diﬀerent lightings and three additional back-
grounds. e training set includes 4,380 images captured from
diﬀerent views, and each image involves one product of a single
class. ere are 3,600 and 13,020 images in the validation and
test sets, respectively. Furthermore, 10,000 images in the val-
idation and test sets are artiﬁcially synthesized that contain one
to ﬁfteen objects randomly picked from the training set. e
samples of training images and test images are shown in
In the work of , the authors evaluated the perfor-
mance of several state-of-the-art deep learning-based
methods on the D2S dataset, including Mask R-CNN ,
FCIS , Faster R-CNN , and RetinaNet . e
results are summarized in Table 7. e evaluation metric is
mean average precision (mAP) . Speciﬁcally, mAP50 and
Figure 14: Cigarette dataset: samples of training images (a) and testing images (b).
Figure 15: Grocery Store dataset: samples of iconic images (a) and natural images (b).
Table 6: Recognition performance comparison of approaches based on deep learning on benchmark datasets.
Publications GroZi-120  GroZi-3.2k 
Precision (%) Recall (%) #product categories Precision (%) Recall (%) #product categories
 45.20 52.70 120 73.10 73.60 20
 — — — 73.50 82.68 181
 — — — 90.47 90.26 181
 — — — Accuracy: 85.30 181
 49.05 29.37 120 65.83 45.52 857
92.19 87.89 181
 49.80 — 120 52.16 — 27
Computational Intelligence and Neuroscience 15
mAP75 are calculated at the intersection-over-union (IoU)
thresholds 0.50 and 0.75 over all product classes,
5.2.2. RPC Dataset. e RPC dataset  is developed to
support research on addressing product recognition in real-
world checkout scenarios. It consists of 83,739 images in total,
including 53,739 single-product exemplary images for training
and 30,000 checkout images for validation and testing. It has a
hierarchical structure of 200 ﬁne-grained product categories,
which can be coarsely categorized as 17 metaclasses. Each
training image is captured in controlled conditions with four
cameras from diﬀerent views. e checkout images are
recorded with three clutter levels using a camera mounted on
top, annotated with a bounding box and object category for
each product. Figure 17 demonstrates some examples of
training images and checkout images in the RPC dataset.
In , feature pyramid network (FPN)  is adopted
as the detector for recognising items on the RPC dataset, and
reasonable results have been achieved in this paper. In
addition, the authors also proposed an essential metric,
checkout accuracy (cAcc), for the automatic checkout task in
. At ﬁrst, CDi,k is deﬁned as the counting error for a
particular category in a checkout image:
CDi,k �Pi,k −GTi,k
where Pi,k and GTi,k denote the predicted count and ground-
truth item number of the k-th class in the i-th image, re-
spectively. en, the calculation of the error over all K
product classes in the i-th image is deﬁned as
Given Nimages from the RPC dataset, cAcc measures
the mean accuracy rate of the correct predictions. Its
mathematical deﬁnition is
where δ(·) � 1 if CDi�0; otherwise, it equals 0. e value of
cAcc ranges from 0 to 1.
Afterwards, based on the work of Wei et al. , data
priming network (DPNet) was developed to select reliable
samples to promote the training process in . Conse-
quently, the performance of product recognition has been
signiﬁcantly boosted with DPNet. e comparative results of
[7, 78] are listed in Table 8, where mmAP is the mean value
over all 10 IoU thresholds (i.e., ranging from 0 : 50 to 0 : 95
with the uniform step size 0 : 05) of all product classes .
6. Research Directions and Conclusion
To the best of our knowledge, this paper is the ﬁrst com-
prehensive literature review on deep learning approaches for
retail product recognition. Based on the thorough investi-
gation into the research of retail product recognition with
deep learning, this section outlines several promising re-
search directions for the future. Finally, we present a con-
clusion for the whole article.
6.1. Research Directions
6.1.1. Generating Product Images with Deep Neural Networks.
In the previous introduction of dataset resources, the largest
publicly available dataset only contained 200 product cate-
gories. Nevertheless, the number of diﬀerent items to be
recognized in a medium-sized grocery store can be ap-
proximately several thousands, far exceeding the category
quantity of the existing datasets. Considering the appearances
of existing products frequently change over time, it is im-
possible to build a man-made dataset that includes the ma-
jority of daily products. Some works [7, 71, 78] have
demonstrated the advantages of generative adversarial net-
works (GANs) for generating images that look realistic.
Moreover, signiﬁcant work in  has ﬁlled the gap between
CNNs and GANs by proposing the deep convolutional
generative adversarial networks (DCGANs) that can create
Figure 16: D2S dataset: samples of training images (a) and testing images (b).
Table 7: Product detection benchmark results on the test set of the
Approaches mAP (%) mAP50 (%) mAP75 (%)
Mask R-CNN  78.3 89.8 84.9
FCIS  68.3 88.5 80.9
Faster R-CNN  78.0 90.3 84.8
RetinaNet  80.1 89.6 84.5
16 Computational Intelligence and Neuroscience
high-quality generated images. In this case, it is feasible to
generate images with deep neural networks to enlarge the
training dataset for retail product recognition. So, developing
image generators with deep neural networks to simulate real-
world scenes shall be a future research direction.
6.1.2. Graph Neural Networks with Deep Learning for Pla-
nogram Compliance Check. Graph neural networks (GNNs)
 are a powerful tool for non-Euclidean data, which can
represent the relationships between objects [161, 162]. Cur-
rently, GNNs have achieved great success on recommenda-
tion systems [163, 164], molecule identiﬁcation , and
paper citation analysis . For an image that contains
multiple objects, each object can be considered as a node, and
GNNs have the ability to learn the location relationship
between every two nodes. With regard to the scenarios in
supermarkets, products are generally placed on shelves
according to certain arrangement rules. In this case, GNNs
can be used with deep learning to learn the position rela-
tionships between diﬀerent products, and then they are
assisted by identifying missing or misplaced items for pla-
nogram compliance. In , authors attempted to apply
GNNs for consistency checks and achieved a remarkable
result. Speciﬁcally, there are two relationship representations.
One is “observed planogram” generated from GNNs, and
another one is “reference planogram,” the true representation.
By comparing the observed planogram and reference pla-
nogram, they obtained the result of the consistency check that
helps to correct the false detection and missing detection.
6.1.3. Cross-Domain Retail Product Recognition with Transfer
Learning. In object detection algorithms, a signiﬁcant as-
sumption is that the learning and test data are derived from
the same feature space and the same distribution , i.e.,
most object detectors require retraining with new data from
random initialization when the distribution changes. In the
real world, many diﬀerent retail stores and supermarkets are
selling diversiﬁed products. Moreover, the internal envi-
ronment between diﬀerent shops can be varied. One model
trained by data from a speciﬁc shop is unable to be applied
with a newly built store, which arises the concept of cross-
domain recognition. Cross-domain recognition is usually
based on transfer learning  that assists the target do-
main in learning by using knowledge transferred from other
domains. Transfer learning is capable of solving new
problems easily by applying knowledge obtained previously.
For a new task, researchers normally use the pretrained
detector either as an initialization or a ﬁxed feature extractor
and then ﬁne-tune the weights of some layers in the network
to realize cross-domain detection. Ordinarily, the majority
of approaches in established papers employs models pre-
trained with ImageNet to implement product recognition
[20, 74]. However, how to make a model adaptable in various
shops still needs attention.
6.1.4. Joint Feature Learning from Text Information on
Product Packaging. Intraclass product classiﬁcation is a
challenge since it is visually similar. Sometimes, we human
beings recognize similar products by reading the text on
packaging when we are facing a lot of intraclass items. us,
the text information on product packaging can be considered
as a factor for classifying ﬁne-grained products. Currently,
joint feature learning (JFL) methods have shown their eﬀec-
tiveness in improving the face recognition performance by
stacking features extracted from diﬀerent face regions .
For this reason, it is possible for the idea of JFL to be in-
troduced to the ﬁeld of retail product recognition, i.e., learning
the product image features and package text features jointly to
enhance the recognition performance. In , researchers
tried to automatically recognize the text on each product
packaging. Unfortunately, the extracted text information in
this paper is just used to search for products for users.
6.1.5. Incremental Learning with the CNN for Flexible
Product Recognition. Deep learning methods always suﬀer
from “catastrophic forgetting,” especially for convolutional
Table 8: Comparative results on the RPC dataset.
Publications cAcc (%) mAP50 (%) mmAP (%)
 56.68 96.57 73.83
 80.51 97.91 77.04
Figure 17: RPC dataset: samples of training images (a) and checkout images (b).
Computational Intelligence and Neuroscience 17
neural networks, i.e., they are incapable of recognising some
previously learned objects when adjusted to a new task .
Incremental learning is a powerful method that can deal with
new data without retraining the whole model. Additionally,
it enables deep neural networks to have a long-term
memory. Shmelkov et al. and Guan et al. [65, 170] imple-
mented incremental learning of object detection by pro-
posing two detection networks. One is an existing network
that has already been trained, and the other one will be
trained for detecting new classes. In , authors attempted
to combine incremental learning with CNNs and compared
various incremental teaching approaches for CNN-based
architectures. erefore, incremental learning will be helpful
to make the recognition system ﬂexible with no or minimal
retraining whenever a fresh item is launched.
6.1.6. e Regression-Based Object Detection Methods for
Retail Product Recognition. If we want to apply product
recognition in the industry area, it requires real-time
availability. Consumers would like to check out immedi-
ately, and retailers shall receive real-time feedback when
something is missing from the shelves. As we all know, deep
learning is computationally expensive. A large number of
deep learning algorithms need to use GPUs to run image
processing. As mentioned in Section 2, there are two cat-
egories of the object detection methods: region proposal-
based and regression-based . e regression-based
methods can reduce the time expense by regressing the
objects’ locations and categories directly from image pixels
. Ordinarily, the regression-based methods perform
better for real-time detection tasks than the methods based
on region proposals. However, although the work in 
achieves detection of general objects at a high rate of speed, it
suﬀers from accuracy reduction. erefore, how to improve
the detection accuracy with the regression-based approach
for retail product recognition is worth more research.
6.2. Conclusion. is paper addresses the broad area of
product recognition technologies. Product recognition will
become increasingly important in a world where cost
margins are becoming increasingly tight, and customers
have increasing pressures on their available time. By sum-
marising the literature in the ﬁeld, we make research in this
area more accessible to new researchers, allowing for the
ﬁeld to progress. It is very important that this ﬁeld addresses
these four challenging problems: (1) large-scale classiﬁca-
tion; (2) data limitations; (3) intraclass variation; and (4)
ﬂexibility. We have identiﬁed several areas for further re-
search: (1) generating data with deep neural networks; (2)
graph neural networks with deep learning; (3) cross-domain
recognition with transfer learning; (4) joint feature learning
from text information on packaging; (5) incremental
learning with the CNN; and (6) the regression-based object
detection methods for retail product recognition.
In this article, we have presented an extensive review of
recent research on deep learning-based retail product rec-
ognition, with more than one hundred references. We
propose four challenging problems and provide
corresponding techniques to those challenges. We have also
brieﬂy described the publicly available datasets and listed
their detailed information, respectively.
Overall, this paper provides a clear overview of the
current research status in this ﬁeld and that it encourages
new researchers to join this ﬁeld and complete extensive
research in this area.
Conflicts of Interest
e authors declare no conﬂicts of interest.
Y. W., S. T. and S. X. contributed to conceptualization. Y. W.
contributed to writing and original draft preparation. S. T.,
S. X., B. K., and M. S. contributed to writing, reviewing, and
editing. S. X. and B. K. supervised the study. B. K. was
responsible for funding acquisition. All authors read and
agreed to the published version of the manuscript.
e ﬁrst author Y. W. was sponsored by the China Schol-
arship Council (CSC).
 T. Sriram, K. V. Rao, S. Biswas, and B. Ahmed, “Applications
of barcode technology in automated storage and retrieval
systems,” in Proceedings of the 1996 22nd International
Conference on Industrial Electronics, Control, and Instru-
mentation, vol. 1, pp. 641–646, Taipei, Taiwan, 1996.
 H. Poll, “Digimarc survey: 88 percent of U.S. adults want their
retail checkout experience to be faster,” 2015, https://www.
 R. Want, “An introduction to RFID technology,” IEEE
Pervasive Computing, vol. 5, no. 1, pp. 25–33, 2006.
 B. Santra and D. P. Mukherjee, “A comprehensive survey on
computer vision based approaches for automatic identiﬁ-
cation of products in retail store,” Image and Vision Com-
puting, vol. 86, pp. 45–63, 2019.
 D. Grewal, A. L. Roggeveen, and J. Nordf¨
alt, “e future of
retailing,” Journal of Retailing, vol. 93, no. 1, pp. 1–6, 2017.
 Hampshire, “AI spending by retailers to reach $12 billion
by 2023, driven by the promise of improved margins,”
April 2019, https://www.juniperresearch.com/press/
 X. S. Wei, Q. Cui, L. Yang, P. Wang, and L. Liu, “RPC: a
large-scale retail product checkout dataset,” 2019, https://
 M. Shapiro, Executing the Best Planogram, Vol. 1, Profes-
sional Candy Buyer, Norwalk, CT, USA, 2009.
 F. D. Orel and A. Kara, “Supermarket self-checkout service
quality, customer satisfaction, and loyalty: empirical evi-
dence from an emerging market,” Journal of Retailing and
Consumer Services, vol. 21, pp. 118–129, 2014.
 B. F. Wu, W. J. Tseng, Y. S. Chen, S. J. Yao, and P. J. Chang,
“An intelligent self-checkout system for smart retail,” in
18 Computational Intelligence and Neuroscience
Proceedings of the 2016 International Conference on System
Science and Engineering (ICSSE), pp. 1–4, Puli, Taiwan, 2016.
 A. C. R. Van Riel, J. Semeijn, D. Ribbink, and Y. Bomert-
Peters, “Waiting for service at the checkout: negative
emotional responses, store image and overall satisfaction,”
Journal of Service Management, vol. 23, no. 2, pp. 144–169,
 F. Morimura and K. Nishioka, “Waiting in exit-stage op-
erations: expectation for self-checkout systems and overall
satisfaction,” Journal of Marketing Channels, vol. 23, no. 4,
pp. 241–254, 2016.
 M. George and C. Floerkemeier, “Recognizing products: a
per-exemplar multi-label image classiﬁcation approach,” in
Proceedings of the 2014 European Conference on Computer
Vision, pp. 440–455, Zurich, Switzerland, 2014.
 D. L´
opez-de-Ipiña, T. Lorido, and U. L´
opez, “Indoor nav-
igation and product recognition for blind people assisted
shopping,” in Proceedings of the 2011 International Work-
shop on Ambient Assisted Living, pp. 33–40, Torremolinos,
 A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet
classiﬁcation with deep convolutional neural networks,”
Advances in Neural Information Processing Systems,
pp. 1097–1105, Springer, Berlin, Germany, 2012.
 C. Y. Fu, W. Liu, A. Ranga, A. Tyagi, and A. C. Berg, “DSSD:
deconvolutional single shot detector,” 2017, https://arxiv.
 J. Redmon and A. Farhadi, “Yolov3: an incremental im-
provement,” 2018, https://arxiv.org/abs/1804.02767.
 R. Girshick, “Fast R-CNN,” in Proceedings of the 2015 IEEE
International Conference on Computer Vision, pp. 1440–
1448, Santiago, Chile, 2015.
 Q. Zhao, T. Sheng, Y. Wang et al., “M2Det: a single-shot
object detector based on multi-level feature pyramid net-
work,” 2018, https://arxiv.org/abs/1811.04533.
 A. Tonioni, E. Serro, and L. Di Stefano, “A deep learning
pipeline for product recognition in store shelves,” 2018,
 L. Karlinsky, J. Shtok, Y. Tzur, and A. Tzadok, “Fine-grained
recognition of thousands of object categories with single-
example training,” in Proceedings of the 2017 IEEE Confer-
ence on Computer Vision and Pattern Recognition,
pp. 4113–4122, Honolulu, HI, USA, 2017.
 S. Qiao, W. Shen, W. Qiu, C. Liu, and A. Yuille, “Scalenet:
guiding object proposal generation in supermarkets and
beyond,” in Proceedings of the 2017 IEEE International
Conference on Computer Vision, pp. 1791–1800, Venice,
 C. G. Melek, E. B. Sonmez, and S. Albayrak, “A survey of
product recognition in shelf images,” in Proceedings of the
2017 International Conference on Computer Science and
Engineering (UBMK), pp. 145–150, Antalya, Turkey, 2017.
 D. G. Lowe, “Object recognition from local scale-invariant
features,” in Proceedings of the 2017 7th International
Conference on Computer Vision, Kerkyra, Greece, 1999.
 D. G. Lowe, “Distinctive image features from scale-invariant
keypoints,” International Journal of Computer Vision, vol. 60,
no. 2, pp. 91–110, 2004.
 H. Bay, T. Tuytelaars, and L. Van Gool, “SURF: speeded up
robust features,” in Proceedings of the 2006 European Con-
ference on Computer Vision, pp. 404–417, Graz, Austria,
 H. Bay, A. Ess, T. Tuytelaars, and L. Van Gool, “Speeded-up
robust features (SURF),” Computer Vision and Image Un-
derstanding, vol. 110, no. 3, pp. 346–359, 2008.
 R. Moorthy, S. Behera, S. Verma, S. Bhargave, and
P. Ramanathan, “Applying image processing for detecting
on-shelf availability and product positioning in retail stores,”
in Proceedings of the 3rd International Symposium on Women
in Computing and Informatics, Kochi, India, 2015.
 S. Zhang, L. Yao, A. Sun, and Y. Tay, “Deep learning based
recommender system: a survey and new perspectives,” ACM
Computing Surveys (CSUR), vol. 52, p. 5, 2019.
 G. E. Hinton, S. Osindero, and Y.-W. Teh, “A fast learning
algorithm for deep belief nets,” Neural Computation, vol. 18,
no. 7, pp. 1527–1554, 2006.
 G. E. Hinton and R. R. Salakhutdinov, “Reducing the di-
mensionality of data with neural networks,” Science, vol. 313,
no. 5786, pp. 504–507, 2006.
 Y. Bengio, P. Lamblin, D. Popovici, and H. Larochelle,
“Greedy layer-wise training of deep networks,” Advances in
Neural Information Processing Systems, pp. 153–160, MIT
Press, Cambridge, MA, USA, 2007.
 V. Nair and G. E. Hinton, “Rectiﬁed linear units improve
restricted Boltzmann machines,” in Proceedings of the 27th
International Conference on Machine Learning (ICML-10),
Haifa, Israel, 2010.
 G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and
R. R. Salakhutdinov, “Improving neural networks by pre-
venting co-adaptation of feature detectors,” 2012, https://
 Y. Jia, E. Shelhamer, J. Donahue et al., “CAFFE: convolu-
tional architecture for fast feature embedding,” in Proceed-
ings of the 22nd ACM International Conference on
Multimedia, Glasgow, UK, 2014.
 M. Abadi, A. Agarwal, P. Barham et al., “TensorFlow: large-
scale machine learning on heterogeneous distributed sys-
tems,” 2016, https://arxiv.org/abs/1603.04467.
 T. Chen, M. Li, Y. Li et al., “MXNet: a ﬂexible and eﬃcient
machine learning library for heterogeneous distributed
systems,” 2015, https://arxiv.org/abs/1512.01274.
 A. Paszke, S. Gross, S. Chintala, and G. Chanan, “Pytorch:
tensors and dynamic neural networks in python with strong
GPU acceleration,” 2017, https://pytorch.org/.
 D. H. Hubel and T. N. Wiesel, “Receptive ﬁelds, binocular
interaction and functional architecture in the cat’s visual
cortex,” e Journal of Physiology, vol. 160, no. 1,
pp. 106–154, 1962.
 Y. LeCun, L. Bottou, Y. Bengio, P. Haﬀner, and others,
“Gradient-based learning applied to document recognition,”
Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.
 C. Szegedy, W. Liu, Y. Jia et al., “Going deeper with con-
volutions,” in Proceedings of the IEEE Conference on Com-
puter Vision and Pattern Recognition, Boston, MA, USA,
 K. Simonyan and A. Zisserman, “Very deep convolutional
networks for large-scale image recognition,” 2014, https://
 K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning
for image recognition,” in Proceedings of the 2016 IEEE
Conference on Computer Vision and Pattern Recognition, Las
Vegas, NV, USA, 2016.
 H. Su, S. Maji, E. Kalogerakis, and E. Learned-Miller,
“Multi-view convolutional neural networks for 3D shape
recognition,” in Proceedings of the 2015 IEEE
Computational Intelligence and Neuroscience 19
International Conference on Computer Vision, Santiago,
 Z. Gao, Y. Li, and S. Wan, “Exploring deep learning for view-
based 3D model retrieval,” ACM Transactions on Multimedia
Computing, Communications, and Applications, vol. 16, no. 1,
pp. 1–21, 2020.
 P. Viola and M. J. Jones, “Robust real-time face detection,”
International Journal of Computer Vision, vol. 57, no. 2,
pp. 137–154, 2004.
 Z. Q. Zhao, P. Zheng, S. t. Xu, and X. Wu, “Object detection
with deep learning: a review,” IEEE Transactions on Neural
Networks and Learning Systems, vol. 30, no. 11, pp. 3212–
 R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich
feature hierarchies for accurate object detection and se-
mantic segmentation,” in Proceedings of the 2014 IEEE
Conference on Computer Vision and Pattern Recognition,
Columbus, OH, USA, 2014.
 J. R. R. Uijlings, K. E. A. Van De Sande, T. Gevers, and
A. W. M. Smeulders, “Selective search for object recogni-
tion,” International Journal of Computer Vision, vol. 104,
no. 2, pp. 154–171, 2013.
 S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: to-
wards real-time object detection with region proposal net-
works,” Advances in Neural Information Processing Systems,
pp. 91–99, MIT Press, Cambridge, MA, USA, 2015.
 J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only
look once: uniﬁed, real-time object detection,” in Proceedings
of the 2016 IEEE Conference on Computer Vision and Pattern
Recognition, Las Vegas, NV, USA, 2016.
 W. Liu, D. Anguelov, D. Erhan et al., “SSD: single shot
multibox detector,” in Proceedings of the 2016 European
Conference on Computer Vision, pp. 21–37, Amsterdam,
 E. Goldman and J. Goldberger, “Large-scale classiﬁcation of
structured objects using a CRF with deep class embedding,”
 J. Redmon and A. Farhadi, “YOLO9000: better, faster,
stronger,” in Proceedings of the 2017 IEEE Conference on
Computer Vision and Pattern Recognition, Honolulu, HI,
 K. He, G. Gkioxari, P. Doll´ar, and R. Girshick, “Mask
R-CNN,” in Proceedings of the 2017 IEEE International
Conference on Computer Vision, Venice, Italy, 2017.
 M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and
A. Zisserman, “e pascal visual object classes (VOC)
challenge,” International Journal of Computer Vision, vol. 88,
no. 2, pp. 303–338, 2010.
 T.-Y. Lin, M. Maire, S. Belongie et al., “Microsoft coco:
common objects in context,” in Proceedings of the 2014
European Conference on Computer Vision, pp. 740–755,
Zurich, Switzerland, 2014.
 A. Franco, D. Maltoni, and S. Papi, “Grocery product de-
tection and recognition,” Expert Systems with Applications,
vol. 81, pp. 163–176, 2017.
 B. Zhao, J. Feng, X. Wu, and S. Yan, “A survey on deep
learning-based ﬁne-grained object classiﬁcation and se-
mantic segmentation,” International Journal of Automation
and Computing, vol. 14, no. 2, pp. 119–135, 2017.
 C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie,
“e caltech-UCSD birds-200-2011 dataset,” Technical re-
port CNS-TR-2011-001, California Institute of Technology,
Pasadena, CA, USA, 2011.
 A. Khosla, N. Jayadevaprakash, B. Yao, and F. F. Li, “Novel
dataset for ﬁne-grained image categorization: Stanford
dogs,” in Proceedings of the 2011 CVPR Workshop on Fine-
Grained Visual Categorization (FGVC), Colorado Springs,
CO, USA, June, 2011.
 M. E. Nilsback and A. Zisserman, “Automated ﬂower
classiﬁcation over a large number of classes,” in Proceedings
of the 2008 6th Indian Conference on Computer Vision,
Graphics & Image Processing, IEEE, Bhubaneswar, India,
 J. Krause, M. Stark, J. Deng, and L. Fei-Fei, “3D object
representations for ﬁne-grained categorization,” in Pro-
ceedings of the 2013 IEEE International Conference on
Computer Vision Workshops, Sydney, Australia, 2013.
 I. Baz, E. Yoruk, and M. Cetin, “Context-aware hybrid
classiﬁcation system for ﬁne-grained retail product recog-
nition,” in Proceedings of the 2016 IEEE 12th Image, Video,
and Multidimensional Signal Processing Workshop (IVMSP),
Bordeaux, France, 2016.
 K. Shmelkov, C. Schmid, and K. Alahari, “Incremental
learning of object detectors without catastrophic forgetting,”
in Proceedings of the 2017 IEEE International Conference on
Computer Vision, Venice, Italy, 2017.
 L. Zheng, Y. Yang, and Q. Tian, “SIFT meets CNN: a decade
survey of instance retrieval,” IEEE Transactions on Pattern
Analysis and Machine Intelligence, vol. 40, pp. 1224–1244,
 D. Farren, Classifying Food Items by Image Using Convolu-
tional Neural Networks, Stanford University, Stanford, CA,
 L. Liu, B. Zhou, Z. Zou, S. C. Yeh, and L. Zheng, “A smart
unstaﬀed retail shop based on artiﬁcial intelligence and IoT,”
in Proceedings of the 2018 IEEE 23rd International Workshop
on Computer Aided Modeling and Design of Communication
Links and Networks (CAMAD), Barcelona, Spain, 2018.
 L. Li, T.-T. Goh, and D. Jin, “How textual quality of online
reviews aﬀect classiﬁcation performance: a case of deep
learning sentiment analysis,” Neural Computing and Ap-
plications, vol. 32, no. 9, pp. 4387–4415, 2020.
 C. Szegedy, S. Ioﬀe, V. Vanhoucke, and A. A. Alemi, “In-
ception-v4, inception-resnet and the impact of residual
connections on learning,” in Proceedings of the 31st AAAI
Conference on Artiﬁcial Intelligence, San Francisco, CA, USA,
 A. Tonioni and L. Di Stefano, “Domain invariant hierarchical
embedding for grocery products recognition,” Computer
Vision and Image Understanding, vol. 182, 2019.
 T. Chong, I. Bustan, and M. Wee, Deep Learning Approach to
Planogram Compliance in Retail Stores, Stanford University,
Stanford, CA, USA, 2016.
 J. Li, X. Wang, and H. Su, “Supermarket commodity iden-
tiﬁcation using convolutional neural networks,” in Pro-
ceedings of the 2016 2nd International Conference on Cloud
Computing and Internet of ings (CCIOT), Dalian, China,
 W. Geng, F. Han, J. Lin et al., “Fine-grained grocery product
recognition by one-shot learning,” in Proceedings of the 2018
ACM Multimedia Conference on Multimedia Conference,
Seoul, Republic of Korea, 2018.
 A. De Biasio, “Retail shelf analytics through image processing
and deep learning,” Master thesis, Universityo Padua, Padua,
20 Computational Intelligence and Neuroscience
 S. Varadarajan and M. M. Srivastava, “Weakly supervised
object localization on grocery shelves using simple FCN and
synthetic dataset,” 2018, https://arxiv.org/abs/1803.06813.
 P. Jund, N. Abdo, A. Eitel, and W. Burgard, “e freiburg
groceries dataset,” 2016, https://arxiv.org/abs/1611.05799.
 C. Li, D. Du, L. Zhang et al., “Data priming network for
automatic check-out,” 2019, https://arxiv.org/abs/1904.
 W. Yi, Y. Sun, T. Ding, and S. He, “Detecting retail products
in situ using CNN without human eﬀort labeling,” 2019,
 P. Follmann, T. Bottger, P. Hartinger, R. Konig, and
M. Ulrich, “MVTec D2S: densely segmented supermarket
dataset,” in Proceedings of the 2018 European Conference on
Computer Vision (ECCV), Munich, Germany, 2018.
 C. Szegedy, V. Vanhoucke, S. Ioﬀeme)"[?--]>, J. Shlens, and
Z. Wojna, “Rethinking the inception architecture for com-
puter vision,” in Proceedings of the 2016 IEEE Conference on
Computer Vision and Pattern Recognition, Las Vegas, NV,
 J. Laﬀerty, A. McCallum, and F. C. Pereira, Conditional
Random Fields: Probabilistic Models for Segmenting and
Labeling Sequence Data, University of Pennsylvania,
Philadelphia, PA, USA, 2001.
 G. Tolias, R. Sicre, and H. J´
egou, “Particular object retrieval
with integral max-pooling of CNN activations,” 2015,
 K. Chatﬁeld, K. Simonyan, A. Vedaldi, and A. Zisserman,
“Return of the devil in the details: delving deep into con-
volutional nets,” 2014, https://arxiv.org/abs/1405.3531.
 J. Deng, W. Dong, R. Socher, L. J. Li, K. Li, and L. Fei-Fei,
“Imagenet: a large-scale hierarchical image database,” in
Proceedings of the 2009 IEEE Conference on Computer Vision
and Pattern Recognition, IEEE, Miami, FL, USA, 2009.
 J. Redmon, “Darknet: open source neural networks in C,”
 L. Fei-Fei, R. Fergus, and P. Perona, “One-shot learning of
object categories,” IEEE Transactions on Pattern Analysis and
Machine Intelligence, vol. 28, pp. 594–611, 2006.
 D. C. Cires¸an, U. Meier, J. Masci, L. M. Gambardella, and
J. Schmidhuber, “High-performance neural networks for
visual object classiﬁcation,” 2011, https://arxiv.org/abs/1102.
 M. Merler, C. Galleguillos, and S. Belongie, “Recognizing
groceries in situ using in vitro training data,” in Proceedings
of the 2007 IEEE Conference on Computer Vision and Pattern
Recognition, Minneapolis, MN, USA, 2007.
 P. Y. Simard, D. Steinkraus, J. C. Platt, and others, “Best
practices for convolutional neural networks applied to visual
document analysis,” in Proceedings of the 7th International
Conference on Document Analysis and Recognition, Edinburgh,
 D. Cires¸an, U. Meier, and J. Schmidhuber, “Multi-column
deep neural networks for image classiﬁcation,” 2012, https://
 W. Qiu and A. Yuille, “UnrealCV: connecting computer
vision to unreal engine,” in Proceedings of the European
Conference on Computer Vision, Springer, Amsterdam,
 Y. Li, H. Qi, J. Dai, X. Ji, and Y. Wei, “Fully convolutional
instance-aware semantic segmentation,” in Proceedings of the
2017 IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), Honolulu, HI, USA, 2017.
 A. Buslaev, A. Parinov, E. Khvedchenya, V. I. Iglovikov, and
A. A. Kalinin, “Albumentations: fast and ﬂexible image
augmentations,” 2018, https://arxiv.org/abs/1809.06839.
 S. Varadarajan, S. Kant, and M. M. Srivastava, “Benchmark
for generic product detection: a strong baseline for dense
object detection,” 2019, https://arxiv.org/abs/1912.09476.
 D. P. Kingma and M. Welling, “Auto-encoding variational
Bayes,” 2013, https://arxiv.org/abs/1312.6114.
 I. Goodfellow, J. Pouget-Abadie, M. Mirza et al., “Generative
adversarial nets,” Advances in Neural Information Processing
Systems, pp. 2672–2680, MIT Press, Cambridge, MA, USA,
 H. Huang, P. S. Yu, and C. Wang, “An introduction to image
synthesis with generative adversarial nets,” 2018, https://
 K. Sohn, H. Lee, and X. Yan, “Learning structured output
representation using deep conditional generative models,”
Advances in Neural Information Processing Systems,
pp. 3483–3491, MIT Press, Cambridge, MA, USA, 2015.
 M. Mirza and S. Osindero, “Conditional generative adver-
sarial nets,” 2014, https://arxiv.org/abs/1411.1784.
 X. Yan, J. Yang, K. Sohn, and H. Lee, “Attribute2Image:
conditional image generation from visual attributes,” in
Proceedings of the 2016 European Conference on Computer
Vision, pp. 776–791, Amsterdam, Netherlands, 2016.
 A. Radford, L. Metz, and S. Chintala, “Unsupervised rep-
resentation learning with deep convolutional generative
adversarial networks,” 2015, https://arxiv.org/abs/1511.
 L. Cai, H. Gao, and S. Ji, “Multi-stage variational auto-en-
coders for coarse-to-ﬁne image generation,” in Proceedings of
the 2019 SIAM International Conference on Data Mining,
Calgary, Canada, 2019.
 X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever,
and P. Abbeel, “Infogan: interpretable representation
learning by information maximizing generative adversarial
nets,” Advances in Neural Information Processing Systems,
pp. 2172–2180, MIT Press, Cambridge, MA, USA, 2016.
 P. Isola, J. Y. Zhu, T. Zhou, and A. A. Efros, “Image-to-image
translation with conditional adversarial networks,” in Pro-
ceedings of the 2017 IEEE Conference on Computer Vision and
Pattern Recognition, Honolulu, HI, USA, 2017.
 J. Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-
to-image translation using cycle-consistent adversarial net-
works,” in Proceedings of the 2017 IEEE International
Conference on Computer Vision, Venice, Italy, 2017.
 Z. Yi, H. Zhang, P. Tan, and M. Gong, “Dualgan: unsu-
pervised dual learning for image-to-image translation,” in
Proceedings of the 2017 IEEE International Conference on
Computer Vision, Venice, Italy, 2017.
 T. Kim, M. Cha, H. Kim, J. K. Lee, and J. Kim, “Learning to
discover cross-domain relations with generative adversarial
networks,” in Proceedings of the 34th International Confer-
ence on Machine Learning, vol. 70, Sydney, Australia, 2017.
 Y. Choi, M. Choi, M. Kim, J. W. Ha, S. Kim, and J. Choo,
“Stargan: uniﬁed generative adversarial networks for multi-
domain image-to-image translation,” in Proceedings of the
2018 IEEE Conference on Computer Vision and Pattern
Recognition, Salt Lake City, UT, USA, 2018.
 M. Y. Liu, T. Breuel, and J. Kautz, “Unsupervised image-to-
image translation networks,” Advances in Neural Informa-
tion Processing Systems, pp. 700–708, MIT Press, Cambridge,
MA, USA, 2017.
Computational Intelligence and Neuroscience 21
 G. B. Huang, M. Mattar, T. Berg, and E. Learned-Miller,
“Labeled faces in the wild: a database forstudying face rec-
ognition in unconstrained environments,” in Proceedings of
the 2008 Workshop on faces in “Real-Life” Images: Detection,
Alignment, and Recognition, Marseille, France, 2008.
 A. Mishra, S. Krishna Reddy, A. Mittal, and H. A. Murthy, “A
generative model for zero shot learning using conditional
variational autoencoders,” in Proceedings of the 2018 IEEE
Conference on Computer Vision and Pattern Recognition
Workshops, Salt Lake City, UT, USA, 2018.
 C. H. Lampert, H. Nickisch, and S. Harmeling, “Attribute-
based classiﬁcation for zero-shot visual object categoriza-
tion,” IEEE Transactions on Pattern Analysis and Machine
Intelligence, vol. 36, pp. 453–465, 2013.
 G. Patterson and J. Hays, “Sun attribute database: discov-
ering, annotating, and recognizing scene attributes,” in
Proceedings of the 2012 IEEE Conference on Computer Vision
and Pattern Recognition, IEEE, Providence, RI, USA, 2012.
 S. W. Huang, C. T. Lin, S. P. Chen, Y. Y. Wu, P. H. Hsu, and
S. H. Lai, “AugGAN: cross domain adaptation with GAN-
based data augmentation,” in Proceedings of the 2018 Eu-
ropean Conference on Computer Vision (ECCV), Munich,
 Z. Qiu, Y. Pan, T. Yao, and T. Mei, “Deep semantic hashing
with generative adversarial networks,” in Proceedings of the
40th International ACM SIGIR Conference on Research and
Development in Information Retrieval, ACM, Tokyo, Japan,
 A. Krizhevsky, V. Nair, and G. Hinton, “e CIFAR-10
dataset,” 2014, http://www.cs.toronto.edu/kriz/cifar.html.
 T. S. Chua, J. Tang, R. Hong, H. Li, Z. Luo, and Y. T. Zheng,
“NUS-WIDE: a real-world web image database from na-
tional university of Singapore,” in Proceedings of the 2009
ACM Conference on Image and Video Retrieval (CIVR’09),
New York, NY, USA, 2009.
 A. Van den Oord, N. Kalchbrenner, L. Espeholt, O. Vinyals,
A. Graves, and others, “Conditional image generation with
pixelcnn decoders,” Advances in Neural Information Pro-
cessing Systems, pp. 4790–4798, MIT Press, Cambridge, MA,
 Z. Zheng, L. Zheng, and Y. Yang, “Unlabeled samples
generated by gan improve the person re-identiﬁcation
baseline in vitro,” in Proceedings of the 2017 IEEE Interna-
tional Conference on Computer Vision, Venice, Italy, 2017.
 X. Liu, J. Wang, S. Wen, E. Ding, and Y. Lin, “Localizing by
describing: attribute-guided attention localization for ﬁne-
grained recognition,” in Proceedings of the 31st AAAI Con-
ference on Artiﬁcial Intelligence, San Francisco, CA, USA,
 X. Wang, Z. Man, M. You, and C. Shen, “Adversarial
generation of training examples: applications to moving
vehicle license plate recognition,” 2017, https://arxiv.org/abs/
 T. Y. Lin, P. Dollar, R. Girshick, K. He, B. Hariharan, and
S. Belongie, “Feature pyramid networks for object detection,”
in Proceedings of the 2017 IEEE Conference on Computer
Vision and Pattern Recognition (CVPR), Honolulu, HI, USA,
 B. Yao, A. Khosla, and L. Fei-Fei, “Combining randomiza-
tion and discrimination for ﬁne-grained image categoriza-
tion,” in Proceedings of the CVPR 2011, IEEE, Providence, RI,
 J. Deng, J. Krause, and L. Fei-Fei, “Fine-grained crowd-
sourcing for ﬁne-grained recognition,” in Proceedings of the
2013 IEEE Conference on Computer Vision and Pattern
Recognition, Portland, OR, USA, 2013.
 T. Xiao, Y. Xu, K. Yang, J. Zhang, Y. Peng, and Z. Zhang,
“e application of two-level attention models in deep
convolutional neural network for ﬁne-grained image clas-
siﬁcation,” in Proceedings of the 2015 IEEE Conference on
Computer Vision and Pattern Recognition, Boston, MA, USA,
 N. Zhang, J. Donahue, R. Girshick, and T. Darrell, “Part-
based R-CNNs for ﬁne-grained category detection,” in
Proceedings of the 2014 European Conference on Computer
Vision, Springer, Zurich, Switzerland, 2014.
 S. Kong and C. Fowlkes, “Low-rank bilinear pooling for ﬁne-
grained classiﬁcation,” in Proceedings of the 2017 IEEE
Conference on Computer Vision and Pattern Recognition,
Honolulu, HI, USA, 2017.
 B. Zhao, X. Wu, J. Feng, Q. Peng, and S. Yan, “Diversiﬁed
visual attention networks for ﬁne-grained object classiﬁca-
tion,” IEEE Transactions on Multimedia, vol. 19, no. 6,
pp. 1245–1256, 2017.
 J. Krause, H. Jin, J. Yang, and L. Fei-Fei, “Fine-grained
recognition without part annotations,” in Proceedings of the
2015 IEEE Conference on Computer Vision and Pattern
Recognition, Boston, MA, USA, 2015.
 S. Reed, Z. Akata, H. Lee, and B. Schiele, “Learning deep
representations of ﬁne-grained visual descriptions,” in
Proceedings of the 2016 IEEE Conference on Computer Vision
and Pattern Recognition, Las Vegas, NV, USA, 2016.
 A. Angelova, S. Zhu, and Y. Lin, “Image segmentation for
large-scale subcategory ﬂower recognition,” in Proceedings of
the 2013 IEEE Workshop on Applications of Computer Vision
(WACV), IEEE, Tampa, FL, USA, 2013.
 Z. Ge, C. McCool, C. Sanderson, and P. Corke, Content Speciﬁc
Feature Learning for Fine-Grained Plant Classiﬁcation,
Queensland University of Technology, Brisbane, Australia,
 L. Yang, P. Luo, C. Loy, and X. Tang, “A large-scale car
dataset for ﬁne-grained categorization and veriﬁcation,” in
Proceedings of the 2015 IEEE Conference on Computer Vision
and Pattern Recognition, Boston, MA, USA, 2015.
 S. Maji, E. Rahtu, J. Kannala, M. Blaschko, and A. Vedaldi,
“Fine-grained visual classiﬁcation of aircraft,” 2013, https://
 J. Krause, T. Gebru, J. Deng, L. J. Li, and L. Fei-Fei, “Learning
features and parts for ﬁne-grained recognition,” in Pro-
ceedings of the 2014 22nd International Conference on Pattern
Recognition, IEEE, Stockholm, Sweden, 2014.
 S. Branson, G. Van Horn, S. Belongie, and P. Perona, “Bird
species categorization using pose normalized deep con-
volutional nets,” 2014, https://arxiv.org/abs/1406.2952.
 V. Vapnik, e Nature of Statistical Learning eory,
Springer Science & Business Media, Berlin, Germany, 2013.
 B. Hu, N. Zhou, Q. Zhou, X. Wang, and W. Liu, “DiﬀNet: a
learning to compare deep network for product recognition,”
IEEE Access, vol. 8, pp. 19336–19344, 2020.
 M. Simon and E. Rodner, “Neural activation constellations:
unsupervised part model discovery with convolutional
networks,” in Proceedings of the 2015 IEEE International
Conference on Computer Vision, Santiago, Chile, 2015.
 T. Y. Lin, A. RoyChowdhury, and S. Maji, “Bilinear CNN
models for ﬁne-grained visual recognition,” in Proceedings of
the 2015 IEEE International Conference on Computer Vision,
Santiago, Chile, 2015.
22 Computational Intelligence and Neuroscience
 S. Singh, A. Gupta, and A. A. Efros, “Unsupervised discovery
of mid-level discriminative patches,” in Proceedings of the
2012 European Conference on Computer Vision, Springer,
Firenze, Italy, 2012.
 M. George, D. Mircic, G. Soros, C. Floerkemeier, and
F. Mattern, “Fine-grained product class recognition for assisted
shopping,” in Proceedings of the 2015 IEEE International
Conference on Computer Vision Workshops, Santiago, Chile,
 Y. Wang, R. Song, X. S. Wei, and L. Zhang, “An adversarial
domain adaptation network for cross-domain ﬁne-grained
recognition,” in Proceedings of the 2020 IEEE Winter Con-
ference on Applications of Computer Vision, Aspen, CO, USA,
 R. Mottaghi, X. Chen, X. Liu et al., “e role of context for
object detection and semantic segmentation in the wild,” in
Proceedings of the 2014 IEEE Conference on Computer Vision
and Pattern Recognition, Columbus, OH, USA, 2014.
 S. K. Divvala, D. Hoiem, J. H. Hays, A. A. Efros, and
M. Hebert, “An empirical study of context in object de-
tection,” in Proceedings of the 2009 IEEE Conference on
Computer Vision and Pattern Recognition, IEEE, Miami, FL,
 A. Torralba, “Contextual priming for object detection,” In-
ternational Journal of Computer Vision, vol. 53, no. 2,
pp. 169–191, 2003.
 A. Tonioni and L. Di Stefano, “Product recognition in store
shelves as a sub-graph isomorphism problem,” in Proceed-
ings of the 2017 International Conference on Image Analysis
and Processing, Springer, Catania, Italy, 2017.
 E. P. Xing, M. I. Jordan, S. J. Russell, and A. Y. Ng, “Distance
metric learning with application to clustering with side-in-
formation,” Advances in Neural Information Processing
Systems, pp. 521–528, MIT Press, Cambridge, MA, USA,
 O. Vinyals, C. Blundell, T. Lillicrap, D. Wierstra, and others,
“Matching networks for one shot learning,” Advances in
Neural Information Processing Systems, pp. 3630–3638, MIT
Press, Cambridge, MA, USA, 2016.
 R. Keshari, M. Vatsa, R. Singh, and A. Noore, “Learning
structure and strength of CNN ﬁlters for small sample size
training,” in Proceedings of the 2018 IEEE Conference on
Computer Vision and Pattern Recognition, Salt Lake City,
UT, USA, 2018.
 S. Bak and P. Carr, “One-shot metric learning for person re-
identiﬁcation,” in Proceedings of the 2017 IEEE Conference on
Computer Vision and Pattern Recognition, Honolulu, HI,
 G. Koch, R. Zemel, and R. Salakhutdinov, “Siamese neural
networks for one-shot image recognition,” in Proceedings of
the 2015 ICML Deep Learning Workshop, Lille, France, 2015.
 S. Caelles, K. K. Maninis, J. Pont-Tuset, L. Leal-Taix´
D. Cremers, and L. Van Gool, “One-shot video object seg-
mentation,” in Proceedings of the 2017 IEEE Conference on
Computer Vision and Pattern Recognition, Honolulu, HI,
 E. Schwartz, L. Karlinsky, J. Shtok et al., “RepMet: repre-
sentative-based metric learning for classiﬁcation and one-
shot object detection,” 2018, https://arxiv.org/abs/1806.
 T. Wiedemeyer, “IAI Kinect2,” 2015, https://github.com/
 G. Varol and R. S. Kuzu, “Toward retail product recognition
on grocery shelves,” in Proceedings of the 6th International
Conference on Graphic and Image Processing (ICGIP 2014),
Beijing, China, 2015.
 M. Klasson, C. Zhang, and H. Kjellstr¨
om, “A hierarchical
grocery store image dataset with visual and semantic labels,”
in Proceedings of the 2019 IEEE Winter Conference on Ap-
plications of Computer Vision (WACV), IEEE, Waikoloa
Village, HI, USA, 2019.
 T. Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollar, “Focal
loss for dense object detection,” in Proceedings of the 2017
IEEE International Conference on Computer Vision (ICCV),
Venice, Italy, 2017.
 F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner, and
G. Monfardini, “e graph neural network model,” IEEE
Transactions on Neural Networks, vol. 20, pp. 61–80, 2008.
 Z. Wu, S. Pan, F. Chen, G. Long, C. Zhang, and P. S. Yu, “A
comprehensive survey on graph neural networks,” 2019,
 P. W. Battaglia, J. B. Hamrick, V. Bapst et al., “Relational
inductive biases, deep learning, and graph networks,” 2018,
 R. Ying, R. He, K. Chen, P. Eksombatchai, W. L. Hamilton,
and J. Leskovec, “Graph convolutional neural networks for
web-scale recommender systems,” in Proceedings of the 24th
ACM SIGKDD International Conference on Knowledge
Discovery & Data Mining, ACM, London, UK, 2018.
 W. Fan, Y. Ma, Q. Li et al., “Graph neural networks for social
recommendation,” in Proceedings of the 2019 World Wide
Web Conference, ACM, San Francisco, CA, USA, 2019.
 J. Gilmer, S. S. Schoenholz, P. F. Riley, O. Vinyals, and
G. E. Dahl, “Neural message passing for quantum chemis-
try,” in Proceedings of the 34th International Conference on
Machine Learning, vol. 70, Sydney, Australia, 2017.
 T. N. Kipf and M. Welling, “Semi-supervised classiﬁcation
with graph convolutional networks,” 2016, https://arxiv.org/
 S. J. Pan and Q. Yang, “A survey on transfer learning,” IEEE
Transactions on Knowledge and Data Engineering, vol. 22,
pp. 1345–1359, 2009.
 L. Torrey and J. Shavlik, “Transfer learning,” in Handbook of
Research on Machine Learning Applications and Trends:
Algorithms, Methods, and Techniques, pp. 242–264, IGI
Global, Philadelphia, PA, USA, 2010.
 J. Lu, V. E. Liong, G. Wang, and P. Moulin, “Joint feature
learning for face recognition,” IEEE Transactions on Infor-
mation Forensics and Security, vol. 10, no. 7, pp. 1371–1383,
 L. Guan, Y. Wu, J. Zhao, and C. Ye, “Learn to detect objects
incrementally,” in Proceedings of the 2018 IEEE Intelligent
Vehicles Symposium (IV), Changshu, China, 2018.
 V. Lomonaco and D. Maltoni, “Comparing incremental
learning strategies for convolutional neural networks,” in
Proceedings of the 2016 IAPR Workshop on Artiﬁcial Neural
Networks in Pattern Recognition, Springer, Ulm, Germany,
Computational Intelligence and Neuroscience 23