Conference Paper

Fine-Grained Grocery Product Recognition by One-Shot Learning

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Fine-grained grocery product recognition via camera is a challenging task to identify the visually similar products with subtle differences by using single-shot training examples. To address this issue? we present a novel hybrid classification approach that combines feature-based matching and one-shot deep learning with a coarse-to-fine strategy. The candidate regions of product instances are first detected and coarsely labeled by recurring features in product images without any training. Then, attention maps are generated to guide the classifier to focus on fine discriminative details by magnifying the influences of the features in the candidate regions of interest (ROI) and suppressing the interferences of the features outside, improving the accuracy of fine-grained grocery products recognition effectively. Our framework also performs a good adaptability which allows existing classifier to be refined without retraining for new coming product classes. As an additional contribution, we collect a new grocery product database with 102 classes from 2 stores. Extensive experiments demonstrate that our approach outperforms the state-of-the-art methods.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... RPD aims at finding recurrence in a more relaxed notion than symmetry. [14,15,16,17] Pairwise-Object Yes N/A = 2 2 Objects [18,19,7] Pairwise-Object Yes N/A ≤ 2 2 Objects [20,14,21] Pairwise-Object Yes Yes ≥ 1 2 − 10 Objects [22] Pairwise-Visual Word Yes No ≥ 1 ≥ 1 RPs [23,24,25,26,27,28] Cross-Image Yes Yes >> 2 N/A [29,30] Cross-Image Yes N/A >> 2 ≥ 1 Objects [31,32,33] Pairwise-Visual Word Yes No = 1 ≥ 2 RP instances [34] Pairwise-Visual Word& Instance No Yes = 1 = 1 RP [35] Co Tab. 1 shows a summary and comparison of different methods in terms of input/output and matching strategy. From Tab. 1, [5] and our approach are the only methods that discover class agnostic recurring pattern from a single image in an unsupervised fashion. ...
... Thus we treat [5] as our baseline. However, for RP instance counting evaluation, we also compare with [34] on the same dataset. It should be pointed out that [34] requires a given localization of logo regions in grocery product detection scenario (equivalent to giving an RP instance as input) before the RP discovery. ...
... However, for RP instance counting evaluation, we also compare with [34] on the same dataset. It should be pointed out that [34] requires a given localization of logo regions in grocery product detection scenario (equivalent to giving an RP instance as input) before the RP discovery. ...
Preprint
We demonstrate the utility of recurring pattern discovery from a single image for spatial understanding of a 3D scene in terms of (1) vanishing point detection, (2) hypothesizing 3D translation symmetry and (3) counting the number of RP instances in the image. Furthermore, we illustrate the feasibility of leveraging RP discovery output to form a more precise, quantitative text description of the scene. Our quantitative evaluations on a new 1K+ Recurring Pattern (RP) benchmark with diverse variations show that visual perception of recurrence from one single view leads to scene understanding outcomes that are as good as or better than existing supervised methods and/or unsupervised methods that use millions of images.
... However, 35 a target object could be incorrectly recognized when there 36 are multiple objects with feature points similar to the ones 37 of the target object [4]. Other than commercial AR SDKs, 38 deep learning-based studies have been presented to recog-39 nize grocery products [5], [6]. However, they do not consider 40 the mobile environment nor target mobile AR applications. ...
... However, they do not consider 40 the mobile environment nor target mobile AR applications. 41 A recent work [6] aims at accurately recognizing similar 42 product images. While it achieves relatively high accuracy 43 for a small number of products, the accuracy decreases when 44 the number of products is high. ...
... These studies have a limitation in that they did not 195 consider a large number of learning images. 196 Geng et al. [6] proposed a method for recognizing prod-197 ucts with similar labels based on deep learning. They used a 198 one-shot learning method [25] to overcome the requirement 199 of a large number of training images. ...
Article
Full-text available
We propose a system that supports real-time product recognition and tracking based on text detection for mobile augmented reality. To accurately distinguish products with visually similar packages, we develop a method that recognizes product names by utilizing the characteristics of texts printed on the product packages. It first filters out irrelevant products and effectively ranks candidate products through an inverted index search. We significantly reduce processing overhead by selectively performing product name recognition. In addition, we present an optical-flow-based method that enables efficient and responsive product tracking. Our evaluation shows that the proposed system achieves significantly better product recognition accuracy (80%) compared to alternative solutions, Vuforia (55.4%) and MobileNetV2 (69.6%). We also show that it achieves reasonable tracking accuracy and processing latency to support quality mobile AR experiences.
... The first and second phases are coupled with the Karhunen-Loeve transform in the final phase to track the detected boxes in a video. Geng et al. (Geng et al. 2018) developed a product detection system to identify the locations of the products in the video frames by creating a saliency map for the shelf images. The saliency map is constructed using SURF key points and Attention and Information Maximization. ...
... Ray et al. (Ray et al. 2018) cannot differentiate non-identical objects. Geng et al. (Geng et al. 2018) face more partial detections. Franco et al. (Franco and Maltoni 2017) and Karlinsky et al. (Karlinsky et al. 2017) face labeling problem (inaccurate box). ...
... Geng et al. (Geng et al. 2018) use the GroZi-120 dataset to assess the performance of BRISK and SIFT techniques. The authors used VGG16 and Attention map for feature extraction and classification, respectively. ...
Article
Full-text available
Object detection and recognition are the most important and challenging problems in computer vision. The remarkable advancements in deep learning techniques have significantly accelerated the momentum of object detection/recognition in recent years. Meanwhile, text detection/recognition is also a critical task in computer vision and has gotten more attention from many researchers due to its wide range of applications. This work focuses on detecting and recognizing multiple retail products stacked on the shelves and off the shelves in the grocery stores by identifying the label texts. In this paper, we proposed a new framework is composed of three modules: (a) retail product detection, (b) product-text detection, (c) product-text recognition. In the first module, on-the-shelf and off-the-shelf retail products are detected using the YOLOv5 object detection algorithm. In the second module, we improve the performance of the state-of-the-art text detection algorithm by replacing the backbone network with ResNet50 + FPN and by introducing a new post-processing technique, Width Height based Bounding Box Reconstruction, to mitigate the problem of inaccurate text detection. In the final module, we used a state-of-the-art text recognition model to recognize the retail product’s text information. The YOLOv5 algorithm accurately detects both on-the-shelf and off-the-shelf grocery products from the video frames and the static images. The experimental results show that the proposed post-processing approach improves the performance of the existing methods on both regular and irregular text. The robust text detection and text recognition methods greatly support our proposed framework to recognize the on-the-shelf retail products by extracting product information such as product name, brand name, price, and expiring date. The recognized text contexts around the retail products can be used as the identifier to distinguish the product.
... Após uma leitura completa, a base IEEE Xplore alcançou sete estudos aceitos: [45][46][47][48][49][50][51]. A base ACM teve um trabalho aceito: [52]. A base Scopus obteve dois artigos aceitos: [53,54], mais o artigo [55], que não foi encontrado em nenhuma das bases de busca bibliográficas, mas foi alcançado pela análise de referências de outro artigo da Scopus. ...
... Geng et al. [52] elaboraram um framework que detecta e reconhece produtos mercantis, fazendo uso de aprendizado One-Shot [80], que requer apenas um exemplo de treinamento de cada classe, possuindo ainda a capacidade de adicionar novos produtos sem a necessidade de retreinar todas as classes. Eles executaram testes com e sem Attention Map (ATmap) [81], que é empregado para identificar regiões discriminativas, tendo bons resultados com os algoritmos BRISK [82] e SIFT. ...
... Por meio de análises sobre os trabalhos aceitos, é possível observar que a partir de 2016 os estudos se concentraram mais em CNNs, tendo três tipos básicos de abordagens: CNNs na descrição e na classificação [49-51, 54, 55]; somente como descritores [50]; e somente como classificadores [52]. Nessa e em diversas outras áreas de visão computacional, as abordagens com CNNs lideram o estado da arte. ...
Article
Full-text available
In recent years, Computer Vision and Machine Learning techniques have been extensively explored in the creation of assistive systems for the visually impaired. One of the most challenging tasks for visually impaired people is object recognition. In this paper, we conducted a systematic review to identify the current state of the art in designing these assistive systems. Due to the huge amount of object categories, we focused on recognizing products, such as those found in grocery stores, pantries and refrigerators. We analyze the techniques used, noting the efficiency and economy of hardware resources such as processing, memory and battery. Thus we verify if they can be used in wearable systems and adapted to existing devices of the Internet of Things (IoT), enabling the proposition of efficient and accessible assistive product recognition systems.
... e drawback is that the prediction accuracy of the images from real stores only reaches 87.5%, and that needs to be improved. Geng et al. [74] employed VGG-16 as the feature descriptor to recognize the product instances, Computational Intelligence and Neuroscience achieving recognition for 857 classes of food products. In this work, VGG-16 is integrated with recurring features and attention maps to improve the performance of grocery product recognition in the real-world application scenario. ...
... Specifically, a variable containing the product of the scores from a fast detection model and corresponding CNN confidences are used for ranking the final result. In the research of [74], Geng et al. applied visual attention [74] to fine-grained product classification tasks. Attention maps are employed to magnify the influences of the features, consequently to guide the CNN classifier to focus on fine discriminative details. ...
... Specifically, a variable containing the product of the scores from a fast detection model and corresponding CNN confidences are used for ranking the final result. In the research of [74], Geng et al. applied visual attention [74] to fine-grained product classification tasks. Attention maps are employed to magnify the influences of the features, consequently to guide the CNN classifier to focus on fine discriminative details. ...
Article
Full-text available
Taking time to identify expected products and waiting for the checkout in a retail store are common scenes we all encounter in our daily lives. The realization of automatic product recognition has great significance for both economic and social progress because it is more reliable than manual operation and time-saving. Product recognition via images is a challenging task in the field of computer vision. It receives increasing consideration due to the great application prospect, such as automatic checkout, stock tracking, planogram compliance, and visually impaired assistance. In recent years, deep learning enjoys a flourishing evolution with tremendous achievements in image classification and object detection. This article aims to present a comprehensive literature review of recent research on deep learning-based retail product recognition. More specifically, this paper reviews the key challenges of deep learning for retail product recognition and discusses potential techniques that can be helpful for the research of the topic. Next, we provide the details of public datasets which could be used for deep learning. Finally, we conclude the current progress and point new perspectives to the research of related fields.
... Recent advances in object recognition have demonstrated how computer vision methods can enable capabilities such as product recognition [11] or gaze estimation [18,50]. All the aforementioned proposals have one or more of the following weaknesses: (i) insufficient precision or accuracy, (ii) challenging deployment, in the case of techniques that need installed devices to function (e.g., in each Amazon Go store hundreds of cameras are installed), (iii) threat private visual boundaries, in the case of computer vision techniques as shoppers do not want to feel observed [8,30] and might not want their faces to be recorded for post-analytic purposes, or (iv) require attached devices to the shopper (wearables, smartphones). ...
... The authors also propose novel methods to track object picking with BLE devices (i.e., locate a beacon on a sliding door to track the door opening direction). The authors of [11] propose a product recognition system that can identify similar (visual) products that are in supermarkets' shelves. In contrast to these works, EyeShopper does not require from the customers to carry any device, and it is solely based on computer vision techniques that are applied in the frames recorded by installed cameras in retail shops. ...
... The current system can also be used by other similar works who want to apply transfer learning with their datasets. For example, EyeShopper can be merged with a product recognition system for more fine-graded retail analytics [11]. The collection and generation of realistic data to create a unique dataset for back head detection and gaze estimation will be part of our future work and will open possibilities for other applications and studies. ...
Conference Paper
Full-text available
Recent advances in machine and deep learning allow for enhanced retail analytics by applying object detection techniques. However, existing approaches either require laborious installation processes to function or lack precision when the customers turn their back in the installed cameras. In this paper, we present EyeShopper, an innovative system that tracks the gaze of shoppers when facing away from the camera and provides insights about their behavior in physical stores. EyeShopper is readily deployable in existing surveillance systems and robust against low-resolution video inputs. At the same time, its accuracy is comparable to state-of-the-art gaze estimation frameworks that require high-resolution and continuous video inputs to function. Furthermore, EyeShopper is more robust than state-of-the-art gaze tracking techniques for back head images. Extensive evaluation with different real video datasets and a synthetic dataset we produced shows that EyeShopper estimates with high accuracy the gaze of customers.
... Until now, scholars have built dedicated CV pipelines, either for food items (e.g. meals) [27,29,51,52] or for packaged products in a retail environment [53][54][55][56]. Similarly, the publication of image datasets, which are central to developing CV solutions, follows this separation of meals and packaged products. ...
... Specifically, we were interested in whether wearable smartglasses, which leverage CV and MR real-time interventions, support healthier food selections at VMs, i.e. locations where consumers may in fact (intend to) purchase unhealthier foods and beverages. While most research in this domain so far has focused on classifying food products via CV through pre-fabricated image datasets [53][54][55], or has only assessed visual interventions via augmented reality on handheld devices [89], our study contributes to the literature by leveraging MR headsets and interventions, thereby overcoming the drawbacks of current smartphone-mediated interventions. ...
... Compared to existing studies on detecting packaged products [53][54][55], our technical feasibility study is the first to collect, label, and apply real-world images for product detection. To the best of our knowledge, the collected dataset of 10'035 labeled product instances represents the largest labeled image dataset to include product identifiers (GTINs). ...
Article
Full-text available
With the emergence of the Internet of People (IoP) and its user-centric applications, novel solutions to the many issues facing today’s societies are to be expected. These problems include unhealthy diets, with obesity and diet-related diseases reaching epidemic proportions. We argue that the proliferation of mixed reality (MR) headsets as next generation primary interfaces provides promising alternatives to contemporary digital solutions in the context of diet tracking and interventions. Concretely, we propose the use of MR headset-mounted cameras for computer vision (CV) based detection of diet-related activities and the consequential display of visual real-time interventions to support healthy food choices. We provide an integrative framework and results from a technical feasibility as well as an impact study conducted in a vending machine (VM) setting. We conclude that current neural networks already enable accurate food item detection in real-world environments. Moreover, our user study suggests that real-time interventions significantly improve beverage (reduction of sugar and energy intake) as well as food choices (reduction of saturated fat). We discuss the results, learnings, and limitations and provide an overview of further technology- and intervention-related avenues of research required by developing an MR-based user support system for healthy food choices.
... In the detection of retail products using recurring patterns, Haar and Haar-like features are extensively used pattern-based features [12]. A fine-grained recognition of grocery products by integrating VGG-16 with recurring features and attention maps was proposed in [21]. Recurring features detect the candidate region and give coarse labels to the products; afterwards attention maps help the classifier to concentrate on the fine details in the candidate region of interest (ROI). ...
... For training 200 images per product, SKUs are used. Deep learning Two-stage One-stage [4] 2015 ✓ SURF [16] 2015 ✓ SURF [18] 2015 ✓ SVM [19] 2015 ✓ SURF + color histogram [22] 2016 ✓ Recurring pattern detection [38] 2016 ✓ CNN [36] 2017 ✓ VGG-F [17] 2018 ✓ SIFT [21] 2018 ✓ VGG-16 with recurring features and attention maps [37] 2018 ✓ GoogLeNet [35] 2019 ✓ CIFAR-10, CaffeNet [43] 2020 ree one-stage detectors are trained by providing labeled data (YOLO V4, YOLO V5, and YOLOR). e SKUs of the dataset are categorized under twelve categories which are mentioned in Table 2. e testing of the detectors is performed with the test set and performance is evaluated using different accuracy metrics. ...
Article
Full-text available
In retail management, the continuous monitoring of shelves to keep track of the availability of the products and following proper layout are the two important factors that boost the sales and improve customer’s level of satisfaction. The studies conducted earlier were either performing shelf monitoring or verifying planogram compliance. As both the activities are important, to tackle this problem, we presented a deep learning and computer vision-based hybrid approach called Hyb-SMPC that deals with both activities. The Hyb-SMPC approach consists of two modules: The first module detects fine-grained retail products using one-stage deep learning detector. For the detection part, the comparison of three deep learning-based detectors, You Only Look Once (YOLO V4), YOLO V5, and You Only Learn One Representation (YOLOR), is provided and the one giving the best result will be selected. The selected detector will perform detection of different categories of SKUs and racks. The second module performs planogram compliance; for this purpose, the company-provided layout is first converted to JavaScript Object Notation (JSON) and then the matching is performed with the postprocessed retail images. The compliance reports will be generated at the end for indicating the level of compliance. The approach is tested in both quantitative and qualitative manners. The quantitative analysis demonstrates that the proposed approach achieved an accuracy up to 99% on the provided dataset of retail, whereas the qualitative evaluation indicates increase in sales and customers’ satisfaction level.
... [12] introduces Grozi-120 dataset and uses SIFT [9] for product identification as baseline. [13] introduces CAPG-GP dataset and uses a combination of Deep Learning [6] and SIFT [9]/BRISK [14] for product recognition. [15] compares visual bag of words and deep learning on grozi dataset for both detection on shelves and classification. ...
... In real world use cases, such type of classification problem often comes up where one gets only pack shots for training and the classifier trained has to work on images from shops/retail outlets. C. CAPG-GP dataset CAPG-GP [13] (available at link) dataset has 102 retail products for fine grained one-shot classification. All products have just one training image. ...
Preprint
Full-text available
Retail Product Image Classification is an important Computer Vision and Machine Learning problem for building real world systems like self-checkout stores and automated retail execution evaluation. In this work, we present various tricks to increase accuracy of Deep Learning models on different types of retail product image classification datasets. These tricks enable us to increase the accuracy of fine tuned convnets for retail product image classification by a large margin. As the most prominent trick, we introduce a new neural network layer called Local-Concepts-Accumulation (LCA) layer which gives consistent gains across multiple datasets. Two other tricks we find to increase accuracy on retail product identification are using an instagram-pretrained Convnet and using Maximum Entropy as an auxiliary loss for classification.
... Tonioni et al. [25], Chong et al. [3] GoogLeNet Farren et al. [6] AlexNet Karlinsky et al. [13], Farren et al. [6], Li et al. [16], Goldman et al. [8] VGG Biasio et al. [5], Tonioni et al. [25], Tonioni et al. [26], Karlinsky et al. [13] CaffeNet Farren et al. [6], Li et al. [16], Jund et al. [12] ResNet Geng et al. [7], Tonioni et al. [25], Melek et al. [20], Liu et al. [17] set. We improve the feature extraction with dual attention mechanism and improve the binary cross-entropy loss function with Euclidean penalty function. ...
Article
Full-text available
Conventional retail stores are undergoing digital transformation, and in a typical smart retail store, automatic recognition of retail products is essential for customer experience in the checkout stage. In this paper, we propose an improved Siamese neural network to identify the product from one-shot learning. First, a spatial channel dual attention mechanism is proposed to improve the network architecture. Second, a binary cross-entropy loss function with a distance penalty is adopted to replace the conventional contrastive loss function. The proposed network can better model the details of the products. The experimental results are achieved on two public available databases. The results show that the proposed method outperforms the conventional methods, and it can solve the data insufficient problem in the training stage. Smart retail stores can change the SKUs (Stock Keeping Units) conveniently without collecting a large amount of training samples.
... In the automatic retail product recognition application, the large supermarket retail product dataset RPC [55] was introduced, which provides data support for product detection and small sample learning. Geng et al. [7] introduced the CAPG-GP dataset and combined deep learning with SIFT [32] or BRISK features [29] for product identification. In recent years, various deep learning methods [24,25,36,43,45,46] represented by Convolutional Neural Networks have been developed. ...
Article
Full-text available
Due to the similar appearances among many retail products, it is a big challenge to identify the product with high accuracy and low computational cost in smart retail scenes. In this paper, we proposed a lightweight retail product identification and localization method based on an improved convolutional neural network. First, we use group convolution and deep separable convolution to optimize the structure of the backbone network and reduce the amount of calculation. Second, the multiscale structure was adjusted to optimal scales. We further use the k-means clustering algorithm to re-cluster six anchors with different sizes. Third, we introduced spatial pyramid pooling (SPP) to replace pooling by convolution to effectively improve the robustness against image distortion, such as cropping and scaling. Finally, we use mosaic data enhancement method to improved the robustness of the network. Experiments on the RPC dataset show that, compared with YOLOv5, the number of parameters is reduced by 1/6.4 times, and FLOPs is reduced by 1/9 times. Experiments on the DeepBlue Retail Dataset show that compared with YOLOv5, the number of parameters is reduced by 1/7.8 times, and FLOPs is reduced by 1/9.3 times. Realtime evaluation under the same hardware show that the FPS of the proposed model is 123 in the forward inference test, while the FPS of the YOLOv5 model under the same conditions is 58.
... So far, we have only applied pool-based sampling. Lastly, we aim to tackle the challenge of not only classifying a broad product category but recognize specific products, e.g., using one-shot learning [28] or a multistage process containing product classification and ranking. ...
... In addition, they do not generalize as well as other methods. Regarding this technique, a fine-grained grocery product recognition method has been explored by Geng et al. [17], who addressed the OSL problem by presenting a hybrid classification approach that combined feature-based matching and one-shot deep learning with a coarse-to-fine strategy. The candidate regions of the product instances were firstly detected and coarsely labeled by recurring features in the product images without any training. ...
Article
Full-text available
Over the last few years, several techniques have been developed with the aim of implementing one-shot learning, a concept that allows classifying images with only a single image per training category. Conceptually, these methods seek to reproduce certain behavior that humans have. People are able to recognize a person they have only seen once, but they are probably not able to do the same with certain animals, such as a monkey. This is because our brains have been trained for years with images of people but not so much of animals. Among the one-shot learning techniques, some of them have used data generation, such as Generative Adversarial Networks (GAN). Other techniques have been based on the matching of descriptors traditionally used for object detection. Finally, one of the most prominent techniques involves using Siamese neural networks. Siamese networks are usually implemented with two convolutional nets that share their weights. They receive two images as input and can detect whether they belong to the same category or not. In the field of grocery products, there has been a lot of research on the one-shot learning problem but not so much on the use of Siamese networks. In this paper, several classifiers are firstly evaluated to decide on a convolutional model to be used with the Siamese and to improve the baseline results obtained in the dataset used. Then, two existing techniques are integrated within the Siamese model: a convolutional net and a Local Maximal Occurrence (LOMO) descriptor. The latter was initially used for the re-identification of people although it has shown its effectiveness to improve the values of a traditional Siamese with only convolutional sisters. The whole network is trained on categories and responds to different categories, showing its strong capacity to deal with the problem of having only one image per category.
... This is for example the case of the outstanding work in [12] where automatic checkout was addressed using for training single product images taken in a controlled environment. Another example can be found in [13] where authors attempted to use single-shot training examples on 'in vitro' examples. The proposed approach combines featurebased matching and one-shot deep learning with a coarse-tofine strategy. ...
... Both the feature representations provide good accuracy on the tested dataset, but DNN-based features are found to be more effective in complex scenarios thanks to their high robustness and invariance. Geng et al. [9] proposed to integrate attention maps into a CNN-based end-toend classification framework. The attention map is used to capture the location of the product region in the image. ...
Chapter
Grocery products detection and recognition is a very complex task because of the high variability in object appearances, and the possibly very large number of the products to be recognized. Here we present the results of our investigation in the classification of grocery products. We tested several CNN architectures trained in different modalities for product classification, and we propose a multi-task learning network to be used as feature extractor. We evaluated the features extracted from the networks in both supervised and unsupervised classification scenarios. All the experiments have been conducted on publicly available datasets in the literature.
... Recognizing grocery items in their natural environments, such as grocery stores, shelves, and kitchens, have been addressed in numerous previous works. 2,[27][28][29][30][31][32][33][34][35][36][37] The addressed tasks range over hierarchical classification, object detection, segmentation, and three-dimensional (3D) model generation. Most of these works collect a dataset that resembles shopping or cooking scenarios, whereby the datasets vary in the degree of labeling, different camera views, and the data domain difference between the training and test set. ...
Article
Full-text available
An essential task for computer vision-based assistive technologies is to help visually impaired people to recognize objects in constrained environments, for instance, recognizing food items in grocery stores. In this paper, we introduce a novel dataset with natural images of groceries—fruits, vegetables, and packaged products—where all images have been taken inside grocery stores to resemble a shopping scenario. Additionally, we download iconic images and text descriptions for each item that can be utilized for better representation learning of groceries. We select a multi-view generative model, which can combine the different item information into lower-dimensional representations. The experiments show that utilizing the additional information yields higher accuracies on classifying grocery items than only using the natural images. We observe that iconic images help to construct representations separated by visual differences of the items, while text descriptions enable the model to distinguish between visually similar items by different ingredients.
... Grocery product recognition. There has been significant effort in recognizing products on shelves of retail stores such as grocery product recognition [13,14,15,16]. This problem is simpler than the general object instance recognition problem that we are aiming to solve due to the structured environment. ...
Preprint
Deep learning object detectors often return false positives with very high confidence. Although they optimize generic detection performance, such as mean average precision (mAP), they are not designed for reliability. For a reliable detection system, if a high confidence detection is made, we would want high certainty that the object has indeed been detected. To achieve this, we have developed a set of verification tests which a proposed detection must pass to be accepted. We develop a theoretical framework which proves that, under certain assumptions, our verification tests will not accept any false positives. Based on an approximation to this framework, we present a practical detection system that can verify, with high precision, whether each detection of a machine-learning based object detector is correct. We show that these tests can improve the overall accuracy of a base detector and that accepted examples are highly likely to be correct. This allows the detector to operate in a high precision regime and can thus be used for robotic perception systems as a reliable instance detection method.
... A fine-grained grocery product dataset was released by (Geng et al., 2018). It consists of 234 test images taken from 2 stores. ...
Preprint
Object detection in densely packed scenes is a new area where standard object detectors fail to train well. Dense object detectors like RetinaNet trained on large and dense datasets show great performance. We train a standard object detector on a small, normally packed dataset with data augmentation techniques. This dataset is 265 times smaller than the standard dataset, in terms of number of annotations. This low data baseline achieves satisfactory results (mAP=0.56) at standard IoU of 0.5. We also create a varied benchmark for generic SKU product detection by providing full annotations for multiple public datasets. It can be accessed at https://github.com/ParallelDots/generic-sku-detection-benchmark. We hope that this benchmark helps in building robust detectors that perform reliably across different settings in the wild.
... Early efforts relied on the combination of SIFT-alike descriptors, such as [2], where dense SIFT is combined with LLC to retrieve the best candidate classes and then a GA-based algorithm is used to create the final list of products seen in an image. In [1], SIFTalike descriptors are combined with deep learning to generate attention maps from a combination of SIFT, BRISK and SURF features in order to achieve one-shot deep learning of products, utilising a single image per product. Oneshot recognition of products is also proposed by Karlinsky et al. [5], where a probabilistic model is employed to generate bounding boxes and then deep finegrain refinement based on the VGG-f network is applied to the coarse results. ...
Chapter
The proposed application builds on the latest advancements of computer vision with the aim to improve the autonomy of people with visual impairment at both practical and emotional level. More specifically, it is an assistive system that relies on visual information to recognise the objects and faces surrounding the user. The system is supported by a set of sensors for capturing the visual information and for transmitting the auditory messages to the users. In this paper, we present a computer vision application, e-vision, in the context of visiting the supermarket for buying groceries.
... Therefore, research on identification of packaged products relies on relatively few, rather small and quite old datasets [9,17]. Existing studies on identifying packaged products via computer vision indicate promising potential [8,17,43,44], but they rely on such limited datasets and are conducted under resource-intense lab conditions, and do therefore not prove real-world applicability of computer vision based product identification. Although standards on product identifiers (e.g. ...
Conference Paper
Full-text available
Identification of packaged products in retail environments still relies on barcodes, requiring active user input and limited to one product at a time. Computer vision (CV) has already enabled many applications, but has so far been under-discussed in the retail domain, albeit allowing for faster, hands-free, more natural human-object interaction (e.g. via mixed reality headsets). To assess the potential of current convolutional neural network (CNN) architectures to reliably identify packaged products within a retail environment, we created and open-source a dataset of 300 images of vending machines with 15k labeled instances of 90 products. We assessed observed accuracies from transfer learning for image-based product classification (IC) and multi-product object detection (OD) on multiple CNN architectures, and the number of images instances required per product to achieve meaningful predictions. Results show that as little as six images are enough for 90% IC accuracy, but around 30 images are needed for 95% IC accuracy. For simultaneous OD, 42 instances per product are necessary and far more than 100 instances to produce robust results. Thus, this study demonstrates that even in realistic, fast-paced retail environments, image-based product identification provides an alternative to barcodes, especially for use-cases that do not require perfect 100% accuracy.
... MobileNet [34], ResNet [17], DenseNet [18]) or more recently, generative adversarial networks (GAN) [3]. Different, recent studies have demonstrated the feasibility to recognize packaged products through computer vision on a large scale from labelled images [14,26,39,40]. ...
Conference Paper
Full-text available
The rise in diet-related non-communicable diseases suggests that consumers find it difficult to make healthy food-related purchases. This situation is most pertinent in fast-paced retail environments where customers are confronted with sugar-rich or savory food items. Counter-measures such as front-of-package labelling are not yet mandated in most regions, and barcode scanning mobile applications are impractical when purchasing groceries. We thus applied a mixed reality (MR) wearable headset-mediated intervention (N = 61) at vending machines to explore the potential of passively activated, pervasive MR food labels in affecting beverage purchasing choices. Through conduction of a between-subject randomized controlled trial, we find significant, strong improvements in nutritional quality of the selected products (Energy: -34% KJ/100ml, Sugar: -28% g/100ml). Our post-hoc analysis suggests that the intervention effect is especially effective with existing food literacy. This study motivates further research on MR food labels due to the promising, observed intervention effects.
Article
Full-text available
Recently, the progress on image understanding and AIC (Automatic Image Captioning) has attracted lots of researchers to make use of AI (Artificial Intelligence) models to assist the blind people. AIC integrates the principle of both computer vision and NLP (Natural Language Processing) to generate automatic language descriptions in relation to the image observed. This work presents a new assistive technology based on deep learning which helps the blind people to distinguish the food items in online grocery shopping. The proposed AIC model involves the following steps such as Data Collection, Non-captioned image selection, Extraction of appearance, texture features and Generation of automatic image captions. Initially, the data is collected from two public sources and the selection of non-captioned images are done using the ARO (Adaptive Rain Optimization). Next, the appearance feature is extracted using SDM (Spatial Derivative and Multi-scale) approach and WPLBP (Weighted Patch Local Binary Pattern) is used in the extraction of texture features. Finally, the captions are automatically generated using ECANN (Extended Convolutional Atom Neural Network). ECANN model combines the CNN (Convolutional Neural Network) and LSTM (Long Short-Term Memory) architectures to perform the caption reusable system to select the most accurate caption. The loss in the ECANN architecture is minimized using AAS (Adaptive Atom Search) Optimization algorithm. The implementation tool used is PYTHON and the dataset used for the analysis are Grocery datasets (Freiburg Groceries and Grocery Store Dataset). The proposed ECANN model acquired accuracy (99.46%) on Grocery Store Dataset and (99.32%) accuracy on Freiburg Groceries dataset. Thus, the performance of the proposed ECANN model is compared with other existing models to verify the supremacy of the proposed work over the other existing works.
Article
Image recognition based on deep learning methods has gained remarkable achievements by feeding with abundant training data. Unfortunately, collecting a tremendous amount of annotated images is time-consuming and expensive, especially in grocery product recognition tasks. It is challenging to recognise grocery products accurately when the deep learning model is trained with insufficient data. This paper proposes multi-angle Generative Adversarial Networks (MAGAN), which can generate realistic training images with different angles for data augmentation. Mutual information is employed in the novel GAN to achieve the learning of angles in an unsupervised manner. This paper aims to create training images containing grocery products from different angles, thus improving grocery product recognition accuracy. We first enlarge the fruit dataset by using MAGAN and the state-of-the-art GAN variants. Then, we compare the top-1 accuracy results from CNN classifiers trained with different data augmentation methods. Finally, our experiments demonstrate that the MAGAN exceeds the existing GANs for grocery product recognition tasks, obtaining a significant increase in the accuracy.
Article
Full-text available
Text is an important invention of humanity, which plays a key role in human life, so far from dark ages. Text in image is closely related to the scene or a product and is widely used in vision based application. In this paper we are addressing the problem of visual understanding with text. The main focus is combining textual cues and visual cues in deep neural network. First the text is recognized and classified from the image. Then we combine the attended word embedding and visual feature vector which are then optimized by CNN for Fine-grained image classification. We carried out the experiments on soft drink dataset in Pakistan. The results shows that the system achieves significant performance which can be potentially beneficial for real world application e.g. product search.
Preprint
Full-text available
Retail product Image classification problems are often few shot classification problems, given retail product classes cannot have the type of variations across images like a cat or dog or tree could have. Previous works have shown different methods to finetune Convolutional Neural Networks to achieve better classification accuracy on such datasets. In this work, we try to address the problem statement : Can we pretrain a Convolutional Neural Network backbone which yields good enough representations for retail product images, so that training a simple logistic regression on these representations gives us good classifiers ? We use contrastive learning and pseudolabel based noisy student training to learn representations that get accuracy in order of finetuning the entire Convnet backbone for retail product image classification.
Article
With the advancement of image processing and computer vision technology, content-based product search is applied in a wide variety of common tasks, such as online shopping, automatic checkout systems, and intelligent logistics. Given a product image as a query, existing product search systems mainly perform the retrieval process using predefined databases with fixed product categories. However, real-world applications often require inserting new categories or updating existing products in the product database. When using existing product search methods, the image feature extraction models must be retrained and database indexes must be rebuilt to accommodate the updated data, and these operations incur high costs for data annotation and training time. To this end, we propose a few-shot incremental product search framework with meta-learning, which requires very few annotated images and has a reasonable training time. In particular, our framework contains a multipooling-based product feature extractor that learns a discriminative representation for each product, and we also design a meta-learning-based feature adapter to guarantee the robustness of the few-shot features. Furthermore, when expanding new categories in batches during a product search, we reconstruct the few-shot features by using an incremental weight combiner to accommodate the incremental search task. Through extensive experiments, we demonstrate that the proposed framework achieves excellent performance for new products while still guaranteeing the high search accuracy of the base categories after gradually expanding new product categories without forgetting.
Chapter
Object detection in densely packed scenes is a new area where standard object detectors fail to train well [6]. Dense object detectors like RetinaNet [7] trained on large and dense datasets show great performance. We train a standard object detector on a small, normally packed dataset with data augmentation techniques. This dataset is 265 times smaller than the standard dataset, in terms of number of annotations. This low data baseline achieves satisfactory results (mAP = 0.56) at standard IoU of 0.5. We also create a varied benchmark for generic SKU product detection by providing full annotations for multiple public datasets. It can be accessed at this URL. We hope that this benchmark helps in building robust detectors that perform reliably across different settings in the wild.
Chapter
Retail Product Image Classification is an important Computer Vision and Machine Learning problem for building real world systems like self-checkout stores and automated retail execution evaluation. In this work, we present various tricks to increase accuracy of Deep Learning models on different types of retail product image classification datasets. These tricks enable us to increase the accuracy of fine tuned convnets for retail product image classification by a large margin. As the most prominent trick, we introduce a new neural network layer called Local-Concepts-Accumulation (LCA) layer which gives consistent gains across multiple datasets. Two other tricks we find to increase accuracy on retail product identification are using an instagram-pretrained Convnet and using Maximum Entropy as an auxiliary loss for classification.
Article
Product recognition has a significant role because of its benefits to the compliant arrangements of stores, which further affects commercial contracts, customer satisfaction, and sale achievement. Automatic recognition systems have been proposed owing to the current high cost of manual inspection by clerks. Because of the difficult collection of product images, the systems are commonly one‐shot cases, in which the training data are actually template product images. However, despite the development of one‐shot recognition, the systems rarely utilize the special characteristics of products on retail store shelves, and the frequent updating of templates is still challenging. Furthermore, we consider that product detection can be the basis of product recognition. In this article, instead of the present workflow, we propose a novel product detection system, named TemplateFree, which combines product segmentation and zero‐shot learning. It detects products on retail store shelves by single store shelf images, that is, corresponding template product images are not necessary. TemplateFree concentrates on the characteristic that a store shelf can be segmented horizontally into layers and then vertically into products so that each product can be detected according to the segmentation. Double zero‐shot deep learning frameworks are used to improve the segmentation. In experiments, TemplateFree achieves better results than the present method. © 2019 Institute of Electrical Engineers of Japan. Published by John Wiley & Sons, Inc.
Conference Paper
Full-text available
The arrangement of products in store shelves is carefully planned to maximize sales and keep customers happy. Verifying compliance of real shelves to the ideal layout, however, is a costly task currently routinely performed by the store personnel. In this paper, we propose a computer vision pipeline to recognize products on shelves and verify compliance to the planned layout. We deploy local invariant features together with a novel formulation of the product recognition problem as a sub-graph isomorphism between the items appearing in the given image and the ideal layout. This allows for auto-localizing the given image within aisles of the store and improves recognition dramatically.
Article
Full-text available
The arrangement of products in store shelves is carefully planned to maximize sales and keep customers happy. However, verifying compliance of real shelves to the ideal layout is a costly task routinely performed by the store personnel. In this paper, we propose a computer vision pipeline to recognize products on shelves and verify compliance to the planned layout. We deploy local invariant features together with a novel formulation of the product recognition problem as a sub-graph isomorphism between the items appearing in the given image and the ideal layout. This allows for auto-localizing the given image within the aisle or store and improving recognition dramatically.
Conference Paper
Full-text available
We propose a novel method to improve fine-grained bird species classification based on hierarchical subset learning. We first form a similarity tree where classes with strong visual correlations are grouped into subsets. An expert local classifier with strong discriminative power to distinguish visually similar classes is then learnt for each subset. On the challenging Caltech200-2011 bird dataset we show that using the hierarchical approach with features derived from a deep convolutional neural network leads to the average accuracy improving from 64.5% to 72.7%, a relative improvement of 12.7%.
Article
Full-text available
Fine-grained image recognition is a challenging computer vision problem, due to the small inter-class variations caused by highly similar subordinate categories, and the large intra-class variations in poses, scales and rotations. In this paper, we propose a novel end-to-end Mask-CNN model without the fully connected layers for fine-grained recognition. Based on the part annotations of fine-grained images, the proposed model consists of a fully convolutional network to both locate the discriminative parts (e.g., head and torso), and more importantly generate object/part masks for selecting useful and meaningful convolutional descriptors. After that, a four-stream Mask-CNN model is built for aggregating the selected object- and part-level descriptors simultaneously. The proposed Mask-CNN model has the smallest number of parameters, lowest feature dimensionality and highest recognition accuracy when compared with state-of-the-arts fine-grained approaches.
Article
Full-text available
Global warming and its resulting environmental changes surely are ubiquitous subjects nowadays and undisputedly important research topics. One way of tracking such environmental changes is by means of phenology, which studies natural periodic events and their relationship to climate. Phenology is seen as the simplest and most reliable indicator of the effects of climate change on plants and animals. The search for phenological information and monitoring systems has stimulated many research centers worldwide to pursue the development of effective and innovative solutions in this direction. One fundamental requirement for phenological systems is concerned with achieving fine-grained recognition of plants. In this sense, the present work seeks to understand specific properties of each target plant species and to provide the solutions for gathering specific knowledge of such plants for further levels of recognition and exploration in related tasks. In this work, we address some important questions such as: (i) how species from the same leaf functional group differ from each other; (ii) how different pattern classifiers might be combined to improve the effectiveness results in target species identification; and (iii) whether it is possible to achieve good classification results with fewer classifiers for fine-grained plant species identification. In this sense, we perform different analysis considering RGB color information channels from a digital hemispherical lens camera in different hours of day and plant species. A study about the correlation of classifiers associated with time series extracted from digital images is also performed. We adopt a successful selection and fusion framework to combine the most suitable classifiers and features improving the plant identification decision-making task as it is nearly impossible to develop just a single “silver bullet” image descriptor that would capture all subtle discriminatory features of plants within the same functional group. This adopted framework turns out to be an effective solution in the target task, achieving better results than well-known approaches in the literature.
Article
Full-text available
Fine-grained classification is challenging because categories can only be discriminated by subtle and local differences. Variances in the pose, scale or rotation usually make the problem more difficult. Most fine-grained classification systems follow the pipeline of finding foreground object or object parts (where) to extract discriminative features (what). In this paper, we propose to apply visual attention to fine-grained classification task using deep neural network. Our pipeline integrates three types of attention: the bottom-up attention that propose candidate patches, the object-level top-down attention that selects relevant patches to a certain object, and the part-level top-down attention that localizes discriminative parts. We combine these attentions to train domain-specific deep nets, then use it to improve both the what and where aspects. Importantly, we avoid using expensive annotations like bounding box or part information from end-to-end. The weak supervision constraint makes our work easier to generalize. We have verified the effectiveness of the method on the subsets of ILSVRC2012 dataset and CUB200_2011 dataset. Our pipeline delivered significant improvements and achieved the best accuracy under the weakest supervision condition. The performance is competitive against other methods that rely on additional annotations.
Conference Paper
Full-text available
We propose a novel unsupervised method for discovering recurring patterns from a single view. A key contribution of our approach is the formulation and validation of a joint assignment optimization problem where multiple visual words and object instances of a potential recurring pattern are considered simultaneously. The optimization is achieved by a greedy randomized adaptive search procedure (GRASP) with moves specifically designed for fast convergence. We have quantified systematically the performance of our approach under stressed conditions of the input (missing features, geometric distortions). We demonstrate that our proposed algorithm outperforms state of the art methods for recurring pattern discovery on a diverse set of 400+ real world and synthesized test images.
Conference Paper
Full-text available
In this paper, we present a novel scale- and rotation-invariant interest point detector and descriptor, coined SURF (Speeded Up Robust Features). It approximates or even outperforms previously proposed schemes with respect to repeatability, distinctiveness, and robustness, yet can be computed and compared much faster. This is achieved by relying on integral images for image convolutions; by building on the strengths of the leading existing detectors and descriptors (in casu, using a Hessian matrix-based measure for the detector, and a distribution-based descriptor); and by simplifying these methods to the essential. This leads to a combination of novel detection, description, and matching steps. The paper presents experimental results on a standard evaluation set, as well as on imagery obtained in the context of a real-life object recognition application. Both show SURF’s strong performance.
Article
Full-text available
Image category recognition is important to access visual information on the level of objects and scene types. So far, intensity-based descriptors have been widely used for feature extraction at salient points. To increase illumination invariance and discriminative power, color descriptors have been proposed. Because many different descriptors exist, a structured overview is required of color invariant descriptors in the context of image category recognition. Therefore, this paper studies the invariance properties and the distinctiveness of color descriptors (software to compute the color descriptors from this paper is available from http://www.colordescriptors.com) in a structured way. The analytical invariance properties of color descriptors are explored, using a taxonomy based on invariance properties with respect to photometric transformations, and tested experimentally using a data set with known illumination conditions. In addition, the distinctiveness of color descriptors is assessed experimentally using two benchmarks, one from the image domain and one from the video domain. From the theoretical and experimental results, it can be derived that invariance to light intensity changes and light color changes affects category recognition. The results further reveal that, for light intensity shifts, the usefulness of invariance is category-specific. Overall, when choosing a single descriptor and no prior knowledge about the data set and object and scene categories is available, the OpponentSIFT is recommended. Furthermore, a combined set of color descriptors outperforms intensity-based SIFT and improves category recognition by 8 percent on the PASCAL VOC 2007 and by 7 percent on the Mediamill Challenge.
Conference Paper
Full-text available
The problem of using pictures of objects captured un- der ideal imaging conditions (here referred to as in vitro ) to recognize objects in natural environments (in situ) is an emerging area of interest in computer vision and pattern recognition. Examples of tasks in this vein include assis- tive vision systems for the blind and object recognition for mobile robots; the proliferation of image databases on the web is bound to lead to more examples in the near future. Despite its importance, there is still a need for a freely available database to facilitate study of this kind of train- ing/testing dichotomy. In this work one of our contribu- tions is a new multimedia database of 120 grocery products, GroZi-120. For every product, two different recordings are available: in vitro images extracted from the web, and in situ images extracted from camcorder video collected inside a grocery store. As an additional contribution, we present the results of applying three commonly used object recogni- tion/detection algorithms (color histogram matching, SIFT matching, and boosted Haar-like features) to the dataset. Finally, we analyze the successes and failures of these algo- rithms against product type and imaging conditions, both in terms of recognition rate and localization accuracy, in order to suggest ways forward for further research in this domain.
Conference Paper
Full-text available
Effective and efficient generation of keypoints from an image is a well-studied problem in the literature and forms the basis of numerous Computer Vision applications. Established leaders in the field are the SIFT and SURF algorithms which exhibit great performance under a variety of image transformations, with SURF in particular considered as the most computationally efficient amongst the high-performance methods to date. In this paper we propose BRISK1, a novel method for keypoint detection, description and matching. A comprehensive evaluation on benchmark datasets reveals BRISK's adaptive, high quality performance as in state-of-the-art algorithms, albeit at a dramatically lower computational cost (an order of magnitude faster than SURF in cases). The key to speed lies in the application of a novel scale-space FAST-based detector in combination with the assembly of a bit-string descriptor from intensity comparisons retrieved by dedicated sampling of each keypoint neighborhood.
Conference Paper
Full-text available
Methods based on local, viewpoint invariant features have proven capable of recognizing objects in spite of viewpoint changes, oc- clusion and clutter. However, these approaches fail when these factors are too strong, due to the limited repeatability and discriminative power of the features. As additional shortcomings, the objects need to be rigid and only their approximate location is found. We present a novel Object Recognition approach which overcomes these limitations. An initial set of feature correspondences is first generated. The method anchors on it and then gradually explores the surrounding area, trying to construct more and more matching features, increasingly farther from the initial ones. The resulting process covers the object with matches, and simultaneously separates the correct matches from the wrong ones. Hence, recognition and segmentation are achieved at the same time. Only very few correct initial matches suffice for reliable recognition. The experimental results demonstrate the stronger power of the presented method in dealing with extensive clutter, dominant occlusion, large scale and viewpoint changes. Moreover non-rigid deformations are explicitly taken into account, and the approximative contours of the object are produced. The approach can extend any viewpoint invariant feature extractor.
Article
Full-text available
The Pascal Visual Object Classes (VOC) challenge is a benchmark in visual object category recognition and detection, providing the vision and machine learning communities with a standard dataset of images and annotation, and standard evaluation procedures. Organised annually from 2005 to present, the challenge and its associated dataset has become accepted as the benchmark for object detection. This paper describes the dataset and evaluation procedure. We review the state-of-the-art in evaluated methods for both classification and detection, analyse whether the methods are statistically different, what they are learning from the images (e.g. the object or its context), and what the methods find easy or confuse. The paper concludes with lessons learnt in the three year history of the challenge, and proposes directions for future improvement and extension.
Conference Paper
Full-text available
SIFT has been proven to be the most robust local invariant feature descriptor. SIFT is designed mainly for gray images. However, color provides valuable information in object description and matching tasks. Many objects can be misclassified if their color contents are ignored. This paper addresses this problem and proposes a novel colored local invariant feature descriptor. Instead of using the gray space to represent the input image, the proposed approach builds the SIFT descriptors in a color invariant space. The built Colored SIFT (CSIFT) is more robust than the conventional SIFT with respect to color and photometrical variations. The evaluation results support the potential of the proposed approach.
Conference Paper
Fine-grained object recognition is challenging due to large intra-class variation and inter-class ambiguity. A good algorithm should be able to: 1) discover discriminative local details and 2) align and aggregate these local discriminative patch-level features in an effective way to facilitate object level classification. Towards this end, we propose a novel local feature discovery, discriminative alignment and aggregation framework, inspired by the recent success of deep recurrent attention model. First, we develop a novel attribute-guided attentive network to sequentially discover informative parts/regions, by seeking a good registration between attentive regions and predefined object attributes. This could be considered as a semantic guided salient region discovery and alignment network, which might be more robust than conventional attention model. Second, these discovered regions are actively and progressively fed into a recurrent neural network, to yield the object-level representation. This could be considered as a discriminant aggregation network and informative patch-level features are propagated and accumulated to the deeper nodes of the recurrent network for final classification. We extensively test our framework on two fine-grained image benchmarks and the results demonstrate the effectiveness of the proposed framework.
Conference Paper
Fine-Grained Visual Categorization (FGVC) has achieved significant progress recently. However, the number of fine-grained species could be huge and dynamically increasing in real scenarios, making it difficult to recognize unseen objects under the current FGVC framework. This raises an open issue to perform large-scale fine-grained identification without a complete training set. Aiming to conquer this issue, we propose a retrieval task named One-Shot Fine-Grained Instance Retrieval (OSFGIR). "One-Shot" denotes the ability of identifying unseen objects through a fine-grained retrieval task assisted with an incomplete auxiliary training set. This paper first presents the detailed description to OSFGIR task and our collected OSFGIR-378K dataset. Next, we propose the Convolutional and Normalization Networks (CN-Nets) learned on the auxiliary dataset to generate a concise and discriminative representation. Finally, we present a coarse-to-fine retrieval framework consisting of three components, i.e., coarse retrieval, fine-grained retrieval, and query expansion, respectively. The framework progressively retrieves images with similar semantics, and performs fine-grained identification. Experiments show our OSFGIR framework achieves significantly better accuracy and efficiency than existing FGVC and image retrieval methods, thus could be a better solution for large-scale fine-grained object identification.
Article
Image category recognition is important to access visual information on the level of objects and scene types. So far, intensity-based descriptors have been widely used for feature extraction at salient points. To increase illumination invariance and discriminative power, color descriptors have been proposed. Because many different descriptors exist, a structured overview is required of color invariant descriptors in the context of image category recognition. Therefore, this paper studies the invariance properties and the distinctiveness of color descriptors (software to compute the color descriptors from this paper is available from http://www.colordescriptors.com) in a structured way. The analytical invariance properties of color descriptors are explored, using a taxonomy based on invariance properties with respect to photometric transformations, and tested experimentally using a data set with known illumination conditions. In addition, the distinctiveness of color descriptors is assessed experimentally using two benchmarks, one from the image domain and one from the video domain. From the theoretical and experimental results, it can be derived that invariance to light intensity changes and light color changes affects category recognition. The results further reveal that, for light intensity shifts, the usefulness of invariance is category-specific. Overall, when choosing a single descriptor and no prior knowledge about the data set and object and scene categories is available, the OpponentSIFT is recommended. Furthermore, a combined set of color descriptors outperforms intensity-based SIFT and improves category recognition by 8 percent on the PASCAL VOC 2007 and by 7 percent on the Mediamill Challenge.
Conference Paper
Semantic part localization can facilitate fine-grained categorization by explicitly isolating subtle appearance differences associated with specific object parts. Methods for pose-normalized representations have been proposed, but generally presume bounding box annotations at test time due to the difficulty of object detection. We propose a model for fine-grained categorization that overcomes these limitations by leveraging deep convolutional features computed on bottom-up region proposals. Our method learns whole-object and part detectors, enforces learned geometric constraints between them, and predicts a fine-grained category from a pose-normalized representation. Experiments on the Caltech-UCSD bird dataset confirm that our method outperforms state-of-the-art fine-grained categorization methods in an end-to-end evaluation without requiring a bounding box at test time.
Article
Fine-grained recognition is one of the most difficult topics in visual recognition, which aims at distinguishing confusing categories such as bird species within a genus. The information of part and bounding boxes in fine-grained images is very important for improving the performance. However, in real applications, the part and/or bounding box annotations may not exist. This makes fine-grained recognition a challenging problem. In this paper, we propose a jointly trained Convolutional Neural Network (CNN) architecture to solve the fine-grained recognition problem without using part and bounding box information. In this framework, we first detect part candidates by calculating the gradients of feature maps of a trained CNN model w.r.t. the input image and then filter out unnecessary ones by fusing two saliency detection methods. Meanwhile, two groups of global object locations are obtained based on the saliency detection methods and a segmentation method. With the filtered part candidates and approximate object locations as inputs, we construct the CNN architecture with local parts and global discrimination (LG-CNN) which consists of two CNN networks with shared weights. The upper stream of LG-CNN is focused on the part information of the input image, the bottom stream of LG-CNN is focused on the global input image. LG-CNN is jointly trained by two stream loss functions to guide the updating of the shared weights. Experiments on three popular fine-grained datasets well validate the effectiveness of our proposed LG-CNN architecture. Applying our LG-CNN architecture to generic object recognition datasets also yields superior performance over the directly fine-tuned CNN architecture with a large margin.
Article
This paper presents a method for extracting distinctive invariant features from images that can be used to perform reliable matching between different views of an object or scene. The features are invariant to image scale and rotation, and are shown to provide robust matching across a substantial range of affine distortion, change in 3D viewpoint, addition of noise, and change in illumination. The features are highly distinctive, in the sense that a single feature can be correctly matched with high probability against a large database of features from many images. This paper also describes an approach to using these features for object recognition. The recognition proceeds by matching individual features to a database of features from known objects using a fast nearest-neighbor algorithm, followed by a Hough transform to identify clusters belonging to a single object, and finally performing verification through least-squares solution for consistent pose parameters. This approach to recognition can robustly identify objects among clutter and occlusion while achieving near real-time performance.
Article
Object detection and recognition are challenging computer vision tasks receiving great attention due to the large number of applications. This work focuses on the detection/recognition of products in supermarket shelves; this framework has a number of practical applications such as providing additional product/price information to the user or guiding visually impaired customers during shopping. The automatic creation of planograms (i.e., actual layout of products on shelves) is also useful for commercial analysis and management of large stores.
Conference Paper
Large-scale product classification is an essential technique for better product understanding. It can provide support to online retailers from a number of aspects. This paper discusses the CNN based product classification with the existence of class hierarchy. A SaCNN-MCR method is developed to settle this task. It decomposes the classification into two stages. Firstly, a spatial attention based CNN model that directly classifies a product to leaf classes is proposed. Compared with traditional CNNs, the proposed model focuses more on product region rather than the whole image. Secondly, the outputted CNN score together with class hierarchy clues are jointly optimized by employing a multi-class regression (MCR) based refinement, which provides another kind of data fitting that further benefits the classification. Experiments on nearly one million real-world product images show that, based on the two innovations, SaCNN-MCR steadily improves the classification performance over CNN models without these modules. Moreover, it is demonstrated that CNN features characterize product images much better than traditional features, whose classification performance outperforms those of the traditional features by a large margin.
Article
The problem of fine-grained object recognition is very challenging due to the subtle visual differences between different object categories. In this paper, we propose a task-driven progressive part localization (TPPL) approach for fine-grained object recognition. Most existing methods follow a two-step approach that first detects salient object parts to suppress the interference from background scenes and then classifies objects based on features extracted from these regions. The part detector and object classifier are often independently designed and trained. In this paper, our major finding is that the part detector should be jointly designed and progressively refined with the object classifier so that the detected regions can provide the most distinctive features for final object recognition. Specifically, we develop a part-based SPP-net (Part-SPP) as our baseline part detector. We then establish a TPPL framework, which takes the predicted boxes of Part-SPP as an initial guess, and then examines new regions in the neighborhood using a particle swarm optimization approach, searching for more discriminative image regions to maximize the objective function and the recognition performance. This procedure is performed in an iterative manner to progressively improve the joint part detection and object classification performance. Experimental results on the Caltech-UCSD-200-2011 dataset demonstrate that our method outperforms state-of-the-art fine-grained categorization methods both in part localization and classification, even without requiring a bounding box during testing.
Conference Paper
We present a context-aware hybrid classification system for the problem of fine-grained product class recognition in computer vision. Recently, retail product recognition has become an interesting computer vision research topic. We focus on the classification of products on shelves in a store. This is a very challenging classification problem because many product classes are visually similar in terms of shape, color, texture, and metric size. In shelves, same or similar products are more likely to appear adjacent to each other and displayed in certain arrangements rather than at random. The arrangement of the products on the shelves has a spatial continuity both in brand and metric size. By using this context information, the co-occurrence of the products and the adjacency relations between the products can be statistically modeled. The proposed hybrid approach improves the accuracy of context-free image classifiers such as Support Vector Machines (SVMs), by combining them with a probabilistic graphical model such as Hidden Markov Models (HMMs) or Conditional Random Fields (CRFs). The fundamental goal of this paper is using contextual relationships in retail shelves to improve the classification accuracy by executing a context-aware approach.
Conference Paper
This paper considers the problem of fine-grained image recognition with a growing vocabulary. Since in many real world applications we often have to add a new object category or visual concept with just a few images to learn from, it is crucial to develop a method that is able to generalize the recognition model from existing classes to new classes. Deep convolutional neural networks are capable of constructing powerful image representations; however, these networks usually rely on a logistic loss function that cannot handle the incremental learning problem. In this paper, we present a new method that can efficiently learn a new class given only a limited number of training examples, which we evaluate on the problems of food and clothing recognition. To illustrate the performance of our proposed method on the task of recognizing different kinds of food, when using only 1.3\% of training examples per category we achieved about 73\% of the performance (as measured by F1-score) compared to when using all available training data.
Conference Paper
Large-scale instance-level image retrieval aims at retrieving specific instances of objects or scenes. Simultaneously retrieving multiple objects in a test image adds to the difficulty of the problem, especially if the objects are visually similar. This paper presents an efficient approach for per-exemplar multi-label image classification, which targets the recognition and localization of products in retail store images. We achieve runtime efficiency through the use of discriminative random forests, deformable dense pixel matching and genetic algorithm optimization. Cross-dataset recognition is performed, where our training images are taken in ideal conditions with only one single training image per product label, while the evaluation set is taken using a mobile phone in real-life scenarios in completely different conditions. In addition, we provide a large novel dataset and labeling tools for products image search, to motivate further research efforts on multi-label retail products image classification. The proposed approach achieves promising results in terms of both accuracy and runtime efficiency on 680 annotated images of our dataset, and 885 test images of GroZi-120 dataset. We make our dataset of 8350 different product images and the 680 test images from retail stores with complete annotations available to the wider community.
Conference Paper
Methods based on local, viewpoint invariant features have proven capable of recognizing objects in spite of viewpoint changes, occlusion and clutter. However, these approaches fail when these factors are too strong, due to the limited repeatability and discriminative power of the features. As additional shortcomings, the objects need to be rigid and only their approximate location is found. We present a novel Object Recognition approach which overcomes these limitations. An initial set of feature correspondences is first generated. The method anchors on it and then gradually explores the surrounding area, trying to construct more and more matching features, increasingly farther from the initial ones. The resulting process covers the object with matches, and simultaneously separates the correct matches from the wrong ones. Hence, recognition and segmentation are achieved at the same time. Only very few correct initial matches suffice for reliable recognition. The experimental results demonstrate the stronger power of the presented method in dealing with extensive clutter, dominant occlusion, large scale and viewpoint changes. Moreover non-rigid deformations are explicitly taken into account, and the approximative contours of the object are produced. The approach can extend any viewpoint invariant feature extractor.
Conference Paper
We address the problem of large-scale fine-grained visual categorization, describing new methods we have used to produce an online field guide to 500 North American bird species. We focus on the challenges raised when such a system is asked to distinguish between highly similar species of birds. First, we introduce 'one-vs-most classifiers.' By eliminating highly similar species during training, these classifiers achieve more accurate and intuitive results than common one-vs-all classifiers. Second, we show how to estimate spatio-temporal class priors from observations that are sampled at irregular and biased locations. We show how these priors can be used to significantly improve performance. We then show state-of-the-art recognition performance on a new, large dataset that we make publicly available. These recognition methods are integrated into the online field guide, which is also publicly available.
Article
We develop a hierarchical Bayesian model that learns to learn categories from single training examples. The model transfers acquired knowledge from previously learned categories to a novel category, in the form of a prior over category means and variances. The model discovers how to group categories into meaningful super-categories that express different priors for new classes. Given a single example of a novel category, we can efficiently infer which super-category the novel category belongs to, and thereby estimate not only the new category's mean but also an appropriate similarity metric based on parameters inherited from the super-category. On MNIST and MSR Cambridge image datasets the model learns useful representations of novel categories based on just a single training example, and performs significantly better than simpler hierarchical Bayesian approaches. It can also discover new categories in a completely unsupervised fashion, given just one or a few examples.
Article
Fine-grained recognition concerns categorization at sub-ordinate levels, where the distinction between object classes is highly local. Compared to basic level recognition, fine-grained categorization can be more challenging as there are in general less data and fewer discriminative features. This necessitates the use of a stronger prior for feature selection. In this work, we include humans in the loop to help computers select discriminative features. We introduce a novel online game called “Bubbles” that reveals discriminative features humans use. The player’s goal is to identify the category of a heavily blurred image. During the game, the player can choose to reveal full details of circular regions (“bubbles”), with a certain penalty. With proper setup the game generates discriminative bubbles with assured quality. We next propose the “BubbleBank” representation that uses the human selected bubbles to improve machine recognition performance. Finally, we demonstrate how to extend BubbleBank to a view-invariant 3D representation. Experiments demonstrate that our approach yields large improvements over the previous state of the art on challenging benchmarks.
Article
Using image analytics to monitor the contents and status of retail store shelves is an emerging trend with increasing business importance. Detecting and identifying multiple objects on store shelves involves a number of technical challenges. The particular nature of product package design, the arrangement of products on shelves, and the requirement to operate in unconstrained environments are just a few of the issues that must be addressed. We explain how we addressed these challenges in a system for monitoring planogram compliance, developed as part of a project with Tesco, a large multinational retailer. The new system offers store personnel an instant view of shelf status and a list of action items for restocking shelves. The core of the system is based on its ability to achieve high rates of product recognition, despite the very small visual differences between some products. This paper covers how state-of-the-art methods for object detection behave when applied to this problem. We also describe the innovative aspects of our implementation for size-scale-invariant product recognition and fine-grained classification.
Article
In this work we investigate the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting. Our main contribution is a thorough evaluation of networks of increasing depth, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers. These findings were the basis of our ImageNet Challenge 2014 submission, where our team secured the first and the second places in the localisation and classification tracks respectively.
Article
Learning fine-grained image similarity is a challenging task. It needs to capture between-class and within-class image differences. This paper proposes a deep ranking model that employs deep learning techniques to learn similarity metric directly from images.It has higher learning capability than models based on hand-crafted features. A novel multiscale network structure has been developed to describe the images effectively. An efficient triplet sampling algorithm is proposed to learn the model with distributed asynchronized stochastic gradient. Extensive experiments show that the proposed algorithm outperforms models based on hand-crafted visual features and deep classification models.
Conference Paper
We propose a segmentation algorithm for the purposes of large-scale flower species recognition. Our approach is based on identifying potential object regions at the time of detection. We then apply a Laplacian-based segmentation, which is guided by these initially detected regions. More specifically, we show that 1) recognizing parts of the potential object helps the segmentation and makes it more robust to variabilities in both the background and the object appearances, 2) segmenting the object of interest at test time is beneficial for the subsequent recognition. Here we consider a large-scale dataset containing 578 flower species and 250,000 images. This dataset is developed by our team for the purposes of providing a flower recognition application for general use and is the largest in its scale and scope. We tested the proposed segmentation algorithm on the well-known 102 Oxford flowers benchmark [11] and on the new challenging large-scale 578 flower dataset, that we have collected. We observed about 4% improvements in the recognition performance on both datasets compared to the baseline. The algorithm also improves all other known results on the Oxford 102 flower benchmark dataset. Furthermore, our method is both simpler and faster than other related approaches, e.g. [3, 14], and can be potentially applicable to other subcategory recognition datasets.
Article
A new paradigm, Random Sample Consensus (RANSAC), for fitting a model to experimental data is introduced. RANSAC is capable of interpreting/smoothing data containing a significant percentage of gross errors, and is thus ideally suited for applications in automated image analysis where interpretation is based on the data provided by error-prone feature detectors. The authors describe the application of RANSAC to the Location Determination Problem (LDP): given an image depicting a set of landmarks with known locations, determine that point in space from which the image was obtained. In response to a RANSAC requirement, new results are derived on the minimum number of landmarks needed to obtain a solution, and algorithms are presented for computing these minimum-landmark solutions in closed form. These results provide the basis for an automatic system that can solve the LDP under difficult viewing and analysis conditions. Implementation details and computational examples are also presented
Article
This paper presents a method for extracting distinctive invariant features from images that can be used to perform reliable matching between different views of an object or scene. The features are invariant to image scale and rotation, and are shown to provide robust matching across a substantial range of affine distortion, change in 3D viewpoint, addition of noise, and change in illumination. The features are highly distinctive, in the sense that a single feature can be correctly matched with high probability against a large database of features from many images. This paper also describes an approach to using these features for object recognition. The recognition proceeds by matching individual features to a database of features from known objects using a fast nearest-neighbor algorithm, followed by a Hough transform to identify clusters belonging to a single object, and finally performing verification through least-squares solution for consistent pose parameters. This approach to recognition can robustly identify objects among clutter and occlusion while achieving near real-time performance.
Article
Learning visual models of object categories notoriously requires hundreds or thousands of training examples. We show that it is possible to learn much information about a category from just one, or a handful, of images. The key insight is that, rather than learning from scratch, one can take advantage of knowledge coming from previously learned categories, no matter how different these categories might be. We explore a Bayesian implementation of this idea. Object categories are represented by probabilistic models. Prior knowledge is represented as a probability density function on the parameters of these models. The posterior model for an object category is obtained by updating the prior in the light of one or more observations. We test a simple implementation of our algorithm on a database of 101 diverse object categories. We compare category models learned by an implementation of our Bayesian approach to models learned from by Maximum Likelihood (ML) and Maximum A Posteriori (MAP) methods. We find that on a database of more than 100 categories, the Bayesian approach produces informative models when the number of training examples is too small for other methods to operate successfully.
This paper presents a method for recognizing scene categories based on approximate global geometric correspondence. This technique works by partitioning the image into increasingly fine sub-regions and computing histograms of local features found inside each sub-region. The resulting "spatial pyramid" is a simple and computationally efficient extension of an orderless bag-of-features image representation, and it shows significantly improved performance on challenging scene categorization tasks. Specifically, our proposed method exceeds the state of the art on the Caltech-101 database and achieves high accuracy on a large database of fifteen natural scene categories. The spatial pyramid framework also offers insights into the success of several recently proposed image descriptions, including Torralba’s "gist" and Lowe’s SIFT descriptors.
This paper introduces a texture representation suitable for recognizing images of textured surfaces under a wide range of transformations, including viewpoint changes and nonrigid deformations. At the feature extraction stage, a sparse set of affine-invariant local patches is extracted from the image. This spatial selection process permits the computation of characteristic scale and neighborhood shape for every texture element. The proposed texture representation is evaluated in retrieval and classification tasks using the entire Brodatz database and a collection of photographs of textured surfaces taken from different viewpoints.
Article
The Hausdorff distance measures the extent to which each point of a model set lies near some point of an image set and vice versa. Thus, this distance can be used to determine the degree of resemblance between two objects that are superimposed on one another. Efficient algorithms for computing the Hausdorff distance between all possible relative positions of a binary image and a model are presented. The focus is primarily on the case in which the model is only allowed to translate with respect to the image. The techniques are extended to rigid motion. The Hausdorff distance computation differs from many other shape comparison methods in that no correspondence between the model and the image is derived. The method is quite tolerant of small position errors such as those that occur with edge detectors and other feature extraction methods. It is shown that the method extends naturally to the problem of comparing a portion of a model against an image