ChapterPDF Available

Learning CNN-based Features for Retrieval of Food Images

Learning CNN-based Features for Retrieval
of Food Images
Gianluigi Ciocca, Paolo Napoletano(B
), and Raimondo Schettini
DISCo (Dipartimento di Informatica, Sistemistica e Comunicazione),
Universit`a degli Studi di Milano-Bicocca, Viale Sarca 336, 20126 Milano, Italy
Abstract. Recently a huge amount of work has been done in order
to develop Convolutional Neural Networks (CNNs) for supervised food
recognition. These CNNs are trained to classify a predefined set of food
classes within a specific food dataset. CNN-based features have been
largely experimented for many image retrieval domains and to a lesser
extent to the food domain. In this paper, we investigate the use of CNN-
based features for food retrieval by taking advantage of existing food
datasets. To this end, we have built the Food524DB, the largest pub-
licly available food dataset with 524 food classes and 247,636 images by
merging food classes from existing datasets in the state of the art. We
have then used this dataset to fine tune a Residual Network, ResNet-50,
which has demonstrated to be very effective for image recognition. The
last fully connected layer is finally used as feature vector for food image
indexing and retrieval. Experimental results are reported on the UNICT-
FD1200 dataset that has been specifically design for food retrieval.
Keywords: Food retrieval ·Food dataset ·Food recognition
CNN-based features
1 Introduction
Recently, food recognition received a considerable amount of attention due to
the importance of monitoring food consumption for a balanced and healthy
diet. To this end, computer vision techniques can help to build systems to
automatically recognize diverse foods and to estimate the food quantity. Many
works exist in the literature that exploit hand-crafted visual features for
food recognition and quantity estimation both for desktop as well as mobile
applications [1,3,17,27,28].
With the advent of practical techniques for training large convolutional
neural networks, hand-crafted features are being reconsidered in favor of learned
ones [30]. Features learned by deep convolutional neural networks (CNNs) have
been recognized to be more robust and expressive than hand-crafted ones. They
have been successfully used in different computer vision tasks such as object
detection, pattern recognition and image understanding. It is not surprising that
Springer International Publishing AG 2017
S. Battiato et al. (Eds.): ICIAP 2017 International Workshops, LNCS 10590, pp. 426–434, 2017.
Learning CNN-based Features for Retrieval of Food Images 427
a number of studies have investigated the use of deep neural networks for food
recognition as well. Table 1shows the most notable works on food recognition
using deep learning techniques along with the datasets on which they have been
evaluated their performances in terms of Top-1 and Top-5 classification accuracy.
Table 1. Performances of food recognition methods using deep learning techniques.
Reference Network Dataset Top-1 (%) Top-5 (%)
Kawano et al. [22]DeepFoodCam UECFOOD-100 72.26 92.00
UECFOOD-256 63.77 85.82
Yana i e t a l. [32]DCNN-FOOD(ft) UECFOOD-100 78.48 94.85
UECFOOD-256 67.57 88.97
Food-101 70.41 -
Liu et al. [23]DeepFood UECFOOD-100 76.30 94.60
UECFOOD-256 54.70 81.50
Food-101 77.40 93.70
Hassannejad et al. [15]Inception V3 UECFOOD-100 81.45 97.27
UECFOOD-256 76.17 92.58
Food-101 88.28 96.88
Martinel et al. [25]WISeR UECFOOD-100 89.58 99.23
UECFOOD-256 83.15 95.45
Food-101 90.27 98.71
Chen et al. [6]MultiTaskDCNN UECFOOD-100 82.12 97.29
VIREO 82.05 95.88
A Convolutional Neural Network technique requires a large dataset to build a
classification model. To overcome this, often previously pre-trained models on a
different dataset are fine tuned using a small sized dataset specific for the current
classification task. Since the larger and heterogeneous the dataset is, the more
the network can be used to learn powerful models, for the food retrieval task, we
have decided to create a very large food dataset starting from existing ones. We
have analyzed the public datasets and merged some of them depending on their
availability and characteristics thus creating the largest food dataset available
in the literature with 524 food classes and 247,636 images. The lowest number
of images for a given class is 100 while the largest is about 1,700. We exploit this
dataset for learning robust features for food retrieval using a Residual Network.
Our intuition is that, having this dataset more food classes than the ones used
in previous works, the network should be more powerful, generalizes better and
thus the extracted features should be more expressive.
428 G. Ciocca et al.
Table 2. List of food datasets used in the literature. S: Single instance food images.
M: Multi-instance food images.
Name Year #Images #Classes Type Reference
Food50 2009 5,000 50 S [20]
PFID 2009 1,098a61aS [7]
TADA 2009 50/256 - S, M [24]
Food85b2010 8,500 85 S [18]
Chen 2012 5,000 50 S [8]
UECFOOD-100 2012 9,060 100 S, M [26]
Food-101 2014 101,000 101 S [5]
UECFOOD-256c2014 31,395 256 S, M [21]
UNICT-FD889 2014 3,583 889 S [14]
Diabetes 2014 4,868 11 S [2]
UMPCFood-101d2015 90,993 101 S [31]
UNIMIB2015 2015 1,000 ×2 15 M [9]
UNICT-FD1200e2016 4,754 1,200 S [13]
UNIMIB2016 2016 1,027 73 M [10]
VIREO 2016 110,241 172 S [6]
Food524DB 2017 247,636 524 S -
aNumbers refer to the baseline dataset.
bIncludes Food50.
cIncludes UECFOOD-100.
dIncludes same classes of Food-101.
eIncludes UNICT-FD889.
2 CNN-based Features for Food Retrieval
Domain adaptation, also known as transfer learning or fine tuning, is a machine
learning procedure designed to adapt a classification model trained on a set of
data to work on a different set of data. The importance and the usefulness of a
domain adaptation process has been largely discussed in the food recognition lit-
erature [4,11,12,22,23,25,32]. Taking inspiration from these works, in this paper
we fine-tuned a CNN architecture using a large, heterogeneous, food dataset,
namely the Food524DB. The rational behind the creation of the Food524DB is
that building a robust food recognition algorithm requires a large image dataset
of different food instances.
2.1 The Food524DB Food Dataset
Table 2summarizes the characteristic of the food datasets that can be found in
the literature. For each dataset, we have reported its size, the number of food
classes and the type of images it contains: either single, i.e. each image depict a
single food category, or multi, i.e. the images can contain multiple food classes.
Learning CNN-based Features for Retrieval of Food Images 429
We decided to consider only datasets publicly available, with many food classes,
and, most importantly, where each food category is represented by at least 100
images. After having analyzed the available datasets, we finally selected Food50,
Food-101, UECFOOD-256, and VIREO. Since UECFOOD-256 contains multi-
food instance images, we extracted from these images each food region using
the bounding boxes provided in the ground truth. The combined dataset is thus
composed of 247,636 images grouped in 579 food classes making this dataset
the largest and most comprehensive food dataset available nad that can be used
for training food classifiers. Some food classes are present in more than one of
the four datasets. For example both the UECFOOD-256 and Food-101 contain
the “apple pie” category; UECFOOD-256 contains the “beef noodle” category
while the VIREO dataset contains the “Beef noodles” category. In order to
remove these redundancies we applied a category merging procedure based on
the category names. After this procedure, the number of food classes in our
dataset that we named Food524DB is reduced to 524 as reported in the last row
of Table 2.
'miso soup'
'Caesar salad'
'Peking duck'
'Roast chicken wings'
'Chocolate mousse'
'Greek salad'
'Macaroni and cheese'
'Strawberry shortcake'
'Steamed Bun Stued'
'Boiled sliced pork in hot chili sauce'
'Steamed Scallops with Vermicelli '
'Giddle cooked bullfrog'
'mixed rice'
'Sautéed Vermicelli with Spicy Minced Pork'
'Deep fried shrump '
'Yuba salad'
'Crucian Carp and tofu soup '
'Fried beans with eggplant'
Sauted shredded Pork with garlic sprout'
'Scrambled Egg with Bier Melon'
'Four-Joy Meatballs'
'Deep Fried lotus root'
'Yam with Blueberry sauce'
'Sautéed Shredded Pork with skin of tofu'
'snky tofu'
'cold tofu'
'tempura bowl'
'Thai papaya salad'
'chilled noodle'
'rice gruel'
'Japanese tofu and vegetable chowder'
'glunous oil rice'
'eggplant with garlic sauce'
'loco moco'
'nasi padang'
'sweet and sour pork'
'salmon meuniere'
'noodles with sh curry'
Fig. 1. Distribution of the cardinalities of the Food524DB food classes. Names are
shown one every ten.
The sizes of the 524 food classes are reported in Fig. 1. The smallest food
category contains 100 images; 241 classes have size between 100 and 199 images,
58 classes have size between 200 and 499 images, 113 have size between 500
and 999 images, and 112 have more than 1,000 images. The top-5 largest classes
are: “Miso Soup” with 1,728 images; “Rice” with 1,499 images; “Spaghetti alla
Bolognese” with 1,462 images; “Hamburger” with 1,333 images; and “Fried Rice”
with 1,269 images. The Food524db is publicly available at http://www.ivl.disco.
430 G. Ciocca et al.
2.2 CNN-based Food Features
The CNN-based features proposed in this paper have been obtained by exploiting
a deep residual architecture. Residual architectures are based on the idea that
each layer of the network learns residual functions with reference to the layer
inputs instead of learning unreferenced functions. Such architectures demon-
strated to be easier to optimize and to gain accuracy by considerably increasing
the depth [16].
Our network architecture is based on the ResNet-50 which represents a good
trade-off between depth and performance. ResNet-50 demonstrated to be very
effective on the ILSVRC 2015 (ImageNet Large Scale Visual Recognition Chal-
lenge) validation set with a top 1- recognition accuracy of about 80% [16]. We did
not train the ResNet-50 from the scratch on Food524DB because the number of
images for each class is not enough. As in previous work on this topic [22,25], we
started from a pre-trained ResNet-50 on ILSVRC2012 scene image classification
dataset [29]. The Food524DB dataset has been split in 80% of training data and
20% of test data. During the fine-tuning stage each image has been resized to
256 ×256 and a random crop has been taken of 224 ×224 size. We augmented
data with the horizontal flipping. During the test stage we considered a single
central 224 ×224 crop from the 256 ×256-resized image.
The ResNet-50 has been trained via stochastic gradient descent with a
mini-batch of 16 images. We set the initial learning rate to 0.01 with learn-
ing rate update at every 5 K iterations. The network has been trained within the
Caffe [19] framework on a PC equipped with a Tesla NVIDIA K40 GPU. The
classification accuracy of the ResNet-50 fine-tuned with the Food524DB dataset
is 69.52% for the Top-1, and 89.61% for the Top-5.
In the following experiments, the ResNet-50 is then used as feature extractor.
The activations of the neurons in the fully connected layer are used as features
for the retrieval of food images. The resulting feature vectors have size 2,048
3 Food Retrieval Experiments
We have evaluated the classification performances of our network on the UNICT-
FD1200 dataset, chosen because it was specifically designed for food retrieval.
The UNICT-FD1200 dataset is composed by 4,754 images and 1,200 distinct
dishes of food of different nationalities. We followed the evaluation procedures
described in the original paper [13]. Specifically, the food dataset is divided into
a training set of 1,200 images and in a test set with the remaining ones. The
three training/test splits provided by the authors of the dataset are considered.
The overall retrieval performances are measured as the average on the three
The retrieval performances are measured using the P(n) quality metric and
the mean Average Precision (mAP). The P(n) is based on the top ncriterion:
P(n)=Qn/Q, where Qis the number of queries (test images) and Qnthe num-
ber of correct queries among the first nretrieved images [13]. For the retrieval
Learning CNN-based Features for Retrieval of Food Images 431
task, the images in the training set are considered as database images, while
the images in the test set are the queries. Moreover, for each query there is one
correct image to be retrieved. We also report the Top-1 recognition accuracy.
Table 3shows the retrieval results obtained on the UNICT-FD1200 dataset.
We compare the performances of the features extracted with the fine-tuned net-
work, “Activations ResNet-50 (Food524DB)” against those obtained with the
original network, “Activation ResNet-50 (ImageNet)”, and against the hand-
crafted features used in [13]. As it can be seen the using the fine tuned network
outperform all the other methods in the classification task as well as in the
retrieval task. As expected the learned features greatly outperforms the hand-
crafted ones. The fine tuning of the ResNet-50 improves the retrieval results of
3% for the Top-1 and of 2.4% for the mAP. Figure 2shows the P(n) curves of
the methods in Table 3. It can be appreciated how the CNN-based features are
able to effectively retrieve the relevant images in the first position.
Table 3. Classification and retrieval results on the UNICT-FD1200 dataset.
Representation Top-1 (%) mAP (%)
Bag of SIFT 12000 [13]21.81 29.14
Textons (MR8) - RGB - Global [13]71.55 77.00
Textons (Schmidt) - Lab - Global [13]87.44 90.06
Activations ResNet-50 (ImageNet) 91.84 94.15
Activations ResNet-50 (Food524DB) 94.96 96.56
1 101 201 301 401 501 601 701 801 901 1001 1101 1201
Bag of SIFT 12000 12000 Textons (MR8) - Color - Global 12000 Textons (Schmidt) - Lab - Global
ResNet-50 (Food524DB) ResNet-50 (ImageNet)
Fig. 2. P(n) curves of the methods in Table 3.
432 G. Ciocca et al.
4 Conclusions
In this paper we investigated the use of CNN-based features for food retrieval.
In order to accomplish this task we have created the Food524DB dataset by
merging food classes from existing datasets in the state of the art. To date,
Food524DB is the largest publicly available food dataset with 524 food classes
and 247,636 images. The proposed CNN-based features have been obtained from
a Residual Network (ResNet-50) fine tuned on Food524DB. The evaluation have
been carried out on the UNICT-FD1200 dataset, that is a specific food retrieval
dataset with 1,200 classes. Results demonstrated the powerful of the proposed
CNN-based features with respect to CNN-based features extracted from the
same network architecture trained on scene images and with respect to the state
of the art features evaluated on the same dataset.
Acknowledgements. We gratefully acknowledge the support of NVIDIA Corpora-
tion with the donation of the Tesla K40 GPU used for this research.
1. Akpro Hippocrate, E.A., Suwa, H., Arakawa, Y., Yasumoto, K.: Food weight esti-
mation using smartphone and cutlery. In: Proceedings of the First Workshop on
IoT-enabled Healthcare and Wellness Technologies and Systems, IoT of Health
2016, pp. 9–14. ACM (2016)
2. Anthimopoulos, M.M., Gianola, L., Scarnato, L., Diem, P., Mougiakakou, S.G.: A
food recognition system for diabetic patients based on an optimized bag-of-features
model. IEEE J. Biomed. Health Inf. 18(4), 1261–1271 (2014)
3. Bettadapura, V., Thomaz, E., Parnami, A., Abowd, G., Essa, I.: Leveraging con-
text to support automated food recognition in restaurants. In: 2015 IEEE Winter
Conference on Applications of Computer Vision (WACV), pp. 580–587 (2015)
4. Bianco, S., Ciocca, G., Napoletano, P., Schettini, R., Margherita, R., Marini, G.,
Pantaleo, G.: Cooking action recognition with iVAT: an interactive video anno-
tation tool. In: Petrosino, A. (ed.) ICIAP 2013. LNCS, vol. 8157, pp. 631–641.
Springer, Heidelberg (2013). 64
5. Bossard, L., Guillaumin, M., Van Gool, L.: Food-101 – mining discriminative com-
ponents with random forests. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T.
(eds.) ECCV 2014. LNCS, vol. 8694, pp. 446–461. Springer, Cham (2014). https:// 29
6. Chen, J., Ngo, C.W.: Deep-based ingredient recognition for cooking recipe retrieval.
In: Proceedings of the 2016 ACM on Multimedia Conference, pp. 32–41. ACM
7. Chen, M., Dhingra, K., Wu, W., Yang, L., Sukthankar, R., Yang, J.: PFID: pitts-
burgh fast-food image dataset. In: 2009 16th IEEE International Conference on
Image Processing (ICIP), pp. 289–292. IEEE (2009)
8. Chen, M.Y., Yang, Y.H., Ho, C.J., Wang, S.H., Liu, S.M., Chang, E., Yeh, C.H.,
Ouhyoung, M.: Automatic chinese food identification and quantity estimation. In:
SIGGRAPH Asia 2012 Technical Briefs, p. 29. ACM (2012)
Learning CNN-based Features for Retrieval of Food Images 433
9. Ciocca, G., Napoletano, P., Schettini, R.: Food recognition and leftover estima-
tion for daily diet monitoring. In: Murino, V., Puppo, E., Sona, D., Cristani, M.,
Sansone, C. (eds.) ICIAP 2015. LNCS, vol. 9281, pp. 334–341. Springer, Cham
(2015). 23222-5 41
10. Ciocca, G., Napoletano, P., Schettini, R.: Food recognition: a new dataset, exper-
iments and results. IEEE J. Biomed. Health Inf. 21(3), 588–598 (2017)
11. Cusano, C., Napoletano, P., Schettini, R.: Intensity and color descriptors for tex-
ture classification. In: IS&T/SPIE Electronic Imaging, p. 866113. International
Society for Optics and Photonics (2013)
12. Cusano, C., Napoletano, P., Schettini, R.: Combining local binary patterns and
local color contrast for texture classification under varying illumination. JOSA A
31(7), 1453–1461 (2014)
13. Farinella, G.M., Allegra, D., Moltisanti, M., Stanco, F., Battiato, S.: Retrieval and
classification of food images. Comput. Biol. Med. 77, 23–39 (2016)
14. Farinella, G.M., Allegra, D., Stanco, F.: A benchmark dataset to study the rep-
resentation of food images. In: Agapito, L., Bronstein, M.M., Rother, C. (eds.)
ECCV 2014. LNCS, vol. 8927, pp. 584–599. Springer, Cham (2015). https://doi.
org/10.1007/978-3-319-16199-0 41
15. Hassannejad, H., Matrella, G., Ciampolini, P., De Munari, I., Mordonini, M.,
Cagnoni, S.: Food image recognition using very deep convolutional networks. In:
Proceedings of the 2nd International Workshop on Multimedia Assisted Dietary
Management, MADiMa 2016, pp. 41–49. ACM (2016)
16. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In:
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
pp. 770–778 (2016)
17. He, Y., Xu, C., Khanna, N., Boushey, C., Delp, E.: Analysis of food images: features
and classification. In: 2014 IEEE International Conference on Image Processing
(ICIP), pp. 2744–2748 (2014)
18. Hoashi, H., Joutou, T., Yanai, K.: Image recognition of 85 food categories by
feature fusion. In: IEEE International Symposium on Multimedia (ISM) 2010, pp.
296–301. IEEE (2010)
19. Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R.,
Guadarrama, S., Darrell, T.: Caffe: convolutional architecture for fast feature
embedding. arXiv preprint arXiv:1408.5093 (2014)
20. Joutou, T., Yanai, K.: A food image recognition system with multiple kernel learn-
ing. In: 2009 16th IEEE International Conference on Image Processing (ICIP), pp.
285–288. IEEE (2009)
21. Kawano, Y., Yanai, K.: Automatic expansion of a food image dataset leverag-
ing existing categories with domain adaptation. In: Agapito, L., Bronstein, M.M.,
Rother, C. (eds.) ECCV 2014. LNCS, vol. 8927, pp. 3–17. Springer, Cham (2015). 16199-0 1
22. Kawano, Y., Yanai, K.: Food image recognition with deep convolutional features.
In: Proceedings of the 2014 ACM International Joint Conference on Pervasive and
Ubiquitous Computing, UbiComp 2014 Adjunct, pp. 589–593 (2014)
23. Liu, C., Cao, Y., Luo, Y., Chen, G., Vokkarane, V., Ma, Y.: DeepFood: deep
learning-based food image recognition for computer-aided dietary assessment. In:
Chang, C.K., Chiari, L., Cao, Y., Jin, H., Mokhtari, M., Aloulou, H. (eds.) ICOST
2016. LNCS, vol. 9677, pp. 37–48. Springer, Cham (2016).
978-3-319-39601-9 4
434 G. Ciocca et al.
24. Mariappan, A., Bosch, M., Zhu, F., Boushey, C.J., Kerr, D.A., Ebert, D.S., Delp,
E.J.: Personal dietary assessment using mobile devices, vol. 7246, pp. 72460Z-1–
72460Z-12 (2009)
25. Martinel, N., Foresti, G.L., Micheloni, C.: Wide-slice residual networks for food
recognition. arXiv preprint arXiv:1612.06543 (2016)
26. Matsuda, Y., Hoashi, H., Yanai, K.: Recognition of multiple-food images by detect-
ing candidate regions. In: 2012 IEEE International Conference on Multimedia and
Expo (ICME), pp. 25–30 (2012)
27. Nguyen, D.T., Zong, Z., Ogunbona, P.O., Probst, Y., Li, W.: Food image classifi-
cation using local appearance and global structural information. Neurocomputing
140, 242–251 (2014)
28. Pouladzadeh, P., Kuhad, P., Peddi, S.V.B., Yassine, A., Shirmohammadi, S.: Food
calorie measurement using deep learning neural network. In: IEEE International
Instrumentation and Measurement Technology Conference, pp. 1–6 (2016)
29. Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z.,
Karpathy, A., Khosla, A., Bernstein, M., Berg, A.C., Fei-Fei, L.: ImageNet large
scale visual recognition challenge. Int. J. Comput. Vis. (IJCV) 115(3), 211–252
30. Sharif Razavian, A., Azizpour, H., Sullivan, J., Carlsson, S.: CNN features off-the-
shelf: an astounding baseline for recognition. In: Proceedings of the IEEE confer-
ence on computer vision and pattern recognition workshops, pp. 806–813 (2014)
31. Wang, X., Kumar, D., Thome, N., Cord, M., Precioso, F.: Recipe recognition with
large multimodal food dataset. In: 2015 IEEE International Conference on Multi-
media and Expo Workshops (ICMEW), pp. 1–6. IEEE (2015)
32. Yanai, K., Kawano, Y.: Food image recognition using deep convolutional network
with pre-training and fine-tuning. In: 2015 IEEE International Conference on Mul-
timedia Expo Workshops (ICMEW), pp. 1–6 (2015)
... Azizah et al. [15] and Suistika et al. [16] studied CNN models for automatic classifcation of mangosteen and strawberry, respectively. Ciocca et al. [17] researched convolutional neural networks to classify food productions and studied CNN in food information retrieval. Hameed et al. [18] proposed that the classifcation of fresh produce, such as fruits and vegetables, has become a complex problem, and convolutional neural networks are considered a promising approach for its application. ...
Full-text available
Food quality detection is an important method for ensuring food safety. Efficient quality detection methods can improve the efficiency of food circulation and reduce storage and labor costs. Traditional methods use instrumentation, testing reagents, or manual labor. These methods take a long time to detect, are time-consuming and labor-intensive, and require professionals to operate. Fruit, as a high-value food that provides essential nutrition for human beings, is susceptible to spoilage during packaging, transportation, and sales, so the freshness and safety assurance of fruit are a hot and difficult area of current research. Therefore, for the detection of fruit freshness, this paper proposes an efficient and nondestructive way to detect fruit freshness by using the machine learning algorithm convolutional neural network (CNN). This paper shows that convolutional neural networks have good performance in identifying the freshness of fruits through extensive experimental results and discusses the overfitting of machine learning based on the experimental results.
Full-text available
Ripening is a very important process that contributes to cheese quality, as its characteristics are determined by the biochemical changes that occur during this period. Therefore, monitoring ripening time is a fundamental task to market a quality product in a timely manner. However, it is difficult to accurately determine the degree of cheese ripeness. Although some scientific methods have also been proposed in the literature, the conventional methods adopted in dairy industries are typically based on visual and weight control. This study proposes a novel approach aimed at automatically monitoring the cheese ripening based on the analysis of cheese images acquired by a photo camera. Both computer vision and machine learning techniques have been used to deal with this task. The study is based on a dataset of 195 images (specifically collected from an Italian dairy industry), which represent Pecorino cheese forms at four degrees of ripeness. All stages but the one labeled as “day 18”, which has 45 images, consist of 50 images. These images have been handled with image processing techniques and then classified according to the degree of ripening, i.e., 18, 22, 24, and 30 days. A 5-fold cross-validation strategy was used to empirically evaluate the performance of the models. During this phase, each training fold was augmented online. This strategy allowed to use 624 images for training, leaving 39 original images per fold for testing. Experimental results have demonstrated the validity of the approach, showing good performance for most of the trained models.
Full-text available
Leading a healthy lifestyle has become one of the most challenging goals in today's society due to our sedentary lifestyle and poor eating habits. As a result, national and international organisms have made numerous efforts to promote healthier food diets and physical activity habits. However, these recommendations are sometimes difficult to follow in our daily life and they are also based on a general population. As a consequence, a new area of research, personalised nutrition, has been conceived focusing on individual solutions through smart devices and Artificial Intelligence (AI) methods. This study presents the AI4Food-NutritionDB database, the first nutrition database that considers food images and a nutrition taxonomy based on recommendations by national and international organisms. In addition, four different categorisation levels are considered following nutrition experts: 6 nutritional levels, 19 main categories (e.g., "Meat"), 73 subcategories (e.g., "White Meat"), and 893 final food products (e.g., "Chicken"). The AI4Food-NutritionDB opens the doors to new food computing approaches in terms of food intake frequency, quality, and categorisation. Also, in addition to the database, we propose a standard experimental protocol and benchmark including three tasks based on the nutrition taxonomy (i.e., category, subcategory, and final product) to be used for the research community. Finally, we also release our Deep Learning models trained with the AI4Food-NutritionDB, which can be used as pre-trained models, achieving accurate recognition results with challenging food image databases.
Food recognition is an important task for a variety of applications, including managing health conditions and assisting visually impaired people. Several food recognition studies have focused on generic types of food or specific cuisines, however, food recognition with respect to Middle Eastern cuisines has remained unexplored. Therefore, in this paper we focus on developing a mobile friendly, Middle Eastern cuisine focused food recognition application for assisted living purposes. In order to enable a low-latency, high-accuracy food classification system, we opted to utilize the Mobilenet-v2 deep learning model. As some of the foods are more popular than the others, the number of samples per class in the used Middle Eastern food dataset is relatively imbalanced. To compensate for this problem, data augmentation methods are applied on the underrepresented classes. Experimental results show that using Mobilenet-v2 architecture for this task is beneficial in terms of both accuracy and the memory usage. With the model achieving 94% accuracy on 23 food classes, the developed mobile application has potential to serve the visually impaired in automatic food recognition via images.KeywordsFood recognitionAssistive technologyComputer vision
Dietary assessment can be crucial for the overall well-being of humans and at least in some instances for the prevention and management of chronic, life-threatening diseases. Recall and manual record keeping methods for food intake monitoring are available, but often inaccurate when applied for a long period of time. On the other hand, automatic record keeping approaches that adopt mobile cameras and computer vision methods seem to simplify the process and can improve current human-centric diet monitoring methods. Here we present an extended critical literature overview of image-based food recognition systems (IBFRS) combining a camera of the user's mobile device with computer vision methods and publicly available food datasets (PAFD). In brief, such systems consist of several phases, such as the segmentation of the food items on the plate, the classification of the food items in a specific food category, and the estimation phase of volume, calories or nutrients of each food item. 159 studies were screened in this systematic review of IBFRS. A detailed overview of the methods adopted in each of the 78 included studies of this systematic review of IBFRS is provided along with their performance on PAFD. Studies that included IBFRS without presenting their performance in at least one of the abovementioned phases were excluded. Among the included studies, 45 (58%) studies adopted deep learning methods and especially Convolutional Neural Networks (CNNs) in at least one phase of the IBFRS with input PAFD. Among the implemented techniques, CNNs outperform all other approaches on the PAFD with a large volume of data, since the richness of these datasets provides adequate training resources for such algorithms. We also present evidence for the benefits of application of IBFRS in professional dietetic practice. Furthermore, challenges related to the IBFRS presented here are also thoroughly discussed along with future directions.
Full-text available
Food allergies impose a signifcant health concern on the community. A small number of certain food items can cause an allergic reaction within the human body. The symptoms can range from mild hives or itchiness to life-threatening anaphylaxis. In most cases, such reactions can be prevented by simply being aware of the allergen-based food items and avoiding the consumption of the same. We are among the frst research attempts to train a deep learning–based object detection model to detect the presence of such food items within an image. We introduce our Allergen30 dataset, which hosts more than 6,000 annotated images of 30 commonly used food items that can trigger an adverse reaction. We report the comparison of multiple variants of the current state-of-art object detection methods, YOLOv5 and YOLOR. Furthermore, we qualitatively analyzed the performance of these methods by surveying the predictions made on the test dataset images.
Full-text available
Food diary applications represent a tantalizing market. Such applications, based on image food recognition, opened to new challenges for computer vision and pattern recognition algorithms. Recent works in the field are focusing either on hand-crafted representations or on learning these by exploiting deep neural networks. Despite the success of such a last family of works, these generally exploit off-the shelf deep architectures to classify food dishes. Thus, the architectures are not cast to the specific problem. We believe that better results can be obtained if the deep architecture is defined with respect to an analysis of the food composition. Following such an intuition, this work introduces a new deep scheme that is designed to handle the food structure. Specifically, inspired by the recent success of residual deep network, we exploit such a learning scheme and introduce a slice convolution block to capture the vertical food layers. Outputs of the deep residual blocks are combined with the sliced convolution to produce the classification score for specific food categories. To evaluate our proposed architecture we have conducted experimental results on three benchmark datasets. Results demonstrate that our solution shows better performance with respect to existing approaches (e.g., a top-1 accuracy of 90.27% on the Food-101 challenging dataset).
Full-text available
We propose a new dataset for the evaluation of food recognition algorithms that can be used in dietary monitoring applications. Each image depicts a real canteen tray with dishes and foods arranged in different ways. Each tray contains multiple instances of food classes. The dataset contains 1,027 canteen trays for a total of 3,616 food instances belonging to 73 food classes. The food on the tray images have been manually segmented using carefully drawn polygonal boundaries. We have benchmarked the dataset by designing an automatic tray analysis pipeline that takes a tray image as input, finds the regions of interest, and predicts for each region the corresponding food class. We have experimented three different classification strategies using also several visual descriptors. We achieve about 79% of food and tray recognition accuracy using Convolutional-Neural-Networksbased features. The dataset, as well as the benchmark framework, are available to the research community.
Conference Paper
Full-text available
Accurate methods to measure food and energy intake are crucial for the battle against obesity. Providing users/patients with convenient and intelligent solutions that help them measure their food intake and collect dietary information are the most valuable insights toward long-term prevention and successful treatment programs. In this paper, we propose an assistive calorie measurement system to help patients and doctors succeed in their fight against diet-related health conditions. Our proposed system runs on smartphones, which allow the user to take a picture of the food and measure the amount of calorie intake automatically. In order to identify the food accurately in the system, we use deep convolutional neural networks to classify 10000 high-resolution food images for system training. Our results show that the accuracy of our method for food recognition of single food portions is 99%. The analysis and implementation of the proposed system are also described in this paper.
Conference Paper
Full-text available
In this era of Internet of Things (IoT), the healthcare system is one of the fields that has received a lot of attention from researchers. Daily-life things and objects such as mobile phones, watches, or shoes are coupled with sensors to make health systems for monitoring, and managing people heath. Recently, some methods have been focused on using food photography and associated image-processing techniques to assess food nutrients to control calorie intake. However, one of the critical issues in such image-based dietary assessment tools is the accuracy and consistent estimation of the sizes and weights of the food portion in the image. In this paper, we propose a system that uses eating tools (cutlery) such as spoon, fork or chopsticks to measure the weight of a food in a picture, in order to estimate the calorie content of that food, for diet assessment and obesity prevention. Our system requires the user to take only a single image from the top with the cutlery in the picture. Using several image processing techniques and the EXIF metadata of the image, the system automatically estimates the diameter and the height of the food container and derives the food volume. Then, given the food type, the system combines the information about the container diameter, height and the food type to provide the weight of the food in the image. Our experiments show tenable results from the system which achieved an average relative error rate of 6.87% for the weight estimation, over the testing food images.
Full-text available
Worldwide, in 2014, more than 1.9 billion adults, 18 years and older, were overweight. Of these, over 600 million were obese. Accurately documenting dietary caloric intake is crucial to manage weight loss, but also presents challenges because most of the current methods for dietary assessment must rely on memory to recall foods eaten. The ultimate goal of our research is to develop computer-aided technical solutions to enhance and improve the accuracy of current measurements of dietary intake. Our proposed system in this paper aims to improve the accuracy of dietary assessment by analyzing the food images captured by mobile devices (e.g., smartphone). The key technique innovation in this paper is the deep learning-based food image recognition algorithms. Substantial research has demonstrated that digital imaging accurately estimates dietary intake in many environments and it has many advantages over other methods. However, how to derive the food information (e.g., food type and portion size) from food image effectively and efficiently remains a challenging and open research problem. We propose a new Convolutional Neural Network (CNN)-based food image recognition algorithm to address this problem. We applied our proposed approach to two real-world food image data sets (UEC-256 and Food-101) and achieved impressive results. To the best of our knowledge, these results outperformed all other reported work using these two data sets. Our experiments have demonstrated that the proposed approach is a promising solution for addressing the food image recognition problem. Our future work includes further improving the performance of the algorithms and integrating our system into a real-world mobile and cloud computing-based system to enhance the accuracy of current measurements of dietary intake.
Conference Paper
Retrieving recipes corresponding to given dish pictures facilitates the estimation of nutrition facts, which is crucial to various health relevant applications. The current approaches mostly focus on recognition of food category based on global dish appearance without explicit analysis of ingredient composition. Such approaches are incapable for retrieval of recipes with unknown food categories, a problem referred to as zero-shot retrieval. On the other hand, content-based retrieval without knowledge of food categories is also difficult to attain satisfactory performance due to large visual variations in food appearance and ingredient composition. As the number of ingredients is far less than food categories, understanding ingredients underlying dishes in principle is more scalable than recognizing every food category and thus is suitable for zero-shot retrieval. Nevertheless, ingredient recognition is a task far harder than food categorization, and this seriously challenges the feasibility of relying on them for retrieval. This paper proposes deep architectures for simultaneous learning of ingredient recognition and food categorization, by exploiting the mutual but also fuzzy relationship between them. The learnt deep features and semantic labels of ingredients are then innovatively applied for zero-shot retrieval of recipes. By experimenting on a large Chinese food dataset with images of highly complex dish appearance, this paper demonstrates the feasibility of ingredient recognition and sheds light on this zero-shot problem peculiar to cooking recipe retrieval.
Conference Paper
We evaluated the effectiveness in classifying food images of a deep-learning approach based on the specifications of Google's image recognition architecture Inception. The architecture is a deep convolutional neural network (DCNN) having a depth of 54 layers. In this study, we fine-tuned this architecture for classifying food images from three wellknown food image datasets: ETH Food-101, UEC FOOD 100, and UEC FOOD 256. On these datasets we achieved, respectively, 88:28%, 81:45%, and 76:17% as top-1 accuracy and 96:88%, 97:27%, and 92:58% as top-5 accuracy. To the best of our knowledge, these results significantly improve the best published results obtained on the same datasets, while requiring less computation power, since the number of parameters and the computational complexity are much smaller than the competitors'. Because of this, even if it is still rather large, the deep network based on this architecture appears to be at least closer to the requirements for mobile systems
Automatic food understanding from images is an interesting challenge with applications in different domains. In particular, food intake monitoring is becoming more and more important because of the key role that it plays in health and market economies. In this paper, we address the study of food image processing from the perspective of Computer Vision. As first contribution we present a survey of the studies in the context of food image processing from the early attempts to the current state-of-the-art methods. Since retrieval and classification engines able to work on food images are required to build automatic systems for diet monitoring (e.g., to be embedded in wearable cameras), we focus our attention on the aspect of the representation of the food images because it plays a fundamental role in the understanding engines. The food retrieval and classification is a challenging task since the food is intrinsically deformable and presents high variability in appearance. To properly study the peculiarities of different image representations we propose the UNICT-FD1200 dataset. It composed by 4754 food images of 1200 distinct dishes acquired during real meals. Each food plate is acquired multiple times and the overall dataset presents both geometric and photometric varabilities. The images of the dataset have been manually labeled considering 8 categories: Appetizer, Main Course, Second Course, Single Course, Side Dish, Dessert, Breakfast, Fruit. We have performed tests employing different representations of the state-of-the-art to assess the related performances on the UNICT-FD1200 dataset. Finally, we propose a new representation based on the perceptual concept of Anti-Textons which is able to encode spatial information between Textons outperformimg other representations in the context of food retieval and Classification.