Content uploaded by Reza Yogaswara
Author content
All content in this area was uploaded by Reza Yogaswara on Jan 20, 2020
Content may be subject to copyright.
Instance-Aware Semantic Segmentation for Food
Calorie Estimation using Mask R-CNN
Reza Dea Yogaswara
Dept. of Electrical Engineering
Institut Teknologi Sepuluh
Nopember
Surabaya, Indonesia
reza.yoga@gmail.com
Eko Mulyanto Yuniarno
Dept. of Computer Engineering
Institut Teknologi Sepuluh
Nopember
Surabaya, Indonesia
ekomulyanto@ee.its.ac.id
Adhi Dharma Wibawa
Dept. of Computer Engineering
Institut Teknologi Sepuluh
Nopember
Surabaya, Indonesia
adhiosa@te.its.ac.id
Abstract— Knowing the number of calorie content in the
food we consume can help in maintaining body health. By fulfilling
the basic calorie need well, it will produce many positive effects to
the body, including controlling the ideal body weight and becoming
an adequate source of energy for physical activity. Conversely,
people who do not care about their calorie needs will face various
health problems, including obesity and worsening degenerative
diseases such as diabetes or high blood pressure. Calculating the
actual number of calories digitally from food requires the parameters
of area, volume, and mass of the food. Some previous studies in the
field of computer vision have been carried out to get a constant
number of calories based on food types and not based on actual food
volume measurements. In this research, a system will be developed
using a computer vision approach that can be used to calculate the
number of food calories automatically based on the size of the food
volume using the Deep Learning Mask Region-based Convolutional
Neural Network (R-CNN) algorithm. The segmentation technique
uses the instance-aware semantic segmentation approach, which is
to identify each pixel from instance of objects for each object found
in a food image. This work uses the concept of instance-aware data
labeling or segmentation detection that distinguishes each instances
in a similar class, where this model will be used to recognize each
different food object instantaneously in one class so that the number
of calories of each food object can be obtained precisely. The
expected benefit of the results of this research is to help someone
get information about the size of food calories according to the
calorie needs of the body with the mean Average Precision (mAP)
level obtained at 89.4% and the percentage accuracy in calories
calculated at 97.48%.
Keywords— Computer Vision, Deep Learning, Food Calories,
Instance Segmentation, Mask R-CNN
I. INTRODUCTION
Everyone has different caloric needs, not only influenced
by age, but there are several other factors that affect the level
of a person's calorie needs, such as gender, weight or physical
activity, and others. Meet the calorie needs well will have a
lot of positive effects on the body. Among them is controlling
the ideal body weight and becoming an adequate source of
energy for activities. Otherwise, people who do not care
about their calorie needs will be faced with various health
problems. Therefore, it is crucial to get used to calculate the
number of calories to be consumed through food consumed
every day.
Until now, there is still no tool that can be used to measure
food calorie automatically with high precision based on the
volume of the food object. To get the right number of food
calories, it is necessary to know the size of the food volume
first, where it will be difficult and ineffective if we have to
use a measuring instrument manually. In previous research
that focused on the field of computer vision have been carried
out using the technique of object classification or object
detection only to obtain food calorie content constantly based
on food types only and have not measured actual food volume
so that the actual calorie content is not obtained. In this
research, a system with a computer vision approach will be
developed that can calculate the food calorie content based
on the volume of food using the Mask Region-based
Convolutional Neural Network (R-CNN) method.
The image is obtained by taking pictures of food directly
using a smartphone camera, then segmenting it based on the
type or category of food labeled or ground truth. The results
of food segmentation are then used to measure the calorie
levels produced concerning the food calorie reference table.
This research is expected to contribute in helping
someone to get a measure of the number of food calories with
the actual volume of food to be consumed and according to
the body's calorie needs so that a healthy body and ideal body
weight are obtained.
II. RELATED WORKS
Some food recognition techniques used to create features
using a manual process from domain knowledge. Often it is
called handcrafted feature engineering, and this type of
feature engineering has some drawbacks, where this
technique is complicated to use for recognizing an object
because it requires expert knowledge, very time-consuming,
and can be a tedious process. Currently, a method has been
developed in pattern recognition and computer vision using
deep learning so that the manual feature engineering
approach can be replaced by automated feature engineering
where this technique can be done very quickly and can
recognize the domain of a pattern better, fast, and
measurable.
Other studies regarding the introduction of traditional
Thai food has been done using Transfer Learning and
Convolutional Neural Network (CNN). Using 40 food
categories with 2,500 food image data [1]. The author uses
the Inception V3 model to get the weight that has been trained
for combination with the model that will be trained. In this
research, the author only using the technique of classification
and introduction of food using images, so that the predicted
caloric value is also static or constant and adjusted to the class
of objects predicted results and obtained an accuracy of
75.2%.
Research into the food recognition using deep learning
has also been done. In 2017, [2] has used a food object
detection approach using Faster Region-based Convolutional
Neural Network (R-CNN), where this technique is very
suitable only for object detection problems because
computers can be knowing where the food object is in the
image, and the research using multi-label food in one dish. In
this research, the mean average precision (mAP) was 90.7%.
However, the number of calories produced from each food
object is constant because there are no parameters that can be
used by the author to measure the mass of each detected food
object.
Research related to the recognition of traditional
Malaysian food has been carried out [3]. Research on calorie
calculation applications using two layers of Neural Network
and sigmoid activation function with 250 food images
dataset. In this study, the authors used a calorie constant
combined with the results of the classification task so that the
caloric value obtained was still static or constant and got an
overall correction percentage of 80%.
In 2016 a research was conducted on the introduction of
food using Morphological Operations with Watershed
Segmentation [4], in this research the authors used
morphological dilation and erosion techniques, where the
method still requires manual adjustment of the value of
dilation and erosion threshold in order to obtain the right
segmentation on food images, manual edge detection
techniques are also used by the author to get the right
segmentation results.
From the above researches, further research is needed
regarding food calorie counter from food image using broad
segmentation parameters and can distinguish between
detected instance of objects and provide solutions to multi-
label classification problems where it is possible to have more
than one object in one food image, more than one similar
object in one food image, and the computer can distinguish
it.
III. METHODS
In this research, the author will develop a system using
the computer vision approach that can calculate the food
calorie content based on the size of the food volume using the
Mask Region-based Convolutional Neural Network (R-CNN)
algorithm.
Mask R-CNN is a state-of-the-art method developed by
Kaiming He [5], a computer vision researcher from Facebook
AI Research (FAIR) and obtained the Best Paper Award at
the IEEE International Conference on Computer Vision
(ICCV) conference in 2017. This method or algorithm is used
to solve instance segmentation problems [5] in computer
vision and have approached human capabilities in visual
perception tasks.
Instance-aware semantic segmentation or instance
segmentation is separating different instance of objects in an
image or video. For example, in one food image, there are
two objects of white bread, then the object will be segmented
using two masks and two bounding boxes with different
identities.
This technique has the task of identifying objects at the
pixel level and associating each pixel with the physical
instance of an object [6]. Compared to assignments in other
areas of computer vision, this task is the most challenging
task. Because there are more jobs and stages in detecting and
segmenting objects, first, the system will carry out the
classification process along with the improvement of object
detection tasks. Moreover, the second is to recognize pixels
by pixel and understand the types of objects contained within
each pixel.
The Mask R-CNN successfully outperformed two
previous algorithms in instance segmentation tasks. The two
algorithms are the Multi-task Network Cascade (MNC) [7],
the winner of the Microsoft Common Objects in Context
(COCO) competition in 2015 and Fully Convolutional
Instance-aware Semantic Segmentation (FCIS) [8], the
winner of the 2016 Microsoft COCO competition.
This research uses a flow diagram as shown in Fig. 1
below:
Fig. 1. Research flow diagram
A. Determine the Object Classes
This research begins by determining the class of objects
to be calculated for the number of calories in one serving.
Image segmentation is obtained from the prediction of pixel
or pixel dense prediction on food images according to each
class of objects that have been defined on the ground truth.
Segmentation is used to calculate the surface area of each
predicted object class. The height of each object class is
obtained from the measurement results stored as multiplier
constants of the area estimation of the segmentation results to
get the volume of the object.
Each object class has also calculated its density so that the
mass of each object class is obtained and calorie calculation
is then carried out based on the calorie table reference. The
class of objects used in this research was plate, where the
plate is used as a calibrator for the area of the food class,
white bread, braised spiced tempeh, fried tempeh, braised
spiced tofu, and fried tofu.
B. Data Acquisition
Data acquisition is done by taking or capturing digital
food images with a combination of dishes by the class of
objects that have been defined along with measurements of
the height, area, and mass of the object used to validate the
results of segmentation. The stages of data acquisition in this
research are:
a) Image data acquisition: Taking food image data
with various combinations of object classes. Food image
data is taken perpendicular to the angle of ± 90 degrees to
the surface of the food using an iPhone 6s smartphone
camera that has a resolution of twelve million megapixels
with dimensions of 3024px x 4032px. After the image is
obtained, the size is reduced to 768px x 768px to suppress
the computational resources needed during the training
process in each training batch.
Fig. 2. The image of food that has been taken through the iPhone 6s
smartphone camera then reduces its dimensions to 768px x 768px
b) Measuring the height of food objects: This stage is
done to get the height of a food object used for multiplying
constants resulting from the calculation of the area of
segmentation results so that the volume of the object of
each food is obtained.
c) Measuring the area and mass of food objects: The
actual area of a food object is used to validate the estimated
area of the results of segmentation of food objects so that
the accuracy of the results of object segmentation can be
measured. Then the actual plate area size is used as a
comparison constant that is used to calculate the actual
area of a food object from the results of segmentation.
Fig. 3. Measurement of a mass class of white bread and fried tempeh
objects using ek3650 Camry digital weighing scale
TABLE I. RESULT OF MEASUREMENT, HIGH, MASS, AND DENSITY OF
OBJECT CLASS
Food Name
Area
(cm2)
Height
(cm)
Mass
(g)
Density
(g/cm3)
Fried Tofu
22,5
1,4
24
0.735
Fried Tempeh
16
1
14
0.872
Braised Spiced Tofu
18
1.5
33
1.076
Braised Spiced Tempeh
26,5
1.3
37
0.947
White Bread
133
1.3
38
0.237
In Table I above is the result of measuring the area, height,
mass, and density of food objects that will be predicted
the caloric content based on the area of segmentation
obtained in the test results. The measurement results of
area and mass are used as values to validate the results of
segmentation obtained, while height and density are used
multipliers to obtain the volume and mass of food objects.
C. Class Object Labeling
Labeling object classes are used to get images with pixel
labels that represent each class of the object.
Fig. 4. Labeling uses the pixel-dense labeling technique
After labeling the food image data, then perform
segmentation visualization to see how precise the
segmentation mask data on the ground truth is to produce a
segmentation model that can recognize objects in each food
dish.
Fig. 5. Observation of the segmentation or mask of food object classes on
ground truth
D. Splitting the Dataset
After doing object class labeling, a dataset is divided into
three parts, namely training set, development set or hold-out
cross-validation, and a test set to get the best performance
segmentation model from the training results.
Fig. 6. Splitting dataset distribution into a training set, development or
validation set, and test set
Braised Spiced Tempeh
Fried Tofu
Fried Tempeh
Plate
Input
E. Segmentation Model Training and Validation
The Mask R-CNN has an architecture or learning process
flow as shown in Figure 7.
Fig. 7. Mask R-CNN Architecture in the training and inference process
The Mask R-CNN architecture is consists of two parts,
namely the first part produces the Region Proposal Network
(RPN) to find out where the possibility of food objects in
image input using the ResNet 50 or ResNet 101 [9] backbone
is possible to use sharing parameters on convolution results.
The second part predicts the object class based on ground
truth and refining the bounding box for object localization
then generates a mask of segmentation with pixel level based
on the region of the proposal generated in the first part.
The mask segmentation branch using Fully Convolutional
Network (FCN) [10] that is applied to every Region of
Interest (RoI), predicts the mask of segmentation by a pixel-
to-pixel method. FCN uses the per-pixel softmax loss
function. Formally the loss function is calculated by summing
all the loss predictions of each pixel which can be expressed
in the following equation:
!
"
#$%
&
'()
*
#
"
+
&
,-./%
"
+
&
The implementation of training and segmentation model
validation consists of several stages as follows:
1) Prepare computers for training, validation, and testing
with specifications:
•
Dual Processor with Intel(R) Xeon(R) CPU @
2.30GHz model.
•
Memory capacity 13 GB.
•
GPU Tesla K80.
2) Evaluation of accuracy, error (loss), and improvement
of the Mask R-CNN hyperparameter configuration.
Training is conducted several times to get the model with
the lowest loss or error value. When the loss value in training
is too high this can be caused by the high bias, so the way to
overcome this is to add layers to the Mask R-CNN or increase
the number of epochs in training. Whereas when the loss
value in validation is too high, this can be overcome by
adding data to the validation set to minimize overfitting.
Formally, in the training process, the Mask R-CNN
defines multi-task loss on each RoI sample as:
0 '(0123(4056,(4 07839
From the results of the training segmentation model, the
results of measurement of accuracy and error (loss) as shown
in Table II below:
TABLE II. RESULTS OF MEASUREMENT OF ERROR (LOSS) AND
VALIDATION IN TRAINING EXPERIMENTS FOR A SEGMENTATION
MODEL
Training #1
Training #2
Training #3
Epoch
40
30
40
Loss
0.04132
0.06896
0.04447
BBox loss
0.003998
0.007919
0.003894
Class loss
0.008325
0.01043
0.004594
Mask loss
0.02632
0.04133
0.03183
RPN bbox loss
0.002266
0.008789
0.003773
RPN class loss
0.000402
0.0004826
0.0003732
In the first training, using 40 epochs to get the lowest loss
and validation loss values, and the lowest loss value obtained
from the three training trials is 0.04132. Furthermore, the first
experimental training model is used for calorie prediction
testing or inference.
Measurement of the loss function is needed in this
research because it serves as a way to measure the distance or
difference between the predicted output and the ground truth
label to produce a more accurate segmentation model.
F. Model Testing
Testing the food segmentation model is done after the
food segmentation model is obtained from the results of
training.
Fig. 8. Testing the detection of food object segmentation using a
segmentation model and confidence scores obtained.
In the second test, segmentation detection was carried out
by plotting confusion matrix to determine the performance of
Intersection over Union (IoU) between predictions and
ground truth from the result of segmentation in Fig. 8.
Fig. 9. Defines IoU on object detection tasks
IoU calculations are obtained through the equation below:
:;< '=>?@(;A(BC?>D@#
=>?@(;A(<EF;E
(2)
(3)
(1)
Object Class: Braised Spiced
Tofu
Area (px2): 8495
Area (cm2): 21
Height (cm): 1,5
Volume (cm3): 31
Density (g/cm3): 1,07176564
Mass (g): 33
Calories (cal): 49
Object Class: Fried Tofu
Area (px2): 9877
Area (cm2): 24
Height (cm): 1,4
Volume (cm3): 33
Density (g/cm3): 0,71828812
Mass (g): 25
Calories (cal): 27
Fig. 10. Plotting confusion matrix from the result of segmentation in Fig. 8.
Next, the measurement of Average Precision (AP)
measures the results of the bounding box prediction and the
class of detected objects. Measurement is obtained from
calculating the average precision value for recall values with
a range of values between 0 and 1 using the detection
evaluation metric used by the Microsoft COCO dataset. In
this study, the author used three types of evaluation matrices
as follows:
1. AP or percentage of AP on IoU = .50: .05: .95 (main
test)
2. APIoU=.50 or percentage of AP on IoU = .50 (metric
used by PASCAL VOC)
3. APIoU=.75 or percentage of AP on IoU = .75 (strict
metric)
TABLE III. MEASUREMENT OF AVERAGE PRECISION (AP) OF THE
SEGMENTATION MODEL
Backbone
AP
AP50
AP75
ResNet 101
0.8941
0.9678
0.9638
G. Volume calculation of detected object
After obtaining food object segmentation, the next step is
to measure the area of each food object detected or
segmented, then compare the area of detected food object
with the area of the detected plate.
To get the mass object, the actual object area is compared
to the actual plate area, then the actual object area multiplied
by the height of the food object to obtain the volume of the
food objects, while the object mass is obtained by multiplying
the volume of the object with the object density constant of
the object.
Fig. 11. Detect the class of food objects and get the calorie content of each
object
After obtaining the mass of food items from the results of
segmentation, then calculating the estimated calories
contained in food is based on the reference calorie table.
Reference to the calorie table is obtained through the source
of the Indonesian Ministry of Health Promotion brochure on
"Healthy Lifestyle" which includes the five classes of defined
objects.
TABLE IV. CALORIE TABLE
Food Name
Mass (g)
Calories
Unit
Fried Tofu
100
111
1,5
Fried Tempeh
50
118
1,5
Braised Spiced Tofu
100
147
1,75
Braised Spiced Tempeh
50
157
2
White Bread
50
128
1,5
Source: Health Ministry of the Republic of Indonesia
The calorie of each class of food object as a result of
segmentation is obtained by dividing the mass of the object
with the mass of the object on the calorie table then multiplied
by the number of calories in the calorie table so that the
calorie content calculation can be poured into the equation
below.
G@DH'( BIJ?KL(M@NN
BIJ(M@NN(FE(G@DHO@ID?(G@DHFE(G@DHO@ID?
(
H. Show Object Segmentation Results and Calorie
Calculation
Testing the segmentation model in this study uses web
applications and Android applications, thus facilitating the
calculation of the number of food calorie content using the
segmentation model that has been obtained from the results
of training.
(4)
Fig. 12. Detection results of calorie image segmentation and calculation of
food on web applications and Android applications
IV. RESULTS AND DISCUSSION
From the results of segmentation testing according to Fig.
12 a comparison of the results of measurements of the area,
mass, calories from segmentation with ground truth, and
percentage of caloric accuracy is presented through Table V
below.
The accuracy of calorie calculation in Table V is
calculated using the equation below:
(
P>>;> 'G@DH;A(QO )G@DH;A(R>?SFKLF;E(TUU
G@DH;A(QO
(
=KKV>@KW 'TUU)P>>;>
TABLE V. RESULT TABLE
Class
Area
of GT*
(cm2)
Mass
of GT
(g)
Calories
of GT
(cal)
Area
of
Pred.*
(cm2)
Mass
of
Pred.
(g)
Total of
Calories
(cal)
Accuracy
(%)
Fried
Tofu
22,5
24
26.64
24.07
24.77
27.5
92.98
Fried
Tempeh
16
14
33.04
15.47
13.49
31.8
96.25
Braised
Spiced
Tofu
18
33
48.51
20.72
33.45
49.2
98.58
Braised
Spiced
Tempeh
26,5
37
116.18
30.0
36.92
115.9
99.76
White
Bread
133
38
97.28
123.1
37.93
97.1
99.81
GT*: Ground Truth, Pred.*: Prediction
As seen in Table V, the accuracy of calorie calculation in
the fried tofu class was 92.98%, fried tempeh 96.25%, braised
spiced tofu 98.58%, braised spiced tempeh 99.76%, and
white bread 99.81%. From the result table above, the average
accuracy of calorie calculation obtained is 97.48%.
In this research, the actual area of food was calculated to
obtain the volume, and mass of the food, so that the actual
number of calories can be obtained from food, and this
technique has outperformed several previous studies related
to food calorie calculations that did not take calorie
measurements based on the actual volume and mass of food
objects and only carried out the task of image classification
or object detection (localization).
V. CONCLUSION
In this research, we built a system with a computer vision
approach that can calculate food calorie content based on
food volume and mass using the Mask Region-based
Convolutional Neural Network (R-CNN). The author got the
conclusion from this research that Mask R-CNN can be
implemented for food calorie calculations because Mask R-
CNN can compute a pixel-wise mask for every object in the
image using ResNet 101 as backbone model, RPN, RoI
classification and bounding box, and segmentation mask.
This algorithm was able to distinguish each of the same
instances of objects. The segmentation model can be used on
web applications and Android mobile applications, making it
easier for users to calculate the number of food calories
automatically. For future work, another experiment needs to
be performed to evaluate the type of food which have a
convex or concave structure. This structure would be a new
challenge in calculating food calories.
REFERENCES
[1]
P. Temdee and S. Uttama, "Food recognition on
smartphone using transfer learning of convolution
neural network," Cape Town, 2017.
[2]
T. Ege and K. Yanai, "Estimating Food Calories for
Multiple-dish Food Photos," 2017.
[3]
N. A. A. Nor Muhammad, C. P. Lee, K. M. Lim and S.
F. Abdul Razak, "Malaysian Food Recognition and
Calorie Counter Application," 2017.
[4]
S. V. Chavan and S. S. Sambare, "Segmentation of Food
Images using Morphological Operations with
Watershed Segmentation Technique," International
Journal of Computer Applications, vol. 151, no. 1, 2016.
[5]
K. He, G. Gkioxari, P. Dolla ́r and R. Girshick, "Mask
R-CNN," arXiv e-prints, p. arXiv:1703.06870, 2017.
[6]
M. Bai and R. Urtasun, "Deep Watershed Transform for
Instance Segmentation," Honolulu, HI, USA, 2017.
[7]
J. Dai, K. He and J. Sun, "Instance-aware Semantic
Segmentation via Multi-task Network Cascades," arXiv
e-prints, p. arXiv:1512.04412, 2015.
[8]
Y. Li, H. Qi, J. Dai, X. Ji and Y. Wei, "Fully
Convolutional Instance-aware Semantic Segmentation,"
arXiv e-prints, p. arXiv:1611.07709, 2016.
[9]
K. He, X. Zhang, S. Ren and J. Sun, "Deep Residual
Learning for Image Recognition," arXiv e-prints, p.
arXiv:1512.03385, 2015.
[10]
J. Long, E. Shelhamer and T. Darrell, "Fully
Convolutional Networks for Semantic Segmentation,"
arXiv e-prints, p. arXiv:1411.4038, 2014.
(5)
(6)