Content uploaded by Reza Yogaswara

Author content

All content in this area was uploaded by Reza Yogaswara on May 17, 2019

Content may be subject to copyright.

Comparison of Supervised Learning

Image Classification Algorithms

for Food and Non-Food Objects

Reza Dea Yogaswara

Dept. of Electrical Engineering

Institut Teknologi Sepuluh Nopember

Surabaya, Indonesia

reza.yoga@gmail.com

Adhi Dharma Wibawa

Dept. of Computer Engineering

Institut Teknologi Sepuluh Nopember

Surabaya, Indonesia

adhiosa@te.its.ac.id!

Abstract—Object recognition is a method in the computer

vision to identify and recognize objects in the picture or video.

When humans see photos or watch videos, they can quickly

recognize some object like a car, bus, human, cat, food, and other

visual artifacts. However, how do we apply it to the computer?

Classification is the technique or method in object recognition that

can be used on a computer to distinguish one object from another

object contained in the image or video. In this paper, the author

proposes about testing some popular image binary classification

algorithms used along with the results of the performance matrix

of each algorithm, among these are Logistic Regression with

Perceptron, Multi-Layer Perceptron (MLP), Deep Multi-Layer

Perceptron, and Convolutional Neural Network (ConvNet). The

author uses the Food-5K dataset to distinguish two classes of

objects, namely food / non-food, and then try to train and test

how accurate the computer is in recognizing food and non-food

objects, where it will be useful to anyone who needs to identify a

food object using auto recognizing tools. This paper is expected to

contribute in the field of computer vision related algorithm that

is used to solve the problem in image classification, with the state

of optimal hyperparameter and validation accuracy level above

90%. From the test results obtained the level of testing accuracy

using ConvNet reached above 90% and loss function less than

25% while indicating that ConvNet has a significant advantage on

the image classification problem compared to the generic artificial

neural network.

Keywor ds—obje ct classi ficatio n; de ep le arning; image

recognition; machine learning; convolutional neural network

I. INTRODUCTION

Supervised learning is a machine learning technique

which is we can associate between inputs and ground truth in

a dataset. In this technique, it aims to test the truth of

hypothesis or in other words is to construct a compact model

of the class label distribution used in defining the class label

on the test data where the prediction features are known, and

the label value of the class is unknown (SB Kotsiantis, 2007).

In supervised learning, we found two types of problems that

can we resolve, which are regression and classification. In

general, the process performed on supervised learning is

shown in Fig. 1. below.

!

Fig. 1. Flow process of supervised learning

Image classification has many roles in some areas, such as

an automatic vehicle or pedestrian detection to calculate

vehicle density, pedestrian, autonomous self-driving car, and

others, which is this attracts a lot of academic and scientist

attention to make computers fast and accurate in recognizing

an object. (Lee J.D, 1996) has used optimal linear feature

extraction based on Principle Component Analysis (PCA) and

Multi-Layer Perceptron (MLP) to recognize two-dimensional

objects in industrial environments. (Huo B. et al., 2015)

proposed an image classification technique using modified

Support Vector Machines (SVM) with feature extraction using

PCA to extract the image characteristics and eliminate

redundant information. (Zolnouri, SP et al., 2015) have

compared image classification techniques using Multi-Layer

Perceptron (MLP) and Radial Basis Function (RBF) neural

networks with K-Nearest Neighbors (KNN) based on

Euclidean distance and obtained the results that MLP has

more accuracy level than KNN.

In this paper, we compare four popular algorithms of

image classification: Logistic Regression with Perceptron,

Multi-Layer Perceptron (MLP), Deep Multi-Layer Perceptron,

and Convolutional Neural Network (ConvNet). This task we

divide into three parts: obtaining a vector representation for

each data train distribution, using the vector to train the

classifier model, and evaluating the accuracy and cost function

of the classifier model.

From this research is expected to get the classifier model

with the highest level of accuracy, and the lowest cost function

in the optimal hyperparameter state. The author uses the

Food-5K dataset, a dataset containing 2500 food pictures and

2500 non-food images, for food and non-food classification

tasks, divided into three parts: training data of 3000 images,

validation data of 1000 images and data evaluation of 1000

images.

II. PROPOSED METHODS

A. Logistic Regression with Perceptron

Logistic regression or Logit model is the appropriate

regression algorithm to conduct when the dependent variable

is dichotomous (binary) or in some literature called binary

classification. Like all regression analyses, the logistic

regression is a predictive analysis. Logistic regression is used

to describe data and to explain the relationship between one

dependent binary variable and one or more nominal, ordinal,

interval or ratio-level independent variables. This algorithm is

one of the techniques adopted by machine learning from the

field of statistics.

This technique is included in the binary classification

problem that is the problem with two grade values. Statistician

David Cox developed logistic regression in 1958, where the

model was used to estimate the probability of a binary

response based on one or more predictor variables or

independent variables (features). The logit model can also

illustrate the nature of population growth in ecology,

increasing rapidly and maximally on environmental carrying

capacity represented by logistic curves, with a value range

between 0 and 1. The logit model is almost similar to linear

regression or algorithm, but the logit curve is constructed

using the natural logarithm of the "opportunity" of the target

variable, and not the probability.

Logistic regression uses equations as representations,

much like linear regression. The input value (x) is linearly

combined using the weight or coefficient value (b0) to predict

the output value (y).

The difference with linear regression is that the modeled

output values are binary values (0 or 1) rather than numerical

values. Also, predictors should not normally be distributed or

have the same variance in each group.

In the logistic regression, the constant (b0) moves them up and

down curves based on the y-axis, and the slope (b1) defines

the steepness of the curve.

The following equation can represent logit:

!

Where:

!

In training of logit, it is necessary to define the loss

function of each set of training to calculate the feasibility of

computing which represents the price to be paid for any

predictive inaccuracies in the classification problem. Here is

the cost function equation used in the logit model.

!

!

The applicable cost function of the parameter value w and

b is denoted by -1 / m where m is the training number of the

ground truth value to i multiplied by the log of the predicted

value to i plus 1 minus the ground truth value to 1 multiplied

by log 1 minus the predicted value to i training.

In this paper, the author divides a Food-5K dataset

containing 5000 images with a composition of 2500 food

images and 2500 non-food images into three parts, namely

training, validation, and evaluation, with an RGB image of

224x224 pixels.

In the training section we divide the batch size by 32 and

the test/validation by 16, this is also to speed up the training

process and reduce the use of large resources on our

computers, so we get the input image with dimensions (32,

224, 224, 3) and the size of the ground truth dimension (32, 1),

then we put it in the neural network for the process to get the

right weight and bias for each class.

The author uses the sigmoid activation function at the final

layer, to get the final weight of the training process, and the

Adaptive Moment Estimation (ADAM) optimizer to calculate

the level of adaptive learning in each parameter.

(1)

(2)

(3)

Fig. 2. Linear Model and Logistic Model Curve

B. Multi-Layer Perceptron Neural Network

Artificial neural network (ANN) is an information

processing system that has several performance characteristics

similar to that of biological neural networks. McCulloch &

Pitts first designed ANN in 1943. ANN has been developed as

a mathematical model of generalization of neurobiology or

human cognition, then based on this assumption:

1. Information processing occurs in many simple elements

called neurons.

2. The signal is passed between the neurons via a connection

link.

3. Each connection link has an associated weight, which in

characteristic neural net, streams the transmitted signal.

4. Each neuron implements an activation function which is

usually non-linear to the input of its network the number

of input signal weights to determine its output.

ANN has several main characteristics, such as the pattern

of relationship between neurons or so-called ANN architecture,

a method to determine the weight of each connection which is

called training or learning algorithm, and activation function.

ANN consists of many elements of processing units

commonly called neurons, units, cells, or nodes. Each neuron

is connected via a communication link and associated with

weights. Weight represents the information that will be used to

solve the problem, and among the widely used neural network

implementations is for the challenge of pattern classification.

Neurons have an internal state called activation or activity

level, where this state is a function of input received, and

usually, neurons send signals to some other neurons.

For example, let a neuron be Y, through fig. 3. under a

neuron receiving input from neurons X1, X2, and X3. The

activation function is in each neuron that is symbolized by x1,

x2, x3. Moreover, the weights connected from X1, X2, and X3

to Y neurons are w1, w2, and w3. Also, the Y neuron is the

number of weights of each of the neurons X1, X2, and X3,

which can be denoted by the equation:

!

!

Fig. 3. Shallow neural network with three inputs

In general, linear neural network computing can be solved

by the following equation:

!

That is multiplying each of the m-th features by weight (w1,

w2, ..., wm) and summing them all together.

A function of it Ys network input gives the activation function

y of the Y neuron,

!

Where the activation function can be a sigmoid logistic:

!

Alternatively, Hyperbolic tangent (tanh):

!

Alternatively, Rectified Linear Unit (ReLU):

!

When ANN uses three or more layers or input layer, an

output layer, and more than one hidden layer, then this can be

called multilayer perceptron or deep neural networks.

!

Fig. 4. Multilayer Perceptron or Deep Neural Network

architecture

In this algorithm, the author implements neural network or

multi layer perceptron to solve binary classification problem

using deeper layers and more number perceptrons than logistic

regression algorithms. The author also still uses the ADAM

optimizer to calculate the level of adaptive learning in each

parameter that is trained.

C. Convolutional Neural Network (ConvNet)

Convolutional Neural Networks (CNN) or commonly

referred to as ConvNet is one of the special cases of the

Artificial Neural Network (ANN) which is currently

considered the best technique to solve object recognition and

digit detection problems (Andrew Y. Ng et al. 2010). In the

world of machine learning or neural networks specifically

today ConvNet is no longer a strange technique, previously

(Fukushima 1980) once designed an unsupervised artificial

network called Neocognitron, which was inspired by a

biological visual system, much earlier designed by Hubel and

(5)

(6)

(7)

(8)

(9)

(4)

Wiesel, the winner of Nobel Prize in the field of physiology in

1981. (LeCun et al., 1989) derived the model on some sides

and implemented it in the context of supervised learning to

solve handwritten digit recognition problems, and Yan LeCun

introduced the term "Convolutional Neural Networks" itself,

(LeCun et al., 1998), with an architectural model named

LeNet.

In the late 90s to the mid-2000s, neural networks had been

"almost forgotten" because of various algorithms (e.g.,

Support Vector Machines (SVM), AdaBoost that could be

executed faster with better performance at that time. ANN

again gained attention by scientists and academics when Deep

Belief Networks (DBN) (Hinton et al. 2006) made a

breakthrough by becoming the most accurate handwritten digit

recognition model, which ultimately led to the term deep

learning.

A neural network is called a "deep" if it has many

hierarchical layers or piles. In 2012 came a breakthrough

again in deep learning. ConvNets with specific architecture

combined with tricks such as Drop Out Regularization,

Rectified Linear Unit (ReLU) utilization as an activation

function, and augmentation data where this technique can

provide breakthroughs on large scale image classification

(ImageNet) that has 1000 object categories and more or less a

million images, exceeding human performance. This model is

known as AlexNet. What is surprising is the process of

retelling back into the past: using only one phase, supervised

backpropagation, causing many people to distrust the effects

of unsupervised pre-training as done on the Deep Belief

Network (DBN) model.

The activation function of Rectified Linear Units (ReLU)

(Nair and Hinton, ICML 2010) is currently the most popularly

used function, due to image net success (Kritzhevsky et al.

2010) and Deep Neural Networks improvisation problems for

speech recognition (LVCSR) Dahl et al., 2013).

ConvNet is a feedforward network in the flow of

information occurring in one direction only, ConvNet has

several architectural variants, in general, ConvNet consists of

convolutional and pooling or subsampling layers that are

grouped in one module. Either one or more fully-connected

layers, as in the feedforward neural network standard.

Modules are stacked on top of each other to form a deep

model. Fig. 5. illustrates a typical ConvNet Architecture for

image classification tasks of an object using softmax

activation.

!

Fig. 5. Example of convolutional neural network architecture

The purpose of the development of ConvNet is to improve

the accuracy of image classification tasks and reduce

computation cost. ConvNet has three building block

foundations:

1) Convolutional Layer

The Convolutional Layer in ConvNet can be represented as

a feature extractor, where ConvNet does the learning to

recognize features in the input image. Each neuron in the

feature map has a receptive field, which is connected to the

neuron environment in the previous layer through a set of

training to fix the weights, and sometimes these are called

filter banks (LeCun et al., 2015).

Inputs are involved with the weights studied for

calculating new feature maps, and the results of convolution

multiplication are sent through a nonlinear activation function,

which allows it to extract nonlinear features as well as the

image itself. Among the examples of the convolutional use of

this layer is to find the vertical or horizontal edge feature of an

image input (edge detection), the input is multiplied by

convolution matrix which can be called a filter or feature

detector. The shape of the feature map matrix is obtained by

reducing the length and width of the input image with the

feature detector or filter width and then added one.

!

Also, in Convolutional Layers, we can also find techniques

like padding, and strides that can both be combined to get the

best feature map.

2) Pooling Layer

The pooling layer consists of two types of techniques: max

pooling and average pooling are performed to reduce inputs

spatially (reducing the number of parameters) by down-

sampling operations (LeCun et al., 1989a, 1989b; LeCun et

al., 1998; Ranzato et al., 2007). Generally, the pooling method

used is max pooling or taking the most significant value of the

part (Ranzato et al., 2007). However, other pooling methods

can be used such as average pooling or L2-norm pooling. In

this pooling, the technique does not require computational

learning to obtain the output parameters. Equations can

generally solve pooled feature map or output parameter

pooling layer:

!

!

From one black-and-white or color image (RGB) input:

!

3) Fully Connected Layer

Some convolutional layers and pooling layers are usually

stacked on top of each other to extract representations of more

general features that enter the network. fully connected layer

that follows this layer interprets feature representation and

performs high-level reasoning functions (Hinton et al., 2012;

Simonyan & Zisserman, 2014; Zeiler & Fergus, 2014)

There a re sever al a rchit ec tures i n the f am ous

Convolutional Neural Networks that already have names. The

(10)

(11)

(12)

most common are: LeNet, AlexNet, ZFNet, GoogLeNet,

VGGNet, and ResNet or Residual Network.

In this algorithm the author implements CNN using a 100

filter convolution with 3x3 kernel size and input dimension

(224,224,3) and the activation function of Rectified Linear

Unit (ReLU) for the input layer.

In the final layer, the author uses a sigmoid to get the final

value of the image weight according to the class in the training

set.

III. TESTING

In this paper, the author will examine four image

classification algorithms using Logistic Regression with

Perceptron, Multi-Layer Perceptron (MLP), Deep Multi-Layer

Perceptron, and Convolutional Neural Network (ConvNet).

The author performs testing using a computer with the

specifications:

•Dual Processor with Intel(R) Xeon(R) CPU @

2.30GHz model.

•Memory 13 GB.

•GPU Tesla K80, pci bus id: 0000:00:04.0, compute

capability: 3.7.

The author uses the Food-5K dataset, which is a data set of

2500 food pictures and 2500 non-food pictures, for food / non-

food classification tasks, divided into three parts: training data

of 3000, validation data of 1000 and evaluation data a total of

1000. In testing Food / Non-Food Classification and Food,

Categorization has previously been trained previously using

the GoogLeNet Model.

!

Fig. 6. Example of training data

A. Training and Testing (Validation) Algorithm

1) Logistic Regression with Perceptron

In testing of this algorithm the author uses the following

forms of network architecture:

!

Fig. 7. Logistic regression architecture

In fig. 7., the author used 150,529 training parameters on

logistic regression with sigmoid activation function at an

output.

In this algorithm, the author performs backpropagation

optimization using Adaptive Moment Estimation (Adam)

optimizer with learning rate 0.00001 and the type of loss

"binary cross entropy" computing algorithm, which can be

used to measure the difference between two probability

training and test or validation distributions.

Fitting or training is done with a computational epoch of

20 and a batch of 32.

!

Fig. 8. Training result logistic regression

From the fitting process obtained an accuracy of 50% and the

cost function of 1.95 to the test set distribution.

!

Fig. 9. Plot curve model accuracy & model loss logistic

regression

From testing this algorithm, it is concluded that the

accuracy level is still too low and the value of cost function is

also still too big, so we need to try another algorithm that can

give higher accuracy value and lower cost function.

2) Multi-Layer Perceptron (MLP)

In the testing of this algorithm, the author uses the

following forms of network architecture:

!

Fig. 10. Multi-Layer Perceptron architecture

In figure 10, the author uses 15,063,101 training

parameters on MLP with sigmoid activation function on two

hidden layers and one output layer.

In this algorithm, we also do backpropagation optimization

still using Adaptive Moment Estimation (Adam) optimizer

with learning rate 0.00001 and loss function "binary cross

entropy."

!

Fig. 11. Training result MLP

From the fitting process obtained an accuracy of 81% and the

cost function of 50% of the test set distribution.

!

Fig. 12. Plot curve model accuracy & model loss MLP

From the testing of this algorithm, it is concluded that the

accuracy level has increased significantly, and the value of the

cost function has decreased significantly, and we still perform

the test with the next two algorithms which are expected to

give higher accuracy value and lower cost function.

3) Deep Multi-Layer Perceptron (Deep MLP)

In testing this algorithm, the author uses the following

forms of network architecture:

!

Fig. 13. Deep Multi-Layer Perceptron architecture

In fig. 13., the author uses 30,126,921 training parameters

on MLP with Rectified Linear Unit (ReLU) activation function

on three hidden layers and one output layer with sigmoid

activation.

In this algorithm, we also do backpropagation optimization

still using Adaptive Moment Estimation (Adam) optimizer

with loss function type "binary cross entropy."

!

Fig. 14. Training result Deep MLP

Moreover, still get the accuracy of 81% and cost function to

the test set distribution which decreases that is equal to 42%.

!

Fig. 15. Plot curve model accuracy & model loss Deep MLP

From the testing of this algorithm, it is concluded that the

accuracy level is still the same as a result obtained from the

MLP algorithm but the value of the cost function has

decreased slightly, and the writer will again test the next

classification algorithm which is expected to give higher

accuracy value and lower cost function.

4) Convolutional Neural Network (CNN)

In testing this algorithm, the author uses the following

forms of network architecture:

!

Fig. 16. ConvNet architecture

In fig. 16., the author uses 102,124,301 training parameters

on ConvNet with Conv2D 3x3 matrix filter with ReLU

activation function on three hidden layers and one output layer

with sigmoid activation.

We try to investigate sequential model architectures and

parameters using sequential model prints to get the ConvNet

input classifier form as shown in Fig. 17.

!

Fig. 17. ConvNet sequential model

In this algorithm, we also optimize backpropagation using

Root Mean Square Propagation (RMSProp) optimizer with the

same type of loss function, "binary cross entropy."

!

Fig. 18. Training result ConvNet

So, we obtained a significant increase in accuracy of 94% and

the cost function of the test set distribution also experienced a

considerable decrease of 24%.

!

Fig. 19. Plot curve model accuracy & model loss Deep

Learning ConvNet

From testing this Deep ConvNet algorithm, it was

concluded that the accuracy rate had increased significantly

and the value of cost function also decreased significantly.

IV. RESULTS

We experimented to measure the performance of four

algorithms to solve the image classification problem, so we

got the test results as the table below:

I. TRAINING AND TESTING SUMMARY

II.

As shown in Table I. above we got the results of ConvNet

to have the best accuracy compared to the other three tested

image classification algorithms. On ConvNet we get 94%

accuracy, and the smallest loss function is 24%, it is because

ConvNet can understand the unique features of the object and

associate it with the appropriate class label category. Each

layer in the network retrieves data from the previous layer,

changes it, and forward them.

Networks increase the complexity and detail of what is

being learned from layer to layer. The network automatically

learns from the data and does not require explicitly feature

extraction from the input.

We use a fully connected layer on the last layer with the

sigmoid activation function in the four tested algorithms,

where sigmoid can solve the binary classification problem

correctly. We also use binary cross-entropy to calculate the

optimization score function from training and testing in these

four algorithms.

V. CONCLUSION

In this paper, we propose a comparison of four popular

image classification algorithms to identify food and non-food

objects using Food-5K datasets, and the findings are how

ConvNet can be used powerfully in object recognition for the

image compared to the generic artificial neural network, and

ConvNet has proven successful especially in the field of

computer vision. The success of this convolutional neural

network is also the main reason why Deep Learning ConvNet

has been such a hot topic in recent years, and this is also state

of the art.

REFERENCES

1. Lee J.D. (1996). Object Recognition Using a Neural Network with

Optimal Feature Extraction.

2. Kotsiantis, S. B. (2007). Supervised Machine Learning: A Review of

Classification Techniques.

3. Bingquan H. (2015). Research on Novel Image Classification Algorithm

based on Multi-Feature Extraction and Modified SVM Classifier.

4. Zolnouri S. P., Farokhi F., Andabili M. N. (2015). Improving Image

Classification Result using Neural Networks.

5. Andrew Y. Ng, Quoc V. Le, Jiquan Ngiam, Zhenghao Chen, Daniel

Chia, Pang Wei Koh (2010). Tiled Convolutional Neural Networks.

6. Fukushima K. (1980) Neocognitron: A Self-organizing Neural Network

Model for a Mechanism of Pattern Recognition Unaffected by Shift in

Position.

7. Hubel D.H., Wiesel T.N. (1961). Receptive Fields, Binocular Interaction

And Functional Architecture In The Cat's Visual Cortex.

8. LeCun Y., Boser B., Denker J.S., Henderson D., Howard R.E., Hubbard

W., Jackel L.D. (1989) Backpropagation Applied to Handwritten Zip

Code Recognition.

9. LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-based

learning applied to document recognition. Proceedings of the IEEE,

86(11), 2278–2324.

10. Hinton G.E., Osindero S., Teh Y.W. (2006). A Fast Learning Algorithm

for Deep Belief Nets.

11. Nair V., Hinton G.E. (2010). Rectified Linear Units Improve Restricted

Boltzmann Machines.

12. Kr izh evsky A. (2 010 ). Im age Net Clas sif icati on with D eep

Convolutional Neural Networks.

13. Dahl G.E., Sainathy T.N., Hinton G.E. (2013). Improving Deep Neural

Networks For LVSCR Using Rectified Linear Units And Dropout.

14. LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. Nature,

521(7553), 436–444.

15. LeCun, Y., Boser, B., Denker, J. S., Henderson, D., Howard, R. E.,

Hubbard, W., & Jackel, L. D. (1989a). Handwritten digit recognition

with a back-propagation network. In D. S. Touretzky (Ed.), Advances in

neural information processing systems, 2 (pp. 396–404). Cambridge,

MA: MIT Press.

16. LeCun, Y., Boser, B., Denker, J. S., Henderson, D., Howard, R. E.,

Hubbard, W., & Jackel, L. D. (1989b). Backpropagation applied to

handwritten zip code recognition. Neural Computation, 1(4), 541–551.

17. LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-based

learning applied to document recognition. Proceedings of the IEEE,

86(11), 2278–2324.

18. Ranzato, M. A., Huang, F. J., Boureau, Y., & LeCun, Y. (2007).

Unsupervised learning of invariant feature hierarchies with applications

to object recognition. In Proceedings IEEE Conference on Computer

Vision and Pattern Recognition (pp. 1–8). Los Alamitos, CA: IEEE

Computer Society.

19. Hinton, G. E., Srivastava, N., Krizhevsky, A., Sutskever, I., &

Salakhutdinov, R. R.(2012). Improving neural networks by preventing

co-adaptation of feature detectors. arXiv 1207.0580.

20. Simonyan, K., & Zisserman, A. (2014). Very deep convolutional

networks for large-scale image recognition. arXiv 1409.1556.

21. Zeiler, M. D., & Fergus, R. (2014). Visualizing and understanding

convolutional networks. In Proceedings of the European Conference on

Computer Vision (pp. 818–833). Berlin: Springer.

Algorithm

Training

Parameters

Validation

Accuracy (%)

Loss Function (%)

Logistic

Regression

150.529

50

195

MLP

15.063.101

81

50

Deep MLP

30.126.921

81

42

ConvNet

102.124.301

94

24