Content uploaded by Syeda Tanjila Atik

Author content

All content in this area was uploaded by Syeda Tanjila Atik on Jun 26, 2019

Content may be subject to copyright.

A model for identifying Historical Landmarks of

Bangladesh from image content using a Depth-wise

Convolutional Neural Network

Afsana Ahsan Jeny1, Masum Shah Junayed1, Syeda Tanjila Atik1 and Sazzad Ma-

hamd1

1 Daffodil International University, Dhanmondi, Dhaka-1207

1 { ahsan15-5278, junayed15-5008, syeda.cse, sazzad15-4980}@diu.edu.bd

Abstract. At present, tourism is considered to be one of the key factors shaping

the development of a country’s economy. Most of the tourists tend to explore

places that they find fascinating after watching pictures of that places over Inter-

net. Anyone can know about a famous place by simply typing the name of that

place in an internet browser. But problem arises when he/she comes across the

image of a beautiful landmark which is anonymous as most of the time web im-

ages do not convey any text caption. Most of models provided for image identi-

fication so far exhibit much complex structure and increased time complexity. In

this paper, we have proposed a CNN model based on MobileNet and TensorFlow

for detecting some historical landmarks of Bangladesh from their image. We

have examined 750 images from five different places and comparing other state-

of-art models, our model holds relatively simpler structure and has achieved a

significantly higher average accuracy of 99.2%. This model can be further en-

hanced to facilitate image classification in other related areas.

Keywords: Convolutional Neural Networks, TensorFlow, MobileNet, Histori-

cal place detection, Image processing.

1 Introduction

Due to the remarkable development of the communication sector, inspection of histor-

ical places has increased greatly. In many countries, especially Western countries, tour-

ism has been heavily influenced for years. Instead of just traveling, one can also get to

know about different people and different culture while visiting a particular place.

Every day, many people from different parts of the world are visiting Bangladeshi his-

torical places for tourism purpose. But unfortunately, many people cannot recognize

historical places by watching an image of that place where no caption is included.

There are many traditional historical places in Bangladesh such as Buddhist Vihara,

Mahasthangarh, Lalbagh, Puthia Temple Complex, Sonargaon, Shaat Gombuj Mosque,

Kantajew Temple, Star Mosque (Tara Masjid), Tajhat Palace, Mainimati Ruins, Baitul

Mukarram Mosque, Chittagong Commonwealth War Cemetery, Panam Nagar and so

2

on. . But most of the local people of the current generation and also from overseas

countries do not have detail knowledge about these places.

Convolutional Neural Network [8] is a suitable image detection method which was es-

tablished in the last year. It uses local receptive fields as brain neurons, by sharing

weight and linking information and reducing training limitations compared to the nerv-

ous system. It performs image transitions with a specific degree of rotation, translation,

and distortion. This network can avoid the complex preprocessing of the image and

input the original image directly to the people.

In this paper, we have proposed a CNN model based on MobileNet and TensorFlow

[11] for detecting some historical landmarks of Bangladesh from their image which is

built on depth wise distinguishable filters on the dataset of five historical places named

as Ahsanmonzil, Lalbagh, Mahasthangarh, Panam city and Shaat Gombuj Mosque.

The remaining paper is organized as the following: Existing works like CNN [8], Ten-

sorFlow [11] are discussed in Section II. MobileNet [10] architecture which is based

on depth-wise separable convolutions is discussed in Section III. ImageNet related

works are described in Section IV. Data collection, model installation, and the training

process are discussed in Section V. Results of the experiment is discussed in Section

VI. And finally, future work and summary of the conclusion are discussed in Section

VII and VIII.

2 Background Study

TensorFlow [11] is the second machine learning framework that is created by Google

Brain team and is used for designing, building, and training models of deep learning.

Any person can use the TensorFlow [11] library for doing the numerical computations.

These computations consist of data flow graphs. In these graphs, edges describe the

data, while the nodes describe the mathematical operations. A new era, Google has

provided a number of pre-trained model. Among these, MobileNet [10] is one of the

pre-trained model after Inception-v1 [2], Inception-v2 [3], and Inception-v3 [1]. Trans-

fer learning is a new learning process that can be used in existing machines learned

from a single environment and new solutions.

Convolutional neural networks [8] are organized into layers like dense (or fully con-

nected) layers, convolutional layers, pooling layers, recurrent layers, and normalization

layers. Different layers perform different conversions in their inputs and some layers

are suitable for some jobs other than others.

In this paper, we have used Depth wise separable convolutions of MobileNet [10]. It is

a successful model for image classification. It is used a fewer number of parameters

from other models and given a high accuracy within a short training time. In both cases,

obtain better models than previously available for one available calculate the parame-

ters and drops significantly numbers and the parameters required for the given dimen-

sion.

3

3 MobileNet Architecture

In this section, we have described the core layers architecture of MobileNet [10] which

is based on depth-wise separable filters. The MobileNet [10] model is a form of factor-

ized convolutions which manufacturer a standard convolution into a depth-wise convo-

lution. And here 1 x 1 convolution is called a point-wise convolution. Each input chan-

nel of MobileNet [10] the depth-wise convolution prosecutes a single filter. For com-

bining the outputs of the depth-wise convolution, the point-wise convolution prosecutes

a 1 x 1 convolution. The depth-wise separable convolution divides this into two layers.

Between two layers, one layer is for purifying and another layer is for mixing. Here

Fig. 1 shows how a standard convolution 1(a) is factorized into a 1 x 1 point-wise con-

volution 1(c) and a depth-wise convolution 1(b).

Fig. 1. The standard convolutional filters in (a) are substituted by two layers: depth-wise convo-

lution in (b) and point-wise convolution in (c) to create a depth-wise dividable filter.

A standard convolutional layer accepts input an x x M for a feature map of F and

generates an x x N for a feature map of G. Here is the structural width and

height of a square input feature map and M is the number of input channels. Again

is the structural width and height of a square output feature map and N is the number

of output channels.

By convolution kernel, the standard convolutional layer is characterized K which size

is x x M x N where is the structural dimension of the kernel possessed to be

square and M is number of input channels and N is the number of output channels as

defined earlier.

There are computational costs of standard convolutions: . . M. N. .

Where the computational cost depends on the number of input channels M, the number

of output channels N, the kernel size x and the feature map size x.

4

Depth-wise convolution is highly effective comparative to standard convolution. How-

ever, it does not mix them to make new features also filters input channel. So an addi-

tional layer that calculates a linear coordination output of depth-wise convolution out-

put through 1 x 1 convolution. This new feature is needed to generate. The mixed of

depth-wise convolution and 1 x 1 point-wise convolution is called depth-wise separable

convolution.

Depth-wise seperable cost of convolutions: . . M. . + M. N. .

Which is the sum of the depthwise and 1 x 1 pointwise convolutions. Using MobileNet

[10] 3 x 3 deeply dividing convolutions which uses between 8 and 9 times less than the

standard calculation convolutions.

4 Related Work

We have shown some prior work by CNN model using image datasets.

In 2014, the authors proposed an Inception-v3 [1] model with GoogleNet for image

classification. They deeply explained these models and also used images from

ImageNet for training, validation, and testing. But they used 23 million parameters and

so, the model size became large that’s why the run time complexity of the model also

got relatively high [1].

In 2014, Khairul Amin, MirsaHussain and NorsidahUjang used 35 visitors for Visitors’

Identification of Landmarks in the Historic District of Banda Hilir, Melaka, Malaysia.

This paper inspects the identity and nature of Banda Hili's landmarks about the image

of the audience. The study adopted a mental mapping strategy to combine images re-

garded as landmark and its associated characteristics. By using 35 visitors, they have

been also drawn maps of Banda Hili's landmarks. But it was just a manually process

[4].

In 2018, Malcolm J. Beynon, Calvin Jones, Max Munday and Neil Roche used Fuzzy

set qualitative comparative analysis for analysing of landmark historical sites in Wales.

The paper shows information on how the tourism activity is Sites on some historic sites

can be used to analyze functional recipes that are defined GVA higher or lower level

support. But this paper is just an analysis paper [6].

In 2012, Alex Krizhevsky et al. used deep convolutional neural network [8] to classify

the high-resolution images in the ImageNet LSVRC-2010 [9]. There were about 1.2

million training images, 50,000 validation images and 150,000 testing images in the

ImageNet. To reduce the over-fitting they used regularization method which is called

dropout and found a high accuracy using CNN [8] but at the same time this model

exhibited a higher time complexity [7].

In 2017, Łukasz Kaiser et al. have proposed a new model SliceNet which is related to

the Xception and ByteNet and this model is based on Depth-wise separable convolu-

tions. But SliceNet has used larger windows instead of dilated filters and BLEU. As a

result the model has become more complex to implement [12].

Our experiment is based on MobileNet [10] model of Depthwise Convolutional Neural

Network [8] for historical place detection using historical images. Image detection has

not been done by using CNN [8] model before. The depth-wise CNN [8] model has

5

used fewer parameters and worked well in our experiment within a short training time

and achieved a high accuracy.

5 Methodology

In this section, at first a system architecture which represents the procedure of our ex-

periment is provided; then data collection procedures for our experiment is discussed;

then the procedure of data processing and the installation of the model are described

and lastly we describe how we trained our model.

Fig. 2 represents the system architecture which describes the procedure of our experi-

ment.

Fig. 2. Architecture of System.

5.1 Data collection

For detecting the historical place of Bangladesh, we have collected 750 images of five

classes in which per class has 150 images. The classes are Ahsanmonzil, Lalbagh, Ma-

hasthangarh, Panam city, and Shaat Gombuj Mosque.

5.2 Data preprocessing

In this step, we have increased the dataset artificially. But the dataset looks like a refresh

data. First of all, we resize the data. Then we have used five different augmentation

methods to increase the dataset. They are Shearing, Translation, Flip, Rotate +30, and

Rotate -30.

Then we have found 4500 images for training, validation, and test. After augmentation,

each class has 900 images from which 180 images are used for the testing. As it is very

tough to represent all the data, a sample of our dataset is provided in the following Fig.

3.

6

Fig. 3. The sample of our dataset.

5.3 Model Installation

This experiment is based on the MobileNet [10] model. In the backend, we have used

TensorFlow [11] which is an open source library of python. The hardware platform is

2.2GHz, Intel i5, memory 4GB, System type: 64-bit Operating System, x-64 based pro-

cessor.

5.4 Train model

We have used 224 1.0 version of MobileNet [10] which is the biggest version. Here 1.0

is the width multiplier and 224 is the image resolution. Although 224 is a high-resolu-

tion image and takes more processing time, it provides better classification accuracy.

First of all, we have made our dataset which we described already. Then we have started

by creating a Docker [13] image to handle training. Then we have used TensorFlow

[11] which is preinstalled and configured for us, one of the many advantages of Docker

[13]. Then we define the default training parameters as an environmental variable that

will ultimately override them at runtime as needed. We set our image size to 228 pixels.

Next, we define the labels text files that will be created by the trained graph's output

location and the training script.

Finally, we point to 5000 training steps that allow quick training on our skills cycle,

without a huge training cycle (default 4000). After completion of training, the output

will be used to recover trained models and labels from Docker [13] container and input

will be used to provide Docker [13] container with our training images. Then, at last,

7

we have completed the test. The following Fig. 4 represents the MobileNet [10] archi-

tecture of our dataset.

Fig. 4. The architecture of MobileNet.

Confusion Matrix:

A confusing matrix is a table that is often used to describe the performance of a class

model in experimental data whose actual values are known. Number as a confusing

matrix for a binary, such as false positive numbers (FPs), false negative (FNs), true

positive (TPs), and negative 2-class issues (TNs). In the case of multiple classes, such

as 2-square-plus problems, confusing matrices may be n × n (n > 2). It contains N rows,

n columns and total n × n entries in the confusion matrix. From this matrix, the number

of FPs, FNs, TPs, and TNs cannot be calculated directly. According to this method,

classes of FPs, FNs, TPs, and TN are calculated for the class:

.

.

We have also made a multiclass Confusion Matrix [14] of MobileNet [10]. After Con-

fusion Matrix [14], gradually we have calculated Precision, Recall, Accuracy, and F1-

Score and we have used the following formula for calculating those.

Precision =

Recall =

8

Accuracy =

F1 scores =

Table 1 shows the multiclass Confusion Matrix [14] of the MobileNet [10] model.

Table 1. The multiclass Confusion Matrix

Historical Place

Ahsan-

monzil

Lalbagh

Ma-

hasthangarh

Panam

city

Shaat Gombuj

Mosque

Ahsanmonzil

177

1

0

1

1

Lalbagh

3

176

0

0

1

Mahasthangarh

0

1

179

0

0

Panam city

2

1

0

176

1

Shaat Gombuj Mosque

2

1

0

2

175

6 Result Exploration

We have gotten a favorable accuracy from MobileNet [10] which is 99.2%. The fol-

lowing Fig. 5 shows the training progress (x-axis) as a process of the accuracy (y-axis)

and Fig. 6 shows the training progress (x-axis) as a process of the cross-entropy (y-

axis). Here the orange line represents the training set, and the blue line represents the

validation set. From the following two figures we can say, our model has not been cre-

ated overfitting. Overfitting model will have extremely low training error but a high

testing error. But MobileNet [10] has been performed well in testing.

Fig. 5. Accuracy regarding training and validation.

Fig. 6. Cross entropy regarding training and validation.

9

The following Fig. 7 shows the precision, recall, accuracy, F1-Score and macro average

[15] graph of Ahsanmonzil, Lalbagh, Mahasthangarh, Panam city, and Shaat Gombuj

Mosque.

Fig. 7. Precision, Recall, Accuracy and F1-Score graph.

Table 2 shows if we had used Full Convolutional MobileNet [10] then we had to use

high parameters and it would have been lengthy [10].

Table 2. Depthwise Separable vs Full Convolution MobileNet

Model

Accuracy of ImageNet

Million Mult-

Adds

Million Parame-

ters

Conv MobileNet

71.7%

4866

29.3

Depthwise MobileNet

70.6%

569

4.2

7 Conclusion and Future Work

In our paper, we have used Mobilenet [10], a model of a depth-wise Convolution neural

network [8] for detecting some historical landmarks of Bangladesh from their image.

This model is small in size comparing the other models of CNN [8] and very similar to

the Inception-V3 [1. This model is built on depth wise distinguishable filters on the

dataset of five historical places of Bangladesh named as Ahsanmonzil, Lalbagh, Ma-

hasthangarh, Panam city and Shaat Gombuj Mosque. As other models are very deep

and big in size such as Inception-V3 [1], AlexNet [5], VGG 16 [17], ResNet [16] and

DenseNet-190 [18], for using those we would have to consider 23 million, 60 million,

138 million, 25 million and 40 million parameters respectively which would be a long-

term process. In depth-wise MobileNet [10] we have used only 4.2 parameters and after

running the experiment for six hours our model has given an accuracy of 99.2%. More-

over, MobileNet [10] is a generated model of the Convolutional neural network [8] so

a more generalized and effective model can be developed in future using different and

unique techniques to facilitate image classification in other related areas.

10

References

1. Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Angue-

lov, Dumitru Erhan, Vincent Vanhoucke and Andrew Rabinovich, “Going deeper with con-

volutions”, arXiv:1409.4842v1 [cs.CV] 17 Sep 2014.

2. Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, Alexander A. Alemi, “Inception-v4,

Inception-ResNet and the Impact of Residual Connections on Learning”, arXiv: 1602.07261

[cs.CV].

3. Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, “Rethinking the In-

ception Architecture for Computer Vision”, arXiv: 1512.00567v1 [cs.CV] 2 Dec 2015.

4. Khairul Amin, MirsaHussain and NorsidahUjang, “Visitors’ Identification of Landmarks in

the Historic District of Banda Hilir, Melaka, Malaysia”, AMER International Conference on

Quality of Life, AicQoL2014KotaKinabalu.

5. Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton,” Deep Learning with Tensor-

flow”, http://cvml.ist.ac.at/courses/DLWT_W17/

6. Malcolm J. Beynon, Calvin Jones, Max Munday and Neil Roche, “Investigating value added

from heritage assets: An analysis of landmark historical sites in Wales”, International Jour-

nal of Tourism Research.

7. Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton,”ImageNet Classification with Deep

Convolutional Neural Networks”, Advances in Neural Information Processing Systems 25

(NIPS 2012).

8. Karen Simonyan, Andrew Zisserman,” Very Deep Convolutional Networks for Large-Scale

Image Recognition”, arXiv:1409.1556v3 [cs.CV] 18 Nov 2014

9. Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhi-

heng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, Li

Fei-Fei,” ImageNet Large Scale Visual Recognition Challenge”, arXiv:1409.0575v3

[cs.CV] 30 Jan 2015

10. Andrew G. Howard Menglong Zhu Bo Chen Dmitry Kalenichenko Weijun Wang Tobias

Weyand Marco Andreetto Hartwig Adam,”MobileNets: Efficient Convolutional Neural

Networks for Mobile Vision Applications”, arXiv:1704.04861v1 [cs.CV] 17 Apr 2017

11. Mart´ın Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, “Large-Scale Machine

Learning on Heterogeneous Distributed Systems”, (Preliminary White Paper, November 9,

2015), arXiv:1603.04467v2 [cs.DC] 16 Mar 2016.

12. Łukasz Kaiser, Aidan N. Gomez, François Chollet,”Depthwise Separable Convolutions for

Neural Machine Translation”, arXiv:1706.03059v2 [cs.CL] 16 Jun 2017

13. https://www.docker.com/ Docker Simplifies the Developer Experience.

14. The details of Confusion matrix, https://en.wikipedia.org/wiki/Confusion_matrix

15. https://datascience.stackexchange.com/questions/15989/micro-average-vs-macro-average-

performance-in-a-multiclass-classification-settin

16. Kaiming He Xiangyu Zhang Shaoqing Ren Jian Sun,“Deep Residual Learning for Image

Recognition”, arXiv:1512.03385v1 [cs.CV] 10 Dec 2015

17. Karen Simonyan & Andrew Zisserman,”Very deep convolutional networks for large-scale

image recognition”, arXiv:1409.1556v6 [cs.CV] 10 Apr 2015

18. Gao Huang Zhuang Liu Laurens van der Maaten Kilian Q. Weinberger,”Densely Connected

Convolutional Networks”, arXiv:1608.06993v5 [cs.CV] 28 Jan 2018