Content uploaded by Syeda Tanjila Atik
Author content
All content in this area was uploaded by Syeda Tanjila Atik on Jun 26, 2019
Content may be subject to copyright.
A model for identifying Historical Landmarks of
Bangladesh from image content using a Depth-wise
Convolutional Neural Network
Afsana Ahsan Jeny1, Masum Shah Junayed1, Syeda Tanjila Atik1 and Sazzad Ma-
hamd1
1 Daffodil International University, Dhanmondi, Dhaka-1207
1 { ahsan15-5278, junayed15-5008, syeda.cse, sazzad15-4980}@diu.edu.bd
Abstract. At present, tourism is considered to be one of the key factors shaping
the development of a country’s economy. Most of the tourists tend to explore
places that they find fascinating after watching pictures of that places over Inter-
net. Anyone can know about a famous place by simply typing the name of that
place in an internet browser. But problem arises when he/she comes across the
image of a beautiful landmark which is anonymous as most of the time web im-
ages do not convey any text caption. Most of models provided for image identi-
fication so far exhibit much complex structure and increased time complexity. In
this paper, we have proposed a CNN model based on MobileNet and TensorFlow
for detecting some historical landmarks of Bangladesh from their image. We
have examined 750 images from five different places and comparing other state-
of-art models, our model holds relatively simpler structure and has achieved a
significantly higher average accuracy of 99.2%. This model can be further en-
hanced to facilitate image classification in other related areas.
Keywords: Convolutional Neural Networks, TensorFlow, MobileNet, Histori-
cal place detection, Image processing.
1 Introduction
Due to the remarkable development of the communication sector, inspection of histor-
ical places has increased greatly. In many countries, especially Western countries, tour-
ism has been heavily influenced for years. Instead of just traveling, one can also get to
know about different people and different culture while visiting a particular place.
Every day, many people from different parts of the world are visiting Bangladeshi his-
torical places for tourism purpose. But unfortunately, many people cannot recognize
historical places by watching an image of that place where no caption is included.
There are many traditional historical places in Bangladesh such as Buddhist Vihara,
Mahasthangarh, Lalbagh, Puthia Temple Complex, Sonargaon, Shaat Gombuj Mosque,
Kantajew Temple, Star Mosque (Tara Masjid), Tajhat Palace, Mainimati Ruins, Baitul
Mukarram Mosque, Chittagong Commonwealth War Cemetery, Panam Nagar and so
2
on. . But most of the local people of the current generation and also from overseas
countries do not have detail knowledge about these places.
Convolutional Neural Network [8] is a suitable image detection method which was es-
tablished in the last year. It uses local receptive fields as brain neurons, by sharing
weight and linking information and reducing training limitations compared to the nerv-
ous system. It performs image transitions with a specific degree of rotation, translation,
and distortion. This network can avoid the complex preprocessing of the image and
input the original image directly to the people.
In this paper, we have proposed a CNN model based on MobileNet and TensorFlow
[11] for detecting some historical landmarks of Bangladesh from their image which is
built on depth wise distinguishable filters on the dataset of five historical places named
as Ahsanmonzil, Lalbagh, Mahasthangarh, Panam city and Shaat Gombuj Mosque.
The remaining paper is organized as the following: Existing works like CNN [8], Ten-
sorFlow [11] are discussed in Section II. MobileNet [10] architecture which is based
on depth-wise separable convolutions is discussed in Section III. ImageNet related
works are described in Section IV. Data collection, model installation, and the training
process are discussed in Section V. Results of the experiment is discussed in Section
VI. And finally, future work and summary of the conclusion are discussed in Section
VII and VIII.
2 Background Study
TensorFlow [11] is the second machine learning framework that is created by Google
Brain team and is used for designing, building, and training models of deep learning.
Any person can use the TensorFlow [11] library for doing the numerical computations.
These computations consist of data flow graphs. In these graphs, edges describe the
data, while the nodes describe the mathematical operations. A new era, Google has
provided a number of pre-trained model. Among these, MobileNet [10] is one of the
pre-trained model after Inception-v1 [2], Inception-v2 [3], and Inception-v3 [1]. Trans-
fer learning is a new learning process that can be used in existing machines learned
from a single environment and new solutions.
Convolutional neural networks [8] are organized into layers like dense (or fully con-
nected) layers, convolutional layers, pooling layers, recurrent layers, and normalization
layers. Different layers perform different conversions in their inputs and some layers
are suitable for some jobs other than others.
In this paper, we have used Depth wise separable convolutions of MobileNet [10]. It is
a successful model for image classification. It is used a fewer number of parameters
from other models and given a high accuracy within a short training time. In both cases,
obtain better models than previously available for one available calculate the parame-
ters and drops significantly numbers and the parameters required for the given dimen-
sion.
3
3 MobileNet Architecture
In this section, we have described the core layers architecture of MobileNet [10] which
is based on depth-wise separable filters. The MobileNet [10] model is a form of factor-
ized convolutions which manufacturer a standard convolution into a depth-wise convo-
lution. And here 1 x 1 convolution is called a point-wise convolution. Each input chan-
nel of MobileNet [10] the depth-wise convolution prosecutes a single filter. For com-
bining the outputs of the depth-wise convolution, the point-wise convolution prosecutes
a 1 x 1 convolution. The depth-wise separable convolution divides this into two layers.
Between two layers, one layer is for purifying and another layer is for mixing. Here
Fig. 1 shows how a standard convolution 1(a) is factorized into a 1 x 1 point-wise con-
volution 1(c) and a depth-wise convolution 1(b).
Fig. 1. The standard convolutional filters in (a) are substituted by two layers: depth-wise convo-
lution in (b) and point-wise convolution in (c) to create a depth-wise dividable filter.
A standard convolutional layer accepts input an x x M for a feature map of F and
generates an x x N for a feature map of G. Here is the structural width and
height of a square input feature map and M is the number of input channels. Again
is the structural width and height of a square output feature map and N is the number
of output channels.
By convolution kernel, the standard convolutional layer is characterized K which size
is x x M x N where is the structural dimension of the kernel possessed to be
square and M is number of input channels and N is the number of output channels as
defined earlier.
There are computational costs of standard convolutions: . . M. N. .
Where the computational cost depends on the number of input channels M, the number
of output channels N, the kernel size x and the feature map size x.
4
Depth-wise convolution is highly effective comparative to standard convolution. How-
ever, it does not mix them to make new features also filters input channel. So an addi-
tional layer that calculates a linear coordination output of depth-wise convolution out-
put through 1 x 1 convolution. This new feature is needed to generate. The mixed of
depth-wise convolution and 1 x 1 point-wise convolution is called depth-wise separable
convolution.
Depth-wise seperable cost of convolutions: . . M. . + M. N. .
Which is the sum of the depthwise and 1 x 1 pointwise convolutions. Using MobileNet
[10] 3 x 3 deeply dividing convolutions which uses between 8 and 9 times less than the
standard calculation convolutions.
4 Related Work
We have shown some prior work by CNN model using image datasets.
In 2014, the authors proposed an Inception-v3 [1] model with GoogleNet for image
classification. They deeply explained these models and also used images from
ImageNet for training, validation, and testing. But they used 23 million parameters and
so, the model size became large that’s why the run time complexity of the model also
got relatively high [1].
In 2014, Khairul Amin, MirsaHussain and NorsidahUjang used 35 visitors for Visitors’
Identification of Landmarks in the Historic District of Banda Hilir, Melaka, Malaysia.
This paper inspects the identity and nature of Banda Hili's landmarks about the image
of the audience. The study adopted a mental mapping strategy to combine images re-
garded as landmark and its associated characteristics. By using 35 visitors, they have
been also drawn maps of Banda Hili's landmarks. But it was just a manually process
[4].
In 2018, Malcolm J. Beynon, Calvin Jones, Max Munday and Neil Roche used Fuzzy
set qualitative comparative analysis for analysing of landmark historical sites in Wales.
The paper shows information on how the tourism activity is Sites on some historic sites
can be used to analyze functional recipes that are defined GVA higher or lower level
support. But this paper is just an analysis paper [6].
In 2012, Alex Krizhevsky et al. used deep convolutional neural network [8] to classify
the high-resolution images in the ImageNet LSVRC-2010 [9]. There were about 1.2
million training images, 50,000 validation images and 150,000 testing images in the
ImageNet. To reduce the over-fitting they used regularization method which is called
dropout and found a high accuracy using CNN [8] but at the same time this model
exhibited a higher time complexity [7].
In 2017, Łukasz Kaiser et al. have proposed a new model SliceNet which is related to
the Xception and ByteNet and this model is based on Depth-wise separable convolu-
tions. But SliceNet has used larger windows instead of dilated filters and BLEU. As a
result the model has become more complex to implement [12].
Our experiment is based on MobileNet [10] model of Depthwise Convolutional Neural
Network [8] for historical place detection using historical images. Image detection has
not been done by using CNN [8] model before. The depth-wise CNN [8] model has
5
used fewer parameters and worked well in our experiment within a short training time
and achieved a high accuracy.
5 Methodology
In this section, at first a system architecture which represents the procedure of our ex-
periment is provided; then data collection procedures for our experiment is discussed;
then the procedure of data processing and the installation of the model are described
and lastly we describe how we trained our model.
Fig. 2 represents the system architecture which describes the procedure of our experi-
ment.
Fig. 2. Architecture of System.
5.1 Data collection
For detecting the historical place of Bangladesh, we have collected 750 images of five
classes in which per class has 150 images. The classes are Ahsanmonzil, Lalbagh, Ma-
hasthangarh, Panam city, and Shaat Gombuj Mosque.
5.2 Data preprocessing
In this step, we have increased the dataset artificially. But the dataset looks like a refresh
data. First of all, we resize the data. Then we have used five different augmentation
methods to increase the dataset. They are Shearing, Translation, Flip, Rotate +30, and
Rotate -30.
Then we have found 4500 images for training, validation, and test. After augmentation,
each class has 900 images from which 180 images are used for the testing. As it is very
tough to represent all the data, a sample of our dataset is provided in the following Fig.
3.
6
Fig. 3. The sample of our dataset.
5.3 Model Installation
This experiment is based on the MobileNet [10] model. In the backend, we have used
TensorFlow [11] which is an open source library of python. The hardware platform is
2.2GHz, Intel i5, memory 4GB, System type: 64-bit Operating System, x-64 based pro-
cessor.
5.4 Train model
We have used 224 1.0 version of MobileNet [10] which is the biggest version. Here 1.0
is the width multiplier and 224 is the image resolution. Although 224 is a high-resolu-
tion image and takes more processing time, it provides better classification accuracy.
First of all, we have made our dataset which we described already. Then we have started
by creating a Docker [13] image to handle training. Then we have used TensorFlow
[11] which is preinstalled and configured for us, one of the many advantages of Docker
[13]. Then we define the default training parameters as an environmental variable that
will ultimately override them at runtime as needed. We set our image size to 228 pixels.
Next, we define the labels text files that will be created by the trained graph's output
location and the training script.
Finally, we point to 5000 training steps that allow quick training on our skills cycle,
without a huge training cycle (default 4000). After completion of training, the output
will be used to recover trained models and labels from Docker [13] container and input
will be used to provide Docker [13] container with our training images. Then, at last,
7
we have completed the test. The following Fig. 4 represents the MobileNet [10] archi-
tecture of our dataset.
Fig. 4. The architecture of MobileNet.
Confusion Matrix:
A confusing matrix is a table that is often used to describe the performance of a class
model in experimental data whose actual values are known. Number as a confusing
matrix for a binary, such as false positive numbers (FPs), false negative (FNs), true
positive (TPs), and negative 2-class issues (TNs). In the case of multiple classes, such
as 2-square-plus problems, confusing matrices may be n × n (n > 2). It contains N rows,
n columns and total n × n entries in the confusion matrix. From this matrix, the number
of FPs, FNs, TPs, and TNs cannot be calculated directly. According to this method,
classes of FPs, FNs, TPs, and TN are calculated for the class:
.
.
We have also made a multiclass Confusion Matrix [14] of MobileNet [10]. After Con-
fusion Matrix [14], gradually we have calculated Precision, Recall, Accuracy, and F1-
Score and we have used the following formula for calculating those.
Precision =
Recall =
8
Accuracy =
F1 scores =
Table 1 shows the multiclass Confusion Matrix [14] of the MobileNet [10] model.
Table 1. The multiclass Confusion Matrix
Historical Place
Ahsan-
monzil
Lalbagh
Ma-
hasthangarh
Panam
city
Shaat Gombuj
Mosque
Ahsanmonzil
177
1
0
1
1
Lalbagh
3
176
0
0
1
Mahasthangarh
0
1
179
0
0
Panam city
2
1
0
176
1
Shaat Gombuj Mosque
2
1
0
2
175
6 Result Exploration
We have gotten a favorable accuracy from MobileNet [10] which is 99.2%. The fol-
lowing Fig. 5 shows the training progress (x-axis) as a process of the accuracy (y-axis)
and Fig. 6 shows the training progress (x-axis) as a process of the cross-entropy (y-
axis). Here the orange line represents the training set, and the blue line represents the
validation set. From the following two figures we can say, our model has not been cre-
ated overfitting. Overfitting model will have extremely low training error but a high
testing error. But MobileNet [10] has been performed well in testing.
Fig. 5. Accuracy regarding training and validation.
Fig. 6. Cross entropy regarding training and validation.
9
The following Fig. 7 shows the precision, recall, accuracy, F1-Score and macro average
[15] graph of Ahsanmonzil, Lalbagh, Mahasthangarh, Panam city, and Shaat Gombuj
Mosque.
Fig. 7. Precision, Recall, Accuracy and F1-Score graph.
Table 2 shows if we had used Full Convolutional MobileNet [10] then we had to use
high parameters and it would have been lengthy [10].
Table 2. Depthwise Separable vs Full Convolution MobileNet
Model
Accuracy of ImageNet
Million Mult-
Adds
Million Parame-
ters
Conv MobileNet
71.7%
4866
29.3
Depthwise MobileNet
70.6%
569
4.2
7 Conclusion and Future Work
In our paper, we have used Mobilenet [10], a model of a depth-wise Convolution neural
network [8] for detecting some historical landmarks of Bangladesh from their image.
This model is small in size comparing the other models of CNN [8] and very similar to
the Inception-V3 [1. This model is built on depth wise distinguishable filters on the
dataset of five historical places of Bangladesh named as Ahsanmonzil, Lalbagh, Ma-
hasthangarh, Panam city and Shaat Gombuj Mosque. As other models are very deep
and big in size such as Inception-V3 [1], AlexNet [5], VGG 16 [17], ResNet [16] and
DenseNet-190 [18], for using those we would have to consider 23 million, 60 million,
138 million, 25 million and 40 million parameters respectively which would be a long-
term process. In depth-wise MobileNet [10] we have used only 4.2 parameters and after
running the experiment for six hours our model has given an accuracy of 99.2%. More-
over, MobileNet [10] is a generated model of the Convolutional neural network [8] so
a more generalized and effective model can be developed in future using different and
unique techniques to facilitate image classification in other related areas.
10
References
1. Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Angue-
lov, Dumitru Erhan, Vincent Vanhoucke and Andrew Rabinovich, “Going deeper with con-
volutions”, arXiv:1409.4842v1 [cs.CV] 17 Sep 2014.
2. Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, Alexander A. Alemi, “Inception-v4,
Inception-ResNet and the Impact of Residual Connections on Learning”, arXiv: 1602.07261
[cs.CV].
3. Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, “Rethinking the In-
ception Architecture for Computer Vision”, arXiv: 1512.00567v1 [cs.CV] 2 Dec 2015.
4. Khairul Amin, MirsaHussain and NorsidahUjang, “Visitors’ Identification of Landmarks in
the Historic District of Banda Hilir, Melaka, Malaysia”, AMER International Conference on
Quality of Life, AicQoL2014KotaKinabalu.
5. Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton,” Deep Learning with Tensor-
flow”, http://cvml.ist.ac.at/courses/DLWT_W17/
6. Malcolm J. Beynon, Calvin Jones, Max Munday and Neil Roche, “Investigating value added
from heritage assets: An analysis of landmark historical sites in Wales”, International Jour-
nal of Tourism Research.
7. Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton,”ImageNet Classification with Deep
Convolutional Neural Networks”, Advances in Neural Information Processing Systems 25
(NIPS 2012).
8. Karen Simonyan, Andrew Zisserman,” Very Deep Convolutional Networks for Large-Scale
Image Recognition”, arXiv:1409.1556v3 [cs.CV] 18 Nov 2014
9. Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhi-
heng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, Li
Fei-Fei,” ImageNet Large Scale Visual Recognition Challenge”, arXiv:1409.0575v3
[cs.CV] 30 Jan 2015
10. Andrew G. Howard Menglong Zhu Bo Chen Dmitry Kalenichenko Weijun Wang Tobias
Weyand Marco Andreetto Hartwig Adam,”MobileNets: Efficient Convolutional Neural
Networks for Mobile Vision Applications”, arXiv:1704.04861v1 [cs.CV] 17 Apr 2017
11. Mart´ın Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, “Large-Scale Machine
Learning on Heterogeneous Distributed Systems”, (Preliminary White Paper, November 9,
2015), arXiv:1603.04467v2 [cs.DC] 16 Mar 2016.
12. Łukasz Kaiser, Aidan N. Gomez, François Chollet,”Depthwise Separable Convolutions for
Neural Machine Translation”, arXiv:1706.03059v2 [cs.CL] 16 Jun 2017
13. https://www.docker.com/ Docker Simplifies the Developer Experience.
14. The details of Confusion matrix, https://en.wikipedia.org/wiki/Confusion_matrix
15. https://datascience.stackexchange.com/questions/15989/micro-average-vs-macro-average-
performance-in-a-multiclass-classification-settin
16. Kaiming He Xiangyu Zhang Shaoqing Ren Jian Sun,“Deep Residual Learning for Image
Recognition”, arXiv:1512.03385v1 [cs.CV] 10 Dec 2015
17. Karen Simonyan & Andrew Zisserman,”Very deep convolutional networks for large-scale
image recognition”, arXiv:1409.1556v6 [cs.CV] 10 Apr 2015
18. Gao Huang Zhuang Liu Laurens van der Maaten Kilian Q. Weinberger,”Densely Connected
Convolutional Networks”, arXiv:1608.06993v5 [cs.CV] 28 Jan 2018