ChapterPDF Available

Abstract and Figures

At present, tourism is considered to be one of the key factors shaping the development of a country’s economy. Most of the tourists tend to explore places that they find fascinating after watching pictures of that places over Internet. Anyone can know about a famous place by simply typing the name of that place in an internet browser. But problem arises when he/she comes across the image of a beautiful landmark which is anonymous as most of the time web images do not convey any text caption. Most of models provided for image identification so far exhibit much complex structure and increased time complexity. In this paper, we have proposed a CNN model based on MobileNet and TensorFlow for detecting some historical landmarks of Bangladesh from their image. We have examined 750 images from five different places and comparing other state-of-art models, our model holds relatively simpler structure and has achieved a significantly higher average accuracy of 99.2%. This model can be further enhanced to facilitate image classification in other related areas.
Content may be subject to copyright.
A model for identifying Historical Landmarks of
Bangladesh from image content using a Depth-wise
Convolutional Neural Network
Afsana Ahsan Jeny1, Masum Shah Junayed1, Syeda Tanjila Atik1 and Sazzad Ma-
1 Daffodil International University, Dhanmondi, Dhaka-1207
1 { ahsan15-5278, junayed15-5008, syeda.cse, sazzad15-4980}
Abstract. At present, tourism is considered to be one of the key factors shaping
the development of a country’s economy. Most of the tourists tend to explore
places that they find fascinating after watching pictures of that places over Inter-
net. Anyone can know about a famous place by simply typing the name of that
place in an internet browser. But problem arises when he/she comes across the
image of a beautiful landmark which is anonymous as most of the time web im-
ages do not convey any text caption. Most of models provided for image identi-
fication so far exhibit much complex structure and increased time complexity. In
this paper, we have proposed a CNN model based on MobileNet and TensorFlow
for detecting some historical landmarks of Bangladesh from their image. We
have examined 750 images from five different places and comparing other state-
of-art models, our model holds relatively simpler structure and has achieved a
significantly higher average accuracy of 99.2%. This model can be further en-
hanced to facilitate image classification in other related areas.
Keywords: Convolutional Neural Networks, TensorFlow, MobileNet, Histori-
cal place detection, Image processing.
1 Introduction
Due to the remarkable development of the communication sector, inspection of histor-
ical places has increased greatly. In many countries, especially Western countries, tour-
ism has been heavily influenced for years. Instead of just traveling, one can also get to
know about different people and different culture while visiting a particular place.
Every day, many people from different parts of the world are visiting Bangladeshi his-
torical places for tourism purpose. But unfortunately, many people cannot recognize
historical places by watching an image of that place where no caption is included.
There are many traditional historical places in Bangladesh such as Buddhist Vihara,
Mahasthangarh, Lalbagh, Puthia Temple Complex, Sonargaon, Shaat Gombuj Mosque,
Kantajew Temple, Star Mosque (Tara Masjid), Tajhat Palace, Mainimati Ruins, Baitul
Mukarram Mosque, Chittagong Commonwealth War Cemetery, Panam Nagar and so
on. . But most of the local people of the current generation and also from overseas
countries do not have detail knowledge about these places.
Convolutional Neural Network [8] is a suitable image detection method which was es-
tablished in the last year. It uses local receptive fields as brain neurons, by sharing
weight and linking information and reducing training limitations compared to the nerv-
ous system. It performs image transitions with a specific degree of rotation, translation,
and distortion. This network can avoid the complex preprocessing of the image and
input the original image directly to the people.
In this paper, we have proposed a CNN model based on MobileNet and TensorFlow
[11] for detecting some historical landmarks of Bangladesh from their image which is
built on depth wise distinguishable filters on the dataset of five historical places named
as Ahsanmonzil, Lalbagh, Mahasthangarh, Panam city and Shaat Gombuj Mosque.
The remaining paper is organized as the following: Existing works like CNN [8], Ten-
sorFlow [11] are discussed in Section II. MobileNet [10] architecture which is based
on depth-wise separable convolutions is discussed in Section III. ImageNet related
works are described in Section IV. Data collection, model installation, and the training
process are discussed in Section V. Results of the experiment is discussed in Section
VI. And finally, future work and summary of the conclusion are discussed in Section
2 Background Study
TensorFlow [11] is the second machine learning framework that is created by Google
Brain team and is used for designing, building, and training models of deep learning.
Any person can use the TensorFlow [11] library for doing the numerical computations.
These computations consist of data flow graphs. In these graphs, edges describe the
data, while the nodes describe the mathematical operations. A new era, Google has
provided a number of pre-trained model. Among these, MobileNet [10] is one of the
pre-trained model after Inception-v1 [2], Inception-v2 [3], and Inception-v3 [1]. Trans-
fer learning is a new learning process that can be used in existing machines learned
from a single environment and new solutions.
Convolutional neural networks [8] are organized into layers like dense (or fully con-
nected) layers, convolutional layers, pooling layers, recurrent layers, and normalization
layers. Different layers perform different conversions in their inputs and some layers
are suitable for some jobs other than others.
In this paper, we have used Depth wise separable convolutions of MobileNet [10]. It is
a successful model for image classification. It is used a fewer number of parameters
from other models and given a high accuracy within a short training time. In both cases,
obtain better models than previously available for one available calculate the parame-
ters and drops significantly numbers and the parameters required for the given dimen-
3 MobileNet Architecture
In this section, we have described the core layers architecture of MobileNet [10] which
is based on depth-wise separable filters. The MobileNet [10] model is a form of factor-
ized convolutions which manufacturer a standard convolution into a depth-wise convo-
lution. And here 1 x 1 convolution is called a point-wise convolution. Each input chan-
nel of MobileNet [10] the depth-wise convolution prosecutes a single filter. For com-
bining the outputs of the depth-wise convolution, the point-wise convolution prosecutes
a 1 x 1 convolution. The depth-wise separable convolution divides this into two layers.
Between two layers, one layer is for purifying and another layer is for mixing. Here
Fig. 1 shows how a standard convolution 1(a) is factorized into a 1 x 1 point-wise con-
volution 1(c) and a depth-wise convolution 1(b).
Fig. 1. The standard convolutional filters in (a) are substituted by two layers: depth-wise convo-
lution in (b) and point-wise convolution in (c) to create a depth-wise dividable filter.
A standard convolutional layer accepts input an  x  x M for a feature map of F and
generates an x x N for a feature map of G. Here  is the structural width and
height of a square input feature map and M is the number of input channels. Again
is the structural width and height of a square output feature map and N is the number
of output channels.
By convolution kernel, the standard convolutional layer is characterized K which size
is x x M x N where is the structural dimension of the kernel possessed to be
square and M is number of input channels and N is the number of output channels as
defined earlier.
There are computational costs of standard convolutions: . . M. N.  . 
Where the computational cost depends on the number of input channels M, the number
of output channels N, the kernel size x and the feature map size  x.
Depth-wise convolution is highly effective comparative to standard convolution. How-
ever, it does not mix them to make new features also filters input channel. So an addi-
tional layer that calculates a linear coordination output of depth-wise convolution out-
put through 1 x 1 convolution. This new feature is needed to generate. The mixed of
depth-wise convolution and 1 x 1 point-wise convolution is called depth-wise separable
Depth-wise seperable cost of convolutions:  . . M.  .  + M. N.  . 
Which is the sum of the depthwise and 1 x 1 pointwise convolutions. Using MobileNet
[10] 3 x 3 deeply dividing convolutions which uses between 8 and 9 times less than the
standard calculation convolutions.
4 Related Work
We have shown some prior work by CNN model using image datasets.
In 2014, the authors proposed an Inception-v3 [1] model with GoogleNet for image
classification. They deeply explained these models and also used images from
ImageNet for training, validation, and testing. But they used 23 million parameters and
so, the model size became large that’s why the run time complexity of the model also
got relatively high [1].
In 2014, Khairul Amin, MirsaHussain and NorsidahUjang used 35 visitors for Visitors’
Identification of Landmarks in the Historic District of Banda Hilir, Melaka, Malaysia.
This paper inspects the identity and nature of Banda Hili's landmarks about the image
of the audience. The study adopted a mental mapping strategy to combine images re-
garded as landmark and its associated characteristics. By using 35 visitors, they have
been also drawn maps of Banda Hili's landmarks. But it was just a manually process
In 2018, Malcolm J. Beynon, Calvin Jones, Max Munday and Neil Roche used Fuzzy
set qualitative comparative analysis for analysing of landmark historical sites in Wales.
The paper shows information on how the tourism activity is Sites on some historic sites
can be used to analyze functional recipes that are defined GVA higher or lower level
support. But this paper is just an analysis paper [6].
In 2012, Alex Krizhevsky et al. used deep convolutional neural network [8] to classify
the high-resolution images in the ImageNet LSVRC-2010 [9]. There were about 1.2
million training images, 50,000 validation images and 150,000 testing images in the
ImageNet. To reduce the over-fitting they used regularization method which is called
dropout and found a high accuracy using CNN [8] but at the same time this model
exhibited a higher time complexity [7].
In 2017, Łukasz Kaiser et al. have proposed a new model SliceNet which is related to
the Xception and ByteNet and this model is based on Depth-wise separable convolu-
tions. But SliceNet has used larger windows instead of dilated filters and BLEU. As a
result the model has become more complex to implement [12].
Our experiment is based on MobileNet [10] model of Depthwise Convolutional Neural
Network [8] for historical place detection using historical images. Image detection has
not been done by using CNN [8] model before. The depth-wise CNN [8] model has
used fewer parameters and worked well in our experiment within a short training time
and achieved a high accuracy.
5 Methodology
In this section, at first a system architecture which represents the procedure of our ex-
periment is provided; then data collection procedures for our experiment is discussed;
then the procedure of data processing and the installation of the model are described
and lastly we describe how we trained our model.
Fig. 2 represents the system architecture which describes the procedure of our experi-
Fig. 2. Architecture of System.
5.1 Data collection
For detecting the historical place of Bangladesh, we have collected 750 images of five
classes in which per class has 150 images. The classes are Ahsanmonzil, Lalbagh, Ma-
hasthangarh, Panam city, and Shaat Gombuj Mosque.
5.2 Data preprocessing
In this step, we have increased the dataset artificially. But the dataset looks like a refresh
data. First of all, we resize the data. Then we have used five different augmentation
methods to increase the dataset. They are Shearing, Translation, Flip, Rotate +30, and
Rotate -30.
Then we have found 4500 images for training, validation, and test. After augmentation,
each class has 900 images from which 180 images are used for the testing. As it is very
tough to represent all the data, a sample of our dataset is provided in the following Fig.
Fig. 3. The sample of our dataset.
5.3 Model Installation
This experiment is based on the MobileNet [10] model. In the backend, we have used
TensorFlow [11] which is an open source library of python. The hardware platform is
2.2GHz, Intel i5, memory 4GB, System type: 64-bit Operating System, x-64 based pro-
5.4 Train model
We have used 224 1.0 version of MobileNet [10] which is the biggest version. Here 1.0
is the width multiplier and 224 is the image resolution. Although 224 is a high-resolu-
tion image and takes more processing time, it provides better classification accuracy.
First of all, we have made our dataset which we described already. Then we have started
by creating a Docker [13] image to handle training. Then we have used TensorFlow
[11] which is preinstalled and configured for us, one of the many advantages of Docker
[13]. Then we define the default training parameters as an environmental variable that
will ultimately override them at runtime as needed. We set our image size to 228 pixels.
Next, we define the labels text files that will be created by the trained graph's output
location and the training script.
Finally, we point to 5000 training steps that allow quick training on our skills cycle,
without a huge training cycle (default 4000). After completion of training, the output
will be used to recover trained models and labels from Docker [13] container and input
will be used to provide Docker [13] container with our training images. Then, at last,
we have completed the test. The following Fig. 4 represents the MobileNet [10] archi-
tecture of our dataset.
Fig. 4. The architecture of MobileNet.
Confusion Matrix:
A confusing matrix is a table that is often used to describe the performance of a class
model in experimental data whose actual values are known. Number as a confusing
matrix for a binary, such as false positive numbers (FPs), false negative (FNs), true
positive (TPs), and negative 2-class issues (TNs). In the case of multiple classes, such
as 2-square-plus problems, confusing matrices may be n × n (n > 2). It contains N rows,
n columns and total n × n entries in the confusion matrix. From this matrix, the number
of FPs, FNs, TPs, and TNs cannot be calculated directly. According to this method,
classes of FPs, FNs, TPs, and TN are calculated for the class:
 
 .
 .
  
We have also made a multiclass Confusion Matrix [14] of MobileNet [10]. After Con-
fusion Matrix [14], gradually we have calculated Precision, Recall, Accuracy, and F1-
Score and we have used the following formula for calculating those.
Precision =
Recall = 
Accuracy = 
F1 scores =   
Table 1 shows the multiclass Confusion Matrix [14] of the MobileNet [10] model.
Table 1. The multiclass Confusion Matrix
Historical Place
Shaat Gombuj
Panam city
Shaat Gombuj Mosque
6 Result Exploration
We have gotten a favorable accuracy from MobileNet [10] which is 99.2%. The fol-
lowing Fig. 5 shows the training progress (x-axis) as a process of the accuracy (y-axis)
and Fig. 6 shows the training progress (x-axis) as a process of the cross-entropy (y-
axis). Here the orange line represents the training set, and the blue line represents the
validation set. From the following two figures we can say, our model has not been cre-
ated overfitting. Overfitting model will have extremely low training error but a high
testing error. But MobileNet [10] has been performed well in testing.
Fig. 5. Accuracy regarding training and validation.
Fig. 6. Cross entropy regarding training and validation.
The following Fig. 7 shows the precision, recall, accuracy, F1-Score and macro average
[15] graph of Ahsanmonzil, Lalbagh, Mahasthangarh, Panam city, and Shaat Gombuj
Fig. 7. Precision, Recall, Accuracy and F1-Score graph.
Table 2 shows if we had used Full Convolutional MobileNet [10] then we had to use
high parameters and it would have been lengthy [10].
Table 2. Depthwise Separable vs Full Convolution MobileNet
Accuracy of ImageNet
Million Mult-
Million Parame-
Conv MobileNet
Depthwise MobileNet
7 Conclusion and Future Work
In our paper, we have used Mobilenet [10], a model of a depth-wise Convolution neural
network [8] for detecting some historical landmarks of Bangladesh from their image.
This model is small in size comparing the other models of CNN [8] and very similar to
the Inception-V3 [1. This model is built on depth wise distinguishable filters on the
dataset of five historical places of Bangladesh named as Ahsanmonzil, Lalbagh, Ma-
hasthangarh, Panam city and Shaat Gombuj Mosque. As other models are very deep
and big in size such as Inception-V3 [1], AlexNet [5], VGG 16 [17], ResNet [16] and
DenseNet-190 [18], for using those we would have to consider 23 million, 60 million,
138 million, 25 million and 40 million parameters respectively which would be a long-
term process. In depth-wise MobileNet [10] we have used only 4.2 parameters and after
running the experiment for six hours our model has given an accuracy of 99.2%. More-
over, MobileNet [10] is a generated model of the Convolutional neural network [8] so
a more generalized and effective model can be developed in future using different and
unique techniques to facilitate image classification in other related areas.
1. Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Angue-
lov, Dumitru Erhan, Vincent Vanhoucke and Andrew Rabinovich, “Going deeper with con-
volutions”, arXiv:1409.4842v1 [cs.CV] 17 Sep 2014.
2. Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, Alexander A. Alemi, “Inception-v4,
Inception-ResNet and the Impact of Residual Connections on Learning”, arXiv: 1602.07261
3. Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, “Rethinking the In-
ception Architecture for Computer Vision”, arXiv: 1512.00567v1 [cs.CV] 2 Dec 2015.
4. Khairul Amin, MirsaHussain and NorsidahUjang, “Visitors’ Identification of Landmarks in
the Historic District of Banda Hilir, Melaka, Malaysia”, AMER International Conference on
Quality of Life, AicQoL2014KotaKinabalu.
5. Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton,” Deep Learning with Tensor-
6. Malcolm J. Beynon, Calvin Jones, Max Munday and Neil Roche, “Investigating value added
from heritage assets: An analysis of landmark historical sites in Wales”, International Jour-
nal of Tourism Research.
7. Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton,”ImageNet Classification with Deep
Convolutional Neural Networks”, Advances in Neural Information Processing Systems 25
(NIPS 2012).
8. Karen Simonyan, Andrew Zisserman,” Very Deep Convolutional Networks for Large-Scale
Image Recognition”, arXiv:1409.1556v3 [cs.CV] 18 Nov 2014
9. Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhi-
heng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, Li
Fei-Fei,” ImageNet Large Scale Visual Recognition Challenge”, arXiv:1409.0575v3
[cs.CV] 30 Jan 2015
10. Andrew G. Howard Menglong Zhu Bo Chen Dmitry Kalenichenko Weijun Wang Tobias
Weyand Marco Andreetto Hartwig Adam,”MobileNets: Efficient Convolutional Neural
Networks for Mobile Vision Applications”, arXiv:1704.04861v1 [cs.CV] 17 Apr 2017
11. Mart´ın Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, “Large-Scale Machine
Learning on Heterogeneous Distributed Systems, (Preliminary White Paper, November 9,
2015), arXiv:1603.04467v2 [cs.DC] 16 Mar 2016.
12. Łukasz Kaiser, Aidan N. Gomez, François Chollet,”Depthwise Separable Convolutions for
Neural Machine Translation”, arXiv:1706.03059v2 [cs.CL] 16 Jun 2017
13. Docker Simplifies the Developer Experience.
14. The details of Confusion matrix,
16. Kaiming He Xiangyu Zhang Shaoqing Ren Jian Sun,“Deep Residual Learning for Image
Recognition”, arXiv:1512.03385v1 [cs.CV] 10 Dec 2015
17. Karen Simonyan & Andrew Zisserman,”Very deep convolutional networks for large-scale
image recognition”, arXiv:1409.1556v6 [cs.CV] 10 Apr 2015
18. Gao Huang Zhuang Liu Laurens van der Maaten Kilian Q. Weinberger,”Densely Connected
Convolutional Networks”, arXiv:1608.06993v5 [cs.CV] 28 Jan 2018
... Here, we have used a convolutional neural network (CNN) which is an effective recognition model that has been exhibited in current years . There are numerous models of CNN, such as LeNet-5 (Junayed et al., 2019c), AlexNet , VGG 16 (Junayed et al., 2019a), Inception-V3 (Szegedy et al., 2014;Junayed et al., 2019b), ResNet50 (He et al., 2015(He et al., , 2016, ResNeXt (Jeny et al., 2019), DenseNet (Huang et al., 2018) and also depthwise separable convolution models (Jeny et al., 2020) like MobileNet (Howard et al., 2017), Xception (Chollet, 2017) and so on. In this work, we have applied Inception-V3, ResNet50, MobileNet and Xception for recognising local freshwater fish of Bangladesh. ...
... Then we test our images. MobileNet (Howard et al., 2017) is a model of depthwise separable convolutions (Jeny et al., 2020) neural network. For training our image classifier, we have used the concept of transfer learning (Pan and Yang, 2009). ...
Bangladesh is a riverine country with thousands of rivers and ponds. Bangladesh, being a fish-loving nation, is considered one of the most suitable areas for fish culture. People of this generation in Bangladesh usually fail to recognise the local freshwater fish. Performing a classification of freshwater fish can help people recognise the local fish of Bangladesh. In this way, a vital part of the cultural heritage of Bangladesh can be retained. In this paper, a number of local freshwater fish are classified based on their characteristics. For this reason, we take a purpose for identifying fish. For our experiment, we have used total 6,000 images of ten local freshwater fish. We have used four convolutional neural network models, namely Inception-V3, MobileNet, ResNet50, and Xception for classifying the fish. We have obtained good accuracy, i.e., more than 95% with all four models, where MobileNet outperforms all other models by showing an accuracy of 98.41%.
... The Confusion Matrix [18], [19], [20] is sort of a table that narrates the performance of a classification model in test data. We presented our model in a Confusion Matrix to observe the performance of the model. ...
... We calculated accuracy, precision, recall, Specificity, False Positive Rate and False Discovery Rate by using the True Positive (TP), True Negative (TN), False Positive (FP), False Negative (FN) values. We also observed the confusion matrix for predicting each class [24], [26]. Following equations were used for evaluating the results. ...
Conference Paper
Full-text available
Eczema is the most common among all types of skin diseases. A solution for this disease is very crucial for patients to have better treatment. Eczema is usually detected manually by doctors or dermatologists. It is tough to distinguish between different types of eczema because of the similarities in symptoms. In recent years, several attempts have been taken to automate the detection of skin diseases with much accuracy. Many methods such as Image Processing Techniques, Machine Learning algorithms are getting used to execute segmentation and classification of skin diseases. It is found that among all those skin disease detection systems, particularly detection work on eczema disease is rare. There is also insufficiency in the eczema disease dataset. In this paper, we propose a novel deep CNN-based approach for classifying five different classes of Eczema with our collected dataset. Data augmentation is used to transform images for better performance. Regularization techniques such as batch normalization and dropout helped to reduce overfitting. Our proposed model achieved an accuracy of 96.2%, which exceeded the performance of the state of the arts.
Conference Paper
Full-text available
Food is an inseparable part of any culture of any country all over the world. Recognition ability for local foods reflects the cultural strength. Moreover, it would be so beneficial if some can come to know the information about a food by capturing the image of the food with any smart cellphone. In this paper, we have mainly focused on recognizing different local foods of Bangladesh. The Computer Vision community has given little attention to visual food analysis such as food detection, food recognition, food localization, and portion estimation. This is why, we have proposed a novel approach for local food recognition, where we have created and utilized a Deep Residual Neural Network for grouping six classes of food images. We have also used two convolutional neural networks separately for grouping six classes of food images. Then we compare our proposed model with these two deep learning models. Our proposed model has achieved a notable highest accuracy of 98.16%, which is promising enough.
Full-text available
We reveal how tourist visitation to similar historical sites supports different levels of local gross value added (GVA). The paper shows how information on tourism activity at few historical sites can be used to analyse causal recipes defining whether sites support relatively high/low levels of GVA. Fuzzy‐set qualitative comparative analysis is employed to offer perspectives not possible with other analytical methods. The study reveals that for a set of similar heritage sites, that factors supporting local economic impacts are complex and with this having ramifications for management interventions around sites that seek to boost the economic impacts of visitation.
Conference Paper
Full-text available
Recent work has shown that convolutional networks can be substantially deeper, more accurate, and efficient to train if they contain shorter connections between layers close to the input and those close to the output. In this paper, we embrace this observation and introduce the Dense Convolutional Network (DenseNet), which connects each layer to every other layer in a feed-forward fashion. Whereas traditional convolutional networks with L layers have L connections - one between each layer and its subsequent layer - our network has L(L+1)/2 direct connections. For each layer, the feature-maps of all preceding layers are used as inputs, and its own feature-maps are used as inputs into all subsequent layers. DenseNets have several compelling advantages: they alleviate the vanishing-gradient problem, strengthen feature propagation, encourage feature reuse, and substantially reduce the number of parameters. We evaluate our proposed architecture on four highly competitive object recognition benchmark tasks (CIFAR-10, CIFAR-100, SVHN, and ImageNet). DenseNets obtain significant improvements over the state-of-the-art on most of them, whilst requiring less memory and computation to achieve high performance. Code and models are available at
Technical Report
In this work we investigate the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting. Our main contribution is a thorough evaluation of networks of increasing depth, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers. These findings were the basis of our ImageNet Challenge 2014 submission, where our team secured the first and the second places in the localisation and classification tracks respectively.
Conference Paper
We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 dif- ferent classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. To make training faster, we used non-saturating neurons and a very efficient GPU implemen- tation of the convolution operation. To reduce overfitting in the fully-connected layers we employed a recently-developed regularization method called dropout that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry
Depthwise separable convolutions reduce the number of parameters and computation used in convolutional operations while increasing representational efficiency. They have been shown to be successful in image classification models, both in obtaining better models than previously possible for a given parameter count (the Xception architecture) and considerably reducing the number of parameters required to perform at a given level (the MobileNets family of architectures). Recently, convolutional sequence-to-sequence networks have been applied to machine translation tasks with good results. In this work, we study how depthwise separable convolutions can be applied to neural machine translation. We introduce a new architecture inspired by Xception and ByteNet, called SliceNet, which enables a significant reduction of the parameter count and amount of computation needed to obtain results like ByteNet, and, with a similar parameter count, achieves new state-of-the-art results. In addition to showing that depthwise separable convolutions perform well for machine translation, we investigate the architectural changes that they enable: we observe that thanks to depthwise separability, we can increase the length of convolution windows, removing the need for filter dilation. We also introduce a new "super-separable" convolution operation that further reduces the number of parameters and computational cost for obtaining state-of-the-art results.
Conference Paper
Convolutional networks are at the core of most stateof-the-art computer vision solutions for a wide variety of tasks. Since 2014 very deep convolutional networks started to become mainstream, yielding substantial gains in various benchmarks. Although increased model size and computational cost tend to translate to immediate quality gains for most tasks (as long as enough labeled data is provided for training), computational efficiency and low parameter count are still enabling factors for various use cases such as mobile vision and big-data scenarios. Here we are exploring ways to scale up networks in ways that aim at utilizing the added computation as efficiently as possible by suitably factorized convolutions and aggressive regularization. We benchmark our methods on the ILSVRC 2012 classification challenge validation set demonstrate substantial gains over the state of the art: 21.2% top-1 and 5.6% top-5 error for single frame evaluation using a network with a computational cost of 5 billion multiply-adds per inference and with using less than 25 million parameters. With an ensemble of 4 models and multi-crop evaluation, we report 3.5% top-5 error and 17.3% top-1 error.
Very deep convolutional networks have been central to the largest advances in image recognition performance in recent years. One example is the Inception architecture that has been shown to achieve very good performance at relatively low computational cost. Recently, the introduction of residual connections in conjunction with a more traditional architecture has yielded state-of-the-art performance in the 2015 ILSVRC challenge; its performance was similar to the latest generation Inception-v3 network. This raises the question of whether there are any benefit in combining the Inception architecture with residual connections. Here we give clear empirical evidence that training with residual connections accelerates the training of Inception networks significantly. There is also some evidence of residual Inception networks outperforming similarly expensive Inception networks without residual connections by a thin margin. We also present several new streamlined architectures for both residual and non-residual Inception networks. These variations improve the single-frame recognition performance on the ILSVRC 2012 classification task significantly. We further demonstrate how proper activation scaling stabilizes the training of very wide residual Inception networks. With an ensemble of three residual and one Inception-v4, we achieve 3.08 percent top-5 error on the test set of the ImageNet classification (CLS) challenge