Content uploaded by Avinash Sharma
Author content
All content in this area was uploaded by Avinash Sharma on Dec 23, 2020
Content may be subject to copyright.
RESEARCH ARTICLE
Copyright © 2019 American Scientific Publishers
All rights reserved
Printed in the United States of America
Journal of
Computational and Theoretical Nanoscience
Vol. 16, 4044–4052, 2019
Object Recognition Using Deep Learning
Rohini Goel1∗, Avinash Sharma2, and Rajiv Kapoor3
1Research Scholar, Department of Computer Science and Engineering, Maharishi Markandeshwar (Deemed to be University),
Mullana 133203, Ambala, India
2Department of Computer Science and Engineering, Maharishi Markandeshwar (Deemed to be University), Mullana 133203, Ambala, India
3Department of Electronics and Communication Engineering, Delhi Technical University, Delhi 110042, India
The deep learning approaches have drawn much focus of the researchers in the area of
object recognition because of their implicit strength of conquering the shortcomings of classical
approaches dependent on hand crafted features. In the last few years, the deep learning techniques
have been made many developments in object recognition. This paper indicates some recent and
efficient deep learning frameworks for object recognition. The up to date study on recently devel-
oped a deep neural network based object recognition methods is presented. The various bench-
mark datasets that are used for performance evaluation are also discussed. The applications of the
object recognition approach for specific types of objects (like faces, buildings, plants etc.) are also
highlighted. We conclude up with the merits and demerits of existing methods and future scope in
this area.
Keywords: Convolutional Neural Network (CNN), Faster R-CNN, Network on Convolution
Feature Map (NoC), Deep Expectation (DEX), Deep Residual Conv-Deconv
Network, A-ConvNet.
1. INTRODUCTION
Due to the rapid growth in computer technology, computer
can play an important role to complete routine daily tasks
of everyday’s life [1]. For a human, visual object classi-
fication and recognition are usual and offhand biological
visual system process, but for a computer it is not easy.
To imitate because of high mutability in object images of
the same class with different viewing conditions. Object
recognition is a crucial defiance in the field of computer
vision. As the implementation of object recognition on
machines is labyrinthine task, so potent and less compli-
cated object recognition is to be designed [2]. The digital
database of visual information is growing day by day to
manage and analysis this huge mass of visual information,
those image analysis approaches are required which can
automatically get its semantics contexts. The objects in the
images are one of the most crucial contexts for the object
recognition task. Good image feature descriptions are the
backbone of the good object recognition system [3].
In the previous decennary, the comprehensive study
of high resolution image classification has been carried
out with handcrafted features from spatial and spectral
domains. The gray-level co-occurrence matrix (GLCM) [4]
∗Author to whom correspondence should be addressed.
is used as texture-based descriptors to provide the spectral
variation information required for efficient image classifi-
cation. The extended morphological profiles proposed by
Benediktsson et al. [5]. To extract spatial features for high
resolution urban image classification. In high resolution
images, the Gabor filter [6] and wavelet transform [7]
were also used for spatial feature extraction. The intra-
class variation of the building database, the handcrafted
feature is not an efficient solution. Therefore, handcrafted
features are replaced by the feature extracted by sparse
coding scheme proposed by Chenyadat [8]. The sparse-
constrained support vector machine (SVM) is another fea-
ture learning model presented by Tuia et al. [9]. The deep
features [10] are more efficient and powerful than low
level feature in scene classification, image classification
and face recognition.
The mostly used object proposal approaches are based
on super-pixel grouping (e.g., MCG [11], CPMC [12]
and Selective Search [13]) and based of sliding window
(e.g., edge boxes [14], objectness in windows [15]). Other
than this, there are some object proposal methods which
are taken up as detector independent external elements
(e.g., selective search object detection). The R-CNN [16]
method is used as an object detector to segregate the pro-
posal region in object categories or backgrounds. The insti-
gating work by Viola and Jones utilizes Haar [17] features
4044 J. Comput. Theor. Nanosci. 2019, Vol. 16, No. 9 1546-1955/2019/16/4044/009 doi:10.1166/jctn.2019.8291
RESEARCH ARTICLE
Goel et al. Object Recognition Using Deep Learning
and boosted classifies on sliding window. The HOG fea-
tures [18] are combined with linear SVMs [19] a sliding
window classifier and DPM [20] to generate deformable
graphical models. In overfeat approach [21], every slid-
ing window of convolutional feature map is used with a
fully connected layer for efficient detection and classifi-
cation. In SPP based detection method [22], the features
are merged from the proposed region on the convolution
features map and initialize to fully connected layer for
classification.
The typical supervised classification models are like a
decision tree [23], random forest [24] and support vector
machine (SVM) [25]. A random forest approach is based
on the construction of several decision trees during train-
ing and the integration of prediction of all the trees is
used for classification. SVM uses finite training samples
to tackle high dimensional data. The random forest and
SVM are the shallow models; they have limited ability
to handle the nonlinear data as compared to deep net-
works. For image classification, Chen et al. [26] proposed
a stacked auto encoder to predict the hierarchal feature of
hyperspectral image in the spectral domain. A deep belief
network (DBN) [27] represents spectral based features for
hyperspectral data classification. Mou et al. [28] intro-
duced recurrent neural network for classification of hyper-
spectral images. The afore mentioned methods like auto
encoders, RBN, DBM are 1-D deep architectures. The 1-D
architecture processing may cause the loss of hyperspectral
imagery structural information. The CNN has the capa-
bility to automatically discover the contextual 2-D spatial
features for image classification. There are various super-
vised CNN-based models used for spectral-spatial clas-
sification of hyperspectral remote sensing images. Chen
et al. [29] proposed a supervised l2 regularized 3-D CNN
based feature extraction model used for classification pur-
pose. Ghamisi et al. [30] proposed self improving CNN
model. Zhao and Du et al. [31] introduced a spectral, spa-
tial features based classification framework. In the transi-
tion from supervised CNN for unsupervised CNN, Romero
et al. [32] represent an unsupervised convolutional net-
work for spatial-spectral feature extraction adopting sparse
learning to predict the network weights.
The various types of feature extractor were used in
the past based on shape, texture and venation. Shape
based Elliptic Fourier and discriminant analysis to dis-
criminate different plant types was proposed by Neto
et al. [33]. Shape based approaches were based on invari-
ant moments and centroid-radii models [34]. The combi-
nation of geometrical and invariant moment features was
introduced by Du et al. [35] to extract features. Shape
context and HOG [36] have been initialized as shape
descriptors. Fourier descriptor, shape defining features
(SDF) [37], hand crafted shape (HCS) [38] and histogram
of curvature over scale (HoCS) [39] is some important
shape base feature descriptor. Texture features have also
an important contribution in leaf identification. Gabor
co-occurrence in texture classification was proposed by
Cope et al. [40] Learning vector quantization (LVQ) with
radial basis function (RBF) was proposed by Rashad
et al. [41] to recognize texture features. The merger of
gray level co-occurrence matrix (GLCM) and LBP was
proposed by Tang et al. [42] to extract the texture based
features. The most important features for leaf identifica-
tion is venation features. The legume classes based on leaf
venation were proposed by Larese et al. [43]. The fea-
ture extracted from the vein pattern by using hit or missed
transform (UHMT) and then trained CNN for recogni-
tion. The age reckoning can be considered as regres-
sion or classification issue. Support Vector Regression
(SVR) [44], Canonical Correlation Analysis (CCA) [45]
is the famous regression techniques, whereas the classi-
cal Nearest Neighbor (NN) and support vector machines
(SVMs) [46] are used as classification approaches. Chen
et al. [47] proposed CA-SVR techniques. Hureta et al. [48]
image texture and local presentation descriptor and Guo
and Mu [49] utilize CCA and PBS to real age estimation.
Ye et al. [50] proposed a multireal CNN, Wang et al. [51]
deployed deep learning features (DLA) where as Rotre
et al. [52] with CNN and SVR for efficient original age
reckoning. For apparent age estimation, other than our
work Liu et al. proposed a technique based on deep trans-
fer learning and GoogleNet framework. Zihu et al. [53]
deployed GoogleNet with random forest and SVR. Yang
et al. [54] used face and landmark detection and VGG-16
framework for face alignment and modeling respectively.
The standard framework of SAR-ATR has three phases:
detection discrimination and classification. The extraction
phase extracts the targets from the SAR using CFAR detec-
tion [55]. Then the discriminator stage removes the false
alarms and selects the features necessary to detect the tar-
get. The last phase as classifier is used to classify each
input as one of the two classes (target and clutter) [56].
The classification stage may have any of three prototypes:
template matching, model based methods and machine
learning. The semiautomated image intelligence process-
ing (SAIP) [57] system is the popularly used template
base system. But the performance of this system degrades
in extended operating conditions (EOC) [58]. To over-
come the present issue, model based moving and stationary
target acquisition and recognition (MSTAR) [59] system
was grown with the evaluation of trainable classifier such
as: artificial neural network (ANN) [60], SVM [61] and
adaboost etc. [62]. The machine learning prototypes have
been adapted to SAR. Now days, the deep convolutional
networks (ConvNet) [63] presented a remarkable perfor-
mance object detection and recognition. The remaining
paper is structured as: In Sections 2 and 3, Two-Step
and Single-Step architecture object recognition techniques
using deep learning are (see Fig. 1) discussed. Section 4
illustrates the various object recognition applications and
J. Comput. Theor. Nanosci. 16, 4044–4052, 2019 4045
RESEARCH ARTICLE
Object Recognition Using Deep Learning Goel et al.
Deep Learning Based Object
Recognition Techniques
Two Step
Architecture
One Step
Architecture
R-CNN
te
N
P
P
S
Faster R-CNN
Object-NNCd
es
aB
RtsaF -CNN
DEX
vno
C
laud
ise
R
peeD -
kr
o
wteN
v
n
o
c
e
D
SSD
YOLO
Fig. 1. Overview of deep learning based object recognition techniques.
Section 5 reveals the various types of datasets used in dif-
ferent object recognition techniques using deep learning,
Section 6 shows comparative analysis and we conclude in
Section 7.
2. TWO-STEP ARCHITECTURE
In 2014, Ross Girshick [16] proposed R-CNN for the
quality enhancement of the candidate bounding box and
extraction of high level features using deep architecture.
R-CNN indicated the better result over the previous best
result using PASCAL VOC. 2012 dataset. The R-CNN two
stages: in Generation of Region Proposal stage, in R-CNN,
about 2K region proposals are produced using the selective
search method. The accurate bounding boxes of the arbi-
trary sizes are generated very fast with reduced searching
space using selective search. It is due to the fact that selec-
tive search uses a bottom up grouping and saliency cues.
In Deep Feature Extraction using CNN, the deep CNN
extracted the deep features from the cropped or warped
region proposal. These final and robust 4096-dimensional
features are obtained due to high learning capacity and
strong expressive power of the CNN’s structures. The
region proposals are recorded as positive and negative
(background) regions with help of category specific linear
SVMs. The bounding box regression adjusts these regions
and then greedy non-maximum suppression (NMS) fil-
ter them to produce final bounding boxes for secured
object locations. Although R-CNN has various improve-
ments over traditional approaches still it has some dis-
advantages. Due to the requirement of fixed size input
image, the re-computation of the CNN takes more time in
testing period. R-CNN training is a multi-stage pipeline
process. More storage space and time is required for train-
ing of R-CNN because features of different region pro-
posals are extracted and stored on the disk. Due to the
high redundancy of region proposal, it is a time consuming
process.
Due to the fixed size input, the R-CNN warps or
crops the region proposal to the required size. Either the
incomplete information exists due to cropping or distor-
tion occurs due to the warping operation. These effects
can weaken the recognition accuracy. To rectify this prob-
lem, He et al. presented a new CNN architecture called
SPPnet [64]. In SPPnet architecture, unlike R-CNN the
5th convolutional layer (conv 5) is reused to represent the
random sized region proposals to fixed size feature vec-
tors. Due to the strength of local responses and spatial
position of the feature maps, the reusability of these fea-
ture maps is feasible. The layer next to final convolutional
layer is attributed as spatial pyramid pooling layer (SPP
layer). If the conv 5 layer has 256 feature map, then after
SPPnet layer each region proposal has final feature vec-
tor dimension of 256 ×(12 +22 +42) =5376 due to a
3 level pyramid. SPPnet shows better results in correct
region proposal estimations as well as enhances the detec-
tion efficiency during the testing phase due to sharing of
computation cost before SPP layer.
SPPnet has some drawbacks although it has shown
remarkable improvements in accuracy and efficiency in
comparision to R-CNN. SPPnet expenses the additional
storage space due to its multi stage-pipeline architecture
similar to R-CNN. The fine tuning algorithm [22] is not
able to update the convolutional layer before SPP layer.
Due to this, an unsurprising drop in accuracy occurs in
deep network. To avoid these problems, Girshick [65]
introduced a novel architecture of CNN that is known as
Fast R-CNN. Like SPPnet, in the Fast R-CN, the com-
plete image is handled by conv layer to generate feature
map. The ROI pooling layer extract feature vector of fixed-
length from every region proposal. Every feature vector
is passed through a number of fc layers to reach into the
two output layers. One layer produces the probability of
C+1 catagories and another layer generates the bounding
box position with four real-value numbers. The pipelining
of Fast R-CNN is fastened by sampling the mini-batches
hierarchically and comprising layer fc layer by using trun-
cated singular value decomposition (SVD).
In Faster R-CNN, the region proposal algorithms [13]
are used to predict the object location in various object
detection networks. Fast R-CNN [65] and SPPnet are used
as detection network with reduced running time, but the
problem is with computation of region proposals. The pro-
posed RPN contributes full-image convolutional features
towards detection network [22]. The object boundaries and
objectness score are predicted by this fully convolutional
network [66], known as RPN. The Fast R-CNN is used
as the detection network. Then, the Fast R-CNN and RPN
are fixed together as a unified network by using convolu-
tional feature with ‘attention’ mechanism. Ultimately RPN
tells the detection network, where to view and then detec-
tion network detects objects in that particular region. The
4046 J. Comput. Theor. Nanosci. 16, 4044–4052, 2019
RESEARCH ARTICLE
Goel et al. Object Recognition Using Deep Learning
deep VGG-16 is used as detection network. Region pro-
posals are engendered by drifting a small network over
feature map of last convolutional layer. The n∗nspatial
window of input convolutional feature map is produced by
the small network. The lower dimension feature (256-d for
ZF [67] and 512-d for VGG) map each sliding window.
The two sliding layers-a box regression layer (reg) and a
box classification layer (cls) are fed by this feature. The
both networks (Region Proposal and Object Detection)
had shown a common convolution layer. In this method,
work is implemented on the ZF net (having 5 layers) and
VGG-16 [68] which has 13 sharable convolutional layers.
In the field of disaster rescue and urban planning, the
classification and interpretation [69] accuracy and speed
play a critical role for high resolution images [70]. The
recognition of complex pattern becomes challenging if the
resolution of the images gets finer. In object based CNN,
deep learning provides a potent way to efficiently reduce
the semantic gap of the complex patterns. The boundaries
of the different objects are not captured by the deep learn-
ing techniques. To overcome this problem, the merger of
deep feature learning method and object-based classifica-
tion strategy is proposed. This proposed method improves
the accuracy of the high-resolution image classification.
The method involves two steps: extraction of deep features
through CNN and then object based classification with the
help of deep features.
3. SINGLE-STEP ARCHITECTURE
Szegedy et al. [71] proposed DNN based approach in
which simple bounding box interference abstracts the
objects with the help of binary mask. This method is not
so comfortable to extract overlapping objects. Pinheiro
et al. [72] formulated a CNN model which has two
branches. The first branch generates masks and the sec-
ond branch predicts the likelihood of the given patch of
an object. Erhan et al. [73] presented a multibox based
on regression to produce region proposals where as Yoo
et al. [74] proposed a classification technique for object
detection using a CNN architecture named AttentionNet.
The Deep Expectation (DEX) approach for age esti-
mation without using facial landmark and IMDB-WIKI
database of face images having age and gender label
is introduced [75]. The actual age and apparent age
reckoning are handled by CNN of VGG-16 architecture
pretrained on ImageNet. The age reckoning issue is con-
sidered as a deep classification issue with softmax. The
main factor of the work: deep learned model, potent face
alignment and normal value calculation for age regres-
sion. Firstly, the face is aligned using angle and crops it
for the successive strides. This is the potent method for
face alignment rather than using facial landmarks because
its failure rate (∼1%). The cases where the face is not
detected, the full image is considered as the face. The con-
text around face enhances the performance. So appended
40% of the height and width of the face is taken on all
sides. The aligned face is squeezed to 256 ×256 pixel
before initializing to deep CNN. VGG-16 architecture is
deployed as CNN used to anticipate the age of a person.
VGG-16 has 13-convolutional layer, with convolutional fil-
terof3×3 and 3 fully connected layer. The CNN is the
fine tuned using new dataset IMDB-WIKI. The age reck-
oning is a regression issue if last layer is replaced by only
one output neuron. The CNN for regression is relatively
unstable, so the age prediction is considered as a classifi-
cation problem. The age value is represented by ranges of
age. The number of age range depends on the size of the
training set.
In Deep Residual Conv-Deconv Network, a novel
architecture of the neural network is proposed that is
used for unsupervised spectral spatial feature learning
of hyperspectral image. The contemplated network, i.e.,
fully Conv-Deconv network, is based on encoder–decoder
mechanism. The 3-D hyperspectral data is primarily
mutated into the lower dimensional space using encoder
(convolution sub network) and reproduced the original data
by expansion through decoder (deconvolutional network).
The residual learning and new unpooling technique is used
to fine tune the proposed network. This work presents that
few neurons in the initial residual block have good capabil-
ities to predict the visual pattern in the objects. The unsu-
pervised spectral-spatial features are extracted using the
proposed network for remote sensing imagery. The con-
volution subnetwork has a number of convolutional block
where each block is made up of stack of convolutional lay-
ers with convolutional filter of size 3×3. The convolution
layers have similar feature map size and the similar num-
bers of filters. The count of channels of the feature map
boosts with deeper convolutional block. The convolutional
layer is equipped with ReLU [76] activation function.
After each convolutional block, there is a max-pooling
layer to spatially shrink the feature maps. The deconvolu-
tional network used to reconstruct the input data from the
deep features. The deconvolutional network comprises of
unpooling [77] and convolutional layer. The convolutional
block configuration of the deconvolutional subnet is same
as convolutional subnetwork. When the network is trained,
the learning of the network slow down due to the fact that
the network converges to the high value of error. In this
situation the optimization of the two networks is not easy.
The other problem is unpooling operation in the deconvo-
lution subnetwork. This unpooling method simply avoids
the position of highest value which leads to depletion of
edge data at the time of decoding. To settle these problems,
the fully Conv-Deconv network is refined by including
residual learning and unpooling using max-pooling indices
to retain a location of max. value.
YOLO framework, proposed by Redmon et al. [78], pre-
dicts the confidence for multiple categories and bound-
ing boxes. In this framework, initially the whole image is
J. Comput. Theor. Nanosci. 16, 4044–4052, 2019 4047
RESEARCH ARTICLE
Object Recognition Using Deep Learning Goel et al.
divided into N×Ngrids and each grid cell predicts object
centered in that grid cell. It also anticipates the bounding
boxes along with their confidence score. The contributions
of grid cells are only calculated for those that contain an
object. The YOLO contain 24 conv layers and 2 fc layers.
Some of the conv layers build up a group of outset mod-
ules having 1×1 reduction layer succceeded by 3 ×3conv
layers. The network can handle image at rate 45 FPS in
real time and the Fast YOLO version can handle 155 FPS.
Which is better than other detectors. Due to strong spa-
tial restraints on bounding box predictions [79], YOLO
is uncomfortable in dealing with groups of small objects.
YOLO generates coarse features due to multiple down-
sample operation because it is not so relaxed to speculate
objects in new configuration.
To avoid these problems, a novel approach single shot
multibox detector (SSD) [80] is proposed by Liu et al. This
approach is motivated by anchors used in multibox [81],
RPN [74] and multiscale representation [82]. Unlike fixed
grid used in YOLO, the anchor boxes of distinct aspect
ratio and scales are used in SSD to discretize the output
space of bounding boxes. The prediction from various fea-
ture maps having different resolutions is combined by the
network to tackle different size objects. In SSD architec-
ture, some feature layers are attached at the end of VGG16
network to predict the offset of the default boxes hav-
ing distinct scales and aspect ratio. The weighted sum of
confidence loss and localization loss is used to train the
network. NMS is adopted on multiscale refined bounding
boxes to get the final object detection.
4. APPLICATIONS
4.1. Plant Identification
The plant identification system is a sector of computer
vision which helps the botanists to identify the unknown
plant species rapidly and easily. Various studies have been
carried out to increase the use of leaf data for prediction
of plant species. In this method, the useful features of
the leaves are extracted by convolutional neural network
and the yield perception of extracted features based on
the deconvolutional network. This method provides differ-
ent orders of venation [83] that are better than shape [33]
information. The multilevel representation of leaf data
(from lower level to a higher level) has been observed
according to species class. This work is helpful to design
a hybrid feature extraction model which enhances the per-
formance of the plant classification system.
4.2. Age Estimation
One of the most important attribute of identity and social
interaction is age. The prediction of age depends upon
the various factors like posture, facial wrinkles, vocabu-
lary and information. Age estimation is used in the devel-
opment of numerous applications like intelligent human
machine interface, safety and protection in different areas
like security, transport and medicine. The development
in the field of artificial intelligence (AI) enhances the use
of the deep learning technique for accurate age estimation.
The deep learning approaches [75] show the effectiveness
and the robustness for age estimation in comparison to
traditional age estimation approaches.
4.3. Target Classification for SAR Images
The SAR-ATR (synthetic aperture radar automatically tar-
get recognition) algorithm comprises of a feature extractor
and a trainable classifier. The hand designed features are
often extracted and impact the accuracy of the system.
The deep convolutional networks achieved most advanced
outcomes in many computer vision and speech recogni-
tion assignments by automatic learning of feature from
the huge data. As the use of convolutional networks
for SAR-ATR encounter with serious overfitting prob-
lem. To overcome this issue, a new All-convolutional net-
work (A-ConvNet) is proposed. The A-ConvNet comprises
of sparsely connected layers, without using only fully
connected layer. The A-ConvNet demonstrated superior
performance than traditional ConvNet on the classifica-
tion [84] of target in the SAR image dataset.
4.4. Face Recognition
The identification of an individual using face from an
image or a database is known as face recognition [85].
Due to the increasing volume of the dataset, machine
learning approaches like deep neural network is used
for the face recognition problem. The deep learning
approaches perform significantly with large datasets. Espe-
cially convolutional neural networks (CNN) attain a
tremendous recognition rate for face recognition problem.
5. DATASET
There are datasets having complex urban conditions of
three very distinct cities, i.e., Beijing, Pavia and Vaihingen.
The Beijing view was captured by Worldview-2 in 2010.
The ROSIS sensor captured the scenes at the time of the
flight over Pavaia and Italy. The PASCAL VOC 2007 [86]
dataset (consist of 5k trainval images and 5k test images of
20 object classes), PASCAL VOC 2012 [86] dataset, MS
COCO DATASET [87] (contains 80 object classes with
80k images on the training set and 40k images on valida-
tion set and 20k images on test set) are used to assess the
performance of Faster R-CNN approach. The benchmark
of SegNet is performed on CamVid 11 road class segmen-
tation dataset [88] and SUN [89]. RGB-D indoor scene
database. The new leaf dataset Malayakew leaf dataset and
the well known Flavia [90] leaf datasets are used to ana-
lyze the performance of the plant identification approach.
In the DEX work, five distinct datasets for original
and apparent age estimation are used. The IMDB-WIKI is
the new largest data of age reckoning. The MORPH and
FG-NET dataset are used for real age estimation whereas
4048 J. Comput. Theor. Nanosci. 16, 4044–4052, 2019
RESEARCH ARTICLE
Goel et al. Object Recognition Using Deep Learning
Tab l e I . Comparative analysis.
Database Result
Researchers Techniques used (% accuracy)
R. Girshick R-CNN PASCAL VOC, 798
et al. [16] ILSIRC
X. Bai et al. [22] SPPnet PASCAL VOC 9342
2007, ILSIRC
R. Girshick Fast R-CNN PASCAL VOC 893
et al. [65]
S. Ren et al. [66] Faster R-CNN PASCAL VOC 918
2007, PASCAL
VOC 2012
W. Zhao et al. [69] Object Beijing, Pavia, 9904
based CNN VaiHingen
Rasmus Rothe DEX IMDB-Wiki 966
et al. [75]
L. Mou [76] Deep residual Pavia, Indian 8739
conv-deconv pines
J. Redmon Yolo PASCAL VOC 906
quad et al. [78] 2012, COCO
W. Liu et al. [80] SSD PASCAL VOC 832
2012, COCO
LAP dataset used for apparent age estimation. The experi-
ments are carried out using two mainly used hyperspectral
data i.e., Indian Pines and Pavia University. The Indian
Pines dataset is gathered over the Indian pine sites of
northwestern. India using airborne visible/infrared imag-
ing spectrometer (AVIRIS) sensor. The Pavia University
dataset is acquired over university of Pavia by reflective
optics system imaging spectrometer (ROSIS). The perfor-
mance of hyperspectral image classification approaches is
assessed using overall accuracy (OA), average accuracy
(AA) and Kappa coefficient. The experiments are carried
out with A-ConvNet using moving and stationary target
acquisition and recognition (MSTAR) [59] criterion dataset
under standard operation conditions (SOC) and extended
operating conditions (EOC) [58].
6. COMPARATIVE ANALYSIS
Some of the research work in the field of object recogni-
tion using deep learning summarizes in Table I. The vari-
ous techniques used by different researchers are mentioned
in the table. The resulted reported by those researchers
are very encouraging but calculated for particular type of
database. The important question is how these methods
will perform when used with other databases. Therefore
the comparative analysis between techniques mentioned in
literature is desirable.
7. CONCLUSION
In contemporary object recognition approaches, the deep
neural network based object recognition techniques have
remarkable performance due to its powerful learning capa-
bility. In this paper, the recent developments of deep neu-
ral network based object recognition framework have been
received in detailed. Firstly, the two step framework has
been received which familiarize the architectures used for
object recognition. Then, one step frameworks such as:
YOLO, SSD etc. are also reviewed. The various bench-
mark datasets and different application areas of object
recognition are also discussed. Finally, we conclude with
promising future scope to get an intensive perspective of
the object recognition. This paper provides worthwhile
wisdom and guidance for future progress in the field of
deep learning based object recognition. Based on litera-
ture review, there is scope for future improvement. The
object-based CNN for high-resolution imagery classifica-
tion method has no contextual information on a global
level. In future, the main focus will be on the con-
textual information to further improve the performance
because the information about the relationship between
image object affect the classification efficiency.
In Segnet, the estimation model can be design to cal-
culate uncertainty for prediction from deep segmentation
network.
The training dataset can be increase in future for the age
estimation approach DEX. More robust landmark detectors
can lead to better face alignment.
The possible future work is to explore the capability of
Deep Residual Conv-Deconv Network for Hyperspectral
Image Classification approach using APs and estimation
profiles that extract spatial information in a robust and
adaptive way.
References
1. Shokoufandeh, A., Keselman, Y., Demirci, M. F., Macrini, D. and
Dickinson, S.J., 2012. Many to many feature matching in object
recognition: A Review of three approaches. IET Computer Vision,
6(6), pp.500–513.
2. Lillywhite, K. and Archibald, J., 2013. A feature construc-
tion method for general object recognition. Pattern Recognition,
New York, NY, USA, Elsevier Science Inc. Vol. 46, pp.3300–
3314.
3. Martin, L., Tuysuzojlu, A., Karl, W. C. and Ishwa, P., 2015. Learning
based object identification and segmentation using dual energy CT
images for security. IEEE Transaction on Image Processing,24(11),
pp.4069–4081.
4. Puissant, A., Hirsch, J. and Weber, C., 2005. The utility of texture
analysis to improve perpixel classification for high to very high spa-
tial resolution imagery. International Journal Remote Sensing,26(4),
pp.733–745.
5. Benediktsson, J.A., Palmason, J.A. and Sveinsson, J.R., 2005.
Classification of hyperspectral data from urban areas based on
extended morphological profiles. IEEE Transactions on Geoscience
and Remote Sensing,43(3), pp.480–491.
6. Bau, M.T.C., Sarkar, S. and Healey, G., 2010. Hyperspectral
region classification using a three-dimensional gabor filterbank.
IEEE Transactions on Geoscience and Remote Sensing,48(9),
pp.3457–346.
7. Huang, X., Zhang, L. and Li, P., 2008. A multiscale feature fusion
approach for classification of very high resolution satellite imagery
based on wavelet transform. International Journal Remote Sensing,
29(20), pp.5923–5941.
8. Cheriyadat, A.M., 2014. Unsupervised feature learning for aerial
scene classification. IEEE Transactions on Geoscience and Remote
Sensing,52(1), pp.439–451.
J. Comput. Theor. Nanosci. 16, 4044–4052, 2019 4049
RESEARCH ARTICLE
Object Recognition Using Deep Learning Goel et al.
9. Volpi, D.M., Mura, M.D., Rakotomamonjy, A. and Flamary, R.,
2014. Automatic feature learning for spatio-spectral image
classification with sparse svm. IEEE Transactions on Geoscience
and Remote Sensing,52(10), pp.6062–6074.
10. Hinton, G.E., Osindero, S. and Teh, Y.W., 2006. A fast learning
algorithm for deep belief nets. Neural Computation,18(7), pp.1527–
1554.
11. Arbelaez, P., Pont-Tuset, J., Barron, J.T., Marques, F. and Malik, J.,
2014. Multiscale combinatorial grouping. Computer Vision and Pat-
tern Recognition (CVPR), pp.328–335.
12. Carreira, J. and Sminchisescu, C., 2012. CPMC: Automatic object
segmentation using constrained parametric min-cuts. IEEE Transac-
tions on Pattern Analysis and Machine Intelligence,34(7), pp.1312–
1328.
13. Uijlings, J.R., van de Sande, K.E., Gevers, T. and Smeulders, A.W.,
2013. Selective search for object recognition. International Journal
of Computer Vision,104(2), pp.154–171.
14. Zitnick, C.L. and Dollar, P., 2014. Edge Boxes: Locating Object
Proposals from Edges. European Conference on Computer Vision,
pp.391–405.
15. Alexe, B., Deselaers, T. and Ferrari, V., 2012. Measuring the object-
ness of image windows. IEEE Transactions on Pattern Analysis and
Machine Intelligence,34(11), pp.2189–2202.
16. Girshick, R., Donahue, J., Darrell, T. and Malik, J., 2014. Rich Fea-
ture Hierarchies for Accurate Object Detection and Semantic Seg-
mentation. IEEE Conference on Computer Vision and Pattern Recog-
nition, pp.580–587.
17. Viola, P. and Jones, M., 2001. Rapid Object Detection Using a
Boosted Cascade of Simple Features. Proceeding of IEE E Computer
Society Conference on Computer Vision and Pattern Recognition,
pp.I-511–I-518.
18. Dalal, N. and Triggs, B., 2005. Histograms of Oriented Gradients
for Human Detection. Proceeding of IEEE Computer Society Con-
ference on Computer Vision and Pattern Recognition, pp.886–893.
19. Uijlings, J.R., van de Sande, K.E.T., Gevers and Smeulders, A.W.,
2013. Selective search for object recognition. International Journal
of Computer Vision,104, pp.154–171.
20. Felzenszwalb, P.F., Girshick, R.B., McAllester, D. and Ramanan, D.,
2010. Object detection with discriminatively trained part-based mod-
els. IEEE Transactions on Pattern Analysis and Machine Intelli-
gence,32(9), pp.1627–1645.
21. Sermanet, P. Eigen, D., Zhang, X., Mathieu, M., Fergus, R. and
LeCun, Y., 2013. OverFeat: Integrated Recognition, Localization and
Detection Using Convolutional Networks. Inter national Conference
on Learning Representations (ICLR 2014), p.16.
22. He, K., Zhang, X., Ren, S. and Sun, J., 2014. Spatial Pyramid
Pooling in Deep Convolutional Networks for Visual Recognition.
Proceeding in 13th European Conference on Computer Vision,
pp.346–361.
23. Delalieux, S. Somers, B. Haest, B., Spanhove, T., Borre, J.V. and
Mücher, C.A., 2012. Heathland conservation status mapping through
integration of hyperspectral mixture analysis and decision tree clas-
sifiers. Journal Article in Remote Sensing of Environment,126,
pp.222–231.
24. Ham, J., Chen, Y., Crawford, M.M. and Ghosh, J., 2005. Investiga-
tion of the random forest framework for classification of hyperspec-
tral data. IEEE Transactions on Geoscience and Remote Sensing,
43(3), pp.492–501.
25. Melgani, F. and Bruzzone, 2004. L.Classification of hyperspec-
tral remote sensing images with support vector machines. IEEE
Transactions on Geoscience and Remote Sensing,42(8), pp.1778–
1790.
26. Chen, Y., Lin, Z., Zhao, X., Wang, G. and Gu, Y., 2014. Deep
learning-based classification of hyperspectral data. IEEE Journal
Selected Topics Applied Earth Observations Remote Sensing,7(6),
pp.2094–2107.
27. Chen, Y., Zhao, X. and Jia, X., 2015. Spectral–spatial classification
of hyperspectral data based on deep belief network. IEEE Journal
Selected Topics Applied Earth Observations Remote Sensing,8(6),
pp.2381–2392.
28. Mou, L., Ghamisi, P. and Zhu, X.X., 2017. Deep recurrent neu-
ral networks for hyperspectral image classification. IEEE Trans-
actions on Geoscience and Remote Sensing,55(7), pp.3639–
3655.
29. Chen, Y., Jiang, H., Li, C., Jia, X. and Ghamisi, P., 2016. Deep
feature extraction and classification of hyperspectral images based
on convolutional neural networks. IEEE Transactions on Geoscience
and Remote Sensing,54(10), pp.6232–6251.
30. Ghamisi, Y., Chen and Zhu, X.X., 2016. A self-improving convolu-
tion neural network for the classification of hyperspectral data. IEEE
Transactions on Geoscience and Remote Sensing,13(10), pp.1537–
1541.
31. Zhao,W.andDu,S.,2016. Spectral–spatial feature extraction for
hyperspectral image classification: A dimension reduction and deep
learning approach. IEEE Transactions on Geoscience and Remote
Sensing,54(8), pp.4544–4554.
32. Romero, A., Gatta, C. and Camps-Valls, G., 2016. Unsupervised
deep feature extraction for remote sensing image classification. IEEE
Transactions on Geoscience and Remote Sensing,54(3), pp.1349–
1362.
33. Neto, J.C., Meyer, G.E., Jones, D.D. and Samal, A.K., 2006.Plant
species identification using elliptic fourier leaf shape analysis. Com-
puters and Electronics in Agriculture,50(2), pp.121–134.
34. Chaki, J. and Parekh, R., 2011. Plant leaf recognition using shape
based features and neural network classifiers. International Journal
of Advanced Computer Science and Applications,2(10).
35. Du, J.X., Wang, X.F. and Zhang, G.J., 2007. Leaf shape based plant
species recognition. Applied Mathematics and Computation,185(2),
pp.883–893.
36. Mouine, S., Yahiaoui, I. and Verroust-Blondet, A., 2012. Advanced
Shape Context for Plant Species Identification Using Leaf Image
Retrieval. Proceedings of the 2nd ACM International Conference on
Multimedia Retrieval, p.49.
37. Aakif, A. and Khan, M.F., 2015. Automatic classification of
plants based on their leaves. Biosystems Engineering,139,
pp.66–75.
38. Hall, D., McCool, C., Dayoub, F., Sunderhauf, N. and Upcroft, B.,
2015. Evaluation of Features for Leaf Classification in Challenging
Conditions. IEEE Winter Conference on Applications of Computer
Vis io n, pp.797–804.
39. Kumar, N. Belhumeur, P.N., Biswas, A. Kress, W.J., Lopez,
I.C. and Soares, J.V., 2012. Leafsnap: A computer vision sys-
tem for automatic plant species identification. ECCV Springer,
pp.502–516.
40. Cope, J.S., Remagnino, P., Barman, S. and Wilkin, P., 2010.Plant
Texture Classification Using Gabor Co-Occurrences. International
Symposium on Visual Computing, Springer. pp.669–677.
41. Rashad, M., ElDesouky, B. and Khawasik, M.S., 2011. Plants images
classification based on textural features using combined classifier.
International Journal of Computer Science & Information Technol-
ogy,3(4), pp.93–100.
42. Tang, Z., Su, Y., Er, M. J., Qi, Zhang, F. L. and Zhou, J., 2015.
A local binary pattern based texture descriptors for classification of
tea leaves. Neurocomputing,168, pp.1011–1023.
43. Larese, M.G., Namías, R., Craviotto, R.M., Arango, M.R., Gallo,
C. and Granitto, P.M., 2014. Automatic classification of legumes
using leaf vein image features. Pattern Recognition,47(1),
pp.158–168.
44. Drucker, H., Burges, C.J.C., Kaufman, L., Smola, A.J. and
Vapnik, V., 1997. Support vector regression machines. Advances in
Neural Information Processing Systems,9, pp.155–161.
4050 J. Comput. Theor. Nanosci. 16, 4044–4052, 2019
RESEARCH ARTICLE
Goel et al. Object Recognition Using Deep Learning
45. Hardoon, D.R., Szedmak, S. and Shawe-Taylor, 2004. Canonical cor-
relation analysis: An overview with application to learning methods.
Neural Computation,16(12), pp.2639–2664.
46. Cortes, C. and Vapnik, 1995. Support-vector networks. Machine
Learning,20(3), pp.273–297.
47. Chen, K., Gong, S., Xiang, T., and Loy, C., 2013. Cumula-
tive Attribute Space for Age and Crowd Density Estimation.
IEEE Conference on Computer Vision and Pattern Recognition
(CVPR).
48. Huerta, I., Fernández, C. and Prati, 2014. Facial Age Estimation
Through the Fusion of Texture and Local Appearance Descriptors.
IEEE European Conference on Computer Vision (ECCV).
49. Guo, G. and Mu, 2014. A framework for joint estimation of age,
gender and ethnicity on a large database. Image and Vision Comput-
ing,32(10), pp.761–770.
50. Yi, D., Lei, Z. and Li, S.Z., 2014. Age Estimation by Multi-
Scale Convolutional Network. Asian Conference on Computer Vision
(ACCV).
51. Wang, X., Guo, R. and Kambhamettu, 2015. Deeply-Learned Fea-
ture for Age Estimation. IEEE Winter Conference on Applications
of Computer Vision (WACV).
52. Rothe, R., Timofte, R. and Van Gool, 2016. Some Like It Hot-Visual
Guidance for Preference Prediction. IEEE Conference on Computer
Vision and Pattern Recognition (CVPR).
53. Zhu, Y., Li, Y., Mu, G. and Guo, 2015. A Study on Apparent
Age Estimation. IEEE International Conference on Computer Vision
(ICCV) Workshops.
54. Yang, X., Gao, B.B., Xing, C., Huo, Z.W., Wei, X.S., Zhou, Y.,
Wu, J. and Geng, 2015. Deep Label Distribution Learning for Appar-
ent Age Estimation. IEEE Inter national Conference on Computer
Vision (ICCV) Workshops.
55. Cui, Y., Zhou, G., Yang, J. and Yamaguchi, Y., 2011. On the iterative
censoring for target detection in SAR image. IEEE Transactions on
Geoscience and Remote Sensing,8(4), pp.641–645.
56. Park, J.I. and Kim, K.T., 2014. Modified polar mapping classifier for
SAR automatic target recognition. IEEE Transaction on Aerospace
and Electronic Systems,50(2), pp.1092–1107.
57. Novak, L.M., Owirka, G.J., Brower, W.S. and Weaver, A.L., 1997.
The automatic target-recognition system in SAIP. The Lincoln Lab-
oratory Journal,10(2), pp.187–201.
58. Ross, T.D., Bradley, J.J., Hudson, L.J. and O’Connor, M.P., 1999.
SAR ATR: So What’s the Problem? An MSTAR Perspective. Pro-
ceeding in 6th SPIE Conference Algorithms SAR Imagery, Vol. 3721,
pp.566–579.
59. Keydel, E.R., Lee, S.W. and Moore, J.T., 1996. MSTAR Extended
Operating Conditions: A Tutorial. Proceeding in 6th SPIE Confer-
ence Algorithms SAR Imagery, Vol. 2757, pp.228–242.
60. Hirose, A., ed., 2013.Complex-Valued Neural Networks: Advances
and Applications. Hoboken, NJ, USA. Wiley-IEEE Press.
61. Zhao,Q.andPrincipe,J.C.,2001. Support vector machines for SAR
automatic target recognition. IEEE Transaction on Aerospace and
Electronic Systems,37(2), pp.643–654.
62. Sun, Y.J., Liu, Z.P., Todorovic, S. and Li, J., 2007. Adaptive boost-
ing for SAR automatic target recognition. IEEE Transaction on
Aerospace and Electronic Systems,43(1), pp.112–125.
63. Krizhevsky, A., Sutskever, I. and Hinton, G.E., 2012. Imagenet
classification with deep convolutional neural networks. Proceedings
of Advances in Neural Information Processing Systems, pp.1097–
1105.
64. Bai,X.,Wang,X.,Latecki,L.J.,Liu,W.andTu,Z.,2010.Active
Skeleton for Non-Rigid Object Detection. IEEE 12th International
Conference on Computer Vision (ICCV).
65. Girshick, R., 2015. Fast R-CNN. IEEE International Conference on
Computer Vision, pp.1440–1448.
66. Ren, S., He., K., Girshick, R. and Sun, J., 2017. Faster R-CNN:
Towards real-time object detection with region proposal networks.
IEEE Transactions on Pattern Analysis and Machine Intelligence,
39(6).
67. Zeiler, M.D. and Fergus, R., 2014. Visualizing and Understanding
Convolutional Neural Networks. Proceeding 13th European Confer-
ence on Computer Vision, pp.818–833.
68. Simonyan, K. and Zisserman, A., 2015. Very Deep Convolutional
Networks for Large-Scale Image Recognition. Proceeding in Inter-
national Conference Learning Representations.
69. Zhao,W.,Du,S.andEmery,W.I.,2017. Object based convo-
lutional neural network for high resolution imagery classification.
IEEE Journal of Selected Topics in Applied Earth Observation and
Remote Sensing,10(7).
70. Zhao, W. and Du, S., 2016. Spectral-spatial feature extraction for
hyperspectral image classification: A dimension reduction and deep
learning approach. IEEE Transactions on Geoscience and Remote
Sensing,54(8), pp.4544–4554.
71. Szegedy, C., Toshev, A. and Erhan, D., 2013. Deep neural networks
for object detection. Advances in Neural Infor mation Processing Sys-
tems 26 (NIPS 2013).
72. Pinheiro, P.O., Collobert, R. and Dollár, P., 2015. Learning to seg-
ment object candidates. Advances in Neural Information Processing
Systems 26.
73. Szegedy, C., Reed, S., Erhan, Anguelov, D. and Ioffe, S., 2014. Scal-
able, High-Quality Object Detection. ArXiv:1412.1441.
74. Yoo, D., Park, S., Lee, J.-Y., Paek, A.S. and Kweon, I.S., 2015.
Attentionnet: Aggregating Weak Directions for Accurate Object
Detection. IEEE Conference on Computer Vision and Pattern Recog-
nition (CVPR).
75. Rothe, R., Timofte, R. and Goo, L.V., 2018. Deep expectation of
real and apparent age from a single image without facial landmarks.
International Journal of Computer Vision,126, pp.144–157.
76. Mau, L., Ghamisi, P. and Zhu, X.X., 2018. Unsupervised spectral-
spatial feature learning via deep residual conv-deconv network for
hyperspectral image classification. IEEE Transactions on Geoscience
and Remote Sensing,56(1).
77. Dosovitskiy, A., Springenberg, J.T. and Brox, T., 2015. Learning
to Generate Chairs, Tables and Cars with Convolutional Networks.
Proceeding in IEEE Conference on Computer Vision Pattern Recog-
nition (CVPR), pp.1538–1546.
78. Redmon, J. and Farhadi, A., 2016. Yolo9000: Better, Faster,
Stronger. ArXiv:1612.08242.
79. Redmon, J., Divvala, S., Girshick, R. and Farhadi, A. 2016.You
Only Look Once: Unified, Real-Time Object Detection. IEEE Con-
ference on Computer Vision and Pattern Recognition (CVPR).
80. Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.-Y.
and Berg, A.C., 2016. Ssd: Single Shot Multibox Detector. IEEE
European Conference on Computer Vision (ECCV).
81. Erhan, D., Szegedy, C., Toshev, A. and Anguelov, D., 2014. Scalable
Object Detection Using Deep Neural Networks. IEEE Conference
on Computer Vision and Pattern Recognition (CVPR).
82. Bell, S., Zitnick, C.L., Bala, K. and Girshick, R., 2016.Inside-
outside net: Detecting Objects in Context with Skip Pooling and
Recurrent Neural Networks. IEEE Conference on Computer Vision
and Pattern Recognition (CVPR).
83. Charters, J., Wang, Z., Chi, Z., Tsoi, A.C. and Feng, D.D., 2014.
EAGLE: A Novel Descriptor for Identifying Plant Species Using
Leaf Lamina Vascular Features. ICME-Workshop, pp.1–6.
84. Dudgeon, D.E. and Lacoss, R.T., 1993. An overview of automatic
target recognition. The Lincoln Laboratory Journal,6(1), pp.3–10.
85. Wang, W., Yang, J., Xiao, J., Li, S. and Zhou, D., 2015.Face Recog-
nition Based on Deep Learning. Springer International Publishing
Switzerland. pp.812–820.
86. Everingham, M., Van Gool, L., Williams, C.K.I., Winn, J. and
Zisserman, A., 2007. The PASCAL visual object classes chal-
lenge results. International Journal of Computer Vision,88(2),
pp.303–338.
J. Comput. Theor. Nanosci. 16, 4044–4052, 2019 4051
RESEARCH ARTICLE
Object Recognition Using Deep Learning Goel et al.
87. Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P.,
Ramanan, D., Dollar, P. and Zitnick, C.L., 2014. Microsoft COCO:
Common Objects in Context. Proceeding in European Conference
on Computer Vision, pp.740–755.
88. Song, S., Lichtenberg, S.P. and Xiao, J., 2015. SUN RGB-D:
A RGB-D Scene Understanding Benchmark Suite. Proceeding
in IEEE Conference on Computer Vision Pattern Recognition,
pp.567–576.
89. Zitnick, C.L. and Dollar, P., 2014. Edge Boxes: Locating Object
Proposals from Edges. Proceeding in 13th European Conference on
Computer Vision, pp.391–405.
90. Wu, S.G., Bao, F.S., Xu, E.Y., Wang, Y.-X., Chang, Y.-F. and
Xiang, Q.-L., 2007. A Leaf Recognition Algorithm for Plant Clas-
sification Using Probabilistic Neural Network. IEEE International
Symposium on Signal Processing and Information Technology,
pp.11–16.
Received: 20 April 2019. Accepted: 10 May 2019.
4052 J. Comput. Theor. Nanosci. 16, 4044–4052, 2019