ArticlePDF Available

Object Recognition Using Deep Learning

Authors:
  • Maharishi Markandeshwar (Deemed to be University), Mullana, Ambala, Haryana, India

Abstract

The deep learning approaches have drawn much focus of the researchers in the area of object recognition because of their implicit strength of conquering the shortcomings of classical approaches dependent on hand crafted features. In the last few years, the deep learning techniques have been made many developments in object recognition. This paper indicates some recent and efficient deep learning frameworks for object recognition. The up to date study on recently developed a deep neural network based object recognition methods is presented. The various benchmark datasets that are used for performance evaluation are also discussed. The applications of the object recognition approach for specific types of objects (like faces, buildings, plants etc.) are also highlighted. We conclude up with the merits and demerits of existing methods and future scope in this area.
RESEARCH ARTICLE
Copyright © 2019 American Scientific Publishers
All rights reserved
Printed in the United States of America
Journal of
Computational and Theoretical Nanoscience
Vol. 16, 4044–4052, 2019
Object Recognition Using Deep Learning
Rohini Goel1, Avinash Sharma2, and Rajiv Kapoor3
1Research Scholar, Department of Computer Science and Engineering, Maharishi Markandeshwar (Deemed to be University),
Mullana 133203, Ambala, India
2Department of Computer Science and Engineering, Maharishi Markandeshwar (Deemed to be University), Mullana 133203, Ambala, India
3Department of Electronics and Communication Engineering, Delhi Technical University, Delhi 110042, India
The deep learning approaches have drawn much focus of the researchers in the area of
object recognition because of their implicit strength of conquering the shortcomings of classical
approaches dependent on hand crafted features. In the last few years, the deep learning techniques
have been made many developments in object recognition. This paper indicates some recent and
efficient deep learning frameworks for object recognition. The up to date study on recently devel-
oped a deep neural network based object recognition methods is presented. The various bench-
mark datasets that are used for performance evaluation are also discussed. The applications of the
object recognition approach for specific types of objects (like faces, buildings, plants etc.) are also
highlighted. We conclude up with the merits and demerits of existing methods and future scope in
this area.
Keywords: Convolutional Neural Network (CNN), Faster R-CNN, Network on Convolution
Feature Map (NoC), Deep Expectation (DEX), Deep Residual Conv-Deconv
Network, A-ConvNet.
1. INTRODUCTION
Due to the rapid growth in computer technology, computer
can play an important role to complete routine daily tasks
of everyday’s life [1]. For a human, visual object classi-
fication and recognition are usual and offhand biological
visual system process, but for a computer it is not easy.
To imitate because of high mutability in object images of
the same class with different viewing conditions. Object
recognition is a crucial defiance in the field of computer
vision. As the implementation of object recognition on
machines is labyrinthine task, so potent and less compli-
cated object recognition is to be designed [2]. The digital
database of visual information is growing day by day to
manage and analysis this huge mass of visual information,
those image analysis approaches are required which can
automatically get its semantics contexts. The objects in the
images are one of the most crucial contexts for the object
recognition task. Good image feature descriptions are the
backbone of the good object recognition system [3].
In the previous decennary, the comprehensive study
of high resolution image classification has been carried
out with handcrafted features from spatial and spectral
domains. The gray-level co-occurrence matrix (GLCM) [4]
Author to whom correspondence should be addressed.
is used as texture-based descriptors to provide the spectral
variation information required for efficient image classifi-
cation. The extended morphological profiles proposed by
Benediktsson et al. [5]. To extract spatial features for high
resolution urban image classification. In high resolution
images, the Gabor filter [6] and wavelet transform [7]
were also used for spatial feature extraction. The intra-
class variation of the building database, the handcrafted
feature is not an efficient solution. Therefore, handcrafted
features are replaced by the feature extracted by sparse
coding scheme proposed by Chenyadat [8]. The sparse-
constrained support vector machine (SVM) is another fea-
ture learning model presented by Tuia et al. [9]. The deep
features [10] are more efficient and powerful than low
level feature in scene classification, image classification
and face recognition.
The mostly used object proposal approaches are based
on super-pixel grouping (e.g., MCG [11], CPMC [12]
and Selective Search [13]) and based of sliding window
(e.g., edge boxes [14], objectness in windows [15]). Other
than this, there are some object proposal methods which
are taken up as detector independent external elements
(e.g., selective search object detection). The R-CNN [16]
method is used as an object detector to segregate the pro-
posal region in object categories or backgrounds. The insti-
gating work by Viola and Jones utilizes Haar [17] features
4044 J. Comput. Theor. Nanosci. 2019, Vol. 16, No. 9 1546-1955/2019/16/4044/009 doi:10.1166/jctn.2019.8291
RESEARCH ARTICLE
Goel et al. Object Recognition Using Deep Learning
and boosted classifies on sliding window. The HOG fea-
tures [18] are combined with linear SVMs [19] a sliding
window classifier and DPM [20] to generate deformable
graphical models. In overfeat approach [21], every slid-
ing window of convolutional feature map is used with a
fully connected layer for efficient detection and classifi-
cation. In SPP based detection method [22], the features
are merged from the proposed region on the convolution
features map and initialize to fully connected layer for
classification.
The typical supervised classification models are like a
decision tree [23], random forest [24] and support vector
machine (SVM) [25]. A random forest approach is based
on the construction of several decision trees during train-
ing and the integration of prediction of all the trees is
used for classification. SVM uses finite training samples
to tackle high dimensional data. The random forest and
SVM are the shallow models; they have limited ability
to handle the nonlinear data as compared to deep net-
works. For image classification, Chen et al. [26] proposed
a stacked auto encoder to predict the hierarchal feature of
hyperspectral image in the spectral domain. A deep belief
network (DBN) [27] represents spectral based features for
hyperspectral data classification. Mou et al. [28] intro-
duced recurrent neural network for classification of hyper-
spectral images. The afore mentioned methods like auto
encoders, RBN, DBM are 1-D deep architectures. The 1-D
architecture processing may cause the loss of hyperspectral
imagery structural information. The CNN has the capa-
bility to automatically discover the contextual 2-D spatial
features for image classification. There are various super-
vised CNN-based models used for spectral-spatial clas-
sification of hyperspectral remote sensing images. Chen
et al. [29] proposed a supervised l2 regularized 3-D CNN
based feature extraction model used for classification pur-
pose. Ghamisi et al. [30] proposed self improving CNN
model. Zhao and Du et al. [31] introduced a spectral, spa-
tial features based classification framework. In the transi-
tion from supervised CNN for unsupervised CNN, Romero
et al. [32] represent an unsupervised convolutional net-
work for spatial-spectral feature extraction adopting sparse
learning to predict the network weights.
The various types of feature extractor were used in
the past based on shape, texture and venation. Shape
based Elliptic Fourier and discriminant analysis to dis-
criminate different plant types was proposed by Neto
et al. [33]. Shape based approaches were based on invari-
ant moments and centroid-radii models [34]. The combi-
nation of geometrical and invariant moment features was
introduced by Du et al. [35] to extract features. Shape
context and HOG [36] have been initialized as shape
descriptors. Fourier descriptor, shape defining features
(SDF) [37], hand crafted shape (HCS) [38] and histogram
of curvature over scale (HoCS) [39] is some important
shape base feature descriptor. Texture features have also
an important contribution in leaf identification. Gabor
co-occurrence in texture classification was proposed by
Cope et al. [40] Learning vector quantization (LVQ) with
radial basis function (RBF) was proposed by Rashad
et al. [41] to recognize texture features. The merger of
gray level co-occurrence matrix (GLCM) and LBP was
proposed by Tang et al. [42] to extract the texture based
features. The most important features for leaf identifica-
tion is venation features. The legume classes based on leaf
venation were proposed by Larese et al. [43]. The fea-
ture extracted from the vein pattern by using hit or missed
transform (UHMT) and then trained CNN for recogni-
tion. The age reckoning can be considered as regres-
sion or classification issue. Support Vector Regression
(SVR) [44], Canonical Correlation Analysis (CCA) [45]
is the famous regression techniques, whereas the classi-
cal Nearest Neighbor (NN) and support vector machines
(SVMs) [46] are used as classification approaches. Chen
et al. [47] proposed CA-SVR techniques. Hureta et al. [48]
image texture and local presentation descriptor and Guo
and Mu [49] utilize CCA and PBS to real age estimation.
Ye et al. [50] proposed a multireal CNN, Wang et al. [51]
deployed deep learning features (DLA) where as Rotre
et al. [52] with CNN and SVR for efficient original age
reckoning. For apparent age estimation, other than our
work Liu et al. proposed a technique based on deep trans-
fer learning and GoogleNet framework. Zihu et al. [53]
deployed GoogleNet with random forest and SVR. Yang
et al. [54] used face and landmark detection and VGG-16
framework for face alignment and modeling respectively.
The standard framework of SAR-ATR has three phases:
detection discrimination and classification. The extraction
phase extracts the targets from the SAR using CFAR detec-
tion [55]. Then the discriminator stage removes the false
alarms and selects the features necessary to detect the tar-
get. The last phase as classifier is used to classify each
input as one of the two classes (target and clutter) [56].
The classification stage may have any of three prototypes:
template matching, model based methods and machine
learning. The semiautomated image intelligence process-
ing (SAIP) [57] system is the popularly used template
base system. But the performance of this system degrades
in extended operating conditions (EOC) [58]. To over-
come the present issue, model based moving and stationary
target acquisition and recognition (MSTAR) [59] system
was grown with the evaluation of trainable classifier such
as: artificial neural network (ANN) [60], SVM [61] and
adaboost etc. [62]. The machine learning prototypes have
been adapted to SAR. Now days, the deep convolutional
networks (ConvNet) [63] presented a remarkable perfor-
mance object detection and recognition. The remaining
paper is structured as: In Sections 2 and 3, Two-Step
and Single-Step architecture object recognition techniques
using deep learning are (see Fig. 1) discussed. Section 4
illustrates the various object recognition applications and
J. Comput. Theor. Nanosci. 16, 4044–4052, 2019 4045
RESEARCH ARTICLE
Object Recognition Using Deep Learning Goel et al.
Deep Learning Based Object
Recognition Techniques
Two Step
Architecture
One Step
Architecture
R-CNN
te
N
P
P
S
Faster R-CNN
Object-NNCd
es
aB
RtsaF -CNN
DEX
vno
C
laud
ise
R
peeD -
kr
o
wteN
v
n
o
c
e
D
SSD
YOLO
Fig. 1. Overview of deep learning based object recognition techniques.
Section 5 reveals the various types of datasets used in dif-
ferent object recognition techniques using deep learning,
Section 6 shows comparative analysis and we conclude in
Section 7.
2. TWO-STEP ARCHITECTURE
In 2014, Ross Girshick [16] proposed R-CNN for the
quality enhancement of the candidate bounding box and
extraction of high level features using deep architecture.
R-CNN indicated the better result over the previous best
result using PASCAL VOC. 2012 dataset. The R-CNN two
stages: in Generation of Region Proposal stage, in R-CNN,
about 2K region proposals are produced using the selective
search method. The accurate bounding boxes of the arbi-
trary sizes are generated very fast with reduced searching
space using selective search. It is due to the fact that selec-
tive search uses a bottom up grouping and saliency cues.
In Deep Feature Extraction using CNN, the deep CNN
extracted the deep features from the cropped or warped
region proposal. These final and robust 4096-dimensional
features are obtained due to high learning capacity and
strong expressive power of the CNN’s structures. The
region proposals are recorded as positive and negative
(background) regions with help of category specific linear
SVMs. The bounding box regression adjusts these regions
and then greedy non-maximum suppression (NMS) fil-
ter them to produce final bounding boxes for secured
object locations. Although R-CNN has various improve-
ments over traditional approaches still it has some dis-
advantages. Due to the requirement of fixed size input
image, the re-computation of the CNN takes more time in
testing period. R-CNN training is a multi-stage pipeline
process. More storage space and time is required for train-
ing of R-CNN because features of different region pro-
posals are extracted and stored on the disk. Due to the
high redundancy of region proposal, it is a time consuming
process.
Due to the fixed size input, the R-CNN warps or
crops the region proposal to the required size. Either the
incomplete information exists due to cropping or distor-
tion occurs due to the warping operation. These effects
can weaken the recognition accuracy. To rectify this prob-
lem, He et al. presented a new CNN architecture called
SPPnet [64]. In SPPnet architecture, unlike R-CNN the
5th convolutional layer (conv 5) is reused to represent the
random sized region proposals to fixed size feature vec-
tors. Due to the strength of local responses and spatial
position of the feature maps, the reusability of these fea-
ture maps is feasible. The layer next to final convolutional
layer is attributed as spatial pyramid pooling layer (SPP
layer). If the conv 5 layer has 256 feature map, then after
SPPnet layer each region proposal has final feature vec-
tor dimension of 256 ×(12 +22 +42) =5376 due to a
3 level pyramid. SPPnet shows better results in correct
region proposal estimations as well as enhances the detec-
tion efficiency during the testing phase due to sharing of
computation cost before SPP layer.
SPPnet has some drawbacks although it has shown
remarkable improvements in accuracy and efficiency in
comparision to R-CNN. SPPnet expenses the additional
storage space due to its multi stage-pipeline architecture
similar to R-CNN. The fine tuning algorithm [22] is not
able to update the convolutional layer before SPP layer.
Due to this, an unsurprising drop in accuracy occurs in
deep network. To avoid these problems, Girshick [65]
introduced a novel architecture of CNN that is known as
Fast R-CNN. Like SPPnet, in the Fast R-CN, the com-
plete image is handled by conv layer to generate feature
map. The ROI pooling layer extract feature vector of fixed-
length from every region proposal. Every feature vector
is passed through a number of fc layers to reach into the
two output layers. One layer produces the probability of
C+1 catagories and another layer generates the bounding
box position with four real-value numbers. The pipelining
of Fast R-CNN is fastened by sampling the mini-batches
hierarchically and comprising layer fc layer by using trun-
cated singular value decomposition (SVD).
In Faster R-CNN, the region proposal algorithms [13]
are used to predict the object location in various object
detection networks. Fast R-CNN [65] and SPPnet are used
as detection network with reduced running time, but the
problem is with computation of region proposals. The pro-
posed RPN contributes full-image convolutional features
towards detection network [22]. The object boundaries and
objectness score are predicted by this fully convolutional
network [66], known as RPN. The Fast R-CNN is used
as the detection network. Then, the Fast R-CNN and RPN
are fixed together as a unified network by using convolu-
tional feature with ‘attention’ mechanism. Ultimately RPN
tells the detection network, where to view and then detec-
tion network detects objects in that particular region. The
4046 J. Comput. Theor. Nanosci. 16, 4044–4052, 2019
RESEARCH ARTICLE
Goel et al. Object Recognition Using Deep Learning
deep VGG-16 is used as detection network. Region pro-
posals are engendered by drifting a small network over
feature map of last convolutional layer. The nnspatial
window of input convolutional feature map is produced by
the small network. The lower dimension feature (256-d for
ZF [67] and 512-d for VGG) map each sliding window.
The two sliding layers-a box regression layer (reg) and a
box classification layer (cls) are fed by this feature. The
both networks (Region Proposal and Object Detection)
had shown a common convolution layer. In this method,
work is implemented on the ZF net (having 5 layers) and
VGG-16 [68] which has 13 sharable convolutional layers.
In the field of disaster rescue and urban planning, the
classification and interpretation [69] accuracy and speed
play a critical role for high resolution images [70]. The
recognition of complex pattern becomes challenging if the
resolution of the images gets finer. In object based CNN,
deep learning provides a potent way to efficiently reduce
the semantic gap of the complex patterns. The boundaries
of the different objects are not captured by the deep learn-
ing techniques. To overcome this problem, the merger of
deep feature learning method and object-based classifica-
tion strategy is proposed. This proposed method improves
the accuracy of the high-resolution image classification.
The method involves two steps: extraction of deep features
through CNN and then object based classification with the
help of deep features.
3. SINGLE-STEP ARCHITECTURE
Szegedy et al. [71] proposed DNN based approach in
which simple bounding box interference abstracts the
objects with the help of binary mask. This method is not
so comfortable to extract overlapping objects. Pinheiro
et al. [72] formulated a CNN model which has two
branches. The first branch generates masks and the sec-
ond branch predicts the likelihood of the given patch of
an object. Erhan et al. [73] presented a multibox based
on regression to produce region proposals where as Yoo
et al. [74] proposed a classification technique for object
detection using a CNN architecture named AttentionNet.
The Deep Expectation (DEX) approach for age esti-
mation without using facial landmark and IMDB-WIKI
database of face images having age and gender label
is introduced [75]. The actual age and apparent age
reckoning are handled by CNN of VGG-16 architecture
pretrained on ImageNet. The age reckoning issue is con-
sidered as a deep classification issue with softmax. The
main factor of the work: deep learned model, potent face
alignment and normal value calculation for age regres-
sion. Firstly, the face is aligned using angle and crops it
for the successive strides. This is the potent method for
face alignment rather than using facial landmarks because
its failure rate (1%). The cases where the face is not
detected, the full image is considered as the face. The con-
text around face enhances the performance. So appended
40% of the height and width of the face is taken on all
sides. The aligned face is squeezed to 256 ×256 pixel
before initializing to deep CNN. VGG-16 architecture is
deployed as CNN used to anticipate the age of a person.
VGG-16 has 13-convolutional layer, with convolutional fil-
terof3×3 and 3 fully connected layer. The CNN is the
fine tuned using new dataset IMDB-WIKI. The age reck-
oning is a regression issue if last layer is replaced by only
one output neuron. The CNN for regression is relatively
unstable, so the age prediction is considered as a classifi-
cation problem. The age value is represented by ranges of
age. The number of age range depends on the size of the
training set.
In Deep Residual Conv-Deconv Network, a novel
architecture of the neural network is proposed that is
used for unsupervised spectral spatial feature learning
of hyperspectral image. The contemplated network, i.e.,
fully Conv-Deconv network, is based on encoder–decoder
mechanism. The 3-D hyperspectral data is primarily
mutated into the lower dimensional space using encoder
(convolution sub network) and reproduced the original data
by expansion through decoder (deconvolutional network).
The residual learning and new unpooling technique is used
to fine tune the proposed network. This work presents that
few neurons in the initial residual block have good capabil-
ities to predict the visual pattern in the objects. The unsu-
pervised spectral-spatial features are extracted using the
proposed network for remote sensing imagery. The con-
volution subnetwork has a number of convolutional block
where each block is made up of stack of convolutional lay-
ers with convolutional filter of size 3×3. The convolution
layers have similar feature map size and the similar num-
bers of filters. The count of channels of the feature map
boosts with deeper convolutional block. The convolutional
layer is equipped with ReLU [76] activation function.
After each convolutional block, there is a max-pooling
layer to spatially shrink the feature maps. The deconvolu-
tional network used to reconstruct the input data from the
deep features. The deconvolutional network comprises of
unpooling [77] and convolutional layer. The convolutional
block configuration of the deconvolutional subnet is same
as convolutional subnetwork. When the network is trained,
the learning of the network slow down due to the fact that
the network converges to the high value of error. In this
situation the optimization of the two networks is not easy.
The other problem is unpooling operation in the deconvo-
lution subnetwork. This unpooling method simply avoids
the position of highest value which leads to depletion of
edge data at the time of decoding. To settle these problems,
the fully Conv-Deconv network is refined by including
residual learning and unpooling using max-pooling indices
to retain a location of max. value.
YOLO framework, proposed by Redmon et al. [78], pre-
dicts the confidence for multiple categories and bound-
ing boxes. In this framework, initially the whole image is
J. Comput. Theor. Nanosci. 16, 4044–4052, 2019 4047
RESEARCH ARTICLE
Object Recognition Using Deep Learning Goel et al.
divided into N×Ngrids and each grid cell predicts object
centered in that grid cell. It also anticipates the bounding
boxes along with their confidence score. The contributions
of grid cells are only calculated for those that contain an
object. The YOLO contain 24 conv layers and 2 fc layers.
Some of the conv layers build up a group of outset mod-
ules having 1×1 reduction layer succceeded by 3 ×3conv
layers. The network can handle image at rate 45 FPS in
real time and the Fast YOLO version can handle 155 FPS.
Which is better than other detectors. Due to strong spa-
tial restraints on bounding box predictions [79], YOLO
is uncomfortable in dealing with groups of small objects.
YOLO generates coarse features due to multiple down-
sample operation because it is not so relaxed to speculate
objects in new configuration.
To avoid these problems, a novel approach single shot
multibox detector (SSD) [80] is proposed by Liu et al. This
approach is motivated by anchors used in multibox [81],
RPN [74] and multiscale representation [82]. Unlike fixed
grid used in YOLO, the anchor boxes of distinct aspect
ratio and scales are used in SSD to discretize the output
space of bounding boxes. The prediction from various fea-
ture maps having different resolutions is combined by the
network to tackle different size objects. In SSD architec-
ture, some feature layers are attached at the end of VGG16
network to predict the offset of the default boxes hav-
ing distinct scales and aspect ratio. The weighted sum of
confidence loss and localization loss is used to train the
network. NMS is adopted on multiscale refined bounding
boxes to get the final object detection.
4. APPLICATIONS
4.1. Plant Identification
The plant identification system is a sector of computer
vision which helps the botanists to identify the unknown
plant species rapidly and easily. Various studies have been
carried out to increase the use of leaf data for prediction
of plant species. In this method, the useful features of
the leaves are extracted by convolutional neural network
and the yield perception of extracted features based on
the deconvolutional network. This method provides differ-
ent orders of venation [83] that are better than shape [33]
information. The multilevel representation of leaf data
(from lower level to a higher level) has been observed
according to species class. This work is helpful to design
a hybrid feature extraction model which enhances the per-
formance of the plant classification system.
4.2. Age Estimation
One of the most important attribute of identity and social
interaction is age. The prediction of age depends upon
the various factors like posture, facial wrinkles, vocabu-
lary and information. Age estimation is used in the devel-
opment of numerous applications like intelligent human
machine interface, safety and protection in different areas
like security, transport and medicine. The development
in the field of artificial intelligence (AI) enhances the use
of the deep learning technique for accurate age estimation.
The deep learning approaches [75] show the effectiveness
and the robustness for age estimation in comparison to
traditional age estimation approaches.
4.3. Target Classification for SAR Images
The SAR-ATR (synthetic aperture radar automatically tar-
get recognition) algorithm comprises of a feature extractor
and a trainable classifier. The hand designed features are
often extracted and impact the accuracy of the system.
The deep convolutional networks achieved most advanced
outcomes in many computer vision and speech recogni-
tion assignments by automatic learning of feature from
the huge data. As the use of convolutional networks
for SAR-ATR encounter with serious overfitting prob-
lem. To overcome this issue, a new All-convolutional net-
work (A-ConvNet) is proposed. The A-ConvNet comprises
of sparsely connected layers, without using only fully
connected layer. The A-ConvNet demonstrated superior
performance than traditional ConvNet on the classifica-
tion [84] of target in the SAR image dataset.
4.4. Face Recognition
The identification of an individual using face from an
image or a database is known as face recognition [85].
Due to the increasing volume of the dataset, machine
learning approaches like deep neural network is used
for the face recognition problem. The deep learning
approaches perform significantly with large datasets. Espe-
cially convolutional neural networks (CNN) attain a
tremendous recognition rate for face recognition problem.
5. DATASET
There are datasets having complex urban conditions of
three very distinct cities, i.e., Beijing, Pavia and Vaihingen.
The Beijing view was captured by Worldview-2 in 2010.
The ROSIS sensor captured the scenes at the time of the
flight over Pavaia and Italy. The PASCAL VOC 2007 [86]
dataset (consist of 5k trainval images and 5k test images of
20 object classes), PASCAL VOC 2012 [86] dataset, MS
COCO DATASET [87] (contains 80 object classes with
80k images on the training set and 40k images on valida-
tion set and 20k images on test set) are used to assess the
performance of Faster R-CNN approach. The benchmark
of SegNet is performed on CamVid 11 road class segmen-
tation dataset [88] and SUN [89]. RGB-D indoor scene
database. The new leaf dataset Malayakew leaf dataset and
the well known Flavia [90] leaf datasets are used to ana-
lyze the performance of the plant identification approach.
In the DEX work, five distinct datasets for original
and apparent age estimation are used. The IMDB-WIKI is
the new largest data of age reckoning. The MORPH and
FG-NET dataset are used for real age estimation whereas
4048 J. Comput. Theor. Nanosci. 16, 4044–4052, 2019
RESEARCH ARTICLE
Goel et al. Object Recognition Using Deep Learning
Tab l e I . Comparative analysis.
Database Result
Researchers Techniques used (% accuracy)
R. Girshick R-CNN PASCAL VOC, 798
et al. [16] ILSIRC
X. Bai et al. [22] SPPnet PASCAL VOC 9342
2007, ILSIRC
R. Girshick Fast R-CNN PASCAL VOC 893
et al. [65]
S. Ren et al. [66] Faster R-CNN PASCAL VOC 918
2007, PASCAL
VOC 2012
W. Zhao et al. [69] Object Beijing, Pavia, 9904
based CNN VaiHingen
Rasmus Rothe DEX IMDB-Wiki 966
et al. [75]
L. Mou [76] Deep residual Pavia, Indian 8739
conv-deconv pines
J. Redmon Yolo PASCAL VOC 906
quad et al. [78] 2012, COCO
W. Liu et al. [80] SSD PASCAL VOC 832
2012, COCO
LAP dataset used for apparent age estimation. The experi-
ments are carried out using two mainly used hyperspectral
data i.e., Indian Pines and Pavia University. The Indian
Pines dataset is gathered over the Indian pine sites of
northwestern. India using airborne visible/infrared imag-
ing spectrometer (AVIRIS) sensor. The Pavia University
dataset is acquired over university of Pavia by reflective
optics system imaging spectrometer (ROSIS). The perfor-
mance of hyperspectral image classification approaches is
assessed using overall accuracy (OA), average accuracy
(AA) and Kappa coefficient. The experiments are carried
out with A-ConvNet using moving and stationary target
acquisition and recognition (MSTAR) [59] criterion dataset
under standard operation conditions (SOC) and extended
operating conditions (EOC) [58].
6. COMPARATIVE ANALYSIS
Some of the research work in the field of object recogni-
tion using deep learning summarizes in Table I. The vari-
ous techniques used by different researchers are mentioned
in the table. The resulted reported by those researchers
are very encouraging but calculated for particular type of
database. The important question is how these methods
will perform when used with other databases. Therefore
the comparative analysis between techniques mentioned in
literature is desirable.
7. CONCLUSION
In contemporary object recognition approaches, the deep
neural network based object recognition techniques have
remarkable performance due to its powerful learning capa-
bility. In this paper, the recent developments of deep neu-
ral network based object recognition framework have been
received in detailed. Firstly, the two step framework has
been received which familiarize the architectures used for
object recognition. Then, one step frameworks such as:
YOLO, SSD etc. are also reviewed. The various bench-
mark datasets and different application areas of object
recognition are also discussed. Finally, we conclude with
promising future scope to get an intensive perspective of
the object recognition. This paper provides worthwhile
wisdom and guidance for future progress in the field of
deep learning based object recognition. Based on litera-
ture review, there is scope for future improvement. The
object-based CNN for high-resolution imagery classifica-
tion method has no contextual information on a global
level. In future, the main focus will be on the con-
textual information to further improve the performance
because the information about the relationship between
image object affect the classification efficiency.
In Segnet, the estimation model can be design to cal-
culate uncertainty for prediction from deep segmentation
network.
The training dataset can be increase in future for the age
estimation approach DEX. More robust landmark detectors
can lead to better face alignment.
The possible future work is to explore the capability of
Deep Residual Conv-Deconv Network for Hyperspectral
Image Classification approach using APs and estimation
profiles that extract spatial information in a robust and
adaptive way.
References
1. Shokoufandeh, A., Keselman, Y., Demirci, M. F., Macrini, D. and
Dickinson, S.J., 2012. Many to many feature matching in object
recognition: A Review of three approaches. IET Computer Vision,
6(6), pp.500–513.
2. Lillywhite, K. and Archibald, J., 2013. A feature construc-
tion method for general object recognition. Pattern Recognition,
New York, NY, USA, Elsevier Science Inc. Vol. 46, pp.3300–
3314.
3. Martin, L., Tuysuzojlu, A., Karl, W. C. and Ishwa, P., 2015. Learning
based object identification and segmentation using dual energy CT
images for security. IEEE Transaction on Image Processing,24(11),
pp.4069–4081.
4. Puissant, A., Hirsch, J. and Weber, C., 2005. The utility of texture
analysis to improve perpixel classification for high to very high spa-
tial resolution imagery. International Journal Remote Sensing,26(4),
pp.733–745.
5. Benediktsson, J.A., Palmason, J.A. and Sveinsson, J.R., 2005.
Classification of hyperspectral data from urban areas based on
extended morphological profiles. IEEE Transactions on Geoscience
and Remote Sensing,43(3), pp.480–491.
6. Bau, M.T.C., Sarkar, S. and Healey, G., 2010. Hyperspectral
region classification using a three-dimensional gabor filterbank.
IEEE Transactions on Geoscience and Remote Sensing,48(9),
pp.3457–346.
7. Huang, X., Zhang, L. and Li, P., 2008. A multiscale feature fusion
approach for classification of very high resolution satellite imagery
based on wavelet transform. International Journal Remote Sensing,
29(20), pp.5923–5941.
8. Cheriyadat, A.M., 2014. Unsupervised feature learning for aerial
scene classification. IEEE Transactions on Geoscience and Remote
Sensing,52(1), pp.439–451.
J. Comput. Theor. Nanosci. 16, 4044–4052, 2019 4049
RESEARCH ARTICLE
Object Recognition Using Deep Learning Goel et al.
9. Volpi, D.M., Mura, M.D., Rakotomamonjy, A. and Flamary, R.,
2014. Automatic feature learning for spatio-spectral image
classification with sparse svm. IEEE Transactions on Geoscience
and Remote Sensing,52(10), pp.6062–6074.
10. Hinton, G.E., Osindero, S. and Teh, Y.W., 2006. A fast learning
algorithm for deep belief nets. Neural Computation,18(7), pp.1527–
1554.
11. Arbelaez, P., Pont-Tuset, J., Barron, J.T., Marques, F. and Malik, J.,
2014. Multiscale combinatorial grouping. Computer Vision and Pat-
tern Recognition (CVPR), pp.328–335.
12. Carreira, J. and Sminchisescu, C., 2012. CPMC: Automatic object
segmentation using constrained parametric min-cuts. IEEE Transac-
tions on Pattern Analysis and Machine Intelligence,34(7), pp.1312–
1328.
13. Uijlings, J.R., van de Sande, K.E., Gevers, T. and Smeulders, A.W.,
2013. Selective search for object recognition. International Journal
of Computer Vision,104(2), pp.154–171.
14. Zitnick, C.L. and Dollar, P., 2014. Edge Boxes: Locating Object
Proposals from Edges. European Conference on Computer Vision,
pp.391–405.
15. Alexe, B., Deselaers, T. and Ferrari, V., 2012. Measuring the object-
ness of image windows. IEEE Transactions on Pattern Analysis and
Machine Intelligence,34(11), pp.2189–2202.
16. Girshick, R., Donahue, J., Darrell, T. and Malik, J., 2014. Rich Fea-
ture Hierarchies for Accurate Object Detection and Semantic Seg-
mentation. IEEE Conference on Computer Vision and Pattern Recog-
nition, pp.580–587.
17. Viola, P. and Jones, M., 2001. Rapid Object Detection Using a
Boosted Cascade of Simple Features. Proceeding of IEE E Computer
Society Conference on Computer Vision and Pattern Recognition,
pp.I-511–I-518.
18. Dalal, N. and Triggs, B., 2005. Histograms of Oriented Gradients
for Human Detection. Proceeding of IEEE Computer Society Con-
ference on Computer Vision and Pattern Recognition, pp.886–893.
19. Uijlings, J.R., van de Sande, K.E.T., Gevers and Smeulders, A.W.,
2013. Selective search for object recognition. International Journal
of Computer Vision,104, pp.154–171.
20. Felzenszwalb, P.F., Girshick, R.B., McAllester, D. and Ramanan, D.,
2010. Object detection with discriminatively trained part-based mod-
els. IEEE Transactions on Pattern Analysis and Machine Intelli-
gence,32(9), pp.1627–1645.
21. Sermanet, P. Eigen, D., Zhang, X., Mathieu, M., Fergus, R. and
LeCun, Y., 2013. OverFeat: Integrated Recognition, Localization and
Detection Using Convolutional Networks. Inter national Conference
on Learning Representations (ICLR 2014), p.16.
22. He, K., Zhang, X., Ren, S. and Sun, J., 2014. Spatial Pyramid
Pooling in Deep Convolutional Networks for Visual Recognition.
Proceeding in 13th European Conference on Computer Vision,
pp.346–361.
23. Delalieux, S. Somers, B. Haest, B., Spanhove, T., Borre, J.V. and
Mücher, C.A., 2012. Heathland conservation status mapping through
integration of hyperspectral mixture analysis and decision tree clas-
sifiers. Journal Article in Remote Sensing of Environment,126,
pp.222–231.
24. Ham, J., Chen, Y., Crawford, M.M. and Ghosh, J., 2005. Investiga-
tion of the random forest framework for classification of hyperspec-
tral data. IEEE Transactions on Geoscience and Remote Sensing,
43(3), pp.492–501.
25. Melgani, F. and Bruzzone, 2004. L.Classification of hyperspec-
tral remote sensing images with support vector machines. IEEE
Transactions on Geoscience and Remote Sensing,42(8), pp.1778–
1790.
26. Chen, Y., Lin, Z., Zhao, X., Wang, G. and Gu, Y., 2014. Deep
learning-based classification of hyperspectral data. IEEE Journal
Selected Topics Applied Earth Observations Remote Sensing,7(6),
pp.2094–2107.
27. Chen, Y., Zhao, X. and Jia, X., 2015. Spectral–spatial classification
of hyperspectral data based on deep belief network. IEEE Journal
Selected Topics Applied Earth Observations Remote Sensing,8(6),
pp.2381–2392.
28. Mou, L., Ghamisi, P. and Zhu, X.X., 2017. Deep recurrent neu-
ral networks for hyperspectral image classification. IEEE Trans-
actions on Geoscience and Remote Sensing,55(7), pp.3639–
3655.
29. Chen, Y., Jiang, H., Li, C., Jia, X. and Ghamisi, P., 2016. Deep
feature extraction and classification of hyperspectral images based
on convolutional neural networks. IEEE Transactions on Geoscience
and Remote Sensing,54(10), pp.6232–6251.
30. Ghamisi, Y., Chen and Zhu, X.X., 2016. A self-improving convolu-
tion neural network for the classification of hyperspectral data. IEEE
Transactions on Geoscience and Remote Sensing,13(10), pp.1537–
1541.
31. Zhao,W.andDu,S.,2016. Spectral–spatial feature extraction for
hyperspectral image classification: A dimension reduction and deep
learning approach. IEEE Transactions on Geoscience and Remote
Sensing,54(8), pp.4544–4554.
32. Romero, A., Gatta, C. and Camps-Valls, G., 2016. Unsupervised
deep feature extraction for remote sensing image classification. IEEE
Transactions on Geoscience and Remote Sensing,54(3), pp.1349–
1362.
33. Neto, J.C., Meyer, G.E., Jones, D.D. and Samal, A.K., 2006.Plant
species identification using elliptic fourier leaf shape analysis. Com-
puters and Electronics in Agriculture,50(2), pp.121–134.
34. Chaki, J. and Parekh, R., 2011. Plant leaf recognition using shape
based features and neural network classifiers. International Journal
of Advanced Computer Science and Applications,2(10).
35. Du, J.X., Wang, X.F. and Zhang, G.J., 2007. Leaf shape based plant
species recognition. Applied Mathematics and Computation,185(2),
pp.883–893.
36. Mouine, S., Yahiaoui, I. and Verroust-Blondet, A., 2012. Advanced
Shape Context for Plant Species Identification Using Leaf Image
Retrieval. Proceedings of the 2nd ACM International Conference on
Multimedia Retrieval, p.49.
37. Aakif, A. and Khan, M.F., 2015. Automatic classification of
plants based on their leaves. Biosystems Engineering,139,
pp.66–75.
38. Hall, D., McCool, C., Dayoub, F., Sunderhauf, N. and Upcroft, B.,
2015. Evaluation of Features for Leaf Classification in Challenging
Conditions. IEEE Winter Conference on Applications of Computer
Vis io n, pp.797–804.
39. Kumar, N. Belhumeur, P.N., Biswas, A. Kress, W.J., Lopez,
I.C. and Soares, J.V., 2012. Leafsnap: A computer vision sys-
tem for automatic plant species identification. ECCV Springer,
pp.502–516.
40. Cope, J.S., Remagnino, P., Barman, S. and Wilkin, P., 2010.Plant
Texture Classification Using Gabor Co-Occurrences. International
Symposium on Visual Computing, Springer. pp.669–677.
41. Rashad, M., ElDesouky, B. and Khawasik, M.S., 2011. Plants images
classification based on textural features using combined classifier.
International Journal of Computer Science & Information Technol-
ogy,3(4), pp.93–100.
42. Tang, Z., Su, Y., Er, M. J., Qi, Zhang, F. L. and Zhou, J., 2015.
A local binary pattern based texture descriptors for classification of
tea leaves. Neurocomputing,168, pp.1011–1023.
43. Larese, M.G., Namías, R., Craviotto, R.M., Arango, M.R., Gallo,
C. and Granitto, P.M., 2014. Automatic classification of legumes
using leaf vein image features. Pattern Recognition,47(1),
pp.158–168.
44. Drucker, H., Burges, C.J.C., Kaufman, L., Smola, A.J. and
Vapnik, V., 1997. Support vector regression machines. Advances in
Neural Information Processing Systems,9, pp.155–161.
4050 J. Comput. Theor. Nanosci. 16, 4044–4052, 2019
RESEARCH ARTICLE
Goel et al. Object Recognition Using Deep Learning
45. Hardoon, D.R., Szedmak, S. and Shawe-Taylor, 2004. Canonical cor-
relation analysis: An overview with application to learning methods.
Neural Computation,16(12), pp.2639–2664.
46. Cortes, C. and Vapnik, 1995. Support-vector networks. Machine
Learning,20(3), pp.273–297.
47. Chen, K., Gong, S., Xiang, T., and Loy, C., 2013. Cumula-
tive Attribute Space for Age and Crowd Density Estimation.
IEEE Conference on Computer Vision and Pattern Recognition
(CVPR).
48. Huerta, I., Fernández, C. and Prati, 2014. Facial Age Estimation
Through the Fusion of Texture and Local Appearance Descriptors.
IEEE European Conference on Computer Vision (ECCV).
49. Guo, G. and Mu, 2014. A framework for joint estimation of age,
gender and ethnicity on a large database. Image and Vision Comput-
ing,32(10), pp.761–770.
50. Yi, D., Lei, Z. and Li, S.Z., 2014. Age Estimation by Multi-
Scale Convolutional Network. Asian Conference on Computer Vision
(ACCV).
51. Wang, X., Guo, R. and Kambhamettu, 2015. Deeply-Learned Fea-
ture for Age Estimation. IEEE Winter Conference on Applications
of Computer Vision (WACV).
52. Rothe, R., Timofte, R. and Van Gool, 2016. Some Like It Hot-Visual
Guidance for Preference Prediction. IEEE Conference on Computer
Vision and Pattern Recognition (CVPR).
53. Zhu, Y., Li, Y., Mu, G. and Guo, 2015. A Study on Apparent
Age Estimation. IEEE International Conference on Computer Vision
(ICCV) Workshops.
54. Yang, X., Gao, B.B., Xing, C., Huo, Z.W., Wei, X.S., Zhou, Y.,
Wu, J. and Geng, 2015. Deep Label Distribution Learning for Appar-
ent Age Estimation. IEEE Inter national Conference on Computer
Vision (ICCV) Workshops.
55. Cui, Y., Zhou, G., Yang, J. and Yamaguchi, Y., 2011. On the iterative
censoring for target detection in SAR image. IEEE Transactions on
Geoscience and Remote Sensing,8(4), pp.641–645.
56. Park, J.I. and Kim, K.T., 2014. Modified polar mapping classifier for
SAR automatic target recognition. IEEE Transaction on Aerospace
and Electronic Systems,50(2), pp.1092–1107.
57. Novak, L.M., Owirka, G.J., Brower, W.S. and Weaver, A.L., 1997.
The automatic target-recognition system in SAIP. The Lincoln Lab-
oratory Journal,10(2), pp.187–201.
58. Ross, T.D., Bradley, J.J., Hudson, L.J. and O’Connor, M.P., 1999.
SAR ATR: So What’s the Problem? An MSTAR Perspective. Pro-
ceeding in 6th SPIE Conference Algorithms SAR Imagery, Vol. 3721,
pp.566–579.
59. Keydel, E.R., Lee, S.W. and Moore, J.T., 1996. MSTAR Extended
Operating Conditions: A Tutorial. Proceeding in 6th SPIE Confer-
ence Algorithms SAR Imagery, Vol. 2757, pp.228–242.
60. Hirose, A., ed., 2013.Complex-Valued Neural Networks: Advances
and Applications. Hoboken, NJ, USA. Wiley-IEEE Press.
61. Zhao,Q.andPrincipe,J.C.,2001. Support vector machines for SAR
automatic target recognition. IEEE Transaction on Aerospace and
Electronic Systems,37(2), pp.643–654.
62. Sun, Y.J., Liu, Z.P., Todorovic, S. and Li, J., 2007. Adaptive boost-
ing for SAR automatic target recognition. IEEE Transaction on
Aerospace and Electronic Systems,43(1), pp.112–125.
63. Krizhevsky, A., Sutskever, I. and Hinton, G.E., 2012. Imagenet
classification with deep convolutional neural networks. Proceedings
of Advances in Neural Information Processing Systems, pp.1097–
1105.
64. Bai,X.,Wang,X.,Latecki,L.J.,Liu,W.andTu,Z.,2010.Active
Skeleton for Non-Rigid Object Detection. IEEE 12th International
Conference on Computer Vision (ICCV).
65. Girshick, R., 2015. Fast R-CNN. IEEE International Conference on
Computer Vision, pp.1440–1448.
66. Ren, S., He., K., Girshick, R. and Sun, J., 2017. Faster R-CNN:
Towards real-time object detection with region proposal networks.
IEEE Transactions on Pattern Analysis and Machine Intelligence,
39(6).
67. Zeiler, M.D. and Fergus, R., 2014. Visualizing and Understanding
Convolutional Neural Networks. Proceeding 13th European Confer-
ence on Computer Vision, pp.818–833.
68. Simonyan, K. and Zisserman, A., 2015. Very Deep Convolutional
Networks for Large-Scale Image Recognition. Proceeding in Inter-
national Conference Learning Representations.
69. Zhao,W.,Du,S.andEmery,W.I.,2017. Object based convo-
lutional neural network for high resolution imagery classification.
IEEE Journal of Selected Topics in Applied Earth Observation and
Remote Sensing,10(7).
70. Zhao, W. and Du, S., 2016. Spectral-spatial feature extraction for
hyperspectral image classification: A dimension reduction and deep
learning approach. IEEE Transactions on Geoscience and Remote
Sensing,54(8), pp.4544–4554.
71. Szegedy, C., Toshev, A. and Erhan, D., 2013. Deep neural networks
for object detection. Advances in Neural Infor mation Processing Sys-
tems 26 (NIPS 2013).
72. Pinheiro, P.O., Collobert, R. and Dollár, P., 2015. Learning to seg-
ment object candidates. Advances in Neural Information Processing
Systems 26.
73. Szegedy, C., Reed, S., Erhan, Anguelov, D. and Ioffe, S., 2014. Scal-
able, High-Quality Object Detection. ArXiv:1412.1441.
74. Yoo, D., Park, S., Lee, J.-Y., Paek, A.S. and Kweon, I.S., 2015.
Attentionnet: Aggregating Weak Directions for Accurate Object
Detection. IEEE Conference on Computer Vision and Pattern Recog-
nition (CVPR).
75. Rothe, R., Timofte, R. and Goo, L.V., 2018. Deep expectation of
real and apparent age from a single image without facial landmarks.
International Journal of Computer Vision,126, pp.144–157.
76. Mau, L., Ghamisi, P. and Zhu, X.X., 2018. Unsupervised spectral-
spatial feature learning via deep residual conv-deconv network for
hyperspectral image classification. IEEE Transactions on Geoscience
and Remote Sensing,56(1).
77. Dosovitskiy, A., Springenberg, J.T. and Brox, T., 2015. Learning
to Generate Chairs, Tables and Cars with Convolutional Networks.
Proceeding in IEEE Conference on Computer Vision Pattern Recog-
nition (CVPR), pp.1538–1546.
78. Redmon, J. and Farhadi, A., 2016. Yolo9000: Better, Faster,
Stronger. ArXiv:1612.08242.
79. Redmon, J., Divvala, S., Girshick, R. and Farhadi, A. 2016.You
Only Look Once: Unified, Real-Time Object Detection. IEEE Con-
ference on Computer Vision and Pattern Recognition (CVPR).
80. Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.-Y.
and Berg, A.C., 2016. Ssd: Single Shot Multibox Detector. IEEE
European Conference on Computer Vision (ECCV).
81. Erhan, D., Szegedy, C., Toshev, A. and Anguelov, D., 2014. Scalable
Object Detection Using Deep Neural Networks. IEEE Conference
on Computer Vision and Pattern Recognition (CVPR).
82. Bell, S., Zitnick, C.L., Bala, K. and Girshick, R., 2016.Inside-
outside net: Detecting Objects in Context with Skip Pooling and
Recurrent Neural Networks. IEEE Conference on Computer Vision
and Pattern Recognition (CVPR).
83. Charters, J., Wang, Z., Chi, Z., Tsoi, A.C. and Feng, D.D., 2014.
EAGLE: A Novel Descriptor for Identifying Plant Species Using
Leaf Lamina Vascular Features. ICME-Workshop, pp.1–6.
84. Dudgeon, D.E. and Lacoss, R.T., 1993. An overview of automatic
target recognition. The Lincoln Laboratory Journal,6(1), pp.3–10.
85. Wang, W., Yang, J., Xiao, J., Li, S. and Zhou, D., 2015.Face Recog-
nition Based on Deep Learning. Springer International Publishing
Switzerland. pp.812–820.
86. Everingham, M., Van Gool, L., Williams, C.K.I., Winn, J. and
Zisserman, A., 2007. The PASCAL visual object classes chal-
lenge results. International Journal of Computer Vision,88(2),
pp.303–338.
J. Comput. Theor. Nanosci. 16, 4044–4052, 2019 4051
RESEARCH ARTICLE
Object Recognition Using Deep Learning Goel et al.
87. Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P.,
Ramanan, D., Dollar, P. and Zitnick, C.L., 2014. Microsoft COCO:
Common Objects in Context. Proceeding in European Conference
on Computer Vision, pp.740–755.
88. Song, S., Lichtenberg, S.P. and Xiao, J., 2015. SUN RGB-D:
A RGB-D Scene Understanding Benchmark Suite. Proceeding
in IEEE Conference on Computer Vision Pattern Recognition,
pp.567–576.
89. Zitnick, C.L. and Dollar, P., 2014. Edge Boxes: Locating Object
Proposals from Edges. Proceeding in 13th European Conference on
Computer Vision, pp.391–405.
90. Wu, S.G., Bao, F.S., Xu, E.Y., Wang, Y.-X., Chang, Y.-F. and
Xiang, Q.-L., 2007. A Leaf Recognition Algorithm for Plant Clas-
sification Using Probabilistic Neural Network. IEEE International
Symposium on Signal Processing and Information Technology,
pp.11–16.
Received: 20 April 2019. Accepted: 10 May 2019.
4052 J. Comput. Theor. Nanosci. 16, 4044–4052, 2019
... Erhan et al. developed a multibox approach based on regression to provide region proposals while Yoo et al. recommended using AttentionNet, a CNN architecture, for object recognition using a classification strategy. [8] In the realm of age estimation, the Deep Expectation (DEX) technique was introduced. It does not rely on facial landmarks and instead uses the IMDB-WIKI database of face pictures with age and gender characteristics. ...
... The final item detection is achieved using non-maximum suppression (NMS) on multiscale enhanced bounding boxes. [8] The aforementioned techniques offer fresh approaches to age estimation, object detection, and hyperspectral picture analysis. While some methods make use of CNNs to address specific issues, others use state-of-the-art network topologies to improve feature learning or address issues with down sampling. ...
... Following that, the RPN was used to predict item placements in other object detection networks, leading to a quicker R-CNN that ran for shorter periods of time. [8] The semantic gap of complex patterns in the context of disaster response and urban planning utilizing high-resolution pictures can be effectively closed using deep learning. However, object boundaries are hard to capture using deep learning techniques. ...
Conference Paper
Full-text available
In order to detect objects in actual scenarios, this research study examines the practical use of deep learning techniques, with a focus on convolutional neural networks (CNNs). The importance of object categorization as the fundamental building block for object detection is emphasized, and several learning strategies are investigated in order to extract features from challenging multidimensional data. In order to get accurate and effective object detection findings, the study studies various deep learning architectures, including R-CNN, Fast R-CNN, Faster R-CNN, and YOLO. In addition, this paper demonstrates the diverse uses of deep learning-based object detection, including its influence on automated vehicles, medical technology, and industrial automation, where it is changing several sectors of the economy. The analysis also foresees the future developments of deep learning models, predicting their paradigm-shifting effects across a variety of areas and their capacity to successfully address practical difficulties.
... Currently, the emerging technology referred to as Artificial lntelligence (Al) has gained significant momentum. More specifically, deep learning techniques have proved to be efficient at fast object recognition (Goel et al., 2019). At the same time they are able to reduce human error, automate and improve classification processes and provide greater precision, compared with the methods or techniques mentioned above, as long as the system is provided with a large amount of data (training set), or at least big enough to apply the data augmentation technique and thus avoid overfitting 5 in training. ...
... The fragments and medulla of the alpaca, llama and mohair fibers have peculiar and distinctive characteristics (McGregor and Quispe, 2018). Therefore, the application of Al allows identifying them automatically based in thousands of equations (Goel et al, 2019). Thus, it was convenient to develop a model for each type of fiber. ...
Article
Medullated fibres, due to their higher resistance to bending and pressure, constitute a problem for the textile industry. Thus, having practical instruments to identify them is essential. Therefore, the aim of this research was to develop and validate a novel, swift, automatic system (referred to as S-FiberMed) for medullation and diameter assessment of animal fibres based on artificial intelligence. The medullation of 88 samples of alpaca, llama and mohair fibres (41, 43 and 4, respectively) was evaluated. Additionally, 269 samples of alpacas were considered for average fibre diameter (AFD) and the results were compared with the Portable Fiber Tester (PFT) and Optical Fibre Diameter Analyser (OFDA) methods (72 and 197 samples, respectively). The preparation of each sample to be analysed followed the procedure described in IWTO-8-2011. Version 5 of "You Only Look Once" and DenseNet models were used to recognise the type of medullation and diameter of the fibres, respectively. Within each image (n = 661 for alpaca), all fibres were labelled (as Non-Medullated, Fragmented Medulla, Uncontinuous Medulla, Continuous Medulla and Strongly Medullated) using the LabelImg tool. Data augmentation technique was applied to obtain 3 966 images. Such data set was divided into 3 576 and 390 images for training and test data, respectively. For mohair samples (n = 321), a similar process was carried out. The data to train the model used to infer the diameter contained 16 446 fibres labelled with his respective AFD. A complementary hardware composed of three subsystems (mechanical, electronic, and optical) was developed for evaluation purposes. T-test, Pearson and Concordance correlation, Bland-Altman plot and linear regression analyses were used to validate and compare the S-Fiber Med with other methods. Results indicate that there was no significant difference between medullation percentage obtained with the projection microscope and the S-Fiber Med. The Pearson and Concordance correlation analysis shows a strong, high and significant relationship (P-value < 0.001). The AFDs of alpaca and llama fibre samples obtained with the two methods are very similar, because no significant difference was found at the t-test (P-value > 0.172), and they have a strong, high and significant relationship between them, given the high Pearson correlation value (r ≥ 0.96 with P-value < 0.001), high Concordance coefficient and bias correction factor. Similar results were found when PFT and OFDA100 were compared with S-Fiber Med. As a conclusion, this new system provides precise, accurate measurements of medullation and AFD in an expeditious fashion (40 seconds/sample).
... The process entails identifying actions in an image and accurately classifying it. Computer vision has gained attention for its numerous applications, including self-driving cars [1][2][3][4][5][6], healthcare diagnosis [7][8][9][10][11][12], tracking traffic [13][14][15][16][17], detection of human activities [18][19][20], pose estimation [21][22][23][24][25][26][27][28][29], recognition of facial expressions [30][31][32][33][34][35][36], scene comprehension [37][38][39][40][41], and recording of events [42][43][44][45][46][47]. We have used vision-based sensors to recognize human actions. ...
Article
Full-text available
Smart home action recognition has recently become a major focus of research owing to the current need to address the issues that come with the detection and recognition in different sectors. A crucial component in utilizing the technology is human identification and localization, which finds its uses in autonomous vehicles, surveillance, farming, and medicine. Here, we propose a new framework for smart home action recognition which builds upon several state-of-art techniques for better performance. The proposed approach starts by segmenting the input video into frames and then applying Gaussian filter to denoise the segmented data. Background subtraction is then conducted using GrabCuts algorithm in order to segment the object of interest. Skeleton modeling and keypoint extraction are used to augment the object representation. The described feature extraction phase includes the angular features, the 2.5 Dimensional Point Clouds, Full Body Curves, Kinetic Energy, and the Full Body Ridge Patterns. Following this, a feature fusion step is carried out to combine the features obtained to yield a fused feature set. The Adam (Adaptive Moment Estimation) is then used to fine-tune the features, improving their discriminative abilities. Last of all, the data collected is partitioned and classified using a Multi-Layer Perceptron (MLP) classifier. The proposed method was tested with the dataset and found that the method yields an accuracy of 88%, and thus indicate that the presented method is useful in accurate object categorization. The two processes of feature extraction and the MLP classifier significantly helped the model to perform optimally.
... For the detection [1] and classification [2], the leading dataset for research are the public dataset German Traffic Sign Recognition Benchmark (GTSRB) [3] and The German Traffic Sign Detection Benchmark dataset (GTSDB) [4] . ...
Article
Full-text available
Timely and accurate classification and interpretation of high-resolution images are very important for urban planning and disaster rescue. However, as spatial resolution gets finer, it is increasingly difficult to recognize complex patterns in high-resolution remote sensing images. Deep learning offers an efficient strategy to fill the gap between complex image patterns and their semantic labels. However, due to the hierarchical abstract nature of deep learning methods, it is difficult to capture the precise outline of different objects at the pixel level. To further reduce this problem , we propose an object-based deep learning method to accurately classify the high-resolution imagery without intensive human involvement. In this study, high-resolution images were used to accurately classify three different urban scenes: Beijing (China), Pavia (Italy), and Vaihingen (Germany). The proposed method is built on a combination of a deep feature learning strategy and an object-based classification for the interpretation of high-resolution images. Specifically, high-level feature representations extracted through the convolutional neural networks framework have been systematically investigated over five different layer configurations. Furthermore, to improve the classification accuracy, an object-based classification method also has been integrated with the deep learning strategy for more efficient image classification. Experimental results indicate that with the combination of deep learning and object-based classification, it is possible to discriminate different building types in Beijing Scene, such as commercial buildings and residential buildings with classification accuracies above 90%.
Article
Supervised approaches classify input data using a set of representative samples for each class, known as training samples. The collection of such samples is expensive and time demanding. Hence, unsupervised feature learning, which has a quick access to arbitrary amounts of unlabeled data, is conceptually of high interest. In this paper, we propose a novel network architecture, fully Conv-Deconv network, for unsupervised spectral-spatial feature learning of hyperspectral images, which is able to be trained in an end-to-end manner. Specifically, our network is based on the so-called encoder-decoder paradigm, i.e., the input 3-D hyperspectral patch is first transformed into a typically lower dimensional space via a convolutional subnetwork (encoder), and then expanded to reproduce the initial data by a deconvolutional subnetwork (decoder). However, during the experiment, we found that such a network is not easy to be optimized. To address this problem, we refine the proposed network architecture by incorporating: 1) residual learning and 2) a new unpooling operation that can use memorized max-pooling indexes. Moreover, to understand the "black box," we make an in-depth study of the learned feature maps in the experimental analysis. A very interesting discovery is that some specific "neurons" in the first residual block of the proposed network own good description power for semantic visual patterns in the object level, which provide an opportunity to achieve "free" object detection. This paper, for the first time in the remote sensing community, proposes an end-to-end fully Conv-Deconv network for unsupervised spectral-spatial feature learning. Moreover, this paper also introduces an in-depth investigation of learned features. Experimental results on two widely used hyperspectral data, Indian Pines and Pavia University, demonstrate competitive performance obtained by the proposed methodology compared with other studied approaches.
Conference Paper
Can a large convolutional neural network trained for whole-image classification on ImageNet be coaxed into detecting objects in PASCAL? We show that the answer is yes, and that the resulting system is simple, scalable, and boosts mean average precision, relative to the venerable deformable part model, by more than 40% (achieving a final mAP of 48% on VOC 2007). Our framework combines powerful computer vision techniques for generating bottom-up region proposals with recent advances in learning high-capacity convolutional neural networks. We call the resulting system R-CNN: Regions with CNN features. The same framework is also competitive with state-of-the-art semantic segmentation methods, demonstrating its flexibility. Beyond these results, we execute a battery of experiments that provide insight into what the network learns to represent, revealing a rich hierarchy of discriminative and often semantically meaningful features.
Conference Paper
The use of object proposals is an effective recent approach for increasing the computational efficiency of object detection. We propose a novel method for generating object bounding box proposals using edges. Edges provide a sparse yet informative representation of an image. Our main observation is that the number of contours that are wholly contained in a bounding box is indicative of the likelihood of the box containing an object. We propose a simple box objectness score that measures the number of edges that exist in the box minus those that are members of contours that overlap the box's boundary. Using efficient data structures, millions of candidate boxes can be evaluated in a fraction of a second, returning a ranked set of a few thousand top-scoring proposals. Using standard metrics, we show results that are significantly more accurate than the current state-of-the-art while being faster to compute. In particular, given just 1000 proposals we achieve over 96% object recall at overlap threshold of 0.5 and over 75% recall at the more challenging overlap of 0.7. Our approach runs in 0.25 seconds and we additionally demonstrate a near real-time variant with only minor loss in accuracy.
Conference Paper
We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 dif- ferent classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. To make training faster, we used non-saturating neurons and a very efficient GPU implemen- tation of the convolution operation. To reduce overfitting in the fully-connected layers we employed a recently-developed regularization method called dropout that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry
Conference Paper
State-of-the-art object detection networks depend on region proposal algorithms to hypothesize object locations. Advances like SPPnet [7] and Fast R-CNN [5] have reduced the running time of these detection networks, exposing region pro-posal computation as a bottleneck. In this work, we introduce a Region Proposal Network (RPN) that shares full-image convolutional features with the detection network, thus enabling nearly cost-free region proposals. An RPN is a fully-convolutional network that simultaneously predicts object bounds and objectness scores at each position. RPNs are trained end-to-end to generate high-quality region proposals, which are used by Fast R-CNN for detection. With a simple alternating optimization, RPN and Fast R-CNN can be trained to share convolu-tional features. For the very deep VGG-16 model [18], our detection system has a frame rate of 5fps (including all steps) on a GPU, while achieving state-of-the-art object detection accuracy on PASCAL VOC 2007 (73.2% mAP) and 2012 (70.4% mAP) using 300 proposals per image. The code will be released.
Article
In recent years, vector-based machine learning algorithms, such as random forests, support vector machines, and 1-D convolutional neural networks, have shown promising results in hyperspectral image classification. Such methodologies, nevertheless, can lead to information loss in representing hyperspectral pixels, which intrinsically have a sequence-based data structure. A recurrent neural network (RNN), an important branch of the deep learning family, is mainly designed to handle sequential data. Can sequence-based RNN be an effective method of hyperspectral image classification? In this paper, we propose a novel RNN model that can effectively analyze hyperspectral pixels as sequential data and then determine information categories via network reasoning. As far as we know, this is the first time that an RNN framework has been proposed for hyperspectral image classification. Specifically, our RNN makes use of a newly proposed activation function, parametric rectified tanh (PRetanh), for hyperspectral sequential data analysis instead of the popular tanh or rectified linear unit. The proposed activation function makes it possible to use fairly high learning rates without the risk of divergence during the training procedure. Moreover, a modified gated recurrent unit, which uses PRetanh for hidden representation, is adopted to construct the recurrent layer in our network to efficiently process hyperspectral data and reduce the total number of parameters. Experimental results on three airborne hyperspectral images suggest competitive performance in the proposed mode. In addition, the proposed network architecture opens a new window for future research, showcasing the huge potential of deep recurrent networks for hyperspectral data analysis.
Article
Thesupport-vector network is a new learning machine for two-group classification problems. The machine conceptually implements the following idea: input vectors are non-linearly mapped to a very high-dimension feature space. In this feature space a linear decision surface is constructed. Special properties of the decision surface ensures high generalization ability of the learning machine. The idea behind the support-vector network was previously implemented for the restricted case where the training data can be separated without errors. We here extend this result to non-separable training data.High generalization ability of support-vector networks utilizing polynomial input transformations is demonstrated. We also compare the performance of the support-vector network to various classical learning algorithms that all took part in a benchmark study of Optical Character Recognition.