Content uploaded by Aun Irtaza
Author content
All content in this area was uploaded by Aun Irtaza on Aug 29, 2017
Content may be subject to copyright.
Appl Intell
DOI 10.1007/s10489-017-0916-1
Fusion of local and global features for effective image
extraction
Khawaja Tehseen Ahmed1·Aun Irtaza2·Muhammad Amjad Iqbal1
© Springer Science+Business Media New York 2017
Abstract Image extraction methods rely on locating inter-
est points and describing feature vectors for these key
points. These interest points provide different levels of
invariance to the descriptors. The image signature can
be described well by the pixel regions that surround the
interest points at the local and global levels. This con-
tribution presents a feature descriptor that combines the
benefits of local interest point detection with the fea-
ture extraction strengths of a fine-tuned sliding window
in combination with texture pattern analysis. This process
is accomplished with an improved Moravec method using
the covariance matrix of the local directional derivatives.
These directional derivatives are compared with a scor-
ing factor to identify which features are corners, edges
or noise. Located interest point candidates are fetched for
the sliding window algorithm to extract robust features.
These locally-pointed global features are combined with
monotonic invariant uniform local binary patterns that are
extracted a priory as part of the proposed method. Extensive
experiments and comparisons are conducted on the
Khawaja Tehseen Ahmed
khawajatehseenahmed@gmail.com
Aun Irtaza
aun.irtaza@gmail.com
Muhammad Amjad Iqbal
amjad.iqbal@ucp.edu.pk
1Faculty of IT, University of Central Punjab, Lahore, Pakistan
2Department of Computer Science, University of Engineering
and Technology, Taxila, Pakistan
benchmark ImageNet, Caltech-101, Caltech-256 and Corel-
100 datasets and compared with sophisticated methods and
state-of-the-art descriptors. The proposed method outper-
forms the other methods with most of the descriptors and
many image categories.
Keywords Image extraction ·Interest point detection ·
Image descriptor ·Principal component coefficients ·
Sliding window ·Support vector machine
1 Introduction
Several studies have contributed to computer vision and rely
on object recognition, texture classification, scene under-
standing, symmetry detection and related domains that are
based on detecting interest points, edges and corners for
feature description. Images are described by their features
to extract useful hidden patterns to produce symbolized
signatures at different levels of abstraction and understand-
ing. Different levels of image processing metrics involve
different methods of image description and image synthe-
sis. Image description techniques include global, regional
and local metrics, and image synthesis uses texture analysis
methods.
Texture analysis methods are categorized as statistical,
structural, and spectral. Statistical methods based on gray
level statistical moments describe point pixel area proper-
ties, and histograms and scatter plots are used to represent
the values. Structural techniques use structural primitives,
such as parallel lines and regular patterns. Spectral meth-
ods work in the frequency domain to represent data. Local
and global descriptors [1–3] are primitive image descriptors
that work in the statistical, structural and spectral domains.
K. T. Ahmed et al.
Local descriptors describe patches and portions within an
image, and global descriptors describe an entire image.
Color histograms [4], shape features [5] and textures [6]are
used for local feature extraction. However, local features
are unable to produce accurate results in different image
categories. Global descriptors [1] describe objects for recog-
nition and classification. Local and global features can be
employed together to represent images in a very powerful
way. There are several applications of this hybrid scheme
for feature extraction, such as the whole-object approach,
which uses local interest point detectors, digital correlation,
and scale space super-pixels [7]. Another approach is the
partial-object method, which is derived from gray level cor-
ner detection methods [8], image moments [9], and scale
space theory [10]. Interest point descriptors [2,3]arean
extension of these approaches that quantify the light inten-
sity, local area gradients, local area statistical features, and
the histogram of the local gradient directions. Applications
of these extended descriptors have shown better perfor-
mance in object detection, face recognition, medical image
retrieval, and specialized tasks. However, these descrip-
tors typically involve intense computations and require
significant memory resources. For image retrieval, these
descriptors capture low level image attributes, such as color,
texture or spatial information, for optimal performance that
is particularly domain specific. Consequently, they result
in low performance when the same image descriptor is
tested on image categories with complex, overlapping and
background objects.
Detectors use maxima and minima points, such as gra-
dient peaks and corners; however, edges, ridges, and con-
tours are also considered as key points for better image
understanding. For these points, [11] presented an inter-
est points taxonomy that includes intensity-based region
methods (IBRs), edge-based region methods (EBRs) and
shape-based regions (SBRs). Features were extracted by
pixel intensity based on the saturation value of the pixel in
[12]. Image retrieval is performed by applying this feature
model to image segmentation and histogram generation.
Image detectors extract features with a diverse invariance
to occlusion, rotation, illumination and scale. These fea-
ture detectors are employed in image classification, object
recognition, and semantic interpretation based on their spe-
cialty. The quality and effectiveness of interest point detec-
tion methods were evaluated against standard databases and
state-of-the-art methods in [13].
This contribution uses local features along with global
feature description by combining texture values to extract
images from multiple categories. Useful image patterns
are detected by finding edges and corners based on local
interest points. These key points are identified using pixel
intensities. Pixel intensity-based detectors are more pow-
erful interest point detectors than other methods [13]. A
fine-tuned sliding window algorithm is applied to the inter-
est points to extract the image signatures. The texture
analysis results are combined with the signatures to com-
prehensively reflect the image patterns. A novel dimension
reduction technique is used to calculate the limited covari-
ant coefficients. The proposed method provides remark-
able results on benchmarks, existing methods, renowned
databases and state-of-the-art descriptors.
The remainder of this paper is organized as follows.
Section 2presents related work on feature extraction, and
Section 3explains the proposed methodology. The experi-
mental results are provided in Section 4, and we summarize
our findings in Section 5.
2 Related work
A significant amount of research has been performed on
Content Based Image Retrieval (CBIR) by analyzing inter-
est points that are composed of corners, edges, contours,
maxima shapes, ridges or global features, visual contents
and semantic interpretation. These detectors [1], descrip-
tors [2] or extractors [3] can be characterized as invari-
ant or covariant, local or global. Local features are spe-
cific and context oriented. Current Content Based Image
Retrieval (CBIR) systems require image retrieval from ver-
satile image categories, images with complex overlapping
objects, cluttered images, and foreground and background
objects. Solutions are normally tested on a specific dataset
or selected categories, and the results are uncertain for other
benchmarks. A combination of global and local features that
uses the Haar discrete wavelet transform (HDWT) and gray
level co-occurrence matrix (GLCM) was presented [14], and
the results were computed for the Corel-100 dataset [15].
In another image retrieval scheme [16], LBPs are collected
and combined from each channel to describe color images.
Decoded LBPs are introduced to reduce the highly dimen-
sional patterns that are returned from multiple channels.
Experiments were performed on Corel-1k and other bench-
marks. The reported precision for the Corel-1k benchmark
was 0.749. A computationally practical approach for cap-
turing image properties from two multichannel images was
contributed by [17]. Tradeoffs were executed at the feature
and channel levels to avoid redundant image information,
and the mean average precision for the Corel-1k benchmark
was 0.709 [16]. For image retrieval, discrete cubic parti-
tioning of the image was performed in the HSV space [18].
The data were then hierarchically mapped using the hierar-
chical operator, and a similarity-based ranking scheme was
used for the resultant features. A Mean Average Precision
(mAP) of 0.797 was reported for the Corel-100 dataset. A
three stage method was proposed to identify similar images
by first finding the images by their color features [19]. To
Fusion of local and global features
improve the results, the images are matched by their texture
and shape features. This method accumulates global and
regional features for better accuracy. The reported preci-
sion for the Corel-1k benchmark is 0.766. In [20], images
were abstracted based on their statistical features. The Non-
subsampled Contourlet Transform (NSCT) was used to
compute the features of this Multi-scale Geometric Anal-
ysis (MGA). A graph-theoretic approach-based relevance
feedback system was also incorporated for retrieval perfor-
mance. A mAP of 0.553 was reported for this technique
for the Corel-1000 benchmark. A method was presented
to characterize an image as a generalized histogram quan-
tized by Gaussian Mixture Models (GMMs) [21]. This
method learns from training images using the Expectation-
Maximization (EM) algorithm, and the number of quantized
color bins is determined by the Bayesian Information Crite-
rion (BIC). The method gave a mAP of 0.801 for the Corel
image dataset. Color, texture and shape information was
incorporated using the Color Difference Histogram (CDH)
and Angular Radial Transform (ART) features [22]. The
mAP using min-max normalization on the Corel-1k bench-
mark was 0.783. Histograms of triangles were used to add
spatial information to the inverted index of a bag-of-features
by [23]. An image was divided into two and four triangles
that were evaluated separately. Experiments were performed
on the Corel-1000 dataset with an average precision of 0.82.
The color co-occurrence matrix (CCM) and the difference
between pixels of scan pattern (DBPSP) were used to extract
color and texture features [24]. To eliminate redundant fea-
tures, selective features were chosen by finding their high
dependency on the target class. This approach reported a
mean average precision of 0.757 for the Corel-1000 dataset.
A content-based image retrieval approach was presented in
[25] for biometric security based on color histogram, tex-
ture and moment invariants. Color histograms were used
for color features, a Gabor filter was used for the texture
features, and the moment invariants were used for shape
information. This approach reported improved results for
biometric security. A method for CBIR using Local Binary
Pattern (LBP), Hu-moments and radial Chebyshev moments
by focusing shapes and textures was presented in [26]. Ten
categories from the COIL dataset [27] were used for exper-
iments, and the method reported a 3 % higher accuracy than
previous results. A method to retrieve images using color
features by dividing images into non-overlapping blocks
and to determine the dominant color of each block using the
k-means algorithm was presented in [28]. A gray-level co-
occurrence matrix was used for texture feature extraction,
and Fourier descriptors were extracted from the segmented
images for the shape representation. The final feature vec-
tor was composed of these extracted features. The results of
experiments performed on the Corel-1000 dataset [15]were
compared with the results of histogram-based methods. An
8% improvement in precision was achieved with a 4.5
second retrieval time. A descriptor that adds spatial dis-
tribution information of the gray-level variation between
pixels in LBP for image retrieval was presented in [29].
This spatial texture descriptor constructs statistic histograms
of pattern pairs between the reference pixel and its neigh-
boring pixels. Spatial information combined with texture
features produced relatively effective results. A descrip-
tor based on shape and texture features was presented in
[30] by employing the Discrete Wavelet Transform (DWT)
and Edge Histogram Descriptor (EHD) features of MPEG-
7. The wavelet coefficients were calculated for the input
image, and the Edge Histogram Descriptor was then used
on the coefficients to determine the dominant edge orienta-
tions. This combination of DWT and EHD was tested on the
Corel-1000 dataset.
HOG [1], SIFT [2]andSURF[3] are interest point detec-
tors and image descriptors that are used in combination
with local and global descriptors for content-based image
retrieval. The time and the computational costs are barri-
ers to using these famous descriptors in CBIR systems with
complex and cluttered images of different sizes. However,
dimension reduction techniques are employed to overcome
the computation time constraint. A multilayer feed forward
neural network-based CBIR system that incorporates the
strength of SIFT for object detection was introduced by
[31]. SIFT object detection was used for CBIR by reduc-
ing the large number of key points generated by SIFT to
improve the retrieval performance [32]. Salient image parts
were extracted by a saliency-based region detection sys-
tem, and the final results were tested on VOC2005. A CBIR
system that integrates the ycbcrcolor histogram, edge his-
togram and shape descriptor as a global descriptor with surf
salient points using SURF as a local descriptor to enhance
the retrieval results was proposed by [33]. Experiments
were performed on the Corel-1000 and the Uncompressed
Color Image Database (UCID) databases. Velmurugan et al.
[34] combined SURF with color moments by calculating
the first and second order color moments for SURF key
points, and experiments were performed on the COIL-100
dataset. The Histograms of Oriented Gradients feature has
been used in pedestrian detection [35], face recognition [36]
and object detection [37]. For CBIR, Shujing et al. [38]
used HOG by transforming the sizes of images and calcu-
lated a feature vector of 3780 dimensions. Orthogonal lower
dimension data were achieved by applying PCA, and exper-
iments were performed on the Corel-1000 image dataset. A
CBIR called Local Tetra Patterns (LTrPs) was proposed by
Murala et al. [39] by calculating the first order derivatives in
the vertical and horizontal directions on reference pixels and
their neighbors. The performance on benchmark databases
was compared by combining this method with the Gabor
transform.
K. T. Ahmed et al.
The technique presented in this paper focuses on: 1) find-
ing suitable key points to produce useful feature sets to
effectively classify images from multiple categories with
remarkable precision; 2) identifying foreground and back-
ground objects in complex images for better accuracy; and
3) introducing a new mechanism to search images by their
local and global features with low computational cost and
by storing and comparing compact image signatures for effi-
cient retrieval. In the proposed method, corners are detected
by pixel intensities to avoid unwanted key points. A fine-
tuned sliding window technique returns identifiable features
for robust classification, and a useful texture patterns analy-
sis supports the proposed method to provide more accurate
results.
3 Methodology
3.1 Intensity-based local interest point detection
The first corner detector was introduced by Moravec [40]. It
returns points with local maxima of the directional variance
measure and determines the average change in intensity by
moving a local preset detection window in different direc-
tions. This idea was also employed in [41] to investigate
the local statistics of the variations in the directional image
intensity using first order derivatives. This method results
in better subpixel precision and provides better localiza-
tion and corner detection. Our method uses the approach of
Moravec [40] by expanding the average intensity variance
and computing the Sobel derivatives and Gaussian window.
First, intensity-based local interest points are detected.
Local features provide identifiable and localized interest
points. An anchor point can be a point on a curve, the end
of a line and a corner. It can also be an identified point of
local intensity that has the maximum curvature of the points
on the curve. An auto-correlation matrix best describes the
local image features and structure. The following matrix
describes the gradient distribution in the local neighborhood
of an interest point:
M=σ2
Dg(σ) ∗I2
x(X, σ D) IxIy(X, σ D)
IxIy(X, σ D ) I 2
x(X, σ D) (1)
with,
Ix(X, σ D) =∂
∂xg(σD) ∗I(x) (2)
g(σ) =1
2σ 2e−x2+y2
2σ2(3)
Local image derivatives are computed with Gaussian ker-
nels of scale σD[7]. In the neighborhood of a point,
the derivatives are averaged using a Gaussian window. The
eigenvalues determine the principal signal changes in both
orthogonal directions in the neighborhood of the point σI.
Therefore, corners are found when the image signals vary
or the eigenvalues are large. Harris [7] proposed a less
computationally expensive metric that uses two eigenvalues.
3.2 Global feature detection using an optimized sliding
window
For image classification, the entire image is of interest for
global features. Global feature computations for an entire
image have large time and computational costs. Local fea-
tures describe image patches around interest points, while
global features describe an image as a single vector. With an
increasing number of local features, large numbers of fea-
ture vectors are generated, which are difficult to match and
store. To overcome these problems, we used local and global
features in an intuitive way to compute the global features
only for the detected local features of interest.
A sliding window slides a fixed size frame across an
image. The object’s size, location, positioning and scaling
are directly impacted by the block size, cell size within the
block, orientation angle and block overlap. The optimized
values for these parameters correctly classify an image. Our
technique tunes the sliding window technique against these
parameters for the datasets. This optimized sliding win-
dow extracts feature vectors for the detected intensity-based
interest points. For the global feature detection using the
local intensity points, discrete values of quantized visual
features are represented in histograms. Pixel edges of 8 bin
histograms are used as cells (Fig. 1).
The edge magnitude and orientation are computed using
a first order Sobel kernel filter. Histograms are constructed
using the following equation, where y denotes bins, and z
denotes the cells:
h(y, z) =
x∈z|| ∇I(x)|| if (x)
T=y
0else (4)
where (x) is the orientation of the edge, and ∇I(x)
is the magnitude. A histogram of the gradient orientation
is computed for each cell. Histogram normalization is then
performed by accumulating the local histograms for each
block and applying them over all of the cells in the block.
A cell size of 4×4 is used to capture the small-scale spatial
information. Significant local pixel information is captured
by using a block size of 2×2. This helps to express local illu-
mination changes that are not lost when the small block size
is averaged. Parts of the overlapping adjacent blocks pro-
duce better contrast normalization. To produce non-massive
feature vectors, an overlap of half of the block size is used.
An increased number of orientation bins results in large
feature vectors, so our optimized technique uses 7–9 bins,
which produce a relatively small number of feature vectors
with respect to the number of blocks and the cell size. Better
Fusion of local and global features
Fig. 1 Proposed method showing the step-by-step feature extraction process for an input image
orientation results are not observed when values are evenly
spaced between 180 degrees, so in our technique, values are
placed between 0 and 180 degrees by placing minus values
in the positive bins.
3.3 Efficient uniform local binary pattern texture
analysis
An operator that is invariant to monotonic transformations
of the gray scale is used for the texture analysis. It works
on a circularly symmetric neighboring 3 ×3 set of eight
pixels. The strength of the technique is that it uses a lim-
ited subset of pixels to achieve computation simplicity. The
selected subset of pixels is ‘uniform’ in patterns, which
results in rotational invariance. The fewer spatial transitions
employed in uniform patterns are more tolerable to rotation
changes [42]. In this process, a local 3 ×3 neighborhood
with a gray level distribution of nine pixels is selected first.
In the circularly symmetric neighborhood of eight pixels,
the gray values of the diagonal pixels are computed by inter-
polation. To achieve gray scale invariance, the gray values
of these eight pixels are subtracted from the center pixel.
This can be calculated by:
T=p(v0,v
1−v0,v
2−v0,v
3−v0,v
4−v0,v
5−v0,v
6
−v0,v
7−v0,v
8−v0)(5)
where Tis the texture, pis the pixel, and vis the gray scale
value.
The center pixel v0contains the overall luminance of the
image, which is not required for the local texture analy-
sis; therefore, it is discarded for the gray level distribution
calculation [43]. Invariance is easily achieved if the sign of
the gray scale difference is only noted for the pixel pairs to
be differentiated. The minor change to the LBP [43] can be
illustrated as [42]:
LBP8=
8
i=1
sign(vi−v0)2i−1(6)
LBP8and the traditional LBP [44] have different indexed
neighbors and interpolated diagonal values. Both differ-
ences form a base of the rotational invariance of LBP8.The
circular symmetric neighborhood set of eight pixels pro-
duces 28outputs. The gray pixel value is moved along the
perimeter. A right rotation operation is performed on the
pixel values so the bits values have a maximum of zero start-
ing from the eighth bit. This process is formulated as [42]:
LBP ri36
8=min{ROR(LBP8,i) |i=0,1, ..., 7}(7)
As mentioned in [45], these 36 unique rotation invariant
local binary patterns carry micro-scale features; for exam-
ple, the least significant bit shows a bright spot, the most
significant bit shows a dark spot, and diagonals show edges.
Moreover, in some cases, suboptimal results are observed
due to rotated values at 45◦. A large number of spatial
transactions occurs when the uniformity value is large,
so considering this dependency, a value of 2 is used for
uniformity. Thus, LBPri36
8is reformed as:
LBP riu2
8=
8
i=1
sign(vi−v0)if U(LBP8)≤2(8)
K. T. Ahmed et al.
Similarly LBP16 is calculated as:
LBP16 =
16
i=1
sign(vi−v0)2i−1(9)
A distribution of 16 bits along the perimeter produces
65,536 output values, which contain 243 different patterns.
Similarly, defining a uniform rotation produces invariant
patterns of the 16 bit version:
LBP riu2
16 =
16
i=1
sign(vi−v0)if U(LBP16 )≤2 (10)
Mapping of LBP16 to LBP16 riu2 is performed using a
lookup table.
3.4 Generation of feature vectors
a) The sliding window extracts feature vectors for the
intensity-based detected local interest points. The interest
points are edges and corners, which are a limited set of
the image pixels. Feature extraction is performed on these
interest points. The global extraction depends on the num-
ber of bins, the number of overlapping blocks, and the
cell and block sizes. These generated values are very small
compared to the signatures generated by state-of-the-art
descriptors [2,3], which impacts the computational time for
large databases. These points are extracted once for each
image, but this process takes a long time if they need to be
used for classification and image retrieval purposes. It can
also take a long time to access these feature vectors if they
are stored and retrieved.
b) Similar feature vector strengths are produced using tex-
ture analysis with a uniform local binary pattern, as was
described in C-III.
Steps a) and b) represent the image with distinct fea-
tures. Their concatenation results in better classification, but
it is computationally and time intensive. Therefore, these
feature vectors must be reduced with minimum information
loss before concatenation for efficient processing.
3.5 Feature vector optimization algorithm
and calculating the principal components
The concatenated feature vectors contain useful informa-
tion about the images. The coefficients returned by principal
component analysis are often large enough to utilize high
computational power. This is even more crucial in large
database scenarios. The returned coefficients are directly
proportional to the number of observations. The descriptors
[1–3] return voluminous rows and columns, which provide
a base for the coefficient calculations. A smaller set of prin-
cipal components can be obtained by limiting the number
of observations. By row elimination, the signature subset or
similar approaches result in fewer observations but signifi-
cant information loss of the original data, which cause worse
prediction and classification results.
The algorithm in our technique is trained for differ-
ent image datasets, image dimensions, image resolutions,
pixel intensities and image formats. The algorithm pri-
marily checks all of the required and related information
for the dataset on which the experiments are performed.
After this examination, the program has an understanding
of the images. The algorithm takes the feature vectors as
inputs, and they are reshaped based on the optimal num-
ber of observations as inputs for the coefficient calculations
before computing the principal components. The number
of observations is optimized based on the image attributes
that were described previously (e.g., dimension, resolution).
The reshaped feature vectors are then input to the princi-
pal component calculation and result in fewer coefficients
with large variances. These limited principal components
identify the feature vectors comprehensively with very little
information loss. The novelty of the technique for limited
coefficients is that it provides nearly the same precision
as using the complete set of feature vectors without subset
optimization. The feature vectors produced by the texture
analysis are optimized using the feature vector optimization
algorithm on which the principal components are computed.
The texture information additionally carries monotonic and
rotational invariance characteristics, which can perform bet-
ter prediction along with the global features. To achieve
this, the texture features are concatenated with intensity-
based interest points extracted with the sliding window.
The coefficient-based reduced, optimized and concatenated
feature vectors contain texture and object recognition capa-
bilities for simple, overlapping and complex images. The
image retrieval time is also reduced due to the slim size of
the feature vectors. In cases of image retrieval from thou-
sands to millions of images, these compact signatures are
efficient as well as storage friendly.
The number of coefficients generated by PCA is propor-
tional to the dimensions of the input feature. The feature
vectors returned by different local and global methods vary
in size. These hybrid feature vectors produce a large number
of principal components. The number of PCs varies from
hundreds to thousands per image. Even after the reduction
process, classification of such a large number of feature
vectors is time intensive. For the datasets, our algorithm
reorders the feature vectors and generates between 80 and
120 observations depending on several characteristics, such
as the size and type of image attributes and the pixel inten-
sities. Based on the number of observations, each image
is represented by 80–120 coefficients. This image signa-
ture size is very small compared to the output produced by
descriptors [1,2] or the standard PCA.
Fusion of local and global features
3.6 Image classification using Support Vector Machine
(SVM)
A Support Vector Machine is a discriminative classifier that
separates two classes of points by a hyperplane. It is a super-
vised learning model that analyzes data that are used for
classification and regression analysis tasks.
Let the input belong to one of two classes as [46]:
{(xi,yi)}N
i=1yi ={+1,−1}(11)
where xiis the input set, and yiare the corresponding labels.
Hyperplanes are assigned the values of the weight vectors
‘w’ and bias ‘b’ as follows:
wT.x+b=0 (12)
and a maximum margin of 2/||w|| hyperplanes are found
such that the two classes can be separated from each other;
i.e.,
wT.x
i+b≥+1 (13)
wT.x
i+b≤−1 (14)
or
yiwT.x+b≥+1 (15)
The kernel version solution of the Wolfe dual problem is
then found with the Lagrangian multiplied by αi:
Q(α)=
m
i=1
αi−
m
ij=1
αiαjyiyjKxi.xj/2 (16)
where ∝i≥0andm
i=1∝iyi=0.
Based on the kernel function, the SVM classifier is given
by:
F(x)=Sgn(f (x))(17)
where f(x)=l
i=1∝iyiK(xi,x)+bis the output
hyperplane decision function of the SVM. High values of
f(x)represent high prediction confidence, and low values
of f(x)represent low prediction confidence.
4 Experimentation
4.1 Datasets
Most of the datasets are tailored for custom tasks depending
on the nature of the project. Many contributions use domain-
based image types. Experiments performed on a dataset
are difficult to compare with those performed on another
dataset. The accuracy of the results is directly affected by
the image attributes, such as color, object location, quality,
size, overlapping, occlusion, and cluttering [47]. In our
case, widely-used datasets and their respective categories
are selected by considering the following characteristics:
•Diverse image categories
•General content-based image retrieval usage
•Categories contain many types of textures, foreground
and background objects, colors and spatial features
•Images from different areas of life to test the descrip-
tor’s effectiveness
The selected subsets are representative of the respective
datasets and include diverse categories from different areas,
object orientations, shapes and textures, and global and
spatial information. Therefore, the results that are based
on the selected categories are representative of the entire
dataset. Experiments are performed on a variety of standard-
ized datasets, including ImageNet [48], Caltech-256 [49],
Caltech-101 [50] and Corel-1000 [15]. The sampling, object
categories, and image characteristics of each category are
described below.
4.1.1 Corel-1000 dataset
The Corel-1000 dataset is a benchmark that is widely used
in the literature for classification tasks [16–19,51]. The
Corel database includes many semantic groups, including
scene, nature, people, flowers, animal, and food. It consists
of 1000 images in 10 categories. Each semantic category
consists of 100 images with a resolution of 256×384 pixels
or 384×256 pixels. Our algorithm randomly selects 70 %
of the images from each category for training and 30 % for
testing. A total of 660 images from all of the categories is
used for training, and 330 images are used for testing.
4.1.2 ImageNet synset
The ImageNet synset [48] is a large-scale image database
that is used to index, retrieve, organize and annotate multi-
media data. It is organized based on the WordNet hierarchy.
Each meaningful concept in WordNet that can be described
by multiple words or word phrases is called a synset.
WordNet contains more than 100,000 synsets, which are
dominated by nouns (80,000+). The repository contains
an enormous collection of more than 14,197,120 images.
Experiments were performed on 15 synsets downloaded
from the ImageNet repository [48], including aerie, car,
cherry tomato, coffee cup, dish, dust bag, flag, flower, gas
fixture, golf ball, heat exchanger, monitor, mud ceram-
ics, spoon and train. These synsets were selected from the
semantic groups of plants, natural objects, artifacts, devices,
containers, ceramics, arms, and equipment. These synsets
were selected due to their versatility, textures, complexity
and object orientation features. The cherry tomato contains
K. T. Ahmed et al.
small and medium foreground and background objects as
well as overlapping objects. The flag synset contains spe-
cific color-oriented objects; in other words, color and tex-
ture are both used to classify this category. The gas fixture
and aerie synsets are complex and cluttered object cate-
gories. Both contain spatial information due to their hanging
nature. The golf ball and cherry tomato synsets have similar
round object orientations with color differences. It is chal-
lenging to distinguish these synsets. The artifacts group con-
tains structural complexities with semantics associations;
therefore, classification of this synset requires careful anal-
ysis of local and global features. The equipment and devices
groups are sometimes semantically the same. However, the
object- and texture-based training leads to better classifica-
tion. These 15 synsets contain 13,554 images, from which
100 images were randomly selected from each synset for
the experiments; i.e., 1500 (100×15) images were used for
training and testing (Fig. 2).
A total of 1050 images were used for training, and 450
were used for testing, by randomly selecting them from each
category. The positive training samples included two-thirds
of the candidates, which were randomly selected from each
category. The negative training samples included one-third
of the total.
4.1.3 Caltech-256 dataset
This dataset is a challenging set of 256 object categories
that contain 30,607 images [49]. It is a successor to the
Caltech-101 dataset. Image classification in Caltech-256 is
more difficult than in Caltech-101 [50] because it has more
variations. We performed experiments on 15 diverse cate-
gories, including AK47, American flag, backpack, baseball
bat, baseball glove, bear, mug, binocular, calculator, car
tire, Cartman, CD, cockroach, desk globe and comet. The
semantic groups were selected carefully to represent many
areas of real life. These categories contain animals, flags,
guns, accessories, tires, insects, computer accessories, daily
used entities and images with complex and overlapping
objects. Some of the categories are important because of
their texture patterns, whereas others are important because
of their foreground and background objects. The desk globe,
car tire and CD are round objects. Their classifications are
based on their orientations and textures. The cockroach was
selected from the insect category. Recognizing an insect in
an image requires the technique to have object recognition
capabilities. The American flag contains specific color and
texture information that can be used to classify it. Cartman
and the binocular are normally in complex backgrounds and
Fig. 2 a ImageNet Synsets with 15 image samples (one image from
each category). bCorel-1000 dataset showing 15 sample images from
10 categories. cCaltech-256 dataset showing 15 sample images from
15 categories (one image per category). dCaltech-101 dataset showing
15 sample images from 15 categories (one image per category)
Fusion of local and global features
contain overlapping objects. A total of 1050 images were
used for the experiments by selecting 70 images per cate-
gory. Our algorithm randomly selects 70 % of the images
from each category for training and 30 % of the images for
testing. A total of 735 images from all of the categories were
used for training, and 315 were used for testing. In the train-
ing phase, positive samples are chosen randomly from the
respective category. Of each category, 70 % is used for pos-
itive training, and the remaining 30 % are negative training
samples. The negative training samples are gathered ran-
domly from the rest of the categories by selecting an equal
proportion from each semantic group.
4.1.4 Caltech-101 dataset
Caltech-101 [50] is a benchmark that is widely used for
image categorization, recognition and classification. It con-
tains a total of 9146 images in 101 distinct categories.
Fifteen categories were selected the classification, including
airplane, ferry, camera, brain, cougar face, grand piano, Dal-
matian, dollar bill, starfish, soccer ball, minaret, motorbikes,
revolver, sunflower and Windsor chair. These categories
were chosen due to their ability to contribute spatial infor-
mation, rounded objects, and objects with different shape,
texture and color information to test the effectiveness of
the proposed method. The brain and sunflower groups were
considered because of their textures. The dollar bill and
cougar face are categories with complex object structures
and orientations. The camera, revolver and Windsor chair
categories require specific object recognition capability. The
minaret and airplane share spatial and texture information
for classification. A total of 1050 images were used for
the experiments by selecting 70 images per category. Our
algorithm randomly selects 70 % of the images from each
category for training and 30 % of the images for testing. A
total of 735 images from all of the categories was used for
training, and 315 images were used for testing.
4.2 Results
4.2.1 Input process
In the first step, the color space is converted to gray
scale for efficient computation. The gray scale image is
then processed to detect the intensity-based local interest
points. Global features are extracted for these interest points
using the optimized sliding window. The extracted features
are concatenated with the texture features that are invari-
ant to monotonic and rotation changes. The feature vector
concatenation is followed by applying the proposed fea-
ture reshaping technique. Coefficients are generated for the
restructured observations. These data are passed to the sup-
port vector machine for classification. The support vector
machine is involved in two phases: training and testing.
During the training phase, the fused and reduced extracted
feature vectors are input to the support vector machine. The
positive training samples are randomly selected from the
respective categories, and the negative training samples are
collected from the other categories. Two times more positive
training samples are used than negative training samples.
Each training sample is labeled as belonging to one or the
other sample type. The supervised learning model of the
support vector machine learns new examples of one or the
other category, which makes it a non-probabilistic binary
linear classifier.
4.2.2 Precision and recall evaluation
Precision is the specificity measure or positive predicted
value, and recall is the sensitivity measure or true positive
rate evaluation. Precision and recall are calculated on each
image category and also for small and large databases. The
precision and recall results are tested on different sets of
training and testing data.
precision =NA(q)
NR(q)
(18)
recall =NA(q)
Nt
(19)
where NA(q)represents the relevant images that match the
query image, NR(q)represents the images retrieved against
the query image, and Ntis the total number of relevant
images available in the database.
4.3 Experimental results
4.3.1 Results of the Corel-1000 dataset with existing
methods
To determine the accuracy of the proposed technique, we
performed experiments on widely-used benchmarks. The
experimental results are compared with those from exist-
ing methods as well as with the state-of-the-art descriptors
SIFT, SURF, and HOG. The results are also compared with
those of Dubey et al. [16], Xiao et al. [17], Zhou et al. [18],
Shrivastava et al. [19], Kundu et al. [20], Zeng et al. [21],
Walia et al. [22], Ashraf et al. [23] and ElAlami et al. [24]
whose methods achieved remarkable performance. Their
standardized work has also been cited by current researchers
[52–55]. Figure 3shows a graphical representation of the
results of the proposed method compared to those from
existing state-of-the-art methods. The results show that the
proposed method outperforms most of the other methods.
Figure 3a shows the average precision rates in comparison
K. T. Ahmed et al.
Fig. 3 a: Comparison of the average precisions obtained by the proposed method and other standard retrieval systems using the Corel-1000
dataset. b: Comparison of the average recalls obtained by the proposed method and other standard retrieval systems using the Corel-1000 dataset
with those of existing methods. The proposed method shows
remarkable performance in most of the image categories.
The average recall rates are shown in Fig. 3b. The results
show that the proposed method has better recall rates in
most of the categories and that the mean average recall is
higher than those of other methods.
Tabl e 1shows a comparison of the average precision of
the proposed method with those of the standard retrieval
systems. The proposed system provides better precision in
most of the semantic groups; it outperforms in the semantic
groups of Africa, beach, building, bus, elephant, moun-
tain and food. The proposed method extracts local texture
and global features, which provide better results. The exist-
ing methods provide better results in some categories; for
example, [19] gives better results for dinosaur and flower.
However, the proposed method provides better results in
Fusion of local and global features
Tabl e 1 Comparison of the average precision obtained by the proposed method and other standard retrieval systems on the top 20 results
Class Proposed method Dubey [16]Xiao[17] Zhou [18]Shriv[19] Kundu [20]Zeng[21] Walia [22]Ashraf[23] ElAlami [24]
Africa 0.90 0.75 0.67 0.85 0.74 0.44 0.72 0.51 0.65 0.72
Beach 0.92 0.55 0.60 0.53 0.58 0.32 0.65 0.90 0.70 0.59
Building 0.88 0.67 0.56 0.72 0.62 0.52 0.70 0.58 0.75 0.58
Bus 0.98 0.95 0.96 0.85 0.80 0.60 0.89 0.78 0.95 0.89
Dinosaur 0.97 0.97 0.98 1.00 1.00 0.40 1.00 1.00 1.00 0.99
Elephant 0.85 0.63 0.53 0.68 0.75 0.80 0.70 0.84 0.80 0.70
Flower 0.93 0.93 0.93 0.94 0.92 0.57 0.94 1.00 0.95 0.92
Horse 0.86 0.89 0.82 0.99 0.89 0.75 0.91 1.00 0.90 0.85
Mountain 0.84 0.45 0.46 0.55 0.56 0.57 0.72 0.84 0.75 0.56
Food 0.92 0.70 0.58 0.86 0.80 0.56 0.78 0.38 0.75 0.77
Average 0.904 0.749 0.709 0.797 0.766 0.553 0.801 0.783 0.820 0.757
Bold entries show the ‘largest value’ for the respective rows
these and other categories. Similarly, [16] provided a good
precision rate in horse classification. The proposed method
also has good accuracy in this category. Overall, the pro-
posed method provides an increase in the mean average
precision of 0.084 %.
Tabl e 2shows the average recall rates obtained by the
proposed methods and standard retrieval systems. The pro-
posed method has remarkable recall rates in seven of the
ten categories. Better classification leads to improved recall
rates even in the complex semantic groups, such as Africa,
mountain and food. The dinosaur and elephant categories
are relatively easy to classify, and most of the existing meth-
ods provide better results in these categories. The proposed
method provides high recall rates in the dinosaur and bus
categories as well as in complex groups, such as flower and
beach.
Figure 4shows the mean average precision and recall
rates for the proposed method and the existing methods.
Figure 4a shows that the proposed method has a higher
mean average precision rate than the existing methods, and
Fig. 4b shows that it has significantly better mean average
recall rates. The recall rate is improved by 0.017 % over
those from the existing methods [19].
4.3.2 ImageNet Synset results
Experiments were performed on ImageNet synsets to check
the robustness and versatility of the proposed method. The
results are shown for the top 20 images. In the testing phase,
feature vectors of an input image are extracted using the
proposed method. The support vector machine classifies the
input image based on the training data. Input images are
selected from each category to check the precision and recall
rates for each category, and the results are computed for 20
images. The classified images for each category yield the
precision and recall rates for that category. For this bench-
mark, the mean average precision is 0.735 %, and the mean
average recall is 0.147 % (Figs. 5and 6).
Tabl e 2 Comparison of the average recalls obtained by the proposed method and other standard retrieval systems on the top 20 results
Class Proposed method Dubey [16]Xiao[17] Zhou [18]Shriv[19] Kundu [20]Zeng[21] Walia [22]Ashraf[23] ElAlami [24]
Africa 0.18 0.08 0.07 0.17 0.15 0.09 0.14 0.10 0.13 0.14
Beach 0.18 0.06 0.06 0.11 0.12 0.06 0.13 0.18 0.14 0.12
Building 0.18 0.07 0.06 0.14 0.12 0.10 0.14 0.12 0.15 0.12
Bus 0.20 0.10 0.10 0.17 0.16 0.12 0.18 0.16 0.19 0.18
Dinosaur 0.19 0.10 0.10 0.20 0.20 0.08 0.20 0.20 0.20 0.20
Elephant 0.17 0.06 0.05 0.14 0.15 0.16 0.14 0.17 0.16 0.14
Flower 0.19 0.09 0.09 0.19 0.18 0.11 0.19 0.20 0.19 0.18
Horse 0.17 0.09 0.08 0.20 0.18 0.15 0.18 0.20 0.18 0.17
Mountain 0.17 0.05 0.05 0.11 0.11 0.11 0.14 0.17 0.15 0.11
Food 0.18 0.07 0.06 0.17 0.16 0.11 0.16 0.08 0.15 0.15
Average 0.181 0.075 0.071 0.159 0.153 0.111 0.160 0.157 0.164 0.151
Bold entries show the ‘largest value’ for the respective rows
K. T. Ahmed et al.
Fig. 4 a representation of the mean average precisions on the Corel dataset. bGraphical representation of the mean average recalls on the Corel
dataset
4.3.3 Caltech-256 dataset results
To check the effectiveness of the proposed method, the
results are compared with those from state-of-the-art meth-
ods. A total of 1050 images are randomly selected from
15 preselected image categories for training and testing. A
batch of 14 images is used to test each category. A total of
15 such batches are used to obtain the precision and recall
rates for each category. The results are shown for the top
14 images from the batch of 50 relevant images. The pro-
posed method outperforms the others in most of the image
categories. The results show a mean average precision of
0.865 % and a mean average recall rate of 0.242 %. Caltech-
256 is considered a challenging dataset that contains com-
plex images. The proposed method provides exceptional
results for the AK47, baseball bat, desk globe, car tire
and CD image categories, which contain uncrowded back-
grounds and objects with clear boundaries. Sample images
from these categories are shown in Fig. 7a. However the
results of the proposed method are equally good for the
other categories, which include cluttered objects, overlap-
ping objects, and complex backgrounds as shown in Fig. 7b.
Fig. 5 Average precisions and recall rates for the ImageNet synset. The results are computed for the proposed method with 15 synsets
Fusion of local and global features
Fig. 6 Average precision and recall rates of the proposed method on 15 categories of the Caltech-256 dataset
4.3.4 Caltech-101 dataset results
The average precision and recall rates for 15 categories of
the Caltech-101 dataset are shown in Fig. 8. Images with
different foregrounds and backgrounds, object shapes, and
textures are selected for classification. The proposed tech-
nique provides better precision in all of the categories by
processing the local features with global values. The recall
rates for Caltech-101 are also promising. Most of the cate-
gories have high recall rates, while a few have average rates.
The Windsor chair and camera have average rates due to the
complex backgrounds and cluttered objects. The mean aver-
age precision obtained for this benchmark is 0.884 %, and
the mean average recall is 0.248 %.
4.4 Comparative analysis against key point detectors
and descriptors
Feature detectors and descriptors are used in object detec-
tion and recognition. Detectors refer to the tool that extracts
the features from the image, such as corner, blob or edge
detectors. Extractors are used to read the features from the
interest points. HOG [1], SIFT [2], and SURF [3] are well-
known object detectors and descriptors that are widely used
in many applications. HOG was presented at the Confer-
ence on Computer Vision and Pattern Recognition (CVPR)
and is used for object detection [56], image classification
[56] and image retrieval [57] tasks. SIFT was presented in
the proceedings of the International Conference on Com-
puter Vision (ICCV) and is used for content-based image
retrieval [58,59] and object detection tasks [60]. SURF
was presented at the European Conference on Computer
Vision (ECCV) and is used for image retrieval [61]and
related tasks. These descriptors are compared to test the
effectiveness of the proposed method. For the experiments,
1050 images are randomly selected from 15 categories, and
each category contains 70 images. Our algorithm randomly
selects 2/3 of the images from each category for training
and 1/3 of the images for testing. A total of 735 images
from all of the categories was used for training, and 315
were used for testing. In the training phase, positive samples
are taken randomly from the respective category. Positive
Fig. 7 a Sample images from the categories with exceptional results from Caltech-256. bSample images with overlapping objects and complex
backgrounds in Caltech-256
K. T. Ahmed et al.
Fig. 8 Average precision rates of the proposed method on 15 categories of the Caltech-101 dataset
samples make up 70 % of each category, and the negative
training samples (30 %) are selected from the rest of the
categories.
4.4.1 Computational load
Experiments are performed with HOG, SIFT, and SURF,
and the results are compared to those of the proposed
method. These descriptors, particularly SIFT, produced
results with very high computational times. Moreover,
redundant and massive feature vectors are produced, which
require large amounts of processing time and system
resources for computation and classification. The proposed
method performed the classification with very low time and
computation costs. The computational efficiency achieved
by processing a limited set of feature vectors from the pro-
posed reordering algorithm generated a compact input that
was used to obtain compact coefficients. The computational
load is an aggregate of the gray level conversion of the input
image, the feature extraction using the image descriptor, fea-
ture reduction and comparison with the dataset for classifi-
cation. The proposed method consumed a total computation
time of 0.70083 sec/image, which is 35.5 %, 71.22 % and
59.7 % less than HOG, SIFT, and SURF, respectively.
4.4.2 Precision rates
Descriptors are unable to perform equally well in all image
categories due to their limits of effectiveness. Descriptor
[4] is suitable for local features, but it is unable to provide
accurate results for global features. Similarly, the detector
with the best ability to predict texture patterns is unable
to accurately recognize objects. In addition, the descrip-
tors that are suitable for finding edges and corners are not
good candidates for texture analysis. Therefore, none of the
state-of-the-art descriptors are ideal candidates for feature
extraction in versatile image categories. However, the pro-
posed descriptor is able to find the textures, edges, corners,
and pixel intensities and recognize complex and overlapping
objects.
Figure 9a shows the results of the proposed method in
comparison with those of the state-of-the-art descriptors for
15 categories of the Caltech-101 dataset. Some of the detec-
tors show better results in some image categories because
they were designed for those categories. The descriptors
perform well in their areas of specialty.
Tabl e 3shows a comparison of the average precisions
of the proposed descriptor with those of the state-of-the-art
descriptors HOG, SIFT and SURF. Experiments are per-
formed with all of the descriptors to check the strength of the
proposed descriptor. The proposed method shows remark-
able performance in the sunflower, motorbike, starfish, ferry
and brain categories. The mean average precision obtained
by the proposed descriptor is 0.158 % higher than that of the
HOG descriptor.
Tabl e 4compares the experimental results of the pro-
posed method with those of the state-of-the-art descriptors
using the Caltech-256 dataset. The proposed method has
better precision than the existing methods in 13 of the 15
categories. For the other two categories, the precision is
almost the same as that reported by SURF. The results of
the Corel-1000 collection are shown to check the effec-
tiveness of the proposed method compared to those of the
state-of-the-art descriptors. The proposed method provides
Fusion of local and global features
Fig. 9 a Comparison of the average precisions obtained by the
proposed method compared with those of the state-of-the-art descrip-
tors on 15 categories of the Caltech-101 dataset. bComparison of
the average precisions obtained by the proposed method compared
with those of the state-of-the-art descriptors on 10 categories of the
Corel-1000 dataset
better results for most of the image categories. The proposed
descriptor has a 0.036 % better mean average precision for
the proposed method.
4.4.3 Recall rates
Figure 10 shows the recall rates for Caltech-101. The results
show that the state-of-the-art descriptors provide better per-
formance in some image categories and below average
Tabl e 3 Comparison of the average precisions obtained by the pro-
posed method compared with those from the state-of-the-art descrip-
tors on 15 categories of the Caltech-101 dataset
Class Proposed method HOG [1] SURF [3]SIFT[2]
Airplanes 0.94 0.95 0.72 0.76
Ferry 0.88 0.66 0.66 0.75
Camera 0.84 0.85 0.72 0.77
Brain 0.91 0.79 0.70 0.71
Cougar face 0.87 0.83 0.71 0.72
Grand piano 0.92 0.90 0.73 0.76
Dalmatian 0.90 0.73 0.68 0.70
Dollar bill 0.86 0.35 0.40 0.76
Starfish 0.88 0.45 0.74 0.77
Soccer ball 0.87 0.82 0.71 0.74
Minaret 0.88 0.66 0.79 0.74
Motorbikes 0.90 0.78 0.72 0.73
Revolver 0.85 0.73 0.71 0.70
Sunflower 0.92 0.66 0.68 0.50
Windsor chair 0.84 0.74 0.73 0.51
Average 0.887 0.728 0.695 0.710
Bold entries show the ‘largest value’ for the respective rows
performance in others. However, the proposed method pro-
vides better recall rates for most of the categories in both
datasets. Hence, the proposed method provides better clas-
sification results for all of the image categories. Low recall
rates are observed for the dollar bill and sunflower cate-
gories using the HOG and SURF descriptors.
In these categories, complex image backgrounds with
overlapping objects are difficult to classify. However the
proposed method provided better recall rates in these cat-
egories. Hence, the proposed method intuitively combines
local and global features by selecting local features based
on the pixel intensity level and texture values and selecting
global features using the sliding window. The local features
Tabl e 4 Comparison of the average precisions obtained by the pro-
posed method compared with those from the state-of-the-art descrip-
tors on the Corel-1000 dataset
Class Proposed Method HOG [1] SURF [3]SIFT[2]
Africa 0.90 0.82 0.71 0.72
Beach 0.91 0.78 0.60 0.60
Building 0.87 0.90 0.79 0.80
Bus 0.95 0.85 0.81 0.80
Kangaroo 0.97 0.88 0.66 0.45
Elephant 0.85 0.89 0.80 0.79
Flower 0.93 0.85 0.75 0.75
Horse 0.85 0.91 0.76 0.77
Mountain 0.83 0.86 0.77 0.75
Food 0.98 0.94 0.72 0.71
Average 0.907 0.871 0.738 0.716
Bold entries show the ‘largest value’ for the respective rows
K. T. Ahmed et al.
Fig. 10 Comparison of the
average recall rates obtained by
the proposed method compared
with those of the state-of-the-art
descriptors on 15 categories of
the Caltech-101 dataset
help in the texture and shape analysis, whereas the global
features are robust to object recognition. Assembling local
values with global depiction detects the hidden patterns of
an image as well as the distinctive objects. The concate-
nation of the local and global features is performed after
computing the high variance coefficients. The proposed
reshaping algorithm limits the inputs for the component
analysis. Thus, the combined feature vectors are compact
and represent an image efficiently.
5 Conclusion
In this paper, we proposed a novel method for effective
and accurate feature vector extraction and image classi-
fication. The descriptor is able to perform classifications
with significant precision in diverse categories of the bench-
mark datasets ImageNet, Caltech-256, Caltech-101 and
Corel-1000. The descriptor accurately distinguishes cor-
ners, edges, and lines and performs texture analysis and
object recognition for complex and overlapping images.
The proposed method was compared with other sophisti-
cated methods and provided remarkable precision in most
of the image categories due to its superior nature. The
proposed descriptor was also compared with the state-of-
the-art descriptors SIFT, SURF and HOG and outperformed
them in all of the datasets. The experimental results showed
that the state-of-the-art descriptors perform well in some
image categories due to their specialization in those areas
but are unable to provide good results in other categories
due to their limitations with those image attributes. The
proposed method provides reliable and remarkable preci-
sion and recall rates in most of the image categories of the
benchmark datasets.
References
1. Dalal N, Triggs B (2005) Histograms of oriented gradients for
human detection. In: IEEE computer society conference on com-
puter vision and pattern recognition, 2005. CVPR 2005, vol 1.
IEEE, pp 886–893
2. Lowe DG (1999) Object recognition from local scale-invariant
features. In: The proceedings of the seventh IEEE international
conference on computer vision, 1999, vol 2. IEEE, pp 1150–
1157
3. Bay H, Ess A, Tuytelaars T, Van Gool L (2008) Speeded-up robust
features (SURF). Comput Vis Image Underst 110(3):346–359
4. Liu G-H, Yang J-Y (2013) Content-based image retrieval using
color difference histogram. Pattern Recogn 46(1):188–198
5. Chaudhary MD, Upadhyay AB (2014) Integrating shape and edge
histogram descriptor with stationary wavelet transform for effec-
tive content based image retrieval. In: International conference
on circuit, power and computing technologies (ICCPCT), 2014.
IEEE, pp 1522–1527
6. Agrawal D, Jalal AS, Tripathi R (2013) Trademark image retrieval
by integrating shape with texture feature. In: International con-
ference on information systems and computer networks (ISCON),
2013. IEEE, pp 30–33
7. Harris C, Stephens M (1988) A combined corner and edge detec-
tor. In: Alvey vision conference, vol 15, p 50
8. Wang H, Brady M (1995) Real-time corner detection algorithm
for motion estimation. Image Vis Comput 13(9):695–703
9. Khotanzad A, Hong YH (1990) Invariant image recognition
by Zernike moments. IEEE Trans Pattern Anal Mach Intell
12(5):489–497
10. Rosten E, Drummond T (2006) Machine learning for high-speed
corner detection. In: Computer vision–ECCV 2006. Springer,
Berlin, pp 430–443
Fusion of local and global features
11. Tuytelaars T, Van Gool L (2004) Matching widely separated
views based on affine invariant regions. Int J Comput Vis 59(1):
61–85
12. Sural S, Qian G, Pramanik S (2002) Segmentation and histogram
generation using the HSV color space for image retrieval. In:
International conference on image processing, 2002. Proceedings.
2002, vol 2. IEEE, pp II–589
13. Mikolajczyk K, Schmid C (2005) A performance evaluation
of local descriptors. IEEE Trans Pattern Anal Mach Intell
27(10):1615–1630
14. Gupta E, Kushwah RS (2015) Combination of global and local
features using DWT with SVM for CBIR. In: 4th interna-
tional conference on reliability, infocom technologies and opti-
mization (ICRITO)(trends and future directions), 2015. IEEE,
pp 1–6
15. Li J, Wang JZ (2003) Automatic linguistic indexing of pictures by
a statistical modeling approach. IEEE Trans Pattern Anal Mach
Intell 25(9):1075–1088
16. Dubey SR, Singh SK, Singh RK (2016) Multichannel decoded
local binary patterns for content-based image retrieval. IEEE Trans
Image Process 25(9):4018–4032
17. Xiao Y, Wu J, Yuan J (2014) mCENTRIST: a multi-channel fea-
ture generation mechanism for scene categorization. IEEE Trans
Image Process 23(2):823–836
18. Zhou Y, Zeng F-Z, Zhao H-M, Murray P, Ren J (2016) Hierar-
chical visual perception and two-dimensional compressive sensing
for effective content-based color image retrieval. Cogn Comput
8(5):877–889
19. Shrivastava N, Tyagi V (2015) An efficient technique for retrieval
of color images in large databases. Comput Electr Eng 46:314–327
20. Kundu MK, Chowdhury M, Bul`
o SR (2015) A graph-based rel-
evance feedback mechanism in content-based image retrieval.
Knowl-Based Syst 73:254–264
21. Zeng S, Huang R, Wang H, Kang Z (2016) Image retrieval using
spatiograms of colors quantized by Gaussian mixture models.
Neurocomputing 171:673–684
22. Walia E, Pal A (2014) Fusion framework for effective color image
retrieval. J Vis Commun Image Represent 25(6):1335–1348
23. Ashraf R, Bashir K, Irtaza A, Mahmood MT (2015) Content based
image retrieval using embedded neural networks with bandletized
regions. Entropy 17(6):3552–3580
24. ElAlami ME (2014) A new matching strategy for content based
image retrieval system. Appl Soft Comput 14:407–418
25. Iqbal K, Odetayo MO, James A (2012) Content-based image
retrieval approach for biometric security using colour, texture and
shape features controlled by fuzzy heuristics. J Comput Syst Sci
78(4):1258–1277
26. Neelima N, Reddy ES (2015) An improved image retrieval system
using optimized FCM & multiple shape, texture features. In: 2015
IEEE international conference on computational intelligence and
computing research (ICCIC). IEEE, pp 1–7
27. Youssef SM (2012) ICTEDCT-CBIR: integrating curvelet trans-
form with enhanced dominant colors extraction and texture analy-
sis for efficient content-based image retrieval. Comput Electr Eng
38(5):1358–1376
28. Lande MV, Bhanodiya P, Jain P (2014) An effective content-based
image retrieval using color, texture and shape feature. In: Intel-
ligent computing, networking, and informatics. Springer, India,
pp 1163–1170
29. Xia Y, Wan S, Yue L (2014) A new texture direction feature
descriptor and its application in content-based image retrieval. In:
Proceedings of the 3rd international conference on multimedia
technology (ICMT 2013). Springer, Berlin, pp 143–151
30. Agarwal S, Verma AK, Singh P (2013) Content based image
retrieval using discrete wavelet transform and edge histogram
descriptor. In: International conference on information systems
and computer networks (ISCON), 2013. IEEE, pp 19–23
31. Jadhav P, Phalnikar R (2015) SIFT based efficient content based
image retrieval system using neural network. Artificial Intelligent
Systems and Machine Learning 7(8):234–238
32. Awad D, Courboulay V, Revel A (2012) Saliency filtering of
sift detectors: application to cbir. In: Advanced concepts for
intelligent vision systems. Springer, Berlin, pp 290–300
33. Saad MH, Saleh HI, Konber H, Ashour M (2013) CBIR system
based on integration between surf and global features
34. Velmurugan K, Baboo SS (2011) Content-based image retrieval
using SURF and colour moments. Global J Comp Sci Technol
10:11
35. Barbu T (2014) Pedestrian detection and tracking using temporal
differencing and HOG features. Comput Electr Eng 40(4):1072–
1079
36. Albiol A, Monzo D, Martin A, Sastre J, Albiol A (2008)
Face recognition using HOG–EBGM. Pattern Recogn Lett
29(10):1537–1543
37. Felzenszwalb PF, Girshick RB, McAllester D, Ramanan D (2010)
Object detection with discriminatively trained part-based models.
IEEE Trans Pattern Anal Mach Intell 32(9):1627–1645
38. Pan S, Sun S, Yang L, Duan F, Guan A (2015) Content retrieval
algorithm based on improved HOG. In: 3Rd international confer-
ence on applied computing and information technology/2nd inter-
national conference on computational science and intelligence
(ACIT-CSI), 2015. IEEE, pp 438–441
39. Murala S, Maheshwari RP, Balasubramanian R (2012) Local
tetra patterns: a new feature descriptor for content-based image
retrieval. IEEE Trans Image Process 21(5):2874–2886
40. Moravec HP (1979) Visual mapping by a robot rover. In: Pro-
ceedings of the 6th international joint conference on artificial
intelligence, vol 1. Morgan Kaufmann Publishers Inc, pp 598–600
41. F¨
orstner W, G¨
ulch E (1987) A fast operator for detection and pre-
cise location of distinct points, corners and centres of circular fea-
tures. In: Proceedings of the ISPRS intercommission conference
on fast processing of photogrammetric data, pp 281–305
42. Ojala T, Pietik¨
ainen M, M¨
aenp¨
a¨
a T (2000) Gray scale and rota-
tion invariant texture classification with local binary patterns. In:
Computer vision-ECCV 2000. Springer, Berlin, pp 404–420
43. Ojala T, Valkealahti K, Oja E, Pietik¨
ainen M (2001) Texture
discrimination with multidimensional distributions of signed gray-
level differences. Pattern Recogn 34(3):727–739
44. Ojala T, Pietik¨
ainen M, Harwood D (1996) A comparative study
of texture measures with classification based on featured distribu-
tions. Pattern Recogn 29(1):51–59
45. Pietik¨
ainen M, Ojala T, Xu Z (2000) Rotation-invariant texture
classification using feature distributions. Pattern Recogn 33:43–52
46. Steji´
c Z, Takama Y, Hirota K (2003) Genetic algorithm-based rel-
evance feedback for image retrieval using local similarity patterns.
Inf Process Manag 39(1):1–23
47. Oertel C, Colder B, Colombe J, High J, Ingram M, Sallee P (2008)
Current challenges in automating visual perception. In: Proceed-
ings of IEEE advanced imagery pattern recognition workshop
48. Stanford vision lab, http://image- net.org/ last accessed on October
2016
49. Griffin G, Holub A, Perona P (2007) Caltech-256 object category
dataset
50. Fei-Fei L, Fergus R, Perona P (2004) Learning generative visual
models from few training examples: an incremental Bayesian
approach tested on 101 object categories. IEEE. CVPR 2004
Workshop on Generative-Model Based Vision
51. Lai C-C, Chen Y-C (2011) A user-oriented image retrieval system
based on interactive genetic algorithm. IEEE Trans Instrum Meas
60:3318–3325
K. T. Ahmed et al.
52. Ali N, Bajwa KB, Sablatnig R, Mehmood Z (2016) Image retrieval
by addition of spatial information based on histograms of triangu-
lar regions. Comput Electr Eng 54:539–550
53. Walia E, Pal A (2014) Fusion framework for effective color image
retrieval. J Vis Commun Image Represent 25(6):1335–1348
54. Dubey SR, Singh SK, Singh RK (2015) A multi-channel
based illumination compensation mechanism for brightness
invariant image retrieval. Multimedia Tools and Applications
74(24):11223–11253
55. Thepade S, Das R, Ghosh S (2015) Novel technique in block trun-
cation coding based feature extraction for content based image
identification. In: Transactions on computational science XXV.
Springer, Berlin, pp 55–76
56. Dalal N, Triggs B (2006) Object detection using histograms of
oriented gradients. In: Pascal VOC workshop, ECCV
57. Hu R, Collomosse J (2013) A performance evaluation of gradient
field hog descriptor for sketch based image retrieval. Comput Vis
Image Underst 117(7):790–806
58. Wangming X, Jin W, Xinhai L, Lei Z, Gang S (2008) Application
of image SIFT features to the context of CBIR. In: International
conference on computer science and software engineering, 2008,
vol 4. IEEE, pp 552–555
59. Xu P, Zhang L, Yang K, Yao H (2013) Nested-SIFT for efficient
image matching and retrieval. IEEE MultiMedia 20(3):34–46
60. Kim S, Yoon K-J, Kweon IS (2008) Object recognition using a
generalized robust invariant feature and Gestalt’s law of proximity
and similarity. Pattern Recogn 41(2):726–741
61. Lee Y-H, Kim Y (2015) Efficient image retrieval using advanced
SURF and DCD on mobile platform. Multimedia Tools and
Applications 74(7):2289–2299
A preview of this full-text is provided by Springer Nature.
Content available from Applied Intelligence
This content is subject to copyright. Terms and conditions apply.