ArticlePDF Available

Abstract and Figures

Image extraction methods rely on locating interest points and describing feature vectors for these key points. These interest points provide different levels of invariance to the descriptors. The image signature can be described well by the pixel regions that surround the interest points at the local and global levels. This contribution presents a feature descriptor that combines the benefits of local interest point detection with the feature extraction strengths of a fine-tuned sliding window in combination with texture pattern analysis. This process is accomplished with an improved Moravec method using the covariance matrix of the local directional derivatives. These directional derivatives are compared with a scoring factor to identify which features are corners, edges or noise. Located interest point candidates are fetched for the sliding window algorithm to extract robust features. These locally-pointed global features are combined with monotonic invariant uniform local binary patterns that are extracted a priory as part of the proposed method. Extensive experiments and comparisons are conducted on the benchmark ImageNet, Caltech-101, Caltech-256 and Corel-100 datasets and compared with sophisticated methods and state-of-the-art descriptors. The proposed method outperforms the other methods with most of the descriptors and many image categories.
Content may be subject to copyright.
Appl Intell
DOI 10.1007/s10489-017-0916-1
Fusion of local and global features for effective image
extraction
Khawaja Tehseen Ahmed1·Aun Irtaza2·Muhammad Amjad Iqbal1
© Springer Science+Business Media New York 2017
Abstract Image extraction methods rely on locating inter-
est points and describing feature vectors for these key
points. These interest points provide different levels of
invariance to the descriptors. The image signature can
be described well by the pixel regions that surround the
interest points at the local and global levels. This con-
tribution presents a feature descriptor that combines the
benefits of local interest point detection with the fea-
ture extraction strengths of a fine-tuned sliding window
in combination with texture pattern analysis. This process
is accomplished with an improved Moravec method using
the covariance matrix of the local directional derivatives.
These directional derivatives are compared with a scor-
ing factor to identify which features are corners, edges
or noise. Located interest point candidates are fetched for
the sliding window algorithm to extract robust features.
These locally-pointed global features are combined with
monotonic invariant uniform local binary patterns that are
extracted a priory as part of the proposed method. Extensive
experiments and comparisons are conducted on the
Khawaja Tehseen Ahmed
khawajatehseenahmed@gmail.com
Aun Irtaza
aun.irtaza@gmail.com
Muhammad Amjad Iqbal
amjad.iqbal@ucp.edu.pk
1Faculty of IT, University of Central Punjab, Lahore, Pakistan
2Department of Computer Science, University of Engineering
and Technology, Taxila, Pakistan
benchmark ImageNet, Caltech-101, Caltech-256 and Corel-
100 datasets and compared with sophisticated methods and
state-of-the-art descriptors. The proposed method outper-
forms the other methods with most of the descriptors and
many image categories.
Keywords Image extraction ·Interest point detection ·
Image descriptor ·Principal component coefficients ·
Sliding window ·Support vector machine
1 Introduction
Several studies have contributed to computer vision and rely
on object recognition, texture classification, scene under-
standing, symmetry detection and related domains that are
based on detecting interest points, edges and corners for
feature description. Images are described by their features
to extract useful hidden patterns to produce symbolized
signatures at different levels of abstraction and understand-
ing. Different levels of image processing metrics involve
different methods of image description and image synthe-
sis. Image description techniques include global, regional
and local metrics, and image synthesis uses texture analysis
methods.
Texture analysis methods are categorized as statistical,
structural, and spectral. Statistical methods based on gray
level statistical moments describe point pixel area proper-
ties, and histograms and scatter plots are used to represent
the values. Structural techniques use structural primitives,
such as parallel lines and regular patterns. Spectral meth-
ods work in the frequency domain to represent data. Local
and global descriptors [13] are primitive image descriptors
that work in the statistical, structural and spectral domains.
K. T. Ahmed et al.
Local descriptors describe patches and portions within an
image, and global descriptors describe an entire image.
Color histograms [4], shape features [5] and textures [6]are
used for local feature extraction. However, local features
are unable to produce accurate results in different image
categories. Global descriptors [1] describe objects for recog-
nition and classification. Local and global features can be
employed together to represent images in a very powerful
way. There are several applications of this hybrid scheme
for feature extraction, such as the whole-object approach,
which uses local interest point detectors, digital correlation,
and scale space super-pixels [7]. Another approach is the
partial-object method, which is derived from gray level cor-
ner detection methods [8], image moments [9], and scale
space theory [10]. Interest point descriptors [2,3]arean
extension of these approaches that quantify the light inten-
sity, local area gradients, local area statistical features, and
the histogram of the local gradient directions. Applications
of these extended descriptors have shown better perfor-
mance in object detection, face recognition, medical image
retrieval, and specialized tasks. However, these descrip-
tors typically involve intense computations and require
significant memory resources. For image retrieval, these
descriptors capture low level image attributes, such as color,
texture or spatial information, for optimal performance that
is particularly domain specific. Consequently, they result
in low performance when the same image descriptor is
tested on image categories with complex, overlapping and
background objects.
Detectors use maxima and minima points, such as gra-
dient peaks and corners; however, edges, ridges, and con-
tours are also considered as key points for better image
understanding. For these points, [11] presented an inter-
est points taxonomy that includes intensity-based region
methods (IBRs), edge-based region methods (EBRs) and
shape-based regions (SBRs). Features were extracted by
pixel intensity based on the saturation value of the pixel in
[12]. Image retrieval is performed by applying this feature
model to image segmentation and histogram generation.
Image detectors extract features with a diverse invariance
to occlusion, rotation, illumination and scale. These fea-
ture detectors are employed in image classification, object
recognition, and semantic interpretation based on their spe-
cialty. The quality and effectiveness of interest point detec-
tion methods were evaluated against standard databases and
state-of-the-art methods in [13].
This contribution uses local features along with global
feature description by combining texture values to extract
images from multiple categories. Useful image patterns
are detected by finding edges and corners based on local
interest points. These key points are identified using pixel
intensities. Pixel intensity-based detectors are more pow-
erful interest point detectors than other methods [13]. A
fine-tuned sliding window algorithm is applied to the inter-
est points to extract the image signatures. The texture
analysis results are combined with the signatures to com-
prehensively reflect the image patterns. A novel dimension
reduction technique is used to calculate the limited covari-
ant coefficients. The proposed method provides remark-
able results on benchmarks, existing methods, renowned
databases and state-of-the-art descriptors.
The remainder of this paper is organized as follows.
Section 2presents related work on feature extraction, and
Section 3explains the proposed methodology. The experi-
mental results are provided in Section 4, and we summarize
our findings in Section 5.
2 Related work
A significant amount of research has been performed on
Content Based Image Retrieval (CBIR) by analyzing inter-
est points that are composed of corners, edges, contours,
maxima shapes, ridges or global features, visual contents
and semantic interpretation. These detectors [1], descrip-
tors [2] or extractors [3] can be characterized as invari-
ant or covariant, local or global. Local features are spe-
cific and context oriented. Current Content Based Image
Retrieval (CBIR) systems require image retrieval from ver-
satile image categories, images with complex overlapping
objects, cluttered images, and foreground and background
objects. Solutions are normally tested on a specific dataset
or selected categories, and the results are uncertain for other
benchmarks. A combination of global and local features that
uses the Haar discrete wavelet transform (HDWT) and gray
level co-occurrence matrix (GLCM) was presented [14], and
the results were computed for the Corel-100 dataset [15].
In another image retrieval scheme [16], LBPs are collected
and combined from each channel to describe color images.
Decoded LBPs are introduced to reduce the highly dimen-
sional patterns that are returned from multiple channels.
Experiments were performed on Corel-1k and other bench-
marks. The reported precision for the Corel-1k benchmark
was 0.749. A computationally practical approach for cap-
turing image properties from two multichannel images was
contributed by [17]. Tradeoffs were executed at the feature
and channel levels to avoid redundant image information,
and the mean average precision for the Corel-1k benchmark
was 0.709 [16]. For image retrieval, discrete cubic parti-
tioning of the image was performed in the HSV space [18].
The data were then hierarchically mapped using the hierar-
chical operator, and a similarity-based ranking scheme was
used for the resultant features. A Mean Average Precision
(mAP) of 0.797 was reported for the Corel-100 dataset. A
three stage method was proposed to identify similar images
by first finding the images by their color features [19]. To
Fusion of local and global features
improve the results, the images are matched by their texture
and shape features. This method accumulates global and
regional features for better accuracy. The reported preci-
sion for the Corel-1k benchmark is 0.766. In [20], images
were abstracted based on their statistical features. The Non-
subsampled Contourlet Transform (NSCT) was used to
compute the features of this Multi-scale Geometric Anal-
ysis (MGA). A graph-theoretic approach-based relevance
feedback system was also incorporated for retrieval perfor-
mance. A mAP of 0.553 was reported for this technique
for the Corel-1000 benchmark. A method was presented
to characterize an image as a generalized histogram quan-
tized by Gaussian Mixture Models (GMMs) [21]. This
method learns from training images using the Expectation-
Maximization (EM) algorithm, and the number of quantized
color bins is determined by the Bayesian Information Crite-
rion (BIC). The method gave a mAP of 0.801 for the Corel
image dataset. Color, texture and shape information was
incorporated using the Color Difference Histogram (CDH)
and Angular Radial Transform (ART) features [22]. The
mAP using min-max normalization on the Corel-1k bench-
mark was 0.783. Histograms of triangles were used to add
spatial information to the inverted index of a bag-of-features
by [23]. An image was divided into two and four triangles
that were evaluated separately. Experiments were performed
on the Corel-1000 dataset with an average precision of 0.82.
The color co-occurrence matrix (CCM) and the difference
between pixels of scan pattern (DBPSP) were used to extract
color and texture features [24]. To eliminate redundant fea-
tures, selective features were chosen by finding their high
dependency on the target class. This approach reported a
mean average precision of 0.757 for the Corel-1000 dataset.
A content-based image retrieval approach was presented in
[25] for biometric security based on color histogram, tex-
ture and moment invariants. Color histograms were used
for color features, a Gabor filter was used for the texture
features, and the moment invariants were used for shape
information. This approach reported improved results for
biometric security. A method for CBIR using Local Binary
Pattern (LBP), Hu-moments and radial Chebyshev moments
by focusing shapes and textures was presented in [26]. Ten
categories from the COIL dataset [27] were used for exper-
iments, and the method reported a 3 % higher accuracy than
previous results. A method to retrieve images using color
features by dividing images into non-overlapping blocks
and to determine the dominant color of each block using the
k-means algorithm was presented in [28]. A gray-level co-
occurrence matrix was used for texture feature extraction,
and Fourier descriptors were extracted from the segmented
images for the shape representation. The final feature vec-
tor was composed of these extracted features. The results of
experiments performed on the Corel-1000 dataset [15]were
compared with the results of histogram-based methods. An
8% improvement in precision was achieved with a 4.5
second retrieval time. A descriptor that adds spatial dis-
tribution information of the gray-level variation between
pixels in LBP for image retrieval was presented in [29].
This spatial texture descriptor constructs statistic histograms
of pattern pairs between the reference pixel and its neigh-
boring pixels. Spatial information combined with texture
features produced relatively effective results. A descrip-
tor based on shape and texture features was presented in
[30] by employing the Discrete Wavelet Transform (DWT)
and Edge Histogram Descriptor (EHD) features of MPEG-
7. The wavelet coefficients were calculated for the input
image, and the Edge Histogram Descriptor was then used
on the coefficients to determine the dominant edge orienta-
tions. This combination of DWT and EHD was tested on the
Corel-1000 dataset.
HOG [1], SIFT [2]andSURF[3] are interest point detec-
tors and image descriptors that are used in combination
with local and global descriptors for content-based image
retrieval. The time and the computational costs are barri-
ers to using these famous descriptors in CBIR systems with
complex and cluttered images of different sizes. However,
dimension reduction techniques are employed to overcome
the computation time constraint. A multilayer feed forward
neural network-based CBIR system that incorporates the
strength of SIFT for object detection was introduced by
[31]. SIFT object detection was used for CBIR by reduc-
ing the large number of key points generated by SIFT to
improve the retrieval performance [32]. Salient image parts
were extracted by a saliency-based region detection sys-
tem, and the final results were tested on VOC2005. A CBIR
system that integrates the ycbcrcolor histogram, edge his-
togram and shape descriptor as a global descriptor with surf
salient points using SURF as a local descriptor to enhance
the retrieval results was proposed by [33]. Experiments
were performed on the Corel-1000 and the Uncompressed
Color Image Database (UCID) databases. Velmurugan et al.
[34] combined SURF with color moments by calculating
the first and second order color moments for SURF key
points, and experiments were performed on the COIL-100
dataset. The Histograms of Oriented Gradients feature has
been used in pedestrian detection [35], face recognition [36]
and object detection [37]. For CBIR, Shujing et al. [38]
used HOG by transforming the sizes of images and calcu-
lated a feature vector of 3780 dimensions. Orthogonal lower
dimension data were achieved by applying PCA, and exper-
iments were performed on the Corel-1000 image dataset. A
CBIR called Local Tetra Patterns (LTrPs) was proposed by
Murala et al. [39] by calculating the first order derivatives in
the vertical and horizontal directions on reference pixels and
their neighbors. The performance on benchmark databases
was compared by combining this method with the Gabor
transform.
K. T. Ahmed et al.
The technique presented in this paper focuses on: 1) find-
ing suitable key points to produce useful feature sets to
effectively classify images from multiple categories with
remarkable precision; 2) identifying foreground and back-
ground objects in complex images for better accuracy; and
3) introducing a new mechanism to search images by their
local and global features with low computational cost and
by storing and comparing compact image signatures for effi-
cient retrieval. In the proposed method, corners are detected
by pixel intensities to avoid unwanted key points. A fine-
tuned sliding window technique returns identifiable features
for robust classification, and a useful texture patterns analy-
sis supports the proposed method to provide more accurate
results.
3 Methodology
3.1 Intensity-based local interest point detection
The first corner detector was introduced by Moravec [40]. It
returns points with local maxima of the directional variance
measure and determines the average change in intensity by
moving a local preset detection window in different direc-
tions. This idea was also employed in [41] to investigate
the local statistics of the variations in the directional image
intensity using first order derivatives. This method results
in better subpixel precision and provides better localiza-
tion and corner detection. Our method uses the approach of
Moravec [40] by expanding the average intensity variance
and computing the Sobel derivatives and Gaussian window.
First, intensity-based local interest points are detected.
Local features provide identifiable and localized interest
points. An anchor point can be a point on a curve, the end
of a line and a corner. It can also be an identified point of
local intensity that has the maximum curvature of the points
on the curve. An auto-correlation matrix best describes the
local image features and structure. The following matrix
describes the gradient distribution in the local neighborhood
of an interest point:
M=σ2
Dg(σ) I2
x(X, σ D) IxIy(X, σ D)
IxIy(X, σ D ) I 2
x(X, σ D) (1)
with,
Ix(X, σ D) =
∂xg(σD) I(x) (2)
g(σ) =1
2σ 2ex2+y2
2σ2(3)
Local image derivatives are computed with Gaussian ker-
nels of scale σD[7]. In the neighborhood of a point,
the derivatives are averaged using a Gaussian window. The
eigenvalues determine the principal signal changes in both
orthogonal directions in the neighborhood of the point σI.
Therefore, corners are found when the image signals vary
or the eigenvalues are large. Harris [7] proposed a less
computationally expensive metric that uses two eigenvalues.
3.2 Global feature detection using an optimized sliding
window
For image classification, the entire image is of interest for
global features. Global feature computations for an entire
image have large time and computational costs. Local fea-
tures describe image patches around interest points, while
global features describe an image as a single vector. With an
increasing number of local features, large numbers of fea-
ture vectors are generated, which are difficult to match and
store. To overcome these problems, we used local and global
features in an intuitive way to compute the global features
only for the detected local features of interest.
A sliding window slides a fixed size frame across an
image. The object’s size, location, positioning and scaling
are directly impacted by the block size, cell size within the
block, orientation angle and block overlap. The optimized
values for these parameters correctly classify an image. Our
technique tunes the sliding window technique against these
parameters for the datasets. This optimized sliding win-
dow extracts feature vectors for the detected intensity-based
interest points. For the global feature detection using the
local intensity points, discrete values of quantized visual
features are represented in histograms. Pixel edges of 8 bin
histograms are used as cells (Fig. 1).
The edge magnitude and orientation are computed using
a first order Sobel kernel filter. Histograms are constructed
using the following equation, where y denotes bins, and z
denotes the cells:
h(y, z) =
xz|| ∇I(x)|| if (x)
T=y
0else (4)
where (x) is the orientation of the edge, and ∇I(x)
is the magnitude. A histogram of the gradient orientation
is computed for each cell. Histogram normalization is then
performed by accumulating the local histograms for each
block and applying them over all of the cells in the block.
A cell size of 4×4 is used to capture the small-scale spatial
information. Significant local pixel information is captured
by using a block size of 2×2. This helps to express local illu-
mination changes that are not lost when the small block size
is averaged. Parts of the overlapping adjacent blocks pro-
duce better contrast normalization. To produce non-massive
feature vectors, an overlap of half of the block size is used.
An increased number of orientation bins results in large
feature vectors, so our optimized technique uses 7–9 bins,
which produce a relatively small number of feature vectors
with respect to the number of blocks and the cell size. Better
Fusion of local and global features
Fig. 1 Proposed method showing the step-by-step feature extraction process for an input image
orientation results are not observed when values are evenly
spaced between 180 degrees, so in our technique, values are
placed between 0 and 180 degrees by placing minus values
in the positive bins.
3.3 Efficient uniform local binary pattern texture
analysis
An operator that is invariant to monotonic transformations
of the gray scale is used for the texture analysis. It works
on a circularly symmetric neighboring 3 ×3 set of eight
pixels. The strength of the technique is that it uses a lim-
ited subset of pixels to achieve computation simplicity. The
selected subset of pixels is ‘uniform’ in patterns, which
results in rotational invariance. The fewer spatial transitions
employed in uniform patterns are more tolerable to rotation
changes [42]. In this process, a local 3 ×3 neighborhood
with a gray level distribution of nine pixels is selected first.
In the circularly symmetric neighborhood of eight pixels,
the gray values of the diagonal pixels are computed by inter-
polation. To achieve gray scale invariance, the gray values
of these eight pixels are subtracted from the center pixel.
This can be calculated by:
T=p(v0,v
1v0,v
2v0,v
3v0,v
4v0,v
5v0,v
6
v0,v
7v0,v
8v0)(5)
where Tis the texture, pis the pixel, and vis the gray scale
value.
The center pixel v0contains the overall luminance of the
image, which is not required for the local texture analy-
sis; therefore, it is discarded for the gray level distribution
calculation [43]. Invariance is easily achieved if the sign of
the gray scale difference is only noted for the pixel pairs to
be differentiated. The minor change to the LBP [43] can be
illustrated as [42]:
LBP8=
8
i=1
sign(viv0)2i1(6)
LBP8and the traditional LBP [44] have different indexed
neighbors and interpolated diagonal values. Both differ-
ences form a base of the rotational invariance of LBP8.The
circular symmetric neighborhood set of eight pixels pro-
duces 28outputs. The gray pixel value is moved along the
perimeter. A right rotation operation is performed on the
pixel values so the bits values have a maximum of zero start-
ing from the eighth bit. This process is formulated as [42]:
LBP ri36
8=min{ROR(LBP8,i) |i=0,1, ..., 7}(7)
As mentioned in [45], these 36 unique rotation invariant
local binary patterns carry micro-scale features; for exam-
ple, the least significant bit shows a bright spot, the most
significant bit shows a dark spot, and diagonals show edges.
Moreover, in some cases, suboptimal results are observed
due to rotated values at 45. A large number of spatial
transactions occurs when the uniformity value is large,
so considering this dependency, a value of 2 is used for
uniformity. Thus, LBPri36
8is reformed as:
LBP riu2
8=
8
i=1
sign(viv0)if U(LBP8)2(8)
K. T. Ahmed et al.
Similarly LBP16 is calculated as:
LBP16 =
16
i=1
sign(viv0)2i1(9)
A distribution of 16 bits along the perimeter produces
65,536 output values, which contain 243 different patterns.
Similarly, defining a uniform rotation produces invariant
patterns of the 16 bit version:
LBP riu2
16 =
16
i=1
sign(viv0)if U(LBP16 )2 (10)
Mapping of LBP16 to LBP16 riu2 is performed using a
lookup table.
3.4 Generation of feature vectors
a) The sliding window extracts feature vectors for the
intensity-based detected local interest points. The interest
points are edges and corners, which are a limited set of
the image pixels. Feature extraction is performed on these
interest points. The global extraction depends on the num-
ber of bins, the number of overlapping blocks, and the
cell and block sizes. These generated values are very small
compared to the signatures generated by state-of-the-art
descriptors [2,3], which impacts the computational time for
large databases. These points are extracted once for each
image, but this process takes a long time if they need to be
used for classification and image retrieval purposes. It can
also take a long time to access these feature vectors if they
are stored and retrieved.
b) Similar feature vector strengths are produced using tex-
ture analysis with a uniform local binary pattern, as was
described in C-III.
Steps a) and b) represent the image with distinct fea-
tures. Their concatenation results in better classification, but
it is computationally and time intensive. Therefore, these
feature vectors must be reduced with minimum information
loss before concatenation for efficient processing.
3.5 Feature vector optimization algorithm
and calculating the principal components
The concatenated feature vectors contain useful informa-
tion about the images. The coefficients returned by principal
component analysis are often large enough to utilize high
computational power. This is even more crucial in large
database scenarios. The returned coefficients are directly
proportional to the number of observations. The descriptors
[13] return voluminous rows and columns, which provide
a base for the coefficient calculations. A smaller set of prin-
cipal components can be obtained by limiting the number
of observations. By row elimination, the signature subset or
similar approaches result in fewer observations but signifi-
cant information loss of the original data, which cause worse
prediction and classification results.
The algorithm in our technique is trained for differ-
ent image datasets, image dimensions, image resolutions,
pixel intensities and image formats. The algorithm pri-
marily checks all of the required and related information
for the dataset on which the experiments are performed.
After this examination, the program has an understanding
of the images. The algorithm takes the feature vectors as
inputs, and they are reshaped based on the optimal num-
ber of observations as inputs for the coefficient calculations
before computing the principal components. The number
of observations is optimized based on the image attributes
that were described previously (e.g., dimension, resolution).
The reshaped feature vectors are then input to the princi-
pal component calculation and result in fewer coefficients
with large variances. These limited principal components
identify the feature vectors comprehensively with very little
information loss. The novelty of the technique for limited
coefficients is that it provides nearly the same precision
as using the complete set of feature vectors without subset
optimization. The feature vectors produced by the texture
analysis are optimized using the feature vector optimization
algorithm on which the principal components are computed.
The texture information additionally carries monotonic and
rotational invariance characteristics, which can perform bet-
ter prediction along with the global features. To achieve
this, the texture features are concatenated with intensity-
based interest points extracted with the sliding window.
The coefficient-based reduced, optimized and concatenated
feature vectors contain texture and object recognition capa-
bilities for simple, overlapping and complex images. The
image retrieval time is also reduced due to the slim size of
the feature vectors. In cases of image retrieval from thou-
sands to millions of images, these compact signatures are
efficient as well as storage friendly.
The number of coefficients generated by PCA is propor-
tional to the dimensions of the input feature. The feature
vectors returned by different local and global methods vary
in size. These hybrid feature vectors produce a large number
of principal components. The number of PCs varies from
hundreds to thousands per image. Even after the reduction
process, classification of such a large number of feature
vectors is time intensive. For the datasets, our algorithm
reorders the feature vectors and generates between 80 and
120 observations depending on several characteristics, such
as the size and type of image attributes and the pixel inten-
sities. Based on the number of observations, each image
is represented by 80–120 coefficients. This image signa-
ture size is very small compared to the output produced by
descriptors [1,2] or the standard PCA.
Fusion of local and global features
3.6 Image classification using Support Vector Machine
(SVM)
A Support Vector Machine is a discriminative classifier that
separates two classes of points by a hyperplane. It is a super-
vised learning model that analyzes data that are used for
classification and regression analysis tasks.
Let the input belong to one of two classes as [46]:
{(xi,yi)}N
i=1yi ={+1,1}(11)
where xiis the input set, and yiare the corresponding labels.
Hyperplanes are assigned the values of the weight vectors
w’ and bias ‘b’ as follows:
wT.x+b=0 (12)
and a maximum margin of 2/||w|| hyperplanes are found
such that the two classes can be separated from each other;
i.e.,
wT.x
i+b≥+1 (13)
wT.x
i+b≤−1 (14)
or
yiwT.x+b≥+1 (15)
The kernel version solution of the Wolfe dual problem is
then found with the Lagrangian multiplied by αi:
Q(α)=
m
i=1
αi
m
ij=1
αiαjyiyjKxi.xj/2 (16)
where i0andm
i=1iyi=0.
Based on the kernel function, the SVM classifier is given
by:
F(x)=Sgn(f (x))(17)
where f(x)=l
i=1iyiK(xi,x)+bis the output
hyperplane decision function of the SVM. High values of
f(x)represent high prediction confidence, and low values
of f(x)represent low prediction confidence.
4 Experimentation
4.1 Datasets
Most of the datasets are tailored for custom tasks depending
on the nature of the project. Many contributions use domain-
based image types. Experiments performed on a dataset
are difficult to compare with those performed on another
dataset. The accuracy of the results is directly affected by
the image attributes, such as color, object location, quality,
size, overlapping, occlusion, and cluttering [47]. In our
case, widely-used datasets and their respective categories
are selected by considering the following characteristics:
Diverse image categories
General content-based image retrieval usage
Categories contain many types of textures, foreground
and background objects, colors and spatial features
Images from different areas of life to test the descrip-
tor’s effectiveness
The selected subsets are representative of the respective
datasets and include diverse categories from different areas,
object orientations, shapes and textures, and global and
spatial information. Therefore, the results that are based
on the selected categories are representative of the entire
dataset. Experiments are performed on a variety of standard-
ized datasets, including ImageNet [48], Caltech-256 [49],
Caltech-101 [50] and Corel-1000 [15]. The sampling, object
categories, and image characteristics of each category are
described below.
4.1.1 Corel-1000 dataset
The Corel-1000 dataset is a benchmark that is widely used
in the literature for classification tasks [1619,51]. The
Corel database includes many semantic groups, including
scene, nature, people, flowers, animal, and food. It consists
of 1000 images in 10 categories. Each semantic category
consists of 100 images with a resolution of 256×384 pixels
or 384×256 pixels. Our algorithm randomly selects 70 %
of the images from each category for training and 30 % for
testing. A total of 660 images from all of the categories is
used for training, and 330 images are used for testing.
4.1.2 ImageNet synset
The ImageNet synset [48] is a large-scale image database
that is used to index, retrieve, organize and annotate multi-
media data. It is organized based on the WordNet hierarchy.
Each meaningful concept in WordNet that can be described
by multiple words or word phrases is called a synset.
WordNet contains more than 100,000 synsets, which are
dominated by nouns (80,000+). The repository contains
an enormous collection of more than 14,197,120 images.
Experiments were performed on 15 synsets downloaded
from the ImageNet repository [48], including aerie, car,
cherry tomato, coffee cup, dish, dust bag, flag, flower, gas
fixture, golf ball, heat exchanger, monitor, mud ceram-
ics, spoon and train. These synsets were selected from the
semantic groups of plants, natural objects, artifacts, devices,
containers, ceramics, arms, and equipment. These synsets
were selected due to their versatility, textures, complexity
and object orientation features. The cherry tomato contains
K. T. Ahmed et al.
small and medium foreground and background objects as
well as overlapping objects. The flag synset contains spe-
cific color-oriented objects; in other words, color and tex-
ture are both used to classify this category. The gas fixture
and aerie synsets are complex and cluttered object cate-
gories. Both contain spatial information due to their hanging
nature. The golf ball and cherry tomato synsets have similar
round object orientations with color differences. It is chal-
lenging to distinguish these synsets. The artifacts group con-
tains structural complexities with semantics associations;
therefore, classification of this synset requires careful anal-
ysis of local and global features. The equipment and devices
groups are sometimes semantically the same. However, the
object- and texture-based training leads to better classifica-
tion. These 15 synsets contain 13,554 images, from which
100 images were randomly selected from each synset for
the experiments; i.e., 1500 (100×15) images were used for
training and testing (Fig. 2).
A total of 1050 images were used for training, and 450
were used for testing, by randomly selecting them from each
category. The positive training samples included two-thirds
of the candidates, which were randomly selected from each
category. The negative training samples included one-third
of the total.
4.1.3 Caltech-256 dataset
This dataset is a challenging set of 256 object categories
that contain 30,607 images [49]. It is a successor to the
Caltech-101 dataset. Image classification in Caltech-256 is
more difficult than in Caltech-101 [50] because it has more
variations. We performed experiments on 15 diverse cate-
gories, including AK47, American flag, backpack, baseball
bat, baseball glove, bear, mug, binocular, calculator, car
tire, Cartman, CD, cockroach, desk globe and comet. The
semantic groups were selected carefully to represent many
areas of real life. These categories contain animals, flags,
guns, accessories, tires, insects, computer accessories, daily
used entities and images with complex and overlapping
objects. Some of the categories are important because of
their texture patterns, whereas others are important because
of their foreground and background objects. The desk globe,
car tire and CD are round objects. Their classifications are
based on their orientations and textures. The cockroach was
selected from the insect category. Recognizing an insect in
an image requires the technique to have object recognition
capabilities. The American flag contains specific color and
texture information that can be used to classify it. Cartman
and the binocular are normally in complex backgrounds and
Fig. 2 a ImageNet Synsets with 15 image samples (one image from
each category). bCorel-1000 dataset showing 15 sample images from
10 categories. cCaltech-256 dataset showing 15 sample images from
15 categories (one image per category). dCaltech-101 dataset showing
15 sample images from 15 categories (one image per category)
Fusion of local and global features
contain overlapping objects. A total of 1050 images were
used for the experiments by selecting 70 images per cate-
gory. Our algorithm randomly selects 70 % of the images
from each category for training and 30 % of the images for
testing. A total of 735 images from all of the categories were
used for training, and 315 were used for testing. In the train-
ing phase, positive samples are chosen randomly from the
respective category. Of each category, 70 % is used for pos-
itive training, and the remaining 30 % are negative training
samples. The negative training samples are gathered ran-
domly from the rest of the categories by selecting an equal
proportion from each semantic group.
4.1.4 Caltech-101 dataset
Caltech-101 [50] is a benchmark that is widely used for
image categorization, recognition and classification. It con-
tains a total of 9146 images in 101 distinct categories.
Fifteen categories were selected the classification, including
airplane, ferry, camera, brain, cougar face, grand piano, Dal-
matian, dollar bill, starfish, soccer ball, minaret, motorbikes,
revolver, sunflower and Windsor chair. These categories
were chosen due to their ability to contribute spatial infor-
mation, rounded objects, and objects with different shape,
texture and color information to test the effectiveness of
the proposed method. The brain and sunflower groups were
considered because of their textures. The dollar bill and
cougar face are categories with complex object structures
and orientations. The camera, revolver and Windsor chair
categories require specific object recognition capability. The
minaret and airplane share spatial and texture information
for classification. A total of 1050 images were used for
the experiments by selecting 70 images per category. Our
algorithm randomly selects 70 % of the images from each
category for training and 30 % of the images for testing. A
total of 735 images from all of the categories was used for
training, and 315 images were used for testing.
4.2 Results
4.2.1 Input process
In the first step, the color space is converted to gray
scale for efficient computation. The gray scale image is
then processed to detect the intensity-based local interest
points. Global features are extracted for these interest points
using the optimized sliding window. The extracted features
are concatenated with the texture features that are invari-
ant to monotonic and rotation changes. The feature vector
concatenation is followed by applying the proposed fea-
ture reshaping technique. Coefficients are generated for the
restructured observations. These data are passed to the sup-
port vector machine for classification. The support vector
machine is involved in two phases: training and testing.
During the training phase, the fused and reduced extracted
feature vectors are input to the support vector machine. The
positive training samples are randomly selected from the
respective categories, and the negative training samples are
collected from the other categories. Two times more positive
training samples are used than negative training samples.
Each training sample is labeled as belonging to one or the
other sample type. The supervised learning model of the
support vector machine learns new examples of one or the
other category, which makes it a non-probabilistic binary
linear classifier.
4.2.2 Precision and recall evaluation
Precision is the specificity measure or positive predicted
value, and recall is the sensitivity measure or true positive
rate evaluation. Precision and recall are calculated on each
image category and also for small and large databases. The
precision and recall results are tested on different sets of
training and testing data.
precision =NA(q)
NR(q)
(18)
recall =NA(q)
Nt
(19)
where NA(q)represents the relevant images that match the
query image, NR(q)represents the images retrieved against
the query image, and Ntis the total number of relevant
images available in the database.
4.3 Experimental results
4.3.1 Results of the Corel-1000 dataset with existing
methods
To determine the accuracy of the proposed technique, we
performed experiments on widely-used benchmarks. The
experimental results are compared with those from exist-
ing methods as well as with the state-of-the-art descriptors
SIFT, SURF, and HOG. The results are also compared with
those of Dubey et al. [16], Xiao et al. [17], Zhou et al. [18],
Shrivastava et al. [19], Kundu et al. [20], Zeng et al. [21],
Walia et al. [22], Ashraf et al. [23] and ElAlami et al. [24]
whose methods achieved remarkable performance. Their
standardized work has also been cited by current researchers
[5255]. Figure 3shows a graphical representation of the
results of the proposed method compared to those from
existing state-of-the-art methods. The results show that the
proposed method outperforms most of the other methods.
Figure 3a shows the average precision rates in comparison
K. T. Ahmed et al.
Fig. 3 a: Comparison of the average precisions obtained by the proposed method and other standard retrieval systems using the Corel-1000
dataset. b: Comparison of the average recalls obtained by the proposed method and other standard retrieval systems using the Corel-1000 dataset
with those of existing methods. The proposed method shows
remarkable performance in most of the image categories.
The average recall rates are shown in Fig. 3b. The results
show that the proposed method has better recall rates in
most of the categories and that the mean average recall is
higher than those of other methods.
Tabl e 1shows a comparison of the average precision of
the proposed method with those of the standard retrieval
systems. The proposed system provides better precision in
most of the semantic groups; it outperforms in the semantic
groups of Africa, beach, building, bus, elephant, moun-
tain and food. The proposed method extracts local texture
and global features, which provide better results. The exist-
ing methods provide better results in some categories; for
example, [19] gives better results for dinosaur and flower.
However, the proposed method provides better results in
Fusion of local and global features
Tabl e 1 Comparison of the average precision obtained by the proposed method and other standard retrieval systems on the top 20 results
Class Proposed method Dubey [16]Xiao[17] Zhou [18]Shriv[19] Kundu [20]Zeng[21] Walia [22]Ashraf[23] ElAlami [24]
Africa 0.90 0.75 0.67 0.85 0.74 0.44 0.72 0.51 0.65 0.72
Beach 0.92 0.55 0.60 0.53 0.58 0.32 0.65 0.90 0.70 0.59
Building 0.88 0.67 0.56 0.72 0.62 0.52 0.70 0.58 0.75 0.58
Bus 0.98 0.95 0.96 0.85 0.80 0.60 0.89 0.78 0.95 0.89
Dinosaur 0.97 0.97 0.98 1.00 1.00 0.40 1.00 1.00 1.00 0.99
Elephant 0.85 0.63 0.53 0.68 0.75 0.80 0.70 0.84 0.80 0.70
Flower 0.93 0.93 0.93 0.94 0.92 0.57 0.94 1.00 0.95 0.92
Horse 0.86 0.89 0.82 0.99 0.89 0.75 0.91 1.00 0.90 0.85
Mountain 0.84 0.45 0.46 0.55 0.56 0.57 0.72 0.84 0.75 0.56
Food 0.92 0.70 0.58 0.86 0.80 0.56 0.78 0.38 0.75 0.77
Average 0.904 0.749 0.709 0.797 0.766 0.553 0.801 0.783 0.820 0.757
Bold entries show the ‘largest value’ for the respective rows
these and other categories. Similarly, [16] provided a good
precision rate in horse classification. The proposed method
also has good accuracy in this category. Overall, the pro-
posed method provides an increase in the mean average
precision of 0.084 %.
Tabl e 2shows the average recall rates obtained by the
proposed methods and standard retrieval systems. The pro-
posed method has remarkable recall rates in seven of the
ten categories. Better classification leads to improved recall
rates even in the complex semantic groups, such as Africa,
mountain and food. The dinosaur and elephant categories
are relatively easy to classify, and most of the existing meth-
ods provide better results in these categories. The proposed
method provides high recall rates in the dinosaur and bus
categories as well as in complex groups, such as flower and
beach.
Figure 4shows the mean average precision and recall
rates for the proposed method and the existing methods.
Figure 4a shows that the proposed method has a higher
mean average precision rate than the existing methods, and
Fig. 4b shows that it has significantly better mean average
recall rates. The recall rate is improved by 0.017 % over
those from the existing methods [19].
4.3.2 ImageNet Synset results
Experiments were performed on ImageNet synsets to check
the robustness and versatility of the proposed method. The
results are shown for the top 20 images. In the testing phase,
feature vectors of an input image are extracted using the
proposed method. The support vector machine classifies the
input image based on the training data. Input images are
selected from each category to check the precision and recall
rates for each category, and the results are computed for 20
images. The classified images for each category yield the
precision and recall rates for that category. For this bench-
mark, the mean average precision is 0.735 %, and the mean
average recall is 0.147 % (Figs. 5and 6).
Tabl e 2 Comparison of the average recalls obtained by the proposed method and other standard retrieval systems on the top 20 results
Class Proposed method Dubey [16]Xiao[17] Zhou [18]Shriv[19] Kundu [20]Zeng[21] Walia [22]Ashraf[23] ElAlami [24]
Africa 0.18 0.08 0.07 0.17 0.15 0.09 0.14 0.10 0.13 0.14
Beach 0.18 0.06 0.06 0.11 0.12 0.06 0.13 0.18 0.14 0.12
Building 0.18 0.07 0.06 0.14 0.12 0.10 0.14 0.12 0.15 0.12
Bus 0.20 0.10 0.10 0.17 0.16 0.12 0.18 0.16 0.19 0.18
Dinosaur 0.19 0.10 0.10 0.20 0.20 0.08 0.20 0.20 0.20 0.20
Elephant 0.17 0.06 0.05 0.14 0.15 0.16 0.14 0.17 0.16 0.14
Flower 0.19 0.09 0.09 0.19 0.18 0.11 0.19 0.20 0.19 0.18
Horse 0.17 0.09 0.08 0.20 0.18 0.15 0.18 0.20 0.18 0.17
Mountain 0.17 0.05 0.05 0.11 0.11 0.11 0.14 0.17 0.15 0.11
Food 0.18 0.07 0.06 0.17 0.16 0.11 0.16 0.08 0.15 0.15
Average 0.181 0.075 0.071 0.159 0.153 0.111 0.160 0.157 0.164 0.151
Bold entries show the ‘largest value’ for the respective rows
K. T. Ahmed et al.
Fig. 4 a representation of the mean average precisions on the Corel dataset. bGraphical representation of the mean average recalls on the Corel
dataset
4.3.3 Caltech-256 dataset results
To check the effectiveness of the proposed method, the
results are compared with those from state-of-the-art meth-
ods. A total of 1050 images are randomly selected from
15 preselected image categories for training and testing. A
batch of 14 images is used to test each category. A total of
15 such batches are used to obtain the precision and recall
rates for each category. The results are shown for the top
14 images from the batch of 50 relevant images. The pro-
posed method outperforms the others in most of the image
categories. The results show a mean average precision of
0.865 % and a mean average recall rate of 0.242 %. Caltech-
256 is considered a challenging dataset that contains com-
plex images. The proposed method provides exceptional
results for the AK47, baseball bat, desk globe, car tire
and CD image categories, which contain uncrowded back-
grounds and objects with clear boundaries. Sample images
from these categories are shown in Fig. 7a. However the
results of the proposed method are equally good for the
other categories, which include cluttered objects, overlap-
ping objects, and complex backgrounds as shown in Fig. 7b.
Fig. 5 Average precisions and recall rates for the ImageNet synset. The results are computed for the proposed method with 15 synsets
Fusion of local and global features
Fig. 6 Average precision and recall rates of the proposed method on 15 categories of the Caltech-256 dataset
4.3.4 Caltech-101 dataset results
The average precision and recall rates for 15 categories of
the Caltech-101 dataset are shown in Fig. 8. Images with
different foregrounds and backgrounds, object shapes, and
textures are selected for classification. The proposed tech-
nique provides better precision in all of the categories by
processing the local features with global values. The recall
rates for Caltech-101 are also promising. Most of the cate-
gories have high recall rates, while a few have average rates.
The Windsor chair and camera have average rates due to the
complex backgrounds and cluttered objects. The mean aver-
age precision obtained for this benchmark is 0.884 %, and
the mean average recall is 0.248 %.
4.4 Comparative analysis against key point detectors
and descriptors
Feature detectors and descriptors are used in object detec-
tion and recognition. Detectors refer to the tool that extracts
the features from the image, such as corner, blob or edge
detectors. Extractors are used to read the features from the
interest points. HOG [1], SIFT [2], and SURF [3] are well-
known object detectors and descriptors that are widely used
in many applications. HOG was presented at the Confer-
ence on Computer Vision and Pattern Recognition (CVPR)
and is used for object detection [56], image classification
[56] and image retrieval [57] tasks. SIFT was presented in
the proceedings of the International Conference on Com-
puter Vision (ICCV) and is used for content-based image
retrieval [58,59] and object detection tasks [60]. SURF
was presented at the European Conference on Computer
Vision (ECCV) and is used for image retrieval [61]and
related tasks. These descriptors are compared to test the
effectiveness of the proposed method. For the experiments,
1050 images are randomly selected from 15 categories, and
each category contains 70 images. Our algorithm randomly
selects 2/3 of the images from each category for training
and 1/3 of the images for testing. A total of 735 images
from all of the categories was used for training, and 315
were used for testing. In the training phase, positive samples
are taken randomly from the respective category. Positive
Fig. 7 a Sample images from the categories with exceptional results from Caltech-256. bSample images with overlapping objects and complex
backgrounds in Caltech-256
K. T. Ahmed et al.
Fig. 8 Average precision rates of the proposed method on 15 categories of the Caltech-101 dataset
samples make up 70 % of each category, and the negative
training samples (30 %) are selected from the rest of the
categories.
4.4.1 Computational load
Experiments are performed with HOG, SIFT, and SURF,
and the results are compared to those of the proposed
method. These descriptors, particularly SIFT, produced
results with very high computational times. Moreover,
redundant and massive feature vectors are produced, which
require large amounts of processing time and system
resources for computation and classification. The proposed
method performed the classification with very low time and
computation costs. The computational efficiency achieved
by processing a limited set of feature vectors from the pro-
posed reordering algorithm generated a compact input that
was used to obtain compact coefficients. The computational
load is an aggregate of the gray level conversion of the input
image, the feature extraction using the image descriptor, fea-
ture reduction and comparison with the dataset for classifi-
cation. The proposed method consumed a total computation
time of 0.70083 sec/image, which is 35.5 %, 71.22 % and
59.7 % less than HOG, SIFT, and SURF, respectively.
4.4.2 Precision rates
Descriptors are unable to perform equally well in all image
categories due to their limits of effectiveness. Descriptor
[4] is suitable for local features, but it is unable to provide
accurate results for global features. Similarly, the detector
with the best ability to predict texture patterns is unable
to accurately recognize objects. In addition, the descrip-
tors that are suitable for finding edges and corners are not
good candidates for texture analysis. Therefore, none of the
state-of-the-art descriptors are ideal candidates for feature
extraction in versatile image categories. However, the pro-
posed descriptor is able to find the textures, edges, corners,
and pixel intensities and recognize complex and overlapping
objects.
Figure 9a shows the results of the proposed method in
comparison with those of the state-of-the-art descriptors for
15 categories of the Caltech-101 dataset. Some of the detec-
tors show better results in some image categories because
they were designed for those categories. The descriptors
perform well in their areas of specialty.
Tabl e 3shows a comparison of the average precisions
of the proposed descriptor with those of the state-of-the-art
descriptors HOG, SIFT and SURF. Experiments are per-
formed with all of the descriptors to check the strength of the
proposed descriptor. The proposed method shows remark-
able performance in the sunflower, motorbike, starfish, ferry
and brain categories. The mean average precision obtained
by the proposed descriptor is 0.158 % higher than that of the
HOG descriptor.
Tabl e 4compares the experimental results of the pro-
posed method with those of the state-of-the-art descriptors
using the Caltech-256 dataset. The proposed method has
better precision than the existing methods in 13 of the 15
categories. For the other two categories, the precision is
almost the same as that reported by SURF. The results of
the Corel-1000 collection are shown to check the effec-
tiveness of the proposed method compared to those of the
state-of-the-art descriptors. The proposed method provides
Fusion of local and global features
Fig. 9 a Comparison of the average precisions obtained by the
proposed method compared with those of the state-of-the-art descrip-
tors on 15 categories of the Caltech-101 dataset. bComparison of
the average precisions obtained by the proposed method compared
with those of the state-of-the-art descriptors on 10 categories of the
Corel-1000 dataset
better results for most of the image categories. The proposed
descriptor has a 0.036 % better mean average precision for
the proposed method.
4.4.3 Recall rates
Figure 10 shows the recall rates for Caltech-101. The results
show that the state-of-the-art descriptors provide better per-
formance in some image categories and below average
Tabl e 3 Comparison of the average precisions obtained by the pro-
posed method compared with those from the state-of-the-art descrip-
tors on 15 categories of the Caltech-101 dataset
Class Proposed method HOG [1] SURF [3]SIFT[2]
Airplanes 0.94 0.95 0.72 0.76
Ferry 0.88 0.66 0.66 0.75
Camera 0.84 0.85 0.72 0.77
Brain 0.91 0.79 0.70 0.71
Cougar face 0.87 0.83 0.71 0.72
Grand piano 0.92 0.90 0.73 0.76
Dalmatian 0.90 0.73 0.68 0.70
Dollar bill 0.86 0.35 0.40 0.76
Starfish 0.88 0.45 0.74 0.77
Soccer ball 0.87 0.82 0.71 0.74
Minaret 0.88 0.66 0.79 0.74
Motorbikes 0.90 0.78 0.72 0.73
Revolver 0.85 0.73 0.71 0.70
Sunflower 0.92 0.66 0.68 0.50
Windsor chair 0.84 0.74 0.73 0.51
Average 0.887 0.728 0.695 0.710
Bold entries show the ‘largest value’ for the respective rows
performance in others. However, the proposed method pro-
vides better recall rates for most of the categories in both
datasets. Hence, the proposed method provides better clas-
sification results for all of the image categories. Low recall
rates are observed for the dollar bill and sunflower cate-
gories using the HOG and SURF descriptors.
In these categories, complex image backgrounds with
overlapping objects are difficult to classify. However the
proposed method provided better recall rates in these cat-
egories. Hence, the proposed method intuitively combines
local and global features by selecting local features based
on the pixel intensity level and texture values and selecting
global features using the sliding window. The local features
Tabl e 4 Comparison of the average precisions obtained by the pro-
posed method compared with those from the state-of-the-art descrip-
tors on the Corel-1000 dataset
Class Proposed Method HOG [1] SURF [3]SIFT[2]
Africa 0.90 0.82 0.71 0.72
Beach 0.91 0.78 0.60 0.60
Building 0.87 0.90 0.79 0.80
Bus 0.95 0.85 0.81 0.80
Kangaroo 0.97 0.88 0.66 0.45
Elephant 0.85 0.89 0.80 0.79
Flower 0.93 0.85 0.75 0.75
Horse 0.85 0.91 0.76 0.77
Mountain 0.83 0.86 0.77 0.75
Food 0.98 0.94 0.72 0.71
Average 0.907 0.871 0.738 0.716
Bold entries show the ‘largest value’ for the respective rows
K. T. Ahmed et al.
Fig. 10 Comparison of the
average recall rates obtained by
the proposed method compared
with those of the state-of-the-art
descriptors on 15 categories of
the Caltech-101 dataset
help in the texture and shape analysis, whereas the global
features are robust to object recognition. Assembling local
values with global depiction detects the hidden patterns of
an image as well as the distinctive objects. The concate-
nation of the local and global features is performed after
computing the high variance coefficients. The proposed
reshaping algorithm limits the inputs for the component
analysis. Thus, the combined feature vectors are compact
and represent an image efficiently.
5 Conclusion
In this paper, we proposed a novel method for effective
and accurate feature vector extraction and image classi-
fication. The descriptor is able to perform classifications
with significant precision in diverse categories of the bench-
mark datasets ImageNet, Caltech-256, Caltech-101 and
Corel-1000. The descriptor accurately distinguishes cor-
ners, edges, and lines and performs texture analysis and
object recognition for complex and overlapping images.
The proposed method was compared with other sophisti-
cated methods and provided remarkable precision in most
of the image categories due to its superior nature. The
proposed descriptor was also compared with the state-of-
the-art descriptors SIFT, SURF and HOG and outperformed
them in all of the datasets. The experimental results showed
that the state-of-the-art descriptors perform well in some
image categories due to their specialization in those areas
but are unable to provide good results in other categories
due to their limitations with those image attributes. The
proposed method provides reliable and remarkable preci-
sion and recall rates in most of the image categories of the
benchmark datasets.
References
1. Dalal N, Triggs B (2005) Histograms of oriented gradients for
human detection. In: IEEE computer society conference on com-
puter vision and pattern recognition, 2005. CVPR 2005, vol 1.
IEEE, pp 886–893
2. Lowe DG (1999) Object recognition from local scale-invariant
features. In: The proceedings of the seventh IEEE international
conference on computer vision, 1999, vol 2. IEEE, pp 1150–
1157
3. Bay H, Ess A, Tuytelaars T, Van Gool L (2008) Speeded-up robust
features (SURF). Comput Vis Image Underst 110(3):346–359
4. Liu G-H, Yang J-Y (2013) Content-based image retrieval using
color difference histogram. Pattern Recogn 46(1):188–198
5. Chaudhary MD, Upadhyay AB (2014) Integrating shape and edge
histogram descriptor with stationary wavelet transform for effec-
tive content based image retrieval. In: International conference
on circuit, power and computing technologies (ICCPCT), 2014.
IEEE, pp 1522–1527
6. Agrawal D, Jalal AS, Tripathi R (2013) Trademark image retrieval
by integrating shape with texture feature. In: International con-
ference on information systems and computer networks (ISCON),
2013. IEEE, pp 30–33
7. Harris C, Stephens M (1988) A combined corner and edge detec-
tor. In: Alvey vision conference, vol 15, p 50
8. Wang H, Brady M (1995) Real-time corner detection algorithm
for motion estimation. Image Vis Comput 13(9):695–703
9. Khotanzad A, Hong YH (1990) Invariant image recognition
by Zernike moments. IEEE Trans Pattern Anal Mach Intell
12(5):489–497
10. Rosten E, Drummond T (2006) Machine learning for high-speed
corner detection. In: Computer vision–ECCV 2006. Springer,
Berlin, pp 430–443
Fusion of local and global features
11. Tuytelaars T, Van Gool L (2004) Matching widely separated
views based on affine invariant regions. Int J Comput Vis 59(1):
61–85
12. Sural S, Qian G, Pramanik S (2002) Segmentation and histogram
generation using the HSV color space for image retrieval. In:
International conference on image processing, 2002. Proceedings.
2002, vol 2. IEEE, pp II–589
13. Mikolajczyk K, Schmid C (2005) A performance evaluation
of local descriptors. IEEE Trans Pattern Anal Mach Intell
27(10):1615–1630
14. Gupta E, Kushwah RS (2015) Combination of global and local
features using DWT with SVM for CBIR. In: 4th interna-
tional conference on reliability, infocom technologies and opti-
mization (ICRITO)(trends and future directions), 2015. IEEE,
pp 1–6
15. Li J, Wang JZ (2003) Automatic linguistic indexing of pictures by
a statistical modeling approach. IEEE Trans Pattern Anal Mach
Intell 25(9):1075–1088
16. Dubey SR, Singh SK, Singh RK (2016) Multichannel decoded
local binary patterns for content-based image retrieval. IEEE Trans
Image Process 25(9):4018–4032
17. Xiao Y, Wu J, Yuan J (2014) mCENTRIST: a multi-channel fea-
ture generation mechanism for scene categorization. IEEE Trans
Image Process 23(2):823–836
18. Zhou Y, Zeng F-Z, Zhao H-M, Murray P, Ren J (2016) Hierar-
chical visual perception and two-dimensional compressive sensing
for effective content-based color image retrieval. Cogn Comput
8(5):877–889
19. Shrivastava N, Tyagi V (2015) An efficient technique for retrieval
of color images in large databases. Comput Electr Eng 46:314–327
20. Kundu MK, Chowdhury M, Bul`
o SR (2015) A graph-based rel-
evance feedback mechanism in content-based image retrieval.
Knowl-Based Syst 73:254–264
21. Zeng S, Huang R, Wang H, Kang Z (2016) Image retrieval using
spatiograms of colors quantized by Gaussian mixture models.
Neurocomputing 171:673–684
22. Walia E, Pal A (2014) Fusion framework for effective color image
retrieval. J Vis Commun Image Represent 25(6):1335–1348
23. Ashraf R, Bashir K, Irtaza A, Mahmood MT (2015) Content based
image retrieval using embedded neural networks with bandletized
regions. Entropy 17(6):3552–3580
24. ElAlami ME (2014) A new matching strategy for content based
image retrieval system. Appl Soft Comput 14:407–418
25. Iqbal K, Odetayo MO, James A (2012) Content-based image
retrieval approach for biometric security using colour, texture and
shape features controlled by fuzzy heuristics. J Comput Syst Sci
78(4):1258–1277
26. Neelima N, Reddy ES (2015) An improved image retrieval system
using optimized FCM & multiple shape, texture features. In: 2015
IEEE international conference on computational intelligence and
computing research (ICCIC). IEEE, pp 1–7
27. Youssef SM (2012) ICTEDCT-CBIR: integrating curvelet trans-
form with enhanced dominant colors extraction and texture analy-
sis for efficient content-based image retrieval. Comput Electr Eng
38(5):1358–1376
28. Lande MV, Bhanodiya P, Jain P (2014) An effective content-based
image retrieval using color, texture and shape feature. In: Intel-
ligent computing, networking, and informatics. Springer, India,
pp 1163–1170
29. Xia Y, Wan S, Yue L (2014) A new texture direction feature
descriptor and its application in content-based image retrieval. In:
Proceedings of the 3rd international conference on multimedia
technology (ICMT 2013). Springer, Berlin, pp 143–151
30. Agarwal S, Verma AK, Singh P (2013) Content based image
retrieval using discrete wavelet transform and edge histogram
descriptor. In: International conference on information systems
and computer networks (ISCON), 2013. IEEE, pp 19–23
31. Jadhav P, Phalnikar R (2015) SIFT based efficient content based
image retrieval system using neural network. Artificial Intelligent
Systems and Machine Learning 7(8):234–238
32. Awad D, Courboulay V, Revel A (2012) Saliency filtering of
sift detectors: application to cbir. In: Advanced concepts for
intelligent vision systems. Springer, Berlin, pp 290–300
33. Saad MH, Saleh HI, Konber H, Ashour M (2013) CBIR system
based on integration between surf and global features
34. Velmurugan K, Baboo SS (2011) Content-based image retrieval
using SURF and colour moments. Global J Comp Sci Technol
10:11
35. Barbu T (2014) Pedestrian detection and tracking using temporal
differencing and HOG features. Comput Electr Eng 40(4):1072–
1079
36. Albiol A, Monzo D, Martin A, Sastre J, Albiol A (2008)
Face recognition using HOG–EBGM. Pattern Recogn Lett
29(10):1537–1543
37. Felzenszwalb PF, Girshick RB, McAllester D, Ramanan D (2010)
Object detection with discriminatively trained part-based models.
IEEE Trans Pattern Anal Mach Intell 32(9):1627–1645
38. Pan S, Sun S, Yang L, Duan F, Guan A (2015) Content retrieval
algorithm based on improved HOG. In: 3Rd international confer-
ence on applied computing and information technology/2nd inter-
national conference on computational science and intelligence
(ACIT-CSI), 2015. IEEE, pp 438–441
39. Murala S, Maheshwari RP, Balasubramanian R (2012) Local
tetra patterns: a new feature descriptor for content-based image
retrieval. IEEE Trans Image Process 21(5):2874–2886
40. Moravec HP (1979) Visual mapping by a robot rover. In: Pro-
ceedings of the 6th international joint conference on artificial
intelligence, vol 1. Morgan Kaufmann Publishers Inc, pp 598–600
41. F¨
orstner W, G¨
ulch E (1987) A fast operator for detection and pre-
cise location of distinct points, corners and centres of circular fea-
tures. In: Proceedings of the ISPRS intercommission conference
on fast processing of photogrammetric data, pp 281–305
42. Ojala T, Pietik¨
ainen M, M¨
aenp¨
a¨
a T (2000) Gray scale and rota-
tion invariant texture classification with local binary patterns. In:
Computer vision-ECCV 2000. Springer, Berlin, pp 404–420
43. Ojala T, Valkealahti K, Oja E, Pietik¨
ainen M (2001) Texture
discrimination with multidimensional distributions of signed gray-
level differences. Pattern Recogn 34(3):727–739
44. Ojala T, Pietik¨
ainen M, Harwood D (1996) A comparative study
of texture measures with classification based on featured distribu-
tions. Pattern Recogn 29(1):51–59
45. Pietik¨
ainen M, Ojala T, Xu Z (2000) Rotation-invariant texture
classification using feature distributions. Pattern Recogn 33:43–52
46. Steji´
c Z, Takama Y, Hirota K (2003) Genetic algorithm-based rel-
evance feedback for image retrieval using local similarity patterns.
Inf Process Manag 39(1):1–23
47. Oertel C, Colder B, Colombe J, High J, Ingram M, Sallee P (2008)
Current challenges in automating visual perception. In: Proceed-
ings of IEEE advanced imagery pattern recognition workshop
48. Stanford vision lab, http://image- net.org/ last accessed on October
2016
49. Griffin G, Holub A, Perona P (2007) Caltech-256 object category
dataset
50. Fei-Fei L, Fergus R, Perona P (2004) Learning generative visual
models from few training examples: an incremental Bayesian
approach tested on 101 object categories. IEEE. CVPR 2004
Workshop on Generative-Model Based Vision
51. Lai C-C, Chen Y-C (2011) A user-oriented image retrieval system
based on interactive genetic algorithm. IEEE Trans Instrum Meas
60:3318–3325
K. T. Ahmed et al.
52. Ali N, Bajwa KB, Sablatnig R, Mehmood Z (2016) Image retrieval
by addition of spatial information based on histograms of triangu-
lar regions. Comput Electr Eng 54:539–550
53. Walia E, Pal A (2014) Fusion framework for effective color image
retrieval. J Vis Commun Image Represent 25(6):1335–1348
54. Dubey SR, Singh SK, Singh RK (2015) A multi-channel
based illumination compensation mechanism for brightness
invariant image retrieval. Multimedia Tools and Applications
74(24):11223–11253
55. Thepade S, Das R, Ghosh S (2015) Novel technique in block trun-
cation coding based feature extraction for content based image
identification. In: Transactions on computational science XXV.
Springer, Berlin, pp 55–76
56. Dalal N, Triggs B (2006) Object detection using histograms of
oriented gradients. In: Pascal VOC workshop, ECCV
57. Hu R, Collomosse J (2013) A performance evaluation of gradient
field hog descriptor for sketch based image retrieval. Comput Vis
Image Underst 117(7):790–806
58. Wangming X, Jin W, Xinhai L, Lei Z, Gang S (2008) Application
of image SIFT features to the context of CBIR. In: International
conference on computer science and software engineering, 2008,
vol 4. IEEE, pp 552–555
59. Xu P, Zhang L, Yang K, Yao H (2013) Nested-SIFT for efficient
image matching and retrieval. IEEE MultiMedia 20(3):34–46
60. Kim S, Yoon K-J, Kweon IS (2008) Object recognition using a
generalized robust invariant feature and Gestalt’s law of proximity
and similarity. Pattern Recogn 41(2):726–741
61. Lee Y-H, Kim Y (2015) Efficient image retrieval using advanced
SURF and DCD on mobile platform. Multimedia Tools and
Applications 74(7):2289–2299
... (Kumar, Chhabra, and Garg 2018) respectively. The SVM classification model-based CBIR system offered by (Mehmood et al. 2018) and (Ahmed, Irtaza, and Iqbal 2017) , respectively. Thus, the CBIR system based on the classification models reveals that obtaining high precision on the database depends on an equal contribution of the classification model and the feature representation used in training them. ...
... have high retrieval precision compared to the K-NN classification-based CBIR systems on the databases (Freephoto n.d.) considered in performance comparison. The combination of SURF (i.e., local) and HoG (i.e., global) features with SVM classifier(Mehmood et al. 2018) attains 1.1%, 2.77%, and 2.06%, better precision rate on the Wang's, GHIM, and OT-Scene databases than the SVM classification model based CBIR used by(Ahmed, Irtaza, and Iqbal 2017). The RF classification algorithm based CBIR suggested by Singh &Srivastava (Singh and Srivastava 2018) uses three different kinds of features from the image that are named as chromaticity moments, colour percentile, and LBP and obtained 83.47%, 72.44%, 64.25%, and 72.34% on the databases (Freephoto n.d., GHIM Database n.d., Corel-10k Database n.d., OT-Scene Database n.d.), respectively. ...
Article
Full-text available
A conventional content‐based image retrieval system (CBIR) extracts image features from every pixel of the images, and its depiction of the feature is entirely different from human perception. Additionally, it takes a significant amount of time for retrieval. An optimal combination of appropriate image features is necessary to bridge the semantic gap between user queries and retrieval responses. Furthermore, users should require minimal interactions with the CBIR system to obtain accurate responses. Therefore, the proposed work focuses on extracting highly relevant feature information from a set of images in various natural image databases. Subsequently, a feature‐based learning/classification model is introduced before similarity measure calculations, aiming to minimise retrieval time and the number of comparisons. The proposed work analyses the learning models based on the retrieval system's performance separately for the following features: (i) dominant colour, (ii) multi‐resolution radial difference texture patterns, and a combination of both. The developed work is assessed with other techniques, and the results are reported. The results demonstrate that the implemented ensemble learning model‐based CBIR outperforms the recent CBIR techniques.
... Regarding feature point descriptors, the SIFT and SURF algorithms employ Gaussian difference operators and a scale-space approach to extract feature points at various scales and rotations. However, these algorithms incur high computational complexity and demand substantial computational resources 16 . In contrast, the BRIEF algorithm adopts binary descriptors to represent feature points, enabling faster feature matching. ...
... The resulting difference vector [d 0 , ..., d 16 ] captures the local structural characteristics of the image. Notably, this circular field excludes the central pixel, enhancing its robustness against variations in lighting 16 ] . These two vectors are complementary, and the original difference vector can be derived from them. ...
Article
Full-text available
Image stitching is a fundamental pillar of computer vision, and its effectiveness hinges significantly on the quality of the feature descriptors. However, the existing feature descriptors face several challenges, including inadequate robustness to noise or rotational transformations and limited adaptability during hardware deployment. To address these limitations, this paper proposes a set of feature descriptors for image stitching named Lightweight Multi-Feature Descriptors (LMFD). Based on the extensive extraction of gradients, means, and global information surrounding the feature points, feature descriptors are generated through various combinations to enhance the image stitching process. This endows the algorithm with formidable rotational invariance and noise resistance, thereby improving its accuracy and reliability. Furthermore, the feature descriptors take the form of binary matrices consisting of 0s and 1s, not only facilitating more efficient hardware deployment but also enhancing computational efficiency. The utilization of binary matrices significantly reduces the computational complexity of the algorithm while preserving its efficacy. To validate the effectiveness of LMFD, rigorous experimentation was conducted on the Hpatches and 2D-HeLa datasets. The results demonstrate that LMFD outperforms state-of-the-art image matching algorithms in terms of accuracy. This empirical evidence solidifies the superiority of LMFD and substantiates its potential for practical applications in various domains.
... Xiong et al. [28] proposed a new pooling method, which pooled each sub-region into a feature, and these features are weighted and concatenated to obtain classification results through SVM. Ahmed et al. [29] used texture analysis with a unified local binary pattern to extract local features, and intensity-based local interest points and an optimized sliding window to extract global features. Principal component analysis is used to reduce the dimensionality of local and global features, which are concatenated and then classified using SVM. ...
... Then all the fusion features under each scale are concatenated into the final representation-level features. Currently, many classification methods [27,29,30] extract representation-level features from both global and local aspects of the image for fusion. These methods only fuse representation-level features by means of artificial experience weighting. ...
Article
Full-text available
Based on the network structure and training methods of extreme learning machines, extreme learning machine combining hidden-layer feature weighting and batch training (ELM-WB) is proposed to make full use of representation-level features for object images and human action videos classification. To solve the problem of insufficient fusion of multiple representation-level features in most classification methods, a double hidden layer structure in which the input layer and the second hidden layer are directly connected is designed. A loop training method of weighting coefficients and output weights is proposed based on the advantages of this structure. The proposed network structure and training method are combined to construct an extreme learning machine combining hidden-layer feature weighting (ELM-W), which can effectively fuse representation-level features to enhance the classification ability of representation-level features. On this basis, the principle of online sequential ELM (OS-ELM) is introduced to update the loop training formula of the two weights to reduce the memory consumption during the operation of the overall algorithm. ELM-WB is proposed by combining the loop training formula of two weight matrices with batch training. In order to test the feasibility of ELM-WB, experiments are conducted on Caltech 101, MSRC, UCF11 and UCF101 databases. Experimental results prove that the proposed ELM-WB can improve classification accuracy by fusing representation-level features. At the same time, ELM-WB can be used to perform classification tasks on databases of any size in a general-purpose computer without specific hardware.
... SURF features have ability to reduce computation time since it works on integral images. Because of its precise and fast performance, numerous studies preferred employing SURF descriptor in their research methods[3,8,23,27,32]. ...
Article
Full-text available
Video Summarization is one of the most important processes in multimedia applications. It is the process of taking a few segments of each scene to create a video summary that describes the story of an entire video in a short amount of time. Automatic video summarization has many applications which would benefit several domains of healthcare, security, surveillance, and other many applications. However, creating a comprehensive video summary that encompasses keyframes of video shots is a challenging task in terms of providing well representative features extraction and features classification techniques. The main objective of this study is to produce informative video summarization, conveying interestingness, representativeness, and visual semantic information. That’s in addition to preserving the continuous flow of motion information for video sequence. Therefore, this paper proposed SURF and Hog techniques for features description, and Covariance matrix (CM) for features reduction. Fuzzy C-mean clustering algorithm is proposed for features classification and generation summary based keyframes selection. The experiments are conducted using annotated video sequences SumMe and VSUMM datasets. The obtained results showed good performance of proposed method achieving average F-score of 0.5197 on SumMe videos, and 0.9221 on VSUMM videos.
... Regarding feature point descriptors, the SIFT and SURF algorithms employ Gaussian difference operators and a scale-space approach to extract feature points at various scales and rotations. However, these algorithms exhibit high computational complexity and demand substantial computational resources [12]. In contrast, the BRIEF algorithm adopts binary descriptors to represent feature points, enabling faster feature matching. ...
Preprint
Full-text available
Image stitching is a fundamental pillar of computer vision, and its effectiveness hinges significantly on the quality of feature descriptors. However, the existing feature descriptors confront several challenges, including inadequate robustness to noise or rotational transformations and limited adaptability during hardware deployment. To address these limitations, this paper proposed a feature descriptor for image stitching, denoted as Lightweight Multi-Feature Descriptors (LMFD). By extensively extracting gradients, means, and global information surrounding the feature points, the feature descriptors are generated through various combinations to enhance the image stitching. This empowers the algorithm with formidable rotational invariance and noise resistance, thereby improving its accuracy and reliability. Furthermore, the feature descriptors take the form of binary matrices consisting of 0 and 1, which not only facilitates more efficient hardware deployment but also enhances computational efficiency. The utilization of binary matrices significantly reduces the computational complexity of the algorithm while preserving its efficacy. To validate the effectiveness of the LMFD, rigorous experimentation was conducted on the Hpatches and 2D-HeLa datasets. The results demonstrated that the LMFD outperformed state-of-the-art image matching algorithms in terms of accuracy. This empirical evidence solidifies the superiority of the LMFD and substantiates its potential for practical applications in various domains.
... The Hough transform can extract Global Features Extraction from images, Discrete Cosine Transform (DCT), Discrete Radon Transform (DRT), etc. [40]. However, in this paper, we use two methods, namely DRT, and it is proposed to extract projection-based images. ...
Article
Full-text available
Autism spectrum disorder (ASD) is a developmental disability resulting from neurological disparities. People with ASD frequently struggle with communication, social interaction, and limited or repetitive interests or behaviors. People with ASD may also have unique learning, movement, and attention styles. People living with ASD can be interpreted as 1 in every 100 individuals in the globe having ASD. The abilities and requirements of autistic individuals vary and may change over time. Some autistic individuals can live independently, while others have severe disabilities and require lifelong care and support. Autism frequently interferes with educational and employment opportunities. Additionally, the demands placed on families providing care and assistance can be substantial. Important determinants of the quality of life for persons with autism are the community's attitudes and the level of support provided by local and national authorities. Autism is frequently not diagnosed until adolescence, even though autistic traits are detectable in early infancy. This study will discuss the identification of Autism Spectrum Disorders using Magnetic Resonance Imaging (MRI). MRI images of ASD patients and MRI images of patients without ASD were compared. By employing multiple machine learning and deep learning techniques, such as random forests, support vector machines, and convolutional neural networks, the random forest method achieves the utmost accuracy with 100% using a confusion matrix. Therefore, this technique can optimally identify ASD through MRI.
... The Hough transform can extract Global Features Extraction from images, Discrete Cosine Transform (DCT), Discrete Radon Transform (DRT), etc. [40]. However, in this paper, we use two methods, namely DRT, and it is proposed to extract projection-based images. ...
Article
Autism spectrum disorder (ASD) is a developmental disability resulting from neurological disparities. People with ASD frequently struggle with communication, social interaction, and limited or repetitive interests or behaviors. People with ASD may also have unique learning, movement, and attention styles. People living with ASD can be interpreted as 1 in every 100 individuals in the globe having ASD. The abilities and requirements of autistic individuals vary and may change over time. Some autistic individuals can live independently, while others have severe disabilities and require lifelong care and support. Autism frequently interferes with educational and employment opportunities. Additionally, the demands placed on families providing care and assistance can be substantial. Important determinants of the quality of life for persons with autism are the community's attitudes and the level of support provided by local and national authorities. Autism is frequently not diagnosed until adolescence, even though autistic traits are detectable in early infancy. This study will discuss the identification of Autism Spectrum Disorders using Magnetic Resonance Imaging (MRI). MRI images of ASD patients and MRI images of patients without ASD were compared. By employing multiple machine learning and deep learning techniques, such as random forests, support vector machines, and convolutional neural networks, the random forest method achieves the utmost accuracy with 100% using a confusion matrix. Therefore, this technique can optimally identify ASD through MRI.
... Traditional target detection extracts features such as color, texture and edges of target objects in images by artificial design operator, and then locates and classifies the object [5]. However, the multiple morphologies revealed by garments bring great difficulties to feature extraction, and the B Meihua Gu gumh2001@163.com ...
Article
Full-text available
Aiming at the problem of low accuracy of clothing attribute recognition caused by factors such as scale, occlusion and beyond the boundary, a novel clothing attribute recognition algorithm based on improved YOLOv4-Tiny is proposed in this paper. YOLOv4-Tiny is used as the basic model, firstly, the multi-scale feature extraction module Res2Net is adopted to optimize the backbone network, the receptive field size of each layer of the network is increased, and more abundant fine-grained multi-scale clothing feature information is extracted. Then, the three feature layers of the output of feature extraction network are up-sampled, and the high-level semantic features and shallow features are fused to obtain rich shallow fine-grained feature information. Finally, K-Means clustering algorithm is employed to optimize the anchor box parameters to obtain the anchor box that is more compatible with the clothing object, and to improve the integrating degree between the clothing attribute characteristics and the network. The experimental results demonstrate that the proposed method outperforms the original YOLOv4-tiny network in terms of accuracy, speed, and model parameters, and is more suitable for deployment in resource-limited embedded devices.
Article
In various applications across different platforms, image similarity features such as image searching and similar image recommendations are widely used. However, the challenges of semantic gap and querying speed continue to pose significant challenges in image similarity searching. In this study, we propose a novel solution to address these issues using contrastive learning within the TensorFlow Similarity library. Specifically, we trained and tested our proposed method using the Caltech-256 dataset and further evaluated it on the Corel1K dataset. Our work distinguishes itself from previous studies that primarily focus on evaluating accuracy while neglecting the importance of speed evaluation. As such, we propose evaluating both the mean average precision score and query time spending. Our experimental results reveal that our method based on EfficientNet (B7) yields the best average precision scores of 0.93 on the Caltech-256 test dataset and 0.94 on the Corel1K dataset. However, other methods achieve faster query times, although their average precision scores are significantly lower.
Article
Full-text available
Background Although content-based image retrieval (CBIR) has been an active research theme in the computer vision community for over two decades, there are still challenging problems in properly understanding the process in feature extraction and image matching. Consequently, significant research is still required to develop solutions for practical applications, especially in exploring and making the best using of the cognitive aspects of the human vision system. MethodologyMotivated by three cognitive properties of human vision, namely hierarchical structuring, color perception and embedded compressed sensing, we proposed a novel framework for CBIR. First, we use a hierarchical approach to perform discrete cubic partitioning of the image in the HSV space. Then, we propose a new hierarchical mapping of the image data through the use of hierarchical operators: SGLCM. These features are then integrated in a 2D CS model, which extracts refined features and suppresses noise. Finally, the resultant features are used for similarity-based ranking to perform CBIR. Results and Conclusions Experiments were performed using two Corel image datasets, i.e., the Corel-1000 dataset which contains 1000 images in 10 image categories and the Corel-10000 dataset which contains 10000 images in 100 image categories where each category contains 100 images. In comparison with three other state-of-the-art approaches, the proposed method has demonstrated much improved retrieval accuracy, especially for images with rich color contents and detail, yet the computational complexity has been significantly reduced to meet the needs for real-time online applications. The implication of the study is that the exploitation of cognitive properties of our human vision systems in effective CBIR. Future research work can be further explored to address some limitations for optimized parameter setting, adaptive feature fusion and improved machine learning.
Conference Paper
We study the question of feature sets for robust visual object recognition, adopting linear SVM based human detection as a test case. After reviewing existing edge and gradient based descriptors, we show experimentally that grids of Histograms of Oriented Gradient (HOG) descriptors significantly outperform existing feature sets for human detection. We study the influence of each stage of the computation on performance, concluding that fine-scale gradients, fine orientation binning, relatively coarse spatial binning, and high-quality local contrast normalization in overlapping descriptor blocks are all important for good results. The new approach gives near-perfect separation on the original MIT pedestrian database, so we introduce a more challenging dataset containing over 1800 annotated human images with a large range of pose variations and backgrounds.
Conference Paper
Consistency of image edge filtering is of prime importance for 3D interpretation of image sequences using feature tracking algorithms. To cater for image regions containing texture and isolated features, a combined corner and edge detector based on the local auto-correlation function is utilised, and it is shown to perform with good consistency on natural imagery.
Conference Paper
Histograms of Oriented Gradients (HOG) feature has been successfully used in pedestrian detection and achieves high accuracy. This paper introduces a content retrieval algorithm based on improved HOG. The method has two steps which are adjusting the HOG structure by scanning the image with a sliding HOG window and reducing feature dimension by principle component analysis (PCA) technique. The experimental results show that: the precision rate of this method has improved significantly compared with the method of transforming the size of images to calculate HOG feature and the method of extracting color feature.
Conference Paper
Substantial accumulations of computerized photos are consistently made. Various collections of digital photos like in the area of technology, private organization, etc. are structured because of the result of digitizing existing accumulations of simple pictures, charts, outlines, canvases, and archives. For the most part, the main strategy for investigating these accumulations was by perusing or indexing of index words, Digital photo databases begin the best approach to searching based on content. In this paper, we have shown a strategy that has no past information about the picture inside the database, yet retrieval is carried out considering the content data of the pictures prone to be called as content based on image retrieval. Here we are attempting to enhance the image retrieval framework for more exactness and efficiency by utilizing Radial base Function neural system. This arrangement with multilayer feed forward system recognition. By utilizing these procedures we can get into know the exact relevant image gave by the client according to the query. The Scale Invariant Feature Transform (SIFT) is a standout amongst the most nearby feature detector and descriptors which is utilized as a part of the vast majority of the vision programming. In this paper in regards to CBIR framework we can use SIFT algorithm to concentrate the nearby feature of the pictures. The retrieval method has been implemented on two CBIR systems, one using multiple features and the other using only RBG feature. The results obtained are positive and we have obtained higher precision (81.93%) using multiple features than that of using only RGB (41.18%). Experiment demonstrates that the proposed system get better results of retrieval system.
Article
Local binary pattern (LBP) is widely adopted for efficient image feature description and simplicity. To describe the color images, it is required to combine the LBPs from each channel of the image. The traditional way of binary combination is to simply concatenate the LBPs from each channel, but it increases the dimensionality of the pattern. In order to cope with this problem, this paper proposes a novel method for image description with multichannel decoded local binary patterns. We introduce adder and decoder based two schemas for the combination of the LBPs from more than one channel. Image retrieval experiments are performed to observe the effectiveness of the proposed approaches and compared with the existing ways of multichannel techniques. The experiments are performed over twelve benchmark natural scene and color texture image databases such as Corel-1k, MIT-VisTex, USPTex, Colored Brodatz, etc. It is observed that the introduced multichannel adder and decoder based local binary patterns significantly improves the retrieval performance over each database and outperforms the other multichannel based approaches in terms of the average retrieval precision and average retrieval rate.
Article
The compositional and content attributes of images carry information that enhances the performance of image retrieval. Standard images are constructed by following the rule of thirds that divides an image into nine equal parts by placing objects or regions of interest at the intersecting lines of the grid. An image represents regions and objects that are in a spatial semantic relationship with respect to each other. While the Bag of Features (BoF) representation is commonly used for image retrieval, it lacks spatial information. In this paper, we present two novel image representation methods based on the histograms of triangles, which add spatial information to the inverted index of BoF representation. Histograms of triangles are computed at two levels, by dividing an image into two and four triangles that are evaluated separately. Extensive experiments and comparisons conducted on two datasets demonstrate that the proposed image representations enhance the performance of image retrieval.