Content uploaded by Xiang Chen
Author content
All content in this area was uploaded by Xiang Chen on Oct 10, 2014
Content may be subject to copyright.
Recognition of 3D Objects in Arbitrary Pose Using a Fuzzy
Associative Database Algorithm
Aaron Mavrinac, Xi ang Chen, and Ahmad Shawky
Abstract— Once the human vision system has seen a 3D
object from a few different viewpoints, depending on the nature
of the object, it can generally recognize that object from new
arbitrary viewpoints. This useful interpolative skill relies on
the highly complex pattern matching systems in the human
brain, but the general idea can be applied to a computer
vision recognition system using comparatively simple machine
learning techniques. An approach to the recognition of 3D
objects in arbitrary pose relative to the vision equipment
with only a limited training set of views is presented. This
approach involves computing a disparity map using stereo
cameras, extracting a set of features from the disparity map,
and classifying it via a fuzzy associative map to a trained object.
I. INTRODUCTION
H
UMANS are generally able to recognize 2D shapes,
regardless of changes in orientation, sca le , or skew,
after having seen the shape in one such conﬁguration. This
shape recognition has a very wide range of applications,
and accordingly, much work has gone into automating it
with computers. The basic theory is that shapes can be
extracted from otherwise cluttered and cumbersome images,
from which some set of quantiﬁers efﬁciently describing
the shapes can be obtained and compared to known values
through some algorithm for classiﬁcation. The nature of
these quantiﬁers and the classiﬁcation algorithm are a subject
of much research; most use quantiﬁers invariant to the
aforementioned transformations (rotation, scale, skew, etc .)
such as Fourier descriptors, moment invariants, and Hough
transformations, and most use machine lea rning methods
such as fuzzy logic and neural networks for classiﬁcation.
Humans are also generally able to recognize 3D objects,
regardless of their orientation, after having seen a sufﬁcient
number of different views (depending, of course, on the
nature of the object itself). To ge neralize from the 2D case,
it is possible to automate this process in a similar manner
by obtaining quantiﬁers describing the 3D surface rather
than the 2D shape. Such quantiﬁers can be extracted from
range images, or in the case of stereo vision, disparity
maps. However, a single such image gives information only
from a certain perspective; this is commonly referred to
as 2.5D. To approach full 3D information, range images
must be taken from different perspectives around the object.
For c la ssiﬁcation to continue to work as generalized from
The authors are with the Department of Electrical and Computer
Engineering, U niversity of Windsor, Windsor, Ontario, Canada (email:
{mavrin1,xchen,shawky}@uwindsor.ca).
This research was funded in part by the Natural Sciences and Engineering
Research Council of Canada (NSERC).
the 2D case, the sets of quantiﬁers from each perspective
must be combined to fully describe the object, and the
classiﬁcation algorithm must be designed to operate on this
type of information.
In this paper, we expand on previous work in object recog
nition using invariant values on 2D images [5], justifying the
selection of proper invariant descriptors for 3D shapes based
on disparity maps and modifying the class iﬁcation scheme
to reﬂect the new object description. The result is a system
capable of recognizing a trained object based on a disparity
map taken by a stereo camera rig from any view, where
training requires only a few different such views.
II. PRIOR WORK
A. 3D Recognition
There are several cases where 2D moment invariants have
been used for recognition of 3D objects . In both [10] and
[8], moment invariants are computed on a series of intesity
images of the object taken from a variety of positions around
it; it is demonstrated that with a sufﬁcient number of images
and proper ha ndling of the multiimage input in an artiﬁcial
neural network scheme, 2D moments are applicable to 3D
recognition. However, these methods do not examine 3D
information about the object directly, and require a large
number of explicitlyordered views to operate. In addition to
the cost of capturing these views, objects are not identiﬁed
from an arbitrary unknown pose.
Methods have also bee n proposed which operate on in
variants of 3D range data. In [9], the concept of computing
characteristic vectors of multiple images is extended to range
images, allowing for object recognition in arbitrary pose
unaffected by illumination. In [6], local feature histograms,
invariant to translations and rotations as well as being robust
to partial occlusions, are computed directly on range images;
recognition is then performed using histogram matching or
probabilistic recognition.
There are a number of alternate possibilities which employ
other descriptors entirely. One example is [7], in which chro
maticity distributions from a variety of images of the object
are used to identify the object; this method of recognition,
while poseinvariant, is adversely affected by variations in
illumination, though the work attempts to alleviate these
problems.
B. NeuroFuzzy Recognition
Neurofuzzy classiﬁers are used to solve a wide range of
recognition problems [19]. In particular, a number of fuzzy
LVQ schemes have been proposed for prototypebased classi
ﬁcation and recognition. Methods such as those described in
[13], [14], and [15] employ a fuzzy neighbourhood function
on training data with speciﬁc classes, whereas others, such
as [16], attach fuzz y labels to the training data themselves.
Fuzzy associative memory models [12], [17] have been
employed to store rules for classiﬁcations based on fuzzy
LVQ, notably in [5], upon which the system we describe
here is based.
III. PRELIMINARY THEORY
A. Disparity Map
In order to quantify the 3D shape of an object in a manner
useful for recognition, some representation of the shape must
be generated by the sensor. A stereo vision system provides
data which can be analyzed in a variety of ways to obtain
3D information, but the crucial point, in this case, is for the
representation to lend itself to some analog of the 2D work
in [5]. Fortunately, a representation exists to which a similar
recognition scheme may be applied, and it is in fact relatively
easy to obtain.
For the purpose of this description and throughout this
work, the following convention is used for the world and
image coordinate systems: lowercase x and y represent
image coordinates with origin at the upper left corner of
the image a nd positive axes right and down respectively, and
uppercase X, Y , and Z represent world coordinates (which,
unless otherwise speciﬁed, are mutually orthogonal with Z
perpendicular to the rectiﬁed image planes and have their
origin at the optical center of the left camera). Figure 1
illustrates their relationship.
Fig. 1. Coordinate System Convention
We assume a stereo vision system capable of generating
rectiﬁed stereo images, wherein the epipolar lines are parallel
and horizontally aligned as if captured by parallel cameras.
In the general case, this requires internal and external (stereo)
calibration of the cameras. For a thorough geometrical treat
ment see [1], [20], and for some practical methods s ee [2],
[3], [4].
Given a pixel of coordinates (x
1
, y
1
) in one image of
an epipolarrectiﬁed stereo pair, and a corresponding pixel
(x
2
, y
2
) in the other (where y
1
= y
2
), their disparity d is
deﬁned as x
2
− x
1
[20]. This can be used to triangulate the
depth to the original 3D point in the environment (from the
optical center of one camera) in the world coordinate system
according to the following relation:
Z =
bfλ
d
(1)
where b is the ba seline (distance between the two optical
centers), f is the foca l length, and λ is a parameter relating
the pixel width to realworld mea surements.
A disparity map is a 2D matrix D containing the dispa rity
of each pixel in one image with respect to the corresponding
pixel, if any, in the other. Thus, if pixel (i, j) in the ﬁrst
image corresponds to pixel (k, j) in the second, D
ij
= k −
i. The disparity map essentially results in a range image
when its values are normalized and/or quantized to a range
of grayscale values which can be displayed and manipulated
as such. This provides an important visualization tool and
allows existing image invariant computation algorithms to
function unmodiﬁed on the data.
With a calibrated s te reo vision system, the parameters b,
f, and λ are known and Equation 1 may be us ed to calculate
the actual depth (Z coordinate) of the real points represented
by pixels in the dispa rity map. However, for the purposes
of 3D object recognition this is not ne cessary. Instead, the
invariant descriptors (see Section IIIC) are computed from
the disparity map, or more speciﬁcally, from its associated
range image.
B. Correspondence
In order to construct a disparity map for the ﬁrst image
in a stereo pair, it is necessary to establish correspondences
in the second image for each pixel in the ﬁrst. Correlation
based methods s uch a s the sum of square difference (SSD)
and normalized crosscorrelation (NCC) criteria may be used
for this purpose.
Correlationbased correspondence consists of maximizing,
for each leftimage pixel p
l
, a similarity criterion c on
the displacement d = [d
1
, d
2
]
T
, selecting
d + p
l
as the
corresponding rightimage pixel.
c(d) =
W
X
k=−W
W
X
l=−W
ψ(I
l
(i+k, j+l), I
r
(i+k−d
1
, j+l−d
2
))
(2)
In this case, since the images I
l
and I
r
are rectiﬁed and
correspondences are therefore found on the same horizontal
line, d
1
can be constrained to zero [21]. We use here the
SSD criterion for ψ, that is, for two pixel values u and v,
ψ(u, v) = −(u − v)
2
.
C. Invariant Descriptors
We examined a variety of invariant des criptors calculated
from 2D images, evaluating their usefulness in describing
different range views of an object qualitatively and quantita
tively. Three in particular were selec ted to work collectively
to describe a s et of range views.
1) Compactness: The ﬁrst useful descriptor is the com
pactness, a Fourier descriptor which describes a distribution
of intensity values in an enclosed region. When a pplied to a
disparity map, it describes the disparity (range) distribution
invariant to translation and rotation. The compactness of a
greyscale image can be c alculated as follows, adapted from
[22]:
C =
P
h
y=1
P
w
x=1
f
boundary
(x, y)
2
P
h
y=1
P
w
x=1
f(x, y)
(3)
where f(x, y) is the value of the image at pixel (x, y) and
f
boundary
(x, y) deﬁnes pixels on the perimeter of a region
(object).
2) First Hu Moment: The second descriptor is the ﬁrst
of Hu’s seven invariant moments [11], which are invariant
to translation, rotation, and scale. Only the lowestorder
moment is applied to the disparity ma ps as it is robust
against the inherent noise from imperfect correspondences
and occlusions. It is calculated as follows:
I
1
=
M
20
−
xM
10
+ M
02
− yM
01
M
2
00
(4)
where:
M
ij
=
X
x
X
y
x
i
y
i
I(x, y) (5)
3) Histogram: The ﬁnal de scriptor is the histogram, a
Fourier descriptor w hich describes the overall distribution
of intensities in an image. When applied to a disparity map,
it describes rather the range distribution. The histogram is
not a scalar value like the previous two descriptors, but may
be compared for two different images as follows [6]:
χ
2
(I
1
, I
2
) =
M
X
i=0
(h
1i
− h
2i
)
2
h
1i
+ h
2i
(6)
where I
1
and I
2
are the images, h
1i
and h
2i
are the ith
elements of the ﬁrst and second histogram, respectively, and
M is the ﬁnal element in the histogram, which may be 255
in this case as the upper limit of the normalized range for a
disparity map.
IV. FUZZY ASSOCIATIVE DATABASE ALGORITHM
Fuzzy set theory lends itself particularly well to the prob
lem of recognition based on a set of imprecise descriptors
with much variation and overlap. However, it is generally
impractical to develop a rule set for classiﬁcation directly,
since it is not immediately obvious what each descriptor
represents about the object and how they combine. In such
cases, one may train and optimize the parameters of the fuzzy
system using a neural network, in a conﬁguration known as
a neurofuzzy s ystem [19].
We describe here a fuzzy as sociative database similar to
that found in [5], adapted for invariant values of disparity
maps and for multiple training images expected to differ as
a result of the viewpoint change. The basic approach is to
store a table of fuzzy sets associated with the corresponding
membership functions, where each class (type of object
to be recognized) has one fuzzy set for each invariant
value, which are constructed from fuzziﬁed invariant values
extracted from the disparity maps of the object from several
different viewpoints (the training set). Recognition can then
be accomplished by comparing input invariant values to the
fuzzy sets in each class and determining which matches best.
A. Original FAD Algorithm
The original fuzzy associative database algorithm, de
scribed fully in [5], is used for invariant recognition of
multiple planar objects in 2D. It consists of a fuzzy data base
(FD) and a fuzzy s earch engine (FSE) which are trained
using invariant values extracted from the binary images of
2D objects.
Fig. 2. FAD Network with 4 Invariant Values and 3 Classes
The key difference between this and the 3D recognition
problem is the use of multiple images (different views) for
training and recognition. In the 2D case, the invariant values
are sufﬁcient to characterize all possible planar views of the
object, and therefore result in relatively compact membership
functions of the corresponding fuzzy sets. In the 3D case,
multiple views are necessa ry to capture the full structure of
the object, and although the values are invariant to certain
planar transformations of the object from a given view,
across different views the resulting membership functions
may be quite different. This may result in large areas of
overlap among the input membership functions and these
must therefore be scaled relative to themselves and one
another to better describe the object characteristics.
It is possible and beneﬁcial to emphasize the more unique
and descriptive portions of the fuzzy sets before they are
used for training or recognition. The fuzzy adaptive database
is modiﬁed for the multiimage case as described in the
following sections.
B. Supervised Training
During the supervised training stage, the invariant descrip
tors are c omputed from a disparity map of an object of known
class. These are ﬁrst fuzziﬁed into a fuzzy set with a Gaussian
membership function:
F (x, m, σ) = e
−
(x−m)
2
2σ
2
(7)
where x is the universe of discours e , m (the mean) is the
input crisp value and σ is the standard deviation of the
Gaussian, which is determined by trial and error.
Data from multiple views is thus entered, and the fuzzy
sets are joined via a union operator. This results in a joint
fuzzy set in each invariant value describing the object in an
unbiased fashion from multiple viewpoints. In other words,
the fuzzy set desc ribes the entire range of acceptable invariant
data a ssociated with the object class. The value σ is chosen
so that this statement is as true as possible without any more
overlap with other classes than is nec e ssary.
The net result so far, assuming a good training set and
a good value of σ, is that the fuzzy system comprised of
the fuzzy sets for each invariant, for a given class, should
return a strong response to input invariants generated by a
disparity map of any viewpoint of an object of the correct
class. However, it is also highly likely at this point that
there is much overlap among the different classes for certain
invariants, and there is no practical way to directly account
for such ambiguities.
In order to correct for this, once the fuzzy sets (and the
corresponding members hip functions) have been constructed
for all training examples, they are adaptively scaled, essen
tially competing for the ranges of each invariant which best
describe their clas ses. To accomplish this, the crisp invariants
from the training set are ﬁrst clustered according to the
following algorithm [18]:
1) Taking values of the network inputs as the initial values
to form the weight vector;
2) Determine the winner unit based on the minimum
distance;
3) Updating the weight vectors of the winner as follows;
w
i
(N + 1) = w
i
(N) + α(ρ − w
i
(N)) (8)
where N is the number of training epochs (iterations), ρ is
the network inputs (crisp invariant values in our case), and
α is the learning rate (for example α = e
−0.13q−0.69
where
q is the number of trainees in a speciﬁc class).
After the cluster c e nte rs are found, each fuzzy input is
scaled by a meas ure of the distance from the crisp input
data to the associated cluster center as shown below:
A
ij
= A
ij
e
−

w
i
−ρ
ij


w
i
+ρ
ij

(9)
where w
i
is the location of the cluster center in the ith class,
A
ij
is the jth fuzzy input data of the ith class, and ρ
ij
is the
jth crisp input data in the ith class. As the distance between
the cluster center w
i
and input ρ
ij
increases, A
ij
approaches
zero, thus reducing the contribution of data that is far from
the cluster center of the class.
Figures 3 and 4 show an example of scaling on a simple
fuzzy membership function.
Fig. 3. Fuzzy Membership Function Before Scaling
Fig. 4. Fuzzy Membership Function After Scaling
C. Recognition
Once the fuzzy associative database has been constructed,
recognition is a relatively simple process. The system takes
crisp invariant values computed from a disparity map of the
object to be recognized (in any allowable orientation).
The crisp invariants are compared exhaustively to the
FAD fuzzy set for each class, returning the total of the
responses from each fuzzy set. The inference method found
to best quantify the similarity for individual invariant values
is a simple crisp value response, according to a standard
inference equation:
µ
a
= ∨[µ
j
(x) ∧ I(x)] (10)
where µ
j
(x)∧I(x) represents the fuzzy intersection between
the trained fuzzy set for invariant j and the fuzziﬁed invariant
from input image I, and the leading ∨ (union) indicates the
fuzzy union over all invariant values. The class with the
highest overall degree of membership µ
a
is returned as the
probable object c la ss.
D. System Overview
The operation of the system is summarized in two
ﬂowchart diagrams. The ﬁrst (Figure 5) shows the ba sic
process of capturing images, creating the disparity map, a nd
computing the invariant descriptors, mostly covered in sec
tion III. The second (Figure 6) shows the a ctual recognition
network, including training, as described in subsections IVB
and IVC.
Note in Figure 6 that the invariant fuzzy set scaling and
clustering process takes place after all views have been
captured by the vis ion system (with the fuzziﬁed invariant
membership functions stored uns caled), so that the resulting
database incorporates descriptive characteristics of the 3D
object from all of the views.
Fig. 5. Capture Process
V. EXPERIMENTAL RESULTS
A. Apparatus
Testing was conducted using a vision pla tform consis ting
of two highresolution CCD cameras, mounted on a robotic
arm and calibrated for stereo triangulation. No particular
constraints were applied to camera or object positioning other
than generally placing the objects reasonably within the ﬁeld
of view of the system. The platform is shown in Figure 7.
B. Computing Invariant Values
In a practical system, conditions may not be ideal for
generating proper invariant descriptors without some prior
processing of the disparity maps. Since we want to recognize
objects from different viewpoints, it must also be assumed
that the objects might be found in different places in the ﬁeld
of view of the system, and with a background scene present
this has a serious effect on the resultant disparity maps and
invariant descriptors.
Fortunately, given a static background, it is a relatively
simple task to compare each pixel to a stored image of
the background itself and segment out everything but the
object. Many methods exist in the computer vision and
image processing literature, some more complex than others;
we have employed a simple thresholding technique, with
experimentallytuned parameters t, F , and B, outlined be
low:
1) For each pixel p
i,j
and stored background pixel s
i,j
,
if p
i,j
− s
i,j
 > t, mark as foreground.
Fig. 6. Recognition Network
2) Mark as background all foreground pixels in regions
with contiguous area less than F .
3) Mark as foreground all background pixels in regions
with contiguous area less than B.
The descriptors we use for recognition are invariant to
translation, among other things, so once background sub
traction has been performed it is of no concern where in the
image the object lies, so long as it is fully within the image.
C. Results
The system was tested using the training set of Table I on
a set of 200 disparity maps taken from different viewpoints
of 3 different objects.
The recognition rates of the experiment using Gaussian
fuzziﬁcation, three training views, and the simple c rispvalue
inference method are shown in Table II. Test A used no
data scaling whereas Te st B employed the LVQ selfscaling
method. A very high recognition rate was achieved in all
three classes, despite noise in the generated disparity maps
and ambiguity in the shapes of the objects.
VI. CONCLUSIONS
After examining a variety of possible invariant descriptors
for recognition of 3D obje c ts ba sed on disparity maps, we
have found a particular combination of three to yie ld the best
recognition results: compactness, the ﬁrst Hu moment, and
the histogram difference, as detailed in subsection IIIC.
Fig. 7. Vision Plat form
TABLE I
EXPERIMENT TRAINING SET
Class 1 Class 2 Class 3
The recognition method used a neural network to optimize
fuzzy membership functions for the invariant descriptors
against one a nothe r, which successfully mitigated misclassiﬁ
cation introduced by a mbiguities in the individual functions.
After training the recognition system with just three views of
an object, as described in section V, a very high recognition
rate was achieved on disparity maps generated from arbitrary
views.
The recognition could be made more robust by introducing
additional invariant descriptors to the same general concept.
One way to achieve this would be to improve the correlation
correspondence algorithm to yield a smoother and more
accurate range image; this could potentially allow the use of
higherorder moment invariants. Another possibility would
be to apply some form of normalization to the stereo images
or the disparity maps so that additional de scriptors not
invariant to certain properties could be used. Finally, it may
be possible to optimize recognition further by weighting the
contribution of the individual invariant descriptor member
ship functions to the clas siﬁcation.
TABLE II
RECOGNITION RESULTS
Test Class 1 Class 2 Class 3
A 94.00% 93.81% 86.67%
B 98.00% 98.97% 100.00%
REFERENCES
[1] J. J. Koenderink and A. J. van Doorn, “Geometry of Binocular Vision
and a Model for Stereopsis,” Biological Cybernetics, vol. 21, pp. 29–
35, 1976.
[2] R. Y. Tsai, “An Efﬁcient and Accurate Camera Calibration Technique
for 3D Machine Vision,” Proc. IEEE Computer Society Conf. on
Computer Vision and Pattern Recognition, pp. 364–374, 1986.
[3] R. Y. Tsai, “A Versatile Camera Calibration Technique for High
Accuracy 3D Machine Vision Metrology Using OfftheShelf TV
Cameras and Lenses,” IEEE Journal of Robotics and Automation,
vol. 3, no. 4, pp. 323–344, 1987.
[4] Z. Zhang, “A Flexible New Technique for Camera Calibration,” IEEE
Trans. on Pattern Analysis and Machine Intelligence, vol. 22, no. 11,
pp. 1330–1334, 2000.
[5] S. Shahir, X. Chen, and M. Ahmadi, “Fuzzy Associative Database for
Multiple Planar Object Recognition,” Proc. Intl. Symp. on Circuits and
Systems, vol. 5, pp. 805–808, 2003.
[6] G. Hetzel, B. Leibe, P. Levi, and B. Schiele, “3D Object Recognition
for Range Images using Local Feature Histograms,” Proc. IEEE
Computer Society Conf. on Computer Vision and Pattern Recognition,
pp. 394–399, 2001.
[7] S. Lin and S. W. Lee, “Using Chromaticity Distributions and
Eigenspace Analysis for Pose, Illumination, and Specularity
Invariant Recognition of 3D Objects,” Proc. IEEE Computer Society
Conf. on Computer Vision and Pattern Recognition, pp. 426–431,
1997.
[8] N. Rui, J. Guangrong, Z. Wencang, and F. Chen, “3D Object Recog
nition from 2D Invariant View Sequence Under Translation, Rotation
and Scale by Means of ANN Ensemble,” Proc. IEEE Intl. Wkshp. on
VLSI Design and Video Technology, pp. 292–295, 2005.
[9] R. J. Campbell and P. J. Flynn, “Eigenshapes for 3D Object R ecogni
tion in Range Data,” Proc. IEEE Computer Society Conf. on Computer
Vision and Pattern Recognition, pp. 505–510, 1999.
[10] M. Y. Mashor, M. M. Osman, M. R. Arshad, “3D Object Recogni
tion Using 2D Moments and HMLP Network,” Proc. Intl. Conf. on
Computer Graphics, Imaging and Visualization, pp. 126–130, 2004.
[11] M. K. Hu, “Visual Pattern Recognition By Moment Invariants,” IRE
Transactions on Information Theory, vol. 8, no. 2, pp. 179–187, 1962.
[12] S.G. Kong and B. Kosko, “Adaptive Fuzzy Systems for Backing Up
a TruckandTrailer,” IEEE Trans. on Neural Networks, vol. 3, no. 2,
pp. 211–223, 1992.
[13] N. B. Karayiannis and P. I. Pai, “Fuzzy Algorithms for Learning Vect or
Quantization,” IEEE Trans. on Neural Networks, vol. 7, pp. 1196–
1211, 1996.
[14] B. Kusumoputro, H. Budiarto, and W. Jatmiko, “FuzzyNeuro LVQ
and its Comparison with Fuzzy Algorithm LV Q in Artiﬁcial Odor
Discrimination System,” ISA Trans., vol. 41, no. 4, pp. 395–407, 2002.
[15] K. L. Wu and M. S. Yang, “A FuzzySoft Learning Vector Quantiza
tion,” Neurocomputing, vol. 55, no. 3, pp. 681–697, 2003.
[16] C. Thiel, B. Sonntag, and F. Schwenker, “Experiments with Supervised
Fuzzy LVQ,” Proc. 3rd IAPR Wkshp. on Artiﬁcial Neural Networks in
Pattern Recognition, pp. 125–132, 2008.
[17] B. Kosko, Neural Networks and Fuzzy Systems: A Dynamical Systems
Approach to Machine Intelligence, Prentice Hall, 1992.
[18] T. Kohonen, SelfOrganizing Maps, Springer, 1995.
[19] D. Nauck, F. Klawonn, and R. Kruse, Foundations of NeuroFuzzy
Systems, Wiley, 1997.
[20] O. Faugeras, ThreeDimensional Computer Vision: A Geometric View
point, The MIT Press, 1993.
[21] E. Trucco and A. Verri, Introductory Techniques for 3D Computer
Vision, Prentice Hall, 1998.
[22] R. C. Gonzalez, Digital Image Processing, PrenticeHall, 2002.