Conference PaperPDF Available

Semiautomatic Learning of 3D Objects from Video Streams


Abstract and Figures

Object detection and recognition are classical problems in computer vision, but are still challenging without a priori knowledge of objects and with a limited user interaction. In this work, a semiautomatic system for visual object learning from video stream is presented. The system detects movable foreground objects relying on FAST interest points. Once a view of an object has been segmented, the system relies on ORB features to create its descriptor, store it and compare it with descriptors of previously seen views. To this end, a visual similarity function based on geometry consistency of the local features is used. The system groups together similar views of the same object into clusters relying on the transitivity of similarity among them. Each cluster identifies a 3D object and the system learn to autonomously recognize a particular view assessing its cluster membership. When ambiguities arise, the user is asked to validate the membership assignments. Experiments have demonstrated the ability of the system to group together unlabeled views, reducing the labeling work of the user.
Content may be subject to copyright.
Semiautomatic Learning of 3D Objects from
Video Streams
Fabio Carrara, Fabrizio Falchi, and Claudio Gennaro
Via G. Moruzzi 1
56124 Pisa - Italy
Abstract. Object detection and recognition are classical problems in
computer vision, but are still challenging without a priori knowledge of
objects and with a limited user interaction. In this work, a semiautomatic
system for visual object learning from video stream is presented. The sys-
tem detects movable foreground objects relying on FAST interest points.
Once a view of an object has been segmented, the system relies on ORB
features to create its descriptor, store it and compare it with descriptors
of previously seen views. To this end, a visual similarity function based
on geometry consistency of the local features is used. The system groups
together similar views of the same object into clusters relying on the
transitivity of similarity among them. Each cluster identifies a 3D object
and the system learn to autonomously recognize a particular view assess-
ing its cluster membership. When ambiguities arise, the user is asked to
validate the membership assignments. Experiments have demonstrated
the ability of the system to group together unlabeled views, reducing the
labeling work of the user.
1 Introduction
In this work, a user assisted clustering system for online visual object recognition
is presented. Our approach enables a single smart camera to learn and recognize
objects exploiting change detection in the scene: given the evolution of the scene
during time, the system incrementally builds a knowledge that can be exploited
for the subsequent recognitions of the object when reappear on the scene. The
user is queried when ambiguities cannot be automatically resolved.
Object detection is carried out by a local feature based background subtrac-
tion method [1] which distinguishes the foreground local features of the image
from the background ones and segments new objects in the scene relying on
FAST interest points. Each detected object, together with its extracted ORB
local features, is maintained in a local database forming the knowledge base for
object recognition. All the views of detected objects are incrementally organized
in clusters based on the similarity among them. A similarity function between
two object views is defined relying on local features matching and geometry con-
straints on their positions. The main goal of the system is to maintain gathered
views in clusters where each cluster contains only views of the same 3D object,
even if it has been observed under different poses or illuminations (see Figure 1).
Clusters can be labeled anytime by the user and object recognition is performed
assessing the membership of a view to a particular cluster.
Object #1 Object #2 Object #3 Object #4
Fig. 1: Visualization of the system goal: online clustering of detected objects as recog-
nition task.
The system has not been designed for a particular smart camera platform in
mind, but it has been tested on the Raspberry Pi platform equipped with a Pi
Camera module. Experiments have been made using the Stanford 3D Objects 1
[9] public available dataset in order to evaluate the ability to build a knowledge
base for object recognition.
The rest of the paper is organized as follows: Section 2 describes the main
features of some studied object recognition methods. Section 3 presents our
method, describing the similarity function we have defined between detected
objects. Section 4 describes the strategy used for similar object clustering. Sec-
tion 5 reports the experiments performed and the metrics used to evaluate our
method. Conclusive remarks are addressed at the end of this paper.
2 Related Work
Many solutions have been proposed to the problem of 3d object model learning
for recognition task, starting from different 2d views of the object of interest.
Murase and Nayar [7] model each object as a manifold in the eigenspace
obtained compressing the training image set for that object. Given an unknown
input image, it is projected on the eigenspace and labeled relying on the manifold
it lies on. Moreover, the exact point of the projection gives a pose estimation of
the input object. However, only batch training is possible and the training set
must be composed by a large number of normalized images with different poses
and illuminantion and uncluttered background.
More recent studies address object modelling relying on local features of
training images.
Weber et al. [10, 5] developed a method to learn object class models from un-
labeled and unsegmented cluttered scenes. The authors combine appereance and
shape in the object model using the constellation model. In this model, objects
are represented as flexible constellations of rigid parts (features). Most robust
parts are automatically identified applying a clustering algorithm to parts de-
tected from the training set. An expectation-maximization algorithm is applied
to tune the parameters of a joint probability density function (pdf) on the shape
of the constellation and the output of part detectors. An enhanced version of this
method using Bayesian parameter estimation is proposed by Fei-Fei et al. [4] for
the purpose of scale-invariant object categorization capable of both batch and
incremental training. However, this approach performs poorly with few training
images and is more suitable to the modeling of classes of objects rather than
individual 3D objects. Moreover it does not cope with multi-pose 3D object
The work in this paper follows the approach of Lowe [6], who addresses the
problem of view clustering for 3D object recognition using a training set of
images with uncluttered background. Each view is described by its SIFT local
features and adjacent views are clustered together relying on feature matching
and similarity transformation error. The presented method relies on a different
geometric consistency check based on homographies which is capable of relating
views under different perspectives with low false positive rates.
3 Object Extraction and Matching
A specialized local feature based background subtraction method [1] has been
implemented to segment stable foreground objects from a video stream relying
on their FAST keypoints. A 2-level background model is created and updated us-
ing temporal statistics on the positions of the keypoints. The first level is trained
to segment keypoints in background and foreground, while the second level seg-
ments the foreground keypoints in moving or stationary keypoints. Stationary
foreground keypoints are used to extract views of stable foreground objects from
the video removing the part of the image containing keypoints coming from the
cluttered background.
The system ciclycally a) updates the background model untils it is steady, b)
waits for stable new objects to be detected in the scene, c) extracts the view of
the detected object, d) compares it to already collected views and e) organizes
cluster of views.
Each extracted view oiis described by a) Ki, the set of the positions of its
local features (keypoints) and b) Di, the set of their extracted ORB descriptors.
ORB is a rotation invariant version of the BRIEF binary descriptor based on
binary tests between pixels of the smoothed image patch [8]. It is suitable for
realtime applications since it is faster than both SURF and SIFT but it has
similar matching performance and is even less affected by image noise [2].
3.1 Observations Matching
A similarity function S: (o1, o2)[0,1] is defined on a pair of object views
(o1, o2), representing the quality of the visual match between them. The similar-
ity value among two views is computed in steps shown in Figure 2 and described
(a) (b) (c)
(d) (e)
Fig. 2: Example of similarity computation among two different views (a) and (b) of
the same 3D object. The homography relating the matching keypoints is shown in (d).
The view (a) is transformed using the found homography in (c) and all matching steps
are reapplied: the second homography found (e) confirms the match. The computed
similarity value is 0.48.
Feature Matching Let K1, K2be the sets of keypoints of the compared views
and D1, D2their sets of corresponding descriptors.
A preliminary list Lof descriptor matches is created finding for each descrip-
tor in D1its nearest neighbor in D2using the bruteforce method. Distances
between descriptors are computed using the method suggested by the authors
of the descriptor. In case of ORB, Hamming distance between the binary repre-
sentation of the descriptors is used.
Matches in Lare then filtered keeping only the ones having distance between
descriptors below Tm. We chose Tm= 64 as suggested by Rublee et al. [8] despite
not being the most stringent value to filter bad matches, but we preferred high
recall of matches rather than high precision at this step.
If there are less than 4 matches left in the list, the following steps cannot be
applied and the similarity value is set to 0.
RANSAC Filtering of Matches Bad matches in Lare filtered out checking
whether the points that match are geometrically consistent.
Two images of the same planar surface in space are related by a homography
[3]. A homography is a invertible transformation represented by a 3 ×3 real
matrix that maps the 2D coordinates of points in a image plane into the 2D
coordinates in another plane.
h11 h12 h13
h21 h22 h23
h31 h32 1
Let K?
1, K?
2be the sets of keypoints corresponding to the descriptors belong-
ing to L. In order to find the homography that relates correctly the most of the
points in K?
1and K?
2, RANSAC is applied [3]. RANSAC is an non-deterministic
algorithm to estimate parameters of a mathematical model from a set of ob-
served data which contains outliers. RANSAC algorithms iteratevly executes
the following steps:
1. takes 4 matches (couples of points) at random from K?
1and K?
2. computes the homography Hrelating those points,
3. counts the number of other matches that are correctly related by H(inliers).
After a certain number of iterations, the matrix Hwhich gave the maximum
number of inliers is returned.
Using the homography found by the RANSAC algorithm (Figure 2d), we
can further filter the matches in L, keeping only the inliers of the perspective
Quasi-degenerate and flipping homographies can be detected analizing the
homography matrix. Three checks are done:
flipping homographies can be discarded checking if det(H)<0.
very skewed or prospective homographies can be discarded if det(H) is too
small or too big: given a parameter N,His discarded if det(H)> N or
homographies transforming the matching keypoints bounding box in a con-
cave polygon can be filtered out with a convexity check.
In those cases, it is very unlikely that the views under analysis are really
related by this perspective transformation, therefore the system assumes there
is no similarity between them and returns a similarity value of 0.
Second Stage RANSAC Some views may pass the homography matrix check
even if the perspective transform described by His very unlikely to be observed.
In order to filter out false positives homography matrices, the image of the first
view o1is transformed in ˆo1using the homography to be validated (Figure 2c)
and the similarity computation steps are repeated considering the views ˆo1and
o2. Features are re-detected and re-extracted from ˆo1, matched with o2and a
second RANSAC is executed to estimate a new homography ˆ
Hdescribing the
prospective transformation among ˆo1and o24. If the original views o1and o2
were really different views of the same object, ˆ
Hshould be very near to the
identity transformation (Figure 2e), otherwise the similarity between o1and o2
is set to 0.
Similarity Output After the system found a good homography relating the
views, the ratios ˆr1, r2among the number of inliers and the total number of
detected features are computed for each view:
K1|, r2=I
where Iare the number of inliers of the homography estimated between views
ˆo1and o2,|ˆ
K1|and |K2|are respectively the number of detected keypoints in
ˆo1and in o2. The similarity value among original views under analysis S(o1, o2)
is defined as the harmonic mean between ˆr1and r2(Figure 3):
S(o1, o2) = 2 ˆr1r2
(a) (b) (c) (d)
A1 0 0 0.63
B0 1 0.57 0
C0 0.62 1 0
D0.68 0 0 1
Fig. 3: Values of similarity among object views (a-d) reported in table (e).
4 Online Object Clustering
Everytime a new view of an object is gathered from the video stream, the system
a) assigns it to a cluster and b) maintains clusters of views that potentially
represent the same 3D object (Figure 4).
Each cluster is identified by a label assigned to views. The system puts a new
view in a cluster relying on the similarity it has with other already clustered
views, following an agglomerative clustering approach. The new view can bring
informations useful to cluster reorganization: for example, let c1and c2be two
clusters of views representing the same 3D object viewed from two different
poses. An intermediate view of the 3D object could suggest the system to merge
c1and c2in a unique cluster (see Figure 5).
Given a new view ˆo, a list Lsof similar views is generated scanning the local
database. For each object view oithe similarity value si=So, oi) is computed
and if it is above a similarity threshold Ts,oiis inserted in Ls.
When trying to label ˆo, the following scenarios can occur:
(a) table calendar
(b) poetry book
Fig. 4: Example of two object view
clusters (a) and (b).
(a) side view cluster
(b) frontal view cluster (c)
Fig. 5: Example of cluster merging: The
new view (c) is similar to both clusters and
can lead to a cluster merge.
1. ˆodoes not match with any views, hence a new cluster is created and a new
label is assigned to ˆo.
2. ˆomatches with one or more views all belonging to the same cluster, hence
the system assigns the corresponding cluster label to ˆo.
3. ˆomatches with more than two views beloging to different clusters. Many
actions may be taken by the system in this situation:
(a) the clusters containing the views similar to the new one are merged
together in a unique bigger cluster to which the new view will belong
(Figure 5).
(b) the new view is inserted into only one among the candidates clusters.
(c) a new cluster is created containing only the new view.
Up to now, the system does not decide automatically in the third scenario
and asks the user which action should be taken. Interaction between multiple
cameras and similarity values between views and clusters may be exploited to
take the correct action automatically, but are not discussed in this paper and
are left to future work.
In the case a new view is incorrectly put in a new cluster instead of being
grouped with the other views representing the same object, the agglomerative
cluster algorithm can eventually build a unique cluster if intermediate views of
the same object will be collected by the system.
5 Experiments
The presented system autonomously groups object views into clusters without
knowing their labels, but cannot recognize them before the user labels at least
some of them, hence the system cannot be compared with traditional trained
classifiers. Instead the ability of the system to build good and easy to label
clusters is measured.
To do so, the publicly available 3D Objects dataset [9] has been used. This
dataset is composed by images of 10 object categories. For each category, 9-10
Fig. 6: Object learning and recognition in a test video sequence: objects are added, moved
and removed from the scene. The system segments objects from the background and
incrementally creates clusters of similar object views. An object is recognized assessing
the membership of its current view to a pre-existent cluster.
objects are present and for each object, several images are reported in which the
specific object is shown in different poses. Each image comes with a foreground
mask which denotes exactly in which part of the image the object is located.
Images are taken from 8 different angles using 3 different scales and 3 different
heigths for the camera, leading to around 5500 labeled images of 100 specific
objects (see Figure 7).
Category Object Views
cellphone cellphone 1 . . .
. . . . . .
cellphone 9 . . .
mouse mouse 1 . . .
. . . . . .
toaster toaster 1 . . .
. . . . . .
Fig. 7: Excerpt from the Stanford “3D Objects” dataset: only some views of some objects
of some classes are reported.
Let O={(o1, l1),(o2, l2), . . .}the set of labeled views. The entire dataset Ois
randomly shuffled and splitted in training set Otrain (90%) and testing set Otest
(10%): training views are presented to the system as coming from the output of
the foreground extraction stage. The system builds clusters of views while they
are processed. In the case a supervised clustering is needed, the test code uses
the groundtruth labels of involved views to simulate user interaction applying
Algorithm 1.
Once the clusters are built, they must be labeled to produce a labeled training
set. Since the user usually does not want to waste time in cleaning clusters or
labeling singular objects, the test code simulates a labeling technique based on
major voting: an entire cluster is labeled with the label of the most frequent
object present in it.
The training set thus labeled is used for training a k-NN classifier. The cluster
k-NN classifier finds the kmost similar views (the ones with the higher value
Algorithm 1 Clustering algorithm simulating user interaction used for evalua-
tion tests
for all (oi, li)Otrain do
find the set Osof views similar to oi,Os={ojDatabase :S(oi, oj)> Ts}
if |Os|= 0 then
put oiin a new cluster
else if |Os|= 1 or (all views in Osbelong to the same cluster) then
put oiin the cluster of the similar views
else simulate user interaction
find set Csof all clusters to which the similar views belong
for all cCsdo
find the majority groundtruth label of c(the label appearing the most in
the cluster)
end for
create a new cluster merging together all clusters having its majority label
equal to li(the label of oi)
put oiin the newly created cluster
end if
end for
of similarity S) and assigns a score for each label of those views. The winning
label is assigned to the processed test view. Another k-NN classifier is trained
using the training set with correct labels and another labeling of the test set is
generated in the same way.
Test set labelings are evaluated extracting precision, recall and F-score for
each 3D object and then aggregating them using macro- and micro-averaging
techniques defined as follows:
micro-avgd macro-avgd
precision pmicro =Pn
i=1 TPi
i=1(TPi+FPi)pmacro =Pn
i=1 pi
recall rmicro =Pn
i=1 TPi
i=1 Pirmacro =Pn
i=1 ri
f-score Fmicro =2pmicro rmicro
pmicro+rmicr o Fmacro =2pmacro rmacro
pmacro+rmacr o
where nis the number of object classes, T Pithe true positives, F Pithe false
positives, Pithe total number of views, pithe precision, rithe recall and Fithe
F-score of the object i.
Macro-averaged metrics tends to give the same weight to each class, while
micro-averages metrics takes into account possible biases introduced by each
class and gives a more accurate global performance index. Another measured
metric is the number of interactions the system must have with the user in
order to label the training set: the groundtruth k-NN classifier needs the user to
label each training view individually, which corresponds to a number of query
to the user equal to the number of views in the training set. The cluster k-NN
classifier needs to interact with the user a) when a cluster merging can not be
resolved automatically during the online clustering and b) when a cluster has to
be labeled.
6 Conclusions
In Figure 8, the performance of the two classifiers for various similarity thresholds
Tsare reported.
It can be seen that for Tsaround 0.2, the cluster k-NN classifier has almost
the same performance of the groundtruth k-NN classifier, having only around
half the interactions with the user.
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
similarity threshold
(a) Micro- and
precision values
when varying the
similarity threshold
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
similarity threshold
(b) Micro- and
recall values when
varying the similarity
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
similarity threshold
(c) Micro- and
F1-score values when
varying the similarity
(d) The number of
queries to the user
when varying the
similarity threshold
Fig. 8: Comparison of the performance of the recognition task, solved by a k-NN classi-
fier trained with the groundtruth training set (blue lines with circle markers) and by a
k-NN classifier with training set made by cluster labeling (red lines with cross markers).
Solid and dashed lines indicate respectively micro- and macro-averaged metrics.
However, performance degradation of the cluster k-NN classifier is due to
the fact that we simulated a unique user interaction after the training phase
which used major voting paradigm to label all clusters at once. Since the sys-
tem is incrementally building richer and richer clusters, this is not the best
way to interact with the user asking for labels: user interaction may be proac-
tively requested only when big homogeneous clusters are involved, maximizing
the amount of information collected. Moreover, smarter techniques than major
voting may be implemented to simulate a more precise user labeling session. In
the performed tests, many singleton or small clusters are present at the end of
the training phase, raising the number of queries to the user needed to label the
entire training set.
[1] Carrara, F., Amato, G., Falchi, F., Gennaro, C.: Efficient foreground-
background segmentation using local features for object detection. In:
Proceedings of the International Conference on Distributed Smart Cam-
eras, ICDSC ’15, September 08 - 11, 2015, Seville, Spain (submitted for
publication), available at
[2] De Beugher, S., Brˆone, G., Goedem´e, T.: Automatic analysis of in-the-wild
mobile eye-tracking experiments using object, face and person detection.
In: Proceedings of the international conference on computer vision theory
and applications (VISIGRAPP 2014). vol. 1, pp. 625–633 (2014)
[3] Dubrofsky, E.: Homography estimation. Ph.D. thesis, University of British
Columbia (2009)
[4] Fei-Fei, L., Fergus, R., Perona, P.: Learning generative visual models from
few training examples: An incremental bayesian approach tested on 101
object categories. Computer Vision and Image Understanding 106(1), 59–
70 (2007)
[5] Fergus, R., Perona, P., Zisserman, A.: Object class recognition by unsuper-
vised scale-invariant learning. In: Computer Vision and Pattern Recogni-
tion, 2003. Proceedings. 2003 IEEE Computer Society Conference on. vol. 2,
pp. II–264. IEEE (2003)
[6] Lowe, D.G.: Local feature view clustering for 3d object recognition. In:
Computer Vision and Pattern Recognition, 2001. CVPR 2001. Proceedings
of the 2001 IEEE Computer Society Conference on. vol. 1, pp. I–682. IEEE
[7] Murase, H., Nayar, S.K.: Visual learning and recognition of 3-d objects from
appearance. International journal of computer vision 14(1), 5–24 (1995)
[8] Rublee, E., Rabaud, V., Konolige, K., Bradski, G.: Orb: an efficient alterna-
tive to sift or surf. In: Computer Vision (ICCV), 2011 IEEE International
Conference on. pp. 2564–2571. IEEE (2011)
[9] Savarese, S., Li, F.F.: 3d generic object categorization, localization and pose
estimation. In: ICCV. pp. 1–8 (2007)
[10] Weber, M., Welling, M., Perona, P.: Unsupervised learning of models for
recognition. Springer (2000)
ResearchGate has not been able to resolve any citations for this publication.
Conference Paper
Full-text available
In this paper we present a novel method for the automatic analysis of mobile eye-tracking data in natural environments. Mobile eye-trackers generate large amounts of data, making manual analysis very time-consuming. Available solutions, such as marker-based analysis minimize the manual labour but require experimental control, making real-life experiments practically unfeasible. We present a novel method for processing this mobile eye-tracking data by applying object, face and person detection algorithms. Furthermore we present a temporal smoothing technique to improve the detection rate and we trained a new detection model for occluded person and face detections. This enables the analysis to be performed on the object level rather than the traditionally used coordinate level. We present speed and accuracy results of our novel detection scheme on challenging, large-scale real-life experiments.
Conference Paper
Full-text available
Feature matching is at the base of many computer vision problems, such as object recognition or structure from motion. Current methods rely on costly descriptors for detection and matching. In this paper, we propose a very fast binary descriptor based on BRIEF, called ORB, which is rotation invariant and resistant to noise. We demonstrate through experiments how ORB is at two orders of magnitude faster than SIFT, while performing as well in many situations. The efficiency is tested on several real-world applications, including object detection and patch-tracking on a smart phone.
Conference Paper
Full-text available
We propose a novel and robust model to represent and learn generic 3D object categories. We aim to solve the problem of true 3D object categorization for handling arbitrary rotations and scale changes. Our approach is to capture a compact model of an object category by linking together diagnostic parts of the objects from different viewing points. We emphasize on the fact that our "parts" are large and discriminative regions of the objects that are composed of many local invariant features. Instead of recovering a full 3D geometry, we connect these parts through their mutual homographic transformation. The resulting model is a compact summarization of both the appearance and geometry information of the object class. We propose a framework in which learning is done via minimal supervision compared to previous works. Our results on categorization show superior performances to state-of-the-art algorithms such as (Thomas et al., 2006). Furthermore, we have compiled a new 3D object dataset that consists of 10 different object categories. We have tested our algorithm on this dataset and have obtained highly promising results.
Conference Paper
In this work, a local feature based background modelling for background-foreground feature segmentation is presented. In local feature based computer vision applications, a local feature based model presents advantages with respect to classical pixel-based ones in terms of informativeness, robustness and segmentation performances. The method discussed in this paper is a block-wise background modelling where we propose to store the positions of only most frequent local feature configurations for each block. Incoming local features are classified as background or foreground depending on their position with respect to stored configurations. The resulting classification is refined applying a block-level analysis. Experiments on public dataset were conducted to compare the presented method to classical pixel-based background modelling.
Current computational approaches to learning visual object categories require thousands of training images, are slow, cannot learn in an incremental manner and cannot incorporate prior information into the learning process. In addition, no algorithm presented in the literature has been tested on more than a handful of object categories. We present an method for learning object categories from just a few training images. It is quick and it uses prior information in a principled way. We test it on a dataset composed of images of objects belonging to 101 widely varied categories. Our proposed method is based on making use of prior information, assembled from (unrelated) object categories which were previously learnt. A generative probabilistic model is used, which represents the shape and appearance of a constellation of features belonging to the object. The parameters of the model are learnt incrementally in a Bayesian manner. Our incremental algorithm is compared experimentally to an earlier batch Bayesian algorithm, as well as to one based on maximum likelihood. The incremental and batch versions have comparable classification performance on small training sets, but incremental learning is significantly faster, making real-time learning feasible. Both Bayesian methods outperform maximum likelihood on small training sets.
Conference Paper
We present a method to learn object class models from unlabeled and unsegmented cluttered scenes for the purpose of visual object recognition. We focus on a particular type of model where objects are represented as flexible con- stellations of rigid parts (features). The variability within a class is represented by a joint probability density function (pdf) on the shape of the constellation and the output of part detectors. In a first stage, the method automatically identifies distinctive parts in the training set by applying a clustering algorithm to patterns selected by an interest operator. It then learns the statistical shape model using expectation maximization. The method achieves very good classification results on human faces and rear views of cars.
The problem of automatically learning object models for recognition and pose estimation is addressed. In contrast to the traditional approach, the recognition problem is formulated as one of matching appearance rather than shape. The appearance of an object in a two-dimensional image depends on its shape, reflectance properties, pose in the scene, and the illumination conditions. While shape and reflectance are intrinsic properties and constant for a rigid object, pose and illumination vary from scene to scene. A compact representation of object appearance is proposed that is parametrized by pose and illumination. For each object of interest, a large set of images is obtained by automatically varying pose and illumination. This image set is compressed to obtain a low-dimensional subspace, called the eigenspace, in which the object is represented as a manifold. Given an unknown input image, the recognition system projects the image to eigenspace. The object is recognized based on the manifold it lies on. The exact position of the projection on the manifold determines the object's pose in the image. A variety of experiments are conducted using objects with complex appearance characteristics. The performance of the recognition and pose estimation algorithms is studied using over a thousand input images of sample objects. Sensitivity of recognition to the number of eigenspace dimensions and the number of learning samples is analyzed. For the objects used, appearance representation in eigenspaces with less than 20 dimensions produces accurate recognition results with an average pose estimation error of about 1.0 degree. A near real-time recognition system with 20 complex objects in the database has been developed. The paper is concluded with a discussion on various issues related to the proposed learning and recognition methodology.
We present a method to learn and recognize object class models from unlabeled and unsegmented cluttered scenes in a scale invariant manner. Objects are modeled as flexible constellations of parts. A probabilistic representation is used for all aspects of the object: shape, appearance, occlusion and relative scale. An entropy-based feature detector is used to select regions and their scale within the image. In learning the parameters of the scale-invariant object model are estimated. This is done using expectation-maximization in a maximum-likelihood setting. In recognition, this model is used in a Bayesian manner to classify images. The flexible nature of the model is demonstrated by excellent results over a range of datasets including geometrically constrained classes (e.g. faces, cars) and flexible objects (such as animals).
There have been important recent advances in object recognition through the matching of invariant local image features. However, the existing approaches are based on matching to individual training images. This paper presents a method for combining multiple images of a 3D object into a single model representation. This provides for recognition of 3D objects from any viewpoint, the generalization of models to non-rigid changes, and improved robustness through the combination of features acquired under a range of imaging conditions. The decision of whether to cluster a training image into an existing view representation or to treat it as a new view is based on the geometric accuracy of the match to previous model views. A new probabilistic model is developed to reduce the false positive matches that would otherwise arise due to loosened geometric constraints on matching 3D and non-rigid models. A system has been developed based on these approaches that is able to robustly recognize 3D objects in cluttered natural images in sub-second times.
Homography estimation
  • E Dubrofsky
Dubrofsky, E.: Homography estimation. Ph.D. thesis, University of British Columbia (2009)