-
[show abstract]
[hide abstract]
ABSTRACT: Unsupervised over-segmentation of an image into super-pixels is a common preprocessing step for image parsing algorithms. Superpixels are used as both regions of support for feature vectors and as a starting point for the final segmentation. Recent algorithms that construct superpixels that conform to a regular grid (or superpixel lattice) have used greedy solutions. In this paper we show that we can construct a globally optimal solution in either the horizontal or vertical direction using a single graph cut. The solution takes into account both edges in the image, and the coherence of the resulting superpixel regions. We show that our method outperforms existing algorithms for computing superpixel lattices. Additionally, we show that performance can be comparable or better than other contemporary segmentation algorithms which are not constrained to produce a lattice.
Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on; 07/2010
-
[show abstract]
[hide abstract]
ABSTRACT: We consider the problem of parsing facial features from an image labeling perspective. We learn a per-pixel unary classifier, and a prior over expected label configurations, allowing us to estimate a dense labeling of facial images by part (e.g. hair, mouth, moustache, hat). This approach deals naturally with large variations in shape and appearance characteristic of unconstrained facial images, and also the problem of detecting classes that may be present or absent. We use an Adaboost-based unary classifier, and develop a family of priors based on `epitomes' which are shown to be particularly effective in capturing the non-stationary aspects of face label distributions.
Image Processing (ICIP), 2009 16th IEEE International Conference on; 12/2009
-
[show abstract]
[hide abstract]
ABSTRACT: Many previous studies have investigated gender classification in well-lit frontal images. In this paper we consider images where the pose, expression and lighting are relatively unconstrained. We localize faces using a standard sliding-window detector. We preprocess the facial region by convolving with Gabor filters at at four scales and four orientations. We sample these responses and concatenate them to form a feature vector. We develop a classifier based on an additive sum of non-linear functions of one-dimensional projections of the data. In particular we investigate arc tangent and weighted sums of Gaussians. We describe a training method based on increasing the binomial log likelihood. We demonstrate that our system on two databases and show that it performs well relative to the state of the art.
Image Processing (ICIP), 2009 16th IEEE International Conference on; 12/2009
-
[show abstract]
[hide abstract]
ABSTRACT: Advances in object detection have made it possible to collect large databases of certain objects. In this paper we exploit these datasets for within-object classification. For example, we classify gender in face images, pose in pedestrian images and phenotype in cell images. Previous work has mainly targeted the above tasks individually using object specific representations. Here, we propose a general Bayesian framework for within-object classification. Images are represented as a regular grid of non-overlapping patches. In training, these patches are approximated by a predefined library. In inference, the choice of approximating patch determines the classification decision. We propose a Bayesian framework in which we marginalize over the patch frequency parameters to provide a posterior probability for the class. We test our algorithm on several challenging “real world” databases.
Computer Vision, 2009 IEEE 12th International Conference on; 11/2009
-
[show abstract]
[hide abstract]
ABSTRACT: Unsupervised over-segmentation of an image into super-pixels is a common preprocessing step for image parsing algorithms. Superpixels are used as both regions of support for feature vectors and as a starting point for the final segmentation. In this paper we investigate incorporating a priori information into superpixel segmentations. We learn a probabilistic model that describes the spatial density of the object boundaries in the image. We then describe an over-segmentation algorithm that partitions this density roughly equally between superpixels whilst still attempting to capture local object boundaries. We demonstrate this approach using road scenes where objects in the center of the image tend to be more distant and smaller than those at the edge. We show that our algorithm successfully learns this foveated spatial distribution and can exploit this knowledge to improve the segmentation. Lastly, we introduce a new metric for evaluating vision labeling problems. We measure performance on a challenging real-world dataset and illustrate the limitations of conventional evaluation metrics.
Computer Vision, 2009 IEEE 12th International Conference on; 11/2009
-
[show abstract]
[hide abstract]
ABSTRACT: Image parsing remains difficult due to the need to combine local and contextual information when labeling a scene. We approach this problem by using the epitome as a prior over label configurations. Several properties make it suited to this task. First, it allows a condensed patch-based representation. Second, efficient E-M based learning and inference algorithms can be used. Third, non-stationarity is easily incorporated. We consider three existing priors, and show how each can be extended using the epitome. The simplest prior assumes patches of labels are drawn independently from either a mixture model or an epitome. Next we investigate a dasiaconditional epitomepsila model, which substitutes an epitome for a conditional mixture model. Finally, we develop an dasiaepitome treepsila model, which combines the epitome with a tree structured belief network prior. Each model is combined with a per-pixel classifier to perform segmentation. In each case, the epitomized form of the prior provides superior segmentation performance, with the epitome tree performing best overall. We also apply the same models to denoising binary images, with similar results.
Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on; 07/2009
-
[show abstract]
[hide abstract]
ABSTRACT: Face recognition algorithms perform very unreliably when the pose of the probe face is different from the gallery face: typical feature vectors vary more with pose than with identity. We propose a generative model that creates a one-to-many mapping from an idealized "identity" space to the observed data space. In identity space, the representation for each individual does not vary with pose. We model the measured feature vector as being generated by a pose-contingent linear transformation of the identity variable in the presence of Gaussian noise. We term this model "tied" factor analysis. The choice of linear transformation (factors) depends on the pose, but the loadings are constant (tied) for a given individual. We use the EM algorithm to estimate the linear transformations and the noise parameters from training data. We propose a probabilistic distance metric that allows a full posterior over possible matches to be established. We introduce a novel feature extraction process and investigate recognition performance by using the FERET, XM2VTS, and PIE databases. Recognition performance compares favorably with contemporary approaches.
IEEE Transactions on Pattern Analysis and Machine Intelligence 07/2008; 30(6):970-984. · 4.91 Impact Factor
-
[show abstract]
[hide abstract]
ABSTRACT: Most face recognition algorithms use a "distance- based" approach: gallery and probe images are projected into a low dimensional feature space and decisions about matching are based on distance in this space. In this paper we use a very different representation, where each face is approximated by a regular grid of patches (a mosaicface). Each of these patches is chosen from a library. Faces are now represented as a list of indices to this library. Since there is no obvious way to measure distance between two such lists, we use a probabilistic approach in which the observed face data is explained by a generative model. There are two phases: (i) Learning - we estimate library contents and associated variability (noise), (ii) Recognition - we evaluate the probability that probe and gallery images were generated from the same library patches. Our method performs significantly better than contemporary approaches in the presence of large illumination changes. Variation in viewing conditions is handled by extending this model to learn equivalences between multiple patch appearances. We demonstrate that our method provides a major improvement on the lighting subset of the XM2VTS database compared to "distance-based" methods.
Applications of Computer Vision, 2008. WACV 2008. IEEE Workshop on; 02/2008
-
[show abstract]
[hide abstract]
ABSTRACT: Our goal is to generate novel photo-realistic images of a given object class (e.g. faces, trees) using a model trained from example images. To achieve this, we treat training images as samples from a texture with spatially varying statistics and synthesize using a modification of the patch-based method of Efros and Freeman. Unfortunately this generates images that are locally consistent, but globally unrealistic. To resolve this we also learn a weak global model of all the image pixels. This creates images with correct global structure but unrealistic local texture. We demonstrate for the case of faces that combining global and local models allows generation of realistic image content.
Visual Media Production, 2007. IETCVMP. 4th European Conference on; 12/2007
-
[show abstract]
[hide abstract]
ABSTRACT: Current face recognition algorithms require the tacit cooperation of users, who must position themselves in a small area of space and face the camera. Face recognition in uncontrolled conditions, such as in security camera footage presents two extra challenges. First, it is difficult to capture good quality images of faces in this setting. Second, the pose of the face is relatively uncontrolled which causes most face recognition algorithms to fail. In this paper, we present a series of solutions to address these problems. High quality face images are captured using a foveated wide field sensor, in which a narrow-field camera is directed towards faces using information from a static wide-field camera. Feature points corresponding to the eyes/nose etc. are accurately localized and face shape is normalized. A novel algorithm is introduced to identify these (typically non-frontal) faces from a test gallery of frontal faces. Results are demonstrated to be superior to contemporary approaches
Crime and Security, 2006. The Institution of Engineering and Technology Conference on; 07/2006
-
[show abstract]
[hide abstract]
ABSTRACT: A major goal for face recognition is to identify faces where the pose of the probe is different from the stored face. Typical feature vectors vary more with pose than with identity, leading to very poor recognition performance. We propose a non-linear many-to-one mapping from a conventional feature space to a new space constructed so that each individual has a unique feature vector regardless of pose. Training data is used to implicitly parameterize the position of the multi-dimensional face manifold by pose. We introduce a co-ordinate transform, which depends on the position on the manifold. This transform is chosen so that different poses of the same face are mapped to the same feature vector. The same approach is applied to illumination changes. We investigate different methods for creating features, which are invariant to both pose and illumination. We provide a metric to assess the discriminability of the resulting features. Our technique increases the discriminability of faces under unknown pose and lighting compared to contemporary methods.
Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on; 07/2005
-
[show abstract]
[hide abstract]
ABSTRACT: Reliable wide-field detection of human activity is an unsolved problem. The main difficulty is that low resolution and the unconstrained nature of realistic environments and human behaviour make form cues unreliable. Here we argue that reliability in far- or wide-field detection can still be achieved by probabilistic combination of multiple weak but complementary visual cues that do not depend on detailed form analysis. To demonstrate, we describe a real-time Bayesian algorithm for localizing human activity in relatively unconstrained scenes, using motion, background subtraction and skin colour cues. Fast sampling of scale space is achieved using integral images and a flexible norm that can handle sparse cues without loss of statistical power. We show that the probabilistic approach far outperforms a representative logical approach in which skin and background subtraction classifiers are combined conjunctively. Our method is currently used in a pre-attentive human activity sensor, generating saccadic targets for an attentive foveated vision system that reliably fixates faces over a 130 deg field of view, allowing high-resolution capture of facial images over a large dynamic scene.
Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on; 07/2005
-
[show abstract]
[hide abstract]
ABSTRACT: To realistically integrate 3D graphics into an unprepared environment, camera position must be estimated by tracking natural image features. We apply our technique to cases where feature positions in adjacent frames of an image sequence are related by a homography, or projective transformation. We describe this transformation's computation and demonstrate several applications. First, we use an augmented notice board to explain how a homography, between two images of a planar scene, completely determines the relative camera positions. Second, we show that the homography can also recover pure camera rotations, and we use this to develop an outdoor AR tracking system. Third, we use the system to measure head rotation and form a simple low-cost virtual reality (VR) tracking solution.
IEEE Computer Graphics and Applications 12/2002; · 1.41 Impact Factor
-
[show abstract]
[hide abstract]
ABSTRACT: We present a complete scalable system for 6 DOF camera tracking based on natural features. Crucially, the calculation is based only on pre-captured reference images and previous estimates of the camera pose and is hence suitable for online applications. We match natural features in the current frame to two spatially separated reference images. We overcome the wide baseline matching problem by matching to the previous frame and transferring point positions to the reference images. We then minimize deviations from the two-view and three-view constraints between the reference images and the current frame as a function of camera position parameters. We stabilize this calculation using a recursive form of temporal regularization that is similar in spirit to the Kalman filter. We can track camera pose over hundreds of frames and realistically integrate virtual objects with only slight jitter.
Mixed and Augmented Reality, 2002. ISMAR 2002. Proceedings. International Symposium on; 02/2002
-
[show abstract]
[hide abstract]
ABSTRACT: Real-time labeling of geographical landmarks is a simple wearable computing application. Most current systems are based on calculation of optical flow between the current and previous frames to adjust the label position. Here we present two alternative algorithms based on geometrical image constraints The first is based on epipolar geometry and provides a general description of the constraints on image flow between two static scenes. The second is based on the calculation of a homography relationship between the current frame and a stored representation of the scene. A homography can exactly describe the image motion when the scene is planar, or when the camera movement is a pure rotation, and provides a good approximation when these conditions are nearly met. We assess all three styles of algorithms across a number of criteria including robustness, speed and accuracy. We also consider issues of representation and storage for geographical labeling applications.
Wearable Computers, 2002. (ISWC 2002). Proceedings. Sixth International Symposium on; 02/2002
-
[show abstract]
[hide abstract]
ABSTRACT: We present a complete system for live capture of 3D content and simultaneous presentation in augmented reality. The user sees the real world from his viewpoint, but modified so that the image of a remote collaborator is rendered into the scene. Fifteen cameras surround the collaborator, and the resulting video streams are used to construct a three-dimensional model of the subject using a shape-from-silhouette algorithm. Users view a two-dimensional fiducial marker using a video-see-through augmented reality interface. The geometric relationship between the marker and head-mounted camera is calculated, and the equivalent view of the subject is computed and drawn into the scene. Our system can generate 384 288 pixel images of the models at 25 fps, with a latency of < 100 ms. The result gives the strong impression that the subject is a real part of the 3D scene. We demonstrate applications of this system in 3D videoconferencing and entertainment.
International Symposium on Mixed and Augmented Reality ISMAR; 01/2002
-
[show abstract]
[hide abstract]
ABSTRACT: Real-time labeling of geographical landmarks is a simple wearable computing application. Most current systems are based on calculation of optical flow between the current and previous frames to adjust the label position. Here we present two alternative algorithms based on geometrical image constraints .The first is based on epipolar geometry and provides a general description of the constraints on image flow between two static scenes. The second is based on the calculation of a homography relationship between the current frame and a stored representation of the scene. A homography can exactly describe the image motion when the scene is planar, or when the camera movement is a pure rotation, and provides a good approximation when these conditions are nearly met. We assess all three styles of algorithm across a number of criteria including robustness, speed and accuracy. We also consider issues of representation and storage for geographical labeling applications.
2012 16th International Symposium on Wearable Computers.