Yilei Xu

University of California, Riverside, Riverside, CA, United States

Are you Yilei Xu?

Claim your profile

Publications (19)12.18 Total impact

  • [Show abstract] [Hide abstract]
    ABSTRACT: Linear and multi-linear models (PCA, 3DMM, AAM/ASM, multilinear tensors) of object shape/appearance have been very popular in computer vision. In this paper, we analyze the validity of these heuristic models from the fundamental physical laws of object motion and image formation. We prove that, under suitable conditions, the image appearance space can be closely approximated to be multilinear, with the illumination and texture subspaces being trilinearly combined with the direct sum of the motion and deformation subspaces. This result provides a physics-based understanding of many of the successes and limitations of the linear and multi-linear approaches existing in the computer vision literature, and also identifies some of the conditions under which they are valid. It provides an analytical representation of the image space in terms of different physical factors that affect the image formation process. Numerical analysis of the accuracy of the physics-based models is performed, and tracking results on real data are presented.
    IEEE Transactions on Software Engineering 11/2010; · 2.59 Impact Factor
  • Source
    Yilei Xu, A.K. Roy-Chowdhury
    [Show abstract] [Hide abstract]
    ABSTRACT: In this paper, we show how to estimate, accurately and efficiently, the 3D motion of a rigid or non-rigid object, and time-varying lighting in a dynamic scene. This is achieved in an inverse compositional tracking framework with a novel warping function that involves a 2D rarr3Drarr2D transformation. The method is guaranteed to converge, is able to work with rigid and non-rigid objects, and estimates the lighting and motion from a video sequence. Experimental analysis on multiple face video sequences shows impressive speed-up over existing methods while retaining a high level of accuracy.
    Image Processing, 2008. ICIP 2008. 15th IEEE International Conference on; 11/2008
  • Source
    Yilei Xu, A.K. Roy-Chowdhury
    [Show abstract] [Hide abstract]
    ABSTRACT: While low-dimensional image representations have been very popular in computer vision, they suffer from two limitations: (i) they require collecting a large and varied training set to learn a low-dimensional set of basis functions, and (ii) they do not retain information about the 3D geometry of the object being imaged. In this paper, we show that it is possible to estimate low-dimensional manifolds that describe object appearance while retaining the geometrical information about the 3D structure of the object. By using a combination of analytically derived geometrical models and statistical learning methods, this can be achieved using a much smaller training set than most of the existing approaches. Specifically, we derive a quadrilinear manifold of object appearance that can represent the effects of illumination, pose, identity and deformation, and the basis functions of the tangent space to this manifold depend on the 3D surface normals of the objects. We show experimental results on constructing this manifold and how to efficiently track on it using an inverse compositional algorithm.
    Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on; 07/2008
  • Source
    Yilei Xu, Amit Roy-Chowdhury
    [Show abstract] [Hide abstract]
    ABSTRACT: In this paper, we show how to estimate, accurately and efficiently, the 3D motion of a rigid object and time-varying lighting in a dynamic scene. This is achieved in an inverse compositional tracking framework with a novel warping function that involves a 2D --> 3D --> 2D transformation. This also allows us to extend traditional two frame inverse compositional tracking to a sequence of frames, leading to even higher computational savings. We prove the theoretical convergence of this method and show that it leads to significant reduction in computational burden. Experimental analysis on multiple video sequences shows impressive speed-up over existing methods while retaining a high level of accuracy.
    IEEE Transactions on Pattern Analysis and Machine Intelligence 07/2008; 30(7):1300-7. · 4.80 Impact Factor
  • Source
    EURASIP J. Adv. Sig. Proc. 01/2008; 2008.
  • Source
    Proceedings of the International Conference on Image Processing, ICIP 2008, October 12-15, 2008, San Diego, California, USA; 01/2008
  • Source
    2008 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2008), 24-26 June 2008, Anchorage, Alaska, USA; 01/2008
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Linear and multi-linear models of object shape/appearance (PCA, 3DMM, AAM/ASM, multilin- ear tensors) have been very popular in computer vision. In this paper, we analyze the validity of these models from the fundamental physical laws of object motion and image formation. We rigorously prove that the image appearance space can be closely approximated to be locally multilinear, with the illumination subspace being bilinearly combined with the direct sum of the motion, deformation and texture subspaces. This result allows us to understand theoretically many of the successes and limitations of the linear and multi-linear approaches existing in the computer vision literature, and also identifies some of the conditions under which they are valid. It provides an analytical representation of the image space in terms of different physical factors that affect the image formation process. Experimental analysis of the accuracy of the theoretical models is performed as well as tracking on real data using the analytically derived basis functions of this space. 1.1. Overview of the Theoretical Results • Starting from fundamental physics-based models govern- ing rigid object motion, deformations, the interaction of light with the object and perspective projection, we derive a description of the mathematical space in which an image lies. Specifically, we prove that the image space can be closely approximated to be locally multilinear, with the il- lumination subspace being bilinearly combined with the di- rect sum of the motion, deformation and texture subspaces. • This result allows us to justify theoretically the validity of many of the linear and multi-linear approaches existing in the computer vision literature, while also identifying some of the physical constraints under which they are valid. In fact, as explained in Section 3.2, we can now understand theoretically why some methods have worked well in some situations, and not so well in others. • While assuming local linearity may be intuitive, we pro- vide, possibly for the first time an analytical description of this image space in terms of different physical factors that affect the image formation process. • We show that since we can analytically express the image space, we can estimate the motion, deformation and lighting parameters without needing a large number of training ex- amples to first learn the characteristics of this space and the estimates are not a function of the learning data. This ana- lytical expression can be used in future with learning-based methods for more efficient image modeling. Relation to Existing work: The theoretical analysis in this paper builds on some recent work that have described image appearance in terms of mathematical models derived from fundamental physical laws. In describing the effect of light- ing on an object, researchers have obtained descriptions of the illumination space, e.g., illumination cone (2) and basis illumination models (1, 10). A more recent result showed that rigid motion and lighting were related bilinearly (14) in the image appearance space. In this paper, we consider a much more general condition than any of the above - an im- aged object undergoing a rigid motion (i.e., pose change) while deforming and the illumination also changing ran- domly. The theoretical derivation is based on a few weak assumptions - a finite dimensional vector space representa-
    2008 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2008), 24-26 June 2008, Anchorage, Alaska, USA; 01/2008
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: In this paper, we present a theory for combining the effects of motion, illumination, 3D structure, albedo, and camera parameters in a sequence of images obtained by a perspective camera. We show that the set of all Lambertian reflectance functions of a moving object, at any position, illuminated by arbitrarily distant light sources, lies "close" to a bilinear subspace consisting of nine illumination variables and six motion variables. This result implies that, given an arbitrary video sequence, it is possible to recover the 3D structure, motion, and illumination conditions simultaneously using the bilinear subspace formulation. The derivation builds upon existing work on linear subspace representations of reflectance by generalizing it to moving objects. Lighting can change slowly or suddenly, locally or globally, and can originate from a combination of point and extended sources. We experimentally compare the results of our theory with ground truth data and also provide results on real data by using video sequences of a 3D face and the entire human body with various combinations of motion and illumination directions. We also show results of our theory in estimating 3D motion and illumination model parameters from a video sequence.
    IEEE Transactions on Pattern Analysis and Machine Intelligence 06/2007; 29(5):793-806. · 4.80 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: In this paper, we propose a method to incrementally super-resolve 3D facial texture by integrating information frame by frame from a video captured under changing poses and illuminations. First, we recover illumination, 3D motion and shape parameters from our tracking algorithm. This information is then used to super-resolve 3D texture using iterative back-projection (IBP) method. Finally, the super-resolved texture is fed back to the tracking part to improve the estimation of illumination and motion parameters. This closed-loop process continues to refine the texture as new frames come in. We also propose a local-region based scheme to handle non-rigidity of the human face. Experiments demonstrate that our framework not only incrementally super-resolves facial images, but recovers the detailed expression changes in high quality.
    Image Processing, 2007. ICIP 2007. IEEE International Conference on; 01/2007
  • Source
    Yilei Xu, A.K. Roy-Chowdhury
    [Show abstract] [Hide abstract]
    ABSTRACT: Recreating the temporal illumination variations of natural scenes has great potential for realistic synthesis of video sequences. In this paper, we present a 3D (model-based) approach that achieves this goal. The approach requires a training sequence to learn the time-varying illumination models, which can then be used for synthesis in another sequence. The motion and illumination parameters in the training sequence are estimated alternately by projecting onto appropriate basis functions of a bilinear space defined in terms of the 3D surface normals of the objects. The motion is represented in terms of 3D translation and rotation of the object centroid in the camera frame, and the illumination is represented using a spherical harmonics linear basis. We show video synthesis results using the proposed approach.
    Image Processing, 2007. ICIP 2007. IEEE International Conference on; 01/2007
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: The use of video sequences for face recognition has been relatively less studied than image-based approaches. In this paper, we present a framework for face recognition from video sequences that is robust to large changes in facial pose and lighting conditions. Our method is based on a recently obtained theoretical result that can integrate the effects of motion, lighting and shape in generating an image using a perspective camera. This result can be used to estimate the pose and illumination conditions for each frame of the probe sequence. Then, using a 3D face model, we synthesize images corresponding to the pose and illumination conditions estimated in the probe sequences. Similarity between the synthesized images and the probe video is computed by integrating over the entire sequence. The method can handle situations where the pose and lighting conditions in the training and testing data are completely disjoint.
    2007 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2007), 18-23 June 2007, Minneapolis, Minnesota, USA; 01/2007
  • Source
    Amit Roy-Chowdhury, Yilei Xu
    12/2006: pages 9-25;
  • [Show abstract] [Hide abstract]
    ABSTRACT: In this paper we present a method for estimation of 3D motion of a rigid object from a video sequence, while simultaneously learning the parameters of an illumination model that describe the lighting conditions under which the video was captured. This is achieved by alternately estimating motion and illumination parameters in a recently proposed mathematical framework for integrating the effects of motion, illumination and structure. The motion is represented in terms of translation and rotation of the object centroid, and the illumination is represented using a spherical harmonics linear basis. The method does not assume any model for the variation of the illumination conditions - lighting can change slowly or drastically, locally or globally. Also, it can be composed of combinations of point and extended sources. For multiple cameras viewing an object, we derive a new photometric constraint that relates the illumination parameters in two or more independent video sequences. This constraint allows verification of the illumination parameters obtained from multiple views and synthesis of new views under the same lighting conditions. We demonstrate the effectiveness of our algorithm in tracking under severe changes of lighting conditions.
    3rd International Symposium on 3D Data Processing, Visualization and Transmission (3DPVT 2006), 14-16 June 2006, Chapel Hill, North Carolina, USA; 01/2006
  • Source
    A.K. Roy-Chowdhury, Yilei Xu
    [Show abstract] [Hide abstract]
    ABSTRACT: One of the persistent challenges in computer vision has been tracking objects under varying lighting conditions. In this paper we present a method for estimation of 3D motion of a rigid object from a monocular video sequence under arbitrary changes in the illumination conditions under which the video was captured. This is achieved by alternately estimating motion and illumination parameters using a generative model for integrating the effects of motion, illumination and structure within a unified mathematical framework. The motion is represented in terms of translation and rotation of the object centroid, and the illumination is represented using a spherical harmonics linear basis. The method does not assume any model for the variation of the illumination conditions - lighting can change slowly or drastically. For the multi-camera tracking scenario, we propose a new photometric constraint that is valid over the overlapping field of view between two cameras. This is similar in nature to the well-known epipolar constraint, except that it relates the photometric parameters, and can provide an additional constraint for illumination invariant multi-camera tracking. We demonstrate the effectiveness of our tracking algorithm on single and multi-camera video sequences under severe changes of lighting conditions.
    Computer Vision for Interactive and Intelligent Environment, 2005; 12/2005
  • Yilei Xu, A.K. Roy-Chowdhury
    [Show abstract] [Hide abstract]
    ABSTRACT: It has been proved that the set of all Lambertian reflectance functions obtained with arbitrarily distant light sources lies close to a 9D linear subspace. We extend this result from still images to video sequences. We show that the set of all Lambertian reflectance functions of a moving object at any position, illuminated by arbitrarily distant light sources, lies close to a bilinear subspace consisting of nine illumination variables and six motion variables. This result implies that, when the position and 3D model of an object at one instance of time is known, the reflectance images at future time instances can be estimated using the bilinear subspace. This is based on the fact that, given the illumination direction, the image of a moving surface cannot change suddenly over a short time period. We apply our theory to synthesize video sequences of a 3D face with various combinations of motion and illumination directions.
    Image Processing, 2005. ICIP 2005. IEEE International Conference on; 10/2005
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Most work in computer vision has concentrated on studying the individual effects of motion and illumination on a 3D object. In this paper, we present a theory for combining the effects of motion, illumination, 3D structure, albedo, and camera parameters in a sequence of images obtained by a perspective camera. We show that the set of all Lambertian reflectance functions of a moving object, illuminated by arbitrarily distant light sources, lies "close" to a bilinear subspace consisting of nine illumination variables and six motion variables. This result implies that, given an arbitrary video sequence, it is possible to recover the 3D structure, motion and illumination conditions simultaneously using the bilinear subspace formulation. The derivation is based on the intuitive notion that, given an illumination direction, the images of a moving surface cannot change suddenly over a short time period. We experimentally compare the images obtained using our theory with ground truth data and show that the difference is small and acceptable. We also provide experimental results on real data by synthesizing video sequences of a 3D face with various combinations of motion and illumination directions.
    10th IEEE International Conference on Computer Vision (ICCV 2005), 17-20 October 2005, Beijing, China; 01/2005
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: 1.1 ABSTRACT A number of methods in tracking and recognition have successfully exploited low-dimensional representations of object appearance learned from a set of examples. In all these approaches, the construction of the underlying low-dimensional manifold relied upon obtaining different instances of the object's appearance and then using statistical data analysis tools to approximate the appearance space. This requires collecting a very large number of examples and the accuracy of the method depends upon the examples that have been chosen. In this chapter, we show that it is pos-sible to estimate low-dimensional manifolds that describe object appearance using a combination of analytically derived geometrical models and statistical data analy-sis. Specifically, we derive a quadrilinear space of object appearance that is able to represent the effects of illumination, motion, identity and shape. We then show how efficient tracking algorithms like inverse compositional estimation can be adapted to the geometry of this manifold. Our proposed method significantly reduces the amount of data that needs to be collected for learning the manifolds and makes the learned manifold less dependent upon the actual examples that were used. Based upon this novel manifold, we present a framework for face recognition from video sequences that is robust to large changes in facial pose and lighting conditions. The method can handle situations where the pose and lighting conditions in the train-ing and testing data are completely disjoint. We show detailed performance analysis results and recognition scores on a large video dataset.
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Definition In many face recognition systems, the input is a video sequence consisting of one or more faces. It is necessary to track each face over this video sequence so as to extract the information that will be processed by the recognition system. Tracking is also necessary for 3D model-based recognition systems where the 3D model is estimated from the input video. Face tracking can be divided along different lines depending upon the method used, e.g., head tracking, feature tracking, image-based tracking, model-based tracking. The output of the face tracker can be the 2D position of the face in each image of the video (2D tracking), the 3D pose of the face (3D tracking), or the location of features on the face. Some trackers are able to output other parameters related to lighting or expression. The major challenges encountered by face tracking systems are robustness to pose changes, lighting variations, and facial deformations due to changes of expression, occlusions of the face to be tracked and clutter in the scene that makes it difficult to distinguish the face from the other objects.