Article

Human motion perception through video model-based tracking

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

As computer vision enables the robot to be aware of its human counterpart, such algorithms could help machines to achieve human-like interaction. However, many video tracking algorithms are not able to cope with some robot vision requirements. The articulated tracking system we develop solves some of those issues. It relies on model-based algorithms, which we believe are more suitable to robot vision than appearance-based ones. Indeed, as they update all the relevant parameters of a surrounding world model, results include some knowledge of the camera and objects relative positions. Our system relies on 3D model silhouette matching and runs in real time. We increase the algorithm robustness by introducing a pre-processing step based on image moments. This allows the iteration refinement to start in a better position by roughly estimating the body motion from one frame to the next.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

ResearchGate has not been able to resolve any citations for this publication.
Conference Paper
Full-text available
Introduces a method for tracking a user's hand in 3D and recognizing the hand's gesture in real time without the use of any invasive devices attached to the hand. Our method uses multiple cameras for determining the position and orientation of a user's hand moving freely in a 3D space. In addition, the method identifies pre-determined gestures in a fast and robust manner by using a neural network which has been properly trained beforehand. This paper also describes results of user study of our proposed method and several types of applications, including 3D object handling for a desktop system and a 3D walkthrough for a large immersive display system.
Conference Paper
Full-text available
This paper describes a system that uses a camera and a point light source to track a user's hand in three dimensions. Using depth cues obtained from projections of the hand and its shadow, the system computes the 3D position and orientation of two fingers (thumb and pointing finger). The system recognizes one dynamic and two static gestures. Recognition and pose estimation are user independent and robust. The system operates at the rate of 60 Hz and can be used as an intuitive input interface to applications that require multi-dimensional control. Examples include 3D fly-thru's, object manipulation and computer games
Conference Paper
Full-text available
A new view-based approach to the representation and recognition of action is presented. The basis of the representation is a temporal template-a static vector-image where the vector value at each point is a function of the motion properties at the corresponding spatial location in an image sequence. Using 18 aerobics exercises as a test domain, we explore the representational power of a simple, two component version of the templates: the first value is a binary value indicating the presence of motion, and the second value is a function of the recency of motion in a sequence. We then develop a recognition method which matches these temporal templates against stored instances of views of known actions. The method automatically performs temporal segmentation, is invariant to linear changes in speed, and runs in real-time on a standard platform. We recently incorporated this technique into the KIDSROOM: an interactive, narrative play-space for children
Conference Paper
Full-text available
In this paper we first describe how we have constructed a 3D deformable Point Distribution Model of the human hand, capturing training data semi-automatically from volume images via a physically-based model. We then show how we have attempted to use this model in tracking an unmarked hand moving with 6 degrees of freedom (plus deformation) in real time using a single video camera. In the course of this we show how to improve on a weighted least-squares pose parameter approximation at little computational cost. We note the successes and shortcomings of our system and discuss how it might be improved
Article
Full-text available
As an object moves through the field of view of a camera, the images of the object may change dramatically. This is not simply due to the translation of the object across the image plane; complications arise due to the fact that the object undergoes changes in pose relative to the viewing camera, in illumination relative to light sources, and may even become partially or fully occluded. We develop an efficient general framework for object tracking, which addresses each of these complications. We first develop a computationally efficient method for handling the geometric distortions produced by changes in pose. We then combine geometry and illumination into an algorithm that tracks large image regions using no more computation than would be required to track with no accommodation for illumination changes. Finally, we augment these methods with techniques from robust statistics and treat occluded regions on the object as statistical outliers. Experimental results are given to demonstrate the effectiveness of our methods
Article
Full-text available
A new view-based approach to the representation and recognition of action is presented. The basis of the representation is a temporal template --- a static vector-image where the vector value at each point is a function of the motion properties at the corresponding spatial location in an image sequence. Using 18 aerobics exercises as a test domain, we explore the representational power of a simple, two component version of the templates: the #rst value is a binary value indicating the presence of motion, and the second value is a function of the recency of motion in a sequence. We then develop a recognition method which matches these temporal templates against stored instances of views of known actions. The method automatically performs temporal segmentation, is invariant to linear changes in speed, and runs in real-time on a standard platform. We recently incorporated this technique into the KidsRoom: an interactive, narrative play-space for children. 1 Introduction The recent shift...
Article
Full-text available
In this paper, we present a new multi-screen interactive environment. The system extracts a silhouette of the participant for driving the interaction using a method that overcomes the inherent problems associated with traditional chroma-keying, background subtraction, and rear-light projection methods. We present an approach for generating a robust silhouette of the participant using specialized infrared lighting while not making the underlying technology apparent to those interacting within the system. The design also enables video projection screens to be placed in front of and behind the user without interfering with the silhouette extraction process. The framework itself is a portable system which can act as a re-usable infrastructure for many interactive projects. 1 Introduction When designing interactive environments, it's imperative for the system to be engaging as well as be reliably "aware" of the person (or people) interacting within the space. Many installations are design...
Article
To increase the reliability of existing human motion tracking algorithms, we propose a method for imposing limits on the underlying hierarchical joint structures in a way that is true to life. Unlike most existing approaches, we explicitly represent dependencies between the various degrees of freedom and derive these limits from actual experimental data. To this end, we use quaternions to represent individual 3 DOF joint rotations and Euler angles for 2 DOF rotations, which we have experimentally sampled using an optical motion capture system. Each set of valid positions is bounded by an implicit surface and we handle hierarchical dependencies by representing the space of valid configurations for a child joint as a function of the position of its parent joint. This representation provides us with a metric in the space of rotations that readily lets us determine whether a posture is valid or not. As a result, it becomes easy to incorporate these sophisticated constraints into a motion tracking algorithm, using standard constrained optimization techniques. We demonstrate this by showing that doing so dramatically improves performance of an existing system when attempting to track complex and ambiguous upper body motions from low quality stereo data.
Article
This paper describes the temporal spatio-velocity (TSV) transform for extracting pixel velocities from binary image sequences. The TSV transform is derived from the Hough transform over windowed spatio-temporal images. We present the methodology of the transform and its implementation in an iterative computational form. The intensity at each pixel in the TSV image represents a measure of the likelihood of occurrence of a pixel with instantaneous velocity in the current position. Binarization of the TSV image extracts blobs based on the similarity of velocity and position. The TSV transform provides an efficient way to remove noise by focusing on stable velocities, and constructs noise-free blobs. We apply the transform to tracking human figures in a sidewalk environment and extend its use to an interaction recognition system. The system performs background subtraction to separate the foreground image from the background, extracts standing human objects and generates a one-dimensional binary image sequence. The TSV transform takes the one-dimensional image sequence and yields the TSV images. Thresholding of the TSV image generates the human blobs. We obtain the human trajectories by associating the segmented blobs over time using blob features. We analyze the motion-state transitions of human interactions, which we consider to be combinations of ten simple interaction units (SIUs). Our system recognizes the 10 SIUs by analyzing the shape of the human trajectory. We illustrate the TSV transform and its application to real images for human segmentation, tracking and interaction classification.
Conference Paper
The appeal of computer games may be enhanced by vision-based user inputs. The high speed and low cost requirements for near-term, mass-market game applications make system design challenging. The response time of the vision interface should be less than a video frame time and the interface should cost less than $50 U.S. We meet these constraints with algorithms tailored to particular hardware. We have developed a special detector, called the artificial retina chip, which allows for fast, on-chip image processing. We describe two algorithms, based on image moments and orientation histograms, which exploit the capabilities of the chip to provide interactive response to the player's hand or body positions at 10 msec frame time and at low-cost. We show several possible game interactions
Conference Paper
This paper presents a novel method for hand tracking. It uses a D model built from quadrics which approximates the anatomy of a human hand. This approach allows for the use of results from projective geometry that yield an elegant technique to generate the projection of the model as a set of conics, as well as providing an efficient ray tracing algorithm to handle self-occlusion. Once the model is projected, an Unscented Kalman Filter is used to update its pose in order to minimise the geometric error between the model projection and a video sequence on the background. Results from experiments with real data show the accuracy of the technique.
Article
Capturing the human hand motion from video involves the estimation of the rigid global hand pose as well as the nonrigid finger articulation. The complexity induced by the high degrees of freedom of the articulated hand challenges many visual tracking techniques. For example, the particle filtering technique is plagued by the demanding requirement of a huge number of particles and the phenomenon of particle degeneracy. This paper presents a novel approach to tracking the articulated hand in video by learning and integrating natural hand motion priors. To cope with the finger articulation, this paper proposes a powerful sequential Monte Carlo tracking algorithm based on importance sampling techniques, where the importance function is based on an initial manifold model of the articulation configuration space learned from motion-captured data. In addition, this paper presents a divide-and-conquer strategy that decouples the hand poses and finger articulations and integrates them in an iterative framework to reduce the complexity of the problem. Our experiments show that this approach is effective and efficient for tracking the articulated hand. This approach can be extended to track other articulated targets.
Conference Paper
Computer sensing of hand and limb motion is an important problem for applications in human-computer interaction (HCI), virtual reality, and athletic performance measurement. Commercially available sensors are invasive, and require the user to wear gloves or targets. We have developed a noninvasive vision-based hand tracking system, called DigitEyes. Employing a kinematic hand model, the DigitEyes system has demonstrated tracking performance at speeds of up to 10 Hz, using line and point features extracted from gray scale images of unadorned, unmarked hands. We describe an application of our sensor to a 3D mouse user-interface problem