ArticlePDF Available

Vision-based hand pose estimation through similarity search using the earth mover's distance

Authors:

Abstract and Figures

Vision-based hand pose estimation presents unique challenges, particularly if high-fidelity reconstruction is desired. Searching large databases of synthetic pose candidates for items similar to the input offers an attractive means of attaining this goal. The earth mover's distance is a perceptually meaningful measure of dissimilarity that has shown great promise in content-based image retrieval. It is in general, however, a computationally expensive operation and must be used sparingly. The authors investigate a way of economising on its use while preserving much of its accuracy when applied naively in the context of searching for hand pose candidates in large synthetic databases. In particular, a two-tier search method is proposed which achieves similar accuracy with a speed increase of two orders of magnitude. The system performance is evaluated using real input and the results obtained using the different approaches are compared.
Content may be subject to copyright.
A preview of the PDF is not available
... Publications stemming from this work include De Villiers et al. [26] (under review), which presents the tutor system, and De Villiers et al. [24] (published), which discusses the hand pose estimation system. ...
... Because the pose estimation system is to be used as a frontend for the tutor system, the use of a coloured glove is justified, as e-learning environments are more amenable to control than general settings. Note that the work detailed here was published in De Villiers et al. [24]. The presentation will follow this article, with extensions introduced after its publication being described in Chapters 7 and 8. ...
Thesis
Full-text available
A sign language tutoring system capable of generating detailed context-sensitive feedback to the user is presented in this dissertation. This stands in contrast with existing sign language tutor systems, which lack the capability of providing such feedback. A domain specific language is used to describe the constraints placed on the user’s movements during the course of a sign, allowing complex constraints to be built through the combination of simpler constraints. This same linguistic description is then used to evaluate the user’s movements, and to generate corrective natural language feedback. The feedback is dynamically tailored to the user’s attempt, and automatically targets that correction which would require the least effort on the part of the user. Furthermore, a procedure is introduced which allows feedback to take the form of a simple to-do list, despite the potential complexity of the logical constraints describing the sign. The system is demonstrated using real video sequences of South African Sign Language signs, exploring the different kinds of advice the system can produce, as well as the accuracy of the comments produced. To provide input for the tutor system, the user wears a pair of coloured gloves, and a video of their attempt is recorded. A vision-based hand pose estimation system is proposed which uses the Earth Mover’s Distance to obtain hand pose estimates from images of the user’s hands. A two-tier search strategy is employed, first obtaining nearest neighbours using a simple, but related, metric. It is demonstrated that the two-tier system’s accuracy approaches that of a global search using only the Earth Mover’s Distance, yet requires only a fraction of the time. The system is shown to outperform a closely related system on a set of 500 real images of gloved hands.
... They use The CRF model for gesture recognition with the combination of Fourier and Zernike moment features. Several feature extraction techniques are used to improve the recognition accuracy [20]. Discriminative 2D Zernike moments as feature are used for the color dataset [18]. ...
Preprint
Full-text available
We propose a new technique for recognition of dumb person hand gesture in real world environment. In this technique, the hand image containing the gesture is preprocessed and then hand region is segmented by convergent the RGB color image to L.a.b color space. Only few statistical features are used to classify the segmented image to different classes. Artificial Neural Network is trained in sequential manner using one against all. When the system gets trained, it becomes capable of recognition of each class in parallel manner. The result of proposed technique is much better than existing techniques.
... They use The CRF model for gesture recognition with the combination of Fourier and Zernike moment features. Several feature extraction techniques are used to improve the recognition accuracy [20]. Discriminative 2D Zernike moments as feature are used for the color dataset [18]. ...
Article
Full-text available
We propose a new technique for the recognition of dumb person hand gestures in a real-world environments. In this technique, the hand image containing the gesture is preprocessed and then hand region is segmented by convergent the RGB color image to L.a.b color space. Only few statistical features are used to classify the segmented image to different classes. Artificial Neural Network is trained in a sequential manner using one against all. When the system gets trained, it becomes capable of recognition of each class in a parallel manner. The result of the proposed technique is much better than existing techniques.
... The Fourier and Zernike moment extracted features are used to recognized the gesture by CRF model. In hand gesture recognition technique, several feature extraction methods are reported in the literature [11][12][13][14][15][16][17][18]. Triesch et al. [12] proposes a person independent classification of hand gestures on 10 ASL alphabets against two uniform backgrounds like dark, light and one complex background dataset. ...
Article
Full-text available
This paper demonstrates the development of vision based static hand gesture recognition system using web camera in real time applications. The vision based static hand gesture recognition system is developed using the following steps: preprocessing, feature extraction and classification. The preprocessing stage consists of illumination compensation, segmentation, filtering, hand region detection and image resize. This work proposes a discrete wavelet transform (DWT) and Fisher ratio (F-ratio) based feature extraction technique to classify the hand gestures in an uncontrolled environment. This method is not only robust towards distortion and gesture vocabulary, but also invariant to translation and rotation of hand gestures. A linear support vector machine (SVM) is used as a classifier to recognize the hand gestures. The performance of the proposed method is evaluated on two standard public datasets and one indigenously developed complex background dataset for recognition of hand gestures. All above three datasets are developed based on American Sign Language (ASL) hand alphabets. The experimental result is evaluated in terms of mean accuracy. Two possible real time applications are conducted, one is for interpretation of ASL sign alphabets and another is for Image browsing.
Chapter
Maintaining natural image statistics is a crucial factor in restoration and generation of realistic looking images. When training CNNs, photorealism is usually attempted by adversarial training (GAN), that pushes the output images to lie on the manifold of natural images. GANs are very powerful, but not perfect. They are hard to train and the results still often suffer from artifacts. In this paper we propose a complementary approach, that could be applied with or without GAN, whose goal is to train a feed-forward CNN to maintain natural internal statistics. We look explicitly at the distribution of features in an image and train the network to generate images with natural feature distributions. Our approach reduces by orders of magnitude the number of images required for training and achieves state-of-the-art results on both single-image super-resolution, and high-resolution surface normal estimation. Project page: https://www.github.com/roimehrez/contextualLoss.
Article
3D Hand pose estimation has received an increasing amount of attention, especially since consumer depth cameras came onto the market in 2010. Although substantial progress has occurred recently, no overview has kept up with the latest developments. To bridge the gap, we provide a comprehensive survey, including depth cameras, hand pose estimation methods, and public benchmark datasets. First, a markerless approach is proposed to evaluate the tracking accuracy of depth cameras with the aid of a numerical control linear motion guide. Traditional approaches focus only on static characteristics. The evaluation of dynamic tracking capability has been long neglected. Second, we summarize the state-of-the-art methods and analyze the lines of research. Third, existing benchmark datasets and evaluation criteria are identified to provide further insight into the field of hand pose estimation. In addition, realistic challenges, recent trends, dataset creation and annotation, and open problems for future research directions are also discussed.
Article
First world support systems are not always available to, or affordble for, South Africans with assistive requirements. Our group focuses on persons with communication needs, such as persons with sight, hearing or autism barriers.
Article
Full-text available
Direct use of the hand as an input device is an attractive method for providing natural human–computer interaction (HCI). Currently, the only technology that satisfies the advanced requirements of hand-based input for HCI is glove-based sensing. This technology, however, has several drawbacks including that it hinders the ease and naturalness with which the user can interact with the computer-controlled environment, and it requires long calibration and setup procedures. Computer vision (CV) has the potential to provide more natural, non-contact solutions. As a result, there have been considerable research efforts to use the hand as an input device for HCI. In particular, two types of research directions have emerged. One is based on gesture classification and aims to extract high-level abstract information corresponding to motion patterns or postures of the hand. The second is based on pose estimation systems and aims to capture the real 3D motion of the hand. This paper presents a literature review on the latter research direction, which is a very challenging problem in the context of HCI.
Book
Full-text available
In the Information Society, information holds the master key to economic influence. Similarity Search: The Metric Space Approach will focus on efficient ways to locate user-relevant information in collections of objects, the similarity of which is quantified using a pairwise distance measure. This book is a direct response to recent advances in computing, communications and storage which have led to the current flood of digital libraries, data warehouses and the limitless heterogeneity of internet resources. Similarity Search: The Metric Space Approach will introduce state-of-the-art in developing index structures for searching complex data modeled as instances of a metric space. This book consists of two parts. Part 1 presents the metric search approach in a nutshell by defining the problem, describes major theoretical principals, and provides an extensive survey of specific techniques for a large range of applications. Part 2 concentrates on approaches particularly designed for searching in very large collections of data. Similarity Search: The Metric Space Approach is designed for a professional audience, composed of academic researchers as well as practitioners in industry. This book is also suitable as introductory material for graduate-level students in computer science.
Article
Full-text available
We investigate the properties of a metric between two distributions, the Earth Mover's Distance (EMD), for content-based image retrieval. The EMD is based on the minimal cost that must be paid to transform one distribution into the other, in a precise sense, and was first proposed for certain vision problems by Peleg, Werman, and Rom. For image retrieval, we combine this idea with a representation scheme for distributions that is based on vector quantization. This combination leads to an image comparison framework that often accounts for perceptual similarity better than other previously proposed methods. The EMD is based on a solution to the transportation problem from linear optimization, for which efficient algorithms are available, and also allows naturally for partial matching. It is more robust than histogram matching techniques, in that it can operate on variable-length representations of the distributions that avoid quantization and other binning problems typical of histograms. When used to compare distributions with the same overall mass, the EMD is a true metric. In this paper we focus on applications to color and texture, and we compare the retrieval performance of the EMD with that of other distances.
Article
As a first step towards a perceptual user interface, a computer vision color tracking algorithm is developed and applied towards tracking human faces. Computer vision algorithms that are intended to form part of a perceptual user interface must be fast and efficient. They must be able to track in real time yet not absorb a major share of computational resources: other tasks must be able to run while the visual interface is being used. The new algorithm developed here is based on a robust...
Conference Paper
As a first step towards a perceptual user interface, a computer vision color tracking algorithm is developed and applied towards tracking human faces. Computer vision algorithms that are intended to form part of a perceptual user interface must be fast and efficient. They must be able to track in real time yet not absorb a major share of computational resources: other tasks must be able to run while the visual interface is being used. The new algorithm developed here is based on a robust non- parametric technique for climbing density gradients to find the mode (peak) of probability distributions called the mean shift algorithm. In our case, we want to find the mode of a color distribution within a video scene. Therefore, the mean shift algorithm is modified to deal with dynamically changing color probability distributions derived from video frame sequences. The modified algorithm is called the Continuously Adaptive Mean Shift (CAMSHIFT) algorithm. CAMSHIFT's tracking accuracy is compared against a Polhemus tracker. Tolerance to noise, distractors and performance is studied. CAMSHIFT is then used as a computer interface for controlling commercial computer games and for exploring immersive 3D graphic worlds.
Conference Paper
Estimation of 3D hand pose is useful in many gesture recognition applications, ranging from human-computer interaction to automated recognition of sign languages. In this paper, 3D hand pose estimation is treated as a database indexing problem. Given an input image of a hand, the most similar images in a large database of hand images are retrieved. The hand pose parameters of the retrieved images are used as estimates for the hand pose in the input image. Lipschitz embeddings of edge images into a Euclidean space are used to improve the e#ciency of database retrieval. In order to achieve interactive retrieval times, similarity queries are initially performed in this Euclidean space. The paper describes ongoing work that focuses on how to best choose reference images, in order to improve retrieval accuracy.
Conference Paper
We present a model-based method for hand posture recognition in monocular image sequences that measures joint angles, viewing angle, and posi- tion in space. Visual markers in form of a colored cotton glove are used to extract descriptive and stable 2D features. Searching a synthetically generated database of 2.6 million entries, each consisting of 3D hand posture parameters and the corresponding 2D features, yields several candidate postures per frame. This am- biguity is resolved by exploiting temporal continuity between successive frames. The method is robust to noise, can be used from any viewing angle, and places no constraints on the hand posture. Self-occlusion of any number of markers is handled. It requires no initialization and retrospectivel y corrects posture errors when accordant information becomes available. Besides a qualitative evaluation on real images, a quantitative performance measurement using a large amount of synthetic input data featuring various degrees of noise shows the effectiveness of the approach.