Article

Ultra-Range Gesture Recognition using a web-camera in Human–Robot Interaction

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... The effectiveness of this interaction hinges on the accuracy and speed of the robot's visual processing, which is an area of ongoing research and development [34]. Recently, an ultrarange gesture recognition (URGR) in HCI technique has been proposed [35]. According to the research, the existing gesture recognition methods are not feasible at distances exceeding 7 meters and are typically limited to indoor environments. ...
... Indoor environments often bring challenges related to lighting conditions. After using Super Resolution (SR) algorithm and Generative Adversarial Network (GAN) to guide image restoration to reconstruct details, robots are able to see objects 25 meters long [35]. This optimization makes robots have further vision and better understanding about human instructions. ...
Article
Full-text available
This paper examines the application status and technological progress of service robots in home automation, with a particular focus on floor cleaning robots and companion robots for Alzheimer's patients. The evolution of service robots has been driven by advancements in artificial intelligence, sensor technology, and autonomous navigation, transforming them into indispensable tools in modern households. Floor cleaning robots, equipped with advanced sensors and efficient algorithms, have revolutionized routine household chores by autonomously maintaining clean floors. Companion robots, such as Pepper and those developed in projects like RAMCIP, have shown promise in assisting Alzheimer's patients, offering both practical and therapeutic support. The article also delves into the key technologies enabling these robots, including sensor systems, obstacle avoidance algorithms, and human-computer interaction. Despite the significant progress, challenges remain, particularly in optimizing obstacle avoidance, enhancing HCI, and addressing ethical concerns in caregiving roles. The article concludes by highlighting the future potential of service robots to further integrate into smart homes, offering personalized and context-aware solutions to meet the unique needs of individual households.
... For example, reference [35] proposes using wearable e-textile sensor panels for gesture recognition. Reference [36] employs a deep-learning framework with RGB cameras for longdistance gesture recognition. Wearable sensor-based methods [11] may suffer from discomfort. ...
Article
Full-text available
Gesture recognition serves as a foundation for Human-Computer Interaction (HCI). Although Radio Frequency Identification (RFID) is gaining popularity due to its advantages (non-invasive, low-cost, and lightweight), most existing research has only addressed the recognition of simple sign language gestures or body movements. There is still a significant gap in the recognition of fine-grained gestures. In this paper, we propose RF-AIRCGR as a fine-grained hand gesture recognition system for Chinese characters. It enables information input and querying through gestures in contactless scenarios, which is of great significance for both medical and educational applications. This system has three main advantages: First, by designing a tag matrix and dual-antenna layout, it fully captures fine-grained gesture data for handwritten Chinese characters. Second, it uses a variance-based sliding window method to segment continuous gesture actions. Lastly, the phase signals of Chinese characters are innovatively transformed into feature images using the Markov Transition Field. After a series of preprocessing steps, the improved C-AlexNet model is employed for deep training and experimentation. Experimental results show that RF-AIRCGR achieves average recognition accuracies of 97.85% for new users and 97.15% for new scenarios. The accuracy and robustness of the system in recognizing Chinese character gestures have been validated.
... However, one of the key challenges in achieving effective long-range gesture recognition is the degradation of visual information due to factors such as reduced resolution, lighting variations, and occlusions [16], [17]. In a recent work by the authors, a gesture recognition model was proposed using a simple web camera with an effective ultra-range distance of up to at least 25 meters, addressing some of these challenges [18]. However, the approach only considers static gestures while dynamic gesture recognition at ultra-range distances has not yet been addressed. ...
Preprint
Full-text available
Dynamic hand gestures play a crucial role in conveying nonverbal information for Human-Robot Interaction (HRI), eliminating the need for complex interfaces. Current models for dynamic gesture recognition suffer from limitations in effective recognition range, restricting their application to close proximity scenarios. In this letter, we present a novel approach to recognizing dynamic gestures in an ultra-range distance of up to 28 meters, enabling natural, directive communication for guiding robots in both indoor and outdoor environments. Our proposed SlowFast-Transformer (SFT) model effectively integrates the SlowFast architecture with Transformer layers to efficiently process and classify gesture sequences captured at ultra-range distances, overcoming challenges of low resolution and environmental noise. We further introduce a distance-weighted loss function shown to enhance learning and improve model robustness at varying distances. Our model demonstrates significant performance improvement over state-of-the-art gesture recognition frameworks, achieving a recognition accuracy of 95.1% on a diverse dataset with challenging ultra-range gestures. This enables robots to react appropriately to human commands from a far distance, providing an essential enhancement in HRI, especially in scenarios requiring seamless and natural interaction.
... There have been various studies exploring the use of gestures in Human-Robot Interaction (HRI) to control quadruped robots. These studies include efforts to improve hand gesture recognition accuracy [25], enable distance recognition [2], use multimodal sensors, including surface electromyography [23], and detect hand gestures from remote users without relying on robot sensors [14]. All of these studies focus primarily on hand shapes and aim to enhance recognition accuracy. ...
Preprint
Full-text available
In recent years, quadruped robots have attracted significant attention due to their practical advantages in maneuverability, particularly when navigating rough terrain and climbing stairs. As these robots become more integrated into various industries, including construction and healthcare, researchers have increasingly focused on developing intuitive interaction methods such as speech and gestures that do not require separate devices such as keyboards or joysticks. This paper aims at investigating a comfortable and efficient interaction method with quadruped robots that possess a familiar form factor. To this end, we conducted two preliminary studies to observe how individuals naturally interact with a quadruped robot in natural and controlled settings, followed by a prototype experiment to examine human preferences for body-based and hand-based gesture controls using a Unitree Go1 Pro quadruped robot. We assessed the user experience of 13 participants using the User Experience Questionnaire and measured the time taken to complete specific tasks. The findings of our preliminary results indicate that humans have a natural preference for communicating with robots through hand and body gestures rather than speech. In addition, participants reported higher satisfaction and completed tasks more quickly when using body gestures to interact with the robot. This contradicts the fact that most gesture-based control technologies for quadruped robots are hand-based. The video is available at https://youtu.be/rysv1p1zvp4.
... There are two main methods for recognizing hand signs: camera-based image recognition [2] and sensor-based methods. Camera-based systems have shown several limitations, including sensitivity to light and reflective objects, limitations in colour and surface resolution for skin recognition, obstruction by hidden moving objects, and high computational requirements for video processing. ...
... Bridging this gap is crucial for seamless collaboration between humans and robots, especially in scenarios like surveillance, drone operations, and search and rescue missions. Ultra-range gesture recognition (URGR) aims to extend recognition capabilities to 25 meters [5], yet considers only static images as the majority of gesture recognition approaches. However, some gestures are inherently dynamic such as waving and beckoning. ...
Preprint
Full-text available
This paper presents a novel approach for ultra-range gesture recognition, addressing Human-Robot Interaction (HRI) challenges over extended distances. By leveraging human gestures in video data, we propose the Temporal-Spatiotemporal Fusion Network (TSFN) model that surpasses the limitations of current methods, enabling robots to understand gestures from long distances. With applications in service robots, search and rescue operations, and drone-based interactions, our approach enhances HRI in expansive environments. Experimental validation demonstrates significant advancements in gesture recognition accuracy, particularly in prolonged gesture sequences.
... However, most gesture recognition approaches enable short-range interactions for up to a few meters [4], [5]. Nevertheless, the authors have recently proposed a static gesture recognition model using a web camera with an effective distance of up to at least 25 meters while also facing challenges of low-resolution and occlusions [6]. ...
Preprint
Full-text available
Dynamic gestures enable the transfer of directive information to a robot. Moreover, the ability of a robot to recognize them from a long distance makes communication more effective and practical. However, current state-of-the-art models for dynamic gestures exhibit limitations in recognition distance, typically achieving effective performance only within a few meters. In this work, we propose a model for recognizing dynamic gestures from a long distance of up to 20 meters. The model integrates the SlowFast and Transformer architectures (SFT) to effectively process and classify complex gesture sequences captured in video frames. SFT demonstrates superior performance over existing models.
Article
Object recognition, commonly performed by a camera, is a fundamental requirement for robots to complete complex tasks. Some tasks require recognizing objects far from the robot's camera. A challenging example is Ultra-Range Gesture Recognition (URGR) in human-robot interaction where the user exhibits directive gestures at a distance of up to 25 m from the robot. However, training a model to recognize hardly visible objects located in ultra-range requires an exhaustive collection of a significant amount of labeled samples. The generation of synthetic training datasets is a recent solution to the lack of real-world data, while unable to properly replicate the realistic visual characteristics of distant objects in images. In this letter, we propose the Diffusion in Ultra-Range (DUR) framework based on a Diffusion model to generate labeled images of distant objects in various scenes. The DUR generator receives a desired distance and class (e.g., gesture) and outputs a corresponding synthetic image. We apply DUR to train a URGR model with directive gestures in which fine details of the gesturing hand are challenging to distinguish. DUR is compared to other types of generative models showcasing superiority both in fidelity and in recognition success rate when training a URGR model. More importantly, training a DUR model on a limited amount of real data and then using it to generate synthetic data for training a URGR model outperforms directly training the URGR model on real data. The synthetic-based URGR model is also demonstrated in gesture-based direction of a ground robot.
Article
Full-text available
Hand gesture, one of the essential ways for a human to convey information and express intuitive intention, has a significant degree of differentiation, substantial flexibility, and high robustness of information transmission to make hand gesture recognition (HGR) one of the research hotspots in the fields of human–human and human–computer or human–machine interactions. Noninvasive, on‐body sensors can monitor, track, and recognize hand gestures for various applications such as sign language recognition, rehabilitation, myoelectric control for prosthetic hands and human–machine interface (HMI), and many other applications. This article systematically reviews recent achievements from noninvasive upper‐limb sensing techniques for HGR, multimodal sensing fusion to gain additional user information, and wearable gesture recognition algorithms to obtain more reliable and robust performance. Research challenges, progress, and emerging opportunities for sensor‐based HGR systems are also analyzed to provide perspectives for future research and progress.
Article
Full-text available
Single-image super-resolution (SISR) is an important task in image processing, aiming to achieve enhanced image resolution. With the development of deep learning, SISR based on convolutional neural networks has also gained great progress, but as the network deepens and the task of SISR becomes more complex, SISR networks become difficult to train, which hinders SISR from achieving greater success. Therefore, to further promote SISR, many challenges have emerged in recent years. In this review, we briefly review the SISR challenges organized from 2017 to 2022 and focus on the in-depth classification of these challenges, the datasets employed, the evaluation methods used, and the powerful network architectures proposed or accepted by the winners. First, depending on the tasks of the challenges, the SISR challenges can be broadly classified into four categories: classic SISR, efficient SISR, perceptual extreme SISR, and real-world SISR. Second, we introduce the datasets commonly used in the challenges in recent years and describe their characteristics. Third, we present the image evaluation methods commonly used in SISR challenges in recent years. Fourth, we introduce the network architectures used by the winners, mainly to explore in depth where the advantages of their network architectures lie and to compare the results of previous years’ winners. Finally, we summarize the methods that have been widely used in SISR in recent years and suggest several possible promising directions for future SISR.
Article
Full-text available
Gesture recognition has found widespread applications in various fields, such as virtual reality, medical diagnosis, and robot interaction. The existing mainstream gesture-recognition methods are primarily divided into two categories: inertial-sensor-based and camera-vision-based methods. However, optical detection still has limitations such as reflection and occlusion. In this paper, we investigate static and dynamic gesture-recognition methods based on miniature inertial sensors. Hand-gesture data are obtained through a data glove and preprocessed using Butterworth low-pass filtering and normalization algorithms. Magnetometer correction is performed using ellipsoidal fitting methods. An auxiliary segmentation algorithm is employed to segment the gesture data, and a gesture dataset is constructed. For static gesture recognition, we focus on four machine learning algorithms, namely support vector machine (SVM), backpropagation neural network (BP), decision tree (DT), and random forest (RF). We evaluate the model prediction performance through cross-validation comparison. For dynamic gesture recognition, we investigate the recognition of 10 dynamic gestures using Hidden Markov Models (HMM) and Attention-Biased Mechanisms for Bidirectional Long-and Short-Term Memory Neural Network Models (Attention-BiLSTM). We analyze the differences in accuracy for complex dynamic gesture recognition with different feature datasets and compare them with the prediction results of the traditional long-and short-term memory neural network model (LSTM). Experimental results demonstrate that the random forest algorithm achieves the highest recognition accuracy and shortest recognition time for static gestures. Moreover, the addition of the attention mechanism significantly improves the recognition accuracy of the LSTM model for dynamic gestures, with a prediction accuracy of 98.3%, based on the original six-axis dataset.
Article
Full-text available
Recently, human–robot interaction technology has been considered as a key solution for smart factories. Surface electromyography signals obtained from hand gestures are often used to enable users to control robots through hand gestures. In this paper, we propose a dynamic hand-gesture-based industrial robot control system using the edge AI platform. The proposed system can perform both robot operating-system-based control and edge AI control through an embedded board without requiring an external personal computer. Systems on a mobile edge AI platform must be lightweight, robust, and fast. In the context of a smart factory, classifying a given hand gesture is important for ensuring correct operation. In this study, we collected electromyography signal data from hand gestures and used them to train a convolutional recurrent neural network. The trained classifier model achieved 96% accuracy for 10 gestures in real time. We also verified the universality of the classifier by testing it on 11 different participants.
Chapter
Full-text available
Close human-robot interaction (HRI), especially in industrial scenarios, has been vastly investigated for the advantages of combining human and robot skills. For an effective HRI, the validity of currently available human-machine communication media or tools should be questioned, and new communication modalities should be explored. This article proposes a modular architecture allowing human operators to interact with robots through different modalities. In particular, we implemented the architecture to handle gestural and touchscreen input, respectively, using a smartwatch and a tablet. Finally, we performed a comparative user experience study between these two modalities.
Article
Full-text available
This paper presents the development of a visual-perception system on a dual-arm mobile robot for human-robot interaction. This visual system integrates three subsystems. Hand gesture recognition is utilized to trigger human-robot interaction. Engagement and intention of the participants are detected and quantified through a cognitive system. Visual servoing uses YOLO to identify the object to be tracked and hybrid, model-based tracking to follow the object’s geometry. The proposed visual-perception system is implemented in the developed dual-arm mobile robot, and experiments are conducted to validate the proposed method’s effects on human-robot interaction applications.
Article
Full-text available
This paper aims to evaluate two machine learning techniques using low-frequency photoplethysmography (PPG) associated with motion sensors, accelerometers and gyroscopes, from wearable devices such as smartwatches in gesture recognition from the wrist and fingers motion. The applied method consider the application of second-order gradient calculation on PPG to identify motion artifacts and then create segments that will be used gesture classification, the classification process used a support vector machine and random forests trained using statistical features extracted from PPG and motion sensor signals. Preliminary evaluations show that frequencies of 25 Hz are suitable for the recognition process, achieving an F1-score of 0.819 for seven gesture sets.
Article
Full-text available
Digitizing handwriting is mostly performed using either image-based methods, such as optical character recognition, or utilizing two or more devices, such as a special stylus and a smart pad. The high-cost nature of this approach necessitates a cheaper and standalone smart pen. Therefore, in this paper, a deep-learning-based compact smart digital pen that recognizes 36 alphanumeric characters was developed. Unlike common methods, which employ only inertial data, handwriting recognition is achieved from hand motion data captured using an inertial force sensor. The developed prototype smart pen comprises an ordinary ballpoint ink chamber, three force sensors, a six-channel inertial sensor, a microcomputer, and a plastic barrel structure. Handwritten data of the characters were recorded from six volunteers. After the data was properly trimmed and restructured, it was used to train four neural networks using deep-learning methods. These included Vision transformer (ViT), DNN (deep neural network), CNN (convolutional neural network), and LSTM (long short-term memory). The ViT network outperformed the others to achieve a validation accuracy of 99.05%. The trained model was further validated in real-time where it showed promising performance. These results will be used as a foundation to extend this investigation to include more characters and subjects.
Article
Full-text available
Graph convolutional neural networks (GCNNs) have been successfully applied to a wide range of problems, including low-dimensional Euclidean structural domains representing images, videos, and speech and high-dimensional non-Euclidean domains, such as social networks and chemical molecular structures. However, in computer vision, the existing GCNNs are not provided with positional information to distinguish between graphs of new structures; therefore, the performance of the image classification domain represented by arbitrary graphs is significantly poor. In this work, we introduce how to initialize the positional information through a random walk algorithm and continuously learn the additional position-embedded information of various graph structures represented over the superpixel images we choose for efficiency. We call this method the graph convolutional network with learnable positional embedding applied on images (IMGCN-LPE). We apply IMGCN-LPE to three graph convolutional models (the Chebyshev graph convolutional network, graph convolutional network, and graph attention network) to validate performance on various benchmark image datasets. As a result, although not as impressive as convolutional neural networks, the proposed method outperforms various other conventional convolutional methods and demonstrates its effectiveness among the same tasks in the field of GCNNs.
Article
Full-text available
A graph can represent a complex organization of data in which dependencies exist between multiple entities or activities. Such complex structures create challenges for machine learning algorithms, particularly when combined with the high dimensionality of data in current applications. Graph convolutional networks were introduced to adopt concepts from deep convolutional networks (i.e. the convolutional operations/layers) that have shown good results. In this context, we propose two major enhancements to two of the existing graph convolutional network frameworks: (1) topological information enrichment through clustering coefficients; and (2) structural redesign of the network through the addition of dense layers. Furthermore, we propose minor enhancements using convex combinations of activation functions and hyper-parameter optimization. We present extensive results on four state-of-art benchmark datasets. We show that our approach achieves competitive results for three of the datasets and state-of-the-art results for the fourth dataset while having lower computational costs compared to competing methods.
Article
Full-text available
Single image super-resolution (SISR) is the task of inferring a high-resolution image from a single low-resolution image. Recent research on super-resolution has achieved great progress because of the development of deep convolutional neural networks in the field of computer vision. Existing super-resolution methods have high performances in mean square error (MSE), but most methods fail to reconstruct an image with shape edges. To solve this problem, the mixed gradient error, which is composed by MSE and mean gradient error, is proposed and applied to a modified U-net network. The modified U-net removes all batch normalization layers and one of the convolution layers in each block. The operation reduces the parameter number and therefore accelerates the model. Compared with the existing SISR algorithms, the proposed method has better performance and time consumption. The experiments demonstrate that modified U-net with mixed gradient loss yields high-level results on three widely used datasets, SET14, BSD300 and Manga109, and outperforms other state-of-the-art methods on the text dataset, ICDAR2003. Code is available online
Article
Full-text available
Wearable devices that monitor muscle activity based on surface electromyography could be of use in the development of hand gesture recognition applications. Such devices typically use machine-learning models, either locally or externally, for gesture classification. However, most devices with local processing cannot offer training and updating of the machine-learning model during use, resulting in suboptimal performance under practical conditions. Here we report a wearable surface electromyography biosensing system that is based on a screen-printed, conformal electrode array and has in-sensor adaptive learning capabilities. Our system implements a neuro-inspired hyperdimensional computing algorithm locally for real-time gesture classification, as well as model training and updating under variable conditions such as different arm positions and sensor replacement. The system can classify 13 hand gestures with 97.12% accuracy for two participants when training with a single trial per gesture. A high accuracy (92.87%) is preserved on expanding to 21 gestures, and accuracy is recovered by 9.5% by implementing model updates in response to varying conditions, without additional computation on an external device.
Article
Full-text available
Hand gestures are a form of nonverbal communication that can be used in several fields such as communication between deaf-mute people, robot control, human-computer interaction (HCI), home automation and medical applications. Research papers based on hand gestures have adopted many different techniques, including those based on instrumented sensor technology and computer vision. In other words, the hand sign can be classified under many headings, such as posture and gesture, as well as dynamic and static, or a hybrid of the two. This paper focuses on a review of the literature on hand gesture techniques and introduces their merits and limitations under different circumstances. In addition, it tabulates the performance of these methods, focusing on computer vision techniques that deal with the similarity and difference points, technique of hand segmentation used, classification algorithms and drawbacks, number and types of gestures, dataset used, detection range (distance) and type of camera used. This paper is a thorough general overview of hand gesture methods with a brief discussion of some possible applications.
Article
Full-text available
Whenever we are addressing a specific object or refer to a certain spatial location, we are using referential or deictic gestures usually accompanied by some verbal description. Particularly, pointing gestures are necessary to dissolve ambiguities in a scene and they are of crucial importance when verbal communication may fail due to environmental conditions or when two persons simply do not speak the same language. With the currently increasing advances of humanoid robots and their future integration in domestic domains, the development of gesture interfaces complementing human–robot interaction scenarios is of substantial interest. The implementation of an intuitive gesture scenario is still challenging because both the pointing intention and the corresponding object have to be correctly recognized in real time. The demand increases when considering pointing gestures in a cluttered environment, as is the case in households. Also, humans perform pointing in many different ways and those variations have to be captured. Research in this field often proposes a set of geometrical computations which do not scale well with the number of gestures and objects and use specific markers or a predefined set of pointing directions. In this paper, we propose an unsupervised learning approach to model the distribution of pointing gestures using a growing-when-required (GWR) network. We introduce an interaction scenario with a humanoid robot and define the so-called ambiguity classes. Our implementation for the hand and object detection is independent of any markers or skeleton models; thus, it can be easily reproduced. Our evaluation comparing a baseline computer vision approach with our GWR model shows that the pointing-object association is well learned even in cases of ambiguities resulting from close object proximity.
Article
Full-text available
Recently, automatic hand gesture recognition has gained increasing importance for two principal reasons: the growth of the deaf and hearing-impaired population, and the development of vision-based applications and touchless control on ubiquitous devices. As hand gesture recognition is at the core of sign language analysis a robust hand gesture recognition system should consider both spatial and temporal features. Unfortunately, finding discriminative spatiotemporal descriptors for a hand gesture sequence is not a trivial task. In this study, we proposed an efficient deep convolutional neural networks approach for hand gesture recognition. The proposed approach employed transfer learning to beat the scarcity of a large labeled hand gesture dataset. We evaluated it using three gesture datasets from color videos: 40, 23, and 10 classes were used from these datasets. The approach obtained recognition rates of 98.12%, 100%, and 76.67% on the three datasets, respectively for the signer-dependent mode. For the signer-independent mode, it obtained recognition rates of 84.38%, 34.9%, and 70% on the three datasets, respectively.
Conference Paper
Full-text available
Single image super-resolution (SISR) is a challenging ill-posed problem which aims to restore or infer a high-resolution image from a low-resolution one. Powerful deep learning-based techniques have achieved state-of-the art performance in SISR; however, they can underperform when handling images with non-stationary degradations, such as for the application of projector resolution enhancement. In this paper, a new UNet architecture that is able to learn the relationship between a set of degraded low-resolution images and their corresponding original high-resolution images is proposed. We propose employing a degradation model on training images in a non-stationary way, allowing the construction of a robust UNet (RUNet) for image super-resolution (SR). Experimental results show that the proposed RUNet improves the visual quality of the obtained super-resolution images while maintaining a low reconstruction error.
Article
Full-text available
The state-of-the-art models for medical image segmentation are variants of U-Net and fully convolutional networks (FCN). Despite their success, these models have two limitations: (1) their optimal depth is apriori unknown, requiring extensive architecture search or inefficient ensemble of models of varying depths; and (2) their skip connections impose an unnecessarily restrictive fusion scheme, forcing aggregation only at the same-scale feature maps of the encoder and decoder sub-networks. To overcome these two limitations, we propose UNet++, a new neural architecture for semantic and instance segmentation, by (1) alleviating the unknown network depth with an efficient ensemble of U-Nets of varying depths, which partially share an encoder and co-learn simultaneously using deep supervision ; (2) redesigning skip connections to aggregate features of varying semantic scales at the decoder sub-networks, leading to a highly flexible feature fusion scheme; and (3) devising a pruning scheme to accelerate the inference speed of UNet++. We have evaluated UNet++ using six different medical image segmentation datasets, covering multiple imaging modalities such as computed tomography (CT), magnetic resonance imaging (MRI), and electron microscopy (EM), and demonstrating that (1) UNet++ consistently outperforms the baseline models for the task of semantic segmentation across different datasets and backbone architectures; (2) UNet++ enhances segmentation quality of varying-size objects-an improvement over the fixed-depth U-Net; (3) Mask RCNN++ (Mask R-CNN with UNet++ design) outperforms the original Mask R-CNN for the task of instance segmentation; and (4) pruned UNet++ models achieve significant speedup while showing only modest performance degradation. Our implementation and pre-trained models are available at https://github.com/MrGiovanni/UNetPlusPlus.
Article
Full-text available
Abstract Hand gesture recognition methods play an important role in human-computer interaction. Among these methods are skeleton-based recognition techniques that seem to be promising. In literature, several methods have been proposed to recognize hand gestures with skeletons. One problem with these methods is that they consider little the connectivity between the joints of a skeleton, constructing simple graphs for skeleton connectivity. Observing this, we built a new model of hand skeletons by adding three types of edges in the graph to finely describe the linkage action of joints. Then, an end-to-end deep neural network, hand gesture graph convolutional network, is presented in which the convolution is conducted only on linked skeleton joints. Since the training dataset is relatively small, this work proposes expanding the coordinate dimensionality so as to let models learn more semantic features. Furthermore, relative coordinates are employed to help hand gesture graph convolutional network learn the feature representation independent of the random starting positions of actions. The proposed method is validated on two challenging datasets, and the experimental results show that it outperforms the state-of-the-art methods. Furthermore, it is relatively lightweight in practice for hand skeleton-based gesture recognition.
Article
Full-text available
This paper proposes a long distance gesture recognition algorithm based on Kinect. First, we use Kinect to capture human skeleton and depth information, track and extract hand information. For the characteristics of the depth image determine that, the experimental results will be not affected by the background, light, skin color and clothing. Then the initial obtained data of hand shape is denoised and smoothed, and the contour and skeleton of hand shape are extracted. When the hand is at long distance, the accuracy of Kinect is not sufficient to get the detail hand shape-information. So we combine the hand depth information with the color information to get hand shape. Finally, we use the Hu moment of the hand shape contour binary image and the hand skeleton binary image as data feature, and utilize SVM to train and identify hand gesture. The experimental results show that the Hu moment of the hand skeleton binary image is more advantageous than the Hu moment of the hand contour binary image, and the proposed long-distance hand recognition algorithm also has the recognition accuracy similar to the close-distance.
Conference Paper
Full-text available
For successful physical human-robot interaction, the capability of a robot to understand its environment is imperative. More importantly, the robot should extract from the human operator as much information as possible. A reliable 3D skeleton extraction is essential for a robot to predict the intentions of the operator while he/she moves toward the robot or performs a gesture with a specific meaning. For this purpose, we have integrated a time-of-flight depth camera with a state-of-the-art 2D skeleton extraction library namely Openpose, to obtain 3D skeletal joint coordinates reliably. We have also developed a robust and rotation invariant (in the coronal plane) hand gesture detector using a convolutional neural network. At run time (after having been trained) the detector does not require any pre-processing of the hand images. A complete pipeline for skeleton extraction and hand gesture recognition is developed and employed for real-time physical human-robot interaction, demonstrating the promising capability of the designed framework. This work establishes a firm basis and will be extended for the development of intelligent human intention detection in physical human-robot interaction scenarios, to efficiently recognize a variety of static as well as dynamic gestures.
Article
Full-text available
Gesture recognition is an important part of human-robot interaction. In order to achieve fast and stable gesture recognition in real time without distance restrictions, this paper presents an improved threshold segmentation method. The improved method combines the depth information and color information of a target scene with hand position by the spatial hierarchical scanning method; the ROI in the scene is thus extracted by the local neighbor method. In this way, the hand can be identified quickly and accurately in complex scenes and different distances. Furthermore, the convex hull detection algorithm is used to identify the positioning of fingertips in ROI, so that the fingertips can be identified and located accurately. The experimental results show that the hand position can be obtained quickly and accurately in the complex background by using the improved method, the real-time recognition distance interval can be reached by 0.5 m to 2.0 m, and the fingertip detection rates can be reached 98.5% in average. Moreover, the gesture recognition rates are more than 96% by the convex hull detection algorithm. It can be thus concluded that the proposed method achieves good performance of hand detection and positioning at different distances.
Article
Hand gesture recognition has attracted huge interest in the areas of autonomous driving, human computer systems, gaming and many others. Skeleton based techniques along with graph convolutional networks (GCNs) are being popularly used in this field due to the easy estimation of joint coordinates and better representation capability of graphs. Simple hand skeleton graphs are unable to capture the finer details and complex spatial features of hand gestures. To address these challenges, this work proposes an “angle‐based hand gesture graph convolutional network” (AHG‐GCN). This model introduces two additional types of novel edges in the graph to connect the wrist with each fingertip and finger's base, explicitly capturing their relationship, which plays an important role in differentiating gestures. Besides, novel features for each skeleton joint are designed using the angles formed with fingertip/finger‐base joints and the distance among them to extract semantic correlation and tackle the overfitting problem. Thus, an enhanced set of 25 features for each joint is obtained using these novel techniques. The proposed model achieves 90% and 88% accuracy for 14 and 28 gesture configurations for the DHG 14/28 dataset and, 94.05% and 89.4% accuracy for 14 and 28 gesture configurations for the SHREC 2017 dataset, respectively.
Chapter
This manuscript entails the preprocessing of hand gesture images for subsequent gesture recognition and cognitive communication. From this detailed analysis, it is obvious that the preprocessing stage is important for hand gesture recognition and its enhancement. Key to achieving the system design goal is optimizing this stage’s performance. Through the techniques that are described in this stage, the quality of the image for the subsequent stages will be assured. The techniques discussed in literature are rarely successful when they seek to retain the image quality and provide a low noise result. At the segmentation stage, the image is filtered to include only the Region of Interest, which is a clear visual representation of the hand portion, separated from the rest of the information. These techniques are implemented on the images acquired from the dataset and various pre-processing techniques are compared and identified which technique yields the best possible results for optimizing the preprocessing of hand sign images.
Article
Hand pose estimation is a fundamental task in many human–robot interaction-related applications. However, previous approaches suffer from unsatisfying hand landmark predictions in real-world scenes and high computation burden. This paper proposes a fast and accurate framework for hand pose estimation, dubbed as “FastHand”. Using a lightweight encoder–decoder network architecture, FastHand fulfills the requirements of practical applications running on embedded devices. The encoder consists of deep layers with a small number of parameters, while the decoder uses spatial location information to obtain more accurate results. The evaluation took place on two publicly available datasets demonstrating the improved performance of the proposed pipeline compared to other state-of-the-art approaches. FastHand offers high accuracy scores while reaching a speed of 25 frames per second on an NVIDIA Jetson TX2 graphics processing unit.
Article
Single image super-resolution (SISR), which aims to reconstruct a high-resolution (HR) image from a low-resolution (LR) observation, has been an active research topic in the area of image processing in recent decades. Particularly, deep learning-based super-resolution (SR) approaches have drawn much attention and have greatly improved the reconstruction performance on synthetic data. However, recent studies show that simulation results on synthetic data usually overestimate the capacity to super-resolve real-world images. In this context, more and more researchers devote themselves to develop SR approaches for realistic images. This article aims to make a comprehensive review on real-world single image super-resolution (RSISR). More specifically, this review covers the critical publicly available datasets and assessment metrics for RSISR, and four major categories of RSISR methods, namely the degradation modeling-based RSISR, image pairs-based RSISR, domain translation-based RSISR, and self-learning-based RSISR. Comparisons are also made among representative RSISR methods on benchmark datasets, in terms of both reconstruction quality and computational efficiency. Besides, we discuss challenges and promising research topics on RSISR.
Article
Head-mounted device-based human-computer interaction often requires egocentric recognition of hand gestures and fingertips detection. In this paper, a unified approach of egocentric hand gesture recognition and fingertip detection is introduced. The proposed algorithm uses a single convolutional neural network to predict the probabilities of finger class and positions of fingertips in one forward propagation. Instead of directly regressing the positions of fingertips from the fully connected layer, the ensemble of the position of fingertips is regressed from the fully convolutional network. Subsequently, the ensemble average is taken to regress the final position of fingertips. Since the whole pipeline uses a single network, it is significantly fast in computation. Experimental results show that the proposed method outperforms the existing fingertip detection approaches including the Direct Regression and the Heatmap-based framework. The effectiveness of the proposed method is also shown in-the-wild scenario as well as in a use-case of virtual reality.
Chapter
This chapter proposes the design and implementation details of a real-time hand gesture recognition (HGR) system using computational intelligence and super-resolution techniques. The system operates on low-resolution cameras, and the input images are upscaled using super-resolution because they offer efficient feature extraction over lower resolution images. Thereafter, they have been fed into a convolutional neural network-based model to examine and classify gestures translated to specific commands, forming a human-computer interaction (HCI) system.
Conference Paper
Hand gesture recognition presents one of the most important aspects for human machine interface (HMI) development, and it has a wide spectrum of applications including sign language recognition for deaf and dumb people. Herein, force myography signals (FMG) are extracted using eight nanocomposite CNT/PDMS pressure sensors simultaneously. Data are collected from eight healthy volunteers for American sign language digits 0–9. Two sets of features are extracted, the first one is composed of mean, standard deviation and rms values for the raw FMG data for all 8 sensors individually. The second set is composed of the 2-norm of the raw FMG signal and three proportional features, where the FMG signals are studied with respect to the reference rest signal. Classification is performed using each of the seven individual features as well as the combination of features in each set. The combination of features in the second set gives better testing accuracy of 95%, 91.9% for k=2, k=3 using KNN classifier, respectively.
Article
Dynamic hand gesture recognition is a challenging problem in the area of hand-based human-robot interaction (HRI), such as issues of a complex environment and dynamic perception. In the context of this problem, we learn from the principle of the data-glove-based hand gesture recognition method and propose a dynamic hand gesture recognition method based on 3D hand pose estimation. This method uses 3D hand pose estimation, data fusion and deep neural network to improve the recognition accuracy of dynamic hand gestures. First, a 2D hand pose estimation method based on OpenPose is improved to obtain a fast 3D hand pose estimation method. Second, the weighted sum fusion method is utilized to combine the RGB, depth and 3D skeleton data of hand gestures. Finally, a 3DCNN + ConvLSTM framework is used to identify and classify the combined dynamic hand gesture data. In the experiment, the proposed method is verified on a developed dynamic hand gesture database for HRI and gets 92.4% accuracy. Comparative experiment results verify the reliability and efficiency of the proposed method.
Article
Deep convolutional networks–based super-resolution is a fast-growing field with numerous practical applications. In this exposition, we extensively compare more than 30 state-of-the-art super-resolution Convolutional Neural Networks (CNNs) over three classical and three recently introduced challenging datasets to benchmark single image super-resolution. We introduce a taxonomy for deep learning–based super-resolution networks that groups existing methods into nine categories including linear, residual, multi-branch, recursive, progressive, attention-based, and adversarial designs. We also provide comparisons between the models in terms of network complexity, memory footprint, model input and output, learning details, the type of network losses, and important architectural differences (e.g., depth, skip-connections, filters). The extensive evaluation performed shows the consistent and rapid growth in the accuracy in the past few years along with a corresponding boost in model complexity and the availability of large-scale datasets. It is also observed that the pioneering methods identified as the benchmarks have been significantly outperformed by the current contenders. Despite the progress in recent years, we identify several shortcomings of existing techniques and provide future research directions towards the solution of these open problems. Datasets and codes for evaluation are publicly available at https://github.com/saeed-anwar/SRsurvey.
Article
Egocentric vision (a.k.a. first-person vision - FPV) applications have thrived over the past few years, thanks to the availability of affordable wearable cameras and large annotated datasets. The position of the wearable camera (usually mounted on the head) allows recording exactly what the camera wearers have in front of them, in particular hands and manipulated objects. This intrinsic advantage enables the study of the hands from multiple perspectives: localizing hands and their parts within the images; understanding what actions and activities the hands are involved in; and developing human-computer interfaces that rely on hand gestures. In this survey, we review the literature that focuses on the hands using egocentric vision, categorizing the existing approaches into: localization (where are the hands or parts of them?); interpretation (what are the hands doing?); and application (e.g., systems that used egocentric hand cues for solving a specific problem). Moreover, a list of the most prominent datasets with hand-based annotations is provided.
Article
Image Super-Resolution (SR) is an important class of image processing techniques to enhance the resolution of images and videos in computer vision. Recent years have witnessed remarkable progress of image super-resolution using deep learning techniques. In this survey, we aim to give a survey on recent advances of image super-resolution techniques using deep learning approaches in a systematic way. In general, we can roughly group the existing studies of SR techniques into three major categories: supervised SR, unsupervised SR, and domain-specific SR. In addition, we also cover some other important issues, such as publicly available benchmark datasets and performance evaluation metrics. Finally, we conclude this survey by highlighting several future directions and open issues which should be further addressed by the community in the future.
Article
Realtime multi-person 2D pose estimation is a key component in enabling machines to have an understanding of people in images and videos. In this work, we present a realtime approach to detect the 2D pose of multiple people in an image. The proposed method uses a nonparametric representation, which we refer to as Part Affinity Fields (PAFs), to learn to associate body parts with individuals in the image. This bottom-up system achieves high accuracy and realtime performance, regardless of the number of people in the image. In previous work, PAFs and body part location estimation were refined simultaneously across training stages. We demonstrate that using a PAF-only refinement is able to achieve a substantial increase in both runtime performance and accuracy. We also present the first combined body and foot keypoint detector, based on an annotated foot dataset that we have publicly released. We show that the combined detector not only reduces the inference time compared to running them sequentially, but also maintains the accuracy of each component individually. This work has culminated in the release of OpenPose, the first open-source realtime system for multi-person 2D pose detection, including body, foot, hand, and facial keypoints.
Chapter
Gestures are robust behaviors that influence communication, learning and memory. Here evidence that gesture can support math learning and word learning in children and adults is reviewed. Although there is robust evidence revealing beneficial effects of gesture on learning across ages and across domains, it is not clear what mechanisms underlie these effects. Gestures may change learning via a variety of cognitive processes, including perceptual, attentional, memory, linguistic and conceptual processes. In order to delineate among the various potential mechanisms by which gesture supports learning, researchers studying gesture need to develop more specific predictions about exactly which gestures will support learning and precisely when these gestures will be helpful.