Content uploaded by Eran Bamani
Author content
All content in this area was uploaded by Eran Bamani on Jan 16, 2024
Content may be subject to copyright.
Ultra-Range Gesture Recognition using a Web-Camera in
Human-Robot Interaction
Eran Bamania, Eden Nissinmana, Inbar Meira, Lisa Koenigsberga, Avishai Sintova,∗
aSchool of Mechanical Engineering, Tel-Aviv University, Haim Levanon St., Tel-Aviv, 6997801, Israel
Abstract
Hand gestures play a significant role in human interactions where non-verbal intentions,
thoughts and commands are conveyed. In Human-Robot Interaction (HRI), hand gestures of-
fer a similar and efficient medium for conveying clear and rapid directives to a robotic agent.
However, state-of-the-art vision-based methods for gesture recognition have been shown to be
effective only up to a user-camera distance of seven meters. Such a short distance range limits
practical HRI with, for example, service robots, search and rescue robots and drones. In this
work, we address the Ultra-Range Gesture Recognition (URGR) problem by aiming for a recog-
nition distance of up to 25 meters and in the context of HRI. We propose a novel deep-learning
framework for URGR using solely a simple RGB camera. First, a novel super-resolution model
termed HQ-Net is used to enhance the low-resolution image of the user. Then, we propose
a novel URGR classifier termed Graph Vision Transformer (GViT) which takes the enhanced
image as input. GViT combines the benefits of a Graph Convolutional Network (GCN) and a
modified Vision Transformer (ViT). Evaluation of the proposed framework over diverse test data
yields a high recognition rate of 98.1%. The framework has also exhibited superior performance
compared to human recognition in ultra-range distances. With the framework, we analyze and
demonstrate the performance of an autonomous quadruped robot directed by human gestures in
complex ultra-range indoor and outdoor environments.
Keywords: Human-Robot Interaction, Gesture Recognition, Ultra-Range
1. Introduction
Gestures are an imperative medium to complement verbal communication between humans
[1, 2, 3]. From simple hand gestures to complex body movements, gestures can convey rapid
and nuanced information that is essential for effective communication. Often, gestures entirely
replace speech when the delivered message is short and simple [4], e.g., thumbs-up for approval
or beckoning for an invitation to approach. Also, a gesture is the simplest and immediate com-
munication when the source and target participants are far from each other. In such a case, speech
may not be clearly conveyed. Hence, if a line-of-sight between the participants exists, a gesture
can transfer quick and clearer messages or commands.
∗Corresponding Author.
Email address: sintov1@tauex.tau.ac.il (Avishai Sintov)
Preprint submitted to Engineering Applications of Artificial Intelligence January 16, 2024
Figure 1: A robot recognizing a directive gesture from a user located 25 meters away by solely using an RGB camera.
Upon recognizing, for instance, a beckoning gesture, the robot will move toward the user.
Figure 2: Illustration scheme of the proposed URGR framework. The user in the image is detected with YOLOv3
followed by cropping the background. Since the user is in low quality due to the large distance from the camera, HQ-Net
is a proposed super-resolution method which enhances the quality of the cropped image. Then, a classification model
termed GViT outputs the predicted gesture.
Similar to humans, in Human-Robot Interaction (HRI), a user may wish to convey a direc-
tive message to a robot using the same medium. For instance, the user may wish to exhibit a
desired motion direction for the robot through a pointing gesture [5]. Therefore, efficient ges-
ture recognition in HRI is indispensable in order to acquire natural conveyance of commands
and information to robots. Since the leading approach for gesture recognition is visual percep-
tion [6, 7, 8, 9, 10], an important aspect is the distance range between the gesturing user and
the camera on an agent. Previous work has commonly considered short- and long-range gesture
recognition roughly limiting both up to 1 meter and 4-7 meters, respectively [11]. However, we
argue that long-range gesture recognition can be increased further away. Therefore, we define
the Ultra-Range Gesture Recognition (URGR) problem where effective recognition is required
up to a distance of 25 meters.
Short-range (or close-range) gesturing can generally be used when working in front of a
computer [12, 13], with an infotainment system within a car [14, 15] or in Virtual Reality (VR)
[16]. Close-range approaches for gesture recognition are, in most cases, not feasible for HRI
with mobile systems. On the other hand, long-range gesturing may include indoor service robots
such as cleaning robots, robotic servants and medical robots. Also, applications are considered in
2
smart homes, TV control and video games [17]. While numerous approaches have been proposed
for long-range gesture recognition with high success rates, these were only demonstrated indoors,
with a relatively structured background, and the majority were effective only up to 4-5 meters [18,
19]. Some of the approaches use an RGB-Depth (RGB-D) camera [6, 20, 21]. Depth cameras
are limited to short-range in indoor environments and work poorly outdoors. Furthermore, they
require an additional hardware setup and limit the generality of the method. With RGB cameras,
various gesture recognition packages are available open-source (e.g., SAM-SLR [22], MediaPipe
Gesture Recognizer and OpenPose [23]) and work quite well in short- and long-range. However,
these were tested in a preliminary study yielding poor results beyond 4 meters.
While long-range gesture recognition can offer a high recognition rate, it may not be practical
in actual HRI tasks due to a limited range. The farthest distance reported for visual-based gesture
recognition is 7 meters. One work reaching such a distance involved an attention-based model
able to recognize short- and long-range hand gestures [24]. Similarly, a model based on convolu-
tional and feature aggregation pyramid networks achieved effective feature extraction for gesture
detection and recognition [25]. However, these approaches were not demonstrated in ultra-range
or in outdoor environments. To the best of the authors’ knowledge, no work has addressed the
gesture recognition problem at distances farther than 7 meters.
In this paper, we address and explore the URGR problem and aim for an effective distance
of up to 25 meters solely based on an RGB camera. Hence, a user can effectively direct a robot
from a large distance (Figure 1). We propose a data-based framework which does not require
depth information but solely a simple web-camera. Hence, the approach is accessible and cost-
effective. The main challenge in URGR is the low resolution of the image after cropping out
the background and focusing on the gesturing user. The resulted image is usually distorted with
compromised details of the hand due to the extended distance. To overcome the low-resolution
challenge, Super Resolution (SR) is a common method to effectively enhance the resolution of
an image and enable more precise object recognition [26]. Algorithms, such as ESRGAN [27]
and Real-ESRGAN [28], utilize a Generative Adversarial Network (GAN) in order to determine
whether a reconstructed image is realistic. Hence, the GAN guides the recovery of the image
toward reconstructing details. However, these algorithms may prioritize low-resolution informa-
tion at the expense of high-frequency information, leading to over-smoothing and distortion of
indistinguishable features (e.g., fingers). While these models may focus on static or prominent
elements of the image such as windows and bricks, they may not accurately capture intricate
details of the human body in various postures. Therefore, in this work, we propose a novel SR
model termed HQ-Net which is based on a scheme of filters, self-attention and convolutional
layers. HQ-Net is validated through a comparison to state-of-the-art in SR with a focus on the
URGR problem.
In our proposed URGR framework, visualized in Figure 2, an image taken by the camera is
pre-processed to focus on the user. Quality improvement is followed using the HQ-Net model.
Then, the improved image is fed into a classifier termed Graph-Vision Transformer (GViT).
The GViT model leverages the benefits of the Graph Convolutional Network (GCN) [29] and
Vision Transformers (ViT) [30]. ViT is modified to receive embedded information from the
graph convolutional layers of the GCN. The combination of GCN and ViT in GViT enables
it to capture local and global dependencies in the image. The GViT outputs the probability
distribution over all gesture classes. The model was integrated into a robotic system which
executes actions corresponding to recognized gestures. To summarize, the contributions of this
work are as follows:
3
• A novel SR model is proposed for significantly enhancing the quality of low-resolution
images. In addition to significantly improving the success rate in URGR, it may be used
for traditional applications of SR such as image processing, security and medical imaging.
• During the training of the HQ-Net model, we offer an image degradation process specif-
ically designated for ultra-distance objects. The process is validated by comparing it to
other methods over URGR test cases.
• Unlike prior work, we propose the first model that can recognize gestures in ultra-range
up to a distance of 25 meters between the camera and the user. The model is able to work
in complex indoor and outdoor environments and is also evaluated in various edge cases
such as with low lighting conditions and occlusions. Also, the model is shown to perform
better than the performance of human recognition.
• The proposed approach is demonstrated and evaluated in an HRI scenario with a mobile
robot.
• Trained models and datasets are available open-source for the benefit of the community1.
The proposed approach may be used in the same applications of long-distance, as described
above, but with an extended and more practical range. Furthermore, URGR may be required for
directing search and rescue robots, drones and other service robots. Space exploration, police
work and entertainment can also gain from URGR. The work may also be adapted in future work
to various other object recognition tasks with an ultra-range capability.
2. Related Work
2.1. Gesture Recognition
Gesture recognition is a long-standing research field in computer vision [31]. The common
approaches are based on capturing hand poses or movements using either an RGB camera, an
RGB-D camera or some wearable device. RGB-D cameras are a popular tool in gesture recog-
nition since they add spatial information to the observed RGB image. Hence, they are able to
provide more accurate information regarding the exhibited gesture [32, 33, 34]. Despite enabling
easier access to gesture features, depth cameras work poorly outdoors and are usually limited to
short to long ranges in indoor environments. In addition, they require an additional hardware
setup and limit the generality of the method [35]. To address these limitations, researchers have
turned to recognizing human gestures using RGB cameras which have lower cost and are more
accessible. Early approaches handcrafted methods for extracting features of hand pose for further
recognition with traditional classifiers such as Support Vector Machine (SVM) [36] or k-Nearest
Neighbors (kNN) [37]. However, such approaches can usually handle short-range recognition
and may have difficulties in unstructured backgrounds. Hence, more recent work focused on
deep-learning approaches [13, 24, 25]. For instance, an approach combined Convolutional Neu-
ral Network (CNN) with Long Short-Term Memory (LSTM) for dynamic gesture recognition in
short-range with a single camera [38]. Another work utilized a single moving camera to cap-
ture low-resolution images [39]. While the approach is effective up to 5 meters, it could only
1To be available upon acceptance for publication. Images will be modified in order to protect the privacy of the
participants.
4
recognize arm gestures with no capability to observe finger details. A different approach used
a CNN model to capture spatiotemporal features for hand gestures over a sequence of images
[40]. While providing high recognition rates, all approaches are limited to short- or long-range
and cannot be applied to URGR.
Some approaches are based on a human pose estimation model often termed a skeleton model
[41, 42, 10, 13]. Existing packages such as OpenPose [23] and MediaPipe [43] estimate land-
marks of the human body including fingers. Some packages may use only an RGB camera [44]
while others require also depth information [18]. From estimated landmarks, one can infer about
the exhibited gesture. Using a skeleton model provides easy access to hand pose and gesturing
shapes. However, due to the limitations of these skeleton models, gesture recognition is only
available in short- and long-range.
Wearable technology is an alternative approach where some measurement device is attached
to the user. With a wearable device, distance is only limited by communication with the receiv-
ing agent [45]. The notable method is the use of inertial sensors on a smartwatch in order to
track motions of the arm and hand [46]. However, inertial sensors alone can have difficulties to
implicitly sense the state of the hand and, thus, are usually integrated with other sensors [47, 48].
Electro-Myography (EMG) is another wearable approach where electrical currents in muscle
cells are measured and provide some state of the hand [49, 50]. In [51], a wearable band on the
forearm measured a 3-channel EMG signal during gestures. A model combining kNN and De-
cision Tree algorithms was trained to classify the signals into gestures. Similarly, EMG signals
were used to train a convolutional recurrent neural network to control a robotic arm through ges-
tures [52]. In a different approach, Force-Myography was used to sense muscle perturbations on
the arm using simple force sensors and to classify signals to corresponding gestures [53]. Other
approaches include optical sensors for measuring blood volume [54] and acoustic sensors on the
wrist which measure tendons movements [55]. While can provide a high recognition rate, the
above wearable approaches may require designated and expensive hardware. Hence, they do not
allow occasional users to interact with a robot. Also, some may not provide generalized model
transfer to various new users and additional data collection would be required.
2.2. Super Resolution
As discussed in the previous section, SR aims to enhance low-resolution images and im-
prove distorted details. Various methods have been proposed for different applications such as
satellite and aerial imagery, object detection in complex scenarios, medical imaging, astronomi-
cal images, forensics, face detection and recognition in security and surveillance, number plates
reading and text analysis [56, 57]. One approach is the Robust U-Net (RUNet) [58] which is a
variant of the well-known U-Net for segmentation [59]. U-Net was initially used for image seg-
mentation for biomedical applications where the image is contracted and expanded in a U-shaped
path of convolutional layers. However, it was shown to be efficient for image quality improve-
ment [60]. A more advanced approach is the Enhanced Super-Resolution Generative Adversarial
Networks (ESRGAN) [27]. ESRGAN uses an adversarial process to generate realistic textures in
a low-resolution image. Blind Super-Resolution Generative Adversarial Networks (BSRGAN)
is an extended version of ESRGAN using different data and loss functions [61]. Similarly, Real-
ESRGAN uses a high-order degradation process to better simulate complex image corruptions
[28]. While efficient, it has been claimed that SR models are generally trained for specific tasks
and may not perform well for different ones [57].
5
3. Methods
3.1. The Ultra-Range Recognition Problem
The quality of images degrades with the increase of camera-to-subject distance d. The degra-
dation stems from a combination of factors including the dimensions of the image sensor S,
the optical characteristics of the camera lens L, and the physical separation distance D. Upon
capturing an image, the quantity of light incident on the image sensor diminishes proportionally
with the increase in d. This light attenuation results in a reduction of the signal-to-noise ratio
(SNR) and a diminution of fine details within the image. Moreover, the sensor dimension S
significantly influences its light-capturing capacity where smaller sensors tend to yield images of
inferior quality.
Typically, when the object of interest is distant from the camera, zooming becomes neces-
sary in order to perceive finer details in the expense of image sharpness and clarity. This results
in reduced quality. Moreover, pronounced movements often present in long-distance interac-
tions lead to motion blur in captured images. The presence of motion blur presents a significant
obstacle to accurately extract information in fine detail. To cope with these challenges, the devel-
opment of robust algorithms becomes imperative, capable of effectively mitigating motion blur
and capturing meaningful patterns.
In this work, we focus on recognizing human gestures in the context of directing robots by, for
instance, pointing and beckoning. Gesture recognition is required to be performed in a camera-
to-gesture distance of up to 25 meters. In such a scenario, recognition of a blurry and difficult to
distinguish hand in an image with many details is required. The image quality and large variance
of gesture appearance at longer distances pose a complex challenge to the efficacy and precision
of a Deep Neural Network (DNN) model. The model has limited features in the images that can
facilitate the learning process of the DNN. Furthermore, factors such as diminished visual acuity,
occlusions and perspective distortion contribute to ambiguity and diminished distinguishability
of gestures.
Formally, the problem addressed in this paper is as follows. Given an RGB image Jiwhere
a single user is observed within it and with d≤25 meters. The user can exhibit one out of m
gesture classes {O1,...,Om}. Class O1is the null class where no gesture is performed and the
user conducts any other task. A gesture can be performed with either the right or left arm. We
search for a trained classifier that will recognize the gesture exhibited by the user in the image.
In other words, a trained classifier will solve the following maximization problem
j∗=arg max
jP(Oj|Ji),∀j=1,...,m(1)
where P(Oj|Ji)is the conditional probability for image Jito be in class Oj. Problem (1) is solved
such that class Oj∗corresponds to the true gesture exhibited in image Ji.
3.2. Data Collection
A dataset His collected for training the recognition model using a simple RGB camera. A
set of Nimages is collected by taking images of various users at different distances from the
camera and in different environments. Environments can include indoor and outdoor ones. The
users exhibit any of the pre-defined gestures of classes {O1,...,Om}. In order to maintain a nearly
uniform distribution, an equal amount of images is taken for each gesture class. In addition, a
uniform distribution of samples is taken along the distance from the camera. For such, a long
measuring tape was used such that an equal amount of samples is taken for each distance in
6
(a) (b)
Figure 3: Image examples of (a) pointing and (b) stop gestures showing different widths of the user. Hence, pixels around
the user are added in order to maintain a constant image proportion.
intervals of one meter and in the range d∈[0,25] meters. Each image Jiis labeled with the class
index oi∈ {1,...,m}and distance di∈[0,25]. Consequently, the resulted training dataset is of
the form H={(Ji,oi,di)}N
i=1.
3.3. Image Quality Improvement
In this part, we propose a novel model termed HQ-Net for solving the Single Image Super-
Resolution (SISR) problem where a trained model infers a high-resolution image from a low-
resolution one.
3.3.1. Pre-Processing
A subset of dataset His used to train the HQ-Net model. Since HQ-Net is aimed to improve
the quality of zoomed-in images, we require images with sufficient quality for ground-truth la-
beling. Hence, we define subset S ⊂ H where only samples taken in the range d∈[2,8] meters
are used. The degradation becomes predominant when d>8and it is not possible to produce
qualitative labels for training.
In order to emulate the appearance of ultra-range images after focusing on the user, each
image Ji∈ S is degraded by the following manipulations. First, we require images focused
on the user while the gesture is visible. Using a You Only Look Once version 3. (YOLOv3)
[62] bounding box, the user is detected within the image. However, the bounding box may not
always bound the entire human body completely and can miss vital parts such as the gesturing
arm. Hence, an extension of the bounding box is required while maintaining a constant image
proportion. However, the width of the user may be different between various gestures. For
instance, as seen in Figure 3, the width of the user when pointing is much larger when exerting a
stop gesture. Hence, the pixel thickness that will be added around the bounding box is b
awhere b
is the diagonal length of the bounding box and parameter ais a pre-defined user-to-image ratio.
Then, the image is cropped around the extended bounding box. Finally, the cropped product is
resized to resolution 512 ×512 so to have a unified size in the dataset.
After focusing on the user in the image and cropping, the image Jiis deliberately degraded
in order to emulate an image taken from a long distance. We have developed a degradation
process comprising a series of transformations and the application of multiple filters. These
7
Figure 4: Illustration of the HQ-Net model focusing on the user. A cropped image is the input to three pathways yielding
a quality improved image ˆ
I.
filters collectively contribute to the creation of a nuanced shading effect around the object of
interest and detail corruption, thereby emulating the characteristics of long-distance perspective.
The formalized expression for acquiring a degraded image ˜
Jiis given by
˜
Ji=Fcompre ss(Fshar pen(Fsmooth (Ji))).(2)
Function Fsmooth is a smoothing operation where a spatial filtering technique reduces high-frequency
noise. Such an operation acts similar to a median filter making neighboring pixels closely alike
resulting in blurring of some details as if they are seen from far away. Function Fsharpen is a
sharpening operator which generates some noise and in high-frequency regions. Through exper-
imentation, we have found that the combination of smoothing and sharpening operations, while
degrades the quality of details, also creates the shadowing effect in the high-contrast regions that
surround the user. Finally, compression of the image with function Fcompre ss leads to the reduc-
tion in fine details and fidelity. In conclusion, each image Ji∈ H in the range d∈[2,8] meters
is passed through the above process to generate dataset S={(˜
Ji,Ji)}M
i=1with Msamples.
3.3.2. Model
Upon capturing an image Iin ultra-range, the HQ-Net model will optimize image quality in
the region of the visible user. The HQ-Net is illustrated in Figure 4. In the first step, the user
is localized within the image using YOLOv3 and cropped-out to a sub-image ¯
Isimulating its
proximity to the camera. With ¯
I, a three-path processing scheme is employed. In the initial path-
way, the canny edge detection algorithm is used to discern prominent edges within ¯
I. The edge
image then passes through the HQ layer, described below, and a set of convolutional layers. In
the second pathway, cropped image ¯
Iis passed through an HQ layer followed by a self-attention
mechanism. The self-attention plays a significant role in understanding the spatial relationships
and context within the image. It prioritizes relevant details, reduces noise and eliminates distrac-
tions, collectively leading to a significant improvement in the overall quality of the processed
image. The third pathway has an auto-encoder framework with several convolutional layers and
skip connections. The layers map image ¯
Ito a latent space of size 2048. Outputs from the first
two pathways are concatenated into the latent space for further expansion with an additional set
of convolutional layers and skip-connections. The output is the quality improved image ˆ
I.
8
Figure 5: Architecture of the HQ layer used in the HQ-Net (Figure 4) for improving image quality of a user in ultra-range.
The HQ-layer, seen in Figure 5, is composed of a series of convolutional layers, batch nor-
malization and Scaled Exponential Linear Unit (SELU) activation functions. The HQ-layer is
structured by bifurcating the input. The first branch proceeds through designated convolutional
layers, while the second branch undergoes a single convolutional layer followed by Bicubic in-
terpolation. The outputs of both branches are element-wise multiplied and fused with the final
convolutional layer within the network. This novel network architecture of HQ-Net is tailored
to scenarios where image fidelity is compromised by distant capture and subsequent crop oper-
ations. HQ-Net is trained using dataset Ssuch that the input is the degraded image ˜
Iiwith cor-
responding output label Ii. Training involves the minimization of a pixel-to-pixel Mean square
Error (MSE) loss function detailed in the evaluation section.
3.4. URGR Model
Gesture recognition encompasses a set of methodologies and algorithms designed to identify
and interpret human gestures depicted within images. As discussed in Sections 1-2, state-of-the-
art approaches for gesture recognition are designed to identify human gestures in close proximity
of up to 7 meters. While image enhancing using HQ-Net is beneficial, it does not provide a
complete solution for observing fine details. Hence, given a quality improved image ˆ
Ioriginally
taken from a distance of up to 25 meters, our proposed Graph-Vision Transformer (GViT) will
output a solution to (1).
As previously discussed, the greater the distance between the user and the camera, the more
challenging it becomes for the model to accurately discern the executed gesture. This challenge
is primarily attributed to the substantial presence of background noise in such scenarios. In
order to mitigate the influence of background noise, user crop-out and quality improvement are
exerted. Hence, for training, dataset His pre-processed such that each image Ii∈ H is cropped
(with YOLOv3) around the recognized user while maintaining the same image proportion as in
Section 3.3.1. Then, the cropped image is passed through the HQ-Net for quality improvement
yielding a modified training dataset ˆ
H={(ˆ
Ii,oi,di)}N
i=1where ˆ
Iiis the processed image.
The GViT model combines the power of Graph Convolutional Networks (GCN) [29] with
the expressiveness of a modified Vision Transformers (ViT) [30]. Thus, it could effectively
process and analyze structured data, such as images, with inherent graph-like relationships while
9
leveraging the self-attention mechanism of ViT in order to capture fine-grained visual features.
We briefly present the notion of these two models followed by a discussion on GViT.
3.4.1. Graph Convolutional Networks
GCN has emerged as a versatile and powerful tool for image processing, offering distinct
advantages over the traditional Convolutional Neural Network (CNN). GCN is applied on undi-
rected graphs in order to capture intricate relationships among nodes in the graph. Let G=(V,E)
be an undirected graph where Vand Eare the set of nodes and edges connecting them, respec-
tively. In GCN, Ggoes through a set of Graph Convolutional (GC) layers. In a GC layer, each
node aggregates information from its neighboring nodes while considering both its own features
and the features of its neighbors. Let h(k)
xby the feature representation of node x∈Vin layer k.
The aggregation h(k+1)
iof node xin the next layer is acquired through a weighted sum
h(k+1)
x=σ
X
y∈N(x)
W(k)h(k)
y
(3)
where N(x)⊂Vis the subset of neighboring nodes to x,W(k)is a learnable weight matrix of
layer k, and σis some activation function.
In the field of image analysis, a graph can represent relationships between pixels or image
regions. Each pixel can be a node and edges can represent the spatial adjacency of pixels. Then,
the GCN can utilize such a structure to model intricate relationships among pixels and capture
spatial dependencies within the image. It has the ability to capture long-range dependencies and
relationships between non-adjacent pixels. Furthermore, GCN has fewer parameters compared
to a traditional CNN, making it more memory-efficient and easier to train.
3.4.2. Vision Transformer
Transformers have been a ground-breaking architecture for natural language processing [63].
In general, Transformers enable the learning of long-range dependencies in sequential data.
While an image is not considered sequential data, ViT is an extension of the Transformer for
learning dependencies and intricate spatial relationships between regions in an image. Hence,
the key steps in ViT begin with patch extraction, where input images are divided into non-
overlapping patches, each treated as a sequence of feature vectors. To introduce spatial infor-
mation, positional encoding is added to the patches.
Linear embeddings of the patches are considered as tokens and fed into the Transformer
encoder of the ViT. The encoder is composed of a stack of multi-head self-attention layers.
These layers capture long-range dependencies between the patches. The self-attention mecha-
nism evaluates the hierarchical importance of each patch with the assistance of Softmax func-
tions. The outputs of the Transformer encoder are passed to a classification head implemented
by a global average pooling and feed-forward neural network. This hierarchical approach is
considered highly beneficial for complex visual recognition.
3.4.3. Graph-Vision Transformer (GViT)
GViT combines the benefits of GCN and ViT such that it can effectively model both local
and global dependencies within the structured data. A processed image ˆ
Ii∈ˆ
His passed through
a sequence of two GC layers and a ViT as illustrated in Figure 6. As discussed above, the input
to the standard ViT is a linear embedding of the patches. In GViT, however, we modify the
10
Figure 6: The proposed GViT model for URGR. The model reconstructs the image into a graph structure followed by
GCN and ViT models.
ViT such that the GC layers are its embedding input. First, image ˆ
Iiis converted into a graph
structure by iterating over all pixels. For each pixel, edges are established with its neighboring
pixels including diagonal ones. These edges construct set Eand define the connectivity pattern
of graph G.
Using the graph structure, GCN is able to propagate information with graph convolutions in
order to capture dependencies in the image. The graph is propagated through two GC layers de-
fined in (3) with activation function σ() implemented by the Gated Linear Unit (GLU). GLU is a
gating mechanism following a convolutional operation in order to control the flow of information
[64]. The key advantages of the GLU activation function in image recognition tasks are capturing
complex patterns, effective feature learning and the reduction of overfitting. In addition to the
GLU, dropout is added between the GC layers in order to make the model more general and to
mitigate overfitting.
The output of the GCN is passed into a single convolutional layer in order to reduce the num-
ber of image channels and adapt to the required input shape of the ViT. Finally, the ViT compo-
nent takes the output of GCN effectively replacing the linear embedding part of the traditional
ViT. The ViT model extracts meaningful features from the graph representation using the self-
attention mechanism. Following the self-attention step, the output is further processed through
the classification head in order to acquire a probability distribution (P(O1|ˆ
Ii),...,P(Om|ˆ
Ii)) over
the objects. Then, the recognized gesture is the solution to (1). GViT is trained using the cross-
entropy loss function. In the next section, we demonstrate the effectiveness of the proposed GViT
through experiments, showcasing its ability to achieve state-of-the-art results in recognizing hu-
man gestures in both short and long ranges of up to 25 meters.
4. Model Evaluation
In this section, we present the testing and analysis of the proposed framework. Without
loss of generality, we chose to focus on six directive gesture classes seen in Figure 7 including:
pointing, thumbs-up, thumbs-down, beckoning and stop. A sixth class is the null one where
the user does not exhibit any gesture and can perform any other task. As discussed previously,
we analyze the ability of the proposed model, along with other known methods, to identify the
gestures in a range of up to 25 meters. For a basis of comparison, we first present results of
11
Figure 7: Five directive gesture classes considered in this paper, from left to right: pointing, thumbs-up, thumbs-down,
beckoning and stop. A sixth class is the null one where the user does not exhibit any gesture.
Table 1: Human recognition of gestures in various distance ranges
Participant Age Success rate (%) in distance range Total
# 2-7 (m) 8-12 (m) 13-18 (m) 19-22 (m) 23-25 (m) (%)
1 20-25 100 100 100 70 70 88
2 20-25 100 100 80 70 60 82
3 25-30 100 100 80 70 40 78
4 25-30 100 100 100 70 60 86
5 35-40 100 100 90 60 50 80
6 35-40 100 90 70 60 40 72
7 50-60 100 100 80 50 40 74
8 60-70 100 100 80 60 40 76
9 60-70 100 100 90 60 60 82
10 +70 100 90 70 50 20 66
All 100 98 84 62 48 78.4
recognition by human participants directly observing the gestures from a long distance. Then, a
comparative evaluation of the proposed model is discussed. All experiments were carried out on
a Linux Ubuntu (18.04 LTS) machine with an Intel Xeon Gold 6230R CPUs (20 cores running
at 2.1GHz) and four NVIDIA GeForce RTX 2080TI GPUs (each with 11GB of RAM).
4.1. Human Gesture Recognition
Irrelevant to the hardware and any learning method, we wish to examine the ability of humans
to recognize the gestures in ultra-range. Hence, an experiment was designed and conducted in
which a human demonstrator presented the gestures from various distances. Also, the six gestures
were presented in a uniform distribution and in multiple indoor and outdoor environments. Ten
participants of different age ranges were asked to name the gesture they recognize made by the
demonstrator. Participants requiring distance glasses were asked to wear them. The distance
range of [2,25] meters was partitioned into five ranges and, for each range and participant, ten
recognition trials were made. Table 1 summarizes the human recognition results. The results
show, as expected, that short and long-range recognition is much more successful than ultra-
range ones across all participants. In particular, the probability for a human to recognize a gesture
in the range of [19,25] meters is approximately 0.5.
12
4.2. Datasets
As discussed in Section 3.2, dataset His collected for training both HQ-Net and GViT mod-
els. The dataset was collected using a simple web camera yielding images of size 480 ×640. In
addition, 16 participants contributed data in various indoor and outdoor environments, and in a
uniform distribution of data. All participants used only their right arm to exhibit gestures. Nev-
ertheless, we will further analyse the generalization to the recognition of left arm gestures. The
collection yielded 58,000 samples per gesture class and a total of N=347,483 labeled samples
in H. Another set of 10,109 labeled images was taken as a test set in environments and with
users not included in the training. As presented in Section 3.3.1, a subset S ⊂ H is taken and
used for training the HQ-Net model. The subset includes samples in the range of [2,8] meters
and is of size M=191,574. Here also, 6,319 independent images were taken as a test set for
evaluating the HQ-Net model. After the training of HQ-Net using S, all Nsamples in Hwere
pre-processed to ˆ
Hby focusing on the user and improving quality with HQ-Net.
4.3. Existing gesture recognition models
In preliminary studies, we have evaluated several known gesture recognition models that are
publicly available. These include open-source packages such as SAM-SLR [22], MediaPipe Ges-
ture Recognizer [43] and OpenHands by OpenPose [23]. The packages were tested by exhibiting
the gestures they were trained on but in ultra-range distance. All packages failed to provide any
recognition in distances larger than 4-5 meters.
4.4. Evaluation of Image Quality Improvement
As discussed in Section 3.3, model HQ-Net is used to improve the quality of the observed
user in the image prior to the recognition of a gesture. The model is trained with dataset S
where a degraded image ˜
Jiis mapped to its original qualitative image Ji. We compare HQ-Net to
various state-of-the-art models including Autoencoder [65], U-Net [59], U-Net++ [66], Robust
U-Net [58], ESRGAN [27] and BSRGAN [61]. The baseline is a simple Autoencoder where the
image is encoded and reconstructed to the original variant. U-Net++ is an extension of the U-Net
architecture with additional skip-connections to improve accuracy. The other benchmark models
were discussed in Section 3.3.2.
Image quality improvement commonly uses the Mean Squared Error (MSE) loss function to
compare between the improved image ˆ
Iiand the ground truth one Ii. MSE measures the average
of the squares of each pixel error and is given by
MSE =1
nxny
nx
X
j=1
ny
X
k=1
(ii,j,k−ˆ
ii,j,k)2(4)
where nxand nyare the number of rows and columns in the images, respectively, and ii,j,kis the
(j,k)component of image Ii. In such a way, MSE evaluates pixel dissimilarities between the
output and ground truth images. All compared models were trained using the MSE loss function.
In addition, we evaluate the trained models with the Peak Signal-to-Noise Ratio (PSNR). PNSR
is a quality assessment metric for image restoration given by
PNSR =10 log10
L2
MSE (5)
13
Table 2: Comparison of image quality improvement by various models
Models Model Score
MSE Loss PSNR (dB)
Autoencoder 0.401 13.67
Unet 0.305 14.66
Unet++ 0.289 15.01
RUNet 0.198 21.77
ESRGAN 0.239 15.67
BSRGAN 0.216 22.31
HQ-Net 0.019 34.45
where Lis the maximum pixel value [67]. Hence, PNSR is a measure of the maximum error
in the image and is directly related to the MSE. The higher the PSNR value, the greater the
reconstructed image resembles the ground truth one.
The comparative results are presented in Table 2 by reporting both MSE losses and PSNR
values for all methods over the test set. The results clearly show the dominance of HQ-Net over
the existing models. Figures 8 shows an image captured from a distance of 9 meters and in-
tentionally degraded as described in Section 3.3.1 for demonstration. Then, the degraded image
is improved using HQ-Net and the existing methods. Visually, HQ-Net shows the best recon-
struction of the image. On the other hand, Figure 9 shows an example of an image taken from a
distance of 25 meters and directly improved. HQ-Net provides the best detail improvement and
is further used for enhancing image quality before exerting a gesture recognition model. Nev-
ertheless, some of the existing methods seem visually sufficient for gesture recognition. Hence,
we further analyze their use for classification in order to justify the use of HQ-Net.
4.5. Gesture Recognition Model
Once the image is cropped and improved using HQ-Net , it would go through the GViT for
classification of the gesture. We compare the proposed GViT model with various other models
including a standard CNN, DenseNet [68], EfficientNet [69], GoogLeNet [70], Wide Residual
Networks (WideResNet) [71] and Visual Geometry Group (VGG) [72]. Along with the pro-
posed combination of GCN and ViT, performance is evaluated for each separately. The standard
CNN was optimized to include five convolutional layers followed by max-pooling to down-
sample the features, five fully-connected layers with ReLU activation functions and a Sigmoid
function at the output. DenseNet and WideResNet are deep learning models that use densely
connected blocks and residual connections, respectively, to improve accuracy and reduce over-
fitting. EfficientNet is a CNN model which uniformly scales the depth, width and resolution
of the model in order to achieve a good balance between model complexity and performance.
Similarly, GoogLeNet is also a CNN that has a unique inception module with multiple filters in
parallel. VGG is a popular model for object recognition and segmentation which also uses a deep
convolutional architecture. For the standard models, we compare results when fine-tuning (FT)
the model or having a new train (NT) from the ground up, with the training data in H.
Table 3 summarizes the gesture recognition success rate over the independent test set. The
table reports the success rate for using the model in the following variations: raw images without
14
Figure 8: Demonstration of a degraded image taken from a distance of 9 meters and improved using HQ-Net and other
methods. The original image is cropped using YOLOv3 and then degraded using the processing steps described in
Section 3.3.1.
15
Figure 9: Quality improvement example of an image taken from a distance of 25 meters. The user is identified using
YOLOv3 and cropped-out. Then, it is improved using HQ-Net and other methods.
16
Table 3: Gesture recognition success rate with various methods with and without user crop-out and quality improvement
using HQ-Net, and while fine-tuning (FT) existing models or having an entirely new train (NT)
Models No pre-processing w/ crop-out w/ crop-out & HQ-Net
FT (%) NT (%) FT (%) NT (%) FT (%) NT (%)
CNN - 69.3 - 75.9 - 80.1
DenseNet-201 45.6 79.4 51.1 84.2 47.6 92.1
EfficientNet 37.9 73.3 43.4 76.1 35.1 87.5
GoogLeNet 45.1 70.8 48.8 75.8 38.9 84.1
WideResNet 49.3 78.1 53.0 80.9 37.8 89.8
VGG-16 55.5 71.5 58.8 74.1 49.9 86.1
GCN - 72.9 - 77.3 - 88.4
ViT 59.4 79.1 60.8 84.9 61.5 92.7
GViT - 86.9 - 92.6 - 98.1
pre-processing where the images are directly passed through the classifier; images after cropping-
out the user using YOLOv3; and, with cropping-out and quality improvement using HQ-Net.
First, fine-tuned models are shown to provide poor results as they were pre-trained to focus on
features irrelevant to our specified problem. Training the models from the ground up is much
more beneficial and focuses them on the required task. Next, when comparing results with and
without pre-processing, it is clear for all models that cropping-out the user from the image and
improving quality with HQ-Net significantly increase the success rate. Furthermore, GCN and
ViT individually yield roughly comparable accuracy to the other benchmarked models. On the
other hand, their combination in GViT provides superior accuracy. For GViT, cropping-out the
images improves success rate over all classification models by 6.5% on average. Adding also
quality improvement provides an additional 5.9% success rate increase. Overall, the proposed
GViT model with cropping and quality improvement provides a superior and high recognition
success rate over all models.
In order to justify the use of HQ-Net for image improvement over other methods evaluated in
the previous section, we next evaluate the success rate of GViT with these methods. Hence, Table
4 presents the success rate of GViT over the test data while using the quality improvement of the
models evaluated in Section 4.4. Here also, having the HQ-Net over the other methods is proven
to be significantly superior. The proposed HQ-Net model provides finer hand details and gestures
can be recognized more clearly in the image. Note also that GViT without any SR provides higher
accuracy than using any of the existing methods. This emphasizes the incompatibility of these
methods to the ultra-range problem and the merits of GViT . These results also validate the merits
of combining GViT with HQ-Net in achieving a high success rate in URGR. Hence, these will
be used in further evaluations.
In the next analysis, seen in Figure 10, the recognition success rate of GViT is evaluated with
regards to the distance dfrom the camera. In short ranges, the success rate is approximately 99%
while in ultra-range it is slightly reduced to 96.6% at 25 meters. Figures 11 and 12 show the
confusion matrices over test data of distance ranges 15-20 meters and 21-25 meters, respectively.
These results showcase the high efficiency of the GViT model throughout the entire range of
work. Figure 13 exhibits various examples of gesture recognition along with model certainty in
different indoor and outdoor environments. Furthermore, Figure 14 shows snapshot examples
17
Table 4: Gesture recognition success rate using GViT while improving image quality with various models
Models Success rate (%)
AutoEncoder 80.3
Unet 82.6
Unet++ 84.2
ESRGAN 86.5
RUNet 87.1
BSRGAN 89.3
No SR 92.6
HQ-Net 98.1
Figure 10: Gesture recognition success rate of GViT with regards to the distance dfrom the camera.
of gesture recognition when rolling-out GViT in real-time. The presented results validate and
showcase the high performance of GViT in gesture recognition in various ultra-range environ-
ments.
4.6. Data requirements
An evaluation of the data requirements for training HQ-Net and GViT is given next. First, we
evaluate the MSE loss and PSNR of HQ-Net with the increase in data and up to M=191,574.
For a specific amount of data, the model was cross validated over five batches taken randomly
from the entire set S. Figure 15 presents the mean MSE loss and PSNR of HQ-Net over the test
data with regards to the size of the training data. The results show constant improvement with
the increase of training data until reaching some saturation with over 180,000 images.
With the fully trained HQ-Net model, we next evaluate the GViT model for the required
amount of data. Since the training of HQ-Net has already used Mlabeled images in the range
of [2,8] meters, these available images are also used to train GViT. Therefore, we begin the
evaluation with the existing Mimages of up to 8 meters. Similar to HQ-Net , the GViT model is
cross-validated, for each data size, over five batches taken randomly from H. Figure 16 shows
the success rate of gesture recognition over the test data with regards to the amount of data and
up to N=347,483. The results show that training with data of up to 8 meters is not sufficient
18
Figure 11: Classification confusion matrix for URGR with GViT over test data in the range 15-20 meters.
Figure 12: Classification confusion matrix for URGR with GViT over test data in the range 21-25 meters.
19
(a) (b) (c) (d)
(e) (f) (g) (h)
Figure 13: Examples of correct gesture recognition with GViT: (a) thumbs-up gesture from 4 meters distance having
model certainty of 98.8%; (b) beckoning gesture from 9 meters distance having model certainty of 98.6%; (c) null
gesture from 11 meters distance having model certainty of 98.2%; (d) thumbs-up gesture from 13 meters distance having
model certainty of 98.2%. (e) pointing gesture from 18 meters distance having model certainty of 97.5%; (f) thumbs-
down gesture from 20 meters distance having model certainty of 97.3%; (g) stop gesture from 23 meters distance having
model certainty of 96.8%; and (h) thumbs-up gesture from 25 meters distance having model certainty of 98.8%.
(a) (b) (c) (d)
(e) (f) (g) (h)
Figure 14: Examples of correct gesture recognition in real-time with GViT. Images are shown after focusing on the user:
(a) thumbs-down gesture from 20 meters distance having model certainty of 90.0%; (b) thumbs-down gesture from 25
meters distance having model certainty of 90.3%; (c) beckoning gesture from 25 meters distance having model certainty
of 88.3%; (d) thumbs-up gesture from 23 meters distance having model certainty of 84.2%. (e) thumbs-up gesture from
25 meters distance having model certainty of 87.6%; (f) pointing gesture from 17 meters distance having model certainty
of 88.9%; (g) pointing gesture from 23 meters distance having model certainty of 89.1%; and (h) stop gesture from 20
meters distance having model certainty of 81.0%.
20
Figure 15: Image quality improvement of the HQ-Net model evaluated with the MSE loss and PSNR over the test data
and with regards to the amount of data used to train the model.
Table 5: Comparison of gesture recognition success rates with the various trained models on test data with only the left
hand
Models Success rate (%)
CNN 75.8
DenseNet 89.4
EfficientNet 86.2
GoogLeNet 80.7
WideResNet 88.4
VGG-16 79.3
GCN 81.1
ViT 88.9
GViT 97.0
for successful recognition in ultra-range. Adding diverse images from all ranges is shown to
improve accuracy and reach up to 98.1% success rate with Nlabeled images.
4.7. Edge cases
The training data in Hdoes not include edge case scenarios which may limit performance.
First, as described above, the training data was collected using only the right hand of the partic-
ipants. Hence, we have collected an additional test set of 6,000 labeled images taken from four
participants while only gesturing with their left hand and in various ultra-range environments.
Using the models trained on dataset Hand presented in Table 3, the new test set was evaluated.
The success rates for these models including GViT are seen in Table 5. Here also, GViT domi-
nates all other models with a high success rate of 97%. The results distinctly highlight the ability
of the trained GViT to generalize to human gestures executed with the left hand.
Furthermore, we evaluate specific but interesting edge cases where accurate recognition may
be difficult. Designated test sets were collected for eight edge cases including: gloved hands;
21
Figure 16: Gesture recognition success rate of GViT over the test data with regards to the amount of data used to train
the model.
Table 6: Gesture recognition success rate for several edge cases
Edge Cases Success rate (%)
Gloves 95.7
Out-of-frame 96.1
Occlusions 91.8
Sitting-down 95.9
Dual-arm lift 88.7
Poor lighting 89.5
Multiple participants 94.1
Interference 91.6
out-of-frame participant where only the arm is visible; the participant is partly occluded by an
object in the foreground; gesturing while the participant is seated; dual-arm lift while only one
arm is exhibiting a gesture; environment with poor lighting; multiple participants in the image
exhibiting the same gesture; and interference by another person in the foreground. Such edge
cases were not included in the training. A test set of 1,000 labeled images for each edge case
was collected in various indoor and outdoor environments. Table 6 presents the gesture recog-
nition success rates for these cases with GViT. The majority of the cases achieved high success
rates. Nevertheless, dual-arm lift and poor lighting yielded slightly lower results which could be
mitigated by additional data in such scenarios. Figure 17 exhibits various examples of gesture
recognition in different edge cases along with model certainty in different indoor and outdoor
environments. Overall, GViT is able to generalize and recognize gestures in unconventional and
challenging scenarios.
22
(a) (b) (c) (d)
(e) (f) (g) (h)
Figure 17: Examples of correct gesture recognition with GViT in several edge cases not included in the training data: (a)
stop gesture from 17 meters distance while wearing gloves, having model certainty of 94.3%; (b) pointing gesture from
13 meters distance while the participant is fully occluded, having model certainty of 95.8%; (c) stop gesture from 22
meters distance while the participant is partly occluded by a ladder, having model certainty of 90.1%; (d) thumbs-down
gesture from 14 meters distance from two participants, having model certainty of 93.9%. (e) pointing gesture from 23
meters distance while the participant is sitting down, having model certainty of 94.5%; (f) pointing gesture from 11
meters distance in a poor lighting environment, having model certainty of 88.2%; (g) beckoning gesture from 12 meters
distance while having another null participant interfering in the foreground, having model certainty of 91.1%; and (h)
null gesture from 19 meters distance while the participant is lifting both hands, having model certainty of 87.7%.
Figure 18: Robotic platform used in the experiment based on the Unitree Go1 quadruped robot equipped with a simple
RGB web-camera.
23
Table 7: Real-time gesture recognition success rate in different environments
Environment Distance range (m) Success rate (%)
Outdoor 15-28 98
Indoor 12-24 97
Courtyard 17-20 94
5. Robot Experiments
Following the training and evaluation of GViT, the framework is now demonstrated and
evaluated in ultra-range HRI. In the experiment, we wish to show real-time responses of a robot to
recognized gestures exhibited by a user positioned in ultra-range. Hence, we have set up a mobile
robotic platform based on the Unitree Go1 quadruped robot seen in Figure 18. A processing unit
with a simple RGB web-camera was mounted on the robot. The processing unit is based on an
Nvidia Jetson Orin Nano connected to the robot with the Unitree high-level Robot Operating
System (ROS) API. Videos of the experiments can be seen in the supplementary material. It
should be noted that communication latency has resulted in minor visible delays between the
gesture and response.
The processing unit runs the GViT model in real-time and, upon recognition of a gesture,
commands the robot to move in order to comply. For evaluation and demonstration, we set a
premeditated response to each of the six gestures in Figure 7: a pointing gesture commands the
robot to move sideways; thumbs-up will instruct the robot to pitch tilt; thumbs-down will make
the robot lie down; beckoning commands the robot to move toward the user; stop gesture will
make the robot halt any previous command; and upon null, the robot will not change its current
behavior (e.g., moving forward).
We observe the above commands and responses in various distance ranges and in three envi-
ronments: outdoor, indoor and in a courtyard. Within each environment, 100 gesture trials were
exhibited. Table 7 reports the success rates of acquiring the correct response from the robot.
Figures 19-21 show snapshots of some gesture responses of the robot in the three environments.
The snapshots include the robot’s point-of-view (POV) and a closer view of the user for clear
verification. While the work focuses on a distance range of up to 25 meters, we have included
gestures in a distance of 28 meters and outdoors. The results show a high response rate to the
directives of the user. Hence, the GViT is validated in real-time operation.
6. Conclusions
In this work, a URGR framework has been introduced to facilitate a robot’s ability to interpret
human gestures in an ultra-range distance with an RGB camera and subsequently act accordingly.
A pivotal component of this framework is the innovative HQ-Net model, tailored to enhance the
quality of distant objects within images. This is a critical aspect for ensuring successful human
gesture recognition. On top of the low-resolution problem of the images due to zooming-in on the
user, distant objects may experience image degradation. Hence, HQ-Net is intended to enhance
image details for further recognition. The second novel model, GViT, is a fusion of GCN and
ViT designed for the recognition of human hand gestures. Rigorous testing across a diverse
test dataset has validated the superiority of GViT over state-of-the-art methodologies. Notably,
GViT achieved a remarkable 98.1% success rate of recognizing human gestures at distances of
24
Figure 19: A user is directing a robot in ultra-range and in an outdoor environment: (top) Starting at 25 meters, the user
is exhibiting a beckoning gesture leading to the robot moving forward; (middle) the user is exhibiting a stop gesture
making the robot halt; and, (bottom) the user is pointing so to make the robot move sideways.
25
Figure 20: A user is directing a robot from a distance of 21 meters in an indoor environment and through a door opening:
(top) The user is exhibiting a thumbs-up gesture leading to the robot tilting; and, (bottom) the user is exhibiting a
beckoning gesture leading to the robot moving forward.
26
Figure 21: A user is directing a robot from a distance of 15 meters in a courtyard environment. The user is exhibiting a
thumbs-down gesture leading the robot to lie down.
up to 25 meters. Furthermore, HQ-Net was justified compared to other SR approaches. Our
findings have also revealed that our proposed URGR framework outperforms human perception.
The framework was deployed on a quadruped robot and evaluated its response to six gestures in
ultra-range. The robot demonstrated high response rates even at a distance of up to 28 meters.
While our proposed framework is focused on gesture recognition, future work may adapt it
to other object recognition tasks in ultra-range. This may include surveillance, satellite imagery
and sports. Subsequent research could explore the recognition of human gestures in challenging
environmental conditions (such as bad weather or smoke) and over longer distances, with a
particular focus on distances of up to 40 meters. Additionally, investigations into the feasibility of
drones recognizing human gestures from a distance are a promising avenue for future exploration.
Furthermore, integration with verbal instructions may provide a complementary capability for
seamless and context-aware communication.
Declaration of competing interest
The authors declare that they have no known competing financial interests or personal rela-
tionships that could have appeared to influence the work reported in this paper.
Funding
This work was supported by the Israel Innovation Authority (grant No. 77857).
27
References
[1] R. M. Krauss, Y. Chen, P. Chawla, Nonverbal behavior and nonverbal communication: What do conversational
hand gestures tell us?, in: Advances in experimental social psychology, Vol. 28, Elsevier, 1996, pp. 389–450.
[2] P. Bernardis, M. Gentilucci, Speech and gesture share the same communication system, Neuropsychologia 44 (2)
(2006) 178–190.
[3] S. W. Cook, Enhancing learning with hand gestures: Potential mechanisms, in: Psychology of Learning and Moti-
vation, Vol. 69, Elsevier, 2018, pp. 107–133.
[4] S. Goldin-Meadow, The role of gesture in communication and thinking, Trends in cognitive sciences 3 (11) (1999)
419–429.
[5] E. Bamani, E. Nissinman, L. Koenigsberg, I. Meir, Y. Matalon, A. Sintov, Recognition and estimation of human
finger pointing with an RGB camera for robot directive, arXiv preprint arXiv:2307.02949 (2023).
[6] K. Nickel, R. Stiefelhagen, Visual recognition of pointing gestures for human–robot interaction, Image and vision
computing 25 (12) (2007) 1875–1884.
[7] J. P. Wachs, M. K¨
olsch, H. Stern, Y. Edan, Vision-based hand-gesture applications, Communications of the ACM
54 (2) (2011) 60–71.
[8] Z. Xia, Q. Lei, Y. Yang, H. Zhang, Y. He, W. Wang, M. Huang, Vision-based hand gesture recognition for human-
robot collaboration: a survey, in: International Conference on Control, Automation and Robotics (ICCAR), 2019,
pp. 198–205.
[9] W.-T. Weng, H.-P. Huang, Y.-L. Zhao, C.-Y. Lin, Development of a visual perception system on a dual-arm mobile
robot for human-robot interaction, Sensors 22 (23) (2022) 9545.
[10] Q. Gao, Y. Chen, Z. Ju, Y. Liang, Dynamic hand gesture recognition based on 3d hand pose estimation for hu-
man–robot interaction, IEEE Sensors Journal 22 (18) (2022) 17421–17430. doi:10.1109/JSEN.2021.3059685.
[11] D. Liu, L. Zhang, Y. Wu, Ld-congr: A large rgb-d video dataset for long-distance continuous gesture recogni-
tion, in: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 3294–3302.
doi:10.1109/CVPR52688.2022.00330.
[12] M. Oudah, A. Al-Naji, J. Chahl, Hand gesture recognition based on computer vision: a review of techniques,
journal of Imaging 6 (8) (2020) 73.
[13] S. An, X. Zhang, D. Wei, H. Zhu, J. Yang, K. A. Tsintotas, Fasthand: Fast monocular hand pose estimation on
embedded systems, Journal of Systems Architecture 122 (2022) 102361.
[14] J. L. Alba-Castro, E. Gonz ´
alez Agulla, F. Loira, Hand gestures to control infotainment equipment in cars, 2014.
doi:10.1109/IVS.2014.6856614.
[15] A. G. Buddhikot, P. Commerce, N. M. Kulkarni, A. Shaligram, Hand gesture interface based on skin detection
technique for automotive infotainment system, International Journal of Image, Graphics and Signal Processing 10
(2018) 10–24.
[16] M. Deller, A. Ebert, M. Bender, H. Hagen, Flexible gesture recognition for immersive virtual environments, in:
Tenth International Conference on Information Visualisation, 2006, pp. 563–568. doi:10.1109/IV.2006.55.
[17] L. Zulpukharkyzy Zholshiyeva, T. Kokenovna Zhukabayeva, S. Turaev, M. Aimambetovna Berdiyeva, D. Tokhta-
synovna Jambulova, Hand gesture recognition methods and applications: A literature survey, in: Interna-
tional Conference on Engineering & MIS, Association for Computing Machinery, New York, NY, USA, 2021.
doi:10.1145/3492547.3492578.
[18] O. Mazhar, S. Ramdani, B. Navarro, R. Passama, A. Cherubini, Towards real-time physical human-robot interaction
using skeleton information and hand gestures, in: IEEE/RSJ International Conference on Intelligent Robots and
Systems (IROS), 2018, pp. 1–6. doi:10.1109/IROS.2018.8594385.
[19] J.-Y. Chang, A. Tejero-de Pablos, T. Harada, Improved optical flow for gesture-based human-robot in-
teraction, in: International Conference on Robotics and Automation (ICRA), 2019, pp. 7983–7989.
doi:10.1109/ICRA.2019.8793825.
[20] S. Iengo, S. Rossi, M. Staffa, A. Finzi, Continuous gesture recognition for flexible human-robot inter-
action, in: IEEE International Conference on Robotics and Automation (ICRA), 2014, pp. 4863–4868.
doi:10.1109/ICRA.2014.6907571.
[21] X. Nguyen, L. Brun, O. Lezoray, S. Bougleux, A neural network based on spd manifold learning for skeleton-based
hand gesture recognition, in: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019,
pp. 12028–12037.
[22] S. Jiang, B. Sun, L. Wang, Y. Bai, K. Li, Y. Fu, Skeleton aware multi-modal sign language recognition, in:
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2021.
[23] Z. Cao, G. Hidalgo Martinez, T. Simon, S. Wei, Y. A. Sheikh, Openpose: Realtime multi-person 2d pose estimation
using part affinity fields, IEEE Transactions on Pattern Analysis and Machine Intelligence (2019).
[24] L. Zhou, C. Du, Z. Sun, T. L. Lam, Y. Xu, Long-range hand gesture recognition via attention-based ssd network,
in: IEEE International Conference on Robotics and Automation (ICRA), 2021, pp. 1832–1838.
28
[25] H. Liang, L. Fei, S. Zhao, J. Wen, S. Teng, Y. Xu, Mask-guided multiscale feature aggregation network for hand
gesture recognition, Pattern Recognition 145 (2024) 109901. doi:https://doi.org/10.1016/j.patcog.2023.109901.
[26] Z. Wang, J. Chen, S. C. H. Hoi, Deep learning for image super-resolution: A survey, IEEE Transactions on Pattern
Analysis and Machine Intelligence 43 (10) (2021) 3365–3387. doi:10.1109/TPAMI.2020.2982166.
[27] X. Wang, K. Yu, S. Wu, J. Gu, Y. Liu, C. Dong, Y. Qiao, C. C. Loy, Esrgan: Enhanced super-resolution generative
adversarial networks, in: L. Leal-Taix´
e, S. Roth (Eds.), Computer Vision – ECCV 2018 Workshops, Springer
International Publishing, 2019, pp. 63–79.
[28] X. Wang, L. Xie, C. Dong, Y. Shan, Real-ESRGAN: Training real-world blind super-resolution with pure synthetic
data, IEEE/CVF International Conference on Computer Vision Workshops (ICCVW) (2021) 1905–1914.
[29] I. Ullah, M. Manzo, M. Shah, M. G. Madden, Graph convolutional networks: analysis, improvements and results,
Applied Intelligence 52 (2019) 9033 – 9044.
[30] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer,
G. Heigold, S. Gelly, et al., An image is worth 16x16 words: Transformers for image recognition at scale, arXiv
preprint arXiv:2010.11929 (2020).
[31] L. Brethes, P. Menezes, F. Lerasle, J. Hayet, Face tracking and hand gesture recognition for human-robot in-
teraction, in: IEEE International Conference on Robotics and Automation, Vol. 2, 2004, pp. 1901–1906 Vol.2.
doi:10.1109/ROBOT.2004.1308101.
[32] X. Ma, J. Peng, Kinect sensor-based long-distance hand gesture recognition and fingertip detection with depth
information, Journal of Sensors (2018) 1–9.
[33] G. Zhu, L. Zhang, P. Shen, J. Song, Multimodal gesture recognition using 3-d convolution and convolutional lstm,
IEEE Access 5 (2017) 4517–4524.
[34] S. Nakamura, Y. Kawanishi, S. Nobuhara, K. Nishino, Deepoint: Pointing recognition and direction estimation
from a fixed view, arXiv preprint arXiv:2304.06977 (2023).
[35] D. Jirak, D. Biertimpel, M. Kerzel, S. Wermter, Solving visual object ambiguities when pointing: an unsupervised
learning approach, Neural Computing and Applications (2020) 1–23.
[36] D.-Y. Huang, W.-C. Hu, S.-H. Chang, Vision-based hand gesture recognition using pca+gabor filters and svm,
in: International Conference on Intelligent Information Hiding and Multimedia Signal Processing, 2009, pp. 1–4.
doi:10.1109/IIH-MSP.2009.96.
[37] P. Ziaie, T. M¨
uller, M. E. Foster, A. Knoll, A na¨
ıve bayes classifier with distance weighting for hand-gesture
recognition, in: H. Sarbazi-Azad, B. Parhami, S.-G. Miremadi, S. Hessabi (Eds.), Advances in Computer Science
and Engineering, Springer Berlin Heidelberg, Berlin, Heidelberg, 2009, pp. 308–315.
[38] E. Tsironi, P. Barros, C. Weber, S. Wermter, An analysis of convolutional long short-term memory recurrent neu-
ral networks for gesture recognition, Neurocomputing 268 (2017) 76–86, advances in artificial neural networks,
machine learning and computational intelligence. doi:https://doi.org/10.1016/j.neucom.2016.12.088.
[39] D. Kim, J. Lee, H.-S. Yoon, J. Kim, J. Sohn, Vision-based arm gesture recognition for a long-range human–robot
interaction, The Journal of Supercomputing 65 (2013) 336–352.
[40] M. Al-Hammadi, G. Muhammad, W. Abdul, M. Alsulaiman, M. A. Bencherif, M. A. Mekhtiche,
Hand gesture recognition for sign language using 3DCNN, IEEE Access 8 (2020) 79491–79509.
doi:10.1109/ACCESS.2020.2990434.
[41] Y. Lai, C. Wang, Y. Li, S. S. Ge, D. Huang, 3d pointing gesture recognition for human-robot interaction, in: 2016
Chinese Control and Decision Conference (CCDC), IEEE, 2016, pp. 4959–4964.
[42] Y. Fu, L. Miao, Z. Li, Research on long-distance hand recognition based on depth information, in: Journal of
Physics: Conference Series, Vol. 1187, IOP Publishing, 2019, p. 042108.
[43] C. Lugaresi, J. Tang, H. Nash, C. McClanahan, E. Uboweja, M. Hays, F. Zhang, C.-L. Chang, M. G. Yong, J. Lee,
W.-T. Chang, W. Hua, M. Georg, M. Grundmann, Mediapipe: A framework for building perception pipelines
(2019). arXiv:1906.08172.
[44] S. Qiao, Y. Wang, J. Li, Real-time human gesture grading based on openpose, in: International Congress on Image
and Signal Processing, BioMedical Engineering and Informatics (CISP-BMEI), 2017, pp. 1–6. doi:10.1109/CISP-
BMEI.2017.8301910.
[45] R. Tchantchane, H. Zhou, S. Zhang, G. Alici, A review of hand gesture recognition systems
based on noninvasive wearable sensors, Advanced Intelligent Systems 5 (10) (2023) 2300207.
doi:https://doi.org/10.1002/aisy.202300207.
[46] H. Wang, B. Ru, X. Miao, Q. Gao, M. Habib, L. Liu, S. Qiu, Mems devices-based hand gesture recognition via
wearable computing, Micromachines 14 (5) (2023). doi:10.3390/mi14050947.
[47] T. T. Alemayoh, M. Shintani, J. H. Lee, S. Okamoto, Deep-learning-based character recognition from handwriting
motion data captured using imu and force sensors, Sensors 22 (20) (2022). doi:10.3390/s22207840.
[48] A. Bongiovanni, A. De Luca, L. Gava, L. Grassi, M. Lagomarsino, M. Lapolla, A. Marino, P. Roncagliolo,
S. Macci`
o, A. Carf`
ı, F. Mastrogiovanni, Gestural and touchscreen interaction for human-robot collaboration: A
comparative study, in: I. Petrovic, E. Menegatti, I. Markovi´
c (Eds.), Intelligent Autonomous Systems 17, Springer
29
Nature Switzerland, Cham, 2023, pp. 122–138.
[49] M. E. Benalc ´
azar, C. Motoche, J. A. Zea, A. G. Jaramillo, C. E. Anchundia, P. Zambrano, M. Segura,
F. Benalc´
azar Palacios, M. P ´
erez, Real-time hand gesture recognition using the myo armband and mus-
cle activity detection, in: IEEE Second Ecuador Technical Chapters Meeting (ETCM), 2017, pp. 1–6.
doi:10.1109/ETCM.2017.8247458.
[50] A. Moin, A. Zhou, A. Rahimi, A. Menon, S. Benatti, G. Alexandrov, S. Tamakloe, J. Ting, N. Yamamoto, Y. Khan,
F. L. Burghardt, L. Benini, A. C. Arias, J. M. Rabaey, A wearable biosensing system with in-sensor adaptive
machine learning for hand gesture recognition, Nature Electronics 4 (2020) 54–63.
[51] K.-Y. Lian, C.-C. Chiu, Y.-J. Hong, W.-T. Sung, Wearable armband for real time hand gesture recogni-
tion, in: IEEE International Conference on Systems, Man, and Cybernetics (SMC), 2017, pp. 2992–2995.
doi:10.1109/SMC.2017.8123083.
[52] E. Kim, J. Shin, Y. Kwon, B. Park, Emg-based dynamic hand gesture recognition using edge ai for human–robot
interaction, Electronics 12 (7) (2023) 1541. doi:10.3390/electronics12071541.
[53] M. Fora, B. Ben Atitallah, K. Lweesy, O. Kanoun, Hand gesture recognition based on force myography measure-
ments using knn classifier, in: International Multi-Conference on Systems, Signals & Devices (SSD), 2021, pp.
960–964. doi:10.1109/SSD52085.2021.9429514.
[54] M. N. Rylo, R. L. de Medeiros, V. F. de Lucena Jr, Gesture recognition of wrist motion based on wearables sensors,
Procedia Computer Science 210 (2022) 181–188, international Conference on Emerging Ubiquitous Systems and
Pervasive Networks (EUSPN). doi:https://doi.org/10.1016/j.procs.2022.10.135.
[55] N. Siddiqui, R. H. M. Chan, A wearable hand gesture recognition device based on acoustic measurements at wrist,
in: Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), 2017, pp.
4443–4446. doi:10.1109/EMBC.2017.8037842.
[56] K. Nasrollahi, T. B. Moeslund, Super-resolution: a comprehensive survey, Machine Vision and Applications 25
(2014) 1423 – 1468.
[57] S. Anwar, S. Khan, N. Barnes, A deep journey into super-resolution: A survey, ACM Comput. Surv. 53 (3) (2020).
[58] X. Hu, M. A. Naiel, A. Wong, M. Lamm, P. Fieguth, Runet: A robust unet architecture for image super-resolution,
in: IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2019, pp. 505–507.
doi:10.1109/CVPRW.2019.00073.
[59] O. Ronneberger, P. Fischer, T. Brox, U-net: Convolutional networks for biomedical image segmentation, in: Medi-
cal Image Computing and Computer-Assisted Intervention, 2015, pp. 234–241.
[60] Z. Lu, Y. Chen, Single image super-resolution based on a modified u-net with mixed gradient loss, Signal, Image
and Video Processing 16 (2019) 1143 – 1151.
[61] K. Zhang, J. Liang, L. Van Gool, R. Timofte, Designing a practical degradation model for deep blind image super-
resolution, in: IEEE International Conference on Computer Vision, 2021, pp. 4791–4800.
[62] J. Redmon, A. Farhadi, YOLOv3: An incremental improvement, arXiv preprint arXiv:1804.02767 (2018).
[63] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, I. Polosukhin, Attention is all
you need, in: International Conference on Neural Information Processing Systems, 2017, p. 6000–6010.
[64] Y. N. Dauphin, A. Fan, M. Auli, D. Grangier, Language modeling with gated convolutional networks, in: Interna-
tional Conference on Machine Learning, Vol. 70, 2017, p. 933–941.
[65] K. Zeng, J. Yu, R. Wang, C. Li, D. Tao, Coupled deep autoencoder for single image super-resolution, IEEE Trans-
actions on Cybernetics 47 (1) (2017) 27–37. doi:10.1109/TCYB.2015.2501373.
[66] Z. Zhou, M. M. R. Siddiquee, N. Tajbakhsh, J. Liang, Unet++: Redesigning skip connections to exploit multiscale
features in image segmentation, IEEE transactions on medical imaging 39 (6) (2019) 1856–1867.
[67] H. Chen, X. He, L. Qing, Y. Wu, C. Ren, R. E. Sheriff, C. Zhu, Real-world single image super-resolution: A brief
review, Information Fusion 79 (2022) 124–145. doi:https://doi.org/10.1016/j.inffus.2021.09.005.
[68] G. Huang, Z. Liu, L. Van Der Maaten, K. Q. Weinberger, Densely connected convolutional networks, in: IEEE
Conference on Computer Vision and Pattern Recognition, 2017, pp. 4700–4708.
[69] M. Tan, Q. Le, Efficientnet: Rethinking model scaling for convolutional neural networks, in: International confer-
ence on machine learning, 2019, pp. 6105–6114.
[70] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, A. Rabinovich, Going
deeper with convolutions, in: IEEE conference on computer vision and pattern recognition, 2015, pp. 1–9.
[71] S. Zagoruyko, N. Komodakis, Wide residual networks, in: British Machine Vision Conference BMVC, 2016.
[72] K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale image recognition, arXiv preprint
arXiv:1409.1556 (2014).
30