PreprintPDF Available
Preprints and early-stage research may not have been peer reviewed yet.

Abstract and Figures

Close human-robot interaction (HRI), especially in industrial scenarios, has been vastly investigated for the advantages of combining human and robot skills. For an effective HRI, the validity of currently available human-machine communication media or tools should be questioned, and new communication modalities should be explored. This article proposes a modular architecture allowing human operators to interact with robots through different modalities. In particular, we implemented the architecture to handle gestural and touchscreen input, respectively, using a smartwatch and a tablet. Finally, we performed a comparative user experience study between these two modalities.
Content may be subject to copyright.
©2022 IAS Society. Personal use of this material is permitted. Permission
from IAS Society must be obtained for all other uses, in any current or future
media, including reprinting/republishing this material for advertising or promo-
tional purposes, creating new collective works, for resale or redistribution to
servers or lists, or reuse of any copyrighted component of this work in other
works.
arXiv:2207.03783v1 [cs.RO] 8 Jul 2022
Gestural and Touchscreen Interaction for
Human-Robot Collaboration: a Comparative
Study
Antonino Bongiovanni ?, Alessio De Luca , Luna Gava , Lucrezia Grassi ,
Marta Lagomarsino, Marco Lapolla , Antonio Marino , Patrick Roncagliolo
, Simone Macci`o, Alessandro Carf`ı, and Fulvio Mastrogiovanni
Department of Informatics, Bioengineering, Robotics, and Systems Engineering,
University of Genoa, Via Opera Pia 13, 16145 Genoa, Italy
alessandro.carfi@dibris.unige.it
Abstract. Close human-robot interaction (HRI), especially in indus-
trial scenarios, has been vastly investigated for the advantages of com-
bining human and robot skills. For an effective HRI, the validity of cur-
rently available human-machine communication media or tools should
be questioned, and new communication modalities should be explored.
This article proposes a modular architecture allowing human operators
to interact with robots through different modalities. In particular, we
implemented the architecture to handle gestural and touchscreen input,
respectively, using a smartwatch and a tablet. Finally, we performed a
comparative user experience study between these two modalities.
1 Introduction
Research on human-robot collaboration (HRC) leapt forward in industrial sce-
narios with the introduction of collaborative robots (cobots) [1] such as the Kuka
LBR iiwa, the Universal Robot UR5, or ABB YuMi (to mention just a few).
Cobots are safe by design and make it possible for robots and humans to com-
bine their skills, improving overall productivity, efficiency and flexibility while,
possibly, reducing human stress and workload [2]. For a functional human-robot
collaboration, safety is not the only requirement, and cobots should provide more
intuitive interfaces moving away from classical ones such as teach pendants.
More prominent human-robot interfaces rely on touchscreen technology [3,4]
that integrates into the same device the human to robot communication (touch)
and the robot to human one (screen). Touch screen interfaces are classical graph-
ical user interfaces (GUIs) where the presence of the physical screen acts as a
gateway between the human and the robot. New technologies such as augmented
and virtual reality provide alternatives for robots to communicate with humans.
These interfaces need novel approaches for human interaction, and researchers
took inspiration from human-human communication exploring speech [5], ges-
tures [6–8] and their combination [9–11].
?The authors contributed equally.
(a) Touchscreen interface.
(b) Gesture-based interface.
Fig. 1. Two users using, respectively, a tablet and a smartwatch, highlighted by yellow
circles, to interact with a robot.
Gestural- and speech-based interfaces are appealing since humans already
use these modalities to interact with other humans. However, their usage in an
intrinsically different scenario may lead to sub-optimal interaction results. For
these reasons, researchers performed experiments to compare new interaction
modalities with state-of-the-art ones, such as those based on touchscreens. Often,
the number of participants of these studies is not significant, and they rely on
Wizard of Oz (WoZ) experiments (i.e., experiments in which the subject interacts
with an autonomous system, but in reality, an experimenter is operating it) [12].
Since previous studies highlighted them as one of the most promising ap-
proaches, we aim to contribute to this research field by evaluating the user
experience and performance results of a gesture-based interface for HRC. We
do this by comparing gestural (Figure 1b) with touchscreen (Figure 1a) interac-
tion in a real-world HRC scenario. Touchscreen interaction has been picked as
an optimal term of comparison because of its modern and reliable technology.
For the experiment, although it could be suboptimal in the HRC scenario, we
adopted a simple GUI to have optimal conditions for the touchscreen interface
since a GUI is part of its usual setup. Part of the contribution of this work is the
design and implementation of a software architecture capable of handling both
communication modalities.
The paper is structured as follows. In Section 2 a brief review of HRC inter-
faces literature is presented. Section 3 describes the design principles adopted
to design our software architecture. Section 4 describes, in detail, the architec-
ture implementation for the gestural and touchscreen interaction and the logic
managing the HRC. The setup and the experiment description are presented
in Section 5, while in Section 6 the experimental results are presented and dis-
cussed. Conclusions follow.
2 Background
While humans and robots collaborate, they share information at different levels
through explicit or implicit communication. Ideally, an interface for human-robot
collaboration should mediate the two-way communication handling both explicit
and implicit communication, but in this paper, we focus only on the explicit one.
In this context, the interface allows the human to send direct commands to the
robot and the robot answers providing appropriate feedback. Classical interfaces
use screens to give the user visual feedback but, novel technologies, such as
augmented reality (AR) and see-through displays [1315], allow integrating the
interface directly in the working environment.
At the same time, researchers explored different modalities to acquire human
commands. Many studies focused on touchscreen interfaces for human-robot in-
teraction, given their widespread adoption in the consumer market. Mateo et al.
2014 [3] proposed a tablet-based user interface for industrial robot programming,
and a similar interface has been used to send high-level commands to robotic
platforms [4]. However, touchscreen interfaces are suboptimal in scenarios where
humans need free hands. Furthermore, the need for a screen to communicate
limits the adoption of technologies such as AR that can only be used as a comple-
ment [9], and not fully substitute it. Therefore, developing new interfaces requires
alternative communication modalities such as speech and gestures. Despite the
extended research in speech analysis and recognition [16], the high noise level
in industrial environments can jeopardize the correct function of speech-based
interfaces. While gesture-based interfaces usually rely on upper-limb motions,
possibly altering the human workflow. However, with gesture-based interfaces,
users can send commands to the robot while handling tools; and without using
a tablet or reaching a screen. Furthermore, interfaces can rely on gestures for
scenarios where other communication modalities fail, e.g., underwater HRC [17].
Because of their advantages, researchers have explored the usage of gesture-
based interfaces for industrial applications [6], search and rescue [11], and direct
control of mobile robots [7]. Neto et al. 2019 [6] proposed a gesture-based in-
terface using five inertial measurement units (IMUs), worn by the user, to send
commands to an industrial manipulator. At the same time, other studies pro-
posed interfaces using non-wearable devices, such as RGB [17] and RGB-D [7]
cameras, or alternative wearable devices such as Electromyography (EMG) sen-
sors [11,18]. Although wearable devices can impede human motions and rely on
batteries, their usage reduces the necessity for a structured environment, raises
fewer privacy concerns, and does not suffer occlusions. IMUs are one of the most
used sensing solutions for wearables because of their compact data stream, re-
Fig. 2. A conceptual flow diagram of the proposed architecture.
duced price and small size. Independently by the sensor type, a gesture-based
interface should process the collected data to identify motions with explicit com-
munication intent and ignore normal user operations. This problem is known
as gesture recognition, and for IMU data, it has been approached using Naive
Bayes Classifiers, Logistic Regression and Decision Trees [19], Convolutional
Neural Networks (CNNs) [20], Dynamic Time Warping (DTW) [21], Support
Vector Machines (SVMs) [22], and Long Short-Term Memory Neural Networks
(LSTM) [23].
The development of a new interface is not only a technological problem but,
the human factor should be taken into consideration as well [24]. Therefore, an
experimental validation should evaluate both the system performance and the
user experience. For this reason, some studies proposed comparative analyses
between different interaction modalities to determine which is the most suited
given a target application. Taralle et al. 2015 [13] presented a comparative study
between a gesture-based and touchscreen interface to control a small unmanned
aerial vehicle (sUAV) while looking for a target in a video stream. This study
relied on WoZ experiments for the gesture controls, and the results, over 20 vol-
unteers, suggested a preference for the gestural interaction. At the same time,
lower cognitive load and user preference for gestural and touch inputs have been
found using WoZ experiments to compare touch, gesture, speech, and 3D pen
in an industrial scenario [25]. Although these studies provide promising results,
WoZ experiments overlook the effect of the system performances (e.g., accuracy,
responsiveness) on the user experience. When compared in a real-world envi-
ronment, reliable input modalities, such as touchscreens, outperform novel ones,
such as gestures [12]. These results should not discourage the study of new in-
teraction modalities but rather push to reach better system performances to i)
obtain interfaces for a more flexible human-robot collaboration and ii) overcome
the need for a screen in classical interfaces.
Fig. 3. Many heterogeneous discrete signals are mapped to one of the four FSM com-
mand handlers. Each FSM state can install the handlers, or ignore them (like command
number 4 in this example). A secondary input scheme, i.e., events, uses valued signals
(integers) to directly select one of the options offered by the FSM state for user selec-
tion.
3 Design principles
As we have seen, an interface mediates the interaction between the human and
the robot, allowing the human to send commands and provide appropriate feed-
back. In Figure 2, a general architecture summarizing this concept is presented.
The architecture consists of: a GUI displayed on a screen, an input layer, the
main logic, and a command layer that communicates directly with the robot.
Since our study focuses on the human to robot communication, we kept the
robot to human one simple using only a screen. Since they are vastly adopted,
we design a menu-based GUI listing all possible functionalities to minimize the
novelty effect. Furthermore, menus allow an easy adaptation to different exper-
imental scenarios.
The input layer collects the sensory information and processes them to rec-
ognize discrete human commands (i.e., arm gestures, keywords, or touchscreen
pressures). Each communication modality has its dictionary Dcontaining com-
mands descriptions and associated identifiers. Once the sensory information
matches the command description, the input layer returns as output the as-
sociated identifier. In some communication modalities, the input layer also gen-
erates additional information. For example, when the user presses a touchscreen,
the input layer returns the identifier (i.e., screen pressure) and the related 2D
position.
The logic layer is represented using a finite state machine (FSM) whose states
describe different interaction stages. The set of states S(S={s1, . . . , si,...s|S|})
composes the FSM, and the transitions from si(siS) to other states are de-
scribed by the transitions set Ti. In our scenario, we recognize two states cate-
gories, namely menu and action. Menu states describe the GUI (Figure 5 shows
the GUI representation of some menu states), and when they are active, the
interaction is limited to the menu navigation. Instead, action states implement
the system functionalities, e.g., human teaching of a motion or robot execution
of a task.
Transitions between FSM states are triggered either by the system (e.g., when
the robot ends a task execution) or by human commands. As we have seen, dif-
ferent communication modalities can carry different information. Therefore, to
make it possible to integrate various communication modalities, we designed two
different schemas that a human command can follow to interact with the FSM:
signals and events (see Figure 3). With signals, for each state si, the system
defines a one-to-one mapping between the dictionary commands and the transi-
tions defined in Ti. Command handlers, defined for each FSM state (see Figure
3), manage this mapping. With events, human commands are mapped directly
to FSM states. Therefore, when the human performs the command associated
with sj, independently by the current FSM state, sjis activated. This schema
allows long jumps in the interface without a sequential transition through all
the intermediate states. Usually, the number of states |S|is higher than the
number of transitions for a single state |Ti|, therefore using events needs bigger
dictionaries or communication modalities that, carrying extra information, allow
to associate more states to a single command. Signals are more appropriate for
communication modalities where a big dictionary implies high user effort to re-
call commands (e.g., keyboard strokes or gestures). Instead, events can leverage
communication modalities handling big dictionaries (e.g., vocal interaction) or
providing extra information (e.g., touchscreen interaction).
4 Implementation
Following the design principles presented in the previous section, we have imple-
mented our architecture (available open-source1) to interact with the dual-arm
Baxter robot from Rethink Robotics [26] using gestures, sensed by a wrist-worn
IMU and a touchscreen. The architecture uses the robot operating system (ROS)
framework [27] to manage inter-module communication. Besides benefits in code
reuse, this choice allows us to exploit the Baxter-related ROS APIs, which ex-
poses services to acquire sensory data, control robot actuators, and record robot
motions using Kinesthetic Teaching (KT) [28,29].
4.1 Graphical User Interface
Our Graphical User Interface arranges the menu’s options vertically, and a red
selector is used to highlight the selected option (see Figure 5a). The FSM menu
states describing the GUI are published on a ROS topic and converted, by a
renderer, to a graphical representation. The GUI can be visualized on a screen
either locally or remotely through a browser. Since the architecture decouples
the GUI state representation (FSM) and its visual representation, it is possible
1Web: https://github.com/TheEngineRoom-UniGe/gesture_based_interface
Fig. 4. The four gestures supported by our architecture: (G1) wrist up, (G2) wrist
down, (G3) spike clockwise, and (G4) spike counter-clockwise.
to change the GUI appearance (e.g., textual user interfaces or GUI with dif-
ferent designs) or integrate other visualization devices (e.g., augmented reality
headsets).
4.2 Input
The two interaction modalities considered in this study are touchscreen and ges-
tures. The two modalities are intrinsically different, and for the reasons previ-
ously presented, we use signals to handle gestures and events for the touchscreen.
While using gestures, users navigate the menus sequentially, performing multiple
gestures to reach and select an option. With the touchscreen instead, the user
should press simply the corresponding virtual button in the GUI. In our exper-
iment, the touchscreen interface is used on an iPad Air through Safari, and we
developed a ROS node receiving the command events directly from the tablet.
For the gesture-based interface instead, we used SLOTH, a method taking
advantage of LSTM neural networks to recognize gestures online. This method
has been proposed for HRC scenarios and a ROS-compatible version, with a
pre-trained neural network, is available on GitHub2. We developed an Android
Wear 2 application to collect IMU data from an LG G Watch R smartwatch and
publish them online to a ROS topic. Therefore, since the data acquisition pipeline
differs from the one used in the original work, the SLOTH performances should
be assessed again [30]. Every time a gesture is recognized, the SLOTH module
sends a signal to the main logic activating an FSM transition. In our FSM
implementation, the maximum number of possible transitions associated with
a state is four (i {1,...,|S|},|Ti| 4), therefore the architecture presents
only four command handlers (see Figure 3). Since SLOTH recognizes between
six gestures (|D|= 6), we ignore two of them. The command handlers map
the others (see Figure 4) as follows: command handler #1 maps wrist up (G1),
command handler #2 maps wrist down(G2), command handler #3 maps spike
clockwise (G3), and command handler #4 maps spike counter-clockwise (G4).
Notice that the number of active command handlers in a state sicorresponds to
2Web: https://github.com/ACarfi/SLOTH
(a) Main (b) Record (c) Playback
(d) Sequential playback (e) Macro mode
Fig. 5. The menus for the four options provided by the interface.
the number of possible transition Ti, and gestures mapped by inactive command
handlers would be effectless.
4.3 Logic
As previously described, we implemented the architecture logic using an FSM
that the user can control using signals or events. When using the gesture in-
terface, each gesture execution generates a signal triggering the corresponding
state transition. In this case, whichever change in the GUI (e.g., selector up,
selector down or option selection) corresponds to a state transition in the FSM.
Users can navigate menu states using three gestures: G1 to select options, G2
to move down the selector, and G3 to move up the selector (in menu states G4
is effectless).
Our architecture allows the user to control robot functionalities to record
and playback end-effector trajectories. In particular, the main menu offers four
options: record,playback,sequential playback, and macro mode (see Figure 5a).
Whenever one of these options is selected, the corresponding menu is opened.
The record menu (Figure 5b) displays a list of recorded robot tasks (i.e., end-
effector trajectories). The user can overwrite each task by selecting it or creating
a new one using the corresponding option. By selecting a task from the list, the
recording process, handled by the record action state, starts. In this state, a
human operator can program a new task using KT (i.e., teaching a trajectory
by physically guiding the robot) and use G4 to terminate the recording going
back to the record menu.
The playback menu (Figure 5c) lists all the saved tasks, which can be deleted
by the human operator using the corresponding option. When a task is selected,
the FSM transits to the playback action state, and the robot reproduces the
associated trajectory. Here two commands are active: G2 pausing or resuming
the task playback and G4 terminating the playback and moving back to the
playback menu.
The sequential playback menu (Figure 5d) allows the user to combine saved
tasks in sequences to handle more complex behaviours. The user can remove
or substitute a task from the sequence by selecting it. Furthermore, the add
option brings to a selection sub-menu where the operator can append a task to
the sequence, and the run option makes the FSM transit to the playback action
state that reproduces the whole sequence.
Finally, the macro mode menu, see Figure 5e, allows the user to associate
one task to each of the three gestures G1, G2 and G3 (the correspondence task-
gesture is ordered from top to bottom). The user can customize the mapping by
selecting a specific slot and choosing a task from the list of available ones. When
run is selected, the FSM transits to the macro action state, and the robot starts
waiting for gestures. Every time the user performs one of the three mapped
gestures, the robot executes the associated task. In this state, performing G4
terminates the interaction and moves back to the macro mode menu.
The logic does not vary when using the touchscreen interface. Instead of per-
forming gestures, the user presses the touchscreen in a 2D position corresponding
to a menu option, the system sends an event to the FSM, directly activating the
corresponding state.
5 Experiment
Now that we have presented our architecture for multimodal human-robot in-
teraction, we put forward our hypothesis inspired by previous researches. [Hy-
pothesis H1] the gesture-based interface will get positive user evaluation, and
[H1.1] the novelty effect will have a significant contribution. However, [H2] the
touchscreen interface will grant an overall better user experience [H2.1] since it
relies on a more stable technology and volunteers are accustomed to this kind of
interface.
5.1 Experimental Setup
The experimental setup, whose conceptual sketch is represented in Figure 6,
includes: (i) a dual-arm Baxter manipulator, appropriately configured, (ii) a
wooden, 3.7cm edge cube, to be manipulated by the robot, (iii) the tablet or
the smartwatch that human operators use, (iv) a table in front of the robot, on
which two positions, namely Aand B, are defined. Position Ais where the cube
is located at the beginning of the experiment. The distance between Aand Bis
60 cm. While performing the experiments, the human volunteer stands in front
of the robot at a distance of approximately 1.5m.
The architecture contains three pre-recorded tasks, namely Move: A B,
action 1, reproducing a robot arm-waving to greet the user, and action 2,
performing a handover of the cube between the two Baxter’s arms. These have
been included to simulate the difficulty that using an already setup interface
may imply.
Fig. 6. Experimental scenario: a human holds the tablet or wears the smartwatch, while
Baxter is behind a table where the pick-and-places of a wooden cube are performed.
5.2 Description of the Experiment
We performed a within-subject experiment divided into two phases, i.e., a trial
using the tablet (touchscreen-based) and one wearing the smartwatch (gesture-
based). The 25 participants, in a range between 18 and 40 years old, have been
divided into two groups. The two groups completed the experiment once with
each device: the first group started using the smartwatch, and the second group
with the tablet.
The experimental protocol consists of a series of transportation tasks of the
wooden block from position Ato position Band vice versa. The success in
bringing back the cube from Bto Aand the overall quality of the robot motions
are not evaluation criteria considered in this experiment. Instead, as previously
anticipated, we keep track of the time each operator needs to complete the
experiment. Before the experiment, all participants familiarized themselves with
the gestural control and the KT for approximately 1.52 minutes. Whereas
in the end, each participant has filled up the User Experience Questionnaire
(UEQ) [31] and a custom made questionnaire asking familiarity with the devices,
and whether they consider useful to keep on developing this project.
The experiment consists of four tasks, and it is imposed a soft limit of 5
minutes to complete them. The temporal constraint limits the experiment du-
ration to prevent fatigue and distractions that may affect the final result. The
experiment is composed of four ordered tasks. When a volunteer completes a
task, an experimenter evaluates if the remaining time is sufficient for the next
one. If the time is not enough, the experimenter stops the volunteer, and the
experiment ends. The experimenter does not interrupt the volunteer task execu-
tion; therefore, some experiments can exceed the soft limit. The description of
the ordered tasks follows:
G1G2G3G4N.G.
G1
G2
G3
G4
N.G.
33
14.6%
0
0.0%
0
0.0%
0
0.0%
22
9.8%
60.0%
40.0%
0
0.0%
41
18.2%
0
0.0%
0
0.0%
14
6.2%
74.5%
25.5%
0
0.0%
0
0.0%
51
22.6%
0
0.0%
4
1.8%
92.7%
7.3%
0
0.0%
0
0.0%
0
0.0%
38
16.9%
17
7.5%
69.0%
31.0%
0
0.0%
0
0.0%
2
0.9%
3
1.3%
0
0.0%
0.0%
100%
100%
0.0%
100%
0.0%
96.2%
3.8%
92.6%
7.4%
94.4%
5.6%
72.3%
27.7%
Target Class
Output Class
Confusion Matrix
Fig. 7. Confusion matrix for online tests. The bottom row reports the recall measures,
while the rightmost column reports the precision measures. The blue cell reports the
overall accuracy.
1. Starting from the main menu, the participant must select the playback op-
tion, and reproduce the pre-recorded action Move: A B, which com-
mands the robot to move the cube from position Ato position B.
2. Each participant must select the record option from the main menu to record
a new task, i.e., Move: B A, to bring the cube back from Bto A. Using
KT, enabled by cuffing Baxter’s wrist, the volunteer physically guides the
robot through the task. Once the recording is finished, the wooden block is
back to position A.
3. The third task consists in selecting the macro option from the main menu
and associating G1 and G2 (see Figure 4), or two buttons while using the
touchscreen, respectively to the pre-recorded Move: A Band the new
Move: B A. After selecting play, see Figure 5e, the participant has to
activate the two actions, using the corresponding inputs, to move the wooden
block from Ato Band vice versa.
4. The last task consists in selecting the sequence option from the main menu to
create a sequence of actions and then selecting play to start the reproduction.
The sequence should include the prerecorded tasks action 1 and action 2.
Notice that to preserve a fair comparison, the experiment design does not
include factors that could disadvantage the touchscreen interface, e.g., the need
to use working gloves or hold a tool.
6 Results
This Section presents the SLOTH performances in our architecture and the
results of the comparative study. We divide the study results between the eval-
uation of questionnaire answers and the analysis of the time needed to complete
the experiment.
6.1 Assessment of SLOTH Performance
To assess the performance of SLOTH and decide if retraining the model, we asked
eleven volunteers to perform five repetitions for all the four gestures considered
in our experiment, see Figure 4. We conducted the assessment using our Android
infrastructure streaming IMU data at 10 Hz, as in the original work, directly to a
workstation running SLOTH. The results are presented in the confusion matrix,
Figure 7. In general, the overall classification performance seems adequate for
the study. In particular, the high level in precision indicates a reliable system
that does not misinterpret gestures, preserving the correct association between
gestures and commands, e.g., when the user performs G1, the system does not
confuse it with G2, G3 or G4. However, the low recall levels suggest that user
gestures may go undetected, e.g. when the user performs G1 has a 60% chance
that is not detected. We expect this to negatively affect the user experience since
the user will have to perform multiple gesture repetitions for a single command.
6.2 Analysis of Questionnaires
The UEQ [31], the questionnaire used during the experiments, uses 26 ques-
tions to assess different properties related to user experience. The questions are
grouped into six user experience scales: i) attractiveness, i.e., the overall impres-
sion about the interface; ii) perspicuity, i.e., the easiness of learning how to use
the interface; iii) efficiency, i.e., the utility of the interface in completing the
task; iv) dependability, i.e., the perceived security and control over the interface;
v) stimulation, i.e., the engagement generated by the interface; and vi) novelty,
i.e., the perceived novelty and aroused interest. Five of these scales can then be
grouped in pragmatic qualities, describing the task-related aspects of perspicuity,
efficiency and dependability, and hedonic qualities, including non-task-related as-
pects, such as stimulation and novelty. Since all the volunteers were from Italy,
we used the Italian version of the questionnaire. We analysed the questionnaire
answers using the Microsoft Excel sheet provided together with the question-
naire. The tool gets the volunteers answers and calculates scaled values.
As a first step, we checked the value of the Cronbach's alpha coefficient, which
provides a measure of internal consistency, i.e., how closely related a set of items
are as a group. The rule-of-thumb is that the alpha coefficient should be higher
or equal than 0.7 to ensure scale reliability. Questionnaires related to the touch-
screen interface have low alpha values only for dependency (0.35). Instead, for
the gesture-based interface, we found low Cronbach’s alpha for efficiency (0.55)
Fig. 8. Comparison between gesture-based (in blue) and touchscreen-based (in green)
experiments, related to usability and user experience aspects.
and dependability (0.62). These findings can suggest a common misunderstand-
ing of dependability items, but most probably are also influenced by the limited
number of participants.
Figure 8 presents the comparative results between the touchscreen interface
(tablet, in green) and the gesture-based one (smartwatch, in blue). The bar graph
represents the mean for each considered user experience scale. According to the
UEQ guidelines, mean values between -0.8 and 0.8 represent a neutral evaluation
of the corresponding scale, values greater than 0.8 represent a positive evaluation,
and values less than -0.8 represent a negative evaluation. Instead, error bars
represent the 95% confidence of scale means, i.e., the interval whereby 95% of
the scale means are located if the experiment is repeated an infinite number of
times. While comparing the values, if the two confidence intervals do not overlap,
the difference is significant. Observing Figure 8, a relevant difference is evident
for pragmatic qualities (i.e., perspicuity, efficiency and dependability) where the
tablet achieves higher evaluations. Instead, for hedonic qualities (i.e., stimulation
and novelty), volunteers evaluated more positively the gesture-based interface.
Finally, from the custom made questionnaire, we found out that 64% of
the participants were familiar with tablets, while only 12% with smartwatches.
Moreover, while tablet usage assumes a touchscreen interaction, smartwatches
are not typically used for gestural interaction. This difference, together with the
non-optimal performances of the gesture recognition system, probably justifies
the disadvantage of gestural interaction in the pragmatic scales and its advantage
in the hedonic ones.
6.3 Time Analysis
Out of the 25 participants, only 5 completed the four tasks in the given time
using the smartwatch, while 23 completed all the tasks using the tablet. This re-
sult is coherent with the participants’ perception measured by the questionnaire
about the overall interface efficiency. Figure 9 reports with box plots the statis-
tics of the time spent to perform each of the four tasks with the tablet (in green)
Fig. 9. Box plots of the time that each participant needed to complete the four exper-
imental tasks. In green data about the touchscreen-based interaction, in blue data for
the gesture-based one. We reported outliers in red.
and the smartwatch (in blue). For both interfaces, we highlighted outliers in red.
We point out that the box plots represent only the data of 23 volunteers since
the subdivision, over the four tasks for the first two volunteers, have not been
recorded. Furthermore, data for the sequence task using the gesture-based inter-
face were available only for five participants since the others did not complete
this task.
The measured times do not measure only the user interaction but also in-
clude robot motions. This factor partially explains the higher variance for both
touchscreen and smartwatch in the record and macro tasks. In fact, in these
tasks, the time taken by the robot motion is different for each participant, since
the Move: B Atask is recorded individually by each of them. Instead, in the
playback and the sequence tasks, the used robot motions, pre-recorded by the
experimenters, are the same for everyone, and therefore the variance is compar-
atively lower.
The playback task necessitates a few user inputs (i.e., selecting the playback
option and then the motion to execute), and therefore the temporal difference
between the two input modalities is limited. Instead, when the task becomes
more complex (i.e., more user inputs are required to reach the end), the differ-
ence between the two input modalities increases due to the intrinsic difference
between signals and events. This observation finds evidence in the timing results
for the record and macro task but not for the equally complex sequence task.
This result can be explained by observing that only five top-performing subjects
reached this task using the smartwatch, and thus it suggests that the influence of
volunteer skills is relevant. Therefore, more extensive user training could smooth
the differences between gesture- and touchscreen-based interfaces.
7 Conclusions
In this work, we introduced a few designs principles for architectures supporting
multiple human-robot interaction modalities. We presented the implementation
of these design principles into an architecture mediating the human-robot in-
teraction using both a smartwatch and a tablet. We evaluated the differences
between the two communication modalities with a comparative experiment in a
human-robot collaboration scenario.
The result of the experiments shows a positive evaluation for both interaction
modalities [H1and H2]. The smartwatch and tablet interface received better
evaluations, respectively, on the hedonistic [H1.1] and pragmatic scales [H2.1].
These results can be justified by the non-optimal performances of the gesture
recognition system and the volunteers’ familiarity with touchscreen technologies.
We also observed the user proficiency effect in the task completion time, where
the most skilled volunteers obtained similar results with the two input modalities.
These are not conclusive results because of the experimental limitations: the
small number of subjects, simple collaborative scenario and lack of extensive sta-
tistical analysis. However, they encourage more in-depth analyses and studies of
alternative interfaces in the HRC context. Future studies in this field should aim
to improve the technical performances of gesture-based technologies, consider
more complex experimental setups and investigate new interaction modalities.
Acknowledgements
The authors would like to thank the teachers and students of the vocational
education and training schools Centro Oratorio Votivo, Casa di Carit`a, Arti
e Mestieri di Ovada for their contribution to the experiments. This work has
been partially supported by the European Union Erasmus+ Programme via
the European Master on Advanced Robotics Plus (EMARO+) program (grant
agreement n. 2014-2616).
References
1. J. Edward, C., Witaya, W., Peshkin, M.A.: Cobots: Robots for collaboration
with human operators. In: Proceeding of the ASME International Mechanical
Engineering Congress and Exposition (IMECE), Atlanta, Georgia, USA (November
1996)
2. Tsarouchi, P., Makris, S., Chryssolouris, G.: Human-robot interaction review and
challenges on task planning and programming. International Journal of Computer
Integrated Manufacturing 29(8) (2016) 916–931
3. Mateo, C., Brunete, A., Gambao, E., Hernando, M.: Hammer: An Android based
application for end-user industrial robot programming. In: Proceedings of the 10th
IEEE/ASME International Conference on Mechatronic and Embedded Systems
and Applications (MESA), Senigallia, Italy (September 2014)
4. Birkenkampf, P., Leidner, D., Borst, C.: A knowledge-driven shared autonomy
human-robot interface for tablet computers. In: Proceedings of the 14th IEEE-
RAS International Conference on Humanoid Robots (Humanoids), Madrid, Spain
(November 2014)
5. Poirier, S., Routhier, F., Campeau-Lecours, A.: Voice control interface prototype
for assistive robots for people living with upper limb disabilities. In: Proceeding
of the 16th IEEE International Conference on Rehabilitation Robotics (ICORR),
Toronto, Canada (June 2019)
6. Neto, P., Sim˜ao, M., Mendes, N., Safeea, M.: Gesture-based human-robot interac-
tion for human assistance in manufacturing. The International Journal of Advanced
Manufacturing Technology 101(1) (2019) 119–135
7. Cicirelli, G., Attolico, C., Guaragnella, C., D’Orazio, T.: A kinect-based gesture
recognition approach for a natural human robot interface. International Journal
of Advanced Robotic Systems 12(3) (2015) 22
8. Carf`ı, A., Mastrogiovanni, F.: Gesture-based human-machine interaction: Taxon-
omy, problem definition, and analysis. IEEE Transactions on Cybernetics (2022)
9. Papanastasiou, S., Kousi, N., Karagiannis, P., Gkournelos, C., Papavasileiou, A.,
Dimoulas, K., Baris, K., Koukas, S., Michalos, G., Makris, S.: Towards seamless
human robot collaboration: integrating multimodal interaction. The International
Journal of Advanced Manufacturing Technology 105(9) (2019) 3881–3897
10. Wu, L., Alqasemi, R., Dubey, R.: Development of smartphone-based human-robot
interfaces for individuals with disabilities. IEEE Robotics and Automation Letters
5(4) (2020) 5835–5841
11. Gromov, B., Gambardella, L.M., Di Caro, G.A.: Wearable multi-modal interface
for human multi-robot interaction. In: Proceedings of the 14th IEEE International
Symposium on Safety, Security, and Rescue Robotics (SSRR), Lausanne, Switzer-
land (October 2016)
12. Pohlt, C., Haubner, F., Lang, J., Rochholz, S., Schlegl, T., Wachsmuth, S.: Effects
on user experience during human-robot collaboration in industrial scenarios. In:
Proceedings of the IEEE International Conference on Systems, Man, and Cyber-
netics (SMC), Miyazaki, Japan (February 2018)
13. Taralle, F., Paljic, A., Manitsaris, S., Grenier, J., Guettier, C.: Is symbolic gestural
interaction better for the visual attention? Procedia Manufacturing 3(2015) 1060–
1065
14. Ziegler, J., Heinze, S., Urbas, L.: The potential of smartwatches to support mobile
industrial maintenance tasks. In: Proceedings of the 20th IEEE Conference on
Emerging Technologies & Factory Automation (ETFA), Luxembourg (September
2015)
15. Aromaa, S., Aaltonen, I., Kaasinen, E., Elo, J., Parkkinen, I.: Use of wearable and
augmented reality technologies in industrial maintenance work. In: Proceedings of
the 20th International Academic Mindtrek Conference, Tampere, Finland (October
2016)
16. Novoa, J., Wuth, J., Escudero, J.P., Fredes, J., Mahu, R., Yoma, N.B.: DNN-
HMM based automatic speech recognition for HRI scenarios. In: Proceedings of
the ACM/IEEE International Conference on Human-Robot Interaction, Chicago,
Illinois, USA (February 2018)
17. Islam, M.J., Ho, M., Sattar, J.: Understanding human motion and gestures for
underwater human-robot collaboration. Journal of Field Robotics 36(5) (2019)
851–873
18. Ahsan, M.R., Ibrahimy, M.I., Khalifa, O.O.: Electromygraphy (EMG) signal based
hand gesture recognition using artificial neural network (ANN). In: Proceedings
of the 4th IEEE International Conference on Mechatronics (ICOM), Kuala Lump,
Malaysia (May 2011)
19. Xu, C., Pathak, P.H., Mohapatra, P.: Finger-writing with smartwatch: A case for
finger and hand gesture recognition using smartwatch. In: Proceedings of the 16th
International Workshop on Mobile Computing Systems and Applications (HotMo-
bile), Santa Fe, New Mexico, USA (February 2015)
20. Kwon, M.C., Park, G., Choi, S.: Smartwatch user interface implementation using
CNN-based gesture pattern recognition. Sensors 18(9) (2018) 2997
21. Hsu, Y.L., Chu, C.L., Tsai, Y.J., Wang, J.S.: An inertial pen with dynamic time
warping recognizer for handwriting and gesture recognition. IEEE Sensors Journal
15(1) (2015) 154–163
22. Wen, H., Ramos Rojas, J., Dey, A.K.: Serendipity: Finger gesture recognition
using an off-the-shelf smartwatch. In: Proceedings of the ACM Annual Conference
on Human Factors in Computing Systems (CHI), San Jose, California, USA (May
2016)
23. Carf`ı, A., Motolese, C., Bruno, B., Mastrogiovanni, F.: Online human gesture
recognition using recurrent neural networks and wearable sensors. In: Proceed-
ings of the 27th IEEE International Symposium on Robot and Human Interactive
Communication (RO-MAN), Nanjing, China (August 2018)
24. Wachs, J.P., olsch, M., Stern, H., Edan, Y.: Vision-based hand-gesture applica-
tions. Communications of the ACM 54(2) (2011) 60–71
25. Profanter, S., Perzylo, A., Somani, N., Rickert, M., Knoll, A.: Analysis and se-
mantic modeling of modality preferences in industrial human-robot interaction. In:
Proceedings of the 28th IEEE/RSJ International Conference on Intelligent Robots
and Systems (IROS), Hamburg, Germany (September 2015)
26. Guizzo, E., Ackerman, E.: How Rethink Robotics built its new Baxter robot
worker. IEEE spectrum 7(2012)
27. Quigley, M., Conley, K., Gerkey, B., Faust, J., Foote, T., Leibs, J., Wheeler, R.,
Ng, A.Y.: ROS: an open-source robot operating system. In: Workshop on Open
Source Software (ICRA), Kobe, Japan (May 2009)
28. Kormushev, P., Calinon, S., Caldwell, D.G.: Imitation learning of positional and
force skills demonstrated via kinesthetic teaching and haptic input. Advanced
Robotics 25(5) (2011) 581–603
29. Carf`ı, A., Villalobos, J., Coronado, E., Bruno, B., Mastrogiovanni, F.: Can human-
inspired learning behaviour facilitate human–robot interaction? International Jour-
nal of Social Robotics (2019) 1–14
30. Stisen, A., Blunck, H., Bhattacharya, S., Prentow, T.S., Kjærgaard, M.B., Dey,
A., Sonne, T., Jensen, M.M.: Smart devices are different: Assessing and mitigating
mobile sensing heterogeneities for activity recognition. In: Proceedings of the 13th
ACM Conference on Embedded Networked Sensor Systems (SenSys), Seoul, South
Korea (November 2015)
31. Hinderks, A., Schrepp, M., Mayo, F.J.D., Escalona, M.J., Thomaschewski, J.: De-
veloping a UX KPI based on the user experience questionnaire. Computer Stan-
dards & Interfaces 65 (2019) 38–44
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
The possibility for humans to interact with physical or virtual systems using gestures has been vastly explored by researchers and designers in the last 20 years to provide new and intuitive interaction modalities. Unfortunately, the literature about gestural interaction is not homogeneous, and it is characterized by a lack of shared terminology. This leads to fragmented results and makes it difficult for research activities to build on top of state-of-the-art results and approaches. The analysis in this article aims at creating a common conceptual design framework to enforce development efforts in gesture-based human-machine interaction (HMI). The main contributions of this article can be summarized as follows: 1) we provide a broad definition for the notion of functional gesture in HMI; 2) we design a flexible and expandable gesture taxonomy; and 3) we put forward a detailed problem statement for gesture-based HMI. Finally, to support our main contribution, this article presents and analyzes 83 most pertinent articles classified on the basis of our taxonomy and problem statement.
Article
Full-text available
This paper discusses the challenges in the collaboration between human operators and industrial robots for assembly operations focusing on safety and simplified interaction. A case study is presented, involving perception technologies for the robot in conjunction with wearable devices used by the operator. In terms of robot perception, a manual guidance module, an air pressor contact sensor namely skin, and a vision system for recognition and tracking of objects have been developed and integrated. Concerning the wearable devices, an advanced user interface including audio and haptic commands accompanied by augmented reality technology are used to support the operator and provide awareness by visualizing information related to production and safety aspects. In parallel, safety functionalities are implemented through collision detection technologies such as a safety skin and safety monitored regions delimiting the area of the robot activities. The complete system is coordinated under a common integration platform and it is validated in a case study of the white goods industry.
Conference Paper
Full-text available
This paper presents a voice control interface prototype for assistive robots aiming to help people living with upper limb disabilities to perform daily activities autonomously. Assistive robotic devices can be used to help people with upper-body disabilities gain more autonomy in their daily life. However, it is very difficult or even impossible for certain users to control the robot with conventional control systems (e.g. joystick, sip-and-puff). This paper presents the design and preliminary evaluation of a voice command system prototype for the control of assistive robotic arms' movements. This work aims at making the control of assistive robots more intuitive and fluid, and to perform various tasks in less time and with a lesser effort. The prototype of the voice command interface developed is first presented, followed by two experiments with five able-bodied subjects in order to assess the system's performance and guide future development.
Article
Full-text available
The evolution of production systems for smart factories foresees a tight relation between human operators and robots. Specifically, when robot task reconfiguration is needed, the operator must be provided with an easy and intuitive way to do it. A useful tool for robot task reconfiguration is Programming by Demonstration (PbD). PbD allows human operators to teach a robot new tasks by showing it a number of examples. The article presents two studies investigating the role of the robot in PbD. A preliminary study compares standard PbD with human–human teaching and suggests that a collaborative robot should actively participate in the teaching process as human practitioners typically do. The main study uses a wizard of oz approach to determine the effects of having a robot actively participating in the teaching process, specifically by controlling the end-effector. The results suggest that active behaviour inspired by humans can lead to a more intuitive PbD.
Article
Full-text available
Decisions in Companies are made typically by using a number of entirely different key figures. A user experience key figure is one of many important key figures that represents one aspect of the success of the company or its products. What we aim in this article is to present to those responsible for a product a method of how a user experience key performance indicator (UX KPI) can be developed using a UX questionnaire. We have developed a UX KPI for use in organizations based on the User Experience Questionnaire (UEQ). To achieve this, we added six questions to the UEQ to measure the importance of the UEQ scales. Based on the UEQ scales and the scores given for importance, we then developed a User Experience Questionnaire KPI (UEQ KPI). In a first study with 882 participants, we calculated and discussed the UEQ KPI using Amazon and Skype. The results show that the six supplementary questions could be answered independently of the UEQ itself. In our opinion, the extension can be implemented without any problems. The resulting UEQ KPI can be used for communication within an organization as a key performance indicator.
Conference Paper
Full-text available
Gestures are a natural communication modality for humans. The ability to interpret gestures is fundamental for robots aiming to naturally interact with humans. Wearable sensors are promising to monitor human activity, in particular the usage of triaxial accelerometers for gesture recognition have been explored. Despite this, the state of the art presents lack of systems for reliable online gesture recognition using accelerometer data. The article proposes SLOTH, an architecture for online gesture recognition, based on a wearable triaxial accelerometer, a Recurrent Neural Network (RNN) probabilistic classifier and a procedure for continuous gesture detection, relying on modelling gesture probabilities, that guarantees (i) good recognition results in terms of precision and recall, (ii) immediate system reactivity.
Conference Paper
A “cobot” is a robotic device which manipulates objects in collaboration with a human operator. A cobot provides assistance to the human operator by setting up virtual surfaces which can be used to constrain and guide motion. While conventional servo-actuated haptic displays may be used in this way also, an important distinction is that, while haptic displays are active devices which can supply energy to the human operator, cobots are intrinsically passive. This is because cobots do not use servos to implement constraint, but instead employ “steerable” nonholonomic joints. As a consequence of their passivity, cobots are potentially well-suited to safety-critical tasks (e.g. surgery) or those which involve large interaction forces (e.g. automobile assembly). This paper focuses on the simplest possible cobot, which has only a single joint (a steerable wheel). Two control modes of this “unicycle cobot”, termed “virtual caster” and “virtual wall” control, are developed in detail. Experimental results are also presented.
Article
Persons with disabilities often rely on caregivers or family members to assist in their daily living activities. Robotic assistants can provide an alternative solution if intuitive user interfaces are designed for simple operations. Current human-robot interfaces are still far from being able to operate in an intuitive way when used for complex activities of daily living (ADL). In this era of smartphones that are packed with sensors, such as accelerometers, gyroscopes and a precise touch screen, robot controls can be interfaced with smartphones to capture the user's intended operation of the robot assistant. In this paper, we review the current popular human-robot interfaces, and we present three novel human-robot smartphone-based interfaces to operate a robotic arm for assisting persons with disabilities in their ADL tasks. Useful smartphone data, including 3 dimensional orientation and 2 dimensional touchscreen positions, are used as control variables to the robot motion in Cartesian teleoperation. In this paper, we present the three control interfaces, their implementation on a smartphone to control a robotic arm, and a comparison between the results on using the three interfaces for three different ADL tasks. The developed interfaces provide intuitiveness, low cost, and environmental adaptability.