Article

Interaction With Gaze, Gesture, and Speech in a Flexibly Configurable Augmented Reality System

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Multimodal interaction has become a recent research focus since it offers better user experience in augmented reality (AR) systems. However, most existing works only combine two modalities at a time, e.g., gesture and speech. Multimodal interactive system integrating gaze cue has rarely been investigated. In this article, we propose a multimodal interactive system that integrates gaze, gesture, and speech in a flexibly configurable AR system. Our lightweight head-mounted device supports accurate gaze tracking, hand gesture recognition, and speech recognition simultaneously. The system can be easily configured into various modality combinations, which enables us to investigate the effects of different interaction techniques. We evaluate the efficiency of these modalities using two tasks: the lamp brightness adjustment task and the cube manipulation task. We also collect subjective feedback when using such systems. The experimental results demonstrate that the Gaze+Gesture+Speech modality is superior in terms of efficiency, and the Gesture+Speech modality is more preferred by users. Our system opens the pathway toward a multimodal interactive AR system that enables flexible configuration.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... First, in Wang et al. [62]'s research, a multimodal interaction system is proposed, integrating gaze, gesture, and speech for tasks like light brightness adjustment and cube operations. Users fixate on the target and confirm with a verbal ''Select'' command. ...
... Furthermore, G.B. Lee et al. [63] use mobile smartphones as a display platform for AR effects. PC (personal computers) have more computing power and provide a relatively standard platform, so many researchers use PC as AR platform [58,62,67,68,71]. Using a PC makes it easier to perform system development, debugging, testing, and AR display on one device. ...
... Most studies, they have not identified specific application areas [51,58,59,62,64,65,71]. The method they proposed was not explicitly limited to a specific field. ...
Article
Full-text available
In the ever-evolving landscape of Augmented Reality (AR), gesture and speech interaction technologies have emerged as pivotal components, reshaping experiences across diverse domains, from art to healthcare and education. Existing reviews may talk extensively about various types of interactions in augmented reality, but this paper fills a gap in this targeted area by discussing research that adopted both gesture and speech interactions. This paper employs the PRISMA methodology to curate and analyze a selection of cutting-edge research articles, offering a systematic and comprehensive review of 16 AR-based gesture, speech, and multimodal interaction technologies published between 2019 and 2023. Among them, “gesture + speech” accounted for 75%, while “gesture + speech + gaze” and “gesture + speech + head movement” models accounted for 12.5%. Highlighting the primary findings and contributions of this review, this article uncovers the prevailing trends in interaction technology implementation within AR environments. The review explores not only the methodologies but also the practical applications across a spectrum of AR scenarios. This comprehensive overview serves to contextualize the significance of these interaction technologies in enhancing user experiences and opens up new avenues for future research and development. Furthermore, this article underscores the real-world implications of these findings, shedding light on the potential for broader integration of gesture and speech in AR applications. As we look ahead, this paper provides insights into potential areas for further exploration in this dynamic field. By delving into the past, illuminating the present, and paving the way for the future, this review underscores the transformative power of gesture and speech interaction technologies in the realm of Augmented Reality.
... The accelerating development of Augmented Reality (AR) technologies has created the space for investigating even more immersive and more intuitive user interfaces, especially through the adoption of multimodal interaction systems. Multimodal interaction systems merge gesture, voice, and eye-tracking modalities to provide increased user experience in that they allow more natural, more seamless, and more context-sensitive interactions [6][7] [8]. Multimodal interfaces are being widely applied across a range of fields, such as gaming, education, healthcare, and automotive systems, because of their ability Results from experiments indicated enhanced accuracy and naturalness during interaction. ...
... The study promotes integrated HCI models. [6] Chen et al. (2024): Developed an augmented reality (AR) system to improve multi-modal perception and interaction in complex decision-making environments. The system combines state-of-the-art visualization with user-friendly interaction to enhance data understanding and decision-making support. ...
... These cases identify how gaze tracking, gesture, speech, and AR technologies are used across industries to promote interactivity and user experience. For instance, gaze in conjunction with gestures and speech in AR systems enables smooth, hands-free operation in automotive and flexible workspace settings [6] [18]. Multi-modal AR systems augment decision-making through visual overlays and gaze commands [7], whereas gaze interaction in VR museums facilitates intuitive navigation and storytelling [10]. ...
Article
Full-text available
The incorporation of multimodal interfaces into augmented reality (AR) systems, which takes advantage of the synergistic integration of gesture recognition, voice input, and eye-tracking technologies to support user interaction. Driven by the emergence of immersive technologies, AR has penetrated many fields including gaming, healthcare, education, and industrial training. Multimodal input enables more intuitive, responsive, and accessible experiences by coordinating digital interactions with natural human behavior. Eye-tracking enhances attention-aware interfaces, gestures enable spatial commands, and speech input allows hands-free control, all enhancing real-time decision-making and interaction. The article surveys existing research and system deployments, presenting state-of-the-art applications and interaction metaphors. Challenges are also highlighted, including sensor calibration, latency, environmental robustness, and user fatigue. In addition, privacy issues and data security risks for these technologies are discussed. Based on a critical examination of current systems and experimental results, the paper suggests design principles for constructing resilient, user-focused multimodal AR environments. Future directions are highlighted in terms of AI-driven adaptive interfaces, emotion detection, and neurocognitive feedback loops. The results highlight the revolutionary power of AR fueled by synchronized multimodal interaction systems, enabling digital environments to be more fluid, efficient, and human-oriented.
... Object manipulation, e.g., selection and translation, is one of the most common tasks in AR [59,52]. Previous studies have explored hand-eye coordination in object selection with heavy occlusions [4]. ...
... In Selection Phase, the user gazes at the target object and fixates on it for 1.5 seconds to select it. The dwell time was established based on two key factors: the one-second duration suggested by Wang et al. [52] and the accuracy reduction resulting from the Ins-Eye condition. If the gaze intersects with multiple objects, the nearest object to the user will be selected. ...
... Gaze-Speech Interaction (GS ). Gaze-speech interaction has been explored in object selection and translation scenarios [52,34], including scenarios with occluding object [9]. We define gazespeech interaction as private, which may arouse different opinions because users need to speak in public. ...
... However, the study focused on Wizard of Oz elicitation rather than passing multimodal inputs to any computing-based fusing system. Wang et al. [40] presented a multimodal interaction system, in which multiple threads of interaction commands are integrated and transferred to an augmented reality interface. It is not in detail how the modalities were fused in real-time. ...
... The multimodal actions are related in time and space to convey meaningful operations, which require a multimodal fusion technique to combine multiple input actions into a single action data stream and interact with an interface event. Increasing the number of input actions would increase the degree of complexity of the interaction system [40]. In this work, a real-time fuser is proposed to combine the input actions of hand gestures, head gaze, and voice commands. ...
... Task completion time for each combination of modalities was recorded in our study. As discussed in the previous study [40], this is an important parameter to evaluate the effectiveness and efficiency of a multimodal interaction system. A shorter task completion time certified the interaction modality is more efficient. ...
Article
Full-text available
The values of VR and multimodal interaction technologies offer creative, virtual alternatives to manipulate a large data set in a virtual environment. This work presents the design, implementation, and evaluation of a real-time multimodal interaction framework that enables users to navigate, select, and move data elements. The novel multimodal fusion method is able to recognize freehand gestures, voice commands, and head gaze pointer in real-time and fuse them to meaningful actions for interacting with the virtual environment. We worked with imagery analysts who were defense and security experts on designing and testing the interface and interaction modalities. The evaluation of the framework was conducted with a case study of photo management tasks based on a real-world scenario. Users are able to select photos in a large virtual interface and move them to the bins on the left and right sides of the main view. The evaluation focuses on performance, task completion time, and users’ experience amongst several different combinations of input modalities. The evaluation result shows it is important to make multiple interaction modalities available to users, and the interaction design implications are concluded based on the evaluation.
... The fusion process enables complementary modalities to work together, mitigating the limitations of individual channels and enhancing overall system robustness. For example, a speech command such as "Turn on that light" can be clarified by a simultaneous pointing gesture, ensuring accurate intent recognition [76][77][78]. ...
... When combined with other modalities, such as speech or gesture, gaze-based interaction enables highly contextual and precise interface control. For instance, a user could look at an object on a screen and issue a voice command to interact with that object, making the interaction more intuitive and reducing the need for explicit selection through touch or mouse inputs [77,[147][148][149]. ...
Article
Full-text available
Multimodal interaction is a transformative human-computer interaction (HCI) approach that allows users to interact with systems through various communication channels such as speech, gesture, touch, and gaze. With advancements in sensor technology and machine learning (ML), multimodal systems are becoming increasingly important in various applications, including virtual assistants, intelligent environments, healthcare, and accessibility technologies. This survey concisely overviews recent advancements in multimodal interaction, interfaces, and communication. It delves into integrating different input and output modalities, focusing on critical technologies and essential considerations in multimodal fusion, including temporal synchronization and decision-level integration. Furthermore, the survey explores the challenges of developing context-aware, adaptive systems that provide seamless and intuitive user experiences. Lastly, by examining current methodologies and trends, this study underscores the potential of multimodal systems and sheds light on future research directions.
... Studies delve into the effectiveness of gesture inputs, highlighting their integration with the physical environment as a means to navigate virtual elements. Research findings emphasize the intuitive nature of gesturing in AR to ensure seamless data visualization and manipulation [10,11]. Conducting a literature review on gesturing in mixed reality environments involves systematically exploring and analyzing existing scholarly works, research papers, and publications related to the use of gestures within VR and AR settings. ...
... However, by choosing a multimodal method, creates a higher mental demand than any of the other gestures alone. Additionally, speech would be the most useful for descriptive tasks, while gestures would be helpful for spatial tasks [10,11]. Although there are various other gesture inputs for VR/AR such as eye gestures, body gestures, and hand gestures, it is the combination of inputs that remains the most efficient and intuitive method of input. ...
Conference Paper
Full-text available
The intuitive interaction capabilities of augmented reality make it ideal for solving complex 3D problems that require complex spatial representations, which is key for astrodynamics and space mission planning. By implementing common and complex orbital mechanics algorithms in augmented reality, a hands-on method for designing orbit solutions and spacecraft missions is created. This effort explores the aforementioned implementation with the Microsoft Hololens 2 as well as its applications in industry and academia. Furthermore, a human-centered design process and study are utilized to ensure the tool is user-friendly while maintaining accuracy and applicability to higher-fidelity problems.
... The application of Virtual Reality (VR) technology in educational simulations and virtual laboratories has been increasingly recognized for its potential to enhance learning experiences [1,2]. However, the prevalent reliance on traditional input devices like keyboards, mice, or even touchscreens in VR environments limits the naturalness and intuitiveness of user interactions [3,4]. Existing improvements in this area have primarily focused on replacing controllers with gesture recognition [5,6,7,8,9]. ...
... Within China, Wang Huajian et al. [21] designed a simulation teaching system tailored for PLC experimental instruction, Wang Sijie et al. [22] developed a VR live streaming multiuser online teaching system based on panoramic videos, Zhu Yan [23] designed applications for big data and cloud computing's virtual reality experiment platform, and Guo Jian et al. [24] embarked on a study of a lathe teaching training system using VR. Most of the above VR teaching systems are implemented on devices such as desktop or tablet computers, and this dependency limits the naturalness and intuitiveness of user interaction [3,4]. Systems based on VR headsets, on the other hand, can deliver a more authentic virtual reality experience, hence they are gaining increasing attention from researchers. ...
Article
Full-text available
In virtual teaching scenarios, head-mounted display (HMD) interactions often employ traditional controller and UI interactions, which are not very conducive to teaching scenarios that require hand training. Existing improvements in this area have primarily focused on replacing controllers with gesture recognition. However, the exclusive use of gesture recognition may have limitations in certain scenarios, such as complex operations or multitasking environments. This study designed and tested an interaction method that combines simple gestures with voice assistance, aiming to offer a more intuitive user experience and enrich related research. A speech classification model was developed that can be activated via a fist-clenching gesture and is capable of recognising specific Chinese voice commands to initiate various UI interfaces, further controlled by pointing gestures. Virtual scenarios were constructed using Unity, with hand tracking achieved through the HTC OpenXR SDK. Within Unity, hand rendering and gesture recognition were facilitated, and interaction with the UI was made possible using the Unity XR Interaction Toolkit. The interaction method was detailed and exemplified using a teacher training simulation system, including sample code provision. Following this, an empirical test involving 20 participants was conducted, comparing the gesture-plus-voice operation to the traditional controller operation, both quantitatively and qualitatively. The data suggests that while there is no significant difference in task completion time between the two methods, the combined gesture and voice method received positive feedback in terms of user experience, indicating a promising direction for such interactive methods. Future work could involve adding more gestures and expanding the model training dataset to realize additional interactive functions, meeting diverse virtual teaching needs.
... The first category consists of multimodal techniques, which use several alternative input modalities, such as gaze or speech, to avoid the need for precise mid-air gestures. In a study evaluating the performance of multiple interaction modalities in target selection and object manipulation tasks, Wang et al. [44] found that combining gaze, speech, and gesture resulted in a significantly lower physical workload than gesture-only interactions. While the present unimodal approach focuses exclusively on hand-based interaction, we anticipate that our contributions and methodology will provide valuable foundations and insights for future multimodal interaction research. ...
Preprint
Full-text available
Mid-air gestures serve as a common interaction modality across Extended Reality (XR) applications, enhancing engagement and ownership through intuitive body movements. However, prolonged arm movements induce shoulder fatigue, known as "Gorilla Arm Syndrome", degrading user experience and reducing interaction duration. Although existing ergonomic techniques derived from Fitts' law (such as reducing target distance, increasing target width, and modifying control-display gain) provide some fatigue mitigation, their implementation in XR applications remains challenging due to the complex balance between user engagement and physical exertion. We present AlphaPIG, a meta-technique designed to Prolong Interactive Gestures by leveraging real-time fatigue predictions. AlphaPIG assists designers in extending and improving XR interactions by enabling automated fatigue-based interventions. Through adjustment of intervention timing and intensity decay rate, designers can explore and control the trade-off between fatigue reduction and potential effects such as decreased body ownership. We validated AlphaPIG's effectiveness through a study (N=22) implementing the widely-used Go-Go technique. Results demonstrated that AlphaPIG significantly reduces shoulder fatigue compared to non-adaptive Go-Go, while maintaining comparable perceived body ownership and agency. Based on these findings, we discuss positive and negative perceptions of the intervention. By integrating real-time fatigue prediction with adaptive intervention mechanisms, AlphaPIG constitutes a critical first step towards creating fatigue-aware applications in XR.
... This innovative approach effectively addressed the tiredness issue previously identified in chapters "Field Survey" and "Fundamental framework in Web XR," demonstrating its potential to enhance user comfort and engagement. The findings from this study are consistent with existing research reports that highlight the preference of users for the Gesture+Speech modality [131,132]. The combination creates a more natural and intuitive UX, allowing users to interact with the system in a way that feels familiar and comfortable. ...
Thesis
Full-text available
Extended Reality (XR) technology has demonstrated tremendous potential in diverse domains such as education, commerce, and medicine. Web services constitute an integral and indispensable aspect of modern society, providing various avenues for social communication. Given the increasing adoption of XR devices, web services must be made available on these platforms to cater to the evolving needs of users. However, the current state of Web XR still presents certain challenges, such as hardware and software limitations, leading to inadequate support for web services on XR devices. Thus, this study aims to investigate utilizing XR technology to enhance user experiences in web services. I evaluated user experience of Web 2D user interfaces (UI) on XR platforms with 80 participants using pre-test and post-test design. Results showed inferior performance on XR platforms compared to the desktop group, providing reference for UI researchers to improve Web UIs on XR platforms. This work aimed to offer a reliable and critical point of reference for UI researchers in their efforts to enhance the Web UI in XR platforms. This study examined potential strategies to enhance the user experience of Web XR UIs. The research developed a Web XR UIs principle featuring XR-specific characteristics and componentization design in static websites. A SUI was proposed to provide users with an immersive, exploratory, and readable experience. The experimental outcomes indicated that the principle could improve the user experience with SUIs (Spatial User Interfaces). While the fundamental UI principle provides a smooth SUI experience, it primarily adheres to 2D interactions in order to minimize the cost of reconstructing UIs for web services. However, this approach neglects many spatial and interactive features. Therefore, my study needs to address how to design an efficient framework that accommodates a greater number of interactive elements, content, and data. This study focused on the area of online reading services. Unlike existing XR reading samples that simply advertise 2D UI or media in reality, this study designed an adaptive UI framework encompassing three aspects: 3D layout, navigation, and data visualization. The framework aimed to enhance the efficiency and immersion of online reading by providing users with a seamless spatial reading experience. To precisely measure user feedback, I conducted a comprehensive user experiment that included self-report questionnaires, semi-structured interviews, and listening module data. The experiment provided evidence that XR interactions can effectively enhance website layout. The research demonstrated that the framework has the potential to increase participant enthusiasm and make a valuable contribution toward improving the user experience in the Web XR UI field.
... Integrating speech, gesture, and gaze inputs [46,55] enhances medical application control and healthcare team collaboration. Studies demonstrate these multi-modal interfaces significantly outperform single-mode systems [95], with clinicians achieving high recognition rates through combined visual and auditory notifications [43]. To optimize clinical usability, research recommends bottom-center or wrist-mounted displays during frequent AR-HMD interactions [64]. ...
Preprint
Full-text available
How might healthcare workers (HCWs) leverage augmented reality head-mounted displays (AR-HMDs) to enhance teamwork? Although AR-HMDs have shown immense promise in supporting teamwork in healthcare settings, design for Emergency Department (ER) teams has received little attention. The ER presents unique challenges, including procedural recall, medical errors, and communication gaps. To address this gap, we engaged in a participatory design study with healthcare workers to gain a deep understanding of the potential for AR-HMDs to facilitate teamwork during ER procedures. Our results reveal that AR-HMDs can be used as an information-sharing and information-retrieval system to bridge knowledge gaps, and concerns about integrating AR-HMDs in ER workflows. We contribute design recommendations for seven role-based AR-HMD application scenarios involving HCWs with various expertise, working across multiple medical tasks. We hope our research inspires designers to embark on the development of new AR-HMD applications for high-stakes, team environments.
... Thus, the timely and accurate recognition of these needs is of paramount importance. In this context, speech recognition technologies emerge as critical tools, offering a natural and efficient means of facilitating human-machine interaction [1]. Speech recognition can generally be categorized into two D. Zhou main approaches: automatic speech recognition (ASR) [2]- [8] using audio signals and silent speech recognition (SSR) [9]- [12] using non-acoustic signals. ...
Preprint
The global aging population faces considerable challenges, particularly in communication, due to the prevalence of hearing and speech impairments. To address these, we introduce the AVE speech dataset, a comprehensive multi-modal benchmark for speech recognition tasks. The dataset includes a 100-sentence Mandarin Chinese corpus with audio signals, lip-region video recordings, and six-channel electromyography (EMG) data, collected from 100 participants. Each subject read the entire corpus ten times, with each sentence averaging approximately two seconds in duration, resulting in over 55 hours of multi-modal speech data per modality. Experiments demonstrate that combining these modalities significantly improves recognition performance, particularly in cross-subject and high-noise environments. To our knowledge, this is the first publicly available sentence-level dataset integrating these three modalities for large-scale Mandarin speech recognition. We expect this dataset to drive advancements in both acoustic and non-acoustic speech recognition research, enhancing cross-modal learning and human-machine interaction.
... Each method offers certain advantages depending on the interaction context and desired level of immersion. Speech-based interaction complements and is often integrated into other techniques in multimodal interaction to enable intuitive interaction experiences in XR [21]. ...
Preprint
Full-text available
Recent developments in computer graphics, machine learning, and sensor technologies enable numerous opportunities for extended reality (XR) setups for everyday life, from skills training to entertainment. With large corporations offering consumer-grade head-mounted displays (HMDs) in an affordable way, it is likely that XR will become pervasive, and HMDs will develop as personal devices like smartphones and tablets. However, having intelligent spaces and naturalistic interactions in XR is as important as technological advances so that users grow their engagement in virtual and augmented spaces. To this end, large language model (LLM)--powered non-player characters (NPCs) with speech-to-text (STT) and text-to-speech (TTS) models bring significant advantages over conventional or pre-scripted NPCs for facilitating more natural conversational user interfaces (CUIs) in XR. In this paper, we provide the community with an open-source, customizable, extensible, and privacy-aware Unity package, CUIfy, that facilitates speech-based NPC-user interaction with various LLMs, STT, and TTS models. Our package also supports multiple LLM-powered NPCs per environment and minimizes the latency between different computational models through streaming to achieve usable interactions between users and NPCs. We publish our source code in the following repository: https://gitlab.lrz.de/hctl/cuify
... In terms of interaction design, many new studies aim to make XR [50,[53][54][55][56]. A prospective way is to predict user's visual intents and dynamically adapting the virtual content within XR environments [41]. ...
Article
Full-text available
With eye tracking finding widespread utility in augmented reality and virtual reality headsets, eye gaze has the potential to recognize users' visual tasks and adaptively adjust virtual content displays, thereby enhancing the intelligence of these headsets. However, current studies on visual task recognition often focus on scene-specific tasks, like copying tasks for office environments, which lack applicability to new scenarios, e.g., museums. In this paper, we propose four scene-agnostic task types for facilitating task type recognition across a broader range of scenarios. We present a new dataset that includes eye and head movement data recorded from 20 participants while they engaged in four task types across 15 360-degree VR videos. Using this dataset, we propose an egocentric gaze-aware task type recognition method, TRCLP, which achieves promising results. Additionally, we illustrate the practical applications of task type recognition with three examples. Our work offers valuable insights for content developers in designing task-aware intelligent applications. Our dataset and source code are available at https://www.youtube.com/watch?v=HQxJMDK0lHE&t=11s
... Multimodality can promote immersion and presence, as well as improve human performance in specific tasks in a variety of environments [32]. Wang et al. [33] introduced a multimodal interaction system in AR, which integrates gaze, gesture, and speech in a flexible configuration. Furthermore, Hashim et al. [34] developed a framework for a mobile AR learning system that incorporates emotional, image-based, and speech input. ...
... These results encourage future developments of the library, which may include tools to automatically segment, classify and quantify sequence of hand movements, such as proposed in de Souza Baptista et al (2017). Incorporating other input modalities, such as Wang et al (2021) and Aguiar and Bo (2017), is also an alternative, as well as integrating other rehabilitation technologies Cardoso et al (2022), particularly due to the potential therapeutic benefits. ...
Preprint
Full-text available
Virtual reality rehabilitation (VR) can complement traditional rehabilitation therapy to increase motivation and thus the quality of exercise, while transferring effectively to real-world tasks. Hand tracking can significantly improve usability and immersion. A VR hand-tracking-based interaction system has been developed that supports various hand interactions, including grasping, pinching, and gesturing to complete virtual exercises. Interactions support modifiable difficulty and requirements, metrics, and visual error augmentation (EA). Audio-visual signifiers and feedback to guide interactions are also applied. A proof-of-concept rehabilitation game was tested with non-expert (N=7) and expert occupational therapy (N=5) participants. Participants completed tasks including grabbing objects, pinching, pulling handles, tracing paths, and making hand gestures. The system's usability was the primary outcome of the study. The results have shown this system is highly usable, with a modified System Usability Scale (SUS) of 88.93 (non-expert) and 74.5 (expert). A preliminary performance analysis was also conducted, indicating a limited number of unsuccessful hand-object interactions, although the accuracy in some interactions was impacted by tracking occlusion.
... Since gaze can be efficiently combined with gesture or voice input to solve specific challenges in AR/VR applications [37,46], our survey further collected interaction proposals for potential multimodal controls of the suggested tasks, combining gaze input with mid-air gestures, voice control, or other modalities that participants could imagine using in combination with gaze. Using the same approach, we deduced multimodal gaze-supported interaction concepts with mid-air gestures and voice commands. ...
... Estimating gaze from a single low-cost RGB sensor is an important research topic in computer vision, where eye or facial images are typically used as inputs to estimate the real gaze direction and locate gaze points. Gaze estimation has important applications in fields such as human-computer interaction [1] [2], education [3], and medical diagnosis [4] [5]. ...
Preprint
Full-text available
Recent work has demonstrated the Transformer model is effective for computer vision tasks. However, the global self-attention mechanism utilized in Transformer models does not adequately consider the local structure and details of images, which may result in the loss of information and local details, causing decreased estimation accuracy in gaze estimation tasks when compared to convolution or sequential stacking methods. To address this issue, we propose a parallel CNNs-Transformer aggregation network (CTA-Net) for gaze estimation, which fully leverages the advantages of the Transformer model in modeling global context while the convolutional neural networks (CNNs) model in retaining local details. Specifically, Transformer and ResNet are deployed to extract facial and eye information, respectively. Additionally, an attention cross fusion (ACFusion) Block is embedded with CNN branch, which decomposes features in space and channels to supplement lost features, suppress noise, and help extract eye features more effectively. Finally, a dual-feature aggregation (DFA) module is proposed to effectively fuse the output features of both branches with the help feature a selection mechanism and a residual structure. Experimental results on the MPIIGaze and Gaze360 datasets demonstrate that our CTA-Net achieves state-of-the-art results.
... Research related to geovisualization and AR has focused on various topics, such as the investigation of different visualization methods (Langner et al., 2021;Reipschlager et al., 2021), the development of interaction methods that involve gestures (Butscher et al., 2018;Hurter et al., 2019;Kister et al., 2015;Newbury et al., 2021;Piumsomboon et al., 2013) and gaze (Bâce et al., 2016;Blattgerste et al., 2018;Jae-Young Jing et al., 2021;Krajancich et al., 2020;Lee et al., 2011;Park et al., 2008;Pfeuffer et al., 2021;Piening et al., 2021;Wang et al., 2021) or the exploitation of different visualization mediums such as mobile devices (Chatzopoulos et al., 2017;Chun & Höllerer, 2013;Nincarean et al., 2013;Paucher & Turk, 2010), tablets (Dey et al., 2012;Hubenschmid et al., 2021;Lee et al., 2015), and head-mounted displays (Bambusek et al., 2019;El Jamiy & Marsh, 2019;Guarese & Maciel, 2019;Hincapié-Ramos et al., 2015;Itoh et al., 2022;Kapp et al., 2021;Liu et al., 2016;Xu et al., 2019). ...
Article
Augmented reality (AR) is a rapidly advancing technology that enhances users’ perception of the real world by overlaying virtual elements. The geospatial community has been gradually focusing on AR because of its ability to create immersive spatial experiences and facilitate spatial learning. However, designing effective AR interfaces poses several challenges, including managing information overload, providing intuitive user interaction, and optimizing system performance. The management of the Level of Detail (LoD) is a crucial part of an AR-enhanced cartographic representations as it can greatly impact the quantity, accuracy, and usefulness of the information being conveyed and enhance the readability and usability of an application. In this paper, we present a systematic review of published research on the management of LoD for AR cartographic representations based on various dimensions that focus on the types of data that are visualized, the techniques used, and the user actions that trigger LoD change. A corpus of fifteen scientific papers involving different LoD management techniques within AR environments have been analyzed. The limited number of papers implies that this kind of applications is in its infancy. The review provides a synthesis of existing knowledge and identifies challenges for future research in this exciting and dynamic field.
... Change of interaction mode: For the AR navigation design of hand-held devices, designers can mainly adopt the form of voice interaction to avoid the occupation of the user's hands [42], which can also solve the problem of insufficient perception of gesture positions for people with low mental cutting ability. ...
Article
Full-text available
In the context of the rapid development of navigation technology and the deepening of users’ diversified needs, as an emerging public service, mobile AR (Augmented Reality) navigation is supposed to focus on human-computer interaction and user experience. To extract the influencing factors of efficient service of mobile AR navigation, we constructed an experimental method for the usability of mobile AR navigation and the users’ emotional experience based on behavior-emotion analysis. In this study, the user types were divided according to the differences in Mental Cutting Ability and Gender. We explored the effects of Interaction Mode, Mental Cutting Ability, and Gender on the usability of mobile AR navigation and the users’ PAD (Please-Arousal-Dominance) three-dimensional emotion through the objective performance and subjective scoring of users when completing AR navigation tasks. The results showed that the Interaction Mode and Mental Cutting Ability had significant effects on the usability of mobile AR navigation and the users’ emotional experience; the Ease of Learning, Ease of Use in usability indicators, and the Arousal experience of three-dimensional emotion were significantly affected by Gender. Based on the experimental results, we excavated the mechanism of effects between various factors, extracted the behavioral and emotional trends of different types of users, broadened the research scope of mobile AR navigation-related fields, and finally summarized the design strategies from the perspective of human-robot-environment.
... Despite the importance of freehand interaction in AR, other means include voice-based, gaze-based, location-based, or even tactile interaction, which have proven to be well-suited to specific situations. Combined interaction techniques were proposed in [21] to assist the user. For these types of interaction, the same need applies, i.e., developers and creators require higher-level tools that will allow them to configure the desired behaviors and functionality within a few simple steps. ...
Article
Full-text available
Contemporary software applications have shifted focus from 2D representations to 3D. Augmented and Virtual Reality (AR/VR) are two technologies that have captured the industry’s interest as they show great potential in many areas. This paper proposes a system that allows developers to create applications in AR and VR with a simple visual process, while also enabling all the powerful features provided by the Unity 3D game engine. The current system comprises two tools, one for the interaction and one for the behavioral configuration of 3D objects within the environment. Participants from different disciplines with a software-engineering background were asked to participate in the evaluation of the system. They were called to complete two tasks using their mobile phones and then answer a usability questionnaire to reflect on their experience using the system. The results (a) showed that the system is easy to use but still lacks some features, (b) provided insights on what educators seek from digital tools to assist them in the classroom, and (c) that educators often request a more whimsical UI as they want to use the system together with the learners.
... The study depicted that the association between speech expressions and gesture stroke can improve the recognizer efficiency in HMD AR. Most existing multimodal interactions in HMD AR applications mainly focus on combining two modalities at a time, e.g., gesture and speech, and a multimodal interactive method that the integration of gesture, gaze and speech means a more reliable paradigm, it is argued that the "gesture, gaze and speech" multimodal can achieved superior performance in terms of efficiency in HMD AR assistant [214]. ...
Article
Head-mounted display (HMD) augmented reality (AR) has attracted more and more attention in manufacturing activities, as it enables operators to access visual guidance in front of their view directly while freeing human’s two hands. Nevertheless, HMD AR has not been widely adopted in manufacturing fields as humans expected since the release of Google Glass in 2012, and thus it is important to understand the related issues arising from the actual deployments of HMD AR on the shop floor. To the best of the authors’ knowledge, there have not been comprehensive discussions on HMD AR in manufacturing from a holistic perspective. This article aims to provide an extensive map for the distribution of HMD AR in various manufacturing activities and a systematic overview of underlying technical perspectives associated with their actual industrial applications between 2010 and 2022, involving AR visualization, tracking and registration, context awareness, human-machine interaction, as well as ergonomics and usability, which are significant for the actual AR deployments for human-centric manufacturing in Industry 5.0. It is also worth mentioning that this work presents a historical overview of the current research on the development of HMD AR, as well as a summary of the existing methods and open problems for HMD AR in manufacturing. It is helpful to understand the current technical situations of HMD AR while providing insights to deploy industrial AR applications and perform academic research in the future.
... However, eye gaze interaction is not limited to pointing tasks. Recently, many research studies have investigated the importance of hands-free interaction (Park et al., 2021;Wang, Wang, Yu, & Lu, 2021). Since using XR technology in realworld scenarios needs more consideration, different types of interactions were developed to make the interaction with augmented and virtual content more intuitive and comfortable. ...
Article
Significant advancements of eye-tracking technology in extended reality (XR) head-mounted displays have increased the interest in gaze-based interactions. The benefits of gaze interaction proved that it could be a suitable alternative for hand-based interactions when users face situations where they must maintain their position due to mobility impairment. This study aims to assess the user experience of the gaze-based interaction, compared to hand-based interaction, in two movement conditions of static and dynamic. Twenty-four participants took part in this study, and their experience was evaluated in terms of perceived workload, usability, and performance. The results show that gaze-based interactions significantly outperform the hand-based interaction in terms of perceived workload and usability in case of limited mobility. Also, the user performance is significantly higher in gaze-based modes under situational impairment. The findings of this study can be used for designing XR interfaces considering the situation in which the task is performed.
Article
As augmented reality (AR) headsets become increasingly integrated into professional and social settings, a critical challenge emerges: how can users effectively manage and interact with the frequent notifications they receive? With adults receiving nearly 200 notifications daily on their smartphones, which serve as primary computing devices for many, translating this interaction to AR systems is paramount. Unlike traditional devices, AR systems augment the physical world, requiring interaction techniques that blend seamlessly with real-world behaviors. This study explores the complexities of multimodal interaction with notifications in AR. We investigated user preferences, usability, workload, and performance during a virtual cooking task, where participants managed customer orders while interacting with notifications. Various interaction techniques were tested: Point and Pinch, Gaze and Pinch, Point and Voice, Gaze and Voice, and Touch. Our findings reveal significant impacts on workload, performance, and usability based on the interaction method used. We identify key issues in multimodal interaction and offer guidance for optimizing these techniques in AR environments.
Article
Augmented Reality (AR) has been applied to facilitate human-robot collaboration (HRC) in manufacturing. It enhances real-time communication and interaction between humans and robots as a new paradigm of interface. This research conducts an experimental study to systematically evaluate and compare various input modality designs based on hand gestures, eye gaze, head movements, and voice in industrial robot programming. These modalities allow users to perform common robot planning tasks from a distance through an AR headset, including pointing, tracing, 1D rotation, 3D rotation, and switch state. Statistical analyses of both objective and subjective measures collected from the experiment reveal the relative effectiveness of each modality design in assisting individual tasks in terms of positional deviation, operational efficiency, and usability. A verification test on programming a robot to complete a pick-and-place procedure not only demonstrates the practicality of these modality designs but also confirms their cross-comparison results. Significant findings from the experimental study provide design guidelines for AR input modalities that assists planning robot motions.
Article
Gaze is a crucial element in human-computer interaction and plays an increasingly vital role in promoting the adoption of head-mounted devices (HMDs). Existing gaze tracking methods for HMDs either demand user calibration or face challenges in balancing accuracy and speed, compromising the overall user experience. In this paper, we introduce a novel strategy for real-time, calibration-free gaze tracking using joint head-eye cues on HMDs. Initially, we create a multimodal gaze tracking dataset named HE-Gaze, encompassing synchronized eye images and 6DoF head movement data, addressing a gap in the current data landscape. Statistical analyses unveil the correlation between head movements and gaze positions. Building on these insights, we introduce the hierarchical head-eye coordinated gaze tracking model (HHE-Tracker), which incorporates two lightweight branches to encode input eye images and head sequences efficiently. It combines encoded head velocity and posture features with eye features across various scales to infer gaze position. HHE-Tracker was implemented on a commercial HMD, and its performance was assessed in unconstrained scenarios. The results demonstrate the HHE-Tracker's capability to accurately estimate gaze positions in real-time. In comparison to the state-of-the-art gaze tracking algorithm, HHE-Tracker exhibits commendable accuracy (3.47 ^{\circ } ) and a 40-fold speedup (81 FPS on a Snapdragon 845 SoC).
Article
Recent work has demonstrated the Transformer model is effective for computer vision tasks. However, the global self-attention mechanism utilized in Transformer models does not adequately consider the local structure and details of images, which may result in the loss of information and local details, causing decreased estimation accuracy in gaze estimation tasks when compared to convolution or sequential stacking methods. To address this issue, we propose a parallel CNNs-Transformer aggregation network (CTA-Net) for gaze estimation, which fully leverages the advantages of the Transformer model in modeling global context while the convolutional neural networks (CNNs) model in retaining local details. Specifically, Transformer and ResNet are deployed to extract facial and eye information, respectively. Additionally, an attention cross fusion (ACFusion) Block is embedded with CNN branch, which decomposes features in space and channels to supplement lost features, suppress noise, and help extract eye features more effectively. Finally, a dual-feature aggregation (DFA) module is proposed to effectively fuse the output features of both branches with the help feature a selection mechanism and a residual structure. Experimental results on the MPIIGaze and Gaze360 datasets demonstrate that our CTA-Net achieves state-of-the-art results.
Article
While speech interaction finds widespread utility within the Extended Reality (XR) domain, conventional vocal speech keyword spotting systems continue to grapple with formidable challenges, including suboptimal performance in noisy environments, impracticality in situations requiring silence, and susceptibility to inadvertent activations when others speak nearby. These challenges, however, can potentially be surmounted through the cost-effective fusion of voice and lip movement information. Consequently, we propose a novel vocal-echoic dual-modal keyword spotting system designed for XR headsets. We devise two different modal fusion approches and conduct experiments to test the system's performance across diverse scenarios. The results show that our dual-modal system not only consistently outperforms its single-modal counterparts, demonstrating higher precision in both typical and noisy environments, but also excels in accurately identifying silent utterances. Furthermore, we have successfully applied the system in real-time demonstrations, achieving promising results.
Article
Cars, mobile phones, and smart home devices already provide automatic speech recognition (ASR) by default. However, human machine interfaces (HMI) in industrial settings, as opposed to consumer settings, operate under different conditions and thus, present different design challenges. Voice control, arguably the most natural form of communication, has the potential to shorten complex command sequences and menu structures in order to directly execute a final command. Therefore, this contribution explored how differing HMI scenarios could possibly be optimized, by either replacing or complementing existing touch control interactions with voice control. Typical commands from CNC milling machines and industrial robots were categorized by their complexity, quantified by menu level and the necessary number of interactions. The collected interaction data showed that voice control can already provide a time efficiency advantage at either one additional menu level or three touchscreen interactions. For complex commands, such as those needing five menu levels and seven interactions on the touchscreen, the time efficiency advantage of voice control can reach up to 67 %. Furthermore, the study shows the possibility of reducing machine operator training times when using voice control by significantly lower interaction times for the first repetition of the participants Note to Practitioners —Several publications investigate the ergonomics, usability, and cognitive load of classic mouse and keyboard control, button control, touch control, gesture control, gaze control, or voice control in specific interaction scenarios. All publications state that these factors need to be considered for the development of modern human machine interfaces (HMI). Due by the complexity of these factors, it is difficult to develop general guidelines to build efficient HMIs independent from the machine or process. A lack of efficiency guidelines potentially hampers the development of new HMIs, currently necessary to address the new challenges in the digital production hall like increasingly complex machines, processes that become more individual as well as multiple machine operation. In order to inform HMI development, voice and touch control alternatives were empirically measured. Based on the collected data complexity time equivalents for each menu level and number of interactions were calculated. These time equivalents provide the opportunity for machine and programmable logic controller (PLC) manufacturers to evaluate their production processes and the related interaction processes regarding the potential efficiency benefits of voice control as a complement or substitute for the conventional HMI system. Using this model, the efficiency advantage of voice control can be estimated without implementing and testing a voice control on a real production machine. Thus, the potential benefit of implementing voice control can be assessed directly, avoiding expensive test runs.
Chapter
The evaluation of interaction system is an important step to guiding the iterative optimization of the interaction product. However, with the development of emotional, naturalistic and multi-channel HCI, there are some emerging problems, such as mere superposition of interaction modes and imbalance between efficiency and cost. In this study, we proposed a model called User-Device-Interaction Model (UDI). The goal is to establish an interaction evaluation system that takes into account the cost of interaction and quantifies the evaluation metrics. Interaction system was decomposed into measurable indicators from seven dimensions, including user perceived usability and user fatigue perception. Then, Analytic Hierarchy Process (AHP) is used to calculate the weights of the indicators at each level of evaluation, and the validity of the evaluation system was demonstrated through empirical research. We believe that this result can provide guidance and suggestions for the optimal design and evaluation of various types of interaction system.KeywordsHuman-computer interactionInteraction evaluationAnalytic Hierarchy ProcessInteraction usabilitySystem evaluation
Article
Full-text available
This paper proposes a missing feature of windows voice assistant Cortana that voice assistants of android Operating Systems, iPhone Operating Systems. Android operating system has Google Assistant and IOS has Siri, which can adjust the system volume and brightness followed by user’svoice commands, while Cortana can’t adjust the system volumeand brightness
Article
Full-text available
To verify the feasibility of robust speech recognition based on deep learning in sports game review. In this paper, a robust speech recognition model is built based on the generative adversarial network GAN algorithm according to the deep learning model. And the loss function, optimization function and noise reduction front-end are introduced in the model to achieve the optimization of speech extraction features through denoising process to ensure that accurate speech review data can be derived even in the game scene under noisy environment. Finally, the experiments are conducted to verify the four directions of the model algorithm by comparing the speech features MFCC, FBANK and WAVE. The experimental results show that the speech recognition model trained by the GSDNet model algorithm can reach 89% accuracy, 56.24% reduction of auxiliary speech recognition word error rate, 92.61% accuracy of speech feature extraction, about 62.19% reduction of training sample data volume, and 94.75% improvement of speech recognition performance in the speech recognition task under noisy environment. It shows that the robust speech recognition based on deep learning can be applied to sports game reviews, and also can provide accurate voice review information from the noisy sports game scene, and also broaden the application area for deep learning models.
Article
Full-text available
Technology developments have expanded the diversity of interaction modalities that can be used by an agent (either a human or machine) to interact with a computer system. This expansion has created the need for more natural and user-friendly interfaces in order to achieve effective user experience and usability. More than one modality can be provided to an agent for interaction with a system to accomplish this goal, which is referred to as a multimodal interaction (MI) system. The Internet of Things (IoT) and augmented reality (AR) are popular technologies that allow interaction systems to combine the real-world context of the agent and immersive AR content. However, although MI systems have been extensively studied, there are only several studies that reviewed MI systems that used IoT and AR. Therefore, this paper presents an in-depth review of studies that proposed various MI systems utilizing IoT and AR. A total of 23 studies were identified and analyzed through a rigorous systematic literature review protocol. The results of our analysis of MI system architectures, the relationship between system components, input/output interaction modalities, and open research challenges are presented and discussed to summarize the findings and identify future research and development avenues for researchers and MI developers.
Conference Paper
Full-text available
Augmented reality (AR) technologies have the potential to provide individuals with unique training and visualizations, but the effectiveness of these applications may be influenced by users' perceptions of the distance to AR objects. Perceived distances to AR objects may be biased if these objects do not appear to make contact with the ground plane. The current work compared distance judgments of AR targets presented on the ground versus off the ground when no additional AR depth cues, such as shadows, were available to denote ground contact. We predicted that without additional information for height off the ground, observers would perceive the off-ground objects as placed on the ground, but at farther distances. Furthermore, this bias should be exaggerated when targets were viewed with one eye rather than two. In our experiment, participants judged the absolute egocentric distance to various cubes presented on or off the ground with an action-based measure, blind walking. We found that observers walked farther for off-ground AR objects and that this effect was exaggerated when participants viewed off-ground objects with monocular vision compared to binocular vision. However, we also found that the restriction of binocular cues influenced participants' distance judgments for on-ground AR objects. Our results suggest that distances to off-ground AR objects are perceived differently than on-ground AR objects and that the elimination of binocular cues further influences how users perceive these distances.
Conference Paper
Full-text available
Mid-air hand gesture interaction has long been proposed as a ‘natural’ input method for Augmented Reality (AR) applications, yet has been little explored for intensive applications like multiscale navigation. In multiscale navigation, such as digital map navigation, pan and zoom are the predominant interactions. A position-based input mapping (e.g. grabbing metaphor) is intuitive for such interactions, but is prone to arm fatigue. This work focuses on improving digital map navigation in AR with mid-air hand gestures, using a horizontal intangible map display. First, we conducted a user study to explore the effects of handedness (unimanual and bimanual) and input mapping (position-based and rate-based). From these findings we designed DiveZoom and TerraceZoom, two novel hybrid techniques that smoothly transition between position- and rate-based mappings. A second user study evaluated these designs. Our result indicates that the introduced input-mapping transitions can reduce perceived arm fatigue with limited impact on performance.
Article
Full-text available
More than 80% sensory information our brains receive come from the eyes. Eye fatigue and associated eye diseases become increasingly severe as digital devices progress in the last decade. Visual behaviors are controlled by different parts of muscles in human vision system. One can relax and protect his eyes timely if he knows when and how fatigued his eyes are. However, people usually have no sensation when the muscles suffer from fatigue. Thus, subjective assessments of eye fatigue are inaccurate. Objective assessments are more reliable. Some of the previous objective eye fatigue assessment methods depended on complex and expensive equipment, such as EEG, which made the user feel obtrusive. Some other methods depended on eye tracker. However, they didn’t provide a widely accepted definition of eye fatigue. Moreover, most of the existing methods can only tell whether the fatigue occurs but cannot provide the fatigue level. In this paper, we provide a novel definition of eye fatigue based on seven optometry metrics. An unobtrusive eye tracker is used to do the assessment. Two real-time eye fatigue assessment models are proposed based on eye movement data and eye blink data, respectively. As a result, both of our models can provide an accurate eye fatigue level to users.
Conference Paper
Full-text available
Head and eye movement can be leveraged to improve the user's interaction repertoire for wearable displays. Head movements are deliberate and accurate, and provide the current state-of-the-art pointing technique. Eye gaze can potentially be faster and more ergonomic, but suffers from low accuracy due to calibration errors and drift of wearable eye-tracking sensors. This work investigates precise, multimodal selection techniques using head motion and eye gaze. A comparison of speed and pointing accuracy reveals the relative merits of each method, including the achievable target size for robust selection. We demonstrate and discuss example applications for augmented reality, including compact menus with deep structure, and a proof-of-concept method for on-line correction of calibration drift.
Conference Paper
Full-text available
Augmented reality (AR) applications can leverage the full space of an environment to create immersive experiences. However, most empirical studies of interaction in AR focus on interactions with objects close to the user, generally within arms reach. As objects move farther away, the efficacy and usability of different interaction modalities may change. This work explores AR interactions at a distance, measuring how applications may support fluid, efficient, and intuitive interactive experiences in room-scale augmented reality. We conducted an empirical study (N = 20) to measure trade-offs between three interaction modalities–multimodal voice, embodied freehand gesture, and handheld devices–for selecting, rotating, and translating objects at distances ranging from 8 to 16 feet (2.4m-4.9m). Though participants performed comparably with embodied freehand gestures and handheld remotes, they perceived embodied gestures as significantly more efficient and usable than device-mediated interactions. Our findings offer considerations for designing efficient and intuitive interactions in room-scale AR applications.
Conference Paper
Full-text available
Virtual reality affords experimentation with human abilities beyond what's possible in the real world, toward novel senses of interaction. In many interactions, the eyes naturally point at objects of interest while the hands skilfully manipulate in 3D space. We explore a particular combination for virtual reality, the Gaze + Pinch interaction technique. It integrates eye gaze to select targets, and indirect freehand gestures to manipulate them. This keeps the gesture use intuitive like direct physical manipulation, but the gesture's effect can be applied to any object the user looks at --- whether located near or far. In this paper, we describe novel interaction concepts and an experimental system prototype that bring together interaction technique variants, menu interfaces, and applications into one unified virtual experience. Proof-of-concept application examples were developed and informally tested, such as 3D manipulation, scene navigation, and image zooming, illustrating a range of advanced interaction capabilities on targets at any distance, without relying on extra controller devices.
Conference Paper
Full-text available
We describe ubiGaze, a novel wearable ubiquitous method to augment any real-world object with invisible messages through gaze gestures that lock the message into the object. This enables a context and location dependent messaging service, which users can utiize discreetly and effortlessly. Further, gaze gestures can be used as an authentication method, even when the augmented object is publicly known. We developed a prototype using two wearable devices: a Pupil eye tracker equipped with a scene camera and a Sony Smartwatch 3. The eye tracker follows the users’ gaze, the scene camera captures distinct features from the selected real-world object, and the smartwatch provides both input and output modalities for selecting and displaying messages. We describe the concept, design, and implementation of our real-world system. Finally, we discuss research implications and address future work.
Article
Full-text available
Square-based fiducial markers are one of the most popular approaches for camera pose estimation due to its fast detection and robustness. In order to maximize their error correction capabilities, it is required to use an inner binary codification with a large inter-marker distance. This paper proposes two Mixed Integer Linear Programming (MILP) approaches to generate configurable square-based fiducial marker dictionaries maximizing their inter-marker distance. The first approach guarantees the optimal solution, however, it can only be applied to relatively small dictionaries and number of bits since the computing times are too long for many situations. The second approach is an alternative formulation to obtain suboptimal dictionaries within restricted time, achieving results that still surpass significantly the current state of the art methods.
Conference Paper
Full-text available
In order for natural interaction in Augmented Reality (AR) to become widely adopted, the techniques used need to be shown to support precise interaction, and the gestures used proven to be easy to understand and perform. Recent research has explored free-hand gesture interaction with AR interfaces, but there have been few formal evaluations conducted with such systems. In this paper we introduce and evaluate two natural interaction techniques: the free-hand gesture based Grasp-Shell, which provides direct physical manipulation of virtual content; and the multi-modal Gesture-Speech, which combines speech and gesture for indirect natural interaction. These techniques support object selection, 6 degree of freedom movement, uniform scaling, as well as physics-based interaction such as pushing and flinging. We conducted a study evaluating and comparing Grasp-Shell and Gesture-Speech for fundamental manipulation tasks. The results show that Grasp-Shell outperforms Gesture-Speech in both efficiency and user preference for translation and rotation tasks, while Gesture-Speech is better for uniform scaling. They could be good complementary interaction methods in a physics-enabled AR environment, as this combination potentially provides both control and interactivity in one interface. We conclude by discussing implications and future directions of this research.
Conference Paper
Full-text available
Recently there has been an increase in research towards using hand gestures for interaction in the field of Augmented Reality (AR). These works have primarily focused on researcher designed gestures, while little is known about user preference and behavior for gestures in AR. In this paper, we present our guessability study for hand gestures in AR in which 800 gestures were elicited for 40 selected tasks from 20 participants. Using the agreement found among gestures, a user-defined gesture set was created to guide designers to achieve consistent user-centered gestures in AR. Wobbrock’s surface taxonomy has been extended to cover dimensionalities in AR and with it, characteristics of collected gestures have been derived. Common motifs which arose from the empirical findings were applied to obtain a better understanding of users’ thought and behavior. This work aims to lead to consistent user-centered designed gestures in AR.
Article
Full-text available
We describe a user study comparing a low cost VR system using a Head-Mounted-Display (HMD) to a desktop and another setup where the image is projected on a screen. Eighteen participants played the same game in the three platforms. Results show that users generally did not like the setup using a screen and the best performances were obtained with the desktop configuration. This result could be due to the fact that most users were gamers used to the interaction through keyboard/mouse. Still, we noticed that user performance in the HMD setup was not dramatically worse and that users do not collide as often with walls.
Conference Paper
Full-text available
We present a 3D eye model fitting algorithm for use in gaze estimation, that operates on pupil ellipse geometry alone. It works with no user-calibration and does not require calibrated lighting features such as glints. Our algorithm is based on fitting a consistent pupil motion model to a set of eye images. We describe a non-iterative method of initialising this model from detected pupil ellipses, and two methods of iteratively optimising the parameters of the model to best fit the original eye images. We also present a novel eye image dataset, based on a rendered simulation, which gives a perfect ground truth for gaze and pupil shape. We evaluate our approach using this dataset, measuring both the angular gaze error (in degrees) and the pupil reprojection error (in pixels), and discuss the limitations of a user-calibration–free approach.
Conference Paper
Full-text available
Recently there has been an increase in research of hand gestures for interaction in the area of Augmented Reality (AR). However this research has focused on developer designed gestures, and little is known about user preference and behavior for gestures in AR. In this paper, we present the results of a guessability study focused on hand gestures in AR. A total of 800 gestures have been elicited for 40 selected tasks from 20 partic-ipants. Using the agreement found among gestures, a user-defined gesture set was created to guide design-ers to achieve consistent user-centered gestures in AR.
Article
Full-text available
This paper presents a fiducial marker system specially appropriated for camera pose estimation in applications such as augmented reality and robot localization. Three main contributions are presented. First, we propose an algorithm for generating configurable marker dictionaries (in size and number of bits) following a criterion to maximize the inter-marker distance and the number of bit transitions. In the process, we derive the maximum theoretical inter-marker distance that dictionaries of square binary markers can have. Second, a method for automatically detecting the markers and correcting possible errors is proposed. Third, a solution to the occlusion problem in augmented reality applications is shown. To that aim, multiple markers are combined with an occlusion mask calculated by color segmentation. The experiments conducted show that our proposal obtains dictionaries with higher inter-marker distances and lower false negative rates than state-of-the-art systems, and provides an effective solution to the occlusion problem.
Conference Paper
Full-text available
Mid-air interactions are prone to fatigue and lead to a feeling of heaviness in the upper limbs, a condition casually termed as the gorilla-arm effect. Designers have often associated limitations of their mid-air interactions with arm fatigue, but do not possess a quantitative method to assess and therefore mitigate it. In this paper we propose a novel metric, Consumed Endurance (CE), derived from the biomechanical structure of the upper arm and aimed at characterizing the gorilla-arm effect. We present a method to capture CE in a non-intrusive manner using an off-the-shelf camera-based skeleton tracking system, and demonstrate that CE correlates strongly with the Borg CR10 scale of perceived exertion. We show how designers can use CE as a complementary metric for evaluating existing and designing novel mid-air interactions, including tasks with repetitive input such as mid-air text-entry. Finally, we propose a series of guidelines for the design of fatigue-efficient mid-air interfaces. More Information: http://hci.cs.umanitoba.ca/projects-and-research/details/ce
Article
Full-text available
The growing interest in multimodal interface design is inspired in large part by the goals of supporting more transparent, flexible, efficient, and powerfully expressive means of human-computer interaction than in the past. Multimodal interfaces are expected to support a wider range of diverse applications, be usable by a broader spectrum of the average population, and function more reliably under realistic and challenging usage conditions. In this article, we summarize the emerging architectural approaches for interpreting speech and pen-based gestural input in a robust manner-including early and late fusion approaches, and the new hybrid symbolic-statistical approach. We also describe a diverse collection of state-of-the-art multimodal systems that process users' spoken and gestural input. These applications range from map-based and virtual reality systems for engaging in simulations and training, to field medic systems for mobile use in noisy environments, to web-based transactions and standard text-editing applications that will reshape daily computing and have a significant commercial impact. To realize successful multimodal systems of the future, many key research challenges remain to be addressed. Among these challenges are the development of cognitive theories to guide multimodal system design, and the development of effective natural language processing, dialogue processing, and error-handling techniques. In addition, new multimodal systems will be needed that can function more robustly and adaptively, and with support for collaborative multiperson use. Before this new class of systems can proliferate, toolkits also will be needed to promote software development for both simulated and functioning systems.
Article
Full-text available
We propose the use of virtual environments to simulate augmented reality (AR) systems for the purposes of experimentation and usability evaluation. This method allows complete control in the AR environment, providing many advantages over testing with true AR systems. We also discuss some of the limitations to the simulation approach. We have demonstrated the use of such a simulation in a proof of concept experiment controlling the levels of registration error in the AR scenario. In this experiment, we used the simulation method to investigate the effects of registration error on task performance for a generic task involving precise motor control for AR object manipulation. Isolating jitter and latency errors, we provide empirical evidence of the relationship between accurate registration and task performance.
Article
Full-text available
We present a virtual flexible pointer that allows a user in a 3D environment to point more easily to fully or partially obscured objects, and to indicate objects to other users more clearly. The flexible pointer can also reduce the need for disambiguation and can make it possible for the user to point to more objects than currently possible with existing egocentric techniques.
Article
Full-text available
The relationship between gaze and speech is explored for the simple task of moving an object from one location to another on a computer screen. The subject moves a designated object from a group of objects to a new location on the screen by stating, "Move it there." Gaze and speech data are captured to determine if we can robustly predict the selected object and destination position. We have found that the source fixation closest to the desired object begins, with high probability, before the beginning of the word "Move". An analysis of all fixations before and after speech onset time shows that the fixation that best identifies the object to be moved occurs, on average, 630 milliseconds before speech onset with a range of 150 to 1200 milliseconds for individual subjects. The variance in these times for individuals is relatively small although the variance across subjects is large. Selecting a fixation closest to the onset of the word "Move" as the designator of the object to be moved gives a system accuracy close to 95% for all subjects. Thus, although significant differences exist between subjects, we believe that the speech and gaze integration patterns can be modeled reliably for individual users and therefore be used to improve the performance of multimodal systems.
Article
Recent years have witnessed a tremendous increasing of first-person videos captured by wearable devices. Such videos record information from different perspectives than the traditional third-person view, and thus show a wide range of potential usages. However, techniques for analyzing videos from different views can be fundamentally different, not to mention co-analyzing on both views to explore the shared information. In this paper, we take the challenge of cross-view video co-analysis and deliver a novel learning-based method. At the core of our method is the notion of "joint attention", indicating the shared attention regions that link the corresponding views, and eventually guide the shared representation learning across views. To this end, we propose a multi-branch deep network, which extracts cross-view joint attention and shared representation from static frames with spatial constraints, in a self-supervised and simultaneous manner. In addition, by incorporating the temporal transition model of the joint attention, we obtain spatial-temporal joint attention that can robustly capture the essential information extending through time. Our method outperforms the state-of-the-art on the standard cross-view video matching tasks on public datasets. Furthermore, we demonstrate how the learnt joint information can benefit various applications through a set of qualitative and quantitative experiments.
Article
This paper investigates the effect of using augmented reality (AR) annotations and two different gaze visualizations, head pointer (HP) and eye gaze (EG), in an AR system for remote collaboration on physical tasks. First, we developed a spatial AR remote collaboration platform that supports sharing the remote expert’s HP or EG cues. Then the prototype system was evaluated with a user study comparing three conditions for sharing non-verbal cues: (1) a cursor pointer (CP), (2) HP and (3) EG with respect to task performance, workload assessment and user experience. We found that there was a clear difference between these three conditions in the performance time but no significant difference between the HP and EG conditions. When considering the perceived collaboration quality, the HP/EG interface was statistically significantly higher than the CP interface, but there was no significant difference for workload assessment between these three conditions. We used low-cost head tracking for the HP cue and found that this served as an effective referential pointer. This implies that in some circumstances, HP could be a good proxy for EG in remote collaboration. Head pointing is more accessible and cheaper to use than more expensive eye-tracking hardware and paves the way for multi-modal interaction based on HP and gesture in AR remote collaboration.
Article
Eye gaze estimation is increasingly demanded by recent intelligent systems to facilitate a range of interactive applications. Unfortunately, learning the highly complicated regression from a single eye image to the gaze direction is not trivial. Thus, the problem is yet to be solved efficiently. Inspired by the two-eye asymmetry as two eyes of the same person may appear uneven, we propose the face-based asymmetric regression-evaluation network (FARE-Net) to optimize the gaze estimation results by considering the difference between left and right eyes. The proposed method includes one face-based asymmetric regression network (FAR-Net) and one evaluation network (E-Net). The FAR-Net predicts 3D gaze directions for both eyes and is trained with the asymmetric mechanism, which asymmetrically weights and sums the loss generated by two-eye gaze directions. With the asymmetric mechanism, the FAR-Net utilizes the eyes that can achieve high performance to optimize network. The E-Net learns the reliabilities of two eyes to balance the learning of the asymmetric mechanism and symmetric mechanism. Our FARE-Net achieves leading performances on MPIIGaze, EyeDiap and RT-Gene datasets. Additionally, we investigate the effectiveness of FARE-Net by analyzing the distribution of errors and ablation study.
Article
Desktop action recognition from first-person view (egocentric) video is an important task due to its omnipresence in our daily life, and the ideal first-person viewing perspective for observing hand-object interactions. However, no previous research efforts have been dedicated on the benchmark of the task. In this paper, we first release a dataset of daily desktop actions recorded with a wearable camera and publish it as a benchmark for desktop action recognition. Regular desktop activities of six participants were recorded in egocentric video with a wide-angle head-mounted camera. In particular, we focus on five common desktop actions in which hands are involved. We provide original video data, action annotations at frame-level, and hand masks at pixel-level. We also propose a feature representation for the characterization of different desktop actions based on the spatial and temporal information of hands. In experiments, we illustrate the statistical information about the dataset, and evaluate the action recognition performance of different features as a baseline. The proposed method achieves promising performance for five action classes.
Article
Researchers have shown that immersive Virtual Reality (VR) can serve as an unusually powerful pain control technique. However, research assessing the reported symptoms and negative effects of VR systems indicate that it is important to ascertain if these symptoms arise from the use of particular VR display devices, particularly for users who are deemed "at risk," such as chronic pain patients Moreover, these patients have specific and often complex needs and requirements, and because basic issues such as 'comfort' may trigger anxiety or panic attacks, it is important to examine basic questions of the feasibility of using VR displays. Therefore, this repeated-measured experiment was conducted with two VR displays: the Oculus Rift's head-mounted display (HMD) and Firsthand Technologies' immersive desktop display, DeepStream3D. The characteristics of these immersive desktop displays differ: one is worn, enabling patients to move their heads, while the other is peered into, allowing less head movement. To assess the severity of physical discomforts, 20 chronic pain patients tried both displays while watching a VR pain management demo in clinical settings. Results indicated that participants experienced higher levels of Simulator Sickness using the Oculus Rift HMD. However, results also indicated other preferences of the two VR displays among patients, including physical comfort levels and a sense of immersion. Few studies have been conducted that compare usability of specific VR devices specifically with chronic pain patients using a therapeutic virtual environment in pain clinics. Thus, the results may help clinicians and researchers to choose the most appropriate VR displays for chronic pain patients and guide VR designers to enhance the usability of VR displays for long-term pain management interventions.
Conference Paper
Eye tracking is becoming more and more affordable, and thus gaze has the potential to become a viable input modality for human-computer interaction. We present the GazeEverywhere solution that can replace the mouse with gaze control by adding a transparent layer on top of the system GUI. It comprises three parts: i) the SPOCK interaction method that is based on smooth pursuit eye movements and does not suffer from the Midas touch problem; ii) an online recalibration algorithm that continuously improves gaze-tracking accuracy using the SPOCK target projections as reference points; and iii) an optional hardware setup utilizing head-up display technology to project superimposed dynamic stimuli onto the PC screen where a software modification of the system is not feasible. In validation experiments, we show that GazeEverywhere's throughput according to ISO 9241-9 was improved over dwell time based interaction methods and nearly reached trackpad level. Online recalibration reduced interaction target ('button') size by about 25%. Finally, a case study showed that users were able to browse the internet and successfully run Wikirace using gaze only, without any plug-ins or other modifications.
Conference Paper
We describe a hybrid brain computer interface that integrates gaze information from an eye tracker with brain activity information measured by electroencephalography (EEG). Users explicitly control the end effector of a robot arm to move in one of four directions using motor imagery to perform a pick and place task. Measurements of the natural eye gaze behavior of subjects is used to infer the instantaneous intent of the users based on the past gaze trajectory. This information is integrated with the output of the EEG classifier and contextual information about the environment probabilistically using Bayesian inference. Our experiments demonstrate that subjects can achieve 100% task completion within three minutes and that the integration of EEG and gaze information significantly improves performance over either cue in isolation.
Conference Paper
We describe the primary ways researchers can determine the size of a sample of research participants, present the benefits and drawbacks of each of those methods, and focus on improving one method that could be useful to the CHI community: local standards. To determine local standards for sample size within the CHI community, we conducted an analysis of all manuscripts published at CHI2014. We find that sample size for manuscripts published at CHI ranges from 1 -- 916,000 and the most common sample size is 12. We also find that sample size differs based on factors such as study setting and type of methodology employed. The outcome of this paper is an overview of the various ways sample size may be determined and an analysis of local standards for sample size within the CHI community. These contributions may be useful to researchers planning studies and reviewers evaluating the validity of results.
Conference Paper
Humans rely on eye gaze and hand manipulations extensively in their everyday activities. Most often, users gaze at an object to perceive it and then use their hands to manipulate it. We propose applying a multimodal, gaze plus free-space gesture approach to enable rapid, precise and expressive touch-free interactions. We show the input methods are highly complementary, mitigating issues of imprecision and limited expressivity in gaze-alone systems, and issues of targeting speed in gesture-alone systems. We extend an existing interaction taxonomy that naturally divides the gaze+gesture interaction space, which we then populate with a series of example interaction techniques to illustrate the character and utility of each method. We contextualize these interaction techniques in three example scenarios. In our user study, we pit our approach against five contemporary approaches; results show that gaze+gesture can outperform systems using gaze or gesture alone, and in general, approach the performance of "gold standard" input systems, such as the mouse and trackpad.
Conference Paper
" Head-mounted eye tracking has significant potential for gaze-based applications such as life logging, mental health monitoring, or the quantified self. A neglected challenge for the long-term recordings required by these applications is that drift in the initial person-specific eye tracker calibration, for example caused by physical activity, can severely impact gaze estimation accuracy and thus system performance and user experience. We first analyse calibration drift on a new dataset of natural gaze data recorded using synchronised video-based and Electrooculography-based eye trackers of 20 users performing everyday activities in a mobile setting. Based on this analysis we present a method to automatically self-calibrate head-mounted eye trackers based on a computational model of bottom-up visual saliency. Through evaluations on the dataset we show that our method 1) is effective in reducing calibration drift in calibrated eye trackers and 2) given sufficient data, can achieve gaze estimation accuracy competitive with that of a calibrated eye tracker, without any manual calibration.
Article
We describe a hybrid brain computer interface that integrates information from a four-class motor imagery based EEG classifier with information about gaze trajectories from an eye tracker. The novel aspect of this system is that no explicit gaze behavior is required of the user. Rather, the natural gaze behavior of the user integrated probabilistically to smooth the noisy classification results from the motor imagery based EEG. The goal is to provide for a more natural interaction with the BCI system than if gaze were used as an explicit command signal, as is commonly done. Our results on a 2D cursor control task show that integration of gaze information significantly improves task completion accuracy and reduces task completion time. In particular, our system achieves over 80% target completion accuracy on a cursor control task requiring guidance to one of 12 targets.
Article
Many object pointing and selecting techniques for large screens have been proposed in the literature. There is a lack of quantitative evidence suggesting proper pointing postures for interacting with stereoscopic targets in immersive virtual environments. The objective of this study was to explore users' performances and experiences of using different postures while interacting with 3D targets remotely in an immersive stereoscopic environment. Two postures, hand-directed and gaze-directed pointing methods, were compared in order to investigate the postural influences. Two stereo parallaxes, negative and positive parallaxes, were compared for exploring how target depth variances would impact users' performances and experiences. Fifteen participants were recruited to perform two interactive tasks, tapping and tracking tasks, to simulate interaction behaviors in the stereoscopic environment. Hand-directed pointing is suggested for both tapping and tracking tasks due to its significantly better overall performance, less muscle fatigue, and better usability. However, a gaze-directed posture is probably a better alternative than hand-directed pointing for tasks with high accuracy requirements in home-in phases. Additionally, it is easier for users to interact with targets with negative parallax than with targets with positive parallax. Based on the findings of this research, future applications involving different pointing techniques should consider both pointing performances and postural effects as a result of pointing task precision requirements and potential postural fatigue. Copyright © 2014 Elsevier Ltd and The Ergonomics Society. All rights reserved.
Conference Paper
This paper presents an experiment that was conducted to investigate gaze combined with voice commands. There has been very little research about the design of voice commands for this kind of input. It is not known yet if users prefer longer sentences like in natural dialogues or short commands. In the experiment three different voice commands are compared during a simple task in which participants had to drag & drop, rotate, and resize objects. It turned out that the shortness of a voice command -- in terms of number of words -- is more important than it being absolutely natural. Participants preferred the voice command with the fewest words and the fewest syllables. For the voice commands which had the same number of syllables, the users also preferred the one with the fewest words, even though there were no big differences in time and errors.
Article
People naturally interact with the world multimodally, through both parallel and sequential use of multiple perceptual modalities. Multimodal human–computer interaction has sought for decades to endow computers with similar capabilities, in order to provide more natural, powerful, and compelling interactive experiences. With the rapid advance in non-desktop computing generated by powerful mobile devices and affordable sensors in recent years, multimodal research that leverages speech, touch, vision, and gesture is on the rise. This paper provides a brief and personal review of some of the key aspects and issues in multimodal interaction, touching on the history, opportunities, and challenges of the area, especially in the area of multimodal integration. We review the question of early vs. late integration and find inspiration in recent evidence in biological sensory integration. Finally, we list challenges that lie ahead for research in multimodal human–computer interaction.
Article
In this paper, we describe a user study evaluating the usability of an augmented reality (AR) multimodal interface (MMI). We have developed an AR MMI that combines free-hand gesture and speech input in a natural way using a multimodal fusion architecture. We describe the system architecture and present a study exploring the usability of the AR MMI compared with speech-only and 3D-hand-gesture-only interaction conditions. The interface was used in an AR application for selecting 3D virtual objects and changing their shape and color. For each interface condition, we measured task completion time, the number of user and system errors, and user satisfactions. We found that the MMI was more usable than the gesture-only interface conditions, and users felt that the MMI was more satisfying to use than the speech-only interface conditions; however, it was neither more effective nor more efficient than the speech-only interface. We discuss the implications of this research for designing AR MMI and outline directions for future work. The findings could also be used to help develop MMIs for a wider range of AR applications, for example, in AR navigation tasks, mobile AR interfaces, or AR game applications.
Article
Holmqvist, K., Nyström, N., Andersson, R., Dewhurst, R., Jarodzka, H., & Van de Weijer, J. (Eds.) (2011). Eye tracking: a comprehensive guide to methods and measures, Oxford, UK: Oxford University Press.
Conference Paper
NASA-TLX is a multi-dimensional scale designed to obtain workload estimates from one or more operators while they are performing a task or immediately afterwards. The years of research that preceded subscale selection and the weighted averaging approach resulted in a tool that has proven to be reasonably easy to use and reliably sensitive to experimentally important manipulations over the past 20 years. Its use has spread far beyond its original application (aviation), focus (crew complement), and language (English). This survey of 550 studies in which NASA-TLX was used or reviewed was undertaken to provide a resource for a new generation of users. The goal was to summarize the environments in which it has been applied, the types of activities the raters performed, other variables that were measured that did (or did not) covary, methodological issues, and lessons learned
Article
Since humans direct their visual attention by means of eye movements, a device which monitors eye movements should be a natural “pick” device for selecting objects visually present on a monitor. The results from an experimental investigation of an eye tracker as a computer input device are presented. Three different methods were used to select the object looked at; these were a button press, prolonged fixation or “dwell” and an on screen select button. The results show that an eye tracker can be used as a fast selection device providing that the target size is not too small. If the targets are small speed declines and errors increase rapidly.
Article
Virtual reality (VR) systems are used in a variety of applications within industry, education, public and domestic settings. Research assessing reported symptoms and side effects of using VR systems indicates that these factors combine to influence user experiences of virtual reality induced symptoms and effects (VRISE). Three experiments were conducted to assess prevalence and severity of sickness symptoms experienced in each of four VR display conditions; head mounted display (HMD), desktop, projection screen and reality theatre, with controlled examination of two additional aspects of viewing (active vs. passive viewing and light vs. dark conditions). Results indicate 60–70% participants experience an increase in symptoms pre–post exposure for HMD, projection screen and reality theatre viewing and found higher reported symptoms in HMD compared with desktop viewing (nausea symptoms) and in HMD compared with reality theatre viewing (nausea, oculomotor and disorientation symptoms). No effect of lighting condition was found. Higher levels of symptoms were reported in passive viewing compared to active control over movement in the VE. However, the most notable finding was that of high inter- and intra-participant variability. As this supports other findings of individual susceptibility to VRISE, recommendations are offered concerning design and use of VR systems in order to minimise VRISE.
Article
The System Usability Scale (SUS) is an inexpensive, yet effective tool for assessing the usability of a product, including Web sites, cell phones, interactive voice response systems, TV applications, and more. It provides an easy-to-understand score from 0 (negative) to 100 (positive). While a 100-point scale is intuitive in many respects and allows for relative judgments, information describing how the numeric score translates into an absolute judgment of usability is not known. To help answer that question, a seven-point adjective-anchored Likert scale was added as an eleventh question to nearly 1,000 SUS surveys. Results show that the Likert scale scores correlate extremely well with the SUS scores (r=0.822). The addition of the adjective rating scale to the SUS may help practitioners interpret individual SUS scores and aid in explaining the results to non-human factors professionals.
Conference Paper
This paper contributes to the nascent body of literature on pointing performance in Virtual Environments (VEs), comparing gaze- and hand-based pointing. Contrary to previous findings, preliminary results indicate that gaze-based pointing is slower than hand-based pointing for distant objects.