Article
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Virtual reality (VR) is rapidly growing, with the potential to change the way we create and consume content. In VR, users integrate multimodal sensory information they receive, to create a unified perception of the virtual world. In this survey, we review the body of work addressing multimodality in VR, and its role and benefits in user experience, together with different applications that leverage multimodality in many disciplines. These works thus encompass several fields of research, and demonstrate that multimodality plays a fundamental role in VR; enhancing the experience, improving overall performance, and yielding unprecedented abilities in skill and knowledge transfer.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Similarly, inhibitory interactions can also occur when dealing with multimodal information. Crossmodal percep-tion is defined as those multimodal effects that involve interactions between two or more different sensory modalities [35]. These interactions can be facilitatory (for example, decreasing reaction time in a search task when two modalities are presented synchronously) or inhibitory, depending on how the cerebral cortex is activated to process the perceived sensory signals. ...
... Multimodality stands as a prominent investigation subject within the VR field, becoming a multidisciplinary topic of interest for neuroscientists, graphics practitioners and content creators alike [35]. Within the context of our research, the integration of multimodal techniques can have a great impact on users' virtual experience and sense of immersion by properly integrating information from different sensory modalities in a realistic way. ...
... Within the context of our research, the integration of multimodal techniques can have a great impact on users' virtual experience and sense of immersion by properly integrating information from different sensory modalities in a realistic way. We refer the reader to the work of Martin et al. [35] for a comprehensive study of how multimodality is being used to improve VR experiences. ...
Article
Full-text available
Virtual reality (VR) has the potential to become a revolutionary technology with a significant impact on our daily lives. The immersive experience provided by VR equipment, where the user’s body and senses are used to interact with the surrounding content, accompanied by the feeling of presence elicits a realistic behavioral response. In this work, we leverage the full control of audiovisual cues provided by VR to study an audiovisual suppression effect (ASE) where auditory stimuli degrade visual performance. In particular, we explore if barely audible sounds (in the range of the limits of hearing frequencies) generated following a specific spatiotemporal setup can still trigger the ASE while participants are experiencing high cognitive loads. A first study is carried out to find out how sound volume and frequency can impact this suppression effect, while the second study includes higher cognitive load scenarios closer to real applications. Our results show that the ASE is robust to variations in frequency, volume and cognitive load, achieving a reduction of visual perception with the proposed hardly audible sounds. Using such auditory cues means that this effect could be used in real applications, from entertaining to VR techniques like redirected walking.
... To reach a high level of immersion with regard to extensiveness, multiple sensory modalities must be used with regard to the desired environment. The types of sensory modalities that can be integrated into the XR systems can be divided into two groups: external stimuli (visual (V), auditory (A), haptic (H), olfactory (O), gustatory (G) and thermal (T)), and internal stimuli (vestibular (B) and proprioceptive (P)) [57]. These sensory modalities can be used to enhance the VE by stimulating specific senses, so that each sensory modality can also be used to create unimodal experiences, but can also be combined with each other to create multisensory stimulation experiences. ...
... These sensory modalities can be used to enhance the VE by stimulating specific senses, so that each sensory modality can also be used to create unimodal experiences, but can also be combined with each other to create multisensory stimulation experiences. Using multiple sensory modalities that are adapted and integrated to the VE, the system is more likely to impact SoP [12,13,48,[57][58][59][60][61][62][63][64][65][66][67][68][69][70], as well as user performance [57,58,61,69,71]. However, the more senses present in the VE can increase the probability of sensory mismatch, which can inflict SS [6] (see more in Section 2.7), as well as BIPs (see more in Section 2.6.3). ...
... These sensory modalities can be used to enhance the VE by stimulating specific senses, so that each sensory modality can also be used to create unimodal experiences, but can also be combined with each other to create multisensory stimulation experiences. Using multiple sensory modalities that are adapted and integrated to the VE, the system is more likely to impact SoP [12,13,48,[57][58][59][60][61][62][63][64][65][66][67][68][69][70], as well as user performance [57,58,61,69,71]. However, the more senses present in the VE can increase the probability of sensory mismatch, which can inflict SS [6] (see more in Section 2.7), as well as BIPs (see more in Section 2.6.3). ...
Thesis
Full-text available
Flight simulators are a central training method for pilots and with the advances of human-computer interaction, new cutting-edge technology introduces a new type of simulator using extended reality (XR). XR is an umbrella term for many representative forms of realities, where physical reality (PR) and virtual reality (VR) are the endpoints of this spectrum, and any reality in between can be seen as mixed reality (MR). The purpose of this thesis was to investigate the applicabilities of XR and how they can be compared with each other in terms of usability, immersion, presence, and simulator sickness for flight simulators, respectively. To answer these questions, a MR and a VR version was implemented in Unity using the Varjo XR-3 head-mounted display based on the Framework for Immersive Virtual Environments (FIVE). To evaluate these aspects, a user study (N = 11) was conducted, focusing on quantitative and qualitative experimental research methods. Interaction with physical interfaces is a core procedure for pilots; thus, three reaction tests were conducted with the goal of pressing a random button that is lit green for a 3 x 3 Latin square layout for a given time to measure the efficiency of interaction for both versions. Reaction tests were conducted in different complexities: Simple (no flight), moderate (easy flight), and advanced (difficult flight). Participants experienced the MR and VR versions, and completed complementary questionnaires on immersion, presence, and simulator sickness while remaining in the simulation. The user study showed that the usability in MR is considerably higher, and more immersive than VR when incorporating interaction. However, excluding the interaction aspects showed that VR was more immersive. Overall, this work demonstrates how to achieve high levels of immersion, and a high elicitation of sense of presence, simultaneously while having minuscule levels of simulator sickness with a relatively realistic experience.
... Task and time management Information widgets offer a novel way to present essential office data like time and progress, optimizing workflow [90]. The existing research has demonstrated that incorporating multiple sensory modalities (such as visual, auditory, and haptic) in the design of user interfaces enhanced information presentation, reception, comprehension, and acquisition efficiency [17,26,69]. By presenting information across multiple sensory channels, users can receive more comprehensive and intuitive feedback [85]. ...
... By presenting information across multiple sensory channels, users can receive more comprehensive and intuitive feedback [85]. Relevant research demonstrated that the integration of modalities has the potential to enhance users' sense of immersion and optimize their interactive experience [69]. Current multimodal HCI research lacks comparative quantitative studies on the impact of information widgets across modalities, focusing primarily on specialized domains like cockpits and control panels, while neglecting information perception in mentally demanding tasks encountered daily, particularly those involved in time and task management modalities, such as countdown timers. ...
Conference Paper
Full-text available
This article examined how different time and task management information widgets affect time perception across modalities. In mentally demanding office environments, effective countdown representations are crucial for enhancing temporal awareness and productivity. We developed TickSens, a set of information widgets with different modalities, and conducted a within-subjects experiment with 30 participants to evaluate the five types of time perception modes: visual, auditory, haptic, as well as the blank and the timer modes. Our assessment focused on the technology acceptance, cognitive performance and emotional responses. Results indicated that compared to the blank and the timer modes, the use of modalities significantly improved the cognitive performance and positive emotional responses, and was better received by participants. The visual mode had the best task performance, while the auditory feedback was effective in boosting focus and the haptic mode significantly enhances user acceptance. The study revealed varied user preferences that enlightened the integration of these widgets into office.
... In terms of sensory affordances within the VR realm, advancements in visual and auditory affordances are significant, while other sensory modes, such as haptic, olfactory, and taste, lag behind [71]. Auditory feedback, commonly delivered through speakers or headphones, is standard in most consumer VR devices as a built-in feature [72]. While several works have demonstrated the benefits of olfactory and haptic (thermal and tactile) inputs for enhancing user experience in VR [51,[73][74][75], these sensory modalities have not been widely adopted and integrated into mainstream VR applications. ...
... Incorporating olfactory elements into multisensory VR environments is a promising yet underexplored area for enhancing user experience [49,76,77]. While visual and auditory stimuli dominate VR design [71,72], smell stands out for its strong connections to memory, emotion, and spatial awareness [78,79]. Studies show that integrating smells into VR enhances realism, immersion, and presence [45,51,80,81]. ...
... In terms of presence and user preference, multisensory systems have been proven superior to traditional virtual systems based on visual feedback [1]. Virtual reality (VR) is widely used in simulations and visualization applications to make the overall user experience more immersive and engaging [2]. Research findings demonstrated that VR provided a more intuitive visualization experience within a synthetic space to help users understand the complexity of the data dynamics, including both geographical and urban environmental data [3,4,5]. ...
... Research findings demonstrated that VR provided a more intuitive visualization experience within a synthetic space to help users understand the complexity of the data dynamics, including both geographical and urban environmental data [3,4,5]. In addition to a wider field of view to match the real-life experience, VR provided sophisticated sound spatialization with binaural audio, creating a 3D sound effect [2]. Most saliently, Hunt et al. [6] emphasized interaction with sonification to promote user engagement with the virtual environment (VE) and achieve a fluent interaction style. ...
Conference Paper
Full-text available
This research investigated audio-visual analytics of geoscientific data in virtual reality (VR)-enhanced implementation, where users interacted with the dataset with a VR controller and a haptic device. Each interface allowed users to explore rock minerals in unimodal and multimodal virtual environments (VE). In the unimodal version, color variations demonstrated differences in minerals. As users navigated the data using different interfaces, visualization options could be switched between the original geographical topology and its color-coded version, signifying underlying minerals. During the multimodal navigation of the dataset, in addition to the visual feedback, an auditory display was performed by playing a musical tone in different timbres. For example, ten underlying minerals in the sample were explored. Among them, anorthite was represented by nylon guitar, the grand piano was used for albite, and so on. Initial findings showed that users preferred the audio-visual exploration of geoscientific data over the visual-only version. Virtual touch enhanced the user experience while interacting with the data.
... Providing multimodal input can affect the realism and immersion of the experience [1]. A challenge for multimodal VR setups, however, is the integration of multiple devices and the synchronization of multisensory inputs [2]. ...
... Some devices have been proposed that would expand a visual VR within a head mounted display (HMD) by including tactile sensory input. Contacting devices like controllers or wearables such as gloves or wristbands that provide vibratory or electrical feedback can be implemented in such VR setups [2,3]. ...
Article
Full-text available
Objective. To create highly immersive experiences in virtual reality (VR) it is important to not only include the visual sense but also to involve multimodal sensory input. To achieve optimal results, the temporal and spatial synchronization of these multimodal inputs is critical. It is therefore necessary to find methods to objectively evaluate the synchronization of VR experiences with a continuous tracking of the user. Approach. In this study a passive touch experience was incorporated in a visual-tactile VR setup using VR glasses and tactile sensations in mid-air. Inconsistencies of multimodal perception were intentionally integrated into a discrimination task. The participants’ electroencephalogram (EEG) was recorded to obtain neural correlates of visual-tactile mismatch situations. Main results. The results showed significant differences in the event-related potentials (ERP) between match and mismatch situations. A biphasic ERP configuration consisting of a positivity at 120 ms and a later negativity at 370 ms was observed following a visual-tactile mismatch. Significance. This late negativity could be related to the N400 that is associated with semantic incongruency. These results provide a promising approach towards the objective evaluation of visual-tactile synchronization in virtual experiences.
... Providing multimodal input can affect the realism and immersion of the experience [1]. A challenge for multimodal VR setups, however, is the integration of multiple devices and the synchronization of multisensory inputs [2]. Some devices have been proposed that would expand a visual VR within a head mounted display (HMD) by including tactile sensory input. ...
... Some devices have been proposed that would expand a visual VR within a head mounted display (HMD) by including tactile sensory input. Contacting devices like controllers or wearables such as gloves or wristbands that provide vibratory or electrical feedback can be implemented in such VR setups [2,3]. ...
Preprint
To create highly immersive experiences in virtual reality (VR) it is important to not only include the visual sense but also to involve multimodal sensory input. To achieve optimal results, the temporal and spatial synchronization of these multimodal inputs is critical. It is therefore necessary to find methods to objectively evaluate the synchronization of VR experiences with a continuous tracking of the user. In this study a passive touch experience was incorporated in a visual–tactile VR setup using VR glasses and tactile sensations in mid–air. Inconsistencies of multimodal perception were intentionally integrated into a discrimination task. The participants’ electroencephalogram (EEG) was recorded to obtain neural correlates of visual-tactile mismatch situations. The results showed significant differences in the event-related potentials (ERP) between match and mismatch situations. A biphasic ERP configuration consisting of a positivity at 120 ms and a later negativity at 370 ms was observed following a visual–tactile mismatch. This late negativity could be related to the N400 that is associated with semantic incongruency. These results provide a promising approach towards the objective evaluation of visual–tactile synchronization in virtual experiences.
... Technology. E-mail: sihyunjeong@kaist.ac.kr • Jinwook Kim The utilization of Virtual Reality (VR) is rapidly growing across diverse domains, carrying the inherent potential to enhance immersion and engagement to contents and tasks [35]. The expanded areas where VR is utilized are not just confined to entertainment, which emphasizes the significance of enhancing users' task performance in contexts such as education, rehabilitation, and training [18]. ...
... Nonetheless, studies evaluating the effect of attentional cues on task performance under different levels of cognitive load have been rare. VR is characterized by the simultaneous occurrence of multiple events in multiple sensory modalities, including visual, auditory, and tactile stimulation [35]. However, the cognitive load on visual modality is particularly heavy due to the stereoscopic views of immersive VR headmounted displays (HMDs) [50]. ...
Article
As the utilization of VR is expanding across diverse fields, research on devising attentional cues that could optimize users' task performance in VR has become crucial. Since the cognitive load imposed by the context and the individual's cognitive capacity are representative factors that are known to determine task performance, we aimed to examine how the effects of multisensory attentional cues on task performance are modulated by the two factors. For this purpose, we designed a new experimental paradigm in which participants engaged in dual (N-back, visual search) tasks under different levels of cognitive load while an attentional cue (visual, tactile, or visuotactile) was presented to facilitate search performance. The results showed that multi-sensory attentional cues are generally more effective than uni-sensory cues in enhancing task performance, but the benefit of multi-sensory cues changes according to the level of cognitive load and the individual's cognitive capacity; the amount of benefit increases as the cognitive load is higher and the cognitive capacity is lower. The findings of this study provide practical implications for designing attentional cues to enhance VR task performance, considering both the complexity of the VR context and users' internal characteristics.
... When concepts are presented in multiple modalities (i.e., multimodality), the communication conveys more detailed and comprehensive information (Di Mitri et al., 2018). Immersive technologies can reproduce multimodal sensory information, including visual, auditory, and haptic inputs (Martin et al., 2022), and can recognize user inputs from different modalities (Cohen et al., 1999). This approach creates a sense of heightened presence in the learning environment (Martin et al., 2022). ...
... Immersive technologies can reproduce multimodal sensory information, including visual, auditory, and haptic inputs (Martin et al., 2022), and can recognize user inputs from different modalities (Cohen et al., 1999). This approach creates a sense of heightened presence in the learning environment (Martin et al., 2022). Additionally, ILS incorporate game elements into learning tasks to increase engagement, fun, and excitement (Plass et al., 2015). ...
Article
Full-text available
Developing immersive learning systems is challenging due to their multidisciplinary nature, involving game design, pedagogical modelling, computer science, and the application domain. The diversity of technologies, practices, and interventions makes it hard to explore solutions systematically. A new methodology called Multimodal Immersive Learning Systems Design Methodology (MILSDeM) is introduced to address these challenges. It includes a unified taxonomy, key performance indicators, and an iterative development process to foster innovation and creativity while enabling reusability and organisational learning. This article further reports on applying design-based research to design and develop MILSDeM. It also discusses the application of MILSDeM through its implementation in a real-life project conducted by the research team, which included four initiatives and eight prototypes. Moreover, the article introduces a unified taxonomy and reports on the qualitative analysis conducted to assess its components by experts from different domains.
... In previous CVR studies, researchers developed new viewing modalities in the following four directions: guidance cues, intervened rotation, perspective shifting, and avatar assistance [92]. Since attention is a crucial factor in both CVR, effectively managing it is essential for creating immersive and engaging experiences. ...
Preprint
Cinematic Virtual Reality (CVR) is a narrative-driven VR experience that uses head-mounted displays with a 360-degree field of view. Previous research has explored different viewing modalities to enhance viewers' CVR experience. This study conducted a systematic review and meta-analysis focusing on how different viewing modalities, including intervened rotation, avatar assistance, guidance cues, and perspective shifting, influence the CVR experience. The study has screened 3444 papers (between 01/01/2013 and 17/06/2023) and selected 45 for systematic review, 13 of which also for meta-analysis. We conducted separate random-effects meta-analysis and applied Robust Variance Estimation to examine CVR viewing modalities and user experience outcomes. Evidence from experiments was synthesized as differences between standardized mean differences (SMDs) of user experience of control group ("Swivel-Chair" CVR) and experiment groups. To our surprise, we found inconsistencies in the effect sizes across different studies, even with the same viewing modalities. Moreover, in these studies, terms such as "presence," "immersion," and "narrative engagement" were often used interchangeably. Their irregular use of questionnaires, overreliance on self-developed questionnaires, and incomplete data reporting may have led to unrigorous evaluations of CVR experiences. This study contributes to Human-Computer Interaction (HCI) research by identifying gaps in CVR research, emphasizing the need for standardization of terminologies and methodologies to enhance the reliability and comparability of future CVR research.
... In a similar fashion, Passmore et al. also reported that stereoscopy did not have an effect on the efficiency of haptic feedback in a path-following task [114]. Furthermore, previous work reported a negative relationship between immersive display environments and haptics, as multimodality in virtual reality can sometimes lead to sensory overload, hampering task performance to the point where users might express a preference for simpler environments [95]. ...
Article
Full-text available
Haptic feedback reportedly enhances human interaction with 3D data, particularly improving the retention of mental representations of digital objects in immersive settings. However, the effectiveness of visuohaptic integration in promoting object retention across different display environments remains underexplored. Our study extends previous research on the retention effects of haptics from virtual reality to a projected surface display to assess whether earlier findings generalize to 2D environments. Participants performed a delayed match-to-sample task incorporating visual, haptic, and visuohaptic sensory feedback within a projected surface display environment. We compared error rates and response times across these sensory modalities and display environments. Our results reveal that visuohaptic integration significantly enhances object retention on projected surfaces, benefiting task performance across display environments. Our findings suggest that haptics can improve object retention without requiring fully immersive setups, offering insights for the design of interactive systems that assist professionals who rely on precise mental representations of digital objects.
... Multimodal elements have been found to support meaning negotiation, and studies emphasize their importance in meaning negotiation and second language acquisition processes (Canals, 2021;Chen & Sevilla-Pavón, 2023;Wigham & Satar, 2021). Despite its potential, research on multimodality within HISVR, especially in the context of intercultural communication, is limited (Chen & Sevilla-Pavón, 2023;Karimi et al., 2023;Martin et al., 2022). This research gap underscores the need to explore multimodal elements in HISVR environments, specifically regarding intercultural interactions. ...
... Some scholars have combined narration, voice commands, and haptic feedback to enhance user awareness of environmental issues, including air and noise pollution [33]. VR's ability to integrate multimodal sensory information facilitates immersive experiences that exceed the engagement levels of traditional media, which often depend on singular modalities [44]. ...
Preprint
Full-text available
Timely and adequate risk communication before natural hazards can reduce losses from extreme weather events and provide more resilient disaster preparedness. However, existing natural hazard risk communications have been abstract, ineffective, not immersive, and sometimes counterproductive. The implementation of virtual reality (VR) for natural hazard risk communication presents a promising alternative to the existing risk communication system by offering immersive and engaging experiences. However, it is still unknown how different modalities in VR could affect individuals' mitigation behaviors related to incoming natural hazards. In addition, it is also not clear how the repetitive risk communication of different modalities in the VR system leads to the effect of risk habituation. To fill the knowledge gap, we developed a VR system with a tornado risk communication scenario and conducted a mixed-design human subject experiment (N = 24). We comprehensively investigated our research using both quantitative and qualitative results.
... Early attempts at multimodal integration were relatively rudimentary, often involving basic sensor fusion techniques that combined data from different sensors linearly or sequentially. However, these early systems laid the groundwork for more sophisticated approaches, demonstrating that integrating multiple modalities could significantly enhance the perception capabilities of autonomous systems (Martin, Malpica, Gutierrez, Masia, & Serrano, 2022;Mohd, Nguyen, & Javaid, 2022). The evolution of sensor technologies and data processing techniques over the past few decades has been instrumental in advancing multimodal perception systems. ...
Article
Full-text available
Autonomous systems are increasingly deployed in various domains, including transportation, robotics, and industrial automation. However, their ability to accurately perceive and understand their environment remains a significant challenge, particularly when relying on a single modality such as vision or sound. This review paper comprehensively examines multimodal perception systems, emphasizing the integration of visual, auditory, and tactile data to enhance environmental understanding and state estimation. The paper traces the evolution of multimodal perception, reviews the key modalities and data fusion techniques, and identifies the current challenges these systems face, such as environmental uncertainty, sensor limitations, and computational complexity. Furthermore, it proposes enhancement strategies, including adopting advanced sensor technologies, improved data fusion methodologies, and adaptive learning systems. The paper concludes by exploring future directions, highlighting emerging trends, and identifying research gaps that must be addressed to advance the field. The findings underscore the importance of robust multimodal perception for the success of autonomous systems in dynamic and uncertain environments. Keywords: Multimodal Perception, Autonomous Systems, Sensor Fusion, Data Integration, Adaptive Learning.
... Language models have evolved substantially, transitioning from basic statistical methods to complex neural network architectures driving modern LLMs . This evolution reflects a relentless pursuit of models that capture the intricacies of human language more accurately, expanding the possibilities of machine understanding and generation [23][24][25][26][27][28][29][30][31][32][33]. However, this rapid progress has also raised ethical and safety concerns, prompted a reevaluation of development practices and used cases [34][35][36][37][38][39][40]. ...
Article
Full-text available
RETRACTED: This article has been retracted at the request of our research integrity team. The retraction notice can be found here https://doi.org/10.4108/airo.7168
... Subsequently, the extracted features are fused at different levels (including early, middle, or late fusion [58,35,36]) to address specific tasks, such as classification, recognition, description, and segmentation [25,26,12,1,10]. Heterogeneous information from multisource sensors, however, is susceptible to confounding effects caused by extraneous variables or multiple distributions [30,7,5,11,57,32,6]. Confounding variables pertain to external factors that introduce bias (either positive or negative) in the relationship between the variables being studied [42,50]. ...
Article
Full-text available
Recent years have witnessed a surge of interest in integrating high-dimensional data captured by multisource sensors, driven by the impressive success of neural networks in integrating multimodal data. However, the integration of heterogeneous multimodal data poses a significant challenge, as confounding effects and dependencies among such heterogeneous data sources introduce unwanted variability and bias, leading to suboptimal performance of multimodal models. Therefore, it becomes crucial to normalize the low-or high-level features extracted from data modalities before their fusion takes place. This paper introduces RegBN, a novel approach for multimodal Batch Normalization with REGular-ization. RegBN uses the Frobenius norm as a regularizer term to address the side effects of confounders and underlying dependencies among different data sources. The proposed method generalizes well across multiple modalities and eliminates the need for learnable parameters, simplifying training and inference. We validate the effectiveness of RegBN on eight databases from five research areas, encompassing diverse modalities such as language, audio, image, video, depth, tabular, and 3D MRI. The proposed method demonstrates broad applicability across different architectures such as multilayer perceptrons, convolutional neural networks, and vision transformers, enabling effective normalization of both low-and high-level features in multimodal neural networks. RegBN is available at https://mogvision.github.io/RegBN.
... Multimodal alignment. Multimodality enhances understanding and decision-making by integrating information from multiple sensorymodalities [81,82]. Thereinto, ALBEF [77] aligns visual and language representations, using momentum distillation to improve multimodal embeddings. ...
Preprint
Full-text available
Multimodal fusion breaks through the barriers between diverse modalities and has already yielded numerous impressive performances. However, in various specialized fields, it is struggling to obtain sufficient alignment data for the training process, which seriously limits the use of previously elegant models. Thus, semi-supervised learning attempts to achieve multimodal alignment with fewer matched pairs but traditional methods like pseudo-labeling are difficult to apply in domains with no label information. To address these problems, we transform semi-supervised multimodal alignment into a manifold matching problem and propose a new method based on CLIP, named Gentle-CLIP. Specifically, we design a novel semantic density distribution loss to explore implicit semantic alignment information from unpaired multimodal data by constraining the latent representation distribution with fine granularity, thus eliminating the need for numerous strictly matched pairs. Meanwhile, we introduce multi-kernel maximum mean discrepancy as well as self-supervised contrastive loss to pull separate modality distributions closer and enhance the stability of the representation distribution. In addition, the contrastive loss used in CLIP is employed on the supervised matched data to prevent negative optimization. Extensive experiments conducted on a range of tasks in various fields, including protein, remote sensing, and the general vision-language field, demonstrate the effectiveness of our proposed Gentle-CLIP.
... The user task is to utilize gaze to control navigation in the VR system and select content of interest. The design architecture of the VR system follows mainstream system design which consists of explicit calibration, interactive familiarization and a lobby 36 . In the explicit calibration phase, nine point targets are displayed sequentially on the screen using an brief animation to attract and hold attention. ...
Article
Full-text available
Gaze estimation is long been recognised as having potential as the basis for human-computer interaction (HCI) systems, but usability and robustness of performance remain challenging . This work focuses on systems in which there is a live video stream showing enough of the subjects face to track eye movements and some means to infer gaze location from detected eye features. Currently, systems generally require some form of calibration or set-up procedure at the start of each user session. Here we explore some simple strategies for enabling gaze based HCI to operate immediately and robustly without any explicit set-up tasks. We explore different choices of coordinate origin for combining extracted features from multiple subjects and the replacement of subject specific calibration by system initiation based on prior models. Results show that referencing all extracted features to local coordinate origins determined by subject start position enables robust immediate operation. Combining this approach with an adaptive gaze estimation model using an interactive user interface enables continuous operation with the 75th percentile gaze errors of 0.7∘^\circ , and maximum gaze errors of 1.7∘^\circ during prospective testing. There constitute state-of-the-art results and have the potential to enable a new generation of reliable gaze based HCI systems.
... VR has a lot of potential as an empirical research tool for simulating real-world environments in controlled settings. Despite the indisputable potential of multisensory VR to provide more immersive experiences leading to a greater sense of presence and realism (Martin et al., 2022), studies exploiting multisensory aspects of VR systems are still in their infancy (Lyu et al., 2023). Although visual rendering and audio in VR have witnessed substantial progress, the development of other sensory simulations pertaining to olfaction, tactile feedback, thermoception, and taste lags behind (Melo et al., 2020). ...
Conference Paper
Full-text available
Although the potential of VR for studying human-environment interactions is well-established, the scope has mostly been limited to audiovisual sensory feedback, where the other senses were conceived as secondary. This paper develops a novel method for evaluating human responses to multisensory environmental stimuli, including visual, auditory, and olfactory, using VR. An innovative approach is adopted to implement digital smell technologies for architecture. The proposed method provides dynamic interactions with the sensory environments and, in return, collects an extensive range of data streams regarding user experiences. The effectiveness of this method was tested through a proof-of-concept study assessing user responses to a private office involving multisensory modalities. The proposed methodological approach can be further applied to the research and practice of multisensory environmental stimuli.
... For example, in the field of intelligent healthcare, if the patient's physiological state can be more intuitively and vividly displayed in three dimensions, it can help doctors better understand the data and make better diagnoses [13]. In the field of virtual reality (VR), 3D display of user actions and ss can provide a more immersive virtual experience for users [14]. Therefore, the development of 3D visualization display platforms for wearable devices has great application prospects in intelligent healthcare and other fields containing virtual reality. ...
Article
Full-text available
Intelligent wearable systems have been widely used in health monitoring, motion tracking, and engineering safety. However, the single function of current wearable systems cannot satisfy the requirements of complex scenarios, and the wearable systems cannot establish a relationship with the virtual 3D visualization platform. To address these issues, this paper proposes a novel intelligent wearable system with motion and emotion recognition. Multiple sensors are integrated into the system to collect motion and emotion information. In order to achieve accurate classification and recognition of multiple sensor information, we propose a novel human action recognition network called the three-branch spatial-temporal feature extraction network (TB-SFENet), which can obtain more robust features and achieve an accuracy of 97.04% on the UCI-HAR dataset and 92.68% on the UniMiB SHAR dataset. To establish the relationship between the real entity and virtual space, we use digital twin (DT) technology to establish the 3D display DT platform. The platform enables real-time information interaction, such as activity, emotion, location, and monitoring information. Additionally, we establish the TGAM electroencephalogram emotion classification (TEEC) dataset, which contains 120,000 pieces of data, for the proposed system. Experimental results indicate that the proposed system realizes virtual reality information interaction between the personal digital human and actual person based on the intelligent wearable system, which has great potential for applications in intelligent healthcare, virtual reality, and other fields.
... Related, if the auditory event is perceived from outside the current field of view, attention may be guided by auditory information [31], which in turn, may have an effect on scene exploration behavior. For complementary information on investigations of audiovisual effects and binding in AR/VR see, e.g., [31,37,55]. Previous research investigating this effect was conducted with rather simplistic signals such as sound bursts, light sources, or visual patterns As outlined, research has, up to now, focused either on visual coherence or matching speech and appearance in virtual humans, leaving aside audiovisual spatial coherence, i.e., matching (room acoustics) with visuals and movement. ...
Conference Paper
The appearance of virtual humans (avatars and agents) has been widely explored in immersive environments. However, virtual humans’ movements and associated sounds in real-world interactions, particularly in Augmented Reality (AR), are yet to be explored. In this paper, we investigate the influence of three distinct movement patterns (circle, side-to-side, and standing), two rendering styles (realistic and cartoon), and two types of audio (spatial audio and non-spatial audio) on emotional responses, social presence, appearance and behavior plausibility, audiovisual coherence, and auditory plausibility. To enable that, we conducted a study (N=36) where participants observed an agent reciting a short fictional story. Our results indicate an effect of the rendering style and the type of movement on the subjective perception of the agents behaving in an AR environment. Participants reported higher levels of excitement when they observed the realistic agent moving in a circle compared to the cartoon agent or the other two movement patterns. Moreover, we found an influence of agent’s movement pattern on social presence and higher appearance and behavior plausibility for the realistic rendering style. Regarding audiovisual spatial coherence, we found an influence of rendering style and type of audio only for the cartoon agent. Additionally, the spatial audio was perceived as more plausible than non-spatial audio. Our findings suggest that aligning realistic rendering styles with realistic auditory experiences may not be necessary for 1-1 listening experiences with moving sources. However, movement patterns of agents influence excitement and social presence in passive unidirectional communication scenarios.
... Optimizations concerning VR rendering are then presented to show current and future techniques aimed at fusing the understanding of how the eye works and efficient rendering. An in-depth review of the applicability of eye-tracking Visualization of eye-tracking data Blascheck et al. (2017) Study of different techniques for the presentation of data generated with eye-tracking technologies Gaze estimation Kar and Corcoran (2017) Overview of systems and methods for estimating the gaze vector Experimentation Clay et al (2019) Exhaustive exploration of methods and tools which can be applied in the implementation of VR experiments using eye-tracking Perception theory, display engineering and tracking technologies Koulieris et al (2019) Investigation about the theory of perception and vision as well as the latest advancements in tracking technologies and display engineering Application and usability Li et al (2021) Literature review from 2000 to 2019 classifying results according to eye-tracking methods, application measures and assessing the usability of eye-tracking data in VR Eye tracking measures and emergent patterns Rappa et al. (2019) Describe the ways in which eye-tracking devices have been used in VR and MRI environments and identify emerging patterns in the results to inform recommendations for future research Trends Li and Barmaki (2019) Latest research progress in ACM Symposium on Eye Tracking Research and Applications 2019 and recent representative works Multimodalities in VR Martin et al. (2022) Multimodality in VR and its role and benefits in the user experience, along with different applications that leverage multimodality across many disciplines Rendering optimizations Matthews et al. (2020) Brief overview of VR and eye-tracking history, as well as the main fields of rendering optimization: traditional and perception-based approaches Mohanto et al (2022) Overview of rendering work for optimizing the performance of foveated rendering. From geometrical approaches to adaptive solutions, including raster, ray, and hardware-oriented solutions Eye-tracking in VR Our survey While other surveys have examined eye-tracking systems, they typically focus on specific applications such as rendering optimizations or eye-tracking data analysis. ...
Article
Full-text available
Virtual reality (VR) has evolved substantially beyond its initial remit of gaming and entertainment, catalyzed by advancements such as improved screen resolutions and more accessible devices. Among various interaction techniques introduced to VR, eye-tracking stands out as a pivotal development. It not only augments immersion but offers a nuanced insight into user behavior and attention. This precision in capturing gaze direction has made eye-tracking instrumental for applications far beyond mere interaction, influencing areas like medical diagnostics, neuroscientific research, educational interventions, and architectural design, to name a few. Though eye-tracking’s integration into VR has been acknowledged in prior reviews, its true depth, spanning the intricacies of its deployment to its broader ramifications across diverse sectors, has been sparsely explored. This survey undertakes that endeavor, offering a comprehensive overview of eye-tracking’s state of the art within the VR landscape. We delve into its technological nuances, its pivotal role in modern VR applications, and its transformative impact on domains ranging from medicine and neuroscience to marketing and education. Through this exploration, we aim to present a cohesive understanding of the current capabilities, challenges, and future potential of eye-tracking in VR, underscoring its significance and the novelty of our contribution.
... The application situations, however, span several academic fields. The study primarily focuses on various multimodal applications to three fields where virtual reality has had a significant impact: entertainment, education and training, and the medical field [8]. ...
Article
Multimodal interaction refers to the combination of smart speakers and displays. It gives users the option to engage with various input and output modalities. When interacting with other individuals, humans use more nonverbal cues compared to verbal cues. They communicate with each other using a variety of modalities, including gestures, eye contact, and facial expressions. This type of communication is known as multimodal interaction. A specific type of multimodal interaction called human-computer interaction (HCI) makes it easier for people to communicate with machines. Several studies employing the aforementioned numerous modalities will discover that machines could quickly interact with a person by disclosing their feelings or actions. The research presented here provides an in-depth overview of multimodal interaction, HCI, the difficulties and advancements encountered in this field, and its prospects for future technological improvement.
... Future studies should, therefore, focus on the multisensory experience of coimmersion. Indeed, in their review article, Martin et al. [44] found that several industries using immersive technologies are already relying on other sensory modalities in their IVE to elicit higher levels of realism and immersion. It is possible that additional social sensory inputs, such as voice or sound of breathing, as well as affordance of touch in IVE, could elicit higher levels of immersive copresence. ...
Article
Full-text available
Sharing experiences with others is an important part of everyday life. Immersive virtual reality (IVR) promises to simulate these experiences. However, whether IVR elicits a similar level of social presence as measured in the real world is unclear. It is also uncertain whether AI-driven virtual humans (agents) can elicit a similar level of meaningful social copresence as people-driven virtual-humans (avatars). The current study demonstrates that both virtual human types can elicit a cognitive impact on a social partner. The current experiment tested participants’ cognitive performance changes in the presence of virtual social partners by measuring the social facilitation effect (SFE). The SFE-related performance change can occur through either vigilance-based mechanisms related to other people’s copresence (known as the mere presence effect (MPE)) or reputation management mechanisms related to other people’s monitoring (the audience effect (AE)). In this study, we hypothesised AE and MPE as distinct mechanisms of eliciting SFE. Firstly, we predicted that, if head-mounted IVR can simulate sufficient copresence, any social companion’s visual presence would elicit SFE through MPE. The results demonstrated that companion presence decreased participants’ performance irrespective of whether AI or human-driven. Secondly, we predicted that monitoring by a human-driven, but not an AI-driven, companion would elicit SFE through AE. The results demonstrated that monitoring by a human-driven companion affected participant performance more than AI-driven, worsening performance marginally in accuracy and significantly in reaction times. We discuss how the current results explain the findings in prior SFE in virtual-world literature and map out future considerations for social-IVR testing, such as participants’ virtual self-presence and affordances of physical and IVR testing environments.
Article
Full-text available
By providing cutting-edge therapeutic interventions, improving accessibility, and creating immersive healing experiences, generative artificial intelligence (AI) and virtual reality (VR) are transforming mental health care. This study investigates how these new technologies affect mental health, looking at how they might enhance wellbeing while resolving ethical issues and inherent difficulties. By offering individualized interventions, cognitive behavioral therapy (CBT), and real-time emotional support, generative AI-powered chatbots and virtual assistants lower obstacles to mental health care. By establishing safe, immersive settings that promote gradual desensitization, virtual reality exposure therapy has shown promise in the treatment of phobias, anxiety disorders, and post-traumatic stress disorder (PTSD). Notwithstanding these benefits, issues with algorithmic bias, data privacy, and an excessive dependence on technology pose serious problems. To guarantee patient safety, the ethical ramifications of AI-generated mental health advice-specifically, its accuracy and dependability-need thorough confirmation. Guidelines for ethical use are necessary because VR's immersive nature can also result in dissociation, addiction, or unexpected psychological impacts. To ensure fair access and efficacy, technologists, psychologists, and legislators must work together to create standardized guidelines for integrating AI and VR into clinical practice. The implications of AI-driven mental health interventions for marginalized groups-who frequently face inequities in access to conventional care-are also examined in this research. Generative AI models' versatility makes it possible to create therapeutic applications that are inclusive of all languages and cultures, filling in gaps in mental health care around the globe. To solve issues with AI bias, false information, and responsibility in automated mental health solutions, legislative and ethical frameworks must change. The use of AI and VR in self-guided therapy, crisis intervention, and preventive care is growing as these technologies continue to transform the field of mental health. Research on these technologies' long-term effects on social connections, human emotional intelligence, and psychological resilience is still lacking, despite the fact that they have the potential to improve patient participation and lessen the workload for mental health practitioners. To optimize the advantages of generative AI and VR for mental health, this study emphasizes the need to strike a balance between technology innovation and human-centric ethical issues.
Article
Object selection and manipulation are the foundation of VR interactions. With the rapid development of VR technology and the field of virtual object selection and manipulation, the literature demands a structured understanding of the core research challenges and a critical reflection of the current practices. To provide such understanding and reflections, we systematically reviewed 106 papers. We identified classic and emerging topics, categorized existing solutions, and evaluated how success was measured in these publications. Based on our analysis, we discuss future research directions and propose a framework for developing and determining appropriate solutions for different application scenarios.
Article
Virtual/augmented reality (VR/AR) devices offer both immersive imagery and sound. With those wide-field cues, we can simultaneously acquire and process visual and auditory signals to quickly identify objects, make decisions, and take action. While vision often takes precedence in perception, our visual sensitivity degrades in the periphery. In contrast, auditory sensitivity can exhibit an opposite trend due to the elevated interaural time difference. What occurs when these senses are simultaneously integrated, as is common in VR applications such as 360° video watching and immersive gaming? We present a computational and probabilistic model to predict VR users' reaction latency to visual-auditory multisensory targets. To this aim, we first conducted a psychophysical experiment in VR to measure the reaction latency by tracking the onset of eye movements. Experiments with numerical metrics and user studies with naturalistic scenarios showcase the model's accuracy and generalizability. Lastly, we discuss the potential applications, such as measuring the sufficiency of target appearance duration in immersive video playback, and suggesting the optimal spatial layouts for AR interface design.
Preprint
Full-text available
Mismatches between perceived and veridical physiological signals during false feedback (FFB) can bias emotional judgements. Paradigms using auditory FFB suggest perceived changes in heart rate (HR) increase ratings of emotional intensity irrespective of feedback type (increased or decreased HR), implicating right anterior insula as a mismatch comparator between exteroceptive and interoceptive information. However, few paradigms have examined effects of somatosensory FFB. Participants rated the emotional intensity of randomized facial expressions while they received 20 second blocks of pulsatile somatosensory stimulation at rates higher than HR, lower than HR, equivalent to HR, or no stimulation during a functional magnetic resonance neuroimaging scan. FFB exerted a bidirectional effect on reported intensity ratings of the emotional faces, increasing over the course of each 20 second stimulation block. Neuroimaging showed FFB engaging regions indicative of affective touch processing, embodiment, and reflex suppression. Contrasting higher vs lower HR FFB revealed engagement of right insula and centres supporting socio-emotional processing. Results indicate that exposure to pulsatile somatosensory stimulation can influence emotional judgements though its progressive embodiment as a perceived interoceptive arousal state, biasing how affective salience is ascribed to external stimuli. Results are consistent with multimodal integration of priors and prediction-error signalling in shaping perceptual judgments.
Article
Recent advances in edge computing (EC) have pushed cloud-based data caching services to edge, however, such emerging edge storage comes with numerous challenging and unique security issues. One of them is the problem of edge data integrity verification (EDIV) which coordinates multiple participants (e.g., data owners and edge nodes) to inspect whether data cached on edge is authentic. To date, various solutions have been proposed to address the EDIV problem, while there is no systematic review. Thus, we offer a comprehensive survey for the first time, aiming to show current research status, open problems, and potentially promising insights for readers to further investigate this under-explored field. Specifically, we begin by stating the significance of the EDIV problem, the integrity verification difference between data cached on cloud and edge, and three typical system models with corresponding inspection processes. To thoroughly assess prior research efforts, we synthesize a universal criteria framework that an effective verification approach should satisfy. On top of it, a schematic development timeline is developed to reveal the research advance on EDIV in a sequential manner, followed by a detailed review of the existing EDIV solutions. Finally, we highlight intriguing research challenges and possible directions for future work, along with a discussion on how forthcoming technology, e.g., machine learning and context-aware security, can augment security in EC. Given our findings, some major observations are: there is a noticeable trend to equip EDIV solutions with various functions and diversify study scenarios; completing EDIV within two types of participants (i.e., data owner and edge nodes) is garnering escalating interest among researchers; although the majority of existing methods rely on cryptography, emerging technology is being explored to handle the EDIV problem.
Article
Full-text available
This study investigates the effects of multimodal cues on visual field guidance in 360° virtual reality (VR). Although this technology provides highly immersive visual experiences through spontaneous viewing, this capability can disrupt the quality of experience and cause users to miss important objects or scenes. Multimodal cueing using non-visual stimuli to guide the users’ heading, or their visual field, has the potential to preserve the spontaneous viewing experience without interfering with the original content. In this study, we present a visual field guidance method that imparts auditory and haptic stimulations using an artificial electrostatic force that can induce a subtle “fluffy” sensation on the skin. We conducted a visual search experiment in VR, wherein the participants attempted to find visual target stimuli both with and without multimodal cues, to investigate the behavioral characteristics produced by the guidance method. The results showed that the cues aided the participants in locating the target stimuli. However, the performance with simultaneous auditory and electrostatic cues was situated between those obtained when each cue was presented individually ( medial effect ), and no improvement was observed even when multiple cue stimuli pointed to the same target. In addition, a simulation analysis showed that this intermediate performance can be explained by the integrated perception model; that is, it is caused by an imbalanced perceptual uncertainty in each sensory cue for orienting to the correct view direction. The simulation analysis also showed that an improved performance ( synergy effect ) can be observed depending on the balance of the uncertainty, suggesting that a relative amount of uncertainty for each cue determines the performance. These results suggest that electrostatic force can be used to guide 360° viewing in VR, and that the performance of visual field guidance can be improved by introducing multimodal cues, the uncertainty of which is modulated to be less than or comparable to that of other cues. Our findings on the conditions that modulate multimodal cueing effects contribute to maximizing the quality of spontaneous 360° viewing experiences with multimodal guidance.
Article
Navigating the interaction landscape of Virtual Reality (VR) and Augmented Reality (AR) presents significant complexities due to the plethora of available input hardware and interaction modalities, compounded by spatially diverse visual interfaces. Such complexities elevate the likelihood of user errors, necessitating frequent backtracking. To address this, we introduce ViRgilites, a virtual guidance framework that delivers multi-level feedforward information covering the available interaction techniques as well as the future possibilities to interact with virtual objects, anticipating the interaction effects and how they fit with the overall user's goal. ViRgilites is engineered to facilitate task execution, empowering users to make informed decisions about action methodologies and alternative courses of action. This paper presents the architecture and functionality of ViRgilites and demonstrates its efficacy through evaluation with a formative user study.
Article
Designing new input devices and associated interaction techniques is a key contribution in order to increase the bandwidth between users and interactive applications. In the field of Human-Computer Interaction, research and development services in industry and research laboratories in universities have been, since the invention of the mouse and the graphical user interface, proposing multiple contributions including the integration of multiple input devices in multimodal interaction technique. Those contributions, most of the time, are presented as prototypes or demonstrators providing evidence of the bandwidth increase through user studies. Such contributions however, do not provide any support to software developers for integrating them in real-life systems. When done, this integration is performed in a craft manner, outside required software engineering good practice. This paper proposes a systematic process to integrate novel input devices and associated interaction techniques to better support users' work. It exploits most recent interactive systems architectural model and formal model-based approaches for interactive systems supporting verification and validation when required. The paper focusses on Frame-based input devices which support gesture-based interactions and movements recognition but also addresses their multimodal use. This engineering approach is demonstrated on an interactive application in the area of rehabilitation in healthcare where dependability of interactions and applications is as critical as their usability.
Article
This paper presents an evaluation of a potential new interaction mode in virtual reality (VR) to determine whether it provides any positive impact in terms of how users interact with content. We evaluated the user experiences for 3D object manipulation across three modes of interaction. Interaction using controllers and gestures are used as baselines from which to gauge the potential value of the new mode of interaction, where a single controller and gestures are combined. This paper reports on a user study that captures quantitative and qualitative data related to a variety of object manipulation tasks in a Virtual Environment (VE). We investigated the impact of this new interaction mode with 40 participants across a number of interaction tasks, with the quantitative evaluation indicating that generally, the mixed mode of interaction resulted in task completion times consistently faster than gesture-based interaction and, in some cases, faster than with the use of controllers alone. A qualitative evaluation of the user experience indicated potential application areas for the new mode of interaction.
Article
This thought-provoking conceptual research pioneers the conceptualization of sense of place (SOP) in tourism within a metaverse paradigm, where the convergence of real and digital realms compels us to reframe our understanding of tourism destinations. Built upon three major perspectives, corresponding paradigm shifts have been proposed: (1) SOP as an individual's cognition of a tourism destination: from multimodal–socio-psychological to embodied–augmented; (2) SOP as the interconnection between an individual and a destination: from a person-to-place bond to a multiple person–place unity; and (3) SOP as modalities that communicate meanings of a destination: from narratology (stories) to dramaturgy (plays). This study aims to catalyze further research that re-examine established assumptions and conceptualizations of tourism-related constructs, given the ever-evolving technological landscape.
Article
Few empirical studies have explored the effect of using VR technology in educational institutions in Jordan.Therefore, this study aimed to evaluate the effect of using the VR gravity sketch tool on Architecture andDesign university students’ perceptions of usability, suitability, satisfaction, and self-efficacy in Jordan. Apre–post-test control group was used and 161 students (81 in the intervention group and 80 in the controlgroup) from Al-Zaytoonah University of Jordan enrolled in the Free Drawing course were recruited. Find-ings showed a significant difference between the intervention and control groups post-VR simulation forusability, suitability satisfaction, and self-efficacy among students in the intervention group and there wasa significant difference in the aforementioned variables between both groups. Thus, this study suggestedthat VR simulation was effective in training architecture and design students. Therefore, it is necessary tointegrate VR into the educational curriculum as a learning strategy for students
Article
Software practitioners use various methods in Requirements Engineering (RE) to elicit, analyze and specify the requirements of a enterprise products. The methods impact the final product characteristics and influence product delivery. Ad-hoc usage of the methods by software practitioners can lead to inconsistency and ambiguity in the product. With the notable rise in enterprise products, games, etc. across various domains, Virtual Reality (VR) has become an essential technology for the future. The methods adopted for requirement engineering for developing VR products requires a detailed study. This paper presents a mapping study on requirement engineering methods prescribed and used for developing VR applications including requirements elicitation, requirements analysis, and requirements specification. Our study provides insights into the use of such methods in the VR community and suggests using specific requirement engineering methods in various fields of interest. We also discuss future directions in requirement engineering for VR products.
Conference Paper
Full-text available
Several animal species (e.g., bats, dolphins, and whales) and even visually impaired humans have the remarkable ability to perform echolocation: a biological sonar used to perceive spatial layout and locate objects in the world. We explore the spatial cues contained in echoes and how they can benefit vision tasks that require spatial reasoning. First we capture echo responses in photo-realistic 3D indoor scene environments. Then we propose a novel interaction-based representation learning framework that learns useful visual features via echolocation. We show that the learned image features are useful for multiple downstream vision tasks requiring spatial reasoning-monocular depth estimation, surface normal estimation, and visual navigation-with results comparable or even better than heavily supervised pre-training. Our work opens a new path for representation learning for embodied agents, where supervision comes from interacting with the physical world.
Chapter
Full-text available
Mid-air haptics technologies convey haptic sensations without any direct contact between the user and the interface. A popular example of this technology is focused ultrasound. It works by modulating the phase of an array of ultrasound emitters so as to generate focused points of oscillating high pressure, which in turn elicit haptic sensations on the user’s skin. Whilst using focused ultrasound to convey haptic sensations is becoming increasingly popular in Virtual Reality (VR), few studies have been conducted into understanding how to render virtual object properties. In this paper, we evaluate the capability of focused ultrasound arrays to simulate varying stiffness sensations in VR. We carry out a user study enrolling 20 participants, showing that focused ultrasound haptics can well provide the sensation of interacting with objects of different stiffnesses. Finally, we propose four representative VR use cases to show the potential of rendering stiffness sensations using this mid-air haptics.
Article
Full-text available
We report an auditory effect of visual performance degradation in a virtual reality (VR) setting, where the viewing conditions are significantly different from previous studies. With the presentation of temporally congruent but spatially incongruent sound, we can degrade visual performance significantly at detection and recognition levels. We further show that this effect is robust to different types and locations of both auditory and visual stimuli. We also analyze participants behavior with an eye tracker to study the underlying cause of the degradation effect. We find that the performance degradation occurs even in the absence of saccades towards the sound source, during normal gaze behavior. This suggests that this effect is not caused by oculomotor phenomena, but rather by neural interactions or attentional shifts.
Conference Paper
Full-text available
Ambisonics, which constructs a sound distribution over the full viewing sphere, improves immersive experience in omnidirectional video (ODV) by enabling observers to perceive the sound directions. Thus, human attention could be guided by audio and visual stimuli simultaneously. Numerous datasets have been proposed to investigate human visual attention by collecting eye fixations of observers navigating ODV with head-mounted displays (HMD). However , there is no such dataset analyzing the impact of audio information. In this paper, we establish a new audiovisual attention dataset for ODV with mute, mono, and ambisonics. The user behavior including visual attention corresponding to sound source locations , viewing navigation congruence between observers and fixa-tions distributions in these three audio modalities is studied based on video and audio content. From our statistical analysis, we preliminarily found that, compared to only perceiving visual cues, perceiving visual cues with salient object sound (i.e., human voice, siren of ambulance) could draw more visual attention to the objects making sound and guide viewing behaviour when such objects are not in the current field of view. The more in-depth interactive effects between audio and visual cues in mute, mono and ambisonics still require further comprehensive study. The dataset and developed testbed in this initial work will be publicly available with the paper to foster future research on audiovisual attention for ODV.
Conference Paper
Full-text available
We present a novel physics-based concatenative sound synthesis (CSS) methodology for congruent interactions across physical, graphical, aural and haptic modalities in Virtual Environments. Navigation in aural and haptic corpora of annotated audio units is driven by user interactions with highly realistic photogrammetric based models in a game engine, where automated and interactive positional, physics and graphics data are supported. From a technical perspective, the current contribution expands existing CSS frameworks in avoiding mapping or mining the annotation data to real-time performance attributes, while guaranteeing degrees of novelty and variation for the same gesture.
Article
Full-text available
Sound design has been a fundamental component of audiovisual storytelling in linear media. However, with recent technological developments and the shift towards non-linear and immersive media, things are rapidly changing. More sensory information is available and, at the same time, the user is gaining agency upon the narrative, being offered the possibility of navigating or making other decisions. These new characteristics of immersive environments bring new challenges to storytelling in interactive narratives and require new strategies and techniques for audiovisual narrative progression. Can technology offer an immersive environment where the user has the sensation of agency, of choice, where her actions are not mediated by evident controls but subliminally induced in a way that it is ensured that a narrative is being followed? Can sound be a subliminal element that induces attentional focus on the most relevant elements for the narrative, inducing storytelling and biasing search in an immersive non-linear audiovisual environment? Herein, we present a literature review that has been guided by this prospect. With these questions in view, we present our exploration process in finding possible answers and potential solution paths. We point out that consistency, in terms of coherency across sensory modalities and emotional matching may be a critical aspect. Finally, we consider that this review may open up new paths for experimental studies that could, in the future, provide new strategies in the practice of sound design in the context of non-linear media.
Article
Full-text available
Group pressure can often result in people carrying out harmful actions towards others that they would not normally carry out by themselves. However, few studies have manipulated factors that might overcome this. Here male participants (n = 60) were in a virtual reality (VR) scenario of sexual harassment (SH) of a lone woman by a group of males in a bar. Participants were either only embodied as one of the males (Group, n = 20), or also as the woman (Woman, n = 20). A control group (n = 20) only experienced the empty bar, not the SH. One week later they were the Teacher in a VR version of Milgram’s Obedience experiment where they were encouraged to give shocks to a female Learner by a group of 3 virtual males. Those who had been in the Woman condition gave about half the number of shocks of those in the Group condition, with the controls between these two. We explain the results through embodiment promoting identification with the woman or the group, and delegitimization of the group for those in the Woman condition. The experiment raised important ethical issues, showing that a VR study with positive ethical intentions can sometimes produce unexpected and non-beneficent results.
Conference Paper
Full-text available
Figure 1: Top view of the virtual environment, with call-outs for four out of the eight locations on the map. For each trial, the user first received a textual description of a location. Then, on the island (c), the user visited one of eight different places with different types of sensory feedback. We designed the island with particular scene features, including trains (boxed in pink on the map), orange trees (yellow), helicopters (white), and a waterfall (blue). We mapped sensory feedback (vision, audio, wind, floor vibration, and smell) to the scene features and varied them, depending on the study condition. ABSTRACT Supporting perceptual-cognitive tasks is an important part of our daily lives. We use rich, multi-sensory feedback through sight, sound, touch, smell, and taste to support better perceptual-cognitive things we do, such as sports, cooking, and searching for a location, and to increase our confidence in performing those tasks in daily life. Same with real life, the demand for perceptual-cognitive tasks exists in serious VR simulations such as surgical or safety training systems. However, in contrast to real life, VR simulations are typically limited to visual and auditory cues, while sometimes adding simple tactile feedback. This could make it difficult to make confident decisions in VR. In this paper, we investigate the effects of multi-sensory stimuli, namely visuals, audio, two types of tactile (floor vibration and wind), and smell in terms of the confidence levels on a location-matching task which requires a combination of perceptual and cognitive work inside a virtual environment. We also measured the level of presence when participants visited virtual places with different combinations of sensory feedback. Our results show that our multi-sensory VR system was superior to a typical VR system (vision and audio) in terms of the sense of presence and user preference. However, the subjective confidence levels were higher in the typical VR system.
Preprint
Full-text available
Human sensory processing is sensitive to the proximity of stimuli to the body. It is therefore plausible that these perceptual mechanisms also modulate the detectability of content in VR, depending on its location. We evaluate this in a user study and further explore the impact of the user's representation during interaction. We also analyze how embodiment and motor performance are influenced by these factors. In a dual-task paradigm, participants executed a motor task, either through virtual hands, virtual controllers, or a keyboard. Simultaneously, they detected visual stimuli appearing in different locations. We found that while actively performing a motor task in the virtual environment, performance in detecting additional visual stimuli is higher when presented near the user's body. This effect is independent of how the user is represented and only occurs when the user is also engaged in a secondary task. We further found improved motor performance and increased embodiment when interacting through virtual tools and hands in VR, compared to interacting with a keyboard. This study contributes to better understanding the detectability of visual content in VR, depending on its location, as well as the impact of different user representations on information processing, embodiment, and motor performance.
Conference Paper
Full-text available
Through technological advancements more and more companies consider virtual reality (VR) for training of their workforce, in particular for situations that occur rarely, are dangerous, expensive, or very difficult to recreate in the real world. This creates the need for understanding the potentials and limitations of VR training and establish best practices. In pursuit of this, we have developed a VR Training simulation for a use case at Grundfos, in which apprentices learn a sequential maintenance task. We evaluated this simulation in a user study with 36 participants, comparing it to two traditional forms of training (Pairwise Training and Video Training). This case study describes the developed virtual training scenario and discusses design considerations for such VR simulations. The results of our evaluation support that, while VR Training is effective in teaching the procedure for a maintenance task, traditional approaches with hands-on experience still lead to a significantly better outcome.
Article
Full-text available
Proprioceptive development relies on a variety of sensory inputs, among which vision is hugely dominant. Focusing on the developmental trajectory underpinning the integration of vision and proprioception, the present research explores how this integration is involved in interactions with Immersive Virtual Reality (IVR) by examining how proprioceptive accuracy is affected by Age, Perception, and Environment. Individuals from 4 to 43 years old completed a self-turning task which asked them to manually return to a previous location with different sensory modalities available in both IVR and reality. Results were interpreted from an exploratory perspective using Bayesian model comparison analysis, which allows the phenomena to be described using probabilistic statements rather than simplified reject/not-reject decisions. The most plausible model showed that 4–8-year-old children can generally be expected to make more proprioceptive errors than older children and adults. Across age groups, proprioceptive accuracy is higher when vision is available, and is disrupted in the visual environment provided by the IVR headset. We can conclude that proprioceptive accuracy mostly develops during the first eight years of life and that it relies largely on vision. Moreover, our findings indicate that this proprioceptive accuracy can be disrupted by the use of an IVR headset.
Article
Full-text available
Tangible objects are used in Virtual Reality (VR) and Augmented Reality (AR) to enhance haptic information on the general shape of virtual objects. However, they are often passive or unable to simulate rich varying mechanical properties. This paper studies the effect of combining simple passive tangible objects and wearable haptics for improving the display of varying stiffness, friction, and shape sensations in these environments. By providing timely cutaneous stimuli through a wearable finger device, we can make an object feel softer or more slippery than it really is, and we can also create the illusion of encountering virtual bumps and holes. We evaluate the proposed approach carrying out three experiments with human subjects. Results confirm that we can increase the compliance of a tangible object by varying the pressure applied through a wearable device. We are also able to simulate the presence of bumps and holes by providing timely pressure and skin stretch sensations. Finally, we show the potential of our techniques in an immersive medical palpation use case in VR. These results pave the way for novel and promising haptic interactions in VR, better exploiting the multiple ways of providing simple, unobtrusive, and inexpensive haptic displays.
Article
Full-text available
The merger of game-based approaches and Virtual Reality (VR) environments that can enhance learning and training methodologies have a very promising future, reinforced by the widespread market-availability of affordable software and hardware tools for VR-environments. Rather than passive observers, users engage in those learning environments as active participants, permitting the development of exploration-based learning paradigms. There are separate reviews of VR technologies and serious games for educational and training purposes with a focus on only one knowledge area. However, this review covers 135 proposals for serious games in immersive VR-environments that are combinations of both VR and serious games and that offer end-user validation. First, an analysis of the forum, nationality, and date of publication of the articles is conducted. Then, the application domains, the target audience, the design of the game and its technological implementation, the performance evaluation procedure, and the results are analyzed. The aim here is to identify the factual standards of the proposed solutions and the differences between training and learning applications. Finally, the study lays the basis for future research lines that will develop serious games in immersive VR-environments, providing recommendations for the improvement of these tools and their successful application for the enhancement of both learning and training tasks.
Conference Paper
Full-text available
The Unlimited Corridor is a virtual reality system that enables users to walk in an ostensibly straight direction around a virtual corridor within a small tracked space. Unlike other redirected walking systems, the Unlimited Corridor allows users to keep walking around without interruptions or resetting phases. This is made possible by combining a redirected walking technique with visuo-haptic interaction and a path planning algorithm. The Unlimited Corridor produces passive haptic feedback using semi-circular handrails; that is, when users grip a straight handrail in the virtual environment, they simultaneously grip a corresponding curved handrail in the physical world. These stimuli enable visuo-haptic interaction, with the user perceiving the gripped handrail as straight, and this sensation enhances the effects of redirected walking. Furthermore, we developed an algorithm that dynamically modifies the amount of distortion to allow a user to walk ostensibly straight and turn at intersections in any direction. We evaluated the Unlimited Corridor using a virtual space of approximately 400 m2 in a physical space of approximately 60 m2. According to a user study, the median value of the straightness sensation of walking when users grip the handrails (5.13) was significantly larger than that of the sensation felt without gripping the handrails (3.38).
Conference Paper
Full-text available
Immersive co-located theatre aims to bring the social aspects of traditional cinematic and theatrical experience into Virtual Reality (VR). Within these VR environments, participants can see and hear each other, while their virtual seating location corresponds to their actual position in the physical space. These elements create a realistic sense of presence and communication, which enables an audience to create a cognitive impression of a shared virtual space. This article presents a theoretical framework behind the design principles, challenges and factors involved in the sound production of co-located VR cinematic productions, followed by a case-study discussion examining the implementation of an example system for a 6-minute cinematic experience for 30 simultaneous users. A hybrid reproduction system is proposed for the delivery of an effective sound design for shared cinematic VR.
Article
Understanding and modeling the dynamics of human gaze behavior in 360° environments is crucial for creating, improving, and developing emerging virtual reality applications. However, recruiting human observers and acquiring enough data to analyze their behavior when exploring virtual environments requires complex hardware and software setups, and can be time-consuming. Being able to generate virtual observers can help overcome this limitation, and thus stands as an open problem in this medium. Particularly, generative adversarial approaches could alleviate this challenge by generating a large number of scanpaths that reproduce human behavior when observing new scenes, essentially mimicking virtual observers. However, existing methods for scanpath generation do not adequately predict realistic scanpaths for 360° images. We present ScanGAN360, a new generative adversarial approach to address this problem. We propose a novel loss function based on dynamic time warping and tailor our network to the specifics of 360° images. The quality of our generated scanpaths outperforms competing approaches by a large margin, and is almost on par with the human baseline. ScanGAN360 allows fast simulation of large numbers of virtual observers, whose behavior mimics real users, enabling a better understanding of gaze behavior, facilitating experimentation, and aiding novel applications in virtual reality and beyond.
Article
Virtual reality (VR) is a powerful medium for 360 storytelling, yet content creators are still in the process of developing cinematographic rules for effectively communicating stories in VR. Traditional cinematography has relied for over a century in well-established techniques for editing, and one of the most recurrent resources for this are cinematic cuts that allow content creators to seamlessly transition between scenes. One fundamental assumption of these techniques is that the content creator can control the camera, however, this assumption breaks in VR: users are free to explore the 360 around them. Recent works have studied the effectiveness of different cuts in 360 content, but the effect of directional sound cues while experiencing these cuts has been less explored. In this work, we provide the first systematic analysis of the influence of directional sound cues in users behavior across 360 movie cuts, providing insights that can have an impact on deriving conventions for VR storytelling.
Article
Virtual Reality (VR) systems increase immersion by reproducing users' movements in the real world. However, several works have shown that this real-to-virtual mapping does not need to be precise in order to convey a realistic experience. Being able to alter this mapping has many potential applications, since achieving an accurate real-to-virtual mapping is not always possible due to limitations in the capture or display hardware, or in the physical space available. In this work, we measure detection thresholds for lateral translation gains of virtual camera motion in response to the corresponding head motion under natural viewing, and in the absence of locomotion, so that virtual camera movement can be either compressed or expanded while these manipulations remain undetected. Finally, we propose three applications for our method, addressing three key problems in VR: improving 6-DoF viewing for captured 360° footage, overcoming physical constraints, and reducing simulator sickness. We have further validated our thresholds and evaluated our applications by means of additional user studies confirming that our manipulations remain imperceptible, and showing that (i) compressing virtual camera motion reduces visible artifacts in 6-DoF, hence improving perceived quality, (ii) virtual expansion allows for completion of virtual tasks within a reduced physical space, and (iii) simulator sickness may be alleviated in simple scenarios when our compression method is applied.
Article
The majority of virtual reality (VR) applications rely on audiovisual stimuli and do not exploit the addition of other sensory cues that could increase the potential of VR. This systematic review surveys the existing literature on multisensory VR and the impact of haptic, olfactory, and taste cues over audiovisual VR. The goal is to identify the extent to which multisensory stimuli affect the VR experience, which stimuli are used in multisensory VR, the type of VR setups used, and the application fields covered. An analysis of the 105 studies that met the eligibility criteria revealed that 84.8% of the studies show a positive impact of multisensory VR experiences. Haptics is the most commonly used stimulus in multisensory VR systems (86.6%). Non-immersive and immersive VR setups are preferred over semi-immersive setups. Regarding the application fields, a considerable part was adopted by health professionals and science and engineering professionals. We further conclude that smell and taste are still underexplored, and they can bring significant value to VR applications. More research is recommended on how to synthesize and deliver these stimuli, which still require complex and costly apparatus be integrated into the VR experience in a controlled and straightforward manner.
Chapter
We provide an overview of the concerns, current practice, and limitations for capturing, reconstructing, and representing the real world visually within virtual reality. Given that our goals are to capture, transmit, and depict complex real-world phenomena to humans, these challenges cover the opto-electro-mechanical, computational, informational, and perceptual fields. Practically producing a system for real-world VR capture requires navigating a complex design space and pushing the state of the art in each of these areas. As such, we outline several promising directions for future work to improve the quality and flexibility of real-world VR capture systems.
Article
We conduct novel analyses of users' gaze behaviors in dynamic virtual scenes and, based on our analyses, we present a novel CNN-based model called DGaze for gaze prediction in HMD-based applications. We first collect 43 users' eye tracking data in 5 dynamic scenes under free-viewing conditions. Next, we perform statistical analysis of our data and observe that dynamic object positions, head rotation velocities, and salient regions are correlated with users' gaze positions. Based on our analysis, we present a CNN-based model (DGaze) that combines object position sequence, head velocity sequence, and saliency features to predict users' gaze positions. Our model can be applied to predict not only realtime gaze positions but also gaze positions in the near future and can achieve better performance than prior method. In terms of realtime prediction, DGaze achieves a 22.0% improvement over prior method in dynamic scenes and obtains an improvement of 9.5% in static scenes, based on using the angular distance as the evaluation metric. We also propose a variant of our model called DGaze_ET that can be used to predict future gaze positions with higher precision by combining accurate past gaze data gathered using an eye tracker. We further analyze our CNN architecture and verify the effectiveness of each component in our model. We apply DGaze to gaze-contingent rendering and a game, and also present the evaluation results from a user study.
Article
Audio information has been bypassed by most of current visual attention prediction studies. However, sound could have influence on visual attention and such influence has been widely investigated and proofed by many psychological studies. In this paper, we propose a novel multi-modal saliency (MMS) model for videos containing scenes with high audio-visual correspondence. In such scenes, humans tend to be attracted by the sound sources and it is also possible to localize the sound sources via cross-modal analysis. Specifically, we first detect the spatial and temporal saliency maps from the visual modality by using a novel free energy principle. Then we propose to detect the audio saliency map from both audio and visual modalities by localizing the moving-sounding objects using cross-modal kernel canonical correlation analysis, which is first of its kind in the literature. Finally we propose a new two-stage adaptive audiovisual saliency fusion method to integrate the spatial, temporal and audio saliency maps to our audio-visual saliency map. The proposed MMS model has captured the influence of audio, which is not considered in the latest deep learning based saliency models. To take advantages of both deep saliency modeling and audio-visual saliency modeling, we propose to combine deep saliency models and the MMS model via a later fusion, and we find that an average of 5% performance gain is obtained. Experimental results on audio-visual attention databases show that the introduced models incorporating audio cues have significant superiority over state-of-the-art image and video saliency models which utilize a single visual modality.
Article
Nowadays, 360° video/image has been increasingly popular and drawn great attention. The spherical viewing range of 360° video/image accounts for huge data, which pose the challenges to 360° video/image processing in solving the bottleneck of storage, transmission, etc. Accordingly, the recent years have witnessed the explosive emergence of works on 360° video/image processing. In this paper, we review the state-of-the-art works on 360° video/image processing from the aspects of perception, assessment and compression. First, this paper reviews both datasets and visual attention modelling approaches for 360° video/image. Second, we survey the related works on both subjective and objective visual quality assessment (VQA) of 360° video/image. Third, we overview the compression approaches for 360° video/image, which either utilize the spherical characteristics or visual attention models. Finally, we summarize this review paper and outlook the future research trends on 360° video/image processing.
Conference Paper
Force feedback is said to be the next frontier in virtual reality (VR). Recently, with consumers pushing forward with untethered VR, researchers turned away from solutions based on bulky hardware (e.g., exoskeletons and robotic arms) and started exploring smaller portable or wearable devices. However, when it comes to rendering inertial forces, such as when moving a heavy object around or when interacting with objects with unique mass properties, current ungrounded force feedback devices are unable to provide quick weight shifting sensations that can realistically simulate weight changes over 2D surfaces. In this paper we introduce Aero-plane, a force-feedback handheld controller based on two miniature jet propellers that can render shifting weights of up to 14 N within 0.3 seconds. Through two user studies we: (1) characterize the users' ability to perceive and correctly recognize different motion paths on a virtual plane while using our device; and, (2) tested the level of realism and immersion of the controller when used in two VR applications (a rolling ball on a plane, and using kitchen tools of different shapes and sizes). Lastly, we present a set of applications that further explore different usage cases and alternative form-factors for our device.
Article
Virtual-reality provides an immersive environment but can induce cybersickness due to the discrepancy between visual and vestibular cues. To avoid this problem, the movement of the virtual camera needs to match the motion of the user in the real world. Unfortunately, this is usually difficult due to the mismatch between the size of the virtual environments and the space available to the users in the physical domain. The resulting constraints on the camera movement significantly hamper the adoption of virtual-reality headsets in many scenarios and make the design of the virtual environments very challenging. In this work, we study how the characteristics of the virtual camera movement (e.g., translational acceleration and rotational velocity) and the composition of the virtual environment (e.g., scene depth) contribute to perceived discomfort. Based on the results from our user experiments, we devise a computational model for predicting the magnitude of the discomfort for a given scene and camera trajectory. We further apply our model to a new path planning method which optimizes the input motion trajectory to reduce perceptual sickness. We evaluate the effectiveness of our method in improving perceptual comfort in a series of user studies targeting different applications. The results indicate that our method can reduce the perceived discomfort while maintaining the fidelity of the original navigation, and perform better than simpler alternatives.
Conference Paper
Sound and light signal propagation have similar physical properties. This provides inspiration for creating an audio-visual echolocation system, where light is mapped to the sound signal, visually representing auralization of the virtual environment (VE). Some mammals navigate using echolocation; however, humans are less successful with this. To the authors' knowledge, it remains to be seen if sound propagation and its visualization have been implemented in a perceptually pleasant way and is used for navigation purposes in the VE. Therefore, the core novelty of this research is navigation with visualized echolocation signals using a cognitive mental mapping activity in the VE.
Article
We propose a robust 2D meshing algorithm, TriWild, to generate curved triangles reproducing smooth feature curves, leading to coarse meshes designed to match the simulation requirements necessary by applications and avoiding the geometrical errors introduced by linear meshes. The robustness and effectiveness of our technique are demonstrated by batch processing an SVG collection of 20k images, and by comparing our results against state of the art linear and curvilinear meshing algorithms. We demonstrate for our algorithm the practical utility of computing diffusion curves, fluid simulations, elastic deformations, and shape inflation on complex 2D geometries.
Article
Multiple factors can affect presence in virtual environments, such as the number of human senses engaged in a given experience or the extent to which the virtual experience is credible. The purpose of the present work is to study how the inclusion of credible multisensory stimuli affects the sense of presence, namely, through the use of wind, passive haptics, vibration, and scent. Our sample consisted of 37 participants (27 men and 10 women) whose ages ranged from 17 to 44 years old and were mostly students. The participants were divided randomly into 3 groups: Control Scenario (visual and auditory - N = 12), Passive Haptic Scenario (visual, auditory, and passive haptic - N = 13) and Multisensory Scenario (visual, auditory, wind, passive haptic, vibration, and scent - N = 12). The results indicated a significant increase in the involvement subscale when all multisensory stimuli were delivered. We found a trend where the use of passive haptics by itself has a positive impact on presence, which should be the subject of further work.