Article

Goal-Oriented Semantic Communications for Avatar-Centric Augmented Reality

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

With the emergence of the metaverse and its applications in representing humans and intelligent entities in social and related augmented reality (AR) applications. The current bit-oriented network faces challenges in supporting real-time updates for the vast amount of associated information, which hinders development. Thus, a critical revolution in the sixth generation (6G) networks is envisioned through the joint exploitation of information context and its importance to the goal, leading to a communication paradigm shift towards semantic and effectiveness levels. However, current research has not yet proposed any explicit and systematic communication framework for AR applications that incorporate these two levels. To fill this research gap, this paper presents a goal-oriented semantic communication framework for augmented reality (GSAR) to enhance communication efficiency and effectiveness in 6G. Specifically, we first analyse the traditional wireless AR point cloud communication framework and then summarize our proposed semantic information along with the end-to-end wireless communication. We then detail the design blocks of the GSAR framework, covering both semantic and effectiveness levels. Finally, numerous experiments have been conducted to demonstrate that, compared to the traditional point cloud communication framework, our proposed GSAR significantly reduces wireless AR application transmission latency by 95.6%, while improving communication effectiveness in geometry and color aspects by up to 82.4% and 20.4%, respectively.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... In the literature, there have been extensive prior works investigating the semantic transmission of 2D content (see, e.g., [65]- [67]). However, there are only a few studies exploiting 3D content transmission via semantic communications [19], [68]. The work [19] proposed to use keypoints, 2D images, and text as the semantics to transmit the 3D content by using the mesh representation. ...
... The work [19] proposed to use keypoints, 2D images, and text as the semantics to transmit the 3D content by using the mesh representation. The work [68] explored the 3D point cloud wireless transmission by using the keypoints of human avatars as the semantic features. Despite these advancements, 3D content semantic communication with radiance fields, especially NeRF and 3D-GS, has not been studied yet. ...
... The effectiveness of this paradigm has been demonstrated by the prior works that exploit conventional 3D representations as the semantic knowledge base in human avatar transmission. For example, the work [68] proposed a point cloud semantic communication framework, in which an avatar model storing skeleton and appearance information is exploited as the knowledge base to help semantic information extraction and avatar pose recovery, but this work did not consider radiance field model. Besides, our framework can be implemented in the transmission of complex scenes. ...
Article
Full-text available
The efficient representation, transmission, and reconstruction of three-dimensional (3D) contents are becoming increasingly important for sixth-generation (6G) networks that aim to merge virtual and physical worlds for offering immersive communication experiences. Neural radiance field (NeRF) and 3D Gaussian splatting (3D-GS) have recently emerged as two promising 3D representation techniques based on radiance field rendering, which are able to provide photorealistic rendering results for complex scenes. Therefore, embracing NeRF and 3D-GS in 6G networks is envisioned to be a prominent solution to support emerging 3D applications with enhanced quality of experience. This paper provides a comprehensive overview on the integration of NeRF and 3D-GS in 6G. First, we review the basics of the radiance field rendering techniques, and highlight their applications and implementation challenges over wireless networks. Next, we consider the over-the-air training of NeRF and 3D-GS models over wireless networks by presenting various learning techniques. We particularly focus on the federated learning design over a hierarchical device-edge-cloud architecture, which is suitable for exploiting distributed data and computing resources over 6G networks to train large models representing large-scale scenes. Then, we consider the over-the-air rendering of NeRF and 3D-GS models at wireless network edge. We present three practical rendering architectures, namely local, remote, and co-rendering, respectively, and provide model compression approaches to facilitate the transmission of radiance field models for rendering. We also present rendering acceleration approaches and joint computation and communication designs to enhance the rendering efficiency. In a case study, we propose a new semantic communication enabled 3D content transmission design, in which the radiance field models are exploited as the semantic knowledge base to reduce the communication overhead for distributed inference. In addition, we discuss the utilization of radiance field rendering in wireless applications like radio mapping and radio imaging, in which radiance field models are used to effectively represent complex radio environments to facilitate wireless network designs. It is our hope that this paper can provide new insights on the interesting wireless integration with radiance field rendering for future 6G networks with 3D contents.
... Goal-oriented communications encompass two main categories: humancentric and machine-centric. Human-centric applications, such as extended reality (XR) [2] and augmented reality (AR) [3], focus on preserving semantic meaning for accurate human comprehension. Machine-centric applications, including industrial Internet of Things (IIoT) [4] and autonomous driving [5], prioritize transmitting information that directly optimizes system performance. ...
Preprint
Full-text available
Goal-oriented communications prioritize application-driven objectives over data accuracy, enabling intelligent next-generation wireless systems. Efficient scheduling in multi-device, multi-channel systems poses significant challenges due to high-dimensional state and action spaces. We address these challenges by deriving key structural properties of the optimal solution to the goal-oriented scheduling problem, incorporating Age of Information (AoI) and channel states. Specifically, we establish the monotonicity of the optimal state value function (a measure of long-term system performance) w.r.t. channel states and prove its asymptotic convexity w.r.t. AoI states. Additionally, we derive the monotonicity of the optimal policy w.r.t. channel states, advancing the theoretical framework for optimal scheduling. Leveraging these insights, we propose the structure-guided unified dual on-off policy DRL (SUDO-DRL), a hybrid algorithm that combines the stability of on-policy training with the sample efficiency of off-policy methods. Through a novel structural property evaluation framework, SUDO-DRL enables effective and scalable training, addressing the complexities of large-scale systems. Numerical results show SUDO-DRL improves system performance by up to 45% and reduces convergence time by 40% compared to state-of-the-art methods. It also effectively handles scheduling in much larger systems, where off-policy DRL fails and on-policy benchmarks exhibit significant performance loss, demonstrating its scalability and efficacy in goal-oriented communications.
... The authors utilised a joint function of AoI and Value of information to identify the most important control and command data, with the GSC goal of minimising the trajectory MSE. Another GSC framework extending from [23] was proposed for the point cloud-based avatar reconstruction in the Metaverse [25], where only the critical nodes of the avatar skeleton graph are transmitted to minimise bandwidth usage. It can be seen that the GSC framework has been developed for various scenarios, but its application for robot arm reconstruction in DT has never been studied yet, where the communication efficiency needs further improvement, and both the effectiveness-level metrics for the reconstruction task as well as the semantic-level information remain unknown. ...
Preprint
As one of the most promising technologies in industry, the Digital Twin (DT) facilitates real-time monitoring and predictive analysis for real-world systems by precisely reconstructing virtual replicas of physical entities. However, this reconstruction faces unprecedented challenges due to the everincreasing communication overhead, especially for digital robot arm reconstruction. To this end, we propose a novel goal-oriented semantic communication (GSC) framework to extract the GSC information for the robot arm reconstruction task in the DT, with the aim of minimising the communication load under the strict and relaxed reconstruction error constraints. Unlike the traditional reconstruction framework that periodically transmits a reconstruction message for real-time DT reconstruction, our framework implements a feature selection (FS) algorithm to extract the semantic information from the reconstruction message, and a deep reinforcement learning-based temporal selection algorithm to selectively transmit the semantic information over time. We validate our proposed GSC framework through both Pybullet simulations and lab experiments based on the Franka Research 3 robot arm. For a range of distinct robotic tasks, simulation results show that our framework can reduce the communication load by at least 59.5% under strict reconstruction error constraints and 80% under relaxed reconstruction error constraints, compared with traditional communication framework. Also, experimental results confirm the effectiveness of our framework, where the communication load is reduced by 53% in strict constraint case and 74% in relaxed constraint case. The demo is available at: https://youtu.be/2OdeHKxcgnk.
... Recently, semantic communications [6], [7] have emerged as a novel communication paradigm, which allows the transmitter to extract and transmit semantic information instead of bit-level raw data, thus significantly reducing the communication overhead while preserving the quality of experience. In the literature, there have been extensive prior works investigating the semantic transmission of 2D content (see, e.g., [8]- [10]), but only some initial works studying the semantic transmission of 3D content such as point clouds [11] and 360 • videos [12]. How to efficiently deliver 3D content based on advanced NeRF-based representation has not been studied yet. ...
Preprint
This paper investigates the transmission of three-dimensional (3D) human face content for immersive communication over a rate-constrained transmitter-receiver link. We propose a new framework named NeRF-SeCom, which leverages neural radiance fields (NeRF) and semantic communications to improve the quality of 3D visualizations while minimizing the communication overhead. In the NeRF-SeCom framework, we first train a NeRF face model based on the NeRFBlendShape method, which is pre-shared between the transmitter and receiver as the semantic knowledge base to facilitate the real-time transmission. Next, with knowledge base, the transmitter extracts and sends only the essential semantic features for the receiver to reconstruct 3D face in real time. To optimize the transmission efficiency, we classify the expression features into static and dynamic types. Over each video chunk, static features are transmitted once for all frames, whereas dynamic features are transmitted over a portion of frames to adhere to rate constraints. Additionally, we propose a feature prediction mechanism, which allows the receiver to predict the dynamic features for frames that are not transmitted. Experiments show that our proposed NeRF-SeCom framework significantly outperforms benchmark methods in delivering high-quality 3D visualizations of human faces.
Article
The ultra-reliable and low-latency communication (URLLC) service of the fifth-generation (5G) mobile communication network struggles to support safe robot operation. Nowadays, the sixth-generation (6G) mobile communication network is proposed to provide hyper-reliable and low-latency communication to enable safer control for robots. However, current 5G/ 6G research mainly focused on improving communication performance, while the robotics community mostly assumed communication to be ideal. To jointly consider communication and robotic control with a focus on the specific robotic task, we propose goal-oriented semantic communication in robotic control (GSRC) to exploit the context of data and its importance in achieving the task at both transmitter and receiver. At the transmitter, we propose a deep reinforcement learning algorithm to generate optimal control and command (C&C) data and a proactive repetition scheme (DeepPro) to increase the successful transmission probability. At the receiver, we design the value of information (VoI) and age of information (AoI) based queue ordering mechanism (VA-QOM) to rank the queue based on the semantic information extracted from AoI and VoI. The simulation results validate that our proposed GSRC framework achieves a 91.5% improvement in the mean square error compared to the traditional unmanned aerial vehicle control framework.
Article
Full-text available
Upon the arrival of emerging devices, including Extended Reality (XR) and Unmanned Aerial Vehicles (UAVs), the traditional communication framework is approaching Shannon's physical capacity limit and fails to guarantee the massive amount of transmission within latency requirements. By jointly exploiting the context of data and its importance to the task and goal of communication, an emerging communication paradigm shift to semantic level and effectiveness level is envisioned to be a key revolution in Sixth Generation (6G) networks. However, an explicit and systematic communication framework incorporating both semantic level and effectiveness level has not been proposed yet. In this article, we propose a generic goal-oriented semantic communication framework for various tasks with diverse data types, which incorporates both semantic level information and effectiveness-aware performance metrics. We first analyze the unique characteristics of all data types, and summarise the semantic information, along with corresponding extraction methods. We then propose a detailed goal-oriented semantic communication framework for different time-critical and non-critical tasks. In the goal-oriented semantic communication framework, we present the goal-oriented semantic information, extraction methods, recovery methods, and effectiveness-aware performance metrics. Last but not least, we present a goal-oriented semantic communication framework tailored for Unmanned Aerial Vehicle (UAV) control task to validate the effectiveness of the proposed goal-oriented semantic communication framework.
Article
Full-text available
In the metaverse era, point cloud video (PCV) streaming on mobile XR devices is pivotal. While most current methods focus on PCV compression from traditional 3-DoF video services, emerging AI techniques extract vital semantic information, producing content resembling the original. However, these are early-stage and computationally intensive. To enhance the inference efficacy of AI-based approaches, accommodate dynamic environments, and facilitate applicability to metaverse XR devices, we present ISCom, an interest-aware semantic communication scheme for lightweight PCV streaming. ISCom is featured with a region-of-interest (ROI) selection module, a lightweight encoder-decoder training module, and a learning-based scheduler to achieve real-time PCV decoding and rendering on resource-constrained devices. ISCom’s dual-stage ROI selection provides significantly reduces data volume according to real-time interest. The lightweight PCV encoder-decoder training is tailored to resource-constrained devices and adapts to the heterogeneous computing capabilities of devices. Furthermore, We provide a deep reinforcement learning (DRL)-based scheduler to select optimal encoder-decoder model for various devices adaptivelly, considering the dynamic network environments and device computing capabilities. Our extensive experiments demonstrate that ISCom outperforms baselines on mobile devices, achieving a minimum rendering frame rate improvement of 10 FPS and up to 22 FPS. Furthermore, our method significantly reduces memory usage by 41.7% compared to the state-of-the-art AITransfer method. These results highlight the effectiveness of ISCom in enabling lightweight PCV streaming and its potential to improve immersive experiences for emerging metaverse application.
Article
Full-text available
When people are in emotional distress, they often seek support. Virtual humans may provide unique and complementary benefits to human support provision, given that virtual humans are readily available and help sharers overcome socio-evaluative threats, thereby increasing willingness to disclose. Here, we examined whether talking to a virtual human elicits socio-emotional benefits, and whether this is moderated by the type of support provided. To examine the scope of the potential effect, we compared two key types of support (emotional and cognitive), across two emotions (anger and worry). Participants (N = 115) shared two personal emotional experiences with a virtual human, who provided either emotional or cognitive support via the Wizard-of-Oz method (i.e., a human-operated avatar). Results showed that participants felt better after talking to the virtual human, as evidenced by reduced intensity of the target emotion and generally improved affect. The emotional improvement was similar for emotional and cognitive support. Cognitive support was also experienced as equally effective as emotional support, and led to similar levels of experienced closeness and desire to interact with the virtual human again. These findings suggest that talking to a virtual human can be a valuable form of support at times of distress.
Article
Full-text available
Deep Learning advances have made it possible to recover full 3D meshes of human models from individual images. However, extension of this notion to videos for recovering temporally coherent poses is still under-explored. A major challenge in this direction is the lack of appropriately annotated video data for learning the desired computational models. Existing human pose datasets only provide 2D or 3D skeleton joint annotations, whereas the datasets are also insufficiently recorded in constrained environments. We first contribute a technique to synthesize monocular action videos with rich 3D annotations that are suitable for learning computational models for full mesh 3D human pose recovery. Compared to the existing methods which simply ‘'texture-map’' clothes onto the 3D human pose models, our approach incorporates Physics-based realistic cloth deformations with human body movements. The generated videos cover a large variety of human actions, poses, and visual appearances, while the annotations record accurate human pose dynamics and human body surface information. Our second major contribution is an end-to-end trainable Recurrent Neural Network for full pose mesh recovery from monocular video. Using the proposed video data and a Long-Short-Term-Memory recurrent structure, our network explicitly learns to model the temporal coherence in videos and imposes geometric consistency over the recovered meshes. We establish the effectiveness of the proposed model with quantitative and qualitative analysis using the proposed and benchmark datasets.
Article
Full-text available
A 3D Telepresence system allows users to interact with each other in a virtual, mixed, or augmented reality (VR, MR, AR) environment, creating a shared space for collaboration and communication. There are two main methods for representing users within these 3D environments. Users can be represented either as point cloud reconstruction-based avatars that resemble a physical user or as virtual character-based avatars controlled by tracking the users' body motion. This work compares both techniques to identify the differences between user representations and their fit in the reconstructed environments regarding the perceived presence, uncanny valley factors, and behavior impression. Our study uses an asymmetric VR/AR teleconsultation system that allows a remote user to join a local scene using VR. The local user observes the remote user with an AR head-mounted display, leading to facial occlusions in the 3D reconstruction. Participants perform a warm-up interaction task followed by a goal-directed collaborative puzzle task, pursuing a common goal. The local user was represented either as a point cloud reconstruction or as a virtual character-based avatar, in which case the point cloud reconstruction of the local user was masked. Our results show that the point cloud reconstruction-based avatar was superior to the virtual character avatar regarding perceived co-presence, social presence, behavioral impression, and humanness. Further, we found that the task type partly affected the perception. The point cloud reconstruction-based approach led to higher usability ratings, while objective performance measures showed no significant difference. We conclude that despite partly missing facial information, the point cloud-based reconstruction resulted in better conveyance of the user behavior and a more coherent fit into the simulation context.
Article
Full-text available
Wireless connectivity has traditionally been regarded as an opaque data pipe carrying messages, whose context-dependent meaning and effectiveness have been ignored. Nevertheless, in emerging cyber-physical and autonomous networked systems, acquiring, processing, and sending excessive amounts of distributed real-time data, which ends up being stale or useless to the end user, will cause communication bottlenecks, increased latency, and safety issues. We envision a communication paradigm shift, which makes the semantics of information (i.e., the significance and usefulness of messages) the foundation of the communication process. This entails a goal-orient-ed unification of information generation, transmission, and reconstruction, by taking into account process dynamics, signal sparsity, data correlation, and semantic information attributes. We apply this structurally new, synergetic approach to a communication scenario where the destination is tasked with real-time source reconstruction for the purpose of remote actuation. Capitalizing on semantics-empowered sampling and communication policies, we show significant reduction in both reconstruction error and cost of actuation error, as well as in the number of uninformative samples generated.
Conference Paper
Full-text available
3D point clouds constitute an emerging multimedia content, now used in a wide range of applications. The main drawback of this representation is the size of the data since typical point clouds may contain millions of points, usually associated with both geometry and color information. Consequently, a significant amount of work has been devoted to the efficient compression of this representation. Lossy compression leads to a degradation of the data and thus impacts the visual quality of the displayed content. In that context, predicting perceived visual quality computationally is essential for the optimization and evaluation of compression algorithms. In this paper, we introduce PCQM, a full-reference objective metric for visual quality assessment of 3D point clouds. The metric is an optimally-weighted linear combination of geometry-based and color-based features. We evaluate its performance on an open subjective dataset of colored point clouds compressed by several algorithms; the proposed quality assessment approach outperforms all previous metrics in terms of correlation with mean opinion scores.
Article
Full-text available
Cellular-connected wireless connectivity provides new opportunities for virtual reality (VR) to offer seamless user experience from anywhere at anytime. To realize this vision, the quality-of-service (QoS) for wireless VR needs to be carefully defined to reflect human perception requirements. In this article, we first identify the primary drivers of VR systems in terms of applications and use cases. We then map the human perception requirements to corresponding QoS requirements for four phases of VR technology development. To shed light on how to provide short/long-range mobility for VR services, we further list four main use cases for cellular-connected wireless VR and identify their unique research challenges along with their corresponding enabling technologies and solutions in 5G systems and beyond. Last but not least, we present a case study to demonstrate the effectiveness of our proposed solution and the unique QoS performance requirements of VR transmission compared to that of traditional video service in cellular networks.
Conference Paper
Full-text available
Augmented Reality (AR) and Virtual Reality (VR) multimodal systems are the latest trend within the field of multimedia. As they emulate the senses by means of omni-directional visuals, 360°sound, motion tracking and touch simulation, they are able to create a strong feeling of presence and interaction with the virtual environment. These experiences can be applied for virtual training (Industry 4.0), tele-surgery (healthcare) or remote learning (education). However, given the strong time and task sensitiveness of these applications, it is of great importance to sustain the end-user quality, i.e. the Quality-of-Experience (QoE), at all times. Lack of synchronization and quality degradation need to be reduced to a minimum to avoid feelings of cybersickness or loss of immersiveness and concentration. This means that there is a need to shift the quality management from system-centered performance metrics towards a more human, QoE-centered approach. However, this requires for novel techniques in the three areas of the QoE-management loop (monitoring, modelling and control). This position paper identifies open areas of research to fully enable human-centric driven management of immersive multimedia. To this extent, four main dimensions are put forward: (1) Task and well-being driven subjective assessment; (2) Real-time QoE modelling; (3) Accurate viewport prediction; (4) Machine Learning (ML)-based quality optimization and content recreation. This paper discusses the state-of-the-art, and provides with possible solutions to tackle the open challenges.
Article
Full-text available
Since their inception in 2008, polar codes have been shown to offer near-capacity error correction performance across a wide range of block lengths and coding rates. Owing to this, polar codes have been selected to provide channel coding in the control channels of Third Generation Partnership Project (3GPP) New Radio (NR). The operation of the 3GPP NR polar codes is specified in the 3GPP standard TS 38.212, together with schemes for code block segmentation, Cyclic Redundancy Check (CRC) attachment, CRC scrambling, CRC interleaving, frozen and parity check bit insertion, sub-block interleaving, bit selection, channel interleaving and code block concatenation. The configuration of these components is different for the uplink, broadcast and downlink control channels. However, the lack of visualisations and diagrammatic explanations in TS 38.212 limits the accessibility of the standard to new readers. This motivates the aims of the paper, which provides detailed tutorials on the operation and motivation of the components of the 3GPP NR polar codes, as well as surveys of the 3GPP discussions that led to their specification. Furthermore, we comprehensively characterize the error correction and error detection performance of the 3GPP NR polar codes in the uplink, broadcast and downlink control channels.
Article
Full-text available
Web search engines use various ranking methods to determine the order of web pages displayed on the Search Engine Result Page (SERP). PageRank is one of the popular and widely used ranking method. PageRank of any web page can be defined as a fraction of time a random web surfer spends on that web page on average. The PageRank method is a stationary distribution of a stochastic method whose states are web pages of the Web graph. This stochastic method is acquired by combining the hyperlink matrix of the web graph and a trivial uniform process. This combination is needed to make primitive so that stationary distribution is well defined. The combination depends on the value of damping factor in α∈[0,1] the computation of PageRank. The damping factor parameter state that how much time random web surfer follow hyperlink structure than teleporting. The value of α is exceptionally empirical and in current scenario α = 0.85 is considered as suggested by Brin and Page. If we take α =0.8 then we can say that out of total time, 80% of time is taken by the random web surfer to follow the hyperlink structure and 20% time they teleport to new web pages randomly. Today web surfer gets worn out too early on the web because of non-availability of relevant information and they can easily teleport to new web pages rather than following hyperlink structure. So we have to choose some value of damping factor other than 0.85. In this paper, we have given an experimental analysis of PageRank computation for different value of the damping factor. We have observed that for value of α =0.7, PageRank method takes fewer numbers of iterations to converge than α =0.85, and for these values of α the top 25 web pages returned by PageRank method in the SERP are almost same, only some of them exchange their positions. From the experimental results it is observed that value of damping factor α =0.7 takes approximate 25-30% fewer numbers of iterations than α =0.85 to get closely identical web pages in top 25 result pages for personalized web search, selective crawling, intra-web search engine.
Article
Full-text available
Few prior works study deep learning on point sets. PointNet by Qi et al. is a pioneer in this direction. However, by design PointNet does not capture local structures induced by the metric space points live in, limiting its ability to recognize fine-grained patterns and generalizability to complex scenes. In this work, we introduce a hierarchical neural network that applies PointNet recursively on a nested partitioning of the input point set. By exploiting metric space distances, our network is able to learn local features with increasing contextual scales. With further observation that point sets are usually sampled with varying densities, which results in greatly decreased performance for networks trained on uniform densities, we propose novel set learning layers to adaptively combine features from multiple scales. Experiments show that our network called PointNet++ is able to learn deep point set features efficiently and robustly. In particular, results significantly better than state-of-the-art have been obtained on challenging benchmarks of 3D point clouds.
Article
Full-text available
Quality of Experience (QoE) ties together user perception, experience and expectations to application and network performance, typically expressed by Quality of Service (QoS) parameters. Quantitative relationships between QoE and QoS are required in order to be able to build effective QoE control mechanisms onto measurable QoS parameters. On this background, this paper proposes a generic formula in which QoE and QoS parameters are connected through an exponential relationship, called IQX hypothesis. The formula relates changes of QoE with respect to QoS to the current level of QoE, is simple to match, and its limit behaviours are straighforward to interpret. It validates the IQX hypothesis for streaming services, where QoE in terms of Mean Opinion Scores (MOS) is expressed as functions of loss and reordering ratio, the latter of which is caused by jitter. For web surfing as the second application area, matchings provided by the IQX hypothesis are shown to outperform previously published logarithmic functions. We conclude that the IQX hypothesis is a strong candidate to be taken into account when deriving relationships between QoE and QoS parameters.
Article
The ultra-reliable and low-latency communication (URLLC) service of the fifth-generation (5G) mobile communication network struggles to support safe robot operation. Nowadays, the sixth-generation (6G) mobile communication network is proposed to provide hyper-reliable and low-latency communication to enable safer control for robots. However, current 5G/ 6G research mainly focused on improving communication performance, while the robotics community mostly assumed communication to be ideal. To jointly consider communication and robotic control with a focus on the specific robotic task, we propose goal-oriented semantic communication in robotic control (GSRC) to exploit the context of data and its importance in achieving the task at both transmitter and receiver. At the transmitter, we propose a deep reinforcement learning algorithm to generate optimal control and command (C&C) data and a proactive repetition scheme (DeepPro) to increase the successful transmission probability. At the receiver, we design the value of information (VoI) and age of information (AoI) based queue ordering mechanism (VA-QOM) to rank the queue based on the semantic information extracted from AoI and VoI. The simulation results validate that our proposed GSRC framework achieves a 91.5% improvement in the mean square error compared to the traditional unmanned aerial vehicle control framework.
Article
In recent years, the concept of the Metaverse has attracted considerable attention. This paper provides a comprehensive overview of the Metaverse. First, the development status of the Metaverse is presented. We summarize the policies of various countries, companies, and organizations relevant to the Metaverse, as well as statistics on the number of Metaverse-related publications. Characteristics of the Metaverse are identified: 1) multi-technology convergence; 2) sociality; 3) hyper-spatio-temporality. For the multi-technology convergence of the Metaverse, we divide the technological framework of the Metaverse into five dimensions. For the sociality of the Metaverse, we focus on the Metaverse as a virtual social world. Regarding the characteristic of hyper-spatio-temporality, we introduce the Metaverse as an open, immersive, and interactive 3D virtual world which can break through the constraints of time and space in the real world. The challenges of the Metaverse are also discussed.
Article
eXtended reality (XR) technologies such as virtual reality and 360° stereoscopic streaming enable the concept of the Metaverse, an immersive virtual space for collaboration and interaction. To ensure a high-fidelity display of immersive media, the bandwidth, latency and network traffic patterns will need to be considered to ensure a user's Quality of Experience (QoE). In this article, examples and calculations are explored to demonstrate the requirements of the above-mentioned parameters. Additionally, future methods such as network-awareness using reinforcement learning (RL) and XR content awareness using spatial or temporal differences in the frames could be explored from a multimedia communications perspective.
Article
Video conferencing has become a popular mode of meeting despite consuming considerable communication resources. Conventional video compression causes resolution reduction under a limited bandwidth. Semantic video conferencing (SVC) maintains a high resolution by transmitting some keypoints to represent the motions because the background is almost static, and the speakers do not change often. However, the study on the influence of transmission errors on keypoints is limited. In this paper, an SVC network based on keypoint transmission is established, which dramatically reduces transmission resources while only losing detailed expressions. Transmission errors in SVC only lead to a changed expression, whereas those in the conventional methods directly destroy pixels. However, the conventional error detector, such as cyclic redundancy check, cannot reflect the degree of expression changes. To overcome this issue, an incremental redundancy hybrid automatic repeat request framework for varying channels (SVC-HARQ) incorporating a novel semantic error detector is developed. SVC-HARQ has flexibility in bit consumption and achieves a good performance. In addition, SVC-channel state informantion (CSI) is designed for CSI feedback to allocate the keypoint transmission and enhance the performance dramatically. Simulation shows that the proposed wireless semantic communication system can remarkably improve transmission efficiency.
Article
In this paper, we introduce the Age of Incorrect Information (AoII) as an enabler for semantics-empowered communication, a newly advocated communication paradigm centered around data’s role and its usefulness to the communication’s goal. First, we shed light on how the traditional communication paradigm, with its role-blind approach to data, is vulnerable to performance bottlenecks. Next, we highlight the shortcomings of several proposed performance measures destined to deal with the traditional communication paradigm’s limitations, namely the Age of Information (AoI) and the error-based metrics. We also show how the AoII addresses these shortcomings and captures more meaningfully the purpose of data. Afterward, we consider the problem of minimizing the average AoII in a transmitter-receiver pair scenario. We prove that the optimal transmission strategy is a randomized threshold policy, and we propose an algorithm that finds the optimal parameters. Furthermore, we provide a theoretical comparison between the AoII framework and the standard error-based metrics counterpart. Interestingly, we show that the AoII-optimal policy is also error-optimal for the adopted information source model. Concurrently, the converse is not necessarily true. Finally, we implement our policy in various applications, and we showcase its performance advantages compared to both the error-optimal and the AoI-optimal policies.
Article
Metaverse, as an evolving paradigm of the next-generation Internet, aims to build a fully immersive, hyper spatiotemporal, and self-sustaining virtual shared space for humans to play, work, and socialize. Driven by recent advances in emerging technologies such as extended reality, artificial intelligence, and blockchain, metaverse is stepping from science fiction to an upcoming reality. However, severe privacy invasions and security breaches (inherited from underlying technologies or emerged in the new digital ecology) of metaverse can impede its wide deployment. At the same time, a series of fundamental challenges (e.g., scalability and interoperability) can arise in metaverse security provisioning owing to the intrinsic characteristics of metaverse, such as immersive realism, hyper spatiotemporality, sustainability, and heterogeneity. In this paper, we present a comprehensive survey of the fundamentals, security, and privacy of metaverse. Specifically, we first investigate a novel distributed metaverse architecture and its key characteristics with ternary-world interactions. Then, we discuss the security and privacy threats, present the critical challenges of metaverse systems, and review the state-of-the-art countermeasures. Finally, we draw open research directions for building future metaverse systems.
Article
Due to the increased popularity of augmented and virtual reality experiences, the interest in capturing high-resolution real-world point clouds has never been higher. Loss of details and irregularities in point cloud geometry can occur during the capturing, processing, and compression pipeline. It is essential to address these challenges by being able to upsample a low Level-of-Detail (LoD) point cloud into a high LoD point cloud. Current upsampling methods suffer from several weaknesses in handling point cloud upsampling, especially in dense real-world photo-realistic point clouds. In this paper, we present a novel geometry upsampling technique, PU-Dense, which can process a diverse set of point clouds including synthetic mesh-based point clouds, real-world high-resolution point clouds, real-world indoor LiDAR scanned objects, as well as outdoor dynamically acquired LiDAR-based point clouds. PU-Dense employs a 3D multiscale architecture using sparse convolutional networks that hierarchically reconstruct an upsampled point cloud geometry via progressive rescaling and multiscale feature extraction. The framework employs a UNet type architecture that downscales the point cloud to a bottleneck and then upscales it to a higher level-of-detail (LoD) point cloud. PU-Dense introduces a novel Feature Extraction Unit that incorporates multiscale spatial learning by employing filters at multiple sampling rates and receptive fields. The architecture is memory efficient and is driven by a binary voxel occupancy classification loss that allows it to process high-resolution dense point clouds with millions of points during inference time. Qualitative and quantitative experimental results show that our method significantly outperforms the state-of-the-art approaches by a large margin while having much lower inference time complexity. We further test our dataset on high-resolution photo-realistic datasets. In addition, our method can handle noisy data well. We further show that our approach is memory efficient compared to the state-of-the-art methods.
Article
For the last few decades, several major subfields of artificial intelligence including computer vision, graphics, and robotics have progressed largely independently from each other. Recently, however, the community has realized that progress towards robust intelligent systems such as self-driving cars requires a concerted effort across the different fields. This motivated us to develop KITTI-360, successor of the popular KITTI dataset. KITTI-360 is a suburban driving dataset which comprises richer input modalities, comprehensive semantic instance annotations and accurate localization to facilitate research at the intersection of vision, graphics and robotics. For efficient annotation, we created a tool to label 3D scenes with bounding primitives and developed a model that transfers this information into the 2D image domain, resulting in over 150k images and 1B 3D points with coherent semantic instance annotations across 2D and 3D. Moreover, we established benchmarks and baselines for several tasks relevant to mobile perception, encompassing problems from computer vision, graphics, and robotics on the same dataset, e.g., semantic scene understanding, novel view synthesis and semantic SLAM. KITTI-360 will enable progress at the intersection of these research areas and thus contribute towards solving one of today's grand challenges: the development of fully autonomous self-driving systems.
Article
Semantic communications have shown its great potential to improve the transmission reliability, especially in the low signal-to-noise regime. However, resource allocation for semantic communications still remains unexplored, which is a critical issue in guaranteeing the semantic transmission reliability and the communication efficiency. To fill this gap, we investigate the spectral efficiency in the semantic domain and rethink the semantic-aware resource allocation issue. Specifically, taking text semantic communication as an example, the semantic spectral efficiency (S-SE) is defined for the first time, and is used to optimize resource allocation in terms of channel assignment and the number of transmitted semantic symbols. Additionally, for fair comparison of semantic and conventional communication systems, a transform method is developed to convert the conventional bit-based spectral efficiency to the S-SE. Simulation results demonstrate the validity and feasibility of the proposed resource allocation method, as well as the superiority of semantic communications in terms of the S-SE.
Article
We consider the delay-domain sparse channel estimation and data detection/decoding problems in a massive multiple-input-multiple-output (MIMO) orthogonal frequency division multiplexing (OFDM) wireless communication system with low-resolution analog-to-digital converters (ADCs). The non-linear distortion due to coarse quantization leads to severe performance degradation in conventional OFDM receivers, which necessitates novel receiver techniques. First, we derive Bayesian Cramér-Rao-lower-bounds (CRLB) on the mean squared error (MSE) in recovering jointly compressible vectors from quantized noisy underdetermined measurements. Second, we formulate the pilot-assisted channel estimation as a multiple measurement vector (MMV) sparse recovery problem, and develop a variational Bayes (VB) algorithm to infer the posterior distribution of the channel. We benchmark the MSE performance of our algorithm with that of the CRLB, and numerically show that the VB algorithm meets the CRLB. Third, we present a soft symbol decoding algorithm that infers the posterior distributions of the data symbols given the quantized observations. We utilize the posterior statistics of the detected data symbols as virtual pilots, and propose an iterative soft symbol decoding and data-aided channel estimation procedure. Finally, we present a variant of the iterative algorithm that utilizes the output bit log-likelihood ratios of the channel decoder to adapt the data prior to further improve the performance. We provide interesting insights into the impact of the various system parameters on the MSE and bit error rate of the proposed algorithms, and benchmark them against the state-of-the-art.
Article
Accurately predicting point cloud quality plays an important role in human vision tasks. This paper presents an effective and robust objective point cloud quality assessment model called elastic potential energy similarity (EPES). Motivated by the knowledge on point cloud distortion, EPES first expresses a point cloud as a collection of spatially scattered points. A set of origins are then deployed and the scattered points are assumed to be connected to the nearest origin using springs . Imposing external forces can move the scattered points to specific locations such that the resulting point clouds would exhibit desired characteristics. At the same time, this process will store elastic potential energies in the springs. Therefore, through comparing the elastic potential energies kept in the springs of the reference and distorted point clouds, we are able to quantify the influence of distortion on the point cloud quality. The proposed quality assessment model is evaluated on three fairly large databases, SJTU-PCQA, CPCQA, and LSPCQA. Experimental results show that EPES is superior to the state-of-the-art metrics. Ablation studies demonstrate that the developed EPES is robust to variations in the model parameter settings.
Article
The availability of high bandwidth with low-latency communication in 5G mobile networks enables remote rendered real-time virtual reality (VR) applications. Remote rendering of VR graphics in a cloud removes the need for local personal computer for graphics rendering and augments weak graphics processing unit capacity of stand-alone VR headsets. However, to prevent the added network latency of remote rendering from ruining user experience, rendering a locally navigable viewport that is larger than the field of view of the HMD is necessary. The size of the viewport required depends on latency: Longer latency requires rendering a larger viewport and streaming more content. In this article, we aim to utilize multi-access edge computing to assist the backend cloud in such remote rendered interactive VR. Given the dependency between latency and amount and quality of the content streamed, our objective is to jointly optimize the tradeoff between average video quality and delivery latency. Formulating the problem as mixed integer nonlinear programming, we leverage the interpolation between client’s field of view frame size and overall latency to convert the problem to integer nonlinear programming model and then design efficient online algorithms to solve it. The results of our simulations supplemented by real-world user data reveal that enabling a desired balance between video quality and latency, our algorithm particularly achieves the improvements of on average about 22% and 12% in term of video delivery latency and 8% in term of video quality compared to respectively order-of-arrival, threshold-based, and random-location strategies.
Article
Current avatar representations used in immersive VR applications lack features that may be important for supporting natural behaviors and effective communication among individuals. This study investigates the impact of the visual and nonverbal cues afforded by three different types of avatar representations in the context of several cooperative tasks. The avatar types we compared are No_Avatar (HMD and controllers only), Scanned_Avatar (wearing an HMD), and Heal_Avatar (video-see-through). The subjective and objective measures we used to assess the quality of interpersonal communication include surveys of social presence, interpersonal trust, communication satisfaction, and attention to behavioral cues, plus two behavioral measures: duration of mutual gaze and number of unique words spoken. We found that participants reported higher levels of trustworthiness in the Real_Avatar condition compared to the Scanned_Avatar and No_Avatar conditions. They also reported a greater level of attentional focus on facial expressions compared to the No_Avatar condition and spent more extended time, for some tasks, attempting to engage in mutual gaze behavior compared to the Scanned_Avatar and No_Avatar conditions. In both the Heal_Avatar and Scanned_Avatar conditions, participants reported higher levels of co-presence compared with the No_Avatar condition. In the Scanned_Avatar condition, compared with the Heal_Avatar and No_Avatar conditions, participants reported higher levels of attention to body posture. Overall, our exit survey revealed that a majority of participants (66.67%) reported a preference for the Real_Avatar, compared with 25.00% for the Scanned_Avatar and 8.33% for the No_Avatar, These findings provide novel insight into how a user's experience in a social VR scenario is affected by the type of avatar representation provided.
Article
Extended reality (XR)-aided teleoperation has shown its potential in improving operating efficiency in mission-critical, information-rich, and complex scenarios. The multi-sensory XR devices introduce several new types of traffic with unique quality of service (QoS) requirements, which are usually defined by three measures: human perception, corresponding sensors, and present devices. To fulfill these requirements, cellular-sup-ported wireless connectivity can be a promising solution that can largely benefit robot-to-XR and XR-to-robot links. In this article, we present industrial and piloting use cases and identify the service bottleneck of each case. We then cover the QoS of robot-XR and XR-robot links by summarizing the sensors' parameters and processing procedures. To realize these use cases, we introduce potential solutions for each case with cellular connections. Finally, we build testbeds to investigate the effectiveness of supporting our proposed links using current wireless topologies.
Chapter
This paper introduces data augmentation for point clouds by interpolation between examples. Data augmentation by interpolation has shown to be a simple and effective approach in the image domain. Such a mixup is however not directly transferable to point clouds, as we do not have a one-to-one correspondence between the points of two different objects. In this paper, we define data augmentation between point clouds as a shortest path linear interpolation. To that end, we introduce PointMixup, an interpolation method that generates new examples through an optimal assignment of the path function between two point clouds. We prove that our PointMixup finds the shortest path between two point clouds and that the interpolation is assignment invariant and linear. With the definition of interpolation, PointMixup allows to introduce strong interpolation-based regularizers such as mixup and manifold mixup to the point cloud domain. Experimentally, we show the potential of PointMixup for point cloud classification, especially when examples are scarce, as well as increased robustness to noise and geometric transformations to points. The code for PointMixup and the experimental details are publicly available (Code is available at: https://github.com/yunlu-chen/PointMixup/).
Conference Paper
Recently stakeholders in the area of multimedia representation and transmission have been looking at plenoptic technologies to improve immersive experience. Among these technologies, point clouds denote a volumetric information representation format with important applications in the entertainment, automotive and geographical mapping industries. There is some consensus that state-of-the-art solutions for efficient storage and communication of point clouds are far from satisfactory. This paper describes a study on point cloud quality evaluation, conducted in the context of JPEG Pleno to help define the test conditions of future compression proposals. A heterogeneous set of static point clouds in terms of number of points, geometric structure and represented scenarios were selected and compressed using octree-pruning and a projection-based method, with three different levels of degradation. The models were comprised of both geometrical and color information and were displayed using point sizes large enough to ensure observation of watertight surfaces. The stimuli under assessment were presented to the observers on 2D displays as animations, after defining suitable camera paths to enable visualization of the models in their entirety and realistic consumption. The experiments were carried out in three different laboratories and the subjective scores were used in a series of correlation studies to benchmark objective quality metrics and assess inter-laboratory consistency
Article
An increasing number of virtual reality applications now use full-body avatars to represent the user in virtual environments. To fully control these virtual avatars, movement-tracking technology is required. However, most full-body tracking solutions are expensive and often cumbersome and time consuming to setup and use. Affordable depth cameras, on the other hand, are easy to set up, but most lack the ability to fully track a user’s body and fingers and have only limited accuracy. In this paper, we present a solution for combining multiple depth cameras to allow accurate full body movement tracking, including accurate hand and finger tracking. This provides users with the possibility of using natural gestures to interact in the virtual environment. In particular, we improve on previous work in the following five aspects. We have, (1) extended the calibration procedure to eliminate the tracking offsets between the RGB and depth cameras, (2) optimized facing-direction detection to improve the stability of data fusion, (3) implemented two new weighting methods for the depth data fusion of multiple cameras, (4) added the ability to also fuse joint-rotation data, and (5) integrated a short-range depth camera for finger tracking. We evaluated the system empirically and show that our new methods improved previous work in terms of tracking accuracy and particularly reduced the coupled hand-lifting phenomenon.
5G technology is revolutionizing the wireless industry with unparalleled efficiency
  • M N Kumar