Kailun Yang

Kailun Yang
Hunan University · School of Robotics

Doctor of Engineering

About

202
Publications
77,776
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
2,947
Citations

Publications

Publications (202)
Article
Full-text available
Introduction of RGB-D sensors is a revolutionary force that offers a portable, versatile and cost-effective solution of navigational assistance for the visually impaired. RGB-D sensors on the market such as Microsoft Kinect, Asus Xtion and Intel RealSense are mature products, but all have a minimum detecting distance of about 800 mm. This results i...
Article
Full-text available
Pixel-wise semantic segmentation is capable of unifying most of driving scene perception tasks, and has enabled striking progress in the context of navigation assistance, where an entire surrounding sensing is vital. However, current mainstream semantic segmenters are predominantly benchmarked against datasets featuring narrow Field of View (FoV),...
Article
Full-text available
Semantic segmentation, unifying most navigational perception tasks at the pixel level has catalyzed striking progress in the field of autonomous transportation. Modern Convolution Neural Networks (CNNs) are able to perform semantic segmentation both efficiently and accurately, particularly owing to their exploitation of wide context information. Ho...
Thesis
Full-text available
Semantic segmentation plays a pivotal role in various computer vision applications, such as autonomous vehicles, robotic systems, and augmented reality. Accurate scene understanding is essential for these technologies to operate safely and effectively. In this context, the fusion of data from diverse sensors, such as LiDAR and cameras, has emerged...
Article
Full-text available
A semantic map of the road scene, covering fundamental road elements, is an essential ingredient in autonomous driving systems. It provides important perception foundations for positioning and planning when rendered in the Bird's-Eye-View (BEV). Currently, the prior knowledge of hypothetical depth can guide the learning of translating front perspec...
Preprint
Full-text available
Self-supervised representation learning for human action recognition has developed rapidly in recent years. Most of the existing works are based on skeleton data while using a multi-modality setup. These works overlooked the differences in performance among modalities, which led to the propagation of erroneous knowledge between modalities while onl...
Preprint
Full-text available
To integrate action recognition methods into autonomous robotic systems, it is crucial to consider adverse situations involving target occlusions. Such a scenario, despite its practical relevance, is rarely addressed in existing self-supervised skeleton-based action recognition methods. To empower robots with the capacity to address occlusion, we p...
Preprint
Full-text available
Recently, transformer-based methods have shown exceptional performance in monocular 3D object detection, which can predict 3D attributes from a single 2D image. These methods typically use visual and depth representations to generate query points on objects, whose quality plays a decisive role in the detection accuracy. However, current unsupervise...
Preprint
Full-text available
Key-point-based scene understanding is fundamental for autonomous driving applications. At the same time, optical flow plays an important role in many vision tasks. However, due to the implicit bias of equal attention on all points, classic data-driven optical flow estimation methods yield less satisfactory performance on key points, limiting their...
Preprint
Audio-guided Video Object Segmentation (A-VOS) and Referring Video Object Segmentation (R-VOS) are two highly-related tasks, which both aim to segment specific objects from video sequences according to user-provided expression prompts. However, due to the challenges in modeling representations for different modalities, contemporary methods struggle...
Preprint
Full-text available
The recently rising markup-to-image generation poses greater challenges as compared to natural image generation, due to its low tolerance for errors as well as the complex sequence and context correlations between markup and rendered image. This paper proposes a novel model named "Contrast-augmented Diffusion Model with Fine-grained Sequence Alignm...
Preprint
Light field cameras can provide rich angular and spatial information to enhance image semantic segmentation for scene understanding in the field of autonomous driving. However, the extensive angular information of light field cameras contains a large amount of redundant data, which is overwhelming for the limited hardware resource of intelligent ve...
Preprint
Full-text available
The mobile robot relies on SLAM (Simultaneous Localization and Mapping) to provide autonomous navigation and task execution in complex and unknown environments. However, it is hard to develop a dedicated algorithm for mobile robots due to dynamic and challenging situations, such as poor lighting conditions and motion blur. To tackle this issue, we...
Preprint
Full-text available
Grounded Situation Recognition (GSR) is capable of recognizing and interpreting visual scenes in a contextually intuitive way, yielding salient activities (verbs) and the involved entities (roles) depicted in images. In this work, we focus on the application of GSR in assisting people with visual impairments (PVI). However, precise localization inf...
Preprint
Event cameras are capable of responding to log-brightness changes in microseconds. Its characteristic of producing responses only to the changing region is particularly suitable for optical flow estimation. In contrast to the super low-latency response speed of event cameras, existing datasets collected via event cameras, however, only provide limi...
Thesis
Full-text available
Providing stable navigation for the visually impaired in various unknown scenarios is a significant challenge. Simultaneous Localization And Mapping (SLAM) technology is essential for navigation with precise and reliable real-time map and position. It helps visually impaired people reach their destinations smoothly. Several SLAM algorithms are curr...
Preprint
Full-text available
High-quality panoramic images with a Field of View (FoV) of 360-degree are essential for contemporary panoramic computer vision tasks. However, conventional imaging systems come with sophisticated lens designs and heavy optical components. This disqualifies their usage in many mobile and wearable applications where thin and portable, minimalist ima...
Preprint
In this paper, we propose LF-PGVIO, a Visual-Inertial-Odometry (VIO) framework for large Field-of-View (FoV) cameras with a negative plane using points and geodesic segments. Notoriously, when the FoV of a panoramic camera reaches the negative half-plane, the image cannot be unfolded into a single pinhole image. Moreover, if a traditional straight-...
Preprint
The integration of diverse visual prompts like clicks, scribbles, and boxes in interactive image segmentation could significantly facilitate user interaction as well as improve interaction efficiency. Most existing studies focus on a single type of visual prompt by simply concatenating prompts and images as input for segmentation prediction, which...
Preprint
Domain adaptive semantic segmentation enables robust pixel-wise understanding in real-world driving scenes. Source-free domain adaptation, as a more practical technique, addresses the concerns of data privacy and storage limitations in typical unsupervised domain adaptation methods. It utilizes a well-trained source model and unlabeled target data...
Preprint
Domain adaptation is essential for activity recognition, as common spatiotemporal architectures risk overfitting due to increased parameters arising from the temporal dimension. Unsupervised domain adaptation methods have been extensively studied, yet, they require large-scale unlabeled data from the target domain. In this work, we address few-shot...
Preprint
Full-text available
Transformer-based methods have demonstrated superior performance for monocular 3D object detection recently, which predicts 3D attributes from a single 2D image. Most existing transformer-based methods leverage visual and depth representations to explore valuable query points on objects, and the quality of the learned queries has a great impact on...
Preprint
Interactive Image Segmentation (IIS) has emerged as a promising technique for decreasing annotation time. Substantial progress has been made in pre- and post-processing for IIS, but the critical issue of interaction ambiguity notably hindering segmentation quality, has been under-researched. To address this, we introduce AdaptiveClick -- a clicks-a...
Preprint
Full-text available
A semantic map of the road scene, covering fundamental road elements, is an essential ingredient in autonomous driving systems. It provides important perception foundations for positioning and planning when rendered in the Bird's-Eye-View (BEV). Currently, the prior knowledge of hypothetical depth can guide the learning of translating front perspec...
Article
Full-text available
Optical flow estimation is a basic task in self-driving and robotics systems, which enables to temporally interpret traffic scenes. Autonomous vehicles clearly benefit from the ultra-wide Field of View (FoV) offered by $360^\circ$ panoramic sensors. However, due to the unique imaging process of panoramic cameras, models designed for pinhole image...
Article
Full-text available
In this work, we introduce panoramic panoptic segmentation, as the most holistic scene understanding, both in terms of Field of View (FoV) and image-level understanding for standard camera-based input. A complete surrounding understanding provides a maximum of information to a mobile agent. This is essential information for any intelligent vehicle...
Preprint
Full-text available
This paper raises the new task of Fisheye Semantic Completion (FSC), where dense texture, structure, and semantics of a fisheye image are inferred even beyond the sensor field-of-view (FoV). Fisheye cameras have larger FoV than ordinary pinhole cameras, yet its unique special imaging model naturally leads to a blind area at the edge of the image pl...
Preprint
Full-text available
Visual place recognition has received increasing attention in recent years as a key technology in autonomous driving and robotics. The current mainstream approaches use either the perspective view retrieval perspective view (P2P) paradigm or the equirectangular image retrieval equirectangular image (E2E) paradigm. However, a natural and practical i...
Preprint
Full-text available
Seeing only a tiny part of the whole is not knowing the full circumstance. Bird's-eye-view (BEV) perception, a process of obtaining allocentric maps from egocentric views, is restricted when using a narrow Field of View (FoV) alone. In this work, mapping from 360{\deg} panoramas to BEV semantics, the 360BEV task, is established for the first time t...
Preprint
Full-text available
Multimodal fusion can make semantic segmentation more robust. However, fusing an arbitrary number of modalities remains underexplored. To delve into this problem, we create the DeLiVER arbitrary-modal segmentation benchmark, covering Depth, LiDAR, multiple Views, Events, and RGB. Aside from this, we provide this dataset in four severe weather condi...
Chapter
Full-text available
Local feature matching is a computationally intensive task at the subpixel level. While detector-based methods coupled with feature descriptors struggle in low-texture scenes, CNN-based methods with a sequential extract-to-match pipeline, fail to make use of the matching capacity of the encoder and tend to overburden the decoder for matching. In co...
Preprint
Full-text available
In this paper, we tackle the new task of video-based Activated Muscle Group Estimation (AMGE) aiming at identifying currently activated muscular regions of humans performing a specific activity. Video-based AMGE is an important yet overlooked problem. To this intent, we provide the MuscleMap136 featuring >15K video clips with 136 different activiti...
Preprint
Full-text available
Wearable robotics can improve the lives of People with Visual Impairments (PVI) by providing additional sensory information. Blind people typically recognize objects through haptic perception. However, knowing materials before touching is under-explored in the field of assistive technology. To fill this gap, in this work, a wearable robotic system,...
Chapter
Full-text available
Failure to timely diagnose and effectively treat depression leads to over 280 million people suffering from this psychological disorder worldwide. The information cues of depression can be harvested from diverse heterogeneous resources, e.g., audio, visual, and textual data, raising demand for new effective multi-modal fusion approaches for automat...
Article
Full-text available
Occlusions are universal disruptions constantly present in the real world. Especially for sparse representations, such as human skeletons, a few occluded points might destroy the geometrical and temporal continuity critically affecting the results. Yet, the research of data-scarce recognition from skeleton sequences, such as one-shot action recogni...
Article
Full-text available
Panoramic Annular Lens (PAL) composed of few lenses has great potential in panoramic surrounding sensing tasks for mobile and wearable devices because of its tiny size and large Field of View (FoV). However, the image quality of tiny-volume PAL confines to optical limit due to the lack of lenses for aberration correction. In this paper, we propose...
Article
Full-text available
Scene understanding based on image segmentation is a crucial component of autonomous vehicles. Pixel-wise semantic segmentation of RGB images can be advanced by exploiting complementary features from the supplementary modality (X-modality). However, covering a wide variety of sensors with a modality-agnostic model remains an unresolved problem due...
Article
Full-text available
Transformer-based methods have demonstrated superior performance for monocular 3D object detection recently, which aims at predicting 3D attributes from a single 2D image. Most existing transformer-based methods leverage both visual and depth representations to explore valuable query points on objects, and the quality of the learned query points ha...
Article
Full-text available
Simultaneous Localization And Mapping (SLAM) has become a crucial aspect in the fields of autonomous driving and robotics. One crucial component of visual SLAM is the Field-of-View (FoV) of the camera, as a larger FoV allows for a wider range of surrounding elements and features to be perceived. However, when the FoV of the camera reaches the negat...
Article
Full-text available
Driver observation models are rarely deployed under perfect conditions. In practice, illumination, camera placement and type differ from the ones present during training and unforeseen behaviours may occur at any time. While observing the human behind the steering wheel leads to more intuitive human-vehicle-interaction and safer driving, it require...
Preprint
Full-text available
Semantic scene understanding with Minimalist Optical Systems (MOS) in mobile and wearable applications remains a challenge due to the corrupted imaging quality induced by optical aberrations. However, previous works only focus on improving the subjective imaging quality through computational optics, i.e. Computational Imaging (CI) technique, ignori...
Preprint
Full-text available
Limited by hardware cost and system size, camera's Field-of-View (FoV) is not always satisfactory. However, from a spatio-temporal perspective, information beyond the camera's physical FoV is off-the-shelf and can actually be obtained "for free" from the past. In this paper, we propose a novel task termed Beyond-FoV Estimation, aiming to exploit pa...
Chapter
Full-text available
Traditional frame-based cameras inevitably suffer from motion blur due to long exposure times. As a kind of bio-inspired camera, the event camera records the intensity changes in an asynchronous way with high temporal resolution, providing valid image degradation information within the exposure time. In this paper, we rethink the event-based image...
Article
Full-text available
With the rapid development of high-speed communication and artificial intelligence technologies, human perception of real-world scenes is no longer limited to the use of small Field of View (FoV) and low-dimensional scene detection devices. Panoramic imaging emerges as the next generation of innovative intelligent instruments for environmental perc...
Article
Full-text available
Transparent objects, such as glass walls and doors, constitute architectural obstacles hindering the mobility of people with low vision or blindness. For instance, the open space behind glass doors is inaccessible, unless it is correctly perceived and interacted with. However, traditional assistive technologies rarely cover the segmentation of thes...
Preprint
Full-text available
Loop closure is an important component of Simultaneous Localization and Mapping (SLAM) systems. Large Field-of-View (FoV) cameras have received extensive attention in the SLAM field as they can exploit more surrounding features on the panoramic image. In large-FoV VIO, for incorporating the informative cues located on the negative plane of the pano...
Article
Full-text available
At the heart of all automated driving systems is the ability to sense the surroundings, e.g., through semantic segmentation of LiDAR sequences, which experienced a remarkable progress due to the release of large datasets such as SemanticKITTI and nuScenes-LidarSeg. While most previous works focus on sparse segmentation of the LiDAR input, dens...
Preprint
In this paper, we address panoramic semantic segmentation, which provides a full-view and dense-pixel understanding of surroundings in a holistic way. Panoramic segmentation is under-explored due to two critical challenges: (1) image distortions and object deformations on panoramas; (2) lack of annotations for training panoramic segmenters. To tack...
Preprint
Full-text available
Optical flow estimation is a basic task in self-driving and robotics systems, which enables to temporally interpret traffic scenes. Autonomous vehicles clearly benefit from the ultrawide Field of View (FoV) offered by 360◦ panoramic sensors. However, due to the unique imaging process of panoramic cameras, models designed for pinhole images do not d...
Preprint
Full-text available
Humans have an innate ability to sense their surroundings, as they can extract the spatial representation from the egocentric perception and form an allocentric semantic map via spatial transformation and memory updating. However, endowing mobile agents with such a spatial sensing ability is still a challenge, due to two difficulties: (1) the previ...
Preprint
Full-text available
Failure to timely diagnose and effectively treat depression leads to over 280 million people suffering from this psychological disorder worldwide. The information cues of depression can be harvested from diverse heterogeneous resources, e.g., audio, visual, and textual data, raising demand for new effective multi-modal fusion approaches for its aut...
Chapter
Full-text available
Despite a number of advances in the accessibility of STEM education, there is a lack of advanced tool support for authors and educators seeking to make corresponding documents accessible. We propose an interactive labeling method that combines an AI with user input to create accessible chemical structural formulas and incrementally improve the mode...
Chapter
Full-text available
Exploring an unfamiliar indoor environment and avoiding obstacles is challenging for visually impaired people. Currently, several approaches achieve the avoidance of static obstacles based on the mapping of indoor scenes. To solve the issue of distinguishing dynamic obstacles, we propose an assistive system with an RGB-D sensor to detect dynamic in...
Preprint
Full-text available
In this work, we introduce panoramic panoptic segmentation, as the most holistic scene understanding, both in terms of Field of View (FoV) and image-level understanding for standard camera-based input. A complete surrounding understanding provides a maximum of information to a mobile agent, which is essential for any intelligent vehicle in order to...
Preprint
Full-text available
Panoramic Annular Lens (PAL), composed of few lenses, has great potential in panoramic surrounding sensing tasks for mobile and wearable devices because of its tiny size and large Field of View (FoV). However, the image quality of tiny-volume PAL confines to optical limit due to the lack of lenses for aberration correction. In this paper, we propos...
Preprint
Full-text available
Human Pose Estimation (HPE) based on RGB images has experienced a rapid development benefiting from deep learning. However, event-based HPE has not been fully studied, which remains great potential for applications in extreme scenes and efficiency-critical conditions. In this paper, we are the first to estimate 2D human pose directly from 3D event...
Conference Paper
Full-text available
Optical flow estimation is an essential task in self-driving systems, which helps autonomous vehicles perceive temporal continuity information of surrounding scenes. The calculation of all-pair correlation plays an important role in many existing state-of-the-art optical flow estimation methods. However, the reliance on local knowledge often limits...
Preprint
Full-text available
With the rapid development of high-speed communication and artificial intelligence technologies, human perception of real-world scenes is no longer limited to the use of small Field of View (FoV) and low-dimensional scene detection devices. Panoramic imaging emerges as the next generation of innovative intelligent instruments for environmental perc...
Thesis
Full-text available
Exploring an unfamiliar indoor environment and avoiding obstacles is challenging for vi- sually impaired people. Currently, several assistive applications for visual impaired people combined with Simultaneous Localization And Mapping (SLAM) technology and seman- tic segmentation technologies achieve localization of users, obstacle detection, destin...
Preprint
Full-text available
Structured Visual Content (SVC) such as graphs, flow charts, or the like are used by authors to illustrate various concepts. While such depictions allow the average reader to better understand the contents, images containing SVCs are typically not machine-readable. This, in turn, not only hinders automated knowledge aggregation, but also the percep...
Preprint
Full-text available
Driver observation models are rarely deployed under perfect conditions. In practice, illumination, camera placement and type differ from the ones present during training and unforeseen behaviours may occur at any time. While observing the human behind the steering wheel leads to more intuitive human-vehicle-interaction and safer driving, it require...
Preprint
Full-text available
Exploring an unfamiliar indoor environment and avoiding obstacles is challenging for visually impaired people. Currently, several approaches achieve the avoidance of static obstacles based on the mapping of indoor scenes. To solve the issue of distinguishing dynamic obstacles, we propose an assistive system with an RGB-D sensor to detect dynamic in...
Preprint
Full-text available
Autonomous vehicles utilize urban scene segmentation to understand the real world like a human and react accordingly. Semantic segmentation of normal scenes has experienced a remarkable rise in accuracy on conventional benchmarks. However, a significant portion of real-life accidents features abnormal scenes, such as those with object deformations,...
Preprint
Full-text available
Local feature matching is a computationally intensive task at the subpixel level. While detector-based methods coupled with feature descriptors struggle in low-texture scenes, CNN-based methods with a sequential extract-to-match pipeline, fail to make use of the matching capacity of the encoder and tend to overburden the decoder for matching. In co...
Preprint
Full-text available
The performance of semantic segmentation of RGB images can be advanced by exploiting informative features from supplementary modalities. In this work, we propose CMX, a vision-transformer-based cross-modal fusion framework for RGB-X semantic segmentation. To generalize to different sensing modalities encompassing various uncertainties, we consider...