Haibin Yan’s research while affiliated with Beijing University of Posts and Telecommunications and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (52)


Kinship verification via Frequency Feature Decoupling and Fusion
  • Article

April 2025

·

1 Read

Pattern Recognition Letters

Shuofeng Sun

·

Yaohan Yang

·

Haibin Yan

Ablation experiments for optimization cost terms, where the base method denotes the full setting of our approach. We verify its effectiveness by gradually eliminating the cost term. We further explored the effectiveness of bi-level optimization.
Real-world results. We tested each task 10 times.
MoManipVLA: Transferring Vision-language-action Models for General Mobile Manipulation
  • Preprint
  • File available

March 2025

·

22 Reads

Zhenyu Wu

·

Yuheng Zhou

·

Xiuwei Xu

·

[...]

·

Haibin Yan

Mobile manipulation is the fundamental challenge for robotics to assist humans with diverse tasks and environments in everyday life. However, conventional mobile manipulation approaches often struggle to generalize across different tasks and environments because of the lack of large-scale training. In contrast, recent advances in vision-language-action (VLA) models have shown impressive generalization capabilities, but these foundation models are developed for fixed-base manipulation tasks. Therefore, we propose an efficient policy adaptation framework named MoManipVLA to transfer pre-trained VLA models of fix-base manipulation to mobile manipulation, so that high generalization ability across tasks and environments can be achieved in mobile manipulation policy. Specifically, we utilize pre-trained VLA models to generate waypoints of the end-effector with high generalization ability. We design motion planning objectives for the mobile base and the robot arm, which aim at maximizing the physical feasibility of the trajectory. Finally, we present an efficient bi-level objective optimization framework for trajectory generation, where the upper-level optimization predicts waypoints for base movement to enhance the manipulator policy space, and the lower-level optimization selects the optimal end-effector trajectory to complete the manipulation task. In this way, MoManipVLA can adjust the position of the robot base in a zero-shot manner, thus making the waypoints predicted from the fixed-base VLA models feasible. Extensive experimental results on OVMM and the real world demonstrate that MoManipVLA achieves a 4.2% higher success rate than the state-of-the-art mobile manipulation, and only requires 50 training cost for real world deployment due to the strong generalization ability in the pre-trained VLA models.

Download



Embodied Instruction Following in Unknown Environments

June 2024

·

31 Reads

Enabling embodied agents to complete complex human instructions from natural language is crucial to autonomous systems in household services. Conventional methods can only accomplish human instructions in the known environment where all interactive objects are provided to the embodied agent, and directly deploying the existing approaches for the unknown environment usually generates infeasible plans that manipulate non-existing objects. On the contrary, we propose an embodied instruction following (EIF) method for complex tasks in the unknown environment, where the agent efficiently explores the unknown environment to generate feasible plans with existing objects to accomplish abstract instructions. Specifically, we build a hierarchical embodied instruction following framework including the high-level task planner and the low-level exploration controller with multimodal large language models. We then construct a semantic representation map of the scene with dynamic region attention to demonstrate the known visual clues, where the goal of task planning and scene exploration is aligned for human instruction. For the task planner, we generate the feasible step-by-step plans for human goal accomplishment according to the task completion process and the known visual clues. For the exploration controller, the optimal navigation or object interaction policy is predicted based on the generated step-wise plans and the known visual clues. The experimental results demonstrate that our method can achieve 45.09% success rate in 204 complex human instructions such as making breakfast and tidying rooms in large house-level scenes.



Hardness-Aware Scene Synthesis for Semi-Supervised 3D Object Detection

May 2024

·

3 Reads

3D object detection aims to recover the 3D information of concerning objects and serves as the fundamental task of autonomous driving perception. Its performance greatly depends on the scale of labeled training data, yet it is costly to obtain high-quality annotations for point cloud data. While conventional methods focus on generating pseudo-labels for unlabeled samples as supplements for training, the structural nature of 3D point cloud data facilitates the composition of objects and backgrounds to synthesize realistic scenes. Motivated by this, we propose a hardness-aware scene synthesis (HASS) method to generate adaptive synthetic scenes to improve the generalization of the detection models. We obtain pseudo-labels for unlabeled objects and generate diverse scenes with different compositions of objects and backgrounds. As the scene synthesis is sensitive to the quality of pseudo-labels, we further propose a hardness-aware strategy to reduce the effect of low-quality pseudo-labels and maintain a dynamic pseudo-database to ensure the diversity and quality of synthetic scenes. Extensive experimental results on the widely used KITTI and Waymo datasets demonstrate the superiority of the proposed HASS method, which outperforms existing semi-supervised learning methods on 3D object detection. Code: https://github.com/wzzheng/HASS.


Hardness-Aware Scene Synthesis for Semi-Supervised 3D Object Detection

January 2024

·

4 Reads

·

4 Citations

IEEE Transactions on Multimedia

3D object detection aims to recover the 3D information of concerning objects and serves as the fundamental task of autonomous driving perception. Its performance greatly depends on the scale of labeled training data, yet it is costly to obtain high-quality annotations for point cloud data. This motivates the use of semi-supervised learning which can additionally exploit unlabeled data to further boost the performance. While 2D semi-supervised learning methods focus on generating pseudo-labels for unlabeled existing samples as supplements for training, the structural nature of 3D point cloud data facilitates the composition of objects and backgrounds to synthesize realistic scenes. Motivated by this, we propose a hardness-aware scene synthesis (HASS) method to generate adaptive synthetic scenes to improve the generalization of the detection models. We obtain pseudo-labels for unlabeled objects and generate diverse scenes with different compositions of objects and backgrounds. As the scene synthesis is sensitive to the quality of pseudo-labels, we further propose a hardness-aware strategy to reduce the effect of low-quality pseudo-labels. In addition, we maintain a dynamic pseudo- database to ensure the diversity and quality of synthetic scenes. Extensive experimental results on the widely used KITTI and Waymo datasets demonstrate the superiority of the proposed HASS method, which outperforms existing semi-supervised learning methods on 3D object detection. We also conducted a series of experiments to analyze the effectiveness of our method including pseudo-label quality analysis, the effect of different filtering and thresholding strategies, and ablations of each component.


Toward Integrity and Detail With Ensemble Learning for Salient Object Detection in Optical Remote-Sensing Images

January 2024

·

2 Reads

·

4 Citations

IEEE Transactions on Geoscience and Remote Sensing

Optical remote sensing image salient object detection (ORSI-SOD) poses significant challenges due to complicated object variances and interfering surroundings. Although existing methods have achieved impressive performance, they encounter difficulties in balancing deep and shallow features, leading to limitations in preserving object integrity and edge detail. To address this, we propose the Integrated and Detailed Ensemble Learning (IDEL) framework, which incorporates hierarchical branches with deep supervision. By divide-and-conquer, each branch captures information with a specific granularity, while the fusion module combines all outputs to generate the final saliency maps. To ensure the effectiveness of ensemble learning, IDEL is designed to satisfy two necessary conditions: the weak learner property and branch independence. Firstly, we utilize the Transformer blocks with a global receptive field and purify intermediate features with the Deep Supervision Module (DSM) to enhance the performance of each branch. Secondly, we disentangle multiple branches through hardness-aware weights and hierarchical supervision labels, allowing them to learn distinct features. Qualitative visualizations demonstrate the effectiveness of each module, and extensive experimental results conducted on three popular ORSI datasets confirm the superiority of IDEL compared to other state-of-the-art (SOTA) counterparts.


Dense Hybrid Proposal Modulation for Lane Detection

November 2023

·

19 Reads

·

9 Citations

IEEE Transactions on Circuits and Systems for Video Technology

In this paper, we present a dense hybrid proposal modulation (DHPM) method for lane detection. Most existing methods perform sparse supervision on a subset of high-scoring proposals, while other proposals fail to obtain effective shape and location guidance, resulting in poor overall quality. To address this, we densely modulate all proposals to generate topologically and spatially high-quality lane predictions with discriminative representations. Specifically, we first ensure that lane proposals are physically meaningful by applying single-lane shape and location constraints. Benefitting from the proposed proposal-to-label matching algorithm, we assign each proposal a target ground truth lane to efficiently learn from spatial layout priors. To enhance the generalization and model the inter-proposal relations, we diversify the shape difference of proposals matching the same ground-truth lane. In addition to the shape and location constraints, we design a quality-aware classification loss to adaptively supervise each positive proposal so that the discriminative power can be further boosted. Our DHPM achieves very competitive performances on four popular benchmark datasets. Moreover, we consistently outperform the baseline model on most metrics without introducing new parameters and reducing inference speed. The codes of our method are available at https://github.com/wuyuej/DHPM .


Citations (32)


... Different from prior work which generates object attributes then apply a database retrieval process to obtain textured object mesh which heavily restricts the generation capacity, Graph-to-3D [27] proposes a GCN-based VAE architecture to learn to directly generate object 3D meshes given the scene graph as input with an end-to-end framework. To better handle the biased generation results due to the object category imbalance in the training set, FairScene [150] exploits unbiased object interactions with a causal reasoning framework which achieves fair scene synthesis by calibrating the long-tailed category distribution. ...

Reference:

Computer-Aided Layout Generation for Building Design: A Review
FairScene: Learning unbiased object interactions for indoor scene synthesis
  • Citing Article
  • December 2024

Pattern Recognition

... It has reached 90.7% in OA and 90.0%. Meanwhile, DMR is 4.8% higher than pointMLP (Ma et al. 2022) in OA after voting, which also achieves a level comparable to SOTA methods (Qi et al. 2017b;Wang et al. 2019;Li et al. 2018;Qiu, Anwar, and Barnes 2021;Ran, Liu, and Wang 2022;Ma et al. 2022;Zheng et al. 2023;Qian et al. 2022;Lin et al. 2023;Park et al. 2023;Yao et al. 2023;Thomas et al. 2024;Chen et al. 2023;Sun et al. 2024). It is worth noting that DMR far surpasses HyCoRe to reach 3.0%, which also further verifies that our method is theoretically correct. ...

X-3D: Explicit 3D Structure Modeling for Point Cloud Recognition
  • Citing Conference Paper
  • June 2024

... Zhou et al. [21] enhanced multi-scale deep features through an edge extraction module and incorporated fine edge details into the saliency maps using a hybrid loss that includes edge-aware constraints. Liu et al. [46] utilized Transformer blocks with global perception areas to extract features and adopted a divideand-conquer method to extract and integrate relevant features from various branches. Yan et al. [47] ingeniously integrated Transformers and CNNs into the encoder, using an adaptive semantic matching mechanism to model both global and local relationships. ...

Toward Integrity and Detail With Ensemble Learning for Salient Object Detection in Optical Remote-Sensing Images
  • Citing Article
  • January 2024

IEEE Transactions on Geoscience and Remote Sensing

... This transform is critical for enabling vehicles to understand their environment, detect objects, and navigate safely. Previous works (Huang et al., 2021;Reading et al., 2021;Zhou & Krähenbühl, 2022;Zeng et al., 2024; have achieved remarkable 3D perception ability by utilizing Bird's Eye View (BEV) representations to process 2D-3D lift. Recently, many vision-based 3D occupancy prediction methods (Huang et al., 2023;Wei et al., 2023;b; further improved the understanding of dynamic and cluttered driving scenes, pushing the boundaries of the research domain. ...

Hardness-Aware Scene Synthesis for Semi-Supervised 3D Object Detection
  • Citing Article
  • January 2024

IEEE Transactions on Multimedia

... Lane detection is crucial for autonomous driving [3]- [6], as it allows the identification of lane boundaries and facilitates the determination of secure driving routes [4], [7], [8]. The RGB images captured by vehicle-mounted cameras provide multiple forward lane lines and other useful driving information [9], [10]. ...

Dense Hybrid Proposal Modulation for Lane Detection
  • Citing Article
  • November 2023

IEEE Transactions on Circuits and Systems for Video Technology

... Language model grounding for embodied tasks: An embodied agent not only requires active exploration [33], manipulation [34], and scene perception [35,36] as well as embodied task planning ability. Embodied task planning aims to generate executable action steps in the given environments, where action plans are generated from grounded LLMs by receiving information from the surrounding environments [37,38,39] or prompt engineering [40]. ...

Smart Explorer: Recognizing Objects in Dense Clutter via Interactive Exploration
  • Citing Conference Paper
  • October 2022

... Alongside these, deep learning techniques have demonstrated significant effectiveness. Recent studies, like [29], have utilized dual CNNs with shared parameters to extract multi-scale deep features, providing a comprehensive contextual representation of facial images. Integrating multiple or multi-level features has become a prevalent strategy in kinship verification. ...

Multi-scale Deep Relational Reasoning for Facial Kinship Verification
  • Citing Article
  • July 2020

Pattern Recognition

... The classification is made through threshold comparison or a basic machine learning classifier (i.e., Support Vector Machines-SVM, K-Nearest Neighbors-KNN) [22,[26][27][28][29][30][31]. Moreover, thanks to the contribution of deep learning in the field of computer vision, many works have shifted toward implementing a deep learning-based methods as it not only encapsulates the classification and extraction process but also gathers more accurate and hidden features, so-called deep features [2,9,13,[32][33][34][35][36][37][38] (see Fig. 2). Furthermore, the development of ensemble learning methods has demonstrated an effective way to improve method performance through merging different models. ...

Discriminative sampling via deep reinforcement learning for kinship verification
  • Citing Article
  • June 2020

Pattern Recognition Letters

... Eventually, NRML was applied to the features of these two networks. Concerning the recent methods, an architecture based on the attention mechanism was designed in [35] to extract local information from face components. OR 2 Net [36] employed hard negative samples beyond the pre-defined protocol of configurations to utilize information of all negative samples by reweighing them. ...

Learning Part-Aware Attention Networks for Kinship Verification
  • Citing Article
  • August 2019

Pattern Recognition Letters