April 2025
·
1 Read
Pattern Recognition Letters
This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.
April 2025
·
1 Read
Pattern Recognition Letters
March 2025
·
22 Reads
Mobile manipulation is the fundamental challenge for robotics to assist humans with diverse tasks and environments in everyday life. However, conventional mobile manipulation approaches often struggle to generalize across different tasks and environments because of the lack of large-scale training. In contrast, recent advances in vision-language-action (VLA) models have shown impressive generalization capabilities, but these foundation models are developed for fixed-base manipulation tasks. Therefore, we propose an efficient policy adaptation framework named MoManipVLA to transfer pre-trained VLA models of fix-base manipulation to mobile manipulation, so that high generalization ability across tasks and environments can be achieved in mobile manipulation policy. Specifically, we utilize pre-trained VLA models to generate waypoints of the end-effector with high generalization ability. We design motion planning objectives for the mobile base and the robot arm, which aim at maximizing the physical feasibility of the trajectory. Finally, we present an efficient bi-level objective optimization framework for trajectory generation, where the upper-level optimization predicts waypoints for base movement to enhance the manipulator policy space, and the lower-level optimization selects the optimal end-effector trajectory to complete the manipulation task. In this way, MoManipVLA can adjust the position of the robot base in a zero-shot manner, thus making the waypoints predicted from the fixed-base VLA models feasible. Extensive experimental results on OVMM and the real world demonstrate that MoManipVLA achieves a 4.2% higher success rate than the state-of-the-art mobile manipulation, and only requires 50 training cost for real world deployment due to the strong generalization ability in the pre-trained VLA models.
January 2025
·
1 Read
December 2024
·
1 Read
·
2 Citations
Pattern Recognition
June 2024
·
31 Reads
Enabling embodied agents to complete complex human instructions from natural language is crucial to autonomous systems in household services. Conventional methods can only accomplish human instructions in the known environment where all interactive objects are provided to the embodied agent, and directly deploying the existing approaches for the unknown environment usually generates infeasible plans that manipulate non-existing objects. On the contrary, we propose an embodied instruction following (EIF) method for complex tasks in the unknown environment, where the agent efficiently explores the unknown environment to generate feasible plans with existing objects to accomplish abstract instructions. Specifically, we build a hierarchical embodied instruction following framework including the high-level task planner and the low-level exploration controller with multimodal large language models. We then construct a semantic representation map of the scene with dynamic region attention to demonstrate the known visual clues, where the goal of task planning and scene exploration is aligned for human instruction. For the task planner, we generate the feasible step-by-step plans for human goal accomplishment according to the task completion process and the known visual clues. For the exploration controller, the optimal navigation or object interaction policy is predicted based on the generated step-wise plans and the known visual clues. The experimental results demonstrate that our method can achieve 45.09% success rate in 204 complex human instructions such as making breakfast and tidying rooms in large house-level scenes.
June 2024
·
5 Reads
·
13 Citations
May 2024
·
3 Reads
3D object detection aims to recover the 3D information of concerning objects and serves as the fundamental task of autonomous driving perception. Its performance greatly depends on the scale of labeled training data, yet it is costly to obtain high-quality annotations for point cloud data. While conventional methods focus on generating pseudo-labels for unlabeled samples as supplements for training, the structural nature of 3D point cloud data facilitates the composition of objects and backgrounds to synthesize realistic scenes. Motivated by this, we propose a hardness-aware scene synthesis (HASS) method to generate adaptive synthetic scenes to improve the generalization of the detection models. We obtain pseudo-labels for unlabeled objects and generate diverse scenes with different compositions of objects and backgrounds. As the scene synthesis is sensitive to the quality of pseudo-labels, we further propose a hardness-aware strategy to reduce the effect of low-quality pseudo-labels and maintain a dynamic pseudo-database to ensure the diversity and quality of synthetic scenes. Extensive experimental results on the widely used KITTI and Waymo datasets demonstrate the superiority of the proposed HASS method, which outperforms existing semi-supervised learning methods on 3D object detection. Code: https://github.com/wzzheng/HASS.
January 2024
·
4 Reads
·
4 Citations
IEEE Transactions on Multimedia
3D object detection aims to recover the 3D information of concerning objects and serves as the fundamental task of autonomous driving perception. Its performance greatly depends on the scale of labeled training data, yet it is costly to obtain high-quality annotations for point cloud data. This motivates the use of semi-supervised learning which can additionally exploit unlabeled data to further boost the performance. While 2D semi-supervised learning methods focus on generating pseudo-labels for unlabeled existing samples as supplements for training, the structural nature of 3D point cloud data facilitates the composition of objects and backgrounds to synthesize realistic scenes. Motivated by this, we propose a hardness-aware scene synthesis (HASS) method to generate adaptive synthetic scenes to improve the generalization of the detection models. We obtain pseudo-labels for unlabeled objects and generate diverse scenes with different compositions of objects and backgrounds. As the scene synthesis is sensitive to the quality of pseudo-labels, we further propose a hardness-aware strategy to reduce the effect of low-quality pseudo-labels. In addition, we maintain a dynamic pseudo- database to ensure the diversity and quality of synthetic scenes. Extensive experimental results on the widely used KITTI and Waymo datasets demonstrate the superiority of the proposed HASS method, which outperforms existing semi-supervised learning methods on 3D object detection. We also conducted a series of experiments to analyze the effectiveness of our method including pseudo-label quality analysis, the effect of different filtering and thresholding strategies, and ablations of each component.
January 2024
·
2 Reads
·
4 Citations
IEEE Transactions on Geoscience and Remote Sensing
Optical remote sensing image salient object detection (ORSI-SOD) poses significant challenges due to complicated object variances and interfering surroundings. Although existing methods have achieved impressive performance, they encounter difficulties in balancing deep and shallow features, leading to limitations in preserving object integrity and edge detail. To address this, we propose the Integrated and Detailed Ensemble Learning (IDEL) framework, which incorporates hierarchical branches with deep supervision. By divide-and-conquer, each branch captures information with a specific granularity, while the fusion module combines all outputs to generate the final saliency maps. To ensure the effectiveness of ensemble learning, IDEL is designed to satisfy two necessary conditions: the weak learner property and branch independence. Firstly, we utilize the Transformer blocks with a global receptive field and purify intermediate features with the Deep Supervision Module (DSM) to enhance the performance of each branch. Secondly, we disentangle multiple branches through hardness-aware weights and hierarchical supervision labels, allowing them to learn distinct features. Qualitative visualizations demonstrate the effectiveness of each module, and extensive experimental results conducted on three popular ORSI datasets confirm the superiority of IDEL compared to other state-of-the-art (SOTA) counterparts.
November 2023
·
19 Reads
·
9 Citations
IEEE Transactions on Circuits and Systems for Video Technology
In this paper, we present a dense hybrid proposal modulation (DHPM) method for lane detection. Most existing methods perform sparse supervision on a subset of high-scoring proposals, while other proposals fail to obtain effective shape and location guidance, resulting in poor overall quality. To address this, we densely modulate all proposals to generate topologically and spatially high-quality lane predictions with discriminative representations. Specifically, we first ensure that lane proposals are physically meaningful by applying single-lane shape and location constraints. Benefitting from the proposed proposal-to-label matching algorithm, we assign each proposal a target ground truth lane to efficiently learn from spatial layout priors. To enhance the generalization and model the inter-proposal relations, we diversify the shape difference of proposals matching the same ground-truth lane. In addition to the shape and location constraints, we design a quality-aware classification loss to adaptively supervise each positive proposal so that the discriminative power can be further boosted. Our DHPM achieves very competitive performances on four popular benchmark datasets. Moreover, we consistently outperform the baseline model on most metrics without introducing new parameters and reducing inference speed. The codes of our method are available at https://github.com/wuyuej/DHPM .
... Different from prior work which generates object attributes then apply a database retrieval process to obtain textured object mesh which heavily restricts the generation capacity, Graph-to-3D [27] proposes a GCN-based VAE architecture to learn to directly generate object 3D meshes given the scene graph as input with an end-to-end framework. To better handle the biased generation results due to the object category imbalance in the training set, FairScene [150] exploits unbiased object interactions with a causal reasoning framework which achieves fair scene synthesis by calibrating the long-tailed category distribution. ...
December 2024
Pattern Recognition
... It has reached 90.7% in OA and 90.0%. Meanwhile, DMR is 4.8% higher than pointMLP (Ma et al. 2022) in OA after voting, which also achieves a level comparable to SOTA methods (Qi et al. 2017b;Wang et al. 2019;Li et al. 2018;Qiu, Anwar, and Barnes 2021;Ran, Liu, and Wang 2022;Ma et al. 2022;Zheng et al. 2023;Qian et al. 2022;Lin et al. 2023;Park et al. 2023;Yao et al. 2023;Thomas et al. 2024;Chen et al. 2023;Sun et al. 2024). It is worth noting that DMR far surpasses HyCoRe to reach 3.0%, which also further verifies that our method is theoretically correct. ...
June 2024
... Zhou et al. [21] enhanced multi-scale deep features through an edge extraction module and incorporated fine edge details into the saliency maps using a hybrid loss that includes edge-aware constraints. Liu et al. [46] utilized Transformer blocks with global perception areas to extract features and adopted a divideand-conquer method to extract and integrate relevant features from various branches. Yan et al. [47] ingeniously integrated Transformers and CNNs into the encoder, using an adaptive semantic matching mechanism to model both global and local relationships. ...
January 2024
IEEE Transactions on Geoscience and Remote Sensing
... This transform is critical for enabling vehicles to understand their environment, detect objects, and navigate safely. Previous works (Huang et al., 2021;Reading et al., 2021;Zhou & Krähenbühl, 2022;Zeng et al., 2024; have achieved remarkable 3D perception ability by utilizing Bird's Eye View (BEV) representations to process 2D-3D lift. Recently, many vision-based 3D occupancy prediction methods (Huang et al., 2023;Wei et al., 2023;b; further improved the understanding of dynamic and cluttered driving scenes, pushing the boundaries of the research domain. ...
January 2024
IEEE Transactions on Multimedia
... Lane detection is crucial for autonomous driving [3]- [6], as it allows the identification of lane boundaries and facilitates the determination of secure driving routes [4], [7], [8]. The RGB images captured by vehicle-mounted cameras provide multiple forward lane lines and other useful driving information [9], [10]. ...
November 2023
IEEE Transactions on Circuits and Systems for Video Technology
... Language model grounding for embodied tasks: An embodied agent not only requires active exploration [33], manipulation [34], and scene perception [35,36] as well as embodied task planning ability. Embodied task planning aims to generate executable action steps in the given environments, where action plans are generated from grounded LLMs by receiving information from the surrounding environments [37,38,39] or prompt engineering [40]. ...
October 2022
... Alongside these, deep learning techniques have demonstrated significant effectiveness. Recent studies, like [29], have utilized dual CNNs with shared parameters to extract multi-scale deep features, providing a comprehensive contextual representation of facial images. Integrating multiple or multi-level features has become a prevalent strategy in kinship verification. ...
July 2020
Pattern Recognition
... Another method known as the Kin Mix method was put forward by Song and Yan (2020) to generate positive samples which have a kin relationship in the feature space. The assumption that was made was the linearly combined kinship features yielded an indistinguishable clustering. ...
July 2020
... The classification is made through threshold comparison or a basic machine learning classifier (i.e., Support Vector Machines-SVM, K-Nearest Neighbors-KNN) [22,[26][27][28][29][30][31]. Moreover, thanks to the contribution of deep learning in the field of computer vision, many works have shifted toward implementing a deep learning-based methods as it not only encapsulates the classification and extraction process but also gathers more accurate and hidden features, so-called deep features [2,9,13,[32][33][34][35][36][37][38] (see Fig. 2). Furthermore, the development of ensemble learning methods has demonstrated an effective way to improve method performance through merging different models. ...
June 2020
Pattern Recognition Letters
... Eventually, NRML was applied to the features of these two networks. Concerning the recent methods, an architecture based on the attention mechanism was designed in [35] to extract local information from face components. OR 2 Net [36] employed hard negative samples beyond the pre-defined protocol of configurations to utilize information of all negative samples by reweighing them. ...
August 2019
Pattern Recognition Letters