Chao Ma’s research while affiliated with Anhui Agricultural University and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (183)


Active Learning from Scene Embeddings for End-to-End Autonomous Driving
  • Preprint

March 2025

Wenhao Jiang

·

Duo Li

·

Menghan Hu

·

[...]

·

Zhipeng Zhang

In the field of autonomous driving, end-to-end deep learning models show great potential by learning driving decisions directly from sensor data. However, training these models requires large amounts of labeled data, which is time-consuming and expensive. Considering that the real-world driving data exhibits a long-tailed distribution where simple scenarios constitute a majority part of the data, we are thus inspired to identify the most challenging scenarios within it. Subsequently, we can efficiently improve the performance of the model by training with the selected data of the highest value. Prior research has focused on the selection of valuable data by empirically designed strategies. However, manually designed methods suffer from being less generalizable to new data distributions. Observing that the BEV (Bird's Eye View) features in end-to-end models contain all the information required to represent the scenario, we propose an active learning framework that relies on these vectorized scene-level features, called SEAD. The framework selects initial data based on driving-environmental information and incremental data based on BEV features. Experiments show that we only need 30\% of the nuScenes training data to achieve performance close to what can be achieved with the full dataset. The source code will be released.


VRM: Knowledge Distillation via Virtual Relation Matching

February 2025

·

3 Reads

Knowledge distillation (KD) aims to transfer the knowledge of a more capable yet cumbersome teacher model to a lightweight student model. In recent years, relation-based KD methods have fallen behind, as their instance-matching counterparts dominate in performance. In this paper, we revive relational KD by identifying and tackling several key issues in relation-based methods, including their susceptibility to overfitting and spurious responses. Specifically, we transfer novelly constructed affinity graphs that compactly encapsulate a wealth of beneficial inter-sample, inter-class, and inter-view correlations by exploiting virtual views and relations as a new kind of knowledge. As a result, the student has access to richer guidance signals and stronger regularisation throughout the distillation process. To further mitigate the adverse impact of spurious responses, we prune the affinity graphs by dynamically detaching redundant and unreliable edges. Extensive experiments on CIFAR-100 and ImageNet datasets demonstrate the superior performance of the proposed virtual relation matching (VRM) method over a range of models, architectures, and set-ups. For instance, VRM for the first time hits 74.0% accuracy for ResNet50-to-MobileNetV2 distillation on ImageNet, and improves DeiT-T by 14.44% on CIFAR-100 with a ResNet56 teacher. Thorough analyses are also conducted to gauge the soundness, properties, and complexity of our designs. Code and models will be released.


SparseFormer: Detecting Objects in HRW Shots via Sparse Vision Transformer
  • Preprint
  • File available

February 2025

·

3 Reads

Recent years have seen an increase in the use of gigapixel-level image and video capture systems and benchmarks with high-resolution wide (HRW) shots. However, unlike close-up shots in the MS COCO dataset, the higher resolution and wider field of view raise unique challenges, such as extreme sparsity and huge scale changes, causing existing close-up detectors inaccuracy and inefficiency. In this paper, we present a novel model-agnostic sparse vision transformer, dubbed SparseFormer, to bridge the gap of object detection between close-up and HRW shots. The proposed SparseFormer selectively uses attentive tokens to scrutinize the sparsely distributed windows that may contain objects. In this way, it can jointly explore global and local attention by fusing coarse- and fine-grained features to handle huge scale changes. SparseFormer also benefits from a novel Cross-slice non-maximum suppression (C-NMS) algorithm to precisely localize objects from noisy windows and a simple yet effective multi-scale strategy to improve accuracy. Extensive experiments on two HRW benchmarks, PANDA and DOTA-v1.0, demonstrate that the proposed SparseFormer significantly improves detection accuracy (up to 5.8%) and speed (up to 3x) over the state-of-the-art approaches.

Download

Barnacle and mussel adhesive protein dual-bionic design and SM/PVA/SPA adhesive preparations. a A typical adhesion system of natural and translated barnacles and mussels. b Schematic diagram of SM/PVA/SPA adhesive molecular structures
Preparation and characterization of the SPA. a The hierarchical structure of the SF and the preparation of the SPA solution. b Optical image of silk fibers after degumming. c Macroscopic photographs of the SF and SPA aqueous solution. d XPS analysis, e FTIR spectra, and f XRD patterns of SF and SPA
The bonding properties and toughness of adhesives. a Schematic illustration of the bonding strength test. b Cold-pressing shear strength and c curves of adhesives. d Dry and e wet shear strength of adhesives. f Force-distance curves and g different adhesive toughness during dry shear strength tests. h Wood failure rates in plywood specimens
Characterization of adhesives. a FTIR, and b XRD patterns of the adhesives. c–e C1s XPS spectra of adhesives. f TGA analysis of adhesives. g SEM images of fracture morphology for cured adhesives
Physicochemical and adhesion characterization of the SM/PVA/SPAB adhesive. a Schematic diagram of borate cross-linked network. b Adhesives were adhered to different materials. c XPS spectra, d rheological properties, and e TGA analysis of SM/PVA/SPAB adhesives

+4

Barnacle and mussel-inspired of functional silk fibroin to prepare biomass adhesive with ultra-high cold-pressing adhesion

January 2025

·

30 Reads

·

1 Citation

Soy protein adhesives have shown attractive potential across wood-based composite application as replacements for formaldehyde synthetic resins. However, the problems of poor cold-pressing bonding performance, mildew resistance, and flame retardancy are major bottlenecks preventing their widespread industrial use. Herein, a novel dual-bionic soybean meal (SM) adhesive was developed with excellent bonding performance inspired by the strong wet adhesion system of barnacles and mussels. Polydopamine (PDA) adopts for coating the silk fibroin (SF) via a facile in situ polymerization to form a functional SF/PDA (SPA), which acts as a hydrogen bond donor in adhesive system. Benefiting from abundant phenolic hydroxyl groups, multiple hydrogen bond cross-linked networks, and the removal of interfacial water, the adhesive generated a strong cold-pressing adhesion (771.0 kPa) and toughness (1.54 MJ/m³), which respectively increased by 267.1% and 208.0% compared with SM adhesive. The introduction of borate further induces the formation of the dynamic cross-linked network along with supramolecular interactions in adhesive system, which demonstrated strong adhesion to different hydrophilic and hydrophobic substrates in the adhesive. Additionally, borate and phenol hydroxyl groups synergistically improved mildew resistance and flame retardancy in adhesives, which were stored for 20 days without mildew formation and showed combustion suppression behavior during combustion tests. The design of this bionic system offers a novel approach for developing high performance wood adhesives.


HaWoR: World-Space Hand Motion Reconstruction from Egocentric Videos

January 2025

·

3 Reads

Despite the advent in 3D hand pose estimation, current methods predominantly focus on single-image 3D hand reconstruction in the camera frame, overlooking the world-space motion of the hands. Such limitation prohibits their direct use in egocentric video settings, where hands and camera are continuously in motion. In this work, we propose HaWoR, a high-fidelity method for hand motion reconstruction in world coordinates from egocentric videos. We propose to decouple the task by reconstructing the hand motion in the camera space and estimating the camera trajectory in the world coordinate system. To achieve precise camera trajectory estimation, we propose an adaptive egocentric SLAM framework that addresses the shortcomings of traditional SLAM methods, providing robust performance under challenging camera dynamics. To ensure robust hand motion trajectories, even when the hands move out of view frustum, we devise a novel motion infiller network that effectively completes the missing frames of the sequence. Through extensive quantitative and qualitative evaluations, we demonstrate that HaWoR achieves state-of-the-art performance on both hand motion reconstruction and world-frame camera trajectory estimation under different egocentric benchmark datasets. Code and models are available on https://hawor-project.github.io/ .


Cross-View Consistency Regularisation for Knowledge Distillation

December 2024

·

3 Reads

Knowledge distillation (KD) is an established paradigm for transferring privileged knowledge from a cumbersome model to a lightweight and efficient one. In recent years, logit-based KD methods are quickly catching up in performance with their feature-based counterparts. However, previous research has pointed out that logit-based methods are still fundamentally limited by two major issues in their training process, namely overconfident teacher and confirmation bias. Inspired by the success of cross-view learning in fields such as semi-supervised learning, in this work we introduce within-view and cross-view regularisations to standard logit-based distillation frameworks to combat the above cruxes. We also perform confidence-based soft label mining to improve the quality of distilling signals from the teacher, which further mitigates the confirmation bias problem. Despite its apparent simplicity, the proposed Consistency-Regularisation-based Logit Distillation (CRLD) significantly boosts student learning, setting new state-of-the-art results on the standard CIFAR-100, Tiny-ImageNet, and ImageNet datasets across a diversity of teacher and student architectures, whilst introducing no extra network parameters. Orthogonal to on-going logit-based distillation research, our method enjoys excellent generalisation properties and, without bells and whistles, boosts the performance of various existing approaches by considerable margins.


Fig. 2: Overview of the proposed OccScene. The framework involves the concurrent training of the perception model and the generative diffusion UNet, facilitating the simultaneous generation of images or videos and their corresponding semantic occupancy grids during the inference process. Within OccScene, we introduce a Mamba-based Dual Alignment (MDA) module to sequentially align the semantic occupancy and the diffusion latent with camera trajectory awareness. Algorithm 1 Training the generation model f θ and the perception model f δ simultaneously. 1: repeat 2:
Fig. 4: Visualization results of the heat maps from our proposed Mamba-based Dual Alignment (MDA) module. The heat maps are extracted from the last diffusion sampling step. Our proposed module effectively highlights the aligned contextual information from the temporal frames and semantic occupancy.
Fig. 5: (a) The performance evaluation with different learning strategies. (b) The learning curves of different learning strategies. To conduct the setting of 'Independent Learning', we detach the perception model and predict the semantic occupancy in an offline manner.
Fig. 7: Quantitative comparison of cross-view generation consistency with existing methods. Our method generates more consistent and reasonable results across different perspectives.
Fig. 10: Comparison of different architecture designs of the Mamba-based Dual Alignment (MDA) module. Although the GRU-based encoding yields better computational efficiency through an iterative encoding process compared to the Attentionbased encoding, it is susceptible to cumulative errors. The Mamba-based encoding demonstrates superior running speed and generation quality with the linear-complexity operator and efficient long-term modeling in a single pass.
OccScene: Semantic Occupancy-based Cross-task Mutual Learning for 3D Scene Generation

December 2024

·

2 Reads

Recent diffusion models have demonstrated remarkable performance in both 3D scene generation and perception tasks. Nevertheless, existing methods typically separate these two processes, acting as a data augmenter to generate synthetic data for downstream perception tasks. In this work, we propose OccScene, a novel mutual learning paradigm that integrates fine-grained 3D perception and high-quality generation in a unified framework, achieving a cross-task win-win effect. OccScene generates new and consistent 3D realistic scenes only depending on text prompts, guided with semantic occupancy in a joint-training diffusion framework. To align the occupancy with the diffusion latent, a Mamba-based Dual Alignment module is introduced to incorporate fine-grained semantics and geometry as perception priors. Within OccScene, the perception module can be effectively improved with customized and diverse generated scenes, while the perception priors in return enhance the generation performance for mutual benefits. Extensive experiments show that OccScene achieves realistic 3D scene generation in broad indoor and outdoor scenarios, while concurrently boosting the perception models to achieve substantial performance improvements in the 3D perception task of semantic occupancy prediction.



PhyCAGE: Physically Plausible Compositional 3D Asset Generation from a Single Image

November 2024

·

9 Reads

We present PhyCAGE, the first approach for physically plausible compositional 3D asset generation from a single image. Given an input image, we first generate consistent multi-view images for components of the assets. These images are then fitted with 3D Gaussian Splatting representations. To ensure that the Gaussians representing objects are physically compatible with each other, we introduce a Physical Simulation-Enhanced Score Distillation Sampling (PSE-SDS) technique to further optimize the positions of the Gaussians. It is achieved by setting the gradient of the SDS loss as the initial velocity of the physical simulation, allowing the simulator to act as a physics-guided optimizer that progressively corrects the Gaussians' positions to a physically compatible state. Experimental results demonstrate that the proposed method can generate physically plausible compositional 3D assets given a single image.



Citations (54)


... Additionally, boron exhibits low toxicity, flame retardancy, antimicrobial activity, and anti-mold properties. Inspired by the borate chemistry in plant cell walls, researchers have developed a range of advanced materials [33]. Building upon this inspiration, the incorporation of environmentally friendly borates in synergy with POSS-polyphenol coreshell structures is expected to simultaneously enhance the adhesive strength and toughness ...

Reference:

Developing Eco-Friendly, High-Performance Soy Protein Plywood Adhesive via Core–Shell Hybridization and Borate Chemistry
Barnacle and mussel-inspired of functional silk fibroin to prepare biomass adhesive with ultra-high cold-pressing adhesion

... To better adapt VLMs for downstream tasks, various text prompt-based fine-tuning methods [3], [4], [6], [7], [52] have been proposed, which can enhance VLM performance on specific tasks. In more complex scenarios, learnable prompts can be inserted into intermediate layers [8] to incorporate more sophisticated general knowledge. ...

Unleash the Power of Vision-Language Models by Visual Attention Prompt and Multi-modal Interaction
  • Citing Article
  • January 2024

IEEE Transactions on Multimedia

... LiDAR provides accurate 3D positional information, enabling accurate occupancy prediction, while cameras offer rich semantic details for object classification. Previous studies [1]- [3] have focused on integrating these two modalities by using 3D voxel representations generated from LiDAR data. To fuse camera features with 3D LiDAR features, these approaches predict depth from images, transform them from 2D to 3D views to create voxel-based image features, and then align these features with the LiDAR data in the voxel space. ...

OccGen: Generative Multi-modal 3D Occupancy Prediction for Autonomous Driving
  • Citing Chapter
  • November 2024

... Existing occupancy prediction approaches [3,9,11,23,28,35,44] primarily rely on supervised learning, which typically requires dense 3D annotations obtained through labor-intensive manual labeling of dynamic driving scenes spanning up to 80 meters per frame. To mitigate this cost, recent studies have resorted to self-supervised alternatives [2,7,8,10,13,31,42,45]. These methods leverage 2D predictions from vision foundation models (VLMs) to train a 3D occupancy network, enfocing image reprojection consistency through volume rendering [10,31,42] or differentiable rasterization [7]. ...

VEON: Vocabulary-Enhanced Occupancy Prediction
  • Citing Chapter
  • October 2024

... Backbone: The backbone of Elastic-DETR, inspired by advances in SparseViT ) and Sparse-Former (Li et al. 2024a), implements a sparsity strategy that selects windows of varying importance to economize on feature extraction computations. Notably, while the computational load is reduced during the feature extraction phase, the final feature map retains the same shape as those produced by denser models like Swin Transformer , ensuring compatibility with subsequent processing stages. ...

SparseFormer: Detecting Objects in HRW Shots via Sparse Vision Transformer
  • Citing Conference Paper
  • October 2024

... However, during thermal processing and use, PVC is prone to degradation when exposed to heat, light, and other environmental factors. This degradation can significantly reduce the material's mechanical properties and service life [6,7]. Therefore, understanding the thermal degradation behavior of PVC and developing effective stabilizers are crucial for enhancing its durability and safety. ...

Preparation of bamboo charcoal-reinforced polyamide 6 composites modified with diverse additives: Synergy and interface improvement
  • Citing Article
  • October 2024

Industrial Crops and Products

... This benchmark has recently gained a lot of attention. Previous works on gigapixel-level detection focus on achieving lower latency through patch selection or arrangement [5,10,23,24,34]. However, they are unable to solve the unique challenges faced in HRW shots. ...

SaccadeMOT: Enhancing Object Detection and Tracking in Gigapixel Images via Scale-Aware Density Estimation
  • Citing Chapter
  • October 2024

... Adversarial Training. Adversarial attacks (Tang et al. 2022;Zhang et al. 2021bZhang et al. , 2020 refer to the creation of maliciously crafted inputs, called adversarial examples, that are designed to deceive machine learning models (Jia et al. 2020Miyato, Dai, and Goodfellow 2016). A common method (Madry et al. 2017) with crafting adversarial examples for VLMs involves seeking a perturbation r for an input natural image x with label y to maximize the dissimilarity, typically cosine dissimilarity. ...

Robust Deep Object Tracking against Adversarial Attacks

International Journal of Computer Vision