Guobao Xiao’s research while affiliated with Tongji University and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (105)


COMPrompter: reconceptualized segment anything model with multiprompt network for camouflaged object detection
  • Article

December 2024

·

10 Reads

Science China Information Sciences

Xiaoqin Zhang

·

Zhenni Yu

·

Li Zhao

·

[...]

·

Guobao Xiao

We rethink the segment anything model (SAM) and propose a novel multiprompt network called COMPrompter for camouflaged object detection (COD). SAM has zero-shot generalization ability beyond other models and can provide an ideal framework for COD. Our network aims to enhance the single prompt strategy in SAM to a multiprompt strategy. To achieve this, we propose an edge gradient extraction module, which generates a mask containing gradient information regarding the boundaries of camouflaged objects. This gradient mask is then used as a novel boundary prompt, enhancing the segmentation process. Thereafter, we design a box-boundary mutual guidance module, which fosters more precise and comprehensive feature extraction via mutual guidance between a boundary prompt and a box prompt. This collaboration enhances the model’s ability to accurately detect camouflaged objects. Moreover, we employ the discrete wavelet transform to extract high-frequency features from image embeddings. The high-frequency features serve as a supplementary component to the multiprompt system. Finally, our COMPrompter guides the network to achieve enhanced segmentation results, thereby advancing the development of SAM in terms of COD. Experimental results across COD benchmarks demonstrate that COMPrompter achieves a cutting-edge performance, surpassing the current leading model by an average positive metric of 2.2% in COD10K. In the specific application of COD, the experimental results in polyp segmentation show that our model is superior to top-tier methods as well. The code will be made available at https://github.com/guobaoxiao/COMPrompter.


PTH-Net: Dynamic Facial Expression Recognition Without Face Detection and Alignment

November 2024

·

6 Reads

·

1 Citation

IEEE Transactions on Image Processing

Pyramid Temporal Hierarchy Network (PTH-Net) is a new paradigm for dynamic facial expression recognition, applied directly to raw videos, without face detection and alignment. Unlike the traditional paradigm, which focus only on facial areas and often overlooks valuable information like body movements, PTH-Net preserves more critical information. It does this by distinguishing between backgrounds and human bodies at the feature level, offering greater flexibility as an end-to-end network. Specifically, PTH-Net utilizes a pre-trained backbone to extract multiple general features of video understanding at various temporal frequencies, forming a temporal feature pyramid. It then further expands this temporal hierarchy through differentiated parameter sharing and downsampling, ultimately refining emotional information under the supervision of expression temporal-frequency invariance. Additionally, PTH-Net features an efficient Scalable Semantic Distinction layer that enhances feature discrimination, helping to better identify target expressions versus non-target ones in the video. Finally, extensive experiments demonstrate that PTH-Net performs excellently in eight challenging benchmarks, with lower computational costs compared to previous methods. The source code is available at https://github.com/lm495455/PTH-Net .


COMPrompter: reconceptualized segment anything model with multiprompt network for camouflaged object detection

November 2024

·

5 Reads

We rethink the segment anything model (SAM) and propose a novel multiprompt network called COMPrompter for camouflaged object detection (COD). SAM has zero-shot generalization ability beyond other models and can provide an ideal framework for COD. Our network aims to enhance the single prompt strategy in SAM to a multiprompt strategy. To achieve this, we propose an edge gradient extraction module, which generates a mask containing gradient information regarding the boundaries of camouflaged objects. This gradient mask is then used as a novel boundary prompt, enhancing the segmentation process. Thereafter, we design a box-boundary mutual guidance module, which fosters more precise and comprehensive feature extraction via mutual guidance between a boundary prompt and a box prompt. This collaboration enhances the model's ability to accurately detect camouflaged objects. Moreover, we employ the discrete wavelet transform to extract high-frequency features from image embeddings. The high-frequency features serve as a supplementary component to the multiprompt system. Finally, our COMPrompter guides the network to achieve enhanced segmentation results, thereby advancing the development of SAM in terms of COD. Experimental results across COD benchmarks demonstrate that COMPrompter achieves a cutting-edge performance, surpassing the current leading model by an average positive metric of 2.2% in COD10K. In the specific application of COD, the experimental results in polyp segmentation show that our model is superior to top-tier methods as well. The code will be made available at https://github.com/guobaoxiao/COMPrompter.


Second-Order Proximity Guided Sampling Consensus for Robust Model Fitting

November 2024

·

13 Reads

IEEE Transactions on Circuits and Systems for Video Technology

Robust model fitting plays a critical role in artificial intelligence and computer vision, with its performance primarily depends on the utilization of sampling algorithms. However, existing sampling algorithms become less effective when initial correspondences between two images are corrupted by a large number of outliers, especially in the presence of multi-structure data. In this paper, we propose a novel sampling algorithm (called SPGSC) for robust model fitting, where minimal subsets are sampled with the guidance of the second-order proximity measure, which involves global geometric relationships instead of local consistency relationships. Specifically, we first propose a second-order proximity measure to facilitate graph construction, which helps detect a potential inlier from input data as the first datum (i.e., the seed datum) of a minimal subset. After that, we propose a second-order proximity based initial minimal subset generation strategy, which is able to choose a certain number of minimal subsets by the seed data for efficiently producing significant model hypotheses. Furthermore, to achieve better fitting performance, we propose a maximum spanning tree based refinement (MSTR) strategy, which is used to refine the previous sampled minimal subsets and improve the effectiveness and efficiency of the sampling process. Experimental results on three vision tasks (i.e., two-view based motion segmentation, affine matrix based segmentation, and 3D motion segmentation ) show the superiority of the proposed SPGSC in comparison with other state-of-the-art algorithms.



DHM-Net: Deep Hypergraph Modeling for Robust Feature Matching

October 2024

·

7 Reads

·

1 Citation

IEEE Transactions on Image Processing

We present a novel deep hypergraph modeling architecture (called DHM-Net) for feature matching in this paper. Our network focuses on learning reliable correspondences between two sets of initial feature points by establishing a dynamic hypergraph structure that models group-wise relationships and assigns weights to each node. Compared to existing feature matching methods that only consider pair-wise relationships via a simple graph, our dynamic hypergraph is capable of modeling nonlinear higher-order group-wise relationships among correspondences in an interaction capturing and attention representation learning fashion. Specifically, we propose a novel Deep Hypergraph Modeling block, which initializes an overall hypergraph by utilizing neighbor information, and then adopts node-to-hyperedge and hyperedge-to-node strategies to propagate interaction information among correspondences while assigning weights based on hypergraph attention. In addition, we propose a Differentiation Correspondence-Aware Attention mechanism to optimize the hypergraph for promoting representation learning. The proposed mechanism is able to effectively locate the exact position of the object of importance via the correspondence aware encoding and simple feature gating mechanism to distinguish candidates of inliers. In short, we learn such a dynamic hypergraph format that embeds deep group-wise interactions to explicitly infer categories of correspondences. To demonstrate the effectiveness of DHM-Net, we perform extensive experiments on both real-world outdoor and indoor datasets. Particularly, experimental results show that DHM-Net surpasses the state-of-the-art method by a sizable margin. Our approach obtains an 11.65% improvement under error threshold of 5° for relative pose estimation task on YFCC100M dataset. Code will be released at https://github.com/CSX777/DHM-Net .


T-Net++: Effective Permutation-Equivariance Network for Two-View Correspondence Pruning

August 2024

·

3 Reads

·

2 Citations

IEEE Transactions on Pattern Analysis and Machine Intelligence

We propose a conceptually novel, flexible, and effective framework (named T-Net++) for the task of two-view correspondence pruning. T-Net++ comprises two unique structures: the ""-^{\prime \prime } structure and the ""|^{\prime \prime } structure. The ""-^{\prime \prime } structure utilizes an iterative learning strategy to process correspondences, while the ""|^{\prime \prime } structure integrates all feature information of the ""-^{\prime \prime } structure and produces inlier weights. Moreover, within the ""|^{\prime \prime } structure, we design a new Local-Global Attention Fusion module to fully exploit valuable information obtained from concatenating features through channel-wise and spatial-wise relationships. Furthermore, we develop a Channel-Spatial Squeeze-and-Excitation module, a modified network backbone that enhances the representation ability of important channels and correspondences through the squeeze-and-excitation operation. T-Net++ not only preserves the permutation-equivariance manner for correspondence pruning, but also gathers rich contextual information, thereby enhancing the effectiveness of the network. Experimental results demonstrate that T-Net++ outperforms other state-of-the-art correspondence pruning methods on various benchmarks and excels in two extended tasks. Our code will be available at https://github.com/guobaoxiao/T-Net .


Exploring Deeper! Segment Anything Model with Depth Perception for Camouflaged Object Detection

July 2024

·

5 Reads

This paper introduces a new Segment Anything Model with Depth Perception (DSAM) for Camouflaged Object Detection (COD). DSAM exploits the zero-shot capability of SAM to realize precise segmentation in the RGB-D domain. It consists of the Prompt-Deeper Module and the Finer Module. The Prompt-Deeper Module utilizes knowledge distillation and the Bias Correction Module to achieve the interaction between RGB features and depth features, especially using depth features to correct erroneous parts in RGB features. Then, the interacted features are combined with the box prompt in SAM to create a prompt with depth perception. The Finer Module explores the possibility of accurately segmenting highly camouflaged targets from a depth perspective. It uncovers depth cues in areas missed by SAM through mask reversion, self-filtering, and self-attention operations, compensating for its defects in the COD domain. DSAM represents the first step towards the SAM-based RGB-D COD model. It maximizes the utilization of depth features while synergizing with RGB features to achieve multimodal complementarity, thereby overcoming the segmentation limitations of SAM and improving its accuracy in COD. Experimental results on COD benchmarks demonstrate that DSAM achieves excellent segmentation performance and reaches the state-of-the-art (SOTA) on COD benchmarks with less consumption of training resources. The code will be available at https://github.com/guobaoxiao/DSAM.


MSGA-Net: Progressive Feature Matching via Multi-layer Sparse Graph Attention

July 2024

·

5 Reads

·

3 Citations

IEEE Transactions on Circuits and Systems for Video Technology

Feature matching is an essential computer vision task that requires the establishment of high-quality correspondences between two images. Constructing sparse dynamic graphs and extracting contextual information by searching for neighbors in feature space is a prevalent strategy in numerous previous works. Nonetheless, these works often neglect the potential connections between dynamic graphs from different layers, leading to underutilization of available information. To tackle this issue, we introduce a Sparse Dynamic Graph Interaction block for feature matching. This innovation facilitates the implicit establishment of dependencies by enabling interaction and aggregation among dynamic graphs across various layers. In addition, we design a novel Multiple Sparse Transformer to enhance the capture of the global context from the sparse graph. This block selectively mines significant global contextual information along spatial and channel dimensions, respectively. Ultimately, we present the Multi-layer Sparse Graph Attention Network (MSGA-Net), a framework designed to predict probabilities of correspondences as inliers and to recover camera poses. Experimental results demonstrate that our proposed MSGA-Net surpasses state-of-the-art methods on challenging indoor and outdoor datasets. Code will be available at https://github.com/gongzhepeng/MSGA-Net .


Figure 5: The examples of reconstruction results for masked correspondences. The left column represents the original correspondences, the middle column means the remaining correspondences, and the right column denotes the reconstruction results for masked correspondences.
Figure 6: Partial typical visualization results of the correspondence pruning on YFCC100M. The correspondence is drawn in green if it represents the inlier and red for the outlier.
Ablation study for our CorrFormer encoder.
Ablation study for our CorrMAE. We con- duct experiments using two masking types with different masking ratios (%), and report pre-train loss (x100) on MegaDepth [14] as well as fine-tune camera pose estima- tion on YFCC100M [32].
Evaluation on Aachen Day-Night bench- marks [29, 45] for the visual localization evaluation.

+1

CorrMAE: Pre-training Correspondence Transformers with Masked Autoencoder
  • Preprint
  • File available

June 2024

·

59 Reads

Pre-training has emerged as a simple yet powerful methodology for representation learning across various domains. However, due to the expensive training cost and limited data, pre-training has not yet been extensively studied in correspondence pruning. To tackle these challenges, we propose a pre-training method to acquire a generic inliers-consistent representation by reconstructing masked correspondences, providing a strong initial representation for downstream tasks. Toward this objective, a modicum of true correspondences naturally serve as input, thus significantly reducing pre-training overhead. In practice, we introduce CorrMAE, an extension of the mask autoencoder framework tailored for the pre-training of correspondence pruning. CorrMAE involves two main phases, \ie correspondence learning and matching point reconstruction, guiding the reconstruction of masked correspondences through learning visible correspondence consistency. Herein, we employ a dual-branch structure with an ingenious positional encoding to reconstruct unordered and irregular correspondences. Also, a bi-level designed encoder is proposed for correspondence learning, which offers enhanced consistency learning capability and transferability. Extensive experiments have shown that the model pre-trained with our CorrMAE outperforms prior work on multiple challenging benchmarks. Meanwhile, our CorrMAE is primarily a task-driven pre-training method, and can achieve notable improvements for downstream tasks by pre-training on the targeted dataset. We hope this work can provide a starting point for correspondence pruning pre-training.

Download

Citations (62)


... It is worth investigating how to design adapters with both modalities in mind -should we use fine-tuned networks with the same architecture or different ones? Inspired by the teacher and student networks in DSAM [58], we believe that different modal inputs should be extracted with features in different ways in order to perform complementary functions. To this end, we design two adapters to handle RGB-D inputs. ...

Reference:

Improving SAM for Camouflaged Object Detection via Dual Stream Adapters
Exploring Deeper! Segment Anything Model with Depth Perception for Camouflaged Object Detection
  • Citing Conference Paper
  • October 2024

... Our DFEC includes detailed facial movements described using natural language. searchers have shifted their focus from static images to dynamic video content, Dynamic Facial Expression Recognition (DFER) [7,9,16,35,40,86] has attracted attention from experts in psychology, computer science, linguistics, neuroscience, and related disciplines due to its wideranging applications. The goal of DFER [24,50,71,92] is to classify video sequences into one of several fundamental emotional categories, including neutral, happiness, sadness, surprise, fear, disgust, and anger. ...

Dual-STI: Dual-path Spatial-Temporal Interaction Learning for Dynamic Facial Expression Recognition
  • Citing Article
  • June 2024

Information Sciences

... To highlight the advantages of our approach, we conducted a detailed comparison using the original jaw cyst dataset against several prominent models, including U-Net [5], NPD-Net [31], TRFE-Net [32], TransUNet [33], TranSEFusionNet [34], SEANet [35], MG-Net [36], FBSNet [37], CPGNet [38], CMUNeXt [39], CMU-Net [40], CCBANet [41], and MFI-Net [42]. The qualitative results are summarized in Table 2. From the table we find that U-Net achieved F1-score of 87.38%, Mcc of 87.72%, and Jaccard index of 77.86%, which is the worst performance among all models. ...

A novel non-pretrained deep supervision network for polyp segmentation
  • Citing Article
  • May 2024

Pattern Recognition

... Frequency domain analysis, a cornerstone in image signal processing, has been instrumental across a diverse array of fields such as image classification [31], [32], texture extraction [33], image fusion [34] and super-resolution [35], [36]. To cope with the face live detection problem, Stuchi et al. [31] proposed a feature extraction method based on Fourier analysis. ...

FAFusion: Learning for Infrared and Visible Image Fusion via Frequency Awareness
  • Citing Article
  • January 2024

IEEE Transactions on Instrumentation and Measurement

... SuperPoint is a self-supervised machine learning operator that seems to be a convenient and adjustable solution. Alternatively, correspondence can be established with an AI scoring network [59] or pruning performed with global texture [60]. ...

BCLNet: Bilateral Consensus Learning for Two-View Correspondence Pruning
  • Citing Article
  • March 2024

Proceedings of the AAAI Conference on Artificial Intelligence

... Advancements in feature matching techniques, such as SSL-Net's approach to sparse semantic learning [15] and PGFNet's strategy for preference-guided filtering [16], have been instrumental in enhancing the accuracy of feature correspondence in complex scenarios. These methods demonstrate the importance of accurately identifying and filtering key features, a concept that can significantly augment traditional methods like ICP in point cloud densification processes, especially in dynamic scenes. ...

SSL-Net: Sparse semantic learning for identifying reliable correspondences
  • Citing Article
  • October 2023

Pattern Recognition

... These methods estimate model parameters from data points generated by multiple models. Many prevailing approaches perform one-shot model fitting, which can be classified into clustering-based methods [44]− [46] and RANSAC-based methods [47]− [50]. The first category clusters input points based on the preference of the sampled hypotheses. ...

Density-Guided Incremental Dominant Instance Exploration for Two-View Geometric Model Fitting
  • Citing Article
  • October 2023

IEEE Transactions on Image Processing

... compares the outcomes of our proposed method on the DFEW dataset to the results of the following DFER methods: References[22,24,59,60,64,65,[67][68][69][70][72][73][74]. It can be observed from the scores in ...

Transformer-Based Multimodal Emotional Perception for Dynamic Facial Expression Recognition in the Wild
  • Citing Article
  • January 2023

IEEE Transactions on Circuits and Systems for Video Technology

... MSA-Net (Zheng et al. 2022) and PGFNet (Liu et al. 2023) also propose some jointly spatial-channel attention blocks to capture the global context of correspondences. After that, there are some researches based on the graph neural network (Zhao et al. 2021;Dai et al. 2022;Liao et al. 2023). CLNet (Zhao et al. 2021) introduces a neighborhood aggregation manner and the pruning strategy to refine coarse correspondences. ...

SGA-Net: A Sparse Graph Attention Network for Two-View Correspondence Learning
  • Citing Article
  • December 2023

IEEE Transactions on Circuits and Systems for Video Technology

... Methods such as TENet [40], LF Tracy [36], and LF Transnet [25] employ attention mechanisms to fuse features from the focal stack with various depths, implicitly integrating depth information with contextual cues. FESNet [2] introduces a joint learning framework, utilizing skip connections to link encoder information to the decoder, thereby implicitly addressing the loss of angular information during feature transmission. LF-SODNet [60] proposes the use of focal stacks as depth cues, emphasizing the exploration of spatial information from all-in-focus images. ...

Fusion-Embedding Siamese Network for Light Field Salient Object Detection
  • Citing Article
  • January 2023

IEEE Transactions on Multimedia