Xian Sun’s research while affiliated with Aerospace Information Research Institute, Chinese Academy of Sciences and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (444)


A Complex-valued SAR Foundation Model Based on Physically Inspired Representation Learning
  • Preprint

April 2025

·

3 Reads

Mengyu Wang

·

Hanbo Bi

·

Yingchao Feng

·

[...]

·

Xian Sun

Vision foundation models in remote sensing have been extensively studied due to their superior generalization on various downstream tasks. Synthetic Aperture Radar (SAR) offers all-day, all-weather imaging capabilities, providing significant advantages for Earth observation. However, establishing a foundation model for SAR image interpretation inevitably encounters the challenges of insufficient information utilization and poor interpretability. In this paper, we propose a remote sensing foundation model based on complex-valued SAR data, which simulates the polarimetric decomposition process for pre-training, i.e., characterizing pixel scattering intensity as a weighted combination of scattering bases and scattering coefficients, thereby endowing the foundation model with physical interpretability. Specifically, we construct a series of scattering queries, each representing an independent and meaningful scattering basis, which interact with SAR features in the scattering query decoder and output the corresponding scattering coefficient. To guide the pre-training process, polarimetric decomposition loss and power self-supervision loss are constructed. The former aligns the predicted coefficients with Yamaguchi coefficients, while the latter reconstructs power from the predicted coefficients and compares it to the input image's power. The performance of our foundation model is validated on six typical downstream tasks, achieving state-of-the-art results. Notably, the foundation model can extract stable feature representations and exhibits strong generalization, even in data-scarce conditions.


SemStereo: Semantic-Constrained Stereo Matching Network for Remote Sensing

April 2025

·

1 Read

Proceedings of the AAAI Conference on Artificial Intelligence

Semantic segmentation and 3D reconstruction are two fundamental tasks in remote sensing, typically treated as separate or loosely coupled tasks. Despite attempts to integrate them into a unified network, the constraints between the two heterogeneous tasks are not explicitly modeled, since the pioneering studies either utilize a loosely coupled parallel structure or engage in only implicit interactions, failing to capture the inherent connections. In this work, we explore the connections between the two tasks and propose a new network that imposes semantic constraints on the stereo matching task, both implicitly and explicitly. Implicitly, we transform the traditional parallel structure to a new cascade structure termed Semantic-Guided Cascade structure, where the deep features enriched with semantic information are utilized for the computation of initial disparity maps, enhancing semantic guidance. Explicitly, we propose a Semantic Selective Refinement (SSR) module and a Left-Right Semantic Consistency (LRSC) module. The SSR refines the initial disparity map under the guidance of the semantic map. The LRSC ensures semantic consistency between two views via reducing the semantic divergence after transforming the semantic map from one view to the other using the disparity map. Experiments on the US3D and WHU datasets demonstrate that our method achieves state-of-the-art performance for both semantic segmentation and stereo matching.


HyperMixer: Specializable Hypergraph Channel Mixing for Long-term Multivariate Time Series Forecasting

April 2025

·

1 Read

Proceedings of the AAAI Conference on Artificial Intelligence

Long-term Multivariate Time Series (LMTS) forecasting aims to predict extended future trends based on channel-interrelated historical data. Considering the elusive channel correlations, most existing methods compromise by treating channels as independent or tentatively modeling pairwise channel interactions, making it challenging to handle the characteristics of both higher-order interactions and time variation in channel correlations. In this paper, we propose HyperMixer, a novel specializable hypergraph channel mixing plugin which introduces versatile hypergraph structures to capture group channel interactions and time-varying patterns for long-term multivariate time series forecasting. Specifically, to encode the higher-order channel interactions, we structure multiple channels into a hypergraph, achieving a two-phase message-passing mechanism: channel-to-group and group-to-channel. Moreover, the functionally specializable hypergraph structures are presented to boost the capability of hypergraph to capture the time-varying patterns across periods, further refining modeling of channel correlations. Extensive experimental results on seven available benchmark datasets demonstrate the effectiveness and generalization of our plugin in LMTS forecasting. The visual analysis further illustrates that HyperMixer with specializable hypergraphs tailors channel interactions specific to certain periods.


RingMoE: Mixture-of-Modality-Experts Multi-Modal Foundation Models for Universal Remote Sensing Image Interpretation

April 2025

·

1 Read

The rapid advancement of foundation models has revolutionized visual representation learning in a self-supervised manner. However, their application in remote sensing (RS) remains constrained by a fundamental gap: existing models predominantly handle single or limited modalities, overlooking the inherently multi-modal nature of RS observations. Optical, synthetic aperture radar (SAR), and multi-spectral data offer complementary insights that significantly reduce the inherent ambiguity and uncertainty in single-source analysis. To bridge this gap, we introduce RingMoE, a unified multi-modal RS foundation model with 14.7 billion parameters, pre-trained on 400 million multi-modal RS images from nine satellites. RingMoE incorporates three key innovations: (1) A hierarchical Mixture-of-Experts (MoE) architecture comprising modal-specialized, collaborative, and shared experts, effectively modeling intra-modal knowledge while capturing cross-modal dependencies to mitigate conflicts between modal representations; (2) Physics-informed self-supervised learning, explicitly embedding sensor-specific radiometric characteristics into the pre-training objectives; (3) Dynamic expert pruning, enabling adaptive model compression from 14.7B to 1B parameters while maintaining performance, facilitating efficient deployment in Earth observation applications. Evaluated across 23 benchmarks spanning six key RS tasks (i.e., classification, detection, segmentation, tracking, change detection, and depth estimation), RingMoE outperforms existing foundation models and sets new SOTAs, demonstrating remarkable adaptability from single-modal to multi-modal scenarios. Beyond theoretical progress, it has been deployed and trialed in multiple sectors, including emergency response, land management, marine sciences, and urban planning.


SA-Occ: Satellite-Assisted 3D Occupancy Prediction in Real World

March 2025

·

1 Read

Existing vision-based 3D occupancy prediction methods are inherently limited in accuracy due to their exclusive reliance on street-view imagery, neglecting the potential benefits of incorporating satellite views. We propose SA-Occ, the first Satellite-Assisted 3D occupancy prediction model, which leverages GPS & IMU to integrate historical yet readily available satellite imagery into real-time applications, effectively mitigating limitations of ego-vehicle perceptions, involving occlusions and degraded performance in distant regions. To address the core challenges of cross-view perception, we propose: 1) Dynamic-Decoupling Fusion, which resolves inconsistencies in dynamic regions caused by the temporal asynchrony between satellite and street views; 2) 3D-Proj Guidance, a module that enhances 3D feature extraction from inherently 2D satellite imagery; and 3) Uniform Sampling Alignment, which aligns the sampling density between street and satellite views. Evaluated on Occ3D-nuScenes, SA-Occ achieves state-of-the-art performance, especially among single-frame methods, with a 39.05% mIoU (a 6.97% improvement), while incurring only 6.93 ms of additional latency per frame. Our code and newly curated dataset are available at https://github.com/chenchen235/SA-Occ.




Fig. 1. Illustration of CSC. C t represents labels provided in the current step, C 0 stands for background. (a) and (d), (b) and (e), (c) and (f) shows three instances of semantic confusion.
Fig. 7. Visualization results of the feature response after category perception enhancement module. (a) Input Image. (b) Fine-tuning. (c) ILOD. (d) Faster ILOD. (e) MMA. (f) our DMPM.
Fig. 8. Visualization detection results of DIOR dataset on the most challenging setting 15-1. (a) FT. (b) ILOD. (c) Faster ILOD. (d) MMA. (e) our DMPM.
A Class-Incremental Object Detection Method for Remote Sensing Images Based on Dynamic Multiprototype Matching
  • Article
  • Full-text available

January 2025

·

3 Reads

IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing

In object detection task, incremental learning method enables the previously trained model better adapt to the new task using either a small amount of old data or none at all. In the incremental training processes of complex remote sensing scenes, the newly arrived data only includes new classes annotations. These new classes may exhibit spatial overlap and shape similarity with old classes or may have been labeled as background in earlier tasks, leading to a unique challenge called class semantic confusion. To address this issue, this article dynamically generate multiple representative prototypes of the different categories for refined matching the objects. To improve the matching accuracy, prototype contrastive learning is employed for expanding the distance between dissimilar prototypes and reducing the distance between similar prototypes. Meanwhile, a category perception enhancement module is proposed to enhance the aware of old categories to mitigate catastrophic forgetting. Comprehensive experimental results demonstrate that our proposed method outperforms the current state-of-the-art class-incremental object detection methods in most experimental settings on DIOR and FAIR1M datasets.

Download

Hypergraph-Guided Multimodal Prototype for Remote Sensing Scene Understanding

January 2025

·

3 Reads

IEEE Transactions on Geoscience and Remote Sensing

Noticeable achievements have been made in entity-level perception tasks (e.g., object detection) in remote sensing (RS) image interpretation. But for RS images carrying rich content, individual perception cannot well obtain the interaction patterns between entities. The recognition of relationships between entities is the key to deeply understanding RS scenes. In this article, we propose a hypergraph-guided multimodal prototype network (HMPNet), which performs relation recognition by matching relation representations with multimodal predicate prototypes. To overcome the imbalance of modal information in the matching process, a multimodal calibration strategy is devised, taking into account the image subprototype and text subprototype, which makes prediction results more reliable. Meanwhile, to align image and text subprototypes and explore relevant semantic patterns, the multimodal hypergraph is constructed to efficiently capture the associations between heterogeneous prototypes. Experimental results show that the performance of our model can reach the state-of-the-art (SOTA) level on the RS scene graph generation (SGG) task.


SoPerModel: Leveraging Social Perception for Multi-Agent Trajectory Prediction

January 2025

IEEE Transactions on Geoscience and Remote Sensing

Trajectory prediction is an essential task within various automation systems. Recent studies have highlighted that the social interactions among multiple agents are crucial for accurate predictions, relying on empirically derived human-imposed constraints to model these interactions. However, from a sociological perspective, agents’ interactions exhibit significant inherent randomness. Dependence on a priori knowledge may lead to biased estimations of data distributions across different scenarios, failing to account for this randomness. Consequently, such methodologies often do not comprehensively capture the full spectrum of social influences, thus limiting the models’ predictive efficacy. To address these issues, we propose a novel multi-agent trajectory prediction framework, SoPerModel, which incorporates a Freeform Social Evolution Module (FSEM) and a Local Perception Attention mechanism (LPA). The FSEM enables SoPerModel to naturally capture representative social interactions among agents without the reliance on additional human-derived priors. Through LPA, the model integrates both local and global social interaction information and leverages them to enhance trajectory prediction performance. Our framework is empirically evaluated on real-world trajectory prediction datasets, and the results demonstrate that our approach achieves a highly competitive performance compared with state-of-the-art models.


Citations (36)


... Kang et al. [27] associated scattering points and adaptively aggregated layer features to highlight salient targets. Meng et al. [28] proposed STC-Net, which models targets as topology structures, enabling the reconstruction of aircraft features while suppressing background interference. ...

Reference:

SFG-Net: A Scattering Feature Guidance Network for Oriented Aircraft Detection in SAR Images
STC-Net: Scattering Topology Cue-based Network for Aircraft Detection in SAR Images
  • Citing Article
  • January 2024

IEEE Transactions on Geoscience and Remote Sensing

... data diversity. In particular, new SAR detection datasets [28,29] have emerged with 100,000 images. However, our previous research [30,31] on SAR foundation models revealed that collecting public datasets yields fewer than 200,000 available target samples due to severe sample imbalance-mainly ship detection datasets. ...

FAIR-CSAR: A Benchmark Dataset for Fine-grained Object Detection and Recognition based on Single Look Complex SAR Images
  • Citing Article
  • January 2024

IEEE Transactions on Geoscience and Remote Sensing

... It designs a module named Q-former as the adapter to bridge the modality gap and combines contrastive learning, instance matching and maskreconstruction together as the SSL objectives. RingMoGPT [168] follows the design of BLIP-2, and make it capable of object detection and change captioning for RS images. It proposes a location-and instruction-aware Q-Former and pretrain the model using the same learning objectives as BLIP-2. ...

RingMoGPT: A Unified Remote Sensing Foundation Model for Vision, Language, and Grounded Tasks
  • Citing Article
  • January 2024

IEEE Transactions on Geoscience and Remote Sensing

... No. of image-text Caption Caption Image Geographical pairs granularity generation data coverage UCM-Captions (Qu et al., 2016) 10 500 Coarse-grained Manually annotated RGB, UCMerced (Yang and Newsam, 2010) Regional Sydney-Captions (Qu et al., 2016) 3065 Coarse-grained Manually annotated RGB, Sydney (Zhang et al., 2014) Regional RSICD 54 605 Coarse-grained Manually annotated RGB, Google Earth, Baidu Map Regional NWPU-Captions (Cheng et al., 2022) 157 500 Coarse-grained Manually annotated RGB, NWPU-RESISC45 (Cheng et al., 2017) Regional RSICap (Hu et al., 2023) 2585 Fine-grained Manually annotated RGB, DOTA (Xia et al., 2018) Regional RS5M 5 000 000 Coarse-grained Model-generated and multiple datasets RGB, multiple datasets Global SkyScript 2 600 000 Coarse-grained OpenStreetMap RGB and multispectral, multiple sensors Global FIT-RS (Luo et al., 2024) 1 800 851 Fine-grained STAR and ChatGPT RGB, STAR (Li et al., 2024b) Global RemoteCLIP (Liu et al., 2024) 828 725 Coarse-grained Rule-based RGB, multiple datasets Global ChatEarthNet 173 488 Fine-grained WorldCover and ChatGPT RGB and multispectral, Sentinel-2 Global ...

STAR: A First-Ever Dataset and a Large-Scale Benchmark for Scene Graph Generation in Large-Size Satellite Imagery
  • Citing Article
  • November 2024

IEEE Transactions on Pattern Analysis and Machine Intelligence

... Some recent studies developed PEFT methods specific to geospatial foundation models [27,42], and started to explore the challenge of self-supervised fine-tuning [29,41]. Others, by contrast, developed supervised fine-tuning methods to address the specific case of cross-domain adaptation by incorporating domain inductive bias for specific cases, e.g. for multi-spectral [50], thermal [62], or RGB-Depth [24] imagery. ...

TEA: A Training-Efficient Adapting Framework for Tuning Foundation Models in Remote Sensing
  • Citing Article
  • January 2024

IEEE Transactions on Geoscience and Remote Sensing

... In remote sensing scenarios, prototype learning approach is widely used in few shot learning methods [56], [57], [58], [59] because of the ability to extract stable features for each class. Besides, PLNet-PR [60] utilized prototype learning network with proposal relation to enhance the receptive field of small targets for object detection. ...

AgMTR: Agent Mining Transformer for Few-Shot Segmentation in Remote Sensing

International Journal of Computer Vision

... Mainstream learning-based MVS frameworks divide the 3D reconstruction task into feature extraction, 3D cost volume construction, 3D cost volume regularization, depth regression and depth map prediction sub-tasks. Multiple structures are often embedded in these MVS frameworks to better the reconstruction performance, such as the feature pyramid network [1]- [3], the cascade structure [4]- [6] and the attention mechanism [7]. ...

SDL-MVS: View Space and Depth Deformable Learning Paradigm for Multiview Stereo Reconstruction in Remote Sensing
  • Citing Article
  • January 2024

IEEE Transactions on Geoscience and Remote Sensing

... Ref. [34] presented a method to employ foundation models for few-shot object detection, showcasing their adaptability with minimal data. In the domain of FSS, Ref. [35] stands out with its auto-prompt network designed for cross-domain few-shot semantic segmentation. This method can generate prompts autonomously, allowing the model to adapt to new domains with limited labeled data. ...

Prompt-and-Transfer: Dynamic Class-Aware Enhancement for Few-Shot Segmentation
  • Citing Article
  • September 2024

IEEE Transactions on Pattern Analysis and Machine Intelligence

... Knowledge distillation (KD) has emerged as a crucial technique in model compression and transfer learning. Cross-modal knowledge distillation (CMKD), an extension of KD, focuses on transferring knowledge between different modalities [16]. Many studies have explored cross-modal distillation for audio-visual speech data, leveraging the inherent synergy and correspondence between audio signals and lip movements. ...

Multimodal Cross-Lingual Summarization for Videos: A Revisit in Knowledge Distillation Induced Triple-Stage Training Method
  • Citing Article
  • August 2024

IEEE Transactions on Pattern Analysis and Machine Intelligence

... In the future, we will explore PORE for automated data synthesis and probe its scalability and practicality in the massive-scale pre-training stage. We will also attempt other reasoning format, such as change reasoning (Lu et al., 2024). ...

Relation-Aware Multi-Pass Comparison Deconfounded Network for Change Captioning
  • Citing Article
  • December 2024

IEEE Transactions on Circuits and Systems for Video Technology