Yu Meng’s research while affiliated with Aerospace Information Research Institute, Chinese Academy of Sciences and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (33)


LRSCLIP: A Vision-Language Foundation Model for Aligning Remote Sensing Image with Longer Text
  • Preprint

March 2025

·

1 Read

Weizhi Chen

·

Jingbo Chen

·

·

[...]

·

Yu Meng

This study addresses the technical bottlenecks in handling long text and the "hallucination" issue caused by insufficient short text information in remote sensing vision-language foundation models (VLFM). We propose a novel vision-language foundation model, LRSCLIP, and a multimodal dataset, LRS2M. The main contributions are as follows: (1) By integrating multi-source remote sensing data and adopting a large language model labeling strategy, we construct the LRS2M dataset, which contains 2 million image-text pairs, providing both short and long texts for the first time, thus solving the problem of semantic granularity limitations in existing datasets; (2) The design of the LRSCLIP architecture based on Long-CLIP's KPS module, which extends CLIP's text processing capacity and achieves fine-grained cross-modal feature alignment through a dual-text loss weighting mechanism. Experimental results show that LRSCLIP improves retrieval accuracy by 10\%-20\% over the Long-CLIP baseline in the zero-shot long-text cross-modal retrieval task. For the zero-shot short-text cross-modal retrieval task, LRSCLIP achieves improvements over the current best model, GeoRSCLIP, with increases of 0.17\%, 0.67\%, and 0.92\% in Text to Image R@1, Image to Text R@1, and mR on RSITMD, respectively, and 0.04\%, 2.93\%, and 1.28\% on RSICD. In the zero-shot image classification task (average accuracy=75.75\%) and semantic localization task (Rmi=0.7653), LRSCLIP achieves state-of-the-art performance. These results validate the dual advantages of fine-grained semantic understanding and global feature matching in LRSCLIP. This work provides a new benchmark model and data support for remote sensing multimodal learning. The related code has been open source and is available at https://github.com/MitsuiChen14/LRSCLIP.


SayAnything: Audio-Driven Lip Synchronization with Conditional Video Diffusion
  • Preprint
  • File available

February 2025

·

14 Reads

Recent advances in diffusion models have led to significant progress in audio-driven lip synchronization. However, existing methods typically rely on constrained audio-visual alignment priors or multi-stage learning of intermediate representations to force lip motion synthesis. This leads to complex training pipelines and limited motion naturalness. In this paper, we present SayAnything, a conditional video diffusion framework that directly synthesizes lip movements from audio input while preserving speaker identity. Specifically, we propose three specialized modules including identity preservation module, audio guidance module, and editing control module. Our novel design effectively balances different condition signals in the latent space, enabling precise control over appearance, motion, and region-specific generation without requiring additional supervision signals or intermediate representations. Extensive experiments demonstrate that SayAnything generates highly realistic videos with improved lip-teeth coherence, enabling unseen characters to say anything, while effectively generalizing to animated characters.

Download


Overall architecture of inception-enhanced temporal attention encoder (IncepTAE). The meanings of the abbreviations, colors and symbols involved are given as follows: green block: tensor of data; red text: shape of tensor; blue block: modules of neural networks; ⟶: the direction in which data flows; ⊕: element-wise add; ⊗: matrix multiplication; X: input time series; D: dates of time series; E: embedding of time series; O: output of self-attention mechanism; O˜ final output of hybrid temporal feature; @1,@k: kernel size of convolutions and pooling layers.
Comparison of the computation of keys for multi-head attention in (a) LTAE, (b) IncepTAE. IncepTAE has less computation than LTAE because IncepTAE implements multi-head by group convolution rather than full convolution linear layer.
The confusion proportions of (a) IncepTAE and (b) GLTAE for each class in TimeSen2Crop plotted with different colors. Note the axis of proportions starts at 60%. IncepTAE yields better results in most classes, especially on Legumes and Winter Caraway.
The confusion proportions of (a) IncepTAE and (b) GLTAE for each class in Ghana plotted with different colors. Note the axis of proportions starts at 80%. Both IncepTAE and GLTAE tend to misclassify Non-Intercrop as Intercrop, but one intuitive fact is that IncepTAE has fewer misclassified proportions as Intercrop for most classes than GLTAE, resulting in a better performance.
Performance in OA, mIoU, κ and mF1 of IncepTAE with different kernel_sizes. The best results are achieved by setting kernel_size=5 for (a) TimeSen2Crop dataset and kernel_size=9 for (b) Ghana dataset, respectively, which are slightly better than setting kernel_size=3 for both datasets. In practice, setting kernel_size=3 is enough for extracting enhanced global and local information with less computation.

+2

Satellite Image Time-Series Classification with Inception-Enhanced Temporal Attention Encoder

December 2024

·

31 Reads

In this study, we propose a one-branch IncepTAE network to extract local and global hybrid temporal attention simultaneously and congruously for fine-grained satellite image time series (SITS) classification. Transformer and the temporal self-attention mechanism have been the research focus of SITS classification in recent years. However, its effectiveness seems to diminish in the scenario of fine-grained classification among similar categories, for example, different crop types. Theoretically, most of the existing methods focus on only one type of temporal attention, either global attention or local attention, but actually, both of them are required to achieve fine-grained classification. Even though some works adopt two-branch architecture to extract hybrid attention, they usually lack congruity between different types of temporal attention and hinder the expected discriminating ability. Compared with the existing methods, IncepTAE exhibits multiple methodological novelties. Firstly, we insert average/maximum pooling layers into the calculation of multi-head attention to extract hybrid temporal attention. Secondly, IncepTAE adopts one-branch architecture, which reinforces the interaction and congruity of different temporal information. Thirdly, the proposed IncepTAE is more lightweight due to the use of group convolutions. IncepTAE achieves 95.65% and 97.84% overall accuracy on two challenging datasets, TimeSen2Crop and Ghana. The comparative results with existing state-of-the-art methods demonstrate that IncepTAE is able to achieve superior classification performance and faster inference speed, which is conducive to the large-area application of SITS classification.



Extracting polygonal footprints in off-nadir images with Segment Anything Model

August 2024

·

8 Reads

Building Footprint Extraction (BFE) in off-nadir aerial images often relies on roof segmentation and roof-to-footprint offset prediction, then drugging roof-to-footprint via the offset. However, the results from this multi-stage inference are not applicable in data production, because of the low quality of masks given by prediction. To solve this problem, we proposed OBMv2 in this paper, which supports both end-to-end and promptable polygonal footprint prediction. Different from OBM, OBMv2 using a newly proposed Self Offset Attention (SOFA) to bridge the performance gap on bungalow and skyscraper, which realized a real end-to-end footprint polygon prediction without postprocessing. %, such as Non-Maximum Suppression (NMS) and Distance NMS (DNMS). % To fully use information contained in roof masks, building masks and offsets, we proposed a Multi-level Information SyStem (MISS) for footprint prediction, with which OBMv2 can predict footprints even with insufficient predictions. Additionally, to squeeze information from the same model, we were inspired by Retrieval-Augmented Generation (RAG) in Nature Language Processing and proposed "RAG in BFE" problem. To verify the effectiveness of the proposed method, experiments were conducted on open datasets BONAI and OmniCity-view3. A generalization test was also conducted on Huizhou test set. The code will be available at \url{https://github.com/likaiucas/OBM}.


Global heterogeneous graph convolutional network: from coarse to refined land cover and land use segmentation

May 2024

·

111 Reads

The abundant details embedded in very-high-resolution remote sensing images establish a solid foundation for comprehending the land surface. Simultaneously, as spatial resolution advances, there is a corresponding escalation in the required granularity of land cover and land use (LCLU) categories. The coarse classes identified necessitate further refinement into more detailed categories. For instance, the ‘built-up’ class can be subdivided into specific categories such as squares, stadiums, and airports. These refined LCLU classifications are better equipped to support diverse domains. Nonetheless, most studies simply adopt methods initially designed for coarse LCLU when addressing the challenging refined LCLU segmentation. Few studies have considered the inherent relationships between coarse and refined LCLU, overlooking the potential exploitation of the numerous recently released LCLU products. To better leverage this prior knowledge, we propose the Global Heterogeneous Graph Convolutional Network (GHGCN). The GHGCN introduces a heterogeneous graph and excels in establishing relationships between coarse and refined LCLU, which can extract long-distance dependencies more effectively than convolutional neural networks. Furthermore, this model is performed end-to-end, eliminating the necessity for presegmentation and facilitating training acceleration. GHGCN exhibits competitive performance compared to state-of-the-art models, indicating its effective design in exploiting coarse LCLU data, especially for categories with limited samples. The source code is released at: https://github.com/Liuzhizhiooo/GHGCN.



Prompt-Driven Building Footprint Extraction in Aerial Images With Offset-Building Model

January 2024

·

13 Reads

·

1 Citation

IEEE Transactions on Geoscience and Remote Sensing

More accurate extraction of invisible building footprints from very-high-resolution (VHR) aerial images relies on roof segmentation and roof-to-footprint offset extraction. Existing methods based on instance segmentation suffer from poor generalization when extended to large-scale data production and fail to achieve low-cost human interaction. This prompt paradigm inspires us to design a promptable framework for roof and offset extraction, and transforms end-to-end algorithms into promptable methods. Within this framework, we propose a novel Offset-Building Model (OBM). Based on prompt prediction, we first discover a common pattern of predicting offsets and tailored Distance-NMS (DNMS) algorithms for offset optimization. To rigorously evaluate the algorithm’s capabilities, we introduce a prompt-based evaluation method, where our model reduces offset errors by 16.6% and improves roof Intersection over Union (IoU) by 10.8% compared to other models. Leveraging the common patterns in predicting offsets, DNMS algorithms enable models to further reduce offset vector loss by 6.5%. To further validate the generalization of models, we tested them using a newly proposed test set, Huizhou test set, with over 7,000 manually annotated instance samples. Our algorithms and dataset will be available at https://github.com/likaiucas/OBM.


Fig. 5. The structure of the joint optimization process.
Fig. 6. The network of temporal convolutional network.
Fig. 12. Evaluation of clustering metrics against different latent feature dimensions on the Cerrado dataset, delineating the optimal embedding size for maximal clustering accuracy.
Fig. 14. Execution time comparison between different methods across two datasets: Imperial and Reunion.
Deep Temporal Joint Clustering for Satellite Image Time Series Analysis

January 2024

·

24 Reads

IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing

With the advancement of remote sensing satellite technology, the acquisition of Satellite Image Time Series (SITS) data has significantly increased, providing new opportunities and challenges for land cover analysis. Traditional unsupervised clustering methods often struggle with the complexity of these data due to limitations in scalability and generalization capabilities. In response, this paper proposes a new unsupervised learning approach called Deep Temporal Joint Clustering (DTJC) designed for efficient pixel-wise clustering of SITS data. DTJC optimizes the reconstruction of temporal information along with clustering objectives, which not only preserves the temporal dynamics of the original data but also creates a feature space conducive to clustering. Experimental results show that DTJC achieves optimal clustering performance across four publicly available multi-spectral SITS datasets, including TimeSen2Crop, Cerrado Biome, Reunion Island, and Imperial datasets. Compared to traditional K-means and projection algorithms, DTJC significantly improves clustering accuracy, especially in environments with complex geographical distributions. Leveraging the principles of the K-means clustering algorithm, DTJC showcases remarkable performance improvements over traditional optimized K-means and projection algorithms in land cover analysis, heralding a new era in the unsupervised learning landscape of SITS data. The DTJC method greatly enhances the efficiency of SITS data analysis without the need for labeled data, making it a powerful tool for automated land cover classification and environmental monitoring.


Citations (19)


... In response to these limitations of traditional segmentation models, the rise of vision foundation models that have been through task-agnostic training on a huge amount of data [17,18] provides potential solutions. Without the need for re-training, these models can quickly adapt to a new downstream task on a different setting [19,20] through provided user prompts without specific contextual information. Consequently, we turn to the Segment Anything Model (SAM) [17], one of the first promptable vision foundation models for our interest. ...

Reference:

GeoSAM: Fine-tuning SAM with Multi-Modal Prompts for Mobility Infrastructure Segmentation
SAMPolyBuild: Adapting the Segment Anything Model for polygonal building extraction
  • Citing Article
  • December 2024

ISPRS Journal of Photogrammetry and Remote Sensing

... However, those RoI-based methods can hardly applied in real data production because of unstable RPN and Non-Maximum Suppression (NMS) algorithms (Viola and Jones, 2001). To solve this problem, OBM (Li et al., 2024a) was proposed, borrowing the structure of Segment Anything Model (SAM) (Kirillov et al., 2023). OBM was a promptable model and fully structured by Transformer, and the prompt-level offsets were predicted by Reference Offset Augmentation Module (ROAM), using a concept of offset queries. ...

Prompt-Driven Building Footprint Extraction in Aerial Images With Offset-Building Model
  • Citing Article
  • January 2024

IEEE Transactions on Geoscience and Remote Sensing

... In addition to the network structure, many attempts have also been made to improve the semantic segmentation of remote sensing images. These methods include the use of the attention mechanisms to enhance feature representation [10][11][12][13][14][15][16], generative adversarial network to enable cross-domain semantic segmentation [17][18][19], a self-updating CNN model [20] and a progressive edge guidance network [21] to incorporate geographic knowledge into CNN models, a semantic category balance-aware involved anti-interference network named SCBANet [22] to handle category imbalance issue, a high-order semantic decoupling network (HSDN) to disentangle features [23], a uncertainty-aware network (UANet) [24] to facilitate level-by-level feature refinement. ...

Learning to Adapt Adversarial Perturbation Consistency for Domain Adaptive Semantic Segmentation of Remote Sensing Images

... Users often only need to detect changes in some specific categories, e.g., building change detection for urban expansion analysis [21,7,40], cropland change detection for agricultural protection [34,53], landslide change detection for disaster monitoring [75], etc. Typical semantic change detection methods follow a triple decoder architecture, i.e., one difference branch for binary change detection and two semantic branches for bi-temporal class discrimination [67,79,14,15]. In this paper, we simplify the multi-class semantic change detection to single-class semantic change detection, which achieves the same objective while avoiding the complexity of the model structure, and allows the extraction of changes in any class using only offthe-shelf single-temporal models. ...

TChange: A Hybrid Transformer-CNN Change Detection Network

... Remote sensing binary change detection is an active research field with a wide range of applications in Earth observation, aiming to identify regions of change within bi-temporal images (Coppin et al. 2004;Deng et al. 2022;Gueguen and Hamid 2016;Ji, Wei, and Lu 2019;Wang et al. 2023). To further elaborate on change information, SCD detects these changes and classifies the land cover/land use (LCLU) transition types (Lv et al. 2022;Peng et al. 2021). ...

Feature Guided Multitask Change Detection Network

IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing

... For instance, Martha et al. (2016) highlighted the capability of change detection-based methods with multispectral images to identify landslides in vegetated regions. Similarly, Huang et al. (2020) demonstrated the effectiveness of using the normalized difference vegetation index (NDVI) and built-up area presence index features to detect landslides in vegetated areas with multispectral images. Landslide detection using multispectral imagery can be further categorized into bi-temporal and multi-temporal approaches. ...

Landslide Monitoring Using Change Detection in Multitemporal Optical Imagery

IEEE Geoscience and Remote Sensing Letters

... It assigns a categorical label to every pixel in an image, enabling accurate identification and classification of terrain features such as buildings, roads, vegetation, and aquatic environments. This ability has revolutionized fields like urban planning, disaster response, agricultural observation, environmental preservation, and landuse classification [1,10]. ...

Aerial Image Semantic Segmentation Using Spatial and Channel Attention
  • Citing Conference Paper
  • July 2019

... Compared with available statistics in 2023, the area of installed RPVs falls within the statistical range for the household and distributed PV installation area (1. [16][17][18][19].04 km 2 ), underscoring the rationality of our findings. ...

Photovoltaic power station identification using refined encoder–decoder network with channel attention and chained residual dilated convolutions
  • Citing Article
  • January 2020

Journal of Applied Remote Sensing

... Cheng et al. [17] have suggested using a structured neural system to recognize roads from satellite pictures. Ideally, analyzing the extraction of roads through a structured neural network requires large data sets and big images to get exact results. ...

Recognizing Road From Satellite Images by Structured Neural Network
  • Citing Article
  • May 2019

Neurocomputing

... A fuzzy rule-based composition of anisotropic textural measures derived from the satellite's PAN image with the gray-level co-occurrence matrix (GLCM) was used to carry out the procedure for the calculation of the built-up PanTex. The textural measure of contrast was selected because of its excellent ability of discrimination between built-up and non-built-up areas [19]. Three key steps are included in PanTex's calculation. ...

Urban new construction land parcel detection with normalized difference vegetation index and PanTex information

Journal of Applied Remote Sensing