Guanglai Gao’s research while affiliated with Inner Mongolia University and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (176)


SSAN: A Symbol Spatial-Aware Network for Handwritten Mathematical Expression Recognition
  • Article

April 2025

Proceedings of the AAAI Conference on Artificial Intelligence

Haoran Zhang

·

Xiangdong Su

·

Xingxiang Zhou

·

Guanglai Gao

The great challenge of handwritten mathematical expression recognition (HMER) is the complex structures of the expressions, which are directly related to the symbol spatial positions. Existing HMER methods typically employ attention mechanisms in the decoder of their models to implicitly perceive the symbol positions, or employ symbol counting and tree-based strategies to model the symbol spatial relation. However, these methods still cannot effectively capture the structural information of formulas, thus negatively impacting the symbol decoding in HMER. To deal with this problem and enhance the HMER performance, this paper proposes a novel auxiliary task, namely predicting the symbol spatial distribution map of handwritten expression images. On such basis, this paper designs a symbol spatial-aware network (SSAN) for this task, which is jointly optimized with the HMER model. Specifically, considering the similarity of the symbol spatial positions between the handwritten mathematical expression images and their corresponding printed templates, we obtain the symbol spatial distribution map by first generating printed templates from LaTeX ground-truth for handwritten formula images and then replacing the connected components of printed templates with 2D Gaussian distribution maps of the same size. Meanwhile, due to the loose alignment of the symbol spatial positions between handwritten and printed formula images, and misclassification of similar symbols, we further propose a coarse-to-fine alignment strategy and an attention-guided symbol masking strategy in SSAN to tackle these issues. Extensive experiments demonstrate that SSAN significantly improves the recognition performance of the HMER models, and the proposed auxiliary tasks are more effective in enhancing HMER performance than existing auxiliary tasks.








Unifying Dual-Space Embedding for Entity Alignment via Contrastive Learning

December 2024

·

1 Read

Entity alignment aims to match identical entities across different knowledge graphs (KGs). Graph neural network-based entity alignment methods have achieved promising results in Euclidean space. However, KGs often contain complex structures, including both local and hierarchical ones, which make it challenging to efficiently represent them within a single space. In this paper, we proposed a novel method UniEA, which unifies dual-space embedding to preserve the intrinsic structure of KGs. Specifically, we learn graph structure embedding in both Euclidean and hyperbolic spaces simultaneously to maximize the consistency between the embedding in both spaces. Moreover, we employ contrastive learning to mitigate the misalignment issues caused by similar entities, where embedding of similar neighboring entities within the KG become too close in distance. Extensive experiments on benchmark datasets demonstrate that our method achieves state-of-the-art performance in structure-based EA. Our code is available at https://github.com/wonderCS1213/UniEA.


Distance-Adaptive Quaternion Knowledge Graph Embedding with Bidirectional Rotation

December 2024

·

5 Reads

Quaternion contains one real part and three imaginary parts, which provided a more expressive hypercomplex space for learning knowledge graph. Existing quaternion embedding models measure the plausibility of a triplet either through semantic matching or geometric distance scoring functions. However, it appears that semantic matching diminishes the separability of entities, while the distance scoring function weakens the semantics of entities. To address this issue, we propose a novel quaternion knowledge graph embedding model. Our model combines semantic matching with entity's geometric distance to better measure the plausibility of triplets. Specifically, in the quaternion space, we perform a right rotation on head entity and a reverse rotation on tail entity to learn rich semantic features. Then, we utilize distance adaptive translations to learn geometric distance between entities. Furthermore, we provide mathematical proofs to demonstrate our model can handle complex logical relationships. Extensive experimental results and analyses show our model significantly outperforms previous models on well-known knowledge graph completion benchmark datasets. Our code is available at https://github.com/llqy123/DaBR.



Citations (48)


... But these efforts focus only on particular medical segmentation tasks. Additionally, unified frameworks have been designed [41,49,61] to address discrepancies across different medical datasets and tasks. However, the scarcity of medical data has increasingly become a significant barrier to the continued advancement of unified segmentation. ...

Reference:

MedSegFactory: Text-Guided Generation of Medical Image-Mask Pairs
FSAM: Fine-tuning SAM encoder and decoder for Medical Image Segmentation
  • Citing Conference Paper
  • December 2024

... Unlike text-to-speech task [7,8,9,10,11,12,13], the traditional CSS task mainly focuses on CD context modeling, as shown in Fig. 1(a), which can be summarized into three groups: 1) Multi-scale context modeling, [1] introduces a GRU-based coarse-grained context encoder that extracts semantic information from sentence-level historical dialogues. [14] further considers simultaneously learning coarse-grained and fine-grained contextual dependencies from text. [15] infers speaking styles in dialogues using a multi-scale relational graph convolutional network. ...

FCTalker: Fine and Coarse Grained Context Modeling for Expressive Conversational Speech Synthesis
  • Citing Conference Paper
  • November 2024

... Data Imputation: Data imputation is another strategy for handling missing modalities. A noise-robust multimodal sentiment recognition model [24] utilizes a Variational Au-toEncoder (VAE) to reconstruct robust multimodal joint representations from noisy data, addressing real-world scenarios with incomplete information. The UniMF framework [10] employs the Multimodal Generation Mask (MGM) to handle unaligned sequences. ...

Learning Noise-Robust Joint Representation for Multimodal Emotion Recognition under Incomplete Data Scenarios
  • Citing Conference Paper
  • October 2024

... Fashion product creation poses greater challenges than general image generation due to higher demands for quality and diversity (we complement advances in parameter-efficient tuning and multi-modal adaptation [10,21,23,[41][42][43][44][45]48]). Our system requires advanced reasoning and multimodal understanding, supported by ChatGPT-4 for reasoning tasks. ...

Leveraging Contrastive Learning and Self-Training for Multimodal Emotion Recognition with Limited Labeled Samples
  • Citing Conference Paper
  • October 2024

... 1 https://github.com/lee-jhwn/icassp25-fesde-phoneme [12], pre-trained speech models [13], or the WaveNet [14] architecture [15]. Recently, a framework for direct reconstruction of listened speech waveforms has been proposed as described in [16], where no intermediate acoustic feature step is required. ...

Cross-Attention-Guided Wavenet for Mel Spectrogram Reconstruction in The ICASSP 2024 Auditory EEG Challenge
  • Citing Conference Paper
  • April 2024

... Unlike text-to-speech task [7,8,9,10,11,12,13], the traditional CSS task mainly focuses on CD context modeling, as shown in Fig. 1(a), which can be summarized into three groups: 1) Multi-scale context modeling, [1] introduces a GRU-based coarse-grained context encoder that extracts semantic information from sentence-level historical dialogues. [14] further considers simultaneously learning coarse-grained and fine-grained contextual dependencies from text. ...

Controllable Accented Text-to-Speech Synthesis With Fine and Coarse-Grained Intensity Rendering
  • Citing Article
  • January 2024

IEEE/ACM Transactions on Audio Speech and Language Processing

... As for speech deepfake detection, existing multi-view methods only combine the latent feature of waveform and spectrogram views from separate audio models (Yang et al. 2024) or learn the dual-channel information just from the waveform view (Liu, Zhang, and Gao 2024). They cannot jointly learn speech representation from the waveform and spectrogram views. ...

Multi-space channel representation learning for mono-to-binaural conversion based audio deepfake detection
  • Citing Article
  • January 2024

Information Fusion

... Unlike text-to-speech task [7,8,9,10,11,12,13], the traditional CSS task mainly focuses on CD context modeling, as shown in Fig. 1(a), which can be summarized into three groups: 1) Multi-scale context modeling, [1] introduces a GRU-based coarse-grained context encoder that extracts semantic information from sentence-level historical dialogues. [14] further considers simultaneously learning coarse-grained and fine-grained contextual dependencies from text. ...

Text-to-Speech for Low-Resource Agglutinative Language With Morphology-Aware Language Model Pre-Training
  • Citing Article
  • January 2024

IEEE/ACM Transactions on Audio Speech and Language Processing