Xin Guo’s research while affiliated with Zhengzhou University and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (34)


Latent space improved masked reconstruction model for human skeleton-based action recognition
  • Article
  • Full-text available

February 2025

·

4 Reads

Frontiers in Neurorobotics

Enqing Chen

·

Xueting Wang

·

Xin Guo

·

[...]

·

Dong Li

Human skeleton-based action recognition is an important task in the field of computer vision. In recent years, masked autoencoder (MAE) has been used in various fields due to its powerful self-supervised learning ability and has achieved good results in masked data reconstruction tasks. However, in visual classification tasks such as action recognition, the limited ability of the encoder to learn features in the autoencoder structure results in poor classification performance. We propose to enhance the encoder's feature extraction ability in classification tasks by leveraging the latent space of variational autoencoder (VAE) and further replace it with the latent space of vector quantized variational autoencoder (VQVAE). The constructed models are called SkeletonMVAE and SkeletonMVQVAE, respectively. In SkeletonMVAE, we constrain the latent variables to represent features in the form of distributions. In SkeletonMVQVAE, we discretize the latent variables. These help the encoder learn deeper data structures and more discriminative and generalized feature representations. The experiment results on the NTU-60 and NTU-120 datasets demonstrate that our proposed method can effectively improve the classification accuracy of the encoder in classification tasks and its generalization ability in the case of few labeled data. SkeletonMVAE exhibits stronger classification ability, while SkeletonMVQVAE exhibits stronger generalization in situations with fewer labeled data.

Download

Latent space improved masked reconstruction model for human skeleton-based action recognition

February 2025

·

2 Reads

Frontiers in Neurorobotics

Human skeleton-based action recognition is an important task in the field of computer vision. In recent years, masked autoencoder (MAE) has been used in various fields due to its powerful self-supervised learning ability and has achieved good results in masked data reconstruction tasks. However, in visual classification tasks such as action recognition, the limited ability of the encoder to learn features in the autoencoder structure results in poor classification performance. We propose to enhance the encoder's feature extraction ability in classification tasks by leveraging the latent space of variational autoencoder (VAE) and further replace it with the latent space of vector quantized variational autoencoder (VQVAE). The constructed models are called SkeletonMVAE and SkeletonMVQVAE, respectively. In SkeletonMVAE, we constrain the latent variables to represent features in the form of distributions. In SkeletonMVQVAE, we discretize the latent variables. These help the encoder learn deeper data structures and more discriminative and generalized feature representations. The experiment results on the NTU-60 and NTU-120 datasets demonstrate that our proposed method can effectively improve the classification accuracy of the encoder in classification tasks and its generalization ability in the case of few labeled data. SkeletonMVAE exhibits stronger classification ability, while SkeletonMVQVAE exhibits stronger generalization in situations with fewer labeled data.


The structure of global and local feature fusion sequence-aware vision transformer (GLF-ViT). Industrial condition data collected by sensors is preprocessed and segmented into m×n matrices, then linearly projected by sampling points before being fed into the encoder for feature extraction and fusion, enabling fault classification.
Tennessee Eastman test problem [35].
Data slicing process. Slicing the n-dimensional data using a length of m sampling points with a step size of L.
Training loss and validation accuracy curves.
t-SNE visualization of test set data before inputting into GLF-ViT.

+12

Sequence-Aware Vision Transformer with Feature Fusion for Fault Diagnosis in Complex Industrial Processes

February 2025

·

18 Reads

Industrial fault diagnosis faces unique challenges with high-dimensional data, long time-series, and complex couplings, which are characterized by significant information entropy and intricate information dependencies inherent in datasets. Traditional image processing methods are effective for local feature extraction but often miss global temporal patterns, crucial for accurate diagnosis. While deep learning models like Vision Transformer (ViT) capture broader temporal features, they struggle with varying fault causes and time dependencies inherent in industrial data, where adding encoder layers may even hinder performance. This paper proposes a novel global and local feature fusion sequence-aware ViT (GLF-ViT), modifying feature embedding to retain sampling point correlations and preserve more local information. By fusing global features from the classification token with local features from the encoder, the algorithm significantly enhances complex fault diagnosis. Experimental analyses on data segment length, network depth, feature fusion and attention head receptive field validate the approach, demonstrating that a shallower encoder network is better suited for high-dimensional time-series fault diagnosis in complex industrial processes compared to deeper networks. The proposed method outperforms state-of-the-art algorithms on the Tennessee Eastman (TE) dataset and demonstrates excellent performance when further validated on a power transmission fault dataset.


A Decoupled Few-Shot Defect Detection Approach via Vector Quantization Feature Aggregation

January 2025

·

5 Reads

·

1 Citation

IEEE Transactions on Instrumentation and Measurement

In recent years, few-shot detection has become a popular research direction in the field of industrial defect detection, which aims to perform defect detection tasks accurately using a limited number of labeled samples. The dual-branch architecture, which utilizes class center features to aggregate query features is one of the most commonly used methods to solve the few-shot detection problem due to its simplicity and effectiveness. Previous few-shot detection algorithms based on dual-branch architecture typically aggregate features by directly using the features of labeled samples or by performing simple averaging operations on those features to compute the class center features. However, the acquisition of class center features is related to the sample distribution of novel classes, resulting in poor robustness of such methods. Additionally, the regression of bounding boxes and the classification of objects are two coupled problems that are difficult to optimize simultaneously. Therefore, this paper proposes a decoupled few-shot detection algorithm to address the challenges that exist in the current few-shot detection. More specifically, we propose a Vector Quantization Feature Aggregation (VQFA) method, which utilizes vector quantization to map the features of support images into discrete vector sets and each vector comes from a learnable codebook. By employing this operation, we can acquire a category representation that is both robust and accurate, thereby enabling enhanced interactions between support and query features. Besides, we propose a Decoupled Few-Shot (DeFS) module to decouple classification and localization tasks so that the two tasks can obtain different visual regions to achieve better detection results. Experimental results on the public dataset NEU-DET and GC10-DET demonstrate that our method significantly outperforms other methods on various few-shot scenes, which proves the effectiveness of our proposed method.


A Deep Semantic Segmentation Network with Semantic and Contextual Refinements

January 2025

·

17 Reads

IEEE Transactions on Multimedia

Semantic segmentation is a fundamental task in multimedia processing, which can be used for analyzing, understanding, editing contents of images and videos, among others. To accelerate the analysis of multimedia data, existing segmentation researches tend to extract semantic information by progressively reducing the spatial resolutions of feature maps. However, this approach introduces a misalignment problem when restoring the resolution of high-level feature maps. In this paper, we design a Semantic Refinement Module (SRM) to address this issue within the segmentation network. Specifically, SRM is designed to learn a transformation offset for each pixel in the upsampled feature maps, guided by high-resolution feature maps and neighboring offsets. By applying these offsets to the upsampled feature maps, SRM enhances the semantic representation of the segmentation network, particularly for pixels around object boundaries. Furthermore, a Contextual Refinement Module (CRM) is presented to capture global context information across both spatial and channel dimensions. To balance dimensions between channel and space, we aggregate the semantic maps from all four stages of the backbone to enrich channel context information. The efficacy of these proposed modules is validated on three widely used datasets-Cityscapes, Bdd100K, and ADE20K-demonstrating superior performance compared to state-of-the-art methods. Additionally, this paper extends these modules to a lightweight segmentation network, achieving an mIoU of 82.5% on the Cityscapes validation set with only 137.9 GFLOPs.


Fig. 1. The structure of our proposed approach.
Fig. 2. Illustration of Feature Refinement Module.
Comparison on Bdd100K val set. The mIoU and GFlops are calculated using single-scale inference. The im- age size to calculate GFLOPs is 1280 × 720.
A feature refinement module for light-weight semantic segmentation network

December 2024

·

24 Reads

Low computational complexity and high segmentation accuracy are both essential to the real-world semantic segmentation tasks. However, to speed up the model inference, most existing approaches tend to design light-weight networks with a very limited number of parameters, leading to a considerable degradation in accuracy due to the decrease of the representation ability of the networks. To solve the problem, this paper proposes a novel semantic segmentation method to improve the capacity of obtaining semantic information for the light-weight network. Specifically, a feature refinement module (FRM) is proposed to extract semantics from multi-stage feature maps generated by the backbone and capture non-local contextual information by utilizing a transformer block. On Cityscapes and Bdd100K datasets, the experimental results demonstrate that the proposed method achieves a promising trade-off between accuracy and computational cost, especially for Cityscapes test set where 80.4% mIoU is achieved and only 214.82 GFLOPs are required.


Fig. 3. Comparison of the average pooling strategy and attention mechanism. (a) Average pooling strategy captures the contexts by taking the average of all pixels within the pooling region, overlooking the fact that different pixels may make unequal contributions. The pooling region for global average pooling encompasses the entire feature map. (b) Attention mechanism adaptively captures global contexts by calculating the response at a position through a weighted aggregation of features from all positions.
Fig. 8. Qualitative results. Visual results of baseline and the proposed method on Cityscapes val set. The image from left to right is: input image, ground truth, prediction results from the baseline, prediction results from the proposed method. The improved areas are indicated by red dashed boxes.
Fig. 9. Visualization examples of the feature map G 1 . Left: the bilinear upsampling method (Bilinear); Right: the designed Semantic Refinement Module (SRM).
A Deep Semantic Segmentation Network with Semantic and Contextual Refinements

December 2024

·

51 Reads

Semantic segmentation is a fundamental task in multimedia processing, which can be used for analyzing, understanding, editing contents of images and videos, among others. To accelerate the analysis of multimedia data, existing segmentation researches tend to extract semantic information by progressively reducing the spatial resolutions of feature maps. However, this approach introduces a misalignment problem when restoring the resolution of high-level feature maps. In this paper, we design a Semantic Refinement Module (SRM) to address this issue within the segmentation network. Specifically, SRM is designed to learn a transformation offset for each pixel in the upsampled feature maps, guided by high-resolution feature maps and neighboring offsets. By applying these offsets to the upsampled feature maps, SRM enhances the semantic representation of the segmentation network, particularly for pixels around object boundaries. Furthermore, a Contextual Refinement Module (CRM) is presented to capture global context information across both spatial and channel dimensions. To balance dimensions between channel and space, we aggregate the semantic maps from all four stages of the backbone to enrich channel context information. The efficacy of these proposed modules is validated on three widely used datasets-Cityscapes, Bdd100K, and ADE20K-demonstrating superior performance compared to state-of-the-art methods. Additionally, this paper extends these modules to a lightweight segmentation network, achieving an mIoU of 82.5% on the Cityscapes validation set with only 137.9 GFLOPs.


Extended multi-stream temporal-attention module for skeleton-based human action recognition (HAR)

November 2024

·

49 Reads

Graph convolutional networks (GCNs) are an effective skeleton-based human action recognition (HAR) technique. GCNs enable the specification of CNNs to a non-Euclidean frame that is more flexible. The previous GCN-based models still have a lot of issues: (I) The graph structure is the same for all model layers and input data.


Two-Stage Channel Estimation in mmWave MIMO Systems with RIS Blockage

October 2024

·

7 Reads

·

1 Citation

IEEE Wireless Communications Letters

Accurate channel state information is crucial for directional beamforming in reconfigurable intelligent surface (RIS) assisted millimeter wave (mmWave) multiple-input multiple-output (MIMO) systems. In realistic scenarios, the components of RIS are susceptible to coverage by small particles, leading to blockage of certain elements, which increases the difficulty of channel estimation. In such cases, we propose a two-stage channel training scheme based on Kronecker decomposition to achieve joint RIS blockage and channel estimation. Specifically, we utilize the regularization parameter to recover the sparse blockage coefficients effectively. Additionally, we introduce a denoising algorithm to accelerate the convergence rate of sparse recovery and enhance estimation accuracy. Simulation results show that our proposed method outperforms the existing algorithms regarding RIS blockage and channel estimation accuracy.



Citations (16)


... Ethereum is an open-source blockchain-based platform that establishes a distributed peer-to-peer network for the secure execution and verification of intelligent contract code [8,9]. ...

Reference:

The Impact of Blockchain Technology for Evaluation of Electronic Medical Certification
Extended Multi-stream Temporal-attention Module for Skeleton-based Human Action Recognition (HAR)
  • Citing Article
  • October 2024

Computers in Human Behavior

... Both architectures face common challenges, particularly regarding channel estimation with large-scale surfaces and hardware constraints in millimeter wave communications and terahertz environments. Recent works on RIS-assisted systems, such as [9,10], have proposed parametric and two-stage estimation techniques to address these issues. The neural-network-based approach we propose for LISs could also inspire future research on RIS-related receiver strategies. ...

Two-Stage Channel Estimation in mmWave MIMO Systems with RIS Blockage
  • Citing Article
  • October 2024

IEEE Wireless Communications Letters

... Both architectures face common challenges, particularly regarding channel estimation with large-scale surfaces and hardware constraints in millimeter wave communications and terahertz environments. Recent works on RIS-assisted systems, such as [9,10], have proposed parametric and two-stage estimation techniques to address these issues. The neural-network-based approach we propose for LISs could also inspire future research on RIS-related receiver strategies. ...

Parametric channel estimation for RIS-assisted mmWave MIMO-OFDM systems with low pilot overhead
  • Citing Article
  • July 2024

Signal Processing

... Accuracy (5-way-5-shot) SNAIL [96] 68.9% TPN [97] 69.4% BaseTransformers [98] 73.4% EGNN [99] 76.4% Shot-Free [100] 77.6% Meta-Transfer [101] 75.5% Dense [102] 79.0% MetaOptnet [103] 78.6% Constellation [104] 80.0% P-transfer [105] 80.1% DeepEMD [106] 82.4% BaseTransformers [98] 82.4% MGGN [107] 83.3% DMC-CNN (2-view) [51] 84.1% ...

Edge-labeling based modified gated graph network for few-shot learning
  • Citing Article
  • June 2024

Pattern Recognition

... Acknowledging the imbalance between the spatial and channel dimensions, this paper aggregates semantic maps from all four stages of the backbone before extracting context information. In comparison to our previous conference paper [27], this paper proposes a brand- new semantic refinement module to leverage the contribution of neighbors' offsets to offset map learning. Moreover, while the feature refinement module in the conference paper only explores the spatial dependencies between pixels, this paper proposes to capture global context information across both spatial and channel dimensions and builds a new contextual refinement module. ...

A Feature Refinement Module for Light-Weight Semantic Segmentation Network
  • Citing Conference Paper
  • October 2023

... Currently, there are some studies in this field. Xu et al. (2023) proposed a multichannel and multi-scale separable dilated convolution neural network with attention mechanism, achieving an accuracy of 98.4% on their self-built dataset. He et al. (2023) proposed an end-to-end cross-modal enhancement network that extracts multimodal information to grade tobacco leaves, achieving a final grading accuracy of 80.15%. ...

Multi-channel and multi-scale separable dilated convolutional neural network with attention mechanism for flue-cured tobacco classification

Neural Computing and Applications

... On the other hand, research has been developed that generates complex graphs with information that represents vehicle paths or sensor networks, such as the work of Ma et al. (2021) [14]. Li et al. (2023) [15] investigated a massive MIMO uplink system, where a transmitter with two antennas has to upload data in real time to a BS with a more significant number of antennas. Techniques and algorithms are required for high-speed and accurate graph processing in their different forms. ...

Noncoherent Space-Time Coding for Correlated Massive MIMO Channel with Riemannian Distance

Digital Signal Processing

... Previous studies [13][14][15] have confirmed that second-order information about the skeleton plays a complementary role in action recognition. The second-order information about the skeleton, which could also be called the skeletal modality, included joint coordinates, bone vector, joint coordinate motion, and bone vector motion. ...

Extended Multi-Stream Adaptive Graph Convolutional Networks (EMS-AAGCN) for Skeleton-Based Human Action Recognition

... In [22], the authors designed low-complexity multi-level constellations based on Kullback-Leibler divergence for noncoherent SIMO systems, which enhanced error performance. Finally, in [23], an optimal constellation to improve symbol detection reliability for short data packet transmission over noncoherent massive SIMO Rayleigh fading channels was designed. ...

Constellation Design for Noncoherent Massive SIMO Systems in URLLC Applications

IEEE Transactions on Communications

... These work mainly focused on unitary space-time code designs since unitary constellations are optimal when the signal-to-noise ratio (SNR) is high or the number of coherent time slots is large. With the advent of massive MIMO technology, some works have initially reconsidered the noncoherent constellation design criteria [12,13,14,15,16,17,18]. In particular, the favorable propagation condition of the large number of antennas is widely used in signal optimization criteria and constellation design [12,13,14,15,16,17,18]. ...

Cooperative PSK constellation design and power allocation for massive MIMO uplink communications
  • Citing Article
  • January 2021

Digital Signal Processing