Xudong Guo’s research while affiliated with Northwest A & F University and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (19)


MetalPrognosis: A Biological Language Model-Based Approach for Disease-Associated Mutations in Metal-Binding Site Prediction
  • Article

September 2024

·

12 Reads

IEEE/ACM Transactions on Computational Biology and Bioinformatics

·

·

·

[...]

·

Protein-metal ion interactions play a central role in the onset of numerous diseases. When amino acid changes lead to missense mutations in metal-binding sites, the disrupted interaction with metal ions can compromise protein function, potentially causing severe human ailments. Identifying these disease-associated mutation sites within metal-binding regions is paramount for understanding protein function and fostering innovative drug development. While some computational methods aim to tackle this challenge, they often fall short in accuracy, commonly due to manual feature extraction and the absence of structural data. We introduce MetalPrognosis, an innovative, alignment-free solution that predicts disease-associated mutations within metal-binding sites of metalloproteins with heightened precision. Rather than relying on manual feature extraction, MetalPrognosis employs sliding window sequences as input, extracting deep semantic insights from pre-trained protein language models. These insights are then incorporated into a convolutional neural network, facilitating the derivation of intricate features. Comparative evaluations show MetalPrognosis outperforms leading methodologies like MCCNN and M-Ionic across various metalloprotein test sets. Furthermore, an ablation study reiterates the effectiveness of our model architecture. To facilitate public use, we have made the datasets, source codes, and trained models for MetalPrognosis online available at http://metalprognosis.unimelb-biotools.cloud.edu.au/ .


Figure 1. Network architecture of scDFN. (a) The data preprocessing module. (b) ATIF.
Figure 5. Visualization of the results of the Goolam marker genes. (a) A violin plot and (b) a bubble chart; the abscissa represents marker genes, and the ordinate represents cell clusters. The color depth of the violin plot ref lects the degree of expression of the marker genes, while the shape of the violin plot shows the distribution of the expression of the marker genes in different cell types. The distribution of expression of the marker genes was different in different types of cells. The marker genes for each cluster were prominent.
A summary of deep learning-based single-cell clustering methods
scDFN: enhancing single-cell RNA-seq clustering with deep fusion networks
  • Article
  • Full-text available

September 2024

·

59 Reads

·

1 Citation

Briefings in Bioinformatics

Single-cell ribonucleic acid sequencing (scRNA-seq) technology can be used to perform high-resolution analysis of the transcriptomes of individual cells. Therefore, its application has gained popularity for accurately analyzing the ever-increasing content of heterogeneous single-cell datasets. Central to interpreting scRNA-seq data is the clustering of cells to decipher transcriptomic diversity and infer cell behavior patterns. However, its complexity necessitates the application of advanced methodologies capable of resolving the inherent heterogeneity and limited gene expression characteristics of single-cell data. Herein, we introduce a novel deep learning-based algorithm for single-cell clustering, designated scDFN, which can significantly enhance the clustering of scRNA-seq data through a fusion network strategy. The scDFN algorithm applies a dual mechanism involving an autoencoder to extract attribute information and an improved graph autoencoder to capture topological nuances, integrated via a cross-network information fusion mechanism complemented by a triple self-supervision strategy. This fusion is optimized through a holistic consideration of four distinct loss functions. A comparative analysis with five leading scRNA-seq clustering methodologies across multiple datasets revealed the superiority of scDFN, as determined by better the Normalized Mutual Information (NMI) and the Adjusted Rand Index (ARI) metrics. Additionally, scDFN demonstrated robust multi-cluster dataset performance and exceptional resilience to batch effects. Ablation studies highlighted the key roles of the autoencoder and the improved graph autoencoder components, along with the critical contribution of the four joint loss functions to the overall efficacy of the algorithm. Through these advancements, scDFN set a new benchmark in single-cell clustering and can be used as an effective tool for the nuanced analysis of single-cell transcriptomics.

Download

Advancing mRNA subcellular localization prediction with graph neural network and RNA structure

August 2024

·

35 Reads

·

2 Citations

Bioinformatics

Motivation The asymmetrical distribution of expressed mRNAs tightly controls the precise synthesis of proteins within human cells. This non-uniform distribution, a cornerstone of developmental biology, plays a pivotal role in numerous cellular processes. To advance our comprehension of gene regulatory networks, it is essential to develop computational tools for accurately identifying the subcellular localizations of mRNAs. However, considering multi-localization phenomena remains limited in existing approaches, with none considering the influence of RNA’s secondary structure. Results In this study, we propose Allocator, a multi-view parallel deep learning framework that seamlessly integrates the RNA sequence-level and structure-level information, enhancing the prediction of mRNA multi-localization. The Allocator models equip four efficient feature extractors, each designed to handle different inputs. Two are tailored for sequence-based inputs, incorporating multilayer perceptron and multi-head self-attention mechanisms. The other two are specialized in processing structure-based inputs, employing graph neural networks. Benchmarking results underscore Allocator’s superiority over state-of-the-art methods, showcasing its strength in revealing intricate localization associations. Availability and Implementation The webserver of Allocator is available at http://Allocator.unimelb-biotools.cloud.edu.au; the source code and datasets are available on GitHub (https://github.com/lifuyi774/Allocator) and Zenodo (https://doi.org/10.5281/zenodo.13235798). Supplementary Information Available at Bioinformatics online.


Figure 2. (a and b) A comparison of the average value of NMI and ARI on the 32 real scRNA-seq datasets among six 154 clustering methods. (c-f) A comparison of the mean value of NMI on various data platforms among six clustering 155 methods. (g and h) Specific numerical values for each method on each dataset. 156 157
scDFN: Enhancing single-cell RNA-seq Clustering with Deep Fusion Networks

May 2024

·

114 Reads

Single-cell RNA sequencing (scRNA-seq) technology can be used to perform high-resolution analysis of the transcriptomes of individual cells. Therefore, its application has gained popularity for accurately analyzing the ever-increasing content of heterogeneous single-cell datasets. Central to interpreting scRNA-seq data is the clustering of cells to decipher transcriptomic diversity and infer cell behavior patterns. Although clustering plays a key role in the subsequent analysis of single-cell transcriptomics, its complexity necessitates the application of advanced methodologies capable of resolving the inherent heterogeneity and limited gene expression characteristics of single-cell data. In this study, we introduced a novel deep learning-based algorithm for single-cell clustering, designated scFCN, which can significantly enhance the clustering of scRNA-seq data through a fusion network strategy. The scFCN algorithm applies a dual mechanism involving an autoencoder to extract attribute information and an improved graph autoencoder to capture topological nuances, integrated via a cross-network information fusion mechanism complemented by a triple self-supervision strategy. This fusion is optimized through a holistic consideration of four distinct loss functions. A comparative analysis with five leading scRNA-seq clustering methodologies across multiple datasets revealed the superiority of scFCN, as determined by better Normalized Mutual Information (NMI) and Adjusted Rand Index (ARI) metrics. Additionally, scFCN demonstrated robust multi-cluster dataset performance and exceptional resilience to batch effects. Ablation studies highlighted the key roles of the autoencoder and the improved graph autoencoder components, along with the critical contribution of the four joint loss functions to the overall efficacy of the algorithm. Through these advancements, scFCN set a new benchmark in single-cell clustering and can be used as an effective tool for the nuanced analysis of single-cell transcriptomics. The source code of scFCN is publicly available at https://github.com/11051911/scDFN.


Fig. 1. The overview of MERITS, including the collection of primary sequence and tertiary structure data for PE/PPE family proteins, sequence-structure-function annotation analysis of PE/PPE family proteins and database access.
Fig. 2. Two different methods for data accessing in MERITS. (A) An example showing how to use the advanced search function of MERITS. Here, three different search schemas were combined, Organism (Mycobacterium tuberculosis), Signal Peptides (YES), Transmembrane Helices (0); (B) An example illustrates how to explore the protein data through the Browse webpage (Mycobacterium tuberculosis and ID WP 116268447.1 is selected); (C) Typical search result webpage using the ID WP 116268447.1 as an example. The entry webpage consists of ten significant categories of information, comprising summary information and 3D structural visualization, Physicochemical properties and hydrophobicity, structural features, histidine phosphorylation sites, subcellular location, GO annotations, human homology, antigenicity, toxicity, allergenicity, linear B cell epitope, MHC-I and MHC-II binding.
Fig. 3. Statistical analysis results of PE/PPE proteins. (A) Distribution of all PE/PPE proteins according to their protein sequence lengths; (B) Frequency distributions of 20 amino acids in all accumulated PE/PPE proteins; (C) Distribution of top 20 most organisms; (D) Distribution of the top 5 most definitions.
Results of tertiary structure data collection
MERITS: a web-based integrated mycobacterial PE/PPE protein database

March 2024

·

72 Reads

·

1 Citation

Bioinformatics Advances

Motivation PE/PPE proteins, highly abundant in the Mycobacterium genome, play a vital role in virulence and immune modulation. Understanding their functions is key to comprehending the internal mechanisms of Mycobacterium. However, a lack of dedicated resources has limited research into PE/PPE proteins. Results Addressing this gap, we introduce MERITS, a comprehensive 3D structure database specifically designed for PE/PPE proteins. MERITS hosts 22,353 non-redundant PE/PPE proteins, encompassing details like physicochemical properties, subcellular localisation, post-translational modification sites, protein functions, and measures of antigenicity, toxicity, and allergenicity. MERITS also includes data on their secondary and tertiary structure, along with other relevant biological information. MERITS is designed to be user-friendly, offering interactive search and data browsing Features to aid researchers in exploring the potential functions of PE/PPE proteins. MERITS is expected to become a crucial resource in the field, aiding in developing new diagnostics and vaccines by elucidating the sequence-structure-functional relationships of PE/PPE proteins. Availability and implementation MERITS is freely accessible at http://merits.unimelb-biotools.cloud.edu.au/.


Fig. 1. The overview of MERITS, including the collection of primary sequence and tertiary structure data for PE/PPE family proteins, sequence-structure-function annotation analysis of PE/PPE family proteins and database access.
MERITS: a web-based integrated Mycobacterial PE/PPE protein database

December 2023

·

76 Reads

Motivation: PE/PPE proteins, highly abundant in the Mycobacterium genome, play a vital role in virulence and immune modulation. Understanding their functions is key to comprehending the internal mechanisms of Mycobacterium. However, a lack of dedicated resources has limited research into PE/PPE proteins. Results: Addressing this gap, we introduce MERITS, a comprehensive 3D structure database specifically designed for PE/PPE proteins. MERITS hosts 22,353 non-redundant PE/PPE proteins, encompassing details like physicochemical properties, subcellular localisation, post-translational modification sites, protein functions, and measures of antigenicity, toxicity, and allergenicity. MERITS also includes data on their secondary and tertiary structure, along with other relevant biological information. MERITS is designed to be user-friendly, offering interactive search and data browsing Features to aid researchers in exploring the potential functions of PE/PPE proteins. MERITS is expected to become a crucial resource in the field, aiding in developing new diagnostics and vaccines by elucidating the sequence-structure-functional relationships of PE/PPE proteins. Availability and implementation: MERITS is freely accessible at http://merits.unimelb-biotools.cloud.edu.au/.


The distribution for six subcellular localizations in training, validation, and testing set
Performance comparison on the test set using different neural network types
Allocator is a graph neural network-based framework for mRNA subcellular localization prediction

December 2023

·

63 Reads

·

1 Citation

Motivation The asymmetrical distribution of expressed mRNAs tightly controls the precise synthesis of proteins within human cells. This non-uniform distribution, a cornerstone of developmental biology, plays a pivotal role in numerous cellular processes. To advance our comprehension of gene regulatory networks, it is essential to develop computational tools for accurately identifying the subcellular localizations of mRNAs. However, considering multi-localization phenomena remains limited in existing approaches, with none considering the influence of RNA’s secondary structure. Results In this study, we propose Allocator, a multi-view parallel deep learning framework that seamlessly integrates the RNA sequence-level and structure-level information, enhancing the prediction of mRNA multi-localization. The Allocator models equip four efficient feature extractors, each designed to handle different inputs. Two are tailored for sequence-based inputs, incorporating multilayer perceptron and multi-head self-attention mechanisms. The other two are specialized in processing structure-based inputs, employing graph neural networks. Benchmarking results underscore Allocator’s superiority over state-of-the-art methods, showcasing its strength in revealing intricate localization associations. Availability The webserver of Allocator is available at http://Allocator.unimelb-biotools.cloud.edu.au ; the source code and datasets are available at https://github.com/lifuyi774/Allocator


MetalPrognosis: a Biological Language Model-based Approach for Disease-Associated Mutations in Metal-Binding Site prediction

November 2023

·

125 Reads

·

1 Citation

Protein-metal ion interactions play a central role in the onset of numerous diseases. When amino acid changes lead to missense mutations in metal-binding sites, the disrupted interaction with metal ions can compromise protein function, potentially causing severe human ailments. Identifying these disease-associated mutation sites within metal-binding regions is paramount for understanding protein function and fostering innovative drug development. While some computational methods aim to tackle this challenge, they often fall short in accuracy, commonly due to manual feature extraction and the absence of structural data. We introduce MetalPrognosis, an innovative, alignment-free solution that predicts disease-associated mutations within metal-binding sites of metalloproteins with heightened precision. Rather than relying on manual feature extraction, MetalPrognosis employs sliding window sequences as input, extracting deep semantic insights from pre-trained protein language models. These insights are then incorporated into a convolutional neural network, facilitating the derivation of intricate features. Comparative evaluations show MetalPrognosis outperforms leading methodologies like MCCNN and PolyPhen-2 across various metalloprotein test sets. Furthermore, an ablation study reiterates the effectiveness of our model architecture. To facilitate public use, we have made the datasets, source codes, and trained models for MetalPrognosis online available at http://metalprognosis.unimelb-biotools.cloud.edu.au/ .


ProsperousPlus: a one-stop and comprehensive platform for accurate protease-specific substrate cleavage prediction and machine-learning model construction

September 2023

·

183 Reads

·

7 Citations

Briefings in Bioinformatics

Proteases contribute to a broad spectrum of cellular functions. Given a relatively limited amount of experimental data, developing accurate sequence-based predictors of substrate cleavage sites facilitates a better understanding of protease functions and substrate specificity. While many protease-specific predictors of substrate cleavage sites were developed, these efforts are outpaced by the growth of the protease substrate cleavage data. In particular, since data for 100+ protease types are available and this number continues to grow, it becomes impractical to publish predictors for new protease types, and instead it might be better to provide a computational platform that helps users to quickly and efficiently build predictors that address their specific needs. To this end, we conceptualised, developed, tested, and released a versatile bioinformatics platform, ProsperousPlus, that empowers users, even those with no programming or little bioinformatics background, to build fast and accurate predictors of substrate cleavage sites. ProsperousPlus facilitates the use of the rapidly accumulating substrate cleavage data to train, empirically assess and deploy predictive models for user-selected substrate types. Benchmarking tests on test datasets show that our platform produces predictors that on average exceed the predictive performance of current state-of-the-art approaches. ProsperousPlus is available as a webserver and a stand-alone software package at http://prosperousplus.unimelb-biotools.cloud.edu.au/.


iAMPCN: a deep-learning approach for identifying antimicrobial peptides and their functional activities

July 2023

·

236 Reads

·

29 Citations

Briefings in Bioinformatics

Antimicrobial peptides (AMPs) are short peptides that play crucial roles in diverse biological processes and have various functional activities against target organisms. Due to the abuse of chemical antibiotics and microbial pathogens’ increasing resistance to antibiotics, AMPs have the potential to be alternatives to antibiotics. As such, the identification of AMPs has become a widely discussed topic. A variety of computational approaches have been developed to identify AMPs based on machine learning algorithms. However, most of them are not capable of predicting the functional activities of AMPs, and those predictors that can specify activities only focus on a few of them. In this study, we first surveyed ten predictors that can identify AMPs and their functional activities in terms of the features they employed and the algorithms they utilized. Then, we constructed comprehensive AMP datasets and proposed a new deep learning-based framework, iAMPCN (identification of AMPs based on CNNs), to identify AMPs and their related 22 functional activities. Our experiments demonstrate that iAMPCN significantly improved the prediction performance of AMPs and their corresponding functional activities based on four types of sequence features. Benchmarking experiments on the independent test datasets showed that iAMPCN outperformed a number of state-of-the-art approaches for predicting AMPs and their functional activities. Further, we analyzed the amino acid preferences of different AMP activities and evaluated the model on datasets of varying sequence redundancy thresholds. To facilitate the community-wide identification of AMPs and their corresponding functional types, we have made the source codes of iAMPCN publicly available at https://github.com/joy50706/iAMPCN/tree/master. We anticipate that iAMPCN can be explored as a valuable tool for identifying potential AMPs with specific functional activities for further experimental validation.


Citations (13)


... In the realm of scRNA-seq data analysis, deep neural networks have been usefully explored. Prior studies by [6][7][8][9][10][11][12][13] have employed autoencoders -a type of neural network to effectively learn condensed representations of scRNA-seq data in a lower-dimensional space. Representative examples of scDeepCluster, DESC, and scDCC [6,7,10], which are specifically designed to provide clustering assignments with scRNA-seq data. ...

Reference:

Graph Contrastive Learning as a Versatile Foundation for Advanced scRNA-seq Data Analysis
scDFN: enhancing single-cell RNA-seq clustering with deep fusion networks

Briefings in Bioinformatics

... Moreover, Clarion [1] employed an ensemble learning strategy based on XGBoost, enhancing the accuracy and robustness of multi-label mRNA localization predictions by considering label correlations and utilizing advanced feature selection methods. Furthermore, Allocator introduced the use of graph neural networks (GNNs) to incorporate RNA secondary structure information into the prediction model [6]. ...

Advancing mRNA subcellular localization prediction with graph neural network and RNA structure

Bioinformatics

... Besides, we set the number of epochs to 200 and batch size to 16 to train the EDIFIER model. We set up validation sets and dropout to prevent the model from overfitting [45], [47], [48], [49], [50]. The EDIFIER model was trained on a Linux server with a 32-core CPU, 128 GB RAM, and two NVIDIA GeForce RTX 3090 Ti GPUs, each with 24 GB of dedicated memory. ...

Allocator is a graph neural network-based framework for mRNA subcellular localization prediction

... This becomes a limitation in scenarios where protein sequences lack extensive homologous families. However, deep learning can auto-extract features without labor-intensive effort and improve model generalization [22], [23], [24], [25]. ...

ProsperousPlus: a one-stop and comprehensive platform for accurate protease-specific substrate cleavage prediction and machine-learning model construction

Briefings in Bioinformatics

... • Drug design and molecular docking, by providing the required values for the binding affinities between the proteins and drug molecules based on amino acid properties (these affinities can be critical to designing molecules that can effectively bind or inhibit specific proteins) [24][25][26][27]. • Protein function annotation, by comparing the amino acid properties with those of known proteins, facilitating classification based on their physicochemical characteristics [28][29][30][31]. • Sequence alignment and homology modeling, by incorporating amino acid substitution matrices into alignment algorithms that reflect the physicochemical differences between amino acids (this can improve the accuracy of sequence homology models that compare proteins according to their functional or structural similarity) [32][33][34][35]. ...

iAMPCN: a deep-learning approach for identifying antimicrobial peptides and their functional activities

Briefings in Bioinformatics

... This becomes a limitation in scenarios where protein sequences lack extensive homologous families. However, deep learning can auto-extract features without labor-intensive effort and improve model generalization [22], [23], [24], [25]. ...

Digerati – A multipath parallel hybrid deep learning framework for the identification of mycobacterial PE/PPE proteins
  • Citing Article
  • June 2023

Computers in Biology and Medicine

... This becomes a limitation in scenarios where protein sequences lack extensive homologous families. However, deep learning can auto-extract features without labor-intensive effort and improve model generalization [22], [23], [24], [25]. ...

TIMER is a Siamese neural network-based framework for identifying both general and species-specific bacterial promoters
  • Citing Article
  • May 2023

Briefings in Bioinformatics

... iMRVP 43 was developed for subtyping, visualization, and denoising of RNA-modification-related sequence motifs. Furthermore, a number of approaches have been developed for the positional or functional mining and prediction of RNA modifications, 44-46 ranging from basic RNA methylation site prediction approaches such as iRNA-Methyl 47 and PPUS 48 to more recent advancements for species or condition-specific methylation site prediction, such as adaptive-m6A, 49 CLSM6A, 50 and AdaptRM 51 for m 6 A site prediction, NmRF 52 for predicting 2 0 -Omethylation, ATTIC 53 for A-to-I editing events, and RMDGCN 54 for disease association prediction. The processed epitranscriptome data, along with various annotations and predicted associations, have been assembled and made available by RNA-modification-related databases, such as MODOMICS, 55 RMBase, 56 PRMD, 57 RMDisease, 58 M6AREG, 59 and RM2Target. ...

ATTIC is an integrated approach for predicting A-to-I RNA editing sites in three species

Briefings in Bioinformatics

... As an ensemble learning framework, the 2 base models of EnsemPPIS (TransformerPPIS and GatCNNPPIS) were separately trained using the same training procedure. To optimize EnsemPPIS, we selected the optimal combinations of base models [98]. After the completion of model training, the 2 saved models were loaded for individual prediction of PPI sites. ...

COPPER: an ensemble deep-learning approach for identifying exclusive virus-derived small interfering RNAs in plants
  • Citing Article
  • November 2022

Briefings in Functional Genomics

... Feature extraction aims to generate a set of features that can accurately represent the sequence data and facilitate subsequent machine learning analysis. In this study, we used seven methods to characterise the biological features of human genes: Nucleic acid composition (NAC) [35], K-mer [36][37][38], reverse complement K-mer (RCKmer) [39], the composition of K-spaced nucleic acid pairs (CKSNAP) [40], Z-curve [41], pseudo electron-ion interaction pseudopotentials (PseEIIP) [42], and multivariate mutual information (MMI) [43]. Nucleic acid composition, K-mer, RCKmer, CKSNAP, and Z-curve describe NAC. ...

Clarion is a multi-label problem transformation method for identifying mRNA subcellular localizations
  • Citing Article
  • November 2022

Briefings in Bioinformatics