Xudong Guo’s research while affiliated with Northwest A & F University and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (19)


Digerati – A multipath parallel hybrid deep learning framework for the identification of mycobacterial PE/PPE proteins
  • Article

June 2023

·

46 Reads

·

12 Citations

Computers in Biology and Medicine

·

Xudong Guo

·

·

[...]

·

The genome of Mycobacterium tuberculosis contains a relatively high percentage (10%) of genes that are poorly characterised because of their highly repetitive nature and high GC content. Some of these genes encode proteins of the PE/PPE family, which are thought to be involved in host-pathogen inter-actions, virulence, and disease pathogenicity. Members of this family are genetically divergent and challenging to both identify and classify using conventional computational tools. Thus, advanced in silico methods are needed to identify proteins of this family for subsequent functional annotation effi-ciently. In this study, we developed the first deep learning-based approach, termed Digerati, for the rapid and accurate identification of PE and PPE family proteins. Digerati was built upon a multipath parallel hybrid deep learning framework, which equips multi-layer convolutional neural networks with bidirectional, long short-term memory, equipped with a self-attention module to effectively learn the higher-order feature representations of PE/PPE proteins. Empirical studies demonstrated that Di-gerati achieved a significantly better performance (~18-20%) than alignment-based approaches, in-cluding BLASTP, PHMMER, and HHsuite, in both prediction accuracy and speed. Digerati is antici-pated to facilitate community-wide efforts to conduct high-throughput identification and analysis of PE/PPE family members. The webserver and source codes of Digerati are publicly available at http://web.unimelb-bioinfortools.cloud.edu.au/Digerati/.


TIMER is a Siamese neural network-based framework for identifying both general and species-specific bacterial promoters

May 2023

·

89 Reads

·

5 Citations

Briefings in Bioinformatics

Background: Promoters are DNA regions that initiate the transcription of specific genes near the transcription start sites. In bacteria, promoters are recognized by RNA polymerases and associated sigma factors. Effective promoter recognition is essential for synthesizing the gene-encoded products by bacteria to grow and adapt to different environmental conditions. A variety of machine learning-based predictors for bacterial promoters have been developed; however, most of them were designed specifically for a particular species. To date, only a few predictors are available for identifying general bacterial promoters with limited predictive performance. Results: In this study, we developed TIMER, a Siamese neural network-based approach for identifying both general and species-specific bacterial promoters. Specifically, TIMER uses DNA sequences as the input and employs three Siamese neural networks with the attention layers to train and optimize the models for a total of 13 species-specific and general bacterial promoters. Extensive 10-fold cross-validation and independent tests demonstrated that TIMER achieves a competitive performance and outperforms several existing methods on both general and species-specific promoter prediction. As an implementation of the proposed method, the web server of TIMER is publicly accessible at http://web.unimelb-bioinfortools.cloud.edu.au/TIMER/.


Figure 1. The overall framework of ATTIC.
Figure 2. The analysis of RNA sequences, encoding methods and ML algorithms. Panels (A), (B) and (C) present the sequence logos of RNAs for H. sapiens, M. musculus and D. melanogaster, respectively, with positive samples up and negative ones down. Panels (D), (E) and (F) demonstrate 10-fold CV test results of the ET classifier trained with different encoding schemes on H_3000, M_831 and D_125 datasets, respectively. Panels (G), (H) and (I) show 10-fold CV test results of six base classifiers on H_3000, M_831 and D_125 datasets, respectively.
Figure 3. Performance comparison of different combinations of classifiers and encodings. Panels (A)-(C) show the performance comparison results of five ensemble strategies for the three species; panels (D)-(F) provide the corresponding comparisons of diverse encoding combinations.
Figure 4. The performance evaluation results. Panels (A), (B) and (C) provide the comparison results of two different feature selection strategies in terms of ACC, MCC, Recall, Precision, F 1 and AUC for H. sapiens, M. musculus and D. melanogaster. (D) Comparison with existing tools on the independent test datasets.
Figure 5. Top 20 features of ATTIC ranked by the SHAP algorithm for (A) H. sapiens, (B) M. musculus and (C) D. melanogaster. Each row represents a feature, and each column represents a sample. The colour represents the range of SHAP values, where the redder colour indicates that the SHAP value is greater than zero, and the bluer colour indicates that the SHAP value is below zero. Besides, the SHAP value greater than 0 indicates that the prediction tends to be a positive sample (i.e. A-to-I RNA editing site). In contrast, a SHAP value of less than 0 indicates that the prediction is likely to be a negative sample (i.e. non-A-to-I RNA editing site).

+3

ATTIC is an integrated approach for predicting A-to-I RNA editing sites in three species
  • Article
  • Full-text available

April 2023

·

155 Reads

·

17 Citations

Briefings in Bioinformatics

A-to-I editing is the most prevalent RNA editing event, which refers to the change of adenosine (A) bases to inosine (I) bases in double-stranded RNAs. Several studies have revealed that A-to-I editing can regulate cellular processes and is associated with various human diseases. Therefore, accurate identification of A-to-I editing sites is crucial for understanding RNA-level (i.e., transcriptional) modifications and their potential roles in molecular functions. To date, various computational approaches for A-to-I editing site identification have been developed; however, their performance is still unsatisfactory and needs further improvement. In this study, we developed a novel stacked-ensemble learning model, ATTIC (A-To-I ediTing predICtor), to accurately identify A-to-I editing sites across three species including H. sapiens, M. musculus, and D. melanogaster. We first comprehensively evaluated 37 RNA sequence-derived features combined with 14 popular machine learning algorithms. Then we selected the optimal base models to build a series of stacked ensemble models. The final ATTIC framework was developed based on the optimal models improved by the feature selection strategy for specific species. Extensive cross-validation and independent tests illustrate that ATTIC outperforms state-of-the-art tools for predicting A-to-I editing sites. We also developed a webserver for ATTIC, which is publicly available at http://web.unimelb-bioinfortools.cloud.edu.au/ATTIC/. We anticipate that ATTIC can be utilised as a useful tool to accelerate the identification of A-to-I RNA editing events and help characterise their roles in post-transcriptional regulation.

Download

Predicting Pseudouridine Sites with Porpoise

February 2023

·

43 Reads

Methods in molecular biology (Clifton, N.J.)

Pseudouridine is a ubiquitous RNA modification and plays a crucial role in many biological processes. However, it remains a challenging task to identify pseudouridine sites using expensive and time-consuming experimental research. To this end, we present Porpoise, a computational approach to identify pseudouridine sites from RNA sequence data. Porpoise builds on a stacking ensemble learning framework with several informative features and achieves competitive performance compared with state-of-the-art approaches. This protocol elaborates on step-by-step use and execution of the local stand-alone version and the webserver of Porpoise. In addition, we also provide a general machine learning framework that can help identify the optimal stacking ensemble learning model using different combinations of feature-based features. This general machine learning framework can facilitate users to build their pseudouridine predictors using their in-house datasets.


COPPER: an ensemble deep-learning approach for identifying exclusive virus-derived small interfering RNAs in plants

November 2022

·

73 Reads

·

4 Citations

Briefings in Functional Genomics

Antiviral defenses are one of the significant roles of RNAi in plants. It has been reported that the host RNAi mechanism machinery can target viral RNAs for destruction because virus-derived small interfering RNAs (vsiRNAs) are found in infected host cells. Therefore, the recognition of plant vsiRNAs is the key to understanding the functional mechanisms of vsiRNAs and developing antiviral plants. In this work, we introduce a deep learning-based stacking ensemble approach, named COPPER, for plant vsiRNA prediction. COPPER used word2vec and fastText to generate sequence features and a hybrid deep learning framework, including a convolutional neural network, multiscale residual network, and bidirectional long short-term memory network with a self-attention mechanism to enable precise predictions of plant vsiRNAs. Extensive benchmarking experiments with different sequence homology thresholds and ablation studies illustrated the comparative predictive performance of COPPER. In addition, the performance comparison with PVsiRNAPred conducted on an independent test dataset showed that COPPER significantly improved the predictive performance for plant vsiRNAs compared with other state-of-the-art methods. The datasets and source codes are publicly available at https://github.com/yuanyuanbu/COPPER.


Clarion is a multi-label problem transformation method for identifying mRNA subcellular localizations

November 2022

·

117 Reads

·

22 Citations

Briefings in Bioinformatics

Subcellular localization of messenger RNAs (mRNAs) plays a key role in the spatial regulation of gene activity. The functions of mRNAs have been shown to be closely linked with their localizations. As such, understanding of the subcellular localizations of mRNAs can help elucidate gene regulatory networks. Despite several computational methods that have been developed to predict mRNA localizations within cells, there is still much room for improvement in predictive performance, especially for the multiple-location prediction. In this study, we proposed a novel multi-label multi-class predictor, termed Clarion, for mRNA subcellular localization prediction. Clarion was developed based on a manually curated benchmark dataset and leveraged the weighted series method for multi-label transformation. Extensive benchmarking tests demonstrated Clarion achieved competitive predictive performance and the weighted series method plays a crucial role in securing superior performance of Clarion. In addition, the independent test results indicate that Clarion outperformed the state-of-the-art methods and can secure accuracy of 81.47%, 91.29%, 79.77%, 92.10%, 89.15%, 83.74%, 80.74%, 79.23%, and 84.74% for chromatin, cytoplasm, cytosol, exosome, membrane, nucleolus, nucleoplasm, nucleus, and ribosome, respectively. The webserver and local stand-alone tool of Clarion is freely available at http://monash.bioweb.cloud.edu.au/Clarion/.


Computational analysis and prediction of PE_PGRS proteins using machine learning

January 2022

·

220 Reads

·

23 Citations

Computational and Structural Biotechnology Journal

Mycobacterium tuberculosis genome comprises approximately 10% of two families of poorly characterised genes due to their high GC content and highly repetitive nature. The largest sub-group, the proline-glutamic acid polymorphic guanine-cytosine-rich sequence (PE_PGRS) family, is thought to be involved in host response and disease pathogenicity. Due to their high genetic variability and complexity of analysis, they are typically disregarded for further research in genomic studies. There are currently limited online resources and homology computational tools that can identify and analyse PE_PGRS proteins. In addition, they are computational-intensive and time-consuming, and lack sensitivity. Therefore, computational methods that can rapidly and accurately identify PE_PGRS proteins are valuable to facilitate the functional elucidation of the PE_PGRS family proteins. In this study, we developed the first machine learning-based bioinformatics approach, termed PEPPER, to allow users to identify PE_PGRS proteins rapidly and accurately. PEPPER was built upon a comprehensive evaluation of 13 popular machine learning algorithms with various sequence and physicochemical features. Empirical studies demonstrated that PEPPER achieved significantly better performance than alignment-based approaches, BLASTP and PHMMER, in both prediction accuracy and speed. PEPPER is anticipated to facilitate community-wide efforts to conduct high-throughput identification and analysis of PE_PGRS proteins.


Positive-unlabeled learning in bioinformatics and computational biology: A brief review

October 2021

·

277 Reads

·

54 Citations

Briefings in Bioinformatics

Conventional supervised binary classification algorithms have been widely applied to address significant research questions using biological and biomedical data. This classification scheme requires two fully labeled classes of data (e.g. positive and negative samples) to train a classification model. However, in many bioinformatics applications, labeling data is laborious, and the negative samples might be potentially mislabeled due to the limited sensitivity of the experimental equipment. The positive unlabeled (PU) learning scheme was therefore proposed to enable the classifier to learn directly from limited positive samples and a large number of unlabeled samples (i.e. a mixture of positive or negative samples). To date, several PU learning algorithms have been developed to address various biological questions, such as sequence identification, functional site characterization and interaction prediction. In this paper, we revisit a collection of 29 state-of-the-art PU learning bioinformatic applications to address various biological questions. Various important aspects are extensively discussed, including PU learning methodology, biological application, classifier design and evaluation strategy. We also comment on the existing issues of PU learning and offer our perspectives for the future development of PU learning applications. We anticipate that our work serves as an instrumental guideline for a better understanding of the PU learning framework in bioinformatics and further developing next-generation PU learning frameworks for critical biological applications.


Porpoise: a new approach for accurate prediction of RNA pseudouridine sites

June 2021

·

222 Reads

·

53 Citations

Briefings in Bioinformatics

Pseudouridine is a ubiquitous RNA modification type present in eukaryotes and prokaryotes, which plays a vital role in various biological processes. Almost all kinds of RNAs are subject to this modification. However, it remains a great challenge to identify pseudouridine sites via experimental approaches, requiring expensive and time-consuming experimental research. Therefore, computational approaches that can be used to perform accurate in silico identification of pseudouridine sites from a large amount of RNA sequence data are highly desirable and can aid in the functional elucidation of this critical modification. Here, we propose a new computational approach, termed Porpoise, to accurately identify pseudouridine sites from RNA sequence data. Porpoise builds upon a comprehensive evaluation of 18 frequently used feature encoding schemes based on the selection of four types of features, including binary features, pseudo k-tuple composition (PseKNC), nucleotide chemical property (NCP), and position-specific trinucleotide propensity based on single-strand (PSTNPss). The selected features are fed into the stacked ensemble learning framework to enable the construction of an effective stacked model. Both cross-validation tests on the benchmark dataset and independent tests show that Porpoise achieves superior predictive performance than several state-of-the-art approaches. The application of model interpretation tools demonstrates the importance of PSTNPs for the performance of the trained models. This new method is anticipated to facilitate community-wide efforts to identify putative pseudouridine sites and formulate novel testable biological hypothesis.


Citations (13)


... In the realm of scRNA-seq data analysis, deep neural networks have been usefully explored. Prior studies by [6][7][8][9][10][11][12][13] have employed autoencoders -a type of neural network to effectively learn condensed representations of scRNA-seq data in a lower-dimensional space. Representative examples of scDeepCluster, DESC, and scDCC [6,7,10], which are specifically designed to provide clustering assignments with scRNA-seq data. ...

Reference:

Graph Contrastive Learning as a Versatile Foundation for Advanced scRNA-seq Data Analysis
scDFN: enhancing single-cell RNA-seq clustering with deep fusion networks

Briefings in Bioinformatics

... Although these techniques have made some progress in elucidating RNA localization, their high costs and time consumption have prompted researchers to seek more efficient computational methods. In recent years, increasing interest has been in developing machine learning-based "in silico" approaches to effectively address these challenges [18][19][20][21][22][23][24][25][26][27][28][29][30]. ...

Advancing mRNA subcellular localization prediction with graph neural network and RNA structure

Bioinformatics

... Besides, we set the number of epochs to 200 and batch size to 16 to train the EDIFIER model. We set up validation sets and dropout to prevent the model from overfitting [45], [47], [48], [49], [50]. The EDIFIER model was trained on a Linux server with a 32-core CPU, 128 GB RAM, and two NVIDIA GeForce RTX 3090 Ti GPUs, each with 24 GB of dedicated memory. ...

Allocator is a graph neural network-based framework for mRNA subcellular localization prediction

... Hence, this identified novel N-terminal sequence is from proteolysis. We found 27 novel N-termini of SEPs might be produced by proteolysis, according to predictions made by an online tool ProsperousPlus (53). For example, the identified N-terminal peptide IFYNNPKLETAQMFMNR of IP_2307589 may come from the protease cleavage by ADAMTS4. ...

ProsperousPlus: a one-stop and comprehensive platform for accurate protease-specific substrate cleavage prediction and machine-learning model construction

Briefings in Bioinformatics

... . Multiple studies have been shown that AMPs can be discovered with the aid of machine learning techniques. Machine learning can also predict their antimicrobial efficacy, absorption, distribution, metabolism, excretion and toxicity, for example, iAMPCN is an opensource CNN based algorithm which can identify new AMPs and predict their function and properties[66][67][68][69][70][71][72][73][74][75][76][77][78][79][80]. Antifungal Peptides (AFP) are a type of AMPs with antifungal properties. ...

iAMPCN: a deep-learning approach for identifying antimicrobial peptides and their functional activities

Briefings in Bioinformatics

... This becomes a limitation in scenarios where protein sequences lack extensive homologous families. However, deep learning can auto-extract features without labor-intensive effort and improve model generalization [22], [23], [24], [25]. ...

Digerati – A multipath parallel hybrid deep learning framework for the identification of mycobacterial PE/PPE proteins
  • Citing Article
  • June 2023

Computers in Biology and Medicine

... This becomes a limitation in scenarios where protein sequences lack extensive homologous families. However, deep learning can auto-extract features without labor-intensive effort and improve model generalization [22], [23], [24], [25]. ...

TIMER is a Siamese neural network-based framework for identifying both general and species-specific bacterial promoters
  • Citing Article
  • May 2023

Briefings in Bioinformatics

... Techniques such as recurrent neural networks (RNNs) and Transformers, originally developed for natural language processing, are now applied to healthcare research, facilitating the analysis of RNA sequences [6]. These techniques, adapted from natural language processing, enable more precise interpretations of RNA sequences and their functional impacts, facilitating efforts to detect modifications efficiently [1,[6][7][8][9][10][11][12]. ...

ATTIC is an integrated approach for predicting A-to-I RNA editing sites in three species

Briefings in Bioinformatics

... As an ensemble learning framework, the 2 base models of EnsemPPIS (TransformerPPIS and GatCNNPPIS) were separately trained using the same training procedure. To optimize EnsemPPIS, we selected the optimal combinations of base models [98]. After the completion of model training, the 2 saved models were loaded for individual prediction of PPI sites. ...

COPPER: an ensemble deep-learning approach for identifying exclusive virus-derived small interfering RNAs in plants
  • Citing Article
  • November 2022

Briefings in Functional Genomics

... Although these techniques have made some progress in elucidating RNA localization, their high costs and time consumption have prompted researchers to seek more efficient computational methods. In recent years, increasing interest has been in developing machine learning-based "in silico" approaches to effectively address these challenges [18][19][20][21][22][23][24][25][26][27][28][29][30]. ...

Clarion is a multi-label problem transformation method for identifying mRNA subcellular localizations
  • Citing Article
  • November 2022

Briefings in Bioinformatics