Figure - available from: Frontiers in Genetics
This content is subject to copyright.
Flowchart of our method: (A) Obtained the association matrix A; Calculated the gaussian interaction profile kernel similarity of lncRNA and EF respectively. (B) Calculated the chemical structure similarity matrix E. (C) Obtained lncRNA similarity information SL and construct a similarity matrix SE of EF. (D) Integrated three subnets A, SL, and SE to construct a global heterogeneous network. (E) Constructed the adjacency matrix G and obtain the diffusion feature. (F) Calculated the Hetesim score. (G) Combined the diffusion feature and the HeteSim score. (H) Trained the Gradient Boosting Decision Tree classifier (GBDT).

Flowchart of our method: (A) Obtained the association matrix A; Calculated the gaussian interaction profile kernel similarity of lncRNA and EF respectively. (B) Calculated the chemical structure similarity matrix E. (C) Obtained lncRNA similarity information SL and construct a similarity matrix SE of EF. (D) Integrated three subnets A, SL, and SE to construct a global heterogeneous network. (E) Constructed the adjacency matrix G and obtain the diffusion feature. (F) Calculated the Hetesim score. (G) Combined the diffusion feature and the HeteSim score. (H) Trained the Gradient Boosting Decision Tree classifier (GBDT).

Source publication
Article
Full-text available
Interactions between genetic factors and environmental factors (EFs) play an important role in many diseases. Many diseases result from the interaction between genetics and EFs. The long non-coding RNA (lncRNA) is an important non-coding RNA that regulates life processes. The ability to predict the associations between lncRNAs and EFs is of importa...

Citations

... The weak classifiers were constructed through different iteration rounds in each GBDT iteration, and the gradient of the classifiers from the previous iteration was used to train each classifier. The weights of the weak classifiers are added together to create the final classifier [9], [16]. ...
... The first major improvement focused on the efficiency of algorithms utilizing the HeteSim metric. HeteSim-based similarity scoring on heterogeneous information networks has been successfully applied to multiple biomedical research problems [15][16][17][18][19][20]; therefore, the implementation of a faster HeteSim scoring algorithm will have the potential for significant benefit to the biomedical research community. The main investigative line for algorithm improvements involves approximation algorithms using randomness. ...
Article
Full-text available
Literature-based discovery (LBD) summarizes information and generates insight from large text corpuses. The SemNet framework utilizes a large heterogeneous information network or “knowledge graph” of nodes and edges to compute relatedness and rank concepts pertinent to a user-specified target. SemNet provides a way to perform multi-factorial and multi-scalar analysis of complex disease etiology and therapeutic identification using the 33+ million articles in PubMed. The present work improves the efficacy and efficiency of LBD for end users by augmenting SemNet to create SemNet 2.0. A custom Python data structure replaced reliance on Neo4j to improve knowledge graph query times by several orders of magnitude. Additionally, two randomized algorithms were built to optimize the HeteSim metric calculation for computing metapath similarity. The unsupervised learning algorithm for rank aggregation (ULARA), which ranks concepts with respect to the user-specified target, was reconstructed using derived mathematical proofs of correctness and probabilistic performance guarantees for optimization. The upgraded ULARA is generalizable to other rank aggregation problems outside of SemNet. In summary, SemNet 2.0 is a comprehensive open-source software for significantly faster, more effective, and user-friendly means of automated biomedical LBD. An example case is performed to rank relationships between Alzheimer’s disease and metabolic co-morbidities.
... The potential roles of oral squamous cell carcinoma (OSCC)-related mRNA and lncRNA are revealed by Li et al. (2021) through protein interaction network and co-expression network analysis. The model GBDTL2E is proposed by Wang et al. (2020) to predict the association between lncRNA and environmental factors. With the deepening of research, research on the prediction of lncRNA-disease association is mainly divided into the following categories: 1) Based on machine learning methods, the main idea of these methods is to prioritize candidate lncRNAs by training known and unknown lncRNA-disease correlation. ...
Article
Full-text available
In recent years, the long noncoding RNA (lncRNA) has been shown to be involved in many disease processes. The prediction of the lncRNA–disease association is helpful to clarify the mechanism of disease occurrence and bring some new methods of disease prevention and treatment. The current methods for predicting the potential lncRNA–disease association seldom consider the heterogeneous networks with complex node paths, and these methods have the problem of unbalanced positive and negative samples. To solve this problem, a method based on the Gradient Boosting Decision Tree (GBDT) and logistic regression (LR) to predict the lncRNA–disease association (GBDTLRL2D) is proposed in this paper. MetaGraph2Vec is used for feature learning, and negative sample sets are selected by using K-means clustering. The innovation of the GBDTLRL2D is that the clustering algorithm is used to select a representative negative sample set, and the use of MetaGraph2Vec can better retain the semantic and structural features in heterogeneous networks. The average area under the receiver operating characteristic curve (AUC) values of GBDTLRL2D obtained on the three datasets are 0.98, 0.98, and 0.96 in 10-fold cross-validation.
... Because the similarity relationships between each node and its neighbours has an important influence on the prediction result, the RWR algorithm can combine well to calculate the relationships between nodes and their neighbours. RWR combines the similarity [31] between neighbouring nodes by random walk and adjusts the degree of integration of the combined neighbouring nodes by edge weights. The calculation method [19] of RWR is defined as: ...
Article
Full-text available
Background The existing studies show that circRNAs can be used as a biomarker of diseases and play a prominent role in the treatment and diagnosis of diseases. However, the relationships between the vast majority of circRNAs and diseases are still unclear, and more experiments are needed to study the mechanism of circRNAs. Nowadays, some scholars use the attributes between circRNAs and diseases to study and predict their associations. Nonetheless, most of the existing experimental methods use less information about the attributes of circRNAs, which has a certain impact on the accuracy of the final prediction results. On the other hand, some scholars also apply experimental methods to predict the associations between circRNAs and diseases. But such methods are usually expensive and time-consuming. Based on the above shortcomings, follow-up research is needed to propose a more efficient calculation-based method to predict the associations between circRNAs and diseases. Results In this study, a novel algorithm (method) is proposed, which is based on the Graph Convolutional Network (GCN) constructed with Random Walk with Restart (RWR) and Principal Component Analysis (PCA) to predict the associations between circRNAs and diseases (CRPGCN). In the construction of CRPGCN, the RWR algorithm is used to improve the similarity associations of the computed nodes with their neighbours. After that, the PCA method is used to dimensionality reduction and extract features, it makes the connection between circRNAs with higher similarity and diseases closer. Finally, The GCN algorithm is used to learn the features between circRNAs and diseases and calculate the final similarity scores, and the learning datas are constructed from the adjacency matrix, similarity matrix and feature matrix as a heterogeneous adjacency matrix and a heterogeneous feature matrix. Conclusions After 2-fold cross-validation, 5-fold cross-validation and 10-fold cross-validation, the area under the ROC curve of the CRPGCN is 0.9490, 0.9720 and 0.9722, respectively. The CRPGCN method has a valuable effect in predict the associations between circRNAs and diseases.
Article
The circRNAs and miRNAs play an important role in the development of human diseases, and they can be widely used as biomarkers of diseases for disease diagnosis. In particular, circRNAs can act as sponge adsorbers for miRNAs and act together in certain diseases. However, the associations between the vast majority of circRNAs and diseases and between miRNAs and diseases remain unclear. Computational-based approaches are urgently needed to discover the unknown interactions between circRNAs and miRNAs. In this paper, we propose a novel deep learning algorithm based on Node2vec and Graph ATtention network (GAT), Conditional Random Field (CRF) layer and Inductive Matrix Completion (IMC) to predict circRNAs and miRNAs interactions (NGCICM). We construct a GAT-based encoder for deep feature learning by fusing the talking-heads attention mechanism and the CRF layer. The IMC-based decoder is also constructed to obtain interaction scores. The Area Under the receiver operating characteristic Curve (AUC) of the NGCICM method is 0.9697, 0.9932 and 0.9980, and the Area Under the Precision-Recall curve (AUPR) is 0.9671, 0.9935 and 0.9981, respectively, using 2- fold, 5- fold and 10- fold Cross-Validation (CV) as the benchmark. The experimental results confirm the effectiveness of the NGCICM algorithm in predicting the interactions between circRNAs and miRNAs.
Article
A growing number of studies have confirmed the important role of microRNAs (miRNAs) in human diseases and the aberrant expression of miRNAs affects the onset and progression of human diseases. The discovery of disease-associated miRNAs as new biomarkers promote the progress of disease pathology and clinical medicine. However, only a small proportion of miRNA-disease correlations have been validated by biological experiments. And identifying miRNA-disease associations through biological experiments is both expensive and inefficient. Therefore, it is important to develop efficient and highly accurate computational methods to predict miRNA-disease associations. A miRNA-disease associations prediction algorithm based on Graph Convolutional neural Networks and Principal Component Analysis (GCNPCA) is proposed in this paper. Specifically, the deep topological structure information is extracted from the heterogeneous network composed of miRNA and disease nodes by a Graph Convolutional neural Network (GCN) with an additional attention mechanism. The internal attribute information of the nodes is obtained by the Principal Component Analysis (PCA). Then, the topological structure information and the node attribute information are combined to construct comprehensive feature descriptors. Finally, the Random Forest (RF) is used to train and classify these feature descriptors. In the five-fold cross-validation experiment, the AUC and AUPR for the GCNPCA algorithm are 0.983 and 0.988 respectively.
Article
Full-text available
Identifying genes related to Parkinson’s disease (PD) is an active research topic in biomedical analysis, which plays a critical role in diagnosis and treatment. Recently, many studies have proposed different techniques for predicting disease-related genes. However, a few of these techniques are designed or developed for PD gene prediction. Most of these PD techniques are developed to identify only protein genes and discard long noncoding (lncRNA) genes, which play an essential role in biological processes and the transformation and development of diseases. This paper proposes a novel prediction system to identify protein and lncRNA genes related to PD that can aid in an early diagnosis. First, we preprocessed the genes into DNA FASTA sequences from the University of California Santa Cruz (UCSC) genome browser and removed the redundancies. Second, we extracted some significant features of DNA FASTA sequences using the PyFeat method with the AdaBoost as feature selection. These selected features achieved promising results compared with extracted features from some state-of-the-art feature extraction techniques. Finally, the features were fed to the gradient-boosted decision tree (GBDT) to diagnose different tested cases. Seven performance metrics were used to evaluate the performance of the proposed system. The proposed system achieved an average accuracy of 78.6%, the area under the curve equals 84.5%, the area under precision-recall (AUPR) equals 85.3%, F1-score equals 78.3%, Matthews correlation coefficient (MCC) equals 0.575, sensitivity (SEN) equals 77.1%, and specificity (SPC) equals 80.2%. The experiments demonstrate promising results compared with other systems. The predicted top-rank protein and lncRNA genes are verified based on a literature review.