Article

iEnhancer-5Step: Identifying enhancers using hidden information of DNA sequences via Chou's 5-step rule and word embedding

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

An enhancer is a short (50–1500bp) region of DNA that plays an important role in gene expression and the production of RNA and proteins. Genetic variation in enhancers has been linked to many human diseases, such as cancer, disorder or inflammatory bowel disease. Due to the importance of enhancers in genomics, the classification of enhancers has become a popular area of research in computational biology. Despite the few computational tools employed to address this problem, their resulting performance still requires improvements. In this study, we treat enhancers by the word embeddings, including sub-word information of its biological words, which then serve as features to be fed into a support vector machine algorithm to classify them. We present iEnhancer-5Step, a web server containing two-layer classifiers to identify enhancers and their strength. We are able to attain an independent test accuracy of 79% and 63.5% in the two layers, respectively. Compared to current predictors on the same dataset, our proposed method is able to yield superior performance as compared to the other methods. Moreover, this study provides a basis for further research that can enrich the field of applying natural language processing techniques in biological sequences. iEnhancer-5Step is freely accessible via http://biologydeep.com/fastenc/.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... In previous studies, many techniques have been used to successfully discover enhancers and their strengths, which are based on different machine learning (ML), ensemble, and DL approaches. These methods include iEnhancer-2L [4], EnhancerPred [5], Enhancer-TNC [6], iEnhancer-5Steps [7], Enhancer-PCWM [8], iEnhancer-RF [9], and iEnhancer- • NEPERS formulates the prediction of enhancers and their strength as a binary classification problem and solves it using a cascade deep forest algorithm. • It takes advantage of multi-view features, such as position-specific trinucleotide propensity based on single-stranded (PSTNPss) characteristics, position-specific trinucleotide propensity based on double-stranded (PSTNPdss) characteristics, the composition of k-spaced nucleic acid pairs (CKSNAP), and nucleotide chemical properties (NCP), to incorporate biological sequences into nominal descriptors. ...
... In previous studies, many techniques have been used to successfully discover enhancers and their strengths, which are based on different machine learning (ML), ensemble, and DL approaches. These methods include iEnhancer-2L [4], EnhancerPred [5], Enhancer-TNC [6], iEnhancer-5Steps [7], Enhancer-PCWM [8], iEnhancer-RF [9], and iEnhancer-MFGBDT [10], which are classification methods based on conventional machine learning approaches. iEnhancer-EL [11], iEnhancer-XG [12], and iEnhancer-EBLSTM [13] are classification methods based on ensemble-based approaches. ...
... In this model, the SVM classifier, along with the TNC approach, was predicted to be the best for classifying enhancers. In 2019, Le et al. developed a classifier named iEnhancer-5Steps [7] using the concept of pseudo-amino acid composition with 5CV. The scikit-learn package was used to perform SVM on the dataset. ...
Article
Full-text available
Enhancers are short DNA segments (50–1500 bp) that effectively activate gene transcription when transcription factors (TFs) are present. There is a correlation between the genetic differences in enhancers and numerous human disorders including cancer and inflammatory bowel disease. In computational biology, the accurate categorization of enhancers can yield important information for drug discovery and development. High-throughput experimental approaches are thought to be vital tools for researching enhancers’ key characteristics; however, because these techniques require a lot of labor and time, it might be difficult for researchers to forecast enhancers and their powers. Therefore, computational techniques are considered an alternate strategy for handling this issue. Based on the types of algorithms that have been used to construct predictors, the current methodologies can be divided into three primary categories: ensemble-based methods, deep learning-based approaches, and traditional ML-based techniques. In this study, we developed a novel two-layer deep forest-based predictor for accurate enhancer and strength prediction, namely, NEPERS. Enhancers and non-enhancers are divided at the first level by NEPERS, whereas strong and weak enhancers are divided at the second level. To evaluate the effectiveness of feature fusion, block-wise deep forest and other algorithms were combined with multi-view features such as PSTNPss, PSTNPdss, CKSNAP, and NCP via 10-fold cross-validation and independent testing. Our proposed technique performs better than competing models across all parameters, with an ACC of 0.876, Sen of 0.864, Spe of 0.888, MCC of 0.753, and AUC of 0.940 for layer 1 and an ACC of 0.959, Sen of 0.960, Spe of 0.958, MCC of 0.918, and AUC of 0.990 for layer 2, respectively, for the benchmark dataset. Similarly, for the independent test, the ACC, Sen, Spe, MCC, and AUC were 0.863, 0.865, 0.860, 0.725, and 0.948 for layer 1 and 0.890, 0.940, 0.840, 0.784, and 0.951 for layer 2, respectively. This study provides conclusive insights for the accurate and effective detection and characterization of enhancers and their strengths.
... Therefore, the choice of k is a very crucial task in achieving better predictive performance. Considering the work of Asim et al. [2] and Le et al. [1], this paper utilize the stride size of to generate overlapping higher order residues (5-mers). ...
... Batch normalization has achieved great success in multifarious areas of deep learning [4,21,60] as it makes sure that input to output mapping of the deep neural network does not over-specialize only a particular block of input distribution which results in faster training, better convergence, and generalizability [27]. Mathematically, providing the d-dimensional feature space x = {x (1) , ........, x (d) } , batch normalization operation can be expressed as follows: ...
... Considering, the performance of deep learning models is largely influenced by different values of various hyperparameters such as k-mers, residue embedding dimensions, embedding and standard dropout, learning rate, weight decay, batch size, etc. From the training set, we use 10% sequences as the validation set to find the optimal values of the most influential hyperparameters for lncRNA-miRNA and lncRNA-protein interaction prediction tasks using grid search [41,59]. To ensure reproduceability of the results, Table 1 reports the initial value range for different hyperparameters defined by following the literature [1,2] and the optimal hyperparameter values found through the grid search for proposed BoT-Net approach for lncRNA-miRNA and lncRNA-protein interaction prediction tasks. ...
Article
Full-text available
Background and objective Interactions of long non-coding ribonucleic acids (lncRNAs) with micro-ribonucleic acids (miRNAs) play an essential role in gene regulation, cellular metabolic, and pathological processes. Existing purely sequence based computational approaches lack robustness and efficiency mainly due to the high length variability of lncRNA sequences. Hence, the prime focus of the current study is to find optimal length trade-offs between highly flexible length lncRNA sequences. Method The paper at hand performs in-depth exploration of diverse copy padding, sequence truncation approaches, and presents a novel idea of utilizing only subregions of lncRNA sequences to generate fixed-length lncRNA sequences. Furthermore, it presents a novel bag of tricks-based deep learning approach “Bot-Net” which leverages a single layer long-short-term memory network regularized through DropConnect to capture higher order residue dependencies, pooling to retain most salient features, normalization to prevent exploding and vanishing gradient issues, learning rate decay, and dropout to regularize precise neural network for lncRNA–miRNA interaction prediction. Results BoT-Net outperforms the state-of-the-art lncRNA–miRNA interaction prediction approach by 2%, 8%, and 4% in terms of accuracy, specificity, and matthews correlation coefficient. Furthermore, a case study analysis indicates that BoT-Net also outperforms state-of-the-art lncRNA–protein interaction predictor on a benchmark dataset by accuracy of 10%, sensitivity of 19%, specificity of 6%, precision of 14%, and matthews correlation coefficient of 26%. Conclusion In the benchmark lncRNA–miRNA interaction prediction dataset, the length of the lncRNA sequence varies from 213 residues to 22,743 residues and in the benchmark lncRNA–protein interaction prediction dataset, lncRNA sequences vary from 15 residues to 1504 residues. For such highly flexible length sequences, fixed length generation using copy padding introduces a significant level of bias which makes a large number of lncRNA sequences very much identical to each other and eventually derail classifier generalizeability. Empirical evaluation reveals that within 50 residues of only the starting region of long lncRNA sequences, a highly informative distribution for lncRNA–miRNA interaction prediction is contained, a crucial finding exploited by the proposed BoT-Net approach to optimize the lncRNA fixed length generation process. Availability BoT-Net web server can be accessed at https://sds_genetic_analysis.opendfki.de/lncmiRNA/. Graphic Abstract
... For fair comparison with the state-of-the-art methods, we used the same benchmark dataset as those in iEnhancer-2L [40], iEnhancer-PsedeKNC [41], EnhancerPred [42], EnhancerPred2.0 [43], Enhancer-Tri-N [44], iEnhaner-2L-Hybrid [45], iEnhancer-EL [46], iEnhancer-5Step [47], DeployEnhancer [48], ES-ARCNN [49], iEnhancer-ECNN [50], EnhancerP-2L [51], iEnhancer-CNN [52], iEnhancer-XG [53], Enhancer-DRRNN [54], Enhancer-BERT [55], iEnhancer-KL [56], iEnhancer-RF [57], spEnhancer [58], iEnhancer-EBLSTM [59], iEnhancer-GAN [60], piEnPred [61], iEnhancer-RD [62], and iEnhancer-MFGBDT [63]. The dataset was initially collected by Liu et al. [40] from chromatin state information of nine cell lines (H1ES, K562,GM12878, HepG2, HUVEC, HSMM, NHLF, NHEK and HME) which was annotated by ChromHMM [69,70]. ...
... Frist Stage iEnhancer-2L [40] 0.7100 0.7500 0.7300 0.4604 0.8062 EnhancerPred [42] 0.7350 0.7450 0.7400 0.4800 0.8013 iEnhancer-EL [46] 0.7100 0.7850 0.7475 0.4964 0.8173 iEnhancer-5Step [47] 0.8200 0.7600 0.7900 0.5800 -DeployEnhancer [48] 0.7550 0.7600 0.7550 0.5100 0.7704 iEnhancer-ECNN [50] 0.7520 0.7850 0.7690 0.5370 0.8320 EnhancerP-2L [51] 0 ...
... Second Stage iEnhancer-2L [40] 0.4700 0.7400 0.6050 0.2181 0.6678 EnhancerPred [42] 0.4500 0.6500 0.5500 0.1020 0.5790 iEnhancer-EL [46] 0.5400 0.6800 0.6100 0.2222 0.6801 iEnhancer-5Step [47] 0.7400 0.5300 0.6350 0.2800 -DeployEnhancer [48] 0 the number of correctly predicted positive samples to the total number of positive ones, while SP is the ratio of the number of correctly predicted negative samples to the total number of negative ones. Sometimes, the two indices would not maintain synchronization, which was difficult to determine as good or bad. ...
Article
Full-text available
Enhancers are short DNA segments that play a key role in biological processes, such as accelerating transcription of target genes. Since the enhancer resides anywhere in a genome sequence, it is difficult to precisely identify enhancers. We presented a bi-directional long-short term memory (Bi-LSTM) and attention-based deep learning method (Enhancer-LSTMAtt) for enhancer recognition. Enhancer-LSTMAtt is an end-to-end deep learning model that consists mainly of deep residual neural network, Bi-LSTM, and feed-forward attention. We extensively compared the Enhancer-LSTMAtt with 19 state-of-the-art methods by 5-fold cross validation, 10-fold cross validation and independent test. Enhancer-LSTMAtt achieved competitive performances, especially in the independent test. We realized Enhancer-LSTMAtt into a user-friendly web application. Enhancer-LSTMAtt is applicable not only to recognizing enhancers, but also to distinguishing strong enhancer from weak enhancers. Enhancer-LSTMAtt is believed to become a promising tool for identifying enhancers.
... In contrast to iEnhancer-2L, this approach combined three feature encodings to generate hybrid features. To complement the conventional features used in previous methods, Le et al. used word embeddings as inputs and trained SVM algorithms for the development of the iEnhancer-5step in 2019 [22]. Khan et al. developed a prediction tool piEnPred [23] in 2021. ...
... The dataset mentioned in this article is derived from the dataset utilized by Liu in his research [20], as well as by Basith in his study. Liu's dataset is utilized by other predictors as well such as [22,27]. The enhancer sequences were divided into 200 bp fragments and filtered by CD-HIT [35]. ...
Article
Full-text available
An enhancer is a specific DNA sequence typically located within a gene at upstream or downstream position and serves as a pivotal element in the regulation of eukaryotic gene transcription. Therefore, the recognition of enhancers is highly significant for comprehending gene expression regulatory systems. While some useful predictive models have been proposed, there are still deficiencies in these models. To address current limitations, we propose a model, DNABERT2-Enhancer, based on transformer architecture and deep learning, designed for the recognition of enhancers (classified as either enhancer or non-enhancer) and the identification of their activity (strong or weak enhancers). More specifically, DNABERT2-Enhancer is composed of a BERT model for extracting features and a CNN model for enhancers classification. Parameters of the BERT model are initialized by a pre-training DNABERT-2 language model. The enhancer recognition task is then fine-tuned through transfer learning to convert the original sequence into feature vectors. Subsequently, the CNN network is employed to learn the feature vector generated by BERT and produce the prediction results. In comparison with existing predictors utilizing the identical dataset, our approach demonstrates superior performance. This suggests that the model will be a useful instrument for academic research on the enhancer recognition.
... Four different methods were developed in 2019, namely iEnhancer-5step [134], iEnhancer-ECNN [63], DeployEnhancerModel [135], and CHilEnPred [97]. Except for CHilEnPred, every other method was developed using the Liu dataset2. ...
... To complement the conventional features used in previous methods, Le et al.[134] used word embeddings as inputs and trained SVM algorithms for the development of the iEnhancer-5step. In training, this method demonstrated ACC values of 0.823 and 0.681 for Layers 1 and 2, respectively, and the corresponding performances on the independent dataset were 0.790 and 0.635, respectively. ...
Article
Full-text available
Enhancers are non‐coding DNA elements that play a crucial role in enhancing the transcription rate of a specific gene in the genome. Experiments for identifying enhancers can be restricted by their conditions and involve complicated, time‐consuming, laborious, and costly steps. To overcome these challenges, computational platforms have been developed to complement experimental methods that enable high‐throughput identification of enhancers. Over the last few years, the development of various enhancer computational tools has resulted in significant progress in predicting putative enhancers. Thus, researchers are now able to use a variety of strategies to enhance and advance enhancer study. In this review, an overview of machine learning (ML)‐based prediction methods for enhancer identification and related databases has been provided. The existing enhancer‐prediction methods have also been reviewed regarding their algorithms, feature selection processes, validation techniques, and software utility. In addition, the advantages and drawbacks of these ML approaches and guidelines for developing bioinformatic tools have been highlighted for a more efficient enhancer prediction. This review will serve as a useful resource for experimentalists in selecting the appropriate ML tool for their study, and for bioinformaticians in developing more accurate and advanced ML‐based predictors.
... This study evaluated the proposed model against state-of-the-art classification mo els, such as EnhancerPred [21], iEnhancer-RF [42], iEnhancer-PsedeKNC [43], Depl Enhance [44], iEnhancer-EL [16], iEnhancer-RD [45], Enhancer-LSTMAtt [7], iEnhanc XG [46], iEnhancer-2L [3], iEnhancer-5Step [47], iEnhancerDSNet [48], and iEnhanc CNN [49]. The proposed model was compared to all of these prediction methodologies order to make a more accurate comparison. ...
... This study evaluated the proposed model against state-of-the-art classification models, such as EnhancerPred [21], iEnhancer-RF [42], iEnhancer-PsedeKNC [43], DeployEnhance [44], iEnhancer-EL [16], iEnhancer-RD [45], Enhancer-LSTMAtt [7], iEnhancer-XG [46], iEnhancer-2L [3], iEnhancer-5Step [47], iEnhancerDSNet [48], and iEnhancer-CNN [49]. The proposed model was compared to all of these prediction methodologies in order to make a more accurate comparison. ...
Article
Full-text available
Enhancers are sequences with short motifs that exhibit high positional variability and free scattering properties. Identification of these noncoding DNA fragments and their strength are extremely important because they play a key role in controlling gene regulation on a cellular basis. The identification of enhancers is more complex than that of other factors in the genome because they are freely scattered, and their location varies widely. In recent years, bioinformatics tools have enabled significant improvement in identifying this biological difficulty. Cell line-specific screening is not possible using these existing computational methods based solely on DNA sequences. DNA segment chromatin accessibility may provide useful information about its potential function in regulation, thereby identifying regulatory elements based on its chromatin accessibility. In chromatin, the entanglement structure allows positions far apart in the sequence to encounter each other, regardless of their proximity to the gene to be acted upon. Thus, identifying enhancers and assessing their strength is difficult and time-consuming. The goal of our work was to overcome these limitations by presenting a convolutional neural network (CNN) with attention-gated recurrent units (AttGRU) based on Deep Learning. It used a CNN and one-hot coding to build models, primarily to identify enhancers and secondarily to classify their strength. To test the performance of the proposed model, parallels were drawn between enhancer-CNNAttGRU and existing state-of-the-art methods to enable comparisons. The proposed model performed the best for predicting stage one and stage two enhancer sequences, as well as their strengths, in a cross-species analysis, achieving best accuracy values of 87.39% and 84.46%, respectively. Overall, the results showed that the proposed model provided comparable results to state-of-the-art models, highlighting its usefulness.
... 24 FastText employs two types of embedding techniques: a continuous bag of words (CBOW), and skip-gram for creating word-level embeddings. In this study, we utilized the skip-gram method for n-gram extraction of peptide sequences, as shown in Figure 2. 25 Let us consider a peptide sequence "A" that contains of "L" number of amino acids. ...
Article
Full-text available
Neuropeptides (NPs) are critical signaling molecules that are essential in numerous physiological processes andpossess significant therapeutic potential. Computational prediction of NPs has emerged as a promising alternative to traditionalexperimental methods, often labor-intensive, time-consuming, and expensive. Recent advancements in computational peptidemodels provide a cost-effective approach to identifying NPs, characterized by high selectivity toward target cells and minimal sideeffects. In this study, we propose a novel deep capsule neural network-based computational model, namely pNPs-CapsNet, to predictNPs and non-NPs accurately. Input samples are numerically encoded using pretrained protein language models, including ESM,ProtBERT-BFD, and ProtT5, to extract attention mechanism-based contextual and semantic features. A differential evolution-basedweighted feature integration method is utilized to construct a multiview vector. Additionally, a two-tier feature selection strategy,comprising MRMD and SHAP analysis, is developed to identify and select optimal features. Finally, the novel capsule neural network(CapsNet) is trained using the selected optimal feature set. The proposed pNPs-CapsNet model achieved a remarkable predictiveaccuracy of 98.10% and an AUC of 0.98. To validate the generalization capability of the pNPs-CapsNet model, independent samplesreported an accuracy of 95.21% and an AUC of 0.96. The pNPs-CapsNet model outperforms existing state-of-the-art models,demonstrating 4% and 2.5% improved predictive accuracy for training and independent data sets, respectively. The demonstratedefficacy and consistency of pNPs-CapsNet underline its potential as a valuable and robust tool for advancing drug discovery.
... This perspective has led to the development of models like ProtVec and GeneVec for protein and gene sequences, respectively 13 . Additionally, the adaptation of the FastText model for DNA sequences, including enhancers and promoters, has shown promising results 1,14 . Building on this idea, the application of pre-trained language models like BERT in bioinformatics classification tasks has become increasingly popular 15 . ...
Article
Full-text available
Promoters are essential DNA sequences that initiate transcription and regulate gene expression. Precisely identifying promoter sites is crucial for deciphering gene expression patterns and the roles of gene regulatory networks. Recent advancements in bioinformatics have leveraged deep learning and natural language processing (NLP) to enhance promoter prediction accuracy. Techniques such as convolutional neural networks (CNNs), long short-term memory (LSTM) networks, and BERT models have been particularly impactful. However, current approaches often rely on arbitrary DNA sequence segmentation during BERT pre-training, which may not yield optimal results. To overcome this limitation, this article introduces a novel DNA sequence segmentation method. This approach develops a more refined dictionary for DNA sequences, utilizes it for BERT pre-training, and employs an Inception neural network as the foundational model. This BERT-Inception architecture captures information across multiple granularities. Experimental results show that the model improves the performance of several downstream tasks and introduces deep learning interpretability, providing new perspectives for interpreting and understanding DNA sequence information. The detailed source code is available at https://github.com/katouMegumiH/Promoter_BERT.
... This perspective has led to the development of models like ProtVec and GeneVec for protein and gene sequences, respectively 16 . Additionally, the adaptation of the FastText model for DNA sequences, including enhancers and promoters, has shown promising results 1,17 . Building on this idea, the application of pre-trained language models like BERT in bioinformatics classification tasks has become increasingly popular 18 . ...
Preprint
Full-text available
Promoters are essential DNA sequences that initiate transcription and regulate gene expression. Precisely identifying promoter sites is crucial for deciphering gene expression patterns and the roles of gene regulatory networks. Recent advancements in bioinformatics have leveraged deep learning and natural language processing (NLP) to enhance promoter prediction accuracy. Techniques such as convolutional neural networks (CNNs), long short-term memory (LSTM) networks, and BERT models have been particularly impactful. However, current approaches often rely on arbitrary DNA sequence segmentation during BERT pre-training, which may not yield optimal results. To overcome this limitation, this article introduces a novel DNA sequence segmentation method. This approach develops a more refined dictionary for DNA sequences, utilizes it for BERT pre-training, and employs an Inception neural network as the foundational model. This BERT-Inception architecture captures information across multiple granularities. Experimental results show that the model improves the performance of several downstream tasks and introduces deep learning interpretability, providing new perspectives for interpreting and understanding DNA sequence information. The detailed source code is available at https://github.com/katouMegumiH/Promoter_BERT.
... Pooling aids in reducing the spatial dimensions of the data while retaining the most important properties. This results in a more compact representation, allowing the network to concentrate on the most significant features of the material as shown in Fig. 2. The retrieved characteristics are then sent into fully connected layers, which are standard neural network layers [31][32][33] . ...
Article
Full-text available
Proteins, nucleic acids, and lipids all interact with intrinsically disordered protein areas. Lipid-binding regions are involved in a variety of biological processes as well as a number of human illnesses. The expanding body of experimental evidence for these interactions and the dearth of techniques to anticipate them from the protein sequence serve as driving forces. Although large-scale laboratory techniques are considered to be essential for equipment for studying binding residues, they are time consuming and costly, making it challenging for researchers to predict lipid binding residues. As a result, computational techniques are being looked at as a different strategy to overcome this difficulty. To predict disordered lipid-binding residues (DLBRs), we proposed iDLB-Pred predictor utilizing benchmark dataset to compute feature through extraction techniques to identify relevant patterns and information. Various classification techniques, including deep learning methods such as Convolutional Neural Networks (CNNs), Deep Neural Networks (DNNs), Multilayer Perceptrons (MLPs), Recurrent Neural Networks (RNNs), Long Short-Term Memory (LSTM) networks, and Gated Recurrent Units (GRUs), were employed for model training. The proposed model, iDLB-Pred, was rigorously validated using metrics such as accuracy, sensitivity, specificity, and Matthew’s correlation coefficient. The results demonstrate the predictor’s exceptional performance, achieving accuracy rates of 81% on an independent dataset and 86% in 10-fold cross-validation.
... State-of-the-Art Models on the Independent Test Dataset for Two Enhancer Classification 1151 1152 1153 1154 1155 1156 1157 1158 1159 1160 1161 1162 1163 1164 1165 1166 1167 1168 1169 1170 1171 1172 1173 1174 1175 1176 1177 1178 1179 1180 1181 1182 1183 1184 1185 1186 1187 1188 1189 1190 1191 1192 1193 1194 1195 1196 their prediction performances on independent datasets that were not involved in the model training process, to assess their generalization capabilities on future samples, as shown in Tables 1 and 2. Then, as shown in Table 2, we follow the same parameters used for enhancer category classification to train DeepEnhancerPPO and predict on the independent dataset 1197 1198 1199 1200 1201 1202 1203 1204 1205 1206 1207 1208 1209 1210 1211 1212 1213 1214 1215 1216 1217 1218 1219 1220 1221 1222 1223 1224 1225 1226 1227 1228 1229 1230 1231 1232 1233 1234 1235 1236 1237 1238 1239 1240 1241 1242 Model SN SP ACC MCC AUC iEnhancer-2L [15] 0.7100 0.7500 0.7300 0.4604 0.8062 EnhancerPred [16] 0.7350 0.7450 0.7400 0.4800 0.8013 iEnhancer-EL [38] 0.5400 0.6800 0.6100 0.2222 0.6801 iEnhancer-5Step [39] 0.7400 0.5300 0.6350 0.2800 -DeployEnhancer [40] 0.8315 0.4561 0.6849 0.3120 0.6714 iEnhancer-ECNN [18] 0.7910 0.7480 0.6780 0.3680 0.7480 EnhancerP-2L [41] 0.6829 0.7922 0.7250 0.4624 -iEnhancer-CNN [42] 0.6525 0.7610 0.7500 0.3232 -iEnhancer-XG [43] 0.7000 0.5700 0.6350 0.2720 -Enhancer-DRRNN [44] 0.8580 0.8400 0.8490 0.6990 -Enhancer-BERT [22] -----iEnhancer-RF [45] 0.9300 0.7700 0.8500 0.7091 0.9700 spEnhancer [46] 0.9100 0.3300 0.6200 0.3703 0.6253 iEnhancer-EBLSTM [47] 0.8120 0.5360 0.6580 0.3240 0.6880 iEnhancer-GAN [48] 0.9610 0.5370 0.7490 0.5050 -piEnhPred [49] 0.7000 0.7500 0.7250 0.4506 -iEnhancer-RD [48] 0 1243 1244 1245 1246 1247 1248 1249 1250 1251 1252 1253 1254 1255 1256 1257 1258 1259 1260 1261 1262 1263 1264 1265 1266 1267 1268 1269 1270 1271 1272 1273 1274 1275 1276 1277 1278 1279 1280 1281 1282 1283 1284 1285 1286 In conclusion, the feature maps provide clear insights into the significant, taskrelevant features, predominantly from the ResNet module rather than the Transformer. This underscores the model's interpretability and guides further refinement and application of the classifier. ...
Preprint
Full-text available
Enhancers are short genomic segments located in non-coding regions in a genome that help to increase the expressions of the target genes. Despite their significance in transcription regulation, effective methods for classifying enhancer categories and regulatory strengths remain limited. To address the issue, we propose a novel end-to-end deep learning architecture named DeepEnhancerPPO. The model integrates ResNet and Transformer modules to extract local, hierarchical, and long-range contextual features. Following feature fusion, we employ the proximal policy optimization (PPO), a reinforcement learning technique, to reduce the dimensionality of the fused features, retaining the most relevant ones for downstream classification. We evaluate the performance of DeepEnhancerPPO from multiple perspectives, including ablation analysis, independent tests, and interpretability of classification results. Each of these modules contributes positively to the model's performance, with ResNet and PPO being the top contributors. Overall, DeepEnhancerPPO exhibits superb performance on independent datasets compared to other models, outperforming the second-best model by 6.7% in accuracy for enhancer category classification. The model also ranks within the top five classifiers out of 25 in enhancer strength classification without the need to re-optimize the hyperparameters, indicating that the DeepEnhancerPPO framework is highly robust for enhancer classification. Additionally, the inclusion of PPO enhances the interpretability of the classification results. The source code is openly accessible at https://github.com/Mxc666/DeepEnhancerPPO.git.
... All enhancerBD evaluation results are full marks(table1). [31] 0.7350 0.7450 0.7400 0.4800 0.8013 iEnhancer-EL [32] 0.7100 0.78500 0.7475 0.4964 0.8173 iEnhancer-5Step [33] 0.8200 0.7600 0.7900 0.5800 -DeployEnhancer [34] 0.7550 0.7600 0.7550 0.5100 0.7704 iEnhancer-ECNN [35] 0.7520 0.7850 0.7690 0.5370 0.8320 EnhancerP-2L [36] 0.7810 0.8105 0.7950 0.5907 -iEnhancer-CNN [37] 0.7825 0.7900 0.7750 0.5850 -iEnhancer-XG [38] 0.7400 0.7750 0.7575 0.5150 -Enhancer-DRRNN [39] 0.7330 0.8010 0.7670 0.5350 0.8370 Enhancer-BERT [40] 0.8000 0.7120 0.7560 0.5140 -iEnhancer-RF [41] 0.7850 0.8100 0.7975 0.5952 0.8600 spEnhancer [42] 0 ...
Preprint
Deciphering the non-coding language of DNA is one of the fundamental questions in genomic research. Previous bioinformatics methods often struggled to capture this complexity, especially in cases of limited data availability. Enhancers are short DNA segments that play a crucial role in biological processes, such as enhancing the transcription of target genes. Due to their ability to be located at any position within the genome sequence, accurately identifying enhancers can be challenging. We presented a deep learning method (enhancerBD) for enhancer recognition. We extensively compared the enhancerBD with previous 18 state-of-the-art methods by independent test. Enhancer-BD achieved competitive performances. All detection results on the validation set have achieved remarkable scores for each metric. It is a solid state-of-the-art enhancer recognition software. In this paper, I extended the BERT combined DenseNet121 models by sequentially adding the layers GlobalAveragePooling2D, Dropout, and a ReLU activation function. This modification aims to enhance the convergence of the model's loss function and improve its ability to predict sequence features. The improved model is not only applicable for enhancer identification but also for distinguishing enhancer strength. Moreover, it holds the potential for recognizing sequence features such as lncRNA, microRNA, insultor, and silencer.
... We used the following five evaluation metrics: SN(sensitivity), SP(specificity), ACC (accuracy), MCC (Matthews correlation coefficient) to measure the performance [65,66]. Their formulas were expressed as: ...
Article
Full-text available
Human leukocyte antigen (HLA) is closely involved in regulating the human immune system. Despite great advance in detecting classical HLA Class I binders, there are few methods or toolkits for recognizing non-classical HLA Class I binders. To fill in this gap, we have developed a deep learning-based tool called DeepHLAPred. The DeepHLAPred used electron-ion interaction pseudo potential, integer numerical mapping and accumulated amino acid frequency as initial representation of non-classical HLA binder sequence. The deep learning module was used to further refine high-level representations. The deep learning module comprised two parallel convolutional neural networks, each followed by maximum pooling layer, dropout layer, and bi-directional long short-term memory network. The experimental results showed that the DeepHLAPred reached the state-of-the-art performanceson the cross-validation test and the independent test. The extensive test demonstrated the rationality of the DeepHLAPred. We further analyzed sequence pattern of non-classical HLA class I binders by information entropy. The information entropy of non-classical HLA binder sequence implied sequence pattern to a certain extent. In addition, we have developed a user-friendly webserver for convenient use, which is available at http://www.biolscience.cn/DeepHLApred/. The tool and the analysis is helpful to detect non-classical HLA Class I binder. The source code and data is available at https://github.com/tangxingyu0/DeepHLApred. Supplementary Information The online version contains supplementary material available at 10.1186/s12864-023-09796-2.
... SVM can also attain greater generalization ability in small sample classification assignments. It is also widely utilized in many other domains, including handwritten character recognition, text classification, image classification, and recognition [51][52][53][54][55]. The use of the Decision Tree model aids in the early detection of cancer [56,57], diagnosing cardiac arrhythmias [58,59], forecasting stroke outcomes [60][61][62], and assisting with chronic disease management [63,64]. ...
Article
Full-text available
Consolidated efforts have been made to enhance the treatment and diagnosis of heart disease due to its detrimental effects on society. As technology and medical diagnostics become more synergistic, data mining and storing medical information can improve patient management opportunities. Therefore, it is crucial to examine the interdependence of the risk factors in patients' medical histories and comprehend their respective contributions to the prognosis of heart disease. This research aims to analyze the numerous components in patient data for accurate heart disease prediction. The most significant attributes for heart disease prediction have been determined using the Correlation-based Feature Subset Selection Technique with Best First Search. It has been found that the most significant factors for diagnosing heart disease are age, gender, smoking, obesity, diet, physical activity, stress, chest pain type, previous chest pain, blood pressure diastolic, diabetes, troponin, ECG, and target. Distinct artificial intelligence techniques (logistic regression, Naïve Bayes, K-nearest neighbor (K-NN), support vector machine (SVM), decision tree, random forest, and multilayer perceptron (MLP)) are applied and compared for two types of heart disease datasets (all features and selected features). Random forest using selected features has achieved the highest accuracy rate (90%) compared to employing all of the input features and other artificial intelligence techniques. The proposed approach could be utilized as an assistant framework to predict heart disease at an early stage.
... SVM can also attain greater generalization ability in small sample classification assignments. It is also widely utilized in many other domains, including handwritten character recognition, text classification, image classification, and recognition [40][41][42][43][44]. The supervised learning algorithm KNN is primarily employed for classification tasks. ...
Article
Full-text available
The liver is one of the most vital organs of the human body. Even when partially injured, it functions normally. Therefore, detecting liver diseases at the early stages is challenging. Early detection of liver problems can improve patient survival rates. This research enlightens on several Artificial Intelligence techniques, including the Bagged Tree, Support Vector Machine, K-Nearest Neighbor, and Fine Tree classifier, to predict the presence of liver disease in a patient at an early stage. This study compares those models and selects the best technique to detect liver disease at an early stage. The classification performance is measured using the confusion matrix, True Positive Rate (TPR), False Positive Rate (FPR), ROC curve, and accuracy. The result shows that the Bagged Tree classifier achieves the highest classification accuracy (81.30%), which is very promising compared to the other algorithms. The proposed system also performs sensitivity analysis on the dataset to investigate the impact of each attribute on the model’s performance. It has been demonstrated that Alanine Aminotransferase (sgpt) attribute has the most significant impact on the prediction of liver disease. The proposed method could be used as an assistant framework for liver disease detection at an early stage.
... Conventional ML-based methods iEnhancer-2L [23] PseKNC SVM 2016 EnhancerPred [24] BPB, NC, PseKNC SVM 2016 iEnhancer-5Step [25] FastText SVM 2019 EnhancerP-2L [26] Nucleotide Composition, Statistical Moment RF 2022 piEnPred [27] Kmer, CKSNAP, DCC, PseDNC, PseTNC SVM 2021 iEnhancer-MFGBDT [28] Kmer, Revckmer, NMBACC, MACC, SOMA GDBT 2021 Ensemble learning-based methods iEnhancer-EL [29] Kmer, Subsequence, PseKNC SVM 2018 iEnhancer-XG [30] K-Spectrum, Mismatch,PseDNC, PSSM, Subsequence XGBoost 2021 DeployEnhancer [31] One-hot, Dinucleotide Physicochemical properties DRNN 2019 DL-based methods iEnhancer-ECNN [32] One-hot, Kmer CNN 2019 iEnhancer-CNN [33] Word2vec CNN 2020 iEnhancer-GAN [34] Word2vec CNN 2021 iEnhancer-EBLSTM [35] Kmer Bi-LSTM 2021 iEnhancer-RD [36] Kmer, PseKNC, KPCV DNN 2021 spEnhancer [37] SeqPose Bi-LSTM 2021 chromatin epigenetic markers [21]. Most of the above studies have mainly focused on distinguishing enhancers from other regulatory elements. ...
Article
Enhancers, a class of distal cis-regulatory elements located in the non-coding region of DNA, play a key role in gene regulation. It is difficult to identify enhancers from DNA sequence data because enhancers are freely distributed in the non-coding region, with no specific sequence features, and having a long distance with the targeted promoters. Therefore, this study presents a stacking ensemble learning method to accurately identify enhancers and classify enhancers into strong and weak enhancers. Firstly, we obtain the fusion feature matrix by fusing the four features of Kmer, PseDNC, PCPseDNC and Z-Curve9. Secondly, five K-Nearest Neighbor (KNN) models with different parameters are trained as the base model, and the Logistic Regression algorithm is utilized as the meta-model. Thirdly, the stacking ensemble learning strategy is utilized to construct a two-layer model based on the base model and meta-model to train the preprocessed feature sets. The proposed method, named iEnhancer-SKNN, is a two-layer prediction model, in which the function of the first layer is to predict whether the given DNA sequences are enhancers or non-enhancers, and the function of the second layer is to distinguish whether the predicted enhancers are strong enhancers or weak enhancers. The performance of iEnhancer-SKNN is evaluated on the independent testing dataset and the results show that the proposed method has better performance in predicting enhancers and their strength. In enhancer identification, iEnhancer-SKNN achieves an accuracy of 81.75%, an improvement of 1.35% to 8.75% compared with other predictors, and in enhancer classification, iEnhancer-SKNN achieves an accuracy of 80.50%, an improvement of 5.5% to 25.5% compared with other predictors. Moreover, we identify key transcription factor binding site motifs in the enhancer regions and further explore the biological functions of the enhancers and these key motifs. Source code and data can be downloaded from https://github.com/HaoWuLab-Bioinformatics/iEnhancer-SKNN.
... 2) Physicochemical-based methods: such algorithms are implemented using various physicochemical features that encode enhancer subsequences, including iEnhancer-2L [14], EnhancerPred [15], iEnhancer-EL [16], iEnhancer-RF [17], iEnhancer-XG [18], iEnhancer-ECNN [19], CSI-ANN [20] and Enhancer-IF [21], where iEnhancer-ECNN [19] and CSI-ANN [20] utilize deep learning techniques to learn the implicit information in the features, and the other methods use traditional machine learning classifiers to accomplish the identification task. 3) Contextual-based methods:iEnhancer-EBLSTM [22], iEnhancer-5Step [23] and BERT-2DCNNs [24] consider the contextual information in enhancer sequences, and use different natural language processing technologies to form the embedding matrix of enhancer sequences. However, most of these computational models use only a single feature type to characterize enhancer sequences, making it difficult to describe distribution and the representations between nucleotides and their contexts, leaving adequate room for improving performance. ...
Article
Full-text available
Enhancers are short non-coding DNA sequences outside of the target promoter regions that can be bound by specific proteins to increase a gene’s transcriptional activity, which has a crucial role in the spatiotemporal and quantitative regulation of gene expression. However, enhancers do not have a specific sequence motifs or structures, and their scattered distribution in the genome makes the identification of enhancers from human cell lines particularly challenging. Here we present a novel, stacked multivariate fusion framework called SMFM, which enables a comprehensive identification and analysis of enhancers from regulatory DNA sequences as well as their interpretation. Specifically, to characterize the hierarchical relationships of enhancer sequences, multi-source biological information and dynamic semantic information are fused to represent regulatory DNA enhancer sequences. Then, we implement a deep learning–based sequence network to learn the feature representation of the enhancer sequences comprehensively and to extract the implicit relationships in the dynamic semantic information. Ultimately, an ensemble machine learning classifier is trained based on the refined multi-source features and dynamic implicit relations obtained from the deep learning-based sequence network. Benchmarking experiments demonstrated that SMFM significantly outperforms other existing methods using several evaluation metrics. In addition, an independent test set was used to validate the generalization performance of SMFM by comparing it to other state-of-the-art enhancer identification methods. Moreover, we performed motif analysis based on the contribution scores of different bases of enhancer sequences to the final identification results. Besides, we conducted interpretability analysis of the identified enhancer sequences based on attention weights of EnhancerBERT, a fine-tuned BERT model that provides new insights into exploring the gene semantic information likely to underlie the discovered enhancers in an interpretable manner. Finally, in a human placenta study with 4,562 active distal gene regulatory enhancers, SMFM successfully exposed tissue-related placental development and the differential mechanism, demonstrating the generalizability and stability of our proposed framework.
... Word embedding techniques have achieved a great success in natural language processing (NLP) applications. Recently, word embedding techniques have been widely used in the bioinformatics community to tackle the limitation that the kmer-based features of different sequences may be very similar despite their orders being re-versed (30)(31)(32). In this study, we divide the sequence into 'word' to keep the order information of sequence. ...
Article
Full-text available
Promoters are consensus DNA sequences located near the transcription start sites and they play an important role in transcription initiation. Due to their importance in biological processes, the identification of promoters is significantly important for characterizing the expression of the genes. Numerous computational methods have been proposed to predict promoters. However, it is difficult for these methods to achieve satisfactory performance in multiple species. In this study, we propose a novel weighted average ensemble learning model, termed iPro-WAEL, for identifying promoters in multiple species, including Human, Mouse, E.coli, Arabidopsis, B.amyloliquefaciens, B.subtilis and R.capsulatus. Extensive benchmarking experiments illustrate that iPro-WAEL has optimal performance and is superior to the current methods in promoter prediction. The experimental results also demonstrate a satisfactory prediction ability of iPro-WAEL on cross-cell lines, promoters annotated by other methods and distinguishing between promoters and enhancers. Moreover, we identify the most important transcription factor binding site (TFBS) motif in promoter regions to facilitate the study of identifying important motifs in the promoter regions. The source code of iPro-WAEL is freely available at https://github.com/HaoWuLab-Bioinformatics/iPro-WAEL.
... Therefore, the former has more than the latter (Lu BF 2004). Among the above six inflectional arrangements, there are 1089 (91.66%) in which the object is close to the verb, while there are only 99 (8.3%) in which the object is not close to the verb [4][5][6][7]. ...
Article
Full-text available
Object-verb (OV) inflections are an important grammatical device in Chinese with denotative utility. The decorativeness of OV inflections shows different levels in Chinese: OV nouns denote things in reality and are the most denotative; OV independent structures can denote both denotation and trait; OV structures that are definite are not self-sufficient and denote a certain trait. The verb category of V is worn out in denotative OV structures, and OV phrases must repair the wear and tear of V’s category and enhance V’s declarativity when forming small sentences. In this paper, we propose an attention-based approach to Chinese stance representation based on modern Chinese OV order types; firstly, we use bidirectional (bidirectional) long and short-term memory neural networks (LSTM) and convolutional neural networks (CNNs) to obtain text representation vectors and local convolutional features, respectively, and then, we use attention mechanisms to add influence weight information to the local convolutional features and finally fuse the two features for classification. Experiments on the relevant corpus show that the method achieves better stance representation results, and the addition of attention mechanisms can effectively improve the accuracy of stance representation.
... From 1484 enhancer sequences, 742 are strong enhancers and 742 are weak enhancers for the second layer classification. Furthermore, the independent dataset used by iEnhancer-5Step 29 was utilized to enhance the effectiveness and performance of the proposed model. The independent dataset included 400DNA enhancer sequences from which 200 (100 strong and 100 weak enhancers) are enhancers and 200 are non-enhancers. ...
Article
Full-text available
Enhancers regulate gene expression, by playing a crucial role in the synthesis of RNAs and proteins. They do not directly encode proteins or RNA molecules. In order to control gene expression, it is important to predict enhancers and their potency. Given their distance from the target gene, lack of common motifs, and tissue/cell specificity, enhancer regions are thought to be difficult to predict in DNA sequences. Recently, a number of bioinformatics tools were created to distinguish enhancers from other regulatory components and to pinpoint their advantages. However, because the quality of its prediction method needs to be improved, its practical application value must also be improved. Based on nucleotide composition and statistical moment-based features, the current study suggests a novel method for identifying enhancers and non-enhancers and evaluating their strength. The proposed study outperformed state-of-the-art techniques using fivefold and tenfold cross-validation in terms of accuracy. The accuracy from the current study results in 86.5% and 72.3% in enhancer site and its strength prediction respectively. The results of the suggested methodology point to the potential for more efficient and successful outcomes when statistical moment-based features are used. The current study's source code is available to the research community at https://github.com/csbioinfopk/enpred .
... Taking the effectiveness of grid search for automated parameter search [106] into account, we use grid search to determine the optimal values of diverse hyperparameters related to sequence encoding and the generalizeability of machine learning classifiers. Inspired by the studies of Le et al. [107] and Asim et al. [108], experimentation is performed by varying the residues parameter k from 2 to 5. Residue-encoding specific parameters such as K-gap initial range is defined as 2 to 5 following the state-of-the-art sequence representation learning toolkits such as iLearnPlus [49]. Turning towards machine learning classifiers, tree-based classifiers are evaluated using both gini and entropy criterion where the estimator range is varied from 20 to 200, dicriminative classifier "SVM" is evaluated using linear, polynomial, and radial basis kernel, and generative classifier Naive Bayes smoothing ranges falls between 1e-1 to 1e-9. ...
Article
Full-text available
Circular ribonucleic acids (circRNAs) are novel non-coding RNAs that emanate from alternative splicing of precursor mRNA in reversed order across exons. Despite the abundant presence of circRNAs in human genes and their involvement in diverse physiological processes, the functionality of most circRNAs remains a mystery. Like other non-coding RNAs, sub-cellular localization knowledge of circRNAs has the aptitude to demystify the influence of circRNAs on protein synthesis, degradation, destination, their association with different diseases, and potential for drug development. To date, wet experimental approaches are being used to detect sub-cellular locations of circular RNAs. These approaches help to elucidate the role of circRNAs as protein scaffolds, RNA-binding protein (RBP) sponges, micro-RNA (miRNA) sponges, parental gene expression modifiers, alternative splicing regulators, and transcription regulators. To complement wet-lab experiments, considering the progress made by machine learning approaches for the determination of sub-cellular localization of other non-coding RNAs, the paper in hand develops a computational framework, Circ-LocNet, to precisely detect circRNA sub-cellular localization. Circ-LocNet performs comprehensive extrinsic evaluation of 7 residue frequency-based, residue order and frequency-based, and physio-chemical property-based sequence descriptors using the five most widely used machine learning classifiers. Further, it explores the performance impact of K-order sequence descriptor fusion where it ensembles similar as well dissimilar genres of statistical representation learning approaches to reap the combined benefits. Considering the diversity of statistical representation learning schemes, it assesses the performance of second-order, third-order, and going all the way up to seventh-order sequence descriptor fusion. A comprehensive empirical evaluation of Circ-LocNet over a newly developed benchmark dataset using different settings reveals that standalone residue frequency-based sequence descriptors and tree-based classifiers are more suitable to predict sub-cellular localization of circular RNAs. Further, K-order heterogeneous sequence descriptors fusion in combination with tree-based classifiers most accurately predict sub-cellular localization of circular RNAs. We anticipate this study will act as a rich baseline and push the development of robust computational methodologies for the accurate sub-cellular localization determination of novel circRNAs.
... In particular, many TC-related publications are contributed by researchers from the fields of Biochemistry and Biotechnology, which traditionally seem not directly or closely relevant to TC techniques. Such a phenomenon may be explained by the wide and increasing use of natural language processing (NLP) techniques in biological sequences processing (Badal et al., 2018;Buchan & Jones, 2020;Islam et al., 2018;Le et al., 2019). To be specific, many biological sequences that play fundamental roles in life, such as Deoxyribonucleic Acid (DNA) chains and Protein sequences are formed by small molecules with intricate structures and complex grammars, similar to how texts are formed by words or n-grams (Huang & Yu, 2016;Islam et al., 2018;Srivastava & Baptista, 2016). ...
Article
Full-text available
Text Classification (TC) is the process of assigning several different categories to a set of texts. This study aims to evaluate the state of the arts of TC studies. Firstly, TC-related publications indexed in Web of Science were selected as data. In total, 3,121 TC-related publications were published in 760 journals between 2000 and 2020. Then, the bibliographic information was mined to identify the publication trends, important contributors, publication venues, and involved disciplines. Besides, a thematic analysis was performed to extract topics with increasing/decreasing popularity. The findings showed that TC has become a fast-growing interdisciplinary area, and that emerging research powers such as China are playing increasingly important roles in TC research. Moreover, the thematic analysis showed increased interest in topics concerning advanced classification algorithms, performance evaluation methods, and the practical applications of TC. This study will help researchers recognize the recent trends in the area.
... iEnhancer-EL [7] adopted three feature extraction methods, namely, k-mers, subsequence profile, and PseKNC, and utilized SVM as an individual classifier for ensemble learning prediction. The Enhancer-5step [8] applied the word-embedded representation to biological sequences, specifically by using the Fas-tText tool to extract the 100-dimensional features and then using the supervisory method SVM for predictive classification. Tan et al. [9] took six types of dinucleotide physical and chemical properties as input characteristics and employed a deep recursive neural network-based classifier integration model, which achieved good results. ...
Article
Full-text available
Enhancers are a class of noncoding DNA elements located near structural genes. In recent years, their identification and classification have been the focus of research in the field of bioinformatics. However, due to their high free scattering and position variability, although the performance of the prediction model has been continuously improved, there is still a lot of room for progress. In this paper, density-based spatial clustering of applications with noise (DBSCAN) was used to screen the physicochemical properties of dinucleotides to extract dinucleotide-based auto-cross covariance (DACC) features; then, the features are reduced by feature selection Python toolkit MRMD 2.0. The reduced features are input into the random forest to identify enhancers. The enhancer classification model was built by word2vec and attention-based Bi-LSTM. Finally, the accuracies of our enhancer identification and classification models were 77.25% and 73.50%, respectively, and the Matthews’ correlation coefficients (MCCs) were 0.5470 and 0.4881, respectively, which were better than the performance of most predictors.
Article
Full-text available
Enhancing early selection through genomic estimated breeding values is pivotal for reducing generation intervals and accelerating breeding programs. Recently, deep learning (DL) approaches have gained prominence in genomic prediction (GP). Here, we introduce a novel DL framework for GP based on Elastic Net feature selection and bidirectional encoder representations from transformer's embedding and multi-head attention pooling (EBMGP). EBMGP applies Elastic Net for the selection of features, thereby diminishing the computational burden and bolstering the predictive accuracy. In EBMGP, SNPs are treated as “words,” and groups of adjacent SNPs with similar LD levels are considered “sentences.” By applying bidirectional encoder representations from transformers embeddings, this method models SNPs in a manner analogous to human language, capturing complex genetic interactions at both the “word” and “sentence” scales. This flexible representation seamlessly integrates into any DL network and demonstrates a marked improvement in predictive performance for EBMGP and SoyDNGP compared to the widely used one-hot representation. We propose multi-head attention pooling, which can adaptively assign weights to features while learning features from multiple subspaces through multi-heads for a high level of semantic understanding. In a comprehensive comparative analysis across four diverse plant and animal datasets, EBMGP outperformed competing models in 13 out of 16 tasks, achieving accuracy gains ranging from 0.74 to 9.55% over the second-best model. These results underscore EBMGP’s robustness in genomic prediction and highlight its potential for deep learning applications in life sciences.
Article
Full-text available
Deoxyribonucleic acid (DNA) serves as fundamental genetic blueprint that governs development, functioning, growth, and reproduction of all living organisms. DNA can be altered through germline and somatic mutations. Germline mutations underlie hereditary conditions, while somatic mutations can be induced by various factors including environmental influences, chemicals, lifestyle choices, and errors in DNA replication and repair mechanisms which can lead to cancer. DNA sequence analysis plays a pivotal role in uncovering the intricate information embedded within an organism's genetic blueprint and understanding the factors that can modify it. This analysis helps in early detection of genetic diseases and the design of targeted therapies. Traditional wet-lab experimental DNA sequence analysis through traditional wet-lab experimental methods is costly, time-consuming, and prone to errors. To accelerate large-scale DNA sequence analysis, researchers are developing AI applications that complement wet-lab experimental methods. These AI approaches can help generate hypotheses, prioritize experiments, and interpret results by identifying patterns in large genomic datasets. Effective integration of AI methods with experimental validation requires scientists to understand both fields. Considering the need of a comprehensive literature that bridges the gap between both fields, contributions of this paper are manifold: It presents diverse range of DNA sequence analysis tasks and AI methodologies. It equips AI researchers with essential biological knowledge of 44 distinct DNA sequence analysis tasks and aligns these tasks with 3 distinct AI-paradigms, namely, classification, regression, and clustering. It streamlines the integration of AI into DNA sequence analysis tasks by consolidating information of 36 diverse biological databases that can be used to develop benchmark datasets for 44 different DNA sequence analysis tasks. To ensure performance comparisons between new and existing AI predictors, it provides insights into 140 benchmark datasets related to 44 distinct DNA sequence analysis tasks. It presents word embeddings and language models applications across 44 distinct DNA sequence analysis tasks. It streamlines the development of new predictors by providing a comprehensive survey of 39 word embeddings and 67 language models based predictive pipeline performance values as well as top performing traditional sequence encoding-based predictors and their performances across 44 DNA sequence analysis tasks.
Article
Full-text available
Enhancers are short genomic segments located in non-coding regions of the genome that play a critical role in regulating the expression of target genes. Despite their importance in transcriptional regulation, effective methods for classifying enhancer categories and regulatory strengths remain limited. To address this challenge, we propose a novel end-to-end deep learning architecture named DeepEnhancerPPO. The model integrates ResNet and Transformer modules to extract local, hierarchical, and long-range contextual features. Following feature fusion, we employ Proximal Policy Optimization (PPO), a reinforcement learning technique, to reduce the dimensionality of the fused features, retaining the most relevant features for downstream classification tasks. We evaluate the performance of DeepEnhancerPPO from multiple perspectives, including ablation analysis, independent tests, assessment of PPO’s contribution to performance enhancement, and interpretability of the classification results. Each module positively contributes to the overall performance, with ResNet and PPO being the most significant contributors. Overall, DeepEnhancerPPO demonstrates superior performance on independent datasets compared to other models, outperforming the second-best model by 6.7% in accuracy for enhancer category classification. The model consistently ranks among the top five classifiers out of 25 for enhancer strength classification without requiring re-optimization of the hyperparameters and ranks as the second-best when the hyperparameters are refined. This indicates that the DeepEnhancerPPO framework is highly robust for enhancer classification. Additionally, the incorporation of PPO enhances the interpretability of the classification results.
Article
N4-methylcytosine (4mC) is a chemical modification that occurs on one of the four nucleotide bases in DNA and plays a vital role in DNA expression, repair, and replication. It also actively participates in the regulation of cell differentiation and gene expression. Consequently, it is important to comprehend the role of 4mC in the epigenetic regulation for revealing the complications of the gene expression and their associated governing cellular operations. However, the inherent resource requirements and time constraints of the experimental procedure, present challenges to the cellular culture process. While data-driven methodologies present promising solutions to mitigate the demand for extensive experimental efforts, their performance relies on the suitability and existence of high-quality data. This study presents a multi-model framework that integrates convolutional neural network (CNN) with the distributed k-mer and embedding feature extraction techniques to enhance the identification of 4mC sites in DNA sequences. The integration of k-mers ensures the effective representation of the local sequence patterns, while the utilization of embedding enables a more holistic encoding by considering the broader context and semantics of the sequence data. Following the initial step, the obtained distributed representation of the DNA sequence seamlessly enters the CNN, triggering a crucial convolution operation wherein a set of adaptable filters systematically convolves across the sequence to detect vital local patterns. The proposed integrated multi-model framework was applied to six publicly available datasets and evaluated against the cutting-edge 4mCPred, 4mCCNN, iDNA4mC, Meta-4mCpred, DeepTorrent, 4mCPred-SVM, and DMKL-HFIS methods. The evaluation was based on accuracy, specificity, sensitivity, and Matthews Correlation Coefficient. The results demonstrated that the proposed multi-model framework outperformed the state-of-the-art methods, as well as one-hot encoding and the hybrid of one-hot & TNC features, in accurately identifying 4mC sites.
Article
Enhancers are the short functional regions (50–1500bp) in the genome, which play an effective character in activating gene-transcription in the presence of transcription-factors. Many human diseases, such as cancer and inflammatory bowel disease, are correlated with the enhancers’ genetic variations. The precise recognition of the enhancers provides useful insights for understanding the pathogenesis of human diseases and their treatments. High-throughput experiments are considered essential tools for characterizing enhancers; however, these methods are laborious, costly and time-consuming. Computational methods are considered alternative solutions for accurate and rapid identification of the enhancers. Over the past years, numerous computational predictors have been devised for predicting enhancers and their strengths. A comprehensive review and thorough assessment are indispensable to systematically compare sequence-based enhancer’s bioinformatics tools on their performance. Giving the increasing interest in this domain, we conducted a large-scale analysis and assessment of the state-of-the-art enhancer predictors to evaluate their scalability and generalization power. Additionally, we classified the existing approaches into three main groups: conventional machine-learning, ensemble and deep learning-based approaches. Furthermore, the study has focused on exploring the important factors that are crucial for developing precise and reliable predictors such as designing trusted benchmark/independent datasets, feature representation schemes, feature selection methods, classification strategies, evaluation metrics and webservers. Finally, the insights from this review are expected to provide important guidelines to the research community and pharmaceutical companies in general and high-throughput tools for the detection and characterization of enhancers in particular.
Book
Full-text available
Buku ini menawarkan wawasan mendalam ke dalam dunia kimia komputasi dan aplikasi notasi SMILES. Penulis memandu pembaca melalui perjalanan yang komprehensif, dimulai dari pemahaman dasar notasi SMILES hingga teknik mendekode notasi ini. Selain itu, buku ini menguraikan bagaimana Machine Learning, khususnya pendekatan Extreme Learning Machine (ELM) dan Particle Swarm Optimization (PSO), dapat digunakan untuk memprediksi fungsi aktif dari senyawa kimia berdasarkan notasi SMILES. Pembaca akan menemukan rincian tentang ekstraksi fitur, perhitungan fitur, dan bagaimana normalisasi memengaruhi performansi prediksi. Buku ini juga secara khusus mengenalkan parameter utama PSO, yaitu bobot inersia Modified SAIW dan koefisien akselerasi Modified SBAC, yang sangat memengaruhi kinerja PSO menjadi lebih baik. Buku ini juga mencakup berbagai studi kasus dan uji performansi yang membandingkan pendekatan PSO-ELM dengan algoritma machine Learning lainnya. Dengan begitu, buku ini menjadi panduan yang sangat berguna bagi para peneliti dan ilmuwan yang tertarik dalam menggali potensi notasi SMILES dan machine Learning dalam kimia komputasi.
Article
Full-text available
Enhancers play an important role in the process of gene expression regulation. In DNA sequence abundance or absence of enhancers and irregularities in the strength of enhancers affects gene expression process that leads to the initiation and propagation of diverse types of genetic diseases such as hemophilia, bladder cancer, diabetes and congenital disorders. Enhancer identification and strength prediction through experimental approaches is expensive, time-consuming and error-prone. To accelerate and expedite the research related to enhancers identification and strength prediction, around 19 computational frameworks have been proposed. These frameworks used machine and deep learning methods that take raw DNA sequences and predict enhancer’s presence and strength. However, these frameworks still lack in performance and are not useful in real time analysis. This paper presents a novel deep learning framework that uses language modeling strategies for transforming DNA sequences into statistical feature space. It applies transfer learning by training a language model in an unsupervised fashion by predicting a group of nucleotides also known as k-mers based on the context of existing k-mers in a sequence. At the classification stage, it presents a novel classifier that reaps the benefits of two different architectures: convolutional neural network and attention mechanism. The proposed framework is evaluated over the enhancer identification benchmark dataset where it outperforms the existing best-performing framework by 5%, and 9% in terms of accuracy and MCC. Similarly, when evaluated over the enhancer strength prediction benchmark dataset, it outperforms the existing best-performing framework by 4%, and 7% in terms of accuracy and MCC.
Article
Full-text available
Background As parts of the cis‐regulatory mechanism of the human genome, interactions between distal enhancers and proximal promoters play a crucial role. Enhancers, promoters, and enhancer‐promoter interactions (EPIs) can be detected using many sequencing technologies and computation models. However, a systematic review that summarizes these EPI identification methods and that can help researchers apply and optimize them is still needed. Results In this review, we first emphasize the role of EPIs in regulating gene expression and describe a generic framework for predicting enhancer‐promoter interaction. Next, we review prediction methods for enhancers, promoters, loops, and enhancer‐promoter interactions using different data features that have emerged since 2010, and we summarize the websites available for obtaining enhancers, promoters, and enhancer‐promoter interaction datasets. Finally, we review the application of the methods for identifying EPIs in diseases such as cancer. Conclusions The advance of computer technology has allowed traditional machine learning, and deep learning methods to be used to predict enhancer, promoter, and EPIs from genetic, genomic, and epigenomic features. In the past decade, models based on deep learning, especially transfer learning, have been proposed for directly predicting enhancer‐promoter interactions from DNA sequences, and these models can reduce the parameter training time required of bioinformatics researchers. We believe this review can provide detailed research frameworks for researchers who are beginning to study enhancers, promoters, and their interactions.
Article
Inflammation is a biologically resistant response to harmful stimuli, such as infection, damaged cells, toxic chemicals, or tissue injuries. Its purpose is to eradicate pathogenic micro-organisms or irritants and facilitate tissue repair. Prolonged inflammation can result in chronic inflammatory diseases. However, wet-laboratory-based treatments are costly and time-consuming and may have adverse side effects on normal cells. In the past decade, peptide therapeutics have gained significant attention due to their high specificity in targeting affected cells without affecting healthy cells. Motivated by the significance of peptide-based therapies, we developed a highly discriminative prediction model called AIPs-SnTCN to predict anti-inflammatory peptides accurately. The peptide samples are encoded using word embedding techniques such as skip-gram and attention-based bidirectional encoder representation using a transformer (BERT). The conjoint triad feature (CTF) also collects structure-based cluster profile features. The fused vector of word embedding and sequential features is formed to compensate for the limitations of single encoding methods. Support vector machine-based recursive feature elimination (SVM-RFE) is applied to choose the ranking-based optimal space. The optimized feature space is trained by using an improved self-normalized temporal convolutional network (SnTCN). The AIPs-SnTCN model achieved a predictive accuracy of 95.86% and an AUC of 0.97 by using training samples. In the case of the alternate training data set, our model obtained an accuracy of 92.04% and an AUC of 0.96. The proposed AIPs-SnTCN model outperformed existing models with an ∼19% higher accuracy and an ∼14% higher AUC value. The reliability and efficacy of our AIPs-SnTCN model make it a valuable tool for scientists and may play a beneficial role in pharmaceutical design and research academia.
Article
Promoters are DNA fragments located near the transcription initiation site, they can be divided into strong promoter type and weak promoter type according to transcriptional activation and expression level. Identifying promoters and their strengths in DNA sequences is essential for understanding gene expression regulation. Therefore, it is crucial to further improve predictive quality of predictors for real-world application requirements. Here, we constructed the latest training dataset based on the RegalonDB website, where all the promoters in this dataset have been experimentally validated, and their sequence similarity is less than 85%. We used one-hot and nucleotide chemical property and density (NCPD) to represent DNA sequence samples. Additionally, we proposed an ensemble deep learning framework containing a multi-head attention module, long short-term memory present, and a convolutional neural network module. The results showed that iPSI(2L)-EDL outperformed other existing methods for both promoter prediction and identification of strong promoter type and weak promoter type, the AUC and MCC for the iPSI(2L)-EDL in identifying promoter were improved by 2.23% and 2.96% compared to that of PseDNC-DL on independent testing data, respectively, while the AUC and MCC for the iPSI(2L)- EDL were increased by 3.74% and 5.86% in predicting promoter strength type, respectively. The results of ablation experiments indicate that CNN plays a crucial role in recognizing promoters, the importance of different input positions and long-range dependency relationships among features are helpful for recognizing promoters. Furthermore, to make it easier for most experimental scientists to get the results they need, a userfriendly web server has been established and can be accessed at http://47.94.248.117/IPSW(2L)-EDL.
Article
Full-text available
Background Due to the dynamic nature of enhancers, identifying enhancers and their strength are major bioinformatics challenges. With the development of deep learning, several models have facilitated enhancers detection in recent years. However, existing studies either neglect different length motifs information or treat the features at all spatial locations equally. How to effectively use multi-scale motifs information while ignoring irrelevant information is a question worthy of serious consideration. In this paper, we propose an accurate and stable predictor iEnhancer-DCSA, mainly composed of dual-scale fusion and spatial attention, automatically extracting features of different length motifs and selectively focusing on the important features. Results Our experimental results demonstrate that iEnhancer-DCSA is remarkably superior to existing state-of-the-art methods on the test dataset. Especially, the accuracy and MCC of enhancer identification are improved by 3.45% and 9.41%, respectively. Meanwhile, the accuracy and MCC of enhancer classification are improved by 7.65% and 18.1%, respectively. Furthermore, we conduct ablation studies to demonstrate the effectiveness of dual-scale fusion and spatial attention. Conclusions iEnhancer-DCSA will be a valuable computational tool in identifying and classifying enhancers, especially for those not included in the training dataset.
Article
Full-text available
Motivation: Enhancers are vital cis-regulatory elements that regulate gene expression. eRNAs, a type of lncRNAs, are transcribed from enhancer regions in the genome. The tissue-specific expression of eRNAs is crucial in the regulation of gene expression and cancer development. Methods that identify eRNAs based solely on genomic sequence data have high error rates because they do not account for tissue specificity. Specific histone modifications associated with eRNAs offer valuable information for their identification. However, identification of eRNAs using histone modification data requires the use of both RNA-seq and histone modification data. Unfortunately, many public datasets contain only one of these components, which impedes the accurate identification of eRNAs. Results: We introduce DeepITEH, a deep learning framework that leverages RNA-seq data and histone modification data from multiple samples of the same tissue to enhance the accuracy of identifying eRNAs. Specifically, deepITEH initially categorizes eRNAs into two classes, namely, regularly expressed eRNAs (RE) and accidental eRNAs (AE), using histone modification data from multiple samples of the same tissue. Thereafter, it integrates both sequence and histone modification features to identify eRNAs in specific tissues. To evaluate DeepITEH's performance, we compared it with four existing state-of-the-art enhancer prediction methods, SeqPose, iEnhancer-RD, LSTMAtt, and FRL, on four normal tissues and four cancer tissues. Remarkably, seven of these tissues demonstrated a substantially improved specific eRNA prediction performance with DeepITEH, as compared to other methods. Our findings suggest that DeepITEH can effectively predict potential eRNAs on the human genome, providing insights for studying the eRNA function in cancer. Supplementary information: Supplementary data are available at Bioinformatics online.
Article
Full-text available
Background Recently, deep neural networks have been successfully applied in many biological fields. In 2020, a deep learning model AlphaFold won the protein folding competition with predicted structures within the error tolerance of experimental methods. However, this solution to the most prominent bioinformatic challenge of the past 50 years has been possible only thanks to a carefully curated benchmark of experimentally predicted protein structures. In Genomics, we have similar challenges (annotation of genomes and identification of functional elements) but currently, we lack benchmarks similar to protein folding competition. Results Here we present a collection of curated and easily accessible sequence classification datasets in the field of genomics. The proposed collection is based on a combination of novel datasets constructed from the mining of publicly available databases and existing datasets obtained from published articles. The collection currently contains nine datasets that focus on regulatory elements (promoters, enhancers, open chromatin region) from three model organisms: human, mouse, and roundworm. A simple convolution neural network is also included in a repository and can be used as a baseline model. Benchmarks and the baseline model are distributed as the Python package ‘genomic-benchmarks’, and the code is available at https://github.com/ML-Bioinfo-CEITEC/genomic_benchmarks . Conclusions Deep learning techniques revolutionized many biological fields but mainly thanks to the carefully curated benchmarks. For the field of Genomics, we propose a collection of benchmark datasets for the classification of genomic sequences with an interface for the most commonly used deep learning libraries, implementation of the simple neural network and a training framework that can be used as a starting point for future research. The main aim of this effort is to create a repository for shared datasets that will make machine learning for genomics more comparable and reproducible while reducing the overhead of researchers who want to enter the field, leading to healthy competition and new discoveries.
Article
Full-text available
Motivation Enhancers are important cis-regulatory elements that regulate a wide range of biological functions and enhance the transcription of target genes. Although many feature extraction methods have been proposed to improve the performance of enhancer identification, they can’t learn position-related multiscale contextual information from raw DNA sequences. Results In this paper, we propose a novel enhancer identification method (iEnhancer-ELM) based on BERT-like enhancer language models. iEnhancer-ELM tokenizes DNA sequences with multi-scale k-mers and extracts contextual information of different scale k-mers related with their positions via an multi-head attention mechanism. We first evaluate the performance of different scale k-mers, then ensemble them to improve the performance of enhancer identification. The experimental results on two popular benchmark datasets show that our model outperforms state-of-the-art methods. We further illustrate the interpretability of iEnhancer-ELM. For a case study, we discover 30 enhancer motifs via a 3-mer-based model, where 12 of motifs are verified by STREME and JASPAR, demonstrating our model has a potential ability to unveil the biological mechanism of enhancer. Availability The models and associated code are available at https://github.com/chen-bioinfo/iEnhancer-ELM Supplementary information Supplementary data are available at Bioinformatics Advances online.
Article
Recent work on language models has resulted in state-of-the-art performance on various language tasks. Among these, Bidirectional Encoder Representations from Transformers (BERT) has focused on contextualizing word embeddings to extract context and semantics of the words. On the other hand, post-transcriptional 2'-O-methylation (Nm) RNA modification is important in various cellular tasks and related to a number of diseases. The existing high-throughput experimental techniques take longer time to detect these modifications, and costly in exploring these functional processes. Here, to deeply understand the associated biological processes faster, we come up with an efficient method B ert 2O me to infer 2'-O-methylation RNA modification sites from RNA sequences. B ert 2O me combines BERT-based model with convolutional neural networks (CNN) to infer the relationship between the modification sites and RNA sequence content. Unlike the methods proposed so far, B ert 2O me assumes each given RNA sequence as a text and focuses on improving the modification prediction performance by integrating the pretrained deep learning-based language model BERT. Additionally, our transformer-based approach could infer modification sites across multiple species. According to 5-fold cross-validation, human and mouse accuracies were 99.15%99.15\% and 94.35%94.35\% respectively. Similarly, ROC AUC scores were 0.99, 0.94 for the same species. Detailed results show that B ert 2O me reduces the time consumed in biological experiments and outperforms the existing approaches across different datasets and species over multiple metrics. Additionally, deep learning approaches such as 2D CNNs are more promising in learning BERT attributes than more conventional machine learning methods. Our code and datasets can be found at https://github.com/seferlab/bert2ome .
Article
Full-text available
Protein protein interaction (PPI) prediction is essential to understand the functions of proteins in various biological processes and their roles in the development, progression, and treatment of different diseases. To perform economical large-scale PPI analysis, several Artificial Intelligence based approaches have been proposed. However, these approaches have limited predictive performance due to the use of in-effective statistical representation learning methods and predictors that lack the ability to extract comprehensive discriminative features. The paper in hand generates statistical representation of protein sequences by applying transfer learning in an un-supervised manner using FastText embedding generation approach. Furthermore, it presents “ADH-PPI” classifier which reaps the benefits of three different neural layers, Long Short Term Memory, Convolutional, and Self-Attention layers. Over two different species benchmark datasets, proposed ADH-PPI predictor outperforms existing approaches by an overall accuracy of 4%, and matthews correlation coefficient of 6%. In addition, it achieves an overall accuracy increment of 7% on four independent test sets. Availability: ADH-PPI web server is publicly available at https://sds_genetic_analysis.opendfki.de/PPI/
Article
Cancer is a Toxic health concern worldwide, it happens when cellular modifications cause the irregular growth and division of human cells. Several traditional approaches such as therapies and wet laboratory-based methods have been applied to treat cancer cells. However, these methods are considered less effective due to their high cost and diverse side effects. According to recent advancements, peptide-based therapies have attracted the attention of scientists because of their high selectivity. Peptide therapy can efficiently treat the targeted cells, without affecting the normal cells. Due to the rapid increase of peptide sequences, an accurate prediction model has become a challenging task. Keeping the significance of anticancer peptides (ACPs) in cancer treatment, an intelligent and reliable prediction model is highly indispensable. In this paper, a FastText-based word embedding strategy has been employed to represent each peptide sample via a skip-gram model. After extracting the peptide embedding descriptors, the deep neural network (DNN) model was applied to accurately discriminate the ACPs. The optimized parameters of DNN achieved an accuracy of 96.94 %, 93.41 %, and 94.02 % using training, alternate, and independent samples, respectively. It was observed that our proposed cACP-DeepGram model outperformed and reported ~10 % highest prediction accuracy than existing predictors. It is suggested that the cACP-DeepGram model will be a reliable tool for scientists and might play a valuable role in academic research and drug discovery. The source code and the datasets are publicly available at https://github.com/shahidakbarcs/cACP-DeepGram.
Article
Full-text available
Deep exploration of histone occupancy and covalent post-translational modifications (e.g., acetylation, methylation) is essential to decode gene expression regulation, chromosome packaging, DNA damage, and transcriptional activation. Existing computational approaches are unable to precisely predict histone occupancy and modifications mainly due to the use of sub-optimal statistical representation of histone sequences. For the establishment of an improved histone occupancy and modification landscape for multiple histone markers, the paper in hand presents an end-to-end computational multi-paradigm framework “Histone-Net”. To learn local and global residue context aware sequence representation, Histone-Net generates unsupervised higher order residue embeddings (DNA2Vec) and presents a different application of language modelling, where it encapsulates histone occupancy and modification information while generating higher order residue embeddings (SuperDNA2Vec) in a supervised manner. We perform an intrinsic and extrinsic evaluation of both presented distributed representation learning schemes. A comprehensive empirical evaluation of Histone-Net over ten benchmark histone markers data sets for three different histone sequence analysis tasks indicates that SuperDNA2Vec sequence representation and softmax classifier-based approach outperforms state-of-the-art approach by an average accuracy of 7%. To eliminate the overhead of training separate binary classifiers for all ten histone markers, Histone-Net is evaluated in multi-label classification paradigm, where it produces decent performance for simultaneous prediction of histone occupancy, acetylation, and methylation.
Article
Full-text available
Thermophilic proteins (TPPs) are critical for basic research and in the food industry due to their ability to maintain a thermodynamically stable fold at extremely high temperatures. Thus, the expeditious identification of novel TPPs through computational models from protein sequences is very desirable. Over the last few decades, a number of computational methods, especially machine learning (ML)-based methods, for in silico prediction of TPPs have been developed. Therefore, it is desirable to revisit these methods and summarize their advantages and disadvantages in order to further develop new computational approaches to achieve more accurate and improved prediction of TPPs. With this goal in mind, we comprehensively investigate a large collection of fourteen state-of-the-art TPP predictors in terms of their dataset size, feature encoding schemes, feature selection strategies, ML algorithms, evaluation strategies and web server/software usability. To the best of our knowledge, this article represents the first comprehensive review on the development of ML-based methods for in silico prediction of TPPs. Among these TPP predictors, they can be classified into two groups according to the interpretability of ML algorithms employed (i.e., computational black-box methods and computational white-box methods). In order to perform the comparative analysis, we conducted a comparative study on several currently available TPP predictors based on two benchmark datasets. Finally, we provide future perspectives for the design and development of new computational models for TPP prediction. We hope that this comprehensive review will facilitate researchers in selecting an appropriate TPP predictor that is the most suitable one to deal with their purposes and provide useful perspectives for the development of more effective and accurate TPP predictors.
Article
Enhancers are the primary cis-elements of transcriptional regulation and play a vital role in gene expression at different stages of plant growth and development. Having high locational variation and free scattering in non-encoding genomes, identification of enhancers is a crucial, but challenging work in understanding the biological mechanism of model plants. Recently, applications of neural network models are gaining increasing popularity in predicting the function of genomic elements. Although several computational models have shown great advantages to tackle this challenge, a further study of the identification of rice enhancers from DNA sequences is still lacking. We present RicENN, a novel deep learning framework capable of accurately identifying enhancers of rice, integrating convolution neural networks (CNNs), bi-directional recurrent neural networks (RNNs), and attention mechanisms. A combined-feature representation method was designed to extract the sequence features from original DNA sequences using six types of autocorrelation encodings. Moreover, we verified that the integrated model achieves the best performance by an ablation study. Finally, our deep learning framework realized a reliable prediction of the rice enhancers. The results show RicENN outperforms available alternative approaches in rice species, achieving the area under the receiver operating characteristic curve (AUROC) and the area under the precision-recall curve (AUPRC) of 0.960 and 0.960 on cross-validation, and 0.879 and 0.877 during independent tests, respectively. This study develops a hybrid model to combine the merits of different neural network architectures, which shows the potential ability to apply deep learning in bioinformatic sequences and contributes to the acceleration of functional genomic studies of rice. RicENN and its code are freely accessible at http://bioinfor.aielab.cc/RicENN/ .
Article
Full-text available
Motivation: DNA replication is a key step to maintain the continuity of genetic information between parental generation and offspring. The initiation site of DNA replication, also called origin of replication (ORI), plays an extremely important role in the basic biochemical process. Thus, rapidly and effectively identifying the location of ORI in genome will provide key clues for genome analysis. Although biochemical experiments could provide detailed information for ORI, it requires high experimental cost and long experimental period. As good complements to experimental techniques, computational methods could overcome these disadvantages. Results: Thus, in this study, we developed a predictor called iORI-PseKNC2.0 to identify ORIs in the Saccharomyces cerevisiae (S. cerevisiae) genome based on sequence information. The pseudo k-tuple nucleotide composition (PseKNC) including 90 physicochemical properties was proposed to formulate ORI and non-ORI samples. In order to improve the accuracy, a two-step feature selection was proposed to exclude redundant and noise information. As a result, the overall success rate of 88.53% was achieved in the 5-fold cross-validation test by using support vector machine (SVM). Availability: Based on the proposed model, a user-friendly webserver was established and can be freely accessed at http://lin-group.cn/server/iORI-PseKNC2.0. The webserver will provide more convenience to most of wet-experimental scholars.
Article
Full-text available
Motivation: Antibiotic resistance constitutes a major public health crisis, and finding new sources of antimicrobial drugs is crucial to solving it. Bacteriocins, which are bacterially-produced antimicrobial peptide products, are candidates for broadening the available choices of antimicrobials. However, the discovery of new bacteriocins by genomic mining is hampered by their sequences' low complexity and high variance, which frustrates sequence similarity-based searches. Results: Here we use word embeddings of protein sequences to represent bacteriocins, and apply a word embedding method that accounts for amino acid order in protein sequences, to predict novel bacteriocins from protein sequences without using sequence similarity. Our method predicts, with a high probability six yet unknown putative bacteriocins in Lactobacillus. Generalized, the representation of sequences with word embeddings preserving sequence order information can be applied to peptide and protein classification problems for which sequence similarity cannot be used. Availability: Data and source code for this project are freely available at: https://github.com/nafizh/Bacteriocin_paper. Supplementary information: Supplementary data are available at Bioinformatics online.
Article
Full-text available
Schizophrenia (SCZ) is a devastating genetic mental disorder. Identification of the SCZ risk genes in brains is helpful to understand this disease. Thus, we first used the minimum Redundancy-Maximum Relevance (mRMR) approach to integrate the genome-wide sequence analysis results on SCZ and the expression quantitative trait locus (eQTL) data from ten brain tissues to identify the genes related to SCZ. Second, we adopted the variance inflation factor regression algorithm to identify their interacting genes in brains. Third, using multiple analysis methods, we explored and validated their roles. By means of the aforementioned procedures, we have found that (1) the cerebellum may play a crucial role in the pathogenesis of SCZ and (2) ITIH4 may be utilized as a clinical biomarker for the diagnosis of SCZ. These interesting findings may stimulate novel strategy for developing new drugs against SCZ. It has not escaped our notice that the approach reported here is of use for studying many other genome diseases as well.
Article
Full-text available
Motivation: Kinase-regulated phosphorylation is a ubiquitous type of post-translational modification (PTM) in both eukaryotic and prokaryotic cells. Phosphorylation plays fundamental roles in many signalling pathways and biological processes, such as protein degradation and protein-protein interactions. Experimental studies have revealed that signalling defects caused by aberrant phosphorylation are highly associated with a variety of human diseases, especially cancers. In light of this, a number of computational methods aiming to accurately predict protein kinase family-specific or kinase-specific phosphorylation sites have been established, thereby facilitating phosphoproteomic data analysis. Results: In this work, we present Quokka, a novel bioinformatics tool that allows users to rapidly and accurately identify human kinase family-regulated phosphorylation sites. Quokka was developed by using a variety of sequence scoring functions combined with an optimized logistic regression algorithm. We evaluated Quokka based on well-prepared up-to-date benchmark and independent test datasets, curated from the Phospho.ELM and UniProt databases, respectively. The independent test demonstrates that Quokka improves the prediction performance compared with state-of-the-art computational tools for phosphorylation prediction. In summary, our tool provides users with high-quality predicted human phosphorylation sites for hypothesis generation and biological validation. Availability: The Quokka webserver and datasets are freely available at http://quokka.erc.monash.edu/.
Article
Full-text available
Background: Precise identification of three-dimensional genome organization, especially enhancer-promoter interactions (EPIs), is important to deciphering gene regulation, cell differentiation and disease mechanisms. Currently, it is a challenging task to distinguish true interactions from other nearby non-interacting ones since the power of traditional experimental methods is limited due to low resolution or low throughput. Results: We propose a novel computational framework EP2vec to assay three-dimensional genomic interactions. We first extract sequence embedding features, defined as fixed-length vector representations learned from variable-length sequences using an unsupervised deep learning method in natural language processing. Then, we train a classifier to predict EPIs using the learned representations in supervised way. Experimental results demonstrate that EP2vec obtains F1 scores ranging from 0.841~ 0.933 on different datasets, which outperforms existing methods. We prove the robustness of sequence embedding features by carrying out sensitivity analysis. Besides, we identify motifs that represent cell line-specific information through analysis of the learned sequence embedding features by adopting attention mechanism. Last, we show that even superior performance with F1 scores 0.889~ 0.940 can be achieved by combining sequence embedding features and experimental features. Conclusions: EP2vec sheds light on feature extraction for DNA sequences of arbitrary lengths and provides a powerful approach for EPIs identification.
Article
Full-text available
Inflammatory bowel disease (IBD) is a chronic intestinal disorder, with two main types: Crohn's disease (CD) and ulcerative colitis (UC), whose molecular pathology is not well understood. The majority of IBD-associated SNPs are located in non-coding regions and are hard to characterize since regulatory regions in IBD are not known. Here we profile transcription start sites (TSSs) and enhancers in the descending colon of 94 IBD patients and controls. IBD-upregulated promoters and enhancers are highly enriched for IBD-associated SNPs and are bound by the same transcription factors. IBD-specific TSSs are associated to genes with roles in both inflammatory cascades and gut epithelia while TSSs distinguishing UC and CD are associated to gut epithelia functions. We find that as few as 35 TSSs can distinguish active CD, UC, and controls with 85% accuracy in an independent cohort. Our data constitute a foundation for understanding the molecular pathology, gene regulation, and genetics of IBD.
Article
Full-text available
Motivation: Efflux protein plays a key role in pumping xenobiotics out of the cells. The prediction of efflux family proteins involved in transport process of compounds is crucial for understanding family structures, functions and energy dependencies. Many methods have been proposed to classify efflux pump transporters without considerations of any pump specific of efflux protein families. In other words, efflux proteins protect cells from extrusion of foreign chemicals. Moreover, almost all efflux protein families have the same structure based on the analysis of significant motifs. The motif sequences consisting of the same amount of residues will have high degrees of residue similarity and thus will affect the classification process. Consequently, it is challenging but vital to recognize the structures and determine energy dependencies of efflux protein families. In order to efficiently identify efflux protein families with considering about pump specific, we developed a 2 D convolutional neural network (2 D CNN) model called DeepEfflux. DeepEfflux tried to capture the motifs of sequences around hidden target residues to use as hidden features of families. In addition, the 2 D CNN model uses a position-specific scoring matrix (PSSM) as an input. Three different datasets, each for one family of efflux protein, was fed into DeepEfflux, and then a 5-fold cross validation approach was used to evaluate the training performance. Results: The model evaluation results show that DeepEfflux outperforms traditional machine learning algorithms. Furthermore, the accuracy of 96.02%, 94.89% and 90.34% for classes A, B and C, respectively, in the independent test results show that our model can perform well and can be used as a reliable tool for identifying families of efflux proteins in transporters. Availability and implementation: The online version of deepefflux is available at http://deepefflux.irit.fr. The source code of deepefflux is available both on the deepefflux website and at http://140.138.155.216/deepefflux/. Supplementary information: Supplementary data are available at Bioinformatics online.
Article
Full-text available
Motivation: Text mining has become an important tool for biomedical research. The most fundamental text-mining task is the recognition of biomedical named entities (NER), such as genes, chemicals and diseases. Current NER methods rely on pre-defined features which try to capture the specific surface properties of entity types, properties of the typical local context, background knowledge, and linguistic information. State-of-the-art tools are entity-specific, as dictionaries and empirically optimal feature sets differ between entity types, which makes their development costly. Furthermore, features are often optimized for a specific gold standard corpus, which makes extrapolation of quality measures difficult. Results: We show that a completely generic method based on deep learning and statistical word embeddings [called long short-term memory network-conditional random field (LSTM-CRF)] outperforms state-of-the-art entity-specific NER tools, and often by a large margin. To this end, we compared the performance of LSTM-CRF on 33 data sets covering five different entity classes with that of best-of-class NER tools and an entity-agnostic CRF implementation. On average, F1-score of LSTM-CRF is 5% above that of the baselines, mostly due to a sharp increase in recall. Availability and implementation: The source code for LSTM-CRF is available at https://github.com/glample/tagger and the links to the corpora are available at https://corposaurus.github.io/corpora/ . Contact: habibima@informatik.hu-berlin.de.
Article
Full-text available
Motivation: A large number of distal enhancers and proximal promoters form enhancer-promoter interactions to regulate target genes in the human genome. Although recent high-throughput genome-wide mapping approaches have allowed us to more comprehensively recognize potential enhancer-promoter interactions, it is still largely unknown whether sequence-based features alone are sufficient to predict such interactions. Results: Here, we develop a new computational method (named PEP) to predict enhancer-promoter interactions based on sequence-based features only, when the locations of putative enhancers and promoters in a particular cell type are given. The two modules in PEP (PEP-Motif and PEP-Word) use different but complementary feature extraction strategies to exploit sequence-based information. The results across six different cell types demonstrate that our method is effective in predicting enhancer-promoter interactions as compared to the state-of-the-art methods that use functional genomic signals. Our work demonstrates that sequence-based features alone can reliably predict enhancer-promoter interactions genome-wide, which could potentially facilitate the discovery of important sequence determinants for long-range gene regulation. Availability and implementation: The source code of PEP is available at: https://github.com/ma-compbio/PEP . Contact: jianma@cs.cmu.edu. Supplementary information: Supplementary data are available at Bioinformatics online.
Article
Full-text available
Motivation: The effective representation of proteins is a crucial task that directly affects the performance of many bioinformatics problems. Related proteins usually bind to similar ligands. Chemical characteristics of ligands are known to capture the functional and mechanistic properties of proteins suggesting that a ligand-based approach can be utilized in protein representation. In this study, we propose SMILESVec, a Simplified molecular input line entry system (SMILES)-based method to represent ligands and a novel method to compute similarity of proteins by describing them based on their ligands. The proteins are defined utilizing the word-embeddings of the SMILES strings of their ligands. The performance of the proposed protein description method is evaluated in protein clustering task using TransClust and MCL algorithms. Two other protein representation methods that utilize protein sequence, Basic local alignment tool and ProtVec, and two compound fingerprint-based protein representation methods are compared. Results: We showed that ligand-based protein representation, which uses only SMILES strings of the ligands that proteins bind to, performs as well as protein sequence-based representation methods in protein clustering. The results suggest that ligand-based protein description can be an alternative to the traditional sequence or structure-based representation of proteins and this novel approach can be applied to different bioinformatics problems such as prediction of new protein-ligand interactions and protein function annotation. Availability and implementation: https://github.com/hkmztrk/SMILESVecProteinRepresentation. Supplementary information: Supplementary data are available at Bioinformatics online.
Article
Full-text available
The molecular structure of macromolecules in living cells is ambiguous unless we classify them in a scientific manner. Signal peptides are of vital importance in determining the behavior of newly formed proteins towards their destined path in cellular and extracellular location in both eukaryotes and prokaryotes. In the present research work, a novel method is offered to foreknow the behavior of signal peptides and determine their cleavage site. The proposed model employs neural networks using isolated sets of prokaryote and eukaryote primary sequences. Protein sequences are classified as secretory or non-secretory in order to investigate secretory proteins and their signal peptides. In comparison with the previous prediction tools, the proposed algorithm is more rigorous, well-organized, significantly appropriate and highly accurate for the examination of signal peptides even in extensive collection of protein sequences.
Article
Full-text available
Large-scale sequencing studies discovered substantial genetic variants occurring in enhancers which regulate genes via long range chromatin interactions. Importantly, such variants could affect enhancer regulation by changing transcription factor bindings or enhancer hijacking, and in turn, make an essential contribution to disease progression. To facilitate better usage of published data and exploring enhancer deregulation in various human diseases, we created DiseaseEnhancer (http://biocc.hrbmu.edu.cn/DiseaseEnhancer/), a manually curated database for disease-associated enhancers. As of July 2017, DiseaseEnhancer includes 847 disease-associated enhancers in 143 human diseases. Database features include basic enhancer information (i.e. genomic location and target genes); disease types; associated variants on the enhancer and their mediated phenotypes (i.e. gain/loss of enhancer and the alterations of transcription factor bindings). We also include a feature on our website to export any query results into a file and download the full database. DiseaseEnhancer provides a promising avenue for researchers to facilitate the understanding of enhancer deregulation in disease pathogenesis, and identify new biomarkers for disease diagnosis and therapy.
Article
Full-text available
Background Studies have shown that enhancers are significant regulatory elements to play crucial roles in gene expression regulation. Since enhancers are unrelated to the orientation and distance to their target genes, it is a challenging mission for scholars and researchers to accurately predicting distal enhancers. In the past years, with the high-throughout ChiP-seq technologies development, several computational techniques emerge to predict enhancers using epigenetic or genomic features. Nevertheless, the inconsistency of computational models across different cell-lines and the unsatisfactory prediction performance call for further research in this area. Results Here, we propose a new Deep Belief Network (DBN) based computational method for enhancer prediction, which is called EnhancerDBN. This method combines diverse features, composed of DNA sequence compositional features, DNA methylation and histone modifications. Our computational results indicate that 1) EnhancerDBN outperforms 13 existing methods in prediction, and 2) GC content and DNA methylation can serve as relevant features for enhancer prediction. Conclusion Deep learning is effective in boosting the performance of enhancer prediction.
Article
Full-text available
In several years, deep learning is a modern machine learning technique using in a variety of fields with state-of-the-art performance. Therefore, utilization of deep learning to enhance performance is also an important solution for current bioinformatics field. In this study, we try to use deep learning via convolutional neural networks and position specific scoring matrices to identify electron transport proteins, which is an important molecular function in transmembrane proteins. Our deep learning method can approach a precise model for identifying of electron transport proteins with achieved sensitivity of 80.3%, specificity of 94.4%, and accuracy of 92.3%, with MCC of 0.71 for independent dataset. The proposed technique can serve as a powerful tool for identifying electron transport proteins and can help biologists understand the function of the electron transport proteins. Moreover, this study provides a basis for further research that can enrich a field of applying deep learning in bioinformatics. © 2017 Wiley Periodicals, Inc.
Article
Full-text available
Pse-in-One 2.0 is a package of web-servers evolved from Pse-in-One (Liu, B., Liu, F., Wang, X., Chen, J. Fang, L. & Chou, K.C. Nucleic Acids Research, 2015, 43:W65-W71). In order to make it more flexible and comprehensive as suggested by many users, the updated package has incorporated 23 new pseudo component modes as well as a series of new feature analysis approaches. It is available at http://bioinformatics.hitsz.edu.cn/Pse-in-One2.0/. Moreover, to maximize the convenience of users, provided is also the stand-alone version called “Pse-in-One-Analysis”, by which users can significantly speed up the analysis of massive sequences.
Article
Full-text available
Involved with important cellular or gene functions and implicated with many kinds of cancers, piRNAs, or piwi-interacting RNAs, are of small non-coding RNA with around 19-33 nucleotides in length. Given a small non-coding RNA molecule, can we predict whether it is of piRNA according to its sequence information alone? Furthermore, there are two types of piRNA: one has the function of instructing target mRNA deadenylation, and the other has not. Can we discriminate one from the other? With the avalanche of RNA sequences emerging in the postgenomic age, it is urgent to address the two problems for both basic research and drug development. Unfortunately, to our best knowledge, so far no computational methods whatsoever that could be used to deal with the second problem, needless to say to deal with the two problems together. Here, by incorporating the physicochemical properties of nucleotides into the pseudo K-tuple nucleotide composition (PseKNC), we proposed a powerful predictor called 2L-piRNA. It is a two-layer ensemble classifier, in which the 1st layer is for identifying whether a query RNA molecule as piRNA or non-piRNA, and the 2nd layer for identifying whether a piRNA being with or without the function of instructing target mRNA deadenylation. Rigorous cross validations have indicated that the success rates achieved by the proposed predictor are quite high. For the convenience of most biologists and drug development scientists, the web-server for 2L-piRNA has been established at http://bioinformatics.hitsz.edu.cn/2L-piRNA/, by which users can easily get their desired results without the need to go through the mathematical details.
Article
Full-text available
Recommended by the World Health Organization (WHO), drug compounds have been classified into 14 main ATC (Anatomical Therapeutic Chemical) classes according to their therapeutic and chemical characteristics. Given an uncharacterized compound, can we develop a computational method to fast identify which ATC class or classes it belongs to? The information thus obtained will timely help adjusting our focus and selection, significantly speeding up the drug development process. But this problem is by no means an easy one since some drug compounds may belong to two or more than two ATC classes. To address this problem, using the DO (Drug Ontology) approach based on the ChEBI (Chemical Entities of Biological Interest) database, we developed a predictor called iATC-mDO. Subsequently, hybridizing it with an existing drug ATC classifier, we constructed a predictor called iATC-mHyb. It has been demonstrated by the rigorous cross-validation and from five different measuring angles that iATC-mHyb is remarkably superior to the best existing predictor in identifying the ATC classes for drug compounds. To convenience most experimental scientists, a user-friendly web-server for iATC-mHyd has been established at http://www.jci-bioinfo.cn/iATC-mHyb, by which users can easily get their desired results without the need to go through the complicated mathematical equations involved.
Article
Full-text available
There are many different types of RNA modifications, which are essential for numerous biological processes. Knowledge about the occurrence sites of RNA modifications in its sequence is a key for in-depth understanding their biological functions and mechanism. Unfortunately, it is both time-consuming and laborious to determine these sites purely by experiments alone. Although some computational methods were developed in this regard, they each could only be used to deal with some type of modification individually. To our best knowledge, so far no method whatsoever has been developed that can identify the occurrence sites for several different types of RNA modifications with one seamless package or platform. To address such a challenge, a novel platform called “iRNA-PseColl” has been developed. It was formed by incorporating both the individual and collective features of the sequence elements into the general pseudo K-tuple nucleotide composition (PseKNC) of RNA via the chemicophysical properties and density distribution of its constituent nucleotides. Rigorous cross-validations have indicated that the anticipated success rates achieved by the proposed platform are quite high. To maximize the convenience for most experimental biologists, the platform’s web-server has been provided at http://lin.uestc.edu.cn/server/iRNA-PseColl along with a step-by-step user guide, by which users can easily get their desired results without the need to go through the mathematical details involved in this paper.
Article
S-Palmitoylation is a uniquely reversible and biologically important post-translational modification as it plays an essential role in a variety of cellular processes including signal transduction, protein-membrane interactions, neuronal development, lipid raft targeting, subcellular localization and apoptosis. Due to its association with the neuronal development, it plays a pivotal role in a variety of neurodegenerative diseases, mainly Alzheimer's, Schizophrenia and Huntington's disease. It is also essential for developmental life cycles and pathogenesis of Toxoplasma gondii and Plasmodium falciparum, known to cause toxoplasmosis and malaria, respectively. This depicts the strong biological significance of S-Palmitoylation, thus, the timely and accurate identification of S-palmitoylation sites is crucial. Herein, we propose a predictor for S-Palmitoylation sites in proteins namely SPalmitoylC-PseAAC by integrating the Chou's Pseudo Amino Acid Composition (PseAAC) and relative/absolute position-based features. Self-consistency testing and 10-fold cross-validation are performed to evaluate the performance of SPalmitoylC-PseAAC, using accuracy metrics. For self-consistency testing, 99.79% Acc, 99.77% Sp, 99.80% Sn and 1.00 MCC was observed, whereas, for 10-fold cross validation 97.22% Acc, 98.85% Sp, 95.80% Sn and 0.94 MCC was observed. Thus the proposed predictor can help in predicting the palmitoylation sites in an efficient and accurate way. The SPalmitoylC-PseAAC is available at (biopred.org/palm).
Article
Objective: Knowledge of protein subcellular localization is vitally important for both basic research and drug development. Facing the avalanche of protein sequences emerging in the post-genomic age, it is urgent to develop computational tools for timely and effectively identifying their subcellular localization based on the sequence information alone. Recently, a predictor called "pLoc-mVirus" was developed for identifying the subcellular localization of virus proteins. Its performance is overwhelmingly better than that of the other predictors for the same purpose, particularly in dealing with multi-label systems in which some proteins, known as "multiplex proteins", may simultaneously occur in, or move between, two or more subcellular location sites. Despite the fact that it is indeed a very powerful predictor, more efforts are definitely needed to further improve it. This is because pLoc-mVirus was trained by an extremely skewed dataset in which some subset was over 10 times the size of the other subsets. Accordingly, it cannot avoid the biased consequence caused by such an uneven training dataset. Methods: Using the general PseAAC (Pseudo Amino Acid Composition) approach and the IHTS (Inserting Hypothetical Training Samples) treatment to balance out the training dataset, we have developed a new predictor called "pLoc_bal-mVirus" for predicting the subcellular localization of multi-label virus proteins. Results: Cross-validation tests on exactly the same experiment-confirmed dataset have indicated that the proposed new predictor is remarkably superior to pLoc-mVirus, the existing state-of-the-art predictor for the same purpose. Conclusion: Its user-friendly web-server is available at http://www.jci-bioinfo.cn/pLoc_bal-mVirus/, by which the majority of experimental scientists can easily get their desired results without the need to go through the detailed complicated mathematics. Accordingly, pLoc_bal-mVirus will become a very useful tool for designing multi-target drugs and in-depth understanding the biological process in a cell.
Article
The structure of protein gains additional stability against various detrimental effects by the presence of disulfide bonds. The formation of correct disulfide bonds between cysteine residues ensures proper in vivo and in vitro folding of the protein. Many cysteine residues can be present in the polypeptide chain of a protein, however, not all cysteine residues are involved in the formation of a disulfide bond, and therefore, accurate prediction of these bonds is crucial for identifying biophysical characteristics of a protein. In the present study, a novel method is proposed for the prediction of intramolecular disulfide bonds accurately using statistical moments and PseAAC. The pSSbond-PseAAC uses PseAAC along with position and composition relative features to calculate statistical moments. Statistical moments are important as they are very sensitive regarding the position of data sequences and for prediction of intramolecular disulfide bonds, moments are combined together to train neural networks. The overall accuracy of the pSSbond-PseAAC is 98.97% to sensitivity value 98.92%, specificity 98.99% and 0.98 MCC; and it outperforms various previously reported studies.
Article
The animal toxin proteins are one of the disulfide rich small peptides that detected in venomous species. They are used as pharmacological tools and therapeutic agents in medicine for the high specificity of their targets. The successful analysis and prediction of toxin proteins may have important signification for the pharmacological and therapeutic researches of toxins. In this study, significant differences were found between the toxins and the non-toxins in amino acid compositions and several important biological properties. The random forest was firstly proposed to predict the animal toxin proteins by selecting 400 pseudo amino acid compositions and the dipeptide compositions of reduced amino acid alphabet as the input parameters. Based on dipeptide composition of reduced amino acid alphabet with 13 reduced amino acids, the best overall accuracy of 85.71% was obtained. These results indicated that our algorithm was an efficient tool for the animal toxin prediction.
Article
Identifying the location of proteins in a cell plays an important role in understanding their functions, such as drug design, therapeutic target discovery and biological research. However, the traditional subcellular localization experiments are time-consuming, laborious and small scale. With the development of next-generation sequencing technology, the number of proteins has grown exponentially, which lays the foundation of the computational method for identifying protein subcellular localization. Although many methods for predicting subcellular localization of proteins have been proposed, most of them are limited to single-location. In this paper, we propose a multi-kernel SVM to predict subcellular localization of both multi-location and single-location proteins. First, we make use of the evolutionary information extracted from position specific scoring matrix (PSSM) and physicochemical properties of proteins, by Chou's general PseAAC and other efficient functions. Then, we propose a multi-kernel support vector machine (SVM) model to identify multi-label protein subcellular localization. As a result, our method has a good performance on predicting subcellular localization of proteins. It achieves an average precision of 0.7065 and 0.6889 on two human datasets, respectively. All results are higher than those achieved by other existing methods. Therefore, we provide an efficient system via a novel perspective to study the protein subcellular localization.
Article
Protein S-sulfenylation is an essential post-translational modification (PTM) that provides critical information to understand molecular mechanisms of cell signaling transduction, stress response and regulation of cellular functions. Recent advancements in computational methods have contributed towards the detection of protein S-sulfenylation sites. However, the performance of identifying protein S-sulfenylation sites can be influenced by a class imbalance of training datasets while the application of various computational methods. In this study, we designed a Fu-SulfPred model using stratified structure of three kinds of decision trees in order to identify possible protein S-sulfenylation sites by means of reconstructing training datasets and sample rescaling technology. Experimental results showed that the correlation coefficient values of Fu-SulfPred model were found to be 0.5437, 0.3736 and 0.6809 on three independent test datasets, respectively, all of which outperformed the Matthews coefficient values of S-SulfPred model. Fu-SulfPred model provides a promising scheme for the identification of protein S-sulfenylation sites and other post-translational modifications.
Article
Protein structural class could provide important clues for understanding protein fold, evolution and function. However, it is still a challenging problem to accurately predict protein structural classes for low-similarity sequences. This paper was devoted to develop a powerful method to predict protein structural classes for low-similarity sequences. On the basis of a very objective and strict benchmark dataset, we firstly extracted optimal tripeptide compositions (OTC) which was picked out by using feature selection technique to formulate protein samples. And an overall accuracy of 91.1% was achieved in jackknife cross-validation. Subsequently, we investigated the accuracies of three popular features: position-specific scoring matrix (PSSM), predicted secondary structure information (PSSI) and the average chemical shift (ACS) for comparison. Finally, to further improve the prediction performance, we examined all combinations of the four kinds of features and achieved the maximum accuracy of 96.7% in jackknife cross-validation by combining OTC with ACS, demonstrating that the model is efficient and powerful. Our study will provide an important guide to extract valuable information from protein sequences.
Article
Investigation into the network of protein–protein interactions (PPIs) will provide valuable insights into the inner workings of cells. Accordingly, it is crucially important to develop an automated method or high-throughput tool that can efficiently predict the PPIs. In this study, a new predictor, called “iPPI-PseAAC(CGR)” was developed by incorporating the information of “chaos game representation” into the PseAAC (Pseudo Amino Acid Composition). The advantage by doing so is that some key sequence-order or sequence-pattern information can be more effectively incorporated during the treatment of the protein pair samples. The operation engine used in this predictor is the random forests algorithm. It has been observed via the cross-validations on the widely used benchmark datasets that the success rates achieved by the proposed predictor are remarkably higher than those by its existing counterparts. For the convenience of the most experimental scientists, a user-friendly web-server for the new predictor has been established at http://www.jci-bioinfo.cn/iPPI-PseAAC(CGR), by which users can easily get their desired results without the need to go through the detailed mathematics.
Article
Motivation: Transcription termination is an important regulatory step of gene expression. If there is no terminator in gene, transcription couldn't stop, which will result in abnormal gene expression. Detecting such terminators can determine the operon structure in bacterial organisms and improve genome annotation. Thus, accurate identification of transcriptional terminators is essential and extremely important in the research of transcription regulations. Results: In this study, we developed a new predictor called "iTerm-PseKNC" based on Support Vector Machine (SVM) to identify transcription terminators. The binomial distribution approach was used to pick out the optimal feature subset derived from pseudo the K-tuple nucleotide composition (PseKNC). The five-fold cross-validation test results showed that our proposed method achieved an accuracy of 95%. To further evaluate the generalization ability of "iTerm-PseKNC", the model was examined on independent datasets which are experimentally confirmed Rho-independent terminators in Escherichia coli (E. coli) and in Bacillus subtilis (B. subtilis) genome. As a result, all the terminators in E. coli and 87.5% of the terminators in B. subtilis were correctly identified, suggesting that the proposed model could become a powerful tool for bacterial terminator recognition. Availability: For the convenience of most of wet-experimental researchers, the web-server for "iTerm-PseKNC" was established at http://lin-group.cn/server/iTerm-PseKNC/, by which users can easily obtain their desired result without the need to go through the detailed mathematical equations involved.
Article
As a prevalent post-transcriptional modification, N6-methyladenosine (m6A) plays key roles in a series of biological processes. Although experimental technologies have been developed and applied to identify m6A sites, they are still cost-ineffective for transcriptome-wide detections of m6A. As good complements to the experimental techniques, some computational methods have been proposed to identify m6A sites. However, their performance remains unsatisfactory. In this study, we firstly proposed an Euclidean distance based method was proposed to construct a high quality benchmark dataset. By encoding the RNA sequences using pseudo nucleotide composition, a new predictor called iRNA(m6A)-PseDNC was developed to identify m6A sites in the Saccharomyces cerevisiae genome. It has been demonstrated by the 10-fold cross validation tests that the performance of iRNA(m6A)-PseDNC is superior to the existing methods. Meanwhile, for the convenience of most experimental scientists, established at the site http://lin-group.cn/server/iRNA(m6A)-PseDNC.php is its web-server, by which user can easily get their desired results without need to go through the detailed mathematics. It is anticipated that iRNA(m6A)-PseDNC will become a useful high throughput tool for identifying m6A sites in the S. cerevisiae genome.
Article
Antibiotics of β-lactam class account for nearly half of the global antibiotic use. The β-lactamase enzyme is a major element of the bacterial arsenals to escape the lethal effect of β-lactam antibiotics. Different variants of β-lactamases have evolved to counter the different types of β-lactam antibiotics. Extensive research has been done to isolate and characterize different variants of β-lactamases. Unfortunately, identification and classification of the β-lactamase enzyme are purely based on experiments, which is both time- and resource-consuming. Thus, there is a need for fast and accurate computational methods to identify and classify new β-lactamase enzymes from the avalanche of sequence data generated in the post-genomic era. Based on these considerations, we have developed a support vector machine based three-tier prediction system, BlaPred, to predict and classify (as per Ambler classification) β-lactamases solely from their protein sequences. The input features used were amino acid composition, classic and amphiphilic pseudo amino acid compositions. The results show that the classic pseudo amino acid composition-based models performed better than the other models. Following a leave-one-out cross-validation procedure, the accuracy to discriminate β-lactamases from non-β-lactamases was 93.57% (tier-I); accuracies for prediction of class A β-lactamases was 93.27%, 95.52% for class B, 96.86% for class C and 97.31% for class D (tier-II); and at tier-III the accuracies for prediction were 84.78%, 95.65% and 89.13% for subclasses B1, B2 and B3, respectively. The comparative results on an independent dataset suggests that our method works efficiently to distinguish β-lactamases from non-β-lactamases, with an overall accuracy of 93.09%, and is further able to classify β-lactamase sequences into their respective Ambler classes and subclasses with accuracy higher than 92% and 87%, respectively. Comparative performance of BlaPred on an independent benchmark dataset also shows a significant improvement over other existing methods. Finally, BlaPred is available as a webserver, as well as standalone software, which can be accessed at http://proteininformatics.org/mkumar/blapred.
Article
Membrane proteins are vital type of proteins that serve as channels, receptors and energy transducers in a cell. They perform various important functions, which are mainly associated with their types. They are also attractive targets of drug discovery for various diseases. So predicting membrane protein types is a crucial and challenging research area in bioinformatics and proteomics. Because of vast investigation of uncharacterized protein sequences in databases, customary biophysical techniques are extremely tedious, costly and vulnerable to mistakes. Subsequently, it is very attractive to build a vigorous, solid, proficient technique to predict membrane protein types. In this work, a novel feature set Exchange Group Based Protein Sequence Representation (EGBPSR) is proposed for classification of membrane proteins with two new feature extraction strategies known as Exchange Group Local Pattern (EGLP) and Amino acid Interval Pattern (AIP). Imbalanced dataset and large dataset are often handled well by decision tree classifiers. Since imbalanced dataset are taken, the performance of various decision tree classifiers such as Decision Tree (DT), Classification and Regression Tree (CART), ensemble methods such as Adaboost, Random Under Sampling (RUS) boost, Rotation forest and Random forest are analyzed. The overall accuracy achieved in predicting membrane protein types is 96.45%.
Article
Motivation: Long non-coding RNAs (lncRNAs) are a class of RNA molecules with more than 200 nucleotides. They have important functions in cell development and metabolism, such as genetic markers, genome rearrangements, chromatin modifications, cell cycle regulation, transcription and translation. Their functions are generally closely related to their localization in the cell. Therefore, knowledge about their subcellular locations can provide very useful clues or preliminary insight into their biological functions. Although biochemical experiments could determine the localization of lncRNAs in a cell, they are both time-consuming and expensive. Therefore, it is highly desirable to develop bioinformatics tools for fast and effective identification of their subcellular locations. Results: We developed a sequence-based bioinformatics tool called "iLoc-lncRNA" to predict the subcellular locations of LncRNAs by incorporating the 8-tuple nucleotide features into the general PseKNC (Pseudo K-tuple Nucleotide Composition) via the binomial distribution approach. Rigorous jackknife tests have shown that the overall accuracy achieved by the new predictor on a stringent benchmark dataset is 86.72%, which is over 20% higher than that by the existing state-of-the-art predictor evaluated on the same tests. Availability: A user-friendly webserver has been established at http://lin-group.cn/server/iLoc-LncRNA, by which users can easily obtain their desired results. Supplementary information: Supplementary data are available at Bioinformatics online.
Article
Deep learning has been increasingly used to solve a number of problems with state-of-the-art performance in a wide variety of fields. In biology, deep learning can be applied to reduce feature extraction time and achieve high levels of performance. In our present work, we apply deep learning via two-dimensional convolutional neural networks and position-specific scoring matrices to classify Rab protein molecules, which are main regulators in membrane trafficking for transferring proteins and other macromolecules throughout the cell. The functional loss of specific Rab molecular functions has been implicated in a variety of human diseases, e.g., choroideremia, intellectual disabilities, cancer. Therefore, creating a precise model for classifying Rabs is crucial in helping biologists understand the molecular functions of Rabs and design drug targets according to such specific human disease information. We constructed a robust deep neural network for classifying Rabs that achieved an accuracy of 99%, 99.5%, 96.3%, and 97.6% for each of four specific molecular functions. Our approach demonstrates superior performance to traditional artificial neural networks. Therefore, from our proposed study, we provide both an effective tool for classifying Rab proteins and a basis for further research that can improve the performance of biological modeling using deep neural networks.
Article
Motivation: Identification of enhancers and their strength is important because they play a critical role in controlling gene expression. Although some bioinformatics tools were developed, they are limited in discriminating enhancers from non-enhancers only. Recently, a two-layer predictor called "iEnhancer-2L" was developed that can be used to predict the enhancer's strength as well. However, its prediction quality needs further improvement to enhance the practical application value. Results: A new predictor called "iEnhancer-EL" was proposed that contains two layer predictors: the first one (for identifying enhancers) is formed by fusing an array of six key individual classifiers, and the second one (for their strength) formed by fusing an array of ten key individual classifiers. All these key classifiers were selected from 171 elementary classifiers formed by SVM (Support Vector Machine) based on kmer, subsequence profile, and PseKNC (Pseudo K-tuple Nucleotide Composition), respectively. Rigorous cross-validations have indicated that the proposed predictor is remarkably superior to the existing state-of-the-art one in this area. Availability and implementation: A web server for the iEnhancer-EL has been established at http://bioinformatics.hitsz.edu.cn/iEnhancer-EL/, by which users can easily get their desired results without the need to go through the mathematical details. Contact: bliu@hit.edu.cn, dshuang@tongji.edu.cn or kcchou@gordonlifescience.org. Supplementary information: Supplementary data are available at Bioinformatics online.
Article
N6-methyladenine (6mA) is one kind of post-replication modification (PTM or PTRM) occurring in a wide range of DNA sequences. Accurate identification of its sites will be very helpful for revealing the biological functions of 6mA, but it is time-consuming and expensive to determine them by experiments alone. Unfortunately, so far, no bioinformatics tool is available to do so. To fill in such an empty area, we have proposed a novel predictor called iDNA6mA-PseKNC that is established by incorporating nucleotide physicochemical properties into Pseudo K-tuple Nucleotide Composition (PseKNC). It has been observed via rigorous cross-validations that the predictor's sensitivity (Sn), specificity (Sp), accuracy (Acc), and stability (MCC) are 93%, 100%, 96%, and 0.93, respectively. For the convenience of most experimental scientists, a user-friendly web server for iDNA6mA-PseKNC has been established at http://lin-group.cn/server/iDNA6mA-PseKNC, by which users can easily obtain their desired results without the need to go through the complicated mathematical equations involved.
Article
Determining the catalytic residues in an enzyme is critical to our understanding the relationship between protein sequence, structure, function, and enhancing our ability to design novel enzymes and their inhibitors. Although many enzymes have been sequenced, and their primary and tertiary structures determined, experimental methods for enzyme functional characterization lag behind. Because experimental methods used for identifying catalytic residues are resource- and labor-intensive, computational approaches have considerable value and are highly desirable for their ability to complement experimental studies in identifying catalytic residues and helping to bridge the sequence-structure-function gap. In this study, we describe a new computational method called PREvaIL for predicting enzyme catalytic residues. This method was developed by leveraging a comprehensive set of informative features extracted from multiple levels, including sequence, structure, and residue-contact network, in a random forest machine-learning framework. Extensive benchmarking experiments on eight different datasets based on 10-fold cross-validation and independent tests, as well as side-by-side performance comparisons with seven modern sequence- and structure-based methods, showed that PREvaIL achieved competitive predictive performance, with an area under the receiver operating characteristic curve and area under the precision-recall curve ranging from 0.896–0.973 and from 0.294–0.523, respectively. We demonstrated that this method was able to capture useful signals arising from different levels, leveraging such differential but useful types of features and allowing us to significantly improve the performance of catalytic residue prediction. We believe that this new method can be utilized as a valuable tool for both understanding the complex sequence-structure-function relationships of proteins and facilitating the characterization of novel enzymes lacking functional annotations.
Article
Background: Protein S-Sulfenylation, the reversible oxidative modification of cysteine thiol groups to cysteine S-Sulfenic acids, is a post-translational modification (PTM) that plays a critical role in regulating protein function and signal transduction. The identification of specific protein S-sulfenylation sites is crucial to understand the underlying molecular mechanisms. Objective: We sought to develop a computational method that can effectively predict S-sulfenylation sites by using optimally extracted properties. Method: We propose DBN-Sulf, which uses a Deep Belief Network (DBN) with Restricted Boltzmann Machines (RBMs) to reduce the feature dimensions from a combination of heterogeneous information, including amino acid related features, evolutionary features, and structure-based features. Then a support vector machine (SVM) based predictor is built with the optimal features. Results: We evaluate the DBN-Sulf classifier using a training dataset including 1007 positive sites and 7837 negative sites with 5-fold cross validation, and get an AUC score of 0.80, an ACC of 0.85 and a MCC of 0.53, which are significantly better than that of the existing methods. We further validate our method on the independent test set and obtain promising results. Conclusion: The superior performance over existing S-sulfenylation site prediction approaches indicates the importance of the deep belief network-based feature extracting procedure.
Article
Motivation: Cells are deemed the basic unit of life. However, many important functions of cells as well as their growth and reproduction are performed via the protein molecules located at their different organelles or locations. Facing explosive growth of protein sequences, we are challenged to develop fast and effective method to annotate their subcellular localization. However, this is by no means an easy task. Particularly, mounting evidences have indicated proteins have multi-label feature meaning that they may simultaneously exist at, or move between, two or more different subcellular location sites. Unfortunately, most of the existing computational methods can only be used to deal with the single-label proteins. Although the 'iLoc-Animal' predictor developed recently is quite powerful that can be used to deal with the animal proteins with multiple locations as well, its prediction quality needs to be improved, particularly in enhancing the absolute true rate and reducing the absolute false rate. Results: Here we propose a new predictor called 'pLoc-mAnimal', which is superior to iLoc-Animal as shown by the compelling facts. When tested by the most rigorous cross-validation on the same high-quality benchmark dataset, the absolute true success rate achieved by the new predictor is 37% higher and the absolute false rate is four times lower in comparison with the state-of-the-art predictor. Availability and implementation: To maximize the convenience of most experimental scientists, a user-friendly web-server for the new predictor has been established at http://www.jci-bioinfo.cn/pLoc-mAnimal/ , by which users can easily get their desired results without the need to go through the complicated mathematics involved. Contact: xxiao@gordonlifescience.org or kcchou@gordonlifescience.org. Supplementary information: Supplementary data are available at Bioinformatics online.
Article
Information of the proteins' subcellular localization is crucially important for revealing their biological functions in a cell, the basic unit of life. With the avalanche of protein sequences generated in the postgenomic age, it is highly desired to develop computational tools for timely identifying their subcellular locations based on the sequence information alone. The current study is focused on the Gram-negative bacterial proteins. Although considerable efforts have been made in protein subcellular prediction, the problem is far from being solved yet. This is because mounting evidences have indicated that many Gram-negative bacterial proteins exist in two or more location sites. Unfortunately, most existing methods can be used to deal with single-location proteins only. Actually, proteins with multi-locations may have some special biological functions important for both basic research and drug design. In this study, by using the multi-label theory, we developed a new predictor called "pLoc-mGneg" for predicting the subcellular localization of Gram-negative bacterial proteins with both single and multiple locations. Rigorous cross-validation on a high quality benchmark dataset indicated that the proposed predictor is remarkably superior to "iLoc-Gneg", the state-of-the-art predictor for the same purpose. For the convenience of most experimental scientists, a user-friendly web-server for the novel predictor has been established at http://www.jci-bioinfo.cn/pLoc-mGneg/, by which users can easily get their desired results without the need to go through the complicated mathematics involved.
Article
Evolution of cis-properties (such as enhancers) often plays an important role in the production of diverse morphology. However, a mechanistic understanding is often limited by the absence of methods to study enhancers in species outside of established model systems. Here, we sought to establish methods to identify and test enhancer activity in the red flour beetle, Tribolium castaneum. To identify possible enhancer regions, we first obtained genome-wide chromatin profiles from various tissues and stages of Tribolium via FAIRE (Formaldehyde Assisted Isolation of Regulatory Elements)-sequencing. Comparison of these profiles revealed a distinct set of open chromatin regions in each tissue and stage. Second, we established the first reporter assay system that works in both Drosophila and Tribolium, using nubbin in the wing and hunchback in the embryo as case studies. Together, these advances will be useful to study the evolution of cis-language and morphological diversity in Tribolium and other insects.
Article
Motivation: Being responsible for initiating transaction of a particular gene in genome, promoter is a short region of DNA. Promoters have various types with different functions. Owing to their importance in biological process, it is highly desired to develop computational tools for timely identifying promoters and their types. Such a challenge has become particularly critical and urgent in facing the avalanche of DNA sequences discovered in the postgenomic age. Although some prediction methods were developed, they can only be used to discriminate a specific type of promoters from non-promoters. None of them has the ability to identify the types of promoters. This is due to the facts that different types of promoters may share quite similar consensus sequence pattern, and that the promoters of same type may have considerably different consensus sequences. Results: To overcome such difficulty, using the multi-window-based PseKNC (pseudo K-tuple nucleotide composition) approach to incorporate the short-, middle-, and long-range sequence information, we have developed a two-layer seamless predictor named as "iPromoter-2L". The 1 st layer serves to identify a query DNA sequence as a promoter or non-promoter, and the 2 nd layer to predict which of the following six types the identified promoter belongs to: σ 24 , σ 28 , σ 32 , σ 38 , σ 54 , and σ 70 . Availability: For the convenience of most experimental scientists, a user-friendly and publicly accessible web-server for the powerful new predictor has been established at http://bioinformatics.hitsz.edu.cn/iPromoter-2L/ . It is anticipated that iPromoter-2L will become a very useful high throughput tool for genome analysis. Contact: bliu@hit.edu.cn or dshuang@tongji.edu.cn or kcchou@gordonlifescience.org. Supplementary information: Supplementary data are available at Bioinformatics online.
Conference Paper
We propose two novel model architectures for computing continuous vector representations of words from very large data sets. The quality of these representations is measured in a word similarity task, and the results are compared to the previously best performing techniques based on different types of neural networks. We observe large improvements in accuracy at much lower computational cost, i.e. it takes less than a day to learn high quality word vectors from a 1.6 billion words data set. Furthermore, we show that these vectors provide state-of-the-art performance on our test set for measuring syntactic and semantic word similarities.
Article
This paper proposes a simple and efficient approach for text classification and representation learning. Our experiments show that our fast text classifier fastText is often on par with deep learning classifiers in terms of accuracy, and many orders of magnitude faster for training and evaluation. We can train fastText on more than one billion words in less than ten minutes using a standard multicore CPU, and classify half a million sentences among 312K classes in less than a minute.
Article
Protein signal sequences play a central role in the targeting and translocation of nearly all secreted proteins and many integral membrane proteins in both prokaryotes and eukaryotes. The knowledge of signal sequences has become a crucial tool for pharmaceutical scientists who genetically modify bacteria, plants, and animals to produce effective drugs. However, to effectively use such a tool, the first important thing is to find a fast and effective method to identify the “zipcode” entity; this is also evoked by both the huge amount of unprocessed data available and the industrial need to find more effective vehicles for the production of proteins in recombinant systems. In view of this, a sequence-encoded algorithm was developed to identify the signal sequences and predict their cleavage sites. The rate of correct prediction for 1,939 secretory proteins and 1,440 nonsecretory proteins by self-consistency test is 90.14% and that by jackknife test is 90.13%. The encouraging results indicate that the signal sequences share some common features although they lack similarity in sequence, length, and even composition and that they are predictable to a considerably accurate extent. Proteins 2001;42:136–139. © 2000 Wiley-Liss, Inc.
Article
Many efforts have been made in predicting the subcellular localization of eukaryotic proteins, but most of the existing methods have the following two limitations: (1) their coverage scope is less than ten locations and hence many organelles in an eukaryotic cell cannot be covered, and (2) they can only be used to deal with single-label systems in which each of the constituent proteins has one and only one location. Actually, proteins with multiple locations are particularly interesting since they may have some exceptional functions very important for in-depth understanding the biological process in a cell and for selecting drug target as well. Although several predictors (such as "Euk-mPLoc", "Euk-PLoc 2.0" and "iLoc-Euk") can cover up to 22 different location sites, and they also have the function to treat multi-labeled proteins, further efforts are needed to improve their prediction quality, particularly in enhancing the absolute true rate and in reducing the absolute false rate. Here we propose a new predictor called "pLoc-mEuk" by extracting the key GO (Gene Ontology) information into the general PseAAC (Pseudo Amino Acid Composition). Rigorous cross-validations on a high-quality and stringent benchmark dataset have indicated that the proposed pLoc-mEuk predictor is remarkably superior to iLoc-Euk, the best of the aforementioned three predictors. To maximize the convenience of most experimental scientists, a user-friendly web-server for the new predictor has been established at http://www.jci-bioinfo.cn/pLoc-mEuk/, by which users can easily get their desired results without the need to go through the complicated mathematics involved.
Article
Knowledge of subcellular locations of proteins is crucially important for in-depth understanding their functions in a cell. With the explosive growth of protein sequences generated in the postgenomic age, it is highly demanded to develop computational tools for timely annotating their subcellular locations based on the sequence information alone. The current study is focused on virus proteins. Although considerable efforts have been made in this regard, the problem is far from being solved yet. Most existing methods can be used to deal with single-location proteins only. Actually, proteins with multi-locations may have some special biological functions. This kind of multiplex proteins is particularly important for both basic research and drug design. Using the multi-label theory, we present a new predictor called "pLoc-mVirus" by extracting the optimal GO (Gene Ontology) information into the general PseAAC (Pseudo Amino Acid Composition). Rigorous cross-validation on a same stringent benchmark dataset indicated that the proposed pLoc-mVirus predictor is remarkably superior to iLoc-Virus, the state-of-the-art method in predicting virus protein subcellular localization. To maximize the convenience of most experimental scientists, a user-friendly web-server for the new predictor has been established at http://www.jci-bioinfo.cn/pLoc-mVirus/, by which users can easily get their desired results without the need to go through the complicated mathematics involved.
Article
One of the fundamental goals in cellular biochemistry is to identify the functions of proteins in the context of compartments that organize them in the cellular environment. To realize this, it is indispensable to develop an automated method for fast and accurate identification of the subcellular locations of uncharacterized proteins. The current study is focused on plant protein subcellular location prediction based on the sequence information alone. Although considerable efforts have been made in this regard, the problem is far from being solved yet. Most of the existing methods can be used to deal with single-location proteins only. Actually, proteins with multi-locations may have some special biological functions. This kind of multiplex protein is particularly important for both basic research and drug design. Using the multi-label theory, we present a new predictor called “pLoc-mPlant” by extracting the optimal GO (Gene Ontology) information into the Chou's general PseAAC (Pseudo Amino Acid Composition). Rigorous cross-validation on the same stringent benchmark dataset indicated that the proposed pLoc-mPlant predictor is remarkably superior to iLoc-Plant, the state-of-the-art method for predicting plant protein subcellular localization. To maximize the convenience of most experimental scientists, a user-friendly web-server for the new predictor has been established at http://www.jci-bioinfo.cn/pLoc-mPlant/, by which users can easily get their desired results without the need to go through the complicated mathematics involved.
Article
The eternal or ultimate goal of medicinal chemistry is to find most effective ways to treat various diseases and extend human beings' life as long as possible. Human being is a biological entity. To realize such an ultimate goal, the inputs or breakthroughs from the advances in biological science are no doubt most important that may even drive medicinal science into a revolution. In this review article, we are to address this from several different angles. Copyright© Bentham Science Publishers; For any queries, please email at [email protected]