Article

Prediction of therapeutic peptides by incorporating q-Wiener index into Chou’s general PseAAC

Authors:
  • Shandong University, Weihai, China
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

As therapeutic peptides have been taken into consideration in disease therapy in recent years, many biologists spent time and labor to verify various functional peptides from a large number of peptide sequences. In order to reduce the workload and increase the efficiency of identification of functional proteins, we propose a sequence-based model, q-FP (functional peptide prediction based on the q-Wiener Index), capable of recognizing potentially functional proteins. We extract three types of features by mixing graphic representation and statistical indices based on the q-Wiener index and physicochemical properties of amino acids. Our support-vector-machine-based model achieves an accuracy of 96.71%, 92.52%, 98.40%, and 91.40% for anticancer, virulent, and allergenic proteins datasets, respectively, by using 5-fold cross validation.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Another, globally preeminent mortality occurs due to cancer, and significant concentration has been made for the utilization of peptide-based therapies [16]. The most important uses where the peptide-based therapeutic methods have quickly grown is cancer treatment [6], [7], [17], [18]. The new peptide-based cancer treatment methods developed in recent years extending from naturally available and synthetic peptides to peptide conjugate drugs and small molecule peptides [7], [19]. ...
... x 0 = (0, 0) and x n = (x n−1 + ν n )1/2 (17) where ν n represents the nth vertex related to the n th base. ...
Article
Full-text available
Peptides, short-chained amino acids, have shown great potentials toward the investigation and evolution of novel medications for treatment or therapy. The wet-lab based discovery of potential therapeutic peptides and eventually drug development is a hard and time-consuming process. The computational prediction using machine learning (ML) methods can expedite and facilitate the discovery process of potential prospects with therapeutic effects. ML approaches have been practiced favorably and extensively within the area of proteins, DNA, and RNA to discover the hidden features and functional activities, moreover, recently been utilized for functional discovery of peptides for various therapeutics. In this paper, a systematic literature review (SLR) has been presented to recognize the data-sources, ML classifiers, and encoding schemes being utilized in the state-of-the-art computational models to predict therapeutic peptides. To conduct the SLR, fourty-one research articles have been selected carefully based on well-defined selection criteria. To the best of our knowledge, there is no such SLR available that provides a comprehensive review in this domain. In this article, we have proposed a taxonomy based on identified feature encodings, which may offer relational understandings to researchers. Similarly, the framework model for the computational prediction of the therapeutic peptides has been introduced to characterize the best practices and levels involved in the development of peptide prediction models. Lastly, common issues and challenges have been discussed to facilitate the researchers with encouraging future directions in the field of computational prediction of therapeutic peptides.
... The peptide sequence de-scriptors include amino acid composition as well as Chou's pseudoamino acid composition for incorporation of the sequence order information [35]. With success of PseACC in the sequence-based prediction [36][37][38], it is an imperative addition to the standard composition feature vectors. The peptide structure descriptors have been formulated with molecular weight, peptide shape (R, α, β), positive charge (q + ), negative charge (q − ) and volume. ...
Article
Full-text available
The recent elevation of cases infected from novel COVID-19 has placed the human life in trepidation mode, especially for those suffering from comorbidities. Most of the studies in the last few months have undeniably raised concerns for hypertensive patients that face greater risk of fatality from COVID-19. Furthermore, one of the recent WHO reports has estimated a total of 1.13 billion people are at a risk of hypertension of which two-thirds live in low and middle income countries. The gradual escalation of the hypertension problem andthe sudden rise of COVID-19 cases have placed an increasingly higher number of human lives at risk in low and middle income countries. To lower the risk of hypertension, most physicians recommend drugs that have angiotensin-converting enzyme (ACE) inhibitors. However, prolonged use of such drugs is not recommended due to metabolic risks and the increase in the expression of ACE-II which could facilitate COVID-19 infection. In contrast, the intake of optimal macronutrients is one of the possible alternatives to naturally control hypertension. In the present study, a nontrivial feature selection and machine learning algorithm is adopted to intelligently predict the food-derived antihypertensive peptide. The proposed idea of the paper lies in reducing the computational power while retaining the performance of the support vector machine (SVM) by estimating the dominant pattern in the features space through feature filtering. The proposed feature filtering algorithm has reported a trade-off performance by reducing the chances of Type I error, which is desirable when recommending a dietary food to patients suffering from hypertension. The maximum achievable accuracy of the best performing SVM models through feature selection are 86.17% and 85.61%, respectively.
... Despite this, published research has revealed nothing about the enzyme that catalyzes histone lysine succinylation [24][25][26]. In reality, whether or not this reaction is enzymatic is unknown [17,27]. ...
Article
Full-text available
Lysine succinylation is a post-translational modification (PTM) of protein in which a succinyl group (-CO-CH2-CH2-CO2H) is added to a lysine residue of protein that reverses lysine's positive charge to a negative charge and leads to the significant changes in protein structure and function. It occurs on a wide range of proteins and plays an important role in various cellular and biological processes in both eukaryotes and prokaryotes. Beyond experimentally identified succinylation sites, there have been a lot of studies for developing sequence-based prediction using machine learning approaches, because it has the promise of being extremely time-saving, accurate, robust, and cost-effective. Despite of these benefits on computational prediction of lysine succinylation sites for different species, there are a number of issues that need to be addressed in the design and development of succinylation site predictors. In spite the fact that many studies used different statistical and machine learning computational tools, only a few studies have focused on these bioinformatics issues in depth. Therefore, in this comprehensive comparative review, an attempt is made to present the latest advances in the prediction models, datasets, and online resources, as well as the obstacles and limits, to provide an advantageous guideline for developing more suitable and effective succinylation site prediction tools.
... Most biological networks are depicted as directed graphs whose edges express critical interactions, flows and effective directionality [146,[221][222][223]. While considerable quantitative methodologies have been employed for undirected graph networks, i.e., treewidth [220] and cycle rank [224], as well as topological indices [225], there are additional graph complexity indices such as the distance-based Wiener index [226][227][228][229][230], graph entropy measurements [231] or the Szeged index [232] that can also be computed for the more biologically relevant directed graphs. Measures for analyzing directed graphs include DAG (directed acyclic graph)-width [233,234], directed treewidth [235] and girth [236], with the latter two (treewidth and directed treewidth) being based on the game theory applied to special graph decompositions. ...
Article
Full-text available
GPCRs arguably represent the most effective current therapeutic targets for a plethora of diseases. GPCRs also possess a pivotal role in the regulation of the physiological balance between healthy and pathological conditions; thus, their importance in systems biology cannot be underestimated. The molecular diversity of GPCR signaling systems is likely to be closely associated with disease-associated changes in organismal tissue complexity and compartmentalization, thus enabling a nuanced GPCR-based capacity to interdict multiple disease pathomechanisms at a systemic level. GPCRs have been long considered as controllers of communication between tissues and cells. This communication involves the ligand-mediated control of cell surface receptors that then direct their stimuli to impact cell physiology. Given the tremendous success of GPCRs as therapeutic targets, considerable focus has been placed on the ability of these therapeutics to modulate diseases by acting at cell surface receptors. In the past decade, however, attention has focused upon how stable multiprotein GPCR superstructures, termed receptorsomes, both at the cell surface membrane and in the intracellular domain dictate and condition long-term GPCR activities associated with the regulation of protein expression patterns, cellular stress responses and DNA integrity management. The ability of these receptorsomes (often in the absence of typical cell surface ligands) to control complex cellular activities implicates them as key controllers of the functional balance between health and disease. A greater understanding of this function of GPCRs is likely to significantly augment our ability to further employ these proteins in a multitude of diseases.
... These characteristics are essential for understanding the function and structure of peptides [57,58]. To capture the structural and PC properties of CPPs, we employed the reduced sequence and index-vectors (RSIV) feature representation method proposed by Xu et al. [59]. The RSIV descriptor encodes five types of feature vectors based on PC properties, such as polarity, charge, acidity, DHP, secondary structure and hydrophobicity [60]. ...
Article
Cell-penetrating peptides (CPPs) are special kind of peptides capable of carrying variety of bioactive molecules such as genetic materials, short interfering RNA and nanoparticles into cell. In recent era, research on CPP has gained substantial interest from researchers to analyze its biological mechanisms for safe drug delivery agents and therapeutic application. Identifying CPP through traditional methods is extremely slow, overpriced and laborious, particularly due to large volume of unannotated peptide sequences accumulating in World Bank repository. To date; numerous computational methods have been developed, however, the available machine-learning tools cannot distinguish the CPPs and their uptake efficiency. This study aiming to develop two-layer deep learning framework, named DeepCPPred for identifying both CPPs in the first-phase and uptake efficiency peptides in the second-phase. The predictor first uses the four types of descriptors that cover the evolutionary, energy estimation, reduced sequence and amino-acid contact information. Then the extracted features are optimized through elastic net algorithm and fed into cascade deep-forest for building the final CPP model. The proposed method achieved 99.45% overall accuracy on benchmark dataset in the first-layer and 95.43% accuracy in the second-layer using 5-fold cross-validation test. Thus, our proposed bioinformatics tool surpassed all the existing state-of-the-art sequence-based CPP approach.
... Undirected graph networks can be assessed for complexity using treewidth (Gruber, 2012), cycle rank (Eggan, 1963) as well as topological indices (Emmert-Streib and Dehmer, 2011). Some of the classical graph complexity indices such as the distance-based Wiener index (Balasubramanian, 1994;Bonchev, 2001;Dehmer et al., 2019;Gao et al., 2017;Xu et al., 2017), the graph entropy measure based on vertex orbits developed by Mowshowitz (Mowshowitz, 1968) or the Szeged index (Klavžar et al., 1996) can also be computed for more functionally relevant directed graphs. Measures for analyzing directed graphs include DAG (directed acyclic graph)-width (Ben-Naoum and Godin, 2016;Kaufman et al., 2009) directed treewidth (Johnson et al., 2001) and girth (Bermond et al., 2013). ...
Chapter
Systems pharmacology is a recently developed scientific field concerning the appreciation of novel therapeutic networks that enables biomedical scientists to understand the actions of medicinal agents in a multidimensional mechanistic manner. A thorough appreciation of systems pharmacology requires the synergistic integration of multiple disciplines including, receptor biology, network theory, high-dimensionality data acquisition and advanced informatics deconvolution. Appreciating pharmacological signaling pathways at a systemic network level holds the promise that this practice can improve the efficiency of therapeutic development. This advancement is associated with the ability of systems pharmacology to generate a highly nuanced and quantitative appreciation of simultaneous medicinal signaling across multiple physiological domains. Implicit in this process is the potential benefit that multi-level systems medication, as opposed to agents with a limited therapeutic scope, can engender upon disease networks. In this article we shall outline the benefits of this data and biology convergence for both therapeutic discovery, refinement and precision targeting.
... PseAAC [33][34][35] mainly uses sequence information and physical and chemical properties for feature extraction. Currently, the researchers have widely used the method of PseAAC in proteomics. ...
Article
Multi-label proteins occur in two or more subcellular locations, which play a vital role in cell development and metabolism. Prediction and analysis of multi-label subcellular localization (SCL) can present new perspective with drug target identification and new drug design. However, the prediction of multi-label protein SCL using biological experiments is expensive and labor-intensive. Therefore, predicting large-scale SCL with machine learning methods has turned into a popular study topic in bioinformatics. In this study, a novel multi-label learning methods for protein SCL prediction, called DMLDA-LocLIFT, is proposed. Firstly, the dipeptide composition (DC), encoding based on grouped weight (EBGW), pseudo amino acid composition (PseAAC), gene ontology (GO) and pseudo-position specific scoring matrix (PsePSSM) are employed to encode subcellular protein sequences. Then, using direct multi-label linear discriminant analysis (DMLDA) to get rid of noise information of the fused feature vector. Lastly, the first-best feature vectors are input into the multi-label learning with Label-specIfic FeaTures (LIFT) classifier to predict. The leave-one-out cross validation (LOOCV) shows that the overall actual accuracy on Gram-negative bacteria, Gram-positive bacteria, plant datasets, virus dataset and human dataset are 98.6%, 99.6%, 97.9%, 94.7% and 96.1% respectively, which are obviously better than other state-of-the-art prediction methods. The proposed model can effectively predict SCL of multi-label proteins and provide references for experimental identification of SCL. The source codes and datasets are available at https://github.com/QUST-AIBB DRC/DMLDA-LocLIFT/.
... In recent years, protein sequencebased methods (Yu et al., 2017) are becoming the most widely applied technique for predicting PPIs due to the availability of protein sequence data. Liu et al. (2012) designs a sequence analysis method to represent protein sequences based on hypergeometric series using the q-Wiener index (Xu et al., 2017). X. Li et al. employs a global encoding approach (GE) to describe global information of amino sequence (Li et al., 2009). ...
Article
Full-text available
The task of predicting protein–protein interactions (PPIs) has been essential in the context of understanding biological processes. This paper proposes a novel computational model namely FCTP-WSRC to predict PPIs effectively. Initially, combinations of the F-vector, composition (C) and transition (T) are used to map each protein sequence onto numeric feature vectors. Afterwards, an effective feature extraction method PCA (principal component analysis) is employed to reconstruct the most discriminative feature subspaces, which is subsequently used as input in weighted sparse representation based classification (WSRC) for prediction. The FCTP-WSRC model achieves accuracies of 96.67%, 99.82%, and 98.09% for H. pylori, Human and Yeast datasets respectively. Furthermore, the FCTP-WSRC model performs well when predicting three significant PPIs networks: the single-core network (CD9), the multiple-core network (Ras-Raf-Mek-Erk-Elk-Srf pathway), and the cross-connection network (Wnt-related Network). Consequently, the promising results show that the proposed method can be a powerful tool for PPIs prediction with excellent performance and less time.
... To avoid completely losing the sequence-pattern feature for proteins, the pseudo amino acid composition (Chou 2001) or PseAAC (Chou 2005) was proposed. Ever since then, it has been widely used in nearly all the areas of computational proteomics (see, e.g., Ding and Zhang 2008;Fang et al. 2008;Jiang et al. 2008a, b;Li and Li 2008;Lin 2008;Lin et al. 2008Lin et al. , 2009Lin et al. , 2013aNanni and Lumini 2008;Zhang et al. 2008a;b, c, 2014ab, c, , b, c, 2015b, c, , 2018b, c, , 2019 1 3 Jia et al. 2014;Kong et al. 2014;Zuo et al. 2014;Ali and Hayat 2015;Fan et al. 2015;Huang and Yuan 2015;Khan et al. 2015Khan et al. , 2017Kumar et al. 2015;Mandal et al. 2015;Sanchez et al. 2015;Jiao and Du 2016;Kabir and Hayat 2016;Tahir and Hayat 2016;Tang et al. 2016;Zou and Xiao 2016a, b;Huo et al. 2017;Ju and He 2017a, b;Rahimi et al. 2017;Tripathi and Pandey 2017;Yu et al. 2017aYu et al. , 2017bAhmad and Hayat 2018;Akbar and Hayat 2018;Al Maruf and Shatabda 2018;Arif et al. 2018;Contreras-Torres 2018;Cui et al. 2018;Fu et al. 2018;Javed and Hayat 2018;Mei and Zhao 2018a, b;Mousavizadegan et al. 2018;Qiu et al. 2018;Zhang andKong 2018, 2019;Zhang and Liang 2018;Ahmad and Hayat 2019;Al Maruf and Shatabda 2019;Javed and Hayat 2019;Nosrati et al. 2019;Pan et al. 2019;Tahir et al. 2019a, b;Tian et al. 2019;Hayat and Iqbal 2014;Ahmad et al. 2015;Dehzangi et al. 2015;Sharma et al. 2015;Zhang 2015;Ahmad et al. 2016;Behbahani et al. 2016;Fan et al. 2016;Ju et al. 2016;Tiwari 2016;Xu et al. 2016Xu et al. , 2017Jiao and Du 2017;Liang and Zhang 2017;Meher et al. 2017;Qiu et al. 2017a;Ghauri et al. 2018;Ju and Wang 2018;Krishnan 2018;Liang and Zhang 2018;Mei et al. 2018;Rahman et al. 2018;Sabooh et al. 2018;Sankari and Manimegalai 2018;Srivastava et al. 2018;Zhang and Duan 2018;Adilina et al. 2019;Behbahani et al. 2019;Nazari et al. 2019;Shen et al. 2019;Wang et al. 2019;Xiao et al. 2019a), and even leading to an unprecedented revolution in medicinal chemistry . ...
Article
Full-text available
Facing the explosive growth of biological sequences unearthed in the post-genomic age, one of the most important but also most difficult problems in computational biology is how to express a biological sequence with a discrete model or a vector, but still keep it with considerable sequence-order information or its special pattern. To deal with such a challenging problem, the ideas of “pseudo amino acid components” and “pseudo K-tuple nucleotide composition” have been proposed. The ideas and their approaches have further stimulated the birth for “distorted key theory”, “wenxing diagram”, and substantially strengthening the power in treating the multi-label systems, as well as the establishment of the famous “5-steps rule”. All these logic developments are quite natural that are very useful not only for theoretical scientists but also for experimental scientists in conducting genetics/genomics analysis and drug development. Presented in this review paper are also their future perspectives; i.e., their impacts will become even more significant and propounding.
... Nevertheless, the published studies have provided little information regarding the enzyme which catalyzes histone lysine succinylation [17][18][19]. In fact, it is unclear whether this reaction is enzymatic or not [8,9,20]. In addition to histones, the succinylated proteins were found in the cytoplasm, nucleus, and mitochondria [7,[21][22][23][24], indicating that lysine succinylation controls a variety of biological functions [14,18,25,26]. ...
Article
Full-text available
Lysine succinylation is a form of posttranslational modification of the proteins that play an essential functional role in every aspect of cell metabolism in both prokaryotes and eukaryotes. Aside from experimental identification of succinylation sites, there has been an intense effort geared towards the development of sequence-based prediction through machine learning, due to its promising and essential properties of being highly accurate, robust and cost-effective. In spite of these advantages, there are several problems that are in need of attention in the design and development of succinylation site predictors. Notwithstanding of many studies on the employment of machine learning approaches, few articles have examined this bioinformatics field in a systematic manner. Thus, we review the advancements regarding the current state-of-the-art prediction models, datasets, and online resources and illustrate the challenges and limitations to present a useful guideline for developing powerful succinylation site prediction tools.
... To deal with this problem, the PseAAC (Pseudo Amino Acid Composition) was introduced [53,54]. Ever since the concept of PseA-AC was introduced, it has swiftly penetrated into nearly all the areas of computational proteomics (see, e.g., [5,8,36,38,40,[55][56][57][58][59][60][61][62][63][64][65][66][67][68][69][70][71] and a long list of references cited in two review papers [29,72]). Encouraged by the successes of using PseAAC to deal with protein/peptide sequences, its idea and approach have been extended to deal with DNA/RNA sequences [13,23,[30][31][32]39,73] in computational genomics/genetics via PseKNC (Pseudo K-tuple Nucleotide Composition) [74][75][76][77]. ...
Article
Lysine crotonylation (Kcr) is an evolution-conserved histone posttranslational modification (PTM), occurring in both human somatic and mouse male germ cell genomes. It is important for male germ cell differentiation. Information of Kcr sites in proteins is very useful for both basic research and drug development. But it is time-consuming and expensive to determine them by experiments alone. Here, we report a novel predictor called iKcr-PseEns that is established by incorporating five tiers of amino acid pairwise couplings into the general pseudo amino acid composition. It has been observed via rigorous cross-validations that the new predictor's sensitivity (Sn), specificity (Sp), accuracy (Acc), and stability (MCC) are 90.53%, 95.27%, 94.49%, and 0.826, respectively. For the convenience of most experimental scientists, a user-friendly web-server for iKcr-PseEns has been established at http://www.jci-bioinfo.cn/iKcr-PseEns, by which users can easily obtain their desired results without the need to go through the complicated mathematical equations involved.
Article
Full-text available
Background The identification of tumor T cell antigens (TTCAs) is crucial for providing insights into their functional mechanisms and utilizing their potential in anticancer vaccines development. In this context, TTCAs are highly promising. Meanwhile, experimental technologies for discovering and characterizing new TTCAs are expensive and time-consuming. Although many machine learning (ML)-based models have been proposed for identifying new TTCAs, there is still a need to develop a robust model that can achieve higher rates of accuracy and precision. Results In this study, we propose a new stacking ensemble learning-based framework, termed StackTTCA, for accurate and large-scale identification of TTCAs. Firstly, we constructed 156 different baseline models by using 12 different feature encoding schemes and 13 popular ML algorithms. Secondly, these baseline models were trained and employed to create a new probabilistic feature vector. Finally, the optimal probabilistic feature vector was determined based the feature selection strategy and then used for the construction of our stacked model. Comparative benchmarking experiments indicated that StackTTCA clearly outperformed several ML classifiers and the existing methods in terms of the independent test, with an accuracy of 0.932 and Matthew's correlation coefficient of 0.866. Conclusions In summary, the proposed stacking ensemble learning-based framework of StackTTCA could help to precisely and rapidly identify true TTCAs for follow-up experimental verification. In addition, we developed an online web server (http://2pmlab.camt.cmu.ac.th/StackTTCA) to maximize user convenience for high-throughput screening of novel TTCAs.
Article
Full-text available
Protein–protein interactions (PPIs) are of great importance to understand genetic mechanisms, delineate disease pathogenesis, and guide drug design. With the increase of PPI data and development of machine learning technologies, prediction and identification of PPIs have become a research hotspot in proteomics. In this study, we propose a new prediction pipeline for PPIs based on gradient tree boosting (GTB). First, the initial feature vector is extracted by fusing pseudo amino acid composition (PseAAC), pseudo position-specific scoring matrix (PsePSSM), reduced sequence and index-vectors (RSIV), and autocorrelation descriptor (AD). Second, to remove redundancy and noise, we employ L1-regularized logistic regression (L1-RLR) to select an optimal feature subset. Finally, GTB-PPI model is constructed. Five-fold cross-validation showed that GTB-PPI achieved the accuracies of 95.15% and 90.47% on Saccharomyces cerevisiae and Helicobacter pylori datasets, respectively. In addition, GTB-PPI could be applied to predict the independent test datasets for Caenorhabditis elegans, Escherichia coli, Homo sapiens, and Mus musculus, the one-core PPI network for CD9, and the crossover PPI network for the Wnt-related signaling pathways. The results show that GTB-PPI can significantly improve accuracy of PPI prediction. The code and datasets of GTB-PPI can be downloaded from https://github.com/QUST-AIBBDRC/GTB-PPI/.
Article
DNA‐binding proteins perform an indispensable function in the maintenance and processing of genetic information and are inefficiently identified by traditional experimental methods due to their huge quantities. On the contrary, machine learning methods as an emerging technique demonstrate satisfactory speed and accuracy when used to study these molecules. This work focuses on extracting four different features from primary and secondary sequence features: Reduced sequence and index‐vectors (RS), Pseudo‐amino acid components (PseAACS), Position‐specific scoring matrix‐Auto Cross Covariance Transform (PSSM‐ACCT), and Position‐specific scoring matrix‐Discrete Wavelet Transform (PSSM‐DWT). Using the LASSO dimension reduction method, we experiment on the combination of feature submodels to obtain the optimized number of top rank features. These features are respectively input into the training Ensemble subspace discriminant, Ensemble bagged tree and KNN to predict the DNA‐binding proteins. Three different datasets, PDB594, PDB1075, and PDB186, are adopted to evaluate the performance of the as‐proposed approach in this work. The PDB1075 and PDB594 datasets are adopted for the five‐fold cross‐validation, and the PDB186 is used for the independent experiment. In the five‐fold cross‐validation, both the PDB1075 and PDB594 show extremely high accuracy, reaching 86.98% and 88.9% by Ensemble subspace discriminant, respectively. The accuracy of independent experiment by multi‐classifiers voting is 83.33%, which suggests that the methodology proposed in this work is capable of predicting DNA‐binding proteins effectively. Abstract
Preprint
Full-text available
Background: Multi-label proteins occur in two or more subcellular locations, which play a vital part in cell development and metabolism. Prediction and analysis of multi-label subcellular localization (SCL) can present new angle with drug target identification and new drug design. However, the prediction of multi-label protein SCL using biological experiments is expensive and labor-intensive. Therefore, predicting large-scale SCL with machine learning methods has turned into a hot study topic in bioinformatics. Methods: In this study, a novel multi-label learning means for protein SCL prediction, called DMLDA-LocLIFT, is proposed. Firstly, the dipeptide composition, encoding based on grouped weight, pseudo amino acid composition, gene ontology and pseudo position specific scoring matrix are employed to encode subcellular protein sequences. Then, direct multi-label linear discriminant analysis (DMLDA) is used to reduce the dimension of the fused feature vector. Lastly, the optimal feature vectors are input into the multi-label learning with Label-specIfic FeaTures (LIFT) classifier to predict the location of multi-label proteins. Results: The jackknife test showed that the overall actual accuracy on Gram-negative bacteria, Gram-positive bacteria, and plant datasets are 98.60%, 99.60%, and 97.90% respectively, which are obviously better than other state-of-the-art prediction methods. Conclusion: The proposed model can effectively predict SCL of multi-label proteins and provide references for experimental identification of SCL. The source codes and data are publicly available at https://github.com/QUST-AIBBDRC/DMLDA-LocLIFT/.
Article
Full-text available
* Background In the search for therapeutic peptides for disease treatments, many efforts have been made to identify various functional peptides from large numbers of peptide sequence databases. In this paper, we propose an effective computational model that uses deep learning and word2vec to predict therapeutic peptides (PTPD). * Results Representation vectors of all k-mers were obtained through word2vec based on k-mer co-existence information. The original peptide sequences were then divided into k-mers using the windowing method. The peptide sequences were mapped to the input layer by the embedding vector obtained by word2vec. Three types of filters in the convolutional layers, as well as dropout and max-pooling operations, were applied to construct feature maps. These feature maps were concatenated into a fully connected dense layer, and rectified linear units (ReLU) and dropout operations were included to avoid over-fitting of PTPD. The classification probabilities were generated by a sigmoid function. PTPD was then validated using two datasets: an independent anticancer peptide dataset and a virulent protein dataset, on which it achieved accuracies of 96% and 94%, respectively. * Conclusions PTPD identified novel therapeutic peptides efficiently, and it is suitable for application as a useful tool in therapeutic peptide design.
Article
Full-text available
In this minireview paper it has been elucidated that the proposal of pseudo amino acid components represents a very important milestone for the disciplines of proteome and genome. This has been concluded by observing and analyzing the developments in the following six different sub-disciplines: (1) proteome analysis; (2) genome analysis; (3) protein structural classification; (4) protein subcellular location prediction; (5) post-translational modification (PTM) site prediction; (6) stimulating the birth of the renowned and very powerful 5-steps rule.
Article
Full-text available
Identification of the sites of post-translational modifications (PTMs) in protein, RNA, and DNA sequences is currently a very hot topic. This is because the information thus obtained is very useful for in-depth understanding the biological processes at the cellular level and for developing effective drugs against major diseases including cancers as well. Although this can be done by means of various experimental techniques, it is both time-consuming and costly to determine the PTM sites purely based on experiments. With the avalanche of biological sequences generated in the post-genomic age, it is highly desired to develop bioinformatics tools for rapidly and effectively identifying the PTM sites. In the last few years, many efforts have been made in this regard, and considerable progresses have been achieved. This review is focused on those prediction methods that have the following two features. (1) They have been developed by strictly observing the 5-steps rule so that they each have a user-friendly web-server for the majority of experimental scientists to easily get their desired data without the need to go through the detailed mathematics involved. (2) Their cornerstones have been based on Pseudo Amino Acid Composition (PseAAC) or Pseudo K-tuple Nucleotide Composition (PseKNC), and hence the prediction quality is generally higher than most of the other PTM prediction methods.
Article
The smallest unit of life is a cell, which contains numerous protein molecules. Most of the functions critical to the cell’s survival are performed by these proteins located in its different organelles, usually called ‘‘subcellular locations”. Information of subcellular localization for a protein can provide useful clues about its function. To reveal the intricate pathways at the cellular level, knowledge of the subcellular localization of proteins in a cell is prerequisite. Therefore, one of the fundamental goals in molecular cell biology and proteomics is to determine the subcellular locations of proteins in an entire cell. It is also indispensable for prioritizing and selecting the right targets for drug development. Unfortunately, it is both time-consuming and costly to determine the subcellular locations of proteins purely based on experiments. With the avalanche of protein sequences generated in the post-genomic age, it is highly desired to develop computational methods for rapidly and effectively identifying the subcellular locations of uncharacterized proteins based on their sequences information alone. Actually, considerable progresses have been achieved in this regard. This review is focused on those methods, which have the capacity to deal with multi-label proteins that may simultaneously exist in two or more subcellular location sites. Protein molecules with this kind of characteristic are vitally important for finding multi-target drugs, a current hot trend in drug development. Focused in this review are also those methods that have use-friendly web-servers established so that the majority of experimental scientists can use them to get the desired results without the need to go through the detailed mathematics involved.
Article
Background: Peptide-Fc fusion drugs, also known as peptibodies, are a category of biological therapeutics in which the Fc region of an antibody is genetically fused to a peptide of interest. However, to develop such kind of drugs is laborious and expensive. Rational design is urgently needed. Methods: We summarized the key steps in peptide-Fc fusion technology and stressed the main computational resources, tools, and methods that had been used in the rational design of peptide-Fc fusion drugs. We also raised open questions about the computer-aided molecular design of peptide-Fc. Results: The design of peptibody consists of four steps. First, identify peptide leads from native ligands, biopanning, and computational design or prediction. Second, select the proper Fc region from different classes or subclasses of immunoglobulin. Third, fuse the peptide leads and Fc together properly. At last, evaluate the immunogenicity of the constructs. At each step, there are quite a few useful resources and computational tools. Conclusion: Reviewing the molecular design of peptibody will certainly help make the transition from peptide leads to drugs on the market quicker and cheaper.
Article
Full-text available
Pse-in-One 2.0 is a package of web-servers evolved from Pse-in-One (Liu, B., Liu, F., Wang, X., Chen, J. Fang, L. & Chou, K.C. Nucleic Acids Research, 2015, 43:W65-W71). In order to make it more flexible and comprehensive as suggested by many users, the updated package has incorporated 23 new pseudo component modes as well as a series of new feature analysis approaches. It is available at http://bioinformatics.hitsz.edu.cn/Pse-in-One2.0/. Moreover, to maximize the convenience of users, provided is also the stand-alone version called “Pse-in-One-Analysis”, by which users can significantly speed up the analysis of massive sequences.
Article
Full-text available
Toxicity evaluation is an extremely important process during drug development. It is usually initiated by experiments on animals, which is time-consuming and costly. To speed up such a process, a quantitative structure-activity relationship (QSAR) study was performed to develop a computational model for correlating the structures of 581 aromatic compounds with their aquatic toxicity to tetrahymena pyriformis. A set of 68 molecular descriptors derived solely from the structures of the aromatic compounds were calculated based on Gaussian 03, HyperChem 7.5, and TSAR V3.3. A comprehensive feature selection method, minimum Redundancy Maximum Relevance (mRMR)-genetic algorithm (GA)-support vector regression (SVR) method, was applied to select the best descriptor subset in QSAR analysis. The SVR method was employed to model the toxicity potency from a training set of 500 compounds. Five-fold cross-validation method was used to optimize the parameters of SVR model. The new SVR model was tested on an independent dataset of 81 compounds. Both high internal consistent and external predictive rates were obtained, indicating the SVR model is very promising to become an effective tool for fast detecting the toxicity.
Article
Full-text available
Involved with important cellular or gene functions and implicated with many kinds of cancers, piRNAs, or piwi-interacting RNAs, are of small non-coding RNA with around 19-33 nucleotides in length. Given a small non-coding RNA molecule, can we predict whether it is of piRNA according to its sequence information alone? Furthermore, there are two types of piRNA: one has the function of instructing target mRNA deadenylation, and the other has not. Can we discriminate one from the other? With the avalanche of RNA sequences emerging in the postgenomic age, it is urgent to address the two problems for both basic research and drug development. Unfortunately, to our best knowledge, so far no computational methods whatsoever that could be used to deal with the second problem, needless to say to deal with the two problems together. Here, by incorporating the physicochemical properties of nucleotides into the pseudo K-tuple nucleotide composition (PseKNC), we proposed a powerful predictor called 2L-piRNA. It is a two-layer ensemble classifier, in which the 1st layer is for identifying whether a query RNA molecule as piRNA or non-piRNA, and the 2nd layer for identifying whether a piRNA being with or without the function of instructing target mRNA deadenylation. Rigorous cross validations have indicated that the success rates achieved by the proposed predictor are quite high. For the convenience of most biologists and drug development scientists, the web-server for 2L-piRNA has been established at http://bioinformatics.hitsz.edu.cn/2L-piRNA/, by which users can easily get their desired results without the need to go through the mathematical details.
Article
Full-text available
Antimicrobial peptides (AMPs) are important components of the innate immune system that have been found to be effective against disease causing pathogens. Identification of AMPs through wetlab experiment is expensive. Therefore, development of efficient computational tool is essential to identify the best candidate AMP prior to the in vitro experimentation. In this study, we made an attempt to develop a support vector machine (SVM) based computational approach for prediction of AMPs with improved accuracy. Initially, compositional, physico-chemical and structural features of the peptides were generated that were subsequently used as input in SVM for prediction of AMPs. The proposed approach achieved higher accuracy than several existing approaches, while compared using benchmark dataset. Based on the proposed approach, an online prediction server iAMPpred has also been developed to help the scientific community in predicting AMPs, which is freely accessible at http://cabgrid.res.in:8080/amppred/. The proposed approach is believed to supplement the tools and techniques that have been developed in the past for prediction of AMPs.
Article
Full-text available
Motivation: Coexisting in a DNA system, meiosis and recombination are two indispensible aspects for cell reproduction and growth. With the avalanche of genome sequences emerging in the post-genomic age, it is an urgent challenge to acquire the information of DNA recombination spots because it can timely provide very useful insights into the mechanism of meiotic recombination and the process of genome evolution. Results: To address such a challenge, we have developed a predictor, called IRSPOT-EL: , by fusing different modes of PseKNC (pseudo K-tuple nucleotide composition) and mode of DACC (dinucleotide-based auto-cross covariance) into an ensemble classifier of clustering approach. 5 fold cross tests on a widely used benchmark dataset have indicated that the new predictor remarkably outperforms its existing counterparts. Particularly, far beyond their reach, the new predictor can be easily used to conduct the genome-wide analysis and the results obtained are quite consistent with the experimental map. Availability: For the convenience of most experimental scientists, a user-friendly web-server for iRSpot-EL has been established at http://bioinformatics.hitsz.edu.cn/iRSpot-EL/, by which users can easily obtain their desired results without the need to go through the complicated mathematical equations involved. Contact: bliu@gordonlifescience.org or bliu@insun.hit.edu.cn SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Article
Full-text available
In this paper, we propose the graph energy of 20 amino acids and the 2D graphical representation of protein sequences based on six physicochemical properties of 20 amino acids and the relationship between them. Moreover, we could get a specific vector from the graphical curve of a protein sequence, and use this vector to calculate the distance between two sequences. This approach avoids considering the differences in length of protein sequences. Finally, we research the similarities/dissimilarities of ND5 and 36PDs using our method and get better results compared with ClustalX2.
Article
Full-text available
Cancer remains a major killer worldwide. Traditional methods of cancer treatment are expensive and have some deleterious side effects on normal cells. Fortunately, the discovery of anticancer peptides (ACPs) has paved a new way for cancer treatment. With the explosive growth of peptide sequences generated in the post genomic age, it is highly desired to develop computational methods for rapidly and effectively identifying ACPs, so as to speed up their application in treating cancer. Here we report a sequence-based predictor called iACP developed by the approach of optimizing the g-gap dipeptide components. It was demonstrated by rigorous cross-validations that the new predictor remarkably outperformed the existing predictors for the same purpose in both overall accuracy and stability. For the convenience of most experimental scientists, a publicly accessible web-server for iACP has been established at http://lin.uestc.edu.cn/server/iACP, by which users can easily obtain their desired results.
Article
Full-text available
Motivation: Enhancers are of short regulatory DNA elements. They can be bound with proteins (activators) to activate transcription of a gene, and hence play a critical role in promoting gene transcription in eukaryotes. With the avalanche of DNA sequences generated in the post genomic age, it is a challenging task to develop computational methods for timely identifying enhancers from extremely complicated DNA sequences. Although some efforts have been made in this regard, they were limited at only identifying whether a query DNA element being of an enhancer or not. According to the distinct levels of biological activities and regulatory effects on target genes, however, enhancers should be further classified into strong and weak ones in strength. Results: In view of this, a two-layer predictor called " IENHANCER-2L: " was proposed by formulating DNA elements with the "pseudo k-tuple nucleotide composition", into which the six DNA local parameters were incorporated. To our best knowledge, it is the first computational predictor ever established for identifying not only enhancers but also their strength. Rigorous cross validation tests have indicated that IENHANCER-2L: holds very high potential to become a useful tool for genome analysis. Availability: For the convenience of most experimental scientists, a web server for the two-layer predictor was established at http://bioinformatics.hitsz.edu.cn/iEnhancer-2L/, by which users can easily get their desired results without the need to go through the mathematical details. Contact: bliu@gordonlifescience.org or xlan@stanford.edu.
Article
Full-text available
With the avalanche of DNA/RNA sequences generated in the post-genomic age, it is urgent to develop automated methods for analyzing the relationship between the sequences and their functions. Encouraged by the great successes of PseAAC (pseudo amino acid composition) in computational proteomics as well as inspired and stimulated by its idea and concept, the PseKNC (pseudo K-tuple nucleotide composition) approaches have been proposed and applied to analyze various character-unknown DNA/RNA sequences, in order for in-depth understanding their action mechanisms and processes. Compared with the classical sequence-based methods, the PseKNC approaches developed very recently have the following advantages: (1) it can convert length-different DNA/RNA sequences into dimension-fixed digital vectors that can be directly handled by all the existing machine-learning algorithms or operation engines; (2) it can contain the desired features and properties according to the selection or definition of users; (3) it can cover considerable sequence pattern information, both local and global. This minireview is focused on the concept of PseKNC (an extension of PseACC), as well as its development and applications.
Article
Full-text available
With the avalanche of biological sequences generated in the post-genomic age, one of the most challenging problems in computational biology is how to effectively formulate the sequence of a biological sample (such as DNA, RNA or protein) with a discrete model or a vector that can effectively reflect its sequence pattern information or capture its key features concerned. Although several web servers and stand-alone tools were developed to address this problem, all these tools, however, can only handle one type of samples. Furthermore, the number of their built-in properties is limited, and hence it is often difficult for users to formulate the biological sequences according to their desired features or properties. In this article, with a much larger number of built-in properties, we are to propose a much more flexible web server called Pse-in-One (http://bioinformatics.hitsz.edu.cn/Pse-in-One/), which can, through its 28 different modes, generate nearly all the possible feature vectors for DNA, RNA and protein sequences. Particularly, it can also generate those feature vectors with the properties defined by users themselves. These feature vectors can be easily combined with machine-learning algorithms to develop computational predictors and analysis methods for various tasks in bioinformatics and system biology. It is anticipated that the Pse-in-One web server will become a very useful tool in computational proteomics, genomics, as well as biological sequence analysis. Moreover, to maximize users' convenience, its stand-alone version can also be downloaded from http://bioinformatics.hitsz.edu.cn/Pse-in-One/download/, and directly run on Windows, Linux, Unix and Mac OS. © The Author(s) 2015. Published by Oxford University Press on behalf of Nucleic Acids Research.
Article
Full-text available
Facing the explosive growth of biological sequence data, such as those of protein/peptide and DNA/RNA, generated in the post-genomic age, many bioinformatical and mathematical approaches as well as physicochemical concepts have been introduced to timely derive useful informations from these biological sequences, in order to stimulate the development of medical science and drug design. Meanwhile, because of the rapid penetrations from these disciplines, medicinal chemistry is currently undergoing an unprecedented revolution. In this minireview, we are to summarize the progresses by focusing on the following six aspects. (1) Use the pseudo amino acid composition or PseAAC to predict various attributes of protein/peptide sequences that are useful for drug development. (2) Use pseudo oligonucleotide composition or PseKNC to do the same for DNA/RNA sequences. (3) Introduce the multi-label approach to study those systems where the constituent elements bear multiple characters and functions. (4) Utilize the graphical rules and "wenxiang" diagrams to analyze complicated biomedical systems. (5) Recent development in identifying the interactions of drugs with its various types of target proteins in cellular networking. (6) Distorted key theory and its application in developing peptide drugs.
Article
Full-text available
The wenxiang diagram was proposed to repre-sent α-helices in a 2D (two dimensional) space (Chou, K.. It has the capacity to provide more information in a 2D plane about each of the constituent amino acid residues in an α-helix, and is particularly useful for studying and analyzing amphiphilic helices. To meet the increasing requests for getting the program of generating wenxiang diagrams, a user-friendly web-server called "Wenxiang" has been established. It is accessible to the public at the web-site http://www.jci-bioinfo.cn/wenxiang2 or http://icpr.jci.edu.cn/bioinfo/wenxiang2. Further-more, for the convenience of users, here we provide a step-to-step guide for how to use the Wenxiang web-server to generate the desired wenxiang diagrams.
Article
Full-text available
Cancer is one of the most common diseases, which causes more mortality worldwide. Despite the presence of several therapies against cancer, peptides as therapeutic agents are gaining importance. Experimental studies report that peptides containing apoptotic domain exhibit anticancer activity. Hence in this study, we propose a computational method using support vector machine and protein relatedness measure feature vector, in which provision was made to assess the query protein for the presence of any apoptotic domains or not and then to scan/predict the anti-cancer peptides in the protein. Different datasets, including newly developed positive and negative dataset, AntiCP dataset, and balanced randomly generated peptides were used to validate the proposed method. The validation results on independent dataset suggested (sensitivity = 0.95; specificity = 0.97; MCC = 0.92; and Accuracy = 0.96) that the proposed method outperformed the existing method in predicting anti-cancer peptides. The user friendly webserver includes three different modes (i) Protein scan with apoptotic domain prediction; (ii) Multiple peptide mode; and (iii) Peptide mutation mode for prediction and design of anti-cancer peptides. The server was developed using PERL CGI and freely accessible at http://acpp.bicpu.edu.in/predict.php. The established tool will be useful in investigating and designing potent anti-cancer peptides from the query protein effectively.
Article
Full-text available
Many domains would benefit from reliable and efficient systems for automatic protein classification. An area of particular interest in recent studies on automatic protein classification is the exploration of new methods for extracting features from a protein that work well for specific problems. These methods, however, are not generalizable and have proven useful in only a few domains. Our goal is to evaluate several feature extraction approaches for representing proteins by testing them across multiple datasets. Different types of protein representations are evaluated: those starting from the position specific scoring matrix of the proteins (PSSM), those derived from the amino-acid sequence, two matrix representations, and features taken from the 3D tertiary structure of the protein. We also test new variants of proteins descriptors. We develop our system experimentally by comparing and combining different descriptors taken from the protein representations. Each descriptor is used to train a separate support vector machine (SVM), and the results are combined by sum rule. Some stand-alone descriptors work well on some datasets but not on others. Through fusion, the different descriptors provide a performance that works well across all tested datasets, in some cases performing better than the state-of-the-art.
Article
Full-text available
We have developed a method NTXpred for predicting neurotoxins and classifying them based on their function and origin. The dataset used in this study consists of 582 non-redundant, experimentally annotated neurotoxins obtained from Swiss-Prot. A number of modules have been developed for predicting neurotoxins using residue composition based on feed-forwarded neural network (FNN), recurrent neural network (RNN), support vector machine (SVM) and achieved maximum accuracy of 84.19%, 92.75%, 97.72% respectively. In addition, SVM modules have been developed for classifying neurotoxins based on their source (e.g., eubacteria, cnidarians, molluscs, arthropods have been and chordate) using amino acid composition and dipeptide composition and achieved maximum overall accuracy of 78.94% and 88.07% respectively. The overall accuracy increased to 92.10%, when the evolutionary information obtained from PSI-BLAST was combined with SVM module of source classification. We have also developed SVM modules for classifying neurotoxins based on functions using amino acid, dipeptide composition and achieved overall accuracy of 83.11%, 91.10% respectively. The overall accuracy of function classification improved to 95.11%, when PSI-BLAST output was combined with SVM module. All the modules developed in this study were evaluated using five-fold cross-validation technique. The NTXpred is available at www.imtech.res.in/raghava/ntxpred/ and mirror site at http://bioinformatics.uams.edu/mirror/ntxpred.
Article
Full-text available
Use of therapeutic peptides in cancer therapy has been receiving considerable attention in the recent years. Present study describes the development of computational models for predicting and discovering novel anticancer peptides. Preliminary analysis revealed that Cys, Gly, Ile, Lys, and Trp are dominated at various positions in anticancer peptides. Support vector machine models were developed using amino acid composition and binary profiles as input features on main dataset that contains experimentally validated anticancer peptides and random peptides derived from SwissProt database. In addition, models were developed on alternate dataset that contains antimicrobial peptides instead of random peptides. Binary profiles-based model achieved maximum accuracy 91.44% with MCC 0.83. We have developed a webserver, which would be helpful in: (i) predicting minimum mutations required for improving anticancer potency; (ii) virtual screening of peptides for discovering novel anticancer peptides, and (iii) scanning natural proteins for identification of anticancer peptides (http://crdd.osdd.net/raghava/anticp/).
Article
Full-text available
Many molecular biosystems and biomedical systems belong to the multi-label systems in which each of their constituent molecules possesses one or more than one function or feature, and hence needs one or more than one label to indicate its attribute(s). With the avalanche of biological sequences generated in the post genomic age, it is highly desirable to develop computational methods to timely and reliably identify their various kinds of attributes. Compared with the single-label systems, the multi-label systems are much more complicated and difficult to deal with. The current mini review focuses on the recent progresses in this area from both conceptual aspects and detailed mathematical formulations.
Article
Full-text available
Background: Predicting protein subnuclear localization is a challenging problem. Some previous works based on non-sequence information including Gene Ontology annotations and kernel fusion have respective limitations. The aim of this work is twofold: one is to propose a novel individual feature extraction method; another is to develop an ensemble method to improve prediction performance using comprehensive information represented in the form of high dimensional feature vector obtained by 11 feature extraction methods. Methodology/principal findings: A novel two-stage multiclass support vector machine is proposed to predict protein subnuclear localizations. It only considers those feature extraction methods based on amino acid classifications and physicochemical properties. In order to speed up our system, an automatic search method for the kernel parameter is used. The prediction performance of our method is evaluated on four datasets: Lei dataset, multi-localization dataset, SNL9 dataset and a new independent dataset. The overall accuracy of prediction for 6 localizations on Lei dataset is 75.2% and that for 9 localizations on SNL9 dataset is 72.1% in the leave-one-out cross validation, 71.7% for the multi-localization dataset and 69.8% for the new independent dataset, respectively. Comparisons with those existing methods show that our method performs better for both single-localization and multi-localization proteins and achieves more balanced sensitivities and specificities on large-size and small-size subcellular localizations. The overall accuracy improvements are 4.0% and 4.7% for single-localization proteins and 6.5% for multi-localization proteins. The reliability and stability of our classification model are further confirmed by permutation analysis. Conclusions: It can be concluded that our method is effective and valuable for predicting protein subnuclear localizations. A web server has been designed to implement the proposed method. It is freely available at http://bioinformatics.awowshop.com/snlpred_page.php.
Article
Full-text available
Meiotic recombination is an important biological process. As a main driving force of evolution, recombination provides natural new combinations of genetic variations. Rather than randomly occurring across a genome, meiotic recombination takes place in some genomic regions (the so-called 'hotspots') with higher frequencies, and in the other regions (the so-called 'coldspots') with lower frequencies. Therefore, the information of the hotspots and coldspots would provide useful insights for in-depth studying of the mechanism of recombination and the genome evolution process as well. So far, the recombination regions have been mainly determined by experiments, which are both expensive and time-consuming. With the avalanche of genome sequences generated in the postgenomic age, it is highly desired to develop automated methods for rapidly and effectively identifying the recombination regions. In this study, a predictor, called 'iRSpot-PseDNC', was developed for identifying the recombination hotspots and coldspots. In the new predictor, the samples of DNA sequences are formulated by a novel feature vector, the so-called 'pseudo dinucleotide composition' (PseDNC), into which six local DNA structural properties, i.e. three angular parameters (twist, tilt and roll) and three translational parameters (shift, slide and rise), are incorporated. It was observed by the rigorous jackknife test that the overall success rate achieved by iRSpot-PseDNC was >82% in identifying recombination spots in Saccharomyces cerevisiae, indicating the new predictor is promising or at least may become a complementary tool to the existing methods in this area. Although the benchmark data set used to train and test the current method was from S. cerevisiae, the basic approaches can also be extended to deal with all the other genomes. Particularly, it has not escaped our notice that the PseDNC approach can be also used to study many other DNA-related problems. As a user-friendly web-server, iRSpot-PseDNC is freely accessible at http://lin.uestc.edu.cn/server/iRSpot-PseDNC.
Article
Full-text available
A century ago, Gertrude Stein told us that a rose is a rose is a rose, but today, modern genomics is telling us that a pathogen is not a pathogen. Advances in pathogenomics have greatly elucidated the genetic and molecular underpinnings of bacterial virulence and pathogenesis. This new knowledge of the evolution of pathogenic bacteria, and of the ways by which they acquire and maintain virulence, has increasingly indicated that not all bacterial pathogens are created equal. Evolutionarily speaking, there seem to be at least two broad categories of pathogenic bacteria: obligate pathogens that have evolved over time to become irreversibly specialized parasites and “Jekyll-and-Hyde pathogens,” still closely related to free-living bacteria, that have been rapidly but reversibly made pathogenic by mobile genetic elements. This distinction between full-scale genetic re-wiring and subtle genetic fine-tuning represents a fundamental contrast that may shed light on the past, present, and future evolution of pathogenic bacteria. More importantly, we might be able to use this knowledge of the paradigms of pathogenesis to develop novel strategies for combating some of today's most significant bacterial pathogens.
Article
Full-text available
We propose q-analogs of the Wiener index, motivated by the theory of hypergeometric series. The basic properties of these q-Wiener indices are established, as well as their relations with the Hosoya polynomial. Some possible chemical interpretations and applications of the q-Wiener indices are considered.
Article
Full-text available
With the avalanche of protein sequences generated in the post-genomic age, it is highly desired to develop automated methods for efficiently identifying various attributes of uncharacterized proteins. This is one of the most im-portant tasks facing us today in bioinformatics, and the information thus obtained will have important impacts on the de-velopment of proteomics and system biology. To realize that, one of the keys is to find an effective model to represent the sample of a protein. The most straightforward model in this regard is its entire amino acid sequence; however, the entire sequence model would fail to work when the query protein did not have significant homology to proteins of known char-acteristics. Thus, various non-sequential models or discrete models were proposed. The simplest discrete model is the amino acid (AA) composition. Using it to represent a protein, however, all the sequence-order information would be com-pletely lost. To cope with such a dilemma, the concept of pseudo amino acid (PseAA) composition was introduced. Its es-sence is to keep using a discrete model to represent a protein yet without completely losing its sequence-order informa-tion. Therefore, in a broad sense, the PseAA composition of a protein is actually a set of discrete numbers that is de-rived from its amino acid sequence and that is different from the classical AA composition and able to harbour some sort of sequence order or pattern information. Ever since the first PseAA composition was formulated to predict protein sub-cellular localization and membrane protein types, it has stimulated many different modes of PseAA composition for studying various kinds of problems in proteins and proteins-related systems. In this review, we shall give a brief and sys-tematic introduction of various modes of PseAA composition and their applications. Meanwhile, the challenges for find-ing the optimal PseAA composition are also briefly discussed.
Article
Full-text available
We studied 10 protein-coding mitochondrial genes from 19 mammalian species to evaluate the effects of 10 amino acid properties on the evolution of the genetic code, the amino acid composition of proteins, and the pattern of nonsynonymous substitutions. The 10 amino acid properties studied are the chemical composition of the side chain, two polarity measures, hydropathy, isoelectric point, volume, aromaticity, aliphaticity, hydrogenation, and hydroxythiolation. The genetic code appears to have evolved toward minimizing polarity and hydropathy but not the other seven properties. This can be explained by our finding that the presumably primitive amino acids differed much only in polarity and hydropathy, but little in the other properties. Only the chemical composition (C) and isoelectric point (IE) appear to have affected the amino acid composition of the proteins studied, that is, these proteins tend to have more amino acids with typical C and IE values, so that nonsynonymous mutations tend to result in small differences in C and IE. All properties, except for hydroxythiolation, affect the rate of nonsynonymous substitution, with the observed amino acid changes having only small differences in these properties, relative to the spectrum of all possible nonsynonymous mutations.
Article
Full-text available
Anuran tissues, and especially skin, are a rich source of bioactive peptides and their precursors. We here present a manually curated database of antimicrobial and other defense peptides with a total of 2571 entries, most of them in the precursor form with demarcated signal peptide (SP), acidic proregion(s) and bioactive moiety(s) corresponding to 1923 non-identical bioactive sequences. Search functions on the corresponding web server facilitate the extraction of six distinct SP classes. The more conserved of these can be used for searching cDNA and UniProtKB databases for potential bioactive peptides, for creating PROSITE search patterns, and for phylogenetic analysis. Availability: DADP is accessible at http://split4.pmfst.hr/dadp/ Contact:juretic@pmfst.hr Supplementary information:Supplementary data are available at Bioinformatics online.
Article
Full-text available
It is important to develop a reliable system for predicting bacterial virulent proteins for finding novel drug/vaccine and for understanding virulence mechanisms in pathogens.In this work we have proposed a bacterial virulent protein prediction method based on an ensemble of classifiers where the features are extracted directly from the amino acid sequence of a given protein. It is well known in the literature that the features extracted from the evolutionary information of a given protein are better than the features extracted from the amino acid sequence. Our method tries to fill the gap between the amino acid sequence based approaches and the evolutionary information based approaches.An extensive evaluation according to a blind testing protocol, where the parameters of the system are calculated using the training set and the system is validated in three different independent datasets, has demonstrated the validity of the proposed method.
Article
Full-text available
Identifying group-specific characteristics in metabolic networks can provide better insight into evolutionary developments. Here, we present an approach to classify the three domains of life using topological information about the underlying metabolic networks. These networks have been shown to share domain-independent structural similarities, which pose a special challenge for our endeavour. We quantify specific structural information by using topological network descriptors to classify this set of metabolic networks. Such measures quantify the structural complexity of the underlying networks. In this study, we use such measures to capture domain-specific structural features of the metabolic networks to classify the data set. So far, it has been a challenging undertaking to examine what kind of structural complexity such measures do detect. In this paper, we apply two groups of topological network descriptors to metabolic networks and evaluate their classification performance. Moreover, we combine the two groups to perform a feature selection to estimate the structural features with the highest classification ability in order to optimize the classification performance. By combining the two groups, we can identify seven topological network descriptors that show a group-specific characteristic by ANOVA. A multivariate analysis using feature selection and supervised machine learning leads to a reasonable classification performance with a weighted F-score of 83.7% and an accuracy of 83.9%. We further demonstrate that our approach outperforms alternative methods. Also, our results reveal that entropy-based descriptors show the highest classification ability for this set of networks. Our results show that these particular topological network descriptors are able to capture domain-specific structural characteristics for classifying metabolic networks between the three domains of life.
Article
Knowledge of subcellular locations of proteins is crucially important for in-depth understanding their functions in a cell. With the explosive growth of protein sequences generated in the postgenomic age, it is highly demanded to develop computational tools for timely annotating their subcellular locations based on the sequence information alone. The current study is focused on virus proteins. Although considerable efforts have been made in this regard, the problem is far from being solved yet. Most existing methods can be used to deal with single-location proteins only. Actually, proteins with multi-locations may have some special biological functions. This kind of multiplex proteins is particularly important for both basic research and drug design. Using the multi-label theory, we present a new predictor called "pLoc-mVirus" by extracting the optimal GO (Gene Ontology) information into the general PseAAC (Pseudo Amino Acid Composition). Rigorous cross-validation on a same stringent benchmark dataset indicated that the proposed pLoc-mVirus predictor is remarkably superior to iLoc-Virus, the state-of-the-art method in predicting virus protein subcellular localization. To maximize the convenience of most experimental scientists, a user-friendly web-server for the new predictor has been established at http://www.jci-bioinfo.cn/pLoc-mVirus/, by which users can easily get their desired results without the need to go through the complicated mathematics involved.
Article
Motivation: Given a compound, can we predict which anatomical therapeutic chemical (ATC) class/classes it belongs to? It is a challenging problem since the information thus obtained can be used to deduce its possible active ingredients, as well as its therapeutic, pharmacological and chemical properties. And hence the pace of drug development could be substantially expedited. But this problem is by no means an easy one. Particularly, some drugs or compounds may belong to two or more ATC classes.
Article
Background: Occurring at Lys residues, the PGK (lysine phosphoglycerylation) is a special kind of post-translational modification (PTM). It may invert the charge potential of the modified residue and change the protein structures and functions, causing various diseases in liver, brain, and kidney. Objective: From the angles of both basic research and drug development, we are facing a critical challenging problem: for an uncharacterized protein sequence containing many Lys residues, which ones can be of phosphoglycerylation, and which ones cannot? Method: To address this problem, we have developed a predictor called iPGK-PseAAC by incorporating into the general PseAAC (pseudo amino acid composition) with four different tiers of amino acid pairwise coupling information, where tiers 1, 2, 3, and 4 refer to the amino acid pairwise couplings between all the 1st, 2nd, 3rd, and 4th most contiguous residues along a protein segment, respectively. Results: Rigorous cross-validations indicated that the proposed predictor remarkably outperformed its existing counterparts. Conclusion: The proposed predictor iPGK-PseAAC will become a very useful bioinformatics tool for medicinal chemistry. For the convenience of most experimental scientists, a user-friendly web-server for iGPK-PseAAC has been established at http://app.aporc.org/iPGK-PseAAC/, by which users can easily obtain their desired results without the need to go through the complicated mathematical equations involved.
Article
Purpose: Occurring at the cysteine residue in the C-terminal of a protein, prenylation is a special kind of post-translational modification (PTM), which may play a key role for statin in altering immune function. Therefore, knowledge of the prenylation sites in proteins is important for drug development as well as for in-depth understanding the biological process concerned. Given a query protein whose C-terminal contains some cysteine residues, which one can be of prenylation or none of them can be prenylated? Methods: To address this problem, we have developed a new predictor, called "iPreny-PseAAC", by incorporating two tiers of sequence pair coupling effects into the general form of PseAAC (pseudo amino acid composition). Results: It has been observed by four different cross-validation approaches that all the important indexes in reflecting its prediction quality are quite high and fully consistent to each other. Conclusion: It is anticipated that the iPreny-PseAAC predictor holds very high potential to become a useful high throughput tool in identifying protein C-terminal cysteine prenylation sites and the other relevant areas. To maximize the convenience for most experimental biologists, the web-server for the new predictor has been established at http://app.aporc.org/iPreny-PseAAC/, by which users can easily get their desired results without needing to go through the mathematical details involved in this paper.
Article
The eternal or ultimate goal of medicinal chemistry is to find most effective ways to treat various diseases and extend human beings' life as long as possible. Human being is a biological entity. To realize such an ultimate goal, the inputs or breakthroughs from the advances in biological science are no doubt most important that may even drive medicinal science into a revolution. In this review article, we are to address this from several different angles. Copyright© Bentham Science Publishers; For any queries, please email at [email protected]
Article
Thesupport-vector network is a new learning machine for two-group classification problems. The machine conceptually implements the following idea: input vectors are non-linearly mapped to a very high-dimension feature space. In this feature space a linear decision surface is constructed. Special properties of the decision surface ensures high generalization ability of the learning machine. The idea behind the support-vector network was previously implemented for the restricted case where the training data can be separated without errors. We here extend this result to non-separable training data.High generalization ability of support-vector networks utilizing polynomial input transformations is demonstrated. We also compare the performance of the support-vector network to various classical learning algorithms that all took part in a benchmark study of Optical Character Recognition.
Article
This study investigates an efficient and accurate computational method for predicating mycobacterial membrane protein. Mycobacterium is a pathogenic bacterium which is the causative agent of tuberculosis (TB) and leprosy. The existing feature encoding algorithms for protein sequence representation such as composition and translation, and split amino acid composition cannot suitably express the mycobacterium membrane protein and their types due to biasness among different types. Therefore, in this study a novel un-biased dipeptide composition (Unb-DPC) method is proposed. The proposed encoding scheme has two advantages, first it avoid the biasness among the different mycobacterium membrane protein and their types. Secondly, the method is fast and preserves protein sequence structure information. The experimental results yield SVM based classification accurately of 97.1% for membrane protein types and 95.0% for discriminating mycobacterium membrane and non-membrane proteins by using jackknife cross validation test. The results exhibit that proposed model achieved significant predictive performance compared to the existing algorithms and will lead to develop a powerful tool for anti-mycobacterium drugs.
Article
Motivation: Post-translational modification, abbreviated as PTM, refers to the change of the amino acid side chains of a protein after its biosynthesis. Owing to its significance for in-depth understanding various biological processes and developing effective drugs, prediction of PTM sites in proteins has currently become a hot topic in bioinformatics. Although many computational methods were established to identify various single-label PTM types and their occurrence sites in proteins, no method has ever been developed for multi-label PTM types. As one of the most frequently observed PTMs, the K-PTM, namely the modification occurring at lysine (K), can be usually accommodated with many different types, such as "acetylation", "crotonylation", "methylation", and "succinylation". Now we are facing an interesting challenge: Given an uncharacterized protein sequence containing many K residues, which ones can accommodate two or more types of PTM, which ones only one, and which ones none? Results: To address this problem, a multi-label predictor called IPTM-MLYS: has been developed. It represents the first multi-label PTM predictor ever established. The novel predictor is featured by incorporating the sequence-coupled effects into the general PseAAC, and by fusing an array of basic random forest classifiers into an ensemble system. Rigorous cross-validations via a set of multi-label metrics indicate that the first multi-label PTM predictor is very promising and encouraging. Availability: For the convenience of most experimental scientists, a user-friendly web-server for iPTM-mLys has been established at http://www.jci-bioinfo.cn/iPTM-mLys, by which users can easily obtain their desired results without the need to go through the complicated mathematical equations involved. Contact: wqiu@gordonlifescience.org SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Article
The q-Wiener index of graphs is a generalization of the classical Wiener index. This paper determines the q-Wiener index of graphs constructed by graph operations such as Join, Symmetric difference, Composition, Disjunction. Some known results are direct consequence of these observations.
Article
Tyrosine sulfation is a post-translational modification widely distributed in eukaryotic proteins. The prerequisite to reveal its biological role which is largely unknown is identifying more protein sulfotyrosine sites. However, previous computational methods only achieved limited accuracy. In this paper, we propose a novel tool named SulfoTyrP with four designed strategies to predict protein sulfotyrosine sites. Weight parameters in support vector machine (SVM) are optimized for the first time to solve the problem of unbalanced datasets and this approach is proved to perform better than the widely used under-sampling approach for our datasets. Moreover, bi-profile Bayes and composition moment vector (CMV) are used to obtain rationally designed features to highlight the contribution of acidic and hydrophobic amino acids. Using SulfoTyrP, we get a sensitivity of 80.65%, an accuracy of 94.51%, Matthew's Correlation Coefficient (MCC) of 0.779 in jackknife cross-validation evaluations, an average sensitivity of 77.78% and an average ACC of 93.89% in three independent tests. Compared with other published tools, SulfoTyrP can get higher sensitivity and accuracy. We not only propose a high accuracy method to predict protein sulfotyrosine site, but also provide.
Chapter
During the years 1947-1948 Harry Wiener published a series of five papers that introduced into chemistry two novel graph-theoretical invariants. These invariants were specifically designed to characterize alkane molecules and he termed them the polarity number and the path number. The latter number is nowadays more commonly referred to as the Wiener topological index. Our focus here will be primarily on this index and its remarkable historical development over the past half century. We first outline its origins and then discuss its extensive applications and elaboration down to the present time with especial focus on the first thirty years. Our chapter serves to document the fact that Wiener's seminal work has spawned much creative research activity within the broad domain of chemistry. The Wiener index was the first of the current plethora of topological indices that now number in the hundreds. The prolific production of such indices over the years can be ascribed to the fecundity of Wiener's ideas in the stimulation of new scientific endeavors. In this chapter it is our intention not only to chronicle but also to celebrate the rich legacy of Wiener's pioneering contributions to chemistry.
Article
The Wiener index W is the sum of distances between all pairs of vertices of a connected graph. Recently Zhang et al. [MATCH Commun. Math. Comput. Chem. 67 (2012) 347] considered the q-analog of W, motivated by the theory of hypergeometric series. We obtain explicit formulas for the q-Wiener index of cluster and corona of graphs, of which thorny and bridge graphs are special cases. Using these formulas, the q-Wiener indices of several classes of chemical graphs are computed.
Article
Cancer is an important reason of death worldwide. Traditional cytotoxic therapies, such as radiation and chemotherapy, are expensive and cause severe side effects. Currently, design of anticancer peptides is a more effective way for cancer treatment. So there is a need to develop a computational method for predicting the anticancer peptides. In the present study, two methods have been developed to predict these peptides using support vector machine (SVM) as a powerful machine learning algorithm. Classifiers have been applied based on the concept of Chou's pseudo-amino acid composition (PseAAC) and local alignment kernel. Since a number of HIV-1 proteins have cytotoxic effect, therefore we predicted the anticancer effect of HIV-1 p24 protein with these methods. After the prediction, mutagenicity of 2 anticancer peptides and 2 non-anticancer peptides was investigated by Ames test. Our results show that, the accuracy and the specificity of local alignment kernel based method are 89.7% and 92.68%, respectively. The accuracy and specificity of PseAAC-based method are 83.82% and 85.36%, respectively. By computational analysis, Out of 22 peptides of p24 protein, 4 peptides are anticancer and 18 are non-anticancer. In the Ames test results, it is clear that anticancer peptides (ARP788.8 and ARP788.21) are not mutagenic. Therefore the results demonstrate that the described computation methods are useful to identify potential anticancer peptides, which are worthy of further experimental validation and 2 peptides (ARP788.8 and ARP788.21) of HIV-1 p24 protein can be used as new anticancer candidates without mutagenicity.
Article
We present a new classification technique to recognize and predict reservoirs from seismic data using support vector machine (SVM) pattern recognition. As the method is data-driven it is especially suitable for use with non-linear multiattributes. The method has good generalization ability for cases where the populations are small. In this paper, we describe the method, point out the difference between SVM and neural network approaches, and apply the method to a 3D seismic dataset for the "YD" oilfield. First, we train the SVM using 3D seismic multiattributes at known well locations with well test results. The resulting SVM structure is used to make predictions away from the wells. It is demonstrated that the method is less subject to overtraining difficulties than are neural networks and can be used to distinguish oil and gas reservoirs.
Article
Because of the importance of proteins in inducing allergenic reactions, the ability of predicting their potential allergenicity has become an important issue. Bioinformatics presents valuable tools for analyzing allergens and these complementary approaches can help traditional techniques to study allergens. This work proposes a computational method for predicting the allergenic proteins. The prediction was performed using pseudo-amino acid composition (PseAAC) and Support Vector Machines (SVMs). The predictor efficiency was evaluated by fivefold cross-validation. The overall prediction accuracies and Matthew's correlation coefficient (MCC) obtained by this method were 91.19% and 0.82, respectively. Furthermore, the minimum Redundancy and Maximum Relevance (mRMR) feature selection method was utilized for measuring the effect and power of each feature. Interestingly, in our study all six characters (hydrophobicity, hydrophilicity, side chain mass, pK1, pK2 and pI) are present among the 10 higher ranked features obtained from the mRMR feature selection method.
Article
Visual methods illustrate how DNA sequences are read along a single DNA strand from the 5′ end to the 3′ end and they provide the hopes of gaining an understanding of the underlying genomic language. By handling genomic sequence residues as elements of a discrete-time signal, digital signal processing techniques can be employed for the analysis of genomic information. Using these representations and applying frequency domain transformations, it is shown that structures, or seemingly nonrandom behavior, may be readily identified in nucleotide sequences. We review the basic method of DNA walks and we show how these representations can be used to extract useful knowledge from the genomic data; namely long-range correlation information, sequence periodicities, and other sequence characteristics. Further information is elucidated through wavelet transform analysis. This work finally relates a measure of sequence complexity to these visual findings and offers conclusions regarding quantifying DNA sequence behavior or structure.
Article
Thesupport-vector network is a new learning machine for two-group classification problems. The machine conceptually implements the following idea: input vectors are non-linearly mapped to a very high-dimension feature space. In this feature space a linear decision surface is constructed. Special properties of the decision surface ensures high generalization ability of the learning machine. The idea behind the support-vector network was previously implemented for the restricted case where the training data can be separated without errors. We here extend this result to non-separable training data. High generalization ability of support-vector networks utilizing polynomial input transformations is demonstrated. We also compare the performance of the support-vector network to various classical learning algorithms that all took part in a benchmark study of Optical Character Recognition.
Article
In this paper, we propose a new protein map which incorporates with various properties of amino acids. As a powerful tool for protein classification, this new protein map both considers phylogenetic factors arising from amino acid mutations and provides computational efficiency for the huge amount of data. The ten amino acid physico-chemical properties (the chemical composition of the side chain, two polarity measures, hydropathy, isoelectric point, volume, aromaticity, aliphaticity, hydrogenation, and hydroxythiolation) are utilized according to their relative importance. Moreover, during the course of calculation of genetic distances between pairs of proteins, this approach does not require any alignment of sequences. Therefore, the proposed model is easier and quicker in handling protein sequences than multiple alignment methods, and gives protein classification greater evolutionary significance at the amino acid sequence level.
Article
Wenxiang diagram is a new two-dimensional representation that characterizes the disposition of hydrophobic and hydrophilic residues in α-helices. In this research, the hydrophobic and hydrophilic residues of two leucine zipper coiled-coil (LZCC) structural proteins, cGKIα(1-59) and MBS(CT35) are dispositioned on the wenxiang diagrams according to heptad repeat pattern (abcdefg)(n), respectively. Their wenxiang diagrams clearly demonstrate that the residues with same repeat letters are laid on same side of the spiral diagrams, where most hydrophobic residues are positioned at a and d, and most hydrophilic residues are localized on b, c, e, f and g polar position regions. The wenxiang diagrams of a dimetric LZCC can be represented by the combination of two monomeric wenxiang diagrams, and the wenxiang diagrams of the two LZCC (tetramer) complex structures can also be assembled by using two pairs of their wenxiang diagrams. Furthermore, by comparing the wenxiang diagrams of cGKIα(1-59) and MBS(CT35), the interaction between cGKIα(1-59) and MBS(CT35) is suggested to be weaker. By analyzing the wenxiang diagram of the cGKIα(1-59.)·MBS(CT42) complex structure, most affected residues of cGKIα(1-59) by the interaction with MBS(CT42) are proposed at positions d, a, e and g of the LZCC structure. These findings are consistent with our previous NMR results. Incorporating NMR spectroscopy, the wenxiang diagrams of LZCC structures may provide novel insights into the interaction mechanisms between dimeric, trimeric, tetrameric coiled-coil structures.
Article
With the accomplishment of human genome sequencing, the number of sequence-known proteins has increased explosively. In contrast, the pace is much slower in determining their biological attributes. As a consequence, the gap between sequence-known proteins and attribute-known proteins has become increasingly large. The unbalanced situation, which has critically limited our ability to timely utilize the newly discovered proteins for basic research and drug development, has called for developing computational methods or high-throughput automated tools for fast and reliably identifying various attributes of uncharacterized proteins based on their sequence information alone. Actually, during the last two decades or so, many methods in this regard have been established in hope to bridge such a gap. In the course of developing these methods, the following things were often needed to consider: (1) benchmark dataset construction, (2) protein sample formulation, (3) operating algorithm (or engine), (4) anticipated accuracy, and (5) web-server establishment. In this review, we are to discuss each of the five procedures, with a special focus on the introduction of pseudo amino acid composition (PseAAC), its different modes and applications as well as its recent development, particularly in how to use the general formulation of PseAAC to reflect the core and essential features that are deeply hidden in complicated protein sequences.
Article
One major problem with the existing algorithm for the prediction of protein structural classes is low accuracies for proteins from α/β and α+β classes. In this study, three novel features were rationally designed to model the differences between proteins from these two classes. In combination with other rational designed features, an 11-dimensional vector prediction method was proposed. By means of this method, the overall prediction accuracy based on 25PDB dataset was 1.5% higher than the previous best-performing method, MODAS. Furthermore, the prediction accuracy for proteins from α+β class based on 25PDB dataset was 5% higher than the previous best-performing method, SCPRED. The prediction accuracies obtained with the D675 and FC699 datasets were also improved.