Article

GPCR-CA: A Cellular Automaton Image Approach for Predicting G-Protein-Coupled Receptor Functional Classes

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Given an uncharacterized protein sequence, how can we identify whether it is a G-protein-coupled receptor (GPCR) or not? If it is, which functional family class does it belong to? It is important to address these questions because GPCRs are among the most frequent targets of therapeutic drugs and the information thus obtained is very useful for "comparative and evolutionary pharmacology," a technique often used for drug development. Here, we present a web-server predictor called "GPCR-CA," where "CA" stands for "Cellular Automaton" (Wolfram, S. Nature 1984, 311, 419), meaning that the CA images have been utilized to reveal the pattern features hidden in piles of long and complicated protein sequences. Meanwhile, the gray-level co-occurrence matrix factors extracted from the CA images are used to represent the samples of proteins through their pseudo amino acid composition (Chou, K.C. Proteins 2001, 43, 246). GPCR-CA is a two-layer predictor: the first layer prediction engine is for identifying a query protein as GPCR on non-GPCR; if it is a GPCR protein, the process will be automatically continued with the second-layer prediction engine to further identify its type among the following six functional classes: (a) rhodopsin-like, (b) secretin-like, (c) metabotrophic/glutamate/pheromone; (d) fungal pheromone, (e) cAMP receptor, and (f) frizzled/smoothened family. The overall success rates by the predictor for the first and second layers are over 91% and 83%, respectively, that were obtained through rigorous jackknife cross-validation tests on a new-constructed stringent benchmark dataset in which none of proteins has >or=40% pairwise sequence identity to any other in a same subset. GPCR-CA is freely accessible at http://218.65.61.89:8080/bioinfo/GPCR-CA, by which one can get the desired two-layer results for a query protein sequence within about 20 seconds.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Gao and Wang [11] introduced a nearest neighbor method to discriminate GPCRs from non-GPCRs and subsequently classify GPCRs at four levels on the basis of amino acid composition (AAC) and dipeptide composition of proteins in 2006. In 2008, Xiao, Wang and Chou [12] employed "cellular automaton" to reveal the pattern features hidden in piles of long and complicated protein sequences and predicted GPCRs and subfamilies with CD classifier. In 2010, Peng, Yang and Chen [13] proposed a new method called "PCA-GPCR" to predict GPCRs. ...
... In this research, a 10-fold cross-validation approach is chosen to test our hybrid method. Moreover, the performance of predictor is frequently measured by accuracy (ACC) and Matthew's correlation coefficient (MCC) value [12]. ...
... In order to explain the superiority of our fractal methods, we implement our fractal algorithms on the same dataset (D365, 365 GPCRs, six subfamilies) in GPCR-CA [12] and PCA-GPCR [13] via jackknife test. We list detailed comparisons of our method in Table 3. ...
Article
Full-text available
G-protein-coupled receptors (GPCRs) are seven membrane-spanning proteins and regulate many important physiological processes, such as vision, neurotransmission, immune response and so on. GPCRs-related pathways are the targets of a large number of marketed drugs. Therefore, the design of a reliable computational model for predicting GPCRs from amino acid sequence has long been a significant biomedical problem. Chaos game representation (CGR) reveals the fractal patterns hidden in protein sequences, and then fractal dimension (FD) is an important feature of these highly irregular geometries with concise mathematical expression. Here, in order to extract important features from GPCR protein sequences, CGR algorithm, fractal dimension and amino acid composition (AAC) are employed to formulate the numerical features of protein samples. Four groups of features are considered, and each group is evaluated by support vector machine (SVM) and 10-fold cross-validation test. To test the performance of the present method, a new non-redundant dataset was built based on latest GPCRDB database. Comparing the results of numerical experiments, the group of combined features with AAC and FD gets the best result, the accuracy is 99.22% and Matthew's correlation coefficient (MCC) is 0.9845 for identifying GPCRs from non-GPCRs. Moreover, if it is classified as a GPCR, it will be further put into the second level, which will classify a GPCR into one of the five main subfamilies. At this level, the group of combined features with AAC and FD also gets best accuracy 85.73%. Finally, the proposed predictor is also compared with existing methods and shows better performances.
... All GPCR candidates predicted by the pipeline were analyzed by another alignment-free method, GPCR-CA (http://218.65.61.89:8080/bioinfo/ GPCR-CA), whose algorithm is based on cellular automation (CA) (Xiao et al., 2009). GPCR-CA is a bilayer predictor: the first layer identifies a query protein as GPCR or non-GPCR; if it is a GPCR, then a second layer classifies it into an A-F classification system, based on sequence homology and functional similarity (Attwood and Findlay, 1994). ...
... Therefore, predicting coupling specificity of orphan GPCRs to G-protein subfamilies is essential to find potential drug targets through heterologous expression studies (Wess, 1998). However, GPCRs with low sequence similarity may couple to members of the same subfamily of G-proteins, while members of the same GPCR subfamilies often couple to members of distinct G-protein subfamilies (Wong, 2003). As promiscuous GPCRs are found to be coupled with more than one G-protein subfamily, it is evident that coupling is a multidimensional function rather than one-by-one function (Hermans, 2003;Sgourakis et al., 2005). ...
Article
Full-text available
The infective juveniles (IJs) of entomopathogenic nematode (EPN) Heterorhabditis bacteriophora find and infect their host insects in heterogeneous soil ecosystems by sensing a universal host cue (CO 2 ) or insect/plant-derived odorants, which bind to various sensory receptors, including G protein-coupled receptors (GPCRs). Nematode chemosensory GPCRs (NemChRs) bind to a diverse set of ligands, including odor molecules. However, there is a lack of information on the NemChRs in EPNs. Here we identified 21 GPCRs in the H. bacteriophora genome sequence in a triphasic manner, combining various transmembrane detectors and GPCR predictors based on different algorithms, and considering inherent properties of GPCRs. The pipeline was validated by reciprocal BLAST, InterProscan, GPCR-CA, and NCBI CDD search. Functional classification of predicted GPCRs using Pfam revealed the presence of four NemChRs. Additionally, GPCRs were classified into various families based on the reciprocal BLAST approach into a frizzled type, a secretin type, and 19 rhodopsin types of GPCRs. Gi/o is the most abundant kind of G-protein, having a coupling specificity to all the fetched GPCRs. As the 21 GPCRs identified are expected to play a crucial role in the host-seeking behavior, these might be targeted to develop novel insect-pest management strategies by tweaking EPN IJ behavior, or to design novel anthelminthic drugs. Our new and stringent GPCR detection pipeline may also be used to identify GPCRs from the genome sequence of other organisms.
... To address the representation of a given protein, especially for a protein with long sequence, Xiao [8] had proposed a novel method by using complexity measure factor of cellular automata image. The representation contains the lost information of order effects for that the images are derived from the amino acid sequence through the space-time evolution of cellular automata. ...
... The representation contains the lost information of order effects for that the images are derived from the amino acid sequence through the space-time evolution of cellular automata. See [8] to refer the detail process. ...
... This line of research depends on amino acid coding language proposed in [15] to act as the initial configuration of the elementary CA. This model is used to predict protein subcellular location [16], the G-protein-coupled receptor functional classes [17], and protein structural classes [18] [19]. ...
... Notice that some approaches combines two methods together. In [16] [18], [17] and [19], the binary representation is followed by the Pseudo-amino acid composition (PseAA). In those Papers, CA is applied to the binary representation, then the CA image parameters are extracted by methods such as the geometric moments of Hu [49] and the GLCM texture features [50]. ...
Article
Full-text available
Abstract—The literature of building computational and mathematical models of proteins is rich and diverse, since its practical applications are of a vital importance in the development of many fields. Modeling proteins is not a straightforward process and in some modeling strategies, it requires to combine concepts from different fields including physics, chemistry, thermodynamics, and computer science. The focus here will be on models that are based on the concept of cellular automata and equivalent systems. Cellular automata are discrete computational models that are capable of universal computation, in other words, they are capable of doing any computation that a normal computer can do. What is special about cellular automata is its ability to produce complex and chaotic global behavior from local interactions. The paper discusses the effort done so far by the researchers community in this direction and proposes a computational model of protein folding that is based on 3D cellular automata. Unlike common models, the proposed model maintains the basic properties of cellular automata and keeps a realistic view of proteins operations. As in any cellular automata model, the dimension, neighborhood, boundary, and rules were specified. In addition, a discussion is given to clarify why these parameters are in place and what possible alternatives can be used in the protein folding context.
... Applying a large value of d to the image would produce GLCM features that do not capture all the information in the image. Values of d varying from 1 to 64 have been investigated and found that the classification accuracies with values 1, 2, 4, and 8 were basically the same in classifying cloud images (Yazdi and Gheysari, 2008;Xiao et al., 2009)There are many investigations for the best classification accuracy in different applications. In this research, d values of 1–4 are investigated to measure the classification accuracy in identifying the bin level Orientation h The orientation is the third parameter for GLCM computation, though considered less important than the others (Smith et al., 1995). ...
... In this research, d values of 1–4 are investigated to measure the classification accuracy in identifying the bin level Orientation h The orientation is the third parameter for GLCM computation, though considered less important than the others (Smith et al., 1995). Several studies have been investigated on the effect of h for texture feature classification (Rahman et al., 2007;Tsai and Chiu, 2008;Xiao et al., 2009;Gelzinis et al., 2007)In this study, we set the values of h to 0°, 45°, 90°, and 135° for use in all the cases to produce greater statistical and higher classification accuracyFig. 5. Top 20 retrieved images with using Gabor wavelet in retrieval system, the top left image is the query image.Fig. ...
... Applying a large value of d to the image would produce GLCM features that do not capture all the information in the image. Values of d varying from 1 to 64 have been investigated and found that the classification accuracies with values 1, 2, 4, and 8 were basically the same in classifying cloud images ( Yazdi and Gheysari, 2008;Xiao et al., 2009) There are many investigations for the best classification accuracy in different applications. In this research, d values of 1-4 are investigated to measure the classification accuracy in identifying the bin level Orientation h The orientation is the third parameter for GLCM computation, though considered less important than the others ( Smith et al., 1995). ...
... In this research, d values of 1-4 are investigated to measure the classification accuracy in identifying the bin level Orientation h The orientation is the third parameter for GLCM computation, though considered less important than the others ( Smith et al., 1995). Several studies have been investigated on the effect of h for texture feature classification ( Rahman et al., 2007;Tsai and Chiu, 2008;Xiao et al., 2009;Gelzinis et al., 2007) In this study, we set the values of h to 0°, 45°, 90°, and 135° for use in all the cases to produce greater statistical and higher classification accuracy ...
Article
This paper presents a CBIR system to investigate the use of image retrieval with an extracted texture from the image of a bin to detect the bin level. Various similarity distances like Euclidean, Bhattacharyya, Chi-squared, Cosine, and EMD are used with the CBIR system for calculating and comparing the distance between a query image and the images in a database to obtain the highest performance. In this study, the performance metrics is based on two quantitative evaluation criteria. The first one is the average retrieval rate based on the precision-recall graph and the second is the use of F1 measure which is the weighted harmonic mean of precision and recall. In case of feature extraction, texture is used as an image feature for bin level detection system. Various experiments are conducted with different features extraction techniques like Gabor wavelet filter, gray level co-occurrence matrix (GLCM), and gray level aura matrix (GLAM) to identify the level of the bin and its surrounding area. Intensive tests are conducted among 250 bin images to assess the accuracy of the proposed feature extraction techniques. The average retrieval rate is used to evaluate the performance of the retrieval system. The result shows that, the EMD distance achieved high accuracy and provides better performance than the other distances.
... It's important to select the appropriate database because the use of a wide and understandable database is required to evaluate and compare the performances of the proposed classifiers. There are various GPCRs databases [78] and different datasets were built using them, like GDS [29], GDFL [50], D167 [16] and D365 [76]. Among these datasets, we selected GDS because it is one of the most widely used data sets in the GPCRs identification field, this will allow us to compare our results with the other works using the same dataset. ...
... The second is the application of the mean/variance normalization to reduce the range of numerical values obtained after using PseAAC. We have chosen PseAAC since it remains the most widely used for protein representation [35,47,50,52,76,77] but we notice that there are other feature extraction methods, for example, Davies et al. [29] and Secker et al. [59] used the "z-values" to represent a protein sequence. ...
Article
Immunological computation is one of the largest recent bio-inspired approaches of artificial intelligence. Artificial immune systems (AIS) are inspired by the processes of the biological immune systems like the learning and memory characteristics which are used for solving complex problems. During the last two decades, AIS have been applied in various fields such as optimization, network security and data mining. In this article, we focus on the application of AIS to data mining in bioinformatics, more specifically, the classification task. For this purpose, we suggest three immune models based on clonal selection theory for the identification of G-protein coupled receptors (GPCRs) to predict their function. Our three classifiers are the artificial immune recognition system (AIRS), the clonal selection algorithm (CLONALG) and the clonal selection classification algorithm (CSCA). The GPCRs represent one of the largest and most important families of multifunctional proteins and are a significant target for bioactive and drug discovery programs. It is estimated that more than half of the drugs on the market currently target GPCRs. However, although thousands of GPCRs sequences are known, many of them remain orphans, have unknown function. Our experiments show that the three immunological classifiers have provided interesting results, however, AIRS obtained the best ones. Therefore, it is, for us, the most suitable immune model for the GPCRs identification problem.
... Different types of PseAAC are employed to predict protein structural class [32], bacterial secreted proteins [33], cyclins [34], risk type of human papillomaviruses [35], enzyme subfamily classes [24,36,37], G-protein coupled receptor classes [38][39][40], cell wall lytic enzymes [41], subcellular localization of apoptosis proteins [42,43], lipase types [44], subcellular localization of mycobacterial proteins [45], cofactors of oxidoreductases [46], DNAbinding proteins [47], quaternary structural attributes [48], proteases and their types [49] GABAA receptors [50] and Glutathione S-transferases [51][52][53]. ...
Article
Full-text available
Phospholipases, as important lipolytic enzymes, have diverse industrial applications. Regarding the stability of extremophilic archaea's proteins in harsh conditions, analyses of unusual features of their proteins are significantly important for their utilization. This research was accomplished to in silico study of archaeal phospholipases' properties and to develop a pioneering method for distinguishing these enzymes from other archaeal enzymes via machine learning algorithms and Chou's pseudo-amino acid composition concept. The non-redundant sequences of archaeal phospholipases were collected. BioSeq-Analysis sever was used with Support Vector Machine (SVM), Random Forests (RF), Covariance Discrimination (CD), and Optimized Evidence-Theoretic K-nearest Neighbor (OET-KNN) as powerful machine learnings algorithms. Also, different Chou's pseudo-amino acid composition modes were performed and then, 5-fold cross-validation was applied to the sequences. Based on our results, the OET-KNN predictor, with 96% accuracy, yields the best performance in SC-PseAAC mode by 5-fold cross-validation. This predictor also achieved very high values of specificity (95%), sensitivity (96%), Matthews's correlation coefficient (0.92), and accuracy (96%). The present investigation yielded a robust anticipatory model for the archaeal phospholipase prediction utilizing the tenets PseAAC and OET-KNN machine learning algorithm.
... Fifty different runs were performed for docking experiment and was terminated after the evaluation energy of 25 k and other docking parameters were as per Zhou et al. [38] Where a translational step of 0.2 Å, quaternion and torsion steps of 5 were applied during the search. [39,40] Docking analysis enabled the calculation of binding energy and inhibition constant where the binding energy help determine the interaction between both the ligands. [41] The data obtained from the runs (50) enabled the comparison and interaction of hydrophobic and H-bond (hydrophobic) interactions between the ligand and receptor of the docked complex using Chimera 1.14 tool. ...
Article
The oxidation activity of multicopper-oxidases overlaps with different substrates of laccases and bilirubin oxidases, thus in the present study an integrated approach of bioinformatics using hom-ology modeling, docking, and experimental validation was used to confirm the type of multicop-per-oxidase in Myrothecium verrucaria ITCC-8447. The result of peptide sequence of M. verrucaria ITCC-8447 enabled to predict the 3 D-structure of multicopper-oxidase. It was overlapped with the structure of laccase and root mean square deviation (RMSD) was 1.53 Å for 533 and, 171 residues. The low binding energy with azino-bis (3-ethylbenzothiazoline-6-sulfonic acid) (ABTS) (-5.64) as compared to bilirubin (-4.39) suggested that M. verrucaria ITCC-8447 have laccase-like activity. The experimental analysis confirmed high activity with laccase specific substrates, phenol (18.3 U/L), ampyrone (172.4 U/L) and, ampyrone phenol coupling (50 U/L) as compared to bilirubin oxidase substrate bilirubin (16.6 U/L). In addition, lowest binding energy with ABTS (-5.64), syringaldazine SYZ (-4.83), guaiacol GCL (-4.42), and 2,6-dimethoxyphenol DMP (-4.41) confirmed the presence of laccase. Further, complete remediation of two hazardous model pollutants i.e., phenol and resor-cinol (1.5 mM) after 12 h of incubation and low binding energy of À4.32 and, À4.85 respectively confirmed its removal by laccase. The results confirmed the presence of laccase in M. verrucaria ITCC-8447 and its effective bioremediation potential. GRAPHICAL ABSTRACT ARTICLE HISTORY
... The van der Waals and electrostatic terms were calculated using the Autodock parameter with distance-dependent dielectric functions. The docking experiment consisted of 50 different runs which were terminated after energy evaluation of 413 Page 4 of 18 25 k and other docking parameters mentioned in the study by Zhou et al. (2014) were used where a translational step of 0.2 Å, and quaternion and torsion steps of 5 were applied during the search (Xiao et al. 2009;Lin and Lapointe 2013). From the docking analysis, binding energy and inhibition constant were inferred and the interactions of both the ligands were compared to determine the binding energy (Tiwari et al. 2019). ...
Article
In the present study, specificity of laccase from Stropharia sp. ITCC-8422 against various substrates, i.e. 2,2-azino-bis (3-ethylbenzothiazoline-6-sulfonic acid) (ABTS), 2,6-dimethoxyphenol (DMP), guaiacol (GCL) and syringaldazine (SYZ) was determined. It exhibited maximum affinity against ABTS, followed by DMP and negligible activity for GCL and SYZ. As the concentration of substrate increased from 0.5 to 1.5 mM (ABTS) and 1 to 5 mM (DMP), the activity increased from 301.1 to 567.8 U/L and 254.4 to 436.2 U/L. Further, quadrupole time-of-flight liquid chromatography mass spectrometry (QTOF-LCMS) analysis of the extracellular proteome of Stropharia sp. ITCC-8422 identified eighty-four (84) extracellular proteins. The peptide sequence for the enzyme of interest exhibited sequence similarity with laccase-5 of Trametes pubescens. Using high molecular mass sequence of laccase-5, the protein structure of laccase was modelled and binding energy of laccase with four substrates, i.e. ABTS (− 5.65), DMP (− 4.65), GCL (− 4.66) and SYZ (− 5.5) was determined using autodock tool. The experimental and in silico analyses revealed maximum activity of laccase and lowest binding energy with ABTS. Besides, laccase was purified and it exhibited 2.1-fold purification with purification yield of 20.4% and had stability of 70% at pH 5–9 and 30–40 ℃. In addition, the bioremediation potential of laccase was explored by in silico analysis, where the binding energy of laccase with alizarin cyanine green was − 6.37 and both in silico work and experimental work were in agreement.
... Drug target refers to the role of drug binding sites in the body, including gene loci [1], receptors [2,3], enzymes, ion channels, nucleic acids and other biological macromolecules [4]. Studies have shown that druggable proteins are closely related to the immune system, cardiovascular, hypertension and other diseases [5]. ...
Article
Discovering and accurately locating drug targets is of great significance for the research and development of new drugs. As a different approach to traditional drug development, the machine learning algorithm is used to predict the drug target by mining the data. Because of its advantages of short time and low cost, it has received more and more attention in recent years. In this paper, we propose a novel method for predicting druggable proteins. Firstly, the features of the protein sequence are extracted by combining Chou's pseudo amino acid composition (PseAAC), dipeptide composition (DPC) and reduced sequence (RS), getting the 591 dimension of drug target dataset. Then, the feature information of druggable proteins dataset is selected by genetic algorithm (GA). Finally, we use Bagging ensemble learning to improve SVM classifier to get the final prediction model. The predictive accuracy rate reaches 93.78% by using 5-fold cross-validation and compared with other state-of-the-art predictive methods. The results indicate that the method proposed in this paper has a high reference value for the prediction of potential drug targets, which will successfully play a key role in the drug research and development. The source code and all datasets are available at https://github.com/QUST-AIBBDRC/GA-Bagging-SVM.
... This profile is unique for each residue where the presence of the particular residue is denoted by '1' and the absence by '0' (Fig. 5). This approach has been used earlier in many studies 21,[37][38][39] . In this study, we generated binary profile for fist 5, 10 and 15 residues from N terminus as well as from C-terminus. ...
Article
Full-text available
Insect neuropeptides and their associated receptors have been one of the potential targets for the pest control. The present study describes in silico models developed using natural and modified insect neuropeptides for predicting and designing new neuropeptides. Amino acid composition analysis revealed the preference of residues C, D, E, F, G, N, S, and Y in insect neuropeptides The positional residue preference analysis show that in natural neuropeptides residues like A, N, F, D, P, S, and I are preferred at N terminus and residues like L, R, P, F, N, and G are preferred at C terminus. Prediction models were developed using input features like amino acid and dipeptide composition, binary profiles and implementing different machine learning techniques. Dipeptide composition based SVM model performed best among all the models. In case of NeuroPIpred_DS1, model achieved an accuracy of 86.50% accuracy and 0.73 MCC on training dataset and 83.71% accuracy and 0.67 MCC on validation dataset whereas in case of NeuroPIpred_DS2, model achieved 97.47% accuracy and 0.95 MCC on training dataset and 97.93% accuracy and 0.96 MCC on validation dataset. In order to assist researchers, we created standalone and user friendly web server NeuroPIpred, available at (https://webs.iiitd.edu.in/raghava/neuropipred.)
... Other docking parameters were used as described by Zhou et al. (2016). During the search, a translational step of 0.2 Å, and quaternion and torsion steps of 5 were applied (Lin and Lapointe 2013;Xiao et al. 2009). The results were binding free energy, electrostatic energy, inhibition constant and final intermolecular energy, which include van-der Waals, H bond, and distortion energy. ...
Article
Studies on phytochemicals as anti-aflatoxigenic agents have gained importance including quercetin. Thus, to understand the molecular mechanism behind inhibition of aflatoxin biosynthesis by quercetin, interaction study with polyketide synthase A (PksA) of Aspergillus flavus was undertaken. The 3D structure of seven domains of PksA was modeled using SWISS-MODEL server and docking studies were performed by Autodock tools-1.5.6. Docking energies of both the ligands (quercetin and hexanoic acid) were compared with each of the domains of PksA enzyme. Binding energy for quercetin was lesser that ranged from − 7.1 to − 5.25 kcal/mol in comparison to hexanoic acid (− 4.74 to − 3.54 kcal/mol). LigPlot analysis showed the formation of 12 H bonds in case of quercetin and 8 H bonds in hexanoic acid. During an interaction with acyltransferase domain, both ligands showed H bond formation at Arg63 position. Also, in product template domain, quercetin creates four H bonds in comparison to one in hexanoic acid. Our quantitative RT-PCR analysis of genes from aflatoxin biosynthesis showed downregulation of pksA, aflD, aflR, aflP and aflS at 24 h time point in comparison to 7 h in quercetin-treated A. flavus. Overall results revealed that quercetin exhibited the highest level of binding potential (more number of H bonds) with PksA domain in comparison to hexanoic acid; thus, quercetin possibly inhibits via competitively binding to the domains of polyketide synthase, a key enzyme of aflatoxin biosynthetic pathway. Further, we propose that key enzymes from aflatoxin biosynthetic pathway in aflatoxin-producing Aspergilli could be explored further using other phytochemicals as inhibitors.
... GPCRs are difficult to crystallize and even the current state-of-the-art NMR techniques have limitations of cost and time for resolving membrane proteins. The growing numbers of GPCR sequences warrants the need for some computational methods for predicting the families and subfamilies of GPCRs based on their primary sequence [196,197]. This information can contribute towards the "comparative and evolutionary pharmacology" term used in drug designing pathways. ...
Article
The key down player remains for drug industries is the selection of biological target currently attracting huge investments to validate them in patient subsets in order to study potential drugs. GPCRs are one such druggable target wherein non-olfactory GPCRs are encoded by 50% of the human genome-encoded which are therapeutically unexploited. The concept of ploypharmacology has become essential part of therapeutics for complex diseases like schizophrenia, cancer etc and considering the promiscuity and selectivity GPCRs could emerge as novel targets. The availability of 44 crystal structures of unique receptors and 205 ligand receptor complexes now presents a strong foundation structure-based drug discovery and design. Further, 34% of all the drugs approved by the US Food and Drug Administration FDA act at 108 unique GPCRs which indicates the benefits of considering the correct target in drug designing. The important GPCR families currently in clinical trials can offer huge understanding towards the drug designing prospects including "off-target" effects reducing economical resource and time. This review will concentrate on the established and in-trail GPCR families in clinical trials. The druggability of GPCR protein families and critical roles played by them in complex diseases are explained for considering them as targets for novel drug discovery ventures.
... They should also be expressed in a manner that allows for specific targeting. There are several groups of drug targets, such as proteases, kinases, transporter proteins, G protein-coupled receptors (GPCRs), ion channels and nuclear receptors [8]. Among these groups, GPCRs (23%) and enzymes (50%) are the most important classes of drug targets [9]. ...
Article
Full-text available
Background: From a therapeutic viewpoint, understanding how drugs bind and regulate the functions of their target proteins to protect against disease is crucial. The identification of drug targets plays a significant role in drug discovery and studying the mechanisms of diseases. Therefore the development of methods to identify drug targets has become a popular issue. Methods: We systematically review the recent work on identifying drug targets from the view of data and method. We compiled several databases that collect data more comprehensively and introduced several commonly used databases. Then divided the methods into two categories: biological experiments and machine learning, each of which is subdivided into different subclasses and described in detail. Results: Machine learning algorithms are the majority of new methods. Generally, an optimal set of features is chosen to predict successful new drug targets with similar properties. The most widely used features include sequence properties, network topological features, structural properties, and subcellular locations. Since various machine learning methods exist, improving their performance requires combining a better subset of features and choosing the appropriate model for the various datasets involved. Conclusion: The application of experimental and computational methods in protein drug target identification has become increasingly popular in recent years. Current biological and computational methods still have many limitations due to unbalanced and incomplete datasets or imperfect feature selection methods.
... This Grey-PSSM model can be seen in [9,10]. [12,13] described a protein as the cellular automation image and constructed its grey level co-occurrence matrix (GLCM) to express the protein. In this study, the four GLCM- ...
Article
Full-text available
Protein remote homology detection is the most basic and core problems of protein structure and function research. The purpose of protein remote homology detection to detect the remote evolutionary relationship between proteins by computation methods. At present, there are many methods for remote homology detection of proteins, but some methods are not very effective. In bioinformatics, it is urgent to further improve the performance of protein remote homology detection. In this study, we propose a new model called PHom-GRA to detect protein remote homology, which is the integration of various ranking methods via using grey relational analysis. Our experiment constructs benchmark dataset with lower homology, in which any two proteins have not more than 40% identity or homology. We achieve an ROC1 score of 0.7372 and an ROC50 score of 0.7968 in jack-knife test.
... Elementary Rule 84 was used for predicting protein subcellular location [26], classifying proteins based on their structure class [27][28][29], predicting G-protein-coupled receptor functional classes [30], and predicting transmembrane regions in proteins [31]. ...
Article
Full-text available
The design of a protein folding approximation algorithm is not straightforward even when a simplified model is used. The folding problem is a combinatorial problem, where approximation and heuristic algorithms are usually used to find near optimal folds of proteins primary structures. Approximation algorithms provide guarantees on the distance to the optimal solution. The folding approximation approach proposed here depends on two-dimensional cellular automata to fold proteins presented in a well-studied simplified model called the hydrophobic–hydrophilic model. Cellular automata are discrete computational models that rely on local rules to produce some overall global behavior. One-third and one-fourth approximation algorithms choose a subset of the hydrophobic amino acids to form H–H contacts. Those algorithms start with finding a point to fold the protein sequence into two sides where one side ignores H’s at even positions and the other side ignores H’s at odd positions. In addition, blocks or groups of amino acids fold the same way according to a predefined normal form. We intend to improve approximation algorithms by considering all hydrophobic amino acids and folding based on the local neighborhood instead of using normal forms. The CA does not assume a fixed folding point. The proposed approach guarantees one half approximation minus the H–H endpoints. This lower bound guaranteed applies to short sequences only. This is proved as the core and the folds of the protein will have two identical sides for all short sequences.
... One of the well-known and well-studied CAs is the elementary ones; they were added to the pseudo-amino acid (PseAA) representation and used in classifying proteins based on their structural classes [48,51]. In addition to the classification, this model is used in finding the proteins subcellular location [49], predicting the G-protein-coupled receptor functional classes [50], and to predict transmembrane regions [12]. The CA deployed is the elementary CA of rule 84. ...
Article
Full-text available
It is self-evident that the coarse-grained view of transcription and protein translation is a result of certain computations. Although there is no single definition of the term “computation,” protein translation can be implemented over mathematical models of computers. Protein folding, however, is a combinatorial problem; it is still unknown whether a fast, accurate, and optimal folding algorithm exists. The discovery of near-optimal folds depends on approximation algorithms and heuristic searches. The hydrophobic–hydrophilic (HP) model is a simplified representation of some of the realities of protein structure. Despite the simplified representation, the folding problem in the HP model was proven to be NP-complete. We use simple and local rules to model translation and folding of proteins. Local rules imply that at a certain level of abstraction an entity can move from a state to another based on its state and information collected from its neighborhood. Also, the rules are simple in a sense that they do not require complicated computation. We use one-dimensional cellular automata to describe translation of mRNA into protein. Cellular automata are discrete models of computation that use local interactions to produce a global behavior of some sort. We will also discuss how local rules can improve approximation algorithms of protein folding and give an example of a CA that accept a certain family of strings to achieve half H–H contacts.
... We also created the binary profile for the N5C5, N10C10, and N15C15 residues of peptides by combining N-and C-terminus residues. The binary profile has been used heavily in a number of studies for predicting functional properties of peptides (Xiao et al., 2009;Gautam et al., 2013;Chaudhary et al., 2016). ...
Article
Full-text available
This paper describes in silico models developed using a wide range of peptide features for predicting antifungal peptides (AFPs). Our analyses indicate that certain types of residue (e.g., C, G, H, K, R, Y) are more abundant in AFPs. The positional residue preference analysis reveals the prominence of the particular type of residues (e.g., R, V, K) at N-terminus and a certain type of residues (e.g., C, H) at C-terminus. In this study, models have been developed for predicting AFPs using a wide range of peptide features (like residue composition, binary profile, terminal residues). The support vector machine based model developed using compositional features of peptides achieved maximum accuracy of 88.78% on the training dataset and 83.33% on independent or validation dataset. Our model developed using binary patterns of terminal residues of peptides achieved maximum accuracy of 84.88% on training and 84.64% on validation dataset. We benchmark models developed in this study and existing methods on a dataset containing compositionally similar antifungal and non-AFPs. It was observed that binary based model developed in this study preforms better than any model/method. In order to facilitate scientific community, we developed a mobile app, standalone and a user-friendly web server ‘Antifp’ (http://webs.iiitd.edu.in/raghava/antifp).
... Two outputs are produced: clusters (representatives and the redundant records) and a non-redundant collection (only the representatives). ] DisProt Protein 50% [82] GPCRDB Protein 40% [92], 90% [43] PDB-minus Protein 40% [61] Phylogenetic Receptor 40% [43] PupDB Protein 98% [86] SEG Nucleotide 40% [75] Swiss-Prot Protein 40% [11,26,44], 50% [85], 60% [32,39,50,69,85], 70% [85], 75% [51], 80% [48,51,85], 90% [50,51], 96% [49] UBIDATA Protein 40%, 50% … 80% [87] UniProtKB Protein 40% [83], 50% [83,84], 75% [83], 90% [83,84], 95% [77], 100% [84] Dataset: the source of the full or sampled records used in the studies, Type: record type; reshold: the chosen threshold value when clustering the database. ...
Article
Full-text available
The massive volumes of data in biological sequence databases provide a remarkable resource for large-scale biological studies. However, the underlying data quality of these resources is a critical concern. A particular challenge is duplication, in which multiple records have similar sequences, creating a high level of redundancy that impacts database storage, curation, and search. Biological database deduplication has two direct applications: for database curation, where detected duplicates are removed to improve curation efficiency, and for database search, where detected duplicate sequences may be flagged but remain available to support analysis. Clustering methods have been widely applied to biological sequences for database deduplication. Since an exhaustive all-by-all pairwise comparison of sequences cannot scale for a high volume of data, heuristic approaches have been recruited, such as the use of simple similarity thresholds. In this article, we present a comparison between CD-HIT and UCLUST, the two best-known clustering tools for sequence database deduplication. Our contributions include a detailed assessment of the redundancy remaining after deduplication, application of standard clustering evaluation metrics to quantify the cohesion and separation of the clusters generated by each method, and a biological case study that assesses intracluster function annotation consistency to demonstrate the impact of these factors on a practical application of the sequence clustering methods. Our results show that the trade-off between efficiency and accuracy becomes acute when low threshold values are used and when cluster sizes are large. This evaluation leads to practical recommendations for users for more effective uses of the sequence clustering tools for deduplication.
... A three dimensional CA model was proposed in [14], which is a theoretical model that presents the use of heuristic rules of biochemistry and thermodynamics in a 3D cubic space. The PseAAC protein composition given in [15] combined with elementary rule 84 were proposed to classify proteins based on their structure classes [16] [17] and to predict transmembrane regions [18] and G-protein-coupled receptor functional classes [19]. Simply, the model converts the protein to a binary representation using the digital coding proposed in [20] [21]. ...
Article
Full-text available
Cellular Automata are discrete computational models that rely on local rules. The main focus of this paper is to build a model of proteins based on simple and local rules of a cellular automaton. Research in this direction depend mainly on combining cellular automata with other paradigms. Many schemes in literature rely on different evolutionary algorithms to support the use of cellular automata and some depend on combining protein parameters with parameters extracted from a cellular automaton image. The aim here is to keep the simplicity of cellular automata as much as possible. It is not known yet if a set of local rules that can solve the protein folding problem does exist. So far, research depend on some sort of searching or a global view of the sequence in order to find a reasonable confirmation. This paper discusses what simple rules can be like. The proposed cellular automaton rules and states depend on a well-known simple exact model and the basic principles governing protein folding. In the proposed cellular automaton, the cell state can be a hydrophobic amino acid, a polar amino acid, an empty cell, or a control cell. The argument of local rules is supported by graphical examples of applying the proposed rules.
... Wang et al. first collected drug pharmacological and therapeutic effects, drug chemical structures, and protein genomic information to characterize the DTIs and then proposed a kernel-based method to predict DTIs by integrating multiple types of data [14]. Other methods developed machine learning methods focusing on HIV protease cleavage site prediction [15], identification of GPCR (G protein-coupled receptors) type [16], protein subcellular location prediction [17,18], membrane protein type prediction [19], and a series of relevant webserver predictors as summarized in a review [20]. ...
Article
Full-text available
Background Drug-target interaction is key in drug discovery, especially in the design of new lead compound. However, the work to find a new lead compound for a specific target is complicated and hard, and it always leads to many mistakes. Therefore computational techniques are commonly adopted in drug design, which can save time and costs to a significant extent. Results To address the issue, a new prediction system is proposed in this work to identify drug-target interaction. First, drug-target pairs are encoded with a fragment technique and the software “PaDEL-Descriptor.” The fragment technique is for encoding target proteins, which divides each protein sequence into several fragments in order and encodes each fragment with several physiochemical properties of amino acids. The software “PaDEL-Descriptor” creates encoding vectors for drug molecules. Second, the dataset of drug-target pairs is resampled and several overlapped subsets are obtained, which are then input into kNN (k-Nearest Neighbor) classifier to build an ensemble system. Conclusion Experimental results on the drug-target dataset showed that our method performs better and runs faster than the state-of-the-art predictors.
... The most commonly used machine learning methods have been widely applied to investigate drug-target interaction problem. Some focused on HIV protease cleavage site prediction [15], identification of GPCR (G protein-coupled receptors) type [16], protein sub-cellular location prediction [17,18], membrane protein type prediction [19], and a series of relevant web-server predictors as summarized in a recent review [20]. ...
Article
Drug-target interaction is key in drug discovery. Since the determination of drug-target interactions is costly and time-consuming by in vitro experiments, computational method is a complement to determine the interactions. To address the issue, a random projection ensemble approach is proposed. First, drug-compounds are encoded with feature descriptors by software “PaDEL-Descriptor”. Second, target proteins are encoded with physiochemical properties of amino acids, where the 34 relatively independent physiochemical properties are extracted from 544 properties in AAindex1 database. Random projection on the vector of drug-target pair with different dimensions can project the original space onto a reduced one and thus yield a transformed vector with a fixed dimension. Several random projections build an ensemble REPTree system. Experimental results show that our method significantly outperforms and runs faster than other state-of-the-art drug-target predictors, on the commonly used drug-target benchmark sets.
... It has been shown that the proteins with similar structural attributes should have some similar textures in their cellular automata images (Wolfram 2002). Hence, the existing image processing tools and texture-related features can be used to extract hidden information and study protein-related problems (Xiao et al. 2005a(Xiao et al. , 2006(Xiao et al. , 2008(Xiao et al. , 2009). As shown in Fig. 2, proteins with different structural class have different images in terms of texture pattern. ...
Article
Full-text available
Nowadays, having knowledge about cellular attributes of proteins has an important role in pharmacy, medical science and molecular biology. These attributes are closely correlated with the function and three-dimensional structure of proteins. Knowledge of protein structural class is used by various methods for better understanding the protein functionality and folding patterns. Computational methods and intelligence systems can have an important role in performing structural classification of proteins. Most of protein sequences are saved in databanks as characters and strings and a numerical representation is essential for applying machine learning methods. In this work, a binary representation of protein sequences is introduced based on reduced amino acids alphabets according to surrounding hydrophobicity index. Many important features which are hidden in these long binary sequences can be clearly displayed through their cellular automata images. The extracted features from these images are used to build a classification model by support vector machine. Comparing to previous studies on the several benchmark datasets, the promising classification rates obtained by tenfold cross-validation imply that the current approach can help in revealing some inherent features deeply hidden in protein sequences and improve the quality of predicting protein structural class.
... The NR superfamily has been classified into seven families: NR0 (knirps or DAX like) [4,5]; NR1 (thyroid hormone like), NR2 (HNF4-like), NR3 (estrogen like), NR4 (nerve growth factor IB-like), NR5 (fushi tarazu-F1 like), and NR6 (germ cell nuclear factor like). Since they are involved in almost all aspects of human physiology and are implicated in many major diseases such as cancer, diabetes and osteoporosis, nuclear receptors have become major drug targets [6,7], along with G protein-coupled receptors (GPCRs) [8][9][10][11][12][13][14][15][16][17], ion channels [18][19][20], and kinase proteins [21][22][23][24]. Identification of drug-target interactions is one of the most important steps for the new medicine development [25,26]. ...
... Moreover, graphical approaches have been utilized to deal with complicated network systems [167,168] and identify the hub proteins from complicated network systems [169]. Recently, the "cellular automaton image" [170,171] has also been applied to study hepatitis B viral infections [172], HBV virus gene missense mutation [173], and visual analysis of SARS-CoV [174], as well as representing complicated biological sequences [175] and helping to identify their attributes [19,176]. ...
Article
Full-text available
Facing the explosive growth of biological sequence data, such as those of protein/peptide and DNA/RNA, generated in the post-genomic age, many bioinformatical and mathematical approaches as well as physicochemical concepts have been introduced to timely derive useful informations from these biological sequences, in order to stimulate the development of medical science and drug design. Meanwhile, because of the rapid penetrations from these disciplines, medicinal chemistry is currently undergoing an unprecedented revolution. In this minireview, we are to summarize the progresses by focusing on the following six aspects. (1) Use the pseudo amino acid composition or PseAAC to predict various attributes of protein/peptide sequences that are useful for drug development. (2) Use pseudo oligonucleotide composition or PseKNC to do the same for DNA/RNA sequences. (3) Introduce the multi-label approach to study those systems where the constituent elements bear multiple characters and functions. (4) Utilize the graphical rules and "wenxiang" diagrams to analyze complicated biomedical systems. (5) Recent development in identifying the interactions of drugs with its various types of target proteins in cellular networking. (6) Distorted key theory and its application in developing peptide drugs.
Article
Full-text available
Accurate identification of drug-targets in human body has great significance for designing novel drugs. Compared with traditional experimental methods, prediction of drug-targets via machine learning algorithms has enhanced the attention of many researchers due to fast and accurate prediction. In this study, we propose a machine learning-based method, namely XGB-DrugPred for accurate prediction of druggable proteins. The features from primary protein sequences are extracted by group dipeptide composition, reduced amino acid alphabet, and novel encoder pseudo amino acid composition segmentation. To select the best feature set, eXtreme Gradient Boosting-recursive feature elimination is implemented. The best feature set is provided to eXtreme Gradient Boosting (XGB), Random Forest, and Extremely Randomized Tree classifiers for model training and prediction. The performance of these classifiers is evaluated by tenfold cross-validation. The empirical results show that XGB-based predictor achieves the best results compared with other classifiers and existing methods in the literature.
Article
With the avalanche of biological sequences discovered in the postgenomic era, one of the most important but also most difficult problems in computational biology is how to express a biological sequence with a discrete model or a vector, yet still keep its considerable sequence-order information or special pattern. To deal with this problem, the idea of “pseudo amino acid components” or “PseAAC” was proposed in 2001. In this paper, the author has recalled the proposal of “pseudo amino acid components” and its significant and substantial impacts on proteome and genome analyses as well as developing novel and effective drugs, particularly peptide drugs.
Article
Objective One of the most challenging and also the most difficult problems is how to formulate a biological sequence with a vector but considerably keep its sequence order information. Methods To address such a problem, the approach of pseudo amino acid components or PseAAC have been developed. Results and Conclusion It has become increasingly clear via the 20-year recollection that the aforementioned proposal has been indeed very powerful and awesome.
Article
Solution NMR spectroscopy plays important roles in understanding protein structures, dynamics and protein-protein/ligand interactions. In a target-based drug discovery project, NMR can serve an important function in the steps of hit identification and lead optimization. Fluorine is a valuable probe for evaluating protein conformational changes and protein-ligand interactions. Accumulated studies demonstrate that 19F-NMR will be playing important roles in fragment-based drug discovery (FBDD) and probing protein-ligand interactions. This review summarizes the application of 19F-NMR in understanding protein-ligand interactions and drug discovery. Several examples are included to show the roles of 19F-NMR in confirming identified hits/leads in the drug discovery process. In addition to identifying hits from fluorine-containing compound libraries, 19F-NMR will play an important role in drug discovery by providing a fast and robust way in novel hit identification. This technique can be used for ranking compounds with different binding affinities and particularly useful for screening competitive compounds when a reference ligand is available.
Conference Paper
In this article, a novel method of protein similarity analysis is proposed where amino acid sequences of proteins are employed to generate cellular automata images. The pairwise similarities of these images are measured from their horizontal detail images derived from wavelet decomposition by symlet wavelets. The distance metric to measure the similarity is chosen as the normalized Laplacian pyramid. From the pairwise similarity measures, a phylogenetic tree is generated and the tree is compared with two recent methods of similarity analysis. The proposed method has been proved effective to render a computationally faster and alignment-free method of protein similarity analysis.
Article
Drug abuse (DA) or drug addiction is a complicated brain disorder which is commonly considered as neurobiological impairments caused by both genetic factors and environmental effects. Among DA-related targets, G protein-coupled receptors (GPCRs) play an important role in DA therapy. However, only 52 GPCRs have been published with crystal structures in the recent two decades. In the effort to overcome the limitation of crystal structure and conformational diversity of GPCRs, we built homology models and performed conformational searches by molecular dynamics (MD) simulation. To accelerate and facilitate the drug abuse research, we construct a DA related GPCRs-specific chemogenomics knowledgebase (KB) (DAKB-GPCRs) for its research that implemented with our established and novel chemogenomics tools as well as algorithms for data analyses and visualization. Our established TargetHunter and HTDocking, as well as our novel tools that include target classification and Spider Plot are compiled into the platform. Our DAKB-GPCRs provides the following results for a query compound: (1) blood-brain barrier (BBB) plot via our BBB predictor, (2) docking scores via our HTDocking, (3) similarity score via our TargetHunter, (4) target classification via machine learning methods that utilizes both docking scores and similarity score, and (5) drug-targets interaction network via our Spider Plot.
Article
Full-text available
Structural variability in the RLR/ viral RNA (vRNA) complex and the two different symmetry groups of MAVS filament as well as the different cytosolic RNAs (cRNA) in competition for RLR binding are discussed from a biophysical standing point of detecting single viral RNAs with sufficient sensitivity. A different lens of view exams the RLR/vRNA vs RLR/cRNA is focused on physiological detection. The importance of the recently discovered mediator (zyxin) for RLR/vRNA and silent MAVS interaction is brought in the macroscopic picture. The two different groups of MAVS filaments and their ramifications are discussed. Lingering questions are highlighted in this important signaling pathway.
Article
Background B-cell epitope prediction is an essential tool for a variety of immunological studies. For identifying such epitopes, several computational predictors have been proposed in the past 10 years. Objective In this review, we summarized the representative computational approaches developed for the identification of linear B-cell epitopes. Methods: We mainly discuss the datasets, feature extraction methods and classification methods used in the previous work. Results The performance of the existing methods was not very satisfying, and so more effective approaches should be proposed by considering the structural information of proteins. Conclusion We consider existing challenges and future perspectives for developing reliable methods for predicting linear B-cell epitopes.
Preprint
Full-text available
7 Cells need high-sensitivity detection of non-self molecules in order to fight against pathogens. These 8 cellular sensors are thus of significant importance to medicinal purposes, especially for treating novel 9 emerging pathogens. RIG-I-like receptors (RLRs) are intracellular sensors for viral RNAs (vRNAs). 10 Their active forms activate mitochondrial antiviral signaling protein (MAVS) and trigger downstream 11 immune responses against viral infection. Functional and structural studies of the RLR-MAVS signal 12 pathway have revealed significant supramolecular variability in the past few years, which revealed 13 different aspects of the functional signaling pathway. Here I will discuss the molecular events of RLR-14 MAVS pathway from the angle of detecting single copy or a very low copy number of vRNAs in the 15 presence of non-specific competition from cytosolic RNAs, and review key structural variability in the 16 RLR / vRNA complexes, the MAVS helical polymers, and the adapter-mediated interactions between 17 the active RLR / vRNA complex and the inactive MAVS in triggering the initiation of the MAVS 18 filaments. These structural variations may not be exclusive to each other, but instead may reflect the 19 adaptation of the signaling pathways to different conditions or reach different levels of sensitivity in its 20 response to exogenous vRNAs. 21 22 23 2
Chapter
CA model for protein is covered in this chapter. Design of CA rule for amino acid backbone is reported followed by the design of PCAM (Protein Modelling CA Machine). A protein chain with n amino acids is represented by 8n cell PCAM, each amino acid is represented with 8 CA cells. Digitally encoded amino acid side chain with 8-bit string is used as the initial state of 8 CA cells used for its backbone. For a protein chain, provision of 64 PCAMs is provided for modelling its interaction with different biomolecules. Protein interaction modelling is reported for two applications—(i) predicting binding contact residues for protein–protein interaction, and (ii) mutational study. Predicted results derived out of PCAM model are validated against the wet lab experimental results reported in databases and publications. In addition to validation of the results, the case study identifies which specific PCAM (out of the available 64) is appropriate for modelling the interaction. The binding affinity of two Monoclonal Antibodies (MAbs) on wild and mutated version of PD-L1 (implicated cancer immunotherapy) are reported confirming the CA rule appropriate for the mutational study of PD-L1 for a specific MAb.
Article
Full-text available
The analysis of a large number of human and mouse genes codifying for a populated cluster of transmembrane proteins revealed that some of the genes significantly vary in their primary nucleotide sequence inter-species and also intra-species. In spite of that divergence and of the fact that all these genes share a common parental function we asked the question of whether at DNA level they have some kind of common compositional structure, not evident from the analysis of their primary nucleotide sequence. To reveal the existence of gene clusters not based on primary sequence relationships we have analyzed 13574 human and 14047 mouse genes by the composon-clustering methodology. The data presented show that most of the genes from each one of the samples are distributed in 18 clusters sharing the common compositional features between the particular human and mouse clusters. It was observed, in addition, that between particular human and mouse clusters having similar composon-profiles large variations in gene population were detected as an indication that a significant amount of orthologs between both species differs in compositional features. A gene cluster containing exclusively genes codifying for transmembrane proteins, an important fraction of which belongs to the Rhodopsin G-protein coupled receptor superfamily, was also detected. This indicates that even though some of them display low sequence similarity, all of them, in both species, participate with similar compositional features in terms of composons. We conclude that in this family of transmembrane proteins in general and in the Rhodopsin G-protein coupled receptor in particular, the composon-clustering reveals the existence of a type of common compositional structure underlying the primary nucleotide sequence closely correlated to function.
Article
Background and objective: The G-protein coupled receptors are the largest superfamilies of membrane proteins and important targets for the drug design. G-protein coupled receptors are responsible for many physiochemical processes such as smell, taste, vision, neurotransmission, metabolism, cellular growth and immune response. So it is necessary to design a robust and efficient approach for the prediction of G-protein coupled receptors and their subfamilies. Methods: In this paper, the protein samples are represented by amino acid composition, dipeptide composition, correlation features, composition, transition, distribution, sequence order descriptors and pseudo amino acid composition with total 1497 number of sequence derived features. To address the issue of efficient classification of G-protein coupled receptors and their subfamilies, we propose to use a weighted k-nearest neighbor classifier with UNION of best 50 features, selected by Fisher score based feature selection, ReliefF, fast correlation based filter, minimum redundancy maximum relevancy, and support vector machine based recursive elimination feature selection methods to exploit the advantages of these feature selection methods. Results: The proposed method achieved an overall accuracy of 99.9%, 98.3%, 95.4%, MCC values of 1.00, 0.98, 0.95, ROC area values of 1.00, 0.998, 0.996 and precision of 99.9%, 98.3% and 95.5% using 10-fold cross-validation to predict the G-protein coupled receptors and non-G-protein coupled receptors, subfamilies of G-protein coupled receptors, and subfamilies of class A G-protein coupled receptors, respectively. Conclusions: The high accuracies, MCC, ROC area values, and precision values indicate that the proposed method is better for the prediction of G-protein coupled receptors families and their subfamilies.
Chapter
X-ray crystallography is a powerful tool to determine the protein 3D structure. However, it is time-consuming and expensive, and not all proteins can be successfully crystallized, particularly for membrane proteins. Although NMR spectroscopy is indeed a very powerful tool in determining the 3D structures of membrane proteins, it is also time-consuming and costly. To the best of the author’s knowledge, there is little structural data available on the AGAAAAGA palindrome in the hydrophobic region (113–120) of prion proteins due to the noncrystalline and insoluble nature of the amyloid fibril, although many experimental studies have shown that this region has amyloid fibril forming properties and plays an important role in prion diseases. In view of this, the present study is devoted to address this problem from computational approaches such as global energy optimization, simulated annealing, and structural bioinformatics. The optimal atomic-resolution structures of prion AGAAAAGA amyloid fibils reported in this chapter have a value to the scientific community in its drive to find treatments for prion diseases.
Article
G protein-coupled receptors (GPCRs) are integral membrane proteins with seven trans-membrane helices. Belonging to the largest family of cell surface receptors, GPCRs are among the most frequent targets of therapeutic drugs. Unfortunately, since they are difficult to crystallize and most of them will not dissolve in normal solvents, so far the number of GPCRs with three-dimensional structure determined is very limited. This situation has challenged us to develop automated methods by which one can predict the family and sub-family classes of GPCRs based on the information of their primary sequences alone, so as to facilitate classifying drugs, a technique called "evolutionary pharmacology" often used in pharmaceutical industries for drug development. In the past eight years, various computational methods were proposed. This review is devoted to summarize their development. Meanwhile, the future challenge in this area has also been briefly addressed.
Article
Meiotic recombination is vital for maintaining the sequence diversity in human genome. Meiosis and recombination are considered the essential phases of cell division. In meiosis, the genome is divided into equal parts for sexual reproduction whereas in recombination, the diverse genomes are combined to form new combination of genetic variations. Recombination process does not occur randomly across the genomes, it targets specific areas called recombination "hotspots" and "coldspots". Owing to huge exploration of polygenetic sequences in data banks, it is impossible to recognize the sequences through conventional methods. Looking at the significance of recombination spots, it is indispensable to develop an accurate, fast, robust, and high-throughput automated computational model. In this model, the numerical descriptors are extracted using two sequence representation schemes namely: dinucleotide composition and trinucleotide composition. The performances of seven classification algorithms were investigated. Finally, the predicted outcomes of individual classifiers are fused to form ensemble classification, which is formed through majority voting and genetic algorithm (GA). The performance of GA-based ensemble model is quite promising compared to individual classifiers and majority voting-based ensemble model. iRSpot-GAEnsC has achieved 84.46 % accuracy. The empirical results revealed that the performance of iRSpot-GAEnsC is not only higher than the examined algorithms but also better than existing methods in the literature developed so far. It is anticipated that the proposed model might be helpful for research community, academia and for drug discovery.
Article
Peptide-based antiviral therapeutics have gradually paved their way into mainstream drug discovery research. Experimental determination of peptides' antiviral activity as expressed by their IC50 values involves a lot of effort. Therefore, we have developed "AVP-IC50 Pred", a regression-based algorithm to predict the antiviral activity in terms of IC50 values (μM). A total of 759 non-redundant peptides from AVPdb and HIPdb were divided into a training/test set having 683 peptides (T(683) ) and a validation set with 76 independent peptides (V(76) ) for evaluation. We utilized important peptide sequence features like amino-acid compositions, binary profile of N8-C8 residues, physicochemical properties and their hybrids. Four different machine learning techniques (MLTs) namely Support Vector Machine (SVM), Random Forest (RF), Instance Based Classifier (IBk) and K-Star (K*) were employed. During 10-fold cross validation, we achieved maximum Pearson correlation coefficients (PCCs) of 0.63, 0.60, 0.51, 0.50 respectively for the above MLTs using the best combination of feature sets. All the predictive models also performed well on the independent validation dataset and achieved maximum PCCs of 0.74, 0.73, 0.72, 0.67, 0.52, 0.50 respectively on the best combination of feature sets. The AVP-IC50 Pred web server is anticipated to assist the researchers working on antiviral therapeutics by enabling them to computationally screen many compounds and focus experimental validation on the most promising set of peptides, thus reducing cost and time efforts. The server is available at http://crdd.osdd.net/servers/ic50avp. This article is protected by copyright. All rights reserved. © 2015 Wiley Periodicals, Inc.
Article
The composition and sequence order of amino acid residues are two most important characteristics to describe a protein sequence. Graphical representations facilitate visualization of biological sequences and produce biologically useful numerical descriptors. In this paper, we proposed a novel cylindrical representation by placing 20 amino acid residue types in a circle and sequence positions along the z-axis. This representation allows visualization of composition and sequence order of amino acids at the same time. Ten numerical descriptors and one weighted numerical descriptor were developed to quantitatively describe intrinsic properties of protein sequences based on the cylindrical model. Their applications to similarity/dissimilarity analysis of nine ND5 proteins indicated that these numerical descriptors are more effective than several classical numerical matrices. Thus, the cylindrical representation obtained here provides a new useful tool for visualizing and charactering protein sequences. An online server was available at http://biophy.dzu.edu.cn:8080/CNumD/input.jsp.
Article
Graphical representation of DNA sequences is a key component in studying biological problems. In order to gain new insights in DNA sequences, this paper combined the digitized methods of single-base, base pairs and coding in triplet bases with the times of base appearing, and then a novel 4D graphical representation method of DNA sequences was put forward. It was a one-to-one correspondence of the arbitrary DNA sequence and 4D graphical representation, that avoided causing non-unique 4D graphical representation and overlapping lines. The method could reflect the biological information features of DNA sequence more comprehensively and effectively without any losses. Based on the 4D graphical representation, we used the geometric center of 4D graphical representation as eigenvalue of DNA sequences analyses, which kept the original features of the data, and then established the Euclidean distances and included angles between vectors' terminal point for similarity analyses of the first extron of the beta-globulin gene among 11 species. Finally, we established the graph of systematic hierarchical cluster analysis of 11 species to observe more easily the relationship between species. A positive outcome was reached, and the results were in accord with biological taxonomy, which also supported the rationality and effectiveness of the novel 4D graphical representation.
Article
Full-text available
In the last decades Exp-function method has been used for solving fractional differential equations. In this paper, we obtain exact solutions of fractional generalized reaction Duffing model and nonlinear fractional diffusion–reaction equation. The fractional derivatives are described in the modified Riemann–Liouville sense. The fractional complex transform has been suggested to convert fractional-order differential equations with modified Riemann–Liouville derivatives into integer-order differential equations, and the reduced equations can be solved by symbolic computation.
Article
G-protein-coupled receptors (GPCRs) play an important role in physiological processes which are the targets of more than 50% of marketed drugs. In this research, we use a hybrid approach of predicted secondary structural features (PSSF) and approximate entropy (ApEn) as the feature selection method for predicting G-protein-coupled receptors in low homology. The low homology dataset is used to validate the proposed method for its objectivity. The classification model based on the fuzzy K-nearest neighbor classifier has been utilized on the classification of membrane proteins data. In order to enhance the prediction accuracies, here we propose an ensemble classifier as the prediction engine. Compared with the previous best-performing method, the success rate is encouraging. The reliable results also demonstrate the proposed method could contribute more to the characterization of various proteomes and further utilized in neuroscience.
Chapter
Full-text available
A comparison of two sequences may uncover multiple regions of local similarity. While the significance of each local alignment may be evaluated independently, sometimes a combined assessment is appropriate. This paper discusses a variety of statistical and algorithmic issues that such an assessment presents.
Article
Full-text available
Membrane proteins are encoded by 20–35% of genes but represent <1% of known protein structures to date. Thus, improved methods for membrane-protein structure determination are of critical importance. Residual dipolar couplings (RDCs), commonly measured for biological macromolecules weakly aligned by liquid-crystalline media, are important global angular restraints for NMR structure determination. For α-helical membrane proteins >15 kDa in size, Nuclear-Overhauser effect-derived distance restraints are difficult to obtain, and RDCs could serve as the main reliable source of NMR structural information. In many of these cases, RDCs would enable full structure determination that otherwise would be impossible. However, none of the existing liquid-crystalline media used to align water-soluble proteins are compatible with the detergents required to solubilize membrane proteins. We report the design and construction of a detergent-resistant liquid crystal of 0.8-μm-long DNA-nanotubes that can be used to induce weak alignment of membrane proteins. The nanotubes are heterodimers of 0.4-μm-long six-helix bundles each self-assembled from a 7.3-kb scaffold strand and >170 short oligonucleotide staple strands. We show that the DNA-nanotube liquid crystal enables the accurate measurement of backbone NH and CαHα RDCs for the detergent-reconstituted ζ-ζ transmembrane domain of the T cell receptor. The measured RDCs validate the high-resolution structure of this transmembrane dimer. We anticipate that this medium will extend the advantages of weak alignment to NMR structure determination of a broad range of detergent-solubilized membrane proteins. • dipolar couplings • nanotechnology • liquid crystal • scaffolded origami
Article
Full-text available
Two Airborne Visible/Infrared Imaging Spectrometer (AVIRIS) hyperspectral images selected from the Los Angeles area, one representing urban and the other, rural, were used to examine their spatial complexity across their entire spectrum of the remote sensing data. Using the ICAMS (Image Characterization And Modeling System) software, we computed the fractal dimension values via the isarithm and triangular prism methods for all 224 bands in the two AVIRIS scenes. The resultant fractal dimensions reflect changes in image complexity across the spectral range of the hyperspectral images. Both the isarithm and triangular prism methods detect unusually high D values on the spectral bands that fall within the atmospheric absorption and scattering zones where signature to noise ratios are low. Fractal dimensions for the urban area resulted in higher values than for the rural landscape, and the differences between the resulting D values are more distinct in the visible bands. The triangular prism method is sensitive to a few random speckles in the images, leading to a lower dimensionality. On the contrary, the isarithm method will ignore the speckles and focus on the major variation dominating the surface, thus resulting in a higher dimension. It is seen where the fractal curves plotted for the entire bandwidth range of the hyperspectral images could be used to distinguish landscape types as well as for screening noisy bands.
Article
Full-text available
Although the yeast Saccharomyces cerevisiae is the best exemplified single-celled eukaryote, the vast number of protein-protein interactions of integral membrane proteins of Saccharomyces cerevisiae have not been characterized by experiments. Here, based on the kernel method of Greedy Kernel Principal Component analysis plus Linear Discriminant Analysis, we identify 300 protein-protein interactions involving 189 membrane proteins and get the outcome of a highly connected protein-protein interactions network. Furthermore, we study the global topological features of integral membrane proteins network of Saccharomyces cerevisiae. These results give the comprehensive description of protein-protein interactions of integral membrane proteins and reveal global topological and robustness of the interactome network at a system level. This work represents an important step towards a comprehensive understanding of yeast protein interactions.
Article
Full-text available
The location of a protein in a cell is closely correlated with its biological function. Based on the concept that the protein subcellular location is mainly determined by its amino acid and pseudo amino acid composition (PseAA), a new algorithm of increment of diversity combined with support vector machine is proposed to predict the protein subcellular location. The subcellular locations of plant and non-plant proteins are investigated by our method. The overall prediction accuracies in jackknife test are 88.3% for the eukaryotic plant proteins and 92.4% for the eukaryotic non-plant proteins, respectively. In order to estimate the effect of the sequence identity on predictive result, the proteins with sequence identity <or=40% are selected. The overall success rates of prediction are 86.2% and 92.3% for plant and non-plant proteins in jackknife test, respectively.
Article
Full-text available
The membrane protein type is an important feature in characterizing the overall topological folding type of a protein or its domains therein. Many investigators have put their efforts to the prediction of membrane protein type. Here, we propose a new approach, the bootstrap aggregating method or bragging learner, to address this problem based on the protein amino acid composition. As a demonstration, the benchmark dataset constructed by K.C. Chou and D.W. Elrod was used to test the new method. The overall success rate thus obtained by jackknife cross-validation was over 84%, indicating that the bragging learner as presented in this paper holds a quite high potential in predicting the attributes of proteins, or at least can play a complementary role to many existing algorithms in this area. It is anticipated that the prediction quality can be further enhanced if the pseudo amino acid composition can be effectively incorporated into the current predictor. An online membrane protein type prediction web server developed in our lab is available at http://chemdata.shu.edu.cn/protein/protein.jsp.
Article
Full-text available
A protein is usually classified into one of the following five structural classes: alpha, beta, alpha + beta, alpha/beta, and zeta (irregular). The structural class of a protein is correlated with its amino acid composition. However, given the amino acid composition of a protein, how may one predict its structural class? Various efforts have been made in addressing this problem. This review addresses the progress in this field, with the focus on the state of the art, which is featured by a novel prediction algorithm and a recently developed database. The novel algorithm is characterized by a covariance matrix that takes into account the coupling effect among different amino acid components of a protein. The new database was established based on the requirement that the classes should have (1) as many nonhomologous structures as possible, (2) good quality structure, and (3) typical or distinguishable features for each of the structural classes concerned. The very high success rate for both the training-set proteins and the testing-set proteins, which has been further validated by a simulated analysis and a jackknife analysis, indicates that it is possible to predict the structural class of a protein according to its amino acid composition if an ideal and complete database can be established. It also suggests that the overall fold of a protein is basically determined by its amino acid composition.
Article
Full-text available
Given the amino acid composition of a protein, how may one predict its folding type? Although around this problem a number of methods have been proposed, none of them has taken into account the correlative effect among different amino acids, and hence the accuracy of prediction could not be improved to the extent that it should have. In view of this, a new method has been developed in which the similarity between two protein molecules is based on the scale of Mahalanobis distance rather than on the ordinary intuitive geometric distances, such as Minkowski's distance and Euclidian distance. By introducing the Mahalanobis distance, the correlative effect among different amino acids can be automatically incorporated. Predictions have been performed for 131 real proteins consisting of alpha, beta, alpha+beta, and alpha/beta proteins. The results indicate that the rates of correct prediction for both alpha and beta proteins are 100%, and those for alpha+beta and alpha/beta are 88.9 and 89.7%, respectively, with an average accuracy of 94.7%. Predictions have also been performed for 10,000 simulated proteins generated by Monte Carlo sampling for each of the above four folding types, yielding an average accuracy of 95.9%. The accuracy thus obtained for the simulated proteins can avoid the bias due to the limited number of testing proteins selected arbitrarily by different investigators and hence can be regarded as an objective accuracy. It is anticipated that a method with such a high objective accuracy should become a reliable tool in predicting the protein folding type and a useful tool for improving the prediction of secondary structure as well.
Article
Full-text available
The GPCRDB is a G protein-coupled receptor (GPCR) database system aimed at the collection and dissemination of GPCR related data. It holds sequences, mutant data and ligand binding constants as primary (experimental) data. Computationally derived data such as multiple sequence alignments, three dimensional models, phylogenetic trees and two dimensional visualization tools are added to enhance the database's usefulness. The GPCRDB is an EU sponsored project aimed at building a generic molecular class specific database capable of dealing with highly heterogeneous data. GPCRs were chosen as test molecules because of their enormous importance for medical sciences and due to the availability of so much highly heterogeneous data. The GPCRDB is available via the WWW at http://www.gpcr.org/7tm
Article
Full-text available
The function of a protein is closely correlated with its subcellular location. With the rapid increase in new protein sequences entering into data banks, we are confronted with a challenge: is it possible to utilize a bioinformatic approach to help expedite the determination of protein subcellular locations? To explore this problem, proteins were classified, according to their subcellular locations, into the following 12 groups: (1) chloroplast, (2) cytoplasm, (3) cytoskeleton, (4) endoplasmic reticulum, (5) extracell, (6) Golgi apparatus, (7) lysosome, (8) mitochondria, (9) nucleus, (10) peroxisome, (11) plasma membrane and (12) vacuole. Based on the classification scheme that has covered almost all the organelles and subcellular compartments in an animal or plant cell, a covariant discriminant algorithm was proposed to predict the subcellular location of a query protein according to its amino acid composition. Results obtained through self-consistency, jackknife and independent dataset tests indicated that the rates of correct prediction by the current algorithm are significantly higher than those by the existing methods. It is anticipated that the classification scheme and concept and also the prediction algorithm can expedite the functionality determination of new proteins, which can also be of use in the prioritization of genes and proteins identified by genomic efforts as potential molecular targets for drug design.
Article
Full-text available
Motivation: It is well known that the regulatory regions of genomes are highly repetitive. They are rich in direct, symmetric and complemented repeats, and there is no doubt about the functional significance of these repeats. Among known measures of complexity, the Ziv-Lempel complexity measure reflects most adequately repeats occurring in the text. But this measure does not take into account isomorphic repeats. By isomorphic repeats we mean fragments that are identical (or symmetric) modulo some permutation of the alphabet letters. Results: In this paper, two complexity measures of symbolic sequences are proposed that generalize the Ziv-Lempel complexity measure by taking into account any isomorphic repeats in the text (rather than just direct repeats as in Ziv-Lempel). The first of them, the complexity vector, is designed for small alphabets such as the alphabet of nucleotides. The second is based on a search for the longest isomorphic fragment in the history of sequence synthesis and can be used for alphabets of arbitrary cardinality. These measures have been used for recognition of structural regularities in DNA sequences. Some interesting structures related to the regulatory region of the human growth hormone are reported.
Article
Full-text available
The actions of many hormones and neurotransmitters are mediated through stimulation of G protein-coupled receptors. A primary mechanism by which these receptors exert effects inside the cell is by association with heterotrimeric G proteins, which can activate a wide variety of cellular enzymes and ion channels. G protein-coupled receptors can also interact with a number of cytoplasmic scaffold proteins, which can link the receptors to various signaling intermediates and intracellular effectors. The multicomponent nature of G protein-coupled receptor signaling pathways makes them ideally suited for regulation by scaffold proteins. This review focuses on several specific examples of G protein-coupled receptor-associated scaffolds and the roles they may play in organizing receptor-initiated signaling pathways in the cardiovascular system and other tissues.
Article
Full-text available
We describe recent research into using the visual primitive of texture to analyze and manage large collections of remote sensed image and video data. Texture is regarded as the spatial dependence of pixel intensity. It is characterized by the amount of dependence at different scales and orientations, as measured with frequency-selective filters. A homogeneous texture descriptor based on the filter outputs is shown to enable (1) content-based image retrieval in large collections of satellite imagery, (2) semantic labeling and layout retrieval in an aerial video management system, and (3) statistical object modeling in geographic digital libraries.
Article
Full-text available
During the last two decades, the number of sequence-known proteins has increased rapidly. In contrast, the corresponding increment for structure-known proteins is much slower. The unbalanced situation has critically limited our ability to understand the molecular mechanism of proteins and conduct structure-based drug design by timely using the updated information of newly found sequences. Therefore, it is highly desired to develop an automated method for fast deriving the 3D (3-dimensional) structure of a protein from its sequence. Under such a circumstance, the structural bioinformatics was emerging naturally as the time required. In this review, three main strategies developed in structural bioinformatics, i.e., pure energetic approach, heuristic approach, and homology modeling approach, as well as their underlying principles, are briefly introduced. Meanwhile, a series of demonstrations are presented to show how the structural bioinformatics has been applied to timely derive the 3D structures of some functionally important proteins, helping to understand their action mechanisms and stimulating the course of drug discovery. Also, the limitation of these approaches and the future challenges of structural bioinformatics are briefly addressed.
Article
Full-text available
Motivation: With protein sequences entering into databanks at an explosive pace, the early determination of the family or subfamily class for a newly found enzyme molecule becomes important because this is directly related to the detailed information about which specific target it acts on, as well as to its catalytic process and biological function. Unfortunately, it is both time-consuming and costly to do so by experiments alone. In a previous study, the covariant-discriminant algorithm was introduced to identify the 16 subfamily classes of oxidoreductases. Although the results were quite encouraging, the entire prediction process was based on the amino acid composition alone without including any sequence-order information. Therefore, it is worthy of further investigation. Results: To incorporate the sequence-order effects into the predictor, the 'amphiphilic pseudo amino acid composition' is introduced to represent the statistical sample of a protein. The novel representation contains 20 + 2lambda discrete numbers: the first 20 numbers are the components of the conventional amino acid composition; the next 2lambda numbers are a set of correlation factors that reflect different hydrophobicity and hydrophilicity distribution patterns along a protein chain. Based on such a concept and formulation scheme, a new predictor is developed. It is shown by the self-consistency test, jackknife test and independent dataset tests that the success rates obtained by the new predictor are all significantly higher than those by the previous predictors. The significant enhancement in success rates also implies that the distribution of hydrophobicity and hydrophilicity of the amino acid residues along a protein chain plays a very important role to its structure and function.
Article
Full-text available
Recent advances in large-scale genome sequencing have led to the rapid accumulation of amino acid sequences of proteins whose functions are unknown. Because the functions of these proteins are closely correlated with their subcellular localizations, it is vitally important to develop an automated method as a high-throughput tool to timely identify their subcellular location. Based on the concept of the pseudo amino acid composition by which a considerable amount of sequence-order effects can be incorporated into a set of discrete numbers (Chou, K. C., Proteins: Structure, Function, and Genetics, 2001, 43: 246-255), the complexity measure approach is introduced. The advantage by incorporating the complexity measure factor as one of the pseudo amino acid components for a protein is that it can more effectively reflect its overall sequence-order feature than the conventional correlation factors. With such a formulation frame to represent the samples of protein sequences, the covariant-discriminant predictor (Chou, K. C. and Elrod, D. W., Protein Engineering, 1999, 12: 107-118) was adopted to conduct prediction. High success rates were obtained by both the jackknife cross-validation test and independent dataset test, suggesting that introduction of the concept of the complexity measure into prediction of protein subcellular location is quite promising, and might also hold a great potential as a useful vehicle for the other areas of molecular biology.
Article
Full-text available
Here we report a systematic approach for predicting subcellular localization (cytoplasm, mitochondrial, nuclear, and plasma membrane) of human proteins. First, support vector machine (SVM)-based modules for predicting subcellular localization using traditional amino acid and dipeptide (i + 1) composition achieved overall accuracy of 76.6 and 77.8%, respectively. PSI-BLAST, when carried out using a similarity-based search against a nonredundant data base of experimentally annotated proteins, yielded 73.3% accuracy. To gain further insight, a hybrid module (hybrid1) was developed based on amino acid composition, dipeptide composition, and similarity information and attained better accuracy of 84.9%. In addition, SVM modules based on a different higher order dipeptide i.e. i + 2, i + 3, and i + 4 were also constructed for the prediction of subcellular localization of human proteins, and overall accuracy of 79.7, 77.5, and 77.1% was accomplished, respectively. Furthermore, another SVM module hybrid2 was developed using traditional dipeptide (i + 1) and higher order dipeptide (i + 2, i + 3, and i + 4) compositions, which gave an overall accuracy of 81.3%. We also developed SVM module hybrid3 based on amino acid composition, traditional and higher order dipeptide compositions, and PSI-BLAST output and achieved an overall accuracy of 84.4%. A Web server HSLPred (www.imtech.res.in/raghava/hslpred/ or bioinformatics.uams.edu/raghava/hslpred/) has been designed to predict subcellular localization of human proteins using the above approaches.
Article
Full-text available
The avalanche of newly found protein sequences in the post-genomic era has motivated and challenged us to develop an automated method that can rapidly and accurately predict the localization of an uncharacterized protein in cells because the knowledge thus obtained can greatly speed up the process in finding its biological functions. However, it is very difficult to establish such a desired predictor by acquiring the key statistical information buried in a pile of extremely complicated and highly variable sequences. In this paper, based on the concept of the pseudo amino acid composition (Chou, K. C. PROTEINS: Structure, Function, and Genetics, 2001, 43: 246-255), the approach of cellular automata image is introduced to cope with this problem. Many important features, which are originally hidden in the long amino acid sequences, can be clearly displayed through their cellular automata images. One of the remarkable merits by doing so is that many image recognition tools can be straightforwardly applied to the target aimed here. High success rates were observed through the self-consistency, jackknife, and independent dataset tests, respectively.
Article
Full-text available
Contraction and relaxation of heart muscle cells is regulated by cycling of calcium between cytoplasm and sarcoplasmic reticulum. Human phospholamban (PLN), expressed in the sarcoplasmic reticulum membrane as a 30-kDa homopentamer, controls cellular calcium levels by a mechanism that depends on its phosphorylation. Since PLN was discovered ≈30 years ago, extensive studies have aimed to explain how it influences calcium pumps and to determine whether it acts as an ion channel. We have determined by solution NMR methods the atomic resolution structure of an unphosphorylated PLN pentamer in dodecylphosphocholine micelles. The unusual bellflower-like assembly is held together by leucine/isoleucine zipper motifs along the membrane-spanning helices. The structure reveals a channel-forming architecture that could allow passage of small ions. The central pore gradually widens toward the cytoplasmic end as the transmembrane helices twist around each other and bend outward. The dynamic N-terminal amphipathic helices point away from the membrane, perhaps facilitating recognition and inhibition of the calcium pump. • leucine/isoleucine zipper • membrane channel • NMR • dipolar couplings
Book
1. The Foundations for a New Kind of Science 2. The Crucial Experiment 3. The World of Simple Programs 4. Systems Based on Numbers 5. Two Dimensions and Beyond 6. Starting from Randomness 7. Mechanisms in Programs and Nature 8. Implications for Everyday Systems 9. Fundamental Physics 10. Processes of Perception and Analysis 11. The Notion of Computation 12. The Principle of Computational Equivalence
Article
The development of prediction methods based on statistical theory generally consists of two parts: one is focused on the exploration of new algorithms, and the other on the improvement of a training database. The current study is devoted to improving the prediction of protein structural classes from both of the two aspects. To explore a new algorithm, a method has been developed that makes allowance for taking into account the coupling effect among different amino acid components of a protein by a covariance matrix. To improve the training database, the selection of proteins is carried out so that they have (1) as many non-homologous structures as possible, and (2) a good quality of structure. Thus, 129 representative proteins are selected. They are classified into 30 α, 30 β, 30 α + β, 30 α/β, and 9 ζ (irregular) proteins according to a new criterion that better reflects the feature of the structural classes concerned. The average accuracy of prediction by the current method for the 4 × 30 regular proteins is 99.2%, and that for 64 independent testing proteins not included in the training database is 95.3%. To further validate its efficiency, a jackknife analysis has been performed for the current method as well as the previous ones, and the results are also much in favor of the current method. To complete the mathematical basis, a theorem is presented and proved in Appendix A that is instructive for understanding the novel method at a deeper level. © 1995 Wiley-Liss, Inc.
Article
It is now common practice to retrieve, by key words, highly specialized selections of sequences from general-purpose databases such as EMBL, GenBank, etc. The sequences included in a selection are often interconnected, which means that there are duplications, embeddings, intersections, homology, common structural elements. Knowledge of these interconnections is necessary for further processing of the sequences. We propose a rapid (single scan) method for identification of such interconnections by means of complexity analysis that generalizes the Lempel–Ziv approach. Analysis of a selection of 5"-flanking regions of vertebrate growth hormone genes from EMBL is presented as an example.
Article
The basic module of signal transduction that involves G-protein-coupled receptors is usually portrayed as comprising a receptor, a heterotrimeric G protein and an effector. It is now well established that regulated interactions between receptors and arrestins, and between G proteins and regulators of G-protein signalling alter the effectiveness and kinetics of information transfer. However, more recent studies have begun to identify a host of other proteins that interact selectively with individual receptors at both the intracellular and extracellular face of the membrane. Although the functional relevance of many of these interactions is only beginning to be understood, current information indicates that these interactions might determine receptor properties, such as cellular compartmentalization or signal selection, and can promote protein scaffolding into complexes that integrate function.
Article
Identification of Nuclear protein localization assumes significance as it can provide in depth insight for genome regulation and function annotation of novel proteins. A multiclass SVM classifier with various input features was employed for nuclear protein compartment identification. The input features include factor solution scores and evolutionary information (position specific scoring matrix (PSSM) score) apart from conventional dipeptide composition and pseudo amino acid composition. All the SVM classifiers with different sets of input features performed better than the previously available prediction classifiers. The jack-knife success rate thus obtained on the benchmark dataset constructed by Shen and Chou [Shen, H.B., Chou, K.C., 2005, Predicting protein subnuclear location with optimized evidence-theoretic K-nearest classifier and pseudo amino acid composition. Biochem. Biophys. Res. Commun. 337, 752–756] is 71.23%, indicating that the novel pseudo amino acid composition approach with PSSM and SVM classifier is very promising and may at least play a complimentary role to the existing methods.
Article
It is crucial to develop powerful tools to predict apoptosis protein locations for rapidly increasing gap between the number of known structural proteins and the number of known sequences in protein databank. In this study, based on the concept of pseudo amino acid (PseAA) composition originally introduced by Chou, a novel approximate entropy (ApEn) based PseAA composition is proposed to represent apoptosis protein sequences. An ensemble classifier is introduced, of which the basic classifier is the FKNN (fuzzy K-nearest neighbor) one, as prediction engine. Each basic classifier is trained in different dimensions of PseAA composition of protein sequences. The immune genetic algorithm (IGA) is used to search the optimal weight factors in generating the PseAA composition for crucial of weight factors in PseAA composition. The results obtained by Jackknife test are quite encouraging, indicating that the proposed method might become a potentially useful tool for protein function, or at least can play a complimentary role to the existing methods in the relevant areas.
Article
Protein sequences contain surprisingly many local regions of low compositional complexity. These include different types of residue clusters, some of which contain homopolymers, short period repeats or aperiodic mosaics of a few residue types. Several different formal definitions of local complexity and probability are presented here and are compared for their utility in algorithms for localization of such regions in amino acid sequences and sequence databases. The definitions are:—(1) those derived from enumeration a priori by a treatment analogous to statistical mechanics, (2) a log likelihood definition of complexity analogous to informational entropy, (3) multinomial probabilities of observed compositions, (4) an approximation resembling the χ2 statistic and (5) a modification of the coefficient of divergence. These measures, together with a method based on similarity scores of self-aligned sequences at different offsets, are shown to be broadly similar for first-pass, approximate localization of low-complexity regions in protein sequences, but they give significantly different results when applied in optimal segmentation algorithms. These comparisons underpin the choice of robust optimization heuristics in an algorithm, SEG, designed to segment amino acid sequences fully automatically into subsequences of contrasting complexity. After the abundant low-complexity segments have been partitioned from the Swissprot database, the remaining high-complexity sequence set is adequately approximated by a first-order random model.
Article
Natural systems from snowflakes to mollusc shells show a great diversity of complex patterns. The origins of such complexity can be investigated through mathematical models termed `cellular automata'. Cellular automata consist of many identical components, each simple., but together capable of complex behaviour. They are analysed both as discrete dynamical systems, and as information-processing systems. Here some of their universal features are discussed, and some general principles are suggested.
Article
A novel approach was developed for predicting the structural classes of proteins based on their sequences. It was assumed that proteins belonging to the same structural class must bear some sort of similar texture on the images generated by the cellular automaton evolving rule [Wolfram, S., 1984. Cellular automation as models of complexity. Nature 311, 419-424]. Based on this, two geometric invariant moment factors derived from the image functions were used as the pseudo amino acid components [Chou, K.C., 2001. Prediction of protein cellular attributes using pseudo amino acid composition. Proteins: Struct., Funct., Genet. (Erratum: ibid., 2001, vol. 44, 60) 43, 246-255] to formulate the protein samples for statistical prediction. The success rates thus obtained on a previously constructed benchmark dataset are quite promising, implying that the cellular automaton image can help to reveal some inherent and subtle features deeply hidden in a pile of long and complicated amino acid sequences.
Article
Although most proteins have a single subcellular location, some may simultaneously exist at two or more different subcellular locations. Multiplex proteins as such are particularly interesting because they may have some special functions. To deal with this kind of complicated situation, a novel predictor called “Euk-mPLoc” was developed that can be used to predict the subcellular locations of eukaryotic proteins among the 22 sites as shown in the accompanying figure. The predictor is accessible to the public as a free server at http://202.120.37.186/bioinf/euk-multi.
Article
The successful prediction of protein subcellular localization directly from protein primary sequence is useful to protein function prediction and drug discovery. In this paper, by using the concept of pseudo amino acid composition (PseAAC), the mycobacterial proteins are studied and predicted by support vector machine (SVM) and increment of diversity combined with modified Mahalanobis Discriminant (IDQD). The results of jackknife cross-validation for 450 non-redundant proteins show that the overall predicted successful rates of SVM and IDQD are 82.2% and 79.1%, respectively. Compared with other existing methods, SVM combined with PseAAC display higher accuracies.
Article
Two classes of receptors transduce neurotransmitter signals: ionotropic receptors and heptahelical metabotropic receptors. Whereas the ionotropic receptors are structurally associated with a membrane channel, a mediating mechanism is necessary to functionally link metabotropic receptors with their respective effectors. According to the accepted paradigm, the first step in the metabotropic transduction process requires the activation of heterotrimeric G-proteins. An increasing number of observations, however, point to a novel mechanism through which neurotransmitters can initiate biochemical signals and modulate neuronal excitability. According to this mechanism metabotropic receptors induce responses by activating transduction systems that do not involve G-proteins.
Article
How to incorporate the sequence order effect is a key and logical step for improving the prediction quality of protein subcellular location, but meanwhile it is a very difficult problem as well. This is because the number of possible sequence order patterns in proteins is extremely large, which has posed a formidable barrier to construct an effective training data set for statistical treatment based on the current knowledge. That is why most of the existing prediction algorithms are operated based on the amino-acid composition alone. In this paper, based on the physicochemical distance between amino acids, a set of sequence-order-coupling numbers was introduced to reflect the sequence order effect, or in a rigorous term, the quasi-sequence-order effect. Furthermore, the covariant discriminant algorithm by Chou and Elrod (Protein Eng. 12, 107-118, 1999) developed recently was augmented to allow the prediction performed by using the input of both the sequence-order-coupling numbers and amino-acid composition. A remarkable improvement was observed in the prediction quality using the augmented covariant discriminant algorithm. The approach described here represents one promising step forward in the efforts of incorporating sequence order effect in protein subcellular location prediction. It is anticipated that the current approach may also have a series of impacts on the prediction of other protein features by statistical approaches.
Article
The cellular attributes of a protein, such as which compartment of a cell it belongs to and how it is associated with the lipid bilayer of an organelle, are closely correlated with its biological functions. The success of human genome project and the rapid increase in the number of protein sequences entering into data bank have stimulated a challenging frontier: How to develop a fast and accurate method to predict the cellular attributes of a protein based on its amino acid sequence? The existing algorithms for predicting these attributes were all based on the amino acid composition in which no sequence order effect was taken into account. To improve the prediction quality, it is necessary to incorporate such an effect. However, the number of possible patterns for protein sequences is extremely large, which has posed a formidable difficulty for realizing this goal. To deal with such a difficulty, the pseudo-amino acid composition is introduced. It is a combination of a set of discrete sequence correlation factors and the 20 components of the conventional amino acid composition. A remarkable improvement in prediction quality has been observed by using the pseudo-amino acid composition. The success rates of prediction thus obtained are so far the highest for the same classification schemes and same data sets. It has not escaped from our notice that the concept of pseudo-amino acid composition as well as its mathematical framework and biochemical implication may also have a notable impact on improving the prediction quality of other protein features.
Article
It is now common practice to retrieve, by key words, highly specialized selections of sequences from general-purpose databases such as EMBL, GenBank, etc. The sequences included in a selection are often interconnected, which means that there are duplications, embeddings, intersections, homology, common structural elements. Knowledge of these interconnections is necessary for further processing of the sequences. We propose a rapid (single scan) method for identification of such interconnections by means of complexity analysis that generalizes the Lempel-Ziv approach. Analysis of a selection of 5'-flanking regions of vertebrate growth hormone genes from EMBL is presented as an example.
Article
The more proteins diverged in sequence, the more difficult it becomes for bioinformatics to infer similarities of protein function and structure from sequence. The precise thresholds used in automated genome annotations depend on the particular aspect of protein function transferred by homology. Here, we presented the first large-scale analysis of the relation between sequence similarity and identity in subcellular localization. Three results stood out: (1) The subcellular compartment is generally more conserved than what might have been expected given that short sequence motifs like nuclear localization signals can alter the native compartment; (2) the sequence conservation of localization is similar between different compartments; and (3) it is similar to the conservation of structure and enzymatic activity. In particular, we found the transition between the regions of conserved and nonconserved localization to be very sharp, although the thresholds for conservation were less well defined than for structure and enzymatic activity. We found that a simple measure for sequence similarity accounting for pairwise sequence identity and alignment length, the HSSP distance, distinguished accurately between protein pairs of identical and different localizations. In fact, BLAST expectation values outperformed the HSSP distance only for alignments in the subtwilight zone. We succeeded in slightly improving the accuracy of inferring localization through homology by fine tuning the thresholds. Finally, we applied our results to the entire SWISS-PROT database and five entirely sequenced eukaryotes.
Article
G-protein-coupled receptors play a key role in cellular signaling networks that regulate various physiological processes, such as vision, smell, taste, neurotransmission, secretion, inflammatory, immune responses, cellular metabolism, and cellular growth. These proteins are very important for understanding human physiology and disease. Many efforts in pharmaceutical research have been aimed at understanding their structure and function. Unfortunately, because they are difficult to crystallize and most of them will not dissolve in normal solvents, so far very few G-protein-coupled receptor structures have been determined. In contrast, more than 1000 G-protein-coupled receptor sequences are known, and many more are expected to become known soon. In view of the extremely unbalanced state, it would be very useful to develop a fast sequence-based method to identify their different types. This would no doubt have practical value for both basic research and drug discovery because the function or binding specificity of a G-protein coupled receptor is determined by the particular type it belongs to. To realize this, a statistical analysis has been performed for 566 G-protein-coupled receptors classified into seven different types. The results indicate that the types of G-protein-coupled receptors are predictable to a considerable accurate extent if a good training data set can be established for such a goal.
Article
The function of a protein is closely correlated with its subcellular location. With the success of human genome project and the rapid increase in the number of newly found protein sequences entering into data banks, it is highly desirable to develop an automated method for predicting the subcellular location of proteins. The establishment of such a predictor will no doubt expedite the functionality determination of newly found proteins and the process of prioritizing genes and proteins identified by genomics efforts as potential molecular targets for drug design. Based on the concept of pseudo amino acid composition originally proposed by K. C. Chou (Proteins: Struct. Funct. Genet. 43: 246–255, 2001), the digital signal processing approach has been introduced to partially incorporate the sequence order effect. One of the remarkable merits by doing so is that many existing tools in mathematics and engineering can be straightforwardly used in predicting protein subcellular location. The results thus obtained are quite encouraging. It is anticipated that the digital signal processing may serve as a useful vehicle for many other protein science areas as well.
Article
In the protein universe, many proteins are composed of two or more polypeptide chains, generally referred to as subunits, that associate through noncovalent interactions and, occasionally, disulfide bonds. With the number of protein sequences entering into data banks rapidly increasing, we are confronted with a challenge: how to develop an automated method to identify the quaternary attribute for a new polypeptide chain (i.e., whether it is formed just as a monomer, or as a dimer, trimer, or any other oligomer). This is important, because the functions of proteins are closely related to their quaternary attribute. For example, some critical ligands only bind to dimers but not to monomers; some marvelous allosteric transitions only occur in tetramers but not other oligomers; and some ion channels are formed by tetramers, whereas others are formed by pentamers. To explore this problem, we adopted the pseudo amino acid composition originally proposed for improving the prediction of protein subcellular location (Chou, Proteins, 2001; 43:246-255). The advantage of using the pseudo amino acid composition to represent a protein is that it has paved a way that can take into account a considerable amount of sequence-order effects to significantly improve prediction quality. Results obtained by resubstitution, jack-knife, and independent data set tests, have indicated that the current approach might be quite promising in dealing with such an extremely complicated and difficult problem.
Article
GABA is the main inhibitory neurotransmitter in the mammalian central nervous system. When GABA binds to the ubiquitous GABA-A receptors on neurons, chloride channels are activated leading to a rapid increase in chloride conductance that depresses excitatory depolarization. The GABA-A receptors are targets for many clinically important drugs, such as the benzodiazepines, general anaesthetics, and barbiturates. All of these drugs enhance the chloride current activated by GABA. Of the GABA-A receptor family, the subtype 2 is critical for the treatment of anxiety spectrum disorders. To avoid unwanted side effects, it is necessary to find highly selective drugs that interact only with subtype 2 but not with the related receptors such as subtypes 1, 3, and 5. To realize such a goal, it is important to have not only the 3D (dimensional) structure of subtype 2 but also the 3D structures of subtypes 1, 3, and 5. In this study, the 3D structures of all the four subtypes of GABA-A receptors have been derived. The computer-modeled heteropentameric structures bear the following features: (1) each of the five subunits in the pentamer has an intrachain disulfide bond, a hallmark of ligand-gated pentameric channels; (2) those residues which are sensitive to the binding of the benzodiazepine site ligands are grouped around the alpha1,2,3,5/gamma2 interfaces; and (3) those residues which are sensitive to the binding of GABA molecules are grouped around the alpha1,2,3,5/beta2 interfaces. All these findings are fully consistent with experimental observations. Meanwhile, for those sensitive or key residues, a close look at their subtle difference among the four subtypes has been provided through a highlighted superposition picture. In addition to providing the atomic coordinates, the predicted structures have further clarified some ambiguities that could not been uniquely determined by the existing experimental data, such as the directionality of the subunit arrangement in the heteropentamers. The 3D models may provide a reasonable structural frame or footing for designing highly selective drugs. The present models might be also useful in understanding the basic mechanism of operation of the GABA-A receptors, stimulating novel strategies for developing more specific drugs and better treatments.
Article
Recent advances in large-scale genome sequencing have led to the rapid accumulation of amino acid sequences of proteins whose functions are unknown. Since the functions of these proteins are closely correlated with their subcellular localizations, many efforts have been made to develop a variety of methods for predicting protein subcellular location. In this study, based on the strategy by hybridizing the functional domain composition and the pseudo-amino acid composition (Cai and Chou [2003]: Biochem. Biophys. Res. Commun. 305:407-411), the Intimate Sorting Algorithm (ISort predictor) was developed for predicting the protein subcellular location. As a showcase, the same plant and non-plant protein datasets as investigated by the previous investigators were used for demonstration. The overall success rate by the jackknife test for the plant protein dataset was 85.4%, and that for the non-plant protein dataset 91.9%. These are so far the highest success rates achieved for the two datasets by following a rigorous cross validation test procedure, further confirming that such a hybrid approach may become a very useful high-throughput tool in the area of bioinformatics, proteomics, as well as molecular cell biology.
Article
Based on the crystal structure of acetylcholine-binding protein, the three-dimensional structures of the extracellular domain, or the ligand-binding domains, of the monomer, homodimer, and homopentamer of the alpha7 nicotinic acetylcholine receptor were derived. The interface between two subunits, where the ligand-binding site is located, was investigated. Furthermore, an explicit definition of the ligand-binding pocket was illustrated that might provide useful clues for conducting various mutagenesis studies for finding drugs against schizophrenia and Alzheimer's disease.
Article
Given the sequence of a protein, how can we predict whether it is an enzyme or a non-enzyme? If it is, what enzyme family class it belongs to? Because these questions are closely relevant to the biological function of a protein and its acting object, their importance is self-evident. Particularly with the explosion of protein sequences entering into data banks and the relatively much slower progress in using biochemical experiments to determine their functions, it is highly desired to develop an automated method that can be used to give fast answers to these questions. By hybridizing the gene ontology and pseudo-amino-acid composition, we have introduced a new method that is called GO-PseAA predictor and operate it in a hybridization space. To avoid redundancy and bias, demonstrations were performed on a data set in which none of the proteins in an individual class has > or =40% sequence identity to any other. The overall success rate thus obtained by the jackknife cross-validation test in identifying enzyme and non-enzyme was 93%, and that in identifying the enzyme family was 94% for the following six main Enzyme Commission (EC) classes: (1) oxidoreductase, (2) transferase, (3) hydrolase, (4) lyase, (5) isomerase, and (6) ligase. The corresponding rates by the independent data set test were 98% and 97%, respectively.
Article
Many lines of evidences indicate that increased flux of glucose through the pathway, in which glutamine:fructose-6-phosphate amidotransferase (GFPT or GFAT) is a key catalyst while uridine-5'-diphosphate-N-acetylglucosamine (UDP-GlcNAc) functions as an energy sensor, can lead to the insulin resistance that is characteristic of Type-2 diabetes. In view of this, GFAT and its interaction mechanism with UDP-GlcNAc may become a novel therapeutic target for the treatment of type 2 diabetes. To stimulate the structure-based drug design, the three-dimensional structures of human GFAT1 monomer and dimer have been developed. It has been found by docking UDP-GlcNAc to the dimer (the smallest unit for catalyzing the substrate) that UDP-GlcNAc is bound to the interface of the dimer by 12 hydrogen bonds. On the basis of the docking results, a binding pocket of human GFAT1 dimer for UDP-GlcNAc is defined. All of these findings can serve as a reference or footing in developing new therapeutic strategy for the treatment of type-2 diabetes.
Article
A novel approach to visualize biological sequences is developed based on cellular automata (Wolfram, S. Nature 1984, 311, 419-424), a set of discrete dynamical systems in which space and time are discrete. By transforming the symbolic sequence codes into the digital codes, and using some optimal space-time evolvement rules of cellular automata, a biological sequence can be represented by a unique image, the so-called cellular automata image. Many important features, which are originally hidden in a long and complicated biological sequence, can be clearly revealed thru its cellular automata image. With biological sequences entering into databanks rapidly increasing in the post-genomic era, it is anticipated that the cellular automata image will become a very useful vehicle for investigation into their key features, identification of their function, as well as revelation of their "fingerprint". It is anticipated that by using the concept of the pseudo amino acid composition (Chou, K.C. Proteins: Structure, Function, and Genetics, 2001, 43, 246-255), the cellular automata image approach can also be used to improve the quality of predicting protein attributes, such as structural class and subcellular location.
Article
Hepatitis B viruses (HBVs) show instantaneous and high-ratio mutations when they are replicated, some sorts of which significantly affect the efficiency of virus replication through enhancing or depressing the viral replication, while others have no influence at all. The mechanism of gene expression is closely correlated with its gene sequence. With the rapid increase in the number of newly found sequences entering into data banks, it is highly desirable to develop an automated method for simulating the gene regulating function. The establishment of such a predictor will no doubt expedite the process of prioritizing genes and proteins identified by genomics efforts as potential molecular targets for drug design. Based on the power of cellular automata (CA) in treating complex systems with simple rules, a novel method to present HBV gene image has been introduced. The results show that the images thus obtained can very efficiently simulate the effects of the gene missense mutation on the virus replication. It is anticipated that CA may also serve as a useful vehicle for many other studies on complicated biological systems.
Article
Knowledge of membrane protein type often provides crucial hints toward determining the function of an uncharacterized membrane protein. With the avalanche of new protein sequences emerging during the post-genomic era, it is highly desirable to develop an automated method that can serve as a high throughput tool in identifying the types of newly found membrane proteins according to their primary sequences, so as to timely make the relevant annotations on them for the reference usage in both basic research and drug discovery. Based on the concept of pseudo-amino acid composition [K.C. Chou, Proteins: Struct. Funct. Genet. 43 (2001) 246-255; Erratum: Proteins: Struct. Funct. Genet. 44 (2001) 60] that has made it possible to incorporate a considerable amount of sequence-order effects by representing a protein sample in terms of a set of discrete numbers, a novel predictor, the so-called "optimized evidence-theoretic K-nearest neighbor" or "OET-KNN" classifier, was proposed. It was demonstrated via the self-consistency test, jackknife test, and independent dataset test that the new predictor, compared with many previous ones, yielded higher success rates in most cases. The new predictor can also be used to improve the prediction quality for, among many other protein attributes, structural class, subcellular localization, enzyme family class, and G-protein coupled receptor type. The OET-KNN classifier will be available as a web-server at http://www.pami.sjtu.edu.cn/kcchou.
Article
Being the largest family of cell surface receptors, G-protein-coupled receptors (GPCRs) are among the most frequent targets of therapeutic drugs. The functions of many of GPCRs are unknown, and it is both time-consuming and expensive to determine their ligands and signaling pathways. This forces us to face a critical challenge: how to develop an automated method for classifying the family of GPCRs so as to help us in classifying drugs and expedite the process of drug discovery. Owing to their highly divergent nature, it is difficult to predict the classification of GPCRs by means of conventional sequence alignment approaches. To cope with such a situation, the CD (Covariant Discriminant) predictor was introduced to predict the families of GPCRs. The overall success rate thus obtained by jack-knife test for 1238 GPCRs classified into three main families, i.e., class A-"rhodopsin like", class B-"secretin like", and class C-"metabotrophic/glutamate/pheromone", was over 97%. The high success rate suggests that the CD predictor holds very high potential to become a useful tool for understanding the actions of drugs that target GPCRs and designing new medications with fewer side effects and greater efficacy.
Article
Cell membranes are vitally important to living cells. Although the infrastructure of biological membrane is provided by the lipid bilayer, membrane proteins perform most of the specific functions. Knowledge of membrane protein types often provides crucial hints toward determining the function of an uncharacterized membrane protein. With the avalanche of new protein sequences generated in the post-genomic era, it is highly demanded to develop a high throughput tool in identifying the type of newly found membrane proteins according to their primary sequences, so as to timely annotate them for reference usage in both basic research and drug discovery. To realize this, the key is to establish a powerful identifier that can catch their characteristic sequence patterns for different membrane protein types. However, it is not easy because they are buried in a pile of long and complicated sequences. In this paper, based on the concept of the pseudo-amino acid composition [K.C. Chou, PROTEINS: Struct., Funct., Genet. 43 (2001) 246-255], the low-frequency Fourier spectrum analysis is introduced. The merits by doing so are that the sequence pattern information can be more effectively incorporated into a set of discrete components, and that all the existing prediction algorithms can be straightforwardly used on such a formulation for protein samples. High success rates were observed by the re-substitution test, jackknife test, and independent dataset test, indicating that the low-frequency Fourier spectrum approach may become a very useful tool for membrane protein type prediction. The novel approach also holds a high potential for predicting many other attributes of proteins.
Article
Cell membranes are vitally important to the life of a cell. Although the basic structure of biological membrane is provided by the lipid bilayer, membrane proteins perform most of the specific functions. Membrane proteins are putatively classified into five different types. Identification of their types is currently an important topic in bioinformatics and proteomics. In this paper, based on the concept of representing protein samples in terms of their pseudo-amino acid composition, the fuzzy K-nearest neighbors (KNN) algorithm has been introduced to predict membrane protein types, and high success rates were observed. It is anticipated that, the current approach, which is based on a branch of fuzzy mathematics and represents a new strategy, may play an important complementary role to the existing methods in this area. The novel approach may also have notable impact on prediction of the other attributes, such as protein structural class, protein subcellular localization, and enzyme family class, among many others.
Article
G protein-coupled receptors (GPCRs) form a large superfamily of membrane proteins that play an essential role in modulating many vital physiological events, such as cell communication, neurotransmission, sensory perception, and chemotaxis. Understanding of the 3D (dimensional) structures of these receptors and their binding interactions with G proteins will help in the design of drugs for the treatment of GPCR-related diseases. By means of the approach of structural bioinformatics, the 3D structures of human alpha-13 subunit of guanine nucleotide-binding protein (G alpha 13) and human thromboxane A2 (TXA2) receptor were developed. The former plays an important role in the control of cell growth that may serve as a prototypical G protein; the latter is a target for nitric oxide-mediated desensitization that may serve as a prototypical GPCR. On the basis of the 3D models, their coupling interactions were investigated via docking studies. It has been found that the two proteins are coupled with each other mainly through the interaction between the minigene of G alpha 13 and the 3rd intracellular loop of the TXA2 receptor, consistent with the existing deduction in the literatures. However, it has also been observed via a close view that some residues of the TXA2 receptor that are sequentially far away but spatially quite close to the loop region are also involved in forming hydrogen bonds with the minigene of G alpha 13. These findings may provide useful information for conducting mutagenesis and reveal the molecular mechanism how the human TXA2 receptor interact with G alpha 13 to activate intracellular signaling. The findings may also provide useful insights for stimulating new therapeutic approaches by manipulating the interaction of the receptor with the relevant G proteins.