Article

Alignment‐Free Classification of G‐Protein‐Coupled Receptors Using Self‐Organizing Maps.

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

ChemInform is a weekly Abstracting Service, delivering concise information at a glance that was extracted from about 200 leading journals. To access a ChemInform Abstract, please click on HTML or PDF.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Various alignment-free descriptors and kernels have been successfully used with superior performance especially for identifying remotely similar protein families. Examples include: mismatch string kernel used by Leslie et al., (2003) [15]; dipeptide composition used with SVMs by Bhasin and Raghava, (2005) [5]; n-gram frequencies used with decision tree and nave Bayes classifiers by Cheng et al. (2005) [7]; and self-organizing maps used by Otaki et al., (2006) [22]. In Opiyo and Moriyama (2007) [20], we tested partial least squares regression (PLS) methods using phyisco-chemical properties of amino acids for identifying Cyt-b561 proteins from short expressed sequence tags (ESTs) derived from the A. thaliana genome. ...
... Various alignment-free descriptors and kernels have been successfully used with superior performance especially for identifying remotely similar protein families. Examples include: mismatch string kernel used by Leslie et al., (2003) [15]; dipeptide composition used with SVMs by Bhasin and Raghava, (2005) [5]; n-gram frequencies used with decision tree and nave Bayes classifiers by Cheng et al. (2005) [7]; and self-organizing maps used by Otaki et al., (2006) [22]. In Opiyo and Moriyama (2007) [20], we tested partial least squares regression (PLS) methods using phyisco-chemical properties of amino acids for identifying Cyt-b561 proteins from short expressed sequence tags (ESTs) derived from the A. thaliana genome. ...
Article
Full-text available
Cytochrome b561 (Cyt-b561) proteins are important for plant growth, development, and prevention of damage to plants. Because of their high sequence divergence, thorough mining of Cyt-b561 proteins from plant genomes are not easy. Currently there is only one Cyt-b561 gene found in the maize and none in the soybean genome. However, 22 have been identified in the Arabidopsis thaliana genome. We tested alignment-free protein classifiers based on partial least squares (PLS) and support vector machines to identify Cyt-b561. These classifiers performed better than profile hidden Markov models and PSI-BLAST. Using these classifiers we identified new Cyt-b561-related proteins from four plant genomes.
... Except for the general tools for genome annotation there are also classifiers which point to specific membrane protein families and its division into classes. For example to classify members of a GPCRs family several computational methods have been used, namely a phylogenetic analysis (an A-F GPCRs classification system [36]; with a Hidden Markov Models-based search (GRAFS [37] – see Fig. 1), self-organizing maps [38], neighbor-joining [39], unweighted pair group method with arithmetic mean [40], multidimensional scaling [41]. A useful hierarchical integration of various alignment-based and alignment-free classification methods was implemented in a 7TMRmine web server for discovering 7TMRs (seven transmembrane region-containing receptors) [42]. ...
Chapter
The membrane proteins are still the “Wild West” of structural biology. Although more and more membrane proteins structures are determined, their functioning is still difficult to investigate because they are fully functional only in the membranous environments. Several specific methodologies were developed to investigate various aspects of their cellular life but still they are challenging for computational methods. In this chapter we summarize the efforts made on elucidation the structural and dynamical properties of different types of membrane proteins emphasizing on those computational methods which were designed and employed particularly to study membrane proteins including their interactions in complex membranous systems.
... This is, therefore, an ideal protein family for us to use to analyze classifier performance at various degrees of similarities. GPCRs have also been used in previous classifier developments [e.g., 8–10,13–17,19,33]. As shown inTable 1, entries in GPCRDB are derived from the Swiss-Prot Protein Knowledgebase [34], a curated protein database providing high-quality annotations, as well as its computer-annotated supplement, TrEMBL. ...
Article
Computational methods of predicting protein functions rely on detecting similarities among proteins. However, sufficient sequence information is not always available for some protein families. For example, proteins of interest may be new members of a divergent protein family. The performance of protein classification methods could vary in such challenging situations. Using the G-protein-coupled receptor superfamily as an example, we investigated the performance of several protein classifiers. Alignment-free classifiers based on support vector machines using simple amino acid compositions were effective in remote-similarity detection even from short fragmented sequences. Although it is computationally expensive, a support vector machine classifier using local pairwise alignment scores showed very good balanced performance. More commonly used profile hidden Markov models were generally highly specific and well suited to classifying well-established protein family members. It is suggested that different types of protein classifiers should be applied to gain the optimal mining power.
Article
Full-text available
Chemoinformatics has evolved by the marriage of two branches of sciences namely, chemistry and information technology. In this paper, neural network trained by evolutionary algorithm is used as a classifier for the classification of Chemoinformatics data sets. The results of evolutionary neural network classifier are promising.
Article
The G-protein coupled receptors (GPCRs) form a large protein family in the human genome that have been widely studied and classified into classes and phylogenetic subfamilies. However, there still exist orphan GPCRs that are not classified in any of the known subfamilies and new bioinformatics approaches are still needed to address this issue. One of the interesting features of GPCRs is that a large proportion of these proteins are encoded by intronless genes. In this work, we are interested in the study of Rhodpsin-like GPCRs proteins encoded by this kind of genes. After a manual validation of their gene structure, we studied some of their properties including the number of exons, chromosomal location and protein length. The same trend was found for intronless GPCRs as compared to total GPCRs, particularly the uneven chromosomal distribution with a large number (one third) of GPCRs on chromosomes 1 and 11. The proportion of intronless GPCRs among all Rhdopsin-like GPCRs was estimated to about 26% which is significantly less than previously reported. Significant differences in protein length were found between subfamilies. We then used composition properties of DNA and protein sequences to classify intronless Rhodopsin-like GPCRs. Principal component analysis was used to identify key variable and then a discriminant analysis was used to compute discriminant functions that best separates the phylogenetic subfamilies. We found that the most important features to separates the groups is the proportion of aromatic amino acids in protein sequence and the contrast between (A+T) versus (G+C) in coding sequence. These functions are finally used to classify fourteen putative or unclassified GPCRs.
Article
Proteins are classified mainly on the basis of alignments of amino acid sequences. Drug discovery processes based on pharmacologically important proteins such as G-protein-coupled receptors (GPCRs) may be facilitated if more information is extracted directly from the primary sequences. Here, we investigate an alignment-free approach to protein classification using self-organizing maps (SOMs), a kind of artificial neural network, which needs only primary sequences of proteins and determines their relative locations in a two-dimensional lattice of neurons through an adaptive process. We first showed that a set of 1397 aligned samples of Class A GPCRs can be classified by our SOM program into 15 conventional categories with 99.2% accuracy. Similarly, a nonaligned raw sequence data set of 4116 samples was categorized into 15 conventional families with 97.8% accuracy in a cross-validation test. Orphan GPCRs were also classified appropriately using the result of the SOM learning. A supposedly diverse family of olfactory receptors formed the most distinctive cluster in the map, whereas amine and peptide families exhibited diffuse distributions. A feature of this kind in the map can be interpreted to reflect hierarchical family composition. Interestingly, some orphan receptors that were categorized as olfactory were somatosensory chemoreceptors. These results suggest the applicability and potential of the SOM program to classification prediction and knowledge discovery from protein sequences.