DomSVR: Domain boundary prediction with support vector regression from sequence information alone

Department of Systems and Computer Science, Howard University, 2400 Sixth Street, NW, Washington, DC 20059, USA.
Amino Acids (Impact Factor: 3.29). 02/2010; 39(3):713-26. DOI: 10.1007/s00726-010-0506-6
Source: PubMed


Protein domains are structural and fundamental functional units of proteins. The information of protein domain boundaries is helpful in understanding the evolution, structures and functions of proteins, and also plays an important role in protein classification. In this paper, we propose a support vector regression-based method to address the problem of protein domain boundary identification based on novel input profiles extracted from AAindex database. As a result, our method achieves an average sensitivity of approximately 36.5% and an average specificity of approximately 81% for multi-domain protein chains, which is overall better than the performance of published approaches to identify domain boundary. As our method used sequence information alone, our method is simpler and faster.

  • Source
    • "Improving the accuracy in predicting multidomain boundaries is a challenging task because the accuracy usually is considerably less than 40% [35]. We assessed the performances of DomHR on single-domain, two-domain, and multidomain sequences in the S1508 set (Table 3), and the results showed that our approach was a balanced predictor with an excellent ability to predict single-domain, two-domain and multidomain boundaries. "
    [Show abstract] [Hide abstract]
    ABSTRACT: The precise prediction of protein domains, which are the structural, functional and evolutionary units of proteins, has been a research focus in recent years. Although many methods have been presented for predicting protein domains and boundaries, the accuracy of predictions could be improved. In this study we present a novel approach, DomHR, which is an accurate predictor of protein domain boundaries based on a creative hinge region strategy. A hinge region was defined as a segment of amino acids that covers part of a domain region and a boundary region. We developed a strategy to construct profiles of domain-hinge-boundary (DHB) features generated by sequence-domain/hinge/boundary alignment against a database of known domain structures. The DHB features had three elements: normalized domain, hinge, and boundary probabilities. The DHB features were used as input to identify domain boundaries in a sequence. DomHR used a nonredundant dataset as the training set, the DHB and predicted shape string as features, and a conditional random field as the classification algorithm. In predicted hinge regions, a residue was determined to be a domain or a boundary according to a decision threshold. After decision thresholds were optimized, DomHR was evaluated by cross-validation, large-scale prediction, independent test and CASP (Critical Assessment of Techniques for Protein Structure Prediction) tests. All results confirmed that DomHR outperformed other well-established, publicly available domain boundary predictors for prediction accuracy. The DomHR is available at
    PLoS ONE 04/2013; 8(4):e60559. DOI:10.1371/journal.pone.0060559 · 3.23 Impact Factor
  • Source
    • "Protein sequence with closely related homologues can reveal conserved regions which are functionally important (Elhefnawi et al., 2010). Nowadays, it is not only important to detect a protein domain accurately from large numbers of protein sequences with unknown structure, but it is also essential to detect protein domain boundaries of the protein sequence (Chen et al., 2010). "

    Recurrent Neural Networks and Soft Computing, 03/2012; , ISBN: 978-953-51-0409-4
  • Source
    • "Also Wilcoxon rank sum test (Hollander and Wolfe 1999) is conducted to judge the statistical significance and statbility of clusters found by the proposed method. In bioinformatics research on protein sequences, the AAindex1 database has been used in wide range applications, e.g., prediction of post-translational modification (PTM) sites of proteins (Plewczynski et al. 2008; Basu and Plewczynski 2010), protein subcellular localization (Huanga et al. 2007; Tantoso and Li 2008; Liao et al. 2010; Laurila and Vihinen 2010), immunogenicity of MHC class I binding peptides (Tung and Ho 2007; Tian et al. 2009), protein SUMO modification site (Liu et al. 2007; Lu et al. 2010), coordinated substitutions in multiple alignments of protein sequences (Afonnikov and Kolchanov 2004), HIV protease cleavage site prediction (Ogul 2009; Nanni and Lumini 2009), and many more (Jiang et al. 2009; Liang et al. 2009; Soga et al. 2010; Chen et al. 2010; Pugalenthi et al. 2010). "
    [Show abstract] [Hide abstract]
    ABSTRACT: In this article, we categorize presently available experimental and theoretical knowledge of various physicochemical and biochemical features of amino acids, as collected in the AAindex database of known 544 amino acid (AA) indices. Previously reported 402 indices were categorized into six groups using hierarchical clustering technique and 142 were left unclustered. However, due to the increasing diversity of the database these indices are overlapping, therefore crisp clustering method may not provide optimal results. Moreover, in various large-scale bioinformatics analyses of whole proteomes, the proper selection of amino acid indices representing their biological significance is crucial for efficient and error-prone encoding of the short functional sequence motifs. In most cases, researchers perform exhaustive manual selection of the most informative indices. These two facts motivated us to analyse the widely used AA indices. The main goal of this article is twofold. First, we present a novel method of partitioning the bioinformatics data using consensus fuzzy clustering, where the recently proposed fuzzy clustering techniques are exploited. Second, we prepare three high quality subsets of all available indices. Superiority of the consensus fuzzy clustering method is demonstrated quantitatively, visually and statistically by comparing it with the previously proposed hierarchical clustered results. The processed AAindex1 database, supplementary material and the software are available at .
    Amino Acids 10/2011; 43(2):583-94. DOI:10.1007/s00726-011-1106-9 · 3.29 Impact Factor
Show more