Article

Prediction of protein domain with mRMR feature selection and analysis.

Key Laboratory of Systems Biology, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai, China.
PLoS ONE (impact factor: 4.09). 01/2012; 7(6):e39308. DOI:10.1371/journal.pone.0039308 pp.e39308
Source: PubMed

ABSTRACT The domains are the structural and functional units of proteins. With the avalanche of protein sequences generated in the postgenomic age, it is highly desired to develop effective methods for predicting the protein domains according to the sequences information alone, so as to facilitate the structure prediction of proteins and speed up their functional annotation. However, although many efforts have been made in this regard, prediction of protein domains from the sequence information still remains a challenging and elusive problem. Here, a new method was developed by combing the techniques of RF (random forest), mRMR (maximum relevance minimum redundancy), and IFS (incremental feature selection), as well as by incorporating the features of physicochemical and biochemical properties, sequence conservation, residual disorder, secondary structure, and solvent accessibility. The overall success rate achieved by the new method on an independent dataset was around 73%, which was about 28-40% higher than those by the existing method on the same benchmark dataset. Furthermore, it was revealed by an in-depth analysis that the features of evolution, codon diversity, electrostatic charge, and disorder played more important roles than the others in predicting protein domains, quite consistent with experimental observations. It is anticipated that the new method may become a high-throughput tool in annotating protein domains, or may, at the very least, play a complementary role to the existing domain prediction methods, and that the findings about the key features with high impacts to the domain prediction might provide useful insights or clues for further experimental investigations in this area. Finally, it has not escaped our notice that the current approach can also be utilized to study protein signal peptides, B-cell epitopes, HIV protease cleavage sites, among many other important topics in protein science and biomedicine.

0 0
 · 
0 Bookmarks
 · 
66 Views
  • Article: Automated prediction of CASP-5 structures using the Robetta server.
    [show abstract] [hide abstract]
    ABSTRACT: Robetta is a fully automated protein structure prediction server that uses the Rosetta fragment-insertion method. It combines template-based and de novo structure prediction methods in an attempt to produce high quality models that cover every residue of a submitted sequence. The first step in the procedure is the automatic detection of the locations of domains and selection of the appropriate modeling protocol for each domain. For domains matched to a homolog with an experimentally characterized structure by PSI-BLAST or Pcons2, Robetta uses a new alignment method, called K*Sync, to align the query sequence onto the parent structure. It then models the variable regions by allowing them to explore conformational space with fragments in fashion similar to the de novo protocol, but in the context of the template. When no structural homolog is available, domains are modeled with the Rosetta de novo protocol, which allows the full length of the domain to explore conformational space via fragment-insertion, producing a large decoy ensemble from which the final models are selected. The Robetta server produced quite reasonable predictions for targets in the recent CASP-5 and CAFASP-3 experiments, some of which were at the level of the best human predictions.
    Proteins Structure Function and Bioinformatics 02/2003; 53 Suppl 6:524-33. · 3.39 Impact Factor
  • Article: Protein domain prediction.
    [show abstract] [hide abstract]
    ABSTRACT: Domains are considered to be the building blocks of protein structures. A protein can contain a single domain or multiple domains, each one typically associated with a specific function. The combination of domains determines the function of the protein, its subcellular localization and the interactions it is involved in. Determining the domain structure of a protein is important for multiple reasons, including protein function analysis and structure prediction. This chapter reviews the different approaches for domain prediction and discusses lessons learned from the application of these methods.
    Methods in molecular biology (Clifton, N.J.) 02/2008; 426:117-43.
  • Source
    Article: Partitioning protein structures into domains: why is it so difficult?
    [show abstract] [hide abstract]
    ABSTRACT: This analysis takes an in-depth look into the difficulties encountered by automatic methods for domain decomposition from three-dimensional structure. The analysis involves a multi-faceted set of criteria including the integrity of secondary structure elements, the tendency toward fragmentation of domains, domain boundary consistency and topology. The strength of the analysis comes from the use of a new comprehensive benchmark dataset, which is based on consensus among experts (CATH, SCOP and AUTHORS of the 3D structures) and covers 30 distinct architectures and 211 distinct topologies as defined by CATH. Furthermore, over 66% of the structures are multi-domain proteins; each domain combination occurring once per dataset. The performance of four automatic domain assignment methods, DomainParser, NCBI, PDP and PUU, is carefully analyzed using this broad spectrum of topology combinations and knowledge of rules and assumptions built into each algorithm. We conclude that it is practically impossible for an automatic method to achieve the level of performance of human experts. However, we propose specific improvements to automatic methods as well as broadening the concept of a structural domain. Such work is prerequisite for establishing improved approaches to domain recognition. (The benchmark dataset is available from http://pdomains.sdsc.edu).
    Journal of Molecular Biology 09/2006; 361(3):562-90. · 4.00 Impact Factor

Full-text (2 Sources)

View
11 Downloads
Available from
21 Jan 2013

Keywords

annotating protein domains
 
biochemical properties
 
codon diversity
 
domain prediction
 
existing domain prediction methods
 
experimental investigations
 
experimental observations
 
HIV protease cleavage sites
 
incremental feature selection
 
maximum relevance minimum redundancy
 
protein domains
 
protein sequences
 
random forest
 
residual disorder
 
secondary structure
 
sequence conservation
 
solvent accessibility
 
structure prediction
 
study protein signal peptides
 
useful insights