Article

A robust method to detect structural and functional remote homologues.

School of Computer Science and Engineering, Hebrew University, Jerusalem, Israel.
Proteins Structure Function and Bioinformatics (impact factor: 3.39). 12/2004; 57(3):531-8. DOI:10.1002/prot.20235 pp.531-8
Source: PubMed

ABSTRACT With currently available sequence data, it is feasible to conduct extensive comparisons among large sets of protein sequences. It is still a much more challenging task to partition the protein space into structurally and functionally related families solely based on sequence comparisons. The ProtoNet system automatically generates a treelike classification of the whole protein space. It stands to reason that this classification reflects evolutionary relationships, both close and remote. In this article, we examine this hypothesis. We present a semiautomatic procedure that singles out certain inner nodes in the ProtoNet tree that should ideally correspond to structurally and functionally defined protein families. We compare the performance of this method against several expert systems. Some of the competing methods incorporate additional extraneous information on protein structure or on enzymatic activities. The ProtoNet-based method performs at least as well as any of the methods with which it was compared. This article illustrates the ProtoNet-based method on several evolutionarily diverse families. Using this new method, an evolutionary divergence scheme can be proposed for a large number of structural and functional related superfamilies.

0 0
 · 
0 Bookmarks
 · 
35 Views
  • Source
    Article: EVEREST: automatic identification and classification of protein domains in all protein sequences.
    [show abstract] [hide abstract]
    ABSTRACT: Proteins are comprised of one or several building blocks, known as domains. Such domains can be classified into families according to their evolutionary origin. Whereas sequencing technologies have advanced immensely in recent years, there are no matching computational methodologies for large-scale determination of protein domains and their boundaries. We provide and rigorously evaluate a novel set of domain families that is automatically generated from sequence data. Our domain family identification process, called EVEREST (EVolutionary Ensembles of REcurrent SegmenTs), begins by constructing a library of protein segments that emerge in an all vs. all pairwise sequence comparison. It then proceeds to cluster these segments into putative domain families. The selection of the best putative families is done using machine learning techniques. A statistical model is then created for each of the chosen families. This procedure is then iterated: the aforementioned statistical models are used to scan all protein sequences, to recreate a library of segments and to cluster them again. Processing the Swiss-Prot section of the UniProt Knoledgebase, release 7.2, EVEREST defines 20,230 domains, covering 85% of the amino acids of the Swiss-Prot database. EVEREST annotates 11,852 proteins (6% of the database) that are not annotated by Pfam A. In addition, in 43,086 proteins (20% of the database), EVEREST annotates a part of the protein that is not annotated by Pfam A. Performance tests show that EVEREST recovers 56% of Pfam A families and 63% of SCOP families with high accuracy, and suggests previously unknown domain families with at least 51% fidelity. EVEREST domains are often a combination of domains as defined by Pfam or SCOP and are frequently sub-domains of such domains. The EVEREST process and its output domain families provide an exhaustive and validated view of the protein domain world that is automatically generated from sequence data. The EVEREST library of domain families, accessible for browsing and download at 1, provides a complementary view to that provided by other existing libraries. Furthermore, since it is automatic, the EVEREST process is scalable and we will run it in the future on larger databases as well. The EVEREST source files are available for download from the EVEREST web site.
    BMC Bioinformatics 02/2006; 7:277. · 2.75 Impact Factor
  • Source
    Article: Functional grouping based on signatures in protein termini.
    [show abstract] [hide abstract]
    ABSTRACT: The two ends of each protein are known as the amino (N-) and carboxyl (C-) termini. Short signatures in a protein's termini often carry vital cellular function. No systematic research has been conducted to address the importance of short signatures (3 to 10 amino acids) in protein termini at the proteomic level. Specifically, it is unknown whether such signatures are evolutionarily conserved, and if so, whether this conservation confers shared biological functions. Current signature detection methods fail to detect such short signatures due to inadequate statistical scores. The findings presented in this study strongly support the notion that functional significance of protein sets may be captured by short signatures at their termini. A positional search method was applied to over one million proteins from the UniProt database. The result is a collection of about a thousand significant signature groups (SIGs) that include previously identified as well as many novel signatures in protein termini. These SIGs represent protein sets with minimal or no overall sequence similarity excepting the similarity at their termini. The most significant SIGs are assigned by their strong correspondence to functional annotations derived from external databases such as Gene Ontology. Each of the SIGs is associated with the statistical significance of its functional association. These SIGs provide a valuable source for testing previously overlooked signatures in protein termini and allow for the investigation of the role played by such signatures throughout evolution. The SIGs archive and advanced search options are available at http://www.proteus.cs.huji.ac.il.
    Proteins Structure Function and Bioinformatics 07/2006; 63(4):996-1004. · 3.39 Impact Factor
  • Source
    Article: ProtoNet 4.0: a hierarchical classification of one million protein sequences.
    [show abstract] [hide abstract]
    ABSTRACT: ProtoNet is an automatic hierarchical classification of the protein sequence space. In 2004, the ProtoNet (version 4.0) presents the analysis of over one million proteins merged from SwissProt and TrEMBL databases. In addition to rich visualization and analysis tools to navigate the clustering hierarchy, we incorporated several improvements that allow a simplified view of the scaffold of the proteins. An unsupervised, biologically valid method that was developed resulted in a condensation of the ProtoNet hierarchy to only 12% of the clusters. A large portion of these clusters was automatically assigned high confidence biological names according to their correspondence with functional annotations. ProtoNet is available at: http://www.protonet.cs.huji.ac.il.
    Nucleic Acids Research 02/2005; 33(Database issue):D216-8. · 8.03 Impact Factor

Full-text (2 Sources)

View
3 Downloads
Available from
26 Nov 2012

Keywords

additional extraneous information
 
certain inner nodes
 
competing methods
 
conduct extensive comparisons
 
enzymatic activities
 
evolutionarily diverse families
 
evolutionary divergence scheme
 
expert systems
 
families
 
large sets
 
protein families
 
protein sequences
 
protein space
 
protein structure
 
ProtoNet system
 
ProtoNet-based method
 
semiautomatic procedure
 
structurally
 
superfamilies
 
whole protein space