Article
A robust method to detect structural and functional remote homologues.
School of Computer Science and Engineering, Hebrew University, Jerusalem, Israel.
Proteins Structure Function and Bioinformatics (impact factor:
3.39).
12/2004;
57(3):531-8.
DOI:10.1002/prot.20235
pp.531-8
Source: PubMed
-
Citations (0)
- Cited In (4)
-
Article: EVEREST: automatic identification and classification of protein domains in all protein sequences.
[show abstract] [hide abstract]
ABSTRACT: Proteins are comprised of one or several building blocks, known as domains. Such domains can be classified into families according to their evolutionary origin. Whereas sequencing technologies have advanced immensely in recent years, there are no matching computational methodologies for large-scale determination of protein domains and their boundaries. We provide and rigorously evaluate a novel set of domain families that is automatically generated from sequence data. Our domain family identification process, called EVEREST (EVolutionary Ensembles of REcurrent SegmenTs), begins by constructing a library of protein segments that emerge in an all vs. all pairwise sequence comparison. It then proceeds to cluster these segments into putative domain families. The selection of the best putative families is done using machine learning techniques. A statistical model is then created for each of the chosen families. This procedure is then iterated: the aforementioned statistical models are used to scan all protein sequences, to recreate a library of segments and to cluster them again. Processing the Swiss-Prot section of the UniProt Knoledgebase, release 7.2, EVEREST defines 20,230 domains, covering 85% of the amino acids of the Swiss-Prot database. EVEREST annotates 11,852 proteins (6% of the database) that are not annotated by Pfam A. In addition, in 43,086 proteins (20% of the database), EVEREST annotates a part of the protein that is not annotated by Pfam A. Performance tests show that EVEREST recovers 56% of Pfam A families and 63% of SCOP families with high accuracy, and suggests previously unknown domain families with at least 51% fidelity. EVEREST domains are often a combination of domains as defined by Pfam or SCOP and are frequently sub-domains of such domains. The EVEREST process and its output domain families provide an exhaustive and validated view of the protein domain world that is automatically generated from sequence data. The EVEREST library of domain families, accessible for browsing and download at 1, provides a complementary view to that provided by other existing libraries. Furthermore, since it is automatic, the EVEREST process is scalable and we will run it in the future on larger databases as well. The EVEREST source files are available for download from the EVEREST web site.BMC Bioinformatics 02/2006; 7:277. · 2.75 Impact Factor -
Article: Functional grouping based on signatures in protein termini.
[show abstract] [hide abstract]
ABSTRACT: The two ends of each protein are known as the amino (N-) and carboxyl (C-) termini. Short signatures in a protein's termini often carry vital cellular function. No systematic research has been conducted to address the importance of short signatures (3 to 10 amino acids) in protein termini at the proteomic level. Specifically, it is unknown whether such signatures are evolutionarily conserved, and if so, whether this conservation confers shared biological functions. Current signature detection methods fail to detect such short signatures due to inadequate statistical scores. The findings presented in this study strongly support the notion that functional significance of protein sets may be captured by short signatures at their termini. A positional search method was applied to over one million proteins from the UniProt database. The result is a collection of about a thousand significant signature groups (SIGs) that include previously identified as well as many novel signatures in protein termini. These SIGs represent protein sets with minimal or no overall sequence similarity excepting the similarity at their termini. The most significant SIGs are assigned by their strong correspondence to functional annotations derived from external databases such as Gene Ontology. Each of the SIGs is associated with the statistical significance of its functional association. These SIGs provide a valuable source for testing previously overlooked signatures in protein termini and allow for the investigation of the role played by such signatures throughout evolution. The SIGs archive and advanced search options are available at http://www.proteus.cs.huji.ac.il.Proteins Structure Function and Bioinformatics 07/2006; 63(4):996-1004. · 3.39 Impact Factor -
Article: ProtoNet 4.0: a hierarchical classification of one million protein sequences.
[show abstract] [hide abstract]
ABSTRACT: ProtoNet is an automatic hierarchical classification of the protein sequence space. In 2004, the ProtoNet (version 4.0) presents the analysis of over one million proteins merged from SwissProt and TrEMBL databases. In addition to rich visualization and analysis tools to navigate the clustering hierarchy, we incorporated several improvements that allow a simplified view of the scaffold of the proteins. An unsupervised, biologically valid method that was developed resulted in a condensation of the ProtoNet hierarchy to only 12% of the clusters. A large portion of these clusters was automatically assigned high confidence biological names according to their correspondence with functional annotations. ProtoNet is available at: http://www.protonet.cs.huji.ac.il.Nucleic Acids Research 02/2005; 33(Database issue):D216-8. · 8.03 Impact Factor
Data provided are for informational purposes only. Although carefully collected, accuracy cannot be guaranteed.
The impact factor represents a rough estimation of the journal's impact factor and does not reflect the actual
current impact factor.
Publisher conditions are provided by RoMEO. Differing provisions from the publisher's actual policy or licence
agreement may be applicable.
Keywords
additional extraneous information
certain inner nodes
competing methods
conduct extensive comparisons
enzymatic activities
evolutionarily diverse families
evolutionary divergence scheme
expert systems
families
large sets
protein families
protein sequences
protein space
protein structure
ProtoNet system
ProtoNet-based method
semiautomatic procedure
structurally
superfamilies
whole protein space