Article

Exhaustive Enumeration of Protein Domain Families

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Domains are considered as the basic units of protein folding, evolution, and function. Decomposing each protein into modular domains is thus a basic prerequisite for accurate functional classification of biological molecules. Here, we present ADDA, an automatic algorithm for domain decomposition and clustering of all protein domain families. We use alignments derived from an all-on-all sequence comparison to define domains within protein sequences based on a global maximum likelihood model. In all, 90% of domain boundaries are predicted within 10% of domain size when compared with the manual domain definitions given in the SCOP database. A representative database of 249,264 protein sequences were decomposed into 450,462 domains. These domains were clustered on the basis of sequence similarities into 33,879 domain families containing at least two members with less than 40% sequence identity. Validation against family definitions in the manually curated databases SCOP and PFAM indicates almost perfect unification of various large domain families while contamination by unrelated sequences remains at a low level. The global survey of protein-domain space by ADDA confirms that most large and universal domain families are already described in PFAM and/or SMART. However, a survey of the complete set of mobile modules leads to the identification of 1479 new interesting domain families which shuffle around in multi-domain proteins. The data are publicly available at ftp://ftp.ebi.ac.uk/pub/contrib/heger/adda.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Protein families originated from these kind of information constitute the Pfam-A family database. During Pfam's history, several automatic tools (such as Domainer [41] or ADDA [7]) have been used to build a complement to the main database, named Pfam-B. This second database contained automatically-generated families which were gradually integrated in the Pfam-A collection using manual curation. ...
... In 1998 the ProDom [45] database was released, based on the MKDOM program [46], applied on about 20,000 proteins. The growth of protein sequence databases, together with more sophisticated bioinformatics methods and computational resources, has led over the years to the development of algorithms such as ADDA [7], EVEREST [8], and MCL [47], among others. The majority of these algorithms cluster full sequences rather than protein regions; ADDA and EVEREST are two of the few algorithms focused on protein regions, namely tackling the problem of defining protein families boundaries. ...
... ADDA [7] is an algorithm first published in 2003 which searches families starting from an all-to-all set of BLAST alignments. As a first step, ADDA performs a sophisticated domain decomposition on each protein sequence, which aims to identify correctly the boundaries of each possible family appearing in the protein, especially for complicated architectures or in cases of sequence fragments. ...
Thesis
Full-text available
Proteins that are known only at a sequence level outnumber those with an experimental characterization by orders of magnitude. Classifying protein regions (domains) into homologous families can generate testable functional hypotheses for yet unannotated sequences. Existing domain family resources typically use at least some degree of manual curation: they grow slowly over time and leave a large fraction of the protein sequence space unclassified. We here describe automatic clustering by Density Peak Clustering of UniRef50, a protein sequence database including approximately 23M sequences. Here I present a pipeline developed in order to allow handling millions of sequences and data volumes of the order of 5 TeraBytes, damed DPCfam. First, I present a small proof-of-concepts on small datasets; then, i present the results obtained from running the optimized pipelin on uniRef50 protein sequence database. DPCfam finds about 45,000 protein clusters in UniRef50. This automatic classification is in close correspondence to the ones of the Pfam and ECOD resources: in particular, about 81% of medium-large Pfam families can be mapped to clusters generated by DPCfam. In addition, our protocol finds more than 14,000 clusters constituted of protein regions with no Pfam annotation, which are therefore candidates for representing novel protein families. These results are made available to the scientific community through a dedicated repository.
... "Complexity in biology has evolved through modification and recombination of existing building blocks instead of invention from scratch. In the protein world these building blocks have been termed domains and the identification and characterization of new domains and domain families is a major goal of protein science" [Heger and Holm, 2003]. In a more structure-oriented view, protein domains are usually defined either as recurrent evolutionary units, as independent folding units or as globular, more or less independent parts of a protein, and further definitions can be found in the literature (see chapter 2). ...
... Domain assignment in proteins is an important subtask of structure prediction, as domains are usually considered the basic units for protein folding, evolution, and function [Heger andHolm, 2003, Vogel et al., 2005], and thus the decomposition of proteins into domains can help in areas such as functional classification, homology-based structure prediction, and structural genomics [Liu and Rost, 2003]. Since 2004, the CASP and CAFASP experiments have included a domain prediction subcategory into their evaluations, which confirms the importance of this task. ...
... The algorithm described in this chapter, SSEP-Domain, predicts protein domains using the amino acid sequence of the target on the basis of alignments to known SCOP domains. Other recent approaches for domain recognition from sequence are also often alignment-based, such as ADDA [Heger and Holm, 2003], the Dompred-DomSSEA approach [Marsden et al., 2002], and DOMAINATION [George and Heringa, 2002a]. DO-PRO [vonÖhsen, 2005] uses stochastic models on the alignment-based output of the Arby structure prediction server [vonÖhsen et al., 2004]. ...
... Many current approaches perform the detection of sequence domains by detecting only putative regions and as a part of domain family identification, i.e., protein clustering based on domains rather than as a separate question [3][4][5]. To the best of our knowledge there is no published research concentrating only on detecting conserved regions in proteins. ...
... While detection of structural domains as opposed to functional domains has been an area of active research featured as a part of CASP experiments [7][8][9], the detection of sequence level conserved regions has been mostly overshadowed by efforts to cluster proteins based on their domain families [3][4][5]. In MKDOM2 [5], the shortest sequence (without repetitions) is marked as a domain and then PSI-BLAST [10] is used to detect regions of high similarity with that domain. ...
... In ADDA [3], an all vs. all BLAST [15] is performed on the entire data set. Based on the results of BLAST, a tree of putative domain regions is generated for each sequence. ...
Article
Full-text available
Background Identifying conserved regions in protein sequences is a fundamental operation, occurring in numerous sequence-driven analysis pipelines. It is used as a way to decode domain-rich regions within proteins, to compute protein clusters, to annotate sequence function, and to compute evolutionary relationships among protein sequences. A number of approaches exist for identifying and characterizing protein families based on their domains, and because domains represent conserved portions of a protein sequence, the primary computation involved in protein family characterization is identification of such conserved regions. However, identifying conserved regions from large collections (millions) of protein sequences presents significant challenges. Methods In this paper we present a new, alignment-free method for detecting conserved regions in protein sequences called NADDA (No-Alignment Domain Detection Algorithm). Our method exploits the abundance of exact matching short subsequences (k-mers) to quickly detect conserved regions, and the power of machine learning is used to improve the prediction accuracy of detection. We present a parallel implementation of NADDA using the MapReduce framework and show that our method is highly scalable. Results We have compared NADDA with Pfam and InterPro databases. For known domains annotated by Pfam, accuracy is 83%, sensitivity 96%, and specificity 44%. For sequences with new domains not present in the training set an average accuracy of 63% is achieved when compared to Pfam. A boost in results in comparison with InterPro demonstrates the ability of NADDA to capture conserved regions beyond those present in Pfam. We have also compared NADDA with ADDA and MKDOM2, assuming Pfam as ground-truth. On average NADDA shows comparable accuracy, more balanced sensitivity and specificity, and being alignment-free, is significantly faster. Excluding the one-time cost of training, runtimes on a single processor were 49s, 10,566s, and 456s for NADDA, ADDA, and MKDOM2, respectively, for a data set comprised of approximately 2500 sequences.
... An alternative approach to manually curated family classification is performing automatic, sequence-based classification of protein regions. Automated family classification has a long history in protein bioinformatics and over the years has led to the development of algorithms such as ADDA [13], COG [14], EVEREST [15], CD-HIT [16], linclust [17], UCLUST [18] and MCL [19], among others. Most of these methods aim to find conserved family architectures (i.e., full-length sequence homologs). ...
... To our knowledge ADDA and EVEREST are the only ones that were specifically developed to identify individual families. EVEREST uses Pfam information to infer the general notion of "protein family" via a supervised learning step [15] while the ADDA clustering algorithm uses elaborate models to extract information from the sequence space and define domain boundaries [13]. The published implementations of these two algorithms have not been maintained in the last years and are thus obsolete with respect to current operating systems. ...
... For these reasons, databases that attempt to classify protein families and domains use extensively either manual annotation or structural knowledge (often both). Nonetheless, unsupervised, automatic domain classication from sequence [13] [15] [19] is extremely relevant both to identify conserved regions that can later be manually refined and annotated to create novel families and for complementing manual classification in differential domain analysis of large datasets with a high degree of sequence novelty (such as for example sequences from environmental genomics [32] [33]). Here, we have presented a new unsupervised procedure for automatic protein domain classification based on Density Peak Clustering. ...
Article
Full-text available
Background The identification of protein families is of outstanding practical importance for in silico protein annotation and is at the basis of several bioinformatic resources. Pfam is possibly the most well known protein family database, built in many years of work by domain experts with extensive use of manual curation. This approach is generally very accurate, but it is quite time consuming and it may suffer from a bias generated from the hand-curation itself, which is often guided by the available experimental evidence. Results We introduce a procedure that aims to identify automatically putative protein families. The procedure is based on Density Peak Clustering and uses as input only local pairwise alignments between protein sequences. In the experiment we present here, we ran the algorithm on about 4000 full-length proteins with at least one domain classified by Pfam as belonging to the Pseudouridine synthase and Archaeosine transglycosylase (PUA) clan. We obtained 71 automatically-generated sequence clusters with at least 100 members. While our clusters were largely consistent with the Pfam classification, showing good overlap with either single or multi-domain Pfam family architectures, we also observed some inconsistencies. The latter were inspected using structural and sequence based evidence, which suggested that the automatic classification captured evolutionary signals reflecting non-trivial features of protein family architectures. Based on this analysis we identified a putative novel pre-PUA domain as well as alternative boundaries for a few PUA or PUA-associated families. As a first indication that our approach was unlikely to be clan-specific, we performed the same analysis on the P53 clan, obtaining comparable results. Conclusions The clustering procedure described in this work takes advantage of the information contained in a large set of pairwise alignments and successfully identifies a set of putative families and family architectures in an unsupervised manner. Comparison with the Pfam classification highlights significant overlap and points to interesting differences, suggesting that our new algorithm could have potential in applications related to automatic protein classification. Testing this hypothesis, however, will require further experiments on large and diverse sequence datasets.
... MCL [12]). Exceptions were ADDA [13] and EVEREST [14]. ADDA was used by Pfam to automatically generate the Pfam-B database until 2015 but subsequently dismissed because of its high computational costs. ...
... However, most automatic methods have focused on clustering of full-length protein sequences. The few methods that, to the best of our knowledge, have attempted to automatically classify evolutionary modules in proteins (most typically representing structural domains) (ADDA [13], EVEREST [14]), have faced scalability and/or maintenance problems. ...
Article
Full-text available
Proteins that are known only at a sequence level outnumber those with an experimental characterization by orders of magnitude. Classifying protein regions (domains) into homologous families can generate testable functional hypotheses for yet unannotated sequences. Existing domain family resources typically use at least some degree of manual curation: they grow slowly over time and leave a large fraction of the protein sequence space unclassified. We here describe automatic clustering by Density Peak Clustering of UniRef50 v. 2017_07, a protein sequence database including approximately 23M sequences. We performed a radical re-implementation of a pipeline we previously developed in order to allow handling millions of sequences and data volumes of the order of 3 TeraBytes. The modified pipeline, which we call DPCfam, finds ∼ 45,000 protein clusters in UniRef50. Our automatic classification is in close correspondence to the ones of the Pfam and ECOD resources: in particular, about 81% of medium-large Pfam families and 80% of ECOD families can be mapped to clusters generated by DPCfam. In addition, our protocol finds more than 14,000 clusters constituted of protein regions with no Pfam annotation, which are therefore candidates for representing novel protein families. These results are made available to the scientific community through a dedicated repository.
... As a consequence, it can be used effectively to complement manual annotations. Automated family classification has a long history in protein bioinformatics and over the years has led to the development of algorithms such as ADDA [14], COG [35], EVEREST [31], and MCL [11], among others. Until 2015, the ADDA clustering algorithm was used to produce Pfam-B, which was an automatically-built companion to the manually curated Pfam main family collection. ...
... For these reasons, protein family databases that attempt to classify protein domains (Pfam [32], InterPro [28] , CDD [24], SCOP [29], ECOD [7] to name but a few) use extensively either manual annotation or structural knowledge (often both). Nonetheless, unsupervised, automatic domain classification from sequence [14] [31] [11] is extremely relevant both to identify conserved regions that can later be manually refined and annotated to create novel families and for complementing manual classification in differential domain analysis of large datasets with a high degree of sequence novelty (such as for example sequences from environmental genomics [34] [18]. Here, we have presented a new unsupervised method based on Density Peak Clustering, which we named DPCfam, for the purpose of automatic protein domain classification. ...
Preprint
Full-text available
As the UniProt database approaches the 200 million entries' mark, the vast majority of proteins it contains lack any experimental validation of their functions. In this context, the identification of homologous relationships between proteins remains the single most widely applicable tool for generating functional and structural hypotheses in silico. Although many databases exist that classify proteins and protein domains into homologous families, large sections of the sequence space remain unassigned. We introduce DPCfam, a new unsupervised procedure that uses sequence alignments and Density Peak Clustering to automatically classify homologous protein regions. Here, we present a proof-of-principle experiment based on the analysis of two clans from the Pfam protein family database. Our tests indicate that DPCfam automatically-generated clusters are generally evolutionary accurate corresponding to one or more Pfam families and that they cover a significant fraction of known homologs. Overall, DPCfam shows potential both for assisting manual annotation efforts (domain discovery, detection of classification inconsistencies, improvement of family coverage and boosting of clan membership) and as a stand-alone tool for unsupervised classification of sparsely annotated protein datasets such as those from environmental metagenomics studies (domain discovery, analysis of domain diversity). Algorithm implementation used in this paper is available at https://gitlab.com/ETRu/dpcfam (Requires Python 3, C++ compiler and runs on Linux systems.); data are available at https://zenodo.org/record/3934399
... However, as noted above, they require a great deal of human labor and expertise and cannot discover new domain families. DOMO [8] and the more recent ADDA [9] are two fully automatic systems that define domains and classify them. We will use ADDA as a yardstick by which we test our achievements here. ...
... Of those clusters 21829 intersect with Pfam reference families and 6372 intersect with SCOP reference families. The second protagonist system is ADDA [9]. This recently published domain identification and clustering algorithm, has significantly improved all previously known methods. ...
Article
Motivation: Proteins are comprised of one or several domains. Such domains can be classified into families accor- ding to their biological function. Whereas sequencing techno- logies have advanced immensely in recent years, there are no matching computational tools for large-scale determination of protein domains and their boundaries. The present paper addresses the challenge of developing computational tools to identify protein domains and to classify them into their families. The eventual goal of our research is to automatically identify and classify correctly all protein domains. Results: Our method, called EVEREST, combines methodo- logies from the fields of finite metric spaces, machine learning and statistical modeling and achieves state of the art results. Our process begins by constructing a database of protein segments that emerge in an all vs. all pairwise sequence comparison. It then proceeds to cluster these segments, choo- sing the best clusters using machine learning techniques, and creating a statistical model for each of the these clusters. This procedure is then iterated: The aforementioned statistical models are used to scan all protein sequences, to recreate a segment database and to cluster them again. Performance tests show that EVEREST recovers 63% of Pfam families and 40% of SCOP families with high accuracy, and suggests new families with about 40% fidelity. EVEREST domains are fre- quently a combination of domains as defined by Pfam or SCOP and frequently subdomains of such domains. The paper is con- cluded with a discussion of research avenues to improve these results. Availability: A database of statistical models (HMMER HMMs), one per domain family is available for download at http://www.cs.huji.ac.il/ elonp/everest.
... [21][22][23] Through evolution, mutations in the coding region of a gene are likely to have a different biological function, especially if the mutations occur in the protein domain, since they are generally considered as the basic units of protein folding, evolution, and function. [24] Ramu et al. [25] highlighted some possible deleterious mutations in domesticated cassava using whole genomic screening experiments of wild ancestors and cultivars. Like banana, cassava cultivars are clonally propagated and this genomic screening study suggests that many deleterious mutations have not been crossed out. ...
... Advanced whole genomic screening experiments enable the identification and interpretation of mutations at the genome level. [24,25] Although we do not have access (yet) to whole-genome sequencing data from triploid banana cultivars, we show that proteomics is an easily accessible complementary alternative to detect the different allele specific SNPs/SAAPs. ...
Article
Full-text available
Proteomics has been applied with great potential to elucidate molecular mechanisms in plants. This is especially valid in the case of non-model crops of which their genome has not been sequenced yet, or is not well annotated. Plantains are a kind of cooking bananas that are economically very important in Africa, India and Latin America. The aim of this work was to characterize the fruit proteome of common dessert bananas and plantains and to identify proteins that are only encoded by the plantain genome. We present the first plantain fruit proteome. All data are available via ProteomeXchange with identifier PXD005589. Using our in-house workflow, we found 37 alleles to be unique for plantain covered by 59 peptides. Although we do not have access (yet) to whole-genome sequencing data from triploid banana cultivars, we show that proteomics is an easily accessible complementary alternative to detect different allele specific SNPs/SAAPs. These unique alleles might contribute towards the differences in the metabolism between dessert bananas and plantains. This dataset will stimulate further analysis by the scientific community, boost plantain research and facilitate plantain breeding.
... It has long been recognized that sequence evolution is not tree-like, in particular because of domain shuffling (Enright et al. 1999;Marcotte et al. 1999;Portugaly et al. 2006). It has also long been recognized that this non-tree-like evolution results in a network of sequence relationships (Sonnhammer and Kahn 1994;Park et al. 1997;Enright and Ouzounis 2000;Heger and Holm 2003;Ingolfsson and Yona 2008;Song et al. 2008). However, for an almost equally long period of time, it has been assumed that the right way to process this network was to carve it into homologous parts by clustering (Tatusov et al. 1997;Enright and Ouzounis 2000;Yona et al. 2000). ...
... Databases such as homologene (http:// www.ncbi.nlm.nih.gov/homologene, last accessed December 10, 2013) and COG (http://www.ncbi.nlm.nih.gov/COG/, last accessed December 10, 2013) only contain genes that are allowed to be in one family. Although we do not deny that database entries of such sequences are likely or certain to be homologs, sole focus on those kinds of evolving entities (entries that trace their heredity to a single common ancestor) and the heuristic of requiring homologs to manifest nearor full-length significant sequence similarity has clearly resulted in biases and information loss, as has been demonstrated (Sonnhammer and Kahn 1994;Park et al. 1997;Enright and Ouzounis 2000;Heger and Holm 2003;Ingolfsson and Yona 2008;Song et al. 2008). Even if we had a universally agreed definition of the gene (Epp 1997), it remains much more complicated to decide what might be a gene family. ...
Article
Full-text available
Defining homologous genes is important in many evolutionary studies but raises obvious issues. Some of these issues are conceptual, and stem from our assumptions of how a gene evolves, others are practical, and depend on the algorithmic decisions implemented in existing software. Therefore, in order to make progress in the study of homology, both ontological and epistemological questions must be considered. In particular, defining homologous genes cannot solely be addressed under the classic assumptions of strong tree-thinking, according to which genes evolve in a strictly tree-like fashion of vertical descent and divergence and the problems of homology detection are primarily methodological. Gene homology could also be considered under a different perspective where genes evolve as 'public goods', subjected to various introgressive processes. In this latter case, defining homologous genes becomes a matter of designing models suited to the actual complexity of the data and how such complexity arises, rather than trying to fit genetic data to some a priori tree-like evolutionary model, a practice that inevitably results in the loss of much information. Here we show how important aspects of the problems raised by homology detection methods can be overcome when even more fundamental roots of these problems are addressed by analysing 'public goods thinking' evolutionary processes through which genes have frequently originated. This kind of thinking acknowledges distinct types of homologs, characterised by distinct patterns, in phylogenetic and non phylogenetic unrooted or multi-rooted networks. In addition, we define "family resemblances" to include genes that are related through intermediate relatives, thereby placing notions of homology in the broader context of evolutionary relationships. We conclude by presenting some pay-offs of adopting such a pluralistic account of homology and family relationship, that expands the scope of evolutionary analyses beyond the traditional, yet relatively narrow focus allowed by a strong tree-thinking view on gene evolution.
... Pfam-B was part of the Pfam database until release 26. Actually, Pfam contained two types of domain families until this release: the high quality and manually curated Pfam-A families (classical Pfam domains, which are usually and until here in this article just called Pfam), and Pfam-B families, which were automatically generated by the ADDA algorithm (Heger and Holm, 2003) . CC-BY-NC-ND 4.0 International license a certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. ...
Preprint
Full-text available
Motivation Comparing and aligning protein sequences is an essential task in bioinformatics. More specifically, local alignment tools like BLAST are widely used for identifying conserved protein sub-sequences, which likely correspond to protein domains or functional motifs. However, to limit the number of false positives, these tools are used with stringent sequence-similarity thresholds and hence can miss several hits, especially for species that are phylogenetically distant from reference organisms. A solution to this problem is then to integrate additional contextual information to the procedure. Results Here, we propose to use domain co-occurrence to increase the sensitivity of pairwise sequence comparisons. Domain co-occurrence is a strong feature of proteins, since most protein domains tend to appear with a limited number of other domains on the same protein. We propose a method to take this information into account in a typical BLAST analysis and to construct new domain families on the basis of these results. We used Plasmodium falciparum as a case study to evaluate our method. The experimental findings showed an increase of 16% of the number of significant BLAST hits and an increase of 28% of the proteome area that can be covered with a domain. Our method identified 2473 new domains for which, in most cases, no model of the Pfam database could be linked. Moreover, our study of the quality of the new domains in terms of alignment and physicochemical properties show that they are close to that of standard Pfam domains. Availability Software implementing the proposed approach and the Supplementary Data are available at: https://gite.lirmm.fr/menichelli/pairwise-comparison-with-cooccurrence
... Pfam-B entries were generated to supplement the Pfam database for the sequences where there are no Pfam-A associations (Finn et al., 2010). Pfam-B was generated automatically using the ADDA database (Heger and Holm, 2003).Table 13shows the information about the conserved region and reference domain pairings (regarding significant pairwise alignments) using only Pfam-A (the previous analysis) and both Pfam-A and Pfam-B databases. As observed from the table, ...
Thesis
Full-text available
In this study, computational methods are developed for the automatic identification of functional/evolutionary relationships between biomolecular sequences in large and diverse datasets. Different approaches were considered during the development and optimization of the methods. The first approach focused on the expression of gene and protein sequences in high dimensional vector spaces via non-linear embedding. This allowed statistical learning algorithms to be applied on the resulting embeddings in order to cluster and/or classify the sequences. The second approach revised the pairwise similarities between sequences following multiple sequence alignment in order to eliminate the unreliable connections due to remote homology and/or poor alignment. This is achieved by thresholding the pairwise connectivity map over 2 parameters: the inferred evolutionary distances and the number of gapless positions in each pairwise alignment. The resulting connectivity map was disjoint and consisted of clusters of similar proteins. The third and the final approach sought to associate the amino acid sequences with each other over highly conserved/shared sequence segments, as shared sequence segments imply conserved functional or structural attributes. An automated method was developed to identify these segments in large and diverse collections of amino acid sequences, using a combination of sequence alignment, residue conservation scoring and graph-theoretical approaches. The method produces a table of associations between the input sequences and the identified conserved regions that can reveal both new members to the known protein families and entirely new lines. The methods were applied to a dataset composed of 17793 human proteins sequences in order to obtain a global functional relation map. On this map, functional and evolutionary properties of human proteins could be found based on their relationships to the ones bearing functional annotations. The results revealed that conserved regions corresponded strongly to annotated structural domains. This suggests the method can also be useful in identifying novel domains on protein sequences.
... • PICASSO score, implemented in [HH03]: ...
Thesis
To assign structural and functional annotations to the ever increasing amount of sequenced proteins, the main approach relies on sequence-based homology search methods based on significant alignments of query sequences to annotated proteins or protein families. While powerful, existing approaches do not take coevolution between residues into account. Taking advantage of recent advances in the field of contact prediction, in this thesis we propose to represent proteins by Potts models, which model direct couplings between positions in addition to positional composition, and to compare proteins by aligning these models. This novel application of Potts models raised further requirements for their construction, and we identified several key points towards building more comparable Potts models, towards an ideal of canonicity. Due to non-local dependencies, the problem of aligning Potts models is NP-hard. Here, we introduced a method based on an Integer Linear Programming formulation of the problem which can be optimally solved in tractable time. Our first results suggest that taking pairwise couplings into account can improve the alignment of remote homologs and could thus improve remote homology detection.
... ADDA: the Automatic Domain Decomposition Algorithm starts in the same manner as the COG database by obtaining alignment scores for all pairwise protein comparisons [64,63]. Then, the topology of the resulting protein alignment network (i.e., where edges are Blast hits and vertices are protein sequences) is analyzed and based on patterns of scoring walking between proteins and neighbors, 8 domains are defined. ...
... Pfam-A entries are derived from the underlying sequence database, which is built from the most recent release of UniProtKB at a given time-point. Each Pfam-A family consists of a curated seed alignment containing a small set of representative members of the family, profile hidden Markov models (profile HMMs) built from the seed alignment, and an automatically generated full alignment, which contains all detectable protein sequences belonging to the family, as defined by profile HMM searches of primary sequence databases, Pfam-B families are un-annotated and of lower quality as they are generated automatically from the non-redundant clusters of the latest ADDA release (Heger and Holm 2003). We only used Pfam-A in this study because of the high quality due to the manually curated families. ...
Article
Cancer is not rare anywhere in the world now, and the global burden of cancer continues to increase largely every year. Previous research on infections and cancers reported that, about 17.8 % of the cancers worldwide, which are over 1.9 million cases of cancer, are related to viral infections. At least six oncoviruses, cancer-causing viruses, have been known so far, which include hepatitis B virus, hepatitis C virus, Epstein–Barr virus (EBV or HHV-4), human papillomavirus, human T lymphotropic virus type 1, Kaposi’s sarcoma-associated herpesvirus (KSHV or HHV-8), but the pathogenic mechanism is far from being completely understood. In this study, assuming that finding human proteins significantly similar to viral oncoproteins leads to a categorization of the cancer-related pathways that are currently not clearly known, we analyzed different types of virus-caused cancers based on their similarity in order to clarify the unknown cancer mechanisms. As a result, we obtained several potential tumor pathways that may be significant and essential in oncogenic cancer process, which will be helpful for further study on cancer mechanisms and the development of new drug targets.
... Après avoir longtemps construit les modèles Pfam-B grâce à la classification Pro-Dom (cf. paragraphe 1.4.2.d), Pfam utilise désormais la classification automatique ADDA (Heger et Holm, 2003). Les familles de Pfam-A consistent en un alignement expertisé ma- (Sonnhammer et al., 1998;Bateman et al., 2000Bateman et al., , 2002Bateman et al., , 2004Finn et al., 2006Finn et al., , 2008Finn et al., , 2010 (Les HMM de Pfam-B ne sont pas comptabilisés). ...
Article
Hidden Markov Models (HMMs) - from Pfam database for example - are popular tools for protein domain annotation. However, they are not well suited for studying highly divergent proteins. This is notably the case with Plasmodium falciparum (main causal agent of human malaria), where Pfam HMMs identify few distinct domain families and cover less than 50% of its proteins. This thesis aims at providing new methods to enhance domain detection in divergent proteins. The first axis of this work is an approach of domain identification based on domain co-occurrence. Several studies shown that a majority of domains appear in proteins with a small set of other favourite domains. Our method exploits this tendency to detect domains escaping to the classical procedure because of their divergence. Detected domains come along with an false discovery rate (FDR) estimation computed with a shuffling procedure. In P. falciparum proteins, this approach allows us identify, with an FDR below 20%, 585 new domains - with 159 families that were previously unseen in this organism - which account for 16% of the known domains. The second axis of my researches involves the development of statistical and evolutionary methods of HMM correction to improve the annotation of divergent organisms. Two kind of approaches are proposed. On the one hand, the sequences previously identified in the target organism and its close relatives are integrated in the learning alignments. An obvious limitation of this solution is that only new occurrences of previously known families in the taxon can be discovered. On the other hand, we evade this limitation by adjusting HMM parameters by simulating the evolution of the learning sequences. To this end, classical techniques from bioinformatics and statistical learning were used. Alternative libraries offer a complementary set of predictions summing 663 new domains - with 504 previously unseen families - corresponding to an improvement of 18% to add to the previous results.
... In an attempt to approximate the number of modular building blocks from which the proteome is composed, Heger et al. used an automatic algorithm for domain decomposition. 29 These results, validated against the manually curated PFAM and SCOP databases, can be filtered to limit exclusively to eukaryotic modules for which structural data exists and that exhibit mobility (i.e. found in alternate protein family contexts with variable domain architecture), reducing the number to 327 domain families (in rough agreement with estimates obtained by alternate means 30 and on the order of magnitude staining of cells transfected with the RASA1 expression construct is shown in "A" (red), emission from the GFP tag shown in "B" (green) and an overlay of the two images shown in "C." ...
Article
Antibodies are indispensable tools in biochemical research and play an expanding role as therapeutics. While hybridoma technology is the dominant method for antibody production, phage display is an emerging technology. Here, we developed and employed a high-throughput pipeline that enables selection of antibodies against hundreds of antigens in parallel. Binding selections using a phage-displayed synthetic antigen-binding fragment (Fab) library against 110 human SH3 domains yielded hundreds of Fabs targeting 58 antigens. Affinity assays demonstrated that representative Fabs bind tightly and specifically to their targets. Furthermore, we developed an efficient affinity maturation strategy adaptable to high-throughput, which increased affinity dramatically but did not compromise specificity. Finally, we tested Fabs in common cell biology applications and confirmed recognition of the full-length antigen in immunoprecipitation, immunoblotting and immunofluorescence assays. In summary, we have established a rapid and robust high-throughput methodology that can be applied to generate highly functional and renewable antibodies targeting protein domains on a proteome-wide scale. This article is protected by copyright. All rights reserved. © 2015 The Protein Society.
... These problems could be solved, or considerably attenuated, given the chance to select correctly folded and active proteins domains. Due to their smaller size and conserved structural folding 19 , protein domains can be independently expressed while preserving their individual functions 20 . Thus, screening a library that faithfully represents most or all of the functional domains encoded by a genome (domainome), could provide a simple method to annotate gene products, including those encoding RBPs. ...
Article
Full-text available
We describe here a platform for high-throughput protein expression and interaction analysis aimed at identifying the RNA-interacting domainome. This approach combines the selection of a phage library displaying "filtered" open reading frames with next-generation DNA sequencing. The method was validated using an RNA bait corresponding to the AU-rich element of α-prothymosin, an RNA motif that promotes mRNA stability and translation through its interaction with the RNA-binding protein ELAVL1. With this strategy, we not only confirmed known RNA-binding proteins that specifically interact with the target RNA (such as ELAVL1/HuR and RBM38) but also identified proteins not previously known to be ARE-binding (R3HDM2 and RALY). We propose this technology as a novel approach for studying the RNA-binding proteome.
... In order to extend our search methods to allow the identification of potential soluble domains on a larger scale, we applied our strategy based on the construction of an ORFeome library from the B. pseudomallei genome. By fragmenting a whole (intronless) genome into DNA fragments of 200-800 bp (D'Angelo et al., 2011;Heger & Holm, 2003) it is possible to create a library of fragments coding for potential domains (or parts thereof). DNA fragments encoding well folded protein domains, fused upstream of -lactamase, allow the reporter enzyme to fold correctly and allow bacteria to survive the selective pressure posed by the antibiotic. ...
Article
Full-text available
The 1.8 Å resolution crystal structure of a conserved domain of the potential Burkholderia pseudomallei antigen and trimeric autotransporter BPSL2063 is presented as a structural vaccinology target for melioidosis vaccine development. Since BPSL2063 (1090 amino acids) hosts only one conserved domain, and the expression/purification of the full-length protein proved to be problematic, a domain-filtering library was generated using β-lactamase as a reporter gene to select further BPSL2063 domains. As a result, two domains (D1 and D2) were identified and produced in soluble form in Escherichia coli . Furthermore, as a general tool, a genomic open reading frame-filtering library from the B. pseudomallei genome was also constructed to facilitate the selection of domain boundaries from the entire ORFeome. Such an approach allowed the selection of three potential protein antigens that were also produced in soluble form. The results imply the further development of ORF-filtering methods as a tool in protein-based research to improve the selection and production of soluble proteins or domains for downstream applications such as X-ray crystallography.
... La base Pfam est en réalité constituée de deux bases de données Pfam-A et Pfam-B. Pfam-A est la base vérifiée manuellement, tandis que la base Pfam-B est générée de manière entièrement automatique à l'aide de l'algorithme ADDA [HH03] de façon similaire à la base ProDom. Par défaut, quand on évoque les familles de Pfam on fait généralement référence à Pfam-A uniquement, les familles Pfam-B étant rarement annotées et de qualité inférieure. ...
Thesis
Identifier les différentes parties d’une séquence biologique (séquence nucléique, ou séquence d’acides aminés) constitue un premier pas vers la compréhension de la biologie de l’organisme dont elle est issue. Étant donné un ensemble de séquences biologiques d’un organisme, nous nous intéressons dans cette thèse à la découverte de «domaines», c-à-d de sous-séquences relativement grandes (plusieurs dizaines de nucléotides ou d’acides aminés) que l’on retrouve dans un nombre important de séquences. Cette thèse est décomposée en deux axes correspondant à la découverte de domaines dans les séquences protéiques et dans les séquences nucléiques. Dans chaque axe, les méthodes développées sont appliquées à Plasmodium falciparum, le pathogène responsable du paludisme chez l’Homme, et pour lequel les méthodes bio-informatiques classiques peinent à produire des annotations satisfaisantes. Le premier axe développé porte sur la découverte de domaines dans les séquences protéiques. Une approche commune pour identifier les domaines d’une protéine consiste à exécuter des comparaisons de paires de séquences avec des outils d’alignements locaux comme BLAST. Cependant, ces approches manquent parfois de sensibilité, en particulier pour les espèces phylogénétiquement éloignées des organismes de référence classiques. Nous proposons ici une approche pour augmenter la sensibilité des comparaisons de paires de séquences. Cette nouvelle approche utilise le fait que les domaines protéiques ont tendance à apparaître avec un nombre limité d’autres domaines sur une même protéine. Chez Plasmodium falciparum, cette méthode permet la découverte de 2 240 nouveaux domaines pour lesquels, dans la majorité des cas, il n’existe pas de modèle semblable dans les bases de données de domaines. Le deuxième axe développé porte sur la découverte de domaines dans les séquences régulatrices (séquences ADN). Plusieurs études ont montré qu’il existe un lien fort entre la composition nucléotidique de régions particulières (séquences promotrices notamment) et l’expression des gènes. Nous proposons ici une nouvelle approche permettant de découvrir de manière automatique ces régions, que l’on nomme domaines de régulation. Plus précisément notre approche est basée sur une stratégie d’exploration itérative des compositions nucléotidiques, des plus simples (dinucléotides) aux plus complexes (k-mers), ainsi qu’une stratégie de segmentation supervisée pour découvrir les compositions et les régions d’intérêt. En utilisant les domaines ainsi identifiés, nous montrons que l’on peut prédire l’expression des gènes de Plasmodium falciparum avec une étonnante précision. Appliquée à différentes autres espèces eucaryotes, cette approche montre des résultats très différents suivant les espèces (entre 40 et 70 % de corrélation) ce qui laisse entrevoir un mécanisme de régulation sans doute partagé par toutes les espèces eucaryotes mais dont l’importance varie d’une espèce à l’autre.
... Domains are considered as the basic units of protein folding, evolution, and function. 39 Decomposing each protein into modular domains is a basic prerequisite for the accurate functional classification of biological molecules. The function of a protein is determined by its structure, which is mostly embodied in its domain architecture. ...
Article
Full-text available
In the process of host?pathogen interactions, bacterial pathogens always employ some special genes, e.g., virulence factors (VFs) to interact with host and cause damage or diseases to host. A number of VFs have been identified in bacterial pathogens that confer upon bacterial pathogens the ability to cause various types of damage or diseases. However, it has been clarified that some of the identified VFs are also encoded in the genomes of nonpathogenic bacteria, and this finding gives rise to considerable controversy about the definition of virulence factor. Here 1988 virulence factors of 51 sequenced pathogenic bacterial genomes from the virulence factor database (VFDB) were collected, and an orthologous comparison to a non-pathogenic bacteria protein database was conducted using the reciprocal-best-BLAST-hits approach. Six hundred and twenty pathogen-specific VFs and 1368 common VFs (present in both pathogens and nonpathogens) were identified, which account for 31.19% and 68.81% of the total VFs, respectively. The distribution of pathogen-specific VFs and common VFs in pathogenicity islands (PAIs) was systematically investigated, and pathogen-specific VFs were more likely to be located in PAIs than common VFs. The function of the two classes of VFs were also analyzed and compared in depth. Our results indicated that most but not all T3SS proteins are pathogen-specific. T3SS effector proteins tended to be distributed in pathogen-specific VFs, whereas T3SS translocation proteins, apparatus proteins, and chaperones were inclined to be distributed in common VFs. We also observed that exotoxins were located in both pathogen-specific and common VFs. In addition, the architecture of the two classes of VFs was compared, and the results indicated that common VFs had a higher domain number and lower domain coverage value, revealed that common VFs tend to be more complex and less compact proteins.
... Different methods have different sensitivities on the category of protein targets. As shown in Figure 3, the predictions by the (Heger and Holm, 2003), which are not directly associated with template structures in the PDB library. However, the two template-based methods, ThreaDom and FIEFDom, have an obvious difference between 'Easy' and 'Medium/Hard' proteins because of the different availability of the template hits in the two category of proteins. ...
Article
Full-text available
Protein domains are subunits that can fold and evolve independently. Identification of domain boundary locations is often the first step in protein folding and function annotations. Most of the current methods deduce domain boundaries by sequence-based analysis, which has low accuracy. There is no efficient method for predicting discontinuous domains that consist of segments from separated sequence regions. As template-based methods are most efficient for protein 3D structure modeling, combining multiple threading alignment information should increase the accuracy and reliability of computational domain predictions. We developed a new protein domain predictor, ThreaDom, which deduces domain boundary locations based on multiple threading alignments. The core of the method development is the derivation of a domain conservation score that combines information from template domain structures and terminal and internal alignment gaps. Tested on 630 non-redundant sequences, without using homologous templates, ThreaDom generates correct single- and multi-domain classifications in 81% of cases, where 78% have the domain linker assigned within ±20 residues. In a second test on 486 proteins with discontinuous domains, ThreaDom achieves an average precision 84% and recall 65% in domain boundary prediction. Finally, ThreaDom was examined on 56 targets from CASP8 and had a domain overlap rate 73, 87 and 85% with the target for Free Modeling, Hard multiple-domain and discontinuous domain proteins, respectively, which are significantly higher than most domain predictors in the CASP8. Similar results were achieved on the targets from the most recently CASP9 and CASP10 experiments. http://zhanglab.ccmb.med.umich.edu/ThreaDom/. zhng@umich.edu Supplementary data are available at Bioinformatics online.
... We included in the CSG the domain element because decomposing each protein into modular domains is a basic prerequisite for accurate functional classification of biological molecules [33]. The domain is represented as an entity that forms proteins. ...
Article
Full-text available
Background Understanding the genome, with all of its components and intrinsic relationships, is a great challenge. Conceptual modeling techniques have been used as a means to face this challenge. The heterogeneity and idiosyncrasy of genomic use cases mean that conceptual modeling techniques are used to generate conceptual schemes that focus on too specific scenarios (i.e., they are species-specific conceptual schemes). Our research group developed two different conceptual schemes. The first one is the Conceptual Schema of the Human Genome, which is intended to improve Precision Medicine and genetic diagnosis. The second one is the Conceptual Schema of the Citrus Genome, which is intended to identify the genetic cause of relevant phenotypes in the agri-food field. Methods Our two conceptual schemes have been ontologically compared to identify their similarities and differences. Based on this comparison, several changes have been performed in the Conceptual Schema of the Human Genome in order to obtain the first version of a species-independent Conceptual Schema of the Genome. Identifying the different genome information items used in each genomic case study has been essential in achieving our goal. The changes needed to provide an expanded, more generic version of the Conceptual Schema of the Human Genome are analyzed and discussed. Results This work presents a new CS called the Conceptual Schema of the Genome that is ready to be adapted to any specific working genome-based context (i.e., species-independent). Conclusion The generated Conceptual Schema of the Genome works as a global, generic element from which conceptual views can be created in order to work with any specific species. This first working version can be used in the human use case, in the citrus use case, and, potentially, in more use cases of other species.
... Only a fraction of random genomic DNA fragments will capture the correct strand and frame of a gene, however, and so we required an additional selection for in-frame fragments. We first generated fragments by transposon-mediated tagmentation 23,24 and selected fragments of roughly 500 base pairs in order to capture whole protein domains, which have a typical size of ~100 amino acids 25 (Fig. 2a and Extended Data Fig. 2a). We then captured these fragments by gap repair into a vector that depended on in-frame translation through the intervening sequence in order to express a downstream selectable marker (Fig. 2b). ...
Preprint
Full-text available
Numerous proteins regulate gene expression by modulating mRNA translation and decay. In order to uncover the full scope of these post-transcriptional regulators, we conducted an unbiased survey that quantifies regulatory activity across the budding yeast proteome and delineates the protein domains responsible for these effects. Our approach couples a tethered function assay with quantitative single-cell fluorescence measurements to analyze ~50,000 protein fragments and determine their effects on a tethered mRNA. We characterize hundreds of strong regulators, which are enriched for canonical and unconventional mRNA-binding proteins. Regulatory activity typically maps outside the RNA-binding domains themselves, highlighting a modular architecture that separates mRNA targeting from post-transcriptional regulation. Activity often aligns with intrinsically disordered regions that can interact with other proteins, even in core mRNA translation and degradation factors. Our results thus reveal networks of interacting proteins that control mRNA fate and illuminate the molecular basis for post-transcriptional gene regulation.
... During screenings of January 21–31, 2011, the domains of the COG2342 family found members of the families GHL3–GHL5, GHL7, GHL9–GHL11, GHL13, and GHL14 after 5, 8, 7, 2, 4, 7, 8, 5, and 2 iterations, respectively (unpublished data). During screenings of July 21, 2010, the domains of the COG3868 (GH114) family found members of the families GHL3– GHL13 after 7, 19, 8, 19, 6, 3, 7, 7, 20, 6, and 7 iterations, respectively [15]. The domains of the families COG2342 and COG3868 found each other during the first iteration. ...
Article
Full-text available
The domains of 15 recently discovered families of the hypothetical glycoside hydrolases GHL1-GHL15 were used for iterative screening of the protein database. The evolutionary relationships between these families were revealed, as well as their relationship with the previously known families of protein domains: GH5, GH13, GH13-33, GH17, GH18, GH20, GH27, GH29, GH31, GH35, GH36A, GH36B, GH36C, GH36D, GH36E, GH36F, GH36G, GH36H, GH36J, GH36K, GH39, GH42, GH53, GH66, GH97, GH101, GH107, GH112, GH114, COG1082, COG1306, COG1649, COG2342, DUF3111, and PF00962. The unclassified homologues were grouped in 35 new families of the hypothetical glycoside hydrolases: GHL16-GHL50. The position of the families GHL1-GHL15 in the hierarchical classification of glycoside hydrolases and their homologues is discussed. Several new superfamilies of protein domains are proposed.
... Pfam-B was part of the Pfam database until release 26. Actually, Pfam contained two types of domain families until this release: the high quality and manually curated Pfam-A families ("classical" Pfam domains, which are usually and until here in this article just called "Pfam"), and Pfam-B families, which were automatically generated by the ADDA algorithm [31] on the basis of all parts of Uniprot sequences not already covered by a Pfam-A occurrence. ProDom is a protein domain family database constructed automatically by clustering homologous segments with the MKDOM2 procedure based on recursive PSI-BLAST searches on the Uniprot database. ...
Article
Full-text available
Comparing and aligning protein sequences is an essential task in bioinformatics. More specifically, local alignment tools like BLAST are widely used for identifying conserved protein sub-sequences, which likely correspond to protein domains or functional motifs. However, to limit the number of false positives, these tools are used with stringent sequence-similarity thresholds and hence can miss several hits, especially for species that are phylogenetically distant from reference organisms. A solution to this problem is then to integrate additional contextual information to the procedure. Here, we propose to use domain co-occurrence to increase the sensitivity of pairwise sequence comparisons. Domain co-occurrence is a strong feature of proteins, since most protein domains tend to appear with a limited number of other domains on the same protein. We propose a method to take this information into account in a typical BLAST analysis and to construct new domain families on the basis of these results. We used Plasmodium falciparum as a case study to evaluate our method. The experimental findings showed an increase of 14% of the number of significant BLAST hits and an increase of 25% of the proteome area that can be covered with a domain. Our method identified 2240 new domains for which, in most cases, no model of the Pfam database could be linked. Moreover, our study of the quality of the new domains in terms of alignment and physicochemical properties show that they are close to that of standard Pfam domains. Source code of the proposed approach and supplementary data are available at: https://gite.lirmm.fr/menichelli/pairwise-comparison-with-cooccurrence
... A profile HMM is queried against a sequence database called pfamseq, which is derived from the UniProt Knowledgebase (UniProtKB) [61] Reference Proteomes to find other members in the same domain family. Pfam consists of two types of subsets: high-quality Pfam-A families that are generated by manually checking seed alignments and HMMs, and less reliable Pfam-B families, which are produced automatically by applying the ADDA algorithm [62]. ...
Article
Full-text available
Protein domains are the basic units of proteins that can fold, function, and evolve independently. Knowledge of protein domains is critical for protein classification, understanding their biological functions, annotating their evolutionary mechanisms and protein design. Thus, over the past two decades, a number of protein domain identification approaches have been developed, and a variety of protein domain databases have also been constructed. This review divides protein domain prediction methods into two categories, namely sequence-based and structure-based. These methods are introduced in detail, and their advantages and limitations are compared. Furthermore, this review also provides a comprehensive overview of popular online protein domain sequence and structure databases. Finally, we discuss potential improvements of these prediction methods.
... These methods are more similar to our proposed approach. Examples of these methods are EVEREST [22], ADDA [23], DOMO [24], and pClust [25] and its derivatives. However all of these methods depend on pairwise sequence alignment, either on the entire set of input sequences or on some subsets of the input that are selected using various filtering approaches. ...
Article
Full-text available
Background: Clustering of protein sequences is of key importance in predicting the structure and function of newly sequenced proteins and is also of use for their annotation. With the advent of multiple high-throughput sequencing technologies, new protein sequences are becoming available at an extraordinary rate. The rapid growth rate has impeded deployment of existing protein clustering/annotation tools which depend largely on pairwise sequence alignment. Results: In this paper, we propose an alignment-free clustering approach, coreClust, for annotating protein sequences using detected conserved regions. The proposed algorithm uses Min-Wise Independent Hashing for identifying similar conserved regions. Min-Wise Independent Hashing works by generating a (w,c)-sketch for each document and comparing these sketches. Our algorithm fits well within the MapReduce framework, permitting scalability. We show that coreClust generates results comparable to existing known methods. In particular, we show that the clusters generated by our algorithm capture the subfamilies of the Pfam domain families for which the sequences in a cluster have a similar domain architecture. We show that for a data set of 90,000 sequences (about 250,000 domain regions), the clusters generated by our algorithm give a 75% average weighted F1 score, our accuracy metric, when compared to the clusters generated by a semi-exhaustive pairwise alignment algorithm. Conclusions: The new clustering algorithm can be used to generate meaningful clusters of conserved regions. It is a scalable method that when paired with our prior work, NADDA for detecting conserved regions, provides a complete end-to-end pipeline for annotating protein sequences.
... Besides the local structural similarity, discussed in the main text, we looked for global similarity. Using the Dali server (Heger and Holm, 2003), we observed significant structural similarity (Z-score 11.3) between our query protein and Pab0955 from Pyrococcus abyssi (Figure S3). This protein, a member of a recently characterized GTPase family, also functions as a homodimer (Gras et al., 2007). ...
Article
Full-text available
The results, which are summarized in Table S1, suggest that most of the improvements can be attributed to the new geometric representation of the protein in PF2, and that incorporation of the Bayesian version of Rate4Site was secondary. As shown in the table, using the improved PatchFinder the average patch size decreased by 34%, while the number of cases in which PatchFinder found at least half of the SITE residues decreased by only 6%. This suggests that the new patches are more focused, more accurate, and tend to be closer to the proteins' functional sites. Example of Improvements in the New Version of PatchFinder A simple Euclidian distance, as used in PF1, was not accurate enough for reliable representation of the protein's surface; in some cases where residues were interpreted to be continuous over the surface, visual inspection showed that the contacts between them were mediated only through buried atoms. PDB ID 1han represents the crystal structure of a biphenyl-2,3-diol 1,2- dioxygenase from Burkholderia xenovorans (Han et al., 1995). Three SITE residues related to the Fe ion cofactor-binding site are documented in the PDB file. Analysis of the protein using PF1 revealed a patch of 22 residues, of which only two were SITE residues. Visual inspection of the patch showed it to be discontinuous on the protein surface: residue His195, which is located at the bottom of the cavity of the active site, was defined by PF1 to be a neighbor of Arg193, even though these two residues are on different sides of the protein (Figure S1). Because the two residues were defined as neighbors, PatchFinder erroneously extended the patch to include conserved residues from different sites in the protein. We found that incorporation of the Delaunay triangulation into PF2 reduced that problem in this and other cases. Analysis of the 1han structure using PF2 showed that the new ML patch comprised only ten amino acids including His195, and that three of them are SITE residues. Moreover, despite the close proximity of His195 and Arg193, the latter residue was (correctly) excluded from the patch.
... Currently, studies about protein domain boundary prediction are divided into two categories: those that use template-based methods [11,12,13,14,15,16] and those that use ab-initio methods [17,18,19,20,21,22]. Template-based methods generally use alignment against a known domain database (such as CATH [11] or SCOP [23]) to predict domain boundaries. For example, the sequence or secondary structures of a target sequence might be aligned against the sequence or the secondary structures in a domain classification database [24,25]. ...
Article
Full-text available
The precise prediction of protein domains, which are the structural, functional and evolutionary units of proteins, has been a research focus in recent years. Although many methods have been presented for predicting protein domains and boundaries, the accuracy of predictions could be improved. In this study we present a novel approach, DomHR, which is an accurate predictor of protein domain boundaries based on a creative hinge region strategy. A hinge region was defined as a segment of amino acids that covers part of a domain region and a boundary region. We developed a strategy to construct profiles of domain-hinge-boundary (DHB) features generated by sequence-domain/hinge/boundary alignment against a database of known domain structures. The DHB features had three elements: normalized domain, hinge, and boundary probabilities. The DHB features were used as input to identify domain boundaries in a sequence. DomHR used a nonredundant dataset as the training set, the DHB and predicted shape string as features, and a conditional random field as the classification algorithm. In predicted hinge regions, a residue was determined to be a domain or a boundary according to a decision threshold. After decision thresholds were optimized, DomHR was evaluated by cross-validation, large-scale prediction, independent test and CASP (Critical Assessment of Techniques for Protein Structure Prediction) tests. All results confirmed that DomHR outperformed other well-established, publicly available domain boundary predictors for prediction accuracy. The DomHR is available at http://cal.tongji.edu.cn/domain/.
... The sequences and annotations are from FlyBase, and the version corresponds to that of the identified modules (Wu et al. 2012). The module identification by Wu et al. was from a protein comparison by BlastP (Altschul et al. 1997), alignment extension by LALIGN (Huang and Miller 1991), module boundary detection by the ADDA algorithm (Heger and Holm 2003), and module family clustering by OrthoMCL (Enright et al. 2002). ...
Article
Full-text available
How have genes evolved within a well-known genome phylogeny? Many protein-coding genes should have evolved as a whole at the gene level, and some should have evolved partly through fragments at the subgene level. To comprehensively explore such complex homologous relationships and better understand gene family evolution, here, with de novo-identified modules, the subgene units which could consecutively cover proteins within a set of closely related species, we applied a new phylogeny-based approach that considers evolutionary models with partial homology to classify all protein-coding genes in nine Drosophila genomes. Compared with two other popular methods for gene family construction, our approach improved practical gene family classifications with a more reasonable view of homology and provided a much more complete landscape of gene family evolution at the gene and subgene levels. In the case study, we found that most expanded gene families might have evolved mainly through module rearrangements rather than gene duplications and mainly generated single-module genes through partial gene duplication, suggesting that there might be pervasive subgene rearrangement in the evolution of protein-coding gene families. The use of a phylogeny-based approach with partial homology to classify and analyze protein-coding gene families may provide us with a more comprehensive landscape depicting how genes evolve within a well-known genome phylogeny.
... The scaling of free energy barriers with fluctuation length sets an upper limit on the domain size in proteins. The size distribution of protein domains found in biology is peaked at around 100 amino acids and near zero by 300 amino acids [35,36]. This corresponds to proteins of maximum size R g ∼ 2.0 nm, with relaxation time of ∼1 min. ...
Article
Full-text available
We investigate the universal scaling of protein fluctuation dynamics with a site-specific diffusive model of protein motion, which predicts an initial subdiffusive regime in the configurational relaxation. The long-time dynamics of proteins is controlled by an activated regime. We argue that the hierarchical free energy barriers set the time scales of biological processes and establish an upper limit to the size of single protein domains. We find it compelling that the scaling behavior for the protein dynamics is in close agreement with the Kardar-Parisi-Zhang scaling exponents.
Article
The mean size of the most compact native states of globular proteins, independent of folding type, follows the scaling law of collapsed polymers R g ~ n 1/3, relating the radius of gyration R g to the number of protein residues, n. Until now, this behaviour has only been observed within a small subset of unrelated single-domain proteins with n < 300. Here, we employ the SCOP database of protein folds to study systematically the scaling behaviour of well-defined families of domains that share structural and functional characteristics. In the particular case of helical proteins, we identify the folding types that can be associated with scaling laws corresponding to compact behaviour (e.g., the cytochrome-C monodomains) and noncompact behaviour (e.g., the immunoglobulin/albumin-binding and spectrin-repeat domains). Our results quantify the size variations within some folding families, as well as reveal that some distinct folds represent structures with equivalent compactness.
Article
Full-text available
Protein domains are autonomous folding units and are fundamental structural and functional units of proteins. Protein domain boundaries are helpful to the classification of proteins and understanding the evolutions, structures and functions of proteins. In this paper, we propose a support vector regression based method to locate residues at protein domain boundaries with a combination of evolutionary information including sequence profiles, predicted secondary structures, predicted relative solvent accessibility, and profiles from HSSP items. Our proposed model achieved an average sensitivity of ~37% and an average specificity of ~77% on domain boundary identification on our dataset of multi-domain proteins and showed better predictive performance than previous domain identification models.
Article
The native states of the most compact globular proteins have been described as being in the so-called “collapsed-polymer regime,” characterized by the scaling law R g ~ n ν, where R g is radius of gyration, n is the number of residues, and ν ≈ 1/3. However, the diversity of folds and the plasticity of native states suggest that this law may not be universal. In this work, we study the scaling regimes of: (i) one to four-domain protein chains, and (ii) their constituent domains, in terms of the four major folding classes. In the case of complete chains, we show that size scaling is influenced by the number of domains. For the set of domains belonging to the all-α, all-β, α/β, and α + β folding classes, we find that size-scaling exponents vary between 0.3 ≤ ν ≤ 0.4. Interestingly, even domains in the same folding class show scaling regimes that are sensitive to domain provenance, i.e., the number of domains present in the original intact chain. We demonstrate that the level of compactness, as measured by monomer density, decreases when domains originate from increasingly complex proteins.
Article
Billions of people and animals are infected with parasitic worms (helminths). Many of these worms cause diseases that have a major socioeconomic impact worldwide, and are challenging to control because existing treatment methods are often inadequate. There is, therefore, a need to work toward developing new intervention methods, built on a sound understanding of parasitic worms at molecular level, the relationships that they have with their animal hosts, and/or the diseases that they cause. Decoding the genomes and transcriptomes of these parasites brings us a step closer to this goal. The key focus of this article is to critically review and discuss bioinformatic tools used for the assembly and annotation of these genomes and transcriptomes, as well as various post-genomic analyses of transcription profiles, biological pathways, synteny, phylogeny, biogeography and the prediction and prioritisation of drug target candidates. Bioinformatic pipelines implemented and established recently provide practical and efficient tools for the assembly and annotation of genomes of parasitic worms, and will be applicable to a wide range of other parasites and eukaryotic organisms. Future research will need to assess the utility of long-read sequence data sets for enhanced genomic assemblies, and develop improved algorithms for gene prediction and post-genomic analyses, to enable comprehensive systems biology explorations of parasitic organisms.
Article
We studied the size scaling behaviour in an ensemble of 8,614 non-redundant protein domains belonging to the all-α, all-β, α / β, and α + β folding classes. We find that the most compact structural domains can be characterized by an effective exponent ν eff = 0.39 ± 0.01, which is larger than the value for “collapsed-polymers,” i.e., ν = 1/3. We also show that the global ν eff -exponent is an average of the scaling regimes for short and long compact chains, where the values change from νeff ≈ 0.37 to ν eff ≈ 0.45 at chain length of ca. 269. A transition from short-chain to long-chain scaling behaviour is found in all major folding classes, over a window of chain lengths between 216 and 269 residues. In addition, variations in scaling exponent with respect to folding class indicates that the smallest domains in the (all-β) and (α / β) families appear to be more compact structures than the smallest (all-α)- and (α + β)-domains.
Article
The assignment of protein domains from three-dimensional structure is critically important in understanding protein evolution and function, yet little quality assurance has been performed. Here, the differences in the assignment of structural domains are evaluated using six common assignment methods. Three human expert methods (AUTHORS (authors' annotation), CATH and SCOP) and three fully automated methods (DALI, DomainParser and PDP) are investigated by analysis of individual methods against the author's assignment as well as analysis based on the consensus among groups of methods (only expert, only automatic, combined). The results demonstrate that caution is recommended in using current domain assignments, and indicates where additional work is needed. Specifically, the major factors responsible for conflicting domain assignments between methods, both experts and automatic, are: (1) the definition of very small domains; (2) splitting secondary structures between domains; (3) the size and number of discontinuous domains; (4) closely packed or convoluted domain–domain interfaces; (5) structures with large and complex architectures; and (6) the level of significance placed upon structural, functional and evolutionary concepts in considering structural domain definitions. A web-based resource that focuses on the results of benchmarking and the analysis of domain assignments is available at http://pdomains.sdsc.edu
Chapter
Over the years, a multitude of fully automated procedures for protein sequence clustering have been derived. Most methods cluster a sequence space graph that represents similarity relationships detected by all versus all sequence comparison. This chapter gives a historical overview of some of these methods and then goes on to discuss one of them, automatic domain delineation algorithm (ADDA), in detail. The advantages of using ADDA are discussed along with the improvements that need to be made, for example, to distinguish cysteine-rich domain families from otherwise similar cysteine-free protein families. Finally, the challenges that this field still faces, such as the need for more powerful computational resources and better sensitivity in detecting remote homologous, together with new directions for research are reviewed. There is a need for a systematic evolutionary classification of all protein sequences, with quality control being a key issue.
Article
Full-text available
PSI Protein Classifier is a new program that allows to summarize the consecutive and independent PSI-BLAST iteration results. The technical opportunities of the program are explained. Two examples of the PSI Protein Classifier application are given. Iterative screening of the protein database allowed to reveal feasible evolutionary relationship among GH5, GH13, GH27, GH31, GH36, GH66, GH101, and GH114 families of glycoside hydrolases. Family GH31 is divided into 38 subfamilies on the basis of statistically significant sequence similarity analysis (E-value analysis).
Conference Paper
Identifying conserved regions in protein sequences is a fundamental operation that is recurrent in numerous sequence-driven analysis pipelines. It is used as a way to decode domain-rich regions within proteins, compute protein clusters, annotate sequences with function, and compute evolutionary relationships among protein sequences. Current approaches to clustering and annotating protein sequences based on conserved regions depend either on prior knowledge of domains or on computing pairwise sequence similarity, which is not feasible for very large collections of protein sequences. In this paper we present a new clustering method, afClust, that uses the abundance of exact matching short subsequences (k-mers) to quickly detect conserved regions. Our method also lends itself to parallelization under the MapReduce paradigm. Our experimental results are promising. For bacterial protein domains from the SMART database, we detected up to 85% of the targeted domain regions. Our parallel implementation processed 700,000 sequences in approximately one minute. We provide scalability experiments for a smaller data set.
Chapter
Advances in DNA sequencing technologies have led to an increasing amount of protein sequence data being generated. Only a small fraction of this protein sequence data will have experimental annotation associated with them. Here, we describe a protocol for in silico homology-based annotation of large protein datasets that makes extensive use of manually curated collections of protein families. We focus on annotations provided by the Pfam database and suggest ways to identify family outliers and family variations. This protocol may be useful to people who are new to protein data analysis, or who are unfamiliar with the current computational tools that are available.
Chapter
The significant expansion in protein sequence and structure data that we are now witnessing brings with it a pressing need to bring order to the protein world. Such order enables us to gain insights into the evolution of proteins, their function and the extent to which the functional repertoire can vary across the three kingdoms of life. This has lead to the creation of a wide range of protein family classifications that aim to group proteins based upon their evolutionary relationships. In this chapter we discuss the approaches and methods that are frequently used in the classification of proteins, with a specific emphasis on the classification of protein domains. The construction of both domain sequence and domain structure databases is considered and we show how the use of domain family annotations to assign structural and functional information is enhancing our understanding of genomes.
Article
We develop a hierarchical pipeline, ThreaDomEx, for both continuous domain (CD) and discontinuous domain (DCD) structure predictions. Starting from a query sequence, ThreaDomEx first threads it through the PDB to identify multiple structure templates, where a profile of domain conservation score (DC-score) is derived for domain-segment assignment. To further detect DCDs that consist of separated segments along the sequence, a boundary-clustering algorithm is used to refine the DCD-linker locations. In case that the templates do not contain DCDs, a domain-segment assembly process, guided by symmetry comparison, is applied for further DCD detections. ThreaDomEx was tested a set of 1111 proteins and achieved a normalized domain overlap score of 89.3% compared to experimental data, which is significantly higher than other state-of-the-art methods. It also recalls 26.7% of DCDs with 72.7% precision on the proteins for which threading failed to detect any DCDs. The server provides facilities for users to interactively refine the domain models by adjusting DC-score threshold, deleting and adding domain linkers, and assembling domain segments, which are particularly helpful for the hard targets for which current methods have a low accuracy while human-expert knowledge and experimental insights can be used for refining models. ThreaDomEX server is available at http://zhanglab.ccmb.med.umich.edu/ThreaDomEx.
Article
Folding reporters are proteins with easily identifiable phenotypes, such as antibiotic resistance, whose folding and function is compromised when fused to poorly folding proteins or random open reading frames. We have developed a strategy where, by using TEM-1 β-lactamase (the enzyme conferring ampicillin resistance) on a genomic scale, we can select collections of correctly folded protein domains from the coding portion of the DNA of any intronless genome. The protein fragments obtained by this approach, the so called "domainome", will be well expressed and soluble, making them suitable for structural/functional studies. By cloning and displaying the "domainome" directly in a phage display system, we have showed that it is possible to select specific protein domains with the desired binding properties (e.g., to other proteins or to antibodies), thus providing essential experimental information for gene annotation or antigen identification. The identification of the most enriched clones in a selected polyclonal population can be achieved by using novel next-generation sequencing technologies (NGS). For these reasons, we introduce deep sequencing analysis of the library itself and the selection outputs to provide complete information on diversity, abundance and precise mapping of each of the selected fragment. The protocols presented here show the key steps for library construction, characterization, and validation.
Thesis
Full-text available
There are currently over 138, 000 known macromolecular structures deposited in the wwPDB (Worldwide Protein Data Bank) database. While all the macromolecular structure files contain information about a particular structure, the collection of these files also allows combining the macromolecular structures to obtain statistical information about macromolecules in general. This fact has been the basis for many structural biology methods including the molecular replacement method used in X-ray crystallography or homologous structure restraints in the refinement methods. With the success of methods based on prior information, it is feasible that novel methods could be developed and current methods improved using further prior information; more specifically, by using the structure density-map shape similarity instead of sequence or model similarity. Therefore, this project introduces a mathematical framework for computing three different measures of macromolecular three-dimensional shape similarity and demonstrates how these descriptors can be applied in symmetry detection and protein-domain clustering. The ability to detect cyclic (C), dihedral (D), tetrahedral (T), octahedral (O) and icosahedral (I) symmetry groups as well as computing all associated symmetry elements has direct applications in map averaging and reducing the storage requirements by storing only the asymmetric information. Moreover, by having the capacity to find structures with similar shape, it was possible to reduce the size of the BALBES protein domain database by more than 18.7% and thus achieve proportional speed-up in the searching parts of its applications. Finally, the development of the method described in this project has many possible applications throughout structural biology. The method could, for example, facilitate matching and fitting of protein domains into the density maps produced by the electron-microscopy techniques, or it could allow for molecular-replacement candidate search using shape instead of sequence similarity. To allow for the development of any further applications, software for applying the methods described here is also presented and released for the community.
Article
Full-text available
Sweet sorghum is a C4 crop with the characteristic of fast-growth and high-yields. It is a good source for food, feed, fiber, and fuel. On saline land, sweet sorghum can not only survive, but increase its sugar content. Therefore, it is regarded as a potential source for identifying salt-related genes. Here, we review the physiological and biochemical responses of sweet sorghum to salt stress, such as photosynthesis, sucrose synthesis, hormonal regulation, and ion homeostasis, as well as their potential salt-resistance mechanisms. The major advantages of salt-tolerant sweet sorghum include: 1) improving the Na+ exclusion ability to maintain ion homeostasis in roots under salt-stress conditions, which ensures a relatively low Na+ concentration in shoots; 2) maintaining a high sugar content in shoots under salt-stress conditions, by protecting the structures of photosystems, enhancing photosynthetic performance and sucrose synthetase activity, as well as inhibiting sucrose degradation. To study the regulatory mechanism of such genes will provide opportunities for increasing the salt tolerance of sweet sorghum by breeding and genetic engineering.
Chapter
Understanding the genome, with all of its components and intrinsic relationships, is a great challenge. Conceptual modeling techniques have been used as a means to face this challenge, leading to the generation of conceptual schemes whose intent is to provide a precise ontological characterization of the components involved in biological processes. However, the heterogeneity and idiosyncrasy of genomic use cases mean that, although the genome and its internal processes remain the same among eukaryote species, conceptual modeling techniques are used to generate conceptual schemes that focus on particular scenarios (i.e., they are species-specific conceptual schemes). We claim that instead of having different, species-specific conceptual schemes, it is feasible to provide a holistic conceptual schema valid to work with every eukaryote species by generating conceptual views that are inferred from that global conceptual schema. We report our preliminary work towards the possibility of generating such a conceptual schema by ontologically comparing two existing, species-specific conceptual schemes. Those changes that are necessary to provide an expanded conceptual schema that is suitable for both use cases are identified and discussed.
Article
When multiple substitutions affect a trait in opposing ways, they are often assumed to be compensatory, not only with respect to the trait, but also with respect to fitness. This type of compensatory evolution has been suggested to underlie the evolution of protein structures and interactions, RNA secondary structures, and gene regulatory modules and networks. The possibility for compensatory evolution results from epistasis. Yet if epistasis is widespread, then it is also possible that the opposing substitutions are individually adaptive. I term this possibility an adaptive reversal. Although possible for arbitrary phenotype-fitness mappings, it has not yet been investigated whether such epistasis is prevalent in a biologically realistic setting. I investigate a particular regulatory circuit, the type I coherent feed-forward loop, which is ubiquitous in natural systems and is accurately described by a simple mathematical model. I show that such reversals are common during adaptive evolution, can result solely from the topology of the fitness landscape, and can occur even when adaptation follows a modest environmental change and the network was well adapted to the original environment. The possibility of adaptive reversals warrants a systems perspective when interpreting substitution patterns in gene regulatory networks.
Article
Full-text available
The MOLSCRIPT program produces plots of protein structures using several different kinds of representations. Schematic drawings, simple wire models, ball-and-stick models, CPK models and text labels can be mixed freely. The schematic drawings are shaded to improve the illusion of three dimensionality. A number of parameters affecting various aspects of the objects drawn can be changed by the user. The output from the program is in PostScript format.
Article
Full-text available
The ProDom database contains protein domain families generated from the SWISS-PROT database by automated sequence comparisons. The current version was built with a new improved procedure based on recursive PSI-BLAST homology searches. ProDom can be searched on the World Wide Web to study domain arrangements within either known families or new proteins, with the help of a user-friendly graphical interface (http://www.toulouse.inra.fr/prodom.html). Recent improvements to the ProDom server include: ProDom queries under the SRS Sequence Retrieval System; links to the PredictProtein server; phylogenetic trees and condensed multiple alignments for a better representation of large domain families, with zooming in and out capabilities. In addition, a similar server was set up to display the outcome of whole genome domain analysis as applied to 17 completed microbial genomes (http://www.toulouse.inra.fr/prodomCG.html).
Article
Full-text available
We present a fast and flexible program for clustering large protein databases at different sequence identity levels. It takes less than 2 h for the all-against-all sequence comparison and clustering of the non-redundant protein database of over 560000 sequences on a high-end PC. The output database, including only the representative sequences, can be used for more efficient and sensitive database searches. Availability: The program is available from http://bioinformatics.burnham-inst.org/cd-hi Contact: liwz@sdsc.edu or adam@burnham-inst.org * To whom correspondence should be addressed.
Article
Full-text available
To maximize the chances of biological discovery, homology searching must use an up-to-date collection of sequences. However, the available sequence databases are growing rapidly and are partially redundant in content. This leads to increasing strain on CPU resources and decreasing density of first-hand annotation. These problems are addressed by clustering closely similar sequences to yield a covering of sequence space by a representative subset of sequences. No pair of sequences in the representative set has >90% mutual sequence identity. The representative set is derived by an exhaustive search for close similarities in the sequence database in which the need for explicit sequence alignment is significantly reduced by applying deca- and pentapeptide composition filters. The algorithm was applied to the union of the Swissprot, Swissnew, Trembl, Tremblnew, Genbank, PIR, Wormpep and PDB databases. The all-against-all comparison required to generate a representative set at 90% sequence identity was accomplished in 2 days CPU time, and the removal of fragments and close similarities yielded a size reduction of 46%, from 260 000 unique sequences to 140 000 representative sequences. The practical implications are (i) faster homology searches using, for example, Fasta or Blast, and (ii) unified annotation for all sequences clustered around a representative. As tens of thousands of sequence searches are performed daily world-wide, appropriate use of the non-redundant database can lead to major savings in computer resources, without loss of efficacy. A regularly updated non-redundant protein sequence database (nrdb90), a server for homology searches against nrdb90, and a Perl script (nrdb90.pl) implementing the algorithm are available for academic use from http://www.embl-ebi.ac. uk/holm/nrdb90. holm@embl-ebi.ac.uk
Article
Full-text available
Motivation: Sensitive detection and masking of low-complexity regions in protein sequences. Filtered sequences can be used in sequence comparison without the risk of matching compositionally biased regions. The main advantage of the method over similar approaches is the selective masking of single residue types without affecting other, possibly important, regions. Results: A novel algorithm for low-complexity region detection and selective masking. The algorithm is based on multiple-pass Smith-Waterman comparison of the query sequence against twenty homopolymers with infinite gap penalties. The output of the algorithm is both the masked query sequence for further analysis, e.g. database searches, as well as the regions of low complexity. The detection of low-complexity regions is highly specific for single residue types. It is shown that this approach is sufficient for masking database query sequences without generating false positives. The algorithm is benchmarked against widely available algorithms using the 210 genes of Plasmodium falciparum chromosome 2, a dataset known to contain a large number of low-complexity regions. Availability: CAST (version 1.0) executable binaries are available to academic users free of charge under license. Web site entry point, server and additional material: http://www.ebi.ac.uk/research/cgg/services/cast/
Article
Full-text available
Profile analysis is a method for detecting distantly related proteins by sequence comparison. The basis for comparison is not only the customary Dayhoff mutational-distance matrix but also the results of structural studies and information implicit in the alignments of the sequences of families of similar proteins. This information is expressed in a position-specific scoring table (profile), which is created from a group of sequences previously aligned by structural or sequence similarity. The similarity of any other sequence (target) to the group of aligned sequences (probe) can be tested by comparing the target to the profile using dynamic programming algorithms. The profile method differs in two major respects from methods of sequence comparison in common use: (i) Any number of known sequences can be used to construct the profile, allowing more information to be used in the testing of the target than is possible with pairwise alignment methods. (ii) The profile includes the penalties for insertion or deletion at each position, which allow one to include the probe secondary structure in the testing scheme. Tests with globin and immunoglobulin sequences show that profile analysis can distinguish all members of these families from all other sequences in a database containing 3800 protein sequences.
Article
Full-text available
We present a method for condensing the information in multiple alignments of proteins into a mixture of Dirichlet densities over amino acid distributions. Dirichiet mixture densities are designed to be combined with observed amino acid frequencies to form estimates of expected amino acid probabilities at each position in a profile, hidden Markov model or other statistical model. These estimates give a statistical model greater generalization capacity, so that remotely related family members can be more reliably recognized by the model. This paper corrects the previously published formula for estimating these expected probabilities, and contains complete derivations of the Dirichiet mixture formulas, methods for optimizing the mixtures to match particular databases, and suggestions for efficient implementation.
Article
Full-text available
The BLAST programs are widely used tools for searching protein and DNA databases for sequence similarities. For protein comparisons, a variety of definitional, algorithmic and statistical refinements described here permits the execution time of the BLAST programs to be decreased substantially while enhancing their sensitivity to weak similarities. A new criterion for triggering the extension of word hits, combined with a new heuristic for generating gapped alignments, yields a gapped BLAST program that runs at approximately three times the speed of the original. In addition, a method is introduced for automatically combining statistically significant alignments produced by BLAST into a position-specific score matrix, and searching the database using this matrix. The resulting Position-Specific Iterated BLAST (PSIBLAST) program runs at approximately the same speed per iteration as gapped BLAST, but in many cases is much more sensitive to weak but biologically relevant sequence similarities. PSI-BLAST is used to uncover several new and interesting members of the BRCT superfamily.
Article
Full-text available
Pairwise sequence comparison methods have been assessed using proteins whose relationships are known reliably from their structures and functions, as described in the SCOP database [Murzin, A. G., Brenner, S. E., Hubbard, T. & Chothia C. (1995) J. Mol. Biol. 247, 536-540]. The evaluation tested the programs BLAST [Altschul, S. F., Gish, W., Miller, W., Myers, E. W. & Lipman, D. J. (1990). J. Mol. Biol. 215, 403-410], WU-BLAST2 [Altschul, S. F. & Gish, W. (1996) Methods Enzymol. 266, 460-480], FASTA [Pearson, W. R. & Lipman, D. J. (1988) Proc. Natl. Acad. Sci. USA 85, 2444-2448], and SSEARCH [Smith, T. F. & Waterman, M. S. (1981) J. Mol. Biol. 147, 195-197] and their scoring schemes. The error rate of all algorithms is greatly reduced by using statistical scores to evaluate matches rather than percentage identity or raw scores. The E-value statistical scores of SSEARCH and FASTA are reliable: the number of false positives found in our tests agrees well with the scores reported. However, the P-values reported by BLAST and WU-BLAST2 exaggerate significance by orders of magnitude. SSEARCH, FASTA ktup = 1, and WU-BLAST2 perform best, and they are capable of detecting almost all relationships between proteins whose sequence identities are >30%. For more distantly related proteins, they do much less well; only one-half of the relationships between proteins with 20-30% identity are found. Because many homologs have low sequence similarity, most distant relationships cannot be detected by any pairwise comparison method; however, those which are identified may be used with confidence.
Article
Full-text available
This paper surveys approaches to the discovery of patterns in biosequences and places these approaches within a formal framework that systematises the types of patterns and the discovery algorithms. Patterns with expressive power in the class of regular languages are considered, and a classification of pattern languages in this class is developed, covering the patterns that are the most frequently used in molecular bioinformatics. A formulation is given of the problem of the automatic discovery of such patterns from a set of sequences, and an analysis is presented of the ways in which an assessment can be made of the significance of the discovered patterns. It is shown that the problem is related to problems studied in the field of machine learning. The major part of this paper comprises a review of a number of existing methods developed to solve the problem and how these relate to each other, focusing on the algorithms underlying the approaches. A comparison is given of the algorithms, and examples are given of patterns that have been discovered using the different methods.
Article
Full-text available
SWISS-PROT is a curated protein sequence database which strives to provide a high level of annotation (such as the description of the function of a protein, its domains structure, post-translational modifications, variants, etc.), a minimal level of redundancy and high level of integration with other databases. Recent developments of the database include format and content enhancements, cross-references to additional databases, new documentation files and improvements to TrEMBL, a computer-annotated supplement to SWISS-PROT. TrEMBL consists of entries in SWISS-PROT-like format derived from the translation of all coding sequences (CDSs) in the EMBL Nucleotide Sequence Database, except the CDSs already included in SWISS-PROT. We also describe the Human Proteomics Initiative (HPI), a major project to annotate all known human sequences according to the quality standards of SWISS-PROT. SWISS-PROT is available at: http://www.expasy.ch/sprot/ and http://www.ebi.ac.uk/swissprot/
Article
Full-text available
Biological sequence databases are highly redundant for two main reasons: 1. various databanks keep redundant sequences with many identical and nearly identical sequences 2. natural sequences often have high sequence identities due to gene duplication. We wanted to know how many sequences can be removed before the databases start losing homology information. Can a database of sequences with mutual sequence identity of 50% or less provide us with the same amount of biological information as the original full database? Comparisons of nine representative sequence databases (RSDB) derived from full protein databanks showed that the information content of sequence databases is not linearly proportional to its size. An RSDB reduced to mutual sequence identity of around 50% (RSDB50) was equivalent to the original full database in terms of the effectiveness of homology searching. It was a third of the full database size which resulted in a six times faster iterative profile searching. The RSDBs are produced at different granularity for efficient homology searching. All the RSDB files generated and the full analysis results are available through internet: ftp://ftp.ebi.ac. uk/pub/contrib/jong/RSDB/http://cyrah.e bi.ac.uk:1111/Proj/Bio/RSDB
Article
Full-text available
We present a fast and flexible program for clustering large protein databases at different sequence identity levels. It takes less than 2 h for the all-against-all sequence comparison and clustering of the non-redundant protein database of over 560,000 sequences on a high-end PC. The output database, including only the representative sequences, can be used for more efficient and sensitive database searches.
Article
Full-text available
Motivation: Evolutionary classification leads to an economical description of protein sequence data because attributes of function and structure are inherited in protein families. This paper presents Picasso, a procedure for deriving a minimal set of protein family profiles that cover all known protein sequences. Results: Picasso starts from highly overlapping sequence neighbourhoods revealed by all-on-all pairwise Blast alignment. Overlaps are reduced by merging sequences or parts of sequences into multiple alignments. For maximum unification, the multiple alignments must reach into the twilight zone of sequence similarity. Sensitive and selective profile-profile comparison allows unification down to about 15% pairwise sequence identity. Families unified through a short conserved sequence motif are associated with multiple full-length alignments describing different subfamilies. Domains that are mobile modules are identified based on their association with different sets of neighbours. The result is 10000 unified domain families (excluding singletons) representing functionally related proteins and recovering classical prolific domain types in high numbers. The classification is useful, for example, in developing strategies for efficient database searching and for selecting targets to complete the map of all 3-D structures.
Article
Full-text available
Structural genomics has the goal of obtaining useful, three-dimensional models of all proteins by a combination of experimental structure determination and comparative model building. We evaluate different strategies for optimizing information return on effort. The strategy that maximizes structural coverage requires about seven times fewer structure determinations compared with the strategy in which targets are selected at random. With a choice of reasonable model quality and the goal of 90% coverage, we extrapolate the estimate of the total effort of structural genomics. It would take approximately 16,000 carefully selected structure determinations to construct useful atomic models for the vast majority of all proteins. In practice, unless there is global coordination of target selection, the total effort will likely increase by a factor of three. The task can be accomplished within a decade provided that selection of targets is highly coordinated and significant funding is available.
Article
Full-text available
Several technical, social, and biological networks were recently found to demonstrate scale-free and small-world behavior instead of random graph characteristics. In this work, the topology of protein domain networks generated with data from the ProDom, Pfam, and Prosite domain databases was studied. It was found that these networks exhibited small-world and scale-free topologies with a high degree of local clustering accompanied by a few long-distance connections. Moreover, these observations apply not only to the complete databases, but also to the domain distributions in proteomes of different organisms. The extent of connectivity among domains reflects the evolutionary complexity of the organisms considered.
Article
Full-text available
Pfam is a large collection of protein multiple sequence alignments and profile hidden Markov models. Pfam is available on the World Wide Web in the UK at http://www.sanger.ac.uk/Software/Pfam/, in Sweden at http://www.cgb.ki.se/Pfam/, in France at http://pfam.jouy.inra.fr/ and in the US at http://pfam.wustl.edu/. The latest version (6.6) of Pfam contains 3071 families, which match 69% of proteins in SWISS-PROT 39 and TrEMBL 14. Structural data, where available, have been utilised to ensure that Pfam families correspond with structural domains, and to improve domain-based annotation. Predictions of non-domain regions are now also included. In addition to secondary structure, Pfam multiple sequence alignments now contain active site residue mark-up. New search tools, including taxonomy search and domain query, greatly add to the functionality and usability of the Pfam resource.
Article
Full-text available
The Protein Data Bank (PDB; http://www.pdb.org/) is the single worldwide archive of structural data of biological macromolecules. This paper describes the progress that has been made in validating all data in the PDB archive and in releasing a uniform archive for the community. We have now produced a collection of mmCIF data files for the PDB archive (ftp://beta.rcsb.org/pub/pdb/uniformity/data/mmCIF/). A utility application that converts the mmCIF data files to the PDB format (called CIFTr) has also been released to provide support for existing software.
Article
Full-text available
SMART (Simple Modular Architecture Research Tool, http://smart.embl-heidelberg.de) is a web-based resource used for the annotation of protein domains and the analysis of domain architectures, with particular emphasis on mobile eukaryotic domains. Extensive annotation for each domain family is available, providing information relating to function, subcellular localization, phyletic distribution and tertiary structure. The January 2002 release has added more than 200 hand-curated domain models. This brings the total to over 600 domain families that are widely represented among nuclear, signalling and extracellular proteins. Annotation now includes links to the Online Mendelian Inheritance in Man (OMIM) database in cases where a human disease is associated with one or more mutations in a particular domain. We have implemented new analysis methods and updated others. New advanced queries provide direct access to the SMART relational database using SQL. This database now contains information on intrinsic sequence features such as transmembrane regions, coiled-coils, signal peptides and internal repeats. SMART output can now be easily included in users’ documents. A SMART mirror has been created at http://smart.ox.ac.uk.
Article
Full-text available
The Ensembl (http://www.ensembl.org/) database project provides a bioinformatics framework to organise biology around the sequences of large genomes. It is a comprehensive source of stable automatic annotation of the human genome sequence, with confirmed gene predictions that have been integrated with external data sources, and is available as either an interactive web site or as flat files. It is also an open source software engineering project to develop a portable system able to handle very large genomes and associated requirements from sequence analysis to data storage and visualisation. The Ensembl site is one of the leading sources of human genome sequence annotation and provided much of the analysis for publication by the international human genome project of the draft genome. The Ensembl system is being installed around the world in both companies and academic sites on machines ranging from supercomputers to laptops.
Article
Full-text available
The Protein Information Resource (PIR) serves as an integrated public resource of functional annotation of protein data to support genomic/proteomic research and scientific discovery. The PIR, in collaboration with the Munich Information Center for Protein Sequences (MIPS) and the Japan International Protein Information Database (JIPID), produces the PIR-International Protein Sequence Database (PSD), the major annotated protein sequence database in the public domain, containing about 250 000 proteins. To improve protein annotation and the coverage of experimentally validated data, a bibliography submission system is developed for scientists to submit, categorize and retrieve literature information. Comprehensive protein information is available from iProClass, which includes family classification at the superfamily, domain and motif levels, structural and functional features of proteins, as well as cross-references to over 40 biological databases. To provide timely and comprehensive protein data with source attribution, we have introduced a non-redundant reference protein database, PIR-NREF. The database consists of about 800 000 proteins collected from PIR-PSD, SWISS-PROT, TrEMBL, GenPept, RefSeq and PDB, with composite protein names and literature data. To promote database interoperability, we provide XML data distribution and open database schema, and adopt common ontologies. The PIR web site (http://pir.georgetown.edu/) features data mining and sequence analysis tools for information retrieval and functional identification of proteins based on both sequence and annotation information. The PIR databases and other files are also available by FTP (ftp://nbrfa.georgetown.edu/pir_databases).
Article
Full-text available
Detection of protein families in large databases is one of the principal research objectives in structural and functional genomics. Protein family classification can significantly contribute to the delineation of functional diversity of homologous proteins, the prediction of function based on domain architecture or the presence of sequence motifs as well as comparative genomics, providing valuable evolutionary insights. We present a novel approach called TRIBE-MCL for rapid and accurate clustering of protein sequences into families. The method relies on the Markov cluster (MCL) algorithm for the assignment of proteins into families based on precomputed sequence similarity information. This novel approach does not suffer from the problems that normally hinder other protein sequence clustering algorithms, such as the presence of multi-domain proteins, promiscuous domains and fragmented proteins. The method has been rigorously tested and validated on a number of very large databases, including SwissProt, InterPro, SCOP and the draft human genome. Our results indicate that the method is ideally suited to the rapid and accurate detection of protein families on a large scale. The method has been used to detect and categorise protein families within the draft human genome and the resulting families have been used to annotate a large proportion of human proteins.
Article
Full-text available
Protein sequence alignments are more reliable the shorter the evolutionary distance. Here, we align distantly related proteins using many closely spaced intermediate sequences as stepping stones. Such transitive alignments can be generated between any two proteins in a connected set, whether they are direct or indirect sequence neighbors in the underlying library of pairwise alignments. We have implemented a greedy algorithm, MaxFlow, using a novel consistency score to estimate the relative likelihood of alternative paths of transitive alignment. In contrast to traditional profile models of amino acid preferences, MaxFlow models the probability that two positions are structurally equivalent and retains high information content across large distances in sequence space. Thus, MaxFlow is able to identify sparse and narrow active-site sequence signatures which are embedded in high-entropy sequence segments in the structure based multiple alignment of large diverse enzyme superfamilies. In a challenging benchmark based on the urease superfamily, MaxFlow yields better reliability and double coverage compared to available sequence alignment software. This promises to increase information returns from functional and structural genomics, where reliable sequence alignment is a bottleneck to transferring the functional or structural characterization of model proteins to entire protein superfamilies.
Article
Many large proteins have evolved by internal duplication and many internal sequence repeats correspond to functional and structural units. We have developed an automatic algorithm, RADAR, for segmenting a query sequence into repeats. The segmentation procedure has three steps: (i) repeat length is determined by the spacing between suboptimal self-alignment traces; (ii) repeat borders are optimized to yield a maximal integer number of repeats, and (iii) distant repeats are validated by iterative profile alignment. The method identifies short composition biased as well as gapped approximate repeats and complex repeat architectures involving many different types of repeats in the query sequence. No manual intervention and no prior assumptions on the number and length of repeats are required. Comparison to the Pfam-A database indicates good coverage, accurate alignments, and reasonable repeat borders. Screening the Swissprot database revealed 3,000 repeats not annotated in existing domain databases. A number of these repeats had been described in the literature but most were novel. This illustrates how in times when curated databases grapple with ever increasing backlogs, automatic (re)analysis of sequences provides an efficient way to capture this important information. Proteins 2000;41:224–237. © 2000 Wiley-Liss, Inc.
Article
We believe that punctuational change dominates the history of life: evolution is concentrated in very rapid events of speciation (geologically instantaneous, even if tolerably continuous in ecological time). Most species, during their geological history, either do not change in any appreciable way, or else they fluctuate mildly in morphology, with no apparent direction. Phyletic gradualism is very rare and too slow, in any case, to produce the major events of evolution. Evolutionary trends are not the product of slow, directional transformation within lineages; they represent the differential success of certain species within a clade—speciation may be random with respect to the direction of a trend (Wright's rule). As an a priori bias, phyletic gradualism has precluded any fair assessment of evolutionary tempos and modes. It could not be refuted by empirical catalogues constructed in its light because it excluded contrary information as the artificial result of an imperfect fossil record. With the model of punctuated equilibria, an unbiased distribution of evolutionary tempos can be established by treating stasis as data and by recording the pattern of change for all species in an assemblage. This distribution of tempos can lead to strong inferences about modes. If, as we predict, the punctuational tempo is prevalent, then speciation—not phyletic evolution—must be the dominant mode of evolution. We argue that virtually none of the examples brought forward to refute our model can stand as support for phyletic gradualism; many are so weak and ambiguous that they only reflect the persistent bias for gradualism still deeply embedded in paleontological thought. Of the few stronger cases, we concentrate on Gingerich's data for Hyopsodus and argue that it provides an excellent example of species selection under our model. We then review the data of several studies that have supported our model since we published it five years ago. The record of human evolution seems to provide a particularly good example: no gradualism has been detected within any hominid taxon, and many are long-ranging; the trend to larger brains arises from differential success of essentially static taxa. The data of molecular genetics support our assumption that large genetic changes often accompany the process of speciation. Phyletic gradualism was an a priori assertion from the start—it was never “seen” in the rocks; it expressed the cultural and political biases of 19th century liberalism. Huxley advised Darwin to eschew it as an “unnecessary difficulty.” We think that it has now become an empirical fallacy. A punctuational view of change may have wide validity at all levels of evolutionary processes. At the very least, it deserves consideration as an alternate way of interpreting the history of life.
Article
Pfam is a large collection of protein multiple sequence alignments and profile hidden Markov models. Pfam is available on the World Wide Web in the UK at http://www.sanger.ac.uk/Software/Pfam/, in Sweden at http://www.cgb.ki.se/Pfam/, in France at http://pfam.jouy.inra.fr/ and in the US at http://pfam.wustl.edu/. The latest version (6.6) of Pfam contains 3071 families, which match 69% of proteins in SWISS-PROT 39 and TrEMBL 14. Structural data, where available, have been utilised to ensure that Pfam families correspond with structural domains, and to improve domain-based annotation. Predictions of non-domain regions are now also included. In addition to secondary structure, Pfam multiple sequence alignments now contain active site residue mark-up. New search tools, including taxonomy search and domain query, greatly add to the functionality and usability of the Pfam resource.
Article
The sequences of related proteins can diverge beyond the point where their relationship can be recognised by pairwise sequence comparisons. In attempts to overcome this limitation, methods have been developed that use as a query, not a single sequence, but sets of related sequences or a representation of the characteristics shared by related sequences. Here we describe an assessment of three of these methods: the SAM-T98 implementation of a hidden Markov model procedure; PSI-BLAST; and the intermediate sequence search (ISS) procedure. We determined the extent to which these procedures can detect evolutionary relationships between the members of the sequence database PDBD40-J. This database, derived from the structural classification of proteins (SCOP), contains the sequences of proteins of known structure whose sequence identities with each other are 40% or less. The evolutionary relationships that exist between those that have low sequence identities were found by the examination of their structural details and, in many cases, their functional features. For nine false positive predictions out of a possible 432,680, i.e. at a false positive rate of about 1/50,000, SAM-T98 found 35% of the true homologous relationships in PDBD40-J, whilst PSI-BLAST found 30% and ISS found 25%. Overall, this is about twice the number of PDBD40-J relations that can be detected by the pairwise comparison procedures FASTA (17%) and GAP-BLAST (15%). For distantly related sequences in PDBD40-J, those pairs whose sequence identity is less than 30%, SAM-T98 and PSI-BLAST detect three times the number of relationships found by the pairwise methods.
Article
The MOLSCRIPT program produces plots of protein structures using several different kinds of representations. Schematic drawings, simple wire models, ball-and-stick models, CPK models and text labels can be mixed freely. The schematic drawings are shaded to improve the illusion of three dimensionality. A number of parameters affecting various aspects of the objects drawn can be changed by the user. The output from the program is in PostScript format.
Article
To facilitate understanding of, and access to, the information available for protein structures, we have constructed the Structural Classification of Proteins (scop) database. This database provides a detailed and comprehensive description of the structural and evolutionary relationships of the proteins of known structure. It also provides for each entry Links to co-ordinates, images of the structure, interactive viewers, sequence data and literature references. Two search facilities are available. The homology search permits users to enter a sequence and obtain a list of any structures to which it has significant levels of sequence similarity The key word search finds, for a word entered by the user, matches from both the text of the scop database and the headers of Brookhaven Protein Databank structure files. The database is freely accessible on World Wide Web (WWW) with an entry point to URL http://scop.mrc-lmb.cam.ac.uk/scop/ scop: an old English poet or minstrel (Oxford English Dictionary); ckon: pile, accumulation (Russian Dictionary).
Article
The crystal structure of the sugar-binding and dimerization domain of the Escherichia coli gene regulatory protein, AraC, has been determined in complex with the competitive inhibitor d-fucose at pH 5.5 to a resolution of 1.6 Å. An in-depth analysis shows that the structural basis for AraC carbohydrate specificity arises from the precise arrangement of hydrogen bond-forming protein side-chains around the bound sugar molecule. van der Waals interactions also contribute to the epimeric and anomeric selectivity of the protein. The methyl group of d-fucose is accommodated by small side-chain movements in the sugar-binding site that result in a slight distortion in the positioning of the amino-terminal arm. A comparison of this structure with the 1.5 Å structure of AraC complexed with l-arabinose at neutral pH surprisingly revealed very small structural changes between the two complexes. Based on solution data, we suspect that the low pH used to crystallize the fucose complex affected the structure, and speculate about the nature of the changes between pH 5.5 and neutral pH and their implications for gene regulation by AraC. A comparison with the structurally unrelated E. coli periplasmic sugar-binding proteins reveals that conserved features of carbohydrate recognition are present, despite a complete lack of structural similarity between the two classes of proteins, suggesting convergent evolution of carbohydrate binding.
Article
Decomposing each protein into modular domains is a basic prerequisite to classify accurately structural units in biological molecules. Boundaries between domains are indicated by two similar amino acid sequence segments located within the same protein (repeats) or within homologous proteins at notably different distances from their respective N- or C-termini. We have developed an automated method that combines such positional constraints derived from various detected pairwise sequence similarities to delineate the modular organization of proteins. The procedure has been applied to a non-redundant data set of 26 990 proteins whose sequences were taken from the PIR and SWISS-PROT databanks and shared <60% sequence identity amongst pairs. The resultant clustering, delineation and multiple alignment of 24 380 sequence fragments yielded a new database of 4364 domain families. Comparison of the domain collection with that of PRODOM indicates a clear improvement in the number and size of domain families, domain boundaries and multiple sequence alignments. The accuracy and sensitivity of the method are illustrated by results obtained for ankyrin-like repeats and EGF-like modules. The resulting database, called DOMO, is available through the database search routine SRS at Infobiogen (http://www.infobiogen.fr/srs5/), EBI (http://srs.ebi.ac.uk:5000/) and EMBL (http://www.embl-heidelberg.de/srs5/) World Wide Web sites. gracy@infobiogen.fr
Article
... Stephen Jay Gould and Niles Eldredge Abstract.-We believe that punctuational change dominates the history of life: evolution is concentrated in very rapid events of speciation (geologically instantaneous, even if tolerably continuous in ecological time). ... Stephen Jay Gould . ...
Article
Distinct structural regions have been found in several globular proteins composed of single polypeptide chains. The existence of such regions and the continuity of peptide chain within them, coupled with kinetic arguments, suggests that the early stages of three-dimensional structure formation (nucleation) occur independently in separate parts of these molecules. A nucleus can grow rapidly by adding peptide chain segments that are close to the nucleus in aminoacid sequence. Such a process would generate three-dimensional (native) protein structures that contain separate regions of continuous peptide chain. Possible means of testing this hypothesis are discussed.
Article
Introduction to Computational Biology: Maps, Sequencesand Genomes. Chapman Hall, 1995.[WF74] R.A. Wagner and M.J. Fischer. The String to String Correction Problem. Journal of the ACM, 21(1):168--173, 1974.[WM92] S. Wu and U. Manber. Fast Text Searching Allowing Errors. Communicationsof the ACM, 10(35):83--91, 1992.73Bibliography[KOS+00] S. Kurtz, E. Ohlebusch, J. Stoye, C. Schleiermacher, and R. Giegerich.Computation and Visualization of Degenerate Repeats in CompleteGenomes. In ...
Article
To facilitate understanding of, and access to, the information available for protein structures, we have constructed the Structural Classification of Proteins (scop) database. This database provides a detailed and comprehensive description of the structural and evolutionary relationships of the proteins of known structure. It also provides for each entry links to co-ordinates, images of the structure, interactive viewers, sequence data and literature references. Two search facilities are available. The homology search permits users to enter a sequence and obtain a list of any structures to which it has significant levels of sequence similarity. The key word search finds, for a word entered by the user, matches from both the text of the scop database and the headers of Brookhaven Protein Databank structure files. The database is freely accessible on World Wide Web (WWW) with an entry point to URL http: parallel scop.mrc-lmb.cam.ac.uk magnitude of scop.
Article
With a rapidly growing pool of known tertiary structures, the importance of protein structure comparison parallels that of sequence alignment. We have developed a novel algorithm (DALI) for optimal pairwise alignment of protein structures. The three-dimensional co-ordinates of each protein are used to calculate residue-residue (C alpha-C alpha) distance matrices. The distance matrices are first decomposed into elementary contact patterns, e.g. hexapeptide-hexapeptide submatrices. Then, similar contact patterns in the two matrices are paired and combined into larger consistent sets of pairs. A Monte Carlo procedure is used to optimize a similarity score defined in terms of equivalent intramolecular distances. Several alignments are optimized in parallel, leading to simultaneous detection of the best, second-best and so on solutions. The method allows sequence gaps of any length, reversal of chain direction and free topological connectivity of aligned segments. Sequential connectivity can be imposed as an option. The method is fully automatic and identifies structural resemblances and common structural cores accurately and sensitively, even in the presence of geometrical distortions. An all-against-all alignment of over 200 representative protein structures results in an objective classification of known three-dimensional folds in agreement with visual classifications. Unexpected topological similarities of biological interest have been detected, e.g. between the bacterial toxin colicin A and globins, and between the eukaryotic POU-specific DNA-binding domain and the bacterial lambda repressor.