ArticlePDF Available


Many protein classification systems capture homologous relationships by grouping domains into families and superfamilies on the basis of sequence similarity. Superfamilies with similar 3D structures are further grouped into folds. In the absence of discernable sequence similarity, these structural similarities were long thought to have originated independently, by convergent evolution. However, the growth of databases and advances in sequence comparison methods have led to the discovery of many distant evolutionary relationships that transcend the boundaries of superfamilies and folds. To investigate the contributions of convergent versus divergent evolution in the origin of protein folds, we clustered representative domains of known structure by their sequence similarity, treating them as point masses in a virtual 2D space which attract or repel each other depending on their pairwise sequence similarities. As expected, families in the same superfamily form tight clusters. But often, superfamilies of the same fold are linked with each other, suggesting that the entire fold evolved from an ancient prototype. Strikingly, some links connect superfamilies with different folds. They arise from modular peptide fragments of between 20 and 40 residues that co-occur in the connected folds in disparate structural contexts. These may be descendants of an ancestral pool of peptide modules that evolved as cofactors in the RNA world and from which the first folded proteins arose by amplification and recombination. Our galaxy of folds summarizes, in a single image, most known and many yet undescribed homologous relationships between protein superfamilies, providing new insights into the evolution of protein domains.
A galaxy of folds
Vikram Alva,
Michael Remmert,
Andreas Biegert,
Andrei N. Lupas,
and Johannes So
Department of Protein Evolution, Max-Planck-Institute for Developmental Biology, Tu
¨bingen 72076, Germany
Gene Center and Center for Integrated Protein Science, Ludwig-Maximilians-University Munich, Munich 81377, Germany
Received 28 September 2009; Accepted 4 November 2009
DOI: 10.1002/pro.297
Published online 20 November 2009
Abstract: Many protein classification systems capture homologous relationships by grouping
domains into families and superfamilies on the basis of sequence similarity. Superfamilies with
similar 3D structures are further grouped into folds. In the absence of discernable sequence
similarity, these structural similarities were long thought to have originated independently, by
convergent evolution. However, the growth of databases and advances in sequence comparison
methods have led to the discovery of many distant evolutionary relationships that transcend the
boundaries of superfamilies and folds. To investigate the contributions of convergent versus
divergent evolution in the origin of protein folds, we clustered representative domains of known
structure by their sequence similarity, treating them as point masses in a virtual 2D space which
attract or repel each other depending on their pairwise sequence similarities. As expected, families
in the same superfamily form tight clusters. But often, superfamilies of the same fold are linked
with each other, suggesting that the entire fold evolved from an ancient prototype. Strikingly, some
links connect superfamilies with different folds. They arise from modular peptide fragments of
between 20 and 40 residues that co-occur in the connected folds in disparate structural contexts.
These may be descendants of an ancestral pool of peptide modules that evolved as cofactors in
the RNA world and from which the first folded proteins arose by amplification and recombination.
Our galaxy of folds summarizes, in a single image, most known and many yet undescribed
homologous relationships between protein superfamilies, providing new insights into the evolution
of protein domains.
Keywords: protein evolution; fold space; fold map; remote homology; ancient peptide modules
Protein sequence space is essentially infinite. Even
just considering the median protein chain length of
about 300 residues,
the number of possible sequen-
ces is 20
), which vastly exceeds the esti-
mated number of particles in the known universe
). Life could not have explored more than a
minuscule proportion of this astronomical space.
Indeed, the total complement of the world’s pro-
teome is probably only about a trillion (10
each containing 10
protein-coding genes). As in-
significant as this number may seem by comparison
to the available sequence space, it is still a substan-
tial overestimate of the actual protein diversity
found in nature. In fact, most proteins resemble
other proteins in sequence and structure because
they are built by amplification, recombination, and
divergence from a basic set of autonomously folding
units, termed domains. Around 10
domain families
have been recognized by sequence comparison,
and this number is unlikely to grow very much.
These families reflect the descent of modern proteins
from a limited number of ancestral forms, most of
which were already established at the time of the
last common ancestor, 3.5 billion years ago.
Diversity is even more restricted at the struc-
tural level. Only 10
folds are populated in na-
and these frequently show recurrent local
*Correspondence to: Johannes So
¨ding, Gene Center and
Center for Integrated Protein Science, Ludwigs-Maximilians-
University Munich, Feodor-Lynen-Strasse 25, Munich 81377,
Germany. E-mail: or Andrei N.
Lupas, Department of Protein Evolution, Max-Planck-Institute
for Developmental Biology, Spemannstrasse 35, Tu
72076, Germany. E-mail:
124 PROTEIN SCIENCE 2010 VOL 19:124—130 Published by Wiley-Blackwell. V
C2009 The Protein Society
arrangements of secondary structures (supersecon-
dary structures),
such that the diversity at the
subdomain level is even further reduced. However,
although the sequence similarity of domains reflects
homologous descent, structural similarity may often
be analogous because only a limited number of folded
conformations are available to the polypeptide chain,
owing to biophysical constraints. Indeed, domain fam-
ilies unrelated in sequence may show considerable
structural similarity.
This duality between homolo-
gous and analogous contributions to the properties of
modern proteins is captured by structural classifica-
tion systems, such as SCOP (Structural Classification
of Proteins)
and CATH (Class-Architecture-Topology-
by combining homologous criteria at
lower hierarchical levels with analogous criteria at
upper levels. This mode of classification differs from
classification by natural descent, as used for organ-
isms, because life is monophyletic, being descended
from a common ancestor, whereas proteins are poly-
phyletic, having evolved from a set of distinct ances-
tral forms. Therefore, in the absence of detectable
sequence similarity, the resemblance between two
proteins is reasonably assumed to be analogous.
How extensive is this polyphyly of proteins? Did
each of the 10
or so protein families arise independ-
ently, the structural similarities between them being
convergent? It would not appear to be so. In recent
years, the dramatic expansion of molecular databases
and the development of a new generation of highly
sensitive sequence comparison methods
revealed a growing number of distant evolutionary
relationships, which transcend the previous bounda-
ries between homology and analogy. For instance,
most families of the TIM (ba)
-barrel fold are now
thought to have arisen from a common ancestor.
Occasionally, even the boundaries between folds have
been broken either due to the discovery of homolo-
gous fold change
or due to the detection of con-
served supersecondary structures, which may repre-
sent remnants of an ancient peptide-RNA world.
These findings suggest that proteins might not be as
polyphyletic as hitherto assumed.
To evaluate the extent to which such distant rela-
tionships transcend current structural classification,
we clustered a representative set of protein domains,
encompassing all known folds, on the basis of
sequence comparisons alone. The resulting map shows
that many protein families from different superfami-
lies or even folds may have a homologous origin.
Results and Discussion
Hitherto, studies have mapped fold space by struc-
tural criteria, mainly with a focus on principles for
automatically classifying proteins.
Some studies
also considered structure-based function inference
and the fold usage of organisms.
All these studies
used structural similarity to connect different folds, a
property that primarily reflects analogy. Because of
these convergent local similarities, structural maps
show proteins in a continuum,
obscuring discrete
evolutionary relationships.
Indeed, recent results
suggest that events such as circular permutations,
strand invasions, or 3D domain swaps may have sub-
stantially altered the folds of homologous proteins,
often leading to the variant form resembling an
unrelated fold. In such cases, structural convergence
can give the impression of continuity in an evolutio-
narily discontinuous landscape. We therefore decided
to revisit the mapping of protein fold space using
only homologous criteria, that is, sequence similarity.
To gather domains representative of known fold
types, we chose the structural classification of pro-
teins (SCOP) database.
SCOP classifies proteins
hierarchically by grouping related domains into fam-
ilies, related families into superfamilies, structurally
similar superfamilies into folds, and folds into sec-
ondary structure classes. Thus, the first two levels
of the classification capture homologous relation-
ships, whereas the last two capture analogous ones.
For the purpose of this study, we filtered SCOP to a
maximum of 20% sequence identity. At this level, all
superfamilies and nearly all families are still repre-
sented, but most relationships considered homolo-
gous by SCOP have been removed. We made pair-
wise comparisons of profile hidden Markov models
(HMMs) for these domains and clustered them by a
force-directed procedure, using the statistical signifi-
cance of the pairwise comparisons to assign attrac-
tive and repulsive forces to each profile pair in a
two-dimensional map (see methods). In the resulting
cluster map, domains are represented by colored
dots, while the brightness of the connecting lines
indicates the degree of sequence similarity. The dots
in the map were colored based on their SCOP classi-
fication to produce class (Fig. 1), fold (Fig. 2), and
superfamily (Fig. 3) maps, respectively. Interactive
versions of these maps can be navigated at http://
Although the clustering was done only based on
sequence information, we observe that proteins of the
same structural class generally converge to the same
regions in the map. The structural classes recognized
by SCOP are: folds consisting primarily of a-helices
(all-a), folds formed mainly of b-strands (all-b), folds
in which helices and strands alternate regularly (a/b),
and folds with irregular mixtures of helices and
strands (aþb). SCOP also recognizes a small proteins
class, which comprises proteins rich in cysteine, a
multidomain class, and a membrane protein class,
but these do not constitute classes in an architectural
sense. The class map (Fig. 1) shows five large regions
corresponding to the four primary classes—all-a
(blue), all-b(cyan), a/b(red), and aþb(yellow)—and
to the small proteins class (green). We attribute their
Alva et al. PROTEIN SCIENCE VOL 19:124—130 125
convergence to general similarities in amino acid
composition, that is, to an analogous property. We
find support for this notion in the fact that a map
generated after correction for amino acid bias showed
a considerably decreased grouping of the structural
classes. This is consistent with previous observations
that the amino acid composition reflects the struc-
tural class of a protein.
Because of the force-
directed clustering procedure, folds find their equilib-
rium position in the map not only by attraction to
similar folds but also by repulsion of different ones.
Clusters of similar folds can thus develop consider-
able repulsive forces, frequently clearing the areas
around them and repelling dissimilar folds to distant
parts of the map. For this reason, while the all-aand
a/bclasses are next to each other, the all-aand all-b
classes occupy diagonally opposite locations. Of the
primary classes, the aþbclass shows the least con-
vergence and overlaps most with the other classes,
suggesting that it could be considered a catch-all
class. This has already been pointed out by Orengo
et al., who do not consider aþba true structural
Of the last two classes, membrane proteins
cluster with the soluble proteins of the same second-
ary structure (helical membrane proteins with the
all-aclass and outer membrane proteins with the all-
bclass), and multidomain proteins are scattered all
over the map, as their constituent domains belong to
different classes.
Although unrelated domains from the same
class are very loosely connected in general, the
many tighter clusters are formed from groups of
domains with statistically significant pairwise simi-
larities that are indicative of homology. We chose 60
visually prominent clusters for further analysis. As
expected, most of these contain domains of the same
superfamily, but 18 contain domains from different
superfamilies. Out of these, seven comprise
Figure 1. Galaxy of folds colored by classes. Domains from the same class come to lie in similar regions of the galaxy.
Domains in SCOP20 were clustered in CLANS based on their all-against-all pairwise similarities as measured by HHsearch
P-values. Dots represent domains. Line coloring reflects HHsearch P-values; the brighter a line, the lower the P-value.
Domains are colored according to their SCOP class: all-a(blue), all-b(cyan), a/b(red), aþb(yellow), small proteins (green),
multi-domain proteins (orange), and membrane proteins (magenta).
126 PROTEINSCIENCE.ORG A Galaxy of Folds
superfamilies of the same fold and 11 superfamilies
of different folds.
One large cluster contains the various superfa-
milies of the aforementioned TIM (ba)
-barrels (yel-
low cluster at the bottom in Fig. 2). In our map, all
superfamilies of this fold (SCOP c.1.1-c.1.33), except
for monomethylamine methyltransferase (c.1.25) and
NAD(P)-linked oxidoreductase (c.1.7), cluster into
three groups, which are tightly linked to each other,
in agreement with their proposed homology.
Other examples of such folds with tightly connected
superfamilies include the a/atoroid fold (a.102,
salmon cluster near the left edge in Fig. 2) and the
b-trefoil fold (b.42). In both of these cases, a homolo-
gous origin for the superfamilies within the fold is
Although our results indicate that folds may not
be as polyphyletic as assumed by SCOP, we do see
instances of analogous folds. The most striking
example is the ferredoxin-like fold (d.58), which has
by far the largest number of superfamilies in SCOP.
These superfamilies are distributed all over the
map, indicating that they converged upon the same
fold independently. Other examples are the ferritin-
like folds (a.25) and the immunoglobulin-like b-
sandwich folds (b.1). We also see instances of super-
families of the same fold that show a mixture of
homologous and analogous connections. Examples
include the ribonuclease H-like motif fold (c.55), the
double-stranded b-helix fold (b.82), and the SH3-like
barrel fold (b.34).
Of the 11 clusters comprising domains belonging
to different folds, connections in two clusters rely on
global similarities between domains. One cluster
contains b-propellers, which are toroidal folds with
between four and 10 repeats of a four-stranded b-
meander. In SCOP, they are classified into five dif-
ferent folds (b.66-b.70), each with multiple superfa-
milies. We recently proposed a common origin for all
and we find that they indeed cluster
together, except for apyrase (b.67.3) and sema do-
main (b.69.12), which contain large insertions. The
Figure 2. Galaxy of folds colored by folds. Some clusters connect domains of different fold, pointing to common,
homologous fragments of similar sequence and structure. These might represent descendants of a set of ancient peptide
modules, from which the first protein domains have been assembled.
Alva et al. PROTEIN SCIENCE VOL 19:124—130 127
second cluster comprises transmembrane b-barrels,
which are classified into seven superfamilies within
two folds (f.4 and d.24.1.4) in SCOP. Their homolo-
gous origin has been discussed recently.
In the remaining nine clusters, the connections
between domains clearly result from the presence of
sequence- and structure-similar subdomain-sized
fragments. For example, one large cluster contains a
variety of topologically distinct DNA-binding domains
with a common helix-turn-helix motif (large, mainly
red cluster at the middle in Fig. 2), whose homolo-
gous origin has been discussed previously.
these domains are classified into 16 superfamilies
contained within 10 folds. Another large cluster con-
tains the Rossmann folds (large cluster comprising
dots with various colors at the bottom in Fig. 2),
which possess a common dinucleotide-binding bab-
element. Their evolutionary relationship has also
been proposed previously.
A further cluster com-
prises the eukaryotic (Type-I) and the prokaryotic
(Type-II) KH-domains, which are topologically dis-
tinct but homologous.
The similarity between these
folds is limited to a baab motif. These clusters lend
support to a theory on the origin of folded proteins,
which proposes that these structure- and sequence-
similar fragments seen in disparate molecular con-
texts represent remnants of an ancient peptide-RNA
world, thus suggesting that today’s domains have
arisen by fusion, amplification, and divergence from a
simpler set of peptide modules.
We do not observe some clusters that we
expected from reported instances of remote homol-
ogy. One reason is that many of these involve
domains with few homologs of known structure. For
clustering, these would have to rely entirely on the
strength of their pairwise connection rather than
benefiting from the stronger attractive field gener-
ated by a compact group of homologous domain fam-
ilies. Another reason is that some domains that are
clearly recognizable at the structural level do not
appear as independent entries in SCOP. We had pre-
viously proposed that the histone fold (a.22) might
have arisen from the C-domain of AAAþATPases
(c.37.1.20) through a 3D domain swap.
In the
Figure 3. Galaxy of folds colored by superfamilies. Many tight clusters contain various superfamilies of the same fold,
indicating that folds with multiple independent origins are rather the exception than the rule.
128 PROTEINSCIENCE.ORG A Galaxy of Folds
present map, although domains belonging to these
twofolds show clear pairwise connections, they do
not form a tight cluster. This is because C-domains
are not characterized as a separate fold in SCOP but
are classified with other P-loop NTPases based on
the preceding ATPase domain; they therefore cluster
tightly with these. We also anticipate that some
instances of distant homology remain unrecognized
in our map if they involve domains with few homo-
logs in current databases, as it is not possible to
build a reasonable profile HMM in these cases. With
the progress of sequencing projects, this problem
should wane. A few links with significant HHsearch
P-values are false positives, which connect clearly
unrelated domains, such as the link between many
TIM barrel proteins and the guanine deaminase
(d2ooda1, turquoise cluster at the bottom in Fig. 2).
According to a systematic analysis of the highest-
scoring false positives, the chief cause for these false
links are corrupted alignments that are used to
build the profile HMMs. In this case, sequences from
TIM barrels have crept into the alignment of
d2ooda1 during the iterative search.
Materials and Methods
We used the SCOP database,
version 1.75, filtered
to a maximum of 20% sequence identity (SCOP20).
was used for all-against-all comparison
of the 7002 domains in SCOP20. HHsearch is a sen-
sitive method for remote homology detection that is
based on the pairwise comparison of profile hidden
Markov models (HMMs). Profile HMMs can be
viewed as sequence profiles containing position-spe-
cific gap penalties. They can be constructed from
multiple sequence alignments of homologs. For this
purpose, alignments are built for each of the
SCOP20 domains using the script (with
default parameters) from the HHsearch 1.6.0 pack-
age. This script uses CS-BLAST, a sequence context-
specific extension of PSI-BLAST, for iterative
sequence searching.
It also contains heuristics to
reduce the inclusion of nonhomologous sequence seg-
ments at the ends of PSI-BLAST sequence matches,
the leading cause of high-scoring false positive
matches in PSI-BLAST. Profile HMMs were calcu-
lated from the alignments using hhmake and com-
pared with HHsearch, both from the HHsearch 1.6.0
package. We switched off secondary structure scor-
ing and the compositional bias correction (options -
ssm 0 -sc 0) and used default settings otherwise. We
clustered the SCOP20 domains by their pairwise
HHsearch P-values in CLANS,
an implementation
of the Fruchterman-Reingold clustering algorithm
that scales negative log-P-values into attractive
forces in a force field. Clustering was done to equi-
librium in 2D at a P-value cutoff of 1.0e-01 using
default settings. The obtained cluster map is avail-
able through a web-based tool (HHcluster). Users
can select between maps colored by class, fold,
superfamily, or family. HHcluster is integrated into
the MPI Bioinformatics Toolkit
(http://toolkit., allowing the interac-
tive analysis of the map. Proteins can be identified
in the map by a mouse-over function or through text
searches, neighbors with significant connections and
their corresponding HMM-HMM comparison results
can be viewed by clicking on the query domain, and
the structures of both query and matched protein
domain can be viewed with the aligned substruc-
tures structurally superposed and highlighted.
We have produced a two-dimensional map of protein
fold space using sequence criteria alone to evaluate
the abundance of distant evolutionary relationships
among protein domains currently classified into
analogous categories. Our map offers a global view
of evolutionary relationships in fold space and shows
incidences of homologous connections that transcend
both superfamily and fold levels. Many of the rela-
tionships observed in the map have been discussed
individually before, confirming the validity of these
findings. Our results suggest that proteins may not
have had as many independent origins as hitherto
1. Brocchieri L, Karlin S (2005) Protein length in eukary-
otic and prokaryotic proteomes. Nucleic Acids Res 33:
2. Marsden RL, Lee D, Maibaum M, Yeats C, Orengo CA
(2006) Comprehensive genome analysis of 203 genomes
provides structural genomics with new insights into
protein family space. Nucleic Acids Res 34:1066–1080.
3. Hunter S, Apweiler R, Attwood TK, Bairoch A, Bateman
A, Binns D, Bork P, Das U, Daugherty L, Duquenne L,
et al. (2009) InterPro: the integrative protein signature
database. Nucleic Acids Res 37:D211–D215.
4. Orengo CA, Thornton JM (2005) Protein families and
their evolution-a structural perspective. Annu Rev Bio-
chem 74:867–900.
5. Rao ST, Rossmann MG (1973) Comparison of super-sec-
ondary structures in proteins. J Mol Biol 76:241–256.
6. Salem GM, Hutchinson EG, Orengo CA, Thornton JM
(1999) Correlation of observed fold frequency with the
occurrence of local structural motifs. J Mol Biol 287:
7. Cheng H, Kim BH, Grishin NV (2008) MALISAM: a
database of structurally analogous motifs in proteins.
Nucleic Acids Res 36:D211–217.
8. Murzin AG, Brenner SE, Hubbard T, Chothia C (1995)
SCOP: a structural classification of proteins database
for the investigation of sequences and structures. J Mol
Biol 247:536–540.
9. Orengo CA, Michie AD, Jones S, Jones DT, Swindells MB,
Thornton JM (1997) CATH—a hierarchic classification of
protein domain structu res. Structure 5:1093–1108.
10. Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang
Z, Miller W, Lipman DJ (1997) Gapped BLAST and
PSI-BLAST: a new generation of protein database
search programs. Nucleic Acids Res 25:3389–3402.
Alva et al. PROTEIN SCIENCE VOL 19:124—130 129
11. Eddy SR (1998) Profile hidden Markov models. Bioin-
formatics 14:755–763.
12. Sadreyev R, Grishin N (2003) COMPASS: a tool for com-
parison of multiple protein alignments with assessment
of statistical significance. J Mol Biol 326:317–336.
13. Soding J (2005) Protein homology detection by HMM-
HMM comparison. Bioinformatics 21:951–960.
14. Xie L, Bourne PE (2008) Detecting evolutionary rela-
tionships across existing fold space, using sequence
order-independent profile-profile alignments. Proc Natl
Acad Sci USA 105:5441–5446.
15. Copley RR, Bork P (2000) Homology among (betaal-
pha)(8) barrels: implications for the evolution of meta-
bolic pathways. J Mol Biol 303:627–641.
16. Nagano N, Orengo CA, Thornton JM (2002) One fold
with many functions: the evolutionary relationships
between TIM barrel families based on their sequences,
structures and functions. J Mol Biol 321:741–765.
17. Soding J, Remmert M, Biegert A (2006) HHrep: de
novo protein repeat detection and the origin of TIM
barrels. Nucleic Acids Res 34:W137–W142.
18. Grishin NV (2001) Fold change in evolution of protein
structures. J Struct Biol 134:167–185.
19. Andreeva A, Murzin AG (2006) Evolution of protein
fold in the presence of functional constraints. Curr
Opin Struct Biol 16:399–408.
20. Andreeva A, Prlic A, Hubbard TJ, Murzin AG (2007) SIS-
YPHUS—structural alignments for proteins with non-
trivial relationships. Nucleic Acids Res 35:D253–D259.
21. Alva V, Koretke KK, Coles M, Lupas AN (2008) Cradle-
loop barrels and the concept of metafolds in protein
classification by natural descent. Curr Opin Struct Biol
22. Fetrow JS, Godzik A (1998) Function driven protein
evolution. A possible proto-protein for the RNA-binding
proteins. Pac Symp Biocomput 3:485–496.
23. Lupas AN, Ponting CP, Russell RB (2001) On the evolu-
tion of protein folds: are similar motifs in different pro-
tein folds the result of convergence, insertion, or relics of
an ancient peptide world? J Struct Biol 134:191–203.
24. Soding J, Lupas AN (2003) More than the sum of their
parts: on the evolution of proteins from peptides. Bioes-
says 25:837–846.
25. Holm L, Sander C (1993) Protein structure comparison by
alignment of distance matrices. J Mol Biol 233:123–138.
26. Orengo CA, Flores TP, Taylor WR, Thornton JM (1993)
Identification and classification of protein fold families.
Protein Eng 6:485–500.
27. Hou J, Sims GE, Zhang C, Kim SH (2003) A global rep-
resentation of the protein fold space. Proc Natl Acad
Sci USA 100:2386–2390.
28. Hou J, Jun SR, Zhang C, Kim SH (2005) Global map-
ping of the protein structure space and application in
structure-based inference of protein function. Proc Natl
Acad Sci USA 102:3651–3656.
29. Friedberg I, Godzik A (2005) Connecting the protein
structure universe by using sparse recurring frag-
ments. Structure 13:1213–1224.
30. Kolodny R, Petrey D, Honig B (2006) Protein structure
comparison: implications for the nature of ‘fold space’,
and structure and function prediction. Curr Opin
Struct Biol 16:393–398.
31. Taylor WR (2007) Evolutionary transitions in protein
fold space. Curr Opin Struct Biol 17:354–361.
32. Cuff A, Redfern OC, Greene L, Sillitoe I, Lewis T, Dibley
M, Reid A, Pearl F, Dallman T, Todd A, Garrat R, Thorn-
ton J, Orengo C. (2009) The CATH hierarchy revisited-
structural divergence in domain superfamilies and the
continuity of fold space. Structure 17:1051–1062.
33. Pascual-Garcia A, Abia D, Ortiz AR, Bastolla U (2009)
Cross-over between discrete and continuous protein
structure space: insights into automatic classification
and networks of protein structures. PLoS Comput Biol
34. Lupas AN, Koretke KK. Evolution of protein folds. In:
Peitsch MC, Schwede T, Eds. (2008) Computational
structural biology: methods and applications. Hacken-
sack, N.J., London: World Scientific, pp 131–152.
35. Sadreyev RI, Kim BH, Grishin NV (2009) Discrete-con-
tinuous duality of protein structure space. Curr Opin
Struct Biol 19:321–328.
36. Chou KC, Zhang CT (1995) Prediction of protein struc-
tural classes. Crit Rev Biochem Mol Biol 30:275–349.
37. Ponting CP, Russell RB (2000) Identification of distant
homologues of fibroblast growth factors suggests a com-
mon ancestor for all beta-trefoil proteins. J Mol Biol
38. Liang PH, Ko TP, Wang AH (2002) Structure, mecha-
nism and function of prenyltransferases. Eur J Bio-
chem 269:3339–3354.
39. Chaudhuri I, Soding J, Lupas AN (2008) Evolution of
the beta-propeller fold. Proteins 71:795–803.
40. Arnold T, Poynor M, Nussberger S, Lupas AN, Linke D
(2007) Gene duplication of the eight-stranded beta-bar-
rel OmpX produces a functional pore: a scenario for the
evolution of transmembrane beta-barrels. J Mol Biol
41. Remmert M, Linke D, Lupas AN, Soding J (2009)
HHomp—prediction and classification of outer mem-
brane proteins. Nucleic Acids Res 37:W446–W451.
42. Brennan RG (1993) The winged-helix DNA-binding
motif: another helix-turn-helix takeoff. Cell 74:
43. Grishin NV (2001) KH domain: one motif, two folds.
Nucleic Acids Res 29:638–643.
44. Alva V, Ammelburg M, Soding J, Lupas AN (2007) On
the origin of the histone fold. BMC Struct Biol 7:17.
45. Biegert A, Soding J (2009) Sequence context-specific
profiles for homology searching. Proc Natl Acad Sci
USA 106:3770–3775.
46. Frickey T, Lupas A (2004) CLANS: a Java application
for visualizing protein families based on pairwise simi-
larity. Bioinformatics 20:3702–3704.
47. Biegert A, Mayer C, Remmert M, Soding J, Lupas AN
(2006) The MPI Bioinformatics Toolkit for protein
sequence analysis. Nucleic Acids Res 34:W335–W339.
130 PROTEINSCIENCE.ORG A Galaxy of Folds
Full-text available
As sequence and structure comparison algorithms gain sensitivity, the intrinsic interconnectedness of the protein universe has become increasingly apparent. Despite this general trend, β-trefoils have emerged as an uncommon counterexample: They are an isolated protein lineage for which few, if any, sequence or structure associations to other lineages have been identified. If β-trefoils are, in fact, remote islands in sequence-structure space, it implies that the oligomerizing peptide that founded the β-trefoil lineage itself arose de novo . To better understand β-trefoil evolution, and to probe the limits of fragment sharing across the protein universe, we identified both ‘β-trefoil bridging themes’ (evolutionarily-related sequence segments) and ‘β-trefoil-like motifs’ (structure motifs with a hallmark feature of the β-trefoil architecture) in multiple, ostensibly unrelated, protein lineages. The success of the present approach stems, in part, from considering β-trefoil sequence segments or structure motifs rather than the β-trefoil architecture as a whole, as has been done previously. The newly uncovered inter-lineage connections presented here suggest a novel hypothesis about the origins of the β-trefoil fold itself–namely, that it is a derived fold formed by ‘budding’ from an Immunoglobulin-like β-sandwich protein. These results demonstrate how the evolution of a folded domain from a peptide need not be a signature of antiquity and underpin an emerging truth: few protein lineages escape nature’s sewing table.
Introduction: While the origin and evolution of proteins remain mysterious, advances in evolutionary genomics and systems biology are facilitating the historical exploration of the structure, function and organization of proteins and proteomes. Molecular chronologies are series of time events describing the history of biological systems and subsystems and the rise of biological innovations. Together with time-varying networks, these chronologies provide a window into the past. Areas covered: Here, we review molecular chronologies and networks built with modern methods of phylogeny reconstruction. We discuss how chronologies of structural domain families uncover the explosive emergence of metabolism, the late rise of translation, the co-evolution of ribosomal proteins and rRNA, and the late development of the ribosomal exit tunnel; events that coincided with a tendency to shorten folding time. Evolving networks described the early emergence of domains and a late 'big bang' of domain combinations. Expert opinion: Two processes, folding and recruitment appear central to the evolutionary progression. The former increases protein persistence. The later fosters diversity. Chronologically, protein evolution mirrors folding by combining supersecondary structures into domains, developing translation machinery to facilitate folding speed and stability, and enhancing structural complexity by establishing long-distance interactions in novel structural and architectural designs.
Full-text available
Motivation Shedding light on the relationships between protein sequences and functions is a challenging task with many implications in protein evolution, diseases understanding, and protein design. The protein sequence space mapping to specific functions is however hard to comprehend due to its complexity. Generative models help to decipher complex systems thanks to their abilities to learn and recreate data specificity. Applied to proteins, they can capture the sequence patterns associated with functions and point out important relationships between sequence positions. By learning these dependencies between sequences and functions, they can ultimately be used to generate new sequences and navigate through uncharted area of molecular evolution. Results This study presents an Adversarial Auto-Encoder (AAE) approached, an unsupervised generative model, to generate new protein sequences. AAEs are tested on three protein families known for their multiple functions the sulfatase, the HUP and the TPP families. Clustering results on the encoded sequences from the latent space computed by AAEs display high level of homogeneity regarding the protein sequence functions. The study also reports and analyzes for the first time two sampling strategies based on latent space interpolation and latent space arithmetic to generate intermediate protein sequences sharing sequential properties of original sequences linked to known functional properties issued from different families and functions. Generated sequences by interpolation between latent space data points demonstrate the ability of the AAE to generalize and produce meaningful biological sequences from an evolutionary uncharted area of the biological sequence space. Finally, 3D structure models computed by comparative modelling using generated sequences and templates of different sub-families point out to the ability of the latent space arithmetic to successfully transfer protein sequence properties linked to function between different sub-families. All in all this study confirms the ability of deep learning frameworks to model biological complexity and bring new tools to explore amino acid sequence and functional spaces.
Full-text available
Domains are the structural, functional and evolutionary units of proteins. They combine to form multidomain proteins. The evolutionary history of this molecular combinatorics has been studied with phylogenomic methods. Here, we construct networks of domain organization and explore their evolution. A time series of networks revealed two ancient waves of structural novelty arising from ancient ‘p-loop’ and ‘winged helix’ domains and a massive ‘big bang’ of domain organization. The evolutionary recruitment of domains was highly modular, hierarchical and ongoing. Domain rearrangements elicited non-random and scale-free network structure. Comparative analyses of preferential attachment, randomness and modularity showed yin-and-yang complementary transition and biphasic patterns along the structural chronology. Remarkably, the evolving networks highlighted a central evolutionary role of cofactor-supporting structures of non-ribosomal peptide synthesis pathways, likely crucial to the early development of the genetic code. Some highly modular domains featured dual response regulation in two-component signal transduction systems with DNA-binding activity linked to transcriptional regulation of responses to environmental change. Interestingly, hub domains across the evolving networks shared the historical role of DNA binding and editing, an ancient protein function in molecular evolution. Our investigation unfolds historical source-sink patterns of evolutionary recruitment that further our understanding of protein architectures and functions.
Full-text available
A comparison of protein backbones makes clear that not more than approximately 1400 different folds exist, each specifying the three‐dimensional topology of a protein domain. Large proteins are composed of specific domain combinations and many domains can accommodate different functions. These findings confirm that the reuse of domains is key for the evolution of multi‐domain proteins. If reuse was also the driving force for domain evolution, ancestral fragments of sub‐domain size exist that are shared between domains possessing significantly different topologies. For the fully automated detection of putatively ancestral motifs, we developed the algorithm Fragstatt that compares proteins pairwise to identify fragments, that is, instantiations of the same motif. To reach maximal sensitivity, Fragstatt compares sequences by means of cascaded alignments of profile Hidden Markov Models. If the fragment sequences are sufficiently similar, the program determines and scores the structural concordance of the fragments. By analyzing a comprehensive set of proteins from the CATH database, Fragstatt identified 12 532 partially overlapping and structurally similar motifs that clustered to 134 unique motifs. The dissemination of these motifs is limited: We found only two domain topologies that contain two different motifs and generally, these motifs occur in not more than 18% of the CATH topologies. Interestingly, motifs are enriched in topologies that are considered ancestral. Thus, our findings suggest that the reuse of sub‐domain sized fragments was relevant in early phases of protein evolution and became less important later on. This article is protected by copyright. All rights reserved.
Full-text available
Genome sequencing projects unearth sequences of all the protein sequences encoded in a genome. As the first step, homology detection is employed to obtain clues to structure and function of these proteins. However, high evolutionary divergence between homologous proteins challenges our ability to detect distant relationships. In the past, an approach involving multiple Position Specific Scoring Matrices (PSSMs) was found to be more effective than traditional single PSSMs. Cascaded search is another successful approach where hits of a search are queried to detect more homologues. We propose a protocol, ‘Master Blaster’, which combines the principles adopted in these two approaches to enhance our ability to detect remote homologues even further. Assessment of the approach was performed using known relationships available in the SCOP70 database, and the results were compared against that of PSI-BLAST and HHblits, a hidden Markov model-based method. Compared to PSI-BLAST, Master Blaster resulted in 10% improvement with respect to detection of cross superfamily connections, nearly 35% improvement in cross family and more than 80% improvement in intra family connections. From the results it was observed that HHblits is more sensitive in detecting remote homologues compared to Master Blaster. However, there are true hits from 46-folds for which Master Blaster reported homologs that are not reported by HHblits even using the optimal parameters indicating that for detecting remote homologues, use of multiple methods employing a combination of different approaches can be more effective in detecting remote homologs. Master Blaster stand-alone code is available for download in the supplementary archive.
In the regime of domain classifications, the protein universe unveils a discrete set of folds connected by hierarchical relationships. Instead, at sub-domain-size resolution and because of physical constraints not necessarily requiring evolution to shape polypeptide chains, networks of protein motifs depict a continuous view that lies beyond the extent of hierarchical classification schemes. A number of studies, however, suggest that universal sub-sequences could be the descendants of peptides emerged in an ancient pre-biotic world. Should this be the case, evolutionary signals retained by structurally conserved motifs, along with hierarchical features of ancient domains, could sew relationships among folds that diverged beyond the point where homology is discernable. In view of the aforementioned, this paper provides a rationale where a network with hierarchical and continuous levels of the protein space, together with sequence profiles that probe the extent of sequence similarity and contacting residues that capture the transition from pre-biotic to domain world, has been used to explore relationships between ancient folds. Statistics of detected signals have been reported. As a result, an example of an emergent sub-network that makes sense from an evolutionary perspective, where conserved signals retrieved from the assessed protein space have been co-opted, has been discussed.
Full-text available
The vast majority of theoretically possible polypeptide chains do not fold, let alone confer function. Hence, protein evolution from preexisting building blocks has clear potential advantages over ab initio emergence from random sequences. In support of this view, sequence similarities between different proteins is generally indicative of common ancestry, and we collectively refer to such homologous sequences as ‘themes’. At the domain level, sequence homology is routinely detected. However, short themes which are segments, or fragments of intact domains, are particularly interesting because they may provide hints about the emergence of domains, as opposed to divergence of preexisting domains, or their mixing-and-matching to form multi-domain proteins. Here we identified 525 representative short themes, comprising 20-to-80 residues, that are unexpectedly shared between domains considered to have emerged independently. Among these ‘bridging themes’ are ones shared between the most ancient domains, e.g., Rossmann, P-loop NTPase, TIM-barrel, Flavodoxin, and Ferredoxin-like. We elaborate on several particularly interesting cases, where the bridging themes mediate ligand binding. Ligand binding may have contributed to the stability and the plasticity of these building blocks, and to their ability to invade preexisting domains or serve as starting points for completely new domains.
Full-text available
Evolutionary processes that formed the current protein universe left their traces, among them homologous segments that recur, or are ‘reused,’ in multiple proteins. These reused segments, called ‘themes,’ can be found at various scales, the best known of which is the domain. Yet, recent studies have begun to focus on the evolutionary insights that can be derived from sub-domain-scale themes, which are candidates for traces of more ancient events. Characterizing these may provide clues to the emergence of domains. Particularly interesting are themes that are reused across dissimilar contexts, that is, where the rest of the protein domain differs. We survey computational studies identifying reused themes within different contexts at the sub-domain level.
Full-text available
Proteins are chief actors in life that perform a myriad of exquisite functions. This diversity has been enabled through the evolution and diversification of protein folds. Analysis of sequences and structures strongly suggest that numerous protein pieces have been reused as building blocks and propagated to many modern folds. This information can be traced to understand how the protein world has diversified. In this review, we discuss the latest advances in the analysis of protein evolutionary units, and we use as a model system one of the most abundant and versatile topologies, the TIM-barrel fold, to highlight the existing common principles that interconnect protein evolution, structure, folding, function, and design.
Full-text available
Ore mineral and host lithologies have been sampled with 89 oriented samples from 14 sites in the Naica District, northern Mexico. Magnetic parameters permit to charac- terise samples: saturation magnetization, density, low- high-temperature magnetic sus- ceptibility, remanence intensity, Koenigsberger ratio, Curie temperature and hystere- sis parameters. Rock magnetic properties are controlled by variations in titanomag- netite content and hydrothermal alteration. Post-mineralization hydrothermal alter- ation seems the major event that affected the minerals and magnetic properties. Curie temperatures are characteristic of titanomagnetites or titanomaghemites. Hysteresis parameters indicate that most samples have pseudo-single domain (PSD) magnetic grains. Alternating filed (AF) demagnetization and isothermal remanence (IRM) ac- quisition both indicate that natural and laboratory remanences are carried by MD-PSD spinels in the host rocks. The trend of NRM intensity vs susceptibility suggests that the carrier of remanent and induced magnetization is the same in all cases (spinels). The Koenigsberger ratio range from 0.05 to 34.04, indicating the presence of MD and PSD magnetic grains. Constraints on the geometry of the intrusive source body devel- oped in the model of the magnetic anomaly are obtained by quantifying the relative contributions of induced and remanent magnetization components.
Full-text available
The BLAST programs are widely used tools for searching protein and DNA databases for sequence similarities. For protein comparisons, a variety of definitional, algorithmic, and statistical refinements permits the execution time of the BLAST programs to be decreased substantially while enhancing their sensitivity to weak similarities. A new criterion for triggering the extension of word hits, combined with a new heuristic for generating gapped alignments, yields a gapped BLAST program that runs at approximately three times the speed of the original. In addition, a method is described for automatically combining statistically significant alignments produced by BLAST into a position-specific score matrix, and searching the database using this matrix. The resulting Position Specific Iterated BLAST (PSLBLAST) program runs at approximately the same speed per iteration as gapped BLAST, but in many cases is much more sensitive to weak but biologically relevant sequence similarities.
Full-text available
This paper explores the structural continuum in CATH and the extent to which superfamilies adopt distinct folds. Although most superfamilies are structurally conserved, in some of the most highly populated superfamilies (4% of all superfamilies) there is considerable structural divergence. While relatives share a similar fold in the evolutionary conserved core, diverse elaborations to this core can result in significant differences in the global structures. Applying similar protocols to examine the extent to which structural overlaps occur between different fold groups, it appears this effect is confined to just a few architectures and is largely due to small, recurring super-secondary motifs (e.g., alphabeta-motifs, alpha-hairpins). Although 24% of superfamilies overlap with superfamilies having different folds, only 14% of nonredundant structures in CATH are involved in overlaps. Nevertheless, the existence of these overlaps suggests that, in some regions of structure space, the fold universe should be seen as more continuous.
Full-text available
Outer membrane proteins (OMPs) are the transmembrane proteins found in the outer membranes of Gram-negative bacteria, mitochondria and plastids. Most prediction methods have focused on analogous features, such as alternating hydrophobicity patterns. Here, we start from the observation that almost all β-barrel OMPs are related by common ancestry. We identify proteins as OMPs by detecting their homologous relationships to known OMPs using sequence similarity. Given an input sequence, HHomp builds a profile hidden Markov model (HMM) and compares it with an OMP database by pairwise HMM comparison, integrating OMP predictions by PROFtmb. A crucial ingredient is the OMP database, which contains profile HMMs for over 20 000 putative OMP sequences. These were collected with the exhaustive, transitive homology detection method HHsenser, starting from 23 representative OMPs in the PDB database. In a benchmark on TransportDB, HHomp detects 63.5% of the true positives before including the first false positive. This is 70% more than PROFtmb, four times more than BOMP and 10 times more than TMB-Hunt. In Escherichia coli, HHomp identifies 57 out of 59 known OMPs and correctly assigns them to their functional subgroups. HHomp can be accessed at
Full-text available
Structural classifications of proteins assume the existence of the fold, which is an intrinsic equivalence class of protein domains. Here, we test in which conditions such an equivalence class is compatible with objective similarity measures. We base our analysis on the transitive property of the equivalence relationship, requiring that similarity of A with B and B with C implies that A and C are also similar. Divergent gene evolution leads us to expect that the transitive property should approximately hold. However, if protein domains are a combination of recurrent short polypeptide fragments, as proposed by several authors, then similarity of partial fragments may violate the transitive property, favouring the continuous view of the protein structure space. We propose a measure to quantify the violations of the transitive property when a clustering algorithm joins elements into clusters, and we find out that such violations present a well defined and detectable cross-over point, from an approximately transitive regime at high structure similarity to a regime with large transitivity violations and large differences in length at low similarity. We argue that protein structure space is discrete and hierarchic classification is justified up to this cross-over point, whereas at lower similarities the structure space is continuous and it should be represented as a network. We have tested the qualitative behaviour of this measure, varying all the choices involved in the automatic classification procedure, i.e., domain decomposition, alignment algorithm, similarity score, and clustering algorithm, and we have found out that this behaviour is quite robust. The final classification depends on the chosen algorithms. We used the values of the clustering coefficient and the transitivity violations to select the optimal choices among those that we tested. Interestingly, this criterion also favours the agreement between automatic and expert classifications. As a domain set, we have selected a consensus set of 2,890 domains decomposed very similarly in SCOP and CATH. As an alignment algorithm, we used a global version of MAMMOTH developed in our group, which is both rapid and accurate. As a similarity measure, we used the size-normalized contact overlap, and as a clustering algorithm, we used average linkage. The resulting automatic classification at the cross-over point was more consistent than expert ones with respect to the structure similarity measure, with 86% of the clusters corresponding to subsets of either SCOP or CATH superfamilies and fewer than 5% containing domains in distinct folds according to both SCOP and CATH. Almost 15% of SCOP superfamilies and 10% of CATH superfamilies were split, consistent with the notion of fold change in protein evolution. These results were qualitatively robust for all choices that we tested, although we did not try to use alignment algorithms developed by other groups. Folds defined in SCOP and CATH would be completely joined in the regime of large transitivity violations where clustering is more arbitrary. Consistently, the agreement between SCOP and CATH at fold level was lower than their agreement with the automatic classification obtained using as a clustering algorithm, respectively, average linkage (for SCOP) or single linkage (for CATH). The networks representing significant evolutionary and structural relationships between clusters beyond the cross-over point may allow us to perform evolutionary, structural, or functional analyses beyond the limits of classification schemes. These networks and the underlying clusters are available at
To facilitate understanding of, and access to, the information available for protein structures, we have constructed the Structural Classification of Proteins (scop) database. This database provides a detailed and comprehensive description of the structural and evolutionary relationships of the proteins of known structure. It also provides for each entry Links to co-ordinates, images of the structure, interactive viewers, sequence data and literature references. Two search facilities are available. The homology search permits users to enter a sequence and obtain a list of any structures to which it has significant levels of sequence similarity The key word search finds, for a word entered by the user, matches from both the text of the scop database and the headers of Brookhaven Protein Databank structure files. The database is freely accessible on World Wide Web (WWW) with an entry point to URL scop: an old English poet or minstrel (Oxford English Dictionary); ckon: pile, accumulation (Russian Dictionary).
Thispaper presents and discusses evidence suggesting how the diversity of domain folds in existence today might have evolved from peptide ancestors. We apply a structure similarity detection method to detect instances where localized regions of different protein folds contain highly similar sequences and structures. Results of performing an all-on-all comparison of known structures are described and compared with other recently published findings. The numerous instances of local sequence and structure similarities within different protein folds, together with evidence from proteins containing sequence and structure repeats, argues in favor of the evolution of modern single polypeptide domains from ancient short peptide ancestors (antecedent domain segments (ADSs)). In this model, ancient protein structures were formed by self-assembling aggregates of short polypeptides. Subsequently, and perhaps concomitantly with the evolution of higher fidelity DNA replication and repair systems, single polypeptide domains arose from the fusion of ADSs genes. Thus modern protein domains may have a polyphyletic origin.
Recently, the nature of protein structure space has been widely discussed in the literature. The traditional discrete view of protein universe as a set of separate folds has been criticized in the light of growing evidence that almost any arrangement of secondary structures is possible and the whole protein space can be traversed through a path of similar structures. Here we argue that the discrete and continuous descriptions are not mutually exclusive, but complementary: the space is largely discrete in evolutionary sense, but continuous geometrically when purely structural similarities are quantified. Evolutionary connections are mainly confined to separate structural prototypes corresponding to folds as islands of structural stability, with few remaining traceable links between the islands. However, for a geometric similarity measure, it is usually possible to find a reasonable cutoff that yields paths connecting any two structures through intermediates.