ArticlePDF Available

Abstract and Figures

To explore protein space from a global perspective, we consider 9,710 SCOP (Structural Classification of Proteins) domains with up to 70% sequence identity and present all similarities among them as networks: In the "domain network," nodes represent domains, and edges connect domains that share "motifs," i.e., significantly sized segments of similar sequence and structure. We explore the dependence of the network on the thresholds that define the evolutionary relatedness of the domains. At excessively strict thresholds the network falls apart completely; for very lax thresholds, there are network paths between virtually all domains. Interestingly, at intermediate thresholds the network constitutes two regions that can be described as "continuous" versus "discrete." The continuous region comprises a large connected component, dominated by domains with alternating alpha and beta elements, and the discrete region includes the rest of the domains in isolated islands, each generally corresponding to a fold. We also construct the "motif network," in which nodes represent recurring motifs, and edges connect motifs that appear in the same domain. This network also features a large and highly connected component of motifs that originate from domains with alternating alpha/beta elements (and some all-alpha domains), and smaller isolated islands. Indeed, the motif network suggests that nature reuses such motifs extensively. The networks suggest evolutionary paths between domains and give hints about protein evolution and the underlying biophysics. They provide natural means of organizing protein space, and could be useful for the development of strategies for protein search and design.
Content may be subject to copyright.
Global view of the protein universe
Sergey Nepomnyachiy
, Nir Ben-Tal
, and Rachel Kolodny
Department of Computer Science and Engineering, Polytechnic Institute of New York University, Brooklyn, NY 11201;
Department of Biochemistry and
Molecular Biology, George S. Wise Faculty of Life Sciences, Tel Aviv University, Ramat Aviv 69978, Israel; and
Department of Computer Science, University of
Haifa, Mount Carmel 31905, Israel
Edited by Barry Honig, Howard Hughes Medical Institute, Columbia University, New York, NY, and approved July 2, 2014 (received for review
February 24, 2014)
To explore protein space from a global perspective, we consider
9,710 SCOP (Structural Classification of Proteins) domains with up
to 70% sequence identity and present all similarities among them
as networks: In the domain network,nodes represent domains,
and edges connect domains that share motifs,i.e., significantly
sized segments of similar sequence and structure. We explore the
dependence of the network on the thresholds that define the
evolutionary relatedness of the domains. At excessively strict
thresholds the network falls apart completely; for very lax thresh-
olds, there are network paths between virtually all domains. In-
terestingly, at intermediate thresholds the network constitutes
two regions that can be described as continuousversus dis-
crete.The continuous region comprises a large connected compo-
nent, dominated by domains with alternating alpha and beta
elements, and the discrete region includes the rest of the domains
in isolated islands, each generally corresponding to a fold. We also
construct the motif network,in which nodes represent recurring
motifs, and edges connect motifs that appear in the same domain.
This network also features a large and highly connected compo-
nent of motifs that originate from domains with alternating alpha/
beta elements (and some all-alpha domains), and smaller isolated
islands. Indeed, the motif network suggests that nature reuses
such motifs extensively. The networks suggest evolutionary paths
between domains and give hints about protein evolution and the
underlying biophysics. They provide natural means of organizing
protein space, and could be useful for the development of strate-
gies for protein search and design.
protein cooccurrence networks
protein similarity networks
How are proteins related to each other? Which physico-
chemical considerations affect protein evolution and how?
A global view of the protein universe may shed light on these
fundamental questions. It could also suggest new strategies for
protein search and design (13). However, forming a global
picture of the protein universe is difficult because we have to
piece it together from the many local glimpses that our empirical
data and computational tools provide. In other words, a global
picture needs to portray the relationships among all proteins, yet
we only have evidence of such relationships among several pro-
teins, based on the similarity between their sequences, structures,
and functions. The considerable size of the Protein Data Bank
(4) also complicates this task.
In particular, an intensely debated question is whether protein
space is discreteor continuous(2, 3, 510). These terms are
loosely defined. Discrete implies that the global picture consists of
separate, island-like, structural entities. In the hierarchical protein
domains Structural Classification of Proteins (SCOP) (11) these
entities are termed folds,and in the CATH database (12) they
are called topologies.Alternatively, continuousimplies that
the space between these entities is generally populated by cross-
fold similarities (e.g., refs. 2, 5, 6, 9, 1315). If such similarities are
abundant, then one must account for them when organizing and
searching proteins (5, 8, 16). In support of the abundance of such
similarities is the remarkable success of structure prediction
methods that piece together predictions of protein fragments or
larger protein segments (e.g., ref. 17).
There are different approaches to forming a global view of the
protein universe (18). The most significant efforts are the ones
embodied in the hierarchical classifications CATH and SCOP.
However, a hierarchy implicitly assumes that there are isolated
regions in protein space. An alternative approach is to study the
protein universe via maps––where domains are represented by
points in two or three dimensions, placed so that the distances
between them depend on the dissimilarity between their corre-
sponding domains (e.g., refs. 1921). By coloring the points
according to domain characteristics, one can visually identify
global properties of the protein universe (19, 20). However, a map
representation in low-dimensional Euclidean space implicitly
suggests that similarity among domains is transitive (i.e., that
similarity within the pairs AB and BC implies that AC is similar
too); we know that this is often not the case (6). Finally, a third
approach to study protein space is via similarity and cooccurrence
networks. In similarity networks, nodes typically represent protein
domains and edges connect similar domains. Several successful
studies of protein space capitalize on such networks (22, 23).
Cooccurrence networks of protein domains, in which nodes rep-
resent domains and edges connect cooccurring domains, were also
studied to better understand protein evolution (2426).
Here, we study the global nature of the protein universe using
domain and motif networks (Fig. 1). To construct these net-
works, we identify evolutionary relationships among a represen-
tative set of SCOP domains; we relate two domains if they share
a significantly sized part (denoted motif) with similar structure
and sequence. Our analysis reveals that protein space is both
discrete and continuous: SCOP domains of the all-alpha, all-
beta, and alpha +beta classes, in which alpha and beta elements
do not mix, mostly populate the discrete parts, whereas alpha/beta
To globally explore protein space, we use networks to present
similarities among a representative set of all known domains.
In the domain networkedges connect domains that share
motifs,i.e., significantly sized segments of similar sequence
and structure, and in the motif networkedges connect re-
curring motifs that appear in the same domain. The networks
offer a way to organize protein space, and examine how the
organization changes upon changing the definition of evolu-
tionary relatednessamong domains. For example, we use
them to highlight and characterize the uniqueness of a class of
domains called alpha/beta, in which the alpha and beta elements
alternate. The networks can also suggest evolutionary paths be-
tween domains, and be used for protein search and design.
Author contributions: N.B.-T. and R.K. designed research; S.N. and R.K. performed re-
search; N.B.-T. and R.K. analyzed data; and N.B.-T. and R.K. wrote the paper.
The authors declare no conflict of interest.
This article is a PNAS Direct Submission.
Freely available online through the PNAS open access option.
To whom correspondence may be addressed. Email: or
This article contains supporting information online at
1073/pnas.1403395111/-/DCSupplemental. PNAS
August 12, 2014
vol. 111
no. 32
domains, with alternating alpha and beta segments, mostly pop-
ulate the continuous ones. We also find that recurring motifs are
very abundant; the motifs from the all-alpha and alpha/beta
domains are the more abundant, and the more gregarious ones.
We align all-versus-all in a set of 70% sequence nonredundant
SCOP v.1.72 domains (11) using the structural aligner SSM (27).
For each pair of aligned domains, we calculate the length of the
aligned region, the percent sequence similarity of aligned residues
(using the BLOSUM62 substitution matrix), and the root-mean-
square deviation (rmsd) of these residues. Then, we define cutoffs
for these values and use them to filter the alignments. From the
filtered alignments, we construct the domain network (Fig. 1B)and
the motif network (Fig. 1C). In the domain network, nodes are the
SCOP domains in the dataset, and edges connect pairs of domains
that share a similar motif. In the motif network, nodes are motifs,
and the edges connect pairs of motifs that cooccur in a domain. We
consider length thresholds of 55 and 75 residues, percent similarity
of aligned residues thresholds of 30%, 40%, and 50%, and rmsd
thresholds of 2, 2.5, and 3 Å. We explore how well threshold
combinations reproduce SCOP segregation into folds, i.e., opti-
mally including all domains from the same fold in a connected
component, whereas excluding from it domains of other folds.
Protein Space Includes Continuous and Discrete Regions. The con-
nectivity of the domain network varies depending on the thresh-
olds used to define the evolutionary relationships (Fig. 2 and SI
Appendix, Figs. S1S4). If we consider the relatively lax thresholds
of 50 residues, 30% sequence similarity, and 3-Å rmsd, then the
resulting domain network is virtually a single connected compo-
nent (including 9,385 or 97% of the domains). For more stringent
thresholds, which we consider to represent evolutionary relation-
ships more faithfully, the network reveals both continuous and
discrete regions of protein space (Fig. 2 and SI Appendix, Figs. S2
and S3). At even more stringent length and similarity thresholds
the network falls apart completely (e.g., SI Appendix,Fig.S4). SI
Appendix,Fig.S5shows the stacked histograms of sizes of the
connected components, of representative networks. Indeed, using
longer length, higher percent similarity, or lower rmsd thresholds
results in a more disconnected network, and places more domains
in smaller components. Importantly, in all these cases, we see a
single exceptionally large connected component.
SI Appendix, Fig. S6 shows the percent of domain pairs with the
same SCOP fold that are in the same connected component in
a domain network (xaxis), versus the percent of pairs that have
a different SCOP fold and that are not connected (yaxis). We
consider all pairs among the all-alpha, all-beta, alpha/beta, and
alpha +beta domains (SI Appendix, Fig. S6A), and all pairs
among the 61% domains that are not alpha/beta (SI Appendix,
Fig. S6B). Notice that when considering the region of protein
space that does not include the alpha/beta domains (SI Appendix,
Fig. S6B), the domain network captures the notion of fold far
better and fairly well overall. As expected, lax thresholds generate
a network with larger connected components, and consequently
the percent of domain pairs with the same fold that are connected
is greater (higher values along the xaxis), but also, there are more
domain pairs of different folds that are (inappropriately) con-
nected (lower values along the yaxis). The thresholds that gen-
erate domain networks that overall best agree with SCOP fold
assignments are either (i) alignments longer than 75 residues,
with percent similarity greater than 30%, and rmsd smaller than
2.5Å, or (ii) alignments longer than 55 residues, percent similarity
greater than 30%, and rmsd smaller than 2 Å.
SI Appendix, Fig. S7 shows the same analysis per SCOP class.
We see that in the all-beta class, and to a lesser extent in the
alpha +beta class, our optimal thresholds can generally identify
SCOP folds and place domains of the same fold in the same
connected component, while still being disconnected from the
domains that are not in that fold (high values along the xand
yaxes). In the alpha/beta class, and to a lesser extent in the all-
alpha class, if we want to successfully connect domains that are
in the same fold (i.e., achieve high values along the xaxis), we
inevitably connect to domains that are not in the same fold
Fig. 1. Constructing the domain and motif net-
works. (A) The aligned protein segments, marked in
colors, are the motifs. (B) In the domain network,
edges connect domains that share similar motifs
(e.g., domain d1wjga_ and d1vlua_ that share the
cyan motif). (C) In the motif network, edges connect
cooccurring motifs (e.g., the orange and cyan motifs
cooccur in the d1vlua_ domain).
| Nepomnyachiy et al.
(low values along the yaxis). We fail to find threshold combi-
nations that are successful along both axes.
The Continuous Region of the Protein Universe Contains the Alpha/
Beta Domains; the All-Alpha, All-Beta, and Alpha +Beta Domains Are
in the More Discrete Region. Fig. 2 shows the domain network that
best reproduces SCOPs classification into folds. It is balanced in
that it connects a significant amount of domains to each other
even though it was obtained using conservative thresholds to
represent evolutionary relationships. As the networks obtained
using all other (reasonable) thresholds, the network features
both discrete and continuous regions. For the most part, SCOP
domains of the all-alpha, all-beta, and alpha +beta classes, in
which alpha and beta elements do not mix, populate the discrete
parts (Fig. 2); roughly speaking, in this region the connected
components correspond to SCOP folds (SI Appendix,Fig.S8). In
contrast, the alpha/beta domains, with alternating alpha and beta
elements, populate the large connected component. This contin-
uous region includes domains from most alpha/beta folds, in-
cluding the TIM barrels, NAD(P) binding Rossmanns, FAD/NAD
(P) binding, and many more (SI Appendix, Fig. S8); the domains of
each fold are often found in very close vicinity to each other in the
main connected component. It is known that individual folds
(e.g., TIM barrels) have undergone circular mutations and splicing
within their respective folds (28). Our analysis indicates that the
evolutionary relationships extend beyond the individual fold,
covering, in essence, the entire alpha/beta SCOP class.
If we consider the similarity of sequence and structure as in-
dicative of evolutionary relationships among proteins, Fig. 2 can
be interpreted as collections of evolutionary paths in protein
space. For example, Fig. 3 shows a path passing between
domains from the FAD/NAD(P) binding, TIM barrel, Ross-
mann NAD(P) binding, nucleotide binding, and Flavodoxin
folds. In this network, there is a path between 77% of the alpha/
beta domain pairs, whereas paths between a pair of domains
from the all-alpha, all-beta, and alpha +beta classes are found
only in 10%, 6%, and 8% of the cases, respectively. This is not an
artifact of the different number of domains in the four SCOP
classes: we see similar numbers when we randomly sample 1,000
domains from each SCOP class. The large amount of paths
within the alpha/beta SCOP class suggests that it is particularly
easy to add and delete motifs among them without impeding
structural stability. It could be that evolution took advantage of
this property to design new proteins with novel functions.
Fig. 4 shows the network of 8,219 recurring motifs, obtained
using the same parameters used for the domain network of
Fig. 2. Of these, 994 are nonsingletons. The number of different
domains which are present in a motif is, by definition, greater
than 2, and almost always less than 50 (983, or 99%, of the motifs
that are nonsingletons). In 82% of these cases (810 motifs) all of
the domains in the motif have the same SCOP class; only in 31%
of these cases (311 motifs) do they have the same SCOP fold.
Recurring motifs are very common. We see that the (green)
alpha/beta are the most abundant: the percent (number) of
nonsingleton motifs that are all-alpha, all-beta, alpha +beta, and
alpha/beta is 22%, 4%, 4%, and 55%, (223, 43, 44, and 547),
respectively; 28% (279) of the nonsingleton motifs have an equal
part of two classes. The weak connection between motifs taken
from domains of the alpha/beta and all-alpha classes is mediated
by superimpositions of small domains on excessively large do-
mains. Had these larger domains been divided into smaller
Fig. 2. Global view of protein space via the domain
network. The nodes represent the set of 70% se-
quence nonredundant SCOP domains, colored by
their SCOP class (see color legend); edges connect
between domains that share a motif. Here, two
domains are connected if we found a similarity of at
least 75 residues, with at least 25% sequence simi-
larity, and at most 2.5 Å rmsd. We see that there
are two regions: one is very connected, or continu-
ous, and populated mostly by (green) alpha/beta
domains in which the alpha and beta elements al-
ternate; the other is discrete, composed of many
disconnected components, and populated by the
all-alpha, all-beta, and alpha +beta domains. Only
components with more than 10 domains are shown.
Nepomnyachiy et al. PNAS
August 12, 2014
vol. 111
no. 32
domains, the vast majority of motifs from the all-alpha domains
would disintegrate from the main connected component.
The Domain Network Reveals the ContinuousDiscrete Nature of
Protein Space. The question if protein space is continuous or
discrete has been extensively debated (2, 3, 510), and is in-
teresting both fundamentally and for its implications on how
to organize and search protein databases (5, 16). The domain
network allows us to describe continuousand discretemore
concretely based on the sizes and number of connected com-
ponents. We find that protein space has both discrete and con-
tinuous regions, in agreement with Sadreyev et al. (7), and that
the distinction largely depends on the domainsSCOP class:
continuity is most prevalent among the alpha/beta domains
whereas the region of the all-alpha, all-beta, and alpha +beta
domains is mostly discrete. Skolnick et al. attributed the conti-
nuity to physical properties of proteins and to backbone hydro-
gen bonds in particular (15). That alpha/beta domains are more
interconnected than other SCOP classes suggests that the domains
in this class share unique physicochemical qualities that are yet to
be discovered.
Edges in the domain network are determined using specific
thresholds. More lax thresholds imply more edges and hence a
more connected network; at the extreme case all protein space is
a single connected component. Stricter thresholds imply fewer
edges and hence a less connected network. Also, using a more
sensitive method to identify similarity among domains will reveal
a more connected network. Indeed, the method and the thresholds
for inferring the relationships among domain pairs should fit the
question at hand. We consider localrelationships that represent
domains closer and further apart in evolution and combine them
into a globalview of protein space to study its properties.
To connect domain pairs that are likely evolutionarily related,
we verified that the domains share similar structure and se-
quence over a significant number of residues. Skolnick et al. (15)
showed that when relating domain pairs based solely on the
similarity of their structures (and a minimal TM_Score threshold
of 0.4), protein space is essentially a single connected compo-
nent. Our work deals with what happens when we raise the
metaphorical barfor relating two domains, and enforce that the
domain pairs are likely evolutionarily related (using a range of
thresholds). Indeed, even in this stricter setting, if the thresholds
are sufficiently lax (namely, at least 50 residues with more than
25% sequence identity and rmsd less than 3 Å) virtually all of
protein space is connected, suggesting that protein space is
evolutionarily (not only structurally) connected. However, if we
consider stricter thresholds, and specifically ones which were
calibrated to best capture the connectivity of SCOP folds, then
protein space disintegrates, and this disintegration is generally in
the region of nonalpha/beta domains.
One could argue that all of fold space is discrete; only each
SCOP class requires different thresholds to disintegrate. Our data
show that this is not the case. To learn this, we focused on each
of the four SCOP classes, and searched for optimal thresholds
resulting in networks that capture SCOP fold connectivity. Recall
that a successful network simultaneously keeps same-fold domains
connected, and disconnects them from domains in different folds.
The success stems directly from the properties of the class of
domains: If a class has a more discrete nature, that is, if its intrafold
similarities are greater than its interfold similarities, then we can
find appropriate thresholds. If, on the other hand, it has a more
continuous nature, then by using increasingly strict thresholds to
relate domain pairs, the domain network will disintegrate, but it
will do so altogether, and lose the property that same-fold domains
remain in the same connected component. Indeed, we see that the
SCOP classes vary in how well the best thresholds capture their
fold connectivity: the all-beta domains have the most discrete
nature, followed by the alpha +beta domains, the all-alpha
domains, and finally the alpha/beta domains that have the most
continuous nature (SI Appendix,Fig.S4).
We construct the dataset of likely evolutionary relationships
using two steps: (i) searching for candidate domain pairs, and
then (ii) verifying that their corresponding subparts satisfy pre-
defined length, sequence similarity, and structure similarity cri-
teria. For the first step, we used the structural aligner SSM (27).
However, structural aligners vary in the relationships that they
identify: some are more sensitive than others (29, 30). Here, we
chose SSM because it was shown to be particularly sensitive (30).
The search procedure can be augmented using additional structural
Fig. 3. Walkingin the domain network. A putative evolutionary path, to
demonstrate the relationships between connected domains. The path, taken
from the major connected component, passes through eight domains from
five different SCOP folds of the alpha/beta class. The aligned motifs are
marked in orange or cyan; residues shared by the motifs in both directions
along the path are in magenta. The number of residues, rmsd, and percent
sequence similarity (using BLOSUM62) of the aligned motifs are indicated.
| Nepomnyachiy et al.
aligners [e.g., Matt (29), STRUCTAL (31), or TM_align (15)].
Hopefully, these can identify additional candidate evolutionary
relationships, which we can subsequently subject to the similarity
filters in step (ii).
The Motif Network Reveals the Ubiquitous Reuse of Motifs in Nature.
Previous studies of cooccurrence networks use domains as the
unit element (24, 26, 32, 33). In those networks, nodes represent
domains, and edges connect between cooccurring domains. Our
motif network is similar, only we represent motifs that are smaller
than domains. The distributions of the number of neighbors in
the domain cooccurrence and our motif network are similar (24).
Also, the alpha/beta motifs and domains tend to have more
partners (or a higher rank) in their respective networks (24).
Importantly, we derive the unit element (or nodes) in the
cooccurrence network from the data rather than relying on
predefined (e.g., SCOPCATH) domains. Domains were used
because they are considered the basic unit of protein evolution
(34). It is assumed that there is only a limited set of them, and
domains from this set are combined to form the set of proteins in
the proteome using genetic mechanisms (24). For example, ge-
netic recombination can cause loss or duplication of parts of
genes, entire genes, or even longer chromosomal regions; mobile
genetic elements (DNA transposons and retro-transposons) can
lead to duplication or deletions (26). It may be, however, as
suggested by Lupas et al., that the basic unit is actually smaller
than a domain (35). Our tools offer a way to further investigate
this idea and demonstrate the abundance of mix-and-join events.
In this respect it is noteworthy that whereas domains are con-
sidered to be autonomous structural units, which are stable on
their own, it may well be that the motifs are not, and that despite
their ability to hop between domainsthey are stable only
within the context of the intact domain. Note that we have used
the same thresholds in the motif and domain networks. These
thresholds are not necessarily the best ones to highlight all sig-
nificant similarity at the subdomain level. Future in-depth study
is required to better understand the properties of motifs and
their networks with more lax thresholds. Regardless of the actual
evolutionary scenario underlying the motif network, the network
lends itself naturally to protein engineering efforts by suggesting
which substructures can replace one another while maintaining
protein foldability. Just like evolution has recycled such motifs so
could protein engineers, enriching the topologies of engineered
proteins and their likelihood of performing new functions.
Alpha/Beta Domains Are Unique. Previous work showed that the
alpha/beta domains are older (19), more stable (36), more fre-
quently involved in domain fusion events (32), and are associated
with high functional diversity (20). Our analysis shows two addi-
tional unique features of these domains: they lie in a tightly con-
nected region of protein space and their motifs mix-and-join with
a wider range of motifs. The tendency of the alpha/beta domains
to easily mix-and-join could explain their functional diversity.
Two alternative explanations for these properties of the alpha/
beta domains and motifs are (i) they existed in ancient evolutionary
history, and were mixed from these entities (35, 37), or (ii ) their
biophysical properties give them a selection advantage. Our
observations do not help in determining which of the two expla-
nations is more likely, and this remains a significant challenge.
We provide tools for navigation in the domain and motif
networks by integrating Cytoscape (38) and PyMOL (39). To
visualize our networks, download the Cytoscape files describing
them at
networks could be used to theorize about protein evolution, sug-
gest evolutionary pathways between domains, and hence maybe
suggest strategies for protein design.
Dataset. Our dataset consists of 9,710 domains that are 70% sequence
nonredundant from the SCOP database. We filtered away domains whose
structures were not accurately determined [a Summary PDB ASTRAL Check
Index score (40) lower than 0.2]. We aligned all-versus-all domains using the
structural alignment method SSM (27). We parsed the alignments, measured
their length (i.e., number of aligned residues), and calculated the percent of
identical residues, and the percent of similar residues (using the BLOSUM62
matrix). From these data, we constructed and visualized the domain and the
motif networks using Cytoscape (38).
The Domain Network. The nodes in the domain network represent the
domains in the dataset (Fig. 1A); a single edge connects two nodes if we
found a significant alignment of sufficiently many residues, sufficiently low
rmsd, and sufficiently high percent sequence similarity (Fig. 1B). We con-
sidered different thresholds of alignment length (55 and 75 residues), rmsd
(2, 2.5, and 3 Å), and percent sequence similarity (30%, 40%, and 50%).
The Motif Network. The motif graphs offer an alternative representation
of the same alignment data (Fig. 1C). The first step is to identify the nodes
of the motif graph. An alignment, A, matches a set of residues in protein P1
with a set of residues in protein P2; here, we denote these subsections P1A
Fig. 4. Global view of protein space via the motif network. The nodes
represent the set of 8,219 identified motifs, colored by the SCOP class of the
majority of their domains (see color legend; white represents cases where no
SCOP class is the majority); edges connect between motifs that cooccur in
a domain. The motif network was constructed using the set of alignments
that are longer than 75 residues, with more than 25% sequence similarity,
and less than 2.5 Å rmsd (Methods). We see that the alpha/beta (and the all-
alpha) motifs are more common, more gregarious, and form the largest
connected component.
Nepomnyachiy et al. PNAS
August 12, 2014
vol. 111
no. 32
and P2A. As evidenced by the alignment itself, P1Aand P2Aare two names
of a similar subsection. This subsection can have additional names: consider
another alignment, B, which matches subsections P1Band P3B. If the resi-
dues in subsection P1Bare actually the same ones as those in P1A, then P1B
and P3Bare also names of this subsection. Thus, we need to identify the
different names (in the example given here: P1A,P2A,P1B,P3B) that describe
similar subsections. To do this, we constructed an auxiliary graph, in which
the nodes are the raw subsections extracted directly from the set of signifi-
cant alignments (two per alignment); in the example described the nodes in
the auxiliary graph will include the nodes P1A,P2A,P1B,P3B. In the auxiliary
graph we connect pairs of subsections associated with each alignment (one
edge per alignment); in the example these will be the edges between P1A
and P2A, and between P1Band P3B. In the auxiliary graph we also connect
(almost) similar subsections of the same domain; in the example given above
this is an edge between P1Aand P1B. For this, we used a threshold of 90%
overlap (e.g., we connected the motifs that represent residues 1100 and
residues 2101 of the same domain). Each connected component in the
auxiliary graph is a node in the motif network. In other words, each node in
the motif graph is a set of recurring subsections.
To generate a clearer motif network, we added a few more steps. First,
even when using the 90% overlap threshold, we may suffer from a drag-
gingeffect, where we start with one subsection, and then via a series of
intermediate subsections that are 90% similar to each other, we reach an-
other subsection of vastly different size. To circumvent this problem, we
greedily split motifs in which the ratio between the longest and shortest
subsection is greater than 1.5. Also, we remove motifs that we identify as
supermotifs of other motifs in the dataset: if motif1 includes subsection PA
and motif2 includes subsection PB, and all residues in subsection PBare also
subsection PA, then we consider motif1 a supermotif of motif2, and remove
it. The edges in the motif network connect motif pairs for which there are
subsections of that domain in both motifs.
Data Visualization. We added an interface to viewing structural information
using PyMOL (39). In the domain network we visualize the domains that
correspond to the nodes, as well as the domain superimpositions that cor-
respond to the edges; the aligned residues are highlighted. In the motif
network an edge is a domain that includes both motifs at its end nodes: we
show the two motifs in cyan and in orange, with the overlapping residues
in magenta; if there is more than one possible domain, the user needs to
choose the one to visualize. For the nodes in the motif network, we visualize
two domains with these motifs superimposed on one another.
ACKNOWLEDGMENTS. We thank Yonatan Bilu, Sarel Fleishman, and Dan
Tawfik for insightful discussions, Varda Wexler for graphics consulting, and
the anonymous reviewers for helpful comments. N.B.-T. acknowledges the
financial support of Grant 1775/12 of the Israeli Centers of Research
Excellence Program of the Planning and Budgeting Committee and the
Israel Science Foundation.
1. Levitt M, Chothia C (1976) Structural patterns in globular proteins. Nature 261(5561):
2. Kolodny R, Pereyaslavets L, Samson AO, Levitt M (2013) On the universe of protein
folds. Annu Rev Biophys 42:559582.
3. Taylor WR (2007) Evolutionary transitions in protein fold space. Curr Opin Struct Biol
4. Berman HM, et al. (2000) The Protein Data Bank. Nucleic Acids Res 28(1):235242.
5. Kolodny R, Petrey D, Honig B (2006) Protein structure comparison: Implications for the nature
of fold space, and structure and function prediction. Curr Opin Struct Biol 16(3):393398.
6. Pascual-García A, Abia D, Ortiz ÁR, Bastolla U (2009) Cross-over between discrete and
continuous protein structure space: Insights into automatic classification and net-
works of protein structures. PLOS Comput Biol 5(3):e1000331.
7. Sadreyev RI, Kim B-H, Grishin NV (2009) Discrete-continuous duality of protein
structure space. Curr Opin Struct Biol 19(3):321328.
8. Sadowski MI, Taylor WR (2010) On the evolutionary origins of Fold Space Continu-
ity: A study of topological convergence and divergence in mixed alpha-beta do-
mains. J Struct Biol 172(3):244252.
9. Harrison A, Pearl F, Mott R, Thornton J, Orengo C (2002) Quantifying the similarities
within fold space. J Mol Biol 323(5):909926.
10. Valas RE, Yang S, Bourne PE (2009) Nothing about protein structure classification
makes sense except in the light of evolution. Curr Opin Struct Biol 19(3):329334.
11. Murzin AG, Brenner SE, Hubbard T, Chothia C (1995) SCOP: A structural classification
of proteins database for the investigation of sequences and structures. J Mol Biol
12. Orengo CA, et al. (1997) CATHa hierarchic classification of protein domain struc-
tures. Structure 5(8):10931108.
13. Andreeva A, Prli
c A, Hubbard TJP, Murzin AG (2007) SISYPHUSstructural alignments
for proteins with non-trivial relationships. Nucleic Acids Res 35(Database issue, suppl 1):
14. Shindyalov IN, Bourne PE (2000) An alternative view of protein fold space. Proteins
15. Skolnick J, Arakaki AK, Lee SY, Brylinski M (2009) The continuity of protein structure
space is an intrinsic property of proteins. Proc Natl Acad Sci USA 106(37):1569015695.
16. Petrey D, Honig B (2009) Is protein classification necessary? Toward alternative ap-
proaches to function annotation. Curr Opin Struct Biol 19(3):363368.
17. Baker D, Sali A (2001) Protein structure prediction and structural genomics. Science
18. Ben-Tal N, Kolodny R (2014) Representation of the protein universe using classi-
fications, maps, and networks. Isr J Chem, in press.
19. Choi IG, Kim SH (2006) Evolution of protein structural classes and protein sequence
families. Proc Natl Acad Sci USA 103(38):1405614061.
20. Osadchy M, Kolodny R (2011) Maps of protein structure space reveal a fundamental
relationship between protein structure and function. Proc Natl Acad Sci USA 108(30):
21. Holm L, Sander C (1996) Mapping the protein universe. Science 273(5275):595603.
22. Dokholyan NV, Shakhnovich B, Shakhnovich EI (2002) Expanding protein uni-
verse and its origin from the biological Big Bang. Proc Natl Acad Sci USA 99(22):
23. Alva V, Remmert M, Biegert A, Lupas AN, Söding J (2010) A galaxy of folds. Protein Sci
24. Apic G, Gough J, Teichmann SA (2001) Domain combinations in archaeal, eubacterial
and eukaryotic proteomes. J Mol Biol 310(2):311325.
25. Wuchty S (2001) Scale-free behavior in protein domain networks. Mol Biol Evol 18(9):
26. Forslund K, Sonnhammer EL (2012) Evolution of Protein Domain Architectures. Evo-
lutionary Genomics (Springer, Berlin), pp 187216.
27. Krissinel E, Henrick K (2004) Secondary-structure matching (SSM), a new tool for fast
protein structure alignment in three dimensions. Acta Crystallogr D Biol Crystallogr
60(Pt 12 Pt 1):22562268.
28. Söding J, Lupas AN (2003) More than the sum of their parts: On the evolution of
proteins from peptides. BioEssays 25(9):837846.
29. Daniels NM, Kumar A, Cowen LJ, Menke M (2012) Touring protein space with Matt.
IEEE/ACM Trans Comput Biol Bioinformatics 9(1):286293.
30. Kolodny R, Koehl P, Levitt M (2005) Comprehensive evaluation of protein structure
alignment methods: Scoring by geometric measures. J Mol Biol 346(4):11731188.
31. Subbiah S, Laurents DV, Levitt M (1993) Structural similarity of DNA-binding domains
of bacteriophage repressors and the globin core. Curr Biol 3(3):141148.
32. Hua S, Guo T, Gough J, Sun Z (2002) Proteins with class α/βfold have high-level
participation in fusion events. J Mol Biol 320(4):713719.
33. Basu MK, Carmel L, Rogozin IB, Koonin EV (2008) Evolution of protein domain pro-
miscuity in eukaryotes. Genome Res 18(3):449461.
34. Chothia C, Gough J, Vogel C, Teichmann SA (2003) Evolution of the protein reper-
toire. Science 300(5626):17011703.
35. Lupas AN, Ponting CP, Russell RB (2001) On the evolution of protein folds: Are similar
motifs in different protein folds the result of convergence, insertion, or relics of an
ancient peptide world? J Struct Biol 134(2-3):191203.
36. Minary P, Levitt M (2008) Probing protein fold space with a simplified model. J Mol
Biol 375(4):920933.
37. Ponting CP, Russell RR(2002) Thenatural history of protein domains. Annu Rev Biophys
Biomol Struct 31(1):4571.
38. Saito R, et al. (2012) A travel guide to Cytoscape plugins. Nat Methods 9(11):
39. Schrodinger, LLC (2010) The PyMOL Molecular Graphics System, Version 1.3r1.
Available at
40. Brenner SE, Koehl P, Levitt M (2000) The ASTRAL compendium for protein structure
and sequence analysis. Nucleic Acids Res 28(1):254256.
| Nepomnyachiy et al.
... A likely scenario is that they evolved by duplication and fusion of short polypeptides with at least marginal stability, and weak biological functionality, sufficient for their preference over random alternatives. By mining protein databases [6][7][8][9], one can computationally search for traces of the evolutionary events that shaped the current protein universe, such as mutations, duplications, and recombinations of short protein segments (e.g., [4,[10][11][12][13][14][15]). Convergence is also a scenario that may result in sequence similarity. ...
... Indeed, sequence similarity among segments shorter than domains has also been described [4,12,13,16,[37][38][39][40]. In fact, we have observed that the number of statistically significant similar segments increases with the decrease in their length (number of amino acids) [4]. ...
... Organizing the representative examples as an overview network manifests the regions of the protein universe that they traverse. We find that the 'alpha+beta' (22 instances) are the most common, followed in descending order by 'all alpha' X-groups (20 instances), 'all-beta' (13), 'others' (11), and finally the 'alpha/beta' (6) and 'mixed alpha+beta and alpha/beta' X-groups (1). There are connections among X-groups of the same architecture (or A-class) [66], but also many that cross class boundaries and involving all class combinations. ...
Full-text available
The vast majority of theoretically possible polypeptide chains do not fold, let alone confer function. Hence, protein evolution from preexisting building blocks has clear potential advantages over ab initio emergence from random sequences. In support of this view, sequence similarities between different proteins is generally indicative of common ancestry, and we collectively refer to such homologous sequences as ‘themes’. At the domain level, sequence homology is routinely detected. However, short themes which are segments, or fragments of intact domains, are particularly interesting because they may provide hints about the emergence of domains, as opposed to divergence of preexisting domains, or their mixing-and-matching to form multi-domain proteins. Here we identified 525 representative short themes, comprising 20-to-80 residues, that are unexpectedly shared between domains considered to have emerged independently. Among these ‘bridging themes’ are ones shared between the most ancient domains, e.g., Rossmann, P-loop NTPase, TIM-barrel, Flavodoxin, and Ferredoxin-like. We elaborate on several particularly interesting cases, where the bridging themes mediate ligand binding. Ligand binding may have contributed to the stability and the plasticity of these building blocks, and to their ability to invade preexisting domains or serve as starting points for completely new domains.
... Such classification is powerful, foremost because it can serve as a framework for understanding how proteins within a family change over time, but also because it naturally lends itself to naming and thus aids in communication (as is true for taxonomy in general). Nevertheless, this notion of separability is, ultimately, an approximation: Regardless of whether structure [1,2], sequence [3][4][5][6], or both structure and sequence [7,8] are considered, similar segments between proteins that lack global sequence identity are detectable and common. Taken together, these studies emphasize the interconnected, 'patchwork' nature of protein evolution [9] and reveal that proteins might be best understood as being comprised of multiple segments, each with its own independent evolutionary history or structural properties. ...
... One protein lineage of particular interest as a model system of patchwork evolution is the β-trefoil (ECOD X-group 6), which is characterized by a common ancestor [12] and a pseudothree-fold axis of rotational symmetry. Although originally proposed to be related to EGF (Xgroup 389) and ecotin (X-group 521) by gene duplication and fusion [13], proteome-wide sequence analyses [3,6,7] and an experimental fragmentation study [14] have found no support for this hypothesis; instead, the ancestral state of the β-trefoil was most likely a trimerizing peptide that recapitulated the β-trefoil fold. Given that β-trefoils seem to comprise a rare island in an otherwise highly-connected sequence-structure landscape, the origins of the precursor βtrefoil peptide are unclear and potentially the result of a de novo emergence event. ...
... We, and others [4,8], have previously analyzed patterns of global sequence fragment sharing across the protein universe, both with [7] and without [6] structural constraints. In all cases, connections to the β-trefoil were either marginal or absent. ...
Full-text available
As sequence and structure comparison algorithms gain sensitivity, the intrinsic interconnectedness of the protein universe has become increasingly apparent. Despite this general trend, β-trefoils have emerged as an uncommon counterexample: They are an isolated protein lineage for which few, if any, sequence or structure associations to other lineages have been identified. If β-trefoils are, in fact, remote islands in sequence-structure space, it implies that the oligomerizing peptide that founded the β-trefoil lineage itself arose de novo . To better understand β-trefoil evolution, and to probe the limits of fragment sharing across the protein universe, we identified both ‘β-trefoil bridging themes’ (evolutionarily-related sequence segments) and ‘β-trefoil-like motifs’ (structure motifs with a hallmark feature of the β-trefoil architecture) in multiple, ostensibly unrelated, protein lineages. The success of the present approach stems, in part, from considering β-trefoil sequence segments or structure motifs rather than the β-trefoil architecture as a whole, as has been done previously. The newly uncovered inter-lineage connections presented here suggest a novel hypothesis about the origins of the β-trefoil fold itself–namely, that it is a derived fold formed by ‘budding’ from an Immunoglobulin-like β-sandwich protein. These results demonstrate how the evolution of a folded domain from a peptide need not be a signature of antiquity and underpin an emerging truth: few protein lineages escape nature’s sewing table.
... In general, protein comparisons can be distinguished according to whether they search for 'global', 'local', or 'glocal' similarities. Yet, confusingly, comparison studies also use these terms in a different meaning, to describe the extent of the space studied -such that a 'global' refers to a comparison across all proteins (e.g. the global view of protein space [15,31]), whereas a 'local' refers to comparison within a smaller set (e.g. a set of homologous proteins [32]). Here, however, we use these terms in their other common usage, namely, as modifiers that describe the extent of similarity within compared objects (e.g. protein domains). ...
... Notably, however, some of these structural similarities are accompanied by significant cross-fold local (i.e. sub-domain-level) sequence similarities, and these suggest homology: for example, those found by Alva et al. [18], Nepomnyachiy et al. [15], and Ferruz et al. [44 ]. ...
... Together, these observations imply that if we were to model protein similarities as edges in a graph (or network) connecting nodes that represent domains, the hierarchical classifications would form graphs with many disconnected sub-graphs (e.g. one per fold); adding the cross-fold local similarities would connect these subgraphs [45]. Several studies have tried to explore these connections, shifting the curation of evolutionary relationships among domains to include local similarities, for example, Nepomnyachiy et al. [15], SISYPHUS [46], and most recently SCOP2 [47,48 ]. Nevertheless, evolutionary studies continue to rely primarily on global domain-level similarities, because the analysis of such similarities is simpler. ...
Full-text available
Evolutionary processes that formed the current protein universe left their traces, among them homologous segments that recur, or are ‘reused,’ in multiple proteins. These reused segments, called ‘themes,’ can be found at various scales, the best known of which is the domain. Yet, recent studies have begun to focus on the evolutionary insights that can be derived from sub-domain-scale themes, which are candidates for traces of more ancient events. Characterizing these may provide clues to the emergence of domains. Particularly interesting are themes that are reused across dissimilar contexts, that is, where the rest of the protein domain differs. We survey computational studies identifying reused themes within different contexts at the sub-domain level.
... As an example, Lupas and coworkers have constructed a vocabulary of ancient peptides that have led to modern folded proteins, borrowing from strategies to how linguists have compared modern languages to reconstruct ancient vocabularies [88]. Following from this, Kolodny and coworkers have presented a global view of the protein universe based on protein networks that connect domains that share fragments [89]. More recently, they have introduced the concept of 'themes' as recurrent fragments of short protein segments of at least 35 amino acids that are unexpectedly found in domains of independent evolutionary origin [75]. ...
... This was achieved by first segmenting all homologous structures within the protein family of interest (e.g., all GH10 xylanases) into modular parts, then performing an 'idealization' procedure that forces ideal bonds and relaxes the associated torsion angles, computing backbone conformational databases based on this information as well as enforcing position-specific sequence constraints based on the results of multiple sequence alignment, performing a precomputation step involving the design and ranking of thousands of unique backbones, assembling the resulting backbones, and, finally, performing stability optimization using the Protein Repair One Stop Shop (PROSS) [91] to generate a seamless structure. It is perhaps unsurprising that such a strategy would lead to highly efficient catalysts, as this appears to be the strategy also taken by nature itself in designing new proteins [74,75,89,92]. However, it opens a new (and highly promising) door for evolutionary-based computational protein design. ...
Full-text available
Recent years have seen an explosion of interest in understanding the physicochemical parameters that shape enzyme evolution, as well as substantial advances in computational enzyme design. This review discusses three areas where evolutionary information can be used as part of the design process: (i) using ancestral sequence reconstruction (ASR) to generate new starting points for enzyme design efforts; (ii) learning from how nature uses conformational dynamics in enzyme evolution to mimic this process in silico; and (iii) modular design of enzymes from smaller fragments, again mimicking the process by which nature appears to create new protein folds. Using showcase examples, we highlight the importance of incorporating evolutionary information to continue to push forward the boundaries of enzyme design studies.
... Another is that mutations cause plastic deformation that comparison methods should also take into account [1]. On the other hand, proteins reuse the same types of folds, but with the growing number of known protein structures, the global view is changing from discrete folds into considering larger parts of fold space as a continuum [3]. ...
... Hence, the 2 -move is preferred. The last self-intersection involves too many residues to remove, is denoted essential, and the length of the dotted end-contraction avoiding it is calculated The first aligned residue pairs between Chain 0 and Chain 1 are (3, 1), (4, 2) and (6,3). We treat weakly aligned pairs, indicated with ". " by TM-align, as aligned pairs indicated by ":" . ...
Full-text available
Background In computational structural biology, structure comparison is fundamental for our understanding of proteins. Structure comparison is, e.g., algorithmically the starting point for computational studies of structural evolution and it guides our efforts to predict protein structures from their amino acid sequences. Most methods for structural alignment of protein structures optimize the distances between aligned and superimposed residue pairs, i.e., the distances traveled by the aligned and superimposed residues during linear interpolation. Considering such a linear interpolation, these methods do not differentiate if there is room for the interpolation, if it causes steric clashes, or more severely, if it changes the topology of the compared protein backbone curves. Results To distinguish such cases, we analyze the linear interpolation between two aligned and superimposed backbones. We quantify the amount of steric clashes and find all self-intersections in a linear backbone interpolation. To determine if the self-intersections alter the protein’s backbone curve significantly or not, we present a path-finding algorithm that checks if there exists a self-avoiding path in a neighborhood of the linear interpolation. A new path is constructed by altering the linear interpolation using a novel interpretation of Reidemeister moves from knot theory working on three-dimensional curves rather than on knot diagrams. Either the algorithm finds a self-avoiding path or it returns a smallest set of essential self-intersections. Each of these indicates a significant difference between the folds of the aligned protein structures. As expected, we find at least one essential self-intersection separating most unknotted structures from a knotted structure, and we find even larger motions in proteins connected by obstruction free linear interpolations. We also find examples of homologous proteins that are differently threaded, and we find many distinct folds connected by longer but simple deformations. TM-align is one of the most restrictive alignment programs. With standard parameters, it only aligns residues superimposed within 5 Ångström distance. We find 42165 topological obstructions between aligned parts in 142068 TM-alignments. Thus, this restrictive alignment procedure still allows topological dissimilarity of the aligned parts. Conclusions Based on the data we conclude that our program ProteinAlignmentObstruction provides significant additional information to alignment scores based solely on distances between aligned and superimposed residue pairs.
... Several publications tried to reduce the large dimensionality of protein sequences into a few discernible dimensions for their analysis. Most representation methods consist of (i) hierarchical classifications of protein structures such as the ECOD and CATH databases (26,27), (ii) Cartesian representations (28), and similarity networks (29,30). We recently represented the structural space in a network that showed proteins as nodes, linked when they have a homologous and structurallysimilar fragment in common (31) and made the results available in the Fuzzle database (32). ...
Full-text available
Protein design aims to build new proteins from scratch, thereby holding the potential to tackle many environmental and biomedical problems. Recent progress in the field of natural language processing (NLP) has enabled the implementation of ever-growing language models capable of understanding and generating text with human-like capabilities. Given the many similarities between human languages and protein sequences, the use of NLP models offers itself for predictive tasks in protein research. Motivated by the evident success of generative Transformer-based language models such as the GPT-x series, we developed ProtGPT2, a language model trained on protein space that generates de novo protein sequences that follow the principles of natural ones. In particular, the generated proteins display amino acid propensities which resemble natural proteins. Disorder and secondary structure prediction indicate that 88% of ProtGPT2-generated proteins are globular, in line with natural sequences. Sensitive sequence searches in protein databases show that ProtGPT2 sequences are distantly related to natural ones, and similarity networks further demonstrate that ProtGPT2 is sampling unexplored regions of protein space. AlphaFold prediction of ProtGPT2-sequences yielded well-folded non-idealized structures with embodiments as well as large loops and revealed new topologies not captured in current structure databases. ProtGPT2 has learned to speak the protein language. It has the potential to generate de novo proteins in a high throughput fashion in a matter of seconds. The model is easy-to-use and freely available.
... The size of the Protein Data Bank (PDB) has been growing steadily over the past decades (Burley et al., 2019), encouraging the development of computational tools enabling classification (Cheng et al., 2014;Andreeva et al., 2020;Dawson et al., 2017) and annotation (Dana et al., 2019) of protein structures it encompasses. The wealth of the collected data has opened up many research opportunities ranging from studies focused on proteins of a particular function or evolutionary position to comprehensive surveys aiming at capturing the general aspects of the "protein universe" (Nepomnyachiy et al., 2014;Alva et al., 2010). However, these innovations came at the cost of the dispersion of data sources, whose integration began to require expert knowledge and the tedious process of mapping structures to the corresponding sequence and functional information. ...
Full-text available
Motivation The wealth of protein structures collected in the Protein Data Bank enabled large-scale studies of their function and evolution. Such studies, however, require the generation of customized data sets combining the structural data with miscellaneous accessory resources providing functional, taxonomic, and other annotations. Unfortunately, the functionality of currently available tools for the creation of such data sets is limited and their usage frequently requires laborious surveying of various data sources and resolving inconsistencies between their versions. Results To address this problem, we developed localpdb, a versatile Python library for the management of protein structures and their annotations. The library features a flexible plugin system enabling seamless unification of the structural data with diverse auxiliary resources, full version control, and powerful functionality of creating highly customized data sets. The localpdb can be used in a wide range of bioinformatic tasks, in particular those involving large-scale protein structural analyses and machine learning. Availability localpdb is freely available at Documentation along with the usage examples can be accessed at
... Thus, in a regime of peptide fragments and because of cross-fold similarities that transcend the boundaries of superfamilies and folds (Alva et al. 2010;Friedberg and Godzik 2005;Pascual-García et al. 2009), the protein space is regarded as continuous and structure similarity should be described as a network, rather than a tree. Meanwhile, the possibility of generating a reliable hierarchical classification of protein domains has been questioned (Petrey and Honig 2009;Sadreyev et al. 2009;Skolnick et al. 2009) and the extent between continuity and discreteness of the protein space has urged further exploration (Pascual-García et al. 2009;Nepomnyachiy et al. 2014;Xu and Zhang 2016). ...
In the regime of domain classifications, the protein universe unveils a discrete set of folds connected by hierarchical relationships. Instead, at sub-domain-size resolution and because of physical constraints not necessarily requiring evolution to shape polypeptide chains, networks of protein motifs depict a continuous view that lies beyond the extent of hierarchical classification schemes. A number of studies, however, suggest that universal sub-sequences could be the descendants of peptides emerged in an ancient pre-biotic world. Should this be the case, evolutionary signals retained by structurally conserved motifs, along with hierarchical features of ancient domains, could sew relationships among folds that diverged beyond the point where homology is discernable. In view of the aforementioned, this paper provides a rationale where a network with hierarchical and continuous levels of the protein space, together with sequence profiles that probe the extent of sequence similarity and contacting residues that capture the transition from pre-biotic to domain world, has been used to explore relationships between ancient folds. Statistics of detected signals have been reported. As a result, an example of an emergent sub-network that makes sense from an evolutionary perspective, where conserved signals retrieved from the assessed protein space have been co-opted, has been discussed.
... Once the query protein structure is converted to an embedding vector, it can be easily used for many structure-based applications (Osadchy and Kolodny, 2011;Nepomnyachiy et al., 2014;Liu et al., 2018;Alam et al., 2019). Here, one application to classify SCOP superfamilies (Fox et al., 2014;Chandonia et al., 2019) is demonstrated. ...
Motivation General-purpose protein structure embedding can be used for many important protein biology tasks, such as protein design, drug design and binding affinity prediction. Recent researches have shown that attention-based encoder layers are more suitable to learn high-level features. Based on this key observation, we propose a two-level general-purpose protein structure embedding neural network, called ContactLib-ATT. On local embedding level, a biologically more meaningful contact context is introduced. On global embedding level, attention-based encoder layers are employed for better global representation learning. Results Our general-purpose protein structure embedding framework is trained and tested on the SCOP40 2.07 dataset. As a result, ContactLib-ATT achieves a SCOP superfamily classification accuracy of 82.4% (i.e., 6.7% higher than state-of-the-art method). On the same dataset, ContactLib-ATT is used to simulate a structure-based search engine for remote homologous proteins, and our top-10 candidate list contains at least one remote homolog with a probability of 91.9%. Contact and
Introduction: While the origin and evolution of proteins remain mysterious, advances in evolutionary genomics and systems biology are facilitating the historical exploration of the structure, function and organization of proteins and proteomes. Molecular chronologies are series of time events describing the history of biological systems and subsystems and the rise of biological innovations. Together with time-varying networks, these chronologies provide a window into the past. Areas covered: Here, we review molecular chronologies and networks built with modern methods of phylogeny reconstruction. We discuss how chronologies of structural domain families uncover the explosive emergence of metabolism, the late rise of translation, the co-evolution of ribosomal proteins and rRNA, and the late development of the ribosomal exit tunnel; events that coincided with a tendency to shorten folding time. Evolving networks described the early emergence of domains and a late 'big bang' of domain combinations. Expert opinion: Two processes, folding and recruitment appear central to the evolutionary progression. The former increases protein persistence. The later fosters diversity. Chronologically, protein evolution mirrors folding by combining supersecondary structures into domains, developing translation machinery to facilitate folding speed and stability, and enhancing structural complexity by establishing long-distance interactions in novel structural and architectural designs.
Full-text available
In the fifty years since the first atomic structure of a protein was revealed, tens of thousands of additional structures have been solved. Like all objects in biology, proteins structures show common patterns that seem to define family relationships. Classification of proteins structures, which started in the 1970s with about a dozen structures, has continued with increasing enthusiasm, leading to two main fold classifications, SCOP and CATH, as well as many additional databases. Classification is complicated by deciding what constitutes a domain, the fundamental unit of structure. Also difficult is deciding when two given structures are similar. Like all of biology, fold classification is beset by exceptions to all rules. Thus, the perspectives of protein fold space that the fold classifications offer differ from each other. In spite of these ambiguities, fold classifications are useful for prediction of structure and function. Studying the characteristics of fold space can shed light on protein evolution and the physical laws that govern protein behavior. Expected final online publication date for the Annual Review of Biophysics Volume 42 is May 06, 2013. Please see for revised estimates.
Full-text available
Cytoscape is open-source software for integration, visualization and analysis of biological networks. It can be extended through Cytoscape plugins, enabling a broad community of scientists to contribute useful features. This growth has occurred organically through the independent efforts of diverse authors, yielding a powerful but heterogeneous set of tools. We present a travel guide to the world of plugins, covering the 152 publicly available plugins for Cytoscape 2.5-2.8. We also describe ongoing efforts to distribute, organize and maintain the quality of the collection.
Comparing and subsequently classifying protein structures information has received significant attention concurrent with the increase in the number of experimentally derived 3-dimensional structures. Classification schemes have focused on biological function found within protein domains and on structure classification based on topology. Here an alternative view is presented that groups substructures. Substructures are long (50–150 residue) highly repetitive near-contiguous pieces of polypeptide chain that occur frequently in a set of proteins from the PDB defined as structurally non-redundant over the complete polypeptide chain. The substructure classification is based on a previously reported Combinatorial Extension (CE) algorithm that provides a significantly different set of structure alignments than those previously described, having, for example, only a 40% overlap with FSSP. Qualitatively the algorithm provides longer contiguous aligned segments at the price of a slightly higher root-mean-square deviation (rmsd). Clustering these alignments gives a discreet and highly repetitive set of substructures not detectable by sequence similarity alone. In some cases different substructures represent all or different parts of well known folds indicative of the Russian doll effect—the continuity of protein fold space. In other cases they fall into different structure and functional classifications. It is too early to determine whether these newly classified substructures represent new insights into the evolution of a structural framework important to many proteins. What is apparent from on-going work is that these substructures have the potential to be useful probes in finding remote sequence homology and in structure prediction studies. The characteristics of the complete all-by-all comparison of the polypeptide chains present in the PDB and details of the filtering procedure by pair-wise structure alignment that led to the emergent substructure gallery are discussed. Substructure classification, alignments, and tools to analyze them are available at Proteins 2000;38:247–260. © 2000 Wiley-Liss, Inc.
The Protein Data Bank (PDB; ) is the single worldwide archive of structural data of biological macromolecules. This paper describes the goals of the PDB, the systems in place for data deposition and access, how to obtain further information, and near-term plans for the future development of the resource.
A meaningful and coherent global picture of the protein universe is needed to better understand protein evolution and the underlying biophysics. We survey the studies that tackled this fundamental challenge, providing a glimpse of the protein space. A global picture represents all known local relationships among proteins, and needs to do so in a comprehensive and accurate manner. Three types of global representations can be used: classifications, maps, and networks. In these, the local relationships are derived, based on the similarity of the proteins′ sequences, structures, or functions (or a combination of these). Alternatively, the local relationships can be co-occurrences of elements in the protein universe. The representations can be based on different objects: full polypeptide chains, fragments, such as structural domains, or even smaller motifs. Different protein qualities were revealed in each study; many point out the uniqueness of domains of the alpha/beta SCOP (structural classification of proteins) class.
To facilitate understanding of, and access to, the information available for protein structures, we have constructed the Structural Classification of Proteins (scop) database. This database provides a detailed and comprehensive description of the structural and evolutionary relationships of the proteins of known structure. It also provides for each entry Links to co-ordinates, images of the structure, interactive viewers, sequence data and literature references. Two search facilities are available. The homology search permits users to enter a sequence and obtain a list of any structures to which it has significant levels of sequence similarity The key word search finds, for a word entered by the user, matches from both the text of the scop database and the headers of Brookhaven Protein Databank structure files. The database is freely accessible on World Wide Web (WWW) with an entry point to URL scop: an old English poet or minstrel (Oxford English Dictionary); ckon: pile, accumulation (Russian Dictionary).
There is a limited repertoire of domain families that are duplicated and combined in different ways to form the set of proteins in a genome. Proteins are gene products, and at the level of genes, duplication, recombination, fusion and fission are the processes that produce new genes. We attempt to gain an overview of these processes by studying the evolutionary units in proteins, domains, in the protein sequences of 40 genomes. The domain and superfamily definitions in the Structural Classification of Proteins Database are used, so that we can view all pairs of adjacent domains in genome sequences in terms of their superfamily combinations. We find 783 out of the 859 superfamilies in SCOP in these genomes, and the 783 families occur in 1307 pairwise combinations. Most families are observed in combination with one or two other families, while a few families are very versatile in their combinatorial behaviour; 209 families do not make combinations with other families. This type of pattern can be described as a scale-free network. We also study the N to C-terminal orientation of domain pairs and domain repeats. The phylogenetic distribution of domain combinations is surveyed, to establish the extent of common and kingdom-specific combinations. Of the kingdom-specific combinations, significantly more combinations consist of families present in all three kingdoms than of families present in one or two kingdoms. Hence, we are led to conclude that recombination between common families, as compared to the invention of new families and recombination among these, has also been a major contribution to the evolution of kingdom-specific and species-specific functions in organisms in all three kingdoms. Finally, we compare the set of the domain combinations in the genomes to those in the RCSB Protein Data Bank, and discuss the implications for structural genomics.
Thispaper presents and discusses evidence suggesting how the diversity of domain folds in existence today might have evolved from peptide ancestors. We apply a structure similarity detection method to detect instances where localized regions of different protein folds contain highly similar sequences and structures. Results of performing an all-on-all comparison of known structures are described and compared with other recently published findings. The numerous instances of local sequence and structure similarities within different protein folds, together with evidence from proteins containing sequence and structure repeats, argues in favor of the evolution of modern single polypeptide domains from ancient short peptide ancestors (antecedent domain segments (ADSs)). In this model, ancient protein structures were formed by self-assembling aggregates of short polypeptides. Subsequently, and perhaps concomitantly with the evolution of higher fidelity DNA replication and repair systems, single polypeptide domains arose from the fusion of ADSs genes. Thus modern protein domains may have a polyphyletic origin.