Comparing Genomes in terms of Protein Structure: Surveys of a Finite Parts List
ABSTRACT We give an overview of the emerging field of structural genomics, describing how genomes can be compared in terms of protein structure. As the number of genes in a genome and the total number of protein folds are both quite limited, these comparisons take the form of surveys of a finite parts list, similar in respects to demographic censuses. Fold surveys have many similarities with other whole-genome characterizations, e.g. analyses of motifs or pathways. However, structure has a number of aspects that make it particularly suitable for comparing genomes, namely the way it allows for the precise definition of a basic protein module and the fact that it has a better defined relationship to sequence similarity than does protein function. An essential requirement for a structure survey is a library of folds, which groups the known structures into "fold families." This library can be built up automatically using a structure-comparison program, and we described how important objective stat...
- SourceAvailable from: papers.gersteinlab.org
- [Show abstract] [Hide abstract]
ABSTRACT: Abstract Single nucleotide polymorphisms,(SNPs) are useful for genome-wide,mapping,and study of disease genes. Previous studies have focused on specific genes or SNPs pooled from a variety of different sources. Here, we present a systematic approach to the analysis of SNPs in relation to various ,features on a ,genome-wide ,scale. We have ,performed ,a comprehensive analysis of 39,408 SNPs on human chromosomes 21 and 22 from The SNP Consortium (TSC) database, where SNPsare obtained by random sequencing using consistent and uniform ,methods. Our study indicates that the occurrence of SNPs is lowest in exons and higher in repeats, introns and pseudogenes. Moreover, in comparing genes and pseudogenes, we find that the SNP density is higher in pseudogenes and the ratio of nonsynonymous, to ,synonymous ,changes ,is much ,higher as well. ,These observations,may ,be explained ,by the ,increased ,rate of SNP ,accumulation ,in pseudogenes, which presumably are not under selective pressure. We have also performed secondary structure prediction on all coding regions and found that there is no preferential distribution of SNPs in α-helices, β-sheets or coils. This could imply that protein structures, in general, can tolerate a wide degree of substitutions. Tables relating
- [Show abstract] [Hide abstract]
ABSTRACT: Genomics, the study of the properties of genes and gene products on a whole-organism scale, is revolutionizing all aspects of biology. So powerful has knowledge of the complete nucleotide sequences of the genomes of whole organisms proven to be, that it has spawned a large family of progeny, each shifting the emphasis of their disciplines to discovery- driven (as opposed to hypothesis-driven) research: high-throughput, genome-scale data acquisition. Among the fields that have jumped onto the genomics bandwagon most rapidly is the field of structural biology. The painstaking determination of structures of individual proteins by laboratories that then spent years following-up that work by looking at structures of ligand complexes or mutants is being augmented by assembly-line production of structures for all of the proteins in a pathway or even a whole microbe, as rapidly as possible, with any follow-up work to be left to others. Structural genomics, as this effort is called, has as its stated goals the filling-in of the catalog of known protein folds and the assignment of function to gene products whose functions are not known (these may make up 40% of the gene products in a typical genome), by structural similarity to proteins of known function. How realistic are these expectations? What will be the impact on drug discovery and development? And what other tools are needed to realize the promise inherent in this richness of data?01/2004;
in terms of Protein Structure:
Surveys of a Finite Parts List
Mark Gerstein *
Department of Molecular Biophysics & Biochemistry
266 Whitney Avenue, Yale University
PO Box 208114, New Haven, CT 06520
(203) 432-6105, FAX (203) 432-5175
* Corresponding author.
Keywords: Databank Census, Protein Fold, Bioinformatics
Running Title: Comparing Genomes in terms of Protein Folds
Manuscript is 43 Pages in Length (including this one)
Graphics of Figures follow at end in sequence.
Submitted to: FEMS Microbiology Reviews
Report Documentation Page
OMB No. 0704-0188
Public reporting burden for the collection of information is estimated to average 1 hour per response, including the time for reviewing instructions, searching existing data sources, gathering and
maintaining the data needed, and completing and reviewing the collection of information. Send comments regarding this burden estimate or any other aspect of this collection of information,
including suggestions for reducing this burden, to Washington Headquarters Services, Directorate for Information Operations and Reports, 1215 Jefferson Davis Highway, Suite 1204, Arlington
VA 22202-4302. Respondents should be aware that notwithstanding any other provision of law, no person shall be subject to a penalty for failing to comply with a collection of information if it
does not display a currently valid OMB control number.
1. REPORT DATE
2. REPORT TYPE
3. DATES COVERED
00-00-1998 to 00-00-1998
4. TITLE AND SUBTITLE
Comparing Genomes in terms of Protein Structure: Surveys of a Finite
5a. CONTRACT NUMBER
5b. GRANT NUMBER
5c. PROGRAM ELEMENT NUMBER
5d. PROJECT NUMBER
5e. TASK NUMBER
5f. WORK UNIT NUMBER
7. PERFORMING ORGANIZATION NAME(S) AND ADDRESS(ES)
Yale University ,Department of Molecular & Biochemistry ,266 Whitney
Avenue,New Haven ,CT,06520
8. PERFORMING ORGANIZATION
9. SPONSORING/MONITORING AGENCY NAME(S) AND ADDRESS(ES)
10. SPONSOR/MONITOR’S ACRONYM(S)
11. SPONSOR/MONITOR’S REPORT
12. DISTRIBUTION/AVAILABILITY STATEMENT
Approved for public release; distribution unlimited
13. SUPPLEMENTARY NOTES
We give an overview of the emerging field of structural genomics, describing how genomes can be
compared in terms of protein structure. As the number of genes in a genome and the total number of
protein folds are both quite limited, these comparisons take the form of surveys of a finite parts list, similar
in respects to demographic censuses. Fold surveys have many similarities with other whole-genome
characterizations, e.g. analyses of motifs or pathways. However, structure has a number of aspects that
make it particularly suitable for comparing genomes, namely the way it allows for the precise definition of
a basic protein module and the fact that it has a better defined relationship to sequence similarity than does
protein function. An essential requirement for a structure survey is a library of folds, which groups the
known structures into ?fold families.? This library can be built up automatically using a
structure-comparison program, and we described how important objective statistical measures are for
assessing similarities within the library and between the library and genome sequences. After building the
library, one can use it to count the number of folds in genomes, expressing the results in the form of Venn
diagrams and "top-10" statistics for shared and common folds. Depending on the counting methodology
employed, these statistics can reflect different aspects of the genome, such as the amount of internal
duplication or gene expression. Previous analyses have shown that the common folds shared between very
different microorganisms - i.e. in different kingdoms - have a remarkably similar structure, being
comprised of repeated strand-helix-strand super-secondary structure units. A major difficulty with this
sort of ?fold-counting? is that only a small subset of the structures in a complete genome are currently
known and this subset is prone to sampling bias. One way of overcoming biases is through structure
prediction, which can be applied uniformly and comprehensively to a whole genome. Various investigators
have, in fact, already applied many of the existing techniques for predicting secondary structure and
transmembrane (TM) helices to the recently sequenced genomes. The results have been consistent:
Microbial genomes have similar fractions of strands and helices even though they have significantly
different amino-acid composition. The fraction of membrane proteins with a given number of TM-helices
falls off rapidly with more TM elements, approximately according to a Zipf Law. This latter finding
indicates that there is no preference for the highly studied 7-TM proteins in microbial genomes.
Continuously updated tables and further information pertinent to this review is available over the web at
15. SUBJECT TERMS
16. SECURITY CLASSIFICATION OF:
17. LIMITATION OF
19a. NAME OF
c. THIS PAGE
Standard Form 298 (Rev. 8-98)
Prescribed by ANSI Std Z39-18
We give an overview of the emerging field of structural genomics, describing
how genomes can be compared in terms of protein structure. As the number of genes in a
genome and the total number of protein folds are both quite limited, these comparisons
take the form of surveys of a finite parts list, similar in respects to demographic censuses.
Fold surveys have many similarities with other whole-genome characterizations, e.g.
analyses of motifs or pathways. However, structure has a number of aspects that make it
particularly suitable for comparing genomes, namely the way it allows for the precise
definition of a basic protein module and the fact that it has a better defined relationship to
sequence similarity than does protein function. An essential requirement for a structure
survey is a library of folds, which groups the known structures into “fold families.” This
library can be built up automatically using a structure-comparison program, and we
described how important objective statistical measures are for assessing similarities
within the library and between the library and genome sequences. After building the
library, one can use it to count the number of folds in genomes, expressing the results in
the form of Venn diagrams and "top-10" statistics for shared and common folds.
Depending on the counting methodology employed, these statistics can reflect different
aspects of the genome, such as the amount of internal duplication or gene expression.
Previous analyses have shown that the common folds shared between very different
microorganisms - i.e. in different kingdoms - have a remarkably similar structure, being
comprised of repeated strand-helix-strand super-secondary structure units. A major
difficulty with this sort of “fold-counting” is that only a small subset of the structures in a
complete genome are currently known and this subset is prone to sampling bias. One way
of overcoming biases is through structure prediction, which can be applied uniformly and
comprehensively to a whole genome. Various investigators have, in fact, already applied
many of the existing techniques for predicting secondary structure and transmembrane
(TM) helices to the recently sequenced genomes. The results have been consistent:
Microbial genomes have similar fractions of strands and helices even though they have
significantly different amino-acid composition. The fraction of membrane proteins with a
given number of TM-helices falls off rapidly with more TM elements, approximately
according to a Zipf Law. This latter finding indicates that there is no preference for the
highly studied 7-TM proteins in microbial genomes. Continuously updated tables and
further information pertinent to this review is available over the web at
The Sequencing of Complete Genomes Highlights the Finiteness of Molecular
In the last three years a number of microbial genomes have been completely
sequenced, generating tremendous interest, popular as well as scientific [1-3]. In
particular, in 1995 the first genome of a free-living organism, the bacteria H. influenzae,
was sequenced by Venter and colleagues, and two years later another landmark was
reached with the publication of the yeast genome, a significantly more complex genome
of a eukaryote [4, 5].
One of the most important points highlighted by having a complete genome sequence is
the essential finiteness of molecular biology. That is, the complete sequence, while
complex, describes all the parts necessary for microbial life.
A Structural Census, the Connection between Genomes and Structures
Simultaneous with all the progress being made in genomics, there is a tremendous
investment being made in structural biology. This is yielding great returns in the form of
an exponentially increasing number of protein structures. All these structures fall into a
very limited number of folding patterns, currently about 350 [6-10]. It is believed,
furthermore, that we will eventually find that all naturally occurring protein structures are
composed of very small number of folds, estimated to be ~1000 .
The objective of this work is to discuss various means of understanding this finite
universe of genes in terms of an even more limited repertoire of protein folds. This is the
subject of the new field of structural genomics [12, 13]. One can achieve some form of
understanding by performing large-scale surveys, looking at the occurrence of protein
structures and various protein structural features in the genomes of different organisms.
We use the term “structural censuses” to describe these surveys, emphasizing the intent to
provide a comprehensive accounting.
To do such a structural census properly, one needs to cluster together 3D structures
into a library of folds and then to match up genome sequences to structures in this library.
One also needs a way to characterize the sequences without structural homologues in
rough structural terms. This is usually done via various prediction techniques, such as
those for secondary structure or transmembrane helices. Then one does “fold counting,”
enumerating how often a fold or structural feature occurs in a given genome or organism.
These specific aspects of a structure census will be discussed at length. But before doing
so it is worthwhile to provide some perspective on the general questions addressed and
how this work relates to other types of genomic analysis.
The Overall Question: At What Structural Resolution Do Organisms Differ?
One interesting question addressed by a census of structures is to what degree
certain folds occur only in certain branches of the “evolutionary tree.” To put it in
somewhat extreme terms, can one explain the obvious morphological differences
between two microorganisms (e.g. between yeast and E. coli) in terms of their having
different protein folds? Alternatively, it may be that most folds occur in every organism
in the same way that the genetic code and many basic biochemical pathways (such as
glycolysis) are almost universally shared. Currently, it is only possible to answer this
question anecdotally, in terms of individual structures. One can find evidence for either
viewpoint. On one hand, the immunoglobulin fold, which is usually closely associated
with the eukaryotes (e.g. in the vertebrate immune system), has been found in bacteria,
where it carries out a very different function . On the other hand, the small DNA-
binding fold known as the zinc finger so far appears to be confined to eukaryotes .
This question can be rephrased as, "At what structural resolution do organisms
differ?" Structurally, microorganisms appear different on the micron scale, as they have
different internal cell structures, but on the scale of single Ångstroms they appear nearly
the same, containing similar proportions of C, H, O, N, P, and S atoms (Fig. 1). At what
structural resolution can one start seeing differences? It is probably not at the level of
secondary structure (~10 Å) since all organisms are composed of essentially similar
proportions of alpha helices and beta sheets (see below). Is it at the level of protein super-
secondary structure (e.g. four-helix bundles or beta-alpha-beta units) or at the level of
whole domain folds? Or perhaps it is at a higher level, involving the large-scale
organization and regulation of essentially identical protein parts.
This question is especially interesting when one considers the diverse physical
environments inhabited by these organisms -- from high temperature and pressure for
Methanococcus, to normal temperature and pressure for yeast, to high acid for
A Structural Census as a Particular Type of “Occurrence
Analysis” in Genomics
Analyzing the occurrence or frequency of folds in genomes is a particular
example of a general type of comparative genomics we dub “occurrence analysis.” This
involves comparing how often a particular entity (e.g. a sequence motif) occurs in various
genomes, and seeing what fraction of a collection of entities occurs in one genome as
compared to another. Several different types of occurrence analysis have been previously
performed, studying genomes at many different levels.
Starting from the most basic units, genomes have been compared in terms of the
relative frequencies of short oligonucleotide and oligopeptide “words” [16-19].
On the level of individual genes or proteins, the degree of gene duplication in a
number of genomes has been ascertained [20-25]. Other works have investigated the
occurrence of conserved families in several different genomes [26-30]. This can be
performed on a large-scale in a highly automated fashion [31-36]. The recent growth of
databases makes such automatic and objective systems highly desirable. In particular,
with the data of many complete genomes now available, the often arbitrary functional
assignment of homologous genes can be replaced with a system of orthologs and paralogs
(genes with a common ancestor, separated by speciation and presumably performing the
same function, versus genes generated by duplication within the same organism). A semi-
automatic approach was recently developed that compared several genomes and derived
clusters of orthologous groups (COGS) . The approach is straightforward: If one
knows all the potential candidates in a genome for a certain protein function, one can pick
the best one based on the best match to a protein of known function. If the best matches
occur consistently among the same group of proteins from several distantly related
genomes, the proteins are classified as COGS.
An important application of single-gene occurrence analysis is “differential
genomics.” When two closely related genome sequences are compared, the difference,
i.e. those genes that are present only in one of them, may give a clue to the unique nature
of the microbe in question. For example, a comparison between E. coli and H. influenzae
revealed 116 genes that are present only in the latter . Differential genomics may
have useful applications for attacking microbe-related diseases [38, 39], e.g. finding
genes unique to pathogenic organisms can help in developing antibiotics against them.
Occurrence analysis can also be carried out on the level of whole metabolic
pathways and systems [40-42]. This work has yielded many interesting conclusions in
terms of the pathways that are modified or absent in certain organisms. For instance,
many of the respiratory enzymes in E. coli are missing in H. influenzae, and the
metabolism in the latter seems to be biased to a relatively nitrogen-rich and anaerobic
environment [4, 43, 44].
Why Analysis of Structure is Particularly Advantageous for
The analysis of structure is expected to be particularly advantageous for genome
comparison for two reasons.
Structural Modules are Precisely Defined and Relatively Few in Number
First, structure allows one to define a protein module (or shared part) in both a
more precise and more general sense.
It is possible (and quite productive) to define modules purely in terms of
conserved “blocks” in sequence alignments or small, but distinctive, “motifs” shared by
many related proteins [45-58]. However, functioning protein modules fundamentally
consist of units of 3D structure. In fact, it is usually believed that these structural units
form physically interacting "folding domains," and attempts have been made to see how
well they correspond to exon boundaries and other linear sequence features [59-61]. This
is often not a simple relationship as many structural modules are discontinuous in terms
of sequence -- as when a polypeptide chain starts in one domain, goes through a hinge
region into a second domain, and then returns to the first domain. Nevertheless, relating
modules defined on the sequence level to structure enables them to be better
characterized. This is especially true for groups of aligned structures, which allow the
definition of a conserved structural core [62, 63].
Also, one expects analysis of structure to reveal more about distant evolutionary
relationships than sequence comparison, since structure is more conserved than sequence
or function [64, 65]. In other words, it is at the level of protein structure where the
biologists sees the fewest “parts” and greatest amount of redundancy and reuse.
Similarity in Sequence is More Closely Related to Similarity in Structure than in
A second reason that structural analysis is useful for genome comparisons is that the
relationship between sequence similarity and structural similarity is much better defined
than the corresponding relationship between sequence and function.
It is generally accepted that proteins with similar sequences usually have similar
structures. A decade ago Lesk & Chothia systematically investigated the relationship
between divergence in sequence and that in structure [64, 66]. Using the limited amount
of data available at the time (32 pairs of homologous structures among 25 proteins), they
found that the extent of the structural changes is directly related to the extent of the
sequence changes. As shown in figure 2, we have repeated the calculations here using a
much larger data set. (Details of the calculations are described in the legend.) Expressing
sequence similarity in terms of the more modern statistical terminology (i.e. P-value
instead of percentage identity), we find very similar results to the original work of Lesk
& Chothia. There are, of course, exceptions where similarity in sequence does not imply
similarity in structure. These usually occur for small proteins, e.g. an artificially designed
sequence of a four-helix bundle could be made more than 50% identical to a
predominantly beta-sheet protein [67, 68].
The relationship between sequence similarity and functional similarity is much less
clear . In part, this is because it is much more difficult to precisely specify a function
than a sequence or a structure. Moreover, even in cases where the functional
identification is well specified, there are several examples where highly similar sequences
have completely different functions - i.e. same fold but different function. A well-known
example is the structural protein eye-lens crystallin and the metabolic enzyme gluthatione
S-transferase , which have sequence and structural similarity but differ in function.
An extreme example is provided by the enzymes lactate dehydrogenase and malate
dehydrogenase. In protein engineering experiments, Wilks et al. managed to convert one
into the other by changing only a single amino acid .
The opposite situation can also be observed, namely when the same function is
performed by several proteins unrelated in structure and sequence - i.e. same function but
different fold. A good example is chloroperoxidase, which has an alpha/beta fold in the
prokaryote Pseudomonas but has an all-alpha fold in fungi [72, 73]. There are many
more examples of this type of convergent evolution in enzymes .
Elements of a Structural Census: Construction of a Fold Library
Thus far, we have described how comparing genomes in terms of structures is a
particular form of “occurrence analysis” and how structure provides a particularly
advantageous subject for comparison. Now we outline what goes into a structure census,
its methodological "elements," and discuss some conclusions from recent work. An
essential element in a survey of known structures is the construction of a library of folds.
This is expected to be an essential data structure in molecular biology, organizing the
collection of gene families like the columns in the chemical periodic table .
Pairwise Structural Comparison and Alignment: Automatic vs Manual
To build a fold library, one must have a way of comparing and aligning protein
structures (see figure 3). One approach is to do this manually, the approach taken for the
scop classification of protein structures . On another extreme, there are a number of
algorithms for automatically comparing structures and clustering them into fold families
[76-89]. Finally, there is a hybrid approach, based on both automatic and manual
comparison [10, 90].
Completely automatic methods have the advantage of speed and objectivity.
However, the fold classifications produced by a computer are not always as
understandable or reliable as those produced by humans. Furthermore, although manual
classification is slow, if it is done correctly, it only has to be done once.
Various Automatic Methods for Structural Comparison
To get a perspective on the automatic methods, it is useful to compare structural
alignment with the much more thoroughly studied methods for sequence alignment [91,
92]. Both methods produce an alignment, which can be described as an ordered set of
equivalent pairs (i,j) associating residue i in protein A with residue j in protein B. Both
methods allow gaps in these alignments which correspond to non-sequential i (or j)
values in consecutive pairs — i.e. one has pairs like (10,20) and (11, 22). And both
methods reach an alignment by optimizing a function that scores well for good matches
and badly for gaps. The major difference between the methods is that the optimization
used for sequence alignment is globally convergent whereas that used for structural
alignment is not. This is the case for sequence alignment because the optimum match for
one part of a sequence is not affected by the match for any other part. Structural
alignment fails to converge globally because the possible matches for different segments
are tightly coupled, as they are part of the same rigid 3D structure.
This lack-of-convergence has led to a large number of different approaches to
structural alignment, the methods differing in how they attack the problem. No current
algorithm works all of the time (i.e. for all the pathological cases). The methods also
differ in the function they optimize (the equivalent of the amino acid substitution matrix
used in sequence alignment) and how they treat gaps. Some of the methods effectively
compare the respective distance matrices of each structure, trying to minimize the
difference in intra-atomic distances for selected aligned substructures [80, 83, 93]. Other
approaches, in contrast, directly try to minimize the inter-atomic distances between two
structures, using repeated application of dynamic programming [77, 89, 90, 94, 95]. This
allows structures to be aligned in a similar fashion to normal sequence alignment . A
similar approach is taken in minimizing the "soap-bubble area" between two structures
. Other methods involve other techniques, such as geometric hashing or lattice fitting
[79, 85, 86].
Fusing a Multiple Alignment into a Structural Template
The classification of the entire databank using a variety of the automatic and
manual procedures outlined above has recently been undertaken by a number of groups
[7, 83, 97-101], resulting in the scop, FSSP, LPFC, CATH, and HOMALDB databases.
These databases group the known structures into ~350 fold families, some of which are
quite large (e.g. currently the PDB contains over 166 antibody structures). Because of the
great numbers of structures and of families, it is worthwhile to summarize the common
features within a family, whilst separating out the variable ones. That is, one wants to
know which regions are conserved and which are highly variable, and to fuse all the
conserved regions into a single “core structure” template (figure 3). A number of
approaches have been developed to tackle this problem through determining a mean and
variance for an ensemble of multiply aligned structures and then picking the low variance
atoms as “core” [8, 62, 102, 103].
Searching the Genome with Structural Templates
Clustering the Structure Databank into Sequence Templates
Once a library of folds has been constructed, one wants to build sequence
templates based on it and then use these to search the genome. A necessary
methodological preliminary is clustering the known structures into a number of
(sequence) representative domains, using a variety of single or multiple linkage
approaches [6, 67, 104-106]. Currently, the PDB can be clustered in ~1200 representative
domains. Then using structure comparison, one finds that these representatives are
distributed amongst 338 folds, giving about three sequence families per fold . The fact
that the number of folds is so much less than the number of sequence families highlights
the fact that many of the evolutionary similarities between highly diverged organisms
may only be apparent in terms of structure . Folds can, in turn, be ranked the
number of different families of non-homologous sequences they are associated with.
Folds uniting many distinct sequence families have been dubbed superfolds . These
may represent intrinsically stable and favorable structural arrangements, as suggested by
a variety of analyses [108-110].
At this point one has ~350 3D-structural alignments, each of which “connects” a
number of non-homologous sequences. These can be used as “seeds” to build up large
sequence alignments from the major databases using standard pairwise searching tools -
e.g. the popular BLAST and FASTA programs on the SwissProt and GenBank databases
[111-115]. A number of recently developed methods of transitive sequence matching
(through a third intermediate sequence) are expected to improve the sensitivity of these
pairwise searches somewhat [116-119].
As many of these alignments contain quite a few sequences, it can be advantageous to
fuse them into a consensus pattern or template, just as is done with structures  (Fig.
3). For this, a variety of probabilistic approaches can be used. A most popular
representations is the Hidden Markov Model (HMM) [120-125]. This is a generalization
of the sequence profile, and like a profile it gives an explicit probability for each of the 20
amino acids to occur at each position in the model . The HMM goes beyond a
profile in associating with each position an explicit probability for introducing a gap
(either for insertion or deletion).
Microbial Genome Sequences
Once formed, sequence templates can be compared directly against the genomes.
This can take place in a variety of ways. The most straightforward is to just compare each
sequence in the template against the genome using the standard pairwise comparison
programs (e.g. FASTA, BLAST, or straight Smith-Waterman [111-113, 127]).
Alternately, one can use profile or HMM searching programs for those sequences that are
part of an explicit pattern. However, in doing this one has to consider some important
issues related to bias (see below).
At the time of this writing there are 13 microbial genome sequences currently
available (Table 1). These already provide a most diverse comparison -- representing
microbes from the three kingdoms of life (Eukarya, Eubacteria, Archea), from different
environments (room temperature and pressure to high temperature and pressure, and
neutral pH to highly acidic), with a wide range of genome sizes (0.6 to 13 Mb), and with
a variety of modes of life (from parasite to autotroph).
One point worth mentioning is that the genome data is constantly changing and is
contingent on the current “state of the art” in gene finding. The data used in any analysis
reflects a particular snapshot of this ongoing process. For instance, the current E. coli
data file is version M52, containing 4290 ORFs. This is a more recent version and
contains a different number of ORFs than the one referred to in the official publication
(M49, containing 4288 ORFs) . For yeast there is some uncertainty regarding
whether all of the ORFs in the web site file are really genes. In particular, 5888 of the
6218 ORFs are definitely believed to be genes, but there is some question about the
remaining 330 . Furthermore, quite a number of yeast sequences (initially)
annotated to be ORFs are, in fact, transposons, which should properly be segregated from
the rest of the proteome .
Similarity in Both Sequence and Structure is Best Described
Similarities are best expressed statistically in terms of a P-value
The preceding section was concerned with comparison, both for structure and
sequence. To do this right, one needs to be able to assess the significance of a given
comparison score – i.e. what does a score of 392 mean? This is often quite subtle and, in
a sense, relates to the fundamental problem of what constitutes similarity in biology.
Moreover, it is a most important issue with respect to large-scale genome surveys, which
involve hundreds of thousands of comparisons. It is essential to have a rapid and
automatic method to assess the significance of a given comparison score (i.e. to set a
threshold), as it is neither possible nor desirable to do this by hand.
The best way to assess significance is to see how a particular similarity score compares
in a statistical sense to all the others. A major development in the past few years has been
the implementation of probabilistic scoring schemes for doing just this [131-137]. These
give the significance of a match in terms of a P-value rather than an absolute, “raw” score
(such as percent identity or RMS). A P-value is the chance that one would get a given
similarity score (or better) from a random alignment. That is, P(s > S) = .01 means that a
randomly generated score s would be greater than the threshold score S (e.g. 392) 1% of the
time. The P-value gives the rank of a score relative to all the other possible scores. It places
scores from very different programs in a common framework and provides an obvious
way to set a significance cutoff (i.e. at P < 0.0001 or 0.01%).
P-values are closely related to another quantity called the e-value, which is the
number of false positives expected with a given score threshold in a whole databank
comparison. Thus, the e-value is just the databank size multiplied by the P-value.
Determination of P-values involves determining the score distribution for true
negatives, i.e. for random alignments. This can be done in a number of ways: simulating
random alignments, analytically deriving the score distribution for a random alignment, or
doing an all-vs-all comparison of the databank and curve-fitting to the observed score
Statistics for Sequence Similarity
For sequences, P-values were first used in the BLAST family of sequence searching
programs, where they are derived from an analytic model for the chance of an arbitrary
ungapped alignment [131, 135]. P-values have subsequently been implemented in other
programs such as FASTA and gapped BLAST using a somewhat different formalism
[116, 136-138]. In all the formalisms, P-values for sequence comparison are derived from
an extreme value distribution. That is, sequence comparison scores are observed to follow
a distribution like exp(-S-exp(-S)), which has a much longer "tail" than the rapidly falling
off normal distribution exp(-S2). Such a distribution arises naturally from repeatedly
considering the maximum of a number of independent, random variables. This is in
contrast to the normal distribution, which arises from repeatedly considering sums of
In general, P-values give similar results to more conventional scores, such as
percent identity, but they have been shown to be better calibrated and more sensitive for
marginal similarities, taking into account compositional biases of the databank and the
query sequence [94, 132, 133]. In particular, Brenner et al. tested the applicability of
probabilistic scores to the detection of structural relationships [67, 139, 140]. They found
that the FASTA e-value closely tracked the error rate against a test set of known
structural relationships. That is, with regard to the number of false positives, expectation
Statistics for Structural Similarity
Some of the current methods for structural alignment have associated with them
probabilistic scoring schemes. In particular, one method computes a P-value for an
alignment based on measuring how many secondary structure elements are aligned, as
compared to the chance of aligning this many elements randomly (VAST) . Another
method expresses the significance of an alignment in terms of the number of standard
deviations it scores above the mean alignment score in an all-vs-all comparison (i.e., a Z-
score) [8, 83].
We have recently developed a simple empirical approach for calculating the
significance of a structural alignment score based on doing an all-vs-all comparison of
the databank and then curve fitting to the observed score distribution for the true
negatives [90, 94]. We can apply our approach consistently to both sequences and
structures. For sequences, we compared our fit-based P-values with the differently
derived statistical scores from commonly used programs such as BLAST and FASTA and
found substantial agreement. For structure alignment, we follow a parallel route to derive
an expression for the P-value of a given alignment in terms of a structural alignment
We find that scores from structure alignment follow a similar extreme-value
distribution to those in sequence comparison, allowing one to adopt a uniform statistical
formalism for both comparison techniques. (As dynamic programming applied to either
sequence or structure alignment essentially finds a maximum score over many possible
alignments, it is quite reasonable that this should be the case. However, this is not
trivially obvious, as the dynamic programming score does not result from considering the
maximum of truly independent variables.)
A nice aspect of structural alignment is that one can visualize exactly what is meant
by a strong similarity in comparison to a marginal one. Examples shown in figure 4,
which shows a strong similarity (for two globins), a weaker one (for two
immunoglobulins), and a very marginal one.
Overall “Inventory” Statistics in a Census Calculation
Distribution of Folds Amongst Genomes (Venn Diagrams)
After setting a uniform comparison threshold and running the fold library against
the genomes, it is possible to see how the known folds are distributed amongst different
genomes, or partial genomes. There are a number of web sites that compile this data
automatically – e.g. PENDANT and GeneQuiz [33, 141]. However, few detailed analyses
have been published, mostly because only recently have enough complete genomes
become available for this sort of comparative analysis.
A recent work illustrates what is initially possible . This analysis focussed on
three of the first genomes to be sequenced, the first ones from each of the major
kingdoms: i.e., H. influenzae (a eubacteria, ), M. jannaschii (an archaeon, ), and
S. cerevisiae (yeast, a eukaryote ).
As shown in Figure 5, the analysis can be conceptualized in terms of a Venn diagram,
similar to those used for studying the occurrence of motifs and sequence families [143,
144]. About half of the known folds (148) are contained in at least one of the three
genomes, and 45 folds are shared amongst all three genomes. These shared folds
presumably represent an ancient set of molecular parts.
It is possible to classify each fold as all-alpha, all-beta, alpha/beta, alpha+beta, or
“other” using the original definitions of Levitt & Chothia and then to see how the folds
corresponding to each structural class are distributed among the genomes [145, 146].
Overall, the genomes contain a disproportionate number of mixed folds (α/β and
α+β, 83/148), and the shared fold are even more enriched in α/β super-secondary
structures, with 38 of 45 having a mixed architecture.
A related analysis looked at the occurrence of folds in different groups of organisms
(e.g. plants vs. animals) . This did not involve complete genomes but rather
partitioning the sequence databank into a number of distinct phylogenetic sets. Such an
analysis suffers from various biases (as discussed below), but it is nevertheless
suggestive, showing that more closely related organisms had a greater number of folds in
It is expected that many more analyses such as these will be undertaken in the
future as more genomes are sequenced and structures determined . It is difficult to
express the shared folds amongst more than three genomes in terms of a Venn diagram,
so other representations become useful, such as cluster trees .
Frequency that Folds Occur in a Genome (“Top-10 lists”)
Another simple statistic to look at is how often a particular known fold occurs in a
genome, i.e. the fold frequency. In the previous work comparing three genomes, these
frequencies were expressed in terms of “top-10” lists for the most common folds in a
genome . As was the case for the folds overall, most of the common folds have an
Combining the frequent fold analysis with the Venn diagram, one can determine
the common folds that are shared by all genomes. As shown in figure 6, ordered in terms
of their frequency of occurrence, the top-five common and shared folds when comparing
yeast, Haemophilus influenzae, and Methanococcus jannaschii are the P-loop containing
NTP hydrolase fold, the Rossmann fold, the TIM-barrel fold, the flavodoxin fold, and the
Thiamin-binding fold. Each of these folds is associated with basic metabolism (as
opposed to other functions such as transcription or regulation). They are all classic α/β
proteins and share a remarkably similar super-secondary structure architecture, with a
central sheet of parallel strands with helices packed onto at least one face of this sheet.
Moreover, the topology of the central sheet is very similar in all the proteins. Almost all
of the connections are right-handed links between adjacent parallel strands through an
intervening helix packed onto the central sheet.
These top-10 lists rank folds by how often they occur in the genome, tending to
emphasize highly duplicated genes. Folds can also be ranked by a number of other
criteria. For instance, they can be ranked by the number of non-homologous sequence
families they are associated with, i.e. their superfold ranking. This number is not always
correlated with how often the fold occurs in microbial genomes, but it is the case that
superfolds are among the most common folds found in genomes. Folds can also be
ranked in terms of expression level, essentially a ranking by mRNA occurrence in the
cell. This has already been done in non-structural terms for all the genes in yeast [150-
152]. In table 2, we see how this expression level ranking maps onto folds. Using data
from DeRisi et al. , the figure shows the most highly expressed folds in yeast grown
in two different conditions (high sugar and low sugar, aerobic vs. anaerobic conditions).
The ranking of folds is clearly different from that purely based on duplication.
The Problem of Sampling Bias Affects the Statistics
General Issue of Bias in the Databanks
One of the most important issues in doing a large-scale survey is avoiding biases.
Because of the preferences of investigators, some types of sequences or structures are
over-represented and others are under-represented in the databanks. For instance, in
GenBank there is an over-representation of globins from humans relative to flies.
Moreover, a particular fold may be found in the human but not in the fly simply because
not all the fly sequences are currently known. Focussing only on organisms for which
complete genomes are known eliminates this obvious form of bias. However, there is
another bias that is not overcome by knowledge of complete genomes. The selection of
proteins in the PDB is also biased by the preferences of individual investigators and by
the physical constraints on what will crystallize (or can be studied by NMR
spectroscopy). For instance, the PDB currently contains about 5500 entries (5493
identifiers and 10781 domains). This total includes 222 structures for T4 lysozyme, but
only a single structure for the “equally important” tyrosine kinase and topisomerase-II
Structures in the PDB are also biased towards certain commonly studied organisms.
Thus, a much larger percentage of folds is known for the bacteria Haemophilus in
comparison to the archeon Methanococcus, even though both have roughly the same
number of genes .
Another issue related to the state of the structure databank is that the absolute counts
found in a given genome survey are contingent on the evolving contents of the databank.
Thus, over time as more structures are added to the databank, one should expect such
statistics as the most common folds and number of shared folds to change somewhat.
The Multi-domain Nature of Proteins Creates Counting Problems
A second type of bias has to do with the fact that protein structure is fundamentally
arranged around the level of folding domains whereas statistics for genomes are often
calculated and best understood in terms of the number of genes (Fig. 7). For instance,
when one talks about how prevalent the kinase and Rossmann folds are in the yeast and
E. coli genomes, one is implicitly comparing the number of matches that known kinase
and Rossmann fold structures have in the ~6200 yeast ORFs relative to the ~4300 E. coli
ORFs. However, it is possible for a single gene to contain a number of kinase fold
domains or to simultaneously contain both a kinase and Rossmann fold. Thus, the total
number of domains in a genome is probably a better standard for these comparisons.
Unfortunately, one does not know this number. But one does know that the number of
domains is not related simply to the number of genes. For instance, on average a protein
is about 50% larger in yeast than in E. coli (317 vs. 466), meaning that there are probably
twice as many possible domains in yeast as in E.coli.
Another problem emanating from the multi-domain nature of proteins is highlighted
in Figure 7. When clustering genes based on their sequence similarities, simple single-
linkage clustering can give potentially misleading results. As has been pointed out
before, it may group together two multi-domain proteins (AB and bc) containing the two
unrelated domain folds (A and c) based on their having similarity only through a common
domain (B and b) [42, 50].
Subtle Biases in Comparison Techniques
A final, rather subtle form of bias results from the type of sequence comparison
method used. Different pairwise comparison methods (e.g. Smith-Waterman vs. FASTA)
and different thresholds will give rise to different absolute numbers of fold counts, but
the relative values between different folds will usually remain comparable. However, as
discussed above, there are other, potentially more sensitive, methods of comparing
sequences to structures – e.g. profiles, HMMs, and motif analysis, and threading [55, 125,
153-155]. These latter methods find more homologues for certain folds, particularly those
for which multiple alignments are available. However, the sensitivity improvement is not
consistent for all folds. This is not advantageous for a large-scale survey where uniform
sampling and treatment of the data is more important than sensitivity. One is more
concerned with accurate relative numbers than with absolute values. Cobbling together a
survey through a disparate collection of tools and patterns creates the problem of devising
consistent scores and thresholds. This problem is particularly acute in the case of
manually derived sequence patterns and motifs, since an expert on a particular fold or
motif would expect his pattern to find relatively more homologues than a pattern not so
expertly constructed. The simple approach of just using pairwise comparison, applying
the same objective procedure to each fold, circumvents these problems somewhat.
Furthermore, it has an added advantage in that it can be performed automatically without
manual intervention and, consequently, can easily be scaled up to deal with large data
Various weighting, sampling and clustering schemes attempt to correct for both
obvious and more subtle biases [156-160]. Potentially, even methods developed to
correct for biases in governmental censuses may be of use [161, 162]. However, in a
large-scale structure survey nothing can really make up for essential folds that are