Evolutionary trace for prediction and redesign of protein functional sites.
ABSTRACT The evolutionary trace (ET) is the single most validated approach to identify protein functional determinants and to target mutational analysis, protein engineering and drug design to the most relevant sites of a protein. It applies to the entire proteome; its predictions come with a reliability score; and its results typically reach significance in most protein families with 20 or more sequence homologs. In order to identify functional hot spots, ET scans a multiple sequence alignment for residue variations that correlate with major evolutionary divergences. In case studies this enables the selective separation, recoding, or mimicry of functional sites and, on a large scale, this enables specific function predictions based on motifs built from select ET-identified residues. ET is therefore an accurate, scalable and efficient method to identify the molecular determinants of protein function and to direct their rational perturbation for therapeutic purposes. Public ET servers are located at: http://mammoth.bcm.tmc.edu/.
- SourceAvailable from: Roy Morello[Show abstract] [Hide abstract]
ABSTRACT: Mutations in the genes encoding cartilage associated protein (CRTAP) and prolyl 3-hydroxylase 1 (P3H1 encoded by LEPRE1) were the first identified causes of recessive Osteogenesis Imperfecta (OI). These proteins, together with cyclophilin B (encoded by PPIB), form a complex that 3-hydroxylates a single proline residue on the α1(I) chain (Pro986) and has cis/trans isomerase (PPIase) activity essential for proper collagen folding. Recent data suggest that prolyl 3-hydroxylation of Pro986 is not required for the structural stability of collagen; however, the absence of this post-translational modification may disrupt protein-protein interactions integral for proper collagen folding and lead to collagen over-modification. P3H1 and CRTAP stabilize each other and absence of one results in degradation of the other. Hence, hypomorphic or loss of function mutations of either gene cause loss of the whole complex and its associated functions. The relative contribution of losing this complex's 3-hydroxylation versus PPIase and collagen chaperone activities to the phenotype of recessive OI is unknown. To distinguish between these functions, we generated knock-in mice carrying a single amino acid substitution in the catalytic site of P3h1 (Lepre1(H662A) ). This substitution abolished P3h1 activity but retained ability to form a complex with Crtap and thus the collagen chaperone function. Knock-in mice showed absence of prolyl 3-hydroxylation at Pro986 of the α1(I) and α1(II) collagen chains but no significant over-modification at other collagen residues. They were normal in appearance, had no growth defects and normal cartilage growth plate histology but showed decreased trabecular bone mass. This new mouse model recapitulates elements of the bone phenotype of OI but not the cartilage and growth phenotypes caused by loss of the prolyl 3-hydroxylation complex. Our observations suggest differential tissue consequences due to selective inactivation of P3H1 hydroxylase activity versus complete ablation of the prolyl 3-hydroxylation complex.PLoS Genetics 01/2014; 10(1):e1004121. · 8.17 Impact Factor
- [Show abstract] [Hide abstract]
ABSTRACT: Understanding the molecular basis of protein function remains a central goal of biology, with the hope to elucidate the role of human genes in health and in disease, and to rationally design therapies through targeted molecular perturbations. We review here some of the computational techniques and resources available for characterizing a critical aspect of protein function - those mediated by protein-protein interactions (PPI). We describe several applications and recent successes of the Evolutionary Trace (ET) in identifying molecular events and shapes that underlie protein function and specificity in both eukaryotes and prokaryotes. ET is a part of analytical approaches based on the successes and failures of evolution that enable the rational control of PPI.Progress in Biophysics and Molecular Biology 05/2014; · 3.38 Impact Factor
- [Show abstract] [Hide abstract]
ABSTRACT: Co-variation between positions in a multiple sequence alignment may reflect structural, functional, and/or phylogenetic constraints and can be analyzed by a wide variety of methods. We explored several of these methods for their ability to identify co-varying positions related to the divergence of a protein family at different hierarchical levels. Specifically, we compared seven methods on a system model composed of three nested sets of G-protein-coupled receptors (GPCRs) in which a divergence event occurred. The co-variation methods analyzed were based on: χ2 test, mutual information, substitution matrices, and perturbation methods. We first analyzed the dependence of the co-variation scores on residue conservation (measured by sequence entropy), and then we analyzed the networking structure of the top pairs. Two methods out of seven—OMES (Observed minus Expected Squared) and ELSC (Explicit Likelihood of Subset Covariation)—favored pairs with intermediate entropy and a networking structure with a central residue involved in several high scoring pairs. This networking structure was observed for the three sequence sets. In each case, the central residue corresponded to a residue known to be crucial for the evolution of the GPCR family and the sub-family specificity. These central residues can be viewed as evolutionary hubs, in relation with an epistasis-based mechanism of functional divergence within a protein family. © Proteins 2014;. © 2014 Wiley Periodicals, Inc.Proteins Structure Function and Bioinformatics 03/2014; · 3.34 Impact Factor
Evolutionary Trace for Prediction and Redesign
of Protein Functional Sites
Angela Wilkins, Serkan Erdin, Rhonald Lua,
and Olivier Lichtarge
It applies to the entire proteome; its predictions come with a reliability score; and its results typically reach
significancein mostproteinfamilieswith20ormoresequencehomologs.Inorder toidentifyfunctional hot
spots, ET scans a multiple sequence alignment for residue variations that correlate with major evolutionary
on a large scale, this enables specific function predictions based on motifs built from select ET-identified
residues. ET is therefore an accurate, scalable and efficient method to identify the molecular determinants of
protein function and to direct their rational perturbation for therapeutic purposes. Public ET servers are
located at: http:/ /mammoth.bcm.tmc.edu/.
Key words: Evolutionary trace, Protein design, Protein engineering, Function annotation,
Phylogenomics, Protein–protein interaction
of Evolutionary Trace:
The evolutionary trace (ET) is a phylogenomic method to identify
important amino acids in protein sequences. The approach con-
ceptually mimics experimental mutational scanning: Whereas in
the laboratory a sequence residue is deemed important when its
mutation changes the response of an assay, ET infers that a residue
is important when its variations during evolution correlate with
major divergences (1, 2). Thus, ET aims to measure the impact of
a residue not by its conservation or through its co-variations, but
rather by its associated evolutionary changes and the functional
perturbations and adaptation that they presumably represent.
The ET approach to measure the correlation between residue
and phylogenetic variations is still under refinement. But the basic
Riccardo Baron (ed.), Computational Drug Discovery and Design, Methods in Molecular Biology, vol. 819,
DOI 10.1007/978-1-61779-465-0_3,#Springer Science+Business Media, LLC 2012
hypothesis is that residues that vary among widely divergent
branches of evolution are more likely to have a larger functional
impact than other residues that vary even among closely related
species (see Fig. 1). Taking initially an absolute view of variation
patterns (1), the ET rank riof sequence residue i in a query
ri¼ 1 þ
where the summation is over the phylogenetic tree nodes (total of
N ? 1 branches); N is the number of homologs in the multiple
sequence alignment. The value of dnis equal to 0 if residue position
i is invariant within the sequences making up node n, while dnis
equal 1 otherwise. The exact magnitude of riis less important than
its relative percentile rank compared to all residues in the protein:
those with smaller percentile ranks being considered more impor-
tant. In practice, (1) ranks best the sequence positions that vary
among the most evolutionary divergent branches and that are also
invariant within small branches of closely related species.
Following this scheme, top-ranked ETresidues (or ETresidues
for short, usually defined as those residues ranked in the top 30th
percentile) can be singled out in a sequence or structure. As
expected, completely invariant residues are the most important
and highly variable one tend to be least so. However, top-ranked
residues can be surprisingly variable as long as these variations
are between rather than within large branches. Conversely, some
they do exhibit are within small evolutionary branches. The phylo-
Fig. 1. The Evolutionary Trace method. The proteins making up the multiple sequence
alignment are divided into groups based on the phylogenic tree. Each group has a
representative sequence withthe invariant residues. TheET method extracts the relative
evolutionary importance of the residues in example where the top ranked residues are
marked 1, 2 and 3. These residues are then mapped onto the protein structure in order
to visualize functional site.
30 A. Wilkins et al.
are more or less important. Moreover, the use of the tree also
naturally takes into account the bias due to overrepresentation of
some branches, a difficult aspect for conservation or co-variation
In practice, ET residues have remarkable structural and func-
They cluster together spatially in the protein structure (3)
sites for catalysis or ligand binding (4)
Internal clusters of ET residues presumably form the folding
core of the protein, and, in some cases, play a critical role in
allosteric regulation and specificity (5)
Mutations directed to ET residues will alter function in a
variety of ways (6–8)
Mimicry of ET residues leads to peptides with functional
And in silico mimicry of top-ranked ET residues identifies
functional similarity (10, 11)
For example, this early version of ET detected functional resi-
dues and directed mutational studies into the molecular basis of
G protein signaling (12–14). One hundred mutations of the
Galpha-protein confirmed prior ET predictions of binding sites
to the G beta gamma subunits and to the G protein-coupled recep-
tor (15). Likewise, ET clusters of evolutionary important residues
in the regulators of G protein signaling (RGS) were subsequently
confirmed—one at an RGS-Galpha binding interface and another
that mediates cGMP phosphodiesterase (PDE) interactions
transfer of function between RGS7 and RGS9 by mutationally
that ETcould identify a protein’s binding sites and its key residues.
1.2. ET Refinements:
Hybrid and Clustering
A number of refinements were added to the basic ET algorithm to
increase its robustness. One issue addressed was the fact that (1)
leads to ET ranks that are over-sensitive to errors, gaps, insertions,
Each of these may break the perfect patterns that ET searches for,
namely, variations between branches but invariance within them.
First, the Shannon Entropy (16) was introduced to measure
invariance within the individual branches. This led to a hybrid
entropy-phylogenetic method (17) called the real-value ET
(rvET) because it produces absolute ranks that are not whole
integers. By contrast, the original ET method and (1) yields
integer ranks and is now referred to as integer-value ET (ivET).
3 Evolutionary Trace for Prediction and Redesign of Protein Functional Sites31
To be clear, the Shannon Entropy, si, for a given residue
position i is:
where fiais the frequency that an amino acid type, a, appears in the
column containing residue position i. This Shannon Entropy is
first calculated for the entire alignment, and then for every
subsequent node defined by the phylogenetic tree. Finally, the
rank riof residue i is:
ri¼ 1 þ
where fiais the frequency of the amino acid of type a within the
sub-alignment of group g. The number of possible nodes in the
evolutionary tree is (N ? 1) where N is the number of sequences
in the alignment. The nodes in the phylogenetic tree are num-
bered in the order of increasing distance from the root. A key
achievement of rvET (thereafter simply ET) is that it requires little
manual curation, and thus lends itself to large-scale automation
and allows for web server application.
A second important improvement quantified the notion of ET
residue clusters (1, 2). Studies on numerous proteins showed that
ET clusters were common and statistically significant (3), then
that they significantly overlapped functional sites (4), and finally,
that the extent of clustering was predictively correlated with the
extent of overlap (18). In other words, the clustering z-score is a
measure of ET quality such that it can be maximized in order to
optimize functional site predictions (19–21).
To derive the clustering z-score, the structure provides an
adjacency matrix between residues: A matrix element Aijis equal
and equal to zero otherwise. If a residue meets a given ET thresh-
old of importance, the parameter Si¼ 1. If that residue i does not
meet this importance cut-off, then Si¼ 0. With these definitions,
the cluster weight at a particular importance threshold is
SiSjAijðj ? iÞ;(4)
where (j ? i) is a weighting function that favors residues that are
near in structure but far in sequence. Finally, the clustering z-score
is determined, as usual:
z ¼w ? hwi
32A. Wilkins et al.
The average, hwi, and standard deviation, s, in the ensemble
of random residue choices are found through repeated sampling
or analytically (18).
These improvements were experimentally tested in different
proteins through a number of protein engineering studies
that included: rewiring functional specificity (22), separating
functions (6), designing of peptide inhibitors and redesigning allo-
steric specificity (5) (see Notes 1–4).
1.3. ET Optimization
and Future Directions
A third generation of improvements originates from the fact that
the clustering among top-ranked residues can be treated as a
measure of ET quality. The greater the clustering z-scores the
better the “fitness” among the selection of sequences making up
the alignment, the phylogenetic tree and the 3D structure of the
protein. This held true when extended for selecting structures
among a set of decoy models of protein folds where the structures
closer to native (18) were more likely to be chosen. This idea was
alsoextendedinorder toselect themostrelevantsequencesforET
analysis. Specifically, a Metropolis Monte Carlo algorithm was
tested in 50 diverse proteins to choose sequences that maximized
the clustering z-scores. The greater these z-scores, the better the
clusters predicted functional sites (19). Another and structure-free
quality measure, Rank Information, can likewise identify problem-
atic “misfit” sequences during analysis (23). More recently,
multiple ET quality measures were formally defined, such that
maximizing their value optimizes the prediction of functional
sites and annotations (21). Together these studies further confirm
a quantitative relationship among evolutionary pressure (the ET
rank),the proteinfoldand functional site locations;and theypoint
to a common feature of ET quality: the rank distribution that best
reflects evolutionary history and functional pressures appear to
maximize “rank continuity,” namely the similarity of ET ranks
1.4. Large Scale
ET was also validated on a large scale in the context of protein
function prediction. This application is motivated by Structural
Genomics (SG) which solves many protein structures that cannot
be annotated by homology-based annotation transfer (24).
Since typically a few residues are essential for binding or catalytic
activities it may be possible instead to rely on local structural
similarities (25): different structures may perform similar bio-
chemical function if they share a common spatial organization of
experimentally verified functional motifs (26) or, lacking those,
key functional residues as defined by ET.
A series of technical studies developed these ideas into an
Evolutionary Trace Annotation (ETA) pipeline to predict the
function of novel protein structures. ET rankings proved useful
to define small structure-function motifs called 3D-templates (27),
3 Evolutionary Trace for Prediction and Redesign of Protein Functional Sites33
to identify meaningful geometric and evolutionary matches of
these templates to other protein structures based on reciprocity
(10), and votingplurality(28)in order to infer functionin enzymes
for example, its positive predictive value was 93% (10) in 1218 SG
Enzyme Commission classification, EC numbers). ETA matches
further create a network of local structural and evolutionary simila-
diffusion algorithm can then transfer annotations globally over the
entire network. Every combination of protein and function receives
a confidence score, and the highest one defines the functional
prediction. This competitive annotation diffusion strategy yields
predictions at the most detailed (fourth) EC level. For example,
false positives fell fourfold, at 97% sensitivity, against a recent
method (29). On a large-scale SG set, accuracy rose 6% and false
positives fell twofold at 65% coverage, compared to ETA.
In practice, ETA predictions are being validated experimentally
(30). For example, ETA suggested carboxylesterase activity
(EC220.127.116.11) for a bacterial protein of unknown function (Uniprot
accession Q99WQ5, gene name SAV0321, PDB 3h04 chain A)
found in a vancomycin resistant strain of the bacteria Staphylococ-
cus aureus (31). The ETA annotation was based on template
matches to three other carboxylesterases with only 10% to 13%
sequence identity to the query. In vitro biochemical assays then
to the positive control.
This work is notable for two reasons. First, it improves function
discovery in proteins of known structure by formulating reliable
hypothesis for efficient experimental validation. This supports the
knowledge. Second, since ET ranks, the 3D templates and matches
scale test of ET identification of key functional residues.
2.1. Functional Site
by Evolutionary Trace
1. To ensure that only the most relevant proteins are analyzed, a
custom database of sequences removes from NCBI’s non-
redundant protein sequence database any sequence with
“synthetic construct,” “artificial,” “fragment” and “partial”
in the sequence header.
2. To identify homologs to the protein being traced, a BLAST
(BLAST Local Alignment Search Tool) (32) search is done
on the custom database. Typically, the default number of
34 A. Wilkins et al.
homologs is limited to 500 sequences and the maximum
E-value threshold is set to 0.05 (see Note 5).
3. Sequences with less than half the length of the query protein
are eliminated, as are those with greater then 98% or less than
28% sequence identity (see Note 6).
4. A ClustalW alignment is generated (www.clustal.org) with
default parameters set at gap open penalty (10) and gap exten-
sion penalty (0.05). For the ET web servers (see Note 7). The
current ET code accepts MSF format.
5. The alignment is rescanned for sequences that are too short.
After these are removed, the remaining sequences are then
6. To generate an evolutionary tree, a pairwise sequence similarity
matrix is constructed and the UPGMA method is applied. Any
phylogenetic tree that represents the family of proteins can be
used as input into the ETcode.
7. Integer or rvET ranks are computed as described above:
sub-alignments that correspond to nodes in the evolutionary
treeare formed and(1), or(2) and(3)are applied(seeNote8).
8. If a structure is provided: structural clusters of highly ranked
residues in the query structure are identified and their statisti-
cal significance is measured as described in Subheading 3.2.
These clusters indicate likely functional hot spots and provide
a suitable hypothesis to direct mutational studies in order to
identify functional regions and determinants and drug target
9. Direct visualization of ET results can be obtained via two
programs: the ET Viewer and the PyETV application (33).
ET servers and viewers are available at http:/ /mammoth.bcm.
2.2. Protein Function
by Evolutionary Trace
1. rvET is applied to a query protein structure of unknown
function to rank the evolutionary importance of its residues.
2. The first cluster with ten evolutionarily important surface
residues is identified. A residue is defined to be on the surface
if its solvent accessibility is at least 2 A˚(2) as calculated by
3. The six most evolutionarily important residues in that cluster
define the query template. Their alpha carbon coordinates
define the template geometry. If ties arise between candidate
residues, those closest to a point halfway between the center
of mass of the growing template are chosen.
4. The template is allowed to vary in keeping with the side chain
variations found in multiple sequence alignment used by ET,
provided an amino acid appears at least twice.
3 Evolutionary Trace for Prediction and Redesign of Protein Functional Sites35
5. The templates are matched to target proteins of known struc-
ture and function (the current target set is 2008PDB90 (24)).
Functions are described by the Enzyme Commission (EC)
numbers (35) or Gene Ontology (GO) molecular terms (36).
Geometric matches are obtained hierarchically, employing a
distance cutoff of 2.5A˚(28). Finally, a root-mean-square-
distance (RMSD) is calculated.
6. It is important to filter nonspecific geometric matches. First,
only those with RMSD below 2A˚are considered for further
analysis. Second, a support vector machine (SVM) chooses
matches that are both geometrically and evolutionarily signif-
icant (it combines RMSD and evolutionary similarity between
the template and the matched sites in the target structures).
Third, these steps are repeated by reversing the role of the
query and of the target structure in order to assess reciprocity:
reciprocal ETA matches between two protein structures are
much less likely to be due to chance. Fourth, all-against-all
matches enable to tally how often a query matches to different
proteins with the same function. A plurality rule is then
applied to transfer to the query the one function annotation
that is matched the most often. In the case of a tie, no
prediction is suggested.
7. For GO annotations, ETA takes into account all known GO
terms and their parent terms for each match. ETA votes at
each GOdepth in suchaway that themost votedor tiedterms
are considered to be predictions. Voting continues until a GO
term has no more child terms. Once a term or terms are
considered to be predictions, their child terms are also sug-
gested as predictions. In the voting procedure, self-matches
8. An ETA server is available at http:/ /mammoth.bcm.tmc.edu/
3.1. ET Servers
A summary of ET tools is reported in Table 1. There are a number
of servers that provide ET results:
1. The first server (http:/ /mammoth.bcm.tmc.edu/ETserver.
html) requires the users to enter a PDB ID (e.g., 2phy).
The web output includes links that launch ETV and PyMOL
with which to view a structural mapping of every trace. This
output also packages zipped versions of all the files used or
generated by ET.
36 A. Wilkins et al.
2. The Evolutionary Trace Report Maker is a second server (37),
whichproduces a fully automatedETreport in a pdf document
(http:/ /mammoth.bcm.tmc.edu/report_maker). It pools data
several sources, and adds to that background inference on
functional sites and residues obtained from rvET. It requires
either a Protein Data Bank (PDB) identifier or a UniProt
accession number for a sequence. Report Maker utilizes HSSP
alignments when available.
3. The “ET Wizard” server is accessible directly through the
evolutionary trace viewer (ETV), launched separately in the
“Utils” menu,and usefulforgeneratinguser-controlled traces
Available ET tools
Name/URLType PurposeInput Output
Evolutionary Trace Results
Web server Functional site
PDB ID ET analyses files
Evolutionary Trace Report
maker http:/ /mammoth.
Web server Functional site
PDB ID or
PDF report, ET
Evolutionary Trace Viewer
(ETV) http:/ /mammoth.
PyMOL ETV http:/ /
ET rank data,
Annotation (ETA) server
PDB ID EC and GO
3 Evolutionary Trace for Prediction and Redesign of Protein Functional Sites37
A Tool to Run ET
and View Results
The ETV (38) (http:/ /mammoth.bcm.tmc.edu/traceview) is a
one-stop environment to run, visualize and interpret ET predic-
tions of functional sites in protein structures. It is implemented in
Java and runs across different operating systems utilizing Java Web
Start Technology for self-installation.
1. A key ETV feature is an interactive molecular graphics display
file. This file is selected in the “File” menu command: “Open
ETV Results.” It produces a colored structural map of the ET
rank of every protein residue. Evolutionary and functional hot
of top-ranked residues, and the statistical z-score of these
clusters is shown. The threshold of percentile rank to color
top-ranked residues can be adjusted by moving a slider (hori-
zontal scrollbar) prominently shown on top of the graphics
to display at once a heatmap of evolutionary importance.
2. A second feature of ETV is that the evolutionary tree used to
compute the ET rank of every residue can be viewed: select
“ET Tree” under the “View” menu.
3. Critically, an ET Wizard is integrated into ETV (under the
“Utils” menu”) to let users launch customized ET analyses.
The ET Wizard accepts either a PDB ID, or a PDB formatted
file provided directly by the user as input. Users may then also
choose to provide their own custom alignments or set of input
sequences. Alternately, they can allow the ET Wizard to build
its own alignments (see Note 9).
4. A database of pre-generated ET analysis results for all unique
chains in the PDB is maintained and regularly updated.
3.3. PyMOL ETV:
ET Viewer for Protein
Chains and Complexes
protein–protein interactions are an emerging target for design and
therapeutics, an alternative system was developed to trace multi-
protein interfaces. This PyETV (for PyMOL Evolutionary Trace
Viewer) (33) provides a high graphics quality interface to map
evolutionary forces and identify functional sites in complexes.
1. The PyETV is a plug-in that builds on the popular and exten-
sible PyMOL molecular graphics package (39). Information
for its installation, and instructional videos, are available
PyETVHelp/pyInstructions.html. PyETV is also integrated
into the web server http://mammoth.bcm.tmc.edu/
ETserver.html through web links to PyMOL scripts.
2. PyMOL (39) (www.pymol.org) is a versatile molecular
graphics package developed by Bill DeLano to view, select,
38A. Wilkins et al.
label, and perturb any number of structures or substructures
(such as groups of atoms or residues) in many ways (e.g.,
cartoon, surface, stereo etc.). Moreover, it is easily extended
with plug-ins—scripts that can add to PyMOL’s user interface
and can overlay complementary information to a protein
structure, such as electrostatics maps.
3. Through the PyETV plug-in, any number of user-generated
and pre-generated ET analysis results can be mapped to any
number of structures and displayed in PyMOL. In particular,
predicted biological assemblies from PISA (40) and ETanaly-
sis for each component in the assembly can be loaded directly
through PyETV using the “Assembly” tab. As with ETV,
PyETV provides a colored structural map of the importance
of each residue in a protein.
in Protein Structures
Using 3D Templates
1. ETA analysis starts with the PDB code of the protein structure
of unknown function, including a 1-digit chain identifier.
Click “Submit.” An ET analysis then provides information
on the evolutionary importance of each residue. If this ET
analysis is cached, the server goes to step 2. If not, it launches
automatically a new trace with default parameters. One may
gain control over this process by uploading a custom ET
analysis that was run before through the ET Wizard. Clicking
“Browse” to locate such an ET file and “Upload” to submit it
to the ETA server (http://mammoth.bcm.tmc.edu/ETA).
a cluster of evolutionarily important residues on the surface of
the protein, picking the six most important ones. It renders an
clicking on the image to download a PyMOL session file. The
3. The server next identifies possible amino acid types for each
template residue based on the multiple sequence alignment
used by ET. Each unique combination is listed, along with the
number of times it occurs in the alignment. Combinations
may be turned on or off using their check boxes. Custom
amino acid labels can also be added. Click “Find Matches” to
begin the template search.
4. The results page contains GO and EC predictions based on
reciprocal matches (highly reliable) and non-reciprocal
matches (less reliable). The GO terms and EC numbers are
hyperlinked to web pages containing more information about
that GO term or EC number.
3 Evolutionary Trace for Prediction and Redesign of Protein Functional Sites39
1. Rewiring functional specificity: Top-ranked residues were
exchanged to rewire transcriptional specificity in evolutionary
divergent helix-loop-helix proneural transcription factors
from the frog and the fly, and vice versa (22).
2. Separating functions: Alanine mutations of ET-predicted
functional residues confirmed predictions of new functional
sites and led to selective loss of function in the Ku70/80
heterodimer. One site was found to be responsible for
telomere maintenance and another site, that was structurally
diametrically opposite and facing the centromere, was respon-
sible for end-joining of double-strand DNA break repair (6).
to mimic ET-predicted sites composed mostly of solvent
exposed helices. The top-ranked residues were left intact
while the lesser-ranked amino acids were chosen to favor
helix formation. These peptides disrupted in vitro binding
among nuclear receptors (41) and, in another case, G protein-
coupled receptor phosphorylation by G protein receptor
4. Redesigning allosteric specificity: ETresidues in the transmem-
brane domain of Class A GPCRs (42) were targeted for muta-
tions. Some selectively uncoupled beta-arrestin-mediated
signaling from G protein-mediated signaling (43). Others
rewired a dopamine receptor to become serotonin responsive
not by altering ligand binding specificity, but rather by altering
the response of the allosteric pathway to either ligands (5).
5. ET analysis can be done for any reasonable set of sequences.
Typically 15–20 sequencesare needed but thisdepends on the
validity and diversity ofthe set. When structural information is
known, HSSP alignments can also be an option.
6. The parameters for filtering sequences were optimized for
better functional site prediction. They are often adjusted on
a case-by-case basis, for example, when studying an entire
family, it is important to ignore cut-offs like sequence identity.
7. For cases where homologues are close, the quicktree option in
ClustalW dramatically decreases computational time.
8. In sequence analysis, gaps are treated as a 21stamino acid.
This is simply a computational tool and has no relevance.
9. In the ET Wizard tool, the user can control the number of
sequences to be included in the alignment, after a BLAST
search, and the thresholds for acceptable sequence identity
and sequence length.
40 A. Wilkins et al.
Institute of Health through NIH-GM079656, NIH-GM066099,
T90 DA022885, R90 DA023418, NLM 5T15LM07093, and of
the National Science Foundation through NSF CCF-0905536.
1. Lichtarge, O., Bourne, H.R. & Cohen, F.E.
An evolutionary trace method defines binding
surfaces common to protein families. J Mol
Biol 257, 342–358 (1996).
2. Lichtarge, O., Yamamoto, K.R. & Cohen, F.
E. Identification of functional surfaces of the
zinc binding domains of intracellular recep-
tors. J Mol Biol 274, 325–337 (1997).
3. Madabushi, S. et al. Structural clusters of evo-
lutionary trace residues are statistically signifi-
cant and common in proteins. J Mol Biol 316,
4. Yao, H. et al. A Sensitive, Accurate, and Scal-
able Method to Identify Functional Sites in
5. Rodriguez, G.J., Yao, R., Lichtarge, O. &
Wensel, T.G. Evolution-guided discovery
and recoding of allosteric pathway specificity
determinants in psychoactive bioamine recep-
tors. Proc Natl Acad Sci U S A 107,
Bertuch, A.A. Distinct faces of the Ku
heterodimer mediate DNA repair and telo-
meric functions. Nat Struct Mol Biol 14,
7. Rajagopalan, L., Pereira, F.A., Lichtarge, O.
& Brownell, W.E. Identification of function-
ally important residues/domains in mem-
approach coupled with systematic mutational
analysis. Methods Mol Biol 493, 287–297
8. Kobayashi, H., Ogawa, K., Yao, R., Lichtarge,
O. & Bouvier, M. Functional rescue of beta-
adrenoceptor dimerization and trafficking by
9. Baameur, F. et al. Role for the regulator of
G-protein signaling homology domain of G
protein-coupled receptor kinases 5 and 6 in
beta 2-adrenergic receptor and rhodopsin
J. Mol.Biol 326,
10. Ward, R.M. et al. De-orphaning the structural
proteome through reciprocal comparison of
evolutionarily important structural features.
PLoS ONE 3, e2136 (2008).
O. Evolutionary trace annotation of protein
function in the structural proteome. J Mol Biol
12. Onrust, R. et al. Receptor and betagamma
binding sites in the alpha subunit ofthe retinal
G protein transducin. Science 275, 381–384
O. A regulator of G protein signaling interac-
tion surface linked to effector specificity. Proc
Natl Acad Sci U S A 97, 1483–1488 (2000).
14. Sowa, M.E. et al. Prediction and confirmation
of a site critical for effector regulation of RGS
domain activity. Nat Struct Biol 8, 234–237
15. Lichtarge, O., Bourne, H.R. & Cohen, F.E.
Evolutionarily conserved Galphabetagamma
binding surfaces support a model of the G
protein-receptor complex. Proc Natl Acad
Sci U S A 93, 7507–7511 (1996).
16. Shenkin, P.S.,Erman,B.&Mastrandrea,L.D.
Information-theoretical entropy as a measure
of sequence variability. Proteins 11, 297–313
17. Mihalek, I., Res, I. & Lichtarge, O. A family
of evolution-entropy hybrid methods for
ranking protein residues by importance.
J Mol Biol 336, 1265–1282 (2004).
18. Mihalek, I., Res, I., Yao, H. & Lichtarge, O.
Combining inference from evolution and
geometric probability in protein structure
evaluation. J Mol Biol 331, 263–279 (2003).
19. Mihalek, I., Res, I. & Lichtarge, O. Evolu-
tionary andstructural feedbackon selection of
sequences for comparative analysis of pro-
teins. Proteins 63, 87–99 (2006).
20. Mihalek, I., Res, I. & Lichtarge, O. A
structure and evolution-guided Monte Carlo
sequence selectionstrategy formultiple
3 Evolutionary Trace for Prediction and Redesign of Protein Functional Sites41
Bioinformatics 22, 149–156 (2006).
21. Wilkins, A.D., Lua, R., Erdin, S., Ward, R.M.
& Lichtarge, O. Sequence and structure con-
tinuity of evolutionary importance improves
protein functional site discovery and annota-
tion. Protein Sci 19, 1296–1311.
22. Quan, X.J. et al. Evolution of neural precursor
selection: functional divergence of proneural
23. Yao, H., Mihalek, I. & Lichtarge, O. Rank
information: a structure-independent mea-
sure of evolutionary
improves identification of protein functional
sites. Proteins 65, 111–123 (2006).
24. Berman, H.M. et al. The Protein Data Bank.
Nucleic Acids Res 28, 235–242 (2000).
25. Polacco, B.J. & Babbitt, P.C. Automated dis-
covery of3D motifs for protein function anno-
tation. Bioinformatics 22, 723–730 (2006).
26. Porter, C.T., Bartlett, G.J. & Thornton, J.M.
The Catalytic Site Atlas: a resource of catalytic
sites and residues identified in enzymes using
27. Kristensen, D.M. et al. Recurrent use of evo-
lutionary importance for functional annota-
tion of proteins based on local structural
similarity. Protein Sci 15, 1530–1536 (2006).
28. Kristensen, D.M. et al. Prediction of enzyme
function based on 3D templates of evolution-
arily important amino acids. BMC Bioinfor-
matics 9, 17 (2008).
29. Redfern, O.C., Dessailly, B.H., Dallman, T.J.,
Sillitoe, I. & Orengo, C.A. FLORA: a novel
method to predict protein function from
structure in diverse superfamilies. PLoS Com-
put Biol 5, e1000485 (2009).
30. Venner, E., Lisewski, A.M., Erdin, S., Ward,
R.W., Amin, S. & Lichtarge, O. Accurate
protein structure annotation through com-
petitive diffusion of enzymatic functions over
a network of local evolutionary similarities.
PLoS One 12, e14286 (2010).
31. Gill, S.R. et al. Insights on evolution of viru-
lence and resistance from the complete
genome analysis of an early methicillin-
resistant Staphylococcus aureus strain and a
Staphylococcus epidermidis strain. J Bacteriol
187, 2426–2438 (2005).
32. Altschul, S.F., Gish, W., Miller, W., Myers,
E.W. & Lipman, D.J. Basic local alignment
search tool. J Mol Biol 215, 403–410 (1990).
33. Lua, R.C. & Lichtarge, O. PyETV: a PyMOL
evolutionary trace viewer to analyze func-
tional site predictions in protein complexes.
Bioinformatics 26, 2981–2982.
34. Kabsch, W. & Sander, C. Dictionary of
protein secondary structure: pattern recogni-
tion of hydrogen-bonded and geometrical
features. Biopolymers 22, 2577–2637 (1983).
35. International Union of Biochemistry and
Molecular Biology. Nomenclature Commit-
tee. & Webb, E.C. Enzyme nomenclature
1992 : recommendations of the Nomencla-
ture Committee of the International Union of
Biochemistry and Molecular Biology on the
nomenclature and classification of enzymes.
(Academic Press, San Diego; 1992).
36. Ashburner, M. et al. Gene ontology: tool for
the unification of biology. The Gene Ontol-
ogy Consortium. Nat Genet 25, 25–29
37. Mihalek, I., Res, I. & Lichtarge, O. Evolu-
tionary trace report_maker: a new type of ser-
vice for comparative analysis of proteins.
Bioinformatics 22, 1656–1657 (2006).
38. Morgan, D.H., Kristensen, D.M., Mittelman,
D. & Lichtarge, O. ET viewer: an application
for predicting and visualizing functional sites
in protein structures. Bioinformatics 22,
Graphics System, San Carlos, CA, DeLano
40. Krissinel, E. & Henrick, K. Inference of mac-
romolecular assemblies from crystalline state.
J Mol Biol 372, 774–797 (2007).
41. Gu, P.et al.Evolutionary
peptides identify a novel asymmetric interac-
tion that mediates oligomerization in nuclear
receptors. J Biol Chem 280, 31818–31829
42. Madabushi, S. et al. Evolutionary trace of G
protein-coupled receptors reveals clusters of
residues that determine global and class-spe-
cific functions. J Biol Chem 279, 8126–8132
43. Shenoy, S.K. et al. beta-arrestin-dependent, G
protein-independent ERK1/2 activation by
the beta2 adrenergic receptor. J Biol Chem
281, 1261–1273 (2006).
42 A. Wilkins et al.