Evolutionary Analysis of Amino Acid Repeats across the Genomes of 12
Melanie A. Huntley and Andrew G. Clark
Department of Molecular Biology and Genetics Cornell University
Repeated motifs of amino acids within proteins are an abundant feature of eukaryotic sequences and may catalyze the
rapid production of genetic and even phenotypic variation among organisms. The completion of the genome sequencing
projects of 12 distinct Drosophila species provides a unique dataset to study these intriguing sequence features on
a phylogeny with a variety of timescales. We show that there is a higher percentage of proteins containing repeats within
the Drosophila genus than most other eukaryotes, including non-Drosphila insects, which makes this collection of
species particularly useful for the study of protein repeats. We also find that proteins containing repeats are
overrepresented in functional categories involving developmental processes, signaling, and gene regulation. Using the set
of 1-to-1 ortholog alignments for the 12 Drosophila species, we test the ability of repeats to act as reliable phylogenetic
signals and find that they resolve the generally accepted phylogeny despite the noise caused by their accelerated rate of
evolution. We also determine that in general the position of repeats within a protein sequence is non-random, with
repeats more often being absent from the middle regions of sequences. Finally we find evidence to suggest that the
presence of repeats is associated with an increase in evolutionary rate upon the entire sequence in which they are
embedded. With additional evidence to suggest a corresponding elevation in positive selection we propose that some
repeats may be inducing compensatory substitutions in their surrounding sequence.
Evolutionary innovations are often a result of internal
modification. In fact, simple sequence repeats are now be-
ing recognized as influential sequences with the ability to
produce rapid variation, and act as ‘evolutionary tuning
knobs’ (Kashi and King 2006). However, this increase in
mutability can come at a devatating price; numerous dis-
eases involving neurodegeneration are associated with ho-
mopolymer repeat expansions within protein sequences
(Gatchel and Zoghbi 2005). Yet despite the propensity
of many homopeptides to form insoluble toxic aggregates
within the cell, repetitive sequence is the most commonly
shared feature among eukaryotic proteins (Golding 1999;
Huntley and Golding 2000).
The rapid expansion and contraction of simple se-
quence repeats is generally thought to be facilitated through
replicative slippage (Levinson and Gutman 1987). How-
ever, we have previously found evidence that selection also
plays a role in the evolution and maintenance of some ho-
mopolymers within proteins (Huntley and Golding 2006).
quences, and the observation that selection has acted to pre-
serve centain repeats by preventing further slippage suggest
that repeats themselves may have inherent functional attrib-
utes. Studies aimed at elucidating these attributes using
structural data have demonstrated that consistent structures
for repeats are largely absent (Wootton 1994; Saqi 1995;
Huntley and Golding 2002). This result has led to the
suggestion that repeats do not form stable globular struc-
tures, and are in fact structurally disordered (Dunker
et al. 2002).
Proteins containing disordered regions are generally
involved in molecular recognition and signaling (Alba
et al. 1999; Dunker et al. 2002). This function might arise
from the intrinsic flexibility and mobility of disordered re-
gions as this could allow for increased association and dis-
sociation rates along with binding promiscuity.
In this study we utilize the newly available genome se-
quences from 12 Drosophila species in order to further in-
vestigate the properties of protein repeats. We determine the
relative abundance and patterns of repeats within each of the
Drosophila proteomes and the specific functional categories
positions of repeats within protein sequences, as this may
provide more insight into their function and evolution.
We also test the ability of repeats to act as phylogenetic
signals. Finally, we investigate the effect of repeats on the
evolutionary rate of their surrounding sequence.
Materials and methods
The protein sequences based on the Comparative
Analysis Freeze 1 (CAF1) dataset for the 12 Drosophila
species (D. melanogaster, D. simulans, D. sechellia, D.
erecta, D. yakuba, D. ananassae, D. pseudoobscura, D.
persimilix, D. willistoni, D. virilis, D. mojavensis, and
D. grimshawi) were retrieved from NCBI.
To investigate the patterns of repetitive sequence
within theproteinsofthe 12 Drosphilaspecies,wesearched
five tandem, identical amino acids in length and imple-
mented the program SEG (Wootton and Federhen 1993)
to detect regions of low complexity within the protein se-
quences. The SEG algorithm is often used to filter out sim-
ple sequence within proteins before performing homology
and similarity based searches like BLAST (Altschul et al.
1997), while a separate program is used to filter simple se-
quence from nucleotide sequences. In our implementation
of the SEG algorithm we used a window length of 15, and
Key words: protein repeats, simple sequence, homopeptides.
Mol. Biol. Evol. 24(12):2598–2609. 2007
Advance Access publication June 29, 2007
? The Author 2007. Published by Oxford University Press on behalf of
the Society for Molecular Biology and Evolution. All rights reserved.
For permissions, please e-mail: email@example.com
by guest on November 9, 2015
a complexity cut-off K2(1) of 1.9 instead of the default
values (12 and 2.2 respectively). The parameter K2(1) is
an initial cut-off complexity value such that when SEG
initially calculates the complexity of a subsequence, it must
not exceed the cut-off complexity value. These values were
previously shown to preferentially detect the longer and
more repetitive sequence regions common among eukary-
otic proteins (Huntley and Golding 2002).
Repeat Enrichment and Functional Associations
Foreach protein,we recorded the numberof homopol-
ymers, and low complexity sequences detected, along with
their lenghts, relative position within the protein, and gene
ontology (GO) associations (from Drosophila 12 Genomes
Consortium 2007). For homopolymers, we also recorded
the amino acid comprising the tract.
The CAF1 nucleotide sequences for the 12 Drosophila
species were scanned for triplet repeat tracts. All tracts of
five triplet repeats or more were recorded, and subsequently
analyzed to demonstrate differences in the triplet repeat
composition of coding and non-coding sequences.
To test the hypothesis that repeats are randomly dis-
persed throughout the length of a protein sequence, we di-
vided each protein into three segments of equal length: an
N-terminal segment, a mid segment, and a C-terminal seg-
ment. Each detected repeat was assigned to one of the three
segments, based upon where the midpoint of the repeat was
located. For instance, a repeat whose midpoint fell within
the N-terminal third of the full protein was considered to
have the majority of the repeat present in that segment
of the protein. Any repeat whose midpoint fell on the
boundary between two segments was randomly assigned
to one of them. This gave us an obseved distribution of re-
peat positions within the protein sequences.
To perform a goodness of fit v2test, we needed an ex-
pected distribution for the positions of repeats. However,
since the length of each individual repeat and of the protein
within which the repeat is embedded influences that expec-
tation, we had to generate the expected position of each re-
peat independently. Therefore, for each detected repeat,
assuming random dispersal, we calculated the probability
of the midpoint falling within each segment. If the length
of theentireproteinwas Land the length ofthe repeat was l,
then there would be L–l possible positions within the pro-
tein where the midpoint could fall. The mid segment con-
tains L/3 of those possibilities, while the N-terminal and
C-terminal each contain L/3 – l/2 possible midpoint posi-
tions. Therefore the probability that the repeat midpoint
falls within the mid segment isL=3
for each terminal segment isL=3?l=2
was then generated and the expected position of the repeat
was assigned based on these probabilities.
To determine whether the trends we observed within
the Drosophila species were unique to this clade, we also
performed the above analysis on a range of other species,
including two additional insects (Anopheles gambiae and
L?l; while the probability
: A Random number
Apis mellifera), five other eukaryotes (the yeast Saccharo-
myces cerevisiae, the worm Ceanorhabditis elegans, the
plant Arabidopsis thaliana, the fish Danio rerio, and the
mammal Homo sapiens), two archaebacteria (Methanococ-
cus jannaschii and Pyrococcus horikoshii) and two eubac-
teria (the gram negative Escherichia coli and the gram
positive Bacillus subtilis).
Repeats as Phylogenetic Signals
In order toevaluatetrendswithin each Drosophilaspe-
cies, without confounding the results with proteins unique
to a given lineage, we also analyzed the 1-to-1 Tcoffee
ortholog alignments (obtained from http://rana.1b1.gov/
drosophila/wiki/). In this way, we also had a collection
of proteins for which there was a single ortholog in each
of the 12 Drosophila genomes.
Due to the unique mechanism producing most protein
repeats (replicative slippage and expansion, rather than typ-
ical point mutations), we were curious whether repeats
alone could perform well as phylogenetic signals. Using
the 1-to-1 ortholog alignments of the 12 Drosophila species
we began to test this hypothesis by taking each ortholog
alignment, and scanning for homopolymer repeats at leat
five residues in length. If such a homopolymer was found
in any of the species, we attempted to expand the repeat
boundaries in both directions by examining the sequence
in the other species at those particular positions in the align-
ment. If any of the sequences contained at leat two tandem
amino acid residues identical to the residues in the homo-
polymer, starting at the boundary position and extending
beyond the homopolymer repeat, the boundary was then
extended (see figure 1).
After isolating the repeat alignment blocks within the
ortholog alignments we then constructed a distance matrix
for each repeat alignment using two separate methods. In
the first method, termed ‘exact,’ we determined the length
of the longest homopolymer tract (of the predominant
The distance between any given species pair was then cal-
culated as the magnitude of the difference between the
lengths of their longest homopolymers. In the second,
‘fuzzy’ method, we counted for each species in a repeat
alignment block the number of times the predominant
amino acid was found in that stretch of sequence, regardless
of whether it was part of a tandem homopolymer tract. The
distance between species was calculated as the magnitude
of the difference between their respective sums of predom-
inant amino acid residues. We included the ‘fuzzy’ analysis
because often a longer repeat tract becomes subsequently
interrupted by amino acid substitutions which could signif-
icantly shorten the observed length of the longest homopol-
ymer tract within the sequence. This could then effectively
bias the distances to be greater than the single amino acid
substitution would warrant. The ‘fuzzy’ analysis should
alleviate this concern.
The ‘exact’ and ‘fuzzy’ distance matrices were then
programs in the PHYLIP package (Felsenstein 1989) in or-
der to produce individual trees for each repeat alignment
block, and then finally a consensus tree. These consensus
Amino Acid Repeats within Drosophila Proteomes2599
by guest on November 9, 2015
Rates of Evolution
It has been well established now that repeats them-
selves tend to evolve more rapidly than the remaining pep-
tide sequence in which they are embedded (Huntley and
Golding 2000; Romov et al. 2006). However our finding
in this study that the sequence surrounding a repeat evolves
faster and with an increased signal for positive selection
than sequence containing no repeats is intriguing. Our hy-
pothesis that this increase in evolutionary rate might result
from compensatory substitutions in the flanking sequence
to accommodate the rapid length perturbations in the repeat
sequence is supported by a preliminary data set indicating
that repeats stablilized by selection to prevent further ex-
pansion have flanking sequence with lower evolutionary
rates than repeats that have ongoing slippage. However,
an equally supported hypothesis to explain these results
is that proteins undergoing rapid evolution may benefit
from the acquisition of repeat domains which can then rap-
idly expand and contract until stabilization is preferred. In
this way, repeats could act as evolutionary ‘‘tuning knobs’’
(Kashi and King 2006) and be selected for on the basis of
the increase in variability afforded by their unique mecha-
nism of mutation.
It is still curious, however, that some repeats appear
advantageous or neutral, while others are incredibly dele-
terious. This apparent discordance can be somewhat recon-
ciled by the findings from an experiment by Brignull et al.
(2006), who used C. elegans mutants to demonstrate that
the threshold for pathogenic length in poly-Q type diseases
could be manipulated by perturbing the function of various
housekeeping proteins. By using mutants with extended
lifespans they demonstrated that the onset for poly-Q path-
ogenesis can be further delayed, in agreement with obser-
vations that in general the age of onset is related to
homopolymer tract length and lifespan of the organism.
They then reasoned that a cellular buffering system exists
to prevent proteotoxicty until a certain age when the buff-
ering system begins to fail. They found additionally that
they could induce transition from soluble protein to aggre-
gate states in homopolymer lengths just under the patho-
genic threshold by disrupting genes involved in the
seems apparent that repeats themselves only become prob-
lematic tothecell when other housekeeping networksbegin
to fail. Otherwise repeat expansions may induce rapid com-
pensatory mutations presumably to stabilize the protein
structure, preserving function, or by virtue of their propen-
sity to rapidly expand and contract, repeats may enable the
exploration of novel protein conformations and functions.
Supplementary material figures S1 and S2 are avail-
able at Molecular Biology and Evolution online (http://
The authors wish to thank Hadi Quesneville for pro-
viding microsatellite predictions within the genomic se-
quences, Dara Torgerson for collecting the mammalian
multiz alignments, Amanda Laracuente for assistance with
the gene ontology associations and Tim Sackton for provid-
ing the PAML data. We also thank David King and two
anonymous reviewers for their insightful comments on this
manuscript. This workwas supported by a Natural Sciences
Alba MM, Guigo R. 2004. Comparative analysis of amino acid
repeats in rodents and humans. Genome Res. 14:549–554.
Alba MM, Santibanez-Koref MF, Hancock JM. 1999. Amino
acid reiterations in yeast are overrepresented in particular
classes of proteins and show evidence of a slippage-like
mutational process. J Mol Evol. 49:789–797.
Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z,
Miller W, Lipman DJ. 1997. Gapped BLAST and PSI-
BLAST: a new generation of protein database search
programs. Nucleic Acids Res. 25:3389–3402.
Benson G. 1999. Tandem repeats finder: a program to analyze
DNA sequences. Nucleic Acids Res. 27:573–580.
Blanchette M, Kent WJ, Riemer C, et al. (12 co-authors). 2004.
Aligning multiple genomic sequences with the threaded
blockset aligner. Genome Res. 14:708–715.
Brignull HR, Morley JF, Garcia SM, Morimote RI. 2006.
Modeling polyglutamine pathogenesis in C. elegans. Methods
DePristo MA, Zilversmit MM, Hartl DL. 2006. On the abun-
dance, amino acid composition, and evolutionary dynamics
of low-complexity regions in proteins. Gene. 378:19–30.
Drosophila 12 Genomes Consortium. 2007. Evolution of genes
and genomes on the Drosophila phylogeny. Nature. doi:
DunkerAK, BrownCJ, Lawson
Obradovic Z. 2002. Intrinsic disorder and protein function.
Felsenstein J. 1989. PHYLIP- Phylogeny Inference Package
(Version 3.2). Cladistics. 5:164–166.
Fujimori S, Washio T, Higo K, et al. (11 co-authors). 2003. A
novel feature of microsatellites in plants: a distribution
gradient along the direction of transcription. FEBS Lett.
Gatchel JR, Zoghbi HY. 2005. Diseases of unstable repeat
expansion: mechanisms and common principles. Nat Rev
Golding GB. 1999. Simple sequence is abundant in eukaryotic
proteins. Protein Sci. 8:1358–1361.
Huntley M, Golding GB. 2000. Evolution of simple sequence in
proteins. J Mol Evol. 51:131–140.
Huntley MA, Golding GB. 2002. Simple sequences are rare in
the Protein Data Bank. Proteins. 48:134–140.
Huntley MA, Golding GB. 2004. Neurological proteins are
not enriched for repetitive
Huntley MA, Golding GB. 2006. Selection and slippage creating
serine homopolymers. Mol Biol Evol. 23:2017–2025.
Karlin S, Burge C. 1996. Trinucleotide repeats and long
homopeptides in genes and proteins associated with nervous
system disease and development. Proc Natl Acad Sci USA.
Karolchik D, Baertsch R, Diekhans M, et al. (13 co-authors).
2003. The UCSC Genome Browser Database. Nucleic Acids
Kashi Y, King DG. 2006. Simple sequence repeats as advanta-
geous mutators in evolution. Trends Genet. 22:253–259.
2608 Huntley and Clark
by guest on November 9, 2015
Kolpakov R, Bana G, Kucherov G. 2003. Mreps: efficient and
flexible detection of tandem repeats in DNA. Nucleic Acids
Levinson G, Gutman GA. 1987. Slipped-strand mispairing:
a major mechanism for DNA sequence evolution. Mol Biol
Marcotte EM, Pellegrini M, Yeates TO, Eisenberg D. 1999. A
census of protein repeats. J Mol Biol. 293:151–160.
Pizzi E, Frontali C. 2001. Low-complexity regions in Plasmo-
dium falciparum proteins. Genome Res. 11:218–229.
Romov PA, Li F, Lipke PN, Epstein SL, Qiu WG. 2006. Compara-
tive genomics reveals long, evolutionarily conserved, low-
complexity islands in yeast proteins. J Mol Evol. 63:415–425.
Saqi M. 1995. An analysis of structural instances of low
complexity sequence segments. Protein Eng. 8:1069–1073.
Siwach P, Pophaly SD, Ganesh S. 2006. Genomic and
evolutionary insights into genes encoding proteins with single
amino acid repeats. Mol Biol Evol. 23:1357–1369.
Storey JD, Tibshirani R. 2003. Statistical significance for genome-
wide studies. Proc Natl Acad Sci USA. 100:9440–9445.
Wootton J. 1994. Sequences with ‘unusual’ amino acid
compositions. Current Opinion in Struct Bi. 4:413–421.
Wootton J, Federhen S. 1993. Statistics of local complexity in
amino acid sequences and sequence databases. Comput Chem.
Yang Z. 1997. PAML: a program package for phylogenetic
analysis by maximum likelihood. Comput Appl Biosci. 13:
Zhang L, Yu S, Cao Y, Wang J, Zuo K, Qin J, Tang K. 2006.
Distributional gradient of amino acid repeats in plant proteins.
David Erwin, Associate Editor
Accepted June 18, 2007
Amino Acid Repeats within Drosophila Proteomes2609
by guest on November 9, 2015