A formal test of the theory of universal common ancestry.
ABSTRACT Universal common ancestry (UCA) is a central pillar of modern evolutionary theory. As first suggested by Darwin, the theory of UCA posits that all extant terrestrial organisms share a common genetic heritage, each being the genealogical descendant of a single species from the distant past. The classic evidence for UCA, although massive, is largely restricted to 'local' common ancestry-for example, of specific phyla rather than the entirety of life-and has yet to fully integrate the recent advances from modern phylogenetics and probability theory. Although UCA is widely assumed, it has rarely been subjected to formal quantitative testing, and this has led to critical commentary emphasizing the intrinsic technical difficulties in empirically evaluating a theory of such broad scope. Furthermore, several researchers have proposed that early life was characterized by rampant horizontal gene transfer, leading some to question the monophyly of life. Here I provide the first, to my knowledge, formal, fundamental test of UCA, without assuming that sequence similarity implies genetic kinship. I test UCA by applying model selection theory to molecular phylogenies, focusing on a set of ubiquitously conserved proteins that are proposed to be orthologous. Among a wide range of biological models involving the independent ancestry of major taxonomic groups, the model selection tests are found to overwhelmingly support UCA irrespective of the presence of horizontal gene transfer and symbiotic fusion events. These results provide powerful statistical evidence corroborating the monophyly of all known life.
- SourceAvailable from: PubMed Central[Show abstract] [Hide abstract]
ABSTRACT: The Human Genome Project has transformed biology through its integrated big science approach to deciphering a reference human genome sequence along with the complete sequences of key model organisms. The project exemplifies the power, necessity and success of large, integrated, cross-disciplinary efforts - so-called 'big science' - directed towards complex major objectives. In this article, we discuss the ways in which this ambitious endeavor led to the development of novel technologies and analytical tools, and how it brought the expertise of engineers, computer scientists and mathematicians together with biologists. It established an open approach to data sharing and open-source software, thereby making the data resulting from the project accessible to all. The genome sequences of microbes, plants and animals have revolutionized many fields of science, including microbiology, virology, infectious disease and plant biology. Moreover, deeper knowledge of human sequence variation has begun to alter the practice of medicine. The Human Genome Project has inspired subsequent large-scale data acquisition initiatives such as the International HapMap Project, 1000 Genomes, and The Cancer Genome Atlas, as well as the recently announced Human Brain Project and the emerging Human Proteome Project.Genome Medicine 09/2013; 5(9):79. · 4.94 Impact Factor
- Physics of Life Reviews 12/2013; · 6.58 Impact Factor
- [Show abstract] [Hide abstract]
ABSTRACT: Oxidoreductases mediate electron transfer (i.e., redox) reactions across the tree of life and ultimately facilitate the biologically driven fluxes of hydrogen, carbon, nitrogen, oxygen, and sulfur on Earth. The core enzymes responsible for these reactions are ancient, often small in size, and highly diverse in amino acid sequence, and many require specific transition metals in their active sites. Here we reconstruct the evolution of metal-binding domains in extant oxidoreductases using a flexible network approach and permissive profile alignments based on available microbial genome data. Our results suggest there were at least 10 independent origins of redox domain families. However, we also identified multiple ancient connections between Fe2S2- (adrenodoxin-like) and heme- (cytochrome c) binding domains. Our results suggest that these two iron-containing redox families had a single common ancestor that underwent duplication and divergence. The iron-containing protein family constitutes ∼50% of all metal-containing oxidoreductases and potentially catalyzed redox reactions in the Archean oceans. Heme-binding domains seem to be derived via modular evolutionary processes that ultimately form the backbone of redox reactions in both anaerobic and aerobic respiration and photosynthesis. The empirically discovered network allows us to peer into the ancient history of microbial metabolism on our planet.Proceedings of the National Academy of Sciences 04/2014; · 9.81 Impact Factor
A formal test of the theory of universal common
Douglas L. Theobald1
Universal common ancestry (UCA) is a central pillar of modern
evolutionary theory1. As first suggested by Darwin2, the theory of
UCA posits that all extant terrestrial organisms share a common
genetic heritage, each being the genealogical descendant of a single
species from the distant past3–6. The classic evidence for UCA,
although massive, is largely restricted to ‘local’ common ancestry—
for example, of specific phyla rather than the entirety of life—and
has yet to fully integrate the recent advances from modern phyloge-
netics and probability theory. Although UCA is widely assumed, it
has rarely been subjected to formal quantitative testing7–10, and this
has led to critical commentary emphasizing the intrinsic technical
difficulties in empirically evaluating a theory of such broad
scope1,5,8,9,11–15. Furthermore, several researchers have proposed that
early life was characterized by rampant horizontal gene transfer,
leading some to question the monophyly of life11,14,15. Here I provide
by applying model selection theory5,16,17to molecular phylogenies,
focusing on a set of ubiquitously conserved proteins that are pro-
posed to be orthologous. Among a wide range of biological models
involving the independent ancestry of major taxonomic groups,
the model selection tests are found to overwhelmingly support
UCA irrespective of the presence of horizontal gene transfer and
symbiotic fusion events. These results provide powerful statistical
evidence corroborating the monophyly of all known life.
‘‘all the organic beings which have ever lived on this earth have
descended from some one primordial form’’2. This theory of
UCA—the proposition that all extant life is genetically related—is
perhaps the most fundamental premise of modern evolutionary
theory, providing a unifying foundation for all life sciences. UCA is
now supported by a wealth of evidence from many independent
sources18, including: (1) the agreement between phylogeny and bio-
geography; (2) the correspondence between phylogeny and the
acteristics; (5) the marked similarities of biological structures with
different functions (that is, homologies); and (6) the congruence of
morphological and molecular phylogenies9,10. Although the consili-
ence of these classic arguments provides strong evidence for the com-
mon ancestryof higher taxasuch as thechordates ormetazoans, none
are all genetically related. However, the ‘universal’ in universal com-
mon ancestry is primarily supported by two further lines of evidence:
various key commonalities at the molecular level6(including fun-
damental biological polymers, nucleic acid genetic material, L-amino
acids, and core metabolism) and the near universality of the genetic
code4,7. Notably, these two traditional arguments for UCA are largely
qualitative, and typical presentations of the evidence do not assess
quantitative measures of support for competing hypotheses, such as
the probability of evolution from multiple, independent ancestors.
The inference from biological similarities to evolutionary homo-
logyisa feature sharedbyseveralofthelinesofevidenceforcommon
lance, often gauged by an E value from a BLAST search, indicates
by chance20. A Karlin–Altschul E value is a Fisherian null-hypothesis
significance test in which the null hypothesis is that two random
sequences have been aligned20. Therefore, an E value in principle
cannot provide evidence for or against the hypothesis that two
sequences share a common ancestor. (In fact, an E value cannot even
provide evidence for the random null hypothesis.21) Sequence simi-
is a hypothesis proposed to explain the similarity22. Statistically sig-
ancestry, such as convergent evolution due to selection, structural
constraints on sequence identity, mutation bias, chance, or artefact
manufacture19. For these reasons, a sceptic who rejects the common
proteins have similar sequences and are ‘homologous’ in the original
pre-Darwinian sense of the term (homology here being similarity of
structure due to ‘‘fidelity to archetype’’)23. Consequently, it would be
support from sequence data for common-ancestry versus competing
Here I report tests of the theory of UCA using model selection
theory, without assuming that sequence similarity indicates a genea-
logical relationship.Byaccountingfor the trade-offbetween datapre-
diction and simplicity, model selection theory provides methods for
choosing among several competing scientific models, two opposing
number of free parameters. On the other hand, simple hypotheses
(those with as few ad hoc parameters as possible) are preferred.
Model selection methods weigh these two factors statistically to find
the hypothesis that is both the most accurate and the most precise.
Because model selection tests directly quantify the evidence for and
against competing models, these tests overcome many of the well-
known logical problems with Fisherian null-hypothesis significance
tests (such as BLAST-style E values)16,21. To quantify the evidence
widely used model selection criteria from all major statistical schools:
and the log Bayes factor (LBF)16,17.
Using these model selection criteria, I specifically asked whether
the three domains of life (Eukarya, Bacteria and Archaea) are best
1Department of Biochemistry, Brandeis University, Waltham, Massachusetts 01778, USA.
Vol 465|13 May 2010|doi:10.1038/nature09014
Macmillan Publishers Limited. All rights reserved
described by a unified, common genetic relationship (that is, UCA)
or by multiple groups of genetically unrelated taxa that arose inde-
pendently and in parallel. As one example, a simplified model was
considered for the hypothesis that Archaea and Eukarya share a
common ancestor but do not share a common ancestor with
Bacteria. This model (indicated by ‘AE1B’ in Fig. 1 and Table 1)
comprises two independent trees—one containing Archaea and
Eukarya and another containing only Bacteria. In these models the
primary assumptions are: (1) that sequences change over time by a
gradual, time-reversible Markovian process of residue substitution,
described by a 20320 instantaneous rate matrix defined by certain
amino acid equilibrium frequencies and a symmetric matrix of
amino acid exchangeabilities; (2) that new genetically related genes
are generated by duplication during bifurcating speciation or gene
duplication events; and (3) that residue substitutions are uncorre-
lated along different lineages and at different sites. The model selec-
tion tests evaluate how well these assumptions explain the given data
set when various subsets of taxa and proteins are postulated to share
ancestry, without any recourse to measures of sequence similarity.
origins of life1–6. If life began multiple times, UCA requires a ‘bottle-
origins have survived exclusively until the present (and the rest have
become extinct), or, multiple populations with independent, separate
origins convergently gained the ability to exchange essential genetic
here are compatible with multiple origins in both the above schemes,
and therefore the tests reported here are designed to discriminate
specifically between UCA and multiple ancestry, rather than between
that the last universal common ancestor was a single organism24,25, in
accord with the traditional evolutionary view that common ancestors
mon ancestor may have comprised a population of organisms with
different genotypes that lived in different places at different times25.
The datasetconsistsofasubsetoftheproteinalignmentdata from
ref. 27, containing 23 universally conserved proteins for 12 taxa from
horizontally transferred early in evolution27. The conserved proteins
be orthologues. The first class of models I considered (presented in
a given set of taxa to evolve by the same tree, and hence these models
do not account for possible horizontal gene transfer (HGT) or sym-
biotic fusion events during the evolution of the three domains of life.
representing universal common ancestry of all taxa in the three
domains of life and shown in Fig. 1a, can be considered to represent
the classic three-domain ‘tree of life’ model of evolution28.
Among the class I models, all criteria select the UCA tree by an
extremely large margin (score differences ranging from 6,569 to
have evolutionary histories complicated by HGT. For all model selec-
viewed as very strong empirical evidence for the hypothesis with the
are also highly statistically significant (the estimated variance for each
score is approximately 2–3). According to a standard objective
Bayesian interpretation of the model selection criteria, the scores are
the log odds of the hypotheses16,17. Therefore, UCA is at least 102,860
times more probable than the closest competing hypothesis. Notably,
UCA is the most accurate and the most parsimonious hypothesis.
better fit to the data (as seen from its higher likelihood), and it is also
the least complex (as judged by the number of parameters).
The extraordinary strength of these results in the face of suspected
to the extent of HGT. To test this possibility, the analysis was
independent evolutionary history. I refer to this set of models, which
rejects a single tree metaphor for genealogically related taxa, as ‘class
of genealogically related taxa, each of the 23 universally conserved
proteins is allowedto evolve onits ownseparate phylogeny, inwhich
both branch lengths and tree topology are free parameters. For
example, the multiple-ancestry model [AE1B]IIcomprises two clus-
ters of protein trees, one cluster (AE) in which Archaea and Eukarya
share a common ancestor but are genetically unrelated to another
cluster (B) consisting only of Bacteria. Class II models are highly
reticulate, phylogenetic networks that can represent very complex
evolutionary mechanisms, including unrestricted HGT, symbiotic
fusion events and independent ancestry of various taxa. Overall,
the model selection tests show that the class II models are greatly
preferred to the class I models. For instance, the class II UCA hypo-
thesis ([ABE]II) versus the class I UCA hypothesis (ABE) gives a
highly significant LLR of 3,557, a DAIC of 2,633 and an LBF of
2,875. The optimal class II models represent an upper limit to the
degree of HGT, as many of the apparent reticulations are probably
due to incomplete lineage sorting, hidden paralogy, recombination,
or inaccuracies in the evolutionary models. Nonetheless, as with the
class I non-HGT hypotheses, all model selection criteria unequivo-
cally support a single common genetic ancestry for all taxa. Also
similar to the class I models, the class II UCA model has the greatest
explanatory power and is the most parsimonious.
Figure 1 | Selected class I evolutionary hypotheses, excluding HGT. a, The
model ABE, representing UCA of all taxa in the three domains of life. b, A
competing multiple-ancestry model,AE1B,representing commonancestry
of Archaea and Eukarya, but an independent ancestry for Bacteria. Trees
shown are actual maximum likelihood estimates, with branch lengths
proportional to the number of sequence substitutions.
Table 1 | Class I hypotheses of single versus multiple ancestries
DAIC LBF ML evolutionary model
(AE) R-IGF; (B) R-GF
(AB) W-IGF; (E) R-GF
(BE) R-IGF; (A) W-IGF
(E) R-GF; (B) R-GF; (A) W-IGF
(ABE2M) W-IF; (M) R-GF
(ABE2H) R-IGF; (H) empirical
Shown are the model section scores for class I hypotheses of single ancestry versus multiple
ancestries, excluding HGT events. A, Archaea; B,Bacteria; E,Eukarya; H,Homo sapiens;
M,Metazoa; ABE2M, ABE without Metazoa; ABE2H, ABE without H. sapiens. AE1B denotes a
hypothesis of two independent ancestries, one tree for A and E together, and another separate
tree for B. K denotes the total number of parameters in the model. All criteria are given as
differences from ABE, so that larger values indicate less support for that model relative to ABE.
LLR and DAIC scores correspond to the maximum likelihood (ML) estimates. For the ML
evolutionary model, the first letter refers to the rate matrix: R, RtREV; W, WAG. The following
letters denote models with additional parameters: I,invariant positions; G, gamma rate
variation; F, empirical amino acid frequencies. The raw log likelihood for ABE is 2126,299, and
the marginal log likelihood is 2126,713.
NATURE|Vol 465|13 May 2010
Macmillan Publishers Limited. All rights reserved
Several hypotheses have been proposed to explain the origin of
eukaryotes and the early evolution of life by endosymbiotic fusion of
an early archaeon and bacterium29. A key commonality of these
hypotheses is the rejection of a single, bifurcating tree as a proper
model for the ancestry of Eukarya. For instance, in these biological
hypotheses certain eukaryotic genes are derived from Archaea
whereas others are derived from Bacteria. The class II models freely
allow eukaryotic genes to be either archaeal-derived or bacterial-
derived, as the data dictate, and hence class II hypotheses can model
several endosymbiotic ‘rings’ and HGT events. Because specific
endosymbiotic fusion schemes can be represented by constrained
For nested hypotheses, the constrained versions necessarily have
equal or lower likelihoods than the unconstrained versions. As a
result, strict bounds can be placed on the LLR and DAIC scores
for the constrained class II network models that represent specific
endosymbiotic fusion or HGT hypotheses (see Methods and
Supplementary Information). In all cases, these bounds show that
multiple-ancestry versions of the constrained class II models are
overwhelmingly rejected by the tests (model selection scores of
for all specific HGT and endosymbiotic fusion models. In terms of a
fusion hypothesis for the origin of Eukarya, the data conclusively
support a UCA model in which Eukarya share an ancestor with
Bacteria and another independently with Archaea, and in which
Bacteria and Archaea are also genetically related independently of
Eukarya (see Table 3).
The proteins in this data set were postulated to be orthologous on
the basis ofsignificant sequence similarity27. Because the proteins are
universally conserved, all of the taxa have their own specific versions
of each of the proteins. It would be of interest to know how the tests
respond to the inclusion of proteins that are not universally con-
served, as omitting independently evolved proteins could perhaps
bias the results towards common ancestry. Nevertheless, the inclu-
sion of bona fide independently evolved genes has no effect on the
likelihoods of the winning class II models, except in certain cases to
strengthen the conclusion of common ancestry (for a formal proof,
see the Supplementary Information). Many proteins probably do
exist that have independent origins. For instance, in the Metazoa
certain protein domains have probably evolved de novo that are not
found in either Bacteria or Archaea30. However, the independent
evolution of unique Metazoan proteins, by itself, is not evidence
for or against UCA. The probability that the Metazoa would evolve
a new protein domain is the same whether or not the Metazoa are
related to Bacteria and Archaea. Therefore, omitting proteins with
independent origins from the data set does not affect support for the
ing independently evolved proteins is expected to increase support
for common ancestry for the subsets of taxa that share them (in this
example, to increase support for common ancestry of the Metazoa).
As is common in phylogenetic practice, most gaps and poorly
aligned regions were removed from the original data set used in this
analysis27, leaving only those sites that were thought to be homolog-
ouswithhighconfidence. Toexploretheeffectoftheseomitted sites,
the model selection tests were performed on a similar data set, with
the same proteins and species, in which all gaps were kept in the final
alignment (see Supplementary Methods and Supplementary Tables
analyses greatly increases the support for UCA in all cases (for
instance, with the ABE versus AE1B test, the class I DAIC is 10,323
and the class II DAIC is 11,072).
What property of the sequence data supports common ancestry so
decisively? When two related taxa are separated into two trees, the
strong correlations that exist between the sequences are no longer
modelled, which results in a large decrease in the likelihood. Con-
sequently, when comparing a common-ancestry model to a multiple-
in our ability to accurately predict the sequence of a genealogically
related protein relative to an unrelated protein. The sequence correla-
tions between a given clade of taxa and the rest of the tree would be
randomly shuffled. In such a case, these model-based selection tests
should prefer the multiple-ancestry model. In fact, in actual tests with
randomly shuffled data, the optimal estimate of the unified tree (for
both maximum likelihood and Bayesian analyses) contains an extre-
all cases tried, with a wide variety of evolutionary models (from the
models (LLR on the order of a thousand), even with the large internal
branches. Hence, the large test scores in favour of UCA models reflect
the immense power of a tree structure, coupled with a gradual
Markovian mechanism of residue substitution, to accurately and pre-
cisely explain the particular patterns of sequence correlations found
among genealogically related biological macromolecules.
Table 2 | Class II hypotheses of single versus multiple ancestries
Shown are model selection scores for class II hypotheses of single ancestry versus multiple
ancestries, allowing for unlimited HGT and/or endosymbiotic fusion events. Abbreviations are
as in the Table 1 legend. All criteria are listed as differences from [ABE]II. All scores shown are
raw log likelihood for [ABE]IIis 2122,742, and the marginal log likelihood is 2123,838.
Table 3 | Class I and class II hypotheses for selected subsets
AB versus A1B
BE versus B1E
AE versus A1E
Shown are model selection scores for class I and II hypotheses for selected subsets of the taxa.
Single ancestry hypotheses are listed left, multiple-ancestry hypotheses right. Terms are as in
Figure 2 | SelectedclassIIevolutionaryhypotheses,includingHGT. a,The
reticulated model [ABE]II, representing UCA. b, A competing network
model of multiple ancestry, [AE1B]II, representing common ancestry of
as phylogenetic networks (reticulate trees). The phylogenetic networks are
phylogenies using the evolutionary model parameters shown for ABE and
AE1B in Table 1.
NATURE|Vol 465|13 May 2010
Macmillan Publishers Limited. All rights reserved
previously described data set comprising 23 ubiquitous proteins27. Archaea:
Methanococcus jannaschii, Archaeoglobus fulgidus, Pyrococcus furiosus and
Thermoplasma acidophilum; Eukarya: Drosophila melanogaster, Homo sapiens,
Caenorhabditis elegans and Saccharomyces cerevisiae; Bacteria: Escherichia coli,
Bacillus subtilis, Mycobacterium tuberculosis and Porphyromonas gingivalis.
Optimal models were determined using both maximum likelihood and
Bayesian phylogenetic methods. For a hypothesis involving several independent
trees, such as model AE1B, each tree in the model was allowed to have its own
independent evolutionary model parameters (such as amino acid substitution
matrix, shape parameter for the gamma rate distribution, fraction of invariant
sites, and empirical amino acid background frequencies), if it improved the
a Bayesian analysis the total marginal likelihood is the product of marginal like-
lihoods from each independent tree. The AIC was calculated as AIC5L2K,
Note that this differs from some common versions of the AIC by a factor of 22,
with the other test scores. No assumptions were made about the positions of the
roots of the trees, as all inferred trees are unrooted. For the class II models
involving HGT, each protein was given its own branch length and topology
parameters; all other parameters were identical to the analogous class I model.
The class II models thus implicitly assume that HGT involves the exchange of
entire protein-coding genes. All phylogenetic input files are available by request.
Full Methods and any associated references are available in the online version of
the paper at www.nature.com/nature.
Received 28 August 2009; accepted 17 March 2010.
Sober, E. Evidence and Evolution Ch. 4 (Cambridge University Press, 2008).
Darwin, C. On the Origin of Species by Means of Natural Selection, or, The
Preservation of Favoured Races in the Struggle for Life Ch. 14 (J. Murray, 1859).
Raup, D.M.&Valentine, J.W. Multiple origins oflife. Proc. NatlAcad. Sci.USA 80,
Crick, F. H. C. The origin of the genetic code. J. Mol. Biol. 38, 367–379 (1968).
Sober, E. & Steel, M. Testing the hypothesis of common ancestry. J. Theor. Biol.
218, 395–408 (2002).
Dobzhansky, T. Nothing in biology makes sense except in the light of evolution.
Am. Biol. Teach. 35, 125–129 (1973).
Hinegardner, R. T. & Engelberg, J. Rationale for a universal genetic code. Science
142, 1083–1085 (1963).
Penny, D., Hendy, M. D. & Poole, A. M. Testing fundamental evolutionary
hypotheses. J. Theor. Biol. 223, 377–385 (2003).
Penny, D., Foulds, L. R. & Hendy, M. D. Testing the theory of evolution by
comparing phylogenetic trees constructed from five different protein sequences.
Nature 297, 197–200 (1982).
H. J.) 97–166 (Academic Press, 1965).
11.Doolittle, W. F. The nature of the universal ancestor and the evolution of the
proteome. Curr. Opin. Struct. Biol. 10, 355–358 (2000).
12. How true is the theory of evolution? Nature 290 (Editorial), 75–76 (1981).
13. Popper, K. R. Unended Quest: An Intellectual Autobiography revised edn (Fontana,
14. Syvanen, M. On the occurrence of horizontal gene transfer among an arbitrarily
chosen group of 26 genes. J. Mol. Evol. 54, 258–266 (2002).
15. Woese, C. R. On the evolution of cells. Proc. Natl Acad. Sci. USA 99, 8742–8747
16. Burnham, K. P. & Anderson, D. R. Model Selection and Inference: A Practical
Information-Theoretic Approach (Springer, 1998).
17. Kass, R. E. & Raftery, A. E. Bayes factors. J. Am. Stat. Assoc. 90, 773–795 (1995).
18. Futuyma, D. J. Evolutionary Biology 3rd edn (Sinauer Associates, 1998).
19. Murzin, A. G. How far divergent evolution goes in proteins. Curr. Opin. Struct. Biol.
8, 380–387 (1998).
20. Karlin, S. & Altschul, S. Methods for assessing the statistical significance of
molecular sequence features by using general scoring schemes. Proc. Natl Acad.
Sci. USA 87, 2264–2268 (1990).
(Multivariate Applications) (Lawrence Erlbaum, 1997).
22. Reeck, G. et al. ‘‘Homology’’ in proteins and nucleic acids: a terminology muddle
and a way out of it. Cell 50, 667 (1987).
23. Mindell, D. & Meyer, A. Homology evolving. Trends Ecol. Evol. 16, 434–440
24. Crick, F. H. C. in Progress in Nucleic Acid Research (eds Davidson, J. N. & Cohn, W.
E.) 163–217 (Academic Press, 1963).
Trans. R. Soc. Lond. B 364, 2221–2228 (2009).
27. Brown, J. R., Douady,C. J.,Italia, M.J.,Marshall, W. E.&Stanhope, M.J. Universal
trees based on large combined protein sequence data sets. Nature Genet. 28,
28. Woese, C. & Fox, G. Phylogenetic structure of the prokaryotic domain: the
primary kingdoms. Proc. Natl Acad. Sci. USA 74, 5088–5090 (1977).
29, 74–84 (2007).
30. Chothia, C., Gough, J., Vogel, C. & Teichmann, S. A. Evolution of the protein
repertoire. Science 300, 1701–1703 (2003).
Supplementary Information is linked to the online version of the paper at
Acknowledgements I thank J. Felsenstein, P. Garrity, N. Matzke, C. Miller,
C. Theobald and J. Wilkins for critical commentary.
Author Information Reprints and permissions information is available at
www.nature.com/reprints. The author declares no competing financial interests.
Correspondence and requests for materials should be addressed to D.L.T.
NATURE|Vol 465|13 May 2010
Macmillan Publishers Limited. All rights reserved
Data sets. The original data set comprises 6,591 aligned amino acids from 23
ubiquitous proteins27: alanyl-tRNA synthetase, aspartyl-tRNA synthetase,
glutamyl-tRNA synthetase, histidyl-tRNA synthetase, isoleucyl-tRNA synthe-
tase, leucyl-tRNA synthetase, methionyl-tRNA synthetase, phenylalanyl-tRNA
synthetase b subunit, threonyl-tRNA synthetase, valyl-tRNA synthetase, ini-
tiation factor 2, elongation factor G, elongation factor Tu, ribosomal protein
L2, ribosomal protein S5, ribosomal protein S8, ribosomal protein S11, amino-
peptidase P, DNA-directed RNA polymerase b chain, DNA topoisomerase I,
DNA polymerase III c subunit, signal recognition particle protein and rRNA
dimethylase. The original data set was constructed by removing poorly aligned
regions and most gapped columns from the CLUSTALW alignment27. I con-
structed a similar data set, using the same proteins from the same taxa, which
retained the entire protein sequences. The proteins in this data set were inde-
pendently aligned with ProbCons31. The resulting complete unmodified align-
ment comprised 25,411 columns, including gaps.
Likelihood phylogenetics.Forthe LLR andAIC tests, more than1,800compet-
ing biological models were fit to this data using the method of maximum like-
lihood and the program ProtTest 1.4 (ref. 32) (defaults) supplemented by
independent runs with PhyML 2.4.5 (ref. 33). ProtTest calculates the maximum
likelihood for 72 evolutionary models for each tree in each model: B, B-F, B-G,
B-GF, B-I, B-IF, B-IG, B-IGF, C, C-F, C-G, C-GF, C-I, C-IF, C-IG, C-IGF, D,
D-F, D-G, D-GF,D-I, D-IF, D-IG, D-IGF, J, J-F, J-G, J-GF, J-I, J-IF,J-IG, J-IGF,
MM, MM-F, MM-G, MM-GF, MM-I, MM-IF, MM-IG, MM-IGF, MR, MR-F,
MR-G, MR-GF, MR-I, MR-IF, MR-IG, MR-IGF, R, R-F, R-G, R-GF, R-I, R-IF,
W-I, W-IF, W-IG, and W-IGF, where the substitution matrices are coded
MR5MtREV, R5RtREV, V5VT, and W5WAG. The following letters
denote models with further parameters: I5invariant positions, G5gamma
distributed rate variation, F5empirical amino acid frequencies. For the class
II HGTmodels,23 differentproteintrees werecalculatedfor eachclusteroftaxa
proposed to be genealogically related. For example, the model [AE1B]IIcom-
another 23 trees for Bacteria. The total log likelihood for a particular class II
model is the sum of the log likelihoods for all the protein trees in the model.
Bayesian phylogenetics. All Bayesian analyses were calculated with the parallel
version of MrBayes 3.1.2 (ref. 34) and used mixed-rate matrices and gamma-
distributed rate variation across sites (16 categories). A uniform (0.0, 200.0)
prior was assumed for the shape parameter of the gamma distribution, an
unconstrained exponential prior (mean50.1) was assumed for the branch
lengths, and a uniform prior was assumed for all topologies. Two independent
Markov chain Monte Carlo (MCMC) analyses were performed (each with one
cold and three heated chains), with all other parameters set to defaults.
deviation of split frequencies of less than 0.01 (generally never more than
10,000,000 generations). After convergence, the first half of the chain was dis-
carded as ‘burn in’. For the class II HGT models, the data were partitioned by
protein, and all parameters (topology, branch lengths, state frequencies, amino
acid substitution model and gamma shape) were unlinked across partitions.
Phylogenetic networks. Phylogenetic networks were computed and displayed
with SplitsTree 4.10 (ref. 35), using the equal angle, consensus network algo-
rithm (threshold50, to show all reticulations). The phylogenetic networks
shown in Fig. 2 are derived from the maximum likelihood estimates of the 23
individual protein phylogenies using the evolutionary model parameters shown
in Table 1.
Model selection test scores. LLR values were calculated directly from the like-
was used as previously described36, which involves estimating the variance of a
centred log likelihood using the per site likelihoods as output by PhyML. The
number of parameters K was calculated as follows: one parameter per branch
length for all trees in the model, where the number of branch lengths per tree is
the number of invariant sites was estimated; one parameter per tree if the
gamma-distribution shape parameter was estimated; 19 parameters per tree if
the empirical amino acid frequencies were estimated. Marginal likelihoods for
the Bayes factors were calculated with MrBayes34using the harmonic-mean
estimator17. The LBF was calculated as the difference in the marginal-log like-
lihoods for each model.
Bounds on modelselection scores. Consider three hypotheses: HA, HBand HC.
If HBis a partially constrained hypothesis nested within HC, then the following
inequalities necessarily hold:
where LLRA2B5LA2LB, DAICA2B5AICA2AICB, and LXis the log like-
lihood for hypothesis HX. These inequalities follow directly from the definitions
of the model-selection scores and the fact that the likelihood for a nested, con-
strained hypothesis is always less than or equal to the likelihood of the uncon-
strained hypothesis16. Derivations and discussion are provided in the
Supplementary Materials. The inequalities are especially useful for the purposes
31. Do, C. B., Mahabhashyam, M. S., Brudno, M. & Batzoglou, S. ProbCons:
probabilistic consistency-based multiple sequence alignment. Genome Res. 15,
32. Abascal, F., Zardoya, R. & Posada, D. ProtTest: selection of best-fit models of
protein evolution. Bioinformatics 21, 2104–2105 (2005).
phylogenies by maximum likelihood. Syst. Biol. 52, 696–704 (2003).
34. Altekar, G. Parallel metropolis coupled Markov chain Monte Carlo for Bayesian
phylogenetic inference. Bioinformatics 20, 407–415 (2004).
35. Huson, D. & Bryant, D. Application of phylogenetic networks in evolutionary
studies. Mol. Biol. Evol. 23, 254–267 (2006).
36. Vuong, Q. H. Likelihood ratio tests for model selection and non-nested
hypotheses. Econometrica 57, 307–333 (1989).
Macmillan Publishers Limited. All rights reserved