Page 1

A Consensus Tree Approach for Reconstructing

Human Evolutionary History and Detecting

Population Substructure

Ming-Chi Tsai1, Guy Blelloch2, R. Ravi3, and Russell Schwartz4

1Joint CMU-Pitt Computational Biology Program,

Carnegie Mellon University and University of Pittsburgh, Pittsburgh, PA 15213, USA

2Department of Computer Science

3Tepper School of Business

4Department of Biological Science,

Carnegie Mellon University, Pittsburgh, PA 15213, USA

Abstract. The random accumulation of variations in the human genome

over time implicitly encodes a history of how human populations have

arisen, dispersed, and intermixed since we emerged as a species. Re-

constructing that history is a challenging computational and statistical

problem but has important applications both to basic research and to the

discovery of genotype-phenotype correlations. In this study, we present

a novel approach to inferring human evolutionary history from genetic

variation data. Our approach uses the idea of consensus trees, a tech-

nique generally used to reconcile species trees from divergent gene trees,

adapting it to the problem of finding the robust relationships within a set

of intraspecies phylogenies derived from local regions of the genome. We

assess the quality of the method on two large-scale genetic variation data

sets: the HapMap Phase II and the Human Genome Diversity Project.

Qualitative comparison to a consensus model of the evolution of mod-

ern human population groups shows that our inferences closely match

our best current understanding of human evolutionary history. A further

comparison with results of a leading method for the simpler problem of

population substructure assignment verifies that our method provides

comparable accuracy in identifying meaningful population subgroups in

addition to inferring the relationships among them.

1 Introduction

The advent of high-throughput genotyping methods and their application in

large-scale genetic variation studies have made it possible to determine in un-

precedented detail how the modern diversity of the human species arose from our

common ancestors. In addition to its importance as a basic research problem,

this topic has great practical relevance to the discovery of genetic risk factors

of disease due to the confounding effect of unrecognized substructure on genetic

association tests [22]. Past work on human ancestry inference has essentially

treated it as two distinct problems: identifying meaningful population groups

M. Borodovsky et al. (Eds.): ISBRA 2010, LNBI 6053, pp. 167–178, 2010.

c ? Springer-Verlag Berlin Heidelberg 2010

Page 2

168M.-C. Tsai et al.

and inferring evolutionary trees among them. Population groups may be as-

sumed in advance based on common conceptions of ethnic groupings, although

the field increasingly depends on computational analysis to make such infer-

ences automatically. Probably the most well known system for this problem is

STRUCTURE [16], which uses a Markov Chain Monte Carlo (MCMC) clustering

method to group sequences into subpopulations characterized by similar allele

frequencies across variation sites. A variety of other computational and statisti-

cal methods have been developed to perform population substructure inference

or similar analyses, including EIGENSOFT [15], Spectrum [19], and SABER

[21]. A separate literature has arisen on the inference of relationships between

populations, typically based on phylogenetic reconstruction of limited sets of

genetic markers — such as classic restriction fragment length polymorphisms

[14], mtDNA genotypes [9,2], short tandem repeats [9,23], and Y chromosome

polymorphism [5] — supplemented by extensive manual analysis informed by

population genetics theory. There has thus far been little cross-talk between the

two problems of inferring population substructure and inferring phylogenetics of

subgroups, despite the fact that both problems depend on similar data sources

and in principle can help inform the decisions of one another.

We propose a novel approach for reconstructing a species history that is in-

tended to unify these two inference problems. The method is conceptually based

on the idea of consensus trees [13], which represent inferences as to the robust

features of a family of trees. The approach takes advantage of the fact that the

availability of large-scale variation data sets, combined with new algorithms for

fast phylogeny inference on these data sets [20], has made it possible to infer

likely phylogenies on millions of small regions spanning the human genome. The

intuition behind our method is that each such phylogeny will represent a dis-

torted version of the global evolutionary history and population structure of the

species, with many trees supporting the major splits or subdivisions between

population groups while few support any particular splits independent of those

groups. By detecting precisely the robust features of these trees, we can assemble

a model of the true evolutionary history and population structure that can be

made resistant to overfitting and to noise in the SNP data or tree inferences.

In the remainder of this paper, we describe and evaluate our approach. We first

present in more detail our mathematical model of the consensus tree problem and

a set of algorithms for finding consensus trees from families of local phylogenies.

We next evaluate our method on the HapMap Phase II [7] and Human Genome

Diversity Project [8] datasets. Finally, we consider some of the implications of

the results and future prospects of the consensus tree approach for evolutionary

history and substructure inference.

2 Methods

2.1 Consensus Tree Model

We assume we are given a set of m taxa, S, representing the paired haplotypes

from each individual in a population sample. If we let T be the set of all possible

Page 3

A Consensus Tree Approach for Reconstructing Human Evolutionary History169

labeled trees connecting the s ∈ S, where each node of any t ∈ T may be labeled

by any subset of zero or more s ∈ S without repetition, then our input will

consist of some set of n trees D = (T1,...,Tn) ⊆ T . Our desired output will also

be some labeled tree TM∈ T , intended to represent a consensus of T1,...,Tn.

Our objective function for choosing TM is based on the task of finding a

consensus tree [13] from a set of phylogenies each describing inferred ancestry

of a small region of a genome. Our problem is, however, fairly different from

standard uses of consensus tree algorithms in that our phylogenies are derived

from many variant markers, each only minimally informative, within a single

species. Standard consensus tree approaches, such as majority consensus [11] or

Adam consensus [1], would not be expected to be effective in this situation as it is

likely there is no single subdivision of a population that is consistently preserved

across more than a small fraction of the local intraspecies trees and that many

similar but incompatible subdivisions are supported by different subsets of the

trees. We therefore require an alternative representation of the consensus tree

problem designed to be robust to large numbers of trees and high levels of noise

and uncertainty in data.

For this purpose, we chose a model of the problem based on the principle

of minimum description length (MDL)[4], a standard technique for avoiding

overfitting when making inferences from noisy data sets. An MDL method seeks

to minimize the amount of information needed to encode the model and to encode

the data set given knowledge of the model. Suppose we have some function

L : T → R that computes a description length, L(Ti), for any tree Ti. We will

assume the existence of another function, which for notational convenience we

will also call L, L : T ×T → R, which computes a description length, L(Ti|Tj),

of a tree Ti given that we have reference to a model Tj. Then, given a set of

observed trees, D = {T1,T2,...,Tn} for Ti∈ T , our objective function is

?

L(TM,T1,...,Tn) = arg min

TM∈T

L(TM) +

n

?

i=1

L(Ti|TM) + f(TM)

?

The first term computes the description length of the model (consensus) tree

TM. The sum computes the cost of explaining the set of observed (input) trees

D. The function f(TM) = |TM|log2m defines an additional penalty on model

edges used to set a minimum confidence level on edge predictions.

We next need to specify how we compute the description length of a tree.

For this purpose, we use the fact that a phylogeny can be encoded as a set of

bipartitions (or splits) of the taxa with which it is labeled, each specifying the

set of taxa lying on either side of a single edge of the tree. We represent the ob-

served trees and candidate consensus trees as sets of bipartitions for the purpose

of calculating description lengths. Once we have identified a set of bipartitions

representing the desired consensus tree, we then apply a tree reconstruction al-

gorithm to convert those bipartitions into a tree. A bipartition b can in turn

be represented as a string of bits by arbitrarily assigning elements in one part

of the bipartition the label “0” and the other part the label “1”. Fig. 1a shows

an example of a hypothetical tree, its description as a set of bipartitions, and

Page 4

170M.-C. Tsai et al.

(a) (b)(c)

Fig.1. (a) A maximum parsimony (MP) tree consisting of 11 labeled individuals or

haplotypes. (b) The set of bipartitions induced by edges (ea,eb,ec,ed) in the tree. (c)

0-1 bit sequence representation for each bipartition.

representations of the bipartitions as bit strings. Such a bit representation al-

lows us to compute the encoding length of a bipartition b as the entropy of its

corresponding bit string. If we define p0to be the fraction of bits of b that are

zero and p1as the fraction that are one, then:

L(b) = m(−p0log2p0− p1log2p1)

Similarly, we can encode the representation of one bipartition b1given another

b2using the concept of conditional entropy. If we let p00be the fraction of bits

for which both bipartitions have value “0,” p01be the fraction for which the first

bipartition has value “0” and the second “1,” and so forth, then:

⎡

s,t∈{0,1}

where the first term is the joint entropy of b1and b2and the second term is the

entropy of b2.

We can use these definitions to specify the minimum encoding cost of a tree

L(Ti) or of one tree given another L(Ti|TM). We first convert the tree into a set

of bipartitions b1,...,bk. We can then observe that each bipartition bi can be

encoded either as an entity to itself, with cost equal to its own entropy L(bi), or

by reference to some other bipartition bjwith cost L(bi|bj). In addition, we must

add a cost for specifying whether each bi is explained by reference to another

bipartition and, if so, which one. The total minimum encoding costs, L(TM)

and L(Ti|TM), can then computed by summing the minimum encoding cost for

each bipartition in the tree. Specifically, let bt,iand bs,M be elements from the

bipartition set Biof Tiand BMof TM, respectively. We can then compute L(TM)

and L(Ti|TM) by optimizing for the following objectives over possible reference

bipartitions, if any, for each bipartition in each tree:

L(b1|b2) = m

⎣

?

−pstlog2pst +

?

u∈{0,1}

(p0u+ p1u)log2(p0u+ p1u)

⎤

⎦

L(TM) = argmin

bs∈BM∪{∅}

|BM|

?

s=1

[L(bs,M|bs) + log2(|BM| + 1)]

L(Ti|TM) = argmin

bt∈BM∪Bi∪{∅}

|Bi|

?

t=1

[L(bt,i|bt) + log2(|BM| + |Bi| + 1)]

Page 5

A Consensus Tree Approach for Reconstructing Human Evolutionary History171

(a) (b)(c) (d)

Fig.2. Illustration of the DMST construction for determining model description

length. (a) Hypothetical model tree TM (red) and observed tree Ti (blue). (b) Graph of

possible reference relationships for explaining Ti (blue nodes) by reference to TM (red

nodes). (c) A possible resolution of the graph of (b). (d) Graph of possible reference

relationships for explaining TM by itself.

2.2Algorithms

Encoding Algorithm. We pose the problem of computing L(TM) and L(Ti|TM)

as a weighted directed minimum spanning tree (DMST) problem, illustrated in

Fig. 2. We construct a graph G = (V,E) in which each node represents either

a bipartition or a single “empty” root node r explained below. Each directed

edge (bj,bi) represents a possible reference relationship by which bjexplains bi.

If a bipartition bi is to be encoded from another bipartition bj, the weight of

the edge ejiwould be given by wji= L(bi|bj)+log2|V | where the term log2|V |

represents the bits we need to specify the reference bipartition (including no

bipartition) from which bimight be chosen. This term introduces a penalty to

avoid overfitting. We add an additional edge directly from the empty node to

each node to be encoded whose weight is the cost of encoding the edge with

reference to no other edge, wempty,j= L(bj) + log2|V |.

To compute L(TM), the bipartitions BM of TM and the single root node

collectively specify the complete node set of the directed graph. One edge is

then created from every node BM ∪ {r} to every node of BM. To compute

L(Ti|TM), the node set will include the bipartitions Bi of Ti, the bipartitions

BM of TM, and the root node r. The edge set will consist of two parts. Part

one consists of one edge from each node of Bi∪ BM∪ {r} to each node of Bi,

with weights corresponding to the cost of possible encodings of Bi. Part two

will consist of a zero-cost edge from r to each node in BM, representing the fact

that the presumed cost of the model tree has already been computed. Fig. 2

illustrates the construction for a hypothetical model tree TM and observed tree

Ti(Fig. 2(a)), showing the graph of possible reference relationships (Fig. 2(b)),

a possible solution corresponding to a specific explanation of Tiin terms of TM

(Fig. 2(c)), and the graph of possible reference relationships for TM by itself

(Fig. 2(d)).

For both constructions, the minimum encoding length is found by solving for

the DMST with the algorithm of Chiu and Liu [3] and summing the weights of

the edges. This cost is computed for a candidate model tree TM and for each

observed tree Tito give the total cost [L(TM,T1,...,Tn)].