A Consensus Tree Approach for Reconstructing Human Evolutionary History and Detecting Population Substructure

Joint Carnegie Mellon University/University of Pittsburgh PhD Program in Computational Biology and Lane Center for Computational Biology, 4400 Fifth Avenue, Pittsburgh, PA 15213, USA.
IEEE/ACM transactions on computational biology and bioinformatics / IEEE, ACM (Impact Factor: 1.54). 07/2011; 8(4):918-28. DOI: 10.1109/TCBB.2011.23
Source: PubMed

ABSTRACT The random accumulation of variations in the human genome over time implicitly encodes a history of how human populations have arisen, dispersed, and intermixed since we emerged as a species. Reconstructing that history is a challenging computational and statistical problem but has important applications both to basic research and to the discovery of genotype-phenotype correlations. We present a novel approach to inferring human evolutionary history from genetic variation data. We use the idea of consensus trees, a technique generally used to reconcile species trees from divergent gene trees, adapting it to the problem of finding robust relationships within a set of intraspecies phylogenies derived from local regions of the genome. Validation on both simulated and real data shows the method to be effective in recapitulating known true structure of the data closely matching our best current understanding of human evolutionary history. Additional comparison with results of leading methods for the problem of population substructure assignment verifies that our method provides comparable accuracy in identifying meaningful population subgroups in addition to inferring relationships among them. The consensus tree approach thus provides a promising new model for the robust inference of substructure and ancestry from large-scale genetic variation data.

  • [Show abstract] [Hide abstract]
    ABSTRACT: Phylogenetics, or the inference of evolutionary trees, is one of the oldest and most intensively studied topics in computational biology. Yet it remains a vibrant area of research, in part because advances in our ability to gather data for phylogenetic inference continue to create novel and more challenging variants of the phylogeny problem. In this talk, I will discuss a particular challenge underlying some important phylogenetic problems in the genomic era: reconstructing evolutionary histories from samples of heterogeneous populations, each of which may contain contributions from multiple evolutionary stages or pathways.
    Bioinformatics Research and Applications - 7th International Symposium, ISBRA 2011, Changsha, China, May 27-29, 2011. Proceedings; 01/2011
  • [Show abstract] [Hide abstract]
    ABSTRACT: Detecting and quantifying the timing and the genetic contributions of parental populations to a hybrid population is an important but challenging problem in reconstructing evolutionary histories from genetic variation data. With the advent of high throughput genotyping technologies, new methods suitable for large-scale data are especially needed. Furthermore, existing methods typically assume the assignment of individuals into subpopulations is known, when that itself is a difficult problem often unresolved for real data. Here we propose a novel method that combines prior work for inferring non-reticulate population structures with an MCMC scheme for sampling over admixture scenarios to both identify population assignments and learn divergence times and admixture proportions for those populations using genome-scale admixed genetic variation data. We validated our method using coalescent simulations and a collection of real bovine and human variation data. On simulated sequences, our methods show better accuracy and faster runtime than leading competitive methods in estimating admixture fractions and divergence times. Analysis on the real data further shows our methods to be effective at matching our best current knowledge about the relevant populations.
    IEEE/ACM transactions on computational biology and bioinformatics / IEEE, ACM 08/2013; 10(5). DOI:10.1109/TCBB.2013.98 · 1.54 Impact Factor


Available from