[Show abstract][Hide abstract] ABSTRACT: The magnitude of genetic diversity within human populations varies in a way that reflects the sequence of migrations by which people spread throughout the world. Beyond its use in human evolutionary genetics, worldwide variation in genetic diversity sometimes can interact with social processes to produce differences among populations in their relationship to modern societal problems. We review the consequences of genetic diversity differences in the settings of familial identification in forensic genetic testing, match probabilities in bone marrow transplantation, and representation in genome-wide association studies of disease. In each of these three cases, the contribution of genetic diversity to social differences follows from population-genetic principles. For a fourth setting that is not similarly grounded, we reanalyze with expanded genetic data a report that genetic diversity differences influence global patterns of human economic development, finding no support for the claim. The four examples describe a limit to the importance of genetic diversity for explaining societal differences while illustrating a distinction that certain biologically based scenarios do require consideration of genetic diversity for solving problems to which populations have been differentially predisposed by the unique history of human migrations.
[Show abstract][Hide abstract] ABSTRACT: Coalescent histories provide lists of species tree branches on which gene
tree coalescences can take place, and their enumerative properties assist in
understanding the computational complexity of calculations central in the study
of gene trees and species trees. Here, we solve an enumerative problem left
open by Rosenberg (IEEE/ACM Transactions on Computational Biology and
Bioinformatics 10: 1253-1262, 2013) concerning the number of coalescent
histories for gene trees and species trees with a matching labeled topology
that belongs to a generic caterpillar-like family. By bringing a generating
function approach to the study of coalescent histories, we prove that for any
caterpillar-like family with seed tree $t$, the sequence $(h_n)_{n\geq 0}$
describing the number of matching coalescent histories of the $n$th tree of the
family grows asymptotically as a constant multiple of the Catalan numbers.
Thus, $h_n \sim \beta_t c_n$, where the asymptotic constant $\beta_t > 0$
depends on the shape of the seed tree $t$. The result extends a claim
demonstrated only for seed trees with at most 8 taxa to arbitrary seed trees,
expanding the set of cases for which detailed enumerative properties of
coalescent histories can be determined. We introduce a procedure that computes
from $t$ the constant $\beta_t$ as well as the algebraic expression for the
generating function of the sequence $(h_n)_{n\geq 0}$.
IEEE/ACM Transactions on Computational Biology and Bioinformatics 03/2015; DOI:10.1109/TCBB.2015.2485217 · 1.44 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: Coalescent histories are combinatorial structures that describe for a given
gene tree and species tree the possible lists of branches of the species tree
on which the gene tree coalescences take place. Properties of the number of
coalescent histories for gene trees and species trees affect a variety of
probabilistic calculations in mathematical phylogenetics. Exact and asymptotic
evaluations of the number of coalescent histories, however, are known only in a
limited number of cases. Here we introduce a particular family of species
trees, the \emph{lodgepole} species trees $(\lambda_n)_{n\geq 0}$, in which
tree $\lambda_n$ has $m=2n+1$ taxa. We determine the number of coalescent
histories for the lodgepole species trees, in the case that the gene tree
matches the species tree, showing that this number grows with $m!!$ in the
number of taxa $m$. This computation demonstrates the existence of tree
families in which the growth in the number of coalescent histories is faster
than exponential. Further, it provides a substantial improvement on the lower
bound for the ratio of the largest number of matching coalescent histories to
the smallest number of matching coalescent histories for trees with $m$ taxa,
increasing a previous bound of $(\sqrt{\pi} / 32)[(5m-12)/(4m-6)] m \sqrt{m}$
to $[ \sqrt{m-1}/(4 \sqrt{e}) ]^{m}$. We discuss the implications of our
enumerative results for phylogenetic computations.
Journal of computational biology: a journal of computational molecular cell biology 03/2015; DOI:10.1089/cmb.2015.0015 · 1.74 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: The identification of the genetic structure of populations from multilocus genotype data has become a central component of modern population-genetic data analysis. Application of model-based clustering programs often entails a number of steps, in which the user considers different modeling assumptions, compares results across different pre-determined values of the number of assumed clusters (a parameter typically denoted K), examines multiple independent runs for each fixed value of K, and distinguishes among runs belonging to substantially distinct clustering solutions. Here, we present Clumpak (Cluster Markov Packager Across K), a method that automates the post-processing of results of model-based population structure analyses. For analyzing multiple independent runs at a single K value, Clumpak identifies sets of highly similar runs, separating distinct groups of runs that represent distinct modes in the space of possible solutions. This procedure, which generates a consensus solution for each distinct mode, is performed by the use of a Markov clustering algorithm that relies on a similarity matrix between replicate runs, as computed by the software Clumpp. Next, Clumpak identifies an optimal alignment of inferred clusters across different values of K, extending a similar approach implemented for a fixed K in Clumpp, and simplifying the comparison of clustering results across different K values. Clumpak incorporates additional features, such as implementations of methods for choosing K and comparing solutions obtained by different programs, models, or data subsets. Clumpak, available at http://clumpak.tau.ac.il, simplifies the use of model-based analyses of population structure in population genetics and molecular ecology. This article is protected by copyright. All rights reserved.
This article is protected by copyright. All rights reserved.
Studies in History and Philosophy of Science Part C Studies in History and Philosophy of Biological and Biomedical Sciences 02/2015; 52. DOI:10.1016/j.shpsc.2014.12.005
[Show abstract][Hide abstract] ABSTRACT: Worldwide patterns of genetic variation are driven by human demographic history. Here, we test whether this demographic history has left similar signatures on phonemes-sound units that distinguish meaning between words in languages-to those it has left on genes. We analyze, jointly and in parallel, phoneme inventories from 2,082 worldwide languages and microsatellite polymorphisms from 246 worldwide populations. On a global scale, both genetic distance and phonemic distance between populations are significantly correlated with geographic distance. Geographically close language pairs share significantly more phonemes than distant language pairs, whether or not the languages are closely related. The regional geographic axes of greatest phonemic differentiation correspond to axes of genetic differentiation, suggesting that there is a relationship between human dispersal and linguistic variation. However, the geographic distribution of phoneme inventory sizes does not follow the predictions of a serial founder effect during human expansion out of Africa. Furthermore, although geographically isolated populations lose genetic diversity via genetic drift, phonemes are not subject to drift in the same way: within a given geographic radius, languages that are relatively isolated exhibit more variance in number of phonemes than languages with many neighbors. This finding suggests that relatively isolated languages are more susceptible to phonemic change than languages with many neighbors. Within a language family, phoneme evolution along genetic, geographic, or cognate-based linguistic trees predicts similar ancestral phoneme states to those predicted from ancient sources. More genetic sampling could further elucidate the relative roles of vertical and horizontal transmission in phoneme evolution.
Proceedings of the National Academy of Sciences 02/2015; 112(5):1265-1272. DOI:10.1073/pnas.1424033112 · 9.67 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: FST is one of the most frequently-used indices of genetic differentiation among groups. Though FST takes values between 0 and 1, authors going back to Wright have noted that under many circumstances, FST is constrained to be less than 1. Recently, we showed that at a genetic locus with an unspecified number of alleles, FST for two subpopulations is strictly bounded from above by functions of both the frequency of the most frequent allele (M) and the homozygosity of the total population (HT). In the two-subpopulation case, FST can equal one only when the frequency of the most frequent allele and the total homozygosity are 1/21/2. Here, we extend this work by deriving strict bounds on FST for two subpopulations when the number of alleles at the locus is specified to be II. We show that restricting to II alleles produces the same upper bound on FST over much of the allowable domain for MM and HT, and we derive more restrictive bounds in the windows M∈[1/I,1/(I−1)) and HT∈[1/I,I/(I2−1)). These results extend our understanding of the behavior of FST in relation to other population-genetic statistics.
Theoretical Population Biology 11/2014; 97. DOI:10.1016/j.tpb.2014.08.001 · 1.70 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: Approximate Bayesian computation (ABC) methods perform inference on model-specific parameters of mechanistically motivated parametric models when evaluating likelihoods is difficult. Central to the success of ABC methods, which have been used frequently in biology, is computationally inexpensive simulation of data sets from the parametric model of interest. However, when simulating data sets from a model is so computationally expensive that the posterior distribution of parameters cannot be adequately sampled by ABC, inference is not straightforward. We present “approximate approximate Bayesian computation” (AABC), a class of computationally fast inference methods that extends ABC to models in which simulating data is expensive. In AABC, we first simulate a number of data sets small enough to be computationally feasible to simulate from the parametric model. Conditional on these data sets, we use a statistical model that approximates the correct parametric model and enables efficient simulation of a large number of data sets. We show that under mild assumptions, the posterior distribution obtained by AABC converges to the posterior distribution obtained by ABC, as the number of data sets simulated from the parametric model and the sample size of the observed data set increase. We demonstrate the performance of AABC on a population-genetic model of natural selection, as well as on a model of the admixture history of hybrid populations. This latter example illustrates how, in population genetics, AABC is of particular utility in scenarios that rely on conceptually straightforward but potentially slow forward-in-time simulations.
Theoretical Population Biology 09/2014; 99. DOI:10.1016/j.tpb.2014.09.002 · 1.70 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: Sex-biased admixture has been observed in a wide variety of admixed populations. Genetic variation in sex chromosomes and functions of quantities computed from sex chromosomes and autosomes have often been examined in order to infer patterns of sex-biased admixture, typically using statistical approaches that do not mechanistically model the complexity of a sex-specific history of admixture. Here, expanding on a model of Verdu & Rosenberg (2011) that did not include sex specificity, we develop a model that mechanistically examines sex-specific admixture histories. Under the model, multiple source populations contribute to an admixed population, potentially with their male and female contributions varying over time. In an admixed population descended from two source groups, we derive the moments of the distribution of the autosomal admixture fraction from a specific source population as a function of sex-specific introgression parameters and time. Considering admixture processes that are constant in time, we demonstrate that surprisingly, although the mean autosomal admixture fraction from a specific source population does not reveal a sex bias in the admixture history, the variance of autosomal admixture is informative about sex bias. Specifically, the long-term variance decreases as the sex bias from a contributing source population increases. This result can be viewed as analogous to the reduction in effective population size for populations with an unequal number of breeding males and females. Our approach suggests that it may be possible to use the effect of sex-biased admixture on autosomal DNA to assist with methods for inference of the history of complex sex-biased admixture processes.
[Show abstract][Hide abstract] ABSTRACT: We derive formulas for mean deep coalescence cost, for either a fixed species tree or a fixed gene tree, under probability distributions that satisfy the exchangeability property. We then apply the formulas to study mean deep coalescence cost under two commonly used exchangeable models—the uniform and Yule models. We find that mean deep coalescence cost, for either a fixed species tree or a fixed gene tree, tends to be larger for unbalanced trees than for balanced trees. These results provide a better understanding of the deep coalescence cost, as well as allow for the development of new species tree inference criteria.
[Show abstract][Hide abstract] ABSTRACT: The initial contact of European populations with indigenous populations of the Americas produced diverse admixture processes across North, Central, and South America. Recent studies have examined the genetic structure of indigenous populations of Latin America and the Caribbean and their admixed descendants, reporting on the genomic impact of the history of admixture with colonizing populations of European and African ancestry. However, relatively little genomic research has been conducted on admixture in indigenous North American populations. In this study, we analyze genomic data at 475,109 single-nucleotide polymorphisms sampled in indigenous peoples of the Pacific Northwest in British Columbia and Southeast Alaska, populations with a well-documented history of contact with European and Asian traders, fishermen, and contract laborers. We find that the indigenous populations of the Pacific Northwest have higher gene diversity than Latin American indigenous populations. Among the Pacific Northwest populations, interior groups provide more evidence for East Asian admixture, whereas coastal groups have higher levels of European admixture. In contrast with many Latin American indigenous populations, the variance of admixture is high in each of the Pacific Northwest indigenous populations, as expected for recent and ongoing admixture processes. The results reveal some similarities but notable differences between admixture patterns in the Pacific Northwest and those in Latin America, contributing to a more detailed understanding of the genomic consequences of European colonization events throughout the Americas.
[Show abstract][Hide abstract] ABSTRACT: Analysis of probability distributions conditional on species trees has
demonstrated the existence of anomalous ranked gene trees (ARGTs), ranked gene
trees that are more probable than the ranked gene tree that accords with the
ranked species tree. Here, to improve the characterization of ARGTs, we study
enumerative and probabilistic properties of two classes of ranked labeled
species trees, focusing on the presence or avoidance of certain subtree
patterns associated with the production of ARGTs. We provide exact enumerations
and asymptotic estimates for cardinalities of these sets of trees, showing that
as the number of species increases without bound, the fraction of all ranked
labeled species trees that are ARGT-producing approaches 1. This result extends
beyond earlier existence results to provide a probabilistic claim about the
frequency of ARGTs.
IEEE/ACM Transactions on Computational Biology and Bioinformatics 07/2014; 11(6). DOI:10.1109/TCBB.2014.2343977 · 1.44 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: Background/aims:
Culturally driven marital practices provide a key instance of an interaction between social and genetic processes in shaping patterns of human genetic variation, producing, for example, increased identity by descent through consanguineous marriage. A commonly used measure to quantify identity by descent in an individual is the inbreeding coefficient, a quantity that reflects not only consanguinity, but also other aspects of kinship in the population to which the individual belongs. Here, in populations worldwide, we examine the relationship between genomic estimates of the inbreeding coefficient and population patterns in genetic variation.
Methods:
Using genotypes at 645 microsatellites, we compare inbreeding coefficients from 5,043 individuals representing 237 populations worldwide to demographic consanguinity frequency estimates available for 26 populations as well as to other quantities that can illuminate population-genetic influences on inbreeding coefficients.
Results:
We observe higher inbreeding coefficient estimates in populations and geographic regions with known high levels of consanguinity or genetic isolation and in populations with an increased effect of genetic drift and decreased genetic diversity with increasing distance from Africa. For the small number of populations with specific consanguinity estimates, we find a correlation between inbreeding coefficients and consanguinity frequency (r = 0.349, p = 0.040).
Conclusions:
The results emphasize the importance of both consanguinity and population-genetic factors in influencing variation in inbreeding coefficients, and they provide insight into factors useful for assessing the effect of consanguinity on genomic patterns in different populations.
Human Heredity 07/2014; 77(1-4):37-48. DOI:10.1159/000362878 · 1.47 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: As it becomes increasingly possible to obtain DNA sequences of orthologous genes from diverse setsof taxa, species trees are frequently being inferred from multilocus data. However, the behavior ofmany methods for performing this inference has remained largely unexplored. Some methods have been proven to be consistent given certain evolutionary models, whereas others rely on criteria that, although appropriate for many parameter values, have peculiar zones of the parameter space in whichthey fail to converge on the correct estimate as data sets increase in size.
Here, using North American pines, we empirically evaluate the behavior of 24 strategies for speciestree inference using three alternative outgroups (72 strategies total). The data consist of 120individuals sampled in eight ingroup species from subsection Strobus and three outgroup speciesfrom subsection Gerardianae, spanning ~47 kilobases of sequence at 121 loci. Each "strategy"for inferring species trees consists of three features: a species tree construction method, a gene treeinference method, and a choice of outgroup. We use multivariate analysis techniques such as principalcomponents analysis and hierarchical clustering to identify tree characteristics that are robustlyobserved across strategies, as well as to identify groups of strategies that produce trees with similarfeatures. We find that strategies that construct species trees using only topological information clustertogether and that strategies that use additional non-topological information (e.g., branch lengths) alsocluster together. Strategies that utilize more than one individual within a species to infer gene treestend to produce estimates of species trees that contain clades present in trees estimated by otherstrategies. Strategies that use the minimize-deep-coalescences criterion to construct species trees tendto produce species tree estimates that contain clades that are not present in trees estimated by theConcatenation, RTC, SMRT, STAR, and STEAC methods, and that in general are more balanced thanthose inferred by these other strategies.
When constructing a species tree from a multilocus set of sequences, our observations provide a basisfor interpreting differences in species tree estimates obtained via different approaches that have atwo-stage structure in common, one step for gene tree estimation and a second step for species treeestimation. The methods explored here employ a number of distinct features of the data, and ouranalysis suggests that recovery of the same results from multiple methods that tend to differ in theirpatterns of inference can be a valuable tool for obtaining reliable estimates.
[Show abstract][Hide abstract] ABSTRACT: Under the coalescent model, the random number nt of lineages ancestral to a sample is nearly deterministic as a function of time when nt is moderate to large in value, and it is well approximated by its expectation E[nt]. In turn, this expectation is well approximated by simple deterministic functions that are easy to compute. Such deterministic functions have been applied to estimate allele age, effective population size, and genetic diversity, and they have been used to study properties of models of infectious disease dynamics. Although a number of simple approximations of E[nt] have been derived and applied to problems of population-genetic inference, the theoretical accuracy of the formulas and the inferences obtained using these approximations is not known, and the range of problems to which they can be applied is not well understood. Here, we demonstrate general procedures by which the approximation nt≈E[nt] can be used to reduce the computational complexity of coalescent formulas, and we show that the resulting approximations converge to their true values under simple assumptions. Such approximations provide alternatives to exact formulas that are computationally intractable or numerically unstable when the number of sampled lineages is moderate or large. We also extend an existing class of approximations of E[nt] to the case of multiple populations of time-varying size with migration among them. Our results facilitate the use of the deterministic approximation nt≈E[nt] for deriving functionally simple, computationally efficient, and numerically stable approximations of coalescent formulas under complicated demographic scenarios.
Theoretical Population Biology 01/2014; 93. DOI:10.1016/j.tpb.2013.12.007 · 1.70 Impact Factor