Publications (136)950.85 Total impact
 Theoretical Population Biology 05/2015; DOI:10.1016/j.tpb.2015.05.001 · 1.53 Impact Factor
 [Show abstract] [Hide abstract]
ABSTRACT: Soft selective sweeps represent an important form of adaptation in which multiple haplotypes bearing adaptive alleles rise to high frequency. Most statistical methods for detecting selective sweeps from genetic polymorphism data, however, have focused on identifying hard selective sweeps in which a favored allele appears on a single haplotypic background; these methods might be underpowered to detect soft sweeps. Among exceptions is the set of haplotype homozygosity statistics introduced for the detection of soft sweeps by Garud et al. (2015). These statistics, examining frequencies of multiple haplotypes in relation to each other, include H12, a statistic designed to identify both hard and soft selective sweeps, and H2/H1, a statistic that conditional on high H12 values seeks to distinguish between hard and soft sweeps. A challenge in the use of H2/H1 is that its range depends on the associated value of H12, so that equal H2/H1 values might provide different levels of support for a soft sweep model at different values of H12. Here, we enhance the H12 and H2/H1 haplotype homozygosity statistics for selective sweep detection by deriving the upper bound on H2/H1 as a function of H12, thereby generating a statistic that normalizes H2/H1 to lie between 0 and 1. Through a reanalysis of resequencing data from inbred lines of Drosophila, we show that the enhanced statistic both strengthens interpretations obtained with the unnormalized statistic and leads to empirical insights that are less readily apparent without the normalization. Copyright © 2015. Published by Elsevier Inc.Theoretical Population Biology 04/2015; 102. DOI:10.1016/j.tpb.2015.04.001 · 1.53 Impact Factor  [Show abstract] [Hide abstract]
ABSTRACT: Coalescent histories provide lists of species tree branches on which gene tree coalescences can take place, and their enumerative properties assist in understanding the computational complexity of calculations central in the study of gene trees and species trees. Here, we solve an enumerative problem left open by Rosenberg (IEEE/ACM Transactions on Computational Biology and Bioinformatics 10: 12531262, 2013) concerning the number of coalescent histories for gene trees and species trees with a matching labeled topology that belongs to a generic caterpillarlike family. By bringing a generating function approach to the study of coalescent histories, we prove that for any caterpillarlike family with seed tree $t$, the sequence $(h_n)_{n\geq 0}$ describing the number of matching coalescent histories of the $n$th tree of the family grows asymptotically as a constant multiple of the Catalan numbers. Thus, $h_n \sim \beta_t c_n$, where the asymptotic constant $\beta_t > 0$ depends on the shape of the seed tree $t$. The result extends a claim demonstrated only for seed trees with at most 8 taxa to arbitrary seed trees, expanding the set of cases for which detailed enumerative properties of coalescent histories can be determined. We introduce a procedure that computes from $t$ the constant $\beta_t$ as well as the algebraic expression for the generating function of the sequence $(h_n)_{n\geq 0}$.  [Show abstract] [Hide abstract]
ABSTRACT: Coalescent histories are combinatorial structures that describe for a given gene tree and species tree the possible lists of branches of the species tree on which the gene tree coalescences take place. Properties of the number of coalescent histories for gene trees and species trees affect a variety of probabilistic calculations in mathematical phylogenetics. Exact and asymptotic evaluations of the number of coalescent histories, however, are known only in a limited number of cases. Here we introduce a particular family of species trees, the \emph{lodgepole} species trees $(\lambda_n)_{n\geq 0}$, in which tree $\lambda_n$ has $m=2n+1$ taxa. We determine the number of coalescent histories for the lodgepole species trees, in the case that the gene tree matches the species tree, showing that this number grows with $m!!$ in the number of taxa $m$. This computation demonstrates the existence of tree families in which the growth in the number of coalescent histories is faster than exponential. Further, it provides a substantial improvement on the lower bound for the ratio of the largest number of matching coalescent histories to the smallest number of matching coalescent histories for trees with $m$ taxa, increasing a previous bound of $(\sqrt{\pi} / 32)[(5m12)/(4m6)] m \sqrt{m}$ to $[ \sqrt{m1}/(4 \sqrt{e}) ]^{m}$. We discuss the implications of our enumerative results for phylogenetic computations.Journal of computational biology: a journal of computational molecular cell biology 03/2015; DOI:10.1089/cmb.2015.0015 · 1.67 Impact Factor  [Show abstract] [Hide abstract]
ABSTRACT: The identification of the genetic structure of populations from multilocus genotype data has become a central component of modern populationgenetic data analysis. Application of modelbased clustering programs often entails a number of steps, in which the user considers different modeling assumptions, compares results across different predetermined values of the number of assumed clusters (a parameter typically denoted K), examines multiple independent runs for each fixed value of K, and distinguishes among runs belonging to substantially distinct clustering solutions. Here, we present Clumpak (Cluster Markov Packager Across K), a method that automates the postprocessing of results of modelbased population structure analyses. For analyzing multiple independent runs at a single K value, Clumpak identifies sets of highly similar runs, separating distinct groups of runs that represent distinct modes in the space of possible solutions. This procedure, which generates a consensus solution for each distinct mode, is performed by the use of a Markov clustering algorithm that relies on a similarity matrix between replicate runs, as computed by the software Clumpp. Next, Clumpak identifies an optimal alignment of inferred clusters across different values of K, extending a similar approach implemented for a fixed K in Clumpp, and simplifying the comparison of clustering results across different K values. Clumpak incorporates additional features, such as implementations of methods for choosing K and comparing solutions obtained by different programs, models, or data subsets. Clumpak, available at http://clumpak.tau.ac.il, simplifies the use of modelbased analyses of population structure in population genetics and molecular ecology. This article is protected by copyright. All rights reserved. This article is protected by copyright. All rights reserved.Molecular Ecology Resources 02/2015; DOI:10.1111/17550998.12387 · 5.63 Impact Factor  [Show abstract] [Hide abstract]
ABSTRACT: Researchers in many fields have considered the meaning of two results about genetic variation for concepts of "race." First, at most genetic loci, apportionments of human genetic diversity find that worldwide populations are genetically similar. Second, when multiple genetic loci are examined, it is possible to distinguish people with ancestry from different geographical regions. These two results raise an important question about human phenotypic diversity: To what extent do populations typically differ on phenotypes determined by multiple genetic loci? It might be expected that such phenotypes follow the pattern of similarity observed at individual loci. Alternatively, because they have a multilocus genetic architecture, they might follow the pattern of greater differentiation suggested by multilocus ancestry inference. To address the question, we extend a wellknown classification model of Edwards (2003) by adding a selectively neutral quantitative trait. Using the extended model, we show, in line with previous work in quantitative genetics, that regardless of how many genetic loci influence the trait, one neutral trait is approximately as informative about ancestry as a single genetic locus. The results support the relevance of singlelocus geneticdiversity partitioning for predictions about phenotypic diversity. Copyright © 2015 Elsevier Ltd. All rights reserved.Studies in History and Philosophy of Science Part C Studies in History and Philosophy of Biological and Biomedical Sciences 02/2015; DOI:10.1016/j.shpsc.2014.12.005  [Show abstract] [Hide abstract]
ABSTRACT: Worldwide patterns of genetic variation are driven by human demographic history. Here, we test whether this demographic history has left similar signatures on phonemessound units that distinguish meaning between words in languagesto those it has left on genes. We analyze, jointly and in parallel, phoneme inventories from 2,082 worldwide languages and microsatellite polymorphisms from 246 worldwide populations. On a global scale, both genetic distance and phonemic distance between populations are significantly correlated with geographic distance. Geographically close language pairs share significantly more phonemes than distant language pairs, whether or not the languages are closely related. The regional geographic axes of greatest phonemic differentiation correspond to axes of genetic differentiation, suggesting that there is a relationship between human dispersal and linguistic variation. However, the geographic distribution of phoneme inventory sizes does not follow the predictions of a serial founder effect during human expansion out of Africa. Furthermore, although geographically isolated populations lose genetic diversity via genetic drift, phonemes are not subject to drift in the same way: within a given geographic radius, languages that are relatively isolated exhibit more variance in number of phonemes than languages with many neighbors. This finding suggests that relatively isolated languages are more susceptible to phonemic change than languages with many neighbors. Within a language family, phoneme evolution along genetic, geographic, or cognatebased linguistic trees predicts similar ancestral phoneme states to those predicted from ancient sources. More genetic sampling could further elucidate the relative roles of vertical and horizontal transmission in phoneme evolution.Proceedings of the National Academy of Sciences 02/2015; 112(5):12651272. DOI:10.1073/pnas.1424033112 · 9.81 Impact Factor  Theoretical Population Biology 01/2015; 102. DOI:10.1016/j.tpb.2015.01.002 · 1.53 Impact Factor
 [Show abstract] [Hide abstract]
ABSTRACT: FST is one of the most frequentlyused indices of genetic differentiation among groups. Though FST takes values between 0 and 1, authors going back to Wright have noted that under many circumstances, FST is constrained to be less than 1. Recently, we showed that at a genetic locus with an unspecified number of alleles, FST for two subpopulations is strictly bounded from above by functions of both the frequency of the most frequent allele (M) and the homozygosity of the total population (HT). In the twosubpopulation case, FST can equal one only when the frequency of the most frequent allele and the total homozygosity are 1/21/2. Here, we extend this work by deriving strict bounds on FST for two subpopulations when the number of alleles at the locus is specified to be II. We show that restricting to II alleles produces the same upper bound on FST over much of the allowable domain for MM and HT, and we derive more restrictive bounds in the windows M∈[1/I,1/(I−1)) and HT∈[1/I,I/(I2−1)). These results extend our understanding of the behavior of FST in relation to other populationgenetic statistics.Theoretical Population Biology 11/2014; 97. DOI:10.1016/j.tpb.2014.08.001 · 1.53 Impact Factor 
Article: AABC: Approximate approximate Bayesian computation for inference in populationgenetic models
[Show abstract] [Hide abstract]
ABSTRACT: Approximate Bayesian computation (ABC) methods perform inference on modelspecific parameters of mechanistically motivated parametric models when evaluating likelihoods is difficult. Central to the success of ABC methods, which have been used frequently in biology, is computationally inexpensive simulation of data sets from the parametric model of interest. However, when simulating data sets from a model is so computationally expensive that the posterior distribution of parameters cannot be adequately sampled by ABC, inference is not straightforward. We present “approximate approximate Bayesian computation” (AABC), a class of computationally fast inference methods that extends ABC to models in which simulating data is expensive. In AABC, we first simulate a number of data sets small enough to be computationally feasible to simulate from the parametric model. Conditional on these data sets, we use a statistical model that approximates the correct parametric model and enables efficient simulation of a large number of data sets. We show that under mild assumptions, the posterior distribution obtained by AABC converges to the posterior distribution obtained by ABC, as the number of data sets simulated from the parametric model and the sample size of the observed data set increase. We demonstrate the performance of AABC on a populationgenetic model of natural selection, as well as on a model of the admixture history of hybrid populations. This latter example illustrates how, in population genetics, AABC is of particular utility in scenarios that rely on conceptually straightforward but potentially slow forwardintime simulations.Theoretical Population Biology 09/2014; 99. DOI:10.1016/j.tpb.2014.09.002 · 1.53 Impact Factor  [Show abstract] [Hide abstract]
ABSTRACT: Sexbiased admixture has been observed in a wide variety of admixed populations. Genetic variation in sex chromosomes and functions of quantities computed from sex chromosomes and autosomes have often been examined in order to infer patterns of sexbiased admixture, typically using statistical approaches that do not mechanistically model the complexity of a sexspecific history of admixture. Here, expanding on a model of Verdu & Rosenberg (2011) that did not include sex specificity, we develop a model that mechanistically examines sexspecific admixture histories. Under the model, multiple source populations contribute to an admixed population, potentially with their male and female contributions varying over time. In an admixed population descended from two source groups, we derive the moments of the distribution of the autosomal admixture fraction from a specific source population as a function of sexspecific introgression parameters and time. Considering admixture processes that are constant in time, we demonstrate that surprisingly, although the mean autosomal admixture fraction from a specific source population does not reveal a sex bias in the admixture history, the variance of autosomal admixture is informative about sex bias. Specifically, the longterm variance decreases as the sex bias from a contributing source population increases. This result can be viewed as analogous to the reduction in effective population size for populations with an unequal number of breeding males and females. Our approach suggests that it may be possible to use the effect of sexbiased admixture on autosomal DNA to assist with methods for inference of the history of complex sexbiased admixture processes.Genetics 09/2014; 198(3). DOI:10.1534/genetics.114.166793 · 4.87 Impact Factor  [Show abstract] [Hide abstract]
ABSTRACT: We derive formulas for mean deep coalescence cost, for either a fixed species tree or a fixed gene tree, under probability distributions that satisfy the exchangeability property. We then apply the formulas to study mean deep coalescence cost under two commonly used exchangeable models—the uniform and Yule models. We find that mean deep coalescence cost, for either a fixed species tree or a fixed gene tree, tends to be larger for unbalanced trees than for balanced trees. These results provide a better understanding of the deep coalescence cost, as well as allow for the development of new species tree inference criteria.Discrete Applied Mathematics 09/2014; 174. DOI:10.1016/j.dam.2014.02.010 · 0.68 Impact Factor 
Article: Patterns of Admixture and Population Structure in Native Populations of Northwest North America.
[Show abstract] [Hide abstract]
ABSTRACT: The initial contact of European populations with indigenous populations of the Americas produced diverse admixture processes across North, Central, and South America. Recent studies have examined the genetic structure of indigenous populations of Latin America and the Caribbean and their admixed descendants, reporting on the genomic impact of the history of admixture with colonizing populations of European and African ancestry. However, relatively little genomic research has been conducted on admixture in indigenous North American populations. In this study, we analyze genomic data at 475,109 singlenucleotide polymorphisms sampled in indigenous peoples of the Pacific Northwest in British Columbia and Southeast Alaska, populations with a welldocumented history of contact with European and Asian traders, fishermen, and contract laborers. We find that the indigenous populations of the Pacific Northwest have higher gene diversity than Latin American indigenous populations. Among the Pacific Northwest populations, interior groups provide more evidence for East Asian admixture, whereas coastal groups have higher levels of European admixture. In contrast with many Latin American indigenous populations, the variance of admixture is high in each of the Pacific Northwest indigenous populations, as expected for recent and ongoing admixture processes. The results reveal some similarities but notable differences between admixture patterns in the Pacific Northwest and those in Latin America, contributing to a more detailed understanding of the genomic consequences of European colonization events throughout the Americas.PLoS Genetics 08/2014; 10(8):e1004530. DOI:10.1371/journal.pgen.1004530 · 8.17 Impact Factor  [Show abstract] [Hide abstract]
ABSTRACT: Analysis of probability distributions conditional on species trees has demonstrated the existence of anomalous ranked gene trees (ARGTs), ranked gene trees that are more probable than the ranked gene tree that accords with the ranked species tree. Here, to improve the characterization of ARGTs, we study enumerative and probabilistic properties of two classes of ranked labeled species trees, focusing on the presence or avoidance of certain subtree patterns associated with the production of ARGTs. We provide exact enumerations and asymptotic estimates for cardinalities of these sets of trees, showing that as the number of species increases without bound, the fraction of all ranked labeled species trees that are ARGTproducing approaches 1. This result extends beyond earlier existence results to provide a probabilistic claim about the frequency of ARGTs.IEEE/ACM Transactions on Computational Biology and Bioinformatics 07/2014; DOI:10.1109/TCBB.2014.2343977 · 1.54 Impact Factor  [Show abstract] [Hide abstract]
ABSTRACT: As it becomes increasingly possible to obtain DNA sequences of orthologous genes from diverse setsof taxa, species trees are frequently being inferred from multilocus data. However, the behavior ofmany methods for performing this inference has remained largely unexplored. Some methods have been proven to be consistent given certain evolutionary models, whereas others rely on criteria that, although appropriate for many parameter values, have peculiar zones of the parameter space in whichthey fail to converge on the correct estimate as data sets increase in size. Here, using North American pines, we empirically evaluate the behavior of 24 strategies for speciestree inference using three alternative outgroups (72 strategies total). The data consist of 120individuals sampled in eight ingroup species from subsection Strobus and three outgroup speciesfrom subsection Gerardianae, spanning ~47 kilobases of sequence at 121 loci. Each "strategy"for inferring species trees consists of three features: a species tree construction method, a gene treeinference method, and a choice of outgroup. We use multivariate analysis techniques such as principalcomponents analysis and hierarchical clustering to identify tree characteristics that are robustlyobserved across strategies, as well as to identify groups of strategies that produce trees with similarfeatures. We find that strategies that construct species trees using only topological information clustertogether and that strategies that use additional nontopological information (e.g., branch lengths) alsocluster together. Strategies that utilize more than one individual within a species to infer gene treestend to produce estimates of species trees that contain clades present in trees estimated by otherstrategies. Strategies that use the minimizedeepcoalescences criterion to construct species trees tendto produce species tree estimates that contain clades that are not present in trees estimated by theConcatenation, RTC, SMRT, STAR, and STEAC methods, and that in general are more balanced thanthose inferred by these other strategies. When constructing a species tree from a multilocus set of sequences, our observations provide a basisfor interpreting differences in species tree estimates obtained via different approaches that have atwostage structure in common, one step for gene tree estimation and a second step for species treeestimation. The methods explored here employ a number of distinct features of the data, and ouranalysis suggests that recovery of the same results from multiple methods that tend to differ in theirpatterns of inference can be a valuable tool for obtaining reliable estimates.BMC Evolutionary Biology 03/2014; 14(1):67. DOI:10.1186/147121481467 · 3.41 Impact Factor  [Show abstract] [Hide abstract]
ABSTRACT: Under the coalescent model, the random number nt of lineages ancestral to a sample is nearly deterministic as a function of time when nt is moderate to large in value, and it is well approximated by its expectation E[nt]. In turn, this expectation is well approximated by simple deterministic functions that are easy to compute. Such deterministic functions have been applied to estimate allele age, effective population size, and genetic diversity, and they have been used to study properties of models of infectious disease dynamics. Although a number of simple approximations of E[nt] have been derived and applied to problems of populationgenetic inference, the theoretical accuracy of the formulas and the inferences obtained using these approximations is not known, and the range of problems to which they can be applied is not well understood. Here, we demonstrate general procedures by which the approximation nt≈E[nt] can be used to reduce the computational complexity of coalescent formulas, and we show that the resulting approximations converge to their true values under simple assumptions. Such approximations provide alternatives to exact formulas that are computationally intractable or numerically unstable when the number of sampled lineages is moderate or large. We also extend an existing class of approximations of E[nt] to the case of multiple populations of timevarying size with migration among them. Our results facilitate the use of the deterministic approximation nt≈E[nt] for deriving functionally simple, computationally efficient, and numerically stable approximations of coalescent formulas under complicated demographic scenarios.Theoretical Population Biology 01/2014; DOI:10.1016/j.tpb.2013.12.007 · 1.53 Impact Factor  [Show abstract] [Hide abstract]
ABSTRACT: Culturally driven marital practices provide a key instance of an interaction between social and genetic processes in shaping patterns of human genetic variation, producing, for example, increased identity by descent through consanguineous marriage. A commonly used measure to quantify identity by descent in an individual is the inbreeding coefficient, a quantity that reflects not only consanguinity, but also other aspects of kinship in the population to which the individual belongs. Here, in populations worldwide, we examine the relationship between genomic estimates of the inbreeding coefficient and population patterns in genetic variation.Human Heredity 01/2014; 77(14):3748. DOI:10.1159/000362878 · 1.64 Impact Factor 
Article: Core elements of a TPB paper
Theoretical Population Biology 12/2013; 92. DOI:10.1016/j.tpb.2013.11.003 · 1.53 Impact Factor  Human Biology 12/2013; 85(6):954954. DOI:10.3378/027.085.0608 · 1.52 Impact Factor
 [Show abstract] [Hide abstract]
ABSTRACT: The Samaritans are a group of some 750 indigenous Middle Eastern people, about half of whom live in Holon, a suburb of Tel Aviv, and the other half near Nablus. The Samaritan population is believed to have numbered more than a million in late Roman times but less than 150 in 1917. The ancestry of the Samaritans has been subject to controversy from late Biblical times to the present. In this study, liquid chromatography/electrospray ionization/quadrupole ion trap mass spectrometry was used to allelotype 13 Ychromosomal and 15 autosomal microsatellites in a sample of 12 Samaritans chosen to have as low a level of relationship as possible, and 461 Jews and nonJews. Estimation of genetic distances between the Samaritans and seven Jewish and three nonJewish populations from Israel, as well as populations from Africa, Pakistan, Turkey, and Europe, revealed that the Samaritans were closely related to Cohanim. This result supports the position of the Samaritans that they are descendants from the tribes of Israel dating to before the Assyrian exile in 722720 BCE. In concordance with previously published singlenucleotide polymorphism haplotypes, each Samaritan family, with the exception of the Samaritan Cohen lineage, was observed to carry a distinctive Ychromosome short tandem repeat haplotype that was not more than one mutation removed from the sixmarker Cohen modal haplotype.Human Biology 12/2013; 85(6):825858. DOI:10.3378/027.085.0601 · 1.52 Impact Factor
Publication Stats
12k  Citations  
950.85  Total Impact Points  
Top Journals
 Theoretical Population Biology (10)
 Theoretical Population Biology (8)
 Genetics (8)
 Genetics (7)
 PLoS Genetics (7)
Institutions

1999–2015

Stanford University
 Department of Biology
Palo Alto, California, United States


2008–2012

Uppsala University
Uppsala, Uppsala, Sweden


2011

Statens Serum Institut
København, Capital Region, Denmark


2005–2011

University of Michigan
 • Life Sciences Institute
 • Department of Human Genetics
Ann Arbor, Michigan, United States 
University of Texas Health Science Center at Houston
 Human Genetics Center
Houston, TX, United States


2009

University of California, Davis
 Department of Anthropology
Davis, CA, United States


2007

Concordia University–Ann Arbor
Ann Arbor, Michigan, United States


2002–2005

University of Southern California
 Division of Molecular and Computational Biology
Los Ángeles, California, United States 
University of California, Los Angeles
Los Ángeles, California, United States


2001

Hebrew University of Jerusalem
 Department of Genetics
Yerushalayim, Jerusalem, Israel
