[Show abstract][Hide abstract] ABSTRACT: Next generation sequencing and advances in genomic enrichment technologies have enabled the discovery of the full spectrum of variants from common to rare alleles in the human population. The application of such technologies can be limited by the amount of DNA available. Whole genome amplification (WGA) can overcome such limitations. Here we investigate applicability of using WGA by comparing SNP and INDEL variant calls from a single genomic/WGA sample pair from two capture separate experiments: a 50 Mbp whole exome capture and a custom capture array of 4 Mbp region on chr12.
Our results comparing variant calls derived from genomic and WGA DNA show that the majority of variant SNP and INDEL calls are common to both callsets, both at the site and genotype level and suggest that allele bias plays a minimal role when using WGA DNA in re-sequencing studies.
Although the results of this study are based on a limited sample size, they suggest that using WGA DNA allows the discovery of the vast majority of variants, and achieves high concordance metrics, when comparing to genomic DNA calls.
[Show abstract][Hide abstract] ABSTRACT: Rare coding variants constitute an important class of human genetic variation, but are underrepresented in current databases that are based on small population samples. Recent studies show that variants altering amino acid sequence and protein function are enriched at low variant allele frequency, 2 to 5%, but because of insufficient sample size it is not clear if the same trend holds for rare variants below 1% allele frequency.
The 1000 Genomes Exon Pilot Project has collected deep-coverage exon-capture data in roughly 1,000 human genes, for nearly 700 samples. Although medical whole-exome projects are currently afoot, this is still the deepest reported sampling of a large number of human genes with next-generation technologies. According to the goals of the 1000 Genomes Project, we created effective informatics pipelines to process and analyze the data, and discovered 12,758 exonic SNPs, 70% of them novel, and 74% below 1% allele frequency in the seven population samples we examined. Our analysis confirms that coding variants below 1% allele frequency show increased population-specificity and are enriched for functional variants.
This study represents a large step toward detecting and interpreting low frequency coding variation, clearly lays out technical steps for effective analysis of DNA capture data, and articulates functional and population properties of this important class of genetic variation.
[Show abstract][Hide abstract] ABSTRACT: As a consequence of the accumulation of insertion events over evolutionary time, mobile elements now comprise nearly half of the human genome. The Alu, L1, and SVA mobile element families are still duplicating, generating variation between individual genomes. Mobile element insertions (MEI) have been identified as causes for genetic diseases, including hemophilia, neurofibromatosis, and various cancers. Here we present a comprehensive map of 7,380 MEI polymorphisms from the 1000 Genomes Project whole-genome sequencing data of 185 samples in three major populations detected with two detection methods. This catalog enables us to systematically study mutation rates, population segregation, genomic distribution, and functional properties of MEI polymorphisms and to compare MEI to SNP variation from the same individuals. Population allele frequencies of MEI and SNPs are described, broadly, by the same neutral ancestral processes despite vastly different mutation mechanisms and rates, except in coding regions where MEI are virtually absent, presumably due to strong negative selection. A direct comparison of MEI and SNP diversity levels suggests a differential mobile element insertion rate among populations.
[Show abstract][Hide abstract] ABSTRACT: High-throughput sequencing technology enables population-level surveys of human genomic variation. Here, we examine the joint allele frequency distributions across continental human populations and present an approach for combining complementary aspects of whole-genome, low-coverage data and targeted high-coverage data. We apply this approach to data generated by the pilot phase of the Thousand Genomes Project, including whole-genome 2-4× coverage data for 179 samples from HapMap European, Asian, and African panels as well as high-coverage target sequencing of the exons of 800 genes from 697 individuals in seven populations. We use the site frequency spectra obtained from these data to infer demographic parameters for an Out-of-Africa model for populations of African, European, and Asian descent and to predict, by a jackknife-based approach, the amount of genetic diversity that will be discovered as sample sizes are increased. We predict that the number of discovered nonsynonymous coding variants will reach 100,000 in each population after ∼1,000 sequenced chromosomes per population, whereas ∼2,500 chromosomes will be needed for the same number of synonymous variants. Beyond this point, the number of segregating sites in the European and Asian panel populations is expected to overcome that of the African panel because of faster recent population growth. Overall, we find that the majority of human genomic variable sites are rare and exhibit little sharing among diverged populations. Our results emphasize that replication of disease association for specific rare genetic variants across diverged populations must overcome both reduced statistical power because of rarity and higher population divergence.
Proceedings of the National Academy of Sciences 07/2011; 108(29):11983-8. · 9.81 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: Balancing selection is potentially an important biological force for maintaining advantageous genetic diversity in populations, including variation that is responsible for long-term adaptation to the environment. By serving as a means to maintain genetic variation, it may be particularly relevant to maintaining phenotypic variation in natural populations. Nevertheless, its prevalence and specific targets in the human genome remain largely unknown. We have analyzed the patterns of diversity and divergence of 13,400 genes in two human populations using an unbiased single-nucleotide polymorphism data set, a genome-wide approach, and a method that incorporates demography in neutrality tests. We identified an unbiased catalog of genes with signatures of long-term balancing selection, which includes immunity genes as well as genes encoding keratins and membrane channels; the catalog also shows enrichment in functional categories involved in cellular structure. Patterns are mostly concordant in the two populations, with a small fraction of genes showing population-specific signatures of selection. Power considerations indicate that our findings represent a subset of all targets in the genome, suggesting that although balancing selection may not have an obvious impact on a large proportion of human genes, it is a key force affecting the evolution of a number of genes in humans.
[Show abstract][Hide abstract] ABSTRACT: Analysis of polymorphism and divergence in the non-coding portion of the human genome yields crucial information about factors driving the evolution of gene regulation. Candidate cis-regulatory regions spanning more than 15,000 genes in 15 African Americans and 20 European Americans were re-sequenced and aligned to the chimpanzee genome in order to identify potentially functional polymorphism and to characterize and quantify departures from neutral evolution. Distortions of the site frequency spectra suggest a general pattern of selective constraint on conserved non-coding sites in the flanking regions of genes (CNCs). Moreover, there is an excess of fixed differences that cannot be explained by a Gamma model of deleterious fitness effects, suggesting the presence of positive selection on CNCs. Extensions of the McDonald-Kreitman test identified candidate cis-regulatory regions with high probabilities of positive and negative selection near many known human genes, the biological characteristics of which exhibit genome-wide trends that differ from patterns observed in protein-coding regions. Notably, there is a higher probability of positive selection in candidate cis-regulatory regions near genes expressed in the fetal brain, suggesting that a larger portion of adaptive regulatory changes has occurred in genes expressed during brain development. Overall we find that natural selection has played an important role in the evolution of candidate cis-regulatory regions throughout hominid evolution.
[Show abstract][Hide abstract] ABSTRACT: Past demographic changes can produce distortions in patterns of genetic variation that can mimic the appearance of natural selection unless the demographic effects are explicitly removed. Here we fit a detailed model of human demography that incorporates divergence, migration, admixture, and changes in population size to directly sequenced data from 13,400 protein coding genes from 20 European-American and 19 African-American individuals. Based on this demographic model, we use several new and established statistical methods for identifying genes with extreme patterns of polymorphism likely to be caused by Darwinian selection, providing the first genome-wide analysis of allele frequency distributions in humans based on directly sequenced data. The tests are based on observations of excesses of high frequency-derived alleles, excesses of low frequency-derived alleles, and excesses of differences in allele frequencies between populations. We detect numerous new genes with strong evidence of selection, including a number of genes related to psychiatric and other diseases. We also show that microRNA controlled genes evolve under extremely high constraints and are more likely to undergo negative selection than other genes. Furthermore, we show that genes involved in muscle development have been subject to positive selection during recent human history. In accordance with previous studies, we find evidence for negative selection against mutations in genes associated with Mendelian disease and positive selection acting on genes associated with several complex diseases.
Genome Research 04/2009; 19(5):838-49. DOI:10.1101/gr.088336.108 · 13.85 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: Characterizing patterns of genetic variation within and among human populations is important for understanding human evolutionary history and for careful design of medical genetic studies. Here, we analyze patterns of variation across 443,434 single nucleotide polymorphisms (SNPs) genotyped in 3845 individuals from four continental regions. This unique resource allows us to illuminate patterns of diversity in previously under-studied populations at the genome-wide scale including Latin America, South Asia, and Southern Europe. Key insights afforded by our analysis include quantifying the degree of admixture in a large collection of individuals from Guadalajara, Mexico; identifying language and geography as key determinants of population structure within India; and elucidating a north-south gradient in haplotype diversity within Europe. We also present a novel method for identifying long-range tracts of homozygosity indicative of recent common ancestry. Application of our approach suggests great variation within and among populations in the extent of homozygosity, suggesting both demographic history (such as population bottlenecks) and recent ancestry events (such as consanguinity) play an important role in patterning variation in large modern human populations.
Genome Research 03/2009; 19(5):795-803. DOI:10.1101/gr.088898.108 · 13.85 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: Understanding the genetic structure of human populations is of fundamental interest to medical, forensic and anthropological sciences. Advances in high-throughput genotyping technology have markedly improved our understanding of global patterns of human genetic variation and suggest the potential to use large samples to uncover variation among closely spaced populations. Here we characterize genetic variation in a sample of 3,000 European individuals genotyped at over half a million variable DNA sites in the human genome. Despite low average levels of genetic differentiation among Europeans, we find a close correspondence between genetic and geographic distances; indeed, a geographical map of Europe arises naturally as an efficient two-dimensional summary of genetic variation in Europeans. The results emphasize that when mapping the genetic basis of a disease phenotype, spurious associations can arise if genetic structure is not properly accounted for. In addition, the results are relevant to the prospects of genetic ancestry testing; an individual's DNA can be used to infer their geographic origin with surprising accuracy-often to within a few hundred kilometres.
[Show abstract][Hide abstract] ABSTRACT: Technological and scientific advances, stemming in large part from the Human Genome and HapMap projects, have made large-scale, genome-wide investigations feasible and cost effective. These advances have the potential to dramatically impact drug discovery and development by identifying genetic factors that contribute to variation in disease risk as well as drug pharmacokinetics, treatment efficacy, and adverse drug reactions. In spite of the technological advancements, successful application in biomedical research would be limited without access to suitable sample collections. To facilitate exploratory genetics research, we have assembled a DNA resource from a large number of subjects participating in multiple studies throughout the world. This growing resource was initially genotyped with a commercially available genome-wide 500,000 single-nucleotide polymorphism panel. This project includes nearly 6,000 subjects of African-American, East Asian, South Asian, Mexican, and European origin. Seven informative axes of variation identified via principal-component analysis (PCA) of these data confirm the overall integrity of the data and highlight important features of the genetic structure of diverse populations. The potential value of such extensively genotyped collections is illustrated by selection of genetically matched population controls in a genome-wide analysis of abacavir-associated hypersensitivity reaction. We find that matching based on country of origin, identity-by-state distance, and multidimensional PCA do similarly well to control the type I error rate. The genotype and demographic data from this reference sample are freely available through the NCBI database of Genotypes and Phenotypes (dbGaP).
The American Journal of Human Genetics 09/2008; 83(3):347-58. DOI:10.1016/j.ajhg.2008.08.005 · 10.99 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: What evolutionary forces shape genes that contribute to the risk of human disease? Do similar selective pressures act on alleles that underlie simple versus complex disorders [1-3]? Answers to these questions will shed light onto the origin of human disorders (e.g., ) and help to predict the population frequencies of alleles that contribute to disease risk, with important implications for the efficient design of mapping studies [5-7]. As a first step toward addressing these questions, we created a hand-curated version of the Mendelian Inheritance in Man database (OMIM). We then examined selective pressures on Mendelian-disease genes, genes that contribute to complex-disease risk, and genes known to be essential in mouse by analyzing patterns of human polymorphism and of divergence between human and rhesus macaque. We found that Mendelian-disease genes appear to be under widespread purifying selection, especially when the disease mutations are dominant (rather than recessive). In contrast, the class of genes that influence complex-disease risk shows little signs of evolutionary conservation, possibly because this category includes targets of both purifying and positive selection.
Current Biology 07/2008; 18(12):883-9. DOI:10.1016/j.cub.2008.04.074 · 9.92 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: Quantifying the distribution of fitness effects among newly arising mutations in the human genome is key to resolving important debates in medical and evolutionary genetics. Here, we present a method for inferring this distribution using Single Nucleotide Polymorphism (SNP) data from a population with non-stationary demographic history (such as that of modern humans). Application of our method to 47,576 coding SNPs found by direct resequencing of 11,404 protein coding-genes in 35 individuals (20 European Americans and 15 African Americans) allows us to assess the relative contribution of demographic and selective effects to patterning amino acid variation in the human genome. We find evidence of an ancient population expansion in the sample with African ancestry and a relatively recent bottleneck in the sample with European ancestry. After accounting for these demographic effects, we find strong evidence for great variability in the selective effects of new amino acid replacing mutations. In both populations, the patterns of variation are consistent with a leptokurtic distribution of selection coefficients (e.g., gamma or log-normal) peaked near neutrality. Specifically, we predict 27-29% of amino acid changing (nonsynonymous) mutations are neutral or nearly neutral (|s|<0.01%), 30-42% are moderately deleterious (0.01%<|s|<1%), and nearly all the remainder are highly deleterious or lethal (|s|>1%). Our results are consistent with 10-20% of amino acid differences between humans and chimpanzees having been fixed by positive selection with the remainder of differences being neutral or nearly neutral. Our analysis also predicts that many of the alleles identified via whole-genome association mapping may be selectively neutral or (formerly) positively selected, implying that deleterious genetic variation affecting disease phenotype may be missed by this widely used approach for mapping genes underlying complex traits.
[Show abstract][Hide abstract] ABSTRACT: Quantifying the number of deleterious mutations per diploid human genome is of crucial concern to both evolutionary and medical geneticists. Here we combine genome-wide polymorphism data from PCR-based exon resequencing, comparative genomic data across mammalian species, and protein structure predictions to estimate the number of functionally consequential single-nucleotide polymorphisms (SNPs) carried by each of 15 African American (AA) and 20 European American (EA) individuals. We find that AAs show significantly higher levels of nucleotide heterozygosity than do EAs for all categories of functional SNPs considered, including synonymous, non-synonymous, predicted 'benign', predicted 'possibly damaging' and predicted 'probably damaging' SNPs. This result is wholly consistent with previous work showing higher overall levels of nucleotide variation in African populations than in Europeans. EA individuals, in contrast, have significantly more genotypes homozygous for the derived allele at synonymous and non-synonymous SNPs and for the damaging allele at 'probably damaging' SNPs than AAs do. For SNPs segregating only in one population or the other, the proportion of non-synonymous SNPs is significantly higher in the EA sample (55.4%) than in the AA sample (47.0%; P < 2.3 x 10(-37)). We observe a similar proportional excess of SNPs that are inferred to be 'probably damaging' (15.9% in EA; 12.1% in AA; P < 3.3 x 10(-11)). Using extensive simulations, we show that this excess proportion of segregating damaging alleles in Europeans is probably a consequence of a bottleneck that Europeans experienced at about the time of the migration out of Africa.
[Show abstract][Hide abstract] ABSTRACT: The rhesus macaque (Macaca mulatta) is an abundant primate species that diverged from the ancestors of Homo sapiens about 25 million years ago. Because they are genetically and physiologically similar to humans, rhesus monkeys are the most
widely used nonhuman primate in basic and applied biomedical research. We determined the genome sequence of an Indian-origin
Macaca mulatta female and compared the data with chimpanzees and humans to reveal the structure of ancestral primate genomes and to identify
evidence for positive selection and lineage-specific expansions and contractions of gene families. A comparison of sequences
from individual animals was used to investigate their underlying genetic diversity. The complete description of the macaque
genome blueprint enhances the utility of this animal model for biomedical research and improves our understanding of the basic
biology of the species.
[Show abstract][Hide abstract] ABSTRACT: To understand the demographic history of rhesus macaques (Macaca mulatta) and document the extent of linkage disequilibrium (LD) in the genome, we partially resequenced five Encyclopedia of DNA Elements regions in 9 Chinese and 38 captive-born Indian rhesus macaques. Population genetic analyses of the 1467 single-nucleotide polymorphisms discovered suggest that the two populations separated about 162,000 years ago, with the Chinese population tripling in size since then and the Indian population eventually shrinking by a factor of four. Using coalescent simulations, we confirmed that these inferred demographic events explain a much faster decay of LD in Chinese (r(2) approximately 0.15 at 10 kilobases) versus Indian (r(2) approximately 0.52 at 10 kilobases) macaque populations.
[Show abstract][Hide abstract] ABSTRACT: The rhesus macaque (Macaca mulatta) is an abundant primate species that diverged from the ancestors of Homo sapiens about 25 million years ago. Because they are genetically and physiologically similar to humans, rhesus monkeys are the most widely used nonhuman primate in basic and applied biomedical research. We determined the genome sequence of an Indian-origin Macaca mulatta female and compared the data with chimpanzees and humans to reveal the structure of ancestral primate genomes and to identify evidence for positive selection and lineage-specific expansions and contractions of gene families. A comparison of sequences from individual animals was used to investigate their underlying genetic diversity. The complete description of the macaque genome blueprint enhances the utility of this animal model for biomedical research and improves our understanding of the basic biology of the species.
[Show abstract][Hide abstract] ABSTRACT: Different classes of haplotype block algorithms exist and the ideal dataset to assess their performance would be to comprehensively re-sequence a large genomic region in a large population. Such data sets are expensive to collect. Alternatively, we performed coalescent simulations to generate haplotypes with a high marker density and compared block partitioning results from diversity based, LD based, and information theoretic algorithms under different values of SNP density and allele frequency.
We simulated 1000 haplotypes using the standard coalescent for three world populations--European, African American, and East Asian--and applied three classes of block partitioning algorithms--diversity based, LD based, and information theoretic. We assessed algorithm differences in number, size, and coverage of blocks inferred under different conditions of SNP density, allele frequency, and sample size. Each algorithm inferred blocks differing in number, size, and coverage under different density and allele frequency conditions. Different partitions had few if any matching block boundaries. However they still overlapped and a high percentage of total chromosomal region was common to all methods. This percentage was generally higher with a higher density of SNPs and when rarer markers were included.
A gold standard definition of a haplotype block is difficult to achieve, but collecting haplotypes covered with a high density of SNPs, partitioning them with a variety of block algorithms, and identifying regions common to all methods may be the best way to identify genomic regions that harbor SNP variants that cause disease.