[show abstract][hide abstract] ABSTRACT: Full sequencing of individual human genomes has greatly expanded our understanding of human genetic variation and population history. Here, we present a systematic analysis of 50 human genomes from 11 diverse global populations sequenced at high coverage. Our sample includes 12 individuals who have admixed ancestry and who have varying degrees of recent (within the last 500 years) African, Native American, and European ancestry. We found over 21 million single-nucleotide variants that contribute to a 1.75-fold range in nucleotide heterozygosity across diverse human genomes. This heterozygosity ranged from a high of one heterozygous site per kilobase in west African genomes to a low of 0.57 heterozygous sites per kilobase in segments inferred to have diploid Native American ancestry from the genomes of Mexican and Puerto Rican individuals. We show evidence of all three continental ancestries in the genomes of Mexican, Puerto Rican, and African American populations, and the genome-wide statistics are highly consistent across individuals from a population once ancestry proportions have been accounted for. Using a generalized linear model, we identified subtle variations across populations in the proportion of neutral versus deleterious variation and found that genome-wide statistics vary in admixed populations even once ancestry proportions have been factored in. We further infer that multiple periods of gene flow shaped the diversity of admixed populations in the Americas-70% of the European ancestry in today's African Americans dates back to European gene flow happening only 7-8 generations ago.
The American Journal of Human Genetics 10/2012; 91(4):660-71. · 11.20 Impact Factor
[show abstract][hide abstract] ABSTRACT: The evolutionary forces responsible for intron loss are unresolved. Whereas research has focused on protein-coding genes, here we analyze noncoding small nucleolar RNA (snoRNA) genes in which introns, rather than exons, are typically the functional elements. Within the yeast lineage exemplified by the human pathogen Candida albicans, we find--through deep RNA sequencing and genome-wide annotation of splice junctions--extreme compaction and loss of associated exons, but retention of snoRNAs within introns. In the Saccharomyces yeast lineage, however, we find it is the introns that have been lost through widespread degeneration of splicing signals. This intron loss, perhaps facilitated by innovations in snoRNA processing, is distinct from that observed in protein-coding genes with respect to both mechanism and evolutionary timing.
[show abstract][hide abstract] ABSTRACT: The differentiation of cells into distinct cell types, each of which is heritable for many generations, underlies many biological phenomena. White and opaque cells of the fungal pathogen Candida albicans are two such heritable cell types, each thought to be adapted to unique niches within their human host. To systematically investigate their differences, we performed strand-specific, massively-parallel sequencing of RNA from C. albicans white and opaque cells. With these data we first annotated the C. albicans transcriptome, finding hundreds of novel differentially-expressed transcripts. Using the new annotation, we compared differences in transcript abundance between the two cell types with the genomic regions bound by a master regulator of the white-opaque switch (Wor1). We found that the revised transcriptional landscape considerably alters our understanding of the circuit governing differentiation. In particular, we can now resolve the poor concordance between binding of a master regulator and the differential expression of adjacent genes, a discrepancy observed in several other studies of cell differentiation. More than one third of the Wor1-bound differentially-expressed transcripts were previously unannotated, which explains the formerly puzzling presence of Wor1 at these positions along the genome. Many of these newly identified Wor1-regulated genes are non-coding and transcribed antisense to coding transcripts. We also find that 5' and 3' UTRs of mRNAs in the circuit are unusually long and that 5' UTRs often differ in length between cell-types, suggesting UTRs encode important regulatory information and that use of alternative promoters is widespread. Further analysis revealed that the revised Wor1 circuit bears several striking similarities to the Oct4 circuit that specifies the pluripotency of mammalian embryonic stem cells. Additional characteristics shared with the Oct4 circuit suggest a set of general hallmarks characteristic of heritable differentiation states in eukaryotes.
[show abstract][hide abstract] ABSTRACT: Background/Aims: There is a growing interest regarding the effect of differential misclassification on power and type I error rate in genome-wide association studies. We present an extension of a previously published test statistic: the likelihood ratio test allowing for errors (LRT(AE)). This test uses double-sample information on a subset of individuals to increase power for genetic association in the presence of nondifferential misclassification. Methods:We extend the original LRT(AE) by allowing for differential genotype misclassification between case and control populations. We label this new statistic as LRT(D)(A)(M)(E ). We test the performance of this statistic with data simulated under differential misclassification specifications and two different types of genetic models: null and power. For simulations using the null model, we specify that there is no difference between case and control genotype frequencies before the introduction of errors. For simulations under power, we consider three modes of inheritance: dominant, multiplicative, and recessive. Results: We show that the LRT(D)(A)(M)(E ), with p values computed using permutation, maintains a correct type I error rate under the null model after the introduction of differential genotyping errors. Also, we find that as little as 10 to 15% of double-sampled genotype data is needed to achieve this effect. Aside from a few situations (particularly recessive mode of inheritance simulations) the LRT(D)(A)(M)(E ) version that calculates p values through permutation requires 15 to 20% double sampling to maintain an 80% power for a 0.05 significance level and approximately 20% double sampling for a 0.01 significance level.
Human Heredity 07/2010; 70(2):102-108. · 1.57 Impact Factor
[show abstract][hide abstract] ABSTRACT: Due to growing throughput and shrinking cost, massively parallel sequencing is rapidly becoming an attractive alternative to microarrays for the genome-wide study of gene expression and copy number alterations in primary tumors. The sequencing of transcripts (RNA-Seq) should offer several advantages over microarray-based methods, including the ability to detect somatic mutations and accurately measure allele-specific expression. To investigate these advantages we have applied a novel, strand-specific RNA-Seq method to tumors and matched normal tissue from three patients with oral squamous cell carcinomas. Additionally, to better understand the genomic determinants of the gene expression changes observed, we have sequenced the tumor and normal genomes of one of these patients. We demonstrate here that our RNA-Seq method accurately measures allelic imbalance and that measurement on the genome-wide scale yields novel insights into cancer etiology. As expected, the set of genes differentially expressed in the tumors is enriched for cell adhesion and differentiation functions, but, unexpectedly, the set of allelically imbalanced genes is also enriched for these same cancer-related functions. By comparing the transcriptomic perturbations observed in one patient to his underlying normal and tumor genomes, we find that allelic imbalance in the tumor is associated with copy number mutations and that copy number mutations are, in turn, strongly associated with changes in transcript abundance. These results support a model in which allele-specific deletions and duplications drive allele-specific changes in gene expression in the developing tumor.
PLoS ONE 01/2010; 5(2):e9317. · 3.73 Impact Factor
[show abstract][hide abstract] ABSTRACT: We introduce a simple, broadly applicable method for obtaining estimates of nucleotide diversity from genomic shotgun sequencing data. The method takes into account the special nature of these data: random sampling of genomic segments from one or more individuals and a relatively high error rate for individual reads. Applying this method to data from the Celera human genome sequencing and SNP discovery project, we obtain estimates of nucleotide diversity in windows spanning the human genome and show that the diversity to divergence ratio is reduced in regions of low recombination. Furthermore, we show that the elevated diversity in telomeric regions is mainly due to elevated mutation rates and not due to decreased levels of background selection. However, we find indications that telomeres as well as centromeres experience greater impact from natural selection than intrachromosomal regions. Finally, we identify a number of genomic regions with increased or reduced diversity compared with the local level of human-chimpanzee divergence and the local recombination rate.
Genome Research 08/2008; 18(7):1020-9. · 14.40 Impact Factor
[show abstract][hide abstract] ABSTRACT: When a selective sweep occurs in the chromosomal region around a target gene in two populations that have recently separated, it produces three dramatic genomic consequences: 1) decreased multi-locus heterozygosity in the region; 2) elevated or diminished genetic divergence (F(ST)) of multiple polymorphic variants adjacent to the selected locus between the divergent populations, due to the alternative fixation of alleles; and 3) a consequent regional increase in the variance of F(ST) (S(2)F(ST)) for the same clustered variants, due to the increased alternative fixation of alleles in the loci surrounding the selection target. In the first part of our study, to search for potential targets of directional selection, we developed and validated a resampling-based computational approach; we then scanned an array of 31 different-sized moving windows of SNP variants (5-65 SNPs) across the human genome in a set of European and African American population samples with 183,997 SNP loci after correcting for the recombination rate variation. The analysis revealed 180 regions of recent selection with very strong evidence in either population or both. In the second part of our study, we compared the newly discovered putative regions to those sites previously postulated in the literature, using methods based on inspecting patterns of linkage disequilibrium, population divergence and other methodologies. The newly found regions were cross-validated with those found in nine other studies that have searched for selection signals. Our study was replicated especially well in those regions confirmed by three or more studies. These validated regions were independently verified, using a combination of different methods and different databases in other studies, and should include fewer false positives. The main strength of our analysis method compared to others is that it does not require dense genotyping and therefore can be used with data from population-based genome SNP scans from smaller studies of humans or other species.
PLoS ONE 02/2008; 3(3):e1712. · 3.73 Impact Factor
[show abstract][hide abstract] ABSTRACT: In isolated populations, 'background' linkage disequilibrium (LD) has been shown to extend over large genetic distances. This and their reduced environmental and genetic heterogeneity has stimulated interest in their potential for association mapping. We compared LD unit map distances with pair-wise measurements of LD in a dense single nucleotide polymorphism (SNP) set.
We genotyped 771 SNPs in an 8 Mb segment of chromosome 22 on 101 individuals from the isolated village of Talana, Sardinia, and compared with outbred European populations.
Heterozygosity was remarkably similar in both populations. In contrast, the extent of LD observed was quite different. The decay of LD with distance is slower in the isolate. The differences in LD map lengths suggest that useful LD extends up to three times farther in the Sardinian population; smaller differences are seen with pairwise LD metrics. While LD map length slightly decreases with average relatedness, cryptic relatedness does not explain the decrease in LD map length. Haplotypes, block boundaries, and patterns of LD are similar in both populations, suggesting a shared distribution of recombination hotspots.
About 15% fewer haplotype tagging SNPs need to be genotyped in the isolate, and possibly 70% fewer if selecting SNPs evenly spaced on the metric LD map.
Human Heredity 02/2008; 65(1):9-22. · 1.57 Impact Factor
[show abstract][hide abstract] ABSTRACT: We have completed a second-generation linkage map that incorporates sequence-based positional information. This new map, the Rutgers Map v.2, includes 28,121 polymorphic markers with physical positions corroborated by recombination-based data. Sex-averaged and sex-specific linkage map distances, along with confidence intervals, have been estimated for all map intervals. In addition, a regression-based smoothed map is provided that facilitates interpolation of positions of unmapped markers on this map. With nearly twice as many markers as our first-generation map, the Rutgers Map continues to be a unique and comprehensive resource for obtaining genetic map information for large sets of polymorphic markers.
Genome Research 01/2008; 17(12):1783-6. · 14.40 Impact Factor
[show abstract][hide abstract] ABSTRACT: The design of genetic association studies using single-nucleotide polymorphisms (SNPs) requires the selection of subsets of the variants providing high statistical power at a reasonable cost. SNPs must be selected to maximize the probability that a causative mutation is in linkage disequilibrium (LD) with at least one marker genotyped in the study. The HapMap project performed a genome-wide survey of genetic variation with about a million SNPs typed in four populations, providing a rich resource to inform the design of association studies. A number of strategies have been proposed for the selection of SNPs based on observed LD, including construction of metric LD maps and the selection of haplotype tagging SNPs. Power calculations are important at the study design stage to ensure successful results. Integrating these methods and annotations can be challenging: the algorithms required to implement these methods are complex to deploy, and all the necessary data and annotations are deposited in disparate databases. Here, we present the SNPbrowser Software, a freely available tool to assist in the LD-based selection of markers for association studies. This stand-alone application provides fast query capabilities and swift visualization of SNPs, gene annotations, power, haplotype blocks, and LD map coordinates. Wizards implement several common SNP selection workflows including the selection of optimal subsets of SNPs (e.g. tagging SNPs). Selected SNPs are screened for their conversion potential to either TaqMan SNP Genotyping Assays or the SNPlex Genotyping System, two commercially available genotyping platforms, expediting the set-up of genetic studies with an increased probability of success.
Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing 02/2006;
[show abstract][hide abstract] ABSTRACT: We developed the SNPlex Genotyping System to address the need for accurate genotyping data, high sample throughput, study design flexibility, and cost efficiency. The system uses oligonucleotide ligation/polymerase chain reaction and capillary electrophoresis to analyze bi-allelic single nucleotide polymorphism genotypes. It is well suited for single nucleotide polymorphism genotyping efforts in which throughput and cost efficiency are essential. The SNPlex Genotyping System offers a high degree of flexibility and scalability, allowing the selection of custom-defined sets of SNPs for medium- to high-throughput genotyping projects. It is therefore suitable for a broad range of study designs. In this article we describe the principle and applications of the SNPlex Genotyping System, as well as a set of single nucleotide polymorphism selection tools and validated assay resources that accelerate the assay design process. We developed the control pool, an oligonucleotide ligation probe set for training and quality-control purposes, which interrogates 48 SNPs simultaneously. We present performance data from this control pool obtained by testing genomic DNA samples from 44 individuals. in addition, we present data from a study that analyzed 521 SNPs in 92 individuals. Combined, both studies show the SNPlex Genotyping system to have a 99.32% overall call rate, 99.95% precision, and 99.84% concordance with genotypes analyzed by TaqMan probe-based assays. The SNPlex Genotyping System is an efficient and reliable tool for a broad range of genotyping applications, supported by applications for study design, data analysis, and data management.
Journal of biomolecular techniques: JBT 01/2006; 16(4):398-406.
[show abstract][hide abstract] ABSTRACT: The extent and patterns of linkage disequilibrium (LD) determine the feasibility of association studies to map genes that underlie complex traits. Here we present a comparison of the patterns of LD across four major human populations (African-American, Caucasian, Chinese, and Japanese) with a high-resolution single-nucleotide polymorphism (SNP) map covering almost the entire length of chromosomes 6, 21, and 22. We constructed metric LD maps formulated such that the units measure the extent of useful LD for association mapping. LD reaches almost twice as far in chromosome 6 as in chromosomes 21 or 22, in agreement with their differences in recombination rates. By all measures used, out-of-Africa populations showed over a third more LD than African-Americans, highlighting the role of the population's demography in shaping the patterns of LD. Despite those differences, the long-range contour of the LD maps is remarkably similar across the four populations, presumably reflecting common localization of recombination hot spots. Our results have practical implications for the rational design and selection of SNPs for disease association studies.
Genome Research 05/2005; 15(4):454-62. · 14.40 Impact Factor
[show abstract][hide abstract] ABSTRACT: Power and sample size calculations are critical parts of any research design for genetic association. We present a method that utilizes haplotype frequency information and average marker-marker linkage disequilibrium on SNPs typed in and around all genes on a chromosome. The test statistic used is the classic likelihood ratio test applied to haplotypes in case/control populations. Haplotype frequencies are computed through specification of genetic model parameters. Power is determined by computation of the test's non-centrality parameter. Power per gene is computed as a weighted average of the power assuming each haplotype is associated with the trait. We apply our method to genotype data from dense SNP maps across three entire chromosomes (6, 21, and 22) for three different human populations (African-American, Caucasian, Chinese), three different models of disease (additive, dominant, and multiplicative) and two trait allele frequencies (rare, common). We perform a regression analysis using these factors, average marker-marker disequilibrium, and the haplotype diversity across the gene region to determine which factors most significantly affect average power for a gene in our data. Also, as a 'proof of principle' calculation, we perform power and sample size calculations for all genes within 100 kb of the PSORS1 locus (chromosome 6) for a previously published association study of psoriasis. Results of our regression analysis indicate that four highly significant factors that determine average power to detect association are: disease model, average marker-marker disequilibrium, haplotype diversity, and the trait allele frequency. These findings may have important implications for the design of well-powered candidate gene association studies. Our power and sample size calculations for the PSORS1 gene appear consistent with published findings, namely that there is substantial power (>0.99) for most genes within 100 kb of the PSORS1 locus at the 0.01 significance level.
Human Heredity 02/2005; 60(1):43-60. · 1.57 Impact Factor
[show abstract][hide abstract] ABSTRACT: One of the key issues facing researchers who want to map genes for complex traits is appropriate methodology for statistical power calculations. Most classic methods assume that parameters for the genetic model are known, which is rarely the case for complex traits. Furthermore, few if any methods use empirical data from genes of interest. We present a statistically valid method for performing such power calculations using empirical data and apply it to a candidate gene example for schizophrenia. We also document several advantages of our method, most notably the computation speed with which our power calculations may be performed.
Clinical Neuroscience Research 01/2005; 5(1):31-35. · 0.80 Impact Factor
[show abstract][hide abstract] ABSTRACT: It is widely hoped that the study of sequence variation in the human genome will provide a means of elucidating the genetic component of complex diseases and variable drug responses. A major stumbling block to the successful design and execution of genome-wide disease association studies using single-nucleotide polymorphisms (SNPs) and linkage disequilibrium is the enormous number of SNPs in the human genome. This results in unacceptably high costs for exhaustive genotyping and presents a challenging problem of statistical inference. Here, we present a new method for optimally selecting minimum informative subsets of SNPs, also known as "tagging" SNPs, that is efficient for genome-wide selection. We contrast this method to published methods including haplotype block tagging, that is, grouping SNPs into segments of low haplotype diversity and typing a subset of the SNPs that can discriminate all common haplotypes within the blocks. Because our method does not rely on a predefined haplotype block structure and makes use of the weaker correlations that occur across neighboring blocks, it can be effectively applied across chromosomal regions with both high and low local linkage disequilibrium. We show that the number of tagging SNPs selected is substantially smaller than previously reported using block-based approaches and that selecting tagging SNPs optimally can result in a two- to threefold savings over selecting random SNPs.
Genome Research 09/2004; 14(8):1633-40. · 14.40 Impact Factor
[show abstract][hide abstract] ABSTRACT: Admixture mapping (also known as "mapping by admixture linkage disequilibrium," or MALD) provides a way of localizing genes that cause disease, in admixed ethnic groups such as African Americans, with approximately 100 times fewer markers than are required for whole-genome haplotype scans. However, it has not been possible to perform powerful scans with admixture mapping because the method requires a dense map of validated markers known to have large frequency differences between Europeans and Africans. To create such a map, we screened through databases containing approximately 450000 single-nucleotide polymorphisms (SNPs) for which frequencies had been estimated in African and European population samples. We experimentally confirmed the frequencies of the most promising SNPs in a multiethnic panel of unrelated samples and identified 3011 as a MALD map (1.2 cM average spacing). We estimate that this map is approximately 70% informative in differentiating African versus European origins of chromosomal segments. This map provides a practical and powerful tool, which is freely available without restriction, for screening for disease genes in African American patient cohorts. The map is especially appropriate for those diseases that differ in incidence between the parental African and European populations.
The American Journal of Human Genetics 06/2004; 74(5):1001-13. · 11.20 Impact Factor
[show abstract][hide abstract] ABSTRACT: We examine the current effort to develop a haplotype map of the human genome and suggest an alternative approach which represents linkage disequilibrium patterns in the form of a metric LD map. LD maps have some of the useful properties of genetic linkage maps but have a much higher resolution which is optimal for SNP-based association mapping of common diseases. The studies that have been undertaken to date suggest that LD and recombination maps show some close similarities because of abundant, narrow, recombination hot spots. These hot spots are co-localised in all populations but, unlike linkage maps, LD maps differ in scale for different populations because of differences in population history. The prospects for developing optimized panels of SNPs and the use of linkage disequilibrium maps in disease gene localisation are assessed in the light of recent evidence.
Human Heredity 02/2004; 58(1):2-9. · 1.57 Impact Factor