[show abstract][hide abstract] ABSTRACT: Long genomic segments that are nearly identical between a pair of individuals
and are inherited from a recent common ancestor without recombination are
called identical-by-descent (IBD) segments. IBD sharing has numerous
applications in genetics, from demographic inference to phasing, imputation,
pedigree reconstruction, and disease mapping. Here, we provide a theoretical
analysis of IBD sharing under Markovian approximations of the coalescent with
recombination. We describe a general framework for the IBD process along the
chromosome under the Markovian models (SMC/SMC'), as well as introduce and
justify a new model, which we term the renewal approximation, under which
lengths of successive segments are independent. Then, considering the
infinite-chromosome limit of the IBD process, we recover previous results (for
SMC) and derive new results (for SMC') for the average fraction of the
chromosome found in long shared segments and the average number of such
segments. A number of new results for tree heights in SMC' are proved as
lemmas. We then use renewal theory to derive an expression (in Laplace space)
for the distribution of the number of shared segments and demonstrate
implications for demographic inference. We also use renewal theory to compute
the distribution of the fraction of the chromosome shared. While the expression
is again in Laplace space, we could invert the first two moments and compare a
number of approximations. Finally, we generalized all results to populations
with variable historical effective size.
[show abstract][hide abstract] ABSTRACT: Schizophrenia and bipolar disorder are major psychiatric disorders with high heritability and overlapping genetic variance. Here we perform a genome-wide association study in an ethnically homogeneous cohort of 904 schizophrenia cases and 1,640 controls drawn from the Ashkenazi Jewish population. We identify a novel genome-wide significant risk locus at chromosome 4q26, demonstrating the potential advantages of this founder population for gene discovery. The top single-nucleotide polymorphism (SNP; rs11098403) demonstrates consistent effects across 11 replication and extension cohorts, totalling 23, 191 samples across multiple ethnicities, regardless of diagnosis (schizophrenia or bipolar disorder), resulting in Pmeta=9.49 × 10(-12) (odds ratio (OR)=1.13, 95% confidence interval (CI): 1.08-1.17) across both disorders and Pmeta=2.67 × 10(-8) (OR=1.15, 95% CI: 1.08-1.21) for schizophrenia alone. In addition, this intergenic SNP significantly predicts postmortem cerebellar gene expression of NDST3, which encodes an enzyme critical to heparan sulphate metabolism. Heparan sulphate binding is critical to neurite outgrowth, axon formation and synaptic processes thought to be aberrant in these disorders.
[show abstract][hide abstract] ABSTRACT: In recent years many genetic variants (eSNPs) have been reported as associated with expression of transcripts in trans. However, the causal variants and regulatory mechanisms through which they act remain mostly unknown. In this paper we follow two kinds of usual suspects: SNPs that alter coding regions or transcription factors, identifiable by sequencing data with transcriptional profiles in the same cohort. We show these interpretable genomic regions are enriched for eSNP association signals, thereby naturally defining source-target gene pairs. We map these pairs onto a protein-protein interaction (PPI) network and study their topological properties.
For exonic eSNP sources, we report source-target proximity and high target degree within the PPI network. These pairs are more likely to be co-expressed and the eSNPs tend to have a cis effect, modulating the expression of the source gene. In contrast, transcription factor source-target pairs are not observed to have such properties, but instead a transcription factor source tends to assemble into units of defined functional roles along with its gene targets, and to share with them the same functional cluster of the PPI network.
Our results suggest two modes of trans regulation: transcription factor variation frequently acts via a modular regulation mechanism, with multiple targets that share a function with the transcription factor source. Notwithstanding, exon variation often acts by a local cis effect, delineating shorter paths of interacting proteins across functional clusters of the PPI network.
[show abstract][hide abstract] ABSTRACT: Pairs of individuals from a study cohort will often share long-range haplotypes identical-by-descent. Such haplotypes are transmitted from common ancestors that lived tens to hundreds of generations in the past, and they can now be efficiently detected in high-resolution genomic datasets, providing a novel source of information in several domains of genetic analysis. Recently, haplotype sharing distributions were studied in the context of demographic inference, and they were used to reconstruct recent demographic events in several populations. We here extend the framework to handle demographic models that contain multiple demes interacting through migration. We extensively test our formulation in several demographic scenarios, compare our approach with methods based on ancestry deconvolution and use this method to analyze Masai samples from the HapMap 3 dataset.
DoRIS, a Java implementation of the proposed method, and its source code are freely available at http://www.cs.columbia.edu/∼pier/doris. Contact: firstname.lastname@example.org.
[show abstract][hide abstract] ABSTRACT: Congenital heart disease (CHD) is the most frequent birth defect, affecting 0.8% of live births. Many cases occur sporadically and impair reproductive fitness, suggesting a role for de novo mutations. Here we compare the incidence of de novo mutations in 362 severe CHD cases and 264 controls by analysing exome sequencing of parent-offspring trios. CHD cases show a significant excess of protein-altering de novo mutations in genes expressed in the developing heart, with an odds ratio of 7.5 for damaging (premature termination, frameshift, splice site) mutations. Similar odds ratios are seen across the main classes of severe CHD. We find a marked excess of de novo mutations in genes involved in the production, removal or reading of histone 3 lysine 4 (H3K4) methylation, or ubiquitination of H2BK120, which is required for H3K4 methylation. There are also two de novo mutations in SMAD2, which regulates H3K27 methylation in the embryonic left-right organizer. The combination of both activating (H3K4 methylation) and inactivating (H3K27 methylation) chromatin marks characterizes 'poised' promoters and enhancers, which regulate expression of key developmental genes. These findings implicate de novo point mutations in several hundreds of genes that collectively contribute to approximately 10% of severe CHD.
[show abstract][hide abstract] ABSTRACT: The Ashkenazi Jewish population has a several-fold higher prevalence of Crohn's disease (CD) compared with non-Jewish European ancestry populations and has a unique genetic history. Haplotype association is critical to CD etiology in this population, most notably at NOD2, in which three causal, uncommon and conditionally independent NOD2 variants reside on a shared background haplotype. We present an analysis of extended haplotypes that showed significantly greater association to CD in the Ashkenazi Jewish population compared with a non-Jewish population (145 haplotypes and no haplotypes with P-value <10(-3), respectively). Two haplotype regions, one each on chromosomes 16 and 21, conferred increased disease risk within established CD loci. We performed exome sequencing of 55 Ashkenazi Jewish individuals and follow-up genotyping focused on variants in these two regions. We observed Ashkenazi Jewish-specific nominal association at R755C in TRPM2 on chromosome 21. Within the chromosome 16 region, R642S of HEATR3 and rs9922362 of BRD7 showed genome-wide significance. Expression studies of HEATR3 demonstrated a positive role in NOD2-mediated NF-κB signaling. The BRD7 signal showed conditional dependence with only the downstream rare CD-causal variants in NOD2, but not with the background haplotype; this elaborates NOD2 as a key illustration of synthetic association.Genes and Immunity advance online publication, 25 April 2013; doi:10.1038/gene.2013.19.
[show abstract][hide abstract] ABSTRACT: Germline determinants of gene expression in tumors are infrequently studied due to the complexity of transcript regulation caused by somatically acquired alterations. We performed expression quantitative trait locus (eQTL)-based analyses using the multi-level information provided in The Cancer Genome Atlas (TCGA). Of the factors we measured, cis-acting eQTLs accounted for 1.2% of the total variation of tumor gene expression, while somatic copy-number alteration and CpG methylation accounted for 7.3% and 3.3%, respectively. eQTL analyses of 15 previously reported breast cancer risk loci resulted in the discovery of three variants that are significantly associated with transcript levels (false discovery rate [FDR] < 0.1). Our trans-based analysis identified an additional three risk loci to act through ESR1, MYC, and KLF4. These findings provide a more comprehensive picture of gene expression determinants in breast cancer as well as insights into the underlying biology of breast cancer risk loci.
[show abstract][hide abstract] ABSTRACT: Human genetics recently transitioned from GWAS to studies based on NGS data. For GWAS, small effects dictated large sample sizes, typically made possible through meta-analysis by exchanging summary statistics across consortia. NGS studies groupwise-test for association of multiple potentially-causal alleles along each gene. They are subject to similar power constraints and therefore likely to resort to meta-analysis as well. The problem arises when considering privacy of the genetic information during the data-exchange process. Many scoring schemes for NGS association rely on the frequency of each variant thus requiring the exchange of identity of the sequenced variant. As such variants are often rare, potentially revealing the identity of their carriers and jeopardizing privacy. We have thus developed MetaSeq, a protocol for meta-analysis of genome-wide sequencing data by multiple collaborating parties, scoring association for rare variants pooled per gene across all parties. We tackle the challenge of tallying frequency counts of rare, sequenced alleles, for metaanalysis of sequencing data without disclosing the allele identity and counts, thereby protecting sample identity. This apparent paradoxical exchange of information is achieved through cryptographic means. The key idea is that parties encrypt identity of genes and variants. When they transfer information about frequency counts in cases and controls, the exchanged data does not convey the identity of a mutation and therefore does not expose carrier identity. The exchange relies on a 3rd party, trusted to follow the protocol although not trusted to learn about the raw data. We show applicability of this method to publicly available exome-sequencing data from multiple studies, simulating phenotypic information for powerful meta-analysis. The MetaSeq software is publicly available as open source.
Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing. 01/2013;
[show abstract][hide abstract] ABSTRACT: Widespread sharing of long, identical-by-descent (IBD) genetic segments is a hallmark of populations that have experienced recent genetic drift. Detection of these IBD segments has recently become feasible, enabling a wide range of applications from phasing and imputation to demographic inference. Here, we study the distribution of IBD sharing in the Wright-Fisher model. Specifically, using coalescent theory, we calculate the variance of the total sharing between random pairs of individuals. We then investigate the cohort-averaged sharing: the average total sharing between one individual and the rest of the cohort. We find that for large cohorts, the cohort-averaged sharing is distributed approximately normally and surprisingly, the variance of this distribution does not vanish large even for large cohorts, implying the existence of "hyper-sharing" individuals. The presence of such individuals has consequences for the design of sequencing studies, since, if they are selected for whole-genome sequencing, a larger fraction of the cohort can be subsequently imputed. We calculate the expected gain in power of imputation by IBD, and subsequently, in power to detect an association, when individuals are either randomly selected or are specifically chosen to be the hyper-sharing individuals. Using our framework, we also compute the variance of an estimator of the population size that is based on the mean IBD sharing and the variance in the sharing between inbred siblings. Finally, we study IBD sharing in an admixture pulse model, and show that in the Ashkenazi Jewish population the admixture fraction is correlated with the cohort-averaged sharing.
[show abstract][hide abstract] ABSTRACT: Data-driven studies of identity by descent (IBD) were recently enabled by high-resolution genomic data from large cohorts and scalable algorithms for IBD detection. Yet, haplotype sharing currently represents an underutilized source of information for population-genetics research. We present analytical results on the relationship between haplotype sharing across purportedly unrelated individuals and a population's demographic history. We express the distribution of IBD sharing across pairs of individuals for segments of arbitrary length as a function of the population's demography, and we derive an inference procedure to reconstruct such demographic history. The accuracy of the proposed reconstruction methodology was extensively tested on simulated data. We applied this methodology to two densely typed data sets: 500 Ashkenazi Jewish (AJ) individuals and 56 Kenyan Maasai (MKK) individuals (HapMap 3 data set). Reconstructing the demographic history of the AJ cohort, we recovered two subsequent population expansions, separated by a severe founder event, consistent with previous analysis of lower-throughput genetic data and historical accounts of AJ history. In the MKK cohort, high levels of cryptic relatedness were detected. The spectrum of IBD sharing is consistent with a demographic model in which several small-sized demes intermix through high migration rates and result in enrichment of shared long-range haplotypes. This scenario of historically structured demographies might explain the unexpected abundance of runs of homozygosity within several populations.
The American Journal of Human Genetics 10/2012; · 11.20 Impact Factor
[show abstract][hide abstract] ABSTRACT: Obesity and diabetes are particularly high in indigenous populations exposed to a Western diet and lifestyle. We describe the prevalence of obesity, diabetes, hyperglycemia, dyslipidemia, and hypertension in one such population, the Micronesian island of Kosrae. Longitudinal screenings for metabolic traits were conducted on adult Kosraens ò 20 years of age in 1994 and again in 2001. Data was obtained on 3,106 Kosraens, comprising ˜80% of the adult population. Diabetes was diagnosed using World Health Organization guidelines. Prevalences of obesity, hyperglycemia, dyslipidemia, and hypertension were assessed. The overall age-adjusted prevalence of diabetes increased from 14% to 21%. The most significant change observed in the population was increases in obesity and hyperglycemia, especially among young Kosraens and women. Obesity age-adjusted prevalence increased from 45% to 62%, and hyperglycemia age-adjusted prevalence increased from 19% to 44%. Of note, Kosraens as a group had unusually low high density lipoprotein (HDL) levels with 80% classified as low HDL by NCEP-ATPIII criteria, despite lacking the usually accompanying increase in triglycerides. Comparison to reports from other populations shows that Kosrae experiences one of the highest rates of obesity, hyperglycemia, and low HDL globally while maintaining relatively healthy levels of triglycerides. Our study shows a dramatic increase in obesity and hyperglycemia in Kosrae in just 7 years and forebodes significantly increased health risks for this part of the world.
[show abstract][hide abstract] ABSTRACT: BACKGROUND AND AIMS: Drug-induced liver injury (DILI) is a serious adverse drug event that is suspected to have a heritable component. We carried out a genome-wide association study of 783 individuals of European ancestry who experienced DILI due to more than 200 implicated drugs. METHODS: DILI patients from the US-based Drug-Induced Liver Injury Network (n=401) and three international registries (n=382) were genotyped with the Illumina 1Mduo BeadChip and compared with population controls (n=3001). Potential associations were tested in 307 independent Drug-Induced Liver Injury Network cases. RESULTS: After accounting for known major histocompatibility complex risk alleles for flucloxacillin-DILI and amoxicillin/clavulanate-DILI, there were no genome-wide significant associations, including in the major histocompatibility complex region. Stratification of DILI cases according to clinical phenotypes (injury type, latency, age of onset) also did not show significant associations. An analysis of hepatocellular DILI (n=285) restricted to 193 single-nucleotide polymorphisms previously associated with autoimmune disease showed a trend association for rs7574865, in the vicinity of signal transducer and activator of transcription 4 (STAT4) (P=4.5×10). This association was replicated in an independent cohort of 168 hepatocellular DILI cases (P=0.011 and 1.5×10 for combined cohorts). No significant associations were found with stratification by other clinical or demographic variables. CONCLUSION: Although not significant at the genome-wide level, the association between hepatocellular DILI and STAT4 is consistent with the emerging role of the immune system in DILI. However, the lack of genome-wide association study findings supports the idea that strong genetic determinants of DILI may be largely drug-specific or may reflect rare genetic variations, which were not assessed in our study.
Pharmacogenetics and Genomics 09/2012; 22(11):784-795. · 3.61 Impact Factor
[show abstract][hide abstract] ABSTRACT: North African Jews constitute the second largest Jewish Diaspora group. However, their relatedness to each other; to European,
Middle Eastern, and other Jewish Diaspora groups; and to their former North African non-Jewish neighbors has not been well
defined. Here, genome-wide analysis of five North African Jewish groups (Moroccan, Algerian, Tunisian, Djerban, and Libyan)
and comparison with other Jewish and non-Jewish groups demonstrated distinctive North African Jewish population clusters with
proximity to other Jewish populations and variable degrees of Middle Eastern, European, and North African admixture. Two major
subgroups were identified by principal component, neighbor joining tree, and identity-by-descent analysis—Moroccan/Algerian
and Djerban/Libyan—that varied in their degree of European admixture. These populations showed a high degree of endogamy and
were part of a larger Ashkenazi and Sephardic Jewish group. By principal component analysis, these North African groups were
orthogonal to contemporary populations from North and South Morocco, Western Sahara, Tunisia, Libya, and Egypt. Thus, this
study is compatible with the history of North African Jews—founding during Classical Antiquity with proselytism of local populations,
followed by genetic isolation with the rise of Christianity and then Islam, and admixture following the emigration of Sephardic
Jews during the Inquisition.
Proceedings of the National Academy of Sciences 08/2012; 109(34):13865-13870. · 9.74 Impact Factor
[show abstract][hide abstract] ABSTRACT: Long-range gene-gene interactions are biologically compelling models for disease genetics and can provide insights on relevant mechanisms and pathways. Despite considerable effort, rigorous interaction mapping in humans has remained prohibitively difficult due to computational and statistical limitations. We introduce a novel algorithmic approach to find long-range interactions in common diseases using a standard two-locus test that contrasts the linkage disequilibrium between SNPs in cases and controls. Our ultrafast method overcomes the computational burden of a genome × genome scan by using a novel randomization technique that requires 10× to 100× fewer tests than a brute-force approach. By sampling small groups of cases and highlighting combinations of alleles carried by all individuals in the group, this algorithm drastically trims the universe of combinations while simultaneously guaranteeing that all statistically significant pairs are reported. Our implementation can comprehensively scan large data sets (2K cases, 3K controls, 500K SNPs) to find all candidate pairwise interactions (LD-contrast p < 10(-12)) in a few hours-a task that typically took days or weeks to complete by methods running on equivalent desktop computers. We applied our method to the Wellcome Trust bipolar disorder data and found a significant interaction between SNPs located within genes encoding two calcium channel subunits: RYR2 on chr1q43 and CACNA2D4 on chr12p13 (LD-contrast test, p = 4.6 × 10(-14)). We replicated this pattern of interchromosomal LD between the genes in a separate bipolar data set from the GAIN project, demonstrating an example of gene-gene interaction that plays a role in the largely uncharted genetic landscape of bipolar disorder.
[show abstract][hide abstract] ABSTRACT: Cataloging the association of transcripts to genetic variants in recent years holds the promise for functional dissection of regulatory structure of human transcription. Here, we present a novel approach, which aims at elucidating the joint relationships between transcripts and single-nucleotide polymorphisms (SNPs). This entails detection and analysis of modules of transcripts, each weakly associated to a single genetic variant, together exposing a high-confidence association signal between the module and this 'main' SNP. To explore how transcripts in a module are related to causative loci for that module, we represent such dependencies by a graphical model. We applied our method to the existing data on genetics of gene expression in the liver. The modules are significantly more, larger and denser than found in permuted data. Quantification of the confidence in a module as a likelihood score, allows us to detect transcripts that do not reach genome-wide significance level. Topological analysis of each module identifies novel insights regarding the flow of causality between the main SNP and transcripts. We observe similar annotations of modules from two sources of information: the enrichment of a module in gene subsets and locus annotation of the genetic variants. This and further phenotypic analysis provide a validation for our methodology.
Nucleic Acids Research 03/2012; 40(13):e98. · 8.28 Impact Factor
[show abstract][hide abstract] ABSTRACT: Crohn's disease (CD) is a complex disorder resulting from the interaction of intestinal microbiota with the host immune system in genetically susceptible individuals. The largest meta-analysis of genome-wide association to date identified 71 CD-susceptibility loci in individuals of European ancestry. An important epidemiological feature of CD is that it is 2-4 times more prevalent among individuals of Ashkenazi Jewish (AJ) descent compared to non-Jewish Europeans (NJ). To explore genetic variation associated with CD in AJs, we conducted a genome-wide association study (GWAS) by combining raw genotype data across 10 AJ cohorts consisting of 907 cases and 2,345 controls in the discovery stage, followed up by a replication study in 971 cases and 2,124 controls. We confirmed genome-wide significant associations of 9 known CD loci in AJs and replicated 3 additional loci with strong signal (p<5×10⁻⁶). Novel signals detected among AJs were mapped to chromosomes 5q21.1 (rs7705924, combined p = 2×10⁻⁸; combined odds ratio OR = 1.48), 2p15 (rs6545946, p = 7×10⁻⁹; OR = 1.16), 8q21.11 (rs12677663, p = 2×10⁻⁸; OR = 1.15), 10q26.3 (rs10734105, p = 3×10⁻⁸; OR = 1.27), and 11q12.1 (rs11229030, p = 8×10⁻⁹; OR = 1.15), implicating biologically plausible candidate genes, including RPL7, CPAMD8, PRG2, and PRG3. In all, the 16 replicated and newly discovered loci, in addition to the three coding NOD2 variants, accounted for 11.2% of the total genetic variance for CD risk in the AJ population. This study demonstrates the complementary value of genetic studies in the Ashkenazim.
[show abstract][hide abstract] ABSTRACT: Homologous long segments along the genomes of close or remote relatives that are identical by descent (IBD) from a common ancestor provide clues for recent events in human genetics. We set out to extensively map such IBD segments in large cohorts and investigate their distribution within and across different populations. We report analysis of several data sets, demonstrating that IBD is more common than expected by naïve models of population genetics. We show that the frequency of IBD pairs is population dependent and can be used to cluster individuals into populations, detect a homogeneous subpopulation within a larger cohort, and infer bottleneck events in such a subpopulation. Specifically, we show that Ashkenazi Jewish individuals are all connected through transitive remote family ties evident by sharing of 50 cM IBD to a publicly available data set of less than 400 individuals. We further expose regions where long-range haplotypes are shared significantly more often than elsewhere in the genome, observed across multiple populations, and enriched for common long structural variation. These are inconsistent with recent relatedness and suggest ancient common ancestry, with limited recombination between haplotypes.
Molecular Biology and Evolution 02/2012; 29(2):473-86. · 10.35 Impact Factor
[show abstract][hide abstract] ABSTRACT: Relatively small, reproductively isolated populations with reduced genetic diversity may have advantages for genomewide association mapping in disease genetics. The Ashkenazi Jewish population represents a unique population for study based on its recent (< 1,000 year) history of a limited number of founders, population bottlenecks and tradition of marriage within the community. We genotyped more than 1,300 Ashkenazi Jewish healthy volunteers from the Hebrew University Genetic Resource with the Illumina HumanOmni1-Quad platform. Comparison of the genotyping data with that of neighboring European and Asian populations enabled the Ashkenazi Jewish-specific component of the variance to be characterized with respect to disease-relevant alleles and pathways.
Using clustering, principal components, and pairwise genetic distance as converging approaches, we identified an Ashkenazi Jewish-specific genetic signature that differentiated these subjects from both European and Middle Eastern samples. Most notably, gene ontology analysis of the Ashkenazi Jewish genetic signature revealed an enrichment of genes functioning in transepithelial chloride transport, such as CFTR, and in equilibrioception, potentially shedding light on cystic fibrosis, Usher syndrome and other diseases over-represented in the Ashkenazi Jewish population. Results also impact risk profiles for autoimmune and metabolic disorders in this population. Finally, residual intra-Ashkenazi population structure was minimal, primarily determined by class 1 MHC alleles, and not related to host country of origin.
The Ashkenazi Jewish population is of potential utility in disease-mapping studies due to its relative homogeneity and distinct genomic signature. Results suggest that Ashkenazi-associated disease genes may be components of population-specific genomic differences in key functional pathways.
[show abstract][hide abstract] ABSTRACT: The detection of genetic segments of Identical by Descent (IBD) in Genome-Wide Association Studies has proven successful in pinpointing genetic relatedness between reportedly unrelated individuals and leveraging such regions to shortlist candidate genes. These techniques depend on high-density genotyping arrays and their effectiveness in diverse sequence data is largely unknown. Due to decreasing costs and increasing effectiveness of high throughput techniques for whole-exome sequencing, an influx of exome sequencing data has become available. Studies using exomes and IBD-detection methods within known pedigrees have shown that IBD can be useful in finding hidden genetic candidates where known relatives are available. We set out to examine the viability of using IBD-detection in whole exome sequencing data in population-wide studies. In doing so, we extend GERMLINE, a method to detect IBD from exome sequencing data by finding small slices of matching alleles between pairs of individuals and extending them into full IBD segments. This algorithm allows for efficient population-wide detection in dense data. We apply this algorithm to a cohort of Crohn's Disease cases where whole-exome and GWAS array data is available. We confirm that GWAS-based detected segments are highly accurate and predictive of underlying shared variation. Where segments inferred from GWAS are expected to be of high accuracy, we compare exome-based detection accuracy of multiple detection strategies. We find detection accuracy to be prohibitively low in all assessments, both in terms of segment sensitivity and specificity. Even after isolating relatively long segments beyond 10cM, exome-based detection continued to offer poor specificity/sensitivity tradeoffs. We hypothesize that the variable coverage and platform biases of exome capture account for this decreased accuracy and look toward whole genome sequencing data as a higher quality source for detecting population-wide IBD.
PLoS ONE 01/2012; 7(10):e47618. · 3.73 Impact Factor