[Show abstract][Hide abstract] ABSTRACT: Background. Establishing health-related causal relationships is a central pursuit in biomedical research. Yet, the interdependent non-linearity of biological systems renders causal dynamics laborious and at times impractical to disentangle. This pursuit is further impeded by the dearth of time series that are sufficiently long to observe and understand recurrent patterns of flux. However, as data generation costs plummet and technologies like wearable devices democratize data collection, we anticipate a coming surge in the availability of biomedically-relevant time series data. Given the life-saving potential of these burgeoning resources, it is critical to invest in the development of open source software tools that are capable of drawing meaningful insight from vast amounts of time series data.
Results. Here we present CauseMap, the first open source implementation of convergent cross mapping (CCM), a method for establishing causality from long time series data (≳25 observations). Compared to existing time series methods, CCM has the advantage of being model-free and robust to unmeasured confounding that could otherwise induce spurious associations. CCM builds on Takens’ Theorem, a well-established result from dynamical systems theory that requires only mild assumptions. This theorem allows us to reconstruct high dimensional system dynamics using a time series of only a single variable. These reconstructions can be thought of as shadows of the true causal system. If reconstructed shadows can predict points from opposing time series, we can infer that the corresponding variables are providing views of the same causal system, and so are causally related. Unlike traditional metrics, this test can establish the directionality of causation, even in the presence of feedback loops. Furthermore, since CCM can extract causal relationships from times series of, e.g., a single individual, it may be a valuable tool to personalized medicine. We implement CCM in Julia, a high-performance programming language designed for facile technical computing. Our software package, CauseMap, is platform-independent and freely available as an official Julia package.
Conclusions. CauseMap is an efficient implementation of a state-of-the-art algorithm for detecting causality from time series data. We believe this tool will be a valuable resource for biomedical research and personalized medicine.
[Show abstract][Hide abstract] ABSTRACT: Demographic events and natural selection alter patterns of genetic variation within populations and may play a substantial role in shaping the genetic architecture of complex phenotypes and disease. However, the joint impact of these basic evolutionary forces is often ignored in the assessment of statistical tests of association. Here, we provide a simulation-based framework for generating DNA sequences that incorporates selection and demography with flexible models for simulating phenotypic variation (sfs_coder). This tool also allows the user to perform locus-specific simulations by automatically querying annotated genomic functional elements and genetic maps. We demonstrate the effects of evolutionary forces on patterns of genetic variation by simulating recently inferred models of human selection and demography. We use these simulations to show that the demographic model and locus-specific features, such as the proportion of sites under selection, may have practical implications for estimating the statistical power of sequencing-based rare variant association tests. In particular, for some phenotype models, there may be higher power to detect rare variant associations in African populations compared to non-Africans, but power is considerably reduced in regions of the genome with rampant negative selection. Furthermore, we show that existing methods for simulating large samples based on resampling from a small set of observed haplotypes fail to recapitulate the distribution of rare variants in the presence of rapid population growth (as has been observed in several human populations).
[Show abstract][Hide abstract] ABSTRACT: Genetic simulation programs are used to model data under specified assumptions to facilitate the understanding and study of complex genetic systems. Standardized data sets generated using genetic simulation are essential for the development and application of novel analytical tools in genetic epidemiology studies. With continuing advances in high-throughput genomic technologies and generation and analysis of larger, more complex data sets, there is a need for updating current approaches in genetic simulation modeling. To provide a forum to address current and emerging challenges in this area, the National Cancer Institute (NCI) sponsored a workshop, entitled “Genetic Simulation Tools for Post-Genome Wide Association Studies of Complex Diseases” at the National Institutes of Health (NIH) in Bethesda, Maryland on March 11–12, 2014. The goals of the workshop were to (1) identify opportunities, challenges, and resource needs for the development and application of genetic simulation models; (2) improve the integration of tools for modeling and analysis of simulated data; and (3) foster collaborations to facilitate development and applications of genetic simulation. During the course of the meeting, the group identified challenges and opportunities for the science of simulation, software and methods development, and collaboration. This paper summarizes key discussions at the meeting, and highlights important challenges and opportunities to advance the field of genetic simulation.
[Show abstract][Hide abstract] ABSTRACT: Background
IgE is a key mediator of allergic inflammation, and its levels are frequently increased in patients with allergic disorders.
We sought to identify genetic variants associated with IgE levels in Latinos.
We performed a genome-wide association study and admixture mapping of total IgE levels in 3334 Latinos from the Genes-environments & Admixture in Latino Americans (GALA II) study. Replication was evaluated in 454 Latinos, 1564 European Americans, and 3187 African Americans from independent studies.
We confirmed associations of 6 genes identified by means of previous genome-wide association studies and identified a novel genome-wide significant association of a polymorphism in the zinc finger protein 365 gene (ZNF365) with total IgE levels (rs200076616, P = 2.3 × 10−8). We next identified 4 admixture mapping peaks (6p21.32-p22.1, 13p22-31, 14q23.2, and 22q13.1) at which local African, European, and/or Native American ancestry was significantly associated with IgE levels. The most significant peak was 6p21.32-p22.1, where Native American ancestry was associated with lower IgE levels (P = 4.95 × 10−8). All but 22q13.1 were replicated in an independent sample of Latinos, and 2 of the peaks were replicated in African Americans (6p21.32-p22.1 and 14q23.2). Fine mapping of 6p21.32-p22.1 identified 6 genome-wide significant single nucleotide polymorphisms in Latinos, 2 of which replicated in European Americans. Another single nucleotide polymorphism was peak-wide significant within 14q23.2 in African Americans (rs1741099, P = 3.7 × 10−6) and replicated in non–African American samples (P = .011).
We confirmed genetic associations at 6 genes and identified novel associations within ZNF365, HLA-DQA1, and 14q23.2. Our results highlight the importance of studying diverse multiethnic populations to uncover novel loci associated with total IgE levels.
The Journal of allergy and clinical immunology 12/2014; DOI:10.1016/j.jaci.2014.10.033 · 11.25 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: Assessing the statistical significance of an observed 2x2 contingency table
can easily be accomplished using Fisher's exact test (FET). However, if the
cell entries are continuous or represent values inferred from a continuous
parametric model, then FET cannot be applied. Such tables arise frequently in
areas of biostatistical research including population genetics and evolutionary
genomics, where cell entries are estimated by computational methods and result
in cell entries drawn from the non-negative real line R+. Simply rounding cell
entries to conform to the assumptions of FET is an ill-suited approach that we
show creates problems related to both type-I and type-II errors. Pearson's
chi^2 test for independence, while technically applicable, is not often
effective for these tables, as the test has several limiting assumptions that
make application of this method inadvisable in many common instances
(particularly with small cell entries). Here we develop a novel method for
tables with continuous entries, which we term continuous Fisher's Exact Test
(cFET). Through simulations, we show that cFET has a close-to-uniform
distribution of p-values under the null hypothesis of independence, and more
power when applied to tables where the null hypothesis is false (compared to
FET applied to rounded cell entries). We apply cFET to an example from
comparative genomics to confirm an overall increased evolutionary rate among
primates compared to rodents, and identify several genes that show particularly
elevated evolutionary rates in primates. Some of these genes exhibit signatures
of continued positive selection along the human lineage since our divergence
with chimpanzee 5-7 million years ago, as well as ongoing selection in modern
[Show abstract][Hide abstract] ABSTRACT: Haplotype-based scans to detect natural selection are useful to identify recent or ongoing positive selection in genomes.
As both real and simulated genomic data sets grow larger, spanning thousands of samples and millions of markers, there is
a need for a fast and efficient implementation of these scans for general use. Here, we present selscan, an efficient multithreaded application that implements Extended Haplotype Homozygosity (EHH), Integrated Haplotype Score
(iHS), and Cross-population EHH (XPEHH). selscan accepts phased genotypes in multiple formats, including TPED, and performs extremely well on both simulated and real data
and over an order of magnitude faster than existing available implementations. It calculates iHS on chromosome 22 (22,147
loci) across 204 CEU haplotypes in 353 s on one thread (33 s on 16 threads) and calculates XPEHH for the same data relative
to 210 YRI haplotypes in 578 s on one thread (52 s on 16 threads). Source code and binaries (Windows, OSX, and Linux) are
available at https://github.com/szpiech/selscan.
[Show abstract][Hide abstract] ABSTRACT: Evolutionary forces shape patterns of genetic diversity within populations and contribute to phenotypic variation. In particular, recurrent positive selection has attracted significant interest in both theoretical and empirical studies. However, most existing theoretical models of recurrent positive selection cannot easily incorporate realistic confounding effects such as interference between selected sites, arbitrary selection schemes, and complicated demographic processes. It is possible to quantify the effects of arbitrarily complex evolutionary models by performing forward population genetic simulations, but forward simulations can be computationally prohibitive for large population sizes (>10(5)). A common approach for overcoming these computational limitations is rescaling of the most computationally expensive parameters, especially population size. Here, we show that ad hoc approaches to parameter rescaling under the recurrent hitchhiking model do not always provide sufficiently accurate dynamics, potentially skewing patterns of diversity in simulated DNA sequences. We derive an extension of the recurrent hitchhiking model that is appropriate for strong selection in small population sizes, and use it to develop a method for parameter rescaling that provides the best possible computational performance for a given error tolerance. We perform a detailed theoretical analysis of the robustness of rescaling across the parameter space. Finally, we apply our rescaling algorithms to parameters that were previously inferred for Drosophila, and discuss practical considerations such as interference between selected sites.
[Show abstract][Hide abstract] ABSTRACT: Proteins are not monolithic entities; rather, they can contain multiple domains that mediate distinct interactions, and their functionality can be regulated through post-translational modifications at multiple distinct sites. Traditionally, network biology has ignored such properties of proteins and has instead examined either the physical interactions of whole proteins or the consequences of removing entire genes. In this Review, we discuss experimental and computational methods to increase the resolution of protein-protein, genetic and drug-gene interaction studies to the domain and residue levels. Such work will be crucial for using interaction networks to connect sequence and structural information, and to understand the biological consequences of disease-associated mutations, which will hopefully lead to more effective therapeutic strategies.
[Show abstract][Hide abstract] ABSTRACT: Ortholog detection (OD) is a critical step for comparative genomic analysis
of protein-coding sequences. There is a range of methods available for OD.
However, relative performance varies by application, stymying attempts to
identify a single best method. In this paper, we present a novel tool, MOSAIC,
which is capable of integrating the entire swath of OD methods. We analyze the
results of applying MOSAIC over four methodologically diverse OD methods.
Relative to component and competing methods, we demonstrate large gains in the
number of detected orthologs while simultaneously maintaining or improving
functional-, phylogenetic-, and sequence identity-based measures of ortholog
[Show abstract][Hide abstract] ABSTRACT: The primary rescue medication to treat acute asthma exacerbation is the short-acting β2-adrenergic receptor agonist; however, there is variation in how well a patient responds to treatment. Although these differences might be due to environmental factors, there is mounting evidence for a genetic contribution to variability in bronchodilator response (BDR).
To identify genetic variation associated with bronchodilator drug response in Latino children with asthma.
We performed a genome-wide association study (GWAS) for BDR in 1782 Latino children with asthma using standard linear regression, adjusting for genetic ancestry and ethnicity, and performed replication studies in an additional 531 Latinos. We also performed admixture mapping across the genome by testing for an association between local European, African, and Native American ancestry and BDR, adjusting for genomic ancestry and ethnicity.
We identified 7 genetic variants associated with BDR at a genome-wide significant threshold (P < 5 × 10(-8)), all of which had frequencies of less than 5%. Furthermore, we observed an excess of small P values driven by rare variants (frequency, <5%) and by variants in the proximity of solute carrier (SLC) genes. Admixture mapping identified 5 significant peaks; fine mapping within these peaks identified 2 rare variants in SLC22A15 as being associated with increased BDR in Mexicans. Quantitative PCR and immunohistochemistry identified SLC22A15 as being expressed in the lung and bronchial epithelial cells.
Our results suggest that rare variation contributes to individual differences in response to albuterol in Latinos, notably in SLC genes that include membrane transport proteins involved in the transport of endogenous metabolites and xenobiotics. Resequencing in larger, multiethnic population samples and additional functional studies are required to further understand the role of rare variation in BDR.
The Journal of allergy and clinical immunology 08/2013; 133(2). DOI:10.1016/j.jaci.2013.06.043 · 11.25 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: Regions of the genome that are under evolutionary constraint across multiple species have previously been used to identify functional sequences in the human genome. Furthermore, it is known that there is an inverse relationship between evolutionary constraint and the allele frequency of a mutation segregating in human populations, implying a direct relationship between interspecies divergence and fitness in humans. Here we utilise this relationship to test differences in the accumulation of putatively deleterious mutations both between populations and on the individual level.
Using whole genome and exome sequencing data from Phase 1 of the 1000 Genome Project for 1,092 individuals from 14 worldwide populations we show that minor allele frequency (MAF) varies as a function of constraint around both coding regions and non-coding sites genome-wide, implying that negative, rather than positive, selection primarily drives the distribution of alleles among individuals via background selection. We find a strong relationship between effective population size and the depth of depression in MAF around the most conserved genes, suggesting that populations with smaller effective size are carrying more deleterious mutations, which also translates into higher genetic load when considering the number of putatively deleterious alleles segregating within each population. Finally, given the extreme richness of the data, we are now able to classify individual genomes by the accumulation of mutations at functional sites using high coverage 1000 Genomes data. Using this approach we detect differences between 'healthy' individuals within populations for the distributions of putatively deleterious rare alleles they are carrying.
These findings demonstrate the extent of background selection in the human genome and highlight the role of population history in shaping patterns of diversity between human individuals. Furthermore, we provide a framework for the utility of personal genomic data for the study of genetic fitness and diseases.
[Show abstract][Hide abstract] ABSTRACT: Progressive HIV infection is characterized by dysregulation of the intestinal immune barrier, translocation of immunostimulatory microbial products, and chronic systemic inflammation that is thought to drive progression of disease to AIDS. Elements of this pathologic process persist despite viral suppression during highly active antiretroviral therapy (HAART), and drivers of these phenomena remain poorly understood. Disrupted intestinal immunity can precipitate dysbiosis that induces chronic inflammation in the mucosa and periphery of mice. However, putative microbial drivers of HIV-associated immunopathology versus recovery have not been identified in humans. Using high-resolution bacterial community profiling, we identified a dysbiotic mucosal-adherent community enriched in Proteobacteria and depleted of Bacteroidia members that was associated with markers of mucosal immune disruption, T cell activation, and chronic inflammation in HIV-infected subjects. Furthermore, this dysbiosis was evident among HIV-infected subjects undergoing HAART, and the extent of dysbiosis correlated with activity of the kynurenine pathway of tryptophan catabolism and plasma concentrations of the inflammatory cytokine interleukin-6 (IL-6), two established markers of disease progression. Gut-resident bacteria with capacity to catabolize tryptophan through the kynurenine pathway were found to be enriched in HIV-infected subjects, strongly correlated with kynurenine levels in HIV-infected subjects, and capable of kynurenine production in vitro. These observations demonstrate a link between mucosal-adherent colonic bacteria and immunopathogenesis during progressive HIV infection that is apparent even in the setting of viral suppression during HAART. This link suggests that gut-resident microbial populations may influence intestinal homeostasis during HIV disease.
Science translational medicine 07/2013; 5(193):193ra91. DOI:10.1126/scitranslmed.3006438 · 14.41 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: By characterizing the geographic and functional spectrum of human genetic variation, the 1000 Genomes Project aims to build a resource to help to understand the genetic contribution to disease. Here we describe the genomes of 1,092 individuals from 14 populations, constructed using a combination of low-coverage whole-genome and exome sequencing. By developing methods to integrate information across several algorithms and diverse data sources, we provide a validated haplotype map of 38 million single nucleotide polymorphisms, 1.4 million short insertions and deletions, and more than 14,000 larger deletions. We show that individuals from different populations carry different profiles of rare and common variants, and that low-frequency variants show substantial geographic differentiation, which is further increased by the action of purifying selection. We show that evolutionary conservation and coding consequence are key determinants of the strength of purifying selection, that rare-variant load varies substantially across biological pathways, and that each individual contains hundreds of rare non-coding variants at conserved sites, such as motif-disrupting changes in transcription-factor-binding sites. This resource, which captures up to 98% of accessible single nucleotide polymorphisms at a frequency of 1% in related populations, enables analysis of common and low-frequency variants in individuals from diverse, including admixed, populations.
[Show abstract][Hide abstract] ABSTRACT: Polymorphisms in more than 100 genes have been associated with asthma susceptibility, yet much of the heritability remains to be explained. Asthma disproportionately affects different racial and ethnic groups in the United States, suggesting that admixture mapping is a useful strategy to identify novel asthma-associated loci.
We sought to identify novel asthma-associated loci in Latino populations using case-control admixture mapping.
We performed genome-wide admixture mapping by comparing levels of local Native American, European, and African ancestry between children with asthma and nonasthmatic control subjects in Puerto Rican and Mexican populations. Within candidate peaks, we performed allelic tests of association, controlling for differences in local ancestry.
Between the 2 populations, we identified a total of 62 admixture mapping peaks at a P value of less than 10(-3) that were significantly enriched for previously identified asthma-associated genes (P= .0051). One of the peaks was statistically significant based on 100 permutations in the Mexican sample (6q15); however, it was not significant in Puerto Rican subjects. Another peak was identified at nominal significance in both populations (8q12); however, the association was observed with different ancestries.
Case-control admixture mapping is a promising strategy for identifying novel asthma-associated loci in Latino populations and implicates genetic variation at 6q15 and 8q12 regions with asthma susceptibility. This approach might be useful for identifying regions that contribute to both shared and population-specific differences in asthma susceptibility.
The Journal of allergy and clinical immunology 04/2012; 130(1):76-82.e12. DOI:10.1016/j.jaci.2012.02.040 · 11.25 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: To study the evolution of recombination rates in apes, we developed methodology to construct a fine-scale genetic map from high-throughput sequence data from 10 Western chimpanzees, Pan troglodytes verus. Compared to the human genetic map, broad-scale recombination rates tend to be conserved, but with exceptions, particularly in regions of chromosomal rearrangements and around the site of ancestral fusion in human chromosome 2. At fine scales, chimpanzee recombination is dominated by hotspots, which show no overlap with those of humans even though rates are similarly elevated around CpG islands and decreased within genes. The hotspot-specifying protein PRDM9 shows extensive variation among Western chimpanzees, and there is little evidence that any sequence motifs are enriched in hotspots. The contrasting locations of hotspots provide a natural experiment, which demonstrates the impact of recombination on base composition.
[Show abstract][Hide abstract] ABSTRACT: Common variation in over 100 genes has been implicated in the risk of developing asthma, but the contribution of rare variants to asthma susceptibility remains largely unexplored. We selected nine genes that showed the strongest signatures of weak purifying selection from among 53 candidate asthma-associated genes, and we sequenced the coding exons and flanking noncoding regions in 450 asthmatic cases and 515 nonasthmatic controls. We observed an overall excess of p values <0.05 (p = 0.02), and rare variants in four genes (AGT, DPP10, IKBKAP, and IL12RB1) contributed to asthma susceptibility among African Americans. Rare variants in IL12RB1 were also associated with asthma susceptibility among European Americans, despite the fact that the majority of rare variants in IL12RB1 were specific to either one of the populations. The combined evidence of association with rare noncoding variants in IL12RB1 remained significant (p = 3.7 × 10(-4)) after correcting for multiple testing. Overall, the contribution of rare variants to asthma susceptibility was predominantly due to noncoding variants in sequences flanking the exons, although nonsynonymous rare variants in DPP10 and in IL12RB1 were associated with asthma in African Americans and European Americans, respectively. This study provides evidence that rare variants contribute to asthma susceptibility. Additional studies are required for testing whether prioritizing genes for resequencing on the basis of signatures of purifying selection is an efficient means of identifying novel rare variants that contribute to complex disease.
The American Journal of Human Genetics 02/2012; 90(2):273-81. DOI:10.1016/j.ajhg.2012.01.008 · 10.99 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: Objectives: Identifying drivers of complex traits from the noisy signals of genetic variation obtained from high-throughput genome sequencing technologies is a central challenge faced by human geneticists today. We hypothesize that the variants involved in complex diseases are likely to exhibit non-neutral evolutionary signatures. Uncovering the evolutionary history of all variants is therefore of intrinsic interest for complex disease research. However, doing so necessitates the simultaneous elucidation of the targets of natural selection and population-specific demographic history. Methods: Here we characterize the action of natural selection operating across complex disease categories, and use population genetic simulations to evaluate the expected patterns of genetic variation in large samples. We focus on populations that have experienced historical bottlenecks followed by explosive growth (consistent with many human populations), and describe the differences between evolutionarily deleterious mutations and those that are neutral. Results: Genes associated with several complex disease categories exhibit stronger signatures of purifying selection than non-disease genes. In addition, loci identified through genome-wide association studies of complex traits also exhibit signatures consistent with being in regions recurrently targeted by purifying selection. Through simulations, we show that population bottlenecks and rapid growth enable deleterious rare variants to persist at low frequencies just as long as neutral variants, but low-frequency and common variants tend to be much younger than neutral variants. This has resulted in a large proportion of modern-day rare alleles that have a deleterious effect on function and that potentially contribute to disease susceptibility. Conclusions: The key question for sequencing-based association studies of complex traits is how to distinguish between deleterious and benign genetic variation. We used population genetic simulations to uncover patterns of genetic variation that distinguish these two categories, especially derived allele age, thereby providing inroads into novel methods for characterizing rare genetic variation driving complex diseases.
Human Heredity 01/2012; 74(3-4):118-28. DOI:10.1159/000346826 · 1.64 Impact Factor