Article

Finding Disease Variants in Mendelian Disorders By Using Sequence Data: Methods and Applications

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Many sequencing studies are now underway to identify the genetic causes for both Mendelian and complex traits. Via exome-sequencing, genes harboring variants implicated in several Mendelian traits have already been identified. The underlying methodology in these studies is a multistep algorithm based on filtering variants identified in a small number of affected individuals and depends on whether they are novel (not yet seen in public resources such as dbSNP), shared among affected individuals, and other external functional information on the variants. Although intuitive, these filter-based methods are nonoptimal and do not provide any measure of statistical uncertainty. We describe here a formal statistical approach that has several distinct advantages: (1) it provides fast computation of approximate p values for individual genes, (2) it adjusts for the background variation in each gene, (3) it allows for incorporation of functional or linkage-based information, and (4) it accommodates designs based on both affected relative pairs and unrelated affected individuals. We show via simulations that the proposed approach can be used in conjunction with the existing filter-based methods to achieve a substantially better ranking of a gene relevant for disease when compared to currently used filter-based approaches, this is especially so in the presence of disease locus heterogeneity. We revisit recent studies on three Mendelian diseases and show that the proposed approach results in the implicated gene being ranked first in all studies, and approximate p values of 10(-6) for the Miller Syndrome gene, 1.0 × 10(-4) for the Freeman-Sheldon Syndrome gene, and 3.5 × 10(-5) for the Kabuki Syndrome gene.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Advancements in sequencing technology have allowed examination of rare coding variants associated with disease. In contrast to large studies in complex disease, a "filtering"based approach focusing on variant segregation with pheno-type, predicted variant functionality, and novelty has been used to identify causal variants/genes for Mendelian diseases (Bamshad et al., 2011;Chong et al., 2015;Ionita-Laza et al., 2011;Ng et al., 2010). Some complex diseases are also known to have Mendelian or near-Mendelian variants, such as alpha-1 antitrypsin deficiency in chronic obstructive pulmonary disease (COPD; Dahl, Nordestgaard, Lange, Vestbo, & Tybjaerg-Hansen, 2001), BRCA1 and BRCA2 for breast and ovarian cancer (Aida et al., 1998;Miki et al., 1994;Szabo & King, 1995), and TARDBP for amyotrophic lateral sclerosis (ALS; Daoud et al., 2009). ...
... Collapsing the information from multiple variants within a gene may be more likely to capture key genes (Dering, Hemmelmann, Pugh, & Ziegler, 2011;Price et al., 2010;Sun, Sung, Tintle, & Ziegler, 2011). This is one of the advantages of the methods described in Ionita-Laza et al. (2011), in which the authors designed a gene-based test for segregation events in pairs of affected relatives. It considers background variation in the gene and degrees of relatedness between the affected relatives. ...
... Due to the nature of our approach, we were limited in our ability to compare our approach with other methods in simulations. The most applicable approaches mentioned previously (Bureau et al., 2014;Ionita-Laza et al., 2011;Koboldt et al., 2014) do not perform gene-based tests, or do not allow a flexible pedigree structure. One additional method that we were unable to directly test against in simulations was pVAAST (Hu et al., 2014). ...
Article
Whole-exome sequencing using family data has identified rare coding variants in Mendelian diseases or complex diseases with Mendelian subtypes, using filters based on variant novelty, functionality, and segregation with the phenotype within families. However, formal statistical approaches are limited. We propose a gene-based segregation test (GESE) that quantifies the uncertainty of the filtering approach. It is constructed using the probability of segregation events under the null hypothesis of Mendelian transmission. This test takes into account different degrees of relatedness in families, the number of functional rare variants in the gene, and their minor allele frequencies in the corresponding population. In addition, a weighted version of this test allows incorporating additional subject phenotypes to improve statistical power. We show via simulations that the GESE and weighted GESE tests maintain appropriate type I error rate, and have greater power than several commonly used region-based methods. We apply our method to whole-exome sequencing data from 49 extended pedigrees with severe, early-onset chronic obstructive pulmonary disease (COPD) in the Boston Early-Onset COPD study (BEOCOPD) and identify several promising candidate genes. Our proposed methods show great potential for identifying rare coding variants of large effect and high penetrance for family-based sequencing data. The proposed tests are implemented in an R package that is available on CRAN (https://cran.r-project.org/web/packages/GESE/).
... The problem of inadequate validation has been documented in population studies showing that hundreds of variants reported in the human gene mutation database as disease alleles and lacking adequate biochemical validation can be found in apparently healthy individuals (15,26,30). Although statistical and bioinformatics tools will improve, the most convincing evidence of associations between gene variants and disease will be based on molecular and biochemical validation in human cell lines or experimental organisms such as mice and zebrafish (26,31). Validation requires addressing three major issues: How does the nucleotide alteration affect the protein biochemically? ...
... The common assumption for a disease-causing variant is that it will have a severe deleterious effect on gene function (26,31,51,58,(64)(65)(66)(67). Largely owing to the wide use of exome sequencing, private variants with big effects have been located mainly in protein-coding regions of the genome. ...
... For Mendelian disorders, specific gene identification is achieved by each mutation being necessary and sufficient, within the bounds of penetrance and expressivity considerations, for the disease to occur. Generally the mutations are deleterious to protein function, are rare, and occur with an inheritance pattern consistent with the disease phenotype (22,31,51). The mutation reveals the relevant biological pathway and explains the phenotype. ...
Article
GenomicDNAsequencing technologies have been one of the great advances of the 21st century, having decreased in cost by seven orders of magnitude and opening up new fields of investigation throughout research and clinical medicine. Genomics coupled with biochemical investigation has allowed the molecular definition of a growing number of new genetic diseases that reveal new concepts of immune regulation. Also, defining the genetic pathogenesis of these diseases has led to improved diagnosis, prognosis, genetic counseling, and, most importantly, new therapies. We highlight the investigational journey from patient phenotype to treatment using the newly defined XMEN disease, caused by the genetic loss of the MAGT1 magnesium transporter, as an example. This disease illustrates how genomics yields new fundamental immunoregulatory insights as well as how research genomics is integrated into clinical immunology. At the end, we discuss two other recently described diseases, PASLI (PI3K dysregulation) and CHAI/LATAIE (CTLA-4 deficiency), as additional examples of the journey from unknown immunological diseases to new precision medicine treatments using genomics. Expected final online publication date for the Annual Review of Immunology Volume 34 is May 20, 2016. Please see http://www.annualreviews.org/catalog/pubdates.aspx for revised estimates.
... In particular, Merlin and MORGAN were integrated into FamPipe to calculate the IBD statistics or linkage LOD scores to identify linkage regions. For identifying variants responsible for Mendelian disorders, three methods were implemented in the disease model identification (DMI) module in FamPipe including the segregation scores [8], which can be used for identifying family-specific mutations at disease variants, the weighted-sum statistic [24], which is ideal for identifying mutations in multiple disease variants within a gene, and the filtering rules for compound heterozygosity [33]. For complex disease studies, familybased association tests can be performed in the linkage regions or across the whole genome. ...
... For the second strategy, the weighted-sum statistic [24] and its p-value are calculated for each gene. The method has been shown to be powerful for identifying genes responsible for Mendelian diseases such as the Miller Syndrome, Freeman-Sheldon Syndrome, and Kabuki Syndrome using simulated sequencing data in a few affected individuals. ...
Article
Full-text available
In disease studies, family-based designs have become an attractive approach to analyzing next-generation sequencing (NGS) data for the identification of rare mutations enriched in families. Substantial research effort has been devoted to developing pipelines for automating sequence alignment, variant calling, and annotation. However, fewer pipelines have been designed specifically for disease studies. Most of the current analysis pipelines for family-based disease studies using NGS data focus on a specific function, such as identifying variants with Mendelian inheritance or identifying shared chromosomal regions among affected family members. Consequently, some other useful family-based analysis tools, such as imputation, linkage, and association tools, have yet to be integrated and automated. We developed FamPipe, a comprehensive analysis pipeline, which includes several family-specific analysis modules, including the identification of shared chromosomal regions among affected family members, prioritizing variants assuming a disease model, imputation of untyped variants, and linkage and association tests. We used simulation studies to compare properties of some modules implemented in FamPipe, and based on the results, we provided suggestions for the selection of modules to achieve an optimal analysis strategy. The pipeline is under the GNU GPL License and can be downloaded for free at http://fampipe.sourceforge.net.
... For Mendelian disorders, this may be recessive, dominant or compound heterozygous mutation model, or a combination of the models for the explanation of the inherited traits. 86 Moreover, pathways discovered in previous studies may guide discovery of new mutations in the known pathways, yet it is limited by the prior knowledge of the disease. It should be noted that new pathways can be established by linking newly identified mutations to known diseases. ...
... In other cases, statistical tests are often necessary to discover mutations or target genes that contribute to a disease. 86 This is particularly important for addressing the effects of multiple rare variants that cause functional damage in a combinatorial manner. Previous studies identified rare predisposing variants that are significantly associated with complex traits such as colorectal adenomas, 95 high-density lipoprotein cholesterol, 96 and schizophrenia. ...
Article
Full-text available
The advent of next-generation sequencing technologies has greatly promoted advances in the study of human diseases at the genomic, transcriptomic, and epigenetic levels. Exome sequencing, where the coding region of the genome is captured and sequenced at a deep level, has proven to be a cost-effective method to detect disease-causing variants and discover gene targets. In this review, we outline the general framework of whole exome sequence data analysis. We focus on established bioinformatics tools and applications that support five analytical steps: raw data quality assessment, preprocessing, alignment, post-processing, and variant analysis (detection, annotation, and prioritization). We evaluate the performance of open-source alignment programs and variant calling tools using simulated and benchmark datasets, and highlight the challenges posed by the lack of concordance among variant detection tools. Based on these results, we recommend adopting multiple tools and resources to reduce false positives and increase the sensitivity of variant calling. In addition, we briefly discuss the current status and solutions for big data management, analysis, and summarization in the field of bioinformatics.
... Recent technological advancements have greatly enhanced throughput and turnaround time such that it is possible to envision whole-genome sequencing as a test that can be performed at the point of care and used for real-time monitoring of microbial outbreaks (Green and Guyer 2011). It is no surprise that the diseases that benefited first from this revolutionary technology are the ones whose genetics we understand best, i.e., Mendelian disorders (Bamshad et al. 2011;Ionita-Laza et al. 2011). Although the notion of ''single gene diseases'' is controversial since the causal mutation never operates in isolation but is rather influenced by genetic and environmental modifiers, it is still true that the very large effect size of Mendelian mutations makes them highly relevant clinically, and medically actionable. ...
... However, this still leaves a formidable number of variants whose candidacy will have to be carefully considered. Therefore, additional filters are needed to improve the throughput of this approach (Gilissen et al. 2012;Ionita-Laza et al. 2011). Obviously, when autozygosity mapping highlights a single critical locus, considering only variants within that locus offers a helpful lead (Alkuraya 2012). ...
Article
Full-text available
Autozygosity, or the inheritance of two copies of an ancestral allele, has the potential to not only reveal phenotypes caused by biallelic mutations in autosomal recessive genes, but to also facilitate the mapping of such mutations by flagging the surrounding haplotypes as tractable runs of homozygosity (ROH), a process known as autozygosity mapping. Since SNPs replaced microsatellites as markers for the purpose of genomewide identification of ROH, autozygosity mapping of Mendelian genes has witnessed a significant acceleration. Historically, successful mapping traditionally required favorable family structure that permits the identification of an autozygous interval that is amenable to candidate gene selection and confirmation by Sanger sequencing. This requirement presented a major bottleneck that hindered the utilization of simplex cases and many multiplex families with autosomal recessive phenotypes. However, the advent of next-generation sequencing that enables massively parallel sequencing of DNA has largely bypassed this bottleneck and thus ushered in an era of unprecedented pace of Mendelian disease gene discovery. The ability to identify a single causal mutation among a massive number of variants that are uncovered by next-generation sequencing can be challenging, but applying autozygosity as a filter can greatly enhance the enrichment process and its throughput. This review will discuss the power of combining the best of both techniques in the mapping of recessive disease genes and offer some tips to troubleshoot potential limitations.
... These annotated variants can then further be interpreted by filtering based on either quality, functional, or genetic criteria. These filtered annotated data are then interpreted by either collapsing the data into the desired functional units, often genes, or performing formal statistical association analysis methods [19]. Several software tools have already been developed to tackle the aforementioned analysis steps. ...
... In addition to the widely used collapsing method of disease-gene discovery, where variants across affected samples are collapsed at the gene level and counted, the user can also perform formal statistical analysis using the methodology proposed by Ionita-Laza for case/control studies [19]. This type of statistical methods has the added benefit of computing an approximate P-value for individual genes and of taking into account inherent background variation in genes due to factors such as gene size or hypermutable regions. ...
Article
Full-text available
The increasing size and complexity of exome/genome sequencing data requires new tools for clinical geneticists to discover disease-causing variants. Bottlenecks in identifying the causative variation include poor cross-sample querying, constantly changing functional annotation and not considering existing knowledge concerning the phenotype. We describe a methodology that facilitates exploration of patient sequencing data towards identification of causal variants under different genetic hypotheses. Annotate-it facilitates handling, analysis and interpretation of high-throughput single nucleotide variant data. We demonstrate our strategy using three case studies. Annotate-it is freely available and test data is accessible to all users at http://www.annotate-it.org.
... Many other rare-variant tests that do not correct for covariates also exist. 4,9,[17][18][19][20] In addition to those rare-variant tests that cannot directly adjust for confounders, there are other rarevariant tests that offer only limited mechanisms to correct for such variables. One such example is the variable-allele-frequency threshold test, 3 which proposes adjusting for covariates by the replacement of diseaseoutcome variables in the test statistic with residuals from a regression analysis of disease outcome on covariates under a linear model. ...
... Such preservation of case-control numbers in replicate data sets is particularly valuable for exome-sequencing studies of Mendelian traits, studies which often possess only a handful of cases for analysis. 19 Finally, we note that we could apply the parametric bootstrap in such a manner that we only accept data sets that preserve the original number of cases and controls, but such a procedure will be much less computationally efficient than biased urn sampling. In our simulations, we observed that this approach required ~253 more computation time than biased urn sampling across different sample sizes. ...
Article
Many case-control tests of rare variation are implemented in statistical frameworks that make correction for confounders like population stratification difficult. Simple permutation of disease status is unacceptable for resolving this issue because the replicate data sets do not have the same confounding as the original data set. These limitations make it difficult to apply rare-variant tests to samples in which confounding most likely exists, e.g., samples collected from admixed populations. To enable the use of such rare-variant methods in structured samples, as well as to facilitate permutation tests for any situation in which case-control tests require adjustment for confounding covariates, we propose to establish the significance of a rare-variant test via a modified permutation procedure. Our procedure uses Fisher's noncentral hypergeometric distribution to generate permuted data sets with the same structure present in the actual data set such that inference is valid in the presence of confounding factors. We use simulated sequence data based on coalescent models to show that our permutation strategy corrects for confounding due to population stratification that, if ignored, would otherwise inflate the size of a rare-variant test. We further illustrate the approach by using sequence data from the Dallas Heart Study of energy metabolism traits. Researchers can implement our permutation approach by using the R package BiasedUrn.
... This can also phase rare SNV in the population and produce chromosome-wide phasing but requires additional sequencing of the parents, which is not always available 17 . More importantly, this strategy fails to phase de novo variants, which are often causative in diseases such as intellectual disability 18 and Mendelian diseases 19 . Lastly, per-read phasing leverages only the linking information of variants that are shared in the same read. ...
Preprint
Full-text available
The assignment of variants across haplotypes, phasing, is crucial for predicting the consequences, interaction, and inheritance of mutations and is a key step in improving our understanding of phenotype and disease. However, phasing is limited by read length and stretches of homozygosity along the genome. To overcome this limitation, we designed MethPhaser, the first method that utilizes methylation signals from Oxford Nanopore Technologies to extend SNV-based phasing. Across control samples, we extend the phase length N50 by almost 3-fold while minimally increasing the phasing error by ~0.02%. Nevertheless, methylation signals have limitations, such as random signals on sex chromosomes or tissue purity. To assess the latter, we also applied MethPhaser on blood samples from 4 patients, still showing improvements over SNV-only phasing. MethPhaser further improves phasing across HLA and multiple other medically relevant genes, improving our understanding of how mutations interact across multiple phenotypes. MethPhaser is available at https://github.com/treangenlab/methphaser.
... For example, more than 70% of rare diseases are thought to have a genetic cause, and recent efforts have identified the causal variants for thousands of Mendelian diseases. [1][2][3] However, causal variants have not been identified for approximately half ($3,000) of known rare genetic diseases, [4][5][6] and sequencing often fails to lead to actionable insights, even after expert clinical evaluation through programs such as the NIH's Undiagnosed Diseases Network (UDN). [7][8][9] Many computational methods have been developed for interpreting variants observed in clinical sequencing. ...
Article
Full-text available
Whole exome sequencing (WES) in the clinic has identified several rare monogenic developmental and epileptic encephalopathies (DEE) caused by ion channel variants. However, WES often fails to provide actionable insight for rare diseases, like DEEs, due to the challenges of interpreting variants of unknown significance (VUS). Here, we describe a “personalized structural biology” (PSB) approach that leverages recent innovations in the analysis of protein 3D structures to address this challenge. We illustrate this approach in an Undiagnosed Diseases Network (UDN) individual with DEE symptoms and a de novo VUS in KCNC2 (p.V469L), the Kv3.2 voltage-gated potassium channel. A nearby KCNC2 variant (p.V471L) was recently suggested to cause DEE-like phenotypes. Computational structural modeling suggests that both affect protein function. However, despite their proximity, the p.V469L variant is likely to sterically block the channel pore, while the p.V471L variant is likely to stabilize the open state. Biochemical and electrophysiological analyses demonstrate heterogeneous loss-of-function and gain-of-function effects, as well as differential response to 4-aminopyridine (4-AP) treatment. Molecular dynamics simulations illustrate that the pore of the p.V469L variant is more constricted, increasing the energetic barrier for K⁺ permeation, whereas the p.V471L variant stabilizes the open conformation. Our results implicate variants in KCNC2 as causative for DEE and guide the interpretation of a UDN individual. They further delineate the molecular basis for the heterogeneous clinical phenotypes resulting from two proximal pathogenic variants. This demonstrates how the PSB approach can provide an analytical framework for individualized hypothesis-driven interpretation of protein-coding VUS.
... Possible effects of the determined variants are evaluated on the computer with online prediction tools such as MutationTaster, PolyPhen-2, SIFT, and CADD (Table 2). Additionally, establishing the inheritance pattern of disease from family history or previous studies is another useful step for the analyzes [50]. As previously mentioned, using the restricted region from linkage analysis as a filter provides significant advantages for determining candidate variants. ...
Article
Full-text available
Background: In the context of medical genetics, gene hunting is the process of identifying and functionally characterizing genes or genetic variations that contribute to disease phenotypes. In this review, we would like to summarize gene hunting process in terms of historical aspects from Darwin to now. For this purpose, different approaches and recent developments will be detailed. Summary: Linkage analysis and association studies are the most common methods in use for explaining the genetic background of hereditary diseases and disorders. Although linkage analysis is a relatively old approach, it is still a powerful method to detect disease-causing rare variants using family-based data, particularly for consanguineous marriages. As is known that, consanguineous marriages or endogamy poses a social problem in developing countries, however, this same condition also provides a unique opportunity for scientists to identify and characterize pathogenic variants. The rapid advancements in sequencing technologies and their parallel implementation together with linkage analyses now allow us to identify the candidate variants related to diseases in a relatively short time. Furthermore, we can now go one step further and functionally characterize the causative variant through in vitro and in vivo studies and unveil the variant-phenotype relationships on a molecular level more robustly. Key Messages: Herein, we suggest that the combined analysis of linkage and exome analysis is a powerful and precise tool to diagnose clinically rare and recessively inherited conditions.
... We demonstrate this by using simple aggregation methods such as the maximum score and the mean score. Additionally we show that more involved nonparametric and parametric statistical methods offer similar performance with the added benefit of providing estimates of significance which could in future studies could easily be used in broader statistical frameworks such as those used in weighted association or familial studies [27,28]. We do remark that these methods assume independence of the individual prioritizations, an assumption which is likely not true due to the usage of phenotype--aspecific data sources in eXtasy. ...
Preprint
Full-text available
The identification of disease-causing genes in Mendelian disorders has been facilitated by the detection of rare disease-causing variation through exome sequencing experiments. These studies rely on population databases to filter a majority of the putatively neutral variation in the genome and additional filtering steps using either cohorts of diseased individuals or familial information to narrow down the list of candidate variants. Recently, new computational methods have been proposed to prioritize variants by scoring them not only based on their potential impact on protein function but also on their relevance given the available information on the disease under study. Usually these diseases comprise several phenotypic presentations, which are separately prioritized and then aggregated into a global score. In this study we compare several simple (e.g. maximum and mean score) and more complex aggregation methods (e.g. order statistics, parametric modeling) in order to obtain the best possible prioritization performance. We show that all methods perform reasonably well (median rank below 20 out of more than 8000 variants) and that the selection of an optimal aggregation method depends strongly on the fraction of uninformative phenotypes. Finally, we propose guidelines as to how to select an appropriate aggregation method based on knowledge of the phenotype under study.
... In families, Mendelian transmission results in family members sharing the same alleles, and therefore, affected relatives have a greater chance of being affected by the same disease-causing single-nucleotide polymorphisms (SNPs) than unrelated subjects. For instance, the probability of sibling pairs sharing rare alleles can be calculated (Ionita-Laza et al., 2011). Therefore, family-based analyses have been generally recognized as an important strategy for rare variant association studies. ...
Article
Full-text available
Family-based designs have been shown to be powerful in detecting the significant rare variants associated with human diseases. However, very few significant results have been found owing to relatively small sample sizes and the fact that statistical analyses often suffer from high false-negative error rates. These limitations can be avoided by combining results from multiple studies via meta-analysis. However, statistical methods for meta-analysis with rare variants are limited for family-based samples. In this report, we propose a tool for the meta-analysis of family-based rare variant associations, metaFARVAT. metaFARVAT is based on a quasi-likelihood score for each variant. These scores are combined to generate burden test, variable-threshold test, sequence kernel association test (SKAT), and optimal SKAT statistics. The proposed method tests homogeneous and heterogeneous effects of variants among different studies and can be applied to both quantitative and dichotomous phenotypes. Simulation results demonstrated the robustness and efficiency of the proposed method in different scenarios. By applying metaFARVAT to data from a family-based study and a case-control study, we identified a few promising candidate genes, including DLEC1, which is associated with chronic obstructive pulmonary disease.
... family studies, identity by descent, oral clefts, variant sharing 1 | INTRODUCTION Sequencing distant relatives is an established approach to identify causal variants for Mendelian disorders (e.g., Bamshad et al., 2011;Ionita-Laza et al., 2011;Ng et al., 2010). Typically external databases are combined with variant filtering strategies to identify causal variants under the assumption of complete penetrance and the absence of phenocopies. ...
Article
We previously demonstrated how sharing of rare variants (RVs) in distant affected relatives can be used to identify variants causing a complex and heterogeneous disease. This approach tested whether single RVs were shared by all sequenced affected family members. However, as with other study designs, joint analysis of several RVs (e.g., within genes) is sometimes required to obtain sufficient statistical power. Further, phenocopies can lead to false negatives for some causal RVs if complete sharing among affected is required. Here, we extend our methodology (Rare Variant Sharing, RVS) to address these issues. Specifically, we introduce gene‐based analyses, a partial sharing test based on RV sharing probabilities for subsets of affected relatives and a haplotype‐based RV definition. RVS also has the desirable feature of not requiring external estimates of variant frequency or control samples, provides functionality to assess and address violations of key assumptions, and is available as open source software for genome‐wide analysis. Simulations including phenocopies, based on the families of an oral cleft study, revealed the partial and complete sharing versions of RVS achieved similar statistical power compared with alternative methods (RareIBD and the Gene‐Based Segregation Test), and had superior power compared with the pedigree Variant Annotation, Analysis, and Search Tool (pVAAST) linkage statistic. In studies of multiplex cleft families, analysis of rare single nucleotide variants in the exome of 151 affected relatives from 54 families revealed no significant excess sharing in any one gene, but highlighted different patterns of sharing revealed by the complete and partial sharing tests.
... A second category of tools comprise those that have been developed to rank genes and variants in rare disease studies on the basis of different probabilistic frameworks that analyze the background variation in genes, as well as the nature and frequency of variants in affected individuals [64][65][66] . These tools are especially useful for cohort studies with multiple affected families or individuals. ...
Article
Exomiser is an application that prioritizes genes and variants in next-generation sequencing (NGS) projects for novel disease-gene discovery or differential diagnostics of Mendelian disease. Exomiser comprises a suite of algorithms for prioritizing exome sequences using random-walk analysis of protein interaction networks, clinical relevance and cross-species phenotype comparisons, as well as a wide range of other computational filters for variant frequency, predicted pathogenicity and pedigree analysis. In this protocol, we provide a detailed explanation of how to install Exomiser and use it to prioritize exome sequences in a number of scenarios. Exomiser requires ∼3 GB of RAM and roughly 15-90 s of computing time on a standard desktop computer to analyze a variant call format (VCF) file. Exomiser is freely available for academic use from http://www.sanger.ac.uk/science/tools/exomiser.
... In current sequencing of patients in autosomal recessive (AR) families, candidate disease variants are generally prioritized based on well-known filtering steps 1,2 . Homozygosity mapping is also often applied to identify long runs of homozygosity 3 , which may be interpreted as harboring segments of DNA identical by descent (IBD), but length alone is known to be a poor statistic for this purpose 4 . ...
Article
Full-text available
A major challenge in current exome sequencing in autosomal recessive (AR) families is the lack of an effective method to prioritize single-nucleotide variants (SNVs). AR families are generally too small for linkage analysis, and length of homozygous regions is unreliable for identification of causative variants. Various common filtering steps usually result in a list of candidate variants that cannot be narrowed down further or ranked. To prioritize shortlisted SNVs we consider each homozygous candidate variant together with a set of SNVs flanking it. We compare the resulting array of genotypes between an affected family member and a number of control individuals and argue that, in a family, differences between family member and controls should be larger for a pathogenic variant and SNVs flanking it than for a random variant. We assess differences between arrays in two individuals by the Hamming distance and develop a suitable test statistic, which is expected to be large for a causative variant and flanking SNVs. We prioritize candidate variants based on this statistic and applied our approach to six patients with known pathogenic variants and found these to be in the top 2 to 10 percentiles of ranks.
... The current literature provides empirical and theoretical evidence that association signals that have been identified by GWAS are caused by rare variants in the proximity of the GWAS-SNPs [1][2][3][4][5][6][7]. Various rare variant association-approaches have already been developed [8]. So far, the methodology is centered around the technical aspects of handling RV (rare variants) data in statistical association tests, i.e. efficient ways of collapsing/combining the data of multiple loci, different effect directions, etc. [9][10][11][12][13][14][15]. ...
... VAAST employs a number of filter steps followed by a likelihood ratio test that incorporates both amino acid substitution frequencies and allele frequencies to prioritize candidate genes on the basis of SNVs present in those genes in cases and controls (Yandell et al. 2011). If several families are available for analysis, rare variant burden tests have been applied with weighting of the variants by characteristics, including predicted pathogenicity or de novo status (Ionita-Laza et al. 2011). Additional filter criteria resulting from linkage analysis (Smith et al. 2011), pedigree analysis (Sincan et al. 2012), and inference of identical-by-descent regions (Rödelsperger et al. 2011) may be helpful in certain cases. ...
Article
Full-text available
Numerous new disease-gene associations have been identified by whole-exome sequencing studies in the last few years. However, many cases remain unsolved due to the sheer number of candidate variants remaining after common filtering strategies such as removing low quality and common variants and those deemed unlikely to be pathogenic. The observation that each of our genomes contains about 100 genuine loss-of-function variants makes identification of the causative mutation problematic when using these strategies alone. We propose using the wealth of genotype to phenotype data that already exists from model organism studies to assess the potential impact of these exome variants. Here, we introduce PHenotypic Interpretation of Variants in Exomes (PHIVE), an algorithm that integrates the calculation of phenotype similarity between human diseases and genetically modified mouse models with evaluation of the variants according to allele frequency, pathogenicity, and mode of inheritance approaches in our Exomiser tool. Large-scale validation of PHIVE analysis using 100,000 exomes containing known mutations demonstrated a substantial improvement (up to 54.1-fold) over purely variant-based (frequency and pathogenicity) methods with the correct gene recalled as the top hit in up to 83% of samples, corresponding to an area under the ROC curve of >95%. We conclude that incorporation of phenotype data can play a vital role in translational bioinformatics and propose that exome sequencing projects should systematically capture clinical phenotypes to take advantage of the strategy presented here.
... Despite these limitations, the exomic approach has not only helped identify and properly diagnose certain diseases, as in an atypical case of Wolfram syndrome, as well as Freeman-Sheldon syndrome [143][144][145], but holds promise for finding both rare and common variants among patients with known clinical conditions like MSA. While diseases with classic Mendelian forms of inheritance serve as ideal candidates for exome sequencing, it is important to realize that complex disorders, provided that sample sizes are sufficient, are amenable as well, such as in the identification of TREM2 variants as risk factors for ...
Article
Full-text available
Classically defined phenotypically by a triad of cerebellar ataxia, parkinsonism, and autonomic dysfunction in conjunction with pyramidal signs, multiple system atrophy (MSA) is a rare and progressive neurodegenerative disease affecting an estimated 3-4 per every 100,000 individuals among adults 50-99 years of age. With a pathological hallmark of alpha-synuclein-immunoreactive glial cytoplasmic inclusions (GCIs; Papp-Lantos inclusions), MSA patients exhibit marked neurodegenerative changes in the striatonigral and/or olivopontocerebellar structures of the brain. As a member of the alpha-synucleinopathy family, which is defined by its well-demarcated alpha-synuclein-immunoreactive inclusions and aggregation, MSA's clinical presentation exhibits several overlapping features with other members including Parkinson's disease (PD) and dementia with Lewy bodies (DLB). Given the extensive fund of knowledge regarding the genetic etiology of PD revealed within the past several years, a genetic investigation of MSA is warranted. While a current genome-wide association study is underway for MSA to further clarify the role of associated genetic loci and single-nucleotide polymorphisms, several cases have presented solid preliminary evidence of a genetic etiology. Naturally, genes and variants manifesting known associations with PD (and other phenotypically similar neurodegenerative disorders), including SNCA and MAPT, have been comprehensively investigated in MSA patient cohorts. More recently variants in COQ2 have been linked to MSA in the Japanese population although this finding awaits replication. Nonetheless, significant positive associations with subsequent independent replication studies have been scarce. With very limited information regarding genetic mutations or alterations in gene dosage as a cause of MSA, the search for novel risk genes, which may be in the form of common variants or rare variants, is the logical nexus for MSA research. We believe that the application of next generation genetic methods to MSA will provide valuable insight into the underlying causes of this disease, and will be central to the identification of etiologic-based therapies.
... Score test criterion. A simpler alternative to the BF calculations for distinguishing causal from noncausal variants was described by Ionita-Laza et al [14]. Because score tests are computed under the null hypothesis, they do not require specification of an alternative hypothesis distribution of minor allele frequencies (MAFs) and relative risks (RRs) for causal alleles. ...
Article
Full-text available
The cost of next-generation sequencing is now approaching that of the first generation of genome-wide single-nucleotide genotyping panels, but this is still out of reach for large-scale epidemiologic studies with tens of thousands of subjects. Furthermore, the anticipated yield of millions of rare variants poses serious challenges for distinguishing causal from noncausal variants for disease. We explore the merits of using family-based designs for sequencing substudies to identify novel variants and prioritize them for their likelihood of causality. While the sharing of variants within families means that family-based designs may be less efficient for discovery than sequencing of a comparable number of unrelated individuals, the ability to exploit cosegregation of variants with disease within families helps distinguish causal from noncausal ones. We introduce a score test criterion for prioritizing discovered variants in terms of their likelihood of being functional. We compare the relative statistical efficiency of 2-stage versus1-stage family-based designs by application to the Genetic Analysis Workshop 18 simulated sequence data.
... Both empirical and theoretical studies suggest that rare genetic variants are an important contributor to disease risk [6,7,8,9]. Over the past few years several statistical tests have been proposed to test for association with rare variants in a small genetic region, such as a gene [10,11,12,13,14,15,16,17]. The proposed association tests are based on the idea of grouping together variants in the gene, and testing for association at the gene rather than variant level. ...
Article
Full-text available
Pinpointing the small number of causal variants among the abundant naturally occurring genetic variation is a difficult challenge, but a crucial one for understanding precise molecular mechanisms of disease and follow-up functional studies. We propose and investigate two complementary statistical approaches for identification of rare causal variants in sequencing studies: a backward elimination procedure based on groupwise association tests, and a hierarchical approach that can integrate sequencing data with diverse functional and evolutionary conservation annotations for individual variants. Using simulations, we show that incorporation of multiple bioinformatic predictors of deleteriousness, such as PolyPhen-2, SIFT and GERP++ scores, can improve the power to discover truly causal variants. As proof of principle, we apply the proposed methods to VPS13B, a gene mutated in the rare neurodevelopmental disorder called Cohen syndrome, and recently reported with recessive variants in autism. We identify a small set of promising candidates for causal variants, including two loss-of-function variants and a rare, homozygous probably-damaging variant that could contribute to autism risk. Copyright: ß 2014 Ionita-Laza et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Data Availability: The authors confirm that all data underlying the findings are fully available without restriction. The data are from dbGaP (dbGaP Study Accession: phs000298.v1.p1) Funding: This work has been supported in part by grants R01MH095797 and MH100233 from the National Institute of Health. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Competing Interests: The authors have declared that no competing interests exist.
... [2][3][4]), the transition to regular clinical access to exome analysis is challenging. The data output from exome sequencing is immense and computationally complex, and finding relevant sequence variations amongst the hundreds of thousands of variants in each individual remains an ongoing challenge [5][6][7]. Various software packages have been developed for visualization and interpretation of sequence variation data to address this challenge, but to date no comprehensive usability studies have been reported to identify and investigate user interface features required for efficient clinical work involving exome analysis. ...
Article
Full-text available
Objectives: New DNA sequencing technologies have revolutionized the search for genetic disruptions. Targeted sequencing of all protein coding regions of the genome, called exome analysis, is actively used in research-oriented genetics clinics, with the transition to exomes as a standard procedure underway. This transition is challenging; identification of potentially causal mutation(s) amongst ∼10(6) variants requires specialized computation in combination with expert assessment. This study analyzes the usability of user interfaces for clinical exome analysis software. There are two study objectives: (1) To ascertain the key features of successful user interfaces for clinical exome analysis software based on the perspective of expert clinical geneticists, (2) To assess user-system interactions in order to reveal strengths and weaknesses of existing software, inform future design, and accelerate the clinical uptake of exome analysis. Methods: Surveys, interviews, and cognitive task analysis were performed for the assessment of two next-generation exome sequence analysis software packages. The subjects included ten clinical geneticists who interacted with the software packages using the "think aloud" method. Subjects' interactions with the software were recorded in their clinical office within an urban research and teaching hospital. All major user interface events (from the user interactions with the packages) were time-stamped and annotated with coding categories to identify usability issues in order to characterize desired features and deficiencies in the user experience. Results: We detected 193 usability issues, the majority of which concern interface layout and navigation, and the resolution of reports. Our study highlights gaps in specific software features typical within exome analysis. The clinicians perform best when the flow of the system is structured into well-defined yet customizable layers for incorporation within the clinical workflow. The results highlight opportunities to dramatically accelerate clinician analysis and interpretation of patient genomic data. Conclusion: We present the first application of usability methods to evaluate software interfaces in the context of exome analysis. Our results highlight how the study of user responses can lead to identification of usability issues and challenges and reveal software reengineering opportunities for improving clinical next-generation sequencing analysis. While the evaluation focused on two distinctive software tools, the results are general and should inform active and future software development for genome analysis software. As large-scale genome analysis becomes increasingly common in healthcare, it is critical that efficient and effective software interfaces are provided to accelerate clinical adoption of the technology. Implications for improved design of such applications are discussed.
... In the last few years, the introduction of the next-generation sequencing (NGS) has revolutionized clinical genetics, making whole-exome sequencing (WES) a rapid way to elucidate the genetic basis of clinically and genetically heterogeneous Mendelian disorders (Bamshad et al. 2011;Ionita-Laza et al. 2011;Rabbani et al. 2012). At present, it is expected that NGS, in combination with network analysis (Minguez et al. 2009) and other advanced bioinformatics tools that allows prioritizing candidate genes because of their functional relationships (Ideker and Sharan 2008), will play an increasingly important role in the diagnosis of complex and oligogenic disorders. ...
Article
Full-text available
Bardet-Biedl syndrome (BBS) is a model ciliopathy characterized by a wide range of clinical variability. The heterogeneity of this condition is reflected in the number of underlying gene defects and the epistatic interactions between the proteins encoded. BBS is generally inherited in an autosomal recessive trait. However, in some families, mutations across different loci interact to modulate the expressivity of the phenotype. In order to investigate the magnitude of epistasis in one BBS family with remarkable intrafamilial phenotypic variability, we designed an exome sequencing-based approach using SOLID 5500xl platform. This strategy allowed the reliable detection of the primary causal mutations in our family consisting of two novel compound heterozygous mutations in McKusick-Kaufman syndrome (MKKS) gene (p.D90G and p.V396F). Additionally, exome sequencing enabled the detection of one novel heterozygous NPHP4 variant which is predicted to activate a cryptic acceptor splice site and is only present in the most severely affected patient. Here, we provide an exome sequencing analysis of a BBS family and show the potential utility of this tool, in combination with network analysis, to detect disease-causing mutations and second-site modifiers. Our data demonstrate how next-generation sequencing (NGS) can facilitate the dissection of epistatic phenomena, and shed light on the genetic basis of phenotypic variability.
... To date, exome sequencing studies of rare inherited disorders have generally used a winnowing strategy in which variants are progressively filtered for removal of those deemed unlikely to cause disease. 1 Current statistical frameworks 16,17 and analytic tools such as KGGSeq, 18 VAAST, 19 and VAR-MD 20 filter and prioritize variants on the basis of segregation, predicted pathogenicity, dbSNP information, and genotype quality. Despite some successes, a strict filtering approach carries certain risks. ...
Article
Exome sequencing in families affected by rare genetic disorders has the potential to rapidly identify new disease genes (genes in which mutations cause disease), but the identification of a single causal mutation among thousands of variants remains a significant challenge. We developed a scoring algorithm to prioritize potential causal variants within a family according to segregation with the phenotype, population frequency, predicted effect, and gene expression in the tissue(s) of interest. To narrow the search space in families with multiple affected individuals, we also developed two complementary approaches to exome-based mapping of autosomal-dominant disorders. One approach identifies segments of maximum identity by descent among affected individuals; the other nominates regions on the basis of shared rare variants and the absence of homozygous differences between affected individuals. We showcase our methods by using exome sequence data from families affected by autosomal-dominant retinitis pigmentosa (adRP), a rare disorder characterized by night blindness and progressive vision loss. We performed exome capture and sequencing on 91 samples representing 24 families affected by probable adRP but lacking common disease-causing mutations. Eight of 24 families (33%) were revealed to harbor high-scoring, most likely pathogenic (by clinical assessment) mutations affecting known RP genes. Analysis of the remaining 17 families identified candidate variants in a number of interesting genes, some of which have withstood further segregation testing in extended pedigrees. To empower the search for Mendelian-disease genes in family-based sequencing studies, we implemented them in a cross-platform-compatible software package, MendelScan, which is freely available to the research community.
... 5,6 They also allow testing of parent-of-origin effects. 7 Many tests have been proposed for population-based designs, [8][9][10][11][12][13][14][15][16][17][18][19][20][21][22] and among them two main classes of tests can be distinguished: the Burden test 12 and the variance-component test. 19 Comparatively, for family-based designs there has been relatively little development. ...
Article
Recent advances in high-throughput sequencing technologies make it increasingly more efficient to sequence large cohorts for many complex traits. We discuss here a class of sequence-based association tests for family-based designs that corresponds naturally to previously proposed population-based tests, including the classical Burden and variance-component tests. This framework allows for a direct comparison between the powers of sequence-based association tests with family- vs population-based designs. We show that for dichotomous traits using family-based controls results in similar power levels as the population-based design (although at an increased sequencing cost for the family-based design), while for continuous traits (in random samples, no ascertainment) the population-based design can be substantially more powerful. A possible disadvantage of population-based designs is that they can lead to increased false-positive rates in the presence of population stratification, while the family-based designs are robust to population stratification. We show also an application to a small exome-sequencing family-based study on autism spectrum disorders. The tests are implemented in publicly available software.European Journal of Human Genetics advance online publication, 6 February 2013; doi:10.1038/ejhg.2012.308.
... Massively parallel sequencing (MPS) technologies have transformed the field of genomic studies (Ansorge, 2009;Mardis, 2008;Metzker, 2010). High-coverage sequencing, which is able to detect essentially every variant present in the sequenced individuals (Coventry et al., 2010;Nelson et al., 2012), has been successful for the identification of genetic variants causing Mendelian diseases (Bamshad et al., 2011;Ionita-Laza et al., 2011;Ng et al., 2010), but it remains cost prohibitive for large samples required for the study of complex traits. Recent studies (Flannick et al., 2012;Li et al., 2011;Pasaniuc et al., 2012) have proposed low-coverage sequencing as a powerful alternative. ...
Article
Recent advances in sequencing technologies have revolutionized genetic studies. Although high-coverage sequencing can uncover most variants present in the sequenced sample, low-coverage sequencing is appealing for its cost effectiveness. Here, we present AbCD (arbitrary coverage design) to aid the design of sequencing-based studies. AbCD is a user-friendly interface providing pre-estimated effective sample sizes, specific to each minor allele frequency category, for designs with arbitrary coverage (0.5–30×) and sample size (20–10 000), and for four major ethnic groups (Europeans, Africans, Asians and African Americans). In addition, we also present two software tools: ShotGun and DesignPlanner, which were used to generate the estimates behind AbCD. ShotGun is a flexible short-read simulator for arbitrary user-specified read length and average depth, allowing cycle-specific sequencing error rates and realistic read depth distributions. DesignPlanner is a full pipeline that uses ShotGun to generate sequence data and performs initial SNP discovery, uses our previously presented linkage disequilibrium-aware method to call genotypes, and, finally, provides minor allele frequency-specific effective sample sizes. ShotGun plus DesignPlanner can accommodate effective sample size estimate for any combination of high-depth and low-depth data (for example, whole-genome low-depth plus exonic high-depth) or combination of sequence and genotype data [for example, whole-exome sequencing plus genotyping from existing Genomewide Association Study (GWAS)]. Availability and implementation: AbCD, including its downloadable terminal interface and web-based interface, and the associated tools ShotGun and DesignPlanner, including documentation, examples and executables, are available at http://www.unc.edu/∼yunmli/AbCD.html. Contact:yunli@med.unc.edu
... The number of individuals needed of each ancestry to be reliable for determining the likelihood that a particular variant is unique to a patient with an idiopathic condition is an open question, but our analyses, as well as other recent studies (Pelak et al., 2010), suggest that there is diminishing returns in adding more and more genomes to a reference panel in order to cut down on the number of variants falsely inferred as novel, possibly after as few as 8-15 individual genomes. In addition, better methods for predicting the functional consequences of variants of unknown significance are needed, as are methods for leveraging such predictions in more sophisticated pathogenic variant identification strategies (Ionita-Laza et al., 2011;Rope et al., 2011;Torkamani et al., 2011;Yandell et al., 2011). ...
Article
Full-text available
There have been a number of recent successes in the use of whole genome sequencing and sophisticated bioinformatics techniques to identify pathogenic DNA sequence variants responsible for individual idiopathic congenital conditions. However, the success of this identification process is heavily influenced by the ancestry or genetic background of a patient with an idiopathic condition. This is so because potential pathogenic variants in a patient’s genome must be contrasted with variants in a reference set of genomes made up of other individuals’ genomes of the same ancestry as the patient. We explored the effect of ignoring the ancestries of both an individual patient and the individuals used to construct reference genomes. We pursued this exploration in two major steps. We first considered variation in the per-genome number and rates of likely functional derived (i.e., non-ancestral, based on the chimp genome) single nucleotide variants and small indels in 52 individual whole human genomes sampled from 10 different global populations. We took advantage of a suite of computational and bioinformatics techniques to predict the functional effect of over 24 million genomic variants, both coding and non-coding, across these genomes. We found that the typical human genome harbors ∼5.5–6.1 million total derived variants, of which ∼12,000 are likely to have a functional effect (∼5000 coding and ∼7000 non-coding). We also found that the rates of functional genotypes per the total number of genotypes in individual whole genomes differ dramatically between human populations. We then created tables showing how the use of comparator or reference genome panels comprised of genomes from individuals that do not have the same ancestral background as a patient can negatively impact pathogenic variant identification. Our results have important implications for clinical sequencing initiatives.
... However, monogenic disorders resulting from even smaller changes in dosage-sensitive genes often fall below the detection threshold of detection by classical molecular cytogenetic techniques. Massively parallel exome sequencing, combined with statistical filtering methodologies (55) to exclude benign or unrelated variants, is a powerful technique to identify very small causative mutations in a single gene, as shown for Miller syndrome (56), Charcot -Marie -Tooth disease (21) and Kabuki syndrome (57). With the falling costs of whole genome and exome sequencing, it is not unrealistic to assume that the coming years will see a huge increase in the diagnosis and discovery of rare diseases and their causal genes in individual patients and lead to improvements in personalized medical care. ...
Article
Full-text available
Patients with developmental disorders often harbour sub-microscopic deletions or duplications that lead to a disruption of normal gene expression or perturbation in the copy number of dosage-sensitive genes. Clinical interpretation for such patients in isolation is hindered by the rarity and novelty of such disorders. The DECIPHER project (https://decipher.sanger.ac.uk) was established in 2004 as an accessible online repository of genomic and associated phenotypic data with the primary goal of aiding the clinical interpretation of rare copy-number variants (CNVs). DECIPHER integrates information from a variety of bioinformatics resources and uses visualization tools to identify potential disease genes within a CNV. A two-tier access system permits clinicians and clinical scientists to maintain confidential linked anonymous records of phenotypes and CNVs for their patients that, with informed consent, can subsequently be shared with the wider clinical genetics and research communities. Advances in next-generation sequencing technologies are making it practical and affordable to sequence the whole exome/genome of patients who display features suggestive of a genetic disorder. This approach enables the identification of smaller intragenic mutations including single-nucleotide variants that are not accessible even with high-resolution genomic array analysis. This article briefly summarizes the current status and achievements of the DECIPHER project and looks ahead to the opportunities and challenges of jointly analysing structural and sequence variation in the human genome.
... As an alternative to multistep filtering, studies have also employed more formal statistical approaches. One of these methods provides fast computation of approximate P values for individual genes, adjusts for the background variation in each gene, allows for incorporation of functional or linkage-based information, and accommodates designs based on both affected relative pairs and unrelated affected individuals (17). A different unified framework for variant discovery that does not involve a formal statistical approach but consists of three steps: (1) data processing, (2) variant discovery, and (3) integration with known variants and other information, such as pedigrees and population structure to recalibrate variant quality, has also been developed by another group (18). ...
Article
Common genetic risk variants identified by Genome-Wide Association (GWA) studies over the past decade have explained a small portion of disease heritability in complex diseases. It is becoming apparent that each gene/locus is heterogeneous and that multiple rare independent risk alleles across the population contribute to disease risk. Next generation sequencing technologies have reached the maturity and low cost necessary to perform whole genome, whole exome, and targeted region sequencing to identify all rare risk alleles across a population, a task that is not possible to achieve by genotyping. Design of whole genome, whole exome, and targeted sequencing projects to identify disease variants for complex lung diseases requires the balancing issues related to the four main steps in projects - library preparation, sequencing, sequence data analysis, and statistical analysis. Although data analysis approaches are still evolving, a number of published studies have successfully identified rare variants associated with complex disease. Despite many challenges that lie ahead in applying these technologies to lung disease, rare variants are likely to be a critical piece of the puzzle that needs to be solved to understand genetic basis of complex lung disease and to use this information to develop better therapies.  
... However, such methods are not applicable to very rare variants or those only observed once or twice in a sample. Methods to deal with such rare variants have previously been discussed,3,4 and may consist simply of a comparison between the combined counts of all rare variants observed in cases and controls.5 This approach however, is limited in that difficulty may lie in the classification of a variant as “rare”, as common variants cannot be utilized, their much larger allele counts tending to swamp the signal from rare variants. ...
Article
Full-text available
Previously described methods for the combined analysis of common and rare variants have disadvantages such as requiring an arbitrary classification of variants or permutation testing to assess statistical significance. Here we propose a novel method which implements a weighting scheme based on allele frequencies observed in both cases and controls. Because the test is unbiased, scores can be analyzed with a standard t-test. To test its validity we applied it to data for common, rare, and very rare variants simulated under the null hypothesis. To test its power we applied it to simulated data in which association was present, including data using the observed allele frequencies of common and rare variants in NOD2 previously reported in cases of Crohn's disease and controls. The method produced results that conformed well to those expected under the null hypothesis. It demonstrated more power to detect association when rare and common variants were analyzed jointly, the power further increasing when rare variants were assigned higher weights. 20,000 analyses of a gene containing 62 variants could be performed in 80 minutes on a laptop. This approach shows promise for the analysis of data currently emerging from genome wide sequencing studies.
... Over the past 2 years, several variant annotation pipelines have been developed by many different groups [26,[71][72][73][74][75][76][77][78]. In Table 2, we have summarized some current software tools that are capable of annotating genetic variants from high-throughput sequencing data. ...
Article
Full-text available
The pace of exome and genome sequencing is accelerating, with the identification of many new disease-causing mutations in research settings, and it is likely that whole exome or genome sequencing could have a major impact in the clinical arena in the relatively near future. However, the human genomics community is currently facing several challenges, including phenotyping, sample collection, sequencing strategies, bioinformatics analysis, biological validation of variant function, clinical interpretation and validity of variant data, and delivery of genomic information to various constituents. Here we review these challenges and summarize the bottlenecks for the clinical application of exome and genome sequencing, and we discuss ways for moving the field forward. In particular, we urge the need for clinical-grade sample collection, high-quality sequencing data acquisition, digitalized phenotyping, rigorous generation of variant calls, and comprehensive functional annotation of variants. Additionally, we suggest that a 'networking of science' model that encourages much more collaboration and online sharing of medical history, genomic data and biological knowledge, including among research participants and consumers/patients, will help establish causation and penetrance for disease causal variants and genes. As we enter this new era of genomic medicine, we envision that consumer-driven and consumer-oriented efforts will take center stage, thus allowing insights from the human genome project to translate directly back into individualized medicine.
Article
The characterization of gene-environment interactions (GEIs) can provide detailed insights into the biological mechanisms underlying complex diseases. Despite recent interest in GEIs for rare variants, published GEI tests are underpowered for an extremely small proportion of causal rare variants in a gene or a region. By extending the aggregated Cauchy association test (ACAT), we propose three GEI tests to address this issue: a Cauchy combination GEI test with fixed main effects (CCGEI-F), a Cauchy combination GEI test with random main effects (CCGEI-R), and an omnibus Cauchy combination GEI test (CCGEI-O). ACAT was applied to combine p values of single-variant GEI analyses to obtain CCGEI-F and CCGEI-R and p values of multiple GEI tests were combined in CCGEI-O. Through numerical simulations, for small numbers of causal variants, CCGEI-F, CCGEI-R and CCGEI-O provided approximately 5% higher power than the existing GEI tests INT-FIX and INT-RAN; however, they had slightly higher power than the existing GEI test TOW-GE. For large numbers of causal variants, although CCGEI-F and CCGEI-R exhibited comparable or slightly lower power values than the competing tests, the results were still satisfactory. Among all simulation conditions evaluated, CCGEI-O provided significantly higher power than that of competing GEI tests. We further applied our GEI tests in genome-wide analyses of systolic blood pressure or diastolic blood pressure to detect gene-body mass index (BMI) interactions, using whole-exome sequencing data from UK Biobank. At a suggestive significance level of 1.0 × 10-4, KCNC4, GAR1, FAM120AOS and NT5C3B showed interactions with BMI by our GEI tests.
Preprint
Full-text available
Background Next-generation whole exome sequencing (WES) is ubiquitous as an early step in the diagnosis of rare diseases and the interpretation of variants of unknown significance (VUS). Developmental and epileptic encephalopathies (DEE) are a group of rare devastating epilepsies, many of which have unknown causes. Increasing WES in the clinic has identified several rare monogenic DEEs caused by ion channel variants. However, WES often fails to provide actionable insight, due to the challenges of proposing functional hypotheses for candidate variants. Here, we describe a “personalized structural biology” (PSB) approach that addresses this challenge by leveraging recent innovations in the determination and analysis of protein 3D structures. Results We illustrate the power of the PSB approach in an individual from the Undiagnosed Diseases Network (UDN) with DEE symptoms who has a novel de novo VUS in KCNC2 (p.V469L), the gene that encodes the Kv3.2 voltage-gated potassium channel. A nearby KCNC2 variant (p.V471L) was recently suggested to cause DEE-like phenotypes. We find that both variants are located in the conserved hinge region of the S6 helix and likely to affect protein function. However, despite their proximity, computational structural modeling suggests that the V469L variant is likely to sterically block the channel pore, while the V471L variant is likely to stabilize the open state. Biochemical and electrophysiological analyses demonstrate heterogeneous loss-of-function and gain-of-function effects, respectively, as well as differential inhibition in response to 4-aminopyridine (4-AP) treatment. Using computational structural modeling and molecular dynamics simulations, we illustrate that the pore of the V469L variant is more constricted increasing the energetic barrier for K ⁺ permeation, whereas the V471L variant stabilizes the open conformation Conclusions Our results implicate KCNC2 as a causative gene for DEE and guided the interpretation of a UDN case. They further delineate the molecular basis for the heterogeneous clinical phenotypes resulting from two proximal pathogenic variants. This demonstrates how the PSB approach can provide an analytical framework for individualized hypothesis-driven interpretation of protein-coding VUS suspected to contribute to disease.
Article
Rare diseases affect millions of people worldwide, and discovering their genetic causes is challenging. More than half of the individuals analyzed by the Undiagnosed Diseases Network (UDN) remain undiagnosed. The central hypothesis of this work is that many of these rare genetic disorders are caused by multiple variants in more than one gene. However, given the large number of variants in each individual genome, experimentally evaluating combinations of variants for potential to cause disease is currently infeasible. To address this challenge, we developed the digenic predictor (DiGePred), a random forest classifier for identifying candidate digenic disease gene pairs by features derived from biological networks, genomics, evolutionary history, and functional annotations. We trained the DiGePred classifier by using DIDA, the largest available database of known digenic-disease-causing gene pairs, and several sets of non-digenic gene pairs, including variant pairs derived from unaffected relatives of UDN individuals. DiGePred achieved high precision and recall in cross-validation and on a held-out test set (PR area under the curve > 77%), and we further demonstrate its utility by using digenic pairs from the recent literature. In contrast to other approaches, DiGePred also appropriately controls the number of false positives when applied in realistic clinical settings. Finally, to enable the rapid screening of variant gene pairs for digenic disease potential, we freely provide the predictions of DiGePred on all human gene pairs. Our work enables the discovery of genetic causes for rare non-monogenic diseases by providing a means to rapidly evaluate variant gene pairs for the potential to cause digenic disease.
Preprint
Full-text available
We previously demonstrated how sharing of rare variants (RVs) in distant affected relatives can be used to identify variants causing a complex and heterogeneous disease. This approach tested whether single RVs were shared by all sequenced affected family members. However, as with other study designs, joint analysis of several RVs (e.g. within genes) is sometimes required to obtain sufficient statistical power. Further, phenocopies can lead to false negatives for some causal RVs if complete sharing among affecteds is required. Here we extend our methodology (Rare Variant Sharing, RVS) to address these issues. Specifically, we introduce gene-based analyses, refine RV definition based on haplotypes, and introduce a partial sharing test based on RV sharing probabilities for subsets of affected family members. RVS also has the desirable features of not requiring external estimates of variant frequency or control samples, provides functionality to assess and address violations of key assumptions, and is available as open source software for genome-wide analysis. Simulations including phenocopies, based on the families of an oral cleft study, revealed the partial and complete sharing versions of RVS achieved similar statistical power compared to alternative methods (RareIBD and the Gene-Based Segregation Test), and had superior power compared to the pedigree Variant Annotation, Analysis and Search Tool (pVAAST) linkage statistic. In studies of multiplex cleft families, analysis of rare single nucleotide variants in the exome of 151 affected relatives from 54 families revealed no significant excess sharing in any one gene, but highlighted different patterns of sharing revealed by the complete and partial sharing tests.
Chapter
The genomic characterization of millions of individuals is immensely useful for medical research. With the decreasing cost of genome sequencing, this is now more plausible than ever. Having a larger number of studied individuals is almost assured to boost statistical power for future discoveries; therefore, genomes from increasingly more individuals will be sequenced going forward. Understanding of genomic variation is essential to understand the mechanism behind the genetic abnormalities. Therefore sharing genomic data is important in the clinical setting, in basic research, and for reproducibility. In this chapter, I will go over the importance of data sharing for health and biomedical studies. I will then discuss the importance of data sharing in curing diseases, especially from a statistical point of view. I will conclude the chapter with the contribution of the projects with broad genomic data-sharing policies to the biomedical sciences and patient's perspective in genomic data sharing.
Article
Advanced technology in whole-genome sequencing has offered the opportunity to comprehensively investigate the genetic contribution, particularly rare variants, to complex traits. Several region-based tests have been developed to jointly model the marginal effect of rare variants, but methods to detect gene-environment (GE) interactions are underdeveloped. Identifying the modification effects of environmental factors on genetic risk poses a considerable challenge. To tackle this challenge, we develop a method to detect GE interactions for rare variants using generalized linear mixed effect model. The proposed method can accommodate either binary or continuous traits in related or unrelated samples. Under this model, genetic main effects, GE interactions, and sample relatedness are modeled as random effects. We adopt a kernel-based method to leverage the joint information across rare variants and implement variance component score tests to reduce the computational burden. Our simulation studies of continuous and binary traits show that the proposed method maintains correct type I error rates and appropriate power under various scenarios, such as genotype main effects and GE interaction effects in opposite directions and varying the proportion of causal variants in the model. We apply our method in the Framingham Heart Study to test GE interaction of smoking on body mass index or overweight status and replicate the Cholinergic Receptor Nicotinic Beta 4 gene association reported in previous large consortium meta-analysis of single nucleotide polymorphism-smoking interaction. Our proposed set-based GE test is computationally efficient and is applicable to both binary and continuous phenotypes, while appropriately accounting for familial or cryptic relatedness.
Preprint
Advanced technology in whole-genome sequencing has offered the opportunity to comprehensively investigate the genetic contribution, particularly rare variants, to complex traits. Many rare variants analysis methods have been developed to jointly model the marginal effect but methods to detect gene-environment (GE) interactions are underdeveloped. Identifying the modification effects of environmental factors on genetic risk poses a considerable challenge. To tackle this challenge, we develop a unified method to detect GE interactions of a set of rare variants using generalized linear mixed effect model. The proposed method can accommodate both binary and continuous traits in related or unrelated samples. Under this model, genetic main effects, sample relatedness and GE interactions are modeled as random effects. We adopt a kernel-based method to leverage the joint information across rare variants and implement variance component score tests to reduce the computational burden. Our simulation study shows that the proposed method maintains correct type I error rates and high power under various scenarios, such as differing the direction of main genotype and GE interaction effects and the proportion of causal variants in the model for both continuous and binary traits. We illustrate our method to test gene-based interaction with smoking on body mass index or overweight status in the Framingham Heart Study and replicate the CHRNB4 gene association reported in previous large consortium meta-analysis of single nucleotide polymorphism (SNP)-smoking interaction. Our proposed set-based GE test is computationally efficient and is applicable to both binary and continuous phenotypes, while appropriately accounting for familial or cryptic relatedness.
Chapter
The field of DNA sequencing experienced a transformational shift beginning in 2005 with the introduction of the first high-throughput, massively parallel DNA sequencing platform that ushered in the era of “next-generation sequencing.” Initially, next-generation sequencing (NGS) platforms generated millions of bases per instrument run which steadily progressed to the now routine output of billions of bases. These unprecedented data volumes have driven a renaissance in bioinformatics research and development resulting in a proliferation of open-source and commercial software algorithms to support the computational processing, analysis, and interpretation of NGS results. These efforts have facilitated a broad dissemination of NGS into every facet of biomedical research and into a growing list of clinical diagnostic applications from targeted multigene panels to whole-genome sequencing.
Chapter
The field of DNA sequencing began a transformational shift in 2005 with the introduction of the first high throughput, massively parallel DNA sequencing platform that ushered in the era of “next generation sequencing.” Initially, next generation sequencing (NGS) platforms generated millions of bases per instrument run which steadily progressed to the now routine outputs in billions of bases. These unprecedented data volumes have driven a renaissance in bioinformatics research and development resulting in a proliferation of open-source and commercial algorithms and software to support the computational processing, analysis, and interpretation of NGS results. These efforts have facilitated a broad dissemination of NGS into every facet of biomedical research and more recently into various clinical diagnostic applications from multi-gene panels to exome sequencing and whole-genome sequencing (WGS). Every laboratory adopting NGS has undergone two learning curves, one regarding the implementation of new chemistries and instrumentation and the second being the acquisition of the knowledge and skill sets necessary for the analysis of NGS data. The latter curve has proven to be a significant bottleneck for most laboratories. In this chapter, basic concepts and principles of bioinformatics required for the analysis of NGS data are presented. We discuss the spectrum of NGS data generation, processing and alignment, variant calling and interpretation. The Illumina and Ion Torrent sequencing technologies are emphasized due to their current dominant roles in the NGS landscape. We include bioinformatics considerations and approaches for clinical diagnostic applications. A subsection is devoted to computational approaches for the identification of candidate genes from exome sequencing and WGS studies.
Article
Full-text available
Rare-variant association testing usually requires some method of aggregation. The next important step is to pinpoint individual rare causal variants among a large number of variants within a genetic region. Recently Ionita-Laza et al. propose a backward elimination (BE) procedure that can identify individual causal variants among the many variants in a gene. The BE procedure removes a variant if excluding this variant can lead to a smaller P-value for the BURDEN test (referred to as "BE-BURDEN") or the SKAT test (referred to as "BE-SKAT"). We here use the adaptive combination of P-values (ADA) method to pinpoint causal variants. Unlike most gene-based association tests, the ADA statistic is built upon per-site P-values of individual variants. It is straightforward to select important variants given the optimal P-value truncation threshold found by ADA. We performed comprehensive simulations to compare ADA with BE-SKAT and BE-BURDEN. Ranking these three approaches according to positive predictive values (PPVs), the percentage of truly causal variants among the total selected variants, we found ADA > BE-SKAT > BE-BURDEN across all simulation scenarios. We therefore recommend using ADA to pinpoint plausible rare causal variants in a gene.
Article
Rationale: Genomic regions identified by genome-wide association studies explain only a small fraction of heritability for chronic obstructive pulmonary disease (COPD). Alpha-1 antitrypsin deficiency shows that rare coding variants of large effect also influence COPD susceptibility. We hypothesized that exome sequencing in families identified through a proband with severe, early-onset COPD would identify additional rare genetic determinants of large effect. Objective: To identify rare genetic determinants of severe COPD. Methods: We applied filtering approaches to identify potential causal variants for COPD in whole exomes from 347 subjects in 49 extended pedigrees from the Boston Early-Onset COPD Study. We assessed the power of this approach under different levels of genetic heterogeneity using simulations. We tested genes identified in these families using gene-based association tests in exomes of 204 cases with severe COPD and 195 resistant smokers from the COPDGene study. In addition, we examined previously described loci associated with COPD using these datasets. Results: We identified 69 genes with predicted deleterious non-synonymous, stop, or splice variants that segregated with severe COPD in at least two pedigrees. Four genes (DNAH8, ALCAM, RARS and GBF1) also demonstrated an increase in rare non-synonymous, stop and/or splice mutations in cases compared to resistant smokers from the COPDGene study; however, these results were not statistically significant. We demonstrate the limitations of the power of this approach under genetic heterogeneity through simulation. Conclusions: Rare deleterious coding variants may increase risk for COPD, but multiple genes likely contribute to COPD susceptibility.
Article
Full-text available
This work reviews the most relevant present-day processing methods used to improve the accuracy of multimodal nonlinear images in the detection of epithelial cancer and the supporting stroma. Special emphasis has been placed on methods of non linear optical (NLO) microscopy image processing such as: second harmonic to autofluorescence ageing index of dermis (SAAID), tumor-associated collagen signatures (TACS), fast Fourier transform (FFT) analysis, and gray level co-occurrence matrix (GLCM)-based methods. These strategies are presented as a set of potential valuable diagnostic tools for early cancer detection. It may be proposed that the combination of NLO microscopy and informatics based image analysis approaches described in this review (all carried out on free software) may represent a powerful tool to investigate collagen organization and remodeling of extracellular matrix in carcinogenesis processes.
Article
Over the past few years, association analysis has become the primary tool for finding genes that underlie complex traits. Both population-based and family-based designs are commonly used designs in genetic association studies. Recent technological advances in exome and whole genome sequencing afford the next generation of sequence-based association studies. We review here recent developments in statistical methodology and remaining challenges related to sequence-based association studies with both population-based and family-based designs.
Article
Full-text available
Massively parallel sequencing greatly facilitates the discovery of novel disease genes causing Mendelian and oligogenic disorders. However, many mutations are present in any individual genome, and identifying which ones are disease causing remains a largely open problem. We introduce eXtasy, an approach to prioritize nonsynonymous single-nucleotide variants (nSNVs) that substantially improves prediction of disease-causing variants in exome sequencing data by integrating variant impact prediction, haploinsufficiency prediction and phenotype-specific gene prioritization.
Article
Genetic variation explains some of the observed heterogeneity in patients' risk for developing the acute respiratory distress syndrome (ARDS). Although the lack of extant family pedigrees for ARDS precludes an estimate of heritability of the syndrome, ARDS may function as a pattern of response to injury or infection, traits that exhibit strong heritability. A total of 34 genes have now been reported to influence ARDS susceptibility, the majority of which arose as candidate genes based on the current pathophysiological understanding of ARDS, with particular focus on inflammation and endothelial or epithelial injury. In addition, novel candidate genes have emerged from agnostic genetic approaches, including genome-wide association studies, orthologous gene expression profiling across animal models of lung injury, and human peripheral blood gene expression data. The genetic risk for ARDS seems to vary both by ancestry and by the subtype of ARDS, suggesting that both factors may be valid considerations in clinical trial design.
Article
Full-text available
Motivation: For the analysis of rare variants in sequence data, numerous approaches have been suggested. Fixed and flexible threshold approaches collapse the rare variant information of a genomic region into a test statistic with reduced dimensionality. Alternatively, the rare variant information can be combined in statistical frameworks that are based on suitable regression models, machine learning, etc. Although the existing approaches provide powerful tests that can incorporate information on allele frequencies and prior biological knowledge, differences in the spatial clustering of rare variants between cases and controls cannot be incorporated. Based on the assumption that deleterious variants and protective variants cluster or occur in different parts of the genomic region of interest, we propose a testing strategy for rare variants that builds on spatial cluster methodology and that guides the identification of the biological relevant segments of the region. Our approach does not require any assumption about the directions of the genetic effects. Results: In simulation studies, we assess the power of the clustering approach and compare it with existing methodology. Our simulation results suggest that the clustering approach for rare variants is well powered, even in situations that are ideal for standard methods. The efficiency of our spatial clustering approach is not affected by the presence of rare variants that have opposite effect size directions. An application to a sequencing study for non-syndromic cleft lip with or without cleft palate (NSCL/P) demonstrates its practical relevance. The proposed testing strategy is applied to a genomic region on chromosome 15q13.3 that was implicated in NSCL/P etiology in a previous genome-wide association study, and its results are compared with standard approaches. Availability: Source code and documentation for the implementation in R will be provided online. Currently, the R-implementation only supports genotype data. We currently are working on an extension for VCF files. Contact: heide.fier@googlemail.com.
Article
Context: Advances in sequencing technology with the commercialization of next-generation sequencing (NGS) has substantially increased the feasibility of sequencing human genomes and exomes. Next-generation sequencing has been successfully applied to the discovery of disease-causing genes in rare, inherited disorders. By necessity, the advent of NGS has fostered the concurrent development of bioinformatics approaches to expeditiously analyze the large data sets generated. Next-generation sequencing has been used for important discoveries in the research setting and is now being implemented into the clinical diagnostic arena. Objective: To review the current literature on technical and bioinformatics approaches for exome and genome sequencing and highlight examples of successful disease gene discovery in inherited disorders. To discuss the challenges for implementing NGS in the clinical research and diagnostic arenas. Data sources: Literature review and authors' experience. Conclusions: Next-generation sequencing approaches are powerful and require an investment in infrastructure and personnel expertise for effective use; however, the potential for improvement of patient care through faster and more accurate molecular diagnoses is high.
Article
The main result of this paper is a lower bound for the essential spectrum of Schrdinger operators −Δ+V on Riemannian manifolds. In particular, we obtain conditions on V which imply the discreteness of the spectrum, or equivalently, the compactness of the resolvent.
Article
This review examines the application of next-generation sequencing (NGS) technologies in the identification of the causation of nonsyndromic genetic cardiomyopathies. NGS sequencing of the entire genetic coding sequence (the exome) has successfully identified five novel genes and causative variants for cardiomyopathies without previously known cause within the last 12 months. Continual rapidly decreasing costs of NGS will shortly allow cost-effective sequencing of the entire genomes of affected individuals and their relatives to include noncoding and regulatory variant discovery and epigenetic profiling. Despite this rapid technological progress with sequencing, analysis of these large data sets remains challenging, particularly for assigning causality to novel rare variants identified in DNA samples from patients with cardiomyopathy. NGS technologies are rapidly moving to identify novel rare variants in patients with cardiomyopathy, but assigning pathogenicity to these novel variants remains challenging.
Article
Full-text available
Author Summary Developments in sequencing technology now enable us to assay all genetic variation, much of which is extremely rare. We propose to test the distribution of rare variants we observe in cases versus controls. To do so, we present a novel application of the C-alpha statistic to test these rare variants. C-alpha aims to determine whether the set of variants observed in cases and controls is a mixture, such that some of the variants confer risk or protection or are phenotypically neutral. Risk variants are expected to be more common in cases; protective variants more common in controls. C-alpha is sensitive to this imbalance, regardless of its origin—risk, protective, or both—but is ideally suited for a mixture of protective and risk variants. Variation in APOB nicely illustrates a mixture, in that certain rare variants increase triglyceride levels while others decrease it. The hallmark feature of C-alpha is that it uses the distribution of variation observed in cases and controls to detect the presence of a mixture, thus implicating genes or pathways as risk factors for disease.
Article
Full-text available
Rapid advances in sequencing technologies set the stage for the large-scale medical sequencing efforts to be performed in the near future, with the goal of assessing the importance of rare variants in complex diseases. The discovery of new disease susceptibility genes requires powerful statistical methods for rare variant analysis. The low frequency and the expected large number of such variants pose great difficulties for the analysis of these data. We propose here a robust and powerful testing strategy to study the role rare variants may play in affecting susceptibility to complex traits. The strategy is based on assessing whether rare variants in a genetic region collectively occur at significantly higher frequencies in cases compared with controls (or vice versa). A main feature of the proposed methodology is that, although it is an overall test assessing a possibly large number of rare variants simultaneously, the disease variants can be both protective and risk variants, with moderate decreases in statistical power when both types of variants are present. Using simulations, we show that this approach can be powerful under complex and general disease models, as well as in larger genetic regions where the proportion of disease susceptibility variants may be small. Comparisons with previously published tests on simulated data show that the proposed approach can have better power than the existing methods. An application to a recently published study on Type-1 Diabetes finds rare variants in gene IFIH1 to be protective against Type-1 Diabetes.
Article
Full-text available
Sequencing technologies are becoming cheap enough to apply to large numbers of study participants and promise to provide new insights into human phenotypes by bringing to light rare and previously unknown genetic variants. We develop a new framework for the analysis of sequence data that incorporates all of the major features of previously proposed approaches, including those focused on allele counts and allele burden, but is both more general and more powerful. We harness population genetic theory to provide prior information on effect sizes and to create a pooling strategy for information from rare variants. Our method, EMMPAT (Evolutionary Mixed Model for Pooled Association Testing), generates a single test per gene (substantially reducing multiple testing concerns), facilitates graphical summaries, and improves the interpretation of results by allowing calculation of attributable variance. Simulations show that, relative to previously used approaches, our method increases the power to detect genes that affect phenotype when natural selection has kept alleles with large effect sizes rare. We demonstrate our approach on a population-based re-sequencing study of association between serum triglycerides and variation in ANGPTL4.
Article
Full-text available
There is solid evidence that rare variants contribute to complex disease etiology. Next-generation sequencing technologies make it possible to uncover rare variants within candidate genes, exomes, and genomes. Working in a novel framework, the kernel-based adaptive cluster (KBAC) was developed to perform powerful gene/locus based rare variant association testing. The KBAC combines variant classification and association testing in a coherent framework. Covariates can also be incorporated in the analysis to control for potential confounders including age, sex, and population substructure. To evaluate the power of KBAC: 1) variant data was simulated using rigorous population genetic models for both Europeans and Africans, with parameters estimated from sequence data, and 2) phenotypes were generated using models motivated by complex diseases including breast cancer and Hirschsprung's disease. It is demonstrated that the KBAC has superior power compared to other rare variant analysis methods, such as the combined multivariate and collapsing and weight sum statistic. In the presence of variant misclassification and gene interaction, association testing using KBAC is particularly advantageous. The KBAC method was also applied to test for associations, using sequence data from the Dallas Heart Study, between energy metabolism traits and rare variants in ANGPTL 3,4,5 and 6 genes. A number of novel associations were identified, including the associations of high density lipoprotein and very low density lipoprotein with ANGPTL4. The KBAC method is implemented in a user-friendly R package.
Article
Full-text available
Genome wide association (GWA) studies, which test for association between common genetic markers and a disease phenotype, have shown varying degrees of success. While many factors could potentially confound GWA studies, we focus on the possibility that multiple, rare variants (RVs) may act in concert to influence disease etiology. Here, we describe an algorithm for RV analysis, RareCover. The algorithm combines a disparate collection of RVs with low effect and modest penetrance. Further, it does not require the rare variants be adjacent in location. Extensive simulations over a range of assumed penetrance and population attributable risk (PAR) values illustrate the power of our approach over other published methods, including the collapsing and weighted-collapsing strategies. To showcase the method, we apply RareCover to re-sequencing data from a cohort of 289 individuals at the extremes of Body Mass Index distribution (NCT00263042). Individual samples were re-sequenced at two genes, FAAH and MGLL, known to be involved in endocannabinoid metabolism (187Kbp for 148 obese and 150 controls). The RareCover analysis identifies exactly one significantly associated region in each gene, each about 5 Kbp in the upstream regulatory regions. The data suggests that the RVs help disrupt the expression of the two genes, leading to lowered metabolism of the corresponding cannabinoids. Overall, our results point to the power of including RVs in measuring genetic associations.
Article
Full-text available
We demonstrate the successful application of exome sequencing to discover a gene for an autosomal dominant disorder, Kabuki syndrome (OMIM%147920). We subjected the exomes of ten unrelated probands to massively parallel sequencing. After filtering against existing SNP databases, there was no compelling candidate gene containing previously unknown variants in all affected individuals. Less stringent filtering criteria allowed for the presence of modest genetic heterogeneity or missing data but also identified multiple candidate genes. However, genotypic and phenotypic stratification highlighted MLL2, which encodes a Trithorax-group histone methyltransferase: seven probands had newly identified nonsense or frameshift mutations in this gene. Follow-up Sanger sequencing detected MLL2 mutations in two of the three remaining individuals with Kabuki syndrome (cases) and in 26 of 43 additional cases. In families where parental DNA was available, the mutation was confirmed to be de novo (n = 12) or transmitted (n = 2) in concordance with phenotype. Our results strongly suggest that mutations in MLL2 are a major cause of Kabuki syndrome.
Article
Full-text available
Deep sequencing will soon generate comprehensive sequence information in large disease samples. Although the power to detect association with an individual rare variant is limited, pooling variants by gene or pathway into a composite test provides an alternative strategy for identifying susceptibility genes. We describe a statistical method for detecting association of multiple rare variants in protein-coding genes with a quantitative or dichotomous trait. The approach is based on the regression of phenotypic values on individuals' genotype scores subject to a variable allele-frequency threshold, incorporating computational predictions of the functional effects of missense variants. Statistical significance is assessed by permutation testing with variable thresholds. We used a rigorous population-genetics simulation framework to evaluate the power of the method, and we applied the method to empirical sequencing data from three disease studies.
Article
Full-text available
Since associations between complex diseases and common variants are typically weak, and approaches to genotyping rare variants (e.g. by next-generation resequencing) multiply, there is an urgent demand to develop powerful association tests that are able to detect disease associations with both common and rare variants. In this article we present such a test. It is based on data-adaptive modifications to a so-called Sum test originally proposed for common variants, which aims to strike a balance between utilizing information on multiple markers in linkage disequilibrium and reducing the cost of large degrees of freedom or of multiple testing adjustment. When applied to multiple common or rare variants in a candidate region, the proposed test is easy to use with 1 degree of freedom and without the need for multiple testing adjustment. We show that the proposed test has high power across a wide range of scenarios with either common or rare variants, or both. In particular, in some situations the proposed test performs better than several commonly used methods.
Article
Full-text available
We demonstrate the first successful application of exome sequencing to discover the gene for a rare mendelian disorder of unknown cause, Miller syndrome (MIM%263750). For four affected individuals in three independent kindreds, we captured and sequenced coding regions to a mean coverage of 40x and sufficient depth to call variants at approximately 97% of each targeted exome. Filtering against public SNP databases and eight HapMap exomes for genes with two previously unknown variants in each of the four individuals identified a single candidate gene, DHODH, which encodes a key enzyme in the pyrimidine de novo biosynthesis pathway. Sanger sequencing confirmed the presence of DHODH mutations in three additional families with Miller syndrome. Exome sequencing of a small number of unrelated affected individuals is a powerful, efficient strategy for identifying the genes underlying rare mendelian disorders and will likely transform the genetic analysis of monogenic traits.
Article
Full-text available
The effect of genetic mutation on phenotype is of significant interest in genetics. The type of genetic mutation that causes a single amino acid substitution (AAS) in a protein sequence is called a non-synonymous single nucleotide polymorphism (nsSNP). An nsSNP could potentially affect the function of the protein, subsequently altering the carrier's phenotype. This protocol describes the use of the 'Sorting Tolerant From Intolerant' (SIFT) algorithm in predicting whether an AAS affects protein function. To assess the effect of a substitution, SIFT assumes that important positions in a protein sequence have been conserved throughout evolution and therefore substitutions at these positions may affect protein function. Thus, by using sequence homology, SIFT predicts the effects of all possible substitutions at each position in the protein sequence. The protocol typically takes 5-20 min, depending on the input. SIFT is available as an online tool (http://sift.jcvi.org).
Article
Full-text available
Resequencing is an emerging tool for identification of rare disease-associated mutations. Rare mutations are difficult to tag with SNP genotyping, as genotyping studies are designed to detect common variants. However, studies have shown that genetic heterogeneity is a probable scenario for common diseases, in which multiple rare mutations together explain a large proportion of the genetic basis for the disease. Thus, we propose a weighted-sum method to jointly analyse a group of mutations in order to test for groupwise association with disease status. For example, such a group of mutations may result from resequencing a gene. We compare the proposed weighted-sum method to alternative methods and show that it is powerful for identifying disease-associated genes, both on simulated and Encode data. Using the weighted-sum method, a resequencing study can identify a disease-associated gene with an overall population attributable risk (PAR) of 2%, even when each individual mutation has much lower PAR, using 1,000 to 7,000 affected and unaffected individuals, depending on the underlying genetic model. This study thus demonstrates that resequencing studies can identify important genetic associations, provided that specialised analysis methods, such as the weighted-sum method, are used.
Article
Full-text available
The level of DNA sequence variation is reduced in regions of the Drosophila melanogaster genome where the rate of crossing over per physical distance is also reduced. This observation has been interpreted as support for the simple model of genetic hitchhiking, in which directional selection on rare variants, e.g., newly arising advantageous mutants, sweeps linked neutral alleles to fixation, thus eliminating polymorphisms near the selected site. However, the frequency spectra of segregating sites of several loci from some populations exhibiting reduced levels of nucleotide diversity and reduced numbers of segregating sites did not appear different from what would be expected under a neutral equilibrium model. Specifically, a skew toward an excess of rare sites was not observed in these samples, as measured by Tajima's D. Because this skew was predicted by a simple hitchhiking model, yet it had never been expressed quantitatively and compared directly to DNA polymorphism data, this paper investigates the hitchhiking effect on the site frequency spectrum, as measured by Tajima's D and several other statistics, using a computer simulation model based on the coalescent process and recurrent hitchhiking events. The results presented here demonstrate that under the simple hitchhiking model (1) the expected value of Tajima's D is large and negative (indicating a skew toward rare variants), (2) that Tajima's test has reasonable power to detect a skew in the frequency spectrum for parameters comparable to those from actual data sets, and (3) that the Tajima's Ds observed in several data sets are very unlikely to have been the result of simple hitchhiking. Consequently, the simple hitchhiking model is not a sufficient explanation for the DNA polymorphism at those loci exhibiting a decreased number of segregating sites yet not exhibiting a skew in the frequency spectrum.
Article
Full-text available
A class of statistical tests based on molecular polymorphism data is studied to determine size and power properties. The class includes Tajima's D statistic as well as the D* and F* tests proposed by Fu and Li. A new method of constructing critical values for these tests is described. Simulations indicate that Tajima's test is generally most powerful against the alternative hypotheses of selective sweep, population bottleneck, and population subdivision, among tests within this class. However, even Tajima's test can detect a selective sweep or bottleneck only if it has occurred within a specific interval of time in the recent past or population subdivision only when it has persisted for a very long time. For greatest power against the particular alternatives studied here, it is better to sequence more alleles than more sites.
Article
Full-text available
Positive selection can be inferred from its effect on linked neutral variation. In the restrictive case when there is no recombination, all linked variation is removed. If recombination is present but rare, both deterministic and stochastic models of positive selection show that linked variation hitchhikes to either low or high frequencies. While the frequency distribution of variation can be influenced by a number of evolutionary processes, an excess of derived variants at high frequency is a unique pattern produced by hitchhiking (derived refers to the nonancestral state as determined from an outgroup). We adopt a statistic, H, to measure an excess of high compared to intermediate frequency variants. Only a few high-frequency variants are needed to detect hitchhiking since not many are expected under neutrality. This is of particular utility in regions of low recombination where there is not much variation and in regions of normal or high recombination, where the hitchhiking effect can be limited to a small (<1 kb) region. Application of the H test to published surveys of Drosophila variation reveals an excess of high frequency variants that are likely to have been influenced by positive selection.
Article
Full-text available
The frequencies of low-activity alleles of glucose-6-phosphate dehydrogenase in humans are highly correlated with the prevalence of malaria. These “deficiency” alleles are thought to provide reduced risk from infection by the Plasmodium parasite and are maintained at high frequency despite the hemopathologies that they cause. Haplotype analysis of “A−” and ”Med“ mutations at this locus indicates that they have evolved independently and have increased in frequency at a rate that is too rapid to be explained by random genetic drift. Statistical modeling indicates that the A− allele arose within the past 3840 to 11,760 years and the Med allele arose within the past 1600 to 6640 years. These results support the hypothesis that malaria has had a major impact on humans only since the introduction of agriculture within the past 10,000 years and provide a striking example of the signature of selection on the human genome.
Article
Full-text available
During their dispersal from Africa, our ancestors were exposed to new environments and diseases. Those who were better adapted to local conditions passed on their genes, including those conferring these benefits, with greater frequency. This process of natural selection left signatures in our genome that can be used to identify genes that might underlie variation in disease resistance or drug metabolism. These signatures are, however, confounded by population history and by variation in local recombination rates. Although this complexity makes finding adaptive polymorphisms a challenge, recent discoveries are instructing us how and where to look for the signatures of selection.
Article
Full-text available
Milk from domestic cows has been a valuable food source for over 8,000 years, especially in lactose-tolerant human societies that exploit dairy breeds. We studied geographic patterns of variation in genes encoding the six most important milk proteins in 70 native European cattle breeds. We found substantial geographic coincidence between high diversity in cattle milk genes, locations of the European Neolithic cattle farming sites (>5,000 years ago) and present-day lactose tolerance in Europeans. This suggests a gene-culture coevolution between cattle and humans.
Article
Full-text available
Even though human and chimpanzee gene sequences are nearly 99% identical, sequence comparisons can nevertheless be highly informative in identifying biologically important changes that have occurred since our ancestral lineages diverged. We analyzed alignments of 7645 chimpanzee gene sequences to their human and mouse orthologs. These three-species sequence alignments allowed us to identify genes undergoing natural selection along the human and chimp lineage by fitting models that include parameters specifying rates of synonymous and nonsynonymous nucleotide substitution. This evolutionary approach revealed an informative set of genes with significantly different patterns of substitution on the human lineage compared with the chimpanzee and mouse lineages. Partitions of genes into inferred biological classes identified accelerated evolution in several functional classes, including olfaction and nuclear transport. In addition to suggesting adaptive physiological differences between chimps and humans, human-accelerated genes are significantly more likely to underlie major known Mendelian disorders.
Article
Full-text available
The gene Microcephalin (MCPH1) regulates brain size and has evolved under strong positive selection in the human evolutionary lineage. We show that one genetic variant of Microcephalin in modern humans, which arose approximately 37,000 years ago, increased in frequency too rapidly to be compatible with neutral drift. This indicates that it has spread under strong positive selection, although the exact nature of the selection is unknown. The finding that an important brain gene has continued to evolve adaptively in anatomically modern humans suggests the ongoing evolutionary plasticity of the human brain. It also makes Microcephalin an attractive candidate locus for studying the genetics of human variation in brain-related phenotypes.
Article
Full-text available
A large fraction of eukaryotic genomes consists of DNA that is not translated into protein sequence, and little is known about its functional significance. Here I show that several classes of non-coding DNA in Drosophila are evolving considerably slower than synonymous sites, and yet show an excess of between-species divergence relative to polymorphism when compared with synonymous sites. The former is a hallmark of selective constraint, but the latter is a signature of adaptive evolution, resembling general patterns of protein evolution in Drosophila. I estimate that about 40-70% of nucleotides in intergenic regions, untranslated portions of mature mRNAs (UTRs) and most intronic DNA are evolutionarily constrained relative to synonymous sites. However, I also use an extension to the McDonald-Kreitman test to show that a substantial fraction of the nucleotide divergence in these regions was driven to fixation by positive selection (about 20% for most intronic and intergenic DNA, and 60% for UTRs). On the basis of these observations, I suggest that a large fraction of the non-translated genome is functionally important and subject to both purifying selection and adaptive evolution. These results imply that, although positive selection is clearly an important facet of protein evolution, adaptive changes to non-coding DNA might have been considerably more common in the evolution of D. melanogaster.
Article
We offer an approximation to central confidence intervals for directly standardized rates, where we assume that the rates are distributed as a weighted sum of independent Poisson random variables. Like a recent method proposed by Dobson, Kuulasmaa, Eberle and Scherer, our method gives exact intervals whenever the standard population is proportional to the study population. In cases where the two populations differ non‐proportionally, we show through simulation that our method is conservative while other methods (the Dobson et al . method and the approximate bootstrap confidence method) can be liberal. © 1997 by John Wiley & Sons, Ltd.
Article
Cystic fibrosis (CF) is a multisystem autosomal recessive disorder caused by mutations of the cystic fibrosis transmembrane regulator (CFTR), a protein that regulates cyclic-AMP-mediated chloride conductance at the apical membrane of secretory epithelia(1). Mutations in the CFTR gene are common in many populations. In North America, 4-5% Of the general population are heterozygous for a CFTR mutation(2). Although there are over 400 known CFTR mutations, a single mutation, a deletion of the phenylalanine at position 508 (Delta F508) in exon 10, accounts for about 70% of all CF chromosomes worldwide(3). The reasons for the high frequency of the Delta F508 CFTR allele - the selective advantage associated with CF heterozygosity - are unknown(1). Many physiological abnormalities have been observed in CF heterozygotes(4-6), although the clinical significance of these observations is unknown. Preliminary unpublished data and anecdotal information from CF families suggested that, remarkably, the Delta F508 allele might protect heterozygotes against bronchial asthma prompted us to further investigate this possibility. Here we present evidence that the Delta F508 CF allele protects against asthma in childhood and early adult life.
Article
The recent progress in sequencing technologies makes possible large-scale medical sequencing efforts to assess the importance of rare variants in complex diseases. The results of such efforts depend heavily on the use of efficient study designs and analytical methods. We introduce here a unified framework for association testing of rare variants in family-based designs or designs based on unselected affected individuals. This framework allows us to quantify the enrichment in rare disease variants in families containing multiple affected individuals and to investigate the optimal design of studies aiming to identify rare disease variants in complex traits. We show that for many complex diseases with small values for the overall sibling recurrence risk ratio, such as Alzheimer's disease and most cancers, sequencing affected individuals with a positive family history of the disease can be extremely advantageous for identifying rare disease variants. In contrast, for complex diseases with large values of the sibling recurrence risk ratio, sequencing unselected affected individuals may be preferable.
Article
Sequencing studies are increasingly being conducted to identify rare variants associated with complex traits. The limited power of classical single-marker association analysis for rare variants poses a central challenge in such studies. We propose the sequence kernel association test (SKAT), a supervised, flexible, computationally efficient regression method to test for association between genetic variants (common and rare) in a region and a continuous or dichotomous trait while easily adjusting for covariates. As a score-based variance-component test, SKAT can quickly calculate p values analytically by fitting the null model containing only the covariates, and so can easily be applied to genome-wide data. Using SKAT to analyze a genome-wide sequencing study of 1000 individuals, by segmenting the whole genome into 30 kb regions, requires only 7 hr on a laptop. Through analysis of simulated data across a wide range of practical scenarios and triglyceride data from the Dallas Heart Study, we show that SKAT can substantially outperform several alternative rare-variant association tests. We also provide analytic power and sample-size calculations to help design candidate-gene, whole-exome, and whole-genome sequence association studies.
Article
Genomic research has two quite distinct faces. On the one hand, it produces large, curated, reference data sets through numerous networks of investigators for community use—although this aspect has great and widespread utility, it does not inspire per se. On the other hand, it allows an unbiased genome-wide view that is exciting precisely because it habitually uncovers biology that we were hopelessly ignorant about. Consequently, I am sanguine that the search for Mendelian disease genes by exomic and genomic sequencing will produce more than a long and comprehensive list of genes and associated disease mutations. Importantly, we are likely to hear new and surprising biological stories.
Article
Next-generation DNA sequencing (NGS) projects, such as the 1000 Genomes Project, are already revolutionizing our understanding of genetic variation among individuals. However, the massive data sets generated by NGS--the 1000 Genome pilot alone includes nearly five terabases--make writing feature-rich, efficient, and robust analysis tools difficult for even computationally sophisticated individuals. Indeed, many professionals are limited in the scope and the ease with which they can answer scientific questions by the complexity of accessing and manipulating the data produced by these machines. Here, we discuss our Genome Analysis Toolkit (GATK), a structured programming framework designed to ease the development of efficient and robust analysis tools for next-generation DNA sequencers using the functional programming philosophy of MapReduce. The GATK provides a small but rich set of data access patterns that encompass the majority of analysis tool needs. Separating specific analysis calculations from common data management infrastructure enables us to optimize the GATK framework for correctness, stability, and CPU and memory efficiency and to enable distributed and shared memory parallelization. We highlight the capabilities of the GATK by describing the implementation and application of robust, scale-tolerant tools like coverage calculators and single nucleotide polymorphism (SNP) calling. We conclude that the GATK programming framework enables developers and analysts to quickly and easily write efficient and robust NGS tools, many of which have already been incorporated into large-scale sequencing projects like the 1000 Genomes Project and The Cancer Genome Atlas.
Article
Shakespeare wrote 31534 different words, of which 14376 appear only once, 4343 twice, etc. The question considered is how many words he knew but did not use. A parametric empirical Bayes model due to Fisher and a nonparametric model due to Good & Toulmin are examined. The latter theory is augmented using linear programming methods. We conclude that the models are equivalent to supposing that Shakespeare knew at least 35000 more words.
Article
Genome-wide association studies suggest that common genetic variants explain only a modest fraction of heritable risk for common diseases, raising the question of whether rare variants account for a significant fraction of unexplained heritability. Although DNA sequencing costs have fallen markedly, they remain far from what is necessary for rare and novel variants to be routinely identified at a genome-wide scale in large cohorts. We have therefore sought to develop second-generation methods for targeted sequencing of all protein-coding regions ('exomes'), to reduce costs while enriching for discovery of highly penetrant variants. Here we report on the targeted capture and massively parallel sequencing of the exomes of 12 humans. These include eight HapMap individuals representing three populations, and four unrelated individuals with a rare dominantly inherited disorder, Freeman-Sheldon syndrome (FSS). We demonstrate the sensitive and specific identification of rare and common variants in over 300 megabases of coding sequence. Using FSS as a proof-of-concept, we show that candidate genes for Mendelian disorders can be identified by exome sequencing of a small number of unrelated, affected individuals. This strategy may be extendable to diseases with more complex genetics through larger sample sizes and appropriate weighting of non-synonymous variants by predicted functional impact.
Article
The different genetic variation discovery projects (The SNP Consortium, the International HapMap Project, the 1000 Genomes Project, etc.) aim to identify as much as possible of the underlying genetic variation in various human populations. The question we address in this article is how many new variants are yet to be found. This is an instance of the species problem in ecology, where the goal is to estimate the number of species in a closed population. We use a parametric beta-binomial model that allows us to calculate the expected number of new variants with a desired minimum frequency to be discovered in a new dataset of individuals of a specified size. The method can also be used to predict the number of individuals necessary to sequence in order to capture all (or a fraction of) the variation with a specified minimum frequency. We apply the method to three datasets: the ENCODE dataset, the SeattleSNPs dataset, and the National Institute of Environmental Health Sciences SNPs dataset. Consistent with previous descriptions, our results show that the African population is the most diverse in terms of the number of variants expected to exist, the Asian populations the least diverse, with the European population in-between. In addition, our results show a clear distinction between the Chinese and the Japanese populations, with the Japanese population being the less diverse. To find all common variants (frequency at least 1%) the number of individuals that need to be sequenced is small ( approximately 350) and does not differ much among the different populations; our data show that, subject to sequence accuracy, the 1000 Genomes Project is likely to find most of these common variants and a high proportion of the rarer ones (frequency between 0.1 and 1%). The data reveal a rule of diminishing returns: a small number of individuals ( approximately 150) is sufficient to identify 80% of variants with a frequency of at least 0.1%, while a much larger number (> 3,000 individuals) is necessary to find all of those variants. Finally, our results also show a much higher diversity in environmental response genes compared with the average genome, especially in African populations.
Article
Although whole-genome association studies using tagSNPs are a powerful approach for detecting common variants, they are underpowered for detecting associations with rare variants. Recent studies have demonstrated that common diseases can be due to functional variants with a wide spectrum of allele frequencies, ranging from rare to common. An effective way to identify rare variants is through direct sequencing. The development of cost-effective sequencing technologies enables association studies to use sequence data from candidate genes and, in the future, from the entire genome. Although methods used for analysis of common variants are applicable to sequence data, their performance might not be optimal. In this study, it is shown that the collapsing method, which involves collapsing genotypes across variants and applying a univariate test, is powerful for analyzing rare variants, whereas multivariate analysis is robust against inclusion of noncausal variants. Both methods are superior to analyzing each variant individually with univariate tests. In order to unify the advantages of both collapsing and multiple-marker tests, we developed the Combined Multivariate and Collapsing (CMC) method and demonstrated that the CMC method is both powerful and robust. The CMC method can be applied to either candidate-gene or whole-genome sequence data.
Article
A computer method for deriving relative-to-relative genotype transition probabilities is explained for the model of two linked loci. Both autosomal and X-linked cases are treated and the results tabulated for the more common categories of relatives. These results are then applied to the problem of calculating joint distributions and correlations between relatives.
Article
Cardiomyopathies represent a variety of cardiac diseases that are an important cause of morbidity and mortality throughout the world in children and adults and whose definition and classification have evolved since the middle of this century. Currently, they are defined as “heart muscle diseases of unknown etiology” and are classified as dilated, hypertrophic, or restrictive, depending on the type of functional impairment.1 The dilated forms (dilated cardiomyopathy, DCM) are the most common variety. They are characterized by a marked ventricular dilation, poor systolic function, the development of progressive refractory congestive heart failure, and a poor prognosis. Their prevalence in the US population is estimated to be 36.5 per 100 000 persons.2 The hypertrophic forms (hypertrophic cardiomyopathy, HCM) are defined by the presence of unexplained left ventricular hypertrophy that is usually predominant in the interventricular septum and may or may not be associated with right ventricular hypertrophy. Cellular disorganization (myocardial disarray) is present in most patients in the interventricular septum as well as in the free wall. The disease is associated with diastolic dysfunction, myocardial ischemia, and life-threatening arrhythmias, and patients are prone to sudden death. The prevalence of HCM is reported to be 17.9 per 100 000 persons.2 Restrictive cardiomyopathy is extremely rare in western countries. Although apparently clear, this clinical classification presents major limitations: specific cardiac diseases such as hypertension or ischemic heart disease, as well as general disorders with cardiac involvement, can mimic the clinical presentation of idiopathic cardiomyopathies. Moreover, an overlap exists between these categories. For instance, in end-stage HCM, a marked dilation of both ventricles, similar to that observed in DCM, can be present. Most important, this classification does not address the underlying molecular disorders responsible for the development of the “clinical” cardiomyopathy. During the past few years, new and unexpected insights into …
Article
We offer an approximation to central confidence intervals for directly standardized rates, where we assume that the rates are distributed as a weighted sum of independent Poisson random variables. Like a recent method proposed by Dobson, Kuulasmaa, Eberle and Scherer, our method gives exact intervals whenever the standard population is proportional to the study population. In cases where the two populations differ non-proportionally, we show through simulation that our method is conservative while other methods (the Dobson et al. method and the approximate bootstrap confidence method) can be liberal.
Article
Levels of neutral genetic diversity in populations subdivided into two demes were studied by multilocus stochastic simulations. The model includes deleterious mutations at loci throughout the genome, causing 'background selection', as well as a single locus at which a polymorphism is maintained, either by frequency-dependent selection or by local selective differences. These balanced polymorphisms induce long coalescence times at linked neutral loci, so that sequence diversity at these loci is enhanced at statistical equilibrium. We study how equilibrium neutral diversity levels are affected by the degree of population subdivision, the presence or absence of background selection, and the level of inbreeding of the population. The simulation results are compared with approximate analytical formulae, assuming the infinite sites neutral model. We discuss how balancing selection can be distinguished from local selection, by determining whether peaks of diversity in the region of the polymorphic locus are seen within or between demes. The width of such diversity peaks is shown to depend on the total species population size, rather than local deme sizes. We show that, with population subdivision, local selection enhances between-deme diversity even at neutral sites distant from the polymorphic locus, producing higher FST values than with no selection; very high values can be generated at sites close to a selected locus. Background selection also increases FST, mainly because of decreased diversity within populations, which implies that its effects may be distinguishable from those of local selection. Both effects are stronger in selfing than outcrossing populations. Linkage disequilibrium between neutral sites is generated by both balancing and local selection, especially in selfing populations, because of linkage disequilibrium between the neutral sites and the selectively maintained alleles. We discuss how these theoretical results can be related to data on genetic diversity within and between local populations of a species.
Article
The problem of genetic hitch-hiking in a geographically subdivided population is analysed under the assumption that migration rates among populations are relatively small compared with the selection coefficient for a newly arising advantageous allele. The approximate method used in the paper is valid when the number of emigrants per generation (Nm) is less than one. The approximate analysis shows that hitch-hiking can result in substantial differences among populations in the frequencies of neutral alleles closely linked to the advantageous allele. Thus, in cases for which genetic hitch-hiking is thought to be responsible for low levels of genetic variability in regions of the genome with restricted crossing over, it might be possible to find confirmatory evidence for that hypothesis by finding unusual patterns of geographic differentiation in the same regions of the genome.
Article
The KE family is a large three-generation pedigree in which half the members are affected with a severe speech and language disorder that is transmitted as an autosomal dominant monogenic trait. In previously published work, we localized the gene responsible (SPCH1) to a 5.6-cM region of 7q31 between D7S2459 and D7S643. In the present study, we have employed bioinformatic analyses to assemble a detailed BAC-/PAC-based sequence map of this interval, containing 152 sequence tagged sites (STSs), 20 known genes, and >7.75 Mb of completed genomic sequence. We screened the affected chromosome 7 from the KE family with 120 of these STSs (average spacing <100 kb), but we did not detect any evidence of a microdeletion. Novel polymorphic markers were generated from the sequence and were used to further localize critical recombination breakpoints in the KE family. This allowed refinement of the SPCH1 interval to a region between new markers 013A and 330B, containing approximately 6.1 Mb of completed sequence. In addition, we have studied two unrelated patients with a similar speech and language disorder, who have de novo translocations involving 7q31. Fluorescence in situ hybridization analyses with BACs/PACs from the sequence map localized the t(5;7)(q22;q31.2) breakpoint in the first patient (CS) to a single clone within the newly refined SPCH1 interval. This clone contains the CAGH44 gene, which encodes a brain-expressed protein containing a large polyglutamine stretch. However, we found that the t(2;7)(p23;q31.3) breakpoint in the second patient (BRD) resides within a BAC clone mapping >3.7 Mb distal to this, outside the current SPCH1 critical interval. Finally, we investigated the CAGH44 gene in affected individuals of the KE family, but we found no mutations in the currently known coding sequence. These studies represent further steps toward the isolation of the first gene to be implicated in the development of speech and language.
Article
Phylogenetic footprinting is a method for the discovery of regulatory elements in a set of orthologous regulatory regions from multiple species. It does so by identifying the best conserved motifs in those orthologous regions. We describe a computer algorithm designed specifically for this purpose, making use of the phylogenetic relationships among the sequences under study to make more accurate predictions. The program is guaranteed to report all sets of motifs with the lowest parsimony scores, calculated with respect to the phylogenetic tree relating the input species. We report the results of this algorithm on several data sets of interest. A large number of known functional binding sites are identified by our method, but we also find several highly conserved motifs for which no function is yet known.
Article
Linkage disequilibrium (LD) plays a central role in current and proposed methods for mapping complex disease genes. LD-based methods work best when there is a single susceptibility allele at any given disease locus, and generally perform very poorly if there is substantial allelic heterogeneity. The extent of allelic heterogeneity at typical complex disease loci is not yet known, but predictions about allelic heterogeneity have important implications for the design of future mapping studies, including the proposed genome-wide association studies. In this article, we review the available data and models relating to the number and frequencies of susceptibility alleles at complex disease loci-the 'allelic architecture' of human disease genes. We also show that the predicted frequency spectrum of disease variants at a gene depends crucially on the method of ascertainment, for example from prior linkage scans or from surveys of functional candidate loci.
Article
Genes responsible for human-specific phenotypes may have been under altered selective pressures in human evolution and thus exhibit changes in substitution rate and pattern at the protein sequence level. Using comparative analysis of human, chimpanzee, and mouse protein sequences, we identified two genes (PRM2 and FOXP2) with significantly enhanced evolutionary rates in the hominid lineage. PRM2 is a histone-like protein essential to spermatogenesis and was previously reported to be a likely target of sexual selection in humans and chimpanzees. FOXP2 is a transcription factor involved in speech and language development. Human FOXP2 experienced a >60-fold increase in substitution rate and incorporated two fixed amino acid changes in a broadly defined transcription suppression domain. A survey of a diverse group of placental mammals reveals the uniqueness of the human FOXP2 sequence and a population genetic analysis indicates possible adaptive selection behind the accelerated evolution. Taken together, our results suggest an important role that FOXP2 may have played in the origin of human speech and demonstrate a strategy for identifying candidate genes underlying the emergences of human-specific features.
Article
As large-scale sequencing efforts turn from single genome sequencing to polymorphism discovery, single nucleotide polymorphisms (SNPs) are becoming an increasingly important class of population genetic data. But because of the ascertainment biases introduced by many methods of SNP discovery, most SNP data cannot be analyzed using classical population genetic methods. Statistical methods must instead be developed that can explicitly take into account each method of SNP discovery. Here we review some of the current methods for analyzing SNPs and derive sampling distributions for single SNPs and pairs of SNPs for some common SNP discovery schemes. We also show that the ascertainment scheme has a large effect on the estimation of linkage disequilibrium and recombination, and describe some methods of correcting for ascertainment biases when estimating recombination rates from SNP data.
Article
The ability to infer the time and place of origin of a mutation can be very useful when reconstructing the evolutionary histories of populations and species. We use forward computer simulations of population growth, migration, and mutation in an analysis of an expanding population with a wave front that advances at a constant slow rate. A pronounced founder effect can be observed among mutations arising in this wave front where extreme population bottlenecks arise and are followed by major population growth. A fraction of mutations travel with the wave front and generate mutant populations that are on average much larger than those that remain stationary. Analysis of the diffusion of these mutants makes it possible to reconstruct migratory trajectories during population expansions, thus helping us better understand observed patterns in the evolution of species such as modern humans. Examination of some historical data supports our model.
Article
In most human populations, the ability to digest lactose contained in milk usually disappears in childhood, but in European-derived populations, lactase activity frequently persists into adulthood (Scrimshaw and Murray 1988). It has been suggested (Cavalli-Sforza 1973; Hollox et al. 2001; Enattah et al. 2002; Poulter et al. 2003) that a selective advantage based on additional nutrition from dairy explains these genetically determined population differences (Simoons 1970; Kretchmer 1971; Scrimshaw and Murray 1988; Enattah et al. 2002), but formal population-genetics-based evidence of selection has not yet been provided. To assess the population-genetics evidence for selection, we typed 101 single-nucleotide polymorphisms covering 3.2 Mb around the lactase gene. In northern European-derived populations, two alleles that are tightly associated with lactase persistence (Enattah et al. 2002) uniquely mark a common (~77%) haplotype that extends largely undisrupted for >1 Mb. We provide two new lines of genetic evidence that this long, common haplotype arose rapidly due to recent selection: (1) by use of the traditional F(ST) measure and a novel test based on p(excess), we demonstrate large frequency differences among populations for the persistence-associated markers and for flanking markers throughout the haplotype, and (2) we show that the haplotype is unusually long, given its high frequency--a hallmark of recent selection. We estimate that strong selection occurred within the past 5,000-10,000 years, consistent with an advantage to lactase persistence in the setting of dairy farming; the signals of selection we observe are among the strongest yet seen for any gene in the genome.
Article
Demographic events affect all genes in a genome, whereas natural selection has only local effects. Using publicly available data from 151 loci sequenced in both European-American and African-American populations, we attempt to distinguish the effects of demography and selection. To analyze large sets of population genetic data such as this one, we introduce "Perlymorphism," a Unix-based suite of analysis tools. Our analyses show that the demographic histories of human populations can account for a large proportion of effects on the level and frequency of variation across the genome. The African-American population shows both a higher level of nucleotide diversity and more negative values of Tajima's D statistic than does a European-American population. Using coalescent simulations, we show that the significantly negative values of the D statistic in African-Americans and the positive values in European-Americans are well explained by relatively simple models of population admixture and bottleneck, respectively. Working within these nonequilibrium frameworks, we are still able to show deviations from neutral expectations at a number of loci, including ABO and TRPV6. In addition, we show that the frequency spectrum of mutations--corrected for levels of polymorphism--is correlated with recombination rate only in European-Americans. These results are consistent with repeated selective sweeps in non-African populations, in agreement with recent reports using microsatellite data.
Article
Members of the cytochrome P450 3A subfamily catalyze the metabolism of endogenous substrates, environmental carcinogens, and clinically important exogenous compounds, such as prescription drugs and therapeutic agents. In particular, the CYP3A4 and CYP3A5 genes play an especially important role in pharmacogenetics, since they metabolize >50% of the drugs on the market. However, known genetic variants at these two loci are not sufficient to account for the observed phenotypic variability in drug response. We used a comparative genomics approach to identify conserved coding and noncoding regions at these genes and resequenced them in three ethnically diverse human populations. We show that remarkable interpopulation differences exist with regard to frequency spectrum and haplotype structure. The non-African samples are characterized by a marked excess of rare variants and the presence of a homogeneous group of long-range haplotypes at high frequency. The CYP3A5*1/*3 polymorphism, which is likely to influence salt and water retention and risk for salt-sensitive hypertension, was genotyped in >1,000 individuals from 52 worldwide population samples. The results reveal an unusual geographic pattern whereby the CYP3A5*3 frequency shows extreme variation across human populations and is significantly correlated with distance from the equator. Furthermore, we show that an unlinked variant, AGT M235T, previously implicated in hypertension and pre-eclampsia, exhibits a similar geographic distribution and is significantly correlated in frequency with CYP3A5*1/*3. Taken together, these results suggest that variants that influence salt homeostasis were the targets of a shared selective pressure that resulted from an environmental variable correlated with latitude.
Article
Population genetic models play an important role in human genetic research, connecting empirical observations about sequence variation with hypotheses about underlying historical and biological causes. More specifically, models are used to compare empirical measures of sequence variation, linkage disequilibrium (LD), and selection to expectations under a "null" distribution. In the absence of detailed information about human demographic history, and about variation in mutation and recombination rates, simulations have of necessity used arbitrary models, usually simple ones. With the advent of large empirical data sets, it is now possible to calibrate population genetic models with genome-wide data, permitting for the first time the generation of data that are consistent with empirical data across a wide range of characteristics. We present here the first such calibrated model and show that, while still arbitrary, it successfully generates simulated data (for three populations) that closely resemble empirical data in allele frequency, linkage disequilibrium, and population differentiation. No assertion is made about the accuracy of the proposed historical and recombination model, but its ability to generate realistic data meets a long-standing need among geneticists. We anticipate that this model, for which software is publicly available, and others like it will have numerous applications in empirical studies of human genetics.
Article
Scanning the genome for association between markers and complex diseases typically requires testing hundreds of thousands of genetic polymorphisms. Testing such a large number of hypotheses exacerbates the trade-off between power to detect meaningful associations and the chance of making false discoveries. Even before the full genome is scanned, investigators often favor certain regions on the basis of the results of prior investigations, such as previous linkage scans. The remaining regions of the genome are investigated simultaneously because genotyping is relatively inexpensive compared with the cost of recruiting participants for a genetic study and because prior evidence is rarely sufficient to rule out these regions as harboring genes with variation of conferring liability (liability genes). However, the multiple testing inherent in broad genomic searches diminishes power to detect association, even for genes falling in regions of the genome favored a priori. Multiple testing problems of this nature are well suited for application of the false-discovery rate (FDR) principle, which can improve power. To enhance power further, a new FDR approach is proposed that involves weighting the hypotheses on the basis of prior data. We present a method for using linkage data to weight the association P values. Our investigations reveal that if the linkage study is informative, the procedure improves power considerably. Remarkably, the loss in power is small, even when the linkage study is uninformative. For a class of genetic models, we calculate the sample size required to obtain useful prior information from a linkage study. This inquiry reveals that, among genetic models that are seemingly equal in genetic information, some are much more promising than others for this mode of analysis.
Article
Positive natural selection is the force that drives the increase in prevalence of advantageous traits, and it has played a central role in our development as a species. Until recently, the study of natural selection in humans has largely been restricted to comparing individual candidate genes to theoretical expectations. The advent of genome-wide sequence and polymorphism data brings fundamental new tools to the study of natural selection. It is now possible to identify new candidates for selection and to reevaluate previous claims by comparison with empirical distributions of DNA sequence variation across the human genome and among populations. The flood of data and analytical methods, however, raises many new challenges. Here, we review approaches to detect positive natural selection, describe results from recent analyses of genome-wide data, and discuss the prospects and challenges ahead as we expand our understanding of the role of natural selection in shaping the human genome.