[show abstract][hide abstract] ABSTRACT: We report on results from whole-exome sequencing (WES) of 1,039 subjects diagnosed with autism spectrum disorders (ASD) and 870 controls selected from the NIMH repository to be of similar ancestry to cases. The WES data came from two centers using different methods to produce sequence and to call variants from it. Therefore, an initial goal was to ensure the distribution of rare variation was similar for data from different centers. This proved straightforward by filtering called variants by fraction of missing data, read depth, and balance of alternative to reference reads. Results were evaluated using seven samples sequenced at both centers and by results from the association study. Next we addressed how the data and/or results from the centers should be combined. Gene-based analyses of association was an obvious choice, but should statistics for association be combined across centers (meta-analysis) or should data be combined and then analyzed (mega-analysis)? Because of the nature of many gene-based tests, we showed by theory and simulations that mega-analysis has better power than meta-analysis. Finally, before analyzing the data for association, we explored the impact of population structure on rare variant analysis in these data. Like other recent studies, we found evidence that population structure can confound case-control studies by the clustering of rare variants in ancestry space; yet, unlike some recent studies, for these data we found that principal component-based analyses were sufficient to control for ancestry and produce test statistics with appropriate distributions. After using a variety of gene-based tests and both meta- and mega-analysis, we found no new risk genes for ASD in this sample. Our results suggest that standard gene-based tests will require much larger samples of cases and controls before being effective for gene discovery, even for a disorder like ASD.
[show abstract][hide abstract] ABSTRACT: Genome-wide association studies have identified hundreds of loci for type 2 diabetes, coronary artery disease and myocardial infarction, as well as for related traits such as body mass index, glucose and insulin levels, lipid levels, and blood pressure. These studies also have pointed to thousands of loci with promising but not yet compelling association evidence. To establish association at additional loci and to characterize the genome-wide significant loci by fine-mapping, we designed the "Metabochip," a custom genotyping array that assays nearly 200,000 SNP markers. Here, we describe the Metabochip and its component SNP sets, evaluate its performance in capturing variation across the allele-frequency spectrum, describe solutions to methodological challenges commonly encountered in its analysis, and evaluate its performance as a platform for genotype imputation. The metabochip achieves dramatic cost efficiencies compared to designing single-trait follow-up reagents, and provides the opportunity to compare results across a range of related traits. The metabochip and similar custom genotyping arrays offer a powerful and cost-effective approach to follow-up large-scale genotyping and sequencing studies and advance our understanding of the genetic basis of complex human diseases and traits.
[show abstract][hide abstract] ABSTRACT: zCall is a variant caller specifically designed for calling rare single-nucleotide polymorphisms from array-based technology. This caller is implemented as a post-processing step after a default calling algorithm has been applied. The algorithm uses the intensity profile of the common allele homozygote cluster to define the location of the other two genotype clusters. We demonstrate improved detection of rare alleles when applying zCall to samples that have both Illumina Infinium HumanExome BeadChip and exome sequencing data available. AVAILABILITY: http://atguweb.mgh.harvard.edu/apps/zcall. CONTACT: firstname.lastname@example.org SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
[show abstract][hide abstract] ABSTRACT: Autism spectrum disorders (ASD) are believed to have genetic and environmental origins, yet in only a modest fraction of individuals can specific causes be identified. To identify further genetic risk factors, here we assess the role of de novo mutations in ASD by sequencing the exomes of ASD cases and their parents (n = 175 trios). Fewer than half of the cases (46.3%) carry a missense or nonsense de novo variant, and the overall rate of mutation is only modestly higher than the expected rate. In contrast, the proteins encoded by genes that harboured de novo missense or nonsense mutations showed a higher degree of connectivity among themselves and to previous ASD genes as indexed by protein-protein interaction screens. The small increase in the rate of de novo events, when taken together with the protein interaction results, are consistent with an important but limited role for de novo point mutations in ASD, similar to that documented for de novo copy number variants. Genetic models incorporating these data indicate that most of the observed de novo events are unconnected to ASD; those that do confer risk are distributed across many genes and are incompletely penetrant (that is, not necessarily sufficient for disease). Our results support polygenic models in which spontaneous coding mutations in any of a large number of genes increases risk by 5- to 20-fold. Despite the challenge posed by such models, results from de novo events and a large parallel case-control study provide strong evidence in favour of CHD8 and KATNAL2 as genuine autism risk factors.
[show abstract][hide abstract] ABSTRACT: Recent advances in sequencing technology make it possible to comprehensively catalog genetic variation in population samples, creating a foundation for understanding human disease, ancestry and evolution. The amounts of raw data produced are prodigious, and many computational steps are required to translate this output into high-quality variant calls. We present a unified analytic framework to discover and genotype variation among multiple samples simultaneously that achieves sensitive and specific results across five sequencing technologies and three distinct, canonical experimental designs. Our process includes (i) initial read mapping; (ii) local realignment around indels; (iii) base quality score recalibration; (iv) SNP discovery and genotyping to find all potential variants; and (v) machine learning to separate true segregating variation from machine artifacts common to next-generation sequencing technologies. We here discuss the application of these tools, instantiated in the Genome Analysis Toolkit, to deep whole-genome, whole-exome capture and multi-sample low-pass (∼4×) 1000 Genomes Project datasets.
[show abstract][hide abstract] ABSTRACT: Noncoding variants at human chromosome 9p21 near CDKN2A and CDKN2B are associated with type 2 diabetes, myocardial infarction, aneurysm, vertical cup disc ratio and at least five cancers. Here we compare approaches to more comprehensively assess genetic variation in the region. We carried out targeted sequencing at high coverage in 47 individuals and compared the results to pilot data from the 1000 Genomes Project. We imputed variants into type 2 diabetes and myocardial infarction cohorts directly from targeted sequencing, from a genotyped reference panel derived from sequencing and from 1000 Genomes Project low-coverage data. Polymorphisms with frequency >5% were captured well by all strategies. Imputation of intermediate-frequency polymorphisms required a higher density of tag SNPs in disease samples than is available on first-generation genome-wide association study (GWAS) arrays. Our association analyses identified more comprehensive sets of variants showing equivalent statistical association with type 2 diabetes or myocardial infarction, but did not identify stronger associations than the original GWAS signals.
[show abstract][hide abstract] ABSTRACT: Global studies of transcript structure and abundance in cancer cells enable the systematic discovery of aberrations that contribute to carcinogenesis, including gene fusions, alternative splice isoforms, and somatic mutations. We developed a systematic approach to characterize the spectrum of cancer-associated mRNA alterations through integration of transcriptomic and structural genomic data, and we applied this approach to generate new insights into melanoma biology. Using paired-end massively parallel sequencing of cDNA (RNA-seq) together with analyses of high-resolution chromosomal copy number data, we identified 11 novel melanoma gene fusions produced by underlying genomic rearrangements, as well as 12 novel readthrough transcripts. We mapped these chimeric transcripts to base-pair resolution and traced them to their genomic origins using matched chromosomal copy number information. We also used these data to discover and validate base-pair mutations that accumulated in these melanomas, revealing a surprisingly high rate of somatic mutation and lending support to the notion that point mutations constitute the major driver of melanoma progression. Taken together, these results may indicate new avenues for target discovery in melanoma, while also providing a template for large-scale transcriptome studies across many tumor types.
Genome Research 02/2010; 20(4):413-27. · 14.40 Impact Factor
[show abstract][hide abstract] ABSTRACT: Targeting genomic loci by massively parallel sequencing requires new methods to enrich templates to be sequenced. We developed a capture method that uses biotinylated RNA 'baits' to fish targets out of a 'pond' of DNA fragments. The RNA is transcribed from PCR-amplified oligodeoxynucleotides originally synthesized on a microarray, generating sufficient bait for multiple captures at concentrations high enough to drive the hybridization. We tested this method with 170-mer baits that target >15,000 coding exons (2.5 Mb) and four regions (1.7 Mb total) using Illumina sequencing as read-out. About 90% of uniquely aligning bases fell on or near bait sequence; up to 50% lay on exons proper. The uniformity was such that approximately 60% of target bases in the exonic 'catch', and approximately 80% in the regional catch, had at least half the mean coverage. One lane of Illumina sequence was sufficient to call high-confidence genotypes for 89% of the targeted exon space.