ABSTRACT: Rare coding variants constitute an important class of human genetic variation, but are underrepresented in current databases that are based on small population samples. Recent studies show that variants altering amino acid sequence and protein function are enriched at low variant allele frequency, 2 to 5%, but because of insufficient sample size it is not clear if the same trend holds for rare variants below 1% allele frequency.
The 1000 Genomes Exon Pilot Project has collected deep-coverage exon-capture data in roughly 1,000 human genes, for nearly 700 samples. Although medical whole-exome projects are currently afoot, this is still the deepest reported sampling of a large number of human genes with next-generation technologies. According to the goals of the 1000 Genomes Project, we created effective informatics pipelines to process and analyze the data, and discovered 12,758 exonic SNPs, 70% of them novel, and 74% below 1% allele frequency in the seven population samples we examined. Our analysis confirms that coding variants below 1% allele frequency show increased population-specificity and are enriched for functional variants.
This study represents a large step toward detecting and interpreting low frequency coding variation, clearly lays out technical steps for effective analysis of DNA capture data, and articulates functional and population properties of this important class of genetic variation.
Genome biology 09/2011; 12(9):R84. · 6.63 Impact Factor
ABSTRACT: Here, we present ContEst, a tool for estimating the level of cross-individual contamination in next-generation sequencing data. We demonstrate the accuracy of ContEst across a range of contamination levels, sources and read depths using sequencing data mixed in silico at known concentrations. We applied our tool to published cancer sequencing datasets and report their estimated contamination levels.
ContEst is a GATK module, and distributed under a BSD style license at http://www.broadinstitute.org/cancer/cga/contest
Supplementary data is available at Bioinformatics online.
Bioinformatics 07/2011; 27(18):2601-2. · 5.47 Impact Factor
ABSTRACT: Noncoding variants at human chromosome 9p21 near CDKN2A and CDKN2B are associated with type 2 diabetes, myocardial infarction, aneurysm, vertical cup disc ratio and at least five cancers. Here we compare approaches to more comprehensively assess genetic variation in the region. We carried out targeted sequencing at high coverage in 47 individuals and compared the results to pilot data from the 1000 Genomes Project. We imputed variants into type 2 diabetes and myocardial infarction cohorts directly from targeted sequencing, from a genotyped reference panel derived from sequencing and from 1000 Genomes Project low-coverage data. Polymorphisms with frequency >5% were captured well by all strategies. Imputation of intermediate-frequency polymorphisms required a higher density of tag SNPs in disease samples than is available on first-generation genome-wide association study (GWAS) arrays. Our association analyses identified more comprehensive sets of variants showing equivalent statistical association with type 2 diabetes or myocardial infarction, but did not identify stronger associations than the original GWAS signals.
Nature Genetics 01/2011; 43(8):801-5. · 35.53 Impact Factor