Genome-wide mapping and assembly of structural variant breakpoints in the mouse genome.
ABSTRACT Structural variation (SV) is a rich source of genetic diversity in mammals, but due to the challenges associated with mapping SV in complex genomes, basic questions regarding their genomic distribution and mechanistic origins remain unanswered. We have developed an algorithm (HYDRA) to localize SV breakpoints by paired-end mapping, and a general approach for the genome-wide assembly and interpretation of breakpoint sequences. We applied these methods to two inbred mouse strains: C57BL/6J and DBA/2J. We demonstrate that HYDRA accurately maps diverse classes of SV, including those involving repetitive elements such as transposons and segmental duplications; however, our analysis of the C57BL/6J reference strain shows that incomplete reference genome assemblies are a major source of noise. We report 7196 SVs between the two strains, more than two-thirds of which are due to transposon insertions. Of the remainder, 59% are deletions (relative to the reference), 26% are insertions of unlinked DNA, 9% are tandem duplications, and 6% are inversions. To investigate the origins of SV, we characterized 3316 breakpoint sequences at single-nucleotide resolution. We find that approximately 16% of non-transposon SVs have complex breakpoint patterns consistent with template switching during DNA replication or repair, and that this process appears to preferentially generate certain classes of complex variants. Moreover, we find that SVs are significantly enriched in regions of segmental duplication, but that this effect is largely independent of DNA sequence homology and thus cannot be explained by non-allelic homologous recombination (NAHR) alone. This result suggests that the genetic instability of such regions is often the cause rather than the consequence of duplicated genomic architecture.
SourceAvailable from: Camille Berthelot[Show abstract] [Hide abstract]
ABSTRACT: Genomic rearrangements are a major source of evolutionary divergence in eukaryotic genomes, a cause of genetic diseases and a hallmark of tumor cell progression, yet the mechanisms underlying their occurrence and evolutionary fixation are poorly understood. Statistical associations between breakpoints and specific genomic features suggest that genomes may contain elusive "fragile regions" with a higher propensity for breakage. Here, we use ancestral genome reconstructions to demonstrate a near-perfect correlation between gene density and evolutionary rearrangement breakpoints. Simulations based on functional features in the human genome show that this pattern is best explained as the outcome of DNA breaks that occur in open chromatin regions coming into 3D contact in the nucleus. Our model explains how rearrangements reorganize the order of genes in an evolutionary neutral fashion and provides a basis for understanding the susceptibility of "fragile regions" to breakage. Copyright © 2015 The Authors. Published by Elsevier Inc. All rights reserved.Cell Reports 03/2015; DOI:10.1016/j.celrep.2015.02.046 · 7.21 Impact Factor
[Show abstract] [Hide abstract]
ABSTRACT: Non-allelic homologous recombination (NAHR) is a common mechanism for generating genome rearrangements and is implicated in numerous genetic disorders, but its detection in high-throughput sequencing data poses a serious challenge. We present a probabilistic model of NAHR and demonstrate its ability to find NAHR in low-coverage sequencing data from 44 individuals. We identify NAHR-mediated deletions or duplications in 109 of 324 potential NAHR loci in at least one of the individuals. These calls segregate by ancestry, are more common in closely spaced repeats, often result in duplicated genes or pseudogenes, and affect highly studied genes such as GBA and CYP2E1. Electronic supplementary material The online version of this article (doi:10.1186/s13059-015-0633-1) contains supplementary material, which is available to authorized users.Genome Biology 04/2015; 16(1). DOI:10.1186/s13059-015-0633-1 · 10.47 Impact Factor
[Show abstract] [Hide abstract]
ABSTRACT: The identification of DNA copy numbers from short-read sequencing data remains a challenge for both technical and algorithmic reasons. The raw data for these analyses are measured in tens to hundreds of gigabytes per genome; transmitting, storing, and analyzing such large files is cumbersome, particularly for methods that analyze several samples simultaneously. We developed a very efficient representation of depth of coverage (150-1000× compression) that enables such analyses. Current methods for analyzing variants in whole-genome sequencing (WGS) data frequently miss copy number variants (CNVs), particularly hemizygous deletions in the 1-100 kb range. To fill this gap, we developed a method to identify CNVs in individual genomes, based on comparison to joint profiles pre-computed from a large set of genomes. We analyzed depth of coverage in over 6000 high quality (>40×) genomes. The depth of coverage has strong sequence-specific fluctuations only partially explained by global parameters like %GC. To account for these fluctuations, we constructed multi-genome profiles representing the observed or inferred diploid depth of coverage at each position along the genome. These Reference Coverage Profiles (RCPs) take into account the diverse technologies and pipeline versions used. Normalization of the scaled coverage to the RCP followed by hidden Markov model (HMM) segmentation enables efficient detection of CNVs and large deletions in individual genomes. Use of pre-computed multi-genome coverage profiles improves our ability to analyze each individual genome. We make available RCPs and tools for performing these analyses on personal genomes. We expect the increased sensitivity and specificity for individual genome analysis to be critical for achieving clinical-grade genome interpretation.Frontiers in Genetics 02/2015; 6(45). DOI:10.3389/fgene.2015.00045