Sequencing the unsequenceable: Expanded CGG-repeat alleles of the fragile X gene.

Biochemistry and Molecular Medicine, University of California, Davis, School of Medicine
Genome Research (Impact Factor: 13.85). 10/2012; DOI: 10.1101/gr.141705.112
Source: PubMed

ABSTRACT The human fragile X mental retardation 1 (FMR1) gene contains a (CGG)(n) trinucleotide repeat in its 5' untranslated region (5'UTR). Expansions of this repeat result in a number of clinical disorders with distinct molecular pathologies, including fragile X syndrome (FXS; full mutation range, > 200 CGG repeats) and fragile X-associated tremor/ataxia syndrome (FXTAS; premutation range, 55-200 repeats). Study of these diseases has been limited by an inability to sequence expanded CGG repeats, particularly in the full mutation range, with existing DNA sequencing technologies. Single molecule real time (SMRT) sequencing provides an approach to sequencing that is fundamentally different from other "next-generation" sequencing platforms, and is well suited for long, repetitive DNA sequences. We report the first sequence data for expanded CGG-repeat FMR1 alleles in the full mutation range that reveal the confounding effects of CGG-repeat tracts on both cloning and PCR. A unique feature of SMRT sequencing is its ability to yield real-time information on the rates of nucleoside addition by the tethered DNA polymerase; for the CGG-repeat alleles, we find a strand-specific effect of CGG-repeat DNA on the inter-pulse distance. This kinetic signature reveals a novel aspect of the repeat element; namely, that the particular G bias within the CGG/CCG-repeat element influences polymerase activity in a manner that extends beyond simple nearest-neighbor effects. These observations provide a baseline for future kinetic studies of repeat elements, as well as for studies of epigenetic and other chemical modifications thereof.

1 Follower
  • [Show abstract] [Hide abstract]
    ABSTRACT: Today, the base code of DNA is mostly determined through sequencing by synthesis as provided by the Illumina sequencers. Although highly accurate, resulting reads are short, making their analyses challenging. Recently, a new technology, Single Molecule Real-Time (SMRT) sequencing, was developed which could address these challenges as it generates reads of several thousand bases. But, their broad application has been hampered by a high error rate. Therefore, hybrid approaches which use high quality short reads to correct erroneous SMRT long reads have been developed. Still, current implementations have great demands on hardware, work only in well-defined computing infrastructures and reject a substantial amount of reads. This limits their usability considerably, especially in the case of large sequencing projects.
    Bioinformatics 07/2014; 30(21). DOI:10.1093/bioinformatics/btu392 · 4.62 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Background The chloroplast genome is important for plant development and plant evolution. Nelumbo nucifera is one member of relict plants surviving from the late Cretaceous. Recently, a new sequencing platform PacBio RS II, known as `SMRT (Single Molecule, Real-Time) sequencing¿, has been developed. Using the SMRT sequencing to investigate the chloroplast genome of N. nucifera will help to elucidate the plastid evolution of basal eudicots.ResultsThe sizes of the de novo assembled complete chloroplast genome of N. nucifera were 163,307 bp, 163,747 bp and 163,600 bp with average depths of coverage of 7×, 712× and 105× sequenced by Sanger, Illumina MiSeq and PacBio RS II, respectively. The precise chloroplast genome of N. nucifera was obtained from PacBio RS II data proofread by Illumina MiSeq reads, with a quadripartite structure containing a large single copy region (91,846 bp) and a small single copy region (19,626 bp) separated by two inverted repeat regions (26,064 bp). The genome contains 113 different genes, including four distinct rRNAs, 30 distinct tRNAs and 79 distinct peptide-coding genes. A phylogenetic analysis of 133 taxa from 56 orders indicated that Nelumbo with an age of 177 million years is a sister clade to Platanus, which belongs to the basal eudicots. Basal eudicots began to emerge during the early Jurassic with estimated divergence times at 197 million years using MCMCTree. IR expansions/contractions within the basal eudicots seem to have occurred independently.Conclusions Because of long reads and lack of bias in coverage of AT-rich regions, PacBio RS II showed a great promise for highly accurate `finished¿ genomes, especially for a de novo assembly of genomes. N. nucifera is one member of basal eudicots, however, evolutionary analyses of IR structural variations of N. nucifera and other basal eudicots suggested that IR expansions/contractions occurred independently in these basal eudicots or were caused by independent insertions and deletions. The precise chloroplast genome of N. nucifera will present new information for structural variation of chloroplast genomes and provide new insight into the evolution of basal eudicots at the primary sequence and structural level.
    BMC Plant Biology 11/2014; 14(1):289. DOI:10.1186/s12870-014-0289-0 · 3.94 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Repetitive and redundant regions of a genome are particularly problematic for mapping sequencing reads. In the present paper, we compile a list of the unmappable regions in the human genome based on the following definition: hypothetical reads with length 1kb which cannot be uniquely mapped with zero-mismatch alignment for the described regions, considering both the forward and reverse strand. The respective collection of unmappable regions covers 0.77% of the sequence of human autosomes and 8.25% of the sex chromosomes in the reference genome GRCh37/hg19 (overall 1.23%). Not surprisingly, our unmappable regions overlap greatly with segmental duplication, transposable elements, and structural variants. About 99.8% of bases in our unmappable regions are part of either segmental duplication or transposable elements and 98.3% overlap structural variant annotations. Notably, some of these regions overlap units with important biological functions, including 4% of protein-coding genes. In contrast, these regions have zero intersection with the ultraconserved elements, very low overlap with microRNAs, tRNAs, pseudogenes, CpG islands, tandem repeats, microsatellites, sensitive non-coding regions, and the mapping blacklist regions from the ENCODE project.
    Computational Biology and Chemistry 08/2014; 53. DOI:10.1016/j.compbiolchem.2014.08.015 · 1.60 Impact Factor


Available from
May 14, 2014