Predicting the molecular complexity of sequencing libraries

Department of Mathematics, University of Southern California, Los Angeles, California, USA.
Nature Methods (Impact Factor: 25.95). 02/2013; 10(4). DOI: 10.1038/nmeth.2375
Source: PubMed

ABSTRACT Predicting the molecular complexity of a genomic sequencing library has emerged as a critical but difficult problem in modern applications of genome sequencing. Available methods to determine either how deeply to sequence, or predict the benefits of additional sequencing, are almost completely lacking. We introduce an empirical Bayesian method to implicitly model any source of bias and accurately characterize the molecular complexity of a DNA sample or library in almost any sequencing application.

Download full-text


Available from: Timothy Daley, Apr 09, 2014
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Conifers are the predominant gymnosperm. The size and complexity of their genomes has presented formidable technical challenges for whole-genome shotgun sequencing and assembly. We employed novel strategies that allowed us to determine the loblolly pine (Pinus taeda) reference genome sequence, the largest genome assembled to date. Most of the sequence data were derived from whole-genome shotgun sequencing of a single megagametophyte, the haploid tissue of a single pine seed. Although that constrained the quantity of available DNA, the resulting haploid sequence data were well-suited for assembly. The haploid sequence was augmented with multiple linking long-fragment mate pair libraries from the parental diploid DNA. For the longest fragments, we used novel fosmid DiTag libraries. Sequences from the linking libraries that did not match the megagametophyte were identified and removed. Assembly of the sequence data were aided by condensing the enormous number of paired-end reads into a much smaller set of longer "super-reads," rendering subsequent assembly with an overlap-based assembly algorithm computationally feasible. To further improve the contiguity and biological utility of the genome sequence, additional scaffolding methods utilizing independent genome and transcriptome assemblies were implemented. The combination of these strategies resulted in a draft genome sequence of 20.15 billion bases, with an N50 scaffold size of 66.9 kbp.
    Genetics 03/2014; 196(3):875-90. DOI:10.1534/genetics.113.159715 · 4.87 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: We report metrics from complete genome capture of nuclear DNA from extinct mammoths using biotinylated RNAs transcribed from an Asian elephant DNA extract. Enrichment of the nuclear genome ranged from 1.06- to 18.65-fold, to an apparent maximum threshold of about 80% on-target. This projects an order of magnitude less costly complete genome sequencing from long-dead organisms, even when a reference genome is unavailable for bait design.
    Molecular Biology and Evolution 02/2014; DOI:10.1093/molbev/msu074 · 14.31 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Lentil (Lens culinaris ssp. culinaris) is a nutritious and affordable pulse with an ancient crop domestication history. The genus Lens consists of seven taxa, however, there are many discrepancies in the taxon and gene pool classification of lentil and its wild relatives. Due to the narrow genetic basis of cultivated lentil, there is a need towards better understanding of the relationships amongst wild germplasm to assist introgression of favourable genes into lentil breeding programs. Genotyping-by-sequencing (GBS) is an easy and affordable method that allows multiplexing of up to 384 samples or more per library to generate genome-wide single nucleotide Polymorphism (SNP) markers. In this study, we aimed to characterize our lentil germplasm collection using a two-enzyme GBS approach. We constructed two 96-plex GBS libraries with a total of 60 accessions where some accessions had several samples and each sample was sequenced in two technical replicates. We developed an automated GBS pipeline and detected a total of 266,356 genome-wide SNPs. After filtering low quality and redundant SNPs based on haplotype information, we constructed a maximum-likelihood tree using 5,389 SNPs. The phylogenetic tree grouped the germplasm collection into their respective taxa with strong support. Based on phylogenetic tree and STRUCTURE analysis, we identified four gene pools, namely L. culinaris/L. orientalis/L. tomentosus, L. lamottei/L. odemensis, L. ervoides and L. nigricans which form primary, secondary, tertiary and quaternary gene pools, respectively. We discovered sequencing bias problems likely due to DNA quality and observed severe run-to-run variation in the wild lentils. We examined the authenticity of the germplasm collection and identified 17% misclassified samples. Our study demonstrated that GBS is a promising and affordable tool for screening by plant breeders interested in crop wild relatives.
    PLoS ONE 03/2015; 10(3):e0122025. DOI:10.1371/journal.pone.0122025 · 3.53 Impact Factor