Daley T, Smith AD.. Predicting the molecular complexity of sequencing libraries. Nat Methods 10: 325-327

Department of Mathematics, University of Southern California, Los Angeles, California, USA.
Nature Methods (Impact Factor: 32.07). 02/2013; 10(4). DOI: 10.1038/nmeth.2375
Source: PubMed


Predicting the molecular complexity of a genomic sequencing library has emerged as a critical but difficult problem in modern applications of genome sequencing. Available methods to determine either how deeply to sequence, or predict the benefits of additional sequencing, are almost completely lacking. We introduce an empirical Bayesian method to implicitly model any source of bias and accurately characterize the molecular complexity of a DNA sample or library in almost any sequencing application.

Download full-text


Available from: Timothy Daley, Apr 09, 2014
  • Source
    • "Assessment of raw library sequencing metrics. A: Percent molecular complexity is defined as the number of unique reads divided by the number of total reads in a sample, multiplied by 100% [Daley and Smith, 2013] as a function of millions of reads sampled. B: Whole-exome regions targeted by each technology and the region in common between all technologies. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Next generation sequencing (NGS) has aided characterization of genomic variation. While whole genome sequencing may capture all possible mutations, whole exome sequencing remains cost-effective and captures most phenotype-altering mutations. Initial strategies for exome enrichment utilized a hybridization-based capture approach. Recently, amplicon-based methods were designed to simplify preparation and utilize smaller DNA inputs. We evaluated two hybridization capture-based and two amplicon-based whole exome sequencing approaches, utilizing both Illumina and Ion Torrent sequencers, comparing on-target alignment, uniformity, and variant calling. While the amplicon methods had higher on-target rates, the hybridization capture-based approaches demonstrated better uniformity. All methods identified many of the same single nucleotide variants, but each amplicon-based method missed variants detected by the other three methods and reported additional variants discordant with all three other technologies. Many of these potential false positives or negatives appear to result from limited coverage, low variant frequency, vicinity to read starts/ends, or the need for platform-specific variant calling algorithms. All methods demonstrated effective copy number variant calling when evaluated against a single nucleotide polymorphism array. This study illustrates some differences between whole exome sequencing approaches, highlights the need for selecting appropriate variant calling based on capture method, and will aid laboratories in selecting their preferred approach. This article is protected by copyright. All rights reserved. This article is protected by copyright. All rights reserved.
    Human Mutation 06/2015; 36(9). DOI:10.1002/humu.22825 · 5.14 Impact Factor
  • Source
    • "In particular the estimator will diverge to positive or negative infinity depending on whether the largest observed coverage count is odd or even. We introduced rational function approximations to obtain globally stable estimates that still satisfy the nice local properties of the Good-Toulmin estimator (Daley and Smith, 2013). A rational function approximation to a power series is a ratio of polynomials that asymptotically approximates the power series up to a given degree, "
    [Show abstract] [Hide abstract]
    ABSTRACT: Single-cell DNA sequencing is necessary for examining genetic variation at the cellular level, which remains hidden in bulk sequencing experiments. But because they begin with such small amounts of starting material, the amount of information that is obtained from single-cell sequencing experiment is highly sensitive to the choice of protocol employed and variability in library preparation. In particular, the fraction of the genome represented in single-cell sequencing libraries exhibits extreme variability due to quantitative biases in amplification and loss of genetic material.
    Bioinformatics 08/2014; 30(22). DOI:10.1093/bioinformatics/btu540 · 4.98 Impact Factor
  • Source
    • "Figure 4 Library complexity curves quantify library complexity and the diminishing returns as sequencing progresses. Library complexity is an estimate of the number of distinct molecules in the library (Daley and Smith 2013). The convexly shaped complexity curve plots the number of distinct molecules observed against the number of sequenced reads. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Conifers are the predominant gymnosperm. The size and complexity of their genomes has presented formidable technical challenges for whole-genome shotgun sequencing and assembly. We employed novel strategies that allowed us to determine the loblolly pine (Pinus taeda) reference genome sequence, the largest genome assembled to date. Most of the sequence data were derived from whole-genome shotgun sequencing of a single megagametophyte, the haploid tissue of a single pine seed. Although that constrained the quantity of available DNA, the resulting haploid sequence data were well-suited for assembly. The haploid sequence was augmented with multiple linking long-fragment mate pair libraries from the parental diploid DNA. For the longest fragments, we used novel fosmid DiTag libraries. Sequences from the linking libraries that did not match the megagametophyte were identified and removed. Assembly of the sequence data were aided by condensing the enormous number of paired-end reads into a much smaller set of longer "super-reads," rendering subsequent assembly with an overlap-based assembly algorithm computationally feasible. To further improve the contiguity and biological utility of the genome sequence, additional scaffolding methods utilizing independent genome and transcriptome assemblies were implemented. The combination of these strategies resulted in a draft genome sequence of 20.15 billion bases, with an N50 scaffold size of 66.9 kbp.
    Genetics 03/2014; 196(3):875-90. DOI:10.1534/genetics.113.159715 · 5.96 Impact Factor
Show more