Accurate and comprehensive sequencing of personal genomes. Genome Res

Genome Informatics Section, Genome Technology Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD 20892, USA.
Genome Research (Impact Factor: 14.63). 07/2011; 21(9):1498-505. DOI: 10.1101/gr.123638.111
Source: PubMed


As whole-genome sequencing becomes commoditized and we begin to sequence and analyze personal genomes for clinical and diagnostic purposes, it is necessary to understand what constitutes a complete sequencing experiment for determining genotypes and detecting single-nucleotide variants. Here, we show that the current recommendation of ∼30× coverage is not adequate to produce genotype calls across a large fraction of the genome with acceptably low error rates. Our results are based on analyses of a clinical sample sequenced on two related Illumina platforms, GAII(x) and HiSeq 2000, to a very high depth (126×). We used these data to establish genotype-calling filters that dramatically increase accuracy. We also empirically determined how the callable portion of the genome varies as a function of the amount of sequence data used. These results help provide a "sequencing guide" for future whole-genome sequencing decisions and metrics by which coverage statistics should be reported.

Download full-text


Available from: Hatice Ozel Abaan, Feb 21, 2014
47 Reads
  • Source
    • "We identified a small number of single-nucleotide changes (SNCs) in the earlier MSC cultures (p1, 219; p8, 254), while a significant number of somatic mutations (856) were found in p13 MSCs (Figure 1B). Large-scale and massively parallel sequencing is not perfect and presents false-positive calls in SNC identification (Ajay et al., 2011). Thus, we used a mass spectrometrybased Sequenom assay to systematically validate the identified SNCs. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Culture-expanded human mesenchymal stem cells (MSCs) are increasingly used in clinics, yet full characterization of the genomic compositions of these cells is lacking. We present a whole-genome investigation on the genetic dynamics of cultured MSCs under ex vivo establishment (passage 1 [p1]) and serial expansion (p8 and p13). We detected no significant changes in copy-number alterations (CNAs) and low levels of single-nucleotide changes (SNCs) until p8. Strikingly, a significant number (677) of SNCs were found in p13 MSCs. Using a sensitive Droplet Digital PCR assay, we tested the nonsynonymous SNCs detected by whole-genome sequencing and found that they were preexisting low-frequency mutations in uncultured mononuclear cells (∼0.01%) and early-passage MSCs (0.1%-1% at p1 and p8) but reached 17%-36% in p13. Our data demonstrate that human MSCs maintain a stable genomic composition in the early stages of ex vivo culture but are subject to clonal growth upon extended expansion.
    Stem Cell Reports 08/2014; 3(2). DOI:10.1016/j.stemcr.2014.05.019 · 5.37 Impact Factor
  • Source
    • "To reduce the effects of PCR artefacts, one strategy is to remove the duplicate reads before calculating the number of sequencing reads falling into the open reading frame of one gene. This strategy has been used in ChIP-Seq (Leleu et al., 2010; Szulwach et al., 2011), CLIP-Seq (Taliaferro et al., 2013), RNA-Seq (Nookaew et al., 2012; Ueno et al., 2013) and DNA-Seq (Ajay et al., 2011; Clark et al., 2010), and effectively removes all the duplicate reads resulting from the PCR reaction. This method in general works fine for data based on mammalian genomes due to the relatively low per base coverage. "
    [Show abstract] [Hide abstract]
    ABSTRACT: High throughput bacterial RNA-Seq experiments can generate extremely high and imbalanced sequencing coverage. Over- or under-estimation of gene expression levels will hinder accurate gene differential expression analysis. Here we evaluated strategies to identify expression differences of genes with high coverage in bacterial transcriptome data using either raw sequence reads or unique reads with duplicate fragments removed. In addition, we proposed a generalised linear model (GLM) based approach to identify imbalance in read coverage based on sequence compositions. Our results show that analysis using raw reads identifies more differentially expressed genes with more accurate fold change than using unique reads. We also demonstrate the presence of sequence composition related biases that are independent of gene expression levels and experimental conditions. Finally, genes that still show strong coverage imbalance after correction were tagged using statistical approach.
    International Journal of Computational Biology and Drug Design 06/2014; 7(2/3):195-213. DOI:10.1504/IJCBDD.2014.061646
  • Source
    • "the same tumor DNA sample sequenced twice in two separate batches. Additionally, the greater sequencing depth in rBHK2 may contribute to the larger number of variants called in rBHK2 versus rBHK1 (Ajay et al., 2011). "
    [Show abstract] [Hide abstract]
    ABSTRACT: Baby Hamster Kidney (BHK) cell lines are used in the production of veterinary vaccines and recombinant proteins. To facilitate transcriptome analysis of BHK cell lines, we embarked on an effort to sequence, assemble, and annotate transcript sequences from a recombinant BHK cell line and Syrian hamster liver and brain. RNA-seq data were supplemented with 6,170 Sanger ESTs from parental and recombinant BHK lines to generate 221,583 contigs. Annotation by homology to other species, primarily mouse, yielded more than 15,000 unique Ensembl mouse gene IDs with high coverage of KEGG canonical pathways. High coverage of enzymes and isoforms was seen for cell metabolism and N-glycosylation pathways, areas of highest interest for biopharmaceutical production. With the high sequencing depth in RNA-seq data, we set out to identify single-nucleotide variants in the transcripts. A majority of the high-confidence variants detected in both hamster tissue libraries occurred at a frequency of 50%, indicating their origin as heterozygous germline variants. In contrast, the cell line libraries' variants showed a wide range of occurrence frequency, indicating the presence of a heterogeneous population in cultured cells. The extremely high coverage of transcripts of highly abundant genes in RNA-seq enabled us to identify low-frequency variants. Experimental verification through Sanger sequencing confirmed the presence of two variants in the cDNA of a highly expressed gene in the BHK cell line. Furthermore, we detected seven potential missense mutations in the genes of the growth signaling pathways that may have arisen during the cell line derivation process. The development and characterization of a BHK reference transcriptome will facilitate future efforts to understand, monitor, and manipulate BHK cells. Our study on sequencing variants is crucial for improved understanding of the errors inherent in high-throughput sequencing and to increase the accuracy of variant calling in BHK or other systems. Biotechnol. Bioeng. 2013;9999: 1-12. © 2013 Wiley Periodicals, Inc.
    Biotechnology and Bioengineering 04/2014; 111(4). DOI:10.1002/bit.25135 · 4.13 Impact Factor
Show more