Accurate and comprehensive sequencing of personal genomes. Genome Res

Genome Informatics Section, Genome Technology Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD 20892, USA.
Genome Research (Impact Factor: 14.63). 07/2011; 21(9):1498-505. DOI: 10.1101/gr.123638.111
Source: PubMed


As whole-genome sequencing becomes commoditized and we begin to sequence and analyze personal genomes for clinical and diagnostic purposes, it is necessary to understand what constitutes a complete sequencing experiment for determining genotypes and detecting single-nucleotide variants. Here, we show that the current recommendation of ∼30× coverage is not adequate to produce genotype calls across a large fraction of the genome with acceptably low error rates. Our results are based on analyses of a clinical sample sequenced on two related Illumina platforms, GAII(x) and HiSeq 2000, to a very high depth (126×). We used these data to establish genotype-calling filters that dramatically increase accuracy. We also empirically determined how the callable portion of the genome varies as a function of the amount of sequence data used. These results help provide a "sequencing guide" for future whole-genome sequencing decisions and metrics by which coverage statistics should be reported.

Download full-text


Available from: Hatice Ozel Abaan, Feb 21, 2014
    • "Let X = {x 1 , ..., x n } be a set and A = (a 1 , ..., a n ), B = (b 1 , ..., b n ), where a i , b i ∈ [0] [1] "
    [Show abstract] [Hide abstract]
    ABSTRACT: Bioinformatics is a relatively new discipline where Mathematics are applied in the analysis of genetic sequences. The analysis of the genetic material of living organisms which consist of nucleic acids DNA and RNA is of great importance for diagnosis and taxonomy reasons. In the present paper we propose a new methodology for the representation of genetic sequences as fuzzy sets in the I 12 space which can significantly improve the results of Sadegh-Zadeh and Torres & Nieto. An important characteristic of our proposed methodology is that the location of Amino acids along the genetic sequences play an important role thus extending in a significant way the computational efficiency advantage of genetic sequence representation. We present some characteristic examples using the new proposed methodology where we calculate the distance and similarity degree of given polynucleotides.
    No preview · Article · Aug 2015 · Journal of Intelligent and Fuzzy Systems
  • Source
    • "We identified a small number of single-nucleotide changes (SNCs) in the earlier MSC cultures (p1, 219; p8, 254), while a significant number of somatic mutations (856) were found in p13 MSCs (Figure 1B). Large-scale and massively parallel sequencing is not perfect and presents false-positive calls in SNC identification (Ajay et al., 2011). Thus, we used a mass spectrometrybased Sequenom assay to systematically validate the identified SNCs. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Culture-expanded human mesenchymal stem cells (MSCs) are increasingly used in clinics, yet full characterization of the genomic compositions of these cells is lacking. We present a whole-genome investigation on the genetic dynamics of cultured MSCs under ex vivo establishment (passage 1 [p1]) and serial expansion (p8 and p13). We detected no significant changes in copy-number alterations (CNAs) and low levels of single-nucleotide changes (SNCs) until p8. Strikingly, a significant number (677) of SNCs were found in p13 MSCs. Using a sensitive Droplet Digital PCR assay, we tested the nonsynonymous SNCs detected by whole-genome sequencing and found that they were preexisting low-frequency mutations in uncultured mononuclear cells (∼0.01%) and early-passage MSCs (0.1%-1% at p1 and p8) but reached 17%-36% in p13. Our data demonstrate that human MSCs maintain a stable genomic composition in the early stages of ex vivo culture but are subject to clonal growth upon extended expansion.
    Full-text · Article · Aug 2014 · Stem Cell Reports
  • Source
    • "To reduce the effects of PCR artefacts, one strategy is to remove the duplicate reads before calculating the number of sequencing reads falling into the open reading frame of one gene. This strategy has been used in ChIP-Seq (Leleu et al., 2010; Szulwach et al., 2011), CLIP-Seq (Taliaferro et al., 2013), RNA-Seq (Nookaew et al., 2012; Ueno et al., 2013) and DNA-Seq (Ajay et al., 2011; Clark et al., 2010), and effectively removes all the duplicate reads resulting from the PCR reaction. This method in general works fine for data based on mammalian genomes due to the relatively low per base coverage. "
    [Show abstract] [Hide abstract]
    ABSTRACT: High throughput bacterial RNA-Seq experiments can generate extremely high and imbalanced sequencing coverage. Over- or under-estimation of gene expression levels will hinder accurate gene differential expression analysis. Here we evaluated strategies to identify expression differences of genes with high coverage in bacterial transcriptome data using either raw sequence reads or unique reads with duplicate fragments removed. In addition, we proposed a generalised linear model (GLM) based approach to identify imbalance in read coverage based on sequence compositions. Our results show that analysis using raw reads identifies more differentially expressed genes with more accurate fold change than using unique reads. We also demonstrate the presence of sequence composition related biases that are independent of gene expression levels and experimental conditions. Finally, genes that still show strong coverage imbalance after correction were tagged using statistical approach.
    Full-text · Article · Jun 2014 · International Journal of Computational Biology and Drug Design
Show more