ART: A next-generation sequencing read simulator

Biostatistics Branch, National Institute of Environmental Health Sciences, Research Triangle Park, NC 27709, USA.
Bioinformatics (Impact Factor: 4.62). 12/2011; 28(4):593-4. DOI: 10.1093/bioinformatics/btr708
Source: PubMed

ABSTRACT ART is a set of simulation tools that generate synthetic next-generation sequencing reads. This functionality is essential for testing and benchmarking tools for next-generation sequencing data analysis including read alignment, de novo assembly and genetic variation discovery. ART generates simulated sequencing reads by emulating the sequencing process with built-in, technology-specific read error models and base quality value profiles parameterized empirically in large sequencing datasets. We currently support all three major commercial next-generation sequencing platforms: Roche's 454, Illumina's Solexa and Applied Biosystems' SOLiD. ART also allows the flexibility to use customized read error model parameters and quality profiles. AVAILABILITY: Both source and binary software packages are available at

Download full-text


Available from: Weichun Huang, Aug 01, 2014
  • Source
    • "The basic approach to compare mappers is based on simulating NGS reads, aligning them to the reference genome and assessing read mapping accuracy using a tool evaluating if each individual read has been aligned correctly. * to whom correspondence should be addressed There exist many read simulators (WGSIM 1 , DWGSIM 2 , CURESIM (Caboche et al., 2014), ART (Huang et al., 2011), MASON (Holtgrewe, 2010), PIRS (Xu et al., 2012)), XS (Pratas et al., 2014), FLOWSIM (Balzer et al., 2010), GEMSIM (McElroy et al., 2012), PBSIM (Ono et al., 2013), SINC (Pattnaik et al., 2014), FASTQSIM (Shcherbina, 2014)) as well as many evaluation tools (WGSIM EVAL, DWGSIM EVAL, CURESIMEVAL, RABEMA (Holtgrewe et al., 2011), etc.). However, each read simulator encodes information about the origin of reads in its own manner. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Aligning reads to a reference sequence is a fundamental step in numerous bioinformatics pipelines. As a consequence, the sensitivity and precision of the mapping tool, applied with certain parameters to certain data, can critically affect the accuracy of produced results (e.g., in variant calling applications). Therefore, there has been an increasing demand of methods for comparing mappers and for measuring effects of their parameters. Read simulators combined with alignment evaluation tools provide the most straightforward way to evaluate and compare mappers. Simulation of reads is accompanied by information about their positions in the source genome. This information is then used to evaluate alignments produced by the mapper. Finally, reports containing statistics of successful read alignments are created. In default of standards for encoding read origins, every evaluation tool has to be made explicitly compatible with the simulator used to generate reads. In order to solve this obstacle, we have created a generic format RNF (Read Naming Format) for assigning read names with encoded information about original positions. Futhermore, we have developed an associated software package RNF containing two principal components. MIShmash applies one of popular read simulating tools (among DwgSim, Art, Mason, CuReSim etc.) and transforms the generated reads into RNF format. LAVEnder evaluates then a given read mapper using simulated reads in RNF format. A special attention is payed to mapping qualities that serve for parametrization of ROC curves, and to evaluation of the effect of read sample contamination.
  • Source
    • "In this work we extend the previous study through direct consideration of the scale of the data set through a mix of data: real (drawn fro m the Sequence Read Archive [4]) and synthetic (generated by shattering completed and assembled genomes from GenBank through the use of the NCBI ART toolset [5]). Given the relative complexity and sequential structure of SVM training, we seek to use methods that may be more effectively used in systems which partition the training and classification problems to take advantage of the available computational resources. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Next Generation Sequencing (NGS) has revolutionised molecular biology, resulting in an explosion of data sets and an increasing role in clinical practice. Such applications necessarily require rapid identification of the organism as a prelude to annotation and further analysis. NGS data consist of a substantial number of short sequence reads, given context through downstream assembly and annotation, a process requiring reads consistent with the assumed species or species group. Highly accurate results have been obtained for restricted sets using SVM classifiers, but such methods are difficult to parallelise and success depends on careful attention to feature selection. This work examines the problem at very large scale, using a mix of synthetic and real data with a view to determining the overall structure of the problem and the effectiveness of parallel ensembles of simpler classifiers (principally random forests) in addressing the challenges of large scale genomics.
    Procedia Computer Science 12/2014; 29:2003-2012. DOI:10.1016/j.procs.2014.05.184
  • Source
    • "• SIM: 2 million simulated 100bp paired-end reads (mean insert length 300 and standard deviation 30) simulated with a coverage of 50x obtained with the popular read simulator ART [15] from human chromosome 17 (from the 10Mbp to 14Mbp region) after artificially introducing SNPs (SIM-SNP) and inserting indels (SIM-INDEL). • ECOL: 21 million 36bp paired-end E. coli reads (SRX000429) with a coverage of 160x at a genome size of 5Mb. "
    [Show abstract] [Hide abstract]
    ABSTRACT: As high-throughput sequencers become standard equipment outside of sequencing centers, there is an increasing need for efficient methods for pre-processing and primary analysis. While a vast literature proposes methods for HTS data analysis, we argue that significant improvements can still be gained by exploiting expensive pre-processing steps which can be amortized with savings from later stages. We propose a method to accelerate and improve read mapping based on an initial clustering of possibly billions of high-throughput sequencing reads, yielding clusters of high stringency and a high degree of overlap. This clustering improves on the state-of-the-art in running time for small datasets and, for the first time, makes clustering high-coverage human libraries feasible. Given the efficiently computed clusters, only one representative read from each cluster needs to be mapped using a traditional readmapper such as BWA, instead of individually mapping all reads. On human reads, all processing steps, including clustering and mapping, only require 11%-59% of the time for individually mapping all reads, achieving speed-ups for all readmappers, while minimally affecting mapping quality. This accelerates a highly sensitive readmapper such as Stampy to be competitive with a fast readmapper such as BWA on unclustered reads.
Show more