ART: A next-generation sequencing read simulator

Biostatistics Branch, National Institute of Environmental Health Sciences, Research Triangle Park, NC 27709, USA.
Bioinformatics (Impact Factor: 4.98). 12/2011; 28(4):593-4. DOI: 10.1093/bioinformatics/btr708
Source: PubMed


ART is a set of simulation tools that generate synthetic next-generation sequencing reads. This functionality is essential for testing and benchmarking tools for next-generation sequencing data analysis including read alignment, de novo assembly and genetic variation discovery. ART generates simulated sequencing reads by emulating the sequencing process with built-in, technology-specific read error models and base quality value profiles parameterized empirically in large sequencing datasets. We currently support all three major commercial next-generation sequencing platforms: Roche's 454, Illumina's Solexa and Applied Biosystems' SOLiD. ART also allows the flexibility to use customized read error model parameters and quality profiles. AVAILABILITY: Both source and binary software packages are available at

Download full-text


Available from: Weichun Huang, Aug 01, 2014
  • Source
    • "We used ART (Huang et al. 2012) to simulate MiSeq whole genome shotgun sequencing experiments with paired-end reads of length 250 bp. Sequencing errors were simulated with ART's Illumina MiSeq-250 error profile. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Whole genome shotgun sequencing of multi species communities using only a single library layout is commonly used to assess taxonomic and functional diversity of microbial assemblages. Here we investigate to what extent such metagenome skimming approaches are applicable for in-depth genomic characterizations of eukaryotic communities, e.g. lichens. We address how to best assemble a particular eukaryotic metagenome skimming data, what pitfalls can occur, and what genome quality can be expected from this data. To facilitate a project specific benchmarking, we introduce the concept of twin sets, simulated data resembling the outcome of a particular metagenome sequencing study. We show that the quality of genome reconstructions depends essentially on assembler choice. Individual tools, including the metagenome assemblers Omega and MetaVelvet, are surprisingly sensitive to low and uneven coverages. In combination with the routine of assembly parameter choice to optimize the assembly N50 size, these tools can preclude an entire genome from the assembly. In contrast, MIRA, an all-purpose overlap assembler, and SPAdes, a multi-sized de Bruijn graph assembler, facilitate a comprehensive view on the individual genomes across a wide range of coverage ratios. Testing assemblers on a real-world metagenome skimming data from the lichen Lasallia pustulata demonstrates the applicability of twin sets for guiding method selection. Furthermore, it reveals that the assembly outcome for the photobiont Trebouxia sp. falls behind the a-priori expectation given the simulations. Although the underlying reasons remain still unclear this highlights that further studies on this organism require special attention during sequence data generation and downstream analysis. This article is protected by copyright. All rights reserved.
    Molecular Ecology Resources 09/2015; DOI:10.1111/1755-0998.12463 · 3.71 Impact Factor
  • Source
    • "The basic approach to compare mappers is based on simulating NGS reads, aligning them to the reference genome and assessing read mapping accuracy using a tool evaluating if each individual read has been aligned correctly. * to whom correspondence should be addressed There exist many read simulators (WGSIM 1 , DWGSIM 2 , CURESIM (Caboche et al., 2014), ART (Huang et al., 2011), MASON (Holtgrewe, 2010), PIRS (Xu et al., 2012)), XS (Pratas et al., 2014), FLOWSIM (Balzer et al., 2010), GEMSIM (McElroy et al., 2012), PBSIM (Ono et al., 2013), SINC (Pattnaik et al., 2014), FASTQSIM (Shcherbina, 2014)) as well as many evaluation tools (WGSIM EVAL, DWGSIM EVAL, CURESIMEVAL, RABEMA (Holtgrewe et al., 2011), etc.). However, each read simulator encodes information about the origin of reads in its own manner. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Motivation: Read simulators combined with alignment evaluation tools provide the most straightforward way to evaluate and compare mappers. Simulation of reads is accompanied by information about their positions in the source genome. This information is then used to evaluate alignments produced by the mapper. Finally, reports containing statistics of successful read alignments are created. In default of standards for encoding read origins, every evaluation tool has to be made explicitly compatible with the simulator used to generate reads. Results: In order to solve this obstacle, we have created a generic format RNF (Read Naming Format) for assigning read names with encoded information about original positions. Futhermore, we have developed an associated software package RNFTOOLS containing two principal components. MISHMASH applies one of popular read simulating tools (among DWGSIM, ART, MASON, CURESIM etc.) and transforms the generated reads into RNF format. LAVENDER evaluates then a given read mapper using simulated reads in RNF format. A special attention is payed to mapping qualities that serve for parametrization of ROC curves, and to evaluation of the effect of read sample contamination. Availability and implementation: RNFTOOLS: Spec. of RNF: Contact:
    Bioinformatics 09/2015; DOI:10.1093/bioinformatics/btv524 · 4.98 Impact Factor
  • Source
    • "In this work we extend the previous study through direct consideration of the scale of the data set through a mix of data: real (drawn fro m the Sequence Read Archive [4]) and synthetic (generated by shattering completed and assembled genomes from GenBank through the use of the NCBI ART toolset [5]). Given the relative complexity and sequential structure of SVM training, we seek to use methods that may be more effectively used in systems which partition the training and classification problems to take advantage of the available computational resources. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Next Generation Sequencing (NGS) has revolutionised molecular biology, resulting in an explosion of data sets and an increasing role in clinical practice. Such applications necessarily require rapid identification of the organism as a prelude to annotation and further analysis. NGS data consist of a substantial number of short sequence reads, given context through downstream assembly and annotation, a process requiring reads consistent with the assumed species or species group. Highly accurate results have been obtained for restricted sets using SVM classifiers, but such methods are difficult to parallelise and success depends on careful attention to feature selection. This work examines the problem at very large scale, using a mix of synthetic and real data with a view to determining the overall structure of the problem and the effectiveness of parallel ensembles of simpler classifiers (principally random forests) in addressing the challenges of large scale genomics.
    Procedia Computer Science 12/2014; 29:2003-2012. DOI:10.1016/j.procs.2014.05.184
Show more