ART: A next-generation sequencing read simulator

Biostatistics Branch, National Institute of Environmental Health Sciences, Research Triangle Park, NC 27709, USA.
Bioinformatics (Impact Factor: 4.98). 12/2011; 28(4):593-4. DOI: 10.1093/bioinformatics/btr708
Source: PubMed


ART is a set of simulation tools that generate synthetic next-generation sequencing reads. This functionality is essential for testing and benchmarking tools for next-generation sequencing data analysis including read alignment, de novo assembly and genetic variation discovery. ART generates simulated sequencing reads by emulating the sequencing process with built-in, technology-specific read error models and base quality value profiles parameterized empirically in large sequencing datasets. We currently support all three major commercial next-generation sequencing platforms: Roche's 454, Illumina's Solexa and Applied Biosystems' SOLiD. ART also allows the flexibility to use customized read error model parameters and quality profiles. AVAILABILITY: Both source and binary software packages are available at

Download full-text


Available from: Weichun Huang, Aug 01, 2014
85 Reads
  • Source
    • "We used ART (Huang et al. 2012) to simulate MiSeq whole genome shotgun sequencing experiments with paired-end reads of length 250 bp. Sequencing errors were simulated with ART's Illumina MiSeq-250 error profile. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Whole genome shotgun sequencing of multi species communities using only a single library layout is commonly used to assess taxonomic and functional diversity of microbial assemblages. Here we investigate to what extent such metagenome skimming approaches are applicable for in-depth genomic characterizations of eukaryotic communities, e.g. lichens. We address how to best assemble a particular eukaryotic metagenome skimming data, what pitfalls can occur, and what genome quality can be expected from this data. To facilitate a project specific benchmarking, we introduce the concept of twin sets, simulated data resembling the outcome of a particular metagenome sequencing study. We show that the quality of genome reconstructions depends essentially on assembler choice. Individual tools, including the metagenome assemblers Omega and MetaVelvet, are surprisingly sensitive to low and uneven coverages. In combination with the routine of assembly parameter choice to optimize the assembly N50 size, these tools can preclude an entire genome from the assembly. In contrast, MIRA, an all-purpose overlap assembler, and SPAdes, a multi-sized de Bruijn graph assembler, facilitate a comprehensive view on the individual genomes across a wide range of coverage ratios. Testing assemblers on a real-world metagenome skimming data from the lichen Lasallia pustulata demonstrates the applicability of twin sets for guiding method selection. Furthermore, it reveals that the assembly outcome for the photobiont Trebouxia sp. falls behind the a-priori expectation given the simulations. Although the underlying reasons remain still unclear this highlights that further studies on this organism require special attention during sequence data generation and downstream analysis. This article is protected by copyright. All rights reserved.
    Molecular Ecology Resources 09/2015; DOI:10.1111/1755-0998.12463 · 3.71 Impact Factor
  • Source
    • "In order to further evaluate the effect of nonrandomly positioned reads, the real and simulated RNA-seq data sets were compared. The next-generation sequencing read simulator " ART " [43] was used to simulate RNA-seq reads. To simulate the sequencing, the sequencing read simulator assumes that the reads uniformly and randomly distribute on the transcript "
    [Show abstract] [Hide abstract]
    ABSTRACT: To improve the applicability of RNA-seq technology, a large number of RNA-seq data analysis methods and correction algorithms have been developed. Although these new methods and algorithms have steadily improved transcriptome analysis, greater prediction accuracy is needed to better guide experimental designs with computational results. In this study, a new tool for the identification of differentially expressed genes with RNA-seq data, named GExposer, was developed. This tool introduces a local normalization algorithm to reduce the bias of nonrandomly positioned read depth. The naive Bayes classifier is employed to integrate fold change, transcript length, and GC content to identify differentially expressed genes. Results on several independent tests show that GExposer has better performance than other methods. The combination of the local normalization algorithm and naive Bayes classifier with three attributes can achieve better results; both false positive rates and false negative rates are reduced. However, only a small portion of genes is affected by the local normalization and GC content correction.
    09/2015; 2015(3):789516. DOI:10.1155/2015/789516
  • Source
    • "The basic approach to compare mappers is based on simulating NGS reads, aligning them to the reference genome and assessing read mapping accuracy using a tool evaluating if each individual read has been aligned correctly. * to whom correspondence should be addressed There exist many read simulators (WGSIM 1 , DWGSIM 2 , CURESIM (Caboche et al., 2014), ART (Huang et al., 2011), MASON (Holtgrewe, 2010), PIRS (Xu et al., 2012)), XS (Pratas et al., 2014), FLOWSIM (Balzer et al., 2010), GEMSIM (McElroy et al., 2012), PBSIM (Ono et al., 2013), SINC (Pattnaik et al., 2014), FASTQSIM (Shcherbina, 2014)) as well as many evaluation tools (WGSIM EVAL, DWGSIM EVAL, CURESIMEVAL, RABEMA (Holtgrewe et al., 2011), etc.). However, each read simulator encodes information about the origin of reads in its own manner. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Motivation: Read simulators combined with alignment evaluation tools provide the most straightforward way to evaluate and compare mappers. Simulation of reads is accompanied by information about their positions in the source genome. This information is then used to evaluate alignments produced by the mapper. Finally, reports containing statistics of successful read alignments are created. In default of standards for encoding read origins, every evaluation tool has to be made explicitly compatible with the simulator used to generate reads. Results: In order to solve this obstacle, we have created a generic format RNF (Read Naming Format) for assigning read names with encoded information about original positions. Futhermore, we have developed an associated software package RNFTOOLS containing two principal components. MISHMASH applies one of popular read simulating tools (among DWGSIM, ART, MASON, CURESIM etc.) and transforms the generated reads into RNF format. LAVENDER evaluates then a given read mapper using simulated reads in RNF format. A special attention is payed to mapping qualities that serve for parametrization of ROC curves, and to evaluation of the effect of read sample contamination. Availability and implementation: RNFTOOLS: Spec. of RNF: Contact:
    Bioinformatics 09/2015; DOI:10.1093/bioinformatics/btv524 · 4.98 Impact Factor
Show more