MetaSim: a sequencing simulator for genomics and metagenomics.

Daniel C Richter, Felix Ott, Alexander F Auch, Ramona Schmid, Daniel H Huson

ZBIT- Center for Bioinformatics Tübingen, University of Tübingen, Tübingen, Germany.

Journal Article: PLoS ONE (impact factor: 4.41). 02/2008; 3(10):e3373. DOI: 10.1371/journal.pone.0003373

Abstract

BACKGROUND: The new research field of metagenomics is providing exciting insights into various, previously unclassified ecological systems. Next-generation sequencing technologies are producing a rapid increase of environmental data in public databases. There is great need for specialized software solutions and statistical methods for dealing with complex metagenome data sets. METHODOLOGY/PRINCIPAL FINDINGS: To facilitate the development and improvement of metagenomic tools and the planning of metagenomic projects, we introduce a sequencing simulator called MetaSim. Our software can be used to generate collections of synthetic reads that reflect the diverse taxonomical composition of typical metagenome data sets. Based on a database of given genomes, the program allows the user to design a metagenome by specifying the number of genomes present at different levels of the NCBI taxonomy, and then to collect reads from the metagenome using a simulation of a number of different sequencing technologies. A population sampler optionally produces evolved sequences based on source genomes and a given evolutionary tree. CONCLUSIONS/SIGNIFICANCE: MetaSim allows the user to simulate individual read datasets that can be used as standardized test scenarios for planning sequencing projects or for benchmarking metagenomic software.

Source: PubMed

Comments on this publication

ResearchGate members can add comments. Sign up now and post your comment!

Similar publications

Page 1
 
Page 2
 
Page 3
 
Page 4
 
Page 5
 
End of preview.
Page 1
MetaSim—A Sequencing Simulator for Genomics and
Metagenomics
Daniel C. Richter1*, Felix Ott1, Alexander F. Auch1, Ramona Schmid2, Daniel H. Huson1
1 ZBIT- Center for Bioinformatics Tu¨bingen, University of Tu¨bingen, Tu¨bingen, Germany, 2 Boehringer Ingelheim Pharma GmbH & Co. KG, Biberach, Germany
Abstract
Background: The new research field of metagenomics is providing exciting insights into various, previously unclassified
ecological systems. Next-generation sequencing technologies are producing a rapid increase of environmental data in
public databases. There is great need for specialized software solutions and statistical methods for dealing with complex
metagenome data sets.
Methodology/Principal Findings: To facilitate the development and improvement of metagenomic tools and the planning
of metagenomic projects, we introduce a sequencing simulator called MetaSim. Our software can be used to generate
collections of synthetic reads that reflect the diverse taxonomical composition of typical metagenome data sets. Based on a
database of given genomes, the program allows the user to design a metagenome by specifying the number of genomes
present at different levels of the NCBI taxonomy, and then to collect reads from the metagenome using a simulation of a
number of different sequencing technologies. A population sampler optionally produces evolved sequences based on
source genomes and a given evolutionary tree.
Conclusions/Significance: MetaSim allows the user to simulate individual read datasets that can be used as standardized
test scenarios for planning sequencing projects or for benchmarking metagenomic software.
Citation: Richter DC, Ott F, Auch AF, Schmid R, Huson DH (2008) MetaSim—A Sequencing Simulator for Genomics and Metagenomics. PLoS ONE 3(10): e3373.
doi:10.1371/journal.pone.0003373
Editor: Dawn Field, NERC Centre for Ecology and Hydrology, United Kingdom
Received August 11, 2008; Accepted September 16, 2008; Published October 8, 2008
Copyright: � 2008 Richter et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits
unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Funding: The authors have no support or funding to report.
Competing Interests: The authors have declared that no competing interests exist.
* E-mail: drichter@informatik.uni-tuebingen.de
Introduction
Metagenomics is based on the isolation and characterization of
DNA from environmental samples without the need for prior
cultivation of microorganisms. In contrast to single genome studies,
analyses are applied to entire communities of microbes instead of only
few isolated organisms. It has already led to exciting insights into the
ecology of different habitats such as ocean [1], soil [2], acid mine [3],
human and mouse gut [4,5] and even into ancient DNA [6].
The research field of Metagenomics is spurred by the recent
development and improvement of next-generation sequencing
technologies like Roche’s 454 pyrosequencing [7]. Although these
high through-put technologies promise faster and relatively
inexpensive generation of reads, Sanger sequencing still has been
used in environmental genome projects [5] to avoid the drawbacks
of shorter read lengths.
In general, studies show that algorithms developed for single-
genome assembly are only suitable for environmental sequences
under special conditions, for example in low complexity
populations [2,8]. In particular, it is very difficult to assemble
reads from highly diverse ecologic systems [9]. The problem is that
the arrangement of reads into contigs fails or is misleading because
contigs are put together from reads from many different genomes.
Currently, the primary goals of metagenomic studies are the
investigation of the phylogenetic composition of the sample
(taxonomical binning, ‘‘Who is out there’’), the quantitive analysis
(‘‘How many are there?’’) and the prediction of genes and their
functions (functional binning, ‘‘What are they doing’’). Since the
amount of comparable environmental data is rapidly growing,
comparative studies of multiple metagenomic data sets are of great
interest as wells. As of September 2008, 44 metagenome studies have
already been conducted whereas 86 projects still are on-going [10].
Common strategies for taxonomical binning are for example: (1)
detecting phylogenetic markers like rRNA, RecA, heat shock protein
(HSP70) and elongation factors (EF-Tu, EF-G) [11], (2) comparing
reads against a reference database such as NCBI-nr [12] and then
analyzing the matches to place the reads in the NCBI taxonomy
[13] and (3) measuring the oligonucleotide frequency caused by
codon usage or restriction-site frequency [14–18].
When it comes to functional binning, sequences are compared
to known protein functions, families and pathways provided by
several databases, for example COG, KEGG, PFAM, SEED,
STRING and TIGRFAM [19–24]. A de novo search for (unknown)
functional units is only feasible if either long reads or contigs are
available for the detection of open reading frames.
Another challenge in metagenomic studies is the development of
robust statistical techniques [25]. Particularly with regard to
comparative metagenomics dealing with highly variable data,
these techniques are considered as indispensable for a well-
founded analysis.
Despite the enormous amount of sequence data that was
generated and analyzed in the past few years, the number of
PLoS ONE | www.plosone.org 1 October 2008 | Volume 3 | Issue 10 | e3373
Page 2
publicly available software specialized in metagenomic data
analysis is surprisingly low. Hence, many studies still make use
of classic methods, software or web services that originally were
not intended for metagenomic data analysis and have to be
adapted or pipelined to produce the desired results [8].
Thus, there is a great demand for specialized metagenomic
software supporting the analysis process. Because of the complex-
ity of metagenomic data, it is crucial to benchmark new and
existing software with standardized test cases using simulated and
verifiable data. A first study [9] provides three data sets with
varying complexity by selecting original sequence reads from 113
isolated genomes. In their paper, the authors anticipate that these
data sets will be used as standard test cases for software testing.
Some other publications already applied the software ReadSim
(pre-version of MetaSim, unpublished) to generate simulated read
data sets for testing their software [18,26].
Description of MetaSim
MetaSim takes as input a set of known genome sequences and
an abundance profile. This profile determines which genome
sequences are selected for the simulation and the relative
abundance of each genome sequence in the dataset.
MetaSim integrates an ’’induced tree view’’ of the NCBI
taxonomy [27] that can be used to interactively select taxa and
inner nodes of the taxonomy to configure their relative
abundances. Additionally, the user is able to simulate an ’’evolved’’
population of a single genome sequence, using a population
simulator. This feature is aimed at simulating the common real
world situation that many different, but closely related strains of a
lineage coexist in the same habitat.
Finally, for the construction of a realistic read data set, MetaSim
includes a versatile read sequencing simulator. The user is able to
choose from different (adaptable) error models of current
sequencing technologies (e.g. Sanger [28,29], Roche’s 454 [7]
and Illumina (former Solexa) [30]).
MetaSim allows one to construct verifiable read data sets, and
additionally, metagenomes variable in size, taxonomical compo-
sition and abundance to reflect the diverse and complex output of
real metagenomic studies. The resulting data sets can be used to
plan and design metagenomic studies and for evaluation and
improvement of metagenomics software tools, statistical methods
or assembly algorithms.
Availability
MetaSim is written in Java and can be run with a graphical user
interface or in command line mode. Installers for Linux/Unix,
MacOS X and Windows are freely available from our website at:
http://www-ab.informatik.uni-tuebingen.de/software/metasim.
Methods
MetaSim’s processing pipeline consists of several phases:
1. Selection of source genome sequences from the internal
database
2. Configuration of the species abundance profile by setting the
relative copy number of the genome sequences
3. Sampling sequencing of fragments according to the species
abundance profiles
4. Application of technology-specific error models to the frag-
ments to create sequencing reads
Configuration of Species Abundance Profiles
At the beginning, whole genome sequences available from
public database can be stored locally as source sequences in an
integrated database. The user specifies the relative abundance of
each genome sequence in a text-based profile file. An interesting
feature of MetaSim is the possibility of assigning frequency values
not only at the species level but also at higher taxonomical levels.
For example, if the genus Escherichia is assigned a certain amount
of genome copies, this amount is split and applied uniformly to all
descendant species whose sequences are available from the
internal database.
To facilitate this data composition process in GUI mode,
MetaSim provides an interactive taxonomy editor that visualizes the
induced NCBI taxonomy, i.e. the genome sequences listed in the
profile file are displayed as nodes in a rooted tree (Figure 1). Node
sizes reflect the relative number of genome copies for each given
taxon.
Figure 1. Taxonomy Editor. A clipping of the taxonomy editor view is shown. Three taxa are assigned an abundance value (number in
parenthesis). These settings can be either determined in a text-based abundance profile file or directly in the taxonomy editor by right-clicking on a
node.
doi:10.1371/journal.pone.0003373.g001
MetaSim
PLoS ONE | www.plosone.org 2 October 2008 | Volume 3 | Issue 10 | e3373
Page 3
Population sampling
The current genome databases reflect only a small part of
earth’s still unexplored microbial diversity. Thus, a simulated
metagenome only based on known genome sequences does not
adequately reflect the complexity of realistic data sets.
MetaSim therefore includes a population sampler that option-
ally generates a set of evolved (mutated) offsprings derived from
single source genomes, using a given evolutionary tree. This tree
describes how the offspring sequences descend from the source
sequence. By default, a random pyholgenetic tree is generated
under the Yule-Harding model [31,32], but alternatively, user-
defined trees can also be loaded. As a simple model of DNA
evolution, the Jukes-Cantor formula [33] is applied to estimate a
probability of change for each base pair, with a customizable
transition rate a (0.001 by default) and time t based on the edge
weights. MetaSim then generates the designated number of
evolved genomes and then adds them to the internal genome
database. As an example, a fragment recruitment plot (according
to [1]) shows 10000 sampled Sanger reads of 100 evolved offspring
sequences (a = 0.004) mapped to the source genome (Escherichia coli
K-12 substr. MG1655) using blastn (Figure 2). Read sequences
sampled directly from the source genome show a significantly
higher identity compared to the mutated sequencing reads.
Read sampling
MetaSim simulates both Sanger sequencing and Roche’s 454
(sequencing-by-synthesis) approach. Additionally, it provides a
flexible, empirical error model usable to simulate Illumina’s ultra-
short reads.
For the simulation of read sequences, statistical approaches are
adopted to simulate the distribution of read lengths, its frequency
rate and the use of error models depending on the chosen
sequencing technology.
To be able to model mate-pairs as well, MetaSim first extracts
large fragments called clones from the set of genomes with normally
or uniformly distributed lengths. For example, clones with a length
of 1000 bp and a standard deviation of 100 bp are modelled with
a normal distribution N(1000,100) (Figure 3). The overall number
of clones is determined by the number of reads or mate-pairs the
user desires to generate.
If only one source genome is present in the given profile, the
clones are randomly extracted from this single sequence. In
contrast, in a typical metagenome simulation, the clones have to
be sampled from many genomes of varying length, copy number
(e.g. to model the abundance of plasmids versus the organsim
genomes) and abundancies.
So, each genome sequence s is assigned a weight
ws~ls|cs|as ð1Þ
where ls is the length, cs is the copy number and as is the specified
relative abundance of the genome sequence s as determined in the
profile.
For each length of the clone length distribution, the weights of
all sequences are summed up to receive the summarized weight
wsum that is used to compute a sequence probability ps = ws/wsum.
Considering the overall lengths distribution, a frequency value for
each source sequence is then obtained.
After the clone sampling, the ends of the clones are the basis for
the subsequent sampling of the reads or mate-pairs, respectively.
Again, read lengths can be either uniformly or normally
distributed. Finally, read sequences are processed and modified
by applying the selected error model.
Simulation of Sanger sequencing
A widely-used approach to sequencing large DNA molecules is
Sanger sequencing, using a shotgun approach that involves
Figure 2. Fragment Recruitment Plot. Black dots represent 10,000 sequencing reads (Sanger technology, <800 bp) drawn from 100 evolved
offsprings (a=0.004) of the source genome Escherichia coli K-12 substr. MG1655. Their sequence identity is lower compared to the mapped reads
sampled directly form the source genome (red dots).
doi:10.1371/journal.pone.0003373.g002
MetaSim
PLoS ONE | www.plosone.org 3 October 2008 | Volume 3 | Issue 10 | e3373
Page 4
cloning small pieces (or inserts) of DNA and then determining
their sequence using fluorescent dideoxynucleotides for termina-
tion and capillary electrophoresis.
To simulate Sanger sequencing, we closely followed the
implementation of celsim reported in [34]. Each read is
subjected to a linearly increasing error rate. We model fixed
Figure 3. Frequency distribution of clone lengths. As an example, 250,000 clones with mean length 1000 bp and standard deviation of 100 bp
were modelled with a normal distribution.
doi:10.1371/journal.pone.0003373.g003
MetaSim
PLoS ONE | www.plosone.org 4 October 2008 | Volume 3 | Issue 10 | e3373
Page 5
percentages of deletion errors, insertion errors and substitutions.
Further, the simulator is capable of modeling mate-pairs and one
can specify the length distribution of inserts.
Simulation of sequencing-by-synthesis
In pyrosequencing, the intensity of emitted light is used to
estimate the length of homopolymers, i.e. runs of identical nucleotides
in a sequence. During sequencing, the four DNA composing
nucleotides are periodically flowed over the inserts to be sequenced.
Within each flow, the intensity of the signal emitted (which is
linear up to 8 bp) reflects the number of nucleotides incorporated.
Thus, the addition of a single base or even homopolymer stretches
of multiple bases in a single flow can be detected.. For chemical
and technical reasons, this signal is subject to fluctuations that lead
to sequencing errors. In [7], an error rate of about 3% is reported.
Let r denote the length of a given homopolymer. We model the
emitted light intensity using a normal distribution N(m, s), with
mean m= r and standard deviation s~k:
ffiffi
r
p
, where k is a fixed
Figure 4. The graphical user interface of MetaSim is divided into three panels: a project tree on the left containing all simulation
settings and taxon profiles, an overview and edit panel on the right and a message panel at the bottom. Additionally, a configuration
window is shown.
doi:10.1371/journal.pone.0003373.g004
Table 1. Species abundance and percentage of sampled reads of the simLC dataset.
Abdce Species Mbp 454-100a 454-250b S-800c
90 Methanoculleus marisnigri JR1 2.5 82,70 82,61 82,71
10 Escherichia coli str. K-12 substr. MG1655 4.6 17,30 17,39 17,29
a454 technology, 150000 reads (length: 100 bp).
b454 technology, 60000 reads, (length: 250 bp).
cSanger technology, 18750 reads, (length: 800 bp).
doi:10.1371/journal.pone.0003373.t001
MetaSim
PLoS ONE | www.plosone.org 5 October 2008 | Volume 3 | Issue 10 | e3373
End of preview.
Preview full-text

Science & Research Jobs

Keywords

complex metagenome data sets
 
different levels
 
different sequencing technologies
 
diverse taxonomical composition
 
exciting insights
 
given evolutionary tree
 
metagenomic tools
 
new research field
 
Next-generation sequencing technologies
 
planning sequencing projects
 
population sampler optionally
 
public databases
 
rapid increase
 
sequencing simulator
 
simulate individual
 
source genomes
 
specialized software solutions
 
statistical methods
 
typical metagenome data sets
 
unclassified ecological systems