Page 1
ral
ssBioMed CentBMC Bioinformatics
Open AcceResearch
Methods for comparative metagenomics
Daniel H Huson*1, Daniel C Richter1, Suparna Mitra1, Alexander F Auch1
and Stephan C Schuster2
Address: 1Center for Bioinformatics ZBIT, Tübingen University, Sand 14, 72076 Tübingen, Germany and 2310 Wartik Laboratories, PennState
University, Center for Comparative Genomics, Center for Infectious Disease Dynamics, University Park, PA 1803, USA
Email: Daniel H Huson* - huson@informatik.uni-tuebingen.de; Daniel C Richter - drichter@informatik.uni-tuebingen.de;
Suparna Mitra - mitra@informatik.uni-tuebingen.de; Alexander F Auch - auch@informatik.uni-tuebingen.de;
Stephan C Schuster - scs@bx.psu.edu
* Corresponding author
Abstract
Background: Metagenomics is a rapidly growing field of research that aims at studying uncultured
organisms to understand the true diversity of microbes, their functions, cooperation and evolution,
in environments such as soil, water, ancient remains of animals, or the digestive system of animals
and humans. The recent development of ultra-high throughput sequencing technologies, which do
not require cloning or PCR amplification, and can produce huge numbers of DNA reads at an
affordable cost, has boosted the number and scope of metagenomic sequencing projects.
Increasingly, there is a need for new ways of comparing multiple metagenomics datasets, and for
fast and user-friendly implementations of such approaches.
Results: This paper introduces a number of new methods for interactively exploring, analyzing and
comparing multiple metagenomic datasets, which will be made freely available in a new,
comparative version 2.0 of the stand-alone metagenome analysis tool MEGAN.
Conclusion: There is a great need for powerful and user-friendly tools for comparative analysis
of metagenomic data and MEGAN 2.0 will help to fill this gap.
Background
Metagenomics is a rapidly growing field of research that
aims at studying uncultured organisms to understand the
true diversity of microbes, their functions, cooperation
and evolution, in environments such as soil, water,
ancient remains of animals, or the digestive system of ani-
detailed understanding is only beginning to emerge. A
main promise of metagenomics is that it will accelerate
drug discovery and biotechnology by providing new genes
with novel functions.
Currently, the key approach used in metagenomics is
from The Seventh Asia Pacific Bioinformatics Conference (APBC 2009)
Beijing, China. 13–16 January 2009
Published: 30 January 2009
BMC Bioinformatics 2009, 10(Suppl 1):S12 doi:10.1186/1471-2105-10-S1-S12
<supplement> <title> <p>Selected papers from the Seventh Asia-Pacific Bioinformatics Conference (APBC 2009)</p> </title> <editor>Michael Q Zhang, Michael S Waterman and Xuegong Zhang</editor> <note>Research</note> </supplement>
This article is available from: http://www.biomedcentral.com/1471-2105/10/S1/S12
© 2009 Huson et al; licensee BioMed Central Ltd.
This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0),
which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.Page 1 of 10
(page number not for citation purposes)
mals and humans. Although it is clear that communities
of microbes play a vital role in such systems, a more
large-scale sequencing of environmental samples. The
recent development of ultra-high throughput sequencing
Page 2
BMC Bioinformatics 2009, 10(Suppl 1):S12 http://www.biomedcentral.com/1471-2105/10/S1/S12
technologies [1,2], which do not require cloning or PCR
amplification, and can produce huge numbers of DNA
reads at an affordable cost, has boosted the number and
scope of metagenomic sequencing projects, see [3,4]. The
analysis of such datasets is aimed at determining and
comparing the biological diversity and the functional
activity of different microbial communities.
Computationally, species identification relies on the use
of reference databases or reference phylogenies that con-
tain of sequences of known origin and gene function. The
most prominently used databases are the NR and NT data-
bases [5]. Unfortunately, substantial database biases
toward model organisms present a major hurdle for
metagenomic analysis, and in a typical metagenome data-
set as much as 90% of the reads may exhibit no similarity
to any known sequence. However, this problem is beyond
the scope of this paper. Early 2007, our group released
and published the first publicly available, stand-alone
analysis tool for metagenomic data, called MEGAN [6,7].
We initially developed this tool to analyze the microbial
community present in a sample of mammoth bone [8].
MEGAN takes as input the result of a BLAST [9] compari-
son of a set of metagenomic reads against one or more ref-
erence databases and produces as output a taxonomical
analysis of the sample, obtained by assigning the reads to
different nodes in the NCBI taxonomy using an "LCA-
algorithm".
As an exploration tool designed and optimized to run on
a laptop, MEGAN complements other systems and
resources for metagenome analysis, which are offered in
the form of databases, web portals and web services, such
as [10-14].
MEGAN now has over 400 registered users working in
many different biological labs around the world. It is rou-
tinely used at the Joint-Genome-Institute (JGI) both in
quality control and also to provide initial analyses of
newly sequenced datasets. Other users include researchers
at the J.C. Venter Institute studying viral populations. In a
recent publication [15], we demonstrate how to use the
software for meta-transcriptomics, as well.
Increasingly, the emphasize of metagenome analysis is
shifting from species and functional identification for
individual datasets toward comparative analysis. This
paper addresses the latter issue and provides solutions to
questions such as: Given two or more metagenome data-
sets, how similar or different are their taxonomical and
functional profiles? Are observed differences statistically
significant? Have enough reads been sequenced, i.e. what
is the current "rate of discovery" as a function of the
metagenome datasets. Then, we will focus on new com-
parative methods. Finally, we will illustrate the applica-
tion of the methods in two comparisons, one comparing
the contents of a human gut [16] with the contents of a
mouse gut [17] and the other comparing a soil sample
[18] with a recent marine sample [19].
The ideas presented in this paper are all quite simple and
unsophisticated. The main merit of this work lies in the
integrated implementation of the methods in the form of
a very robust and user-friendly program, which is easily
used by biologists. The implementation goes well beyond
the hastily written "proof of concept" implementations
that so often accompany method papers. We are currently
beta-testing version 2.0 of the MEGAN software, which
implements all ideas presented in this paper. The latest
beta version can be obtained from our website at [20].
Methods
One goal of metagenome analysis is to determine the tax-
onomical content of a dataset [6,21]. There are two main
approaches toward doing this.
The phylogenetic approach is based on carefully chosen
genes that are believed to provide robust phylogenetic
information [22,23], see [21,24]. When randomly-tar-
geted sequencing is used, only a small fraction of the
sequences will correspond to such phylogenetic markers
[21,25]. Often, universal primers are employed to specifi-
cally target the phylogenetic markers. The DNA sequences
obtained are usually aligned into precomputed reference
alignments and placed into precomputed reference trees,
using fast heuristics and then taxonomical placements are
deduced from this.
The taxonomical approach places reads directly into the
NCBI taxonomy, based on the similarity of the reads to
sequences in one or more reference databases. As ran-
domly sequenced reads will exhibit very different levels of
evolutionary conservation, it is important to make use of
all ranks of the NCBI taxonomy, placing more conserved
sequences higher up in the taxonomy (i.e. closer to the
root) and more distinct sequence onto nodes that are
more specific (i.e. closer to the leaves, which represent
species and strains). This can be done using the LCA algo-
rithm and is the basis of the MEGAN program.
In summary, the LCA algorithm works as follows. A
sequencing read is compared against a database of refer-
ence sequences, such as the NCBI NR database, and the
taxon information of significant matches is extracted and
mapped onto the leaves of the NCBI taxonomy. The leaves
of the NCBI taxonomy represent different species andPage 2 of 10
(page number not for citation purposes)
number of reads sequenced? In the following section, we
will discuss some new ideas for analyzing individual
strains. The LCA algorithm computes the lowest common
ancestor of all these hits, which will correspond to some
Page 3
BMC Bioinformatics 2009, 10(Suppl 1):S12 http://www.biomedcentral.com/1471-2105/10/S1/S12
higher-order taxon, and will then assign the read to that
taxon. In this way, species-specific sequences will be
assigned to the leaves or specific taxa, whereas sequences
that are conserved among different species, or that are sus-
ceptible to horizontal gene transfer, will be assigned to
taxa of less-specific rank. See the original paper [6] for
more details.
Both approaches have different advantages and draw-
backs. The phylogenetic approach can use established
phylogenies that are well understood and targeted
sequencing provides much more informative data per
sequencing run. However, a commonly acknowledged
draw-back is that the "universal primers" employed may
produce only a subset of the true spectrum of different
sequences. On the other hand, random sequencing is
often used in metagenomics to analyze the gene content
of a community and then the taxonomical approach can
make full use of the data and can be complemented by a
phylogenetic approach.
Rate of discovery
One important question is whether the level of sequenc-
ing performed for a given sample is sufficient to capture
the most abundant taxa. This can be addressed by plotting
the discovery rate of a dataset, which is obtained by repeat-
edly selecting random subsamples of the dataset at 10, 20
..., 90% of the original size, and then plotting the number
of taxa predicted by the LSA algorithm, see Figure 1. This
graph can be used to estimate (roughly) how many addi-
tional species are likely to be discovered if one were to
increase the number of reads by a factor of two, say.
In this, to estimate the number of species, one might first
consider counting the number of leaves of the taxonomy
to which reads have been assigned. However, this number
may be confounded by the presence of different strains
and isolates. To avoid this problem, in our implementa-
tion in MEGAN 2.0 we use the number of strongly sup-
ported nodes as a proxy for the number of species. We say
that a node v in the NCBI taxonomy is strongly supported at
level t, where t is a small number (≈ 5), if v has been
assigned t or more reads and no node below v has that
property.
Functional assessment
In a functional analysis, the goal is to determine which
types of genes are available at what relative levels of abun-
dance. Such an analysis can be based on sequences
obtained by random sequencing either of the genomic
DNA in a metagenome, or (reverse transcribed) RNA. In
the former case, the coding potential is analyzed, whereas
in the latter case, the focus is on gene expression. A general
A number of sequences available in the NR database are
annotated by COG [26] identifiers. Hence, after BLAST
comparison of a metagenomic dataset with the NR data-
base, a first analysis of the types of genes present in the
dataset can be performed by extracting all COG identifiers
from the BLAST hits and then summarizing the relative
abundances of the different COG categories, see Figure 2.
Meta-data analysis
The result of a taxonomical analysis can be enhanced by
using "meta-data" to summarize the identified species.
For example, the "Prokaryotic Attributes Table" (obtaina-
ble from the NCBI website) lists attributes of microbes
that describe their cellular features, environment, temper-
ature, pathogenicity and relevance for diseases. A sum-
mary of an analysis based on such attributes is shown in
Figure 3.
Taxonomy-guided capture of reads
Once a first analysis has been performed and reads have
been assigned to taxa, it is often desirable to be able to
identify and capture all reads that have been assigned to
one part of the NCBI taxonomy, not only to a specific spe-
cies, but also to a class, genus or other rank of the taxon-
omy. This is very useful, for example, when performing
additional analysis such as determining the GC-content
A discovery rate plot computed by MEGAN 2.0 for the mouse gut datasetFig re 1
A discovery rate plot computed by MEGAN 2.0 for the
mouse gut dataset. The x-axis represents the percentage of
reads subsampled from the total dataset and the y-axis repre-
sents the number of strong nodes (with t = 5) computed by
the LCA algorithm, approximating the number of identified
species. The datapoint at 10 × t % is based on t independent
runs.
0
100
200
300
400
500
600
700
800
900
1000
10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
Percentage of reads sampled
N
um
be
r o
f s
tr
on
g
no
de
sPage 3 of 10
(page number not for citation purposes)
strategy is to compare the reads against reference data-
bases of gene sequences such as COG [26] and SEED [11].
for a collection of taxa, or for sequence assembly pur-
poses.
Page 4
BMC Bioinformatics 2009, 10(Suppl 1):S12 http://www.biomedcentral.com/1471-2105/10/S1/S12
Comparative visualization
In a comparative analysis, different datasets are brought
together and compared for taxonomical and functional
content. To compare multiple datasets, we define a new
multiple-comparison tree view in which an arbitrary number
of different datasets are displayed together on a subtree of
the NCBI taxonomy, as shown in Figures 4 and 5. In such
a view, each node in the NCBI taxonomy is shown as a pie
chart indicating the number of reads (normalized, if
desired) from each dataset that have been assigned to that
node. An important feature is the ability to interactively
collapse or expand the presented tree at different levels of
Summary of the microbial attributes of the soil dataset based on the NCBI's "Prokaryotic Attributes Table"Figure 3
Summary of the microbial attributes of the soil dataset based on the NCBI's "Prokaryotic Attributes Table". In each pie chart,
A classification of all COGs determined in the mouse gut sampleFigure 2
A classification of all COGs determined in the mouse gut sample.Page 4 of 10
(page number not for citation purposes)
the number of classified species having the indicated property is displayed.
Page 5
BMC Bioinformatics 2009, 10(Suppl 1):S12 http://www.biomedcentral.com/1471-2105/10/S1/S12
the taxonomy, so as to be able to start at a high-level view
and then to drill down to a low-level comparison.
For publication purposes, the ability to interactively setup
and generate different types of summaries using bar and
pie charts, and also heat maps for many-way compari-
sons, are important. We are developing an interactive and
fully customizable chart viewer for MEGAN 2.0 that
allows one to extract a number of different comparisons
number of reads assigned at any desired rank of the NCBI
taxonomy, see Figure 6.
Statistical significance
Comparative visualizations are useful to obtain an
impression of how two datasets differ. For a more detailed
analysis, one requires information on the statistical signif-
icance of observed differences, see Table 1 and 2. To this
end, we have adapted a test developed for comparing
Two multiple-comparative tree views of a human gut metagenome [16] shown in red and a mouse gut metagenome [17] shown in green, as computed by MEGAN 2.0, using norm lized countsFigure 4
Two multiple-comparative tree views of a human gut metagenome [16] shown in red and a mouse gut metagenome [17]
shown in green, as computed by MEGAN 2.0, using normalized counts. In (a), we show an overview of the taxonomy down to
the phylum level, whereas in (b) we display a part of a class-level analysis. In bold we show the support values as listed in Table
1.
No hits
Not assigned
Viruses
Entamoebidae
Cercozoa
Mycetozoa
Heterolobosea
Diplomonadida group
Parabasalidea
Microsporidia
Basidiomycota
AscomycotaDikarya
NeocallimastigomycotaFungi
Ctenophora
Nematoda
Chordata
EchinodermataDeuterostomia
Priapulida
Arthropoda
Mollusca
Annelida
Protostomia
Coelomata
Platyhelminthes
Bilateria
Cnidaria
Eumetazoa
Fungi/Metazoa group
Kinetoplastida
EuglenidaEuglenozoa
Cryptophyta
Ciliophora
Apicomplexa
DinophyceaeAlveolata
Labyrinthulida
Oomycetes
Synurophyceaestramenopiles
Rhodophyta
Chlorophyta
StreptophytaViridiplantae
Eukaryota
Nanoarchaeota
Crenarchaeota
EuryarchaeotaArchaea
Lentisphaerae
Verrucomicrobia
ChlamydiaeChlamydiae/Verrucomicrobia group
Nitrospirae
Acidobacteria
Thermotogae
Actinobacteria
Cyanobacteria
Chloroflexi
Aquificae
Fusobacteria
Deinococcus-Thermus
Firmicutes
Spirochaetes
Planctomycetes
Chlorobi
BacteroidetesBacteroidetes/Chlorobi group
candidate division TM7
Proteobacteria
Bacteria
cellular organisms
root
Human_gut_summary
mouse_gut_obese1_summary
+115.30
+282.88
+30.933
-10.122
Methanococci
Methanomicrobia
Methanobacteria
Lentisphaerales
VictivallalesLentisphaerae
Opitutae
Chlamydiae (class)
Chlamydiae/Verrucomicrobia group
Nitrospira (class)
Solibacteres
Acidobacteria (class)Acidobacteria
Thermotogae (class)
Actinobacteria(class)+282.7
Gloeobacteria
unclassified Cyanobacteria
Prochlorales
Nostocales
Oscillatoriales
Chroococcales
Cyanobacteria
Dehalococcoidetes
Chloroflexi (class)Chloroflexi
Aquificae (class)
Fusobacteria (class)
Deinococci
Mollicutes
Clostridia
BacilliFirmicutes
Spirochaetes (class)
Bacteria
Human_gut_summary
mouse_gut_obese1_summary
+ 25.02
+ 110.21
+ 46.66
+87.0
a) b)Page 5 of 10
(page number not for citation purposes)
directly from the multiple comparison tree view. For
example, one can generate a bar chart summarizing the curated subsystems in metagenomic data [27]. This test
End of preview.