ChapterPDF Available

Visualizing Patterns of Marine Eukaryotic Diversity from Metabarcoding Data Using QIIME


Abstract and Figures

PCR amplification followed by deep sequencing of homologous gene regions is increasingly used to characterize the diversity and taxonomic composition of marine eukaryotic communities. This approach may generate millions of sequences for hundreds of samples simultaneously. Therefore, tools that researchers can use to visualize complex patterns of diversity for these massive datasets are essential. Efforts by microbiologists to understand the Earth and human microbiomes using high-throughput sequencing of the 16S rRNA gene has led to the development of several user-friendly, open-source software packages that can be similarly used to analyze eukaryotic datasets. Quantitative Insights Into Microbial Ecology (QIIME) offers some of the most helpful data visualization tools. Here, we describe functionalities to import OTU tables generated with any molecular marker (e.g., 18S, COI, ITS) and associated metadata into QIIME. We then present a range of analytical tools implemented within QIIME that can be used to obtain insights about patterns of alpha and beta diversity for marine eukaryotes.
Content may be subject to copyright.
Sarah J. Bourlat (ed.), Marine Genomics: Methods and Protocols, Methods in Molecular Biology, vol. 1452,
DOI 10.1007/978-1-4939-3774-5_15, © Springer Science+Business Media New York 2016
Chapter 15
Visualizing Patterns of Marine Eukaryotic Diversity
from Metabarcoding Data Using QIIME
Matthieu Leray and Nancy Knowlton
PCR amplifi cation followed by deep sequencing of homologous gene regions is increasingly used to char-
acterize the diversity and taxonomic composition of marine eukaryotic communities. This approach may
generate millions of sequences for hundreds of samples simultaneously. Therefore, tools that researchers
can use to visualize complex patterns of diversity for these massive datasets are essential. Efforts by micro-
biologists to understand the Earth and human microbiomes using high-throughput sequencing of the 16S
rRNA gene has led to the development of several user-friendly, open-source software packages that can be
similarly used to analyze eukaryotic datasets. Quantitative Insights Into Microbial Ecology (QIIME) offers
some of the most helpful data visualization tools. Here, we describe functionalities to import OTU tables
generated with any molecular marker (e.g., 18S, COI, ITS) and associated metadata into QIIME. We then
present a range of analytical tools implemented within QIIME that can be used to obtain insights about
patterns of alpha and beta diversity for marine eukaryotes.
Key words Metabarcoding , QIIME , Alpha diversity , Beta diversity , Principal component analysis ,
1 Introduction
The world’s Oceans harbor an immense diversity of life estimated
between 0.3 and 2.2 million species belonging to 31 phyla [
1 , 2 ].
Yet, with only about 0.25 million formally described species to date,
a considerable portion of that diversity remains either unknown to
science or without a formal description. In addition, diagnostic
morphological characters used to differentiate species can be very
subtle in some invertebrate groups or even absent among micro-
scopic soft-bodied taxa (e.g., nematodes). This taxonomic impedi-
ment has dramatically limited our understanding of the way ocean
ecosystems function and respond to environmental changes, with
most studies focusing on just a few indicator metazoan taxa [
3 ].
The realization that we might not be able to study ocean diversity
using morphology alone has led more and more researchers toward
DNA approaches to characterize and monitor diversity [
4 , 5 ].
DNA sequencing expands the taxonomic coverage of ecologi-
cal studies by providing rapid and reliable species identifi ers that do
not depend on having taxonomic expertise. Building upon DNA
barcode resources and taking advantage of the availability of afford-
able High-Throughput Sequencing (HTS) technologies, the con-
cept of DNA metabarcoding is now revolutionizing our
understanding of patterns of marine diversity. Most commonly,
general primer sets are used to mass amplify (via Polymerase Chain
Reaction or PCR) a short hypervariable DNA fragment from an
collection of organisms mixed together. Reads obtained using an
HTS platform are then sorted bioinformatically to delineate
Operational Taxonomic Units (OTUs) and provide estimates of
richness and community composition [
6 ]. This metabarcoding
approach [
7 ] is now widely used, and the scientifi c community has
converged toward using standard DNA markers such as 18S
nuclear small subunit (nSSU) ribosomal RNA and the mitochon-
drial Cytochrome Oxidase c. Subunit I (COI) genes to target a
wide taxonomic range of organisms.
The ability to obtain community profi les from hundreds of
samples offers the potential for an unprecedented understanding
of marine diversity. While powerful analysis tools are essential to
handle very large sequence datasets, visualizing patterns of diver-
sity for complex datasets also represents a major challenge. For
example, researchers need to be able to see how community pro-
les vary in relation to each other and in relation to various meta-
data variables. Quantitative Insights Into Microbial Ecology
8 ] is one of the most powerful data visualization tools.
Like many other sequence data analysis workfl ows (e.g., Mothur
9 ]), QIIME was developed to analyze microbial 16S rRNA data-
sets, but its functionalities can be used to analyze OTU tables
obtained with any molecular marker.
QIIME combines many separate programs into a user-friendly
software package to perform analysis from the raw data generated
by any HTS platform to the graphical representation of the data.
Functionalities available within QIIME for sequence processing
include sample demultiplexing , OTU picking, phylogenetic analy-
sis, and taxonomic assignments, all of which are also implemented
in other software packages (e.g., Mothur [
9 ], CloVR [ 10 ], LotuS
11 ]). On the other hand, QIIME offers some unique interactive
graphic tools to explore patterns of diversity in relation to meta-
data variables (e.g., three-dimensional visualizations with EMPeror
12 ]). In this chapter, we fi rst describe how to import an OTU
table and associated metadata into QIIME. We then present a
QIIME tutorial to represent the taxonomic composition (e.g.,
histograms, heatmaps), plot OTU diversity ( alpha diversity ), and
plot dissimilarities in OTU composition between samples ( beta
diversity ). We illustrate community analysis using amplicon data
comparing OTU composition between sessile communities
Matthieu Leray and Nancy Knowlton
collected on settlement plates in Florida and Virginia (USA).
Samples were characterized using HTS of the mitochondrial
Cytochrome Oxidase c. Subunit 1 region [
5 ].
2 Materials
To avoid installing each program used within QIIME separately,
the developing team proposes several full installation packages.
1. The precompiled MacQIIME software package for Mac OS X.
The following tutorial assumes that the user is working with
2. The QIIME virtual box that can be installed and run on Mac
OS X, Windows, and Linux. Instructions are provided in the
following link .
3. The QIIME virtual machine on Amazon Web Services.
MacQIIME is a precompiled, easy-to-install, and easy-to-use ver-
sion of QIIME that is maintained by the Werner lab (
). Version 1.9.1 requires Mac OS X versions 10.7
(Lion) and above.
1. Download the compressed Tar archived MacQIIME fi le here .
2. Unarchive MacQIIME 1.9.1.
Open your terminal and type:
$ cd ~/Downloads
$ tar –xvf MacQIIME_1.9.1-20150604_OS10.7.tar
3. Install MacQIIME 1.9.1.
$ cd MacQIIME_1.9.1-20150604_OS10.7
$ ./install.s
4. Launch MacQIIME 1.9.1.
$ macqiime
Regardless of the target gene and the bioinformatics pipeline used
for sample demultiplexing , quality filtering, OTU clustering, and
taxonomic assignment, the final product of any metabarcoding
study is a “sample by observation contingency table” displaying
the number of reads per OTU and per sample. The file format
traditionally used to represent a raw contingency table is a tab-
delimited text file, termed a classic OTU table. A classic OTU table
can be easily visualized and manipulated in excel or TextWrangler
but it has been deemed an inefficient file format for cross-studies
comparison, for transferring data between software packages, and
for optimizing the use of disk space [
13 ]. As a result, QIIME now
requires classic OTU tables to be converted into Biological
Observation Matrix file format (BIOM).
2.1 QIIME Packages
2.2 Getting Ready
with MacQIIME 1.9.1
2.3 BIOM Formatted
OTU Table (Input File
Visualizing Patterns of Marine Eukaryotic Diversity from Metabarcoding Data Using QIIME
1. Prepare your classic OTU table as shown in Table 1 and save it
as a tab-delimited text fi le called “otu_table.txt” in a directory
called “qiime_analysis.”
2. Open your terminal and navigate to the directory “qiime_
analysis” where the classic OTU table is located.
$ cd ~/qiime_analysis
3. Assuming that MacQIIME is already open (see earlier), type
the following command to convert your classic OTU table to
a BIOM formatted table ( see Note 1 ).
$ biom convert -i otu_table.txt -o otu_
table.biom --process-obs-metadata taxonomy
--table-type "OTU table" --to-json
where the parameter -i is used to specify the input fi le and
the parameter -o is used to specify the output fi le ( see Note 2 ).
4. Check conversion by summarizing the count information.
$ biom summarize-table -i otu_table.biom -o
otu_table_summary_counts.txt --qualitative
The number of OTUs per sample is provided. The param-
eter --qualitative should be removed from the script to sum-
marize the number of sequences per sample instead.
5. Filter out singletons .
OTUs represented by a single sequence are often consid-
ered less reliable because they may result from sequencing
errors. The following command discards all OTUs that are not
represented by at least two sequences.
$ -i otu_table.
biom -o otu_table_nosingleton.biom -n 2
The mapping fi le contains information about each sample present
in the OTU table . It can be generated by hand using excel and
saved as a tab-delimited text fi le. Columns “#SampleID,”
“BarcodeSequence,” “LinkerPrimerSequence,” and “Description”
are mandatory and should be presented in this order. Additional
metadata columns may be added between “LinkerPrimerSequence”
and “Description” (e.g., Location, Site; Table
2 ). When QIIME is
used for downstream data analysis and visualization only, as is the
case in this tutorial, the “BarcodeSequence” and
“LinkerPrimerSequence” columns can be left empty (but keep
tabs) or with “NA” (Table
2 ).
QIIME has a functionality to test the validity of a mapping fi le.
-p and –b parameters need to be specifi ed if no Barcode and primer
sequences are provided. This command line generates a log fi le list-
ing potential warnings and errors detected in the mapping fi le
(e.g., invalid characters, duplicated sample ID). In the following
command, the parameter -m specifi es the mapping fi le labeled
$ validate_mapping_ -m Metadata_
map.txt -p -b
2.4 Metadata
Mapping File (Input
File Required)
Matthieu Leray and Nancy Knowlton
Table 1
Example of a classic OTU table representing three samples (ML.0136-ML.0138) with a total of 15 OTUs
#OTU ID ML.0136 ML.0137 ML.0138 taxonomy
171 7900 3809 4328 Root;k__Animalia;p__Bryozoa;c__Gymnolaemata;o__Cheilostomatida;f__Schizoporellidae
768 2201 1864 5967 Root;k__Animalia;p__Cnidaria;c__Hydrozoa;o__Leptothecata;f__Campanulariidae
1031 10,272 5771 3249 Root;k__Animalia;p__Cnidaria;c__Hydrozoa
13 8 1 1 Root;k__Animalia;p__Chordata;c__Ascidiacea;o__Stolidobranchia;f__Styelidae
185 1 3 2 Root;k__Animalia;p__Chordata;c__Ascidiacea;o__Phlebobranchia;f__Ascidiidae
978 9 6 13 Root;k__Animalia;p__Porifera;c__Demospongiae;o__Homosclerophorida;f__Oscarellidae
5 935 1547 2543 Root;k__Animalia;p__Bryozoa;c__Gymnolaemata;o__Cheilostomatida;f__Bugulidae
388 463 595 904 Root;k__Animalia;p__Chordata
971 670 1433 645 Root;k__Animalia;p__Arthropoda
3 555 829 865 Root;k__Animalia;p__Bryozoa;c__Gymnolaemata;o__Cheilostomatida;f__Membraniporidae
589 1397 1449 207 Root;k__Animalia;p__Arthropoda
919 1043 1107 195 Root;k__Animalia;p__Arthropoda;c__Malacostraca;o__Decapoda;f__Panopeidae
517 0 1 0 Root;k__Animalia;p__Porifera;c__Demospongiae;o__Halichondrida
516 3 3 1 Root;k__Animalia;p__Arthropoda;c__Maxillopoda;o__Calanoida
312 3 2 2 Root;k__Animalia;p__Annelida;c__Polychaeta;o__Sabellida;f__Sabellidae
This is a portion of a larger OTU table obtained using high-throughput sequencing of the mitochondrial Cytochrome Oxidase c. Subu
nit Subunit 1 region [ 5 ]. The last column of the table presents taxonomic assignment of each OTU ( see Note 3 )
Visualizing Patterns of Marine Eukaryotic Diversity from Metabarcoding Data Using QIIME
This is used to calculate phylogenetic alpha (e.g., phylogenetic
diversity) and beta diversity metrics (e.g., unifrac). The QIIME
script “” generates a tree using various meth-
ods (default: FastTree). Otherwise a tree can be imported into
QIIME as a Newick formatted tree fi le. In the following tutorial,
we do not use phylogenetic alpha and beta metrics.
A parameter fi le is used to specify one or a set of values of a param-
eter within a QIIME script. The parameter fi le contains the name
of the script (i.e., alpha_diversity) followed by the name of the
parameter (i.e., metrics), followed by a tab and fi nally the value of
the parameter (i.e., observed_species, chao1). The following line
indicates that the observed number of species and chao1 should be
used to calculate alpha diversity . It should be saved as a text fi le ( see
Note 5 for a list of alpha metrics implemented in QIIME).
alpha_diversity:metrics observed_species,chao1
2.5 Phylogenetic
Tree (Input File
2.6 Alpha Parameter
File (Input File
Table 2
Example of a metadata mapping fi le containing information about 18 samples analyzed in Leray and
Knowlton [
5 ]
# Sample ID Barcode sequence Linker primer sequence Location Site Description
ML.0136 NA NA Virginia Site1 Virginia.Site1
ML.0137 NA NA Virginia Site1 Virginia.Site1
ML.0138 NA NA Virginia Site1 Virginia.Site1
ML.0139 NA NA Virginia Site3 Virginia.Site3
ML.0140 NA NA Virginia Site3 Virginia.Site3
ML.0141 NA NA Virginia Site3 Virginia.Site3
ML.0142 NA NA Virginia Site2 Virginia.Site2
ML.0143 NA NA Virginia Site2 Virginia.Site2
ML.0144 NA NA Virginia Site2 Virginia.Site2
ML.0145 NA NA Florida Site1 Florida.Site1
ML.0146 NA NA Florida Site1 Florida.Site1
ML.0147 NA NA Florida Site1 Florida.Site1
ML.0148 NA NA Florida Site2 Florida.Site2
ML.0149 NA NA Florida Site2 Florida.Site2
ML.0150 NA
NA Florida Site2 Florida.Site2
ML.0151 NA NA Florida Site3 Florida.Site3
ML.0152 NA NA Florida Site3 Florida.Site3
ML.0153 NA NA Florida Site3 Florida.Site3
Each sample represents a community of sessile organisms collected on settlement plates at three sites in Florida and
Virginia ( see Note 4 )
Matthieu Leray and Nancy Knowlton
The beta parameter fi le specifi es the beta metrics to use. The format
of the fi le is similar to the format of the alpha parameter fi le detailed
earlier. It should also be saved as a text fi le ( see Note 6 for a list of
beta metrics implemented in QIIME).
beta_diversity:metrics bray_curtis,binary_jaccard
3 Methods
Diversity analyses presented in the following section require four
input fi les placed in the same directory called “qiime_analysis”:
See Subheading
2 for details about input fi le format.
High-throughput sequencing experiments often result in unequal
numbers of reads between samples. Differences in sequencing
depth affect estimates of alpha and beta diversity because as more
sequences are obtained more OTUs are detected. The goal is
therefore to scale the number of sequences of the larger samples
down to the smallest number of sequences that a sample contains
within the dataset. The following QIIME script creates a single
OTU table “otu_table_nosingleton_rarefi ed.biom” that has been
subsampled down to 11,982 reads for all samples. -i otu_table_nosingleton.biom -o otu_table_
nosingleton_rarefi ed.biom -d 11982
Rank abundance curves display OTU richness and evenness. OTU
richness in a sample can be viewed as the number of ranks that a
curve reaches. OTU evenness corresponds to the slope of the line
1 ). A steep line means that a few OTUs dominate the sample
in terms of abundance (low evenness). In the following command
line, -s is used to specify the name of the sample to plot. ‘*’ means
that all samples should be presented on the same plot. We also spec-
ify -a to use absolute counts and -x to represent a linear x -axis scale. -i otu_table_nosingleton_rarefi ed.
biom -s '*' -o Rank_abundance_plots.pdf -a -x
Various graphical representations of taxonomic composition are
implemented in QIIME. Histograms and heatmaps drawn at various
taxonomic levels are particularly useful for interpreting patterns of
alpha and beta diversity .
2.7 Beta Parameter
File (Input File
3.1 Rarefy OTU Table
3.2 Plot Rank
Abundance Curves
3.3 Visualize
Visualizing Patterns of Marine Eukaryotic Diversity from Metabarcoding Data Using QIIME
1. Calculate the relative abundance of taxonomic groups within each
sample of the rarefi ed OTU table . The following script creates
one table per taxonomic level (i.e., kingdom, phylum, class).
They are later referred to as taxonomy tables.
$ -i otu_table_nosingle-
ton_rare ed.biom -o taxa_summary_relative_
2. Generate taxonomy tables with absolute abundance rather
than relative abundance using the parameter -a
$ -i otu_table_nosingle-
ton_rare ed.biom -o taxa_summary_absolute_
abundance -a
3. Plot histogram displaying relative sequence abundance within
each sample at the phylum (Fig.
2 ) and class levels. Open the
html fi les to visualize the plots. QIIME also produces each plot
in a pdf format.
$ -i taxa_summary_rel-
ative_abundance/otu_table_nosingleton_rare ed_
L3.txt -o taxa_summary_plots/plots_phylum
$ -i taxa_summary_rel-
ative_abundance/otu_table_nosingleton_rare ed_
L4.txt -o taxa_summary_plots/plots_class
4. Plot heatmaps with relative sequence abundance to further
explore differences in composition between groups of samples.
Like histograms, they can be produced at all taxonomic levels
Fig. 1 Rank abundance curves for 18 communities of sessile organisms collected
on settlement plates in Virginia (ML.0136 ML.-0144) and Florida (ML.0145-
ML .0153). Each community was characterized using HTS of COI amplicons [ 5 ]
Matthieu Leray and Nancy Knowlton
represented in taxonomy tables (computed by the summarize_ script). In the following, we specify --no_log_transform
because the input fi le contains relative abundances.
$ -i taxa_summary_rel-
e ed_L3.biom -o taxa_summary_plots/plots_
phylum/phylum_heatmap.pdf --no_log_transform
$ -i taxa_summary_rel-
e ed_L4.biom -o taxa_summary_plots/plots_
class/class_heatmap.pdf --no_log_transform
5. Plot heatmaps with log-transformed absolute sequence abun-
dance at the phylum (Fig.
3 ) and class levels. Because in most
datasets a few OTUs might dominate the sequence counts,
transforming the data often helps better visualize differences in
taxonomic composition .
$ -i taxa_summary_abso-
lute_abundance/otu_table_nosingleton_rare ed_
L3.biom -o taxa_summary_plots/plots_phylum/
phylum_heatmap_log.pdf --absolute_abundance
$ -i taxa_summary_abso-
lute_abundance/otu_table_nosingleton_rare ed_
L4.biom -o taxa_summary_plots/plots_class/
class_heatmap_log.pdf --absolute_abundance
Fig. 2 Histogram summarizing the relative number of COI amplicons within each community of sessile organ-
isms collected in Virginia (ML.0136-ML. 0144) and Florida (ML.0145-ML. 0153). Relative abundance is pre-
sented at the phylum level. See metadata fi le for more information about each sample
Visualizing Patterns of Marine Eukaryotic Diversity from Metabarcoding Data Using QIIME
Alpha diversity represents the diversity within each sample ( see Note
5 for a list of alpha metrics implemented in QIIME). The follow-
ing command line creates a table with the observed number of
OTUs and Chao1 values for each sample.
$ -i otu_table_nosingleton_
rare ed.biom -m observed_otus,chao1 -o alpha_
Unlike species accumulation curves, individual-based rarefaction
curves are built using a resampling approach by randomly selecting
sequences at increasing levels of accumulation (e.g. 1000, 2000,
3000 reads) until all sequences have been accumulated. Many
3.4 Calculate Alpha
3.5 Plot Alpha
Rarefaction Curves
Fig. 3 Heatmap representing log-transformed numbers of COI amplicons within each community of sessile
organisms collected in Virginia (ML.0136-ML. 0144) and Florida (ML.0145-ML. 0153)
Matthieu Leray and Nancy Knowlton
resampling iterations are done at each level to calculate the mean
and standard deviation of the curve.
1. Plot rarefaction curves using alpha metrics specifi ed in the
alpha_params.txt fi le ( see Subheading
2.6 ). An interactive .html
output fi le can be opened with a web browser to visualize
curves built with the different alpha metrics. Because a map-
ping fi le is also specifi ed, rarefaction curves averaged per groups
of samples are also displayed (e.g., Virginia vs. Florida).
$ -i otu_table_nos-
ingleton_rare ed.biom -m Metadata_map.txt -o
alpha_rarefaction -p alpha_params.txt
If no alpha parameter fi le is specifi ed, the default metrics
are used, among which PD_whole_tree requires a phyloge-
netic tree .
2. Plot rarefaction plots in .pdf format (Fig.
4 ; also see Note 7 ).
$ -i alpha_rar-
efaction/alpha_div_collated -o alpha_rarefac-
tion/pdfs -g pdf -m Metadata_map.txt
Beta diversity is a measure of dissimilarity in species composition
between samples. QIIME supports both qualitative and quantita-
tive metrics of beta diversity. Qualitative metrics (e.g., binary_jac-
card) measure changes in communities driven by presence/absence
of OTUs, whereas quantitative metrics (e.g., bray_curtis) measure
differences in relative abundance of OTUs between communities.
3.6 Calculate Beta
Fig. 4 Alpha rarefaction curves representing the observed number of OTUs as a
function of the number of resampled sequences in the OTU table . Curves were
averaged per location (red: Florida; blue: Virginia)
Visualizing Patterns of Marine Eukaryotic Diversity from Metabarcoding Data Using QIIME
The following command line calculates distance matrices using the
Jaccard and Bray Curtis metrics.
$ -i otu_table_nosingleton_
rare ed.biom -m binary_jaccard,bray_curtis -o
In the following section , we provide command lines to represent
dissimilarities in OTU composition calculated between samples
using both the qualitative Jaccard metric and the quantitative Bray
Curtis metric.
1. Plot pairwise distances within and between categories (Fig.
5 ).
$ -d beta_diver-
rare ed.txt -m Metadata_map.txt -f "Location"
-o beta_boxplot/binary_jaccard
$ -d beta_diver-
e ed.txt -m Metadata_map.txt -f "Location"
-o beta_boxplot/bray_curtis
3.7 Visualize Beta
Fig. 5 Boxplot representing Bray Curtis distances within and between locations
where sessile communities were collected. These plots show that, based on COI
amplicon data, the taxonomic composition of sessile communities is more simi-
lar within Florida and Virginia than it is between the two locations
Matthieu Leray and Nancy Knowlton
2. Calculate principal coordinate axes.
$ -i beta_diver-
rare ed.txt -o PC_binary_jaccard.txt
$ -i beta_diver-
e ed.txt -o PC_bray_curtis.txt
3. Plot interactive three-dimensional Principal Coordinate
Analysis (PCoA) using EMPeror [
12 ] (Fig. 6 ).
$ -i PC_binary_jaccard.txt
-m Metadata_map.txt -o 3D_PCoA/binary_jaccard/
$ -i PC_bray_curtis.txt
-m Metadata_map.txt -o 3D_PCoA/bray_curtis/
4. Plot two-dimensional PCoA using the Jaccard and Bray Curtis
distance matrices.
$ -i PC_binary_jaccard.txt
-m Metadata_map.txt -o 2D_PCoA_binary_jaccard
$ -i PC_bray_curtis.txt
-m Metadata_map.txt -o 2D_PCoA_bray_curtis
5. Plot hierarchical clustering tree using Unweighted Pair Group
Method with Arithmetic mean ( UPGMA ). The resulting tree
can be visualized in FigTree (http://
ware/fi gtree/).
$ -i beta_diversity/
e ed.txt -o UPGMA_binary_jaccard.tre
Fig. 6 Three-dimensional PCoA visualized using Emperor in a web browser. Emperor is used to color by category
in the metadata mapping fi le . Here, the PCoA was built using the Bray Curtis distance matrix
Visualizing Patterns of Marine Eukaryotic Diversity from Metabarcoding Data Using QIIME
$ -i beta_diversity/
bray_curtis_otu_table_nosingleton_rare ed.
txt -o UPGMA_bray_curtis.tre
The reliability of beta diversity estimates is measured by resampling
random subsets of the OTU table several times, a process called
jackknifi ng. Beta diversity is then calculated for all independent
datasets and compared to the value obtained for the entire dataset.
1. Jackknife the entire dataset and compare beta diversity values.
Here, we resample 8986 sequences of each sample (which
corresponds to 75% of the smallest sample) to ensure that the
smallest sample is also randomly resampled.
$ -i otu_
table_nosingleton.biom -m Metadata_map.txt -o
beta_jack -e 8986 -p beta_params.txt
3.8 Evaluate
Robustness of Beta
Diversity Estimates
to Sequencing Effort
Fig. 7 UPGMA hierarchical clustering tree with branch support calculated using jackknifi ng. Branch colors
represent the level of support: red for 75-100% support; yellow for 50-75 % support; green for 25-50% sup-
port; blue for < 25% support. Here, the tree shows high level of support for differences in community composi-
tion between sessile communities in Virginia (ML.0136-ML.0144) and Florida (ML.0145-ML.0153). It also
shows strong support for community structuring at each location
Matthieu Leray and Nancy Knowlton
This workfl ow creates a three-dimensional PCoA with a
confi dence ellipsoid around each sample. It calculates support
values for the UPGMA tree.
2. Support values to UPGMA hierarchical clustering tree (Fig. 7)
$ -m beta_jack/
binary_jaccard/upgma_cmp/master_tree.tre -s
knife_support.txt -o beta_jack/binary_jac-
$ -m beta_jack/
bray_curtis/upgma_cmp/master_tree.tre -s
support.txt -o beta_jack/bray_curtis/upgma_
3. Plot two-dimensional PCoA with confi dence ellipsoid around
each sample estimated using jackknifi ng.
$ -i beta_jack/binary_jac-
card/pcoa/ -m Metadata_map.txt -b 'Location&&Site'
--ellipsoid_opacity=0.2 -o beta_jack/binary_
$ -i beta_jack/bray_curtis/
pcoa/ -m Metadata_map.txt -b 'Location&&Site'
--ellipsoid_opacity=0.2 -o beta_jack/
4. Calculate the mean, median, and standard deviation of the dis-
tance matrices previously created by jackknifi ng the OTU
table .
$ -i beta_jack/
binary_jaccard/rare_dm/ -o beta_jack/binary_
$ -i beta_jack/
bray_curtis/rare_dm/ -o beta_jack/bray_cur-
5. Create boxplots to visualize variations in beta distances within and
between categories. Here, we specify the category “Location.”
$ -m Metadata_
map.txt -o beta_jack/binary_jaccard/distance_
boxplot/ -d beta_jack/binary_jaccard/rare_dm/
means.txt -f Location --save_raw_data
$ -m Metadata_
map.txt -o beta_jack/bray_curtis/distance_
boxplot/ -d beta_jack/bray_curtis/rare_dm/
means.txt -f Location --save_raw_data
Several statistical approaches are implemented within QIIME for
testing for differences in beta diversity between groups of samples
( see Note 8 ). Among them, the Permutational Multivariate Analysis
3.9 Test
for Differences in Beta
Between Groups
of Samples
Visualizing Patterns of Marine Eukaryotic Diversity from Metabarcoding Data Using QIIME
of Variance ( PERMANOVA ) is a nonparametric analog of ANOVA
that tests for differences in the position of sets of objects in multi-
variate space. The parameter -c specifi es the metadata categories
that need to be compared.
table_nosingleton_rare ed.txt -m Metadata_map.
txt -c Location -o beta_tests/binary_jaccard/-
method permanova -ibeta_diver-
sity/bray_curtis_otu_table_nosingleton_rare ed.
txt -m Metadata_map.txt -c Location -o beta_tests/
bray_curtis/ --method permanova
1. If fi le conversion fails and QIIME returns an error message, it
is most likely due to the presence of special characters in your
classic OTU table that are not supported by the BIOM format.
To detect and remove these characters, open your classic OTU
table in TextWrangler, and select “Zap Gremlins…” in the
“Text” menu.
2. The syntax of QIIME commands is standardized. For example,
-i is used to specify the input fi le and -o to specify the output
le (or directory) in all command lines.
3. Each sample identifi er should only contain alphanumeric
characters and period (.)
4. Sample identifi ers should match between the OTU table and
the mapping fi le .
5. The following alpha metrics are supported in QIIME: ace,
berger_parker_d, brillouin_d, chao1, chao1_ci, dominance, dou-
bles, enspie, equitability, esty_ci, fi sher_alpha, gini_index, goods_
coverage, heip_e, kempton_taylor_q, margalef, mcintosh_d,
mcintosh_e, menhinick, michaelis_menten_fi t, observed_otus,
observed_species, osd, simpson_reciprocal, robbins, shannon,
simpson, simpson_e, singles, strong, PD_whole_tree. Further
information about each metric can be found at
6. The following nonphylogenetic beta metrics are supported in
Qiime: abund_jaccard, binary_dist_chisq, binary_dist_chord,
binary_dist_euclidean, binary_dist_hamming, binary_dist_jac-
card, binary_dist_lennon, binary_dist_ochiai, binary_otu_gain,
binary_dist_pearson, binary_dist_sorensen_dice, dist_bray_curtis,
dist_canberra, dist_chisq, dist_chord, dist_euclidean, dist_gower,
dist_hellinger, dist_kulczynski, dist_manhattan, dist_morisita_
horn, dist_pearson, dist_soergel, dist_spearman_approx, dist_
specprof. Phylogenetic metrics that require a phylogenetic tree
are the following: dist_unifrac_G, dist_unifrac_G_full_tree, dist_
unweighted_unifrac, dist_unweighted_unifrac_full_tree, dist_
weighted_normalized_unifrac, dist_weighted_unifrac.
Matthieu Leray and Nancy Knowlton
7. The legend is not displayed on the .pdf fi les but can be seen in
the html fi le.
8. Statistical methods available in QIIME to test for differences in
composition between categories of samples are as follows:
PERMDISP, db-RDA. Further information about each test
can be found here
We thank Sarah Bourlat for inviting this submission. This work was
supported by the Sant Chair and the Smithsonian Tennenbaum
Marine Observatories Network, for which this is Contribution No. 3.
1. Mora C, Tittensor DP, Adl S et al (2011) How
many species are there on Earth and in the
Ocean? PLoS Biol 9:1001127. doi:
2. Appeltans W, Ahyong ST, Anderson G et al
(2012) The magnitude of global marine spe-
cies diversity. Curr Biol 22:2189–202.
3. Tittensor DP, Mora C, Jetz W et al (2010)
Global patterns and predictors of marine biodi-
versity across taxa. Nature 466:1098–101.
4. Fonseca VG, Carvalho GR, Sung W et al
(2011) Second-generation environmental
sequencing unmasks marine metazoan biodi-
versity. Nat Commun 1:1–8. doi:
5. Leray M, Knowlton N (2015) DNA barcoding
and metabarcoding of standardized samples
reveal patterns of marine benthic diversity. Proc
Natl Acad Sci U S A 112:2076–2081.
6. Leray M, Yang JY, Meyer CP et al (2013) A new
versatile primer set targeting a short fragment of
the mitochondrial COI region for metabarcod-
ing metazoan diversity: application for character-
izing coral reef fi sh gut contents. Front Zool
10:34. doi:
7. Taberlet P, Coissac E, Pompanon F et al
(2012) Towards next-generation biodiversity
assessment using DNA metabarcoding. Mol
Ecol 21:2045–50. doi:
8. Caporaso JG, Kuczynski J, Stombaugh J et al
(2010) QIIME allows analysis of high-
throughput community sequencing data. Nat
Methods 7:335–6. doi:
9. Schloss PD, Westcott SL, Ryabin T et al (2009)
Introducing mothur: open-source, platform-
independent, community-supported software
for describing and comparing microbial com-
munities. Appl Environ Microbiol 75:7537–
7541. doi:
10. Angiuoli SV, Matalka M, Gussman A et al (2011)
CloVR: a virtual machine for automated and
portable sequence analysis from the desktop
using cloud computing. BMC Bioinformatics
12:356. doi:
11. Hildebrand F, Tadeo R, Voigt A et al (2014)
LotuS: an effi cient and user-friendly OTU pro-
cessing pipeline. Microbiome 2:30.
12. Vázquez-Baeza Y, Pirrung M, Gonzalez A,
Knight R (2013) EMPeror: a tool for visual-
izing high-throughput microbial community
data. Gigascience 2:16. doi:
13. McDonald D, Clemente JC, Kuczynski J et al
(2012) The Biological Observation Matrix
(BIOM) format or: how I learned to stop wor-
rying and love the ome-ome. Gigascience 1:7.
Visualizing Patterns of Marine Eukaryotic Diversity from Metabarcoding Data Using QIIME
... A handful of studies have evaluated the reliability of metabarcoding for the assessment of taxonomic composition and diversity of freshwater macroinvertebrates 18,24,[37][38][39] . Metabarcoding data sets are taxonomically more comprehensive and less dependent on taxonomic expertise 37 , thus providing a quicker and more reliable means of identifying organisms at various taxonomic levels, thereby expanding the taxonomic coverage of ecological studies 40 . The efficiency of metabarcoding-based taxonomic assignment strongly relies on the representation of species in the sequence database. ...
... Prior to estimation of diversity, the OTU table was rarefied to accommodate differences in sequencing depth among sampling sites. A rarefaction analysis was required since non-parametric and parametric estimates are sensitive to sample size 40,59 . A paired sample t-test was performed to compare the relative abundance of species with >2% sequence abundance between the US and DS (DS-A/DS-B) sites of each river categories (i.e. ...
Full-text available
Sediment bypass tunnels (SBTs) are guiding structures used to reduce sediment accumulation in reservoirs during high flows by transporting sediments to downstream reaches during operation. Previous studies monitoring the ecological effects of SBT operations on downstream reaches suggest a positive influence of SBTs on riverbed sediment conditions and macroinvertebrate communities based on traditional morphology-based surveys. Morphology-based macroinvertebrate assessments are costly and time-consuming, and the large number of morphologically cryptic, small-sized and undescribed species usually results in coarse taxonomic identification. Here, we used DNA metabarcoding analysis to assess the influence of SBT operations on macroinvertebrates downstream of SBT outlets by estimating species diversity and pairwise community dissimilarity between upstream and downstream locations in dam-fragmented rivers with operational SBTs in comparison to dam-fragmented (i.e., no SBTs) and free-flowing rivers (i.e., no dam). We found that macroinvertebrate community dissimilarity decreases with increasing operation time and frequency of SBTs. These factors of SBT operation influence changes in riverbed features, e.g. sediment relations, that subsequently effect the recovery of downstream macroinvertebrate communities to their respective upstream communities. Macroinvertebrate abundance using morphologically-identified specimens was positively correlated to read abundance using metabarcoding. This supports and reinforces the use of quantitative estimates for diversity analysis with metabarcoding data.
... For dual-end sequencing data, usearch-Fastq_mergepairs (V.10, the preset minimum length of overlap is 16 bp, and the splicing sequence of the overlap area allows the maximum error to be 5 bp) was used to filter disconformed tags, then obtain the original stitching sequences (Raw Tags) on the basis of the overlap relationship among PE reads. Use the UCHIME algorithm in QIIME (V.2020.11.0) pipeline to remove chimeric sequences ( Leray and Knowlton, 2016). Bacterial sequences were categorized into operational taxonomic units (OTUs) at a threshold value of 97% comparability by utilizing UPARSE (Edgar, 2013). ...
Full-text available
As an important environmental protection measure, the Poplar Ecological Retreat (PER) project aims to restore the ecology of the Dongting Lake (DL, China’s second largest freshwater lake) wetland. And its ecological impact is yet to be revealed. This study selected soil bacterial community structure (BCS) as an indicator of ecological restoration to explore the ecological impact of PER project on DL wetland. Soil samples were collected from reed area (RA, where poplar had never been planted, as the end point of ecological restoration for comparison in this study), poplar planting area (PA), poplar retreat for 1-year area (PR1A) and poplar retreat for 2 years area (PR2A), then their soil properties and BCS were measured. The results showed that the PER project caused significant changes in soil properties, such as the soil organic matter (SOM) and moisture, and an increase in the diversity and richness index of soil BCS. The Shannon-wiener index of RA, PA, PR1A and PR2A were 3.3, 2.63, 2.75 and 2.87, respectively. The number of operational taxonomic units (OTUs) changed similarly to the Shannon-wiener index. The Pearson correlation analysis and redundancy analysis (RDA) showed that the poplar retreat time, SOM and moisture content were the main factors leading to the increase of BCS diversity. All of these indicated that after the implementation of the PER project, the ecology of the lake area showed a trend of gradual recovery.
... We sequenced samples inhouse using a 300-cycle MiSeq Reagent Kit v2 (Illumina, CA, USA) on a MiSeq platform with a 2 Â 150 bp paired-end run in the presence of 25% PhiX sequencing control DNA. Sample splitting was performed on board the MiSeq instrument, and further analysis was performed in the program CLC Workbench 10 and QIIME Module Version 2 (Leray & Knowlton, 2016). Selection of operational taxonomic unit (OTU) and taxonomic classification were clustered against the SILVA (v 123) database using the UCLUST algorithm with a similarity cutoff of 97%. ...
The McMurdo Dry Valleys (MDVs), Antarctica, represent a cold, desert ecosystem poised on the threshold of melting and freezing water. The MDVs have experienced dramatic signs of climatic change, most notably a warm austral summer in 2001–2002 that caused widespread flooding, partial ice cover loss and lake level rise. To understand the impact of these climatic disturbances on lake microbial communities, we simulated lake level rise and ice‐cover loss by transplanting dialysis‐bagged communities from selected depths to other locations in the water column or to an open water perimeter moat. Bacteria and eukaryote communities residing in the surface waters (5 m) exhibited shifts in community composition when exposed to either disturbance, while microbial communities from below the surface were largely unaffected by the transplant. We also observed an accumulation of labile dissolved organic carbon in the transplanted surface communities. In addition, there were taxa‐specific sensitivities: cryptophytes and Actinobacteria were highly sensitive particularly to the moat transplant, while chlorophytes and several bacterial taxa increased in relative abundance or were unaffected. Our results reveal that future climate‐driven disturbances will likely undermine the stability and productivity of MDV lake phytoplankton and bacterial communities in the surface waters of this extreme environment.
... This problem has already been discussed in several journals [15], even in the internal guideline on the publication of microbiome data [16]. There are different software packages that can evaluate biostatistically the complex data structure of microbiome data (low number of detected bacteria with at the same time high number of different species, high homology of the evaluated bacteria with 97 % matches and same phyla), such as QIIME [17][18][19][20][21], MOTHUR [22,23], RDP tools [24][25][26][27], and VAMPS [28]. Based on the matches with the selected primers, operational taxonomic units (OTU) are identified. ...
Full-text available
Zusammenfassung Unter Mikrobiom versteht man die Gesamtheit der bakteriellen, parasitären, viralen oder anderen zellulären Mikroorganismen, die den menschlichen Körper oder ein anderes Lebewesen besiedeln. Das Mikrobiom zeigt in den anatomischen Bereichen der Hals-Nasen-Ohrenheilkunde eine deutliche regionale Varianz. Für die Bereiche von Ohr, Nase, Rachen, Larynx und Haut sind jeweils verschiedene Interaktionen des Mikrobiomes mit allgemeinen Faktoren wie Alter, Diät und Lebensstilfaktoren (z. B. Rauchen) in den letzten Jahren bekannt geworden. Zudem liegen eine Reihe von Erkenntnissen vor, dass das Mikrobiom an der Pathogenese verschiedener Erkrankungen auch im HNO-Bereich beteiligt ist. Der vorgestellte Übersichtsartikel fasst die wesentlichen Erkenntnisse dieses sich aktuell äußerst rasch entwickelnden Forschungsgebietes überblickartig zusammen.
... To avoid biases in diversity estimates caused by unavoidable inconsistencies in the number of reads produced across independent samples by high-throughput sequencers (Leray & Knowlton, 2016), high-throughput sequencing data are commonly analyzed using proportions or rarefraction (Beng et al., 2016). Due to the problem of 'lost data' associated with rarefaction (McMurdie & Holmes, 2014), we opted to calculate the proportion of OTU representing each of the five most abundant arthropod orders in each weekly sample. ...
Full-text available
Arthropod communities in the tropics are increasingly impacted by rapid changes in land use. Because species showing distinct seasonal patterns of activity are thought to be at higher risk of climate-related extirpation, global warming is generally considered a lower threat to arthropod biodiversity in the tropics than in temperate regions. To examine changes associated with land use and weather variables in tropical arthropod communities, we deployed Malaise traps at three major anthropogenic forests (secondary reserve forest, oil palm forest, and urban ornamental forest (UOF)) in Peninsular Malaysia and collected arthropods continuously for 12 months. We used metabarcoding protocols to characterize the diversity within weekly samples. We found that changes in the composition of arthropod communities were significantly associated with maximum temperature in all the three forests, but shifts were reversed in the UOF compared with the other forests. This suggests arthropods in forests in Peninsular Malaysia face a double threat: community shifts and biodiversity loss due to exploitation and disturbance of forests which consequently put species at further risk related to global warming. We highlight the positive feedback mechanism of land use and temperature, which pose threats to the arthropod communities and further implicates ecosystem functioning and human well-being. Consequently, conservation and mitigation plans are urgently needed.
Pickles are a type of traditional fermented food in Northeast China that exhibit a broad variety of preparations, flavors and microbial components. Despite their widespread consumption, the core microorganisms in various traditional pickles and the precise impact of ecological variables on the microbiota remains obscure. The present study aims to unravel the microbial diversity in different pickle types collected from household (12 samples) and industrial (10 samples) sources. Among these 22 samples tested, differences were observed in total acid, amino acid nitrogen, nitrite, and salt content. Firmicutes and Ascomycota emerged as the predominant microbial phyla as observed by Illumina MiSeq sequencing. Amongst these, the commonly encountered microorganisms were Lactobacillus, Weissella and yeast. Comparative analysis based on non-metric multidimensional scaling (NMDS), showed that the microbial community in the pickles was affected by external conditions such as major ingredients and manufacturing process. Correlation analysis further showed that the resident core microorganisms in pickles could adapt to the changing internal fermentation environment. The insights gained from this study further our understanding of traditional fermented foods and can be used to guide the isolation of excellent fermented strains.
Background: Dajiang is fermented based on the metabolism of microbial communities in bean sauce mash, a traditional fermented soybean product in China. The current study firstly investigated the metaproteome of bean sauce mash, followed by analyzed biological functions and microbial community to reveal information on strains as well as the expressed proteins to better understand the roles of the microbiota in bean sauce mash. Results: The metaproteomic results demonstrated that a total of 1415 microbial protein clusters were expressed mainly by members of Penicillium and Rhizopus and were classified into 100 cellular components, 238 biological processes and 220 molecular function categories by GO Annotation. Additionally, enzymes associated with glycolysis metabolic pathways were identified, which can provide the required energy for microbial fermentation. Furthermore, the Illumina MiSeq sequencing technology results showed the microorganism communities of bean sauce mash exhibit a higher diversity, microbiological analysis demonstrated that fungi of Penicillium, Mucor, Fusarium, Aspergillus, Rhizopus, and bacteria of Lactobacillus, Enterococcus, Fructobacillus, Staphylococcus, Carnobacterium, were the predominant genera in 22 samples. Conclusion: The profiles and insights in the current study are important to the research for bean sauce mash and related product in terms of food microbial ecology. Moreover, the strains and information obtained from this study will help the development of sufu starter cultures with unique sensory and stable quality. This article is protected by copyright. All rights reserved.
Full-text available
Significance High-throughput DNA sequencing methods are revolutionizing our ability to census communities, but most analyses have focused on microbes. Using an environmental DNA sequencing approach based on cytochrome c oxidase subunit 1 primers, we document the enormous diversity and fine-scale geographic structuring of the cryptic animals living on oyster reefs, many of which are rare and very small. Sequence data reflected both the presence and relative abundance of organisms, but only 10.9% of the sequences could be matched to reference barcodes in public databases. These results highlight the enormous numbers of marine animal species that remain genetically unanchored to conventional taxonomy and the importance of standardized, genetically based biodiversity surveys to monitor global change.
Full-text available
As microbial ecologists take advantage of high-throughput sequencing technologies to describe microbial communities across ever-increasing numbers of samples, new analysis tools are required to relate the distribution of microbes among larger numbers of communities, and to use increasingly rich and standards-compliant metadata to understand the biological factors driving these relationships. In particular, the Earth Microbiome Project drives these needs by profiling the genomic content of tens of thousands of samples across multiple environment types. Features of EMPeror include: ability to visualize gradients and categorical data, visualize different principal coordinates axes, present the data in the form of parallel coordinates, show taxa as well as environmental samples, dynamically adjust the size and transparency of the spheres representing the communities on a per-category basis, dynamically scale the axes according to the fraction of variance each explains, show, hide or recolor points according to arbitrary metadata including that compliant with the MIxS family of standards developed by the Genomic Standards Consortium, display jackknifed-resampled data to assess statistical confidence in clustering, perform coordinate comparisons (useful for procrustes analysis plots), and greatly reduce loading times and overall memory footprint compared with existing approaches. Additionally, ease of sharing, given EMPeror's small output file size, enables agile collaboration by allowing users to embed these visualizations via emails or web pages without the need for extra plugins. Here we present EMPeror, an open source and web browser enabled tool with a versatile command line interface that allows researchers to perform rapid exploratory investigations of 3D visualizations of microbial community data, such as the widely used principal coordinates plots. EMPeror includes a rich set of controllers to modify features as a function of the metadata. By being specifically tailored to the requirements of microbial ecologists, EMPeror thus increases the speed with which insight can be gained from large microbiome datasets.
Full-text available
Introduction The PCR-based analysis of homologous genes has become one of the most powerful approaches for species detection and identification, particularly with the recent availability of Next Generation Sequencing platforms (NGS) making it possible to identify species composition from a broad range of environmental samples. Identifying species from these samples relies on the ability to match sequences with reference barcodes for taxonomic identification. Unfortunately, most studies of environmental samples have targeted ribosomal markers, despite the fact that the mitochondrial Cytochrome c Oxidase subunit I gene (COI) is by far the most widely available sequence region in public reference libraries. This is largely because the available versatile (“universal”) COI primers target the 658 barcoding region, whose size is considered too large for many NGS applications. Moreover, traditional barcoding primers are known to be poorly conserved across some taxonomic groups. Results We first design a new PCR primer within the highly variable mitochondrial COI region, the “mlCOIintF” primer. We then show that this newly designed forward primer combined with the “jgHCO2198” reverse primer to target a 313 bp fragment performs well across metazoan diversity, with higher success rates than versatile primer sets traditionally used for DNA barcoding (i.e. LCO1490/HCO2198). Finally, we demonstrate how the shorter COI fragment coupled with an efficient bioinformatics pipeline can be used to characterize species diversity from environmental samples by pyrosequencing. We examine the gut contents of three species of planktivorous and benthivorous coral reef fish (family: Apogonidae and Holocentridae). After the removal of dubious COI sequences, we obtained a total of 334 prey Operational Taxonomic Units (OTUs) belonging to 14 phyla from 16 fish guts. Of these, 52.5% matched a reference barcode (>98% sequence similarity) and an additional 32% could be assigned to a higher taxonomic level using Bayesian assignment. Conclusions The molecular analysis of gut contents targeting the 313 COI fragment using the newly designed mlCOIintF primer in combination with the jgHCO2198 primer offers enormous promise for metazoan metabarcoding studies. We believe that this primer set will be a valuable asset for a range of applications from large-scale biodiversity assessments to food web studies.
Full-text available
We present the Biological Observation Matrix (BIOM, pronounced "biome") format: a JSON-based file format for representing arbitrary observation by sample contingency tables with associated sample and observation metadata. As the number of categories of comparative omics data types (collectively, the "ome-ome") grows rapidly, a general format to represent and archive this data will facilitate the interoperability of existing bioinformatics tools and future meta-analyses. The BIOM file format is supported by an independent open-source software project (the biom-format project), which initially contains Python objects that support the use and manipulation of BIOM data in Python programs, and is intended to be an open development effort where developers can submit implementations of these objects in other programming languages. The BIOM file format and the biom-format project are steps toward reducing the "bioinformatics bottleneck" that is currently being experienced in diverse areas of biological sciences, and will help us move toward the next phase of comparative omics where basic science is translated into clinical and environmental applications. The BIOM file format is currently recognized as an Earth Microbiome Project Standard, and as a Candidate Standard by the Genomic Standards Consortium.
Full-text available
Background: The question of how many marine species exist is important because it provides a metric for how much we do and do not know about life in the oceans. We have compiled the first register of the marine species of the world and used this baseline to estimate how many more species, partitioned among all major eukaryotic groups, may be discovered. Results: There are ∼226,000 eukaryotic marine species described. More species were described in the past decade (∼20,000) than in any previous one. The number of authors describing new species has been increasing at a faster rate than the number of new species described in the past six decades. We report that there are ∼170,000 synonyms, that 58,000-72,000 species are collected but not yet described, and that 482,000-741,000 more species have yet to be sampled. Molecular methods may add tens of thousands of cryptic species. Thus, there may be 0.7-1.0 million marine species. Past rates of description of new species indicate there may be 0.5 ± 0.2 million marine species. On average 37% (median 31%) of species in over 100 recent field studies around the world might be new to science. Conclusions: Currently, between one-third and two-thirds of marine species may be undescribed, and previous estimates of there being well over one million marine species appear highly unlikely. More species than ever before are being described annually by an increasing number of authors. If the current trend continues, most species will be discovered this century.
Full-text available
The diversity of life is one of the most striking aspects of our planet; hence knowing how many species inhabit Earth is among the most fundamental questions in science. Yet the answer to this question remains enigmatic, as efforts to sample the world's biodiversity to date have been limited and thus have precluded direct quantification of global species richness, and because indirect estimates rely on assumptions that have proven highly controversial. Here we show that the higher taxonomic classification of species (i.e., the assignment of species to phylum, class, order, family, and genus) follows a consistent and predictable pattern from which the total number of species in a taxonomic group can be estimated. This approach was validated against well-known taxa, and when applied to all domains of life, it predicts ~8.7 million (± 1.3 million SE) eukaryotic species globally, of which ~2.2 million (± 0.18 million SE) are marine. In spite of 250 years of taxonomic classification and over 1.2 million species already catalogued in a central database, our results suggest that some 86% of existing species on Earth and 91% of species in the ocean still await description. Renewed interest in further exploration and taxonomy is required if this significant gap in our knowledge of life on Earth is to be closed.
Full-text available
Next-generation sequencing technologies have decentralized sequence acquisition, increasing the demand for new bioinformatics tools that are easy to use, portable across multiple platforms, and scalable for high-throughput applications. Cloud computing platforms provide on-demand access to computing infrastructure over the Internet and can be used in combination with custom built virtual machines to distribute pre-packaged with pre-configured software. We describe the Cloud Virtual Resource, CloVR, a new desktop application for push-button automated sequence analysis that can utilize cloud computing resources. CloVR is implemented as a single portable virtual machine (VM) that provides several automated analysis pipelines for microbial genomics, including 16S, whole genome and metagenome sequence analysis. The CloVR VM runs on a personal computer, utilizes local computer resources and requires minimal installation, addressing key challenges in deploying bioinformatics workflows. In addition CloVR supports use of remote cloud computing resources to improve performance for large-scale sequence processing. In a case study, we demonstrate the use of CloVR to automatically process next-generation sequencing data on multiple cloud computing platforms. The CloVR VM and associated architecture lowers the barrier of entry for utilizing complex analysis protocols on both local single- and multi-core computers and cloud systems for high throughput data processing.
Background: 16S ribosomal DNA (rDNA) amplicon sequencing is frequently used to analyse the structure of bacterial communities from oceans to the human microbiota. However, computational power is still a major bottleneck in the analysis of continuously enlarging metagenomic data sets. Analysis is further complicated by the technical complexity of current bioinformatics tools. Results: Here we present the less operational taxonomic units scripts (LotuS), a fast and user-friendly open-source tool to calculate denoised, chimera-checked, operational taxonomic units (OTUs). These are the basis to generate taxonomic abundance tables and phylogenetic trees from multiplexed, next-generation sequencing data (454, illumina MiSeq and HiSeq). LotuS is outstanding in its execution speed, as it can process 16S rDNA data up to two orders of magnitude faster than other existing pipelines. This is partly due to an included stand-alone fast simultaneous demultiplexer and quality filter C++ program, simple demultiplexer (sdm), which comes packaged with LotuS. Additionally, we sequenced two MiSeq runs with the intent to validate future pipelines by sequencing 40 technical replicates; these are made available in this work. Conclusion: We show that LotuS analyses microbial 16S data with comparable or even better results than existing pipelines, requiring a fraction of the execution time and providing state-of-the-art denoising and phylogenetic reconstruction. LotuS is available through the following URL: .
Virtually all empirical ecological studies require species identification during data collection. DNA metabarcoding refers to the automated identification of multiple species from a single bulk sample containing entire organisms or from a single environmental sample containing degraded DNA (soil, water, faeces, etc.). It can be implemented for both modern and ancient environmental samples. The availability of next-generation sequencing platforms and the ecologists' need for high-throughput taxon identification have facilitated the emergence of DNA metabarcoding. The potential power of DNA metabarcoding as it is implemented today is limited mainly by its dependency on PCR and by the considerable investment needed to build comprehensive taxonomic reference libraries. Further developments associated with the impressive progress in DNA sequencing will eliminate the currently required DNA amplification step, and comprehensive taxonomic reference libraries composed of whole organellar genomes and repetitive ribosomal nuclear DNA can be built based on the well-curated DNA extract collections maintained by standardized barcoding initiatives. The near-term future of DNA metabarcoding has an enormous potential to boost data acquisition in biodiversity research.