ArticlePDF Available

Natrix2 – Improved amplicon workflow with novel Oxford Nanopore Technologies support and enhancements in clustering, classification and taxonomic databases

Pensoft Publishers
Metabarcoding and Metagenomics
Authors:

Abstract and Figures

Sequencing of amplified DNA is the first step towards the generation of Amplicon Sequence Variants (ASVs) or Operational Taxonomic Units (OTUs) for biodiversity assessment and comparative analyses of environmental communities and microbiomes. Notably, the rapid advancements in sequencing technologies have paved the way for the growing utilization of third-generation long-read approaches in recent years. These sequence data imply increasing read lengths, higher error rates, and altered sequencing chemistry. Likewise, methods for amplicon classification and reference databases have progressed, leading to the expansion of taxonomic application areas and higher classification accuracy. With Natrix, a user-friendly and reducible workflow solution, processing of prokaryotic and eukaryotic environmental Illumina sequences using 16S or 18S is possible. Here, we present an updated version of the pipeline, Natrix2, which incorporates VSEARCH as an alternative clustering method with better performance for 16S metabarcoding approaches and mothur for taxonomic classification on further databases, including PR ² , UNITE and SILVA. Additionally, Natrix2 includes the handling of Nanopore reads, which entails initial error correction and refinement of reads using Medaka and Racon to subsequently determine their taxonomic classification.
This content is subject to copyright. Terms and conditions apply.
263
Natrix2 – Improved amplicon workow with novel Oxford
Nanopore Technologies support and enhancements in clustering,
classication and taxonomic databases
Aman Deep1, Dana Bludau1, Marius Welzel2, Sandra Clemens2, Dominik Heider2, Jens Boenigk1,3 ,
Daniela Beisser1,3
1 Department of Biodiversity, University of Duisburg-Essen, Universitätsstr. 5, 45141 Essen, Germany
2 Faculty of Mathematics and Computer Science, University of Marburg, Hans-Meerwein-Str. 6, 35032 Marburg, Germany
3 Centre for Water and Environmental Research, University of Duisburg-Essen, Universitätsstr. 5, 45141 Essen, Germany
Corresponding author: Daniela Beisser (daniela.beisser@uni-due.de)
Copyright: © Aman Deep et al.
This is an open access article distributed under
terms of the Creative Commons Attribution
License (Attribution 4.0 International –
CC BY 4.0).
Software Description
Abstract
Sequencing of amplied DNA is the rst step towards the generation of Amplicon Se-
quence Variants (ASVs) or Operational Taxonomic Units (OTUs) for biodiversity as-
sessment and comparative analyses of environmental communities and microbiomes.
Notably, the rapid advancements in sequencing technologies have paved the way for
the growing utilization of third-generation long-read approaches in recent years. These
sequence data imply increasing read lengths, higher error rates, and altered sequenc-
ing chemistry. Likewise, methods for amplicon classication and reference databases
have progressed, leading to the expansion of taxonomic application areas and high-
er classication accuracy. With Natrix, a user-friendly and reducible workow solution,
processing of prokaryotic and eukaryotic environmental Illumina sequences using 16S
or 18S is possible. Here, we present an updated version of the pipeline, Natrix2, which
incorporates VSEARCH as an alternative clustering method with better performance
for 16S metabarcoding approaches and mothur for taxonomic classication on further
databases, including PR2, UNITE and SILVA. Additionally, Natrix2 includes the handling
of Nanopore reads, which entails initial error correction and renement of reads using
Medaka and Racon to subsequently determine their taxonomic classication.
Key words: Amplicon sequencing, Amplicon Sequence Variants, community proling,
metabarcoding, microbiome, Operational Taxonomic Units, Snakemake workow,
ultra-long reads
Introduction
Analyzing nucleotide sequences of specic prokaryotic or eukaryotic DNA regions
is the fundamental mechanism for advanced understanding of their biodiversity
and biogeography. Amplicon sequencing of marker genes extracted from environ-
mental samples can answer questions concerning presence, absence and even
(relative) abundance of specic species or community composition. Due to con-
stantly increasing demands, sequencing has developed rapidly in the recent de-
cades. The cost and time intensive Sanger sequencing marks the beginning with
Academic editor: Thorsten Stoeck
Received:
12 July 2023
Accepted:
4 September 2023
Published:
24 October 2023
Citation: Deep A, Bludau D, Welzel
M, Clemens S, Heider D, Boenigk J,
Beisser D (2023) Natrix2 – Improved
amplicon workflow with novel Oxford
Nanopore Technologies support
and enhancements in clustering,
classication and taxonomic
databases. Metabarcoding and
Metagenomics 7: e109389. https://
doi.org/10.3897/mbmg.7.109389
Metabarcoding and Metagenomics 7: 263–271 (2023)
DOI: 10.3897/mbmg.7.109389
264
Metabarcoding and Metagenomics 7: 263–271 (2023), DOI: 10.3897/mbmg.7.109389
Aman Deep et al.: Natrix2 - Improved amplicon workflow
further development to high-throughput sequencing like Illumina technologies
to the latest real-time sequencing platform from Oxford Nanopore Technologies
(ONT). Regardless of sequencing technology, raw sequencing reads need to be
processed in multiple steps and clustered into taxonomically assigned sequence
representatives for further analysis. Despite numerous available tools for each
step, there are just few all-in-one and user-friendly workows (Schloss et al. 2009;
Callahan et al. 2016; Asbun et al. 2020; Tian and Imanian 2022).
For Illumina amplicon data, Natrix is one of few ecient workows for read
processing, OTU or ASV clustering and assigning amplicon sequencing reads
Figure 1. Schematic representation of the Natrix2 workow. The processing of two split samples using AmpliconDuo is
depicted. The color scheme represents the main steps, dashed lines outline the OTU and dotted edges outline the ASV
variant of the workow. Stars depict updates to the original Natrix workow. Details on the ONT part are depicted in
Fig. 2. (Created with BioRender.com).
265
Metabarcoding and Metagenomics 7: 263–271 (2023), DOI: 10.3897/mbmg.7.109389
Aman Deep et al.: Natrix2 - Improved amplicon workflow
to taxonomy, with an adjustable workow system (Welzel et al. 2020). It is an
open-source pipeline that includes quality control, read assembly, dereplication,
chimera detection, and taxonomic assessment. It utilizes Snakemake (Köster
and Rahmann 2012) and bioconda (Grüning et al. 2018) for reproducibility and
scalability. The pipeline executes various steps such as demultiplexing, adapter
trimming, quality assessment with Cutadapt (Martin 2011), FastQC (Andrews
2010), MultiQC (Ewels et al. 2016), and PRINSEQ (Schmieder and Edwards
2011). PANDAseq (Masella et al. 2012) is used for primer dening and paired-
read assembly. DADA2 (Callahan et al. 2016) can be used to generate ASVs.
CD-HIT (Fu et al. 2012) performs dereplication in the OTU variant of the work-
ow. Chimeric sequences are detected using VSEARCH3 (Rognes et al. 2016)
and split samples merged with AmpliconDuo (Lange et al. 2015). OTUs are gen-
erated using Swarm (v3) (Mahé et al. 2022). Finally, taxonomic assignments
are identied using BLASTn (Altschul et al. 1990) against SILVA (Pruesse et al.
2007) or NCBI (Federhen 2012) databases. The nal output comprises a com-
prehensive table with sequence information, abundances, and taxonomic data.
However, sequencing platforms undergoing a constant development, thus ad-
aptations to new sequencing technologies are required. One of the latest tech-
nologies, Nanopore, is capable of producing read lengths of more than 800,000
base pairs (Jain et al. 2018), compared to Illumina reads with a maximum of
300 base pairs (Hu et al. 2021). However, its error rates are ranging from 6 to
8%, which is much higher then illumina reads. Therefore, Nanopore data requires
thorough processing to address these higher error rates. In addition to rapid ad-
vancements in sequencing platforms, classication methods have also evolved
greatly in recent years. The constantly increasing number of reads produced per
sequencing run and the associated computing capacity during processing, as
well as the growth of gene reference libraries, have made this necessary (Ye et
al. 2019). Whereas a few years ago the BLAST algorithm was the preferred clas-
sication tool for taxonomic assignment, nowadays classiers with higher accu-
racy, lower computational capacity, and more specic reference databases are
favored (Schloss et al. 2009; Gerlach and Stoye 2011; Wood and Salzberg 2014;
Murali et al. 2018). The increasing number of microbial metabarcoding approach-
es has led to the development of databases specically tailored to the research
question. One of the many databases existing and already included in Natrix is
SILVA, which is suitable for analysis of ribosomal subunit genes for prokaryotes
and eukaryotes (Pruesse et al. 2007), while the NCBI database, which is likewise
included, is suitable for a broad taxonomic classication of different species that
do not necessarily belong to the same phylum (Federhen 2012). Instead of the
often used ribosomal marker genes, the UNITE database uses the eukaryotic
internal transcribed spacer (ITS) region located between two transcribed genes
(Nilsson et al. 2019). Organismic groups including protists, fungi, metazoa or
plants can be classied using databases such as PR2. It contains nearly 200,000
sequences and annotations which are manually curated (Guillou et al. 2012). In
addition to Swarm, we have also included VSEARCH clustering (Rognes et al.
2016) as an alternative to provide the user with more options and exibility. It can
be used as a drop-in replacement for Swarm in this existing workow.
Natrix2, was thus extended to meet the above mentioned demands. On the
one hand, it now includes specic pipeline options exclusively for Nanopore se-
quences. The automatic identication, reorientation and trimming of Nanopore
266
Metabarcoding and Metagenomics 7: 263–271 (2023), DOI: 10.3897/mbmg.7.109389
Aman Deep et al.: Natrix2 - Improved amplicon workflow
reads were integrated, as well as Naopore specic error correction and cluster-
ing. On the other hand, clustering and taxonomic classication was improved
for Illumina sequences providing further clustering options and additional data-
bases for other marker genes. General improvements include the restructuring
of input and output les, error checking and a detailed description and how-to
of a complete workow including example sequences and conguration les
on GitHub (https://github.com/dbeisser/Natrix2).
Package upgrade description
In the new version of Natrix, Natrix2, four major improvements have been in-
tegrated compared to the previous version (Fig. 1). i) the implementation of
VSEARCH as an alternative clustering method, ii) the addition of mothur for tax-
onomic classication, iii) the extension to further databases and marker genes,
and iv) the support of Nanopore sequence processing.
VSEARCH clustering
As an alternative to the already contained Swarm clustering algorithm (Mahé
et al. 2022), VSEARCH (v2.15.2) was included for OTU generation by sequence
similarity de novo clustering of Illumina reads, using a greedy heuristic clus-
tering algorithm with a centroid approach (Rognes et al. 2016). The option for
choosing the clustering algorithm was added to the conguration le. VSEARCH
uses an adjustable sequence similarity threshold. By default it is set to 0.98,
resulting in clustering of sequences into one OTU with a similarity of 98%. The
integration of the optional VSEARCH clustering improves processing of pro-
karyotic sequences and expands the eld of application for the Natrix pipeline.
In order to enhance the accuracy and reliability of Operational Taxonomic Unit
(OTU) generation from Illumina and Nanopore reads, the mumu post-cluster-
ing algorithm was implemented (https://github.com/frederic-mahe/mumu).
Through the utilization of mumu, incorrect OTUs are effectively eliminated by
considering both the sequence similarity and co-occurrence patterns of the
reads, resulting in an improved representation of biodiversity.
Taxonomic classication and additional databases
In addition to BLAST searches used in the previous version of Natrix, the ‘clas-
sify.seqs’ function from the open-source mothur package was added to assign
a taxonomy from a specic database dened in the conguration le (Schloss
et al. 2009). Mothur provides packages and functions that are used for molec-
ular analysis of community sequence data. Instead of creating alignments be-
tween sequenced reads and database references, mothur uses a kmer-based
approach. Kmers are used to calculate the probability of sequences belonging
to a specic taxonomy. Sequences with the highest probability will be assigned
to the appropriate taxonomy. With the incorporation of the PR2 and UNITE da-
tabases in addition to the SILVA and NCBI nr databases, new marker genes
and organismic groups can now be addressed. The PR2 (Protist Ribosomal
Reference) database focuses on 18S rRNA metabarcoding approaches not
only for protists, but also for fungi, metazoa and plants (Guillou et al. 2012).
267
Metabarcoding and Metagenomics 7: 263–271 (2023), DOI: 10.3897/mbmg.7.109389
Aman Deep et al.: Natrix2 - Improved amplicon workflow
Through the curation of experts, the PR2 database is a reliable complement to
the Natrix pipeline, making it usable for various research approaches. With the
added UNITE database additional taxonomic analysis with focus on the eu-
karyotic nuclear ribosomal ITS region is now possible (Nilsson et al. 2019). The
addition of UNITE offers more than one million fungal reference sequences,
making Natrix an optimal tool for fungal metabarcoding. Taxonomic classica-
tion by mothur is made available for both Illumina and Nanopore reads.
Nanopore support
As the rst version of Natrix was designed for Illumina sequencing reads only,
support for processing of Nanopore long-reads was added (Fig. 2). Nanopore
support can be activated within the conguration le and Nanopore reads in
FASTQ format are used as the initial starting le. Sequencing adapters and
primer sequences are identied by Pychopper (v2), a tool provided by ONT, us-
ing a combination of global and local alignments (https://github.com/epi2me-
labs/pychopper). Reads are afterwards trimmed and oriented into forward
direction. Pychopper is automatically installed using conda, and therefore ver-
sion controlled. Next to its trimming and orienting options, Pychopper writes
fused reads in an additional output le, from which reads are trimmed and ori-
entated subsequently with a specic read rescue option. Afterwards, Nanopore
reads are clustered and error corrected using CD-HIT (v4.8.1) (Li and Godzik
2006) for clustering and Medaka (v1.7.2) (https://github.com/nanoporetech/
Medaka) and Racon (v1.4.13) (Vaser et al. 2017) for error correction. First, fas-
ta transformed reads are clustered based on a similarity threshold algorithm
and representatives are mapped against the initial fasta les with Minimap2
(v2.26) (Li 2018). Second, the initial fasta les, clustering and mapping data are
used for the generation of consensus sequences of higher quality. Here, Racon
is using a distance- and quality-based alignment algorithm, whereas Medaka
is based on a neural network algorithm for creation of error corrected consen-
sus sequences. Last, consensus sequences are again aligned by Minimap2
against the initial fasta les for identication of corresponding read numbers
per consensus. Afterwards, the VSEARCH uchime3_denovo algorithm is still
used for chimera removal of Nanopore sequences (Rognes et al. 2016) before
the Nanopore reads are ltered and used further for taxonomic classication
via BLAST or mothur (Altschul et al. 1990; Schloss et al. 2009).
Figure 2. Schematic diagram of processing nanopore reads with Natrix2 for OTU generation and taxonomic assignment.
The color scheme represents the main steps of this variant of the workow. (created with BioRender.com).
268
Metabarcoding and Metagenomics 7: 263–271 (2023), DOI: 10.3897/mbmg.7.109389
Aman Deep et al.: Natrix2 - Improved amplicon workflow
Conclusion
With the upgraded version of Natrix, processing of Nanopore short and long se-
quencing reads, including orientation, trimming, clustering and error correction,
is possible. In addition, Illumina and Nanopore reads can now be taxonomically
assigned via mothur and the accuracy of OTU clustering is enhanced via mumu
post-clustering. Optionally, VSEARCH can now be used for clustering Illumina
reads. The implementation of PR2 and UNITE as new databases makes Natrix2
a reliable tool for diverse metabarcoding approaches and now offers process-
ing of sequences originating from other organismic groups like fungi, metazoa
and plants or further marker genes like ITS.
Project description
Title: Natrix2 Improved amplicon workow with novel Oxford Nanopore
Technologies support and enhancements in clustering, classication and tax-
onomic databases.
Study area description: Amplicon sequence analysis.
Download page: https://github.com/dbeisser/Natrix2.
Programming language: Snakemake, Python, R, Bash.
Licence: MIT Licence.
Acknowledgements
We acknowledge support by the Open Access Publication Fund of the University
of Duisburg-Essen.
Additional information
Conict of interest
The authors have declared that no competing interests exist.
Ethical statement
No ethical statement was reported.
Funding
This study was performed as part of the Collaborative Research Center (CRC) RESIST
and analyses were performed by Project A04 (AD and DBe), funded by the German
Research Foundation (DFG) – CRC 1439/1; project number 426547801.
Author contributions
Conceptualization: MW, DH, JB, DBe. Formal analysis: SC, DBl, AD. Methodology: AD,
DBl, SC, DBe. Supervision: JB, DBe. Validation: AD. Visualization: AD, DBl. Writing – origi-
nal draft: AD, DBl, DBe. Writing – review and editing: DBl, DH, JB, AD, SC, MW, DBe.
Author ORCIDs
Aman Deep https://orcid.org/0000-0001-7321-864X
Dana Bludau https://orcid.org/0009-0003-3982-3178
Marius Welzel https://orcid.org/0000-0002-4946-2156
269
Metabarcoding and Metagenomics 7: 263–271 (2023), DOI: 10.3897/mbmg.7.109389
Aman Deep et al.: Natrix2 - Improved amplicon workflow
Sandra Clemens https://orcid.org/0000-0002-9710-1152
Dominik Heider https://orcid.org/0000-0002-3108-8311
Jens Boenigk https://orcid.org/0000-0001-8858-8889
Daniela Beisser https://orcid.org/0000-0002-0679-6631
Data availability
All of the data that support the ndings of this study are available in the main text.
References
Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ (1990) Basic local alignment search
tool. Journal of Molecular Biology 215(3): 403–410. https://doi.org/10.1016/S0022-
2836(05)80360-2
Andrews S (2010) FastQC: a quality control tool for high throughput sequence data.
Asbun AA, Besseling MA, Balzano S, van Bleijswijk JDL, Witte HJ, Villanueva L, Engel-
mann JC (2020) Cascabel: A scalable and versatile amplicon sequence data analysis
pipeline delivering reproducible and documented results. Frontiers in Genetics 11:
е489357. https://doi.org/10.3389/fgene.2020.489357
Callahan BJ, McMurdie PJ, Rosen MJ, Han AW, Johnson AJA, Holmes SP (2016) DADA2:
High-resolution sample inference from Illumina amplicon data. Nature Methods
13(7): 581–583. https://doi.org/10.1038/nmeth.3869
Ewels P, Magnusson M, Lundin S, Käller M (2016) MultiQC: Summarize analysis results
for multiple tools and samples in a single report. Bioinformatics 32(19): 3047–3048.
https://doi.org/10.1093/bioinformatics/btw354
Federhen S (2012) The NCBI Taxonomy database. Nucleic Acids Research 40(D1):
D136–D143. https://doi.org/10.1093/nar/gkr1178
Fu L, Niu B, Zhu Z, Wu S, Li W (2012) CD-HIT: Accelerated for clustering the next-gener-
ation sequencing data. Bioinformatics 28(23): 3150–3152. https://doi.org/10.1093/
bioinformatics/bts565
Gerlach W, Stoye J (2011) Taxonomic classication of metagenomic shotgun sequenc-
es with CARMA3. Nucleic Acids Research 39(14): e91. https://doi.org/10.1093/nar/
gkr225
Grüning B, Dale R, Sjödin A, Chapman BA, Rowe J, Tomkins-Tinch CH, Valieris R, Köster
J (2018) Bioconda: Sustainable and comprehensive software distribution for the life
sciences. Nature Methods 15(7): 475–476. https://doi.org/10.1038/s41592-018-
0046-7
Guillou L, Bachar D, Audic S, Bass D, Berney C, Bittner L, Boutte C, Burgaud G, de Vargas
C, Decelle J, Del Campo J (2012) The Protist Ribosomal Reference database (PR2):
a catalog of unicellular eukaryote Small Sub-Unit rRNA sequences with curated tax-
onomy. Nucleic Acids Research 41(D1): D597–D604. https://doi.org/10.1093/nar/
gks1160
Hu T, Chitnis N, Monos D, Dinh A (2021) Next-generation sequencing technologies:
An overview. Human Immunology 82(11): 801–811. https://doi.org/10.1016/j.
humimm.2021.02.012
Jain M, Koren S, Miga KH, Quick J, Rand AC, Sasani TA, Tyson JR, Beggs AD, Dilthey AT, Fid-
des IT, Malla S, Marriott H, Nieto T, O’Grady J, Olsen HE, Pedersen BS, Rhie A, Richardson
H, Quinlan AR, Snutch TP, Tee L, Paten B, Phillippy AM, Simpson JT, Loman NJ, Loose M
(2018) Nanopore sequencing and assembly of a human genome with ultra-long reads.
Nature Biotechnology 36(4): 338–345. https://doi.org/10.1038/nbt.4060
270
Metabarcoding and Metagenomics 7: 263–271 (2023), DOI: 10.3897/mbmg.7.109389
Aman Deep et al.: Natrix2 - Improved amplicon workflow
Köster J, Rahmann S (2012) Snakemake-a scalable bioinformatics workow engine.
Bioinformatics 28(19): 2520–2522. https://doi.org/10.1093/bioinformatics/bts480
Lange A, Jost S, Heider D, Bock C, Budeus B, Schilling E, Strittmatter A, Boenigk J, Hoff-
mann D (2015) AmpliconDuo: A split-sample ltering protocol for high-throughput
amplicon sequencing of microbial communities. PLoS ONE 10(11): e0141590.
https://doi.org/10.1371/journal.pone.0141590
Li H (2018) Minimap2: Pairwise alignment for nucleotide sequences. Bioinformatics
34(18): 3094–3100. https://doi.org/10.1093/bioinformatics/bty191
Li W, Godzik A (2006) Cd-hit: A fast program for clustering and comparing large sets
of protein or nucleotide sequences. Bioinformatics 22(13): 1658–1659. https://doi.
org/10.1093/bioinformatics/btl158
Mahé F, Czech L, Stamatakis A, Quince C, de Vargas C, Dunthorn M, Rognes T (2022)
Swarm v3: Towards tera-scale amplicon clustering. Bioinformatics 38(1): 267–269.
https://doi.org/10.1093/bioinformatics/btab493
Martin M (2011) Cudadapt removes adapter sequences from high-throughput sequenc-
ing reads. EMBnet.Journal 17(1): 1–10. https://doi.org/10.14806/ej.17.1.200
Masella AP, Bartram AK, Truszkowski JM, Brown DG, Neufeld JD (2012) PANDAseq:
Paired-end assembler for illumina sequences. BMC Bioinformatics 13(1): 1–31.
https://doi.org/10.1186/1471-2105-13-31
Murali A, Bhargava A, Wright ES (2018) IDTAXA: A novel approach for accurate taxo-
nomic classication of microbiome sequences. Microbiome 6(140): е140. https://
doi.org/10.1186/s40168-018-0521-5
Nilsson RH, Larsson KH, Taylor AFS, Bengtsson-Palme J, Jeppesen TS, Schigel D, Ken-
nedy P, Picard K, Glöckner FO, Tedersoo L, Saar I, Kõljalg U, Abarenkov K (2019) The
UNITE database for molecular identication of fungi: Handling dark taxa and parallel
taxonomic classications. Nucleic Acids Research 47(D1): D259–D264. https://doi.
org/10.1093/nar/gky1022
Pruesse E, Quast C, Knittel K, Fuchs BM, Ludwig W, Peplies J, Glöckner FO (2007) SILVA:
A comprehensive online resource for quality checked and aligned ribosomal RNA
sequence data compatible with ARB. Nucleic Acids Research 35(21): 7188–7196.
https://doi.org/10.1093/nar/gkm864
Rognes T, Flouri T, Nichols B, Quince C, Mahé F (2016) VSEARCH: A versatile open
source tool for metagenomics. PeerJ 10: 1–22. https://doi.org/10.7717/peerj.2584
Schloss PD, Westcott SL, Ryabin T, Hall JR, Hartmann M, Hollister EB, Lesniewski RA,
Oakley BB, Parks DH, Robinson CJ, Sahl JW, Stres B, Thallinger GG, van Horn DJ,
Weber CF (2009) Introducing mothur: Open-source, platform-independent, commu-
nity-supported software for describing and comparing microbial communities. Ap-
plied and Environmental Microbiology 75(23): 7537–7541. https://doi.org/10.1128/
AEM.01541-09
Schmieder R, Edwards R (2011) Quality control and preprocessing of metagenomic data-
sets. Bioinformatics 27(6): 863–864. https://doi.org/10.1093/bioinformatics/btr026
Tian R, Imanian B (2022) ASAP 2: A pipeline and web server to analyze marker gene am-
plicon sequencing data automatically and consistently. BMC Bioinformatics 23(27):
27. https://doi.org/10.1186/s12859-021-04555-0
Vaser R, Sovic I, Nagarajan N, Sikic M (2017) Fast and accurate de novo genome as-
sembly from long uncorrected reads. Genome Research 27(5): 737–746. https://doi.
org/10.1101/gr.214270.116
Welzel M, Lange A, Heider D, Schwarz M, Freisleben B, Jensen M, Boenigk J, Beisser
D (2020) Natrix: A Snakemake-based workow for processing, clustering, and taxo-
271
Metabarcoding and Metagenomics 7: 263–271 (2023), DOI: 10.3897/mbmg.7.109389
Aman Deep et al.: Natrix2 - Improved amplicon workflow
nomically assigning amplicon sequencing reads. BMC Bioinformatics 21(1): е526.
https://doi.org/10.1186/s12859-020-03852-4
Wood DE, Salzberg SL (2014) Kraken: Ultrafast metagenomic sequence classication
using exact alignments. Genome Biology 15(46): R46. https://doi.org/10.1186/gb-
2014-15-3-r46
Ye SH, Siddle KJ, Park DJ, Sabeti PC (2019) Benchmarking Metagenomics Tools for Taxo-
nomic Classication. Cell 178(4): 779–794. https://doi.org/10.1016/j.cell.2019.07.010
... Meanwhile, the lower read quality from ONT demands specialised pipelines to account for its higher error rates. Such pipelines include, alignment using Minimap2 [18,45,46,52] or EMU [39,40,53], or clustering through NanoCLUST [54] or Natrix2 [55]. The application of Unique Molecular Identifiers (UMIs) has shown promise in achieving more accurate ASVs or 97 % OTUs than NanoCLUST-based methods [11,15]. ...
Article
Full-text available
Sequence comparison of 16S rRNA PCR amplicons is an established approach to taxonomically identify bacterial isolates and profile complex microbial communities. One potential application of recent advances in long-read sequencing technologies is to sequence entire rRNA operons and capture significantly more phylogenetic information compared to sequencing of the 16S rRNA (or regions thereof) alone, with the potential to increase the proportion of amplicons that can be reliably classified to lower taxonomic ranks. Here we describe GROND ( G enome-derived R ibosomal O pero n D atabase), a publicly available database of quality-checked 16S-ITS-23S rRNA operons, accompanied by multiple taxonomic classifications. GROND will aid researchers in analysis of their data and act as a standardised database to allow comparison of results between studies.
... The Natrix2 workflow (Welzel et al., 2020;Deep et al., 2023) was used to perform bioinformatic analyses of microphytobenthos Illumina amplicon sequencing data. The workflow's operational taxonomic units (OTUs) variant was implemented using the clustering algorithm Swarm v3.0.0 (Mahé et al., 2015). ...
Article
Full-text available
OPEN ACCESS PAPER - Field observations form the basis of the majority of studies on microphytobenthic algal communities in freshwater ecosystems. Controlled mesocosm experiments data are comparatively uncommon. The few experimental mesocosm studies that have been conducted provide valuable insights into how multiple stressors affect the community structures and photosynthesis-related traits of benthic microalgae. The recovery process after the stressors have subsided, however, has received less attention in mesocosm studies. To close this gap, here we present the results of a riparian mesocosm experiment designed to investigate the effects of reduced flow velocity, increased salinity and increased temperature on microphytobenthic communities. We used a full factorial design with a semi-randomised distribution of treatments consisting of two levels of each stressor (2 × 2 × 2 treatments), with eight replicates making a total of 64 circular mesocosms, allowing a nuanced examination of their individual and combined influences. We aimed to elucidate the responses of microalgae communities seeded from stream water to the applied environmental stressors. Our results showed significant effects of reduced flow velocity and increased temperature on microphytobenthic communities. Recovery after stressor treatment led to a convergence in community composition, with priority effects (hypothesized to reflect competition for substrate between resident and newly arriving immigrant taxa) slowing down community shifts and biomass increase. Our study contributes to the growing body of literature on the ecological dynamics of microphytobenthos and emphasises the importance of rigorous experiments to validate hypotheses. These results encourage further investigation into the nuanced interactions between microphytobenthos and their environment and shed light on the complexity of ecological responses in benthic systems.
Article
Full-text available
Introduction Microalgae form an essential group of benthic organisms that respond swiftly to environmental changes. They are widely used as bioindicators of anthropogenic stressors in freshwater ecosystems. We aimed to assess the responses of microalgae communities to multiple environmental stressors in the Kinzig River catchment, home to a long-term ecological monitoring site, in Germany. Methods We used a photosynthetic biomass proxy alongside community composition of diatoms assessed by digital light microscopy, and of microalgae by 18S-V9 amplicon sequencing, to characterise microalgae at 19 sampling sites scattered across the catchment. Results Our results revealed significant effects of physical and chemical factors on microalgae biomass and community compositions. We found that conductivity, water temperature and pH were the most important factors affecting microalgae community composition, as observed in both microscopy and amplicon analysis. In addition to these three variables, the effect of total phosphate on all microalgae, together with water discharge on the diatom (Bacillariophyta) communities, as assessed by amplicon analysis, may reveal taxon-specific variations in the ecological responses of different microalgal groups. Discussion Our results highlighted the complex relationship between various environmental variables and microalgae biomass and community composition. Further investigations, involving the collection of time series data, are required to fully understand the underlying biotic and abiotic parameters that influence these microalgae communities.
Preprint
Full-text available
Microalgae form an essential group of benthic organisms that respond swiftly to environmental changes. They are widely used as bioindicators of anthropogenic stressors in freshwater ecosystems. We aimed to assess the responses of microalgae communities to multiple environmental stressors in the Kinzig River catchment, home to a long-term ecological monitoring site, in Germany. We used a photosynthetic biomass proxy alongside community composition of diatoms assessed by digital light microscopy, and of microalgae by 18S-V9 amplicon sequencing, to characterise microalgae at 19 sampling sites scattered across the catchment. Our results revealed significant effects of physical and chemical factors on microalgae biomass and community compositions. We found that conductivity, water temperature and pH were the most important factors affecting microalgae community composition, as observed in both microscopy and amplicon analysis. In addition to these three variables, the effect of total phosphate on all microalgae, together with water discharge on the diatom (Bacillariophyta) communities, as assessed by amplicon analysis, may reveal taxon-specific variations in the ecological responses of different microalgal groups. Our results highlighted the complex relationship between various environmental variables and microalgae biomass and community composition. Further investigations, involving the collection of time series data, are required to fully understand the underlying biotic and abiotic parameters that influence these microalgae communities.
Article
Full-text available
Background Amplicon sequencing of marker genes such as 16S rDNA have been widely used to survey and characterize microbial community. However, the complex data analyses have required many interfering manual steps often leading to inconsistencies in results. Results Here, we have developed a pipeline, amplicon sequence analysis pipeline 2 (ASAP 2), to automate and glide through the processes without the usual manual inspections and user’s interference, for instance, in the detection of barcode orientation, selection of high-quality region of reads, and determination of resampling depth and many more. The pipeline integrates all the analytical processes such as importing data, demultiplexing, summarizing read profiles, trimming quality, denoising, removing chimeric sequences and making the feature table among others. The pipeline accepts multiple file formats as input including multiplexed or demultiplexed, paired-end or single-end, barcode inside or outside and raw or intermediate data (e.g. feature table). The outputs include taxonomic classification, alpha/beta diversity, community composition, ordination analysis and statistical tests. ASAP 2 supports merging multiple sequencing runs which helps integrate and compare data from different sources (public databases and collaborators). Conclusions Our pipeline minimizes hands-on interference and runs amplicon sequence variant (ASV)-based amplicon sequencing analysis automatically and consistently. Our web server assists researchers that have no access to high performance computer (HPC) or have limited bioinformatics skills. The pipeline and web server can be accessed at https://github.com/tianrenmaogithub/asap2 and https://hts.iit.edu/asap2 , respectively.
Article
Full-text available
Motivation: Previously we presented swarm, an open-source amplicon clustering program that produces fine-scale molecular operational taxonomic units (OTUs) that are free of arbitrary global clustering thresholds. Here we present swarm v3 to address issues of contemporary datasets that are growing towards tera-byte sizes. Results: When compared to previous swarm versions, swarm v3 has modernized C ++ source code, reduced memory footprint by up to 50%, optimized CPU-usage and multithreading (more than 7 times faster with default parameters), and it has been extensively tested for its robustness and logic. Availability: Source code and binaries are available at https://github.com/torognes/swarm. Supplementary information: Supplementary data are available at Bioinformatics online.
Article
Full-text available
Marker gene sequencing of the rRNA operon (16S, 18S, ITS) or cytochrome c oxidase I (CO1) is a popular means to assess microbial communities of the environment, microbiomes associated with plants and animals, as well as communities of multicellular organisms via environmental DNA sequencing. Since this technique is based on sequencing a single gene, or even only parts of a single gene rather than the entire genome, the number of reads needed per sample to assess the microbial community structure is lower than that required for metagenome sequencing. This makes marker gene sequencing affordable to nearly any laboratory. Despite the relative ease and cost-efficiency of data generation, analyzing the resulting sequence data requires computational skills that may go beyond the standard repertoire of a current molecular biologist/ecologist. We have developed Cascabel, a scalable, flexible, and easy-to-use amplicon sequence data analysis pipeline, which uses Snakemake and a combination of existing and newly developed solutions for its computational steps. Cascabel takes the raw data as input and delivers a table of operational taxonomic units (OTUs) or Amplicon Sequence Variants (ASVs) in BIOM and text format and representative sequences. Cascabel is a highly versatile software that allows users to customize several steps of the pipeline, such as selecting from a set of OTU clustering methods or performing ASV analysis. In addition, we designed Cascabel to run in any linux/unix computing environment from desktop computers to computing servers making use of parallel processing if possible. The analyses and results are fully reproducible and documented in an HTML and optional pdf report. Cascabel is freely available at Github: https://github.com/AlejandroAb/CASCABEL.
Article
Full-text available
Sequencing of marker genes amplified from environmental samples, known as amplicon sequencing, allows us to resolve some of the hidden diversity and elucidate evolutionary relationships and ecological processes among complex microbial communities. The analysis of large numbers of samples at high sequencing depths generated by high throughput sequencing technologies requires efficient, flexible, and reproducible bioinformatics pipelines. Only a few existing workflows can be run in a user-friendly, scalable, and reproducible manner on different computing devices using an efficient workflow management system. Results: We present Natrix, an open-source bioinformatics workflow for preprocessing raw amplicon sequencing data. The workflow contains all analysis steps from quality assessment, read assembly, dereplication, chimera detection, split-sample merging, sequence representative assignment (OTUs or ASVs) to the taxonomic assignment of sequence representatives. The workflow is written using Snakemake, a workflow management engine for developing data analysis workflows. In addition, Conda is used for version control. Thus, Snakemake ensures reproducibility and Conda offers version control of the utilized programs. The encapsulation of rules and their dependencies support hassle-free sharing of rules between workflows and easy adaptation and extension of existing workflows. Natrix is freely available on GitHub (https://github.com/MW55/Natrix) or as a Docker container on DockerHub (https://hub.docker.com/r/mw55/natrix). Conclusion: Natrix is a user-friendly and highly extensible workflow for processing Illumina amplicon data. Keywords: Amplicon sequencing, Operational Taxonomic Units, Amplicon Sequence Variants, Snakemake, Pipline, Illumina
Article
Full-text available
UNITE (https://unite.ut.ee/) is a web-based database and sequence management environment for the molecular identification of fungi. It targets the formal fungal barcode-the nuclear ribosomal internal transcribed spacer (ITS) region-and offers all ∼1 000 000 public fungal ITS sequences for reference. These are clustered into ∼459 000 species hypotheses and assigned digital object identifiers (DOIs) to promote unambiguous reference across studies. In-house and web-based third-party sequence curation and annotation have resulted in more than 275 000 improvements to the data over the past 15 years. UNITE serves as a data provider for a range of metabarcoding software pipelines and regularly exchanges data with all major fungal sequence databases and other community resources. Recent improvements include redesigned handling of unclassifiable species hypotheses, integration with the taxonomic backbone of the Global Biodiversity Information Facility, and support for an unlimited number of parallel taxonomic classification systems.
Article
Full-text available
Background: Microbiome studies often involve sequencing a marker gene to identify the microorganisms in samples of interest. Sequence classification is a critical component of this process, whereby sequences are assigned to a reference taxonomy containing known sequence representatives of many microbial groups. Previous studies have shown that existing classification programs often assign sequences to reference groups even if they belong to novel taxonomic groups that are absent from the reference taxonomy. This high rate of "over classification" is particularly detrimental in microbiome studies because reference taxonomies are far from comprehensive. Results: Here, we introduce IDTAXA, a novel approach to taxonomic classification that employs principles from machine learning to reduce over classification errors. Using multiple reference taxonomies, we demonstrate that IDTAXA has higher accuracy than popular classifiers such as BLAST, MAPSeq, QIIME, SINTAX, SPINGO, and the RDP Classifier. Similarly, IDTAXA yields far fewer over classifications on Illumina mock microbial community data when the expected taxa are absent from the training set. Furthermore, IDTAXA offers many practical advantages over other classifiers, such as maintaining low error rates across varying input sequence lengths and withholding classifications from input sequences composed of random nucleotides or repeats. Conclusions: IDTAXA's classifications may lead to different conclusions in microbiome studies because of the substantially reduced number of taxa that are incorrectly identified through over classification. Although misclassification error is relatively minor, we believe that many remaining misclassifications are likely caused by errors in the reference taxonomy. We describe how IDTAXA is able to identify many putative mislabeling errors in reference taxonomies, enabling training sets to be automatically corrected by eliminating spurious sequences. IDTAXA is part of the DECIPHER package for the R programming language, available through the Bioconductor repository or accessible online ( http://DECIPHER.codes ).
Article
Since the days of Sanger sequencing, next-generation sequencing technologies have significantly evolved to provide increased data output, efficiencies, and applications. These next generations of technologies can be categorized based on read length. This review provides an overview of these technologies as two paradigms: short-read, or “second-generation,” technologies, and long-read, or “third-generation,” technologies. Herein, short-read sequencing approaches are represented by the most prevalent technologies, Illumina and Ion Torrent, and long-read sequencing approaches are represented by Pacific Biosciences and Oxford Nanopore technologies. All technologies are reviewed along with reported advantages and disadvantages. Until recently, short-read sequencing was thought to provide high accuracy limited by read-length, while long-read technologies afforded much longer read-lengths at the expense of accuracy. Emerging developments for third-generation technologies hold promise for the next wave of sequencing evolution, with the co-existence of longer read lengths and high accuracy.
Article
Metagenomic sequencing is revolutionizing the detection and characterization of microbial species, and a wide variety of software tools are available to perform taxonomic classification of these data. The fast pace of development of these tools and the complexity of metagenomic data make it important that researchers are able to benchmark their performance. Here, we review current approaches for metagenomic analysis and evaluate the performance of 20 metagenomic classifiers using simulated and experimental datasets. We describe the key metrics used to assess performance, offer a framework for the comparison of additional classifiers, and discuss the future of metagenomic data analysis.
Article
Motivation: Recent advances in sequencing technologies promise ultra-long reads of ∼100 kilo bases (kb) in average, full-length mRNA or cDNA reads in high throughput and genomic contigs over 100 mega bases (Mb) in length. Existing alignment programs are unable or inefficient to process such data at scale, which presses for the development of new alignment algorithms. Results: Minimap2 is a general-purpose alignment program to map DNA or long mRNA sequences against a large reference database. It works with accurate short reads of ≥ 100bp in length, ≥1kb genomic reads at error rate ∼15%, full-length noisy Direct RNA or cDNA reads, and assembly contigs or closely related full chromosomes of hundreds of megabases in length. Minimap2 does split-read alignment, employs concave gap cost for long insertions and deletions (INDELs) and introduces new heuristics to reduce spurious alignments. It is 3-4 times as fast as mainstream short-read mappers at comparable accuracy, and is ≥30 times faster than long-read genomic or cDNA mappers at higher accuracy, surpassing most aligners specialized in one type of alignment. Availability and implementation: https://github.com/lh3/minimap2. Contact: hengli@broadinstitute.org.