ArticlePDF Available

HPC-CLUST: Distributed hierarchical clustering for large sets of nucleotide sequences

Authors:

Abstract

Nucleotide sequence data is being produced at an ever increasing rate. Clustering such sequences by similarity is often an essential first step in their analysis - intended to reduce redundancy, define gene families, or suggest taxonomic units. Exact clustering algorithms, such as hierarchical clustering, scale relatively poorly in terms of run time and memory usage, yet they are desirable because heuristic shortcuts taken during clustering might have unintended consequences in later analysis steps. Here we present HPC-CLUST, a highly optimized software pipeline that can cluster large numbers of pre-aligned DNA sequences by running on distributed computing hardware. It allocates both memory and computing resources efficiently, and can process more than a million sequences in a few hours on a small cluster. Source code and binaries are freely available at http://meringlab.org/software/hpc-clust/; the pipeline is implemented in C++ and uses the MPI standard for distributed computing. joao.rodrigues@imls.uzh.ch, mering@imls.uzh.ch SUPPLEMENTARY INFORMATION: Supplementary data is available at Bioinformatics online.
Vol. 30 no. 2 2014, pages 287–288
BIOINFORMATICS APPLICATIONS NOTE doi:10.1093/bioinformatics/btt657
Sequence analysis Advance Access publication November 9, 2013
HPC-CLUST: distributed hierarchical clustering for large sets of
nucleotide sequences
Joa
˜
o F. Matias Rodrigues and Christian von Mering
*
Institute of Molecular Life Sciences and Swiss Institute of Bioinformatics, University of Zurich, Zurich, Switzerland
Associate Editor: Inanc Birol
ABSTRACT
Motivation: Nucleotide sequence data are being produced at an
ever increasing rate. Clustering such sequences by similarity is often
an essential first step in their analysis—intended to reduce redun-
dancy, define gene families or suggest taxonomic units. Exact clus-
tering algorithms, such as hierarchical clustering, scale relatively
poorly in terms of run time and memory usage, yet they are desirable
because heuristic shortcuts taken during clustering might have unin-
tended consequences in later analysis steps.
Results: Here we present HPC-CLUST, a highly optimized software
pipeline that can cluster large numbers of pre-aligned DNA sequences
by running on distributed computing hardware. It allocates both
memory and computing resources efficiently, and can process more
than a million sequences in a few hours on a small cluster.
Availability and implementation: Source code and binaries are
freely available at http://meringlab.org/software/hpc-clust/; the pipe-
line is implemented in Cþþ and uses the Message Passing
Interface (MPI) standard for distributed computing.
Contact: mering@imls.uzh.ch
Supplementary Information: Supplementary data are available at
Bioinformatics online.
Received on September 6, 2013; revised on October 19, 2013;
accepted on November 7, 2013
1 INTRODUCTION
The time complexity of hierarchical clustering algorithms (HCA)
is quadratic OðN
2
Þ or even worse OðN
2
log NÞ, depending on the
selected cluster linkage method (Day and Edelsbrunner, 1984).
However, HCAs have a number of advantages that make them
attractive for applications in biology: (i) they are well defined and
should be reproducible across implementations, (ii) they require
nothing but a pairwise distance matrix as input and (iii) they are
agglomerative, meaning that sets of clusters at arbitrary similar-
ity thresholds can be extracted quickly by post-processing, once a
complete clustering run has been executed. Consequently, HCAs
have been widely adopted in biology, in areas ranging from data
mining to sequence analysis to evolutionary biology.
Apart from generic implementations, a number of hierarchical
clustering implementations exist that focus on biological se-
quence data, taking advantage of the fact that distances between
sequences can be computed relatively cheaply, even in a transient
fashion. However, the existing implementations such as
MOTHUR (Schloss et al., 2009), ESPRIT (Sun et al., 2009) or
RDP online clustering (Cole et al., 2009), all struggle with large
sets of sequences. In light of these performance limits, heuristic
optimizations have also been implemented such as CD-HIT (Li
and Godzik, 2006) and UCLUST (Edgar, 2010).
Hierarchical clustering starts by considering every sequence
separately and merging the two closest ones into a cluster.
Then, iteratively, larger clusters are formed, by joining the closest
sequences and/or clusters. The distance between two clusters
with several sequences will depend on the clustering linkage
chosen. In single linkage, it is the similarity between the two
most similar sequences; in complete linkage, between the two
most dissimilar sequences; and in average linkage, the average
of all pairwise similarities. The latter method is also known as the
Unweighted Pair Group Method with Arithmetic Mean
(UPGMA) and is often used in the construction of phylogenetic
guide trees.
In the type of approach used by CD-HIT and UCLUST, each
input sequence is considered sequentially, and is either added to
an existing cluster (if it is found to meet the clustering threshold)
or is used as a seed to start a new cluster. Although this approach
is extremely efficient, it can lead to some undesired characteris-
tics (Sun et al., 2012): (i) it will create clusters with sequences that
may be more dissimilar than the chosen clustering threshold; (ii)
it can occur that a new cluster is created close to an existing
cluster, but at a distance just slightly longer than the clustering
threshold; at this point, any new sequences close to both clus-
ters will be split among the two clusters, whereas previous se-
quences will have been added to only the first cluster; this
effectively results in a reduction of the clustering threshold lo-
cally; and (iii) different sequence input orders will result in dif-
ferent sets of clusters due to different choices of the seed
sequences. Point (i) also affects HCA using single linkage and
to a lesser extent average linkage, but does not occur with com-
plete linkage.
Here we present a distributed implementation of an HCA that
can handle large numbers of sequences. It can compute single-,
complete- and average-linkage clusters in a single run and pro-
duces a merge-log from which clusters can subsequently be
parsed at any threshold. In contrast to CD-HIT, UCLUST
and ESPRIT, which all take unaligned sequence data as
their input, HPC-CLUST (like MOTHUR) takes as input a
set of pre-aligned sequences. This allows for flexibility in the
choice of alignment algorithm; a future version of HPC-
CLUST may include the alignment step as well. For further de-
tails on implementation and algorithms, see the Supplementary
Material.
*To whom correspondence should be addressed.
ß The Author 2013. Published by Oxford University Press.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0/), which
permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
2 METHODS
For all benchmarks, we used one or more dedicated Dell Blade M605
compute nodes with 2 quad-core Opteron 2.33 GHz processors and 24
GB of random access memory. The most recent version of each software
pipeline was used: HPC-CLUST (v1.0.0), MOTHUR (v.1.29.2), ESPRIT
(Feb. 2011), CD-HIT (v4.6.1) and UCLUST (v6.0.307). Detailed infor-
mation on settings and parameters is available in the Supplementary
Material.
We compiled a dataset of publicly available full-length 16S bacterial
ribosomal RNA sequences from NCBI Genbank. Sequences were aligned
using INFERNAL v1.0.2 with a 16S model for bacteria from the ssu-
align package (Nawrocki et al., 2009). Importantly, INFERNAL uses a
profile alignment strategy that scales linearly O(N) with the number of
sequences, and can be trivially parallelized. Indels were removed and
sequences were trimmed between two well-conserved alignment columns,
such that all sequences had the same aligned length. The final dataset
consisted of 1 105 195 bacterial sequences (833 013 unique) of 1301 in
aligned length.
3RESULTS
3.1 Clustering performance on a single computer
HPC-CLUST has been highly optimized for computation speed
and memory efficiency. It is by far the fastest of the exact clus-
tering implementations tested here, even when running on a
single computer (Fig. 1). Compared with MOTHUR, it produces
identical or nearly identical clustering results (see Supplementary
Material). Because CD-HIT and UCLUST use a different ap-
proach to clustering, they are not directly comparable and are
included for reference only..
In HPC-CLUST, the largest fraction of computation time is
spent calculating the pairwise sequence distances, the second lar-
gest in sorting the distances and the final clustering step is the
fastest. HPC-CLUST can make use of multithreaded execution
on multiple nodes and practically achieves optimal paralleliza-
tion in the distance calculation step. Additional benchmarks are
shown and discussed in the Supplementary Material.
3.2 Distributed clustering performance
Clustering the full dataset (833 013 unique sequences) to 97%
identity threshold required a total of 2 h and 42 min on a com-
pute cluster of 24 nodes with 8 cores each (192 total cores).
Owing to parallelization, the distance and sorting computation
took only 57 min (wall clock time), corresponding to 410 000
min CPU time. The remaining 1 h and 45 min (wall clock time)
were spent collecting and clustering the distances. The combined
total memory used for the distance matrix was 59.8 or 2.6 GB per
node. The node on which the merging step was performed used a
maximum of 4.9 GB of memory when doing single-, complete-
and average-linkage clusterings in the same run
4CONCLUSION
Clustering is often among the first steps when dealing with raw
sequence data, and therefore needs to be as fast and as memory
efficient as possible. The implementation of a distributed version
of hierarchical clustering in HPC-CLUST makes it now possible
to fully cluster a much larger number of sequences, essentially
limited only by the number of available computing nodes.
ACKNOWLEDGEMENT
The authors thank Thomas S. B. Schmidt for his feedback and
help in testing HPC-CLUST.
Funding: ERC grant (Starting Grant ‘UMICIS/242870’ to
C.vM.).
Conflict of Interest: none declared.
REFERENCES
Cole,J.R. et al. (2009) The Ribosomal Database Project: improved alignments and
new tools for rRNA analysis. Nucleic Acids Res., 37, D141–D145.
Day,W. and Edelsbrunner,H. (1984) Efficient algorithms for agglomerative hier-
archical clustering methods. J. Classif., 1,724.
Edgar,R.C. (2010) Search and clustering orders of magnitude faster than BLAST.
Bioinformatics, 26, 2460–2461.
Li,W. and Godzik,A. (2006) CD-HIT: a fast program for clustering and comparing
large sets of protein or nucleotide sequences. Bioinformatics, 22, 1658–1659.
Nawrocki,E.P. et al. (2009) Infernal 1.0: inference of RNA alignments.
Bioinformatics, 25, 1335–1337.
Schloss,P.D. et al. (2009) Introducing MOTHUR: open-source, platform-independ-
ent, community-supported software for describing and comparing microbial
communities. Appl. Environ. Microbiol., 75, 7537–7541.
Sun,Y. et al. (2009) ESPRIT: estimating species richness using large collections of
16S rRNA pyrosequences. Nucleic Acids Res., 37,e76.
Sun,Y. et al. (2012) A large-scale benchmark study of existing algorithms for tax-
onomy-independent microbial community analysis. Brief. Bioinform., 13,
107–121.
Fig. 1. Runtime comparisons. For HPC-CLUST and MOTHUR, run-
times are shown both including and excluding sequence alignment run-
time. UCLUST and CD-HIT exhibited only negligible decreases in
runtime when using multiple threads. Identity threshold for clustering
was 98% identity
288
J.F.Matias Rodrigues and C.von Mering

Supplementary resource (1)

... Various kinds of clustering methods in the past decades were developed for sequence clustering, such as mothur [42], ES-PRIT [43], HPC-CLUST [44] and mcClust [45]. Many of these methods apply the hierarchical strategy to cluster, which needs all-by-all pairwise alignment of sequences for clustering, so that they are highly computation-intensive for large-scale datasets. ...
... The Schmidt dataset is one comprehensive global 16S rRNA gene sequence dataset (http:// meringlab.org/suppdata/2014-otu_robustness/) constructed by Schmidt et al [44] (here, we named it as the Schmidt dataset). This dataset covers near the whole region of the bacterial 16S rRNA gene and contains 887870 sequences collected from NCBI GenBank, the average length is about 1401 bp. ...
Article
Full-text available
Recent advances in sequencing technology have considerably promoted genomics research by providing high-throughput sequencing economically. This great advancement has resulted in a huge amount of sequencing data. Clustering analysis is powerful to study and probes the large-scale sequence data. A number of available clustering methods have been developed in the last decade. Despite numerous comparison studies being published, we noticed that they have two main limitations: only traditional alignment-based clustering methods are compared and the evaluation metrics heavily rely on labeled sequence data. In this study, we present a comprehensive benchmark study for sequence clustering methods. Specifically, i) alignment-based clustering algorithms including classical (e.g., CD-HIT, UCLUST, VSEARCH) and recently proposed methods (e.g., MMseq2, Linclust, edClust) are assessed; ii) two alignment-free methods (e.g., LZW-Kernel and Mash) are included to compare with alignment-based methods; and iii) different evaluation measures based on the true labels (supervised metrics) and the input data itself (unsupervised metrics) are applied to quantify their clustering results. The aims of this study are to help biological analyzers in choosing one reasonable clustering algorithm for processing their collected sequences, and furthermore, motivate algorithm designers to develop more efficient sequence clustering approaches.
... The resulting Amplicon Sequence Variants (ASVs) were then clustered into Operational Taxonomic Units (OTUs) at 98% sequence similarity using an open-reference approach: reads were first mapped to the pre-clustered reference set of full-length 16S rRNA sequences at 98% similarity included with MAPseq v1.2.6 [44]. Reads that did not confidently map were aligned to bacterial and archaeal secondary structure-aware SSU rRNA models using Infernal [45] and clustered into OTUs with 98% average linkage using hpc-clust [46], as described previously [47]. The resulting OTU count tables were noise filtered by asserting that samples retained at least 400 reads and taxa were prevalent in at least 1% of samples; these filters removed 45% of OTUs as spurious, corresponding to 0.16% of total reads. ...
Article
Full-text available
Liver steatosis is the most frequent liver disorder and its advanced stage, non-alcoholic steatohepatitis (NASH), will soon become the main reason for liver fibrosis and cirrhosis. The “multiple hits hypothesis” suggests that progression from simple steatosis to NASH is triggered by multiple factors including the gut microbiota composition. The Epstein Barr virus induced gene 2 (EBI2) is a receptor for the oxysterol 7a, 25-dihydroxycholesterol synthesized by the enzymes CH25H and CYP7B1. EBI2 and its ligand control activation of immune cells in secondary lymphoid organs and the gut. Here we show a concurrent study of the microbial dysregulation and perturbation of the EBI2 axis in a mice model of NASH. We used mice with wildtype, or littermates with CH25H−/−, EBI2−/−, or CYP7B1−/− genotypes fed with a high-fat diet (HFD) containing high amounts of fat, cholesterol, and fructose for 20 weeks to induce liver steatosis and NASH. Fecal and small intestinal microbiota samples were collected, and microbiota signatures were compared according to genotype and NASH disease state. We found pronounced differences in microbiota composition of mice with HFD developing NASH compared to mice did not developing NASH. In mice with NASH, we identified significantly increased 33 taxa mainly belonging to the Clostridiales order and/ or the family, and significantly decreased 17 taxa. Using an Elastic Net algorithm, we suggest a microbiota signature that predicts NASH in animals with a HFD from the microbiota composition with moderate accuracy (area under the receiver operator characteristics curve = 0.64). In contrast, no microbiota differences regarding the studied genotypes (wildtype vs knock-out CH25H−/−, EBI2−/−, or CYP7B1−/−) were observed. In conclusion, our data confirm previous studies identifying the intestinal microbiota composition as a relevant marker for NASH pathogenesis. Further, no link of the EBI2 – oxysterol axis to the intestinal microbiota was detectable in the current study.
... High-performance computing (HPC) workloads span from traditional computationintensive applications such as simulation of complex systems (wind tunnels, drugs Altino development, chemical industry, weather forecasting [39,42]), to new workloads such as big data [19,49], artificial intelligence [27,45,51], DNA sequencing [17,38], and autonomous driving [2,46]. Based on the degree of interaction between the concurrently running parallel processes, these workloads can be categorised as loosely coupled and tightly coupled workloads. ...
Chapter
HPC cloud aims at exploiting cloud infrastructures to run HPC workloads. Compared with traditional HPC clusters, cloud offers several advantages in terms of rapid access to elastic and diversified computing resources, economies of use, and release the users from deploying and maintaining physical infrastructures. Nevertheless, users are responsible for managing the resources rented from clouds to run their workloads, a task that becomes even more complex if we consider the heterogeneity of resources and the diversity of pricing models implemented by cloud providers. Inefficient resource management not only increases end-of-month costs, it often degrades the application performance. This chapter discusses the resource wastage problem in the context of HPC cloud and provides existing state-of-the-art solutions to tackle such situations. To this end, the chapter starts by describing the classes of HPC applications that benefit from running in the cloud, followed by a formulation of the resource management problem. Then, different metrics to detect resource inefficiencies are introduced and a comparative analysis of several scheduling-based resource optimisation strategies is provided. Although several advances happened in past years in the HPC cloud space, this study identifies the limitations of current solutions that need to be addressed in future research.
... Similarly, a temporal study of the Great Salt Lake, (Andrews, 2010;Bolger et al., 2014;Bushnell, 2014;Cantu et al., 2019;De Coster et al., 2018;Ewels et al., 2016;Fukasawa et al., 2020Fukasawa et al., , 2020Gordon & Hannon, 2010;Hufnagel et al., 2020;Jiang et al., 2014;Krueger, 2015;Lanfear et al., 2019;Martin, 2011;Patel et al., 2012;S. Chen et al., 2021; 16SrRNA (Albanese et al., 2015;Cai & Sun, 2011;Cai et al., 2017;Callahan et al., 2016, p. 2;Cheng et al., 2012;Edgar, 2010;Fu et al., 2012, p.;Ghodsi et al., 2011;Hao et al., 2011;Matias Rodrigues & von Mering, 2014;Mercier et al., 2013;Prasad et al., 2015;Rasheed et al., 2013;Rognes et al., 2016;Schloss & Handelsman, 2005; W. Chen et al., 2013;Wei & Zhang, 2015Westcott et al., 2017;Y. Namiki et al., 2013;Zheng et al., 2012) Classification tools RDP-classifier, BLAST, 16S Classifier, Kraken, OTUbase, TreeOTU, MOTHUR, METAXA2 (Beck et al., 2011;Bengtsson-Palme et al., 2015;Chaudhary et al., 2015;D. ...
Article
Hypersaline ecosystems are distributed all over the globe. They are subjected to poly-extreme stresses and are inhabited by halophilic microorganisms possessing multiple adaptations. The halophiles have many biotechnological applications such as nutrient supplements, antioxidant synthesis, salt tolerant enzyme production, osmolyte synthesis, biofuel production, electricity generation etc. However, halophiles are still underexplored in terms of complex ecological interactions and functions as compared to other niches. The advent of metagenomics and the recent advancement of next-generation sequencing tools have made it feasible to investigate the microflora of an ecosystem, its interactions and functions. Both target gene and shotgun metagenomic approaches are commonly employed for the taxonomic, phylogenetic, and functional analyses of the hypersaline microbial communities. This review discusses different types of hypersaline niches, their residential microflora, and an overview of the metagenomic approaches used to investigate them. Various applications, hurdles and the recent advancements in metagenomic approaches have also been focused on here for their better understanding and utilization in the study of hypersaline microbiome.
... Software like D-GENIES report execution times shorter than G-SAIP, but we present a software that is easily integrated into pipelines, also, G-SAIP is easier configurable than D-GENIES which is a web application useful for making dot-plots within an interactive interface, and the default parameters like maximum RAM memory, maximum file size are not intuitive modifiable. In addition, the application of HPC has demonstrated high performances for several bioinformatics tasks such as multiple sequence alignment, [41][42][43][44][45] sequence mapping, [45][46][47][48]81 analyzing transposable elements, [82][83][84] and identification of transposon insertion polymorphisms, 83 among others. Nevertheless, HPC software has not been deployed for graphical alignment. ...
Article
Full-text available
A common task in bioinformatics is to compare DNA sequences to identify similarities between organisms at the sequence level. An approach to such comparison is the dot-plots, a 2-dimensional graphical representation to analyze DNA or protein alignments. Dot-plots alignment software existed before the sequencing revolution, and now there is an ongoing limitation when dealing with large-size sequences, resulting in very long execution times. High-Performance Computing (HPC) techniques have been successfully used in many applications to reduce computing times, but so far, very few applications for graphical sequence alignment using HPC have been reported. Here, we present G-SAIP (Graphical Sequence Alignment in Parallel), a software capable of spawning multiple distributed processes on CPUs, over a supercomputing infrastructure to speed up the execution time for dot-plot generation up to 1.68× compared with other current fastest tools, improve the efficiency for comparative structural genomic analysis, phylogenetics because the benefits of pairwise alignments for comparison between genomes, repetitive structure identification, and assembly quality checking.
... oxfordjournals.org/). HPC-CLUST [23] was used to perform average linkage clustering on the distance matrix, where the distance between each protein pair was equal to 1.0 -STRING association score. Protein pairs that did not have association scores in STRING were assigned the prior probability of association (which is lower than the lowest association score stored in the database). ...
Article
Full-text available
A knowledge-based grouping of genes into pathways or functional units is essential for describing and understanding cellular complexity. However, it is not always clear a priori how and at what level of specificity functionally interconnected genes should be partitioned into pathways, for a given application. Here, we assess and compare nine existing and two conceptually novel functional classification systems, with respect to their discovery power and generality in gene set enrichment testing. We base our assessment on a collection of nearly 2000 functional genomics datasets provided by users of the STRING database. With these real-life and diverse queries, we assess which systems typically provide the most specific and complete enrichment results. We find many structural and performance differences between classification systems. Overall, the well-established, hierarchically organized pathway annotation systems yield the best enrichment performance, despite covering substantial parts of the human genome in general terms only. On the other hand, the more recent unsupervised annotation systems perform strongest in understudied areas and organisms, and in detecting more specific pathways, albeit with less informative labels.
... Regardless of agglomeration technology or splitting technology, a core problem is measuring the distance between two clusters, and time is basically spent on distance calculation. Therefore, a large number of improved algorithms that use different means to reduce the number of distance calculations have been proposed one after another to improve algorithmic efficiency [19][20][21][22][23][24][25][26][27]. Guha et al. [28] proposed the CURE algorithm, which considers sampling the data in the cluster and uses the sampled data as representative of the cluster to reduce the amount of calculation of pairwise distances. ...
Article
Full-text available
Aiming to resolve the problems of the traditional hierarchical clustering algorithm that cannot find clusters with uneven density, requires a large amount of calculation, and has low efficiency, this paper proposes an improved hierarchical clustering algorithm (referred to as PRI-MFC) based on the idea of population reproduction and fusion. It is divided into two stages: fuzzy pre-clustering and Jaccard fusion clustering. In the fuzzy pre-clustering stage, it determines the center point, uses the product of the neighborhood radius eps and the dispersion degree fog as the benchmark to divide the data, uses the Euclidean distance to determine the similarity of the two data points, and uses the membership grade to record the information of the common points in each cluster. In the Jaccard fusion clustering stage, the clusters with common points are the clusters to be fused, and the clusters whose Jaccard similarity coefficient between the clusters to be fused is greater than the fusion parameter jac are fused. The common points of the clusters whose Jaccard similarity coefficient between clusters is less than the fusion parameter jac are divided into the cluster with the largest membership grade. A variety of experiments are designed from multiple perspectives on artificial datasets and real datasets to demonstrate the superiority of the PRI-MFC algorithm in terms of clustering effect, clustering quality, and time consumption. Experiments are carried out on Chinese household financial survey data, and the clustering results that conform to the actual situation of Chinese households are obtained, which shows the practicability of this algorithm.
Article
Full-text available
The gut microbiota operates at the interface of host–environment interactions to influence human homoeostasis and metabolic networks1–4. Environmental factors that unbalance gut microbial ecosystems can therefore shape physiological and disease-associated responses across somatic tissues5–9. However, the systemic impact of the gut microbiome on the germline—and consequently on the F1 offspring it gives rise to—is unexplored¹⁰. Here we show that the gut microbiota act as a key interface between paternal preconception environment and intergenerational health in mice. Perturbations to the gut microbiota of prospective fathers increase the probability of their offspring presenting with low birth weight, severe growth restriction and premature mortality. Transmission of disease risk occurs via the germline and is provoked by pervasive gut microbiome perturbations, including non-absorbable antibiotics or osmotic laxatives, but is rescued by restoring the paternal microbiota before conception. This effect is linked with a dynamic response to induced dysbiosis in the male reproductive system, including impaired leptin signalling, altered testicular metabolite profiles and remapped small RNA payloads in sperm. As a result, dysbiotic fathers trigger an elevated risk of in utero placental insufficiency, revealing a placental origin of mammalian intergenerational effects. Our study defines a regulatory ‘gut–germline axis’ in males, which is sensitive to environmental exposures and programmes offspring fitness through impacting placenta function.
Chapter
This chapter investigates the movement of moving beyond OTU methods and discusses the necessity and possibility of this movement. First, it describes clustering-based OTU methods and the purposes of using OTUs and definitions of species and species-level analysis in microbiome studies. Then, it introduces the OTU-based methods that move toward single-nucleotide resolution. Third, it describes moving beyond the OTU methods. Finally, it discusses the necessity and possibility of moving beyond OTU methods as well as the issues of sub-OTU methods, assumption of sequence similarity predicting the ecological similarity, and functional analysis and multi-omics integration.KeywordsClustering-based OTU methodsHierarchical clustering OTU methodsHeuristic clustering OTU methodsTaxonomyOTUsSequencing errorSpecies and species-level analysisEukaryote speciesProkaryote or bacterial species16S rRNA methodPhysiological characteristicsSingle-nucleotide resolution-based OTU methodsDistribution-based clustering (DBC)Swarm2Entropy-based methodsOligotypingDenoising-based methodsPyrosequencing flowgramsCluster-free filtering (CFF)DADA2UNOISE2UNOISE3DeblurSeekDeepSub-OTU methodsSequence similarityEcological similarityFunctional analysisMulti-omics integration
Article
Full-text available
Background Recent evidence suggests a role for the microbiome in pancreatic ductal adenocarcinoma (PDAC) aetiology and progression. Objective To explore the faecal and salivary microbiota as potential diagnostic biomarkers. Methods We applied shotgun metagenomic and 16S rRNA amplicon sequencing to samples from a Spanish case–control study (n=136), including 57 cases, 50 controls, and 29 patients with chronic pancreatitis in the discovery phase, and from a German case–control study (n=76), in the validation phase. Results Faecal metagenomic classifiers performed much better than saliva-based classifiers and identified patients with PDAC with an accuracy of up to 0.84 area under the receiver operating characteristic curve (AUROC) based on a set of 27 microbial species, with consistent accuracy across early and late disease stages. Performance further improved to up to 0.94 AUROC when we combined our microbiome-based predictions with serum levels of carbohydrate antigen (CA) 19–9, the only current non-invasive, Food and Drug Administration approved, low specificity PDAC diagnostic biomarker. Furthermore, a microbiota-based classification model confined to PDAC-enriched species was highly disease-specific when validated against 25 publicly available metagenomic study populations for various health conditions (n=5792). Both microbiome-based models had a high prediction accuracy on a German validation population (n=76). Several faecal PDAC marker species were detectable in pancreatic tumour and non-tumour tissue using 16S rRNA sequencing and fluorescence in situ hybridisation. Conclusion Taken together, our results indicate that non-invasive, robust and specific faecal microbiota-based screening for the early detection of PDAC is feasible.
Article
Full-text available
The Ribosomal Database Project (RDP) provides researchers with quality-controlled bacterial and archaeal small subunit rRNA alignments and analysis tools. An improved alignment strategy uses the Infernal secondary structure aware aligner to provide a more consistent higher quality alignment and faster processing of user sequences. Substantial new analysis features include a new Pyrosequencing Pipeline that provides tools to support analysis of ultra high-throughput rRNA sequencing data. This pipeline offers a collection of tools that automate the data processing and simplify the computationally intensive analysis of large sequencing libraries. In addition, a new Taxomatic visualization tool allows rapid visualization of taxonomic inconsistencies and suggests corrections, and a new class Assignment Generator provides instructors with a lesson plan and individualized teaching materials. Details about RDP data and analytical functions can be found at http://rdp.cme.msu.edu/.
Article
Full-text available
Recent advances in massively parallel sequencing technology have created new opportunities to probe the hidden world of microbes. Taxonomy-independent clustering of the 16S rRNA gene is usually the first step in analyzing microbial communities. Dozens of algorithms have been developed in the last decade, but a comprehensive benchmark study is lacking. Here, we survey algorithms currently used by microbiologists, and compare seven representative methods in a large-scale benchmark study that addresses several issues of concern. A new experimental protocol was developed that allows different algorithms to be compared using the same platform, and several criteria were introduced to facilitate a quantitative evaluation of the clustering performance of each algorithm. We found that existing methods vary widely in their outputs, and that inappropriate use of distance levels for taxonomic assignments likely resulted in substantial overestimates of biodiversity in many studies. The benchmark study identified our recently developed ESPRIT-Tree, a fast implementation of the average linkage-based hierarchical clustering algorithm, as one of the best algorithms available in terms of computational efficiency and clustering accuracy.
Article
Full-text available
mothur aims to be a comprehensive software package that allows users to use a single piece of software to analyze community sequence data. It builds upon previous tools to provide a flexible and powerful software package for analyzing sequencing data. As a case study, we used mothur to trim, screen, and align sequences; calculate distances; assign sequences to operational taxonomic units; and describe the α and β diversity of eight marine samples previously characterized by pyrosequencing of 16S rRNA gene fragments. This analysis of more than 222,000 sequences was completed in less than 2 h with a laptop computer.
Article
Full-text available
Recent metagenomics studies of environmental samples suggested that microbial communities are much more diverse than previously reported, and deep sequencing will significantly increase the estimate of total species diversity. Massively parallel pyrosequencing technology enables ultra-deep sequencing of complex microbial populations rapidly and inexpensively. However, computational methods for analyzing large collections of 16S ribosomal sequences are limited. We proposed a new algorithm, referred to as ESPRIT, which addresses several computational issues with prior methods. We developed two versions of ESPRIT, one for personal computers (PCs) and one for computer clusters (CCs). The PC version is used for small- and medium-scale data sets and can process several tens of thousands of sequences within a few minutes, while the CC version is for large-scale problems and is able to analyze several hundreds of thousands of reads within one day. Large-scale experiments are presented that clearly demonstrate the effectiveness of the newly proposed algorithm. The source code and user guide are freely available at http://www.biotech.ufl.edu/people/sun/esprit.html.
Article
Full-text available
infernal builds consensus RNA secondary structure profiles called covariance models (CMs), and uses them to search nucleic acid sequence databases for homologous RNAs, or to create new sequence- and structure-based multiple sequence alignments. Availability: Source code, documentation and benchmark downloadable from http://infernal.janelia.org. infernal is freely licensed under the GNU GPLv3 and should be portable to any POSIX-compliant operating system, including Linux and Mac OS/X. Contact: nawrockie,kolbed,eddys{at}janelia.hhmi.org
Article
Full-text available
Motivation: In 2001 and 2002, we published two papers (Bioinformatics, 17, 282–283, Bioinformatics, 18, 77–82) describing an ultrafast protein sequence clustering program called cd-hit. This program can efficiently cluster a huge protein database with millions of sequences. However, the applications of the underlying algorithm are not limited to only protein sequences clustering, here we present several new programs using the same algorithm including cd-hit-2d, cd-hit-est and cd-hit-est-2d. Cd-hit-2d compares two protein datasets and reports similar matches between them; cd-hit-est clusters a DNA/RNA sequence database and cd-hit-est-2d compares two nucleotide datasets. All these programs can handle huge datasets with millions of sequences and can be hundreds of times faster than methods based on the popular sequence comparison and database search tools, such as BLAST. Availability:http://cd-hit.org Contact:liwz{at}sdsc.edu
Article
Biological sequence data is accumulating rapidly, motivating the development of improved high-throughput methods for sequence classification. UBLAST and USEARCH are new algorithms enabling sensitive local and global search of large sequence databases at exceptionally high speeds. They are often orders of magnitude faster than BLAST in practical applications, though sensitivity to distant protein relationships is lower. UCLUST is a new clustering method that exploits USEARCH to assign sequences to clusters. UCLUST offers several advantages over the widely used program CD-HIT, including higher speed, lower memory use, improved sensitivity, clustering at lower identities and classification of much larger datasets. Binaries are available at no charge for non-commercial use at http://www.drive5.com/usearch.
Article
Whenevern objects are characterized by a matrix of pairwise dissimilarities, they may be clustered by any of a number of sequential, agglomerative, hierarchical, nonoverlapping (SAHN) clustering methods. These SAHN clustering methods are defined by a paradigmatic algorithm that usually requires 0(n3) time, in the worst case, to cluster the objects. An improved algorithm (Anderberg 1973), while still requiring 0(n3) worst-case time, can reasonably be expected to exhibit 0(n2) expected behavior. By contrast, we describe a SAHN clustering algorithm that requires 0(n2 logn) time in the worst case. When SAHN clustering methods exhibit reasonable space distortion properties, further improvements are possible. We adapt a SAHN clustering algorithm, based on the efficient construction of nearest neighbor chains, to obtain a reasonably general SAHN clustering algorithm that requires in the worst case 0(n2) time and space.