ArticlePDF Available

HPC-CLUST: Distributed hierarchical clustering for large sets of nucleotide sequences

November 2013
Bioinformatics 30(2)

November 2013
30(2)

DOI:10.1093/bioinformatics/btt657

Source
PubMed

License
CC BY 3.0

Authors:

João F Matias Rodrigues

University of Zurich

Christian von Mering

University of Zurich

Nucleotide sequence data is being produced at an ever increasing rate. Clustering such sequences by similarity is often an essential first step in their analysis - intended to reduce redundancy, define gene families, or suggest taxonomic units. Exact clustering algorithms, such as hierarchical clustering, scale relatively poorly in terms of run time and memory usage, yet they are desirable because heuristic shortcuts taken during clustering might have unintended consequences in later analysis steps. Here we present HPC-CLUST, a highly optimized software pipeline that can cluster large numbers of pre-aligned DNA sequences by running on distributed computing hardware. It allocates both memory and computing resources efficiently, and can process more than a million sequences in a few hours on a small cluster. Source code and binaries are freely available at http://meringlab.org/software/hpc-clust/; the pipeline is implemented in C++ and uses the MPI standard for distributed computing. joao.rodrigues@imls.uzh.ch, mering@imls.uzh.ch SUPPLEMENTARY INFORMATION: Supplementary data is available at Bioinformatics online.

Available via license: CC BY 3.0

Content may be subject to copyright.

Vol. 30 no. 2 2014, pages 287–288

BIOINFORMATICS APPLICATIONS NOTE doi:10.1093/bioinformatics/btt657

Sequence analysis Advance Access publication November 9, 2013

HPC-CLUST: distributed hierarchical clustering for large sets of

nucleotide sequences

Joa

o F. Matias Rodrigues and Christian von Mering

Institute of Molecular Life Sciences and Swiss Institute of Bioinformatics, University of Zurich, Zurich, Switzerland

Associate Editor: Inanc Birol

ABSTRACT

Motivation: Nucleotide sequence data are being produced at an

ever increasing rate. Clustering such sequences by similarity is often

an essential first step in their analysis—intended to reduce redun-

dancy, define gene families or suggest taxonomic units. Exact clus-

tering algorithms, such as hierarchical clustering, scale relatively

poorly in terms of run time and memory usage, yet they are desirable

because heuristic shortcuts taken during clustering might have unin-

tended consequences in later analysis steps.

Results: Here we present HPC-CLUST, a highly optimized software

pipeline that can cluster large numbers of pre-aligned DNA sequences

by running on distributed computing hardware. It allocates both

memory and computing resources efficiently, and can process more

than a million sequences in a few hours on a small cluster.

Availability and implementation: Source code and binaries are

freely available at http://meringlab.org/software/hpc-clust/; the pipe-

line is implemented in Cþþ and uses the Message Passing

Interface (MPI) standard for distributed computing.

Contact: mering@imls.uzh.ch

Supplementary Information: Supplementary data are available at

Bioinformatics online.

Received on September 6, 2013; revised on October 19, 2013;

accepted on November 7, 2013

1 INTRODUCTION

The time complexity of hierarchical clustering algorithms (HCA)

is quadratic OðN

Þ or even worse OðN

log NÞ, depending on the

selected cluster linkage method (Day and Edelsbrunner, 1984).

However, HCAs have a number of advantages that make them

attractive for applications in biology: (i) they are well defined and

should be reproducible across implementations, (ii) they require

nothing but a pairwise distance matrix as input and (iii) they are

agglomerative, meaning that sets of clusters at arbitrary similar-

ity thresholds can be extracted quickly by post-processing, once a

complete clustering run has been executed. Consequently, HCAs

have been widely adopted in biology, in areas ranging from data

mining to sequence analysis to evolutionary biology.

Apart from generic implementations, a number of hierarchical

clustering implementations exist that focus on biological se-

quence data, taking advantage of the fact that distances between

sequences can be computed relatively cheaply, even in a transient

fashion. However, the existing implementations such as

MOTHUR (Schloss et al., 2009), ESPRIT (Sun et al., 2009) or

RDP online clustering (Cole et al., 2009), all struggle with large

sets of sequences. In light of these performance limits, heuristic

optimizations have also been implemented such as CD-HIT (Li

and Godzik, 2006) and UCLUST (Edgar, 2010).

Hierarchical clustering starts by considering every sequence

separately and merging the two closest ones into a cluster.

Then, iteratively, larger clusters are formed, by joining the closest

sequences and/or clusters. The distance between two clusters

with several sequences will depend on the clustering linkage

chosen. In single linkage, it is the similarity between the two

most similar sequences; in complete linkage, between the two

most dissimilar sequences; and in average linkage, the average

of all pairwise similarities. The latter method is also known as the

Unweighted Pair Group Method with Arithmetic Mean

(UPGMA) and is often used in the construction of phylogenetic

guide trees.

In the type of approach used by CD-HIT and UCLUST, each

input sequence is considered sequentially, and is either added to

an existing cluster (if it is found to meet the clustering threshold)

or is used as a seed to start a new cluster. Although this approach

is extremely efficient, it can lead to some undesired characteris-

tics (Sun et al., 2012): (i) it will create clusters with sequences that

may be more dissimilar than the chosen clustering threshold; (ii)

it can occur that a new cluster is created close to an existing

cluster, but at a distance just slightly longer than the clustering

threshold; at this point, any new sequences close to both clus-

ters will be split among the two clusters, whereas previous se-

quences will have been added to only the first cluster; this

effectively results in a reduction of the clustering threshold lo-

cally; and (iii) different sequence input orders will result in dif-

ferent sets of clusters due to different choices of the seed

sequences. Point (i) also affects HCA using single linkage and

to a lesser extent average linkage, but does not occur with com-

plete linkage.

Here we present a distributed implementation of an HCA that

can handle large numbers of sequences. It can compute single-,

complete- and average-linkage clusters in a single run and pro-

duces a merge-log from which clusters can subsequently be

parsed at any threshold. In contrast to CD-HIT, UCLUST

and ESPRIT, which all take unaligned sequence data as

their input, HPC-CLUST (like MOTHUR) takes as input a

set of pre-aligned sequences. This allows for flexibility in the

choice of alignment algorithm; a future version of HPC-

CLUST may include the alignment step as well. For further de-

tails on implementation and algorithms, see the Supplementary

Material.

*To whom correspondence should be addressed.

ß The Author 2013. Published by Oxford University Press.

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0/), which

permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.

2 METHODS

For all benchmarks, we used one or more dedicated Dell Blade M605

compute nodes with 2 quad-core Opteron 2.33 GHz processors and 24

GB of random access memory. The most recent version of each software

pipeline was used: HPC-CLUST (v1.0.0), MOTHUR (v.1.29.2), ESPRIT

(Feb. 2011), CD-HIT (v4.6.1) and UCLUST (v6.0.307). Detailed infor-

mation on settings and parameters is available in the Supplementary

Material.

We compiled a dataset of publicly available full-length 16S bacterial

ribosomal RNA sequences from NCBI Genbank. Sequences were aligned

using INFERNAL v1.0.2 with a 16S model for bacteria from the ssu-

align package (Nawrocki et al., 2009). Importantly, INFERNAL uses a

profile alignment strategy that scales linearly O(N) with the number of

sequences, and can be trivially parallelized. Indels were removed and

sequences were trimmed between two well-conserved alignment columns,

such that all sequences had the same aligned length. The final dataset

consisted of 1 105 195 bacterial sequences (833 013 unique) of 1301 in

aligned length.

3RESULTS

3.1 Clustering performance on a single computer

HPC-CLUST has been highly optimized for computation speed

and memory efficiency. It is by far the fastest of the exact clus-

tering implementations tested here, even when running on a

single computer (Fig. 1). Compared with MOTHUR, it produces

identical or nearly identical clustering results (see Supplementary

Material). Because CD-HIT and UCLUST use a different ap-

proach to clustering, they are not directly comparable and are

included for reference only..

In HPC-CLUST, the largest fraction of computation time is

spent calculating the pairwise sequence distances, the second lar-

gest in sorting the distances and the final clustering step is the

fastest. HPC-CLUST can make use of multithreaded execution

on multiple nodes and practically achieves optimal paralleliza-

tion in the distance calculation step. Additional benchmarks are

shown and discussed in the Supplementary Material.

3.2 Distributed clustering performance

Clustering the full dataset (833 013 unique sequences) to 97%

identity threshold required a total of 2 h and 42 min on a com-

pute cluster of 24 nodes with 8 cores each (192 total cores).

Owing to parallelization, the distance and sorting computation

took only 57 min (wall clock time), corresponding to 410 000

min CPU time. The remaining 1 h and 45 min (wall clock time)

were spent collecting and clustering the distances. The combined

total memory used for the distance matrix was 59.8 or 2.6 GB per

node. The node on which the merging step was performed used a

maximum of 4.9 GB of memory when doing single-, complete-

and average-linkage clusterings in the same run

4CONCLUSION

Clustering is often among the first steps when dealing with raw

sequence data, and therefore needs to be as fast and as memory

efficient as possible. The implementation of a distributed version

of hierarchical clustering in HPC-CLUST makes it now possible

to fully cluster a much larger number of sequences, essentially

limited only by the number of available computing nodes.

ACKNOWLEDGEMENT

The authors thank Thomas S. B. Schmidt for his feedback and

help in testing HPC-CLUST.

Funding: ERC grant (Starting Grant ‘UMICIS/242870’ to

C.vM.).

Conflict of Interest: none declared.

REFERENCES

Cole,J.R. et al. (2009) The Ribosomal Database Project: improved alignments and

new tools for rRNA analysis. Nucleic Acids Res., 37, D141–D145.

Day,W. and Edelsbrunner,H. (1984) Efficient algorithms for agglomerative hier-

archical clustering methods. J. Classif., 1,7–24.

Edgar,R.C. (2010) Search and clustering orders of magnitude faster than BLAST.

Bioinformatics, 26, 2460–2461.

Li,W. and Godzik,A. (2006) CD-HIT: a fast program for clustering and comparing

large sets of protein or nucleotide sequences. Bioinformatics, 22, 1658–1659.

Nawrocki,E.P. et al. (2009) Infernal 1.0: inference of RNA alignments.

Bioinformatics, 25, 1335–1337.

Schloss,P.D. et al. (2009) Introducing MOTHUR: open-source, platform-independ-

ent, community-supported software for describing and comparing microbial

communities. Appl. Environ. Microbiol., 75, 7537–7541.

Sun,Y. et al. (2009) ESPRIT: estimating species richness using large collections of

16S rRNA pyrosequences. Nucleic Acids Res., 37,e76.

Sun,Y. et al. (2012) A large-scale benchmark study of existing algorithms for tax-

onomy-independent microbial community analysis. Brief. Bioinform., 13,

107–121.

Fig. 1. Runtime comparisons. For HPC-CLUST and MOTHUR, run-

times are shown both including and excluding sequence alignment run-

time. UCLUST and CD-HIT exhibited only negligible decreases in

runtime when using multiple threads. Identity threshold for clustering

was 98% identity

288

J.F.Matias Rodrigues and C.von Mering

Supplementary Data

Data

November 2013

João F Matias Rodrigues · Christian von Mering

Comparison of Methods for Biological Sequence Clustering

Article

Full-text available

Mar 2023

Recent advances in sequencing technology have considerably promoted genomics research by providing high-throughput sequencing economically. This great advancement has resulted in a huge amount of sequencing data. Clustering analysis is powerful to study and probes the large-scale sequence data. A number of available clustering methods have been developed in the last decade. Despite numerous comparison studies being published, we noticed that they have two main limitations: only traditional alignment-based clustering methods are compared and the evaluation metrics heavily rely on labeled sequence data. In this study, we present a comprehensive benchmark study for sequence clustering methods. Specifically, i) alignment-based clustering algorithms including classical (e.g., CD-HIT, UCLUST, VSEARCH) and recently proposed methods (e.g., MMseq2, Linclust, edClust) are assessed; ii) two alignment-free methods (e.g., LZW-Kernel and Mash) are included to compare with alignment-based methods; and iii) different evaluation measures based on the true labels (supervised metrics) and the input data itself (unsupervised metrics) are applied to quantify their clustering results. The aims of this study are to help biological analyzers in choosing one reasonable clustering algorithm for processing their collected sequences, and furthermore, motivate algorithm designers to develop more efficient sequence clustering approaches.

Development of non-alcoholic steatohepatitis is associated with gut microbiota but not with oxysterol enzymes CH25H, EBI2, or CYP7B1 in mice

Article

Full-text available

Feb 2024
BMC MICROBIOL

Liver steatosis is the most frequent liver disorder and its advanced stage, non-alcoholic steatohepatitis (NASH), will soon become the main reason for liver fibrosis and cirrhosis. The “multiple hits hypothesis” suggests that progression from simple steatosis to NASH is triggered by multiple factors including the gut microbiota composition. The Epstein Barr virus induced gene 2 (EBI2) is a receptor for the oxysterol 7a, 25-dihydroxycholesterol synthesized by the enzymes CH25H and CYP7B1. EBI2 and its ligand control activation of immune cells in secondary lymphoid organs and the gut. Here we show a concurrent study of the microbial dysregulation and perturbation of the EBI2 axis in a mice model of NASH. We used mice with wildtype, or littermates with CH25H−/−, EBI2−/−, or CYP7B1−/− genotypes fed with a high-fat diet (HFD) containing high amounts of fat, cholesterol, and fructose for 20 weeks to induce liver steatosis and NASH. Fecal and small intestinal microbiota samples were collected, and microbiota signatures were compared according to genotype and NASH disease state. We found pronounced differences in microbiota composition of mice with HFD developing NASH compared to mice did not developing NASH. In mice with NASH, we identified significantly increased 33 taxa mainly belonging to the Clostridiales order and/ or the family, and significantly decreased 17 taxa. Using an Elastic Net algorithm, we suggest a microbiota signature that predicts NASH in animals with a HFD from the microbiota composition with moderate accuracy (area under the receiver operator characteristics curve = 0.64). In contrast, no microbiota differences regarding the studied genotypes (wildtype vs knock-out CH25H−/−, EBI2−/−, or CYP7B1−/−) were observed. In conclusion, our data confirm previous studies identifying the intestinal microbiota composition as a relevant marker for NASH pathogenesis. Further, no link of the EBI2 – oxysterol axis to the intestinal microbiota was detectable in the current study.

Avoiding Resource Wastage

Chapter

Mar 2023

HPC cloud aims at exploiting cloud infrastructures to run HPC workloads. Compared with traditional HPC clusters, cloud offers several advantages in terms of rapid access to elastic and diversified computing resources, economies of use, and release the users from deploying and maintaining physical infrastructures. Nevertheless, users are responsible for managing the resources rented from clouds to run their workloads, a task that becomes even more complex if we consider the heterogeneity of resources and the diversity of pricing models implemented by cloud providers. Inefficient resource management not only increases end-of-month costs, it often degrades the application performance. This chapter discusses the resource wastage problem in the context of HPC cloud and provides existing state-of-the-art solutions to tackle such situations. To this end, the chapter starts by describing the classes of HPC applications that benefit from running in the cloud, followed by a formulation of the resource management problem. Then, different metrics to detect resource inefficiencies are introduced and a comparative analysis of several scheduling-based resource optimisation strategies is provided. Although several advances happened in past years in the HPC cloud space, this study identifies the limitations of current solutions that need to be addressed in future research.

Unveiling the role of emerging metagenomics for the examination of hypersaline environments

Article

Apr 2023
Biotechnol Genet Eng Rev

Hypersaline ecosystems are distributed all over the globe. They are subjected to poly-extreme stresses and are inhabited by halophilic microorganisms possessing multiple adaptations. The halophiles have many biotechnological applications such as nutrient supplements, antioxidant synthesis, salt tolerant enzyme production, osmolyte synthesis, biofuel production, electricity generation etc. However, halophiles are still underexplored in terms of complex ecological interactions and functions as compared to other niches. The advent of metagenomics and the recent advancement of next-generation sequencing tools have made it feasible to investigate the microflora of an ecosystem, its interactions and functions. Both target gene and shotgun metagenomic approaches are commonly employed for the taxonomic, phylogenetic, and functional analyses of the hypersaline microbial communities. This review discusses different types of hypersaline niches, their residential microflora, and an overview of the metagenomic approaches used to investigate them. Various applications, hurdles and the recent advancements in metagenomic approaches have also been focused on here for their better understanding and utilization in the study of hypersaline microbiome.

G-SAIP: Graphical Sequence Alignment Through Parallel Programming in the Post-Genomic Era

Article

Full-text available

Jan 2023
EVOL BIOINFORM

A common task in bioinformatics is to compare DNA sequences to identify similarities between organisms at the sequence level. An approach to such comparison is the dot-plots, a 2-dimensional graphical representation to analyze DNA or protein alignments. Dot-plots alignment software existed before the sequencing revolution, and now there is an ongoing limitation when dealing with large-size sequences, resulting in very long execution times. High-Performance Computing (HPC) techniques have been successfully used in many applications to reduce computing times, but so far, very few applications for graphical sequence alignment using HPC have been reported. Here, we present G-SAIP (Graphical Sequence Alignment in Parallel), a software capable of spawning multiple distributed processes on CPUs, over a supercomputing infrastructure to speed up the execution time for dot-plot generation up to 1.68× compared with other current fastest tools, improve the efficiency for comparative structural genomic analysis, phylogenetics because the benefits of pairwise alignments for comparison between genomes, repetitive structure identification, and assembly quality checking.

Systematic assessment of pathway databases, based on a diverse collection of user-submitted experiments

Article

Full-text available

Sep 2022

A knowledge-based grouping of genes into pathways or functional units is essential for describing and understanding cellular complexity. However, it is not always clear a priori how and at what level of specificity functionally interconnected genes should be partitioned into pathways, for a given application. Here, we assess and compare nine existing and two conceptually novel functional classification systems, with respect to their discovery power and generality in gene set enrichment testing. We base our assessment on a collection of nearly 2000 functional genomics datasets provided by users of the STRING database. With these real-life and diverse queries, we assess which systems typically provide the most specific and complete enrichment results. We find many structural and performance differences between classification systems. Overall, the well-established, hierarchically organized pathway annotation systems yield the best enrichment performance, despite covering substantial parts of the human genome in general terms only. On the other hand, the more recent unsupervised annotation systems perform strongest in understudied areas and organisms, and in detecting more specific pathways, albeit with less informative labels.

An Improved Hierarchical Clustering Algorithm Based on the Idea of Population Reproduction and Fusion

Article

Full-text available

Aug 2022

Aiming to resolve the problems of the traditional hierarchical clustering algorithm that cannot find clusters with uneven density, requires a large amount of calculation, and has low efficiency, this paper proposes an improved hierarchical clustering algorithm (referred to as PRI-MFC) based on the idea of population reproduction and fusion. It is divided into two stages: fuzzy pre-clustering and Jaccard fusion clustering. In the fuzzy pre-clustering stage, it determines the center point, uses the product of the neighborhood radius eps and the dispersion degree fog as the benchmark to divide the data, uses the Euclidean distance to determine the similarity of the two data points, and uses the membership grade to record the information of the common points in each cluster. In the Jaccard fusion clustering stage, the clusters with common points are the clusters to be fused, and the clusters whose Jaccard similarity coefficient between the clusters to be fused is greater than the fusion parameter jac are fused. The common points of the clusters whose Jaccard similarity coefficient between clusters is less than the fusion parameter jac are divided into the cluster with the largest membership grade. A variety of experiments are designed from multiple perspectives on artificial datasets and real datasets to demonstrate the superiority of the PRI-MFC algorithm in terms of clustering effect, clustering quality, and time consumption. Experiments are carried out on Chinese household financial survey data, and the clustering results that conform to the actual situation of Chinese households are obtained, which shows the practicability of this algorithm.

Paternal microbiome perturbations impact offspring fitness

Article

Full-text available

May 2024
NATURE

The gut microbiota operates at the interface of host–environment interactions to influence human homoeostasis and metabolic networks1–4. Environmental factors that unbalance gut microbial ecosystems can therefore shape physiological and disease-associated responses across somatic tissues5–9. However, the systemic impact of the gut microbiome on the germline—and consequently on the F1 offspring it gives rise to—is unexplored¹⁰. Here we show that the gut microbiota act as a key interface between paternal preconception environment and intergenerational health in mice. Perturbations to the gut microbiota of prospective fathers increase the probability of their offspring presenting with low birth weight, severe growth restriction and premature mortality. Transmission of disease risk occurs via the germline and is provoked by pervasive gut microbiome perturbations, including non-absorbable antibiotics or osmotic laxatives, but is rescued by restoring the paternal microbiota before conception. This effect is linked with a dynamic response to induced dysbiosis in the male reproductive system, including impaired leptin signalling, altered testicular metabolite profiles and remapped small RNA payloads in sperm. As a result, dysbiotic fathers trigger an elevated risk of in utero placental insufficiency, revealing a placental origin of mammalian intergenerational effects. Our study defines a regulatory ‘gut–germline axis’ in males, which is sensitive to environmental exposures and programmes offspring fitness through impacting placenta function.

Moving Beyond OTU Methods

Chapter

May 2023

This chapter investigates the movement of moving beyond OTU methods and discusses the necessity and possibility of this movement. First, it describes clustering-based OTU methods and the purposes of using OTUs and definitions of species and species-level analysis in microbiome studies. Then, it introduces the OTU-based methods that move toward single-nucleotide resolution. Third, it describes moving beyond the OTU methods. Finally, it discusses the necessity and possibility of moving beyond OTU methods as well as the issues of sub-OTU methods, assumption of sequence similarity predicting the ecological similarity, and functional analysis and multi-omics integration.KeywordsClustering-based OTU methodsHierarchical clustering OTU methodsHeuristic clustering OTU methodsTaxonomyOTUsSequencing errorSpecies and species-level analysisEukaryote speciesProkaryote or bacterial species16S rRNA methodPhysiological characteristicsSingle-nucleotide resolution-based OTU methodsDistribution-based clustering (DBC)Swarm2Entropy-based methodsOligotypingDenoising-based methodsPyrosequencing flowgramsCluster-free filtering (CFF)DADA2UNOISE2UNOISE3DeblurSeekDeepSub-OTU methodsSequence similarityEcological similarityFunctional analysisMulti-omics integration

A faecal microbiota signature with high specificity for pancreatic cancer

Article

Full-text available

Mar 2022

Background Recent evidence suggests a role for the microbiome in pancreatic ductal adenocarcinoma (PDAC) aetiology and progression. Objective To explore the faecal and salivary microbiota as potential diagnostic biomarkers. Methods We applied shotgun metagenomic and 16S rRNA amplicon sequencing to samples from a Spanish case–control study (n=136), including 57 cases, 50 controls, and 29 patients with chronic pancreatitis in the discovery phase, and from a German case–control study (n=76), in the validation phase. Results Faecal metagenomic classifiers performed much better than saliva-based classifiers and identified patients with PDAC with an accuracy of up to 0.84 area under the receiver operating characteristic curve (AUROC) based on a set of 27 microbial species, with consistent accuracy across early and late disease stages. Performance further improved to up to 0.94 AUROC when we combined our microbiome-based predictions with serum levels of carbohydrate antigen (CA) 19–9, the only current non-invasive, Food and Drug Administration approved, low specificity PDAC diagnostic biomarker. Furthermore, a microbiota-based classification model confined to PDAC-enriched species was highly disease-specific when validated against 25 publicly available metagenomic study populations for various health conditions (n=5792). Both microbiome-based models had a high prediction accuracy on a German validation population (n=76). Several faecal PDAC marker species were detectable in pancreatic tumour and non-tumour tissue using 16S rRNA sequencing and fluorescence in situ hybridisation. Conclusion Taken together, our results indicate that non-invasive, robust and specific faecal microbiota-based screening for the early detection of PDAC is feasible.

The Ribosomal Database Project: Improved Alignments and New Tools for rRNA Analysis

Article

Full-text available

Jan 2009
NUCLEIC ACIDS RES

The Ribosomal Database Project (RDP) provides researchers with quality-controlled bacterial and archaeal small subunit rRNA alignments and analysis tools. An improved alignment strategy uses the Infernal secondary structure aware aligner to provide a more consistent higher quality alignment and faster processing of user sequences. Substantial new analysis features include a new Pyrosequencing Pipeline that provides tools to support analysis of ultra high-throughput rRNA sequencing data. This pipeline offers a collection of tools that automate the data processing and simplify the computationally intensive analysis of large sequencing libraries. In addition, a new Taxomatic visualization tool allows rapid visualization of taxonomic inconsistencies and suggests corrections, and a new class Assignment Generator provides instructors with a lesson plan and individualized teaching materials. Details about RDP data and analytical functions can be found at http://rdp.cme.msu.edu/.

A large-scale benchmark study of existing algorithms for taxonomy-independent microbial community analysis

Article

Full-text available

Apr 2011
BRIEF BIOINFORM

Recent advances in massively parallel sequencing technology have created new opportunities to probe the hidden world of microbes. Taxonomy-independent clustering of the 16S rRNA gene is usually the first step in analyzing microbial communities. Dozens of algorithms have been developed in the last decade, but a comprehensive benchmark study is lacking. Here, we survey algorithms currently used by microbiologists, and compare seven representative methods in a large-scale benchmark study that addresses several issues of concern. A new experimental protocol was developed that allows different algorithms to be compared using the same platform, and several criteria were introduced to facilitate a quantitative evaluation of the clustering performance of each algorithm. We found that existing methods vary widely in their outputs, and that inappropriate use of distance levels for taxonomic assignments likely resulted in substantial overestimates of biodiversity in many studies. The benchmark study identified our recently developed ESPRIT-Tree, a fast implementation of the average linkage-based hierarchical clustering algorithm, as one of the best algorithms available in terms of computational efficiency and clustering accuracy.

Introducing mothur: Open-Source, Platform-Independent, Community-Supported Software for Describing and Comparing Microbial Communities

Article

Full-text available

Jan 2009
APPL ENVIRON MICROB

mothur aims to be a comprehensive software package that allows users to use a single piece of software to analyze community sequence data. It builds upon previous tools to provide a flexible and powerful software package for analyzing sequencing data. As a case study, we used mothur to trim, screen, and align sequences; calculate distances; assign sequences to operational taxonomic units; and describe the α and β diversity of eight marine samples previously characterized by pyrosequencing of 16S rRNA gene fragments. This analysis of more than 222,000 sequences was completed in less than 2 h with a laptop computer.

ESPRIT: Estimating species richness using large collections of 16S rRNA pyrosequences

Article

Full-text available

Jun 2009
NUCLEIC ACIDS RES

Recent metagenomics studies of environmental samples suggested that microbial communities are much more diverse than previously reported, and deep sequencing will significantly increase the estimate of total species diversity. Massively parallel pyrosequencing technology enables ultra-deep sequencing of complex microbial populations rapidly and inexpensively. However, computational methods for analyzing large collections of 16S ribosomal sequences are limited. We proposed a new algorithm, referred to as ESPRIT, which addresses several computational issues with prior methods. We developed two versions of ESPRIT, one for personal computers (PCs) and one for computer clusters (CCs). The PC version is used for small- and medium-scale data sets and can process several tens of thousands of sequences within a few minutes, while the CC version is for large-scale problems and is able to analyze several hundreds of thousands of reads within one day. Large-scale experiments are presented that clearly demonstrate the effectiveness of the newly proposed algorithm. The source code and user guide are freely available at http://www.biotech.ufl.edu/people/sun/esprit.html.

Infernal 1.0: Inference of RNA Alignments

Article

Full-text available

Apr 2009
BIOINFORMATICS

infernal builds consensus RNA secondary structure profiles called covariance models (CMs), and uses them to search nucleic acid sequence databases for homologous RNAs, or to create new sequence- and structure-based multiple sequence alignments. Availability: Source code, documentation and benchmark downloadable from http://infernal.janelia.org. infernal is freely licensed under the GNU GPLv3 and should be portable to any POSIX-compliant operating system, including Linux and Mac OS/X. Contact: nawrockie,kolbed,eddys{at}janelia.hhmi.org

Cd-Hit: a Fast Program for Clustering and Comparing Large Sets of Protein or Nucleotide Sequences

Article

Full-text available

Aug 2006

Motivation: In 2001 and 2002, we published two papers (Bioinformatics, 17, 282–283, Bioinformatics, 18, 77–82) describing an ultrafast protein sequence clustering program called cd-hit. This program can efficiently cluster a huge protein database with millions of sequences. However, the applications of the underlying algorithm are not limited to only protein sequences clustering, here we present several new programs using the same algorithm including cd-hit-2d, cd-hit-est and cd-hit-est-2d. Cd-hit-2d compares two protein datasets and reports similar matches between them; cd-hit-est clusters a DNA/RNA sequence database and cd-hit-est-2d compares two nucleotide datasets. All these programs can handle huge datasets with millions of sequences and can be hundreds of times faster than methods based on the popular sequence comparison and database search tools, such as BLAST. Availability:http://cd-hit.org Contact:liwz{at}sdsc.edu

Efficient agglomerative hierarchical clustering methods

Article

Jan 1984
J CLASSIF

Search and Clustering Orders of Magnitude Faster than BLAST

Article

Oct 2010
BIOINFORMATICS

Robert C. Edgar

Biological sequence data is accumulating rapidly, motivating the development of improved high-throughput methods for sequence classification. UBLAST and USEARCH are new algorithms enabling sensitive local and global search of large sequence databases at exceptionally high speeds. They are often orders of magnitude faster than BLAST in practical applications, though sensitivity to distant protein relationships is lower. UCLUST is a new clustering method that exploits USEARCH to assign sequences to clusters. UCLUST offers several advantages over the widely used program CD-HIT, including higher speed, lower memory use, improved sensitivity, clustering at lower identities and classification of much larger datasets. Binaries are available at no charge for non-commercial use at http://www.drive5.com/usearch.

Efficient algorithms for agglomerative hierarchical clustering methods

Article

Feb 1984
J CLASSIF

Whenevern objects are characterized by a matrix of pairwise dissimilarities, they may be clustered by any of a number of sequential, agglomerative, hierarchical, nonoverlapping (SAHN) clustering methods. These SAHN clustering methods are defined by a paradigmatic algorithm that usually requires 0(n3) time, in the worst case, to cluster the objects. An improved algorithm (Anderberg 1973), while still requiring 0(n3) worst-case time, can reasonably be expected to exhibit 0(n2) expected behavior. By contrast, we describe a SAHN clustering algorithm that requires 0(n2 logn) time in the worst case. When SAHN clustering methods exhibit reasonable space distortion properties, further improvements are possible. We adapt a SAHN clustering algorithm, based on the efficient construction of nearest neighbor chains, to obtain a reasonably general SAHN clustering algorithm that requires in the worst case 0(n2) time and space.

HPC-CLUST: Distributed hierarchical clustering for large sets of nucleotide sequences

Abstract

Supplementary resource (1)

Recommended publications

Model selection via penalization in the additive Cox model

Algorithmic Systems Biology

Systems biology approaches at cellular level in the model organism Ectocarpus siliculosus to better...

Hyperscape: Visualization for complex biological networks