Alexander Auch
Research interests
-
InterestsComputational Biology (Bioinformatics), Computational Science
Other
-
LanguagesGerman, English
Publications
-
A Clustering Optimization Strategy for Molecular Taxonomy Applied to Planktonic Foraminifera SSU rDNA
Evolutionary Bioinformatics. 01/2010;
-
Digital DNA-DNA hybridization for microbial species delineation by means of genome-to-genome sequence comparison.
Standards in genomic sciences. 01/2010; 2(1):117-34.
The pragmatic species concept for Bacteria and Archaea is ultimately based on DNA-DNA hybridization (DDH). While enabling the taxonomist, in principle, to obtain an estimate of the overall similarity between the genomes of two strains, this technique is tedious and error-prone and cannot be used to ... [more] The pragmatic species concept for Bacteria and Archaea is ultimately based on DNA-DNA hybridization (DDH). While enabling the taxonomist, in principle, to obtain an estimate of the overall similarity between the genomes of two strains, this technique is tedious and error-prone and cannot be used to incrementally build up a comparative database. Recent technological progress in the area of genome sequencing calls for bioinformatics methods to replace the wet-lab DDH by in-silico genome-to-genome comparison. Here we investigate state-of-the-art methods for inferring whole-genome distances in their ability to mimic DDH. Algorithms to efficiently determine high-scoring segment pairs or maximally unique matches perform well as a basis of inferring intergenomic distances. The examined distance functions, which are able to cope with heavily reduced genomes and repetitive sequence regions, outperform previously described ones regarding the correlation with and error ratios in emulating DDH. Simulation of incompletely sequenced genomes indicates that some distance formulas are very robust against missing fractions of genomic information. Digitally derived genome-to-genome distances show a better correlation with 16S rRNA gene sequence distances than DDH values. The future perspectives of genome-informed taxonomy are discussed, and the investigated methods are made available as a web service for genome-based species delineation.
-
Standard operating procedure for calculating genome-to-genome distances based on high-scoring segment pairs.
Standards in genomic sciences. 01/2010; 2(1):142-8.
DNA-DNA hybridization (DDH) is a widely applied wet-lab technique to obtain an estimate of the overall similarity between the genomes of two organisms. To base the species concept for prokaryotes ultimately on DDH was chosen by microbiologists as a pragmatic approach for deciding about the recogniti... [more] DNA-DNA hybridization (DDH) is a widely applied wet-lab technique to obtain an estimate of the overall similarity between the genomes of two organisms. To base the species concept for prokaryotes ultimately on DDH was chosen by microbiologists as a pragmatic approach for deciding about the recognition of novel species, but also allowed a relatively high degree of standardization compared to other areas of taxonomy. However, DDH is tedious and error-prone and first and foremost cannot be used to incrementally establish a comparative database. Recent studies have shown that in-silico methods for the comparison of genome sequences can be used to replace DDH. Considering the ongoing rapid technological progress of sequencing methods, genome-based prokaryote taxonomy is coming into reach. However, calculating distances between genomes is dependent on multiple choices for software and program settings. We here provide an overview over the modifications that can be applied to distance methods based in high-scoring segment pairs (HSPs) or maximally unique matches (MUMs) and that need to be documented. General recommendations on determining HSPs using BLAST or other algorithms are also provided. As a reference implementation, we introduce the GGDC web server (http://ggdc.gbdp.org).
-
1.89Impact points
A Clustering Optimization Strategy for Molecular Taxonomy Applied to Planktonic Foraminifera SSU rDNA.
Evolutionary bioinformatics online. 01/2010; 6:97-112.
Identifying species is challenging in the case of organisms for which primarily molecular data are available. Even if morphological features are available, molecular taxonomy is often necessary to revise taxonomic concepts and to analyze environmental DNA sequences. However, clustering approaches to... [more] Identifying species is challenging in the case of organisms for which primarily molecular data are available. Even if morphological features are available, molecular taxonomy is often necessary to revise taxonomic concepts and to analyze environmental DNA sequences. However, clustering approaches to delineate molecular operational taxonomic units often rely on arbitrary parameter choices. Also, distance calculation is difficult for highly alignment-ambiguous sequences. Here, we applied a recently described clustering optimization method to highly divergent planktonic foraminifera SSU rDNA sequences. We determined the distance function and the clustering setting that result in the highest agreement with morphological reference data. Alignment-free distance calculation, when adapted to the use with partly non-homologous sequences caused by distinct primer pairs, outperformed multiple sequence alignment. Clustering optimization offers new perspectives for the barcoding of species diversity and for environmental sequencing. It bridges the gap between traditional and modern taxonomic disciplines by specifically addressing the issue of how to optimally account for both genetic divergence and given species concepts.
-
A clustering optimization strategy for molecular taxonomy applied to planktonic foraminifera SSU rDNA
Evolutionary Bioinformatics Online. 01/2010;
Identifying species is challenging in the case of organisms for which primarily molecular data are available. Even if morphological features are available, molecular taxonomy is often necessary to revise taxonomic concepts and to analyze environmental DNA sequences. However, clustering approaches to... [more] Identifying species is challenging in the case of organisms for which primarily molecular data are available. Even if morphological features are available, molecular taxonomy is often necessary to revise taxonomic concepts and to analyze environmental DNA sequences. However, clustering approaches to delineate molecular operational taxonomic units often rely on arbitrary parameter choices. Also, distance calculation is difficult for highly alignment-ambiguous sequences. Here, we applied a recently described clustering optimization method to highly divergent planktonic foraminifera SSU rDNA sequences. We determined the distance function and the clustering setting that result in the highest agreement with morphological reference data. Alignment-free distance calculation, when adapted to the use with partly non-homologous sequences caused by distinct primer pairs, outperformed multiple sequence alignment. Clustering optimization offers new perspectives for the barcoding of species diversity and for environmental sequencing. It bridges the gap between traditional and modern taxonomic disciplines by specifically addressing the issue of how to optimally account for both genetic divergence and given species concepts.
-
3.43Impact points
Methods for comparative metagenomics.
BMC bioinformatics. 02/2009; 10 Suppl 1:S12.
BACKGROUND: Metagenomics is a rapidly growing field of research that aims at studying uncultured organisms to understand the true diversity of microbes, their functions, cooperation and evolution, in environments such as soil, water, ancient remains of animals, or the digestive system of animals and... [more] BACKGROUND: Metagenomics is a rapidly growing field of research that aims at studying uncultured organisms to understand the true diversity of microbes, their functions, cooperation and evolution, in environments such as soil, water, ancient remains of animals, or the digestive system of animals and humans. The recent development of ultra-high throughput sequencing technologies, which do not require cloning or PCR amplification, and can produce huge numbers of DNA reads at an affordable cost, has boosted the number and scope of metagenomic sequencing projects. Increasingly, there is a need for new ways of comparing multiple metagenomics datasets, and for fast and user-friendly implementations of such approaches. RESULTS: This paper introduces a number of new methods for interactively exploring, analyzing and comparing multiple metagenomic datasets, which will be made freely available in a new, comparative version 2.0 of the stand-alone metagenome analysis tool MEGAN. CONCLUSION: There is a great need for powerful and user-friendly tools for comparative analysis of metagenomic data and MEGAN 2.0 will help to fill this gap.
-
Large-Scale Co-Phylogenetic Analysis on the Grid
International Journal of Grid and High Performance Computing. 01/2009; 1:39-54.
Phylogenetic data analysis represents an extremely compute-intensive area of Bioinformatics and thus requires high-performance technologies. Another compute- and memory-intensive problem is that of hostparasite co-phylogenetic analysis: given two phylogenetic trees, one for the hosts (e.g., mammals)... [more] Phylogenetic data analysis represents an extremely compute-intensive area of Bioinformatics and thus requires high-performance technologies. Another compute- and memory-intensive problem is that of hostparasite co-phylogenetic analysis: given two phylogenetic trees, one for the hosts (e.g., mammals) and one for their respective parasites (e.g., lice) the question arises whether host and parasite trees are more similar to each other than expected by chance alone. CopyCat is an easy-to-use tool that allows biologists to conduct such co-phylogenetic studies within an elaborate statistical framework based on the highly optimized sequential and parallel A xParafit program. We have developed enhanced versions of these tools that efficiently exploit a Grid environment and therefore facilitate large-scale data analyses. Furthermore, we developed a freely accessible client tool that provides co-phylogenetic analysis capabilities. Since the computational bulk of the problem is embarrassingly parallel, it fits well to a computational Grid and reduces the response time of large scale analyses.
-
4.41Impact points
MetaSim: a sequencing simulator for genomics and metagenomics.
PLoS ONE. 02/2008; 3(10):e3373.
BACKGROUND: The new research field of metagenomics is providing exciting insights into various, previously unclassified ecological systems. Next-generation sequencing technologies are producing a rapid increase of environmental data in public databases. There is great need for specialized software s... [more] BACKGROUND: The new research field of metagenomics is providing exciting insights into various, previously unclassified ecological systems. Next-generation sequencing technologies are producing a rapid increase of environmental data in public databases. There is great need for specialized software solutions and statistical methods for dealing with complex metagenome data sets. METHODOLOGY/PRINCIPAL FINDINGS: To facilitate the development and improvement of metagenomic tools and the planning of metagenomic projects, we introduce a sequencing simulator called MetaSim. Our software can be used to generate collections of synthetic reads that reflect the diverse taxonomical composition of typical metagenome data sets. Based on a database of given genomes, the program allows the user to design a metagenome by specifying the number of genomes present at different levels of the NCBI taxonomy, and then to collect reads from the metagenome using a simulation of a number of different sequencing technologies. A population sampler optionally produces evolved sequences based on source genomes and a given evolutionary tree. CONCLUSIONS/SIGNIFICANCE: MetaSim allows the user to simulate individual read datasets that can be used as standardized test scenarios for planning sequencing projects or for benchmarking metagenomic software.
-
4.93Impact points
COPYCAT: cophylogenetic analysis tool.
Bioinformatics (Oxford, England). 05/2007; 23(7):898-900.
We have developed the software CopyCat which provides an easy and fast access to cophylogenetic analyses. It incorporates a wrapper for the program ParaFit, which conducts a statistical test for the presence of congruence between host and parasite phylogenies. CopyCat offers various features, such a... [more] We have developed the software CopyCat which provides an easy and fast access to cophylogenetic analyses. It incorporates a wrapper for the program ParaFit, which conducts a statistical test for the presence of congruence between host and parasite phylogenies. CopyCat offers various features, such as the creation of customized host-parasite association data and the computation of phylogenetic host/parasite trees based on the NCBI taxonomy. AVAILABILITY: CopyCat and its manual are freely available at http://www-ab.informatik.uni-tuebingen.de/software/copycat. SUPPLEMENTARY INFORMATION: Results of the real-world example can be found at http://www-ab.informatik.uni-tuebingen.de/software/copycat or Bioinformatics online.
-
11.34Impact points
MEGAN analysis of metagenomic data.
Genome research. 04/2007; 17(3):377-86.
Metagenomics is the study of the genomic content of a sample of organisms obtained from a common habitat using targeted or random sequencing. Goals include understanding the extent and role of microbial diversity. The taxonomical content of such a sample is usually estimated by comparison against se... [more] Metagenomics is the study of the genomic content of a sample of organisms obtained from a common habitat using targeted or random sequencing. Goals include understanding the extent and role of microbial diversity. The taxonomical content of such a sample is usually estimated by comparison against sequence databases of known sequences. Most published studies use the analysis of paired-end reads, complete sequences of environmental fosmid and BAC clones, or environmental assemblies. Emerging sequencing-by-synthesis technologies with very high throughput are paving the way to low-cost random "shotgun" approaches. This paper introduces MEGAN, a new computer program that allows laptop analysis of large metagenomic data sets. In a preprocessing step, the set of DNA sequences is compared against databases of known sequences using BLAST or another comparison tool. MEGAN is then used to compute and explore the taxonomical content of the data set, employing the NCBI taxonomy to summarize and order the results. A simple lowest common ancestor algorithm assigns reads to taxa such that the taxonomical level of the assigned taxon reflects the level of conservation of the sequence. The software allows large data sets to be dissected without the need for assembly or the targeting of specific phylogenetic markers. It provides graphical and statistical output for comparing different data sets. The approach is applied to several data sets, including the Sargasso Sea data set, a recently published metagenomic data set sampled from a mammoth bone, and several complete microbial genomes. Also, simulations that evaluate the performance of the approach for different read lengths are presented.
-
3.43Impact points
AxPcoords & parallel AxParafit: statistical co-phylogenetic analyses on thousands of taxa.
BMC bioinformatics. 02/2007; 8:405.
BACKGROUND: Current tools for Co-phylogenetic analyses are not able to cope with the continuous accumulation of phylogenetic data. The sophisticated statistical test for host-parasite co-phylogenetic analyses implemented in Parafit does not allow it to handle large datasets in reasonable times. The ... [more] BACKGROUND: Current tools for Co-phylogenetic analyses are not able to cope with the continuous accumulation of phylogenetic data. The sophisticated statistical test for host-parasite co-phylogenetic analyses implemented in Parafit does not allow it to handle large datasets in reasonable times. The Parafit and DistPCoA programs are the by far most compute-intensive components of the Parafit analysis pipeline. We present AxParafit and AxPcoords (Ax stands for Accelerated) which are highly optimized versions of Parafit and DistPCoA respectively. RESULTS: Both programs have been entirely re-written in C. Via optimization of the algorithm and the C code as well as integration of highly tuned BLAS and LAPACK methods AxParafit runs 5-61 times faster than Parafit with a lower memory footprint (up to 35% reduction) while the performance benefit increases with growing dataset size. The MPI-based parallel implementation of AxParafit shows good scalability on up to 128 processors, even on medium-sized datasets. The parallel analysis with AxParafit on 128 CPUs for a medium-sized dataset with an 512 by 512 association matrix is more than 1,200/128 times faster per processor than the sequential Parafit run. AxPcoords is 8-26 times faster than DistPCoA and numerically stable on large datasets. We outline the substantial benefits of using parallel AxParafit by example of a large-scale empirical study on smut fungi and their host plants. To the best of our knowledge, this study represents the largest co-phylogenetic analysis to date. CONCLUSION: The highly efficient AxPcoords and AxParafit programs allow for large-scale co-phylogenetic analyses on several thousands of taxa for the first time. In addition, AxParafit and AxPcoords have been integrated into the easy-to-use CopyCat tool.
-
AxPcoords & parallel AxParafit: statistical co-phylogenetic analyses on thousands of taxa.
BMC Bioinformatics. 01/2007; 8.
-
etagenome Analysis using Megan.
Proceedings of 5th Asia-Pacific Bioinformatics Conference, APBC 2007, 15-17 January 2007, Hong Kong, China; 01/2007
-
3.43Impact points
Genome BLAST distance phylogenies inferred from whole plastid and whole mitochondrion genome sequences.
BMC bioinformatics. 02/2006; 7:350.
BACKGROUND: Phylogenetic methods which do not rely on multiple sequence alignments are important tools in inferring trees directly from completely sequenced genomes. Here, we extend the recently described Genome BLAST Distance Phylogeny (GBDP) strategy to compute phylogenetic trees from all complete... [more] BACKGROUND: Phylogenetic methods which do not rely on multiple sequence alignments are important tools in inferring trees directly from completely sequenced genomes. Here, we extend the recently described Genome BLAST Distance Phylogeny (GBDP) strategy to compute phylogenetic trees from all completely sequenced plastid genomes currently available and from a selection of mitochondrial genomes representing the major eukaryotic lineages. BLASTN, TBLASTX, or combinations of both are used to locate high-scoring segment pairs (HSPs) between two sequences from which pairwise similarities and distances are computed in different ways resulting in a total of 96 GBDP variants. The suitability of these distance formulae for phylogeny reconstruction is directly estimated by computing a recently described measure of "treelikeness", the so-called delta value, from the respective distance matrices. Additionally, we compare the trees inferred from these matrices using UPGMA, NJ, BIONJ, FastME, or STC, respectively, with the NCBI taxonomy tree of the taxa under study. RESULTS: Our results indicate that, at this taxonomic level, plastid genomes are much more valuable for inferring phylogenies than are mitochondrial genomes, and that distances based on breakpoints are of little use. Distances based on the proportion of "matched" HSP length to average genome length were best for tree estimation. Additionally we found that using TBLASTX instead of BLASTN and, particularly, combining TBLASTX and BLASTN leads to a small but significant increase in accuracy. Other factors do not significantly affect the phylogenetic outcome. The BIONJ algorithm results in phylogenies most in accordance with the current NCBI taxonomy, with NJ and FastME performing insignificantly worse, and STC performing as well if applied to high quality distance matrices. delta values are found to be a reliable predictor of phylogenetic accuracy. CONCLUSION: Using the most treelike distance matrices, as judged by their delta values, distance methods are able to recover all major plant lineages, and are more in accordance with Apicomplexa organelles being derived from "green" plastids than from plastids of the "red" type. GBDP-like methods can be used to reliably infer phylogenies from different kinds of genomic data. A framework is established to further develop and improve such methods. delta values are a topology-independent tool of general use for the development and assessment of distance methods for phylogenetic inference.
-
29.75Impact points
Metagenomics to paleogenomics: large-scale sequencing of mammoth DNA.
Science (New York, N.Y.). 02/2006; 311(5759):392-4.
We sequenced 28 million base pairs of DNA in a metagenomics approach, using a woolly mammoth (Mammuthus primigenius) sample from Siberia. As a result of exceptional sample preservation and the use of a recently developed emulsion polymerase chain reaction and pyrosequencing technique, 13 million bas... [more] We sequenced 28 million base pairs of DNA in a metagenomics approach, using a woolly mammoth (Mammuthus primigenius) sample from Siberia. As a result of exceptional sample preservation and the use of a recently developed emulsion polymerase chain reaction and pyrosequencing technique, 13 million base pairs (45.4%) of the sequencing reads were identified as mammoth DNA. Sequence identity between our data and African elephant (Loxodonta africana) was 98.55%, consistent with a paleontologically based divergence date of 5 to 6 million years. The sample includes a surprisingly small diversity of environmental DNAs. The high percentage of endogenous DNA recoverable from this single mammoth would allow for completion of its genome, unleashing the field of paleogenomics.
-
Phylogenies from whole genomes: Methodological update within a distance-based framework
published via TOBIAS-lib, http://tobias-lib.uni-tuebingen.de/volltexte/2008/3417/. 01/2006;
Methods which derive pairwise distances directly from complete sequenced genomes are a potentially important and efficient tool within the growing field of phylogenomics. We have shown in two previous studies that the Genome BLAST Distance Phylogeny (GBDP) approach leads to reliable phylogenetic est... [more] Methods which derive pairwise distances directly from complete sequenced genomes are a potentially important and efficient tool within the growing field of phylogenomics. We have shown in two previous studies that the Genome BLAST Distance Phylogeny (GBDP) approach leads to reliable phylogenetic estimates if applied to prokaryotic as well as plastid and mitochondrial genomes. Basically, GBDP first invokes tools such as BLAST to identify high-scoring segment pairs (HSPs) between all pairs of genomes; afterwards, pairwise distances are estimated based on different formulae. Here, we examine (1) a new GBDP distance formula, based on a combination of two previously existing ones; (2) use of BLAT instead of BLASTN and TBLASTX HSP search; (3) an alternative measure for the agreement of a distance matrix with a predefined reference topology; (4) alternative topology-independent measures of distance quality per se. All examinations were based on a enlarged dataset compared to that used in our previous study, additionally containing interesting key taxa.
-
4.93Impact points
Whole-genome prokaryotic phylogeny.
Bioinformatics (Oxford, England). 06/2005; 21(10):2329-35.
Current understanding of the phylogeny of prokaryotes is based on the comparison of the highly conserved small ssu-rRNA subunit and similar regions. Although such molecules have proved to be very useful phylogenetic markers, mutational saturation is a problem, due to their restricted lengths. Now, a... [more] Current understanding of the phylogeny of prokaryotes is based on the comparison of the highly conserved small ssu-rRNA subunit and similar regions. Although such molecules have proved to be very useful phylogenetic markers, mutational saturation is a problem, due to their restricted lengths. Now, a growing number of complete prokaryotic genomes are available. This paper addresses the problem of determining a prokaryotic phylogeny utilizing the comparison of complete genomes. We introduce a new strategy, GBDP, 'genome blast distance phylogeny', and show that different variants of this approach robustly produce phylogenies that are biologically sound, when applied to 91 prokaryotic genomes. In this approach, first Blast is used to compare genomes, then a distance matrix is computed, and finally a tree- or network-reconstruction method such as UPGMA, Neighbor-Joining, BioNJ or Neighbor-Net is applied.
Following (5)
-
Sandra Gesing
Eberhard-Karls-Universität Tübingen -
Stefan R Henz
Max Placnk Institute for Developmental Biology -
Hannes Planatscher
Eberhard-Karls-Universität Tübingen -
Vera Hemleben
Eberhard-Karls-Universität Tübingen