Jens Stoye
Publications
-
3.76Impact points
The complete genome sequence of the acarbose producer Actinoplanes sp. SE50/110.
BMC genomics. 03/2012; 13(1):112.
ABSTRACT: BACKGROUND: Actinoplanes sp. SE50/110 is known as the wild type producer of the alpha-glucosidase inhibitor acarbose, a potent drug used worldwide in the treatment of type-2 diabetes mellitus. As the incidence of diabetes is rapidly rising worldwide, an ever increasing demand for diabetes ... [more] ABSTRACT: BACKGROUND: Actinoplanes sp. SE50/110 is known as the wild type producer of the alpha-glucosidase inhibitor acarbose, a potent drug used worldwide in the treatment of type-2 diabetes mellitus. As the incidence of diabetes is rapidly rising worldwide, an ever increasing demand for diabetes drugs, such as acarbose, needs to be anticipated. Consequently, derived Actinoplanes strains with increased acarbose yields are being used in large scale industrial batch fermentation since 1990 and were continuously optimized by conventional mutagenesis and screening experiments. This strategy reached its limits and is generally superseded by modern genetic engineering approaches. As a prerequisite for targeted genetic modifications, the complete genome sequence of the organism has to be known. RESULTS: Here, we present the complete genome sequence of Actinoplanes sp. SE50/110 [GenBank:CP003170], the first publicly available genome of the genus Actinoplanes, comprising various producers of pharmaceutically and economically important secondary metabolites. The genome features a high mean G+C content of 71.32% and consists of one circular chromosome with a size of 9,239,851 bp hosting 8,270 predicted protein coding sequences. Phylogenetic analysis of the core genome revealed a rather distant relation to other sequenced species of the family Micromonosporaceae whereas Actinoplanes utahensis was found to be the closest species based on 16S rDNA comparison. Besides the already published acarbose biosynthetic gene cluster sequence, several new non-ribosomal peptide synthetase-, polyketide synthase- and hybrid-clusters were identified on the Actinoplanes genome. Another key feature of the genome represents the discovery of a functional actinomycete integrative and conjugative element. CONCLUSIONS: The complete genome sequence of Actinoplanes sp. SE50/110 marks an important step towards the rational genetic optimization of the acarbose production. In this regard, the identified actinomycete integrative and conjugative element could play a central role by providing the basis for the development of a genetic transformation system for Actinoplanes sp. SE50/110 and other Actinoplanes spp. Furthermore, the identified non-ribosomal peptide synthetase- and polyketide synthase-clusters potentially encode new antibiotics and / or other bioactive compounds, which might be of pharmacologic interest.
-
1.69Impact points
Consistency of sequence-based gene clusters.
Journal of computational biology : a journal of computational molecular cell biology. 09/2011; 18(9):1023-39.
In comparative genomics, differences or similarities of gene orders are determined to predict functional relations of genes or phylogenetic relations of genomes. For this purpose, various combinatorial models can be used to specify gene clusters--groups of genes that are co-located in a set of genom... [more] In comparative genomics, differences or similarities of gene orders are determined to predict functional relations of genes or phylogenetic relations of genomes. For this purpose, various combinatorial models can be used to specify gene clusters--groups of genes that are co-located in a set of genomes. Several approaches have been proposed to reconstruct putative ancestral gene clusters based on the gene order of contemporary species. One prevalent and natural reconstruction criterion is consistency: For a set of reconstructed gene clusters, there should exist a gene order that comprises all given clusters. For permutation-based gene cluster models, efficient methods exist to verify this condition. In this article, we discuss the consistency problem for different gene cluster models on sequences with restricted gene multiplicities. Our results range from linear-time algorithms for the simple model of adjacencies to NP-completeness proofs for more complex models like common intervals.
-
1.69Impact points
Double cut and join with insertions and deletions.
Journal of computational biology : a journal of computational molecular cell biology. 09/2011; 18(9):1167-84.
Many approaches to compute the genomic distance are still limited to genomes with the same content, without duplicated markers. However, differences in the gene content are frequently observed and can reflect important evolutionary aspects. While duplicated markers can hardly be handled by exact mod... [more] Many approaches to compute the genomic distance are still limited to genomes with the same content, without duplicated markers. However, differences in the gene content are frequently observed and can reflect important evolutionary aspects. While duplicated markers can hardly be handled by exact models, when duplicated markers are not allowed, a few polynomial time algorithms that include genome rearrangements, insertions and deletions were already proposed. In an attempt to improve these results, in the present work we give the first linear time algorithm to compute the distance between two multichromosomal genomes with unequal content, but without duplicated markers, considering insertions, deletions and double cut and join (DCJ) operations. We derive from this approach algorithms to sort one genome into another one also using DCJ operations, insertions and deletions. The optimal sorting scenarios can have different compositions and we compare two types of sorting scenarios: one that maximizes and one that minimizes the number of DCJ operations with respect to the number of insertions and deletions. We also show that, although the triangle inequality can be disrupted in the proposed genomic distance, it is possible to correct this problem adopting a surcharge on the number of non-common markers. We use our method to analyze six species of Rickettsia, a group of obligate intracellular parasites, and identify preliminary evidence of clusters of deletions.
-
1.69Impact points
Restricted DCJ model: rearrangement problems with chromosome reincorporation.
Journal of computational biology : a journal of computational molecular cell biology. 09/2011; 18(9):1231-41.
We study three classical problems of genome rearrangement--sorting, halving, and the median problem--in a restricted double cut and join (DCJ) model. In the DCJ model, introduced by Yancopoulos et al., we can represent rearrangement events that happen in multichromosomal genomes, such as inversions,... [more] We study three classical problems of genome rearrangement--sorting, halving, and the median problem--in a restricted double cut and join (DCJ) model. In the DCJ model, introduced by Yancopoulos et al., we can represent rearrangement events that happen in multichromosomal genomes, such as inversions, translocations, fusions, and fissions. Two DCJ operations can mimic transpositions or block interchanges by first extracting an appropriate segment of a chromosome, creating a temporary circular chromosome, and then reinserting it in its proper place. In the restricted model, we are concerned with multichromosomal linear genomes and we require that each circular excision is immediately followed by its reincorporation. Existing linear-time DCJ sorting and halving algorithms ignore this reincorporation constraint. In this article, we propose a new algorithm for the restricted sorting problem running in O(n log n) time, thus improving on the known quadratic time algorithm. We solve the restricted halving problem and give an algorithm that computes a multilinear halved genome in linear time. Finally, we show that the restricted median problem is NP-hard as conjectured.
-
7.48Impact points
Taxonomic classification of metagenomic shotgun sequences with CARMA3.
Nucleic acids research. 05/2011; 39(14):e91.
The vast majority of microbes are unculturable and thus cannot be sequenced by means of traditional methods. High-throughput sequencing techniques like 454 or Solexa-Illumina make it possible to explore those microbes by studying whole natural microbial communities and analysing their biological div... [more] The vast majority of microbes are unculturable and thus cannot be sequenced by means of traditional methods. High-throughput sequencing techniques like 454 or Solexa-Illumina make it possible to explore those microbes by studying whole natural microbial communities and analysing their biological diversity as well as the underlying metabolic pathways. Over the past few years, different methods have been developed for the taxonomic and functional characterization of metagenomic shotgun sequences. However, the taxonomic classification of metagenomic sequences from novel species without close homologue in the biological sequence databases poses a challenge due to the high number of wrong taxonomic predictions on lower taxonomic ranks. Here we present CARMA3, a new method for the taxonomic classification of assembled and unassembled metagenomic sequences that has been adapted to work with both BLAST and HMMER3 homology searches. We show that our method makes fewer wrong taxonomic predictions (at the same sensitivity) than other BLAST-based methods. CARMA3 is freely accessible via the web application WebCARMA from http://webcarma.cebitec.uni-bielefeld.de.
-
2.88Impact points
Sequencing of high G+C microbial genomes using the ultrafast pyrosequencing technology.
Journal of biotechnology. 04/2011; 155(1):68-77.
Next generation pyrosequencing of high G+C content genomes still poses problems to automated sequencing and assembly processes which necessitates cost and time intensive manual work in order to finish such genomes completely. The sequencing of the high G+C actinomycete Actinoplanes sp. SE50/110 was ... [more] Next generation pyrosequencing of high G+C content genomes still poses problems to automated sequencing and assembly processes which necessitates cost and time intensive manual work in order to finish such genomes completely. The sequencing of the high G+C actinomycete Actinoplanes sp. SE50/110 was performed with standard pyrosequencing technology (454 Life Sciences) and revealed a high number of gaps. The reasons for the introduction of gaps were analyzed on a previously known 41kb long DNA reference sequence from Actinoplanes sp. SE50/110, hosting the acarbose biosynthesis gene cluster. Mapping of the sequencing results on the reference gene cluster sequence revealed a fragmentation into 30 contiguous sequences of different lengths. The gaps between these sequences were characterized by extremely low read coverage which strongly correlated with the G+C content in the gap regions in a negative manner. Furthermore, the gap-sequences contained strong stem-loop structures which hindered the amplification of these sequences during the emulsion PCR. Being significantly underrepresented or absent in the subsequent sequencing process, these sequences lead to weakly or uncovered genomic regions which forces the assembly algorithm to output multiple contiguous sequences instead of one finished genome. However, by applying a different pyrosequencing protocol, it was possible to sequence the complete acarbose biosynthesis gene cluster. The changes to the protocol include longer read length and addition of chemicals to the amplification chemistry, which reduces the self-annealing of DNA fragments during the amplification process and enables the complete reconstruction of high G+C content genomes without manual intervention.
-
4.93Impact points
Exact and complete short-read alignment to microbial genomes using Graphics Processing Unit programming.
Bioinformatics (Oxford, England). 03/2011; 27(10):1351-8.
The introduction of next-generation sequencing techniques and especially the high-throughput systems Solexa (Illumina Inc.) and SOLiD (ABI) made the mapping of short reads to reference sequences a standard application in modern bioinformatics. Short-read alignment is needed for reference based re-se... [more] The introduction of next-generation sequencing techniques and especially the high-throughput systems Solexa (Illumina Inc.) and SOLiD (ABI) made the mapping of short reads to reference sequences a standard application in modern bioinformatics. Short-read alignment is needed for reference based re-sequencing of complete genomes as well as for gene expression analysis based on transcriptome sequencing. Several approaches were developed during the last years allowing for a fast alignment of short sequences to a given template. Methods available to date use heuristic techniques to gain a speedup of the alignments, thereby missing possible alignment positions. Furthermore, most approaches return only one best hit for every query sequence, thus losing the potentially valuable information of alternative alignment positions with identical scores. We developed SARUMAN (Semiglobal Alignment of short Reads Using CUDA and NeedleMAN-Wunsch), a mapping approach that returns all possible alignment positions of a read in a reference sequence under a given error threshold, together with one optimal alignment for each of these positions. Alignments are computed in parallel on graphics hardware, facilitating an considerable speedup of this normally time-consuming step. Combining our filter algorithm with CUDA-accelerated alignments, we were able to align reads to microbial genomes in time comparable or even faster than all published approaches, while still providing an exact, complete and optimal result. At the same time, SARUMAN runs on every standard Linux PC with a CUDA-compatible graphics accelerator. http://www.cebitec.uni-bielefeld.de/brf/saruman/saruman.html.
-
Exact and complete short-read alignment to microbial genomes using Graphics Processing Unit programming.
Bioinformatics. 01/2011; 27:1351-1358.
-
3.43Impact points
On the weight of indels in genomic distances.
BMC bioinformatics. 01/2011; 12 Suppl 9:S13.
Classical approaches to compute the genomic distance are usually limited to genomes with the same content, without duplicated markers. However, differences in the gene content are frequently observed and can reflect important evolutionary aspects. A few polynomial time algorithms that include genome... [more] Classical approaches to compute the genomic distance are usually limited to genomes with the same content, without duplicated markers. However, differences in the gene content are frequently observed and can reflect important evolutionary aspects. A few polynomial time algorithms that include genome rearrangements, insertions and deletions (or substitutions) were already proposed. These methods often allow a block of contiguous markers to be inserted, deleted or substituted at once but result in distance functions that do not respect the triangular inequality and hence do not constitute metrics. In the present study we discuss the disruption of the triangular inequality in some of the available methods and give a framework to establish an efficient correction for two models recently proposed, one that includes insertions, deletions and double cut and join (DCJ) operations, and one that includes substitutions and DCJ operations. We show that the proposed framework establishes the triangular inequality in both distances, by summing a surcharge on indel operations and on substitutions that depends only on the number of markers affected by these operations. This correction can be applied a posteriori, without interfering with the already available formulas to compute these distances. We claim that this correction leads to distances that are biologically more plausible.
-
3.43Impact points
Genomic distance under gene substitutions.
BMC bioinformatics. 01/2011; 12 Suppl 9:S8.
The distance between two genomes is often computed by comparing only the common markers between them. Some approaches are also able to deal with non-common markers, allowing the insertion or the deletion of such markers. In these models, a deletion and a subsequent insertion that occur at the same p... [more] The distance between two genomes is often computed by comparing only the common markers between them. Some approaches are also able to deal with non-common markers, allowing the insertion or the deletion of such markers. In these models, a deletion and a subsequent insertion that occur at the same position of the genome count for two sorting steps. Here we propose a new model that sorts non-common markers with substitutions, which are more powerful operations that comprehend insertions and deletions. A deletion and an insertion that occur at the same position of the genome can be modeled as a substitution, counting for a single sorting step. Comparing genomes with unequal content, but without duplicated markers, we give a linear time algorithm to compute the genomic distance considering substitutions and double-cut-and-join (DCJ) operations. This model provides a parsimonious genomic distance to handle genomes free of duplicated markers, that is in practice a lower bound to the real genomic distances. The method could also be used to refine orthology assignments, since in some cases a substitution could actually correspond to an unannotated orthology.
-
3.43Impact points
Swiftly computing center strings.
BMC bioinformatics. 01/2011; 12:106.
The center string (or closest string) problem is a classic computer science problem with important applications in computational biology. Given k input strings and a distance threshold d, we search for a string within Hamming distance at most d to each input string. This problem is NP complete. In t... [more] The center string (or closest string) problem is a classic computer science problem with important applications in computational biology. Given k input strings and a distance threshold d, we search for a string within Hamming distance at most d to each input string. This problem is NP complete. In this paper, we focus on exact methods for the problem that are also swift in application. We first introduce data reduction techniques that allow us to infer that certain instances have no solution, or that a center string must satisfy certain conditions. We describe how to use this information to speed up two previously published search tree algorithms. Then, we describe a novel iterative search strategy that is efficient in practice, where some of our reduction techniques can also be applied. Finally, we present results of an evaluation study for two different data sets from a biological application. We find that the running time for computing the optimal center string is dominated by the subroutine calls for d = dopt -1 and d = dopt. Our data reduction is very effective for both, either rejecting unsolvable instances or solving trivial positions. We find that this speeds up computations considerably.
-
1.69Impact points
Rearrangement models and single-cut operations.
Journal of computational biology : a journal of computational molecular cell biology. 09/2010; 17(9):1213-25.
There have been many widely used genome rearrangement models, such as reversals, Hannenhalli-Pevzner (HP), and double-cut and join. Though each one can be precisely defined, the general notion of a model remains undefined. In this paper, we give a formal set-theoretic definition, which allows us to ... [more] There have been many widely used genome rearrangement models, such as reversals, Hannenhalli-Pevzner (HP), and double-cut and join. Though each one can be precisely defined, the general notion of a model remains undefined. In this paper, we give a formal set-theoretic definition, which allows us to investigate and prove relationships between distances under various existing and new models. Among our results is that sorting in the HP model is equivalent to sorting in the reversal model when the initial and final genomes are linear uni-chromosomal. We also initiate the formal study of single-cut operations by giving a linear time algorithm for the distance problem under a new single-cut and join model.
-
1.69Impact points
The solution space of sorting by DCJ.
Journal of computational biology : a journal of computational molecular cell biology. 09/2010; 17(9):1145-65.
In genome rearrangements, the double cut and join (DCJ) operation, introduced by Yancopoulos et al. in 2005, allows one to represent most rearrangement events that could happen in multichromosomal genomes, such as inversions, translocations, fusions, and fissions. No restriction on the genome struct... [more] In genome rearrangements, the double cut and join (DCJ) operation, introduced by Yancopoulos et al. in 2005, allows one to represent most rearrangement events that could happen in multichromosomal genomes, such as inversions, translocations, fusions, and fissions. No restriction on the genome structure considering linear and circular chromosomes is imposed. An advantage of this general model is that it leads to considerable algorithmic simplifications compared to other genome rearrangement models. Recently, several works concerning the DCJ operation have been published, and in particular, an algorithm was proposed to find an optimal DCJ sequence for sorting one genome into another one. Here we study the solution space of this problem and give an easy-to-compute formula that corresponds to the exact number of optimal DCJ sorting sequences for a particular subset of instances of the problem. We also give an algorithm to count the number of optimal sorting sequences for any instance of the problem. Another interesting result is the demonstration of the possibility of obtaining one optimal sorting sequence by properly replacing any pair of consecutive operations in another optimal sequence. As a consequence, any optimal sorting sequence can be obtained from one other by applying such replacements successively, but the problem of finding the shortest number of replacements between two sorting sequences is still open.
-
16.87Impact points
-
Genomic Distance with DCJ and Indels.
Algorithms in Bioinformatics, 10th International Workshop, WABI 2010, Liverpool, UK, September 6-8, 2010. Proceedings; 01/2010
-
The Problem of Chromosome Reincorporation in DCJ Sorting and Halving.
Comparative Genomics - International Workshop, RECOMB-CG 2010, Ottawa, Canada, October 9-11, 2010. Proceedings; 01/2010
-
The complete genome sequence of Corynebacterium pseudotuberculosis FRC41 isolated from a 12-year-old girl with necrotizing lymphadenitis reveals insights into gene-regulatory networks contributing to virulence
BMC Genomics. 01/2010;
Abstract Background Corynebacterium pseudotuberculosis is generally regarded as an important animal pathogen that rarely infects humans. Clinical strains are occasionally recovered from human cases of lymphadenitis, such as C. pseudotuberculosis FRC41 that was isolated from the inguinal lymph node o... [more] Abstract Background Corynebacterium pseudotuberculosis is generally regarded as an important animal pathogen that rarely infects humans. Clinical strains are occasionally recovered from human cases of lymphadenitis, such as C. pseudotuberculosis FRC41 that was isolated from the inguinal lymph node of a 12-year-old girl with necrotizing lymphadenitis. To detect potential virulence factors and corresponding gene-regulatory networks in this human isolate, the genome sequence of C. pseudotuberculosis FCR41 was determined by pyrosequencing and functionally annotated. Results Sequencing and assembly of the C. pseudotuberculosis FRC41 genome yielded a circular chromosome with a size of 2,337,913 bp and a mean G+C content of 52.2%. Specific gene sets associated with iron and zinc homeostasis were detected among the 2,110 predicted protein-coding regions and integrated into a gene-regulatory network that is linked with both the central metabolism and the oxidative stress response of FRC41. Two gene clusters encode proteins involved in the sortase-mediated polymerization of adhesive pili that can probably mediate the adherence to host tissue to facilitate additional ligand-receptor interactions and the delivery of virulence factors. The prominent virulence factors phospholipase D (Pld) and corynebacterial protease CP40 are encoded in the genome of this human isolate. The genome annotation revealed additional serine proteases, neuraminidase H, nitric oxide reductase, an invasion-associated protein, and acyl-CoA carboxylase subunits involved in mycolic acid biosynthesis as potential virulence factors. The cAMP-sensing transcription regulator GlxR plays a key role in controlling the expression of several genes contributing to virulence. Conclusion The functional data deduced from the genome sequencing and the extended knowledge of virulence factors indicate that the human isolate C. pseudotuberculosis FRC41 is equipped with a distinct gene set promoting its survival under unfavorable environmental conditions encountered in the mammalian host.
-
Repeat-aware Comparative Genome Assembly.
German Conference on Bioinformatics 2010, September 20-22, 2010, Technische Universität Carolo Wilhelmina zu Braunschweig, Germany; 01/2010
Following (11)
-
Dorothea Emig
University of California, San Diego -
Lutz Krause
Queensland Institute of Medical Research -
Sebastian Jünemann
Universität Bielefeld -
Karsten Niehaus
Universität Bielefeld -
Eva Trost
Universität Bielefeld