Sonja J Prohaska

University of Leipzig, Leipzig, Saxony, Germany

Are you Sonja J Prohaska?

Claim your profile

Publications (101)

  • Source
    Henrike Indrischek · Nicolas Wieseke · Peter F. Stadler · Sonja J. Prohaska
    [Show abstract] [Hide abstract] ABSTRACT: Background: The accurate annotation of genes in newly sequenced genomes remains a challenge. Although sophisticated comparative pipelines are available, computationally derived gene models are often less than perfect. This is particularly true when multiple similar paralogs are present. The issue is aggravated further when genomes are assembled only at a preliminary draft level to contigs or short scaffolds. However, these genomes deliver valuable information for studying gene families. High accuracy models of protein coding genes are needed in particular for phylogenetics and for the analysis of gene family histories. Results: We present a pipeline, ExonMatchSolver, that is designed to help the user to produce and curate high quality models of the protein-coding part of genes. The tool in particular tackles the problem of identifying those coding exon groups that belong to the same paralogous genes in a fragmented genome assembly. This paralog-to-contig assignment problem is shown to be NP-complete. It is phrased and solved as an Integer Linear Programming problem. Conclusions: The ExonMatchSolver-pipeline can be employed to build highly accurate models of protein coding genes even when spanning several genomic fragments. This sets the stage for a better understanding of the evolutionary history within particular gene families which possess a large number of paralogs and in which frequent gene duplication events occurred.
    Full-text Article · Dec 2016 · Algorithms for Molecular Biology
  • Source
    [Show abstract] [Hide abstract] ABSTRACT: Function is a central concept in biological theories and explanations. Yet discussions about function are often based on a narrow understanding of biological systems and processes, such as idealized molecular systems or simple evolutionary, i.e., selective, dynamics. Conflicting conceptions of function continue to be used in the scientific literature to support certain claims, for instance about the fraction of "functional DNA" in the human genome. Here we argue that all biologically meaningful interpretations of function are necessarily context dependent. This implies that they derive their meaning as well as their range of applicability only within a specific theoretical and measurement context. We use this framework to shed light on the current debate about functional DNA and argue that without considering explicitly the theoretical and measurement contexts all attempts to integrate biological theories are prone to fail.
    Full-text Article · Oct 2015 · Theory in Biosciences
  • Article · Sep 2015
  • [Show abstract] [Hide abstract] ABSTRACT: Background: Dynamic programming algorithms provide exact solutions to many problems in computational biology, such as sequence alignment, RNA folding, hidden Markov models (HMMs), and scoring of phylogenetic trees. Structurally analogous algorithms compute optimal solutions, evaluate score distributions, and perform stochastic sampling. This is explained in the theory of Algebraic Dynamic Programming (ADP) by a strict separation of state space traversal (usually represented by a context free grammar), scoring (encoded as an algebra), and choice rule. A key ingredient in this theory is the use of yield parsers that operate on the ordered input data structure, usually strings or ordered trees. The computation of ensemble properties, such as a posteriori probabilities of HMMs or partition functions in RNA folding, requires the combination of two distinct, but intimately related algorithms, known as the inside and the outside recursion. Only the inside recursions are covered by the classical ADP theory. Results: The ideas of ADP are generalized to a much wider scope of data structures by relaxing the concept of parsing. This allows us to formalize the conceptual complementarity of inside and outside variables in a natural way. We demonstrate that outside recursions are generically derivable from inside decomposition schemes. In addition to rephrasing the well-known algorithms for HMMs, pairwise sequence alignment, and RNA folding we show how the TSP and the shortest Hamiltonian path problem can be implemented efficiently in the extended ADP framework. As a showcase application we investigate the ancient evolution of HOX gene clusters in terms of shortest Hamiltonian paths. Conclusions: The generalized ADP framework presented here greatly facilitates the development and implementation of dynamic programming algorithms for a wide spectrum of applications. http://www.bioinf.uni-leipzig.de/Software/gADP/
    Article · Aug 2015 · BMC Bioinformatics
  • Source
    [Show abstract] [Hide abstract] ABSTRACT: Background Kiwi, comprising five species from the genus Apteryx, are endangered, ground-dwelling bird species endemic to New Zealand. They are the smallest and only nocturnal representatives of the ratites. The timing of kiwi adaptation to a nocturnal niche and the genomic innovations, which shaped sensory systems and morphology to allow this adaptation, are not yet fully understood. Results We sequenced and assembled the brown kiwi genome to 150-fold coverage and annotated the genome using kiwi transcript data and non-redundant protein information from multiple bird species. We identified evolutionary sequence changes that underlie adaptation to nocturnality and estimated the onset time of these adaptations. Several opsin genes involved in color vision are inactivated in the kiwi. We date this inactivation to the Oligocene epoch, likely after the arrival of the ancestor of modern kiwi in New Zealand. Genome comparisons between kiwi and representatives of ratites, Galloanserae, and Neoaves, including nocturnal and song birds, show diversification of kiwi’s odorant receptors repertoire, which may reflect an increased reliance on olfaction rather than sight during foraging. Further, there is an enrichment of genes influencing mitochondrial function and energy expenditure among genes that are rapidly evolving specifically on the kiwi branch, which may also be linked to its nocturnal lifestyle. Conclusions The genomic changes in kiwi vision and olfaction are consistent with changes that are hypothesized to occur during adaptation to nocturnal lifestyle in mammals. The kiwi genome provides a valuable genomic resource for future genome-wide comparative analyses to other extinct and extant diurnal ratites.
    Full-text Article · Jul 2015 · Genome biology
  • Source
    Heike Betat · Tobias Mede · Sandy Tretbar · [...] · Sonja J Prohaska
    [Show abstract] [Hide abstract] ABSTRACT: Transfer RNAs (tRNAs) require the absolutely conserved sequence motif CCA at their 3'-ends, representing the site of aminoacylation. In the majority of organisms, this trinucleotide sequence is not encoded in the genome and thus has to be added post-transcriptionally by the CCA-adding enzyme, a specialized nucleotidyltransferase. In eukaryotic genomes this ubiquitous and highly conserved enzyme family is usually represented by a single gene copy. Analysis of published sequence data allows us to pin down the unusual evolution of eukaryotic CCA-adding enzymes. We show that the CCA-adding enzymes of animals originated from a horizontal gene transfer event in the stem lineage of Holozoa, i.e. Metazoa (animals) and their unicellular relatives, the Choanozoa. The tRNA nucleotidyltransferase, acquired from an α-proteobacterium, replaced the ancestral enzyme in Metazoa. However, in Choanoflagellata, the group of Choanozoa that is closest to Metazoa, both the ancestral and the horizontally transferred CCA-adding enzymes have survived. Furthermore, our data refute a mitochondrial origin of the animal tRNA nucleotidyltransferases. © The Author(s) 2015. Published by Oxford University Press on behalf of Nucleic Acids Research.
    Full-text Article · Jun 2015 · Nucleic Acids Research
  • [Show abstract] [Hide abstract] ABSTRACT: We present an efficient generalization of algebraic dynamic programming (ADP) to unordered data types and a formalism for the automated derivation of outside grammars from their inside progenitors. These theoretical contributions are illustrated by ADP-style algorithms for shortest Hamiltonian path problems. These arise naturally when asking whether the evolutionary history of an ancient gene cluster can be explained by a series of local tandem duplications. Our framework makes it easy to compute Maximum accuracy solutions, which in turn require the computation of the probabilities of individual edges in the ensemble of Hamiltonian paths. The expansion of the Hox gene clusters is investigated as a show-case application. For implementation details see: http://www.bioinf.uni-leipzig.de/Software/setgram/
    Conference Paper · Oct 2014
  • Source
    [Show abstract] [Hide abstract] ABSTRACT: The elucidation of orthology relationships is an important step both in gene function prediction as well as towards understanding patterns of sequence evolution. Orthology assignments are usually derived directly from sequence similarities for large data because more exact approaches exhibit too high computational costs. Here we present PoFF, an extension for the standalone tool Proteinortho, which enhances orthology detection by combining clustering, sequence similarity, and synteny. In the course of this work, FFAdj-MCS, a heuristic that assesses pairwise gene order using adjacencies (a similarity measure related to the breakpoint distance) was adapted to support multiple linear chromosomes and extended to detect duplicated regions. PoFF largely reduces the number of false positives and enables more fine-grained predictions than purely similarity-based approaches. The extension maintains the low memory requirements and the efficient concurrency options of its basis Proteinortho, making the software applicable to very large datasets.
    Full-text Article · Aug 2014 · PLoS ONE
  • Source
    [Show abstract] [Hide abstract] ABSTRACT: The cell cycle genes homology region (CHR) has been identified as a DNA element with an important role in transcriptional regulation of late cell cycle genes. It has been shown that such genes are controlled by DREAM, MMB and FOXM1-MuvB and that these protein complexes can contact DNA via CHR sites. However, it has not been elucidated which sequence variations of the canonical CHR are functional and how frequent CHR-based regulation is utilized in mammalian genomes. Here, we define the spectrum of functional CHR elements. As the basis for a computational meta-analysis, we identify new CHR sequences and compile phylogenetic motif conservation as well as genome-wide protein-DNA binding and gene expression data. We identify CHR elements in most late cell cycle genes binding DREAM, MMB, or FOXM1-MuvB. In contrast, Myb- and forkhead-binding sites are underrepresented in both early and late cell cycle genes. Our findings support a general mechanism: sequential binding of DREAM, MMB and FOXM1-MuvB complexes to late cell cycle genes requires CHR elements. Taken together, we define the group of CHR-regulated genes in mammalian genomes and provide evidence that the CHR is the central promoter element in transcriptional regulation of late cell cycle genes by DREAM, MMB and FOXM1-MuvB.
    Full-text Article · Aug 2014 · Nucleic Acids Research
  • Source
    Dirk Zeckzer · Daniel Gerighausen · Lydia Steiner · Sonja J. Prohaska
    [Show abstract] [Hide abstract] ABSTRACT: Background: Over the last years, more and more biological data became available. Besides the pure amount of new data, also its dimensionality - the number of different attributes per data point - increased. Recently, especially the amount of data on chromatin and its modifications increased considerably. In the field of epigenetics, appropriate visualization tools designed for highlighting the different aspects of epigenetic data are currently not available. Results: We present a tool called TiBi-Scatter enabling correlation analysis in 2D. This approach allows for analyzing multidimensional data while keeping the use of resources such as memory small. Thus, it is in particular applicable to large data sets. Conclusions: TiBi-Scatter is a resource-friendly and easy to use tool that allows for the hypothesis-free analysis of large multidimensional biological data sets.
    Full-text Conference Paper · Jul 2014
  • [Show abstract] [Hide abstract] ABSTRACT: Enzymatic splicing in Archaeal tRNAs is guided by bulge-helix-bulge structural elements, while much less seems to be known about splicing in other small RNAs (sRNAs). We conduct a genome-wide analysis of several archaeal genomes to identify putative BHB elements and compare our findings with available RNA-seq data. We also provide an analysis of the viability of using pattern-based and stochastic structural scanning algorithms for in silico studies of the occurrence of BHB motifs. Furthermore, we comment on splicing motifs in other small RNAs, which mostly do not fit the pattern of bulge-helix-bulge motifs. Appendix and supporting files available at: http://www.bioinf.uni-leipzig.de/publications/supplements/14-001
    Conference Paper · Apr 2014
  • Source
    Daniel Gerighausen · Dirk Zeckzer · Lydia Steiner · Sonja J Prohaska
    [Show abstract] [Hide abstract] ABSTRACT: Epigenetics studies heritable phenotypic changes which are not due to changes in the DNA sequence. The molecular basis is chromatin forming a “beads on a string”-like structure of histones. Here we present a new tool called ChromatinVis to visualize ChIP-seq data. Before visualization, the histone modification data is segmented typically yielding several millions of data points. In our example, we process data from three cell types and three modifications resulting in eight combinations. The challenging problem is to study the global changes of histone modifications between different cell types. The data are clustered using the k means++ algorithm. For each cluster we allow the user to study the global and local distribution of histone marks using radial windmill charts. To analyze the configuration of the clusters in the data space we use scatterplots in combination with a Principle Component Analyses. A multitude of filtering options and several methods for outlier detection, like calculation of silhouette coefficients, allow the user to improve clustering. From a biological perspective, the tool gives a deeper insight into relationship between histone modifications.
    Full-text Conference Paper · Mar 2014
  • Source
    Full-text Dataset · Oct 2013
  • Arli A Parikesit · Lydia Steiner · Peter F Stadler · Sonja J Prohaska
    [Show abstract] [Hide abstract] ABSTRACT: Most investigations into the large-scale patterns of protein evolution are based on gene annotations that have been compiled in reference databases. The use of these resources for quantitative comparisons, however, is complicated by sometimes vast differences in coverage. More importantly, however, we also observe substantial ascertainment biases that cannot be removed by simple normalization procedures. A striking example is provided by the correlations between protein domains. We observe that statistics derived from different computational gene annotation procedure show dramatic discrepancies, and even qualitative changes from negative to positive correlation, when compared to statistics obtained from annotation databases.
    Article · Jul 2013
  • Christian Arnold · Peter F Stadler · Sonja J Prohaska
    [Show abstract] [Hide abstract] ABSTRACT: Eukaryotic histones carry a diverse set of specific chemical modifications that accumulate over the life-time of a cell and have a crucial impact on the cell state in general and the transcriptional program in particular. Replication constitutes a dramatic disruption of the chromatin states that effectively amounts to partial erasure of stored information. To preserve its epigenetic state the cell reconstructs (at least part of) the histone modifications by means of processes that are still very poorly understood. A plausible hypothesis is that the different combinations of reader and writer domains in histone-modifying enzymes implement local rewriting rules that are capable of "recomputing" the desired parental modification patterns on the basis of the partial information contained in that half of the nucleosomes that predate replication. To test whether such a mechanism is theoretically feasible, we have developed a flexible stochastic simulation system (available at http://www.bioinf.uni-leipzig.de/Software/StoChDyn) for studying the dynamics of histone modification states. The implementation is based on Gillespie's approach, i.e., it models the master equation of a detailed chemical model. It is efficient enough to use an evolutionary algorithm to find patterns across multiple cell divisions with high accuracy. We found that it is easy to evolve a system of enzymes that can maintain a particular chromatin state roughly stable, even without explicit boundary elements separating differentially modified chromatin domains. However, the success of this task depends on several previously unanticipated factors, such as the length of the initial state, the specific pattern that should be maintained, the time between replications, and chemical parameters such as enzymatic binding and dissociation rates. All these factors also influence the accumulation of errors in the wake of cell divisions.
    Article · Jul 2013 · Journal of Theoretical Biology
  • Source
    [Show abstract] [Hide abstract] ABSTRACT: The discovery of a living coelacanth specimen in 1938 was remarkable, as this lineage of lobe-finned fish was thought to have become extinct 70 million years ago. The modern coelacanth looks remarkably similar to many of its ancient relatives, and its evolutionary proximity to our own fish ancestors provides a glimpse of the fish that first walked on land. Here we report the genome sequence of the African coelacanth, Latimeria chalumnae. Through a phylogenomic analysis, we conclude that the lungfish, and not the coelacanth, is the closest living relative of tetrapods. Coelacanth protein-coding genes are significantly more slowly evolving than those of tetrapods, unlike other genomic features. Analyses of changes in genes and regulatory elements during the vertebrate adaptation to land highlight genes involved in immunity, nitrogen excretion and the development of fins, tail, ear, eye, brain and olfaction. Functional assays of enhancers involved in the fin-to-limb transition and in the emergence of extra-embryonic tissues show the importance of the coelacanth genome as a blueprint for understanding tetrapod evolution.
    Full-text Article · Apr 2013 · Nature
  • Source
    [Show abstract] [Hide abstract] ABSTRACT: The discovery of a living coelacanth specimen in 1938 was remarkable, as this lineage of lobe-finned fish was thought to have become extinct 70 million years ago. The modern coelacanth looks remarkably similar to many of its ancient relatives, and its evolutionary proximity to our own fish ancestors provides a glimpse of the fish that first walked on land. Here we report the genome sequence of the African coelacanth, Latimeria chalumnae. Through a phylogenomic analysis, we conclude that the lungfish, and not the coelacanth, is the closest living relative of tetrapods. Coelacanth protein-coding genes are significantly more slowly evolving than those of tetrapods, unlike other genomic features. Analyses of changes in genes and regulatory elements during the vertebrate adaptation to land highlight genes involved in immunity, nitrogen excretion and the development of fins, tail, ear, eye, brain and olfaction. Functional assays of enhancers involved in the fin-to-limb transition and in the emergence of extra-embryonic tissues show the importance of the coelacanth genome as a blueprint for understanding tetrapod evolution.
    Full-text Article · Apr 2013 · Nature
  • Source
    Full-text Article · Mar 2013 · Physical Biology
  • Source
    [Show abstract] [Hide abstract] ABSTRACT: Chromatin-related mechanisms, as e.g. histone modifications, are known to be involved in regulatory switches within the transcriptome. Only recently, mathematical models of these mechanisms have been established. So far they have not been applied to genome-wide data. We here introduce a mathematical model of transcriptional regulation by histone modifications and apply it to data of trimethylation of histone 3 at lysine 4 (H3K4me3) and 27 (H3K27me3) in mouse pluripotent and lineage-committed cells. The model describes binding of protein complexes to chromatin which are capable of reading and writing histone marks. Molecular interactions of the complexes with DNA and modified histones create a regulatory switch of transcriptional activity. The regulatory states of the switch depend on the activity of histone (de-) methylases, the strength of complex-DNA-binding and the number of nucleosomes capable of cooperatively contributing to complex-binding. Our model explains experimentally measured length distributions of modified chromatin regions. It suggests (i) that high CpG-density facilitates recruitment of the modifying complexes in embryonic stem cells and (ii) that re-organization of extended chromatin regions during lineage specification into neuronal progenitor cells requires targeted de-modification. Our approach represents a basic step towards multi-scale models of transcriptional control during development and lineage specification.
    Full-text Article · Mar 2013 · Physical Biology
  • Source
    Dataset: File S3
    [Show abstract] [Hide abstract] ABSTRACT: Additional SOM images for ES-segmentation. Chromosomal-specific population map and chromosomal enrichment maps for ES-segmentation are shown for all chromosome in the mouse genome. (PDF)
    Full-text Dataset · Oct 2012