Jan Baumbach

University of Southern Denmark, Odense, South Denmark, Denmark

Are you Jan Baumbach?

Claim your profile

Publications (79)229.64 Total impact

  • [Show abstract] [Hide abstract]
    ABSTRACT: We �nally arrived in the post-genome era and face systems biology challenges of immense dimensionality. Huge graphs model the interplay of biological entities of all kinds (genes, proteins, metabolites). In parallel, the emergence of next-generation OMICS technology allows measuring their expression on a large-scale and in high-throughput. Horizons opened for so-called network enrichment strategies, which aim for combining these two data types, networks and expression matrices. One basically assumes that disease-speci�c, foreground (FG) genes have a di�erent expression distribution than the others, the background (BG) genes, in a set of patients compared to a control group. A priori one knows neither FG, BG, and their expression distributions. De novo network enrichment tools seek to �nd densely connected sub-networks that are enriched with FG genes, i.e. deregulated diseases-speci�c subnetworks. As we do not know all FG genes and, more important, many of the BG genes (i.e. genes that are not disease-related) we struggle evaluating the real-world relevance of the sub-networks extracted by network enrichers. Here, we contribute with a proof-of-principle study addressing this problem. We introduce a sampling procedure to generate arti�cial 'gold standards' of FG and BG genes of varying complexity. Therefore, we introduce two intuitive parameters controlling how distant FG and BG genes are in their expression values (separation), and how densely the FG genes are distributed in a network (density), respectively. For the latter, we introduce two algorithms to 'hide' FG genes with a certain density in a graph. We exemplary benchmark the performance of the network enrichment tool KeyPathwayMiner in �nding the FG genes that we have 'hidden' in the input network for di�erent density and separation values. We believe that our simple but robust strategy is applicable for systematically assessing and comparing the quality of network enrichment tools in the future.
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: With the availability of newer and cheaper sequencing methods, genomic data are being generated at an increasingly fast pace. In spite of the high degree of complexity of currently available search routines, the massive number of sequences available virtually prohibits quick and correct identification of large groups of sequences sharing common traits. Hence, there is a need for clustering tools for automatic knowledge extraction enabling the curation of large-scale databases. Current sophisticated approaches on sequence clustering are based on pairwise similarity matrices. This is impractical for databases of hundreds of thousands of sequences as such a similarity matrix alone would exceed the available memory. In this paper, a new approach called MultiLevel Clustering (MLC) is proposed which avoids a majority of sequence comparisons, and therefore, significantly reduces the total runtime for clustering. An implementation of the algorithm allowed clustering of all 344,239 ITS (Internal Transcribed Spacer) fungal sequences from GenBank utilizing only a normal desktop computer within 22 CPU-hours whereas the greedy clustering method took up to 242 CPU-hours.
    Scientific Reports 10/2014; 4:6837. · 5.08 Impact Factor
  • Jan Baumbach, Richard Röttger
    [Show abstract] [Hide abstract]
    ABSTRACT: A graphical abstract is available for this content
    Integrative Biology 10/2014; · 4.32 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: The volatolom is the sum of volatile organic compounds that are emitted by all living cells and tissues. We seek to non-invasively "sniff" biomarker molecules that are predictive for the biomedical fate of individual patients. This promises great hope to move the therapeutic windows to earlier stages of disease progression. While portable devices for breathomics measurement exist, we face the traditional biomarker research barrier: A lack of robustness hinders translation to the world outside laboratories. To move from biomarker discovery to validation, from separability to predictability, we have developed several bioinformatics methods for computational breath analysis, which have the potential to redefine non-invasive biomedical decision making by rapid and cheap matching of decisive medical patterns in exhaled air. We aim to provide a supplementary diagnostic tool complementing classic urine, blood and tissue samples. The presentation will review the state of the art, highlight existing challenges and introduce new data mining methods for identifying breathomics biomarkers.
    Highlight Talk at 13th European Conference on Computational Biology (ECCB), Strassbourg, France; 09/2014
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Reverse-phase protein arrays (RPPAs) allow sensitive quantification of relative protein abundance in thousands of samples in parallel. Typical challenges involved in this technology are antibody selection, sample preparation and optimization of staining conditions. The issue of combining effective sample management and data analysis, however, has been widely neglected.
    Bioinformatics 09/2014; 30(17):i631-i638. · 4.62 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Background Over the last decade network enrichment analysis has become popular in computational systems biology to elucidate aberrant network modules. Traditionally, these approaches focus on combining gene expression data with protein-protein interaction (PPI) networks. Nowadays, the so-called omics technologies allow for inclusion of many more data sets, e.g. protein phosphorylation or epigenetic modifications. This creates a need for analysis methods that can combine these various sources of data to obtain a systems-level view on aberrant biological networks.ResultsWe present a new release of KeyPathwayMiner (version 4.0) that is not limited to analyses of single omics data sets, e.g. gene expression, but is able to directly combine several different omics data types. Version 4.0 can further integrate existing knowledge by adding a search bias towards sub-networks that contain (avoid) genes provided in a positive (negative) list. Finally the new release now also provides a set of novel visualization features and has been implemented as an app for the standard bioinformatics network analysis tool: Cytoscape.Conclusion With KeyPathwayMiner 4.0, we publish a Cytoscape app for multi-omics based sub-network extraction. It is available in Cytoscape¿s app store http://apps.cytoscape.org/apps/keypathwayminer or via http://keypathwayminer.mpi-inf.mpg.de.
    BMC Systems Biology 08/2014; 8(1):99. · 2.85 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: In life sciences, and particularly biomedical research, linking aberrant pathways exhibiting phenotype-specific alterations to the underlying physical condition or disease is an ongoing challenge. Computationally, a key approach for pathway identification is data enrichment, combined with generation of biological networks. This allows identification of intrinsic patterns in the data and their linkage to a specific context such as cellular compartments, diseases or functions. Identification of aberrant pathways by traditional approaches is often limited to biological networks based on either gene expression, protein expression or post-translational modifications. To overcome single omics analysis, we developed a set of computational methods that allow a combined analysis of data collections from multiple omics fields utilizing hybrid interactome networks. We apply these methods to data obtained from a triple-negative breast cancer cell line model, combining data sets of gene and protein expression as well as protein phosphorylation. We focus on alterations associated with the phenotypical differences arising from epithelial-mesenchymal transition in two breast cancer cell lines exhibiting epithelial-like and mesenchymal-like morphology, respectively. Here we identified altered protein signaling activity in a complex biologically relevant network, related to focal adhesion and migration of breast cancer cells. We found dysregulated functional network modules revealing altered phosphorylation-dependent activity in concordance with the phenotypic traits and migrating potential of the tested model. In addition, we identified Ser267 on zyxin, a protein coupled to actin filament polymerization, as a potential in vivo phosphorylation target of cyclin-dependent kinase 1.
    Integrative Biology 08/2014; · 4.32 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: Motivation: We address the problem of multiple protein-protein interaction (PPI) network alignment. Given a set of such networks for different species we might ask how much the network topology is conserved throughout evolution. Solving this problem will help to derive a subset of interactions that is conserved over multiple species thus forming a 'core interactome'. Methods: We model the problem as Topological Multiple one-to-one Network Alignment (TMNA), where we aim to minimize the total Graph Edit Distance (GED) between pairs of the input networks. Here, the GED between two graphs is the number of deleted and inserted edges that are required to make one graph isomorphic to another. By minimizing the GED we indirectly maximize the number of edges that are aligned in multiple networks simultaneously. However, computing an optimal GED value is computationally intractable. We thus propose an evolutionary algorithm and developed a software tool, GEDEVO-M, which is able to align multiple PPI networks using topological information only. We demonstrate the power of our approach by computing a maximal common subnetwork for a set of bacterial and eukaryotic PPI networks. GEDEVO-M thus provides great potential for computing the 'core interactome' of different species. Availability: http://gedevo.mpi-inf.mpg.de/multiple-network-alignment/.
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Next-generation sequencing (NGS) technologies have made high-throughput sequencing available to medium- and small-size laboratories, culminating in a tidal wave of genomic information. The quantity of sequenced bacterial genomes has not only brought excitement to the field of genomics but also heightened expectations that NGS would boost antibacterial discovery and vaccine development. Although many possible drug and vaccine targets have been discovered, the success rate of genome-based analysis has remained below expectations. Furthermore, NGS has had consequences for genome quality, resulting in an exponential increase in draft (partial data) genome deposits in public databases. If no further interests are expressed for a particular bacterial genome, it is more likely that the sequencing of its genome will be limited to a draft stage, and the painstaking tasks of completing the sequencing of its genome and annotation will not be undertaken. It is important to know what is lost when we settle for a draft genome and to determine the "scientific value" of a newly sequenced genome. This review addresses the expected impact of newly sequenced genomes on antibacterial discovery and vaccinology. Also, it discusses the factors that could be leading to the increase in the number of draft deposits and the consequent loss of relevant biological information.
    World journal of biological chemistry. 05/2014; 5(2):161-168.
  • [Show abstract] [Hide abstract]
    ABSTRACT: We review the level of genomic specificity regarding actinobacterial pathogenicity. As they occupy various niches in diverse habitats, one may assume the existence of lifestyle-specific genomic features. We include 240 actinobacteria classified into four pathogenicity classes: human pathogens (HPs), broad-spectrum pathogens (BPs), opportunistic pathogens (OPs) and non-pathogenic (NP). We hypothesize: (H1) Pathogens (HPs and BPs) possess specific pathogenicity signature genes. (H2) The same holds for OPs. (H3) Broad-spectrum and exclusively HPs cannot be distinguished from each other because of an observation bias, i.e. many HPs might yet be unclassified BPs. (H4) There is no intrinsic genomic characteristic of OPs compared with pathogens, as small mutations are likely to play a more dominant role to survive the immune system. To study these hypotheses, we implemented a bioinformatics pipeline that combines evolutionary sequence analysis with statistical learning methods (Random Forest with feature selection, model tuning and robustness analysis). Essentially, we present orthologous gene sets that computationally distinguish pathogens from NPs (H1). We further show a clear limit in differentiating OPs from both NPs (H2) and pathogens (H4). HPs may also not be distinguished from bacteria annotated as BPs based only on a small set of orthologous genes (H3), as many HPs might as well target a broad range of mammals but have not been annotated accordingly. In conclusion, we illustrate that even in the post-genome era and despite next-generation sequencing technology, our ability to efficiently deduce real-world conclusions, such as pathogenicity classification, remains quite limited.
    Briefings in functional genomics 05/2014; · 3.43 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: We define breathomics as the metabolomics study of exhaled air. It is a strongly emerging metabolomics research field that mainly focuses on health-related volatile organic compounds (VOCs). Since the amount of these compounds varies with health status, breathomics holds great promise to deliver non-invasive diagnostic tools. Thus, the main aim of breathomics is to find patterns of VOCs related to abnormal (for instance inflammatory) metabolic processes occurring in the human body. Recently, analytical methods for measuring VOCs in exhaled air with high resolution and high throughput have been extensively developed. Yet, the application of machine learning methods for fingerprinting VOC profiles in the breathomics is still in its infancy. Therefore, in this paper, we describe the current state of the art in data pre-processing and multivariate analysis of breathomics data. We start with the detailed pre-processing pipelines for breathomics data obtained from gas-chromatography mass spectrometry and an ion-mobility spectrometer coupled to multi-capillary columns. The outcome of data pre-processing is a matrix containing the relative abundances of a set of VOCs for a group of patients under different conditions (e.g. disease stage, treatment). Independently of the utilized analytical method, the most important question, 'which VOCs are discriminatory?', remains the same. Answers can be given by several modern machine learning techniques (multivariate statistics) and, therefore, are the focus of this paper. We demonstrate the advantages as well the drawbacks of such techniques. We aim to help the community to understand how to profit from a particular method. In parallel, we hope to make the community aware of the existing data fusion methods, as yet unresearched in breathomics.
    Journal of Breath Research 04/2014; 8(2):027105. · 2.57 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: Ion mobility spectrometry coupled to multi capillary columns (MCC/IMS) combines highly sensitive spectrometry with a rapid separation technique. MCC\IMS is widely used for biomedical breath analysis. The identification of molecules in such a complex sample necessitates a reference database. The existing IMS reference databases are still in their infancy and do not allow to actually identify all analytes. With a gas chromatograph coupled to a mass selective detector (GC/MSD) setup in parallel to a MCC/IMS instrumentation we may increase the accuracy of automatic analyte identification. To overcome the time-consuming manual evaluation and comparison of the results of both devices, we developed a software tool MIMA (MS-IMS-Mapper), which can computationally generate analyte layers for MCC/ IMS spectra by using the corresponding GC/MSD data. We demonstrate the power of our method by successfully identifying the analytes of a seven-component mixture. In conclusion, the main contribution of MIMA is a fast and easy computational method for assigning analyte names to yet un-assigned signals in MCC/IMS data. We believe that this will greatly impact modern MCC/IMS-based biomarker research by “giving a name” to previously detected diseasespecific molecules.
    International Journal for Ion Mobility Spectrometry 04/2014; 17(2):95-101.
  • [Show abstract] [Hide abstract]
    ABSTRACT: The explosion of the biological data has dramatically reformed today's biological research. The need to integrate and analyze high-dimensional biological data on a large scale is driving the development of novel bioinformatics approaches. Biclustering, also known as 'simultaneous clustering' or 'co-clustering', has been successfully utilized to discover local patterns in gene expression data and similar biomedical data types. Here, we contribute a new heuristic: 'Bi-Force'. It is based on the weighted bicluster editing model, to perform biclustering on arbitrary sets of biological entities, given any kind of pairwise similarities. We first evaluated the power of Bi-Force to solve dedicated bicluster editing problems by comparing Bi-Force with two existing algorithms in the BiCluE software package. We then followed a biclustering evaluation protocol in a recent review paper from Eren et al. (2013) (A comparative analysis of biclustering algorithms for gene expressiondata. Brief. Bioinform., 14:279-292.) and compared Bi-Force against eight existing tools: FABIA, QUBIC, Cheng and Church, Plaid, BiMax, Spectral, xMOTIFs and ISA. To this end, a suite of synthetic datasets as well as nine large gene expression datasets from Gene Expression Omnibus were analyzed. All resulting biclusters were subsequently investigated by Gene Ontology enrichment analysis to evaluate their biological relevance. The distinct theoretical foundation of Bi-Force (bicluster editing) is more powerful than strict biclustering. We thus outperformed existing tools with Bi-Force at least when following the evaluation protocols from Eren et al. Bi-Force is implemented in Java and integrated into the open source software package of BiCluE. The software as well as all used datasets are publicly available at http://biclue.mpi-inf.mpg.de.
    Nucleic Acids Research 03/2014; · 8.81 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: The advance of new technologies in biomedical research has led to a dramatic growth in experimental throughput. Projects therefore steadily grow in size and involve a larger number of researchers. Spreadsheets traditionally used are thus no longer suitable for keeping track of the vast amounts of samples created and need to be replaced with state-of-the-art laboratory information management systems. Such systems have been developed in large numbers, but they are often limited to specific research domains and types of data. One domain so far neglected is the management of libraries of vector clones and genetically engineered cell lines. OpenLabFramework is a newly developed web-application for sample tracking, particularly laid out to fill this gap, but with an open architecture allowing it to be extended for other biological materials and functional data. Its sample tracking mechanism is fully customizable and aids productivity further through support for mobile devices and barcoded labels.
    Scientific Reports 03/2014; 4:4278. · 5.08 Impact Factor
  • Journal of Breath Research 02/2014; 8(1):012001. · 2.57 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: As high-throughput technologies become cheaper and easier to use, raw sequence data and corresponding annotations for many organisms are becoming available. However, sequence data alone is not sufficient to explain the biological behaviour of organisms, which arises largely from complex molecular interactions. There is a need to develop new platform technologies that can be applied to the investigation of whole-genome datasets in an efficient and cost-effective manner. One such approach is the transfer of existing knowledge from well-studied organisms to closely-related organisms. In this paper, we describe a system, BacillusRegNet, for the use of a model organism, Bacillus subtilis, to infer genome-wide regulatory networks in less well-studied close relatives. The putative transcription factors, their binding sequences and predicted promoter sequences along with annotations are available from the associated BacillusRegNet website (http://bacillus.ncl.ac.uk).
    Journal of integrative bioinformatics 01/2014; 11(2):244.
  • International Journal for Ion Mobility Spectrometry 01/2014; 17(2):95-101. *(Shared first author).
  • Brief Funct Genomics. 01/2014;
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Selecting the most promising treatment strategy for breast cancer crucially depends on determining the correct subtype. In recent years, gene expression profiling has been investigated as an alternative to histochemical methods. Since databases like TCGA provide easy and unrestricted access to gene expression data for hundreds of patients, the challenge is to extract a minimal optimal set of genes with good prognostic properties from a large bulk of genes making a moderate contribution to classification. Several studies have successfully applied machine learning algorithms to solve this so-called gene selection problem. However, more diverse data from other OMICS technologies are available, including methylation. We hypothesize that combining methylation and gene expression data could already lead to a largely improved classification model, since the resulting model will reflect differences not only on the transcriptomic, but also on an epigenetic level. We compared so-called random forest derived classification models based on gene expression and methylation data alone, to a model based on the combined features and to a model based on the gold standard PAM50. We obtained bootstrap errors of 10-20% and classification error of 1-50%, depending on breast cancer subtype and model. The gene expression model was clearly superior to the methylation model, which was also reflected in the combined model, which mainly selected features from gene expression data. However, the methylation model was able to identify unique features not considered as relevant by the gene expression model, which might provide deeper insights into breast cancer subtype differentiation on an epigenetic level.
    Journal of integrative bioinformatics 01/2014; 11(2):236.
  • Source
    Peng Sun, Jiong Guo, Jan Baumbach
    [Show abstract] [Hide abstract]
    ABSTRACT: The explosion of biological data has dramatically reformed today's biology research. The biggest challenge to biologists and bioinformaticians is the integration and analysis of large quantity of data to provide meaningful insights. One major problem is the combined analysis of data from different types. Bi-cluster editing, as a special case of clustering, which partitions two different types of data simultaneously, might be used for several biomedical scenarios. However, the underlying algorithmic problem is NP-hard. Here we contribute with BiCluE, a software package designed to solve the weighted bi-cluster editing problem. It implements (1) an exact algorithm based on fixed-parameter tractability and (2) a polynomial-time greedy heuristics based on solving the hardest part, edge deletions, first. We evaluated its performance on artificial graphs. Afterwards we exemplarily applied our implementation on real world biomedical data, GWAS data in this case. BiCluE generally works on any kind of data types that can be modeled as (weighted or unweighted) bipartite graphs. To our knowledge, this is the first software package solving the weighted bi-cluster editing problem. BiCluE as well as the supplementary results are available online at http://biclue.mpi-inf.mpg.de.
    BMC proceedings 12/2013; 7(Suppl 7):S9.

Publication Stats

757 Citations
229.64 Total Impact Points


  • 2013–2014
    • University of Southern Denmark
      • Department of Mathematics and Computer Science
      Odense, South Denmark, Denmark
    • Institute of Integrative Omics and Applied Biotechnology
      Rānāghāt, Bengal, India
  • 2011–2014
    • Federal University of Minas Gerais
      • Institute of Biological Sciences
      Cidade de Minas, Minas Gerais, Brazil
  • 2011–2012
    • Max Planck Institute for Informatics
      Saarbrücken, Saarland, Germany
  • 2010–2012
    • Universität des Saarlandes
      • Institut für Medizinische Mikrobiologie und Hygiene
      Saarbrücken, Saarland, Germany
    • Max Planck Society
      München, Bavaria, Germany
  • 2006–2012
    • Bielefeld University
      • • CeBiTec - Center for Biotechnology
      • • Faculty of Technology
      Bielefeld, North Rhine-Westphalia, Germany
  • 2009
    • University of California, Berkeley
      Berkeley, California, United States