Jan Baumbach

University of Southern Denmark, Odense, South Denmark, Denmark

Are you Jan Baumbach?

Claim your profile

Publications (77)221.25 Total impact

  • Jan Baumbach, Richard Röttger
    [Show abstract] [Hide abstract]
    ABSTRACT: A graphical abstract is available for this content
    Integrative Biology 10/2014; · 4.32 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: The volatolom is the sum of volatile organic compounds that are emitted by all living cells and tissues. We seek to non-invasively "sniff" biomarker molecules that are predictive for the biomedical fate of individual patients. This promises great hope to move the therapeutic windows to earlier stages of disease progression. While portable devices for breathomics measurement exist, we face the traditional biomarker research barrier: A lack of robustness hinders translation to the world outside laboratories. To move from biomarker discovery to validation, from separability to predictability, we have developed several bioinformatics methods for computational breath analysis, which have the potential to redefine non-invasive biomedical decision making by rapid and cheap matching of decisive medical patterns in exhaled air. We aim to provide a supplementary diagnostic tool complementing classic urine, blood and tissue samples. The presentation will review the state of the art, highlight existing challenges and introduce new data mining methods for identifying breathomics biomarkers.
    Highlight Talk at 13th European Conference on Computational Biology (ECCB), Strassbourg, France; 09/2014
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Reverse-phase protein arrays (RPPAs) allow sensitive quantification of relative protein abundance in thousands of samples in parallel. Typical challenges involved in this technology are antibody selection, sample preparation and optimization of staining conditions. The issue of combining effective sample management and data analysis, however, has been widely neglected.
    Bioinformatics (Oxford, England). 09/2014; 30(17):i631-i638.
  • [Show abstract] [Hide abstract]
    ABSTRACT: Background Over the last decade network enrichment analysis has become popular in computational systems biology to elucidate aberrant network modules. Traditionally, these approaches focus on combining gene expression data with protein-protein interaction (PPI) networks. Nowadays, the so-called omics technologies allow for inclusion of many more data sets, e.g. protein phosphorylation or epigenetic modifications. This creates a need for analysis methods that can combine these various sources of data to obtain a systems-level view on aberrant biological networks.ResultsWe present a new release of KeyPathwayMiner (version 4.0) that is not limited to analyses of single omics data sets, e.g. gene expression, but is able to directly combine several different omics data types. Version 4.0 can further integrate existing knowledge by adding a search bias towards sub-networks that contain (avoid) genes provided in a positive (negative) list. Finally the new release now also provides a set of novel visualization features and has been implemented as an app for the standard bioinformatics network analysis tool: Cytoscape.Conclusion With KeyPathwayMiner 4.0, we publish a Cytoscape app for multi-omics based sub-network extraction. It is available in Cytoscape¿s app store http://apps.cytoscape.org/apps/keypathwayminer or via http://keypathwayminer.mpi-inf.mpg.de.
    BMC Systems Biology 08/2014; 8(1):99. · 2.98 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: In life sciences, and particularly biomedical research, linking aberrant pathways exhibiting phenotype-specific alterations to the underlying physical condition or disease is an ongoing challenge. Computationally, a key approach for pathway identification is data enrichment, combined with generation of biological networks. This allows identification of intrinsic patterns in the data and their linkage to a specific context such as cellular compartments, diseases or functions. Identification of aberrant pathways by traditional approaches is often limited to biological networks based on either gene expression, protein expression or post-translational modifications. To overcome single omics analysis, we developed a set of computational methods that allow a combined analysis of data collections from multiple omics fields utilizing hybrid interactome networks. We apply these methods to data obtained from a triple-negative breast cancer cell line model, combining data sets of gene and protein expression as well as protein phosphorylation. We focus on alterations associated with the phenotypical differences arising from epithelial-mesenchymal transition in two breast cancer cell lines exhibiting epithelial-like and mesenchymal-like morphology, respectively. Here we identified altered protein signaling activity in a complex biologically relevant network, related to focal adhesion and migration of breast cancer cells. We found dysregulated functional network modules revealing altered phosphorylation-dependent activity in concordance with the phenotypic traits and migrating potential of the tested model. In addition, we identified Ser267 on zyxin, a protein coupled to actin filament polymerization, as a potential in vivo phosphorylation target of cyclin-dependent kinase 1.
    Integrative Biology 08/2014; · 4.32 Impact Factor
  • 07/2014;
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Next-generation sequencing (NGS) technologies have made high-throughput sequencing available to medium- and small-size laboratories, culminating in a tidal wave of genomic information. The quantity of sequenced bacterial genomes has not only brought excitement to the field of genomics but also heightened expectations that NGS would boost antibacterial discovery and vaccine development. Although many possible drug and vaccine targets have been discovered, the success rate of genome-based analysis has remained below expectations. Furthermore, NGS has had consequences for genome quality, resulting in an exponential increase in draft (partial data) genome deposits in public databases. If no further interests are expressed for a particular bacterial genome, it is more likely that the sequencing of its genome will be limited to a draft stage, and the painstaking tasks of completing the sequencing of its genome and annotation will not be undertaken. It is important to know what is lost when we settle for a draft genome and to determine the "scientific value" of a newly sequenced genome. This review addresses the expected impact of newly sequenced genomes on antibacterial discovery and vaccinology. Also, it discusses the factors that could be leading to the increase in the number of draft deposits and the consequent loss of relevant biological information.
    World journal of biological chemistry. 05/2014; 5(2):161-168.
  • [Show abstract] [Hide abstract]
    ABSTRACT: We review the level of genomic specificity regarding actinobacterial pathogenicity. As they occupy various niches in diverse habitats, one may assume the existence of lifestyle-specific genomic features. We include 240 actinobacteria classified into four pathogenicity classes: human pathogens (HPs), broad-spectrum pathogens (BPs), opportunistic pathogens (OPs) and non-pathogenic (NP). We hypothesize: (H1) Pathogens (HPs and BPs) possess specific pathogenicity signature genes. (H2) The same holds for OPs. (H3) Broad-spectrum and exclusively HPs cannot be distinguished from each other because of an observation bias, i.e. many HPs might yet be unclassified BPs. (H4) There is no intrinsic genomic characteristic of OPs compared with pathogens, as small mutations are likely to play a more dominant role to survive the immune system. To study these hypotheses, we implemented a bioinformatics pipeline that combines evolutionary sequence analysis with statistical learning methods (Random Forest with feature selection, model tuning and robustness analysis). Essentially, we present orthologous gene sets that computationally distinguish pathogens from NPs (H1). We further show a clear limit in differentiating OPs from both NPs (H2) and pathogens (H4). HPs may also not be distinguished from bacteria annotated as BPs based only on a small set of orthologous genes (H3), as many HPs might as well target a broad range of mammals but have not been annotated accordingly. In conclusion, we illustrate that even in the post-genome era and despite next-generation sequencing technology, our ability to efficiently deduce real-world conclusions, such as pathogenicity classification, remains quite limited.
    Briefings in functional genomics. 05/2014;
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: We define breathomics as the metabolomics study of exhaled air. It is a strongly emerging metabolomics research field that mainly focuses on health-related volatile organic compounds (VOCs). Since the amount of these compounds varies with health status, breathomics holds great promise to deliver non-invasive diagnostic tools. Thus, the main aim of breathomics is to find patterns of VOCs related to abnormal (for instance inflammatory) metabolic processes occurring in the human body. Recently, analytical methods for measuring VOCs in exhaled air with high resolution and high throughput have been extensively developed. Yet, the application of machine learning methods for fingerprinting VOC profiles in the breathomics is still in its infancy. Therefore, in this paper, we describe the current state of the art in data pre-processing and multivariate analysis of breathomics data. We start with the detailed pre-processing pipelines for breathomics data obtained from gas-chromatography mass spectrometry and an ion-mobility spectrometer coupled to multi-capillary columns. The outcome of data pre-processing is a matrix containing the relative abundances of a set of VOCs for a group of patients under different conditions (e.g. disease stage, treatment). Independently of the utilized analytical method, the most important question, 'which VOCs are discriminatory?', remains the same. Answers can be given by several modern machine learning techniques (multivariate statistics) and, therefore, are the focus of this paper. We demonstrate the advantages as well the drawbacks of such techniques. We aim to help the community to understand how to profit from a particular method. In parallel, we hope to make the community aware of the existing data fusion methods, as yet unresearched in breathomics.
    Journal of Breath Research 04/2014; 8(2):027105. · 2.57 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: Ion mobility spectrometry coupled to multi capillary columns (MCC/IMS) combines highly sensitive spectrometry with a rapid separation technique. MCC\IMS is widely used for biomedical breath analysis. The identification of molecules in such a complex sample necessitates a reference database. The existing IMS reference databases are still in their infancy and do not allow to actually identify all analytes. With a gas chromatograph coupled to a mass selective detector (GC/MSD) setup in parallel to a MCC/IMS instrumentation we may increase the accuracy of automatic analyte identification. To overcome the time-consuming manual evaluation and comparison of the results of both devices, we developed a software tool MIMA (MS-IMS-Mapper), which can computationally generate analyte layers for MCC/ IMS spectra by using the corresponding GC/MSD data. We demonstrate the power of our method by successfully identifying the analytes of a seven-component mixture. In conclusion, the main contribution of MIMA is a fast and easy computational method for assigning analyte names to yet un-assigned signals in MCC/IMS data. We believe that this will greatly impact modern MCC/IMS-based biomarker research by “giving a name” to previously detected diseasespecific molecules.
    International Journal for Ion Mobility Spectrometry 04/2014; 17(2):95-101.
  • [Show abstract] [Hide abstract]
    ABSTRACT: The explosion of the biological data has dramatically reformed today's biological research. The need to integrate and analyze high-dimensional biological data on a large scale is driving the development of novel bioinformatics approaches. Biclustering, also known as 'simultaneous clustering' or 'co-clustering', has been successfully utilized to discover local patterns in gene expression data and similar biomedical data types. Here, we contribute a new heuristic: 'Bi-Force'. It is based on the weighted bicluster editing model, to perform biclustering on arbitrary sets of biological entities, given any kind of pairwise similarities. We first evaluated the power of Bi-Force to solve dedicated bicluster editing problems by comparing Bi-Force with two existing algorithms in the BiCluE software package. We then followed a biclustering evaluation protocol in a recent review paper from Eren et al. (2013) (A comparative analysis of biclustering algorithms for gene expressiondata. Brief. Bioinform., 14:279-292.) and compared Bi-Force against eight existing tools: FABIA, QUBIC, Cheng and Church, Plaid, BiMax, Spectral, xMOTIFs and ISA. To this end, a suite of synthetic datasets as well as nine large gene expression datasets from Gene Expression Omnibus were analyzed. All resulting biclusters were subsequently investigated by Gene Ontology enrichment analysis to evaluate their biological relevance. The distinct theoretical foundation of Bi-Force (bicluster editing) is more powerful than strict biclustering. We thus outperformed existing tools with Bi-Force at least when following the evaluation protocols from Eren et al. Bi-Force is implemented in Java and integrated into the open source software package of BiCluE. The software as well as all used datasets are publicly available at http://biclue.mpi-inf.mpg.de.
    Nucleic Acids Research 03/2014; · 8.81 Impact Factor
  • Journal of Breath Research 02/2014; 8(1):012001. · 2.57 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: As high-throughput technologies become cheaper and easier to use, raw sequence data and corresponding annotations for many organisms are becoming available. However, sequence data alone is not sufficient to explain the biological behaviour of organisms, which arises largely from complex molecular interactions. There is a need to develop new platform technologies that can be applied to the investigation of whole-genome datasets in an efficient and cost-effective manner. One such approach is the transfer of existing knowledge from well-studied organisms to closely-related organisms. In this paper, we describe a system, BacillusRegNet, for the use of a model organism, Bacillus subtilis, to infer genome-wide regulatory networks in less well-studied close relatives. The putative transcription factors, their binding sequences and predicted promoter sequences along with annotations are available from the associated BacillusRegNet website (http://bacillus.ncl.ac.uk).
    Journal of integrative bioinformatics 01/2014; 11(2):244.
  • International Journal for Ion Mobility Spectrometry 01/2014; 17(2):95-101. *(Shared first author).
  • Brief Funct Genomics. 01/2014;
  • [Show abstract] [Hide abstract]
    ABSTRACT: The advance of new technologies in biomedical research has led to a dramatic growth in experimental throughput. Projects therefore steadily grow in size and involve a larger number of researchers. Spreadsheets traditionally used are thus no longer suitable for keeping track of the vast amounts of samples created and need to be replaced with state-of-the-art laboratory information management systems. Such systems have been developed in large numbers, but they are often limited to specific research domains and types of data. One domain so far neglected is the management of libraries of vector clones and genetically engineered cell lines. OpenLabFramework is a newly developed web-application for sample tracking, particularly laid out to fill this gap, but with an open architecture allowing it to be extended for other biological materials and functional data. Its sample tracking mechanism is fully customizable and aids productivity further through support for mobile devices and barcoded labels.
    Scientific Reports 01/2014; 4:4278. · 5.08 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: Selecting the most promising treatment strategy for breast cancer crucially depends on determining the correct subtype. In recent years, gene expression profiling has been investigated as an alternative to histochemical methods. Since databases like TCGA provide easy and unrestricted access to gene expression data for hundreds of patients, the challenge is to extract a minimal optimal set of genes with good prognostic properties from a large bulk of genes making a moderate contribution to classification. Several studies have successfully applied machine learning algorithms to solve this so-called gene selection problem. However, more diverse data from other OMICS technologies are available, including methylation. We hypothesize that combining methylation and gene expression data could already lead to a largely improved classification model, since the resulting model will reflect differences not only on the transcriptomic, but also on an epigenetic level. We compared so-called random forest derived classification models based on gene expression and methylation data alone, to a model based on the combined features and to a model based on the gold standard PAM50. We obtained bootstrap errors of 10-20% and classification error of 1-50%, depending on breast cancer subtype and model. The gene expression model was clearly superior to the methylation model, which was also reflected in the combined model, which mainly selected features from gene expression data. However, the methylation model was able to identify unique features not considered as relevant by the gene expression model, which might provide deeper insights into breast cancer subtype differentiation on an epigenetic level.
    Journal of integrative bioinformatics 01/2014; 11(2):236.
  • Source
    Peng Sun, Jiong Guo, Jan Baumbach
    [Show abstract] [Hide abstract]
    ABSTRACT: The explosion of biological data has dramatically reformed today's biology research. The biggest challenge to biologists and bioinformaticians is the integration and analysis of large quantity of data to provide meaningful insights. One major problem is the combined analysis of data from different types. Bi-cluster editing, as a special case of clustering, which partitions two different types of data simultaneously, might be used for several biomedical scenarios. However, the underlying algorithmic problem is NP-hard. Here we contribute with BiCluE, a software package designed to solve the weighted bi-cluster editing problem. It implements (1) an exact algorithm based on fixed-parameter tractability and (2) a polynomial-time greedy heuristics based on solving the hardest part, edge deletions, first. We evaluated its performance on artificial graphs. Afterwards we exemplarily applied our implementation on real world biomedical data, GWAS data in this case. BiCluE generally works on any kind of data types that can be modeled as (weighted or unweighted) bipartite graphs. To our knowledge, this is the first software package solving the weighted bi-cluster editing problem. BiCluE as well as the supplementary results are available online at http://biclue.mpi-inf.mpg.de.
    BMC proceedings 12/2013; 7(Suppl 7):S9.
  • [Show abstract] [Hide abstract]
    ABSTRACT: It is common knowledge that the human breath contains metabolites allowing to infer a patient's health status, especially for diseases related to the respiratory system. This information is encoded in the \volatolom", a combination of volatile organic compounds (VOCs) produced by the human metabolism and environmental perturbations. Nevertheless, due to a lack of alternative analytical techniques most of the traditional diagnostic methods are still based on invasive techniques, e.g. using blood or tissue samples. During the last decade, the ion mobility spectrometer combined with a multi-capillary column (MCC/IMS) has become an established, inexpensive, and non-invasive bioanalytics technique for detecting VOCs with various potential applications in medical research. To pave the way for this technology towards daily usage in medical practice, di�erent challenges still have to be solved. One of the main challenges is to establish an automated framework optimizing the processing algorithms in the pipeline yielding to an optimal performance for the �nal goals: Disease prediction and biomarker detection. Although equivalent computational methods and standard procedures exist for other biomedical applications (e.g. sequence and microarray analysis) we still are lacking such a standard protocol in breath research. In four recently published papers presented here[HBJ12, HKD+13, HSP+12,SHBB13] we aimed at solving this challenge.
    Highlight track at the German Conference on Bioinformatics,, Göttingen (GER); 09/2013
  • Jan Baumbach, Jiong Guo, Rashid Ibragimov
    [Show abstract] [Hide abstract]
    ABSTRACT: We introduce a variation of the graph isomorphism problem, where, given two graphs G1=(V1,E1) and G2=(V2,E2) and three integers l, d, and k, we seek for a set D⊆V1 and a one-to-one mapping f:V1→V2 such that |D|≤k and for every vertex v∈V1∖D and every vertex $u\in N_{G_1}^l(v)\setminus D$ we have $f(u)\in N_{G_2}^d(f(v))$. Here, for a graph G and a vertex v, we use $N_{G}^i(v)$ to denote the set of vertices which have distance at most i to v in G. We call this problem Neighborhood-Preserving Mapping (NPM). The main result of this paper is a complete dichotomy of the classical complexity of NPM on trees with respect to different values of l,d,k. Additionally, we present two dynamic programming algorithms for the case that one of the input trees is a path.
    Proceedings of the 13th international conference on Algorithms and Data Structures; 08/2013

Publication Stats

703 Citations
221.25 Total Impact Points


  • 2013–2014
    • University of Southern Denmark
      • Department of Mathematics and Computer Science
      Odense, South Denmark, Denmark
    • Institute of Integrative Omics and Applied Biotechnology
      Rānāghāt, Bengal, India
  • 2011–2014
    • Federal University of Minas Gerais
      • Institute of Biological Sciences
      Cidade de Minas, Minas Gerais, Brazil
  • 2011–2012
    • Max Planck Institute for Informatics
      Saarbrücken, Saarland, Germany
  • 2010–2012
    • Universität des Saarlandes
      • Institut für Medizinische Mikrobiologie und Hygiene
      Saarbrücken, Saarland, Germany
    • Max Planck Society
      München, Bavaria, Germany
  • 2006–2012
    • Bielefeld University
      • • CeBiTec - Center for Biotechnology
      • • Faculty of Technology
      Bielefeld, North Rhine-Westphalia, Germany
  • 2009
    • University of California, Berkeley
      Berkeley, California, United States