[Show abstract][Hide abstract] ABSTRACT: Organisms utilize a multitude of mechanisms for responding to changing environmental conditions, maintaining their functional homeostasis and to overcome stress situations. One of the most important mechanisms is transcriptional gene regulation. In-depth study of the transcriptional gene regulatory network can lead to various practical applications, creating a greater understanding of how organisms control their cellular behavior.
In this work, we present a new database, CMRegNet for the gene regulatory networks of Corynebacterium glutamicum ATCC 13032 and Mycobacterium tuberculosis H37Rv. We furthermore transferred the known networks of these model organisms to 18 other non-model but phylogenetically close species (target organisms) of the CMNR group. In comparison to other network transfers, for the first time we utilized two model organisms resulting into a more diverse and complete network of the target organisms.
CMRegNet provides easy access to a total of 3,103 known regulations in C. glutamicum ATCC 13032 and M. tuberculosis H37Rv and to 38,940 evolutionary conserved interactions for 18 non-model species of the CMNR group. This makes CMRegNet to date the most comprehensive database of regulatory interactions of CMNR bacteria. The content of CMRegNet is publicly available online via a web interface found at http://lgcm.icb.ufmg.br/cmregnet .
[Show abstract][Hide abstract] ABSTRACT: Computational breath analysis is a growing research area aiming at identifying volatile organic compounds (VOCs) in human breath to assist medical diagnostics of the next generation. While inexpensive and non-invasive bioanalytical technologies for metabolite detection in exhaled air and bacterial/fungal vapor exist and the first studies on the power of supervised machine learning methods for profiling of the resulting data were conducted, we lack methods to extract hidden data features emerging from confounding factors. Here, we present Carotta, a new cluster analysis framework dedicated to uncovering such hidden substructures by sophisticated unsupervised statistical learning methods. We study the power of transitivity clustering and hierarchical clustering to identify groups of VOCs with similar expression behavior over most patient breath samples and/or groups of patients with a similar VOC intensity pattern. This enables the discovery of dependencies between metabolites. On the one hand, this allows us to eliminate the effect of potential confounding factors hindering disease classification, such as smoking. On the other hand, we may also identify VOCs associated with disease subtypes or concomitant diseases. Carotta is an open source software with an intuitive graphical user interface promoting data handling, analysis and visualization. The back-end is designed to be modular, allowing for easy extensions with plugins in the future, such as new clustering methods and statistics. It does not require much prior knowledge or technical skills to operate. We demonstrate its power and applicability by means of one artificial dataset. We also apply Carotta exemplarily to a real-world example dataset on chronic obstructive pulmonary disease (COPD). While the artificial data are utilized as a proof of concept, we will demonstrate how Carotta finds candidate markers in our real dataset associated with confounders rather than the primary disease (COPD) and bronchial carcinoma (BC). Carotta is publicly available at http://carotta.compbio.sdu.dk .
[Show abstract][Hide abstract] ABSTRACT: An in-depth understanding of complex systems such as hepatitis C virus (HCV) infection and host immunomodulatory response is an open challenge for biologists. In order to understand the mechanisms involved in immune evasion by HCV, we present a simplified formalization of the highly dynamic system consisting of HCV, its replication cycle and host immune responses at the cellular level using Hybrid Petri Net (HPN). The approach followed in this study comprises of step wise simulation, model validation and analysis of host immune response. This study was performed with an objective of making correlations among viral RNA levels, interferon (IFN) production and interferon stimulated genes (ISGs) induction. The results correlate with the biological data verifying that the model is very useful in predicting the dynamic behavior of the signaling proteins in response to a stimulus. This study implicates that the HCV infection is dependent upon several key factors of the host immune response. The effect of host proteins on limiting viral infection is effectively overruled by the viral pathogen. This study also analyzes activity levels of RNase L, miR-122, IFN, ISGs and PKR induction and inhibition of TLR3/RIG1 mediated pathways in response to targeted manipulation in the presence of HCV. The results are in complete agreement at the time of writing with the published expression studies and western blot experiments. Our model also provides some biological insights regarding the role of PKR in the acute infection of HCV. It might help to explain why many patients fail to clear acute HCV infection while others, with low ISGs basal levels, clear HCV spontaneously. The described methodology can easily be reproduced, which suitably supports the study of other viral infections in a formal, automated and expressive manner. The Petri Net-based modeling approach applied here may provide valuable insights for study design and analyses to evaluate other disease associated integrated pathways in biological systems.
[Show abstract][Hide abstract] ABSTRACT: We �nally arrived in the post-genome era and face systems
biology challenges of immense dimensionality. Huge graphs model the
interplay of biological entities of all kinds (genes, proteins, metabolites).
In parallel, the emergence of next-generation OMICS technology allows
measuring their expression on a large-scale and in high-throughput. Horizons
opened for so-called network enrichment strategies, which aim for
combining these two data types, networks and expression matrices.
One basically assumes that disease-speci�c, foreground (FG) genes have
a di�erent expression distribution than the others, the background (BG)
genes, in a set of patients compared to a control group. A priori one
knows neither FG, BG, and their expression distributions. De novo
network enrichment tools seek to �nd densely connected sub-networks
that are enriched with FG genes, i.e. deregulated diseases-speci�c subnetworks.
As we do not know all FG genes and, more important, many
of the BG genes (i.e. genes that are not disease-related) we struggle evaluating
the real-world relevance of the sub-networks extracted by network
Here, we contribute with a proof-of-principle study addressing this problem.
We introduce a sampling procedure to generate arti�cial 'gold standards'
of FG and BG genes of varying complexity. Therefore, we introduce
two intuitive parameters controlling how distant FG and BG genes
are in their expression values (separation), and how densely the FG genes
are distributed in a network (density), respectively. For the latter, we introduce
two algorithms to 'hide' FG genes with a certain density in a
graph. We exemplary benchmark the performance of the network enrichment
tool KeyPathwayMiner in �nding the FG genes that we have
'hidden' in the input network for di�erent density and separation values.
We believe that our simple but robust strategy is applicable for systematically
assessing and comparing the quality of network enrichment tools
in the future.
[Show abstract][Hide abstract] ABSTRACT: With the availability of newer and cheaper sequencing methods, genomic data are being generated at an increasingly fast pace. In spite of the high degree of complexity of currently available search routines, the massive number of sequences available virtually prohibits quick and correct identification of large groups of sequences sharing common traits. Hence, there is a need for clustering tools for automatic knowledge extraction enabling the curation of large-scale databases. Current sophisticated approaches on sequence clustering are based on pairwise similarity matrices. This is impractical for databases of hundreds of thousands of sequences as such a similarity matrix alone would exceed the available memory. In this paper, a new approach called MultiLevel Clustering (MLC) is proposed which avoids a majority of sequence comparisons, and therefore, significantly reduces the total runtime for clustering. An implementation of the algorithm allowed clustering of all 344,239 ITS (Internal Transcribed Spacer) fungal sequences from GenBank utilizing only a normal desktop computer within 22 CPU-hours whereas the greedy clustering method took up to 242 CPU-hours.
[Show abstract][Hide abstract] ABSTRACT: The volatolom is the sum of volatile organic compounds that are emitted by all living cells and tissues.
We seek to non-invasively "sniff" biomarker molecules that are predictive for the biomedical fate of individual patients. This promises great hope to move the therapeutic windows to earlier stages of disease progression. While portable devices for breathomics measurement exist, we face the traditional biomarker research barrier: A lack of robustness hinders translation to the world outside laboratories. To move from biomarker discovery to validation, from separability to predictability, we have developed several bioinformatics methods for computational breath analysis, which have the potential to redefine non-invasive biomedical decision making by rapid and cheap matching of decisive medical patterns in exhaled air. We aim to provide a supplementary diagnostic tool complementing classic urine, blood and tissue samples. The presentation will review the state of the art, highlight existing challenges and introduce new data mining methods for identifying breathomics biomarkers.
Highlight Talk at 13th European Conference on Computational Biology (ECCB), Strassbourg, France; 09/2014
[Show abstract][Hide abstract] ABSTRACT: Motivation: Reverse-phase protein arrays (RPPAs) allow sensitive quantification of relative protein abundance in thousands of samples in parallel. Typical challenges involved in this technology are antibody selection, sample preparation and optimization of staining conditions. The issue of combining effective sample management and data analysis, however, has been widely neglected.
Results: This motivated us to develop MIRACLE, a comprehensive and user-friendly web application bridging the gap between spotting and array analysis by conveniently keeping track of sample information. Data processing includes correction of staining bias, estimation of protein concentration from response curves, normalization for total protein amount per sample and statistical evaluation. Established analysis methods have been integrated with MIRACLE, offering experimental scientists an end-to-end solution for sample management and for carrying out data analysis. In addition, experienced users have the possibility to export data to R for more complex analyses. MIRACLE thus has the potential to further spread utilization of RPPAs as an emerging technology for high-throughput protein analysis.
Availability: Project URL: http://www.nanocan.org/miracle/
Supplementary data are available at Bioinformatics online.
[Show abstract][Hide abstract] ABSTRACT: Background
Over the last decade network enrichment analysis has become popular in computational systems biology to elucidate aberrant network modules. Traditionally, these approaches focus on combining gene expression data with protein-protein interaction (PPI) networks. Nowadays, the so-called omics technologies allow for inclusion of many more data sets, e.g. protein phosphorylation or epigenetic modifications. This creates a need for analysis methods that can combine these various sources of data to obtain a systems-level view on aberrant biological networks.ResultsWe present a new release of KeyPathwayMiner (version 4.0) that is not limited to analyses of single omics data sets, e.g. gene expression, but is able to directly combine several different omics data types. Version 4.0 can further integrate existing knowledge by adding a search bias towards sub-networks that contain (avoid) genes provided in a positive (negative) list. Finally the new release now also provides a set of novel visualization features and has been implemented as an app for the standard bioinformatics network analysis tool: Cytoscape.Conclusion
With KeyPathwayMiner 4.0, we publish a Cytoscape app for multi-omics based sub-network extraction. It is available in Cytoscape¿s app store http://apps.cytoscape.org/apps/keypathwayminer or via http://keypathwayminer.mpi-inf.mpg.de.
BMC Systems Biology 08/2014; 8(1):99. DOI:10.1186/s12918-014-0099-x · 2.85 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: In life sciences, and particularly biomedical research, linking aberrant pathways exhibiting phenotype-specific alterations to the underlying physical condition or disease is an ongoing challenge. Computationally, a key approach for pathway identification is data enrichment, combined with generation of biological networks. This allows identification of intrinsic patterns in the data and their linkage to a specific context such as cellular compartments, diseases or functions. Identification of aberrant pathways by traditional approaches is often limited to biological networks based on either gene expression, protein expression or post-translational modifications. To overcome single omics analysis, we developed a set of computational methods that allow a combined analysis of data collections from multiple omics fields utilizing hybrid interactome networks. We apply these methods to data obtained from a triple-negative breast cancer cell line model, combining data sets of gene and protein expression as well as protein phosphorylation. We focus on alterations associated with the phenotypical differences arising from epithelial-mesenchymal transition in two breast cancer cell lines exhibiting epithelial-like and mesenchymal-like morphology, respectively. Here we identified altered protein signaling activity in a complex biologically relevant network, related to focal adhesion and migration of breast cancer cells. We found dysregulated functional network modules revealing altered phosphorylation-dependent activity in concordance with the phenotypic traits and migrating potential of the tested model. In addition, we identified Ser267 on zyxin, a protein coupled to actin filament polymerization, as a potential in vivo phosphorylation target of cyclin-dependent kinase 1.
[Show abstract][Hide abstract] ABSTRACT: Motivation: We address the problem of multiple protein-protein interaction (PPI) network alignment. Given a set of such networks for different species we might ask how much the network topology is conserved throughout evolution. Solving this problem will help to derive a subset of interactions that is conserved over multiple species thus forming a 'core interactome'. Methods: We model the problem as Topological Multiple one-to-one Network Alignment (TMNA), where we aim to minimize the total Graph Edit Distance (GED) between pairs of the input networks. Here, the GED between two graphs is the number of deleted and inserted edges that are required to make one graph isomorphic to another. By minimizing the GED we indirectly maximize the number of edges that are aligned in multiple networks simultaneously. However, computing an optimal GED value is computationally intractable. We thus propose an evolutionary algorithm and developed a software tool, GEDEVO-M, which is able to align multiple PPI networks using topological information only. We demonstrate the power of our approach by computing a maximal common subnetwork for a set of bacterial and eukaryotic PPI networks. GEDEVO-M thus provides great potential for computing the 'core interactome' of different species. Availability: http://gedevo.mpi-inf.mpg.de/multiple-network-alignment/.
[Show abstract][Hide abstract] ABSTRACT: Next-generation sequencing (NGS) technologies have made high-throughput sequencing available to medium- and small-size laboratories, culminating in a tidal wave of genomic information. The quantity of sequenced bacterial genomes has not only brought excitement to the field of genomics but also heightened expectations that NGS would boost antibacterial discovery and vaccine development. Although many possible drug and vaccine targets have been discovered, the success rate of genome-based analysis has remained below expectations. Furthermore, NGS has had consequences for genome quality, resulting in an exponential increase in draft (partial data) genome deposits in public databases. If no further interests are expressed for a particular bacterial genome, it is more likely that the sequencing of its genome will be limited to a draft stage, and the painstaking tasks of completing the sequencing of its genome and annotation will not be undertaken. It is important to know what is lost when we settle for a draft genome and to determine the "scientific value" of a newly sequenced genome. This review addresses the expected impact of newly sequenced genomes on antibacterial discovery and vaccinology. Also, it discusses the factors that could be leading to the increase in the number of draft deposits and the consequent loss of relevant biological information.
[Show abstract][Hide abstract] ABSTRACT: We review the level of genomic specificity regarding actinobacterial pathogenicity. As they occupy various niches in diverse habitats, one may assume the existence of lifestyle-specific genomic features. We include 240 actinobacteria classified into four pathogenicity classes: human pathogens (HPs), broad-spectrum pathogens (BPs), opportunistic pathogens (OPs) and non-pathogenic (NP). We hypothesize: (H1) Pathogens (HPs and BPs) possess specific pathogenicity signature genes. (H2) The same holds for OPs. (H3) Broad-spectrum and exclusively HPs cannot be distinguished from each other because of an observation bias, i.e. many HPs might yet be unclassified BPs. (H4) There is no intrinsic genomic characteristic of OPs compared with pathogens, as small mutations are likely to play a more dominant role to survive the immune system. To study these hypotheses, we implemented a bioinformatics pipeline that combines evolutionary sequence analysis with statistical learning methods (Random Forest with feature selection, model tuning and robustness analysis). Essentially, we present orthologous gene sets that computationally distinguish pathogens from NPs (H1). We further show a clear limit in differentiating OPs from both NPs (H2) and pathogens (H4). HPs may also not be distinguished from bacteria annotated as BPs based only on a small set of orthologous genes (H3), as many HPs might as well target a broad range of mammals but have not been annotated accordingly. In conclusion, we illustrate that even in the post-genome era and despite next-generation sequencing technology, our ability to efficiently deduce real-world conclusions, such as pathogenicity classification, remains quite limited.
[Show abstract][Hide abstract] ABSTRACT: We define breathomics as the metabolomics study of exhaled air. It is a strongly emerging metabolomics research field that mainly focuses on health-related volatile organic compounds (VOCs). Since the amount of these compounds varies with health status, breathomics holds great promise to deliver non-invasive diagnostic tools. Thus, the main aim of breathomics is to find patterns of VOCs related to abnormal (for instance inflammatory) metabolic processes occurring in the human body. Recently, analytical methods for measuring VOCs in exhaled air with high resolution and high throughput have been extensively developed. Yet, the application of machine learning methods for fingerprinting VOC profiles in the breathomics is still in its infancy. Therefore, in this paper, we describe the current state of the art in data pre-processing and multivariate analysis of breathomics data. We start with the detailed pre-processing pipelines for breathomics data obtained from gas-chromatography mass spectrometry and an ion-mobility spectrometer coupled to multi-capillary columns. The outcome of data pre-processing is a matrix containing the relative abundances of a set of VOCs for a group of patients under different conditions (e.g. disease stage, treatment). Independently of the utilized analytical method, the most important question, 'which VOCs are discriminatory?', remains the same. Answers can be given by several modern machine learning techniques (multivariate statistics) and, therefore, are the focus of this paper. We demonstrate the advantages as well the drawbacks of such techniques. We aim to help the community to understand how to profit from a particular method. In parallel, we hope to make the community aware of the existing data fusion methods, as yet unresearched in breathomics.
Journal of Breath Research 04/2014; 8(2):027105. DOI:10.1088/1752-7155/8/2/027105 · 3.59 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: Ion mobility spectrometry coupled to multi capillary
columns (MCC/IMS) combines highly sensitive spectrometry
with a rapid separation technique. MCC\IMS is
widely used for biomedical breath analysis. The identification
of molecules in such a complex sample necessitates a reference
database. The existing IMS reference databases are still
in their infancy and do not allow to actually identify all
analytes. With a gas chromatograph coupled to a mass selective
detector (GC/MSD) setup in parallel to a MCC/IMS
instrumentation we may increase the accuracy of automatic
analyte identification. To overcome the time-consuming manual
evaluation and comparison of the results of both devices,
we developed a software tool MIMA (MS-IMS-Mapper),
which can computationally generate analyte layers for MCC/
IMS spectra by using the corresponding GC/MSD data. We
demonstrate the power of our method by successfully identifying
the analytes of a seven-component mixture. In conclusion,
the main contribution of MIMA is a fast and easy
computational method for assigning analyte names to yet
un-assigned signals in MCC/IMS data. We believe that this
will greatly impact modern MCC/IMS-based biomarker research
by “giving a name” to previously detected diseasespecific
International Journal for Ion Mobility Spectrometry 04/2014; 17(2):95-101. DOI:10.1007/s12127-014-0149-5
[Show abstract][Hide abstract] ABSTRACT: The explosion of the biological data has dramatically reformed today's biological research. The need to integrate and analyze high-dimensional biological data on a large scale is driving the development of novel bioinformatics approaches. Biclustering, also known as 'simultaneous clustering' or 'co-clustering', has been successfully utilized to discover local patterns in gene expression data and similar biomedical data types. Here, we contribute a new heuristic: 'Bi-Force'. It is based on the weighted bicluster editing model, to perform biclustering on arbitrary sets of biological entities, given any kind of pairwise similarities. We first evaluated the power of Bi-Force to solve dedicated bicluster editing problems by comparing Bi-Force with two existing algorithms in the BiCluE software package. We then followed a biclustering evaluation protocol in a recent review paper from Eren et al. (2013) (A comparative analysis of biclustering algorithms for gene expressiondata. Brief. Bioinform., 14:279-292.) and compared Bi-Force against eight existing tools: FABIA, QUBIC, Cheng and Church, Plaid, BiMax, Spectral, xMOTIFs and ISA. To this end, a suite of synthetic datasets as well as nine large gene expression datasets from Gene Expression Omnibus were analyzed. All resulting biclusters were subsequently investigated by Gene Ontology enrichment analysis to evaluate their biological relevance. The distinct theoretical foundation of Bi-Force (bicluster editing) is more powerful than strict biclustering. We thus outperformed existing tools with Bi-Force at least when following the evaluation protocols from Eren et al. Bi-Force is implemented in Java and integrated into the open source software package of BiCluE. The software as well as all used datasets are publicly available at http://biclue.mpi-inf.mpg.de.
Nucleic Acids Research 03/2014; 42(9). DOI:10.1093/nar/gku201 · 9.11 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: The advance of new technologies in biomedical research has led to a dramatic growth in experimental throughput. Projects therefore steadily grow in size and involve a larger number of researchers. Spreadsheets traditionally used are thus no longer suitable for keeping track of the vast amounts of samples created and need to be replaced with state-of-the-art laboratory information management systems. Such systems have been developed in large numbers, but they are often limited to specific research domains and types of data. One domain so far neglected is the management of libraries of vector clones and genetically engineered cell lines. OpenLabFramework is a newly developed web-application for sample tracking, particularly laid out to fill this gap, but with an open architecture allowing it to be extended for other biological materials and functional data. Its sample tracking mechanism is fully customizable and aids productivity further through support for mobile devices and barcoded labels.
[Show abstract][Hide abstract] ABSTRACT: Selecting the most promising treatment strategy for breast cancer crucially depends on determining the correct subtype. In recent years, gene expression profiling has been investigated as an alternative to histochemical methods. Since databases like TCGA provide easy and unrestricted access to gene expression data for hundreds of patients, the challenge is to extract a minimal optimal set of genes with good prognostic properties from a large bulk of genes making a moderate contribution to classification. Several studies have successfully applied machine learning algorithms to solve this so-called gene selection problem. However, more diverse data from other OMICS technologies are available, including methylation. We hypothesize that combining methylation and gene expression data could already lead to a largely improved classification model, since the resulting model will reflect differences not only on the transcriptomic, but also on an epigenetic level. We compared so-called random forest derived classification models based on gene expression and methylation data alone, to a model based on the combined features and to a model based on the gold standard PAM50. We obtained bootstrap errors of 10-20% and classification error of 1-50%, depending on breast cancer subtype and model. The gene expression model was clearly superior to the methylation model, which was also reflected in the combined model, which mainly selected features from gene expression data. However, the methylation model was able to identify unique features not considered as relevant by the gene expression model, which might provide deeper insights into breast cancer subtype differentiation on an epigenetic level.
Journal of integrative bioinformatics 01/2014; 11(2):236. DOI:10.2390/biecoll-jib-2014-236