-
[show abstract]
[hide abstract]
ABSTRACT: Recently, a Bayesian network model for inferring non-stationary regulatory processes from gene expression time series has
been proposed. The Bayesian Gaussian Mixture (BGM) Bayesian network model divides the data into disjunct compartments (data
subsets) by a free allocation model, and infers network structures, which are kept fixed for all compartments. Fixing the
network structure allows for some information sharing among compartments, and each compartment is modelled separately and
independently with the Gaussian BGe scoring metric for Bayesian networks. The BGM model can equally be applied to both static
(steady-state) and dynamic (time series) gene expression data. However, it is this flexibility that renders its application
to time series data suboptimal. To improve the performance of the BGM model on time series data we propose a revised approach
in which the free allocation of data points is replaced by a changepoint process so as to take the temporal structure into
account. The practical inference follows the Bayesian paradigm and approximately samples the network, the number of compartments
and the changepoint locations from the posterior distribution with Markov chain Monte Carlo (MCMC). Our empirical results
show that the proposed modification leads to a more efficient inference tool for analysing gene expression time series.
KeywordsDynamic Bayesian networks–Non-stationary gene regulatory processes–Changepoint process–Gene networks
Computational Statistics 04/2012; 26(2):199-218. · 0.28 Impact Factor
-
[show abstract]
[hide abstract]
ABSTRACT: An important and challenging problem in systems biology is the inference of gene regulatory networks from short non-stationary time series of transcriptional profiles. A popular approach that has been widely applied to this end is based on dynamic Bayesian networks (DBNs), although traditional homogeneous DBNs fail to model the non-stationarity and time-varying nature of the gene regulatory processes. Various authors have therefore recently proposed combining DBNs with multiple changepoint processes to obtain time varying dynamic Bayesian networks (TV-DBNs). However, TV-DBNs are not without problems. Gene expression time series are typically short, which leaves the model over-flexible, leading to over-fitting or inflated inference uncertainty. In the present paper, we introduce a Bayesian regularization scheme that addresses this difficulty. Our approach is based on the rationale that changes in gene regulatory processes appear gradually during an organism's life cycle or in response to a changing environment, and we have integrated this notion in the prior distribution of the TV-DBN parameters. We have extensively tested our regularized TV-DBN model on synthetic data, in which we have simulated short non-homogeneous time series produced from a system subject to gradual change. We have then applied our method to real-world gene expression time series, measured during the life cycle of Drosophila melanogaster, under artificially generated constant light condition in Arabidopsis thaliana, and from a synthetically designed strain of Saccharomyces cerevisiae exposed to a changing environment.
Statistical Applications in Genetics and Molecular Biology 01/2012; 11(4). · 1.52 Impact Factor
-
09/2011: pages 270 - 289; , ISBN: 9781119970606
-
[show abstract]
[hide abstract]
ABSTRACT: Dynamic Bayesian networks (DBNs) have been applied widely to reconstruct the structure of regulatory processes from time series data, and they have established themselves as a standard modelling tool in computational systems biology. The conventional approach is based on the assumption of a homogeneous Markov chain, and many recent research efforts have focused on relaxing this restriction. An approach that enjoys particular popularity is based on a combination of a DBN with a multiple changepoint process, and the application of a Bayesian inference scheme via reversible jump Markov chain Monte Carlo (RJMCMC). In the present article, we expand this approach in two ways. First, we show that a dynamic programming scheme allows the changepoints to be sampled from the correct conditional distribution, which results in improved convergence over RJMCMC. Second, we introduce a novel Bayesian clustering and information sharing scheme among nodes, which provides a mechanism for automatic model complexity tuning.
We evaluate the dynamic programming scheme on expression time series for Arabidopsis thaliana genes involved in circadian regulation. In a simulation study we demonstrate that the regularization scheme improves the network reconstruction accuracy over that obtained with recently proposed inhomogeneous DBNs. For gene expression profiles from a synthetically designed Saccharomyces cerevisiae strain under switching carbon metabolism we show that the combination of both: dynamic programming and regularization yields an inference procedure that outperforms two alternative established network reconstruction methods from the biology literature.
A MATLAB implementation of the algorithm and a supplementary paper with algorithmic details and further results for the Arabidopsis data can be downloaded from: http://www.statistik.tu-dortmund.de/bio2010.html.
Bioinformatics 03/2011; 27(5):693-9. · 5.47 Impact Factor
-
Machine Learning. 01/2011; 83:355-419.
-
International Workshop on Statistical Modelling IWSM; 01/2011
-
Adv. Bioinformatics. 01/2010; 2010.
-
Marco Grzegorczyk
[show abstract]
[hide abstract]
ABSTRACT: The extraction of regulatory networks and pathways from postgenomic data is important for drug -discovery and development, as the extracted pathways reveal how genes or proteins regulate each other. Following up on the seminal paper of Friedman et al. (J Comput Biol 7:601-620, 2000), Bayesian networks have been widely applied as a popular tool to this end in systems biology research. Their popularity stems from the tractability of the marginal likelihood of the network structure, which is a consistent scoring scheme in the Bayesian context. This score is based on an integration over the entire parameter space, for which highly expensive computational procedures have to be applied when using more complex -models based on differential equations; for example, see (Bioinformatics 24:833-839, 2008). This chapter gives an introduction to reverse engineering regulatory networks and pathways with Gaussian Bayesian networks, that is Bayesian networks with the probabilistic BGe scoring metric [see (Geiger and Heckerman 235-243, 1995)]. In the BGe model, the data are assumed to stem from a Gaussian distribution and a normal-Wishart prior is assigned to the unknown parameters. Gaussian Bayesian network methodology for analysing static observational, static interventional as well as dynamic (observational) time series data will be described in detail in this chapter. Finally, we apply these Bayesian network inference methods (1) to observational and interventional flow cytometry (protein) data from the well-known RAF pathway to evaluate the global network reconstruction accuracy of Bayesian network inference and (2) to dynamic gene expression time series data of nine circadian genes in Arabidopsis thaliana to reverse engineer the unknown regulatory network topology for this domain.
Methods in molecular biology (Clifton, N.J.) 01/2010; 662:121-47.
-
Advances in Neural Information Processing Systems 22: 23rd Annual Conference on Neural Information Processing Systems 2009. Proceedings of a meeting held 7-10 December 2009, Vancouver, British Columbia, Canada.; 01/2009
-
Pattern Recognition in Bioinformatics, 4th IAPR International Conference, PRIB 2009, Sheffield, UK, September 7-9, 2009. Proceedings; 01/2009
-
[show abstract]
[hide abstract]
ABSTRACT: The objective of the present article is to propose and evaluate a probabilistic approach based on Bayesian networks for modelling non-homogeneous and non-linear gene regulatory processes. The method is based on a mixture model, using latent variables to assign individual measurements to different classes. The practical inference follows the Bayesian paradigm and samples the network structure, the number of classes and the assignment of latent variables from the posterior distribution with Markov Chain Monte Carlo (MCMC), using the recently proposed allocation sampler as an alternative to RJMCMC.
We have evaluated the method using three criteria: network reconstruction, statistical significance and biological plausibility. In terms of network reconstruction, we found improved results both for a synthetic network of known structure and for a small real regulatory network derived from the literature. We have assessed the statistical significance of the improvement on gene expression time series for two different systems (viral challenge of macrophages, and circadian rhythms in plants), where the proposed new scheme tends to outperform the classical BGe score. Regarding biological plausibility, we found that the inference results obtained with the proposed method were in excellent agreement with biological findings, predicting dichotomies that one would expect to find in the studied systems.
Two supplementary papers on theoretical (T) and experi-mental (E) aspects and the datasets used in our study are available from http://www.bioss.ac.uk/associates/marco/supplement/
Bioinformatics 10/2008; 24(18):2071-8. · 5.47 Impact Factor
-
09/2008: pages 101 - 142; , ISBN: 9783527622818
-
Marco Grzegorczyk
[show abstract]
[hide abstract]
ABSTRACT: Toxicoproteomics integrates traditional toxicology and systems biology and seeks to infer the architecture of biochemical pathways in biological systems that are affected by and respond to chemical and environmental exposures. Different reverse engineering methods for extracting biochemical regulatory networks from data have been proposed and it is important to understand their relative strengths and weaknesses. To shed some light onto this problem, Werhli et al. (2006) cross-compared three widely used methodologies, relevance networks, graphical Gaussian models, and Bayesian networks (BN), on real cytometric and synthetic expression data. This study continues with the evaluation and compares the learning performances of two different stochastic models (BGe and BDe) for BN. Cytometric protein expression data from the RAF-signaling pathway were used for the cross-method comparison. Understanding this pathway is an important task, as it is known that RAF is a critical signaling protein whose deregulation leads to carcinogenesis. When the more flexible BDe model is employed, a data discretization, which usually incurs an inevitable information loss, is needed. However, the results of the study reveal that the BDe model is preferable to the BGe model when a sufficiently large number of observations from the pathway are available.
Journal of Toxicology and Environmental Health Part A 02/2008; 71(11-12):827-34. · 1.83 Impact Factor
-
Machine Learning. 01/2008; 71:265-305.
-
Marco Grzegorczyk
[show abstract]
[hide abstract]
ABSTRACT: During the last decade the development of high-throughput biotechnologies has resulted in the production of exponentially expanding quantities of biological data, such as genomic and proteomic expression data. One fundamental problem in systems biology is to learn the architecture of biochemical pathways and regulatory networks in an inferential way from such postgenomic data. Along with the increasing amount of available data, a lot of novel statistical methods have been developed and proposed in the literature. This article gives a non-mathematical overview of three widely used reverse engineering methods, namely relevance networks, graphical Gaussian models, and Bayesian networks, whereby the focus is on their relative merits and shortcomings. In addition the reverse engineering results of these graphical methods on cytometric protein data from the RAF-signalling network are cross-compared via AUROC scatter plots.
Proteomics 10/2007; 7 Suppl 1:51-9. · 4.43 Impact Factor
-
[show abstract]
[hide abstract]
ABSTRACT: An important problem in systems biology is the inference of biochemical pathways and regulatory networks from postgenomic data. Various reverse engineering methods have been proposed in the literature, and it is important to understand their relative merits and shortcomings. In the present paper, we compare the accuracy of reconstructing gene regulatory networks with three different modelling and inference paradigms: (1) Relevance networks (RNs): pairwise association scores independent of the remaining network; (2) graphical Gaussian models (GGMs): undirected graphical models with constraint-based inference, and (3) Bayesian networks (BNs): directed graphical models with score-based inference. The evaluation is carried out on the Raf pathway, a cellular signalling network describing the interaction of 11 phosphorylated proteins and phospholipids in human immune system cells. We use both laboratory data from cytometry experiments as well as data simulated from the gold-standard network. We also compare passive observations with active interventions.
On Gaussian observational data, BNs and GGMs were found to outperform RNs. The difference in performance was not significant for the non-linear simulated data and the cytoflow data, though. Also, we did not observe a significant difference between BNs and GGMs on observational data in general. However, for interventional data, BNs outperform GGMs and RNs, especially when taking the edge directions rather than just the skeletons of the graphs into account. This suggests that the higher computational costs of inference with BNs over GGMs and RNs are not justified when using only passive observations, but that active interventions in the form of gene knockouts and over-expressions are required to exploit the full potential of BNs.
Data, software and supplementary material are available from http://www.bioss.sari.ac.uk/staff/adriano/research.html
Bioinformatics 11/2006; 22(20):2523-31. · 5.47 Impact Factor
-
[show abstract]
[hide abstract]
ABSTRACT: Most proteomics experiments make use of 'high throughput' technologies such as 2-DE, MS or protein arrays to measure simultaneously the expression levels of thousands of proteins. Such experiments yield large, high-dimensional data sets which usually reflect not only the biological but also technical and experimental factors. Statistical tools are essential for evaluating these data and preventing false conclusions. Here, an overview is given of some typical statistical tools for proteomics experiments. In particular, we present methods for data preprocessing (e.g. calibration, missing values estimation and outlier detection), comparison of protein expression in different groups (e.g. detection of differentially expressed proteins or classification of new observations) as well as the detection of dependencies between proteins (e.g. protein clusters or networks). We also discuss questions of sample size planning for some of these methods.
Proteomics 10/2006; 6 Suppl 2:48-55. · 4.43 Impact Factor
-
Bioinformatics. 01/2006; 22:2523-2531.
-
Nadine Kutzner,
Ingrid Hoffmann,
Christina Linke,
Thomas Thienel, Marco Grzegorczyk,
Wolfgang Urfer,
Dirk Martin,
Günther Winde,
Thilo Traska,
Gerd Hohlbach,
Klaus-Michael Müller,
Cornelius Kuhnen,
Oliver Müller
[show abstract]
[hide abstract]
ABSTRACT: The treatment of early-stage tumours decreases the overall mortality of colorectal tumour patients. In this retrospective study we determined the sensitivity and the specificity of the faecal occult blood test (FOBT) and the molecular diagnosis (MD). We analysed 57 stool samples from patients with colorectal carcinomas for the presence of occult blood using a standard FOBT and for alterations in the three different tumour relevant markers APC, BAT26 and L-DNA. Stool samples from 44 control donors were analysed to determine the specificity of the applied methods. Twenty-nine (51%; 95% confidence interval (CI): 38-63%) stool samples of the cancer patients gave positive FOBT results. Thirty-seven (65%; CI: 52-76%) samples showed alterations in at least one DNA marker. Sixteen (28%) samples were positive only in the FOBT, and 24 (42%) samples showed a positive result exclusively in MD. The combined application of both methods resulted in a sensitivity of 93% (CI: 83-97%) and an overall specificity of 89% (CI: 76-95%). The combined application of FOBT and MD resulted in an overall sensitivity, which could not be achieved by any of the methods alone and which is in the range of invasive diagnostic methods.
Cancer Letters 12/2005; 229(1):33-41. · 4.24 Impact Factor
-
Marco Grzegorczyk
[show abstract]
[hide abstract]
ABSTRACT: An important problem in systems biology is to infer the architecture of gene regulatory networks and biochemical pathways from postgenomic data. Various reverse engineering methods have been developed and proposed in the Statistics and Bioinformatics literature, and it is important to understand their relative merits and shortcomings. To shed light onto this problem, the learning performances of three widely-used Machine Learning methodologies: Relevance Networks, Graphical Gaussian models, and Bayesian Networks are evaluated and compared on different real and synthetic test data sets taken from the RAF signalling network which describes the interactions between eleven phosphorylated proteins and phospholipids in human immune system cells.