ArticlePDF Available

Integrating Gene Regulatory Networks to identify cancer-specific genes

Authors:

Abstract and Figures

Consensus approaches have been widely used to identify Gene Regulatory Networks (GRNs) that are common to multiple studies. However, in this research we develop an application that semi-automatically identifies key mechanisms that are specific to a particular set of conditions. We analyse four different types of cancer to identify gene pathways unique to each of them. To support the results reliability we calculate the prediction accuracy of each gene for the specified conditions and compare to predictions on other conditions. The most predictive are validated using the GeneCards encyclopaedia1 coupled with a statistical test for validating clusters. Finally, we implement an interface that allows the user to identify unique subnetworks of any selected combination of studies using AND & NOT logic operators. Results show that unique genes and sub-networks can be reliably identified and that they reflect key mechanisms that are fundamental to the cancer types under study.
Content may be subject to copyright.
Integrating Gene Regulatory Networks to identify
cancer-specific genes
Valeria Bo1and Allan Tucker1
1Department of Computer Science, Brunel University, London, UK
Abstract
Consensus approaches have been widely used to identify Gene Regulatory Networks (GRNs) that are common
to multiple studies. However, in this research we develop an application that semi-automatically identifies key
mechanisms that are specific to a particular set of conditions. We analyse four different types of cancer to
identify gene pathways unique to each of them. To support the results reliability we calculate the prediction
accuracy of each gene for the specified conditions and compare to predictions on other conditions. The most
predictive are validated using the GeneCards encyclopaedia1coupled with a statistical test for validating
clusters. Finally, we implement an interface that allows the user to identify unique subnetworks of any
selected combination of studies using AND & NOT logic operators. Results show that unique genes and sub-
networks can be reliably identified and that they reflect key mechanisms that are fundamental to the cancer
types under study.
Introduction
When an organism is subjected to a different condition either internal or external to it (environmental
changes, stress, cancer, etc.) its underlying mechanisms undergo some changes. To build robust and reliable
Gene Regulatory Networks (GRNs) from microarrays, it is necessary to integrate multiple data collected from
different studies2, 3,4,5. To identify links in common among a set of independent studies, researchers apply
consensus networks analysis. Swift et al. 6apply a clustering technique coupled with a statistically based
gene functional analysis for the identification of novel genes. While Segal et al. 7group genes that perform
similar functions into ‘modules’ and then build networks of these modules to identify mechanisms at a more
general (higher) level. More recently, a similar approach8was applied to a large number of cancer datasets
where case and control are compared. For each dataset, the pairwise correlation of gene expression profile
is computed and a frequency table is built. Then the values in the table are used to build a weighted gene
co-expression frequency network. After this they identify sub-networks with similar members and iteratively
merge them together to generate the final network for both cancer and healthy tissue.
In9, we expand on this work but rather than focusing on consensus networks, we develop a method to
‘home in’ on both the similarities and differences of GRNs generated from different independent studies by
using a combination of partial correlation network building and graph theory. The method goes beyond
the simple pairwise correlations between genes, as in8, by building independent networks for each study
using glasso which identifies the inverse covariance matrix using the lasso penalty. Rather than identifying
consensus studies, we detect the edges that are unique/specif ic for each study and build Bayesian Networks
to identify the most predictive group of genes and further refine our networks.
In this work we extend the work presented in9by exploring the performances of the pipeline using four
different cancer datasets and identifying, through the GeneCards encyclopaedia 1, the list of genes known to
be involved in each type of cancer. We apply a statistical test to measure the significance of detecting these
genes in our unique networks. In addition, we develop an interface that allows the user to select combinations
of studies using AND and NOT logic operators and to identify the related unique sub-networks and genes.
21
Materials and Methods
In this paper we adapt the Unique Network Identification Pipeline (UNIP) developed in9. Each step of the
pipeline applied for the specific case of this paper are explained in the following sections.
Dataset Description. Four different cancer datasets are downloaded from the NCBI Gene Expression
Omnibus (GEO) website10. To avoid platform bias the datasets selected are all generated using Affymetrix
HU133 Plus 2.0 Genechip. Given the raw series of data, the rma (Robust Multi-Array Average) expression
measure (available in the R package ‘affy’ 11 ) is applied as a pre-processing step. Each study identification
code, description and samples number are summarized in Table 1.
i) Selection of Informative Genes. The high discrepancy between the number of genes (order of
thousands) and the samples (tens or hundreds) measured simultaneously in microarray data leads to the
necessity of reducing the number of variables (genes) involved in the analysis. R statistics provides the ‘pvac’
package12 which applies the PCA (Principal Component Analysis)13 and returns a subset of the original
variables: the closest to the principal components identified. To further refine the variable reduction and
to select the most active genes, the standard deviation of each gene across all the samples in each separate
study is calculated and only genes with sd 1.5 in at least one of the 4 studies are selected. The reduced
datasets are used as input to the following steps of the analysis.
ii) Glasso. At this stage we need to build a GRN for each condition/study in the dataset.
As we want to identify networks that go beyond simple pairwise relationships, our procedure uses glasso 14,15,16,
which calculates the inverse covariance matrix using the lasso penalty to make it as sparse as possible.
In this paper, we apply glasso with the penalization parameter ρ= 0.05, to build a GRN for each study
dataset. In addition, to further improve the sparsity and reduce the nodes involved, we maintain only the
connections with an inverse covariance value greater or equal to 0.8.
iii) Unique Bayesian Networks and Prediction. In this paper we are exploring four different
studies, each of which we want to explore the unique mechanisms, we consider each of the four studies as
a study-cluster of one element and the related glasso-network (built earlier) as the consensus network for
that study-cluster. Although consensus approaches are popular, here we are interested in exploring the
study-specific mechanisms through that we call unique-networks. Given a generic graph G= (V, E ). We
have mfixed graphs Gisuch that Gi= (V, Ei), where V= 1, ..., n is the set of vertices(nodes) of the graph
and Ei={ei}={(ui1, vi1), ..., (uiki, viki)},ki=
Ei
and kin(n1)/2.We define the unique function as
Φ : G7→ G, where, given ˆ
Ei=Sm
j=1,j6=iEj.
Definition 1: We define a function Φ(Gi) such that Φ(Gi) : (V, {ei:ejEiand ej6∈ ˆ
Ei})
In other words, a unique-network contains only those edges present in no other condition-specific network.
We choose to measure the reliability of the unique-networks through prediction using Bayesian Networks
(BNs)17,18 which naturally perform this using inference, given the graphical structure obtained using the
genes involved in the unique-networks provided by glasso. Given the unique edges in the glasso-derived
networks we first build one BN for each of the study-clusters using the R package bnlearn 19,20 and then
identify the most predictive (how well it predicts other expression level values) and predictable (how well its
expression level values are predicted) genes within (intra) and outside (inter) the study using the package
gRain 21 and the leave one out cross validation technique. Given the msamples and ngenes within each
study we use m-1 samples as a training set and the remaining one as test set. Then, given the n-1 genes,
we predict the expression value of the one left out. We compare the predicted with the real value, return 1
if they correspond and zero otherwise. We do this within all the studies and for all possible combinations
of training and test sets of studies and genes. Finally, we average the amount of correctly-predicted values
among the total predictions to obtain the correct-prediction for each gene. The idea is that genes that are
predictive or predicted better within the selected study than on other studies are more likely to be relevant
to the unique-network.
iv) Gene cards. As we detect study-specific sub-networks we also want to verify that our method
captures study-specific genes. We query GeneCards encyclopaedia 1to obtain the list of genes that are
known to be involved in each cancer. We compare the list for each study to the others and select the genes
that appear only in the study under consideration. To compare the unique-gene list for each type of cancer
with the genes found in the corresponding unique-network, we apply a probability score developed in6used
to test the significance of observing multiple genes with known function in a given cluster against the null
hypothesis of this happening by chance. This score is based on the hypothesis that, if a given cluster, iof
22
size si, contains xgenes from a defined functional group of size kj, then the chance of this occurring by
chance follows a binomial distribution and is defined by: Pr(Observingxfrom groupj) = kj
xpxqkjxwhere
p=sj
n, q = 1 pand nis the number of genes in the dataset. As in this paper, when kjand xare very large
Pr cannot be evaluated. Therefore we use the normal approximation of the binomial distribution where:
z=xµ
σ,µ=kjpand σ=pkjpq. Values of zabove zero mean that the probability of observing xelements
from functional group jin cluster iby chance is very small (values of z2.326 correspond to a probability
less than 1%). The test performed is the one tailed test.
v) Logic and GUI. Finally a user interface has been developed using the R package shiny 22. This
interface allows the user to input the networks obtained with glasso and let the user choose which combination
of unique networks to identify, using the logic operators AND and NOT. For example setting 1 AND 2 -
NOT 3 will identify the sub-networks that study 1 and 2 have in common but do not appear in study 3.
The unique sub-networks for that rule/pattern are identified and plotted on the interface together with the
list of genes involved. Finally, the user has the possibility to save the network in a tiff file and the list of
genes involved in csv format.
Results
In this study four cancer datasets are explored: breast, ovarian, medullary breast (a subtype of breast cancer)
and lung, in human patients. Each dataset contains a different number of samples (see Table 1). The variable
selection approach reduces the number of variables/genes to analyse from 54675 to 1629. Variable reduction
is followed by the implementation of glasso with the parameter ρ= 0.05. Given the glasso networks for
each study we consider only the edges that are present in the network under consideration but not in the
others. Once the unique-edges are detected, the genes involved are used to build a BN for each study called
unique-networks (U-Ns). An example of these networks is shown in Figure 1. The structure of the glasso
U-Ns differ from the structure of the Bayesian U-Ns. In the Figures 1a and 1b the nodes with a grey
background indicate genes with a predicted accuracy for the gene greater than 0.6 (based on our findings
in9). Because of the study description in Table 1, we would expect breast cancer to be very similar (involving
almost the same genes) to medullary breast cancer and slightly less similar to ovarian, but very different
from lung cancer. This implies that the average internal prediction for each study will not differ much from
the external prediction. The internal vs external prediction for each study shown in Figure 2 reveals, as
expected a very clear difference only in Network 3 and 4, medullary-breast and lung cancer respectively, with
a small difference in 1 and 3. This deduction is supported by the p-values obtained from the applied t-test
as shown in Table 1. We now evaluate the significance of detecting the identified unique-genes by calculating
the probability score using the normal approximation. For this paper siis the size of each unique network,
kjthe number of genes in the unique gene-list obtained for each cancer type comparing the geneCards gene
lists, xthe number of genes that are present on both the unique network and the corresponding unique
gene-list and nis the number of genes in the original unprocessed dataset. The results in Table 2 show the
z-score and the corresponding p-value indicating that the probability of observing xelements from functional
group jin cluster iby chance is in all four cases very small. This implies that the unique genes identified
by our pipeline are highly significant in all studies.
Finally, Figure 3 shows the Logic Application interface. The example allows the user to visualize the unique
sub-networks and the list of related genes that study 1 AND 4 have in common but do not appear in study
2.
Table 1: Cancer datasets description and t-test p-value
Study ID Study title Samples t-test p-value
GSE18864 Triple Negative Breast Cancer 84 0.55
GSE9891 Ovarian Tumour 285 0.00
GSE21653 Medullary Breast Cancer 266 0.02
GSE10445 Adenocarcinoma and large cell Lung Carcinoma 72 0.00
23
(a) Bayesian U-N for medullary-breast cancer. (b) Bayesian U-N for lung cancer.
Figure 1: Nodes with grey background indicate a prediction accuracy for the nodes greater than 0.6. Isolated
nodes do not have connections due to the structure differences between glasso U-Ns and Bayesian U-Ns.
Nodes are labelled with numbers (directly corresponding to the gene ID) for visualization purposes.
Figure 2: Internal vs External prediction accuracy for each study averaged among all genes involved in the
related unique-network.
Table 2: Parameters values, z-
score and p-value for each study.
Parameters values for each study
Study ID sikjx n z-score p-value
GSE18864 117 2982 11 54675 1.83 3.4%
GSE9891 61 692 4 54675 3.68 1%
GSE21653 89 0 0 54675 NaN 1%
GSE10445 80 240 3 54675 4.47 1%
Figure 3: Logic Application interface.
24
Conclusions
We have developed a tool that aims to identify unique sub-networks and genes based upon a number of
microarray studies. We explore networks and genes that are robust and unique to a pre-selected number of
studies. We support our results using prediction accuracy and a score to test the significance of identifying
a subset of unique genes. Furthermore, we created an application interface which allows the user to combine
different studies through AND and OR logic operators. Based on the findings we conclude that our pipeline
is a robust and reliable method to analyse large sets of transcriptomic data. It detects relationships between
transcriptional expression of genes that are specific to different conditions and also highlights structures and
nodes that could be potential targets for further research.
References
1. Safran M, Dalah I, Alexander J, Rosen N, Stein TI, Shmoish M, et al. GeneCards Version 3: the human
gene integrator. Database. 2010;2010:baq020.
2. Choi JK, Yu U, Kim S, Yoo OJ. Combining multiple microarray studies and modeling interstudy
variation. Bioinformatics. 2003;19(suppl 1):i84–i90.
3. Kirk P, Griffin JE, Savage RS, Ghahramani Z, Wild DL. Bayesian correlated clustering to integrate
multiple datasets. Bioinformatics. 2012;28(24):3290–3297.
4. Steele E, Tucker A. Consensus and Meta-analysis regulatory networks for combining multiple microarray
gene expression datasets. Journal of biomedical informatics. 2008;41(6):914–926.
5. Anvar SY, Tucker A, et al. The identification of informative genes from multiple datasets with increasing
complexity. BMC bioinformatics. 2010;11(1):32.
6. Swift S, Tucker A, Vinciotti V, Martin N, Orengo C, Liu X, et al. Consensus clustering and functional
interpretation of gene-expression data. Genome biology. 2004;5(11):R94.
7. Segal E, Shapira M, Regev A, Pe’er D, Botstein D, Koller D, et al. Module networks: identifying
regulatory modules and their condition-specific regulators from gene expression data. Nature genetics.
2003;34(2):166–176.
8. Zhang J, Lu K, Xiang Y, Islam M, Kotian S, Kais Z, et al. Weighted Frequent Gene Co-expression
Network Mining to Identify Genes Involved in Genome Stability. PLoS Computational Biology.
2012;8(8):e1002656.
9. Bo V, Curtis T, Lysenko A, Saqi M, Swift S, Tucker A. Discovering Study-Specific Gene Regulatory
Networks. PLoS ONE. 2014;9(9):e106524.
10. Edgar R, Domrachev M, Lash AE. Gene Expression Omnibus: NCBI gene expression and hybridization
array data repository. Nucleic acids research. 2002;30(1):207–210.
11. Gautier L, Cope L, Bolstad BM, Irizarry RA. affy—analysis of Affymetrix GeneChip data at the probe
level. Bioinformatics. 2004;20(3):307–315.
12. Lu J, Bushel PR. pvac: PCA-based gene filtering for Affymetrix arrays; 2010. R package version 1.12.0.
13. Pearson K. LIII. On lines and planes of closest fit to systems of points in space. The London, Edinburgh,
and Dublin Philosophical Magazine and Journal of Science. 1901;2(11):559–572.
14. Friedman J, Hastie T, Tibshirani R. glasso: Graphical lasso- estimation of Gaussian graphical models;
2014. R package version 1.8.
15. Friedman J, Hastie T, Tibshirani R. Sparse inverse covariance estimation with the graphical lasso.
Biostatistics. 2008;9(3):432–441.
16. Meinshausen N, B¨uhlmann P. High-dimensional graphs and variable selection with the lasso. The Annals
of Statistics. 2006;34(3):1436–1462.
17. Heckerman D, Geiger D, Chickering DM. Learning Bayesian networks: The combination of knowledge
and statistical data. Machine learning. 1995;20(3):197–243.
18. Friedman N, Linial M, Nachman I, Pe’er D. Using Bayesian networks to analyze expression data. Journal
of computational biology. 2000;7(3-4):601–620.
19. Scutari M. Learning Bayesian networks with the bnlearn R package. arXiv preprint arXiv:09083817.
2009;.
20. Scutari M, Scutari MM. Package bnlearn. 2012;.
21. Højsgaard S. Graphical Independence Networks with the gRain package for R. Journal; 2012.
22. RStudio, Inc . shiny: Web Application Framework for R; 2014. R package version 0.9.1.
25
Article
Full-text available
In this article, we argue, first, that there are very different research projects that fall under the heading of “systems biology of cancer.” While they share some general features, they differ in their aims and theoretical commitments. Second, we argue that some explanations in systems biology of cancer are concerned with properties of signaling networks (such as robustness or fragility) and how they may play an important causal role in patterns of vulnerability to cancer. Further, some systems biological explanations are compelling illustrations of how “top-down” and “bottom-up” approaches to the same phenomena may be integrated. © 2018 by the Philosophy of Science Association. All rights reserved.
Article
Full-text available
bnlearn is an R package which includes several algorithms for learning the structure of Bayesian networks with either discrete or continuous variables. Both constraint-based and score-based algorithms are implemented, and can use the functionality provided by the snow package to improve their performance via parallel computing. Several network scores and conditional independence algorithms are available for both the learning algorithms and independent use. Advanced plotting options are provided by the Rgraphviz package.
Article
Full-text available
Microarrays are commonly used in biology because of their ability to simultaneously measure thousands of genes under different conditions. Due to their structure, typically containing a high amount of variables but far fewer samples, scalable network analysis techniques are often employed. In particular, consensus approaches have been recently used that combine multiple microarray studies in order to find networks that are more robust. The purpose of this paper, however, is to combine multiple microarray studies to automatically identify subnetworks that are distinctive to specific experimental conditions rather than common to them all. To better understand key regulatory mechanisms and how they change under different conditions, we derive unique networks from multiple independent networks built using glasso which goes beyond standard correlations. This involves calculating cluster prediction accuracies to detect the most predictive genes for a specific set of conditions. We differentiate between accuracies calculated using cross-validation within a selected cluster of studies (the intra prediction accuracy) and those calculated on a set of independent studies belonging to different study clusters (inter prediction accuracy). Finally, we compare our method's results to related state-of-the art techniques. We explore how the proposed pipeline performs on both synthetic data and real data (wheat and Fusarium). Our results show that subnetworks can be identified reliably that are specific to subsets of studies and that these networks reflect key mechanisms that are fundamental to the experimental conditions in each of those subsets.
Article
Full-text available
Summary The Gene Expression Omnibus (GEO) project was initiated at NCBI in 1999 in response to the growing demand for a public repository for data generated from high-throughput microarray experiments. GEO has a flexible and open design that allows the submission, storage, and retrieval of many types of data sets, such as those from high-throughput gene expression, genomic hybridization, and antibody array experiments. GEO was never intended to replace lab-specific gene expression databases or laboratory information management systems (LIMS), both of which usually cater to a particular type of data set and analytical method. Rather, GEO complements these resources by acting as a central, molecular abundance-data distribution hub. GEO is available on the World Wide Web at http://www.ncbi.nih.gov/geo (http://www.ncbi.nih.gov/geo).
Article
Full-text available
Motivation: The integration of multiple datasets remains a key challenge in systems biology and genomic medicine. Modern high-throughput technologies generate a broad array of different data types, providing distinct—but often complementary—information. We present a Bayesian method for the unsupervised integrative modelling of multiple datasets, which we refer to as MDI (Multiple Dataset Integration). MDI can integrate information from a wide range of different datasets and data types simultaneously (including the ability to model time series data explicitly using Gaussian processes). Each dataset is modelled using a Dirichlet-multinomial allocation (DMA) mixture model, with dependencies between these models captured through parameters that describe the agreement among the datasets. Results: Using a set of six artificially constructed time series datasets, we show that MDI is able to integrate a significant number of datasets simultaneously, and that it successfully captures the underlying structural similarity between the datasets. We also analyse a variety of real Saccharomyces cerevisiae datasets. In the two-dataset case, we show that MDI’s performance is comparable with the present state-of-the-art. We then move beyond the capabilities of current approaches and integrate gene expression, chromatin immunoprecipitation–chip and protein–protein interaction data, to identify a set of protein complexes for which genes are co-regulated during the cell cycle. Comparisons to other unsupervised data integration techniques—as well as to non-integrative approaches—demonstrate that MDI is competitive, while also providing information that would be difficult or impossible to extract using other methods. Availability: A Matlab implementation of MDI is available from http://www2.warwick.ac.uk/fac/sci/systemsbiology/research/software/. Contact: D.L.Wild@warwick.ac.uk Supplementary information: Supplementary data are available at Bioinformatics online.
Article
Full-text available
Gene co-expression network analysis is an effective method for predicting gene functions and disease biomarkers. However, few studies have systematically identified co-expressed genes involved in the molecular origin and development of various types of tumors. In this study, we used a network mining algorithm to identify tightly connected gene co-expression networks that are frequently present in microarray datasets from 33 types of cancer which were derived from 16 organs/tissues. We compared the results with networks found in multiple normal tissue types and discovered 18 tightly connected frequent networks in cancers, with highly enriched functions on cancer-related activities. Most networks identified also formed physically interacting networks. In contrast, only 6 networks were found in normal tissues, which were highly enriched for housekeeping functions. The largest cancer network contained many genes with genome stability maintenance functions. We tested 13 selected genes from this network for their involvement in genome maintenance using two cell-based assays. Among them, 10 were shown to be involved in either homology-directed DNA repair or centrosome duplication control including the well- known cancer marker MKI67. Our results suggest that the commonly recognized characteristics of cancers are supported by highly coordinated transcriptomic activities. This study also demonstrated that the co-expression network directed approach provides a powerful tool for understanding cancer physiology, predicting new gene functions, as well as providing new target candidates for cancer therapeutics.
Article
Full-text available
In this paper we present the R package gRain for propagation in graphical indepen-dence networks (for which Bayesian networks is a special instance). The paper includes a description of the theory behind the computations. The main part of the paper is an illustration of how to use the package. The paper also illustrates how to turn a graphical model and data into an independence network.
Article
Full-text available
We describe a Bayesian approach for learning Bayesian networks from a combination of prior knowledge and statistical data. First and foremost, we develop a methodology for assessing informative priors needed for learning. Our approach is derived from a set of assumptions made previously as well as the assumption oflikelihood equivalence, which says that data should not help to discriminate network structures that represent the same assertions of conditional independence. We show that likelihood equivalence when combined with previously made assumptions implies that the user''s priors for network parameters can be encoded in a single Bayesian network for the next case to be seen—aprior network—and a single measure of confidence for that network. Second, using these priors, we show how to compute the relative posterior probabilities of network structures given data. Third, we describe search methods for identifying network structures with high posterior probabilities. We describe polynomial algorithms for finding the highest-scoring network structures in the special case where every node has at mostk=1 parent. For the general case (k>1), which is NP-hard, we review heuristic search algorithms including local search, iterative local search, and simulated annealing. Finally, we describe a methodology for evaluating Bayesian-network learning algorithms, and apply this approach to a comparison of various approaches.
Conference Paper
Full-text available
We describe algorithms for learning Bayesian networks from a combination of user knowledge and statistical data. The algorithms have two components: a scoring metric and a search procedure. The scoring metric takes a network structure, statistical data, and a user's prior knowledge, and returns a score proportional to the posterior probability of the network structure given the data. The search procedure generates networks for evaluation by the scoring metric. Our contributions are threefold. First, we identify two important properties of metrics, which we call event equivalence and parameter modularity. These properties have been mostly ignored, but when combined, greatly simplify the encoding of a user's prior knowledge. In particular, a user can express her knowledge-for the most part-as a single prior Bayesian network for the domain. Second, we describe local search and annealing algorithms to be used in conjunction with scoring metrics. In the special case where each node has at most one parent, we show that heuristic search can be replaced with a polynomial algorithm to identify the networks with the highest score. Third, we describe a methodology for evaluating Bayesian-network learning algorithms. We apply this approach to a comparison of metrics and search procedures.
Article
Full-text available
DNA hybridization arrays simultaneously measure the expression level for thousands of genes. These measurements provide a `snapshot' of transcription levels within the cell. A major challenge in computational biology is to uncover, from such measurements, gene/protein interactions and key biological features of cellular systems. In this paper, we propose a new framework for discovering interactions between genes based on multiple expression measurements. This framework builds on the use of Bayesian networks for representing statistical dependencies. A Bayesian network is a graph-based model of joint multi-variate probability distributions that captures properties of conditional independence between variables. Such models are attractive for their ability to describe complex stochastic processes, and for providing clear methodologies for learning from (noisy) observations. We start by showing how Bayesian networks can describe interactions between genes. We then present an efficient algorithm capable of learning such networks and a statistical method to assess our confidence in their features. Finally, we apply this method to the S. cerevisiae cell-cycle measurements of Spellman et al. to uncover biological features.