[show abstract][hide abstract] ABSTRACT: Among network modeling tasks, identifying the rewiring of network structure
is particularly instrumental in revealing and pinpointing the molecular cause
of a disease. Effective incorporation of biological prior knowledge into
network learning algorithms can leverage domain knowledge and make data driven
inference more robust and biologically relevant. We formulate the inference of
condition specific network structures that incorporates relevant prior
knowledge as a convex optimization problem, and develop an efficient learning
algorithm to jointly infer the biological networks as well as their changes. We
test the proposed method on simulation data sets and demonstrate the
effectiveness of this method. We then apply our method to yeast cell line data
and breast cancer microarray data and obtain biologically plausible results.
[show abstract][hide abstract] ABSTRACT: Tissue heterogeneity is a major confounding factor in studying individual
populations that cannot be resolved directly by global profiling. Experimental
solutions to mitigate tissue heterogeneity are expensive, time consuming,
inapplicable to existing data, and may alter the original gene expression
patterns. Here we ask whether it is possible to deconvolute two-source mixed
expressions (estimating both proportions and cell-specific profiles) from two
or more heterogeneous samples without requiring any prior knowledge. Supported
by a well-grounded mathematical framework, we argue that both constituent
proportions and cell-specific expressions can be estimated in a completely
unsupervised mode when cell-specific marker genes exist, which do not have to
be known a priori, for each of constituent cell types. We demonstrate the
performance of unsupervised deconvolution on both simulation and real gene
expression data, together with perspective discussions.
[show abstract][hide abstract] ABSTRACT: The reliability and reproducibility of gene biomarkers for classification of cancer patients has been challenged due to measurement noise and biological heterogeneity among patients. In this paper, we propose a novel module-based feature selection framework, which integrates biological network information and gene expression data to identify biomarkers not as individual genes but as functional modules. Results from four breast cancer studies demonstrate that the identified module biomarkers. achieve higher classification accuracy in independent validation datasets. Are more reproducible than individual gene markers. Improve the biological interpretability of results. Are enriched in cancer 'disease drivers'.
International Journal of Data Mining and Bioinformatics 01/2013; 7(3):284-302. · 0.39 Impact Factor
[show abstract][hide abstract] ABSTRACT: Identification of cooperative gene regulatory network is an important topic for biological study especially in cancer research. Traditional approaches suffer from large noise in gene expression data and false positive connections in motif binding data; they also fail to identify the modularized structure of gene regulatory network. Methods that are capable of revealing underlying modularized structure and robust to noise and false positives are needed to be developed.
We proposed and developed an integrated approach to identify gene regulatory networks, which consists of a novel clustering method (namely motif-guided affinity propagation clustering (mAPC)) and a sampling based method (called Gibbs sampler based on outlier sum statistic (GibbsOS)). mAPC is used in the first step to obtain co-regulated gene modules by clustering genes with a similarity measurement taking into account both gene expression data and binding motif information. This clustering method can reduce the noise effect from microarray data to obtain modularized gene clusters. However, due to many false positives in motif binding data, some genes not regulated by certain transcription factors (TFs) will be falsely clustered with true target genes. To overcome this problem, GibbsOS is applied in the second step to refine each cluster for the identification of true target genes. In order to evaluate the performance of the proposed method, we generated simulation data under different signal-to-noise ratios and false positive ratios to test the method. The experimental results show an improved accuracy in terms of clustering and transcription factor identification. Moreover, an improved performance is demonstrated in target gene identification as compared with GibbsOS. Finally, we applied the proposed method to two breast cancer patient datasets to identify cooperative transcriptional regulatory networks associated with recurrence of breast cancer, as supported by their functional annotations.
We have developed a two-step approach for gene regulatory network identification, featuring an integrated method to identify modularized regulatory structures and refine their target genes subsequently. Simulation studies have shown the robustness of the method against noise in gene expression data and false positives in motif binding data. The proposed method has been applied to two breast cancer gene expression datasets to infer the hidden regulation mechanisms. The experimental results demonstrate the efficacy of the method in identifying key regulatory networks related to the progression and recurrence of breast cancer.
[show abstract][hide abstract] ABSTRACT: Reliable inference of transcription regulatory networks is a challenging task in computational biology. Network component analysis (NCA) has become a powerful scheme to uncover regulatory networks behind complex biological processes. However, the performance of NCA is impaired by the high rate of false connections in binding information. In this paper, we integrate stability analysis with NCA to form a novel scheme, namely stability-based NCA (sNCA), for regulatory network identification. The method mainly addresses the inconsistency between gene expression data and binding motif information. Small perturbations are introduced to prior regulatory network, and the distance among multiple estimated transcript factor (TF) activities is computed to reflect the stability for each TF's binding network. For target gene identification, multivariate regression and t-statistic are used to calculate the significance for each TF-gene connection. Simulation studies are conducted and the experimental results show that sNCA can achieve an improved and robust performance in TF identification as compared to NCA. The approach for target gene identification is also demonstrated to be suitable for identifying true connections between TFs and their target genes. Furthermore, we have successfully applied sNCA to breast cancer data to uncover the role of TFs in regulating endocrine resistance in breast cancer.
IEEE/ACM transactions on computational biology and bioinformatics / IEEE, ACM 11/2012; 10(6):1347-58. · 2.25 Impact Factor
[show abstract][hide abstract] ABSTRACT: Identification of differentially expressed subnetworks from protein-protein interaction (PPI) networks has become increasingly important to our global understanding of the molecular mechanisms that drive cancer. Several methods have been proposed for PPI subnetwork identification, but the dependency among network member genes is not explicitly considered, leaving many important hub genes largely unidentified. We present a new method, based on a bagging Markov random field (BMRF) framework, to improve subnetwork identification for mechanistic studies of breast cancer. The method follows a maximum a posteriori principle to form a novel network score that explicitly considers pairwise gene interactions in PPI networks, and it searches for subnetworks with maximal network scores. To improve their robustness across data sets, a bagging scheme based on bootstrapping samples is implemented to statistically select high confidence subnetworks. We first compared the BMRF-based method with existing methods on simulation data to demonstrate its improved performance. We then applied our method to breast cancer data to identify PPI subnetworks associated with breast cancer progression and/or tamoxifen resistance. The experimental results show that not only an improved prediction performance can be achieved by the BMRF approach when tested on independent data sets, but biologically meaningful subnetworks can also be revealed that are relevant to breast cancer and tamoxifen resistance.
Nucleic Acids Research 11/2012; · 8.28 Impact Factor
[show abstract][hide abstract] ABSTRACT: To construct biologically interpretable gene sets for muscular dystrophy (MD) sub-type classification, we propose a novel computational scheme to integrate protein-protein interaction (PPI) network, functional gene set information, and mRNA profiling data. The workflow of the proposed scheme includes the following three major steps: firstly, we apply an affinity propagation clustering (APC) approach to identify gene sub-networks associated with each MD sub-type, in which a new distance metric is proposed for APC to combine PPI network information and gene-gene co-expression relationship; secondly, we further incorporate functional gene set knowledge, which complements the physical PPI information, into our scheme for biomarker identification; finally, based on the constructed sub-networks and gene set features, we apply multi-class support vector machines (MSVMs) for MD sub-type classification, with which to highlight the biomarkers contributing to sub-type prediction. The experimental results show that our scheme can help identify sub-networks and gene sets that are more relevant to MD than those constructed by other conventional approaches. Moreover, our integrative strategy improves the prediction accuracy substantially, especially for those 'hard-to-classify' sub-types.
[show abstract][hide abstract] ABSTRACT: With the advent of high-throughput biotechnology capable of monitoring genomic signals, it becomes increasingly promising to understand molecular cellular mechanisms through systems biology approaches. One of the active research topics in systems biology is to infer gene transcriptional regulatory networks using various genomic data; this inference problem can be formulated as a linear model with latent signals associated with some regulatory proteins called transcription factors (TFs). As common statistical assumptions may not hold for genomic signals, typical latent variable algorithms such as independent component analysis (ICA) are incapable to reveal underlying true regulatory signals. Liao et al.  proposed to perform inference using an approach named network component analysis (NCA), the optimization of which is achieved by a least-squares fitting approach with biological knowledge constraints. However, the incompleteness of biological knowledge and its inconsistency with gene expression data are not considered in the original NCA solution, which could greatly affect the inference accuracy. To overcome these limitations, we propose a linear extraction scheme, namely regulatory component analysis (RCA), to infer underlying regulatory signals even with partial biological knowledge. Numerical simulations show a significant improvement of our proposed RCA over NCA, not only when signal-to-noise-ratio (SNR) is low, but also when the given biological knowledge is incomplete and inconsistent to gene expression data. Furthermore, real biological experiments on E. coli are performed for regulatory network inference in comparison with several typical linear latent variable methods, which again demonstrates the effectiveness and improved performance of the proposed algorithm.
Signal Processing 08/2012; 92(8):1902-1915. · 1.85 Impact Factor
[show abstract][hide abstract] ABSTRACT: Identification of transcriptional regulatory networks (TRNs) is of significant importance in computational biology for cancer research, providing a critical building block to unravel disease pathways. However, existing methods for TRN identification suffer from the inclusion of excessive 'noise' in microarray data and false-positives in binding data, especially when applied to human tumor-derived cell line studies. More robust methods that can counteract the imperfection of data sources are therefore needed for reliable identification of TRNs in this context.
In this article, we propose to establish a link between the quality of one target gene to represent its regulator and the uncertainty of its expression to represent other target genes. Specifically, an outlier sum statistic was used to measure the aggregated evidence for regulation events between target genes and their corresponding transcription factors. A Gibbs sampling method was then developed to estimate the marginal distribution of the outlier sum statistic, hence, to uncover underlying regulatory relationships. To evaluate the effectiveness of our proposed method, we compared its performance with that of an existing sampling-based method using both simulation data and yeast cell cycle data. The experimental results show that our method consistently outperforms the competing method in different settings of signal-to-noise ratio and network topology, indicating its robustness for biological applications. Finally, we applied our method to breast cancer cell line data and demonstrated its ability to extract biologically meaningful regulatory modules related to estrogen signaling and action in breast cancer.
The Gibbs sampler MATLAB package is freely available at http://www.cbil.ece.vt.edu/software.htm.
Supplementary data are available at Bioinformatics online.
[show abstract][hide abstract] ABSTRACT: How breast cancer cells respond to the stress of endocrine therapies determines whether they will acquire a resistant phenotype or execute a cell-death pathway. After a survival signal is successfully executed, a cell must decide whether it should replicate. How these cell-fate decisions are regulated is unclear, but evidence suggests that the signals that determine these outcomes are highly integrated. Central to the final cell-fate decision is signaling from the unfolded protein response, which can be activated following the sensing of stress within the endoplasmic reticulum. The duration of the response to stress is partly mediated by the duration of inositol-requiring enzyme-1 activation following its release from heat shock protein A5. The resulting signals appear to use several B-cell lymphoma-2 family members to both suppress apoptosis and activate autophagy. Changes in metabolism induced by cellular stress are key components of this regulatory system, and further adaptation of the metabolome is affected in response to stress. Here we describe the unfolded protein response, autophagy, and apoptosis, and how the regulation of these processes is integrated. Central topologic features of the signaling network that integrate cell-fate regulation and decision execution are discussed.
Cancer Research 03/2012; 72(6):1321-31. · 8.65 Impact Factor
[show abstract][hide abstract] ABSTRACT: NOTCH3 gene amplification plays an important role in the progression of many ovarian and breast cancers, but the targets of NOTCH3 signaling are unclear. Here, we report the use of an integrated systems biology approach to identify direct target genes for NOTCH3. Transcriptome analysis showed that suppression of NOTCH signaling in ovarian and breast cancer cells led to downregulation of genes in pathways involved in cell-cycle regulation and nucleotide metabolism. Chromatin immunoprecipitation (ChIP)-on-chip analysis defined promoter target sequences, including a new CSL binding motif (N1) in addition to the canonical CSL binding motif, that were occupied by the NOTCH3/CSL transcription complex. Integration of transcriptome and ChIP-on-chip data showed that the ChIP target genes overlapped significantly with the NOTCH-regulated transcriptome in ovarian cancer cells. From the set of genes identified, we showed that the mitotic apparatus organizing protein DLGAP5 (HURP/DLG7) was a critical target. Both the N1 motif and the canonical CSL binding motif were essential to activate DLGAP5 transcription. DLGAP5 silencing in cancer cells suppressed tumorigenicity and inhibited cellular proliferation by arresting the cell cycle at the G(2)-M phase. In contrast, enforced expression of DLGAP5 partially counteracted the growth inhibitory effects of a pharmacologic or RNA interference-mediated NOTCH inhibition in cancer cells. Our findings define direct target genes of NOTCH3 and highlight the role of DLGAP5 in mediating the function of NOTCH3.
Cancer Research 03/2012; 72(9):2294-303. · 8.65 Impact Factor
[show abstract][hide abstract] ABSTRACT: Identification of condition-specific protein interaction subnetworks has emerged as an attractive research field to reveal molecular mechanisms of diseases and provide reliable network biomarkers for disease diagnosis. Several methods have been proposed, which integrate gene expression and protein-protein interaction (PPI) data to identify subnetworks. However, existing methods treat differential expression of genes and network topology independently, which is an oversimplified assumption to model real biological systems. In this paper, we propose a sampling-based subnetwork identification approach to take into account the dependency between gene expression and network topology. Specifically, we apply Markov random field (MRF) theory to model the dependency of genes in PPI network using a Bayesian framework, followed by a Markov Chain Monte Carlo (MCMC) approach to identify significant subnetworks. The MCMC approach estimates the posterior distribution of genes' significant scores and network structure iteratively. Experimental results on both synthetic data and real breast cancer data demonstrated the effectiveness of the proposed method in identifying subnetworks, especially several functionally important, aberrant subnetworks associated with pathways involved in the development and recurrence of breast cancer.
Machine Learning and Applications (ICMLA), 2012 11th International Conference on; 01/2012
[show abstract][hide abstract] ABSTRACT: Understanding the molecular changes that drive an acquired antiestrogen resistance phenotype is of major clinical relevance. Previous methodologies for addressing this question have taken a single gene/pathway approach and the resulting gains have been limited in terms of their clinical impact. Recent systems biology approaches allow for the integration of data from high throughput "-omics" technologies. We highlight recent advances in the field of antiestrogen resistance with a focus on transcriptomics, proteomics and methylomics.
Drug Discovery Today Disease Mechanisms 01/2012; 9(1-2):e11-e17.
[show abstract][hide abstract] ABSTRACT: Ovarian cancer is often called the 'silent killer' since it is difficult to have early detection and prognosis. Understanding the biological mechanism related to ovarian cancer becomes extremely important for the purpose of treatment. We propose an integrative framework to identify pathway related networks based on large-scale TCGA copy number data and gene expression profiles. The integrative approach first detects highly conserved copy number altered genes and regards them as seed genes, and then applies a network-based method to identify subnetworks that can differentiate gene expression patterns between different phenotypes of ovarian cancer patients. The identified subnetworks are further validated on an independent gene expression data set using a network-based classification method. The experimental results show that our approach can not only achieve good prediction performance across different data sets but also identify biological meaningful subnetworks involved in many signaling pathways related to ovarian cancer.
Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing 01/2012;
[show abstract][hide abstract] ABSTRACT: It is biologically important to integrate high-throughput data to identify aberrant signal transduction pathways in cancer research. The high-throughput data acquired from The Cancer Genome Atlas (TCGA) Project offer a comprehensive picture of the genomic and transcriptional changes across hundreds of tumor samples. In this paper we propose a novel method, namely Gibbs sampler to Infer Signal Transduction pathways (GIST), to detect aberrant pathways that are highly associated with biological phenotypes or clinical information. GIST endeavors to estimate the edge probability by using a Markov Chain Monte Carlo (MCMC) method (i.e., a Gibbs sampling strategy). Through the sampling process, GIST is able to infer the correct signal transduction direction because the sampled edge probabilities are jointly determined by gene expression data and network topology. We first tested the efficacy of the GIST algorithm on yeast data and successfully uncovered several biologically meaningful signaling pathways. A case study on TCGA ovarian cancer data was further designed, aiming to unravel diverse signaling pathways associated with the development of ovarian cancer. The experimental results demonstrated the feasibility of applying GIST to identify and prioritize important signaling pathways in ovarian cancer for further biological validation.
[show abstract][hide abstract] ABSTRACT: Identification of intracellular signal transduction pathways plays an important role in understanding the mechanisms of how cells respond to external stimuli. The availability of high throughput microarray expression data and accumulating knowledge of protein-protein interactions have provided us with useful information to infer condition-specific signal transduction pathways. We propose a novel method called Gibbs sampler to Infer Signal Transduction pathways (GIST) to search dys-regulated pathways from large-scale protein-protein interaction networks. GIST incorporates different knowledge sources to extract paths that are highly associated with biological phenotypes or clinical information. One of the most attractive features of GIST is that the algorithm will not only provide the single optimal path according to the defined cost function but also reveal multiple suboptimal paths as alternative solutions, which can be utilized to study the pathway crosstalk. As a proof-of-concept, we test our GIST algorithm on yeast PPI networks and the identified MAPK signaling pathways are well supported by existing biological knowledge. We also apply the GIST algorithm onto a breast cancer patient dataset to show its feasibility of identifying potential pathways for further biological validation.
Conference proceedings: ... Annual International Conference of the IEEE Engineering in Medicine and Biology Society. IEEE Engineering in Medicine and Biology Society. Conference 08/2011; 2011:2434-7.
[show abstract][hide abstract] ABSTRACT: Lack of understanding of endocrine resistance remains one of the major challenges for breast cancer researchers, clinicians, and patients. Current reductionist approaches to understanding the molecular signaling driving resistance have offered mostly incremental progress over the past 10 years. As the field of systems biology has begun to mature, the approaches and network modeling tools being developed and applied therein offer a different way to think about how molecular signaling and the regulation of critical cellular functions are integrated. To gain novel insights, we first describe some of the key challenges facing network modeling of endocrine resistance, many of which arise from the properties of the data spaces being studied. We then use activation of the unfolded protein response (UPR) following induction of endoplasmic reticulum stress in breast cancer cells by antiestrogens, to illustrate our approaches to computational modeling. Activation of UPR is a key determinant of cell fate decision making and regulation of autophagy and apoptosis. These initial studies provide insight into a small subnetwork topology obtained using differential dependency network analysis and focused on the UPR gene XBP1. The XBP1 subnetwork topology incorporates BCAR3, BCL2, BIK, NFκB, and other genes as nodes; the connecting edges represent the dependency structures amongst these nodes. As data from ongoing cellular and molecular studies become available, we will build detailed mathematical models of this XBP1-UPR network.
Hormone molecular biology and clinical investigation 03/2011; 5(1):35-44.
[show abstract][hide abstract] ABSTRACT: Phenotypic Up-regulated Gene Support Vector Machine (PUGSVM) is a cancer Biomedical Informatics Grid (caBIG™) analytical tool for multiclass gene selection and classification. PUGSVM addresses the problem of imbalanced class separability, small sample size and high gene space dimensionality, where multiclass gene markers are defined by the union of one-versus-everyone phenotypic upregulated genes, and used by a well-matched one-versus-rest support vector machine. PUGSVM provides a simple yet more accurate strategy to identify statistically reproducible mechanistic marker genes for characterization of heterogeneous diseases. AVAILABILITY: http://www.cbil.ece.vt.edu/caBIG-PUGSVM.htm.
[show abstract][hide abstract] ABSTRACT: Genes work coordinately as gene modules or gene networks. Various computational approaches have been proposed to find gene modules based on gene expression data; for example, gene clustering is a popular method for grouping genes with similar gene expression patterns. However, traditional gene clustering often yields unsatisfactory results for regulatory module identification because the resulting gene clusters are co-expressed but not necessarily co-regulated.
We propose a novel approach, motif-guided sparse decomposition (mSD), to identify gene regulatory modules by integrating gene expression data and DNA sequence motif information. The mSD approach is implemented as a two-step algorithm comprising estimates of (1) transcription factor activity and (2) the strength of the predicted gene regulation event(s). Specifically, a motif-guided clustering method is first developed to estimate the transcription factor activity of a gene module; sparse component analysis is then applied to estimate the regulation strength, and so predict the target genes of the transcription factors. The mSD approach was first tested for its improved performance in finding regulatory modules using simulated and real yeast data, revealing functionally distinct gene modules enriched with biologically validated transcription factors. We then demonstrated the efficacy of the mSD approach on breast cancer cell line data and uncovered several important gene regulatory modules related to endocrine therapy of breast cancer.
We have developed a new integrated strategy, namely motif-guided sparse decomposition (mSD) of gene expression data, for regulatory module identification. The mSD method features a novel motif-guided clustering method for transcription factor activity estimation by finding a balance between co-regulation and co-expression. The mSD method further utilizes a sparse decomposition method for regulation strength estimation. The experimental results show that such a motif-guided strategy can provide context-specific regulatory modules in both yeast and breast cancer studies.