[Show abstract][Hide abstract] ABSTRACT: Chromatin immunoprecipitation with massively parallel DNA sequencing (ChIP-seq) has greatly improved the reliability with which transcription factor binding sites (TFBSs) can be identified from genome-wide profiling studies. Many computational tools are developed to detect binding events or peaks, however the robust detection of weak binding events remains a challenge for current peak calling tools. We have developed a novel Bayesian approach (ChIP-BIT) to reliably detect TFBSs and their target genes by jointly modeling binding signal intensities and binding locations of TFBSs. Specifically, a Gaussian mixture model is used to capture both binding and background signals in sample data. As a unique feature of ChIP-BIT, background signals are modeled by a local Gaussian distribution that is accurately estimated from the input data. Extensive simulation studies showed a significantly improved performance of ChIP-BIT in target gene prediction, particularly for detecting weak binding signals at gene promoter regions. We applied ChIP-BIT to find target genes from NOTCH3 and PBX1 ChIP-seq data acquired from MCF-7 breast cancer cells. TF knockdown experiments have initially validated about 30% of co-regulated target genes identified by ChIP-BIT as being differentially expressed in MCF-7 cells. Functional analysis on these genes further revealed the existence of crosstalk between Notch and Wnt signaling pathways.
Full-text · Article · Dec 2015 · Nucleic Acids Research
[Show abstract][Hide abstract] ABSTRACT: Identification of protein interaction network is a very important step for understanding the molecular mechanisms in cancer. Several methods have been developed to integrate protein-protein interaction (PPI) data with gene expression data for network identification. However, they often fail to model the dependency between genes in the network, which makes many important genes, especially the upstream genes, unidentified. It is necessary to develop a method to improve the network identification performance by incorporating the dependency between genes.
We proposed an approach for identifying protein interaction network by incorporating mutual information (MI) into a Markov random field (MRF) based framework to model the dependency between genes. MI is widely used in information theory to measure the uncertainty between random variables. Different from traditional Pearson correlation test, MI is capable of capturing both linear and non-linear relationship between random variables. Among all the existing MI estimators, we choose to use k-nearest neighbor MI (kNN-MI) estimator which is proved to have minimum bias. The estimated MI is integrated with an MRF framework to model the gene dependency in the context of network. The maximum a posterior (MAP) estimation is applied on the MRF-based model to estimate the network score. In order to reduce the computational complexity of finding the optimal network, a probabilistic searching algorithm is implemented. We further increase the robustness and reproducibility of the results by applying a non-parametric bootstrapping method to measure the confidence level of the identified genes. To evaluate the performance of the proposed method, we test the method on simulation data under different conditions. The experimental results show an improved accuracy in terms of subnetwork identification compared to existing methods. Furthermore, we applied our method onto real breast cancer patient data; the identified protein interaction network shows a close association with the recurrence of breast cancer, which is supported by functional annotation. We also show that the identified subnetworks can be used to predict the recurrence status of cancer patients by survival analysis.
We have developed an integrated approach for protein interaction network identification, which combines Markov random field framework and mutual information to model the gene dependency in PPI network. Improvements in subnetwork identification have been demonstrated with simulation datasets compared to existing methods. We then apply our method onto breast cancer patient data to identify recurrence related subnetworks. The experiment results show that the identified genes are enriched in the pathway and functional categories relevant to progression and recurrence of breast cancer. Finally, the survival analysis based on identified subnetworks achieves a good result of classifying the recurrence status of cancer patients.
[Show abstract][Hide abstract] ABSTRACT: Soy flour diet (MS) prevented isoflavones from stimulating MCF-7 tumor growth in athymic nude mice, indicating that other bioactive compounds in soy can negate the estrogenic properties of isoflavones. The underlying signal transduction pathways to explain the protective effects of soy flour consumption were studied here.
Ovariectomized athymic nude mice inoculated with MCF-7 human breast cancer cells were fed either MS or purified isoflavone mix (MI), both with equivalent amounts of genistein. Positive controls received estradiol pellets and negative controls received sham pellets. GeneChip-Human-Genome-U133-Plus-2.0 Array platform was used to evaluate gene expressions, and results were analyzed using bioinformatics approaches. Tumors in MS-fed mice exhibited higher expression of tumor-growth-suppressing genes ATP2A3 and BLNK, and lower expression of oncogene MYC. Tumors in MI-fed mice expressed higher level of oncogene MYB and lower level of MHC-I and MHC-II, allowing tumor cells to escape immunosurveillance. MS-induced gene expression alterations were predictive of prolonged survival among estrogen-receptor-positive breast cancer patients, whilst MI-induced gene changes were predictive of shortened survival.
Our findings suggest dietary soy flour affects gene expression differently than purified isoflavones, which may explain why soy foods prevent isoflavones-induced stimulation of MCF-7 tumor growth in athymic nude mice. This article is protected by copyright. All rights reserved.
This article is protected by copyright. All rights reserved.
No preview · Article · Mar 2015 · Molecular Nutrition & Food Research
[Show abstract][Hide abstract] ABSTRACT: Characterizing the origin of high-grade serous ovarian cancer has significant practical importance for advancing biological knowledge and improving clinical treatments. Rapid advances in molecular profiling technologies and machine learning based data analytics provide new opportunities to investigate this important question using data-driven approaches at the molecular and network levels. We now report novel analytic results in assessing the origin of high-grade serous ovarian carcinoma. Using genome-wide gene expression data and effective machine learning approaches, we design proper statistical significance tests and perform both genomic and network analyses to discriminate among three possible origins. The experimental results are consistent with recent scientific hypothesis and independent findings.
[Show abstract][Hide abstract] ABSTRACT: We sought to determine the mechanisms underlying failure of muscle regeneration that is observed in dystrophic muscle through hypothesis generation using muscle profiling data (human
dystrophy and murine regeneration). We found that transforming growth factor –centered networks strongly associated with pathological fibrosis and failed regeneration were also induced during normal regeneration but at distinct time points. We hypothesized that asynchronously regenerating
microenvironments are an underlying driver of fibrosis and failed regeneration. We validated this hypothesis using an experimental model of focal asynchronous bouts of muscle regeneration in wild-type (WT) mice. A
chronic inflammatory state and reduced mitochondrial oxidative capacity are observed in bouts separated by
4 d, whereas a chronic profibrotic state was seen in bouts separated by 10 d. Treatment of asynchronously remodeling WT muscle with either prednisone or VBP15 mitigated the molecular phenotype. Our asynchronous regeneration model for pathological fibrosis and muscle wasting in the muscular dystrophies is likely generalizable to tissue failure in chronic inflammatory states in other regenerative tissues.
Full-text · Article · Oct 2014 · The Journal of Cell Biology
[Show abstract][Hide abstract] ABSTRACT: Unlabelled:
We have developed an integrated molecular network learning method, within a well-grounded mathematical framework, to construct differential dependency networks with significant rewiring. This knowledge-fused differential dependency networks (KDDN) method, implemented as a Java Cytoscape app, can be used to optimally integrate prior biological knowledge with measured data to simultaneously construct both common and differential networks, to quantitatively assign model parameters and significant rewiring p-values and to provide user-friendly graphical results. The KDDN algorithm is computationally efficient and provides users with parallel computing capability using ubiquitous multi-core machines. We demonstrate the performance of KDDN on various simulations and real gene expression datasets, and further compare the results with those obtained by the most relevant peer methods. The acquired biologically plausible results provide new insights into network rewiring as a mechanistic principle and illustrate KDDN's ability to detect them efficiently and correctly. Although the principal application here involves microarray gene expressions, our methodology can be readily applied to other types of quantitative molecular profiling data.
Source code and compiled package are freely available for download at http://apps.cytoscape.org/apps/kddn.
Supplementary data are available at Bioinformatics online.
[Show abstract][Hide abstract] ABSTRACT: A statistical volumetric model, showing the probability map of localized
prostate cancer within the host anatomical structure, has been developed from
90 optically-imaged surgical specimens. This master model permits an accurate
characterization of prostate cancer distribution patterns and an atlas-informed
biopsy sampling strategy. The model is constructed by mapping individual
prostate models onto a site model, together with localized tumors. An accurate
multi-object non-rigid warping scheme is developed based on a mixture of
principal-axis registrations. We report our evaluation and pilot studies on the
effectiveness of the method and its application to optimizing needle biopsy
[Show abstract][Hide abstract] ABSTRACT: We develop a novel unsupervised deconvolution method, within a well-grounded mathematical framework, to dissect mixed gene expressions in heterogeneous tumor samples. We implement an R package, UNsupervised DecOnvolution (UNDO), that can be used to automatically detect cell-specific marker genes (MGs) located on the scatter radii of mixed gene expressions, estimate cellular proportions in each sample and deconvolute mixed expressions into cell-specific expression profiles. We demonstrate the performance of UNDO over a wide range of tumor-stroma mixing proportions, validate UNDO on various biologically mixed benchmark gene expression datasets and further estimate tumor purity in TCGA/CPTAC datasets. The highly accurate deconvolution results obtained suggest not only the existence of cell-specific MGs but also UNDO's ability to detect them blindly and correctly. Although the principal application here involves microarray gene expressions, our methodology can be readily applied to other types of quantitative molecular profiling data.
Availability and implementation:
UNDO is available at http://bioconductor.org/packages.
[Show abstract][Hide abstract] ABSTRACT: High coverage whole genome DNA-sequencing enables identification of somatic structural variation (SSV) more evident in paired tumor and normal samples. Recent studies show that simultaneous analysis of paired samples provides a better resolution of SSV detection than subtracting shared SVs. However, available tools can neither identify all types of SSVs nor provide any rank information regarding their somatic features. In this paper, we have developed a Bayesian framework, by integrating read alignment information from both tumor and normal samples, called BSSV, to calculate the significance of each SSV. Tested by simulated data, the precision of BSSV is comparable to that of available tools and the false negative rate is significantly lowered. We have also applied this approach to The Cancer Genome Atlas breast cancer data for SSV detection. Many known breast cancer specific mutated genes like RAD51, BRIP1, ER, PGR and PTPRD have been successfully identified.
[Show abstract][Hide abstract] ABSTRACT: Background
Modeling biological networks serves as both a major goal and an effective tool of systems biology in studying mechanisms that orchestrate the activities of gene products in cells. Biological networks are context-specific and dynamic in nature. To systematically characterize the selectively activated regulatory components and mechanisms, modeling tools must be able to effectively distinguish significant rewiring from random background fluctuations. While differential networks cannot be constructed by existing knowledge alone, novel incorporation of prior knowledge into data-driven approaches can improve the robustness and biological relevance of network inference. However, the major unresolved roadblocks include: big solution space but a small sample size; highly complex networks; imperfect prior knowledge; missing significance assessment; and heuristic structural parameter learning.ResultsTo address these challenges, we formulated the inference of differential dependency networks that incorporate both conditional data and prior knowledge as a convex optimization problem, and developed an efficient learning algorithm to jointly infer the conserved biological network and the significant rewiring across different conditions. We used a novel sampling scheme to estimate the expected error rate due to ¿random¿ knowledge. Based on that scheme, we developed a strategy that fully exploits the benefit of this data-knowledge integrated approach. We demonstrated and validated the principle and performance of our method using synthetic datasets. We then applied our method to yeast cell line and breast cancer microarray data and obtained biologically plausible results. The open-source R software package and the experimental data are freely available at http://www.cbil.ece.vt.edu/software.htm.Conclusions
Experiments on both synthetic and real data demonstrate the effectiveness of the knowledge-fused differential dependency network in revealing the statistically significant rewiring in biological networks. The method efficiently leverages data-driven evidence and existing biological knowledge while remaining robust to the false positive edges in the prior knowledge. The identified network rewiring events are supported by previous studies in the literature and also provide new mechanistic insight into the biological systems. We expect the knowledge-fused differential dependency network analysis, together with the open-source R package, to be an important and useful bioinformatics tool in biological network analyses.
Full-text · Article · Jul 2014 · BMC Systems Biology
[Show abstract][Hide abstract] ABSTRACT: Background
Recent advances in RNA sequencing (RNA-Seq) technology have offered unprecedented scope and resolution for transcriptome analysis. However, precise quantification of mRNA abundance and identification of differentially expressed genes are complicated due to biological and technical variations in RNA-Seq data.
We systematically study the variation in count data and dissect the sources of variation into between-sample variation and within-sample variation. A novel Bayesian framework is developed for joint estimate of gene level mRNA abundance and differential state, which models the intrinsic variability in RNA-Seq to improve the estimation. Specifically, a Poisson-Lognormal model is incorporated into the Bayesian framework to model within-sample variation; a Gamma-Gamma model is then used to model between-sample variation, which accounts for over-dispersion of read counts among multiple samples. Simulation studies, where sequencing counts are synthesized based on parameters learned from real datasets, have demonstrated the advantage of the proposed method in both quantification of mRNA abundance and identification of differentially expressed genes. Moreover, performance comparison on data from the Sequencing Quality Control (SEQC) Project with ERCC spike-in controls has shown that the proposed method outperforms existing RNA-Seq methods in differential analysis. Application on breast cancer dataset has further illustrated that the proposed Bayesian model can 'blindly' estimate sources of variation caused by sequencing biases.
We have developed a novel Bayesian hierarchical approach to investigate within-sample and between-sample variations in RNA-Seq data. Simulation and real data applications have validated desirable performance of the proposed method. The software package is available at http://www.cbil.ece.vt.edu/software.htm.
[Show abstract][Hide abstract] ABSTRACT: The rapid development of biotechnology makes it possible to explore genome-wide DNA methylation mapping which has been demonstrated to be related to diseases including cancer. However, it also posts substantial challenges in identifying biologically meaningful methylation pattern changes. Several algorithms have been proposed to detect differential methylation events, such as differentially methylated CpG sites and differentially methylated regions. However, the intrinsic dependency of the CpG sites in a neighboring area has not yet been fully considered. In this paper, we propose a novel method for the identification of differentially methylated genes in a Markov random field-based Bayesian framework. Specifically, we use Markov random field to model the dependency of the neighboring CpG sites, and then estimate the differential methylation score of the CpG sites in a Bayesian framework through a sampling scheme. Finally, the differential methylation statuses of the genes are determined by the estimated scores of the involved CpG sites. In addition, significance test is conducted to assess the significance of the identified differentially methylated genes. Experimental results on both synthetic data and real data demonstrate the effectiveness of the proposed method in identifying genes with differential methylation patterns under different conditions.
[Show abstract][Hide abstract] ABSTRACT: ChIP-Seq experiments provide accurate measurements of the regulatory roles of transcription factors (TFs) under specific condition. Downstream target genes can be detected by analyzing the enriched TF binding sites (TFBSs) in genes' promoter regions. The location and statistical information of TFBSs make it possible to evaluate the relative importance of each binding. Based on the assumption that the TFBSs of one ChIP-Seq experiment follow the same specific location distribution, a statistical model is first proposed using both location and significance information of peaks to weigh target genes. With genes' binding scores from different TFs, we merge them into a weighted binding matrix. A Markov Chain Monte Carlo (MCMC) based approach is then applied to the binding matrix for co-regulatory module identification. We demonstrate the efficiency of our statistical model on an ER-α ChIP-Seq dataset and further identify co-regulatory modules by using eleven breast cancer related TFs from ENCODE ChIP-Seq datasets. The results show that the TFs in individual module regulate common high score target genes; the association of TFs is biologically meaningful, and the functional roles of TFs and target genes are consistent.
[Show abstract][Hide abstract] ABSTRACT: Identification of cooperative gene regulatory network is an important topic for biological study especially in cancer research. Traditional approaches suffer from large noise in gene expression data and false positive connections in motif binding data; they also fail to identify the modularized structure of gene regulatory network. Methods that are capable of revealing underlying modularized structure and robust to noise and false positives are needed to be developed.
We proposed and developed an integrated approach to identify gene regulatory networks, which consists of a novel clustering method (namely motif-guided affinity propagation clustering (mAPC)) and a sampling based method (called Gibbs sampler based on outlier sum statistic (GibbsOS)). mAPC is used in the first step to obtain co-regulated gene modules by clustering genes with a similarity measurement taking into account both gene expression data and binding motif information. This clustering method can reduce the noise effect from microarray data to obtain modularized gene clusters. However, due to many false positives in motif binding data, some genes not regulated by certain transcription factors (TFs) will be falsely clustered with true target genes. To overcome this problem, GibbsOS is applied in the second step to refine each cluster for the identification of true target genes. In order to evaluate the performance of the proposed method, we generated simulation data under different signal-to-noise ratios and false positive ratios to test the method. The experimental results show an improved accuracy in terms of clustering and transcription factor identification. Moreover, an improved performance is demonstrated in target gene identification as compared with GibbsOS. Finally, we applied the proposed method to two breast cancer patient datasets to identify cooperative transcriptional regulatory networks associated with recurrence of breast cancer, as supported by their functional annotations.
We have developed a two-step approach for gene regulatory network identification, featuring an integrated method to identify modularized regulatory structures and refine their target genes subsequently. Simulation studies have shown the robustness of the method against noise in gene expression data and false positives in motif binding data. The proposed method has been applied to two breast cancer gene expression datasets to infer the hidden regulation mechanisms. The experimental results demonstrate the efficacy of the method in identifying key regulatory networks related to the progression and recurrence of breast cancer.
Full-text · Article · Dec 2013 · BMC Systems Biology
[Show abstract][Hide abstract] ABSTRACT: ChlP-chip experiments are performed to determine binding sites for transcription factors (TFs). Conventional TF-gene regulation is generated based on p-value cutoff of the binding sites as well as their distance to nearest genes. Taking into account that binding sites of one ChlP-chip experiment should follow the same specific location distribution, we proposed a statistical model using both location and significance information to weigh target genes. With multiple ChlP-chip experiments and gene expression data, we identified co-regulatory and differentially expressed gene modules with a joint clustering and Metropolis sampling approach. We demonstrated the efficiency of our method on a ChlP-chip data set with 38 breast cancer related TFs.
[Show abstract][Hide abstract] ABSTRACT: Among network modeling tasks, identifying the rewiring of network structure
is particularly instrumental in revealing and pinpointing the molecular cause
of a disease. Effective incorporation of biological prior knowledge into
network learning algorithms can leverage domain knowledge and make data driven
inference more robust and biologically relevant. We formulate the inference of
condition specific network structures that incorporates relevant prior
knowledge as a convex optimization problem, and develop an efficient learning
algorithm to jointly infer the biological networks as well as their changes. We
test the proposed method on simulation data sets and demonstrate the
effectiveness of this method. We then apply our method to yeast cell line data
and breast cancer microarray data and obtain biologically plausible results.