[Show abstract][Hide abstract] ABSTRACT: We sought to determine the mechanisms underlying failure of muscle regeneration that is observed in dystrophic muscle through hypothesis generation using muscle profiling data (human
dystrophy and murine regeneration). We found that transforming growth factor –centered networks strongly associated with pathological fibrosis and failed regeneration were also induced during normal regeneration but at distinct time points. We hypothesized that asynchronously regenerating
microenvironments are an underlying driver of fibrosis and failed regeneration. We validated this hypothesis using an experimental model of focal asynchronous bouts of muscle regeneration in wild-type (WT) mice. A
chronic inflammatory state and reduced mitochondrial oxidative capacity are observed in bouts separated by
4 d, whereas a chronic profibrotic state was seen in bouts separated by 10 d. Treatment of asynchronously remodeling WT muscle with either prednisone or VBP15 mitigated the molecular phenotype. Our asynchronous regeneration model for pathological fibrosis and muscle wasting in the muscular dystrophies is likely generalizable to tissue failure in chronic inflammatory states in other regenerative tissues.
The Journal of Cell Biology 10/2014; 207(1):139-158. · 9.69 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: We have developed an integrated molecular network learning method, within a well-grounded mathematical framework, to construct differential dependency networks with significant rewiring. This knowledge-fused differential dependency networks (KDDN) method, implemented as a Java Cytoscape app, can be used to optimally integrate prior biological knowledge with measured data to simultaneously construct both common and differential networks, to quantitatively assign model parameters and significant rewiring p-values, and to provide user-friendly graphical results. The KDDN algorithm is computationally efficient and provides users with parallel computing capability utilizing ubiquitous multi-core machines. We demonstrate the performance of KDDN on various simulations and real gene expression datasets, and further compare the results with those obtained by the most relevant peer methods. The acquired biologically plausible results provide new insights into network rewiring as a mechanistic principle and illustrate KDDN's ability to detect them efficiently and correctly. While the principal application here involves microarray gene expressions, our methodology can be readily applied to other types of quantitative molecular profiling data. Availability: Source code and compiled package are freely available for download at http://apps.cytoscape.org/apps/kddn CONTACT: firstname.lastname@example.org SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
[Show abstract][Hide abstract] ABSTRACT: A statistical volumetric model, showing the probability map of localized
prostate cancer within the host anatomical structure, has been developed from
90 optically-imaged surgical specimens. This master model permits an accurate
characterization of prostate cancer distribution patterns and an atlas-informed
biopsy sampling strategy. The model is constructed by mapping individual
prostate models onto a site model, together with localized tumors. An accurate
multi-object non-rigid warping scheme is developed based on a mixture of
principal-axis registrations. We report our evaluation and pilot studies on the
effectiveness of the method and its application to optimizing needle biopsy
[Show abstract][Hide abstract] ABSTRACT: We develop a novel unsupervised deconvolution method, within a well-grounded mathematical framework, to dissect mixed gene expressions in heterogeneous tumor samples. We implement an R package, UNsupervised DecOnvolution (UNDO) that can be used to automatically detect cell-specific marker genes located on the scatter radii of mixed gene expressions, estimate cellular proportions in each sample, and deconvolute mixed expressions into cell-specific expression profiles. We demonstrate the performance of UNDO over a wide range of tumor-stroma mixing proportions, validate UNDO on various biologically-mixed benchmark gene expression datasets, and further estimate tumor purity in TCGA/CPTAC datasets. The obtained highly accurate deconvolution results suggest not only the existence of cell-specific marker genes but also UNDO's ability to detect them blindly and correctly. While the principal application here involves microarray gene expressions, our methodology can be readily applied to other types of quantitative molecular profiling data. Availability: UNDO is available at http://bioconductor.org/packages.
[Show abstract][Hide abstract] ABSTRACT: High coverage whole genome DNA-sequencing enables identification of somatic structural variation (SSV) more evident in paired tumor and normal samples. Recent studies show that simultaneous analysis of paired samples provides a better resolution of SSV detection than subtracting shared SVs. However, available tools can neither identify all types of SSVs nor provide any rank information regarding their somatic features. In this paper, we have developed a Bayesian framework, by integrating read alignment information from both tumor and normal samples, called BSSV, to calculate the significance of each SSV. Tested by simulated data, the precision of BSSV is comparable to that of available tools and the false negative rate is significantly lowered. We have also applied this approach to The Cancer Genome Atlas breast cancer data for SSV detection. Many known breast cancer specific mutated genes like RAD51, BRIP1, ER, PGR and PTPRD have been successfully identified.
[Show abstract][Hide abstract] ABSTRACT: Background
Modeling biological networks serves as both a major goal and an effective tool of systems biology in studying mechanisms that orchestrate the activities of gene products in cells. Biological networks are context-specific and dynamic in nature. To systematically characterize the selectively activated regulatory components and mechanisms, modeling tools must be able to effectively distinguish significant rewiring from random background fluctuations. While differential networks cannot be constructed by existing knowledge alone, novel incorporation of prior knowledge into data-driven approaches can improve the robustness and biological relevance of network inference. However, the major unresolved roadblocks include: big solution space but a small sample size; highly complex networks; imperfect prior knowledge; missing significance assessment; and heuristic structural parameter learning.ResultsTo address these challenges, we formulated the inference of differential dependency networks that incorporate both conditional data and prior knowledge as a convex optimization problem, and developed an efficient learning algorithm to jointly infer the conserved biological network and the significant rewiring across different conditions. We used a novel sampling scheme to estimate the expected error rate due to ¿random¿ knowledge. Based on that scheme, we developed a strategy that fully exploits the benefit of this data-knowledge integrated approach. We demonstrated and validated the principle and performance of our method using synthetic datasets. We then applied our method to yeast cell line and breast cancer microarray data and obtained biologically plausible results. The open-source R software package and the experimental data are freely available at http://www.cbil.ece.vt.edu/software.htm.Conclusions
Experiments on both synthetic and real data demonstrate the effectiveness of the knowledge-fused differential dependency network in revealing the statistically significant rewiring in biological networks. The method efficiently leverages data-driven evidence and existing biological knowledge while remaining robust to the false positive edges in the prior knowledge. The identified network rewiring events are supported by previous studies in the literature and also provide new mechanistic insight into the biological systems. We expect the knowledge-fused differential dependency network analysis, together with the open-source R package, to be an important and useful bioinformatics tool in biological network analyses.
BMC Systems Biology 07/2014; 8(1):87. · 2.85 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: Background: Recent advances in RNA sequencing (RNA-Seq) technology have offered unprecedented scope and
resolution for transcriptome analysis. However, precise quantification of mRNA abundance and identification of
differentially expressed genes are complicated due to biological and technical variations in RNA-Seq data.
Results: We systematically study the variation in count data and dissect the sources of variation into
between-sample variation and within-sample variation. A novel Bayesian framework is developed for joint estimate
of gene level mRNA abundance and differential state, which models the intrinsic variability in RNA-Seq to improve
the estimation. Specifically, a Poisson-Lognormal model is incorporated into the Bayesian framework to model
within-sample variation; a Gamma-Gamma model is then used to model between-sample variation, which
accounts for over-dispersion of read counts among multiple samples. Simulation studies, where sequencing counts
are synthesized based on parameters learned from real datasets, have demonstrated the advantage of the
proposed method in both quantification of mRNA abundance and identification of differentially expressed genes.
Moreover, performance comparison on data from the Sequencing Quality Control (SEQC) Project with ERCC spike-
in controls has shown that the proposed method outperforms existing RNA-Seq methods in differential analysis.
Application on breast cancer dataset has further illustrated that the proposed Bayesian model can
‘blindly’ estimate sources of variation caused by sequencing biases.
Conclusions: We have developed a novel Bayesian hierarchical approach to investigate within-sample and
between-sample variations in RNA-Seq data. Simulation and real data applications have validated desirable performance
of the proposed method.
[Show abstract][Hide abstract] ABSTRACT: The rapid development of biotechnology makes it possible to explore genome-wide DNA methylation mapping which has been demonstrated to be related to diseases including cancer. However, it also posts substantial challenges in identifying biologically meaningful methylation pattern changes. Several algorithms have been proposed to detect differential methylation events, such as differentially methylated CpG sites and differentially methylated regions. However, the intrinsic dependency of the CpG sites in a neighboring area has not yet been fully considered. In this paper, we propose a novel method for the identification of differentially methylated genes in a Markov random field-based Bayesian framework. Specifically, we use Markov random field to model the dependency of the neighboring CpG sites, and then estimate the differential methylation score of the CpG sites in a Bayesian framework through a sampling scheme. Finally, the differential methylation statuses of the genes are determined by the estimated scores of the involved CpG sites. In addition, significance test is conducted to assess the significance of the identified differentially methylated genes. Experimental results on both synthetic data and real data demonstrate the effectiveness of the proposed method in identifying genes with differential methylation patterns under different conditions.
2014 IEEE Conference on Computational Intelligence in Bioinformatics and Computational Biology (CIBCB); 05/2014
[Show abstract][Hide abstract] ABSTRACT: Among network modeling tasks, identifying the rewiring of network structure
is particularly instrumental in revealing and pinpointing the molecular cause
of a disease. Effective incorporation of biological prior knowledge into
network learning algorithms can leverage domain knowledge and make data driven
inference more robust and biologically relevant. We formulate the inference of
condition specific network structures that incorporates relevant prior
knowledge as a convex optimization problem, and develop an efficient learning
algorithm to jointly infer the biological networks as well as their changes. We
test the proposed method on simulation data sets and demonstrate the
effectiveness of this method. We then apply our method to yeast cell line data
and breast cancer microarray data and obtain biologically plausible results.
[Show abstract][Hide abstract] ABSTRACT: Tissue heterogeneity is a major confounding factor in studying individual
populations that cannot be resolved directly by global profiling. Experimental
solutions to mitigate tissue heterogeneity are expensive, time consuming,
inapplicable to existing data, and may alter the original gene expression
patterns. Here we ask whether it is possible to deconvolute two-source mixed
expressions (estimating both proportions and cell-specific profiles) from two
or more heterogeneous samples without requiring any prior knowledge. Supported
by a well-grounded mathematical framework, we argue that both constituent
proportions and cell-specific expressions can be estimated in a completely
unsupervised mode when cell-specific marker genes exist, which do not have to
be known a priori, for each of constituent cell types. We demonstrate the
performance of unsupervised deconvolution on both simulation and real gene
expression data, together with perspective discussions.
[Show abstract][Hide abstract] ABSTRACT: Despite encouraging progress made by integrating multi-platform data for regulatory network reconstruction, identification of transcriptional regulatory networks remains challenging due to imperfection in current biotechnology and complexity of biological systems. It is important to develop new computational approaches for reliable regulatory network reconstruction, especially those of robustness against noise in gene expression data and 'structural error' (i.e., false connections) in binding data. We propose a new method, namely probabilistic network component analysis (pNCA), to estimate the posterior binding matrix given observed gene expression and binding data. The elements in the binding matrix, instead of taking deterministic binary values, are modeled as unknown Bernoulli random variables that represent the probability of regulation. A novel two-stage Gibbs sampling framework is employed to iteratively estimate both hidden transcription factor activities and the posterior distribution of binding matrix. Numerical simulation on synthetic data has demonstrated improved performance of the proposed method over several existing methods for regulatory network identification. Notably, the robustness of pNCA against 'structural error' in initial binding data is fortified with high tolerance of false negative connections in addition to that of false positive connections. The proposed method has been applied to breast cancer cell line data to reconstruct biologically meaningful regulatory networks, revealing condition-specific regulatory rewiring and important cooperative regulation associated with estrogen signaling and action in breast cancer cells.
Proceedings of the International Conference on Bioinformatics, Computational Biology and Biomedical Informatics; 09/2013
[Show abstract][Hide abstract] ABSTRACT: Identification of cooperative gene regulatory network is an important topic for biological study especially in cancer research. Traditional approaches suffer from large noise in gene expression data and false positive connections in motif binding data; they also fail to identify the modularized structure of gene regulatory network. Methods that are capable of revealing underlying modularized structure and robust to noise and false positives are needed to be developed.
We proposed and developed an integrated approach to identify gene regulatory networks, which consists of a novel clustering method (namely motif-guided affinity propagation clustering (mAPC)) and a sampling based method (called Gibbs sampler based on outlier sum statistic (GibbsOS)). mAPC is used in the first step to obtain co-regulated gene modules by clustering genes with a similarity measurement taking into account both gene expression data and binding motif information. This clustering method can reduce the noise effect from microarray data to obtain modularized gene clusters. However, due to many false positives in motif binding data, some genes not regulated by certain transcription factors (TFs) will be falsely clustered with true target genes. To overcome this problem, GibbsOS is applied in the second step to refine each cluster for the identification of true target genes. In order to evaluate the performance of the proposed method, we generated simulation data under different signal-to-noise ratios and false positive ratios to test the method. The experimental results show an improved accuracy in terms of clustering and transcription factor identification. Moreover, an improved performance is demonstrated in target gene identification as compared with GibbsOS. Finally, we applied the proposed method to two breast cancer patient datasets to identify cooperative transcriptional regulatory networks associated with recurrence of breast cancer, as supported by their functional annotations.
We have developed a two-step approach for gene regulatory network identification, featuring an integrated method to identify modularized regulatory structures and refine their target genes subsequently. Simulation studies have shown the robustness of the method against noise in gene expression data and false positives in motif binding data. The proposed method has been applied to two breast cancer gene expression datasets to infer the hidden regulation mechanisms. The experimental results demonstrate the efficacy of the method in identifying key regulatory networks related to the progression and recurrence of breast cancer.
[Show abstract][Hide abstract] ABSTRACT: The reliability and reproducibility of gene biomarkers for classification of cancer patients has been challenged due to measurement noise and biological heterogeneity among patients. In this paper, we propose a novel module-based feature selection framework, which integrates biological network information and gene expression data to identify biomarkers not as individual genes but as functional modules. Results from four breast cancer studies demonstrate that the identified module biomarkers. achieve higher classification accuracy in independent validation datasets. Are more reproducible than individual gene markers. Improve the biological interpretability of results. Are enriched in cancer 'disease drivers'.
International Journal of Data Mining and Bioinformatics 01/2013; 7(3):284-302. · 0.66 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: We describe a R-Java CAM (convex analysis of mixtures) package that provides comprehensive analytic functions and a graphic user interface (GUI) for blindly separating mixed nonnegative sources. This open-source multiplatform software implements recent and classic algorithms in the literature including Chan et al. (2008), Wang et al. (2010), Chen et al. (2011a) and Chen et al. (2011b). The CAM package offers several attractive features: (1) instead of using proprietary MATLAB, its analytic functions are written in R, which makes the codes more portable and easier to modify; (2) besides producing and plotting results in R, it also provides a Java GUI for automatic progress update and convenient visual monitoring; (3) multi-thread interactions between the R and Java modules are driven and integrated by a Java GUI, assuring that the whole CAM software runs responsively; (4) the package offers a simple mechanism to allow others to plug-in additional R-functions.
Journal of Machine Learning Research 01/2013; 14(1):2899-2903. · 2.85 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: Reliable inference of transcription regulatory networks is a challenging task in computational biology. Network component analysis (NCA) has become a powerful scheme to uncover regulatory networks behind complex biological processes. However, the performance of NCA is impaired by the high rate of false connections in binding information. In this paper, we integrate stability analysis with NCA to form a novel scheme, namely stability-based NCA (sNCA), for regulatory network identification. The method mainly addresses the inconsistency between gene expression data and binding motif information. Small perturbations are introduced to prior regulatory network, and the distance among multiple estimated transcript factor (TF) activities is computed to reflect the stability for each TF's binding network. For target gene identification, multivariate regression and t-statistic are used to calculate the significance for each TF-gene connection. Simulation studies are conducted and the experimental results show that sNCA can achieve an improved and robust performance in TF identification as compared to NCA. The approach for target gene identification is also demonstrated to be suitable for identifying true connections between TFs and their target genes. Furthermore, we have successfully applied sNCA to breast cancer data to uncover the role of TFs in regulating endocrine resistance in breast cancer.
IEEE/ACM transactions on computational biology and bioinformatics / IEEE, ACM 11/2012; 10(6):1347-58. · 2.25 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: Identification of differentially expressed subnetworks from protein-protein interaction (PPI) networks has become increasingly important to our global understanding of the molecular mechanisms that drive cancer. Several methods have been proposed for PPI subnetwork identification, but the dependency among network member genes is not explicitly considered, leaving many important hub genes largely unidentified. We present a new method, based on a bagging Markov random field (BMRF) framework, to improve subnetwork identification for mechanistic studies of breast cancer. The method follows a maximum a posteriori principle to form a novel network score that explicitly considers pairwise gene interactions in PPI networks, and it searches for subnetworks with maximal network scores. To improve their robustness across data sets, a bagging scheme based on bootstrapping samples is implemented to statistically select high confidence subnetworks. We first compared the BMRF-based method with existing methods on simulation data to demonstrate its improved performance. We then applied our method to breast cancer data to identify PPI subnetworks associated with breast cancer progression and/or tamoxifen resistance. The experimental results show that not only an improved prediction performance can be achieved by the BMRF approach when tested on independent data sets, but biologically meaningful subnetworks can also be revealed that are relevant to breast cancer and tamoxifen resistance.
Nucleic Acids Research 11/2012; · 8.81 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: To construct biologically interpretable gene sets for muscular dystrophy (MD) sub-type classification, we propose a novel computational scheme to integrate protein-protein interaction (PPI) network, functional gene set information, and mRNA profiling data. The workflow of the proposed scheme includes the following three major steps: firstly, we apply an affinity propagation clustering (APC) approach to identify gene sub-networks associated with each MD sub-type, in which a new distance metric is proposed for APC to combine PPI network information and gene-gene co-expression relationship; secondly, we further incorporate functional gene set knowledge, which complements the physical PPI information, into our scheme for biomarker identification; finally, based on the constructed sub-networks and gene set features, we apply multi-class support vector machines (MSVMs) for MD sub-type classification, with which to highlight the biomarkers contributing to sub-type prediction. The experimental results show that our scheme can help identify sub-networks and gene sets that are more relevant to MD than those constructed by other conventional approaches. Moreover, our integrative strategy improves the prediction accuracy substantially, especially for those 'hard-to-classify' sub-types.
[Show abstract][Hide abstract] ABSTRACT: With the advent of high-throughput biotechnology capable of monitoring genomic signals, it becomes increasingly promising to understand molecular cellular mechanisms through systems biology approaches. One of the active research topics in systems biology is to infer gene transcriptional regulatory networks using various genomic data; this inference problem can be formulated as a linear model with latent signals associated with some regulatory proteins called transcription factors (TFs). As common statistical assumptions may not hold for genomic signals, typical latent variable algorithms such as independent component analysis (ICA) are incapable to reveal underlying true regulatory signals. Liao et al.  proposed to perform inference using an approach named network component analysis (NCA), the optimization of which is achieved by a least-squares fitting approach with biological knowledge constraints. However, the incompleteness of biological knowledge and its inconsistency with gene expression data are not considered in the original NCA solution, which could greatly affect the inference accuracy. To overcome these limitations, we propose a linear extraction scheme, namely regulatory component analysis (RCA), to infer underlying regulatory signals even with partial biological knowledge. Numerical simulations show a significant improvement of our proposed RCA over NCA, not only when signal-to-noise-ratio (SNR) is low, but also when the given biological knowledge is incomplete and inconsistent to gene expression data. Furthermore, real biological experiments on E. coli are performed for regulatory network inference in comparison with several typical linear latent variable methods, which again demonstrates the effectiveness and improved performance of the proposed algorithm.
Signal Processing 08/2012; 92(8):1902-1915. · 2.24 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: Understanding the molecular changes that drive an acquired antiestrogen resistance phenotype is of major clinical relevance. Previous methodologies for addressing this question have taken a single gene/pathway approach and the resulting gains have been limited in terms of their clinical impact. Recent systems biology approaches allow for the integration of data from high throughput "-omics" technologies. We highlight recent advances in the field of antiestrogen resistance with a focus on transcriptomics, proteomics and methylomics.
Drug Discovery Today Disease Mechanisms 06/2012; 9(1-2):e11-e17.
[Show abstract][Hide abstract] ABSTRACT: Identification of transcriptional regulatory networks (TRNs) is of significant importance in computational biology for cancer research, providing a critical building block to unravel disease pathways. However, existing methods for TRN identification suffer from the inclusion of excessive 'noise' in microarray data and false-positives in binding data, especially when applied to human tumor-derived cell line studies. More robust methods that can counteract the imperfection of data sources are therefore needed for reliable identification of TRNs in this context.
In this article, we propose to establish a link between the quality of one target gene to represent its regulator and the uncertainty of its expression to represent other target genes. Specifically, an outlier sum statistic was used to measure the aggregated evidence for regulation events between target genes and their corresponding transcription factors. A Gibbs sampling method was then developed to estimate the marginal distribution of the outlier sum statistic, hence, to uncover underlying regulatory relationships. To evaluate the effectiveness of our proposed method, we compared its performance with that of an existing sampling-based method using both simulation data and yeast cell cycle data. The experimental results show that our method consistently outperforms the competing method in different settings of signal-to-noise ratio and network topology, indicating its robustness for biological applications. Finally, we applied our method to breast cancer cell line data and demonstrated its ability to extract biologically meaningful regulatory modules related to estrogen signaling and action in breast cancer.
The Gibbs sampler MATLAB package is freely available at http://www.cbil.ece.vt.edu/software.htm.
Supplementary data are available at Bioinformatics online.