Identifying Responsive Modules by Mathematical Programming: An Application to Budding Yeast Cell Cycle

Key Laboratory of Systems Biology, SIBS-Novo Nordisk Translational Research Centre for PreDiabetes, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai, China.
PLoS ONE (Impact Factor: 3.53). 07/2012; 7(7):e41854. DOI: 10.1371/journal.pone.0041854
Source: PubMed

ABSTRACT High-throughput biological data offer an unprecedented opportunity to fully characterize biological processes. However, how to extract meaningful biological information from these datasets is a significant challenge. Recently, pathway-based analysis has gained much progress in identifying biomarkers for some phenotypes. Nevertheless, these so-called pathway-based methods are mainly individual-gene-based or molecule-complex-based analyses. In this paper, we developed a novel module-based method to reveal causal or dependent relations between network modules and biological phenotypes by integrating both gene expression data and protein-protein interaction network. Specifically, we first formulated the identification problem of the responsive modules underlying biological phenotypes as a mathematical programming model by exploiting phenotype difference, which can also be viewed as a multi-classification problem. Then, we applied it to study cell-cycle process of budding yeast from microarray data based on our biological experiments, and identified important phenotype- and transition-based responsive modules for different stages of cell-cycle process. The resulting responsive modules provide new insight into the regulation mechanisms of cell-cycle process from a network viewpoint. Moreover, the identification of transition modules provides a new way to study dynamical processes at a functional module level. In particular, we found that the dysfunction of a well-known module and two new modules may directly result in cell cycle arresting at S phase. In addition to our biological experiments, the identified responsive modules were also validated by two independent datasets on budding yeast cell cycle.

  • Source
    • "In a different approach to prior knowledge, gene-gene relationships (pathway-based or protein-protein interaction (PPI) networks) are used to improve classification accuracy [21], [22], [23], [24], [25], [30], consistency of biomarker discovery [26], [27] and targeted therapeutic strategies [28], [29]. The majority of these studies utilize gene expressions corresponding to sub-networks in PPI networks , for instance: mean or median of gene expression values in gene ontology network modules [21], probabilistic inference of pathway activity [24], and producing candidate sub-networks via a Markov clustering algorithm applied to high quality PPI networks [26], [31]. None of these methods incorporate the regulating mechanisms (activating or suppressing) into classification or featureselection . "
    [Show abstract] [Hide abstract]
    ABSTRACT: Small samples are commonplace in genomic/proteomic classification, the result being inadequate classifier design and poor error estimation. The problem has recently been addressed by utilizing prior knowledge in the form of a prior distribution on an uncertainty class of feature-label distributions. A critical issue remains: how to incorporate biological knowledge into the prior distribution. For genomics/proteomics, the most common kind of knowledge is in the form of signaling pathways. Thus, it behooves us to find methods of transforming pathway knowledge into knowledge of the feature-label distribution governing the classification problem. In this paper, we address the problem of prior probability construction by proposing a series of optimization paradigms that utilize the incomplete prior information contained in pathways (both topological and regulatory). The optimization paradigms employ the marginal log-likelihood, established using a small number of feature-label realizations (sample points) regularized with the prior pathway information about the variables. In the special case of a Normal-Wishart prior distribution on the mean and inverse covariance matrix (precision matrix) of a Gaussian distribution, these optimization problems become convex. Companion website:
    IEEE/ACM Transactions on Computational Biology and Bioinformatics 01/2014; 11(1). DOI:10.1109/TCBB.2013.143 · 1.54 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: BACKGROUND: Many methods have been developed to identify disease genes and further module biomarkers of complex diseases based on gene expression data. It is generally difficult to distinguish whether the variations in gene expression are causative or merely the effect of a disease. The limitation of relying on gene expression data alone highlights the need to develop new approaches that can explore various data to reflect the casual relationship between network modules and disease traits. METHODS: In this work, we developed a novel network-based approach to identify putative causal module biomarkers of complex diseases by integrating heterogeneous information, for example, epigenomic data, gene expression data, and protein-protein interaction network. We first formulated the identification of modules as a mathematical programming problem, which can be solved efficiently and effectively in an accurate manner. Then, we applied our approach to colorectal cancer (CRC) and identified several network modules that can serve as potential module biomarkers for characterizing CRC. Further validations using three additional gene expression datasets verified their candidate biomarker properties and the effectiveness of the method. Functional enrichment analysis also revealed that the identified modules are strongly related to hallmarks of cancer, and the enriched functions, such as inflammatory response, receptor and signaling pathways, are specific to CRC. RESULTS: Through constructing a transcription factor (TF)-module network, we found that aberrant DNA methylation of genes encoding TF considerably contributes to the activity change of some genes, which may function as causal genes of CRC, and that can also be exploited to develop efficient therapies or effective drugs. CONCLUSION: Our method can potentially be extended to the study of other complex diseases and the multiclassification problem.
    Journal of the American Medical Informatics Association 09/2012; 20(4). DOI:10.1136/amiajnl-2012-001168 · 3.93 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: Systematically identifying biomarkers, in particular, network biomarkers, from high-throughput data is an important and challenging task, and many methods for two-class comparison have been developed to exploit information of high-throughput data. However, as the high-throughput data with multi-phenotypes are available, there is a great need to develop effective multi-classification models. In this study, we proposed a novel approach, called MCentridFS (Multi-class Centroid Feature Selection), to systematically identify responsive modules or network biomarkers for classifying multi-phenotypes from high-throughput data. MCentridFS formulated the multi-classification model by network modules as a binary integer linear programming problem, which can be solved efficiently and effectively in an accurate manner. The approach is evaluated with respect to two diseases, i.e., multi-stages HCV-induced dysplasia and hepatocellular carcinoma and multi-tissues breast cancer, both of which demonstrated the high classification rate and the cross-validation rate of the approach. The computational results of the five-fold cross-validation of the two data show that MCentridFS outperforms the state-of-the-art multi-classification methods. We further verified the effectiveness of MCentridFS to characterize the multi-phenotype processes using module biomarkers by two independent datasets. In addition, functional enrichment analysis revealed that the identified network modules are strongly related to the corresponding biological processes and pathways. All these results suggest that it can serve as a useful tool for module biomarker detection in multiple biological processes or multi-classification problems by exploring both big biological data and network information. The Matlab code for MCentridFS is freely available from .
    Molecular BioSystems 07/2014; DOI:10.1039/c4mb00325j · 3.18 Impact Factor