Binding Profiles of Chromatin-Modifying Proteins Are Predictive for Transcriptional Activity and Promoter-Proximal Pausing
The establishment and maintenance of proper gene expression patterns is essential for stable cell differentiation. Using unsupervised learning techniques, chromatin states have been linked to discrete gene expression states, but these models cannot predict continuous gene expression levels, nor do they reveal detailed insight into the chromatin-based control of gene expression. Here, we employ regularized regression techniques to link, in a quantitative manner, binding profiles of chromatin proteins to gene expression levels and promoter-proximal pausing of RNA polymerase II in Drosophila melanogaster on a genome-wide scale. We apply stability selection to reliably detect interactions of chromatin features and predict several known, suggested, and novel proteins and protein pairs as transcriptional activators or repressors. Our integrative analysis reveals new insights into the complex interplay of transcriptional regulators in the context of gene expression. Supplementary Material is available at www.libertonline.com/cmb.
[Show abstract] [Hide abstract] ABSTRACT: In an interesting and quite exhaustive review on Random Forests (RF) methodology in bioinformatics Touw et al. address—among other topics—the problem of the detection of interactions between variables based on RF methodology. We feel that some important statistical concepts, such as ‘interaction’, ‘conditional dependence’ or ‘correlation’, are sometimes employed inconsistently in the bioinformatics literature in general and in the literature on RF in particular. In this letter to the Editor, we aim to clarify some of the central statistical concepts and point out some confusing interpretations concerning RF given by Touw et al. and other authors.0Comments 6Citations
- "However, we feel that whenever RF methodologies are investigated in relation to interactions, the latter term should be defined precisely and the investigated role of RF in this context should be clearly stated. For example, does it relate to the ability of RF to yield high individual VIMs for predictor variables involved in interactions  , the possibility to directly identify which predictor variables interact with each other by examining a RF [32, 1] , or the combination of RF with other analysis tools with the aim of identifying interactions ? In any case, when an algorithm based on RF (possibly combined with other tools) is suggested to identify which predictor variables interact with each other, we claim that this algorithm should be assessed in simulations using adequate measures such as, for example, sensitivity, the proportion of pairs of interacting variables that are correctly identified as interacting; specificity, the proportion of pairs of non-interacting variables that are correctly identified as non-interacting; or false positive rate, the proportion of pairs of noninteracting variables within the pairs identified as interacting. "
[Show abstract] [Hide abstract] ABSTRACT: In metazoans, each cell type follows a characteristic, spatio-temporally regulated DNA replication program. Histone modifications (HMs) and chromatin binding proteins (CBPs) are fundamental for a faithful progression and completion of this process. However, no individual HM is strictly indispensable for origin function, suggesting that HMs may act combinatorially in analogy to the histone code hypothesis for transcriptional regulation. In contrast to gene expression however, the relationship between combinations of chromatin features and DNA replication timing has not yet been demonstrated. Here, by exploiting a comprehensive data collection consisting of 95 CBPs and HMs we investigated their combinatorial potential for the prediction of DNA replication timing in Drosophila using quantitative statistical models. We found that while combinations of CBPs exhibit moderate predictive power for replication timing, pairwise interactions between HMs lead to accurate predictions genome-wide that can be locally further improved by CBPs. Independent feature importance and model analyses led us to derive a simplified, biologically interpretable model of the relationship between chromatin landscape and replication timing reaching 80% of the full model accuracy using six model terms. Finally, we show that pairwise combinations of HMs are able to predict differential DNA replication timing across different cell types. All in all, our work provides support to the existence of combinatorial HM patterns for DNA replication and reveal cell-type independent key elements thereof, whose experimental investigation might contribute to elucidate the regulatory mode of this fundamental cellular process.0Comments 7Citations
- "Stability analysis of model coefficients was performed essentially as described in . Feature selection probabilities (normalized frequencies of non-zero coefficients) were computed using bootstrap-Lasso. "
[Show abstract] [Hide abstract] ABSTRACT: DNA sequence variation causes changes in gene expression, which in turn has profound effects on cellular states. These variations affect tissue development and may ultimately lead to pathological phenotypes. A genetic locus containing a sequence variation that affects gene expression is called an "expression quantitative trait locus" (eQTL). Whereas the impact of cellular context on expression levels in general is well established, a lot less is known about the cell-state specificity of eQTL. Previous studies differed with respect to how "dynamic eQTL" were defined. Here, we propose a unified framework distinguishing static, conditional and dynamic eQTL and suggest strategies for mapping these eQTL classes. Further, we introduce a new approach to simultaneously infer eQTL from different cell types. By using murine mRNA expression data from four stages of hematopoiesis and 14 related cellular traits, we demonstrate that static, conditional and dynamic eQTL, although derived from the same expression data, represent functionally distinct types of eQTL. While static eQTL affect generic cellular processes, non-static eQTL are more often involved in hematopoiesis and immune response. Our analysis revealed substantial effects of individual genetic variation on cell type-specific expression regulation. Among a total number of 3,941 eQTL we detected 2,729 static eQTL, 1,187 eQTL were conditionally active in one or several cell types, and 70 eQTL affected expression changes during cell type transitions. We also found evidence for feedback control mechanisms reverting the effect of an eQTL specifically in certain cell types. Loci correlated with hematological traits were enriched for conditional eQTL, thus, demonstrating the importance of conditional eQTL for understanding molecular mechanisms underlying physiological trait variation. The classification proposed here has the potential to streamline and unify future analysis of conditional and dynamic eQTL as well as many other kinds of QTL data.0Comments 15Citations
- "In principle, the second step of the simultaneous eQTL mapping, the distinction between conditional and static eQTL, could be directly resolved in the primary eQTL mapping step. The RF framework allows to extract epistatic interactions between predictors directly from the trees [16,697071. However, this requires a large enough sample size in order to grow deep trees where different combinations of variables will be used for splitting in the same branch. "