Using a structural and logics systems approach to infer bHLH–DNA binding specificity determinants

Department of Medicine, Division of Genetics, Brigham & Women's Hospital and Harvard Medical School, Boston, MA 02115, USA.
Nucleic Acids Research (Impact Factor: 9.11). 02/2011; 39(11):4553-63. DOI: 10.1093/nar/gkr070
Source: PubMed


Numerous efforts are underway to determine gene regulatory networks that describe physical relationships between transcription factors (TFs) and their target DNA sequences. Members of paralogous TF families typically recognize similar DNA sequences. Knowledge of the molecular determinants of protein-DNA recognition by paralogous TFs is of central importance for understanding how small differences in DNA specificities can dictate target gene selection. Previously, we determined the in vitro DNA binding specificities of 19 Caenorhabditis elegans basic helix-loop-helix (bHLH) dimers using protein binding microarrays. These TFs bind E-box (CANNTG) and E-box-like sequences. Here, we combine these data with logics, bHLH-DNA co-crystal structures and computational modeling to infer which bHLH monomer can interact with which CAN E-box half-site and we identify a critical residue in the protein that dictates this specificity. Validation experiments using mutant bHLH proteins provide support for our inferences. Our study provides insights into the mechanisms of DNA recognition by bHLH dimers as well as a blueprint for system-level studies of the DNA binding determinants of other TF families in different model organisms and humans.

Download full-text


Available from: Federico De Masi, Mar 12, 2014
  • Source
    • "This observation is important because it suggests that the sequence preferences of TFs may be broadly inferred from data for only a small subset of TFs (Alleyne et al., 2009; Berger et al., 2008; Bernard et al., 2012; Noyes et al., 2008). However, these analyses have utilized data for only a handful of DBD classes and species and they contrast with numerous demonstrations that mutation of one or a few critical DBD AAs can alter the sequence preferences of a TF (Aggarwal et al., 2010; Cook et al., 1994; De Masi et al., 2011; Mathias et al., 2001; Noyes et al., 2008), which suggest that prediction of DNA binding preferences by homology should be highly error-prone. To our knowledge, rigorous and exhaustive analyses of the accuracy and limitations of inference approaches to predicting TF DNAbinding motifs using DBD sequences has not been done. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Transcription factor (TF) DNA sequence preferences direct their regulatory activity, but are currently known for only ∼1% of eukaryotic TFs. Broadly sampling DNA-binding domain (DBD) types from multiple eukaryotic clades, we determined DNA sequence preferences for >1,000 TFs encompassing 54 different DBD classes from 131 diverse eukaryotes. We find that closely related DBDs almost always have very similar DNA sequence preferences, enabling inference of motifs for ∼34% of the ∼170,000 known or predicted eukaryotic TFs. Sequences matching both measured and inferred motifs are enriched in chromatin immunoprecipitation sequencing (ChIP-seq) peaks and upstream of transcription start sites in diverse eukaryotic lineages. SNPs defining expression quantitative trait loci in Arabidopsis promoters are also enriched for predicted TF binding sites. Importantly, our motif "library" can be used to identify specific TFs whose binding may be altered by human disease risk alleles. These data present a powerful resource for mapping transcriptional networks across eukaryotes.
    Cell 09/2014; 158(6):1431-43. DOI:10.1016/j.cell.2014.08.009 · 32.24 Impact Factor
  • Source
    • "(36). Despite this, heterodimerization of NAC TFs (11) may expand the DNA-binding specificity spectrum in vitro, as suggested for the bHLH TFs (15,16). This variability between single or double binding sites can bring yet another level of genetic regulation in NAC-dependent stress response in A. thaliana. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Target gene identification for transcription factors is a prerequisite for the systems wide understanding of organismal behaviour. NAM-ATAF1/2-CUC2 (NAC) transcription factors are amongst the largest transcription factor families in plants, yet limited data exist from unbiased approaches to resolve the DNA-binding preferences of individual members. Here, we present a TF-target gene identification workflow based on the integration of novel protein binding microarray data with gene expression and multi-species promoter sequence conservation to identify the DNA-binding specificities and the gene regulatory networks of 12 NAC transcription factors. Our data offer specific single-base resolution fingerprints for most TFs studied and indicate that NAC DNA-binding specificities might be predicted from their DNA-binding domain's sequence. The developed methodology, including the application of complementary functional genomics filters, makes it possible to translate, for each TF, protein binding microarray data into a set of high-quality target genes. With this approach, we confirm NAC target genes reported from independent in vivo analyses. We emphasize that candidate target gene sets together with the workflow associated with functional modules offer a strong resource to unravel the regulatory potential of NAC genes and that this workflow could be used to study other families of transcription factors.
    Nucleic Acids Research 06/2014; 42(12). DOI:10.1093/nar/gku502 · 9.11 Impact Factor
  • Source
    • "This binding was traditionally studied in isolation, despite the fact that many well-studied TFs were known to bind cooperatively to DNA by forming well-defined dimers or (in some cases) higher-order complexes. Important examples of such direct cooperativity include the p53 homotetramer [1], the NF-κB heterodimer [2], various bHLH dimers [3], SOX2–POU5F1 (SOX2–OCT4) dimerization in embryonic stem cells [4] and, more recently, AR–FOXA1 dimerization in prostate cancer cells [5]. In all these cases, the genomic binding sites of cooperating TFs form well-defined rigidly spaced motif complexes, i.e. motif pairs with fixed relative orientation and spacing. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Cooperative binding of transcription factor (TF) dimers to DNA is increasingly recognized as a major contributor to binding specificity. However, it is likely that the set of known TF dimers is highly incomplete, given that they were discovered using ad hoc approaches, or through computational analyses of limited datasets. Here, we present TACO (Transcription factor Association from Complex Overrepresentation), a general-purpose standalone software tool that takes as input any genome-wide set of regulatory elements and predicts cell-type-specific TF dimers based on enrichment of motif complexes. TACO is the first tool that can accommodate motif complexes composed of overlapping motifs, a characteristic feature of many known TF dimers. Our method comprehensively outperforms existing tools when benchmarked on a reference set of 29 known dimers. We demonstrate the utility and consistency of TACO by applying it to 152 DNase-seq datasets and 94 ChIP-seq datasets. Based on these results, we uncover a general principle governing the structure of TF-TF-DNA ternary complexes, namely that the flexibility of the complex is correlated with, and most likely a consequence of, inter-motif spacing.
    BMC Genomics 03/2014; 15(1):208. DOI:10.1186/1471-2164-15-208 · 3.99 Impact Factor
Show more