AB initio prediction of transcription factor binding sites

Department of Biomedical Engineering, High-Throughput Biology Center, Johns Hopkins University, Baltimore, MD 21218, USA.
Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing 02/2007; 12:484-95. DOI: 10.1142/9789812772435_0046
Source: PubMed


Transcription factors are DNA-binding proteins that control gene transcription by binding specific short DNA sequences. Experiments that identify transcription factor binding sites are often laborious and expensive, and the binding sites of many transcription factors remain unknown. We present a computational scheme to predict the binding sites directly from transcription factor sequence using all-atom molecular simulations. This method is a computational counterpart to recent high-throughput experimental technologies that identify transcription factor binding sites (ChIP-chip and protein-dsDNA binding microarrays). The only requirement of our method is an accurate 3D structural model of a transcription factor-DNA complex. We apply free energy calculations by thermodynamic integration to compute the change in binding energy of the complex due to a single base pair mutation. By calculating the binding free energy differences for all possible single mutations, we construct a position weight matrix for the predicted binding sites that can be directly compared with experimental data. As water-bridged hydrogen bonds between the transcription factor and DNA often contribute to the binding specificity, we include explicit solvent in our simulations. We present successful predictions for the yeast MAT-alpha2 homeodomain and GCN4 bZIP proteins. Water-bridged hydrogen bonds are found to be more prevalent than direct protein-DNA hydrogen bonds at the binding interfaces, indicating why empirical potentials with implicit water may be less successful in predicting binding. Our methodology can be applied to a variety of DNA-binding proteins.

Download full-text


Available from: Joel S Bader,
  • Source
    • "Structural modeling has generated important insights into protein-DNA recognition mechanisms, from studies of the relative contributions of direct and indirect recognition (6–8), and the role of DNA shape (9,10) and interfacial waters (11,12) in sequence-specific recognition, to the validity of the additivity assumption in protein-DNA energetics (13,14). Structural modeling has also been used to predict DNA binding preferences, using a wide range of sampling algorithms and energy functions, including database-derived potentials (15–17), all-atom molecular mechanics force fields (11,18–23), and hybrid scoring functions (12,14). These approaches can often generate highly accurate predictions when given an X-ray crystal structure of the target protein in complex with a high affinity binding site. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Sequence-specific DNA recognition by gene regulatory proteins is critical for proper cellular functioning. The ability to predict the DNA binding preferences of these regulatory proteins from their amino acid sequence would greatly aid in reconstruction of their regulatory interactions. Structural modeling provides one route to such predictions: by building accurate molecular models of regulatory proteins in complex with candidate binding sites, and estimating their relative binding affinities for these sites using a suitable potential function, it should be possible to construct DNA binding profiles. Here, we present a novel molecular modeling protocol for protein-DNA interfaces that borrows conformational sampling techniques from de novo protein structure prediction to generate a diverse ensemble of structural models from small fragments of related and unrelated protein-DNA complexes. The extensive conformational sampling is coupled with sequence space exploration so that binding preferences for the target protein can be inferred from the resulting optimized DNA sequences. We apply the algorithm to predict binding profiles for a benchmark set of eleven C2H2 zinc finger transcription factors, five of known and six of unknown structure. The predicted profiles are in good agreement with experimental binding data; furthermore, examination of the modeled structures gives insight into observed binding preferences.
    Nucleic Acids Research 02/2011; 39(11):4564-76. DOI:10.1093/nar/gkr048 · 9.11 Impact Factor
  • Source
    • "We further investigated the contribution of the sequence-dependent DNA binding and tested whether computations can be accelerated using an EBWM approach. This systematic and diverse testing makes our study complimentary to other recent works (7–12,31). "
    [Show abstract] [Hide abstract]
    ABSTRACT: The binding of a transcription factor (TF) to a DNA operator site can initiate or repress the expression of a gene. Computational prediction of sites recognized by a TF has traditionally relied upon knowledge of several cognate sites, rather than an ab initio approach. Here, we examine the possibility of using structure-based energy calculations that require no knowledge of bound sites but rather start with the structure of a protein-DNA complex. We study the PurR Escherichia coli TF, and explore to which extent atomistic models of protein-DNA complexes can be used to distinguish between cognate and noncognate DNA sites. Particular emphasis is placed on systematic evaluation of this approach by comparing its performance with bioinformatic methods, by testing it against random decoys and sites of homologous TFs. We also examine a set of experimental mutations in both DNA and the protein. Using our explicit estimates of energy, we show that the specificity for PurR is dominated by direct protein-DNA interactions, and weakly influenced by bending of DNA.
    Nucleic Acids Research 11/2008; 36(19):6209-17. DOI:10.1093/nar/gkn589 · 9.11 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Peptide recognition modules (PRMs) are used throughout biology to mediate protein-protein interactions, and many PRMs are members of large protein domain families. Members of these families are often quite similar to each other, but each domain recognizes a distinct set of peptides, raising the question of how peptide recognition specificity is achieved using similar protein domains. The analysis of individual protein complex structures often gives answers that are not easily applicable to other members of the same PRM family. Bioinformatics-based approaches, one the other hand, may be difficult to interpret physically. Here we integrate structural information with a large, quantitative data set of SH2-peptide interactions to study the physical origin of domain-peptide specificity. We develop an energy model, inspired by protein folding, based on interactions between the amino acid positions in the domain and peptide. We use this model to successfully predict which SH2 domains and peptides interact and uncover the positions in each that are important for specificity. The energy model is general enough that it can be applied to other members of the SH2 family or to new peptides, and the cross-validation results suggest that these energy calculations will be useful for predicting binding interactions. It can also be adapted to study other PRM families, predict optimal peptides for a given SH2 domain, or study other biological interactions, e.g. protein-DNA interactions.
Show more