[Show abstract][Hide abstract] ABSTRACT: Cys2-His2 zinc finger proteins (ZFPs) are the largest family of transcription factors in higher metazoans. They also represent the most diverse family with regards to the composition of their recognition sequences. Although there are a number of ZFPs with characterized DNA-binding preferences, the specificity of the vast majority of ZFPs is unknown and cannot be directly inferred by homology due to the diversity of recognition residues present within individual fingers. Given the large number of unique zinc fingers and assemblies present across eukaryotes, a comprehensive predictive recognition model that could accurately estimate the DNA-binding specificity of any ZFP based on its amino acid sequence would have great utility. Toward this goal, we have used the DNA-binding specificities of 678 two-finger modules from both natural and artificial sources to construct a random forest-based predictive model for ZFP recognition. We find that our recognition model outperforms previously described determinant-based recognition models for ZFPs, and can successfully estimate the specificity of naturally occurring ZFPs with previously defined specificities.
Nucleic Acids Research 02/2014; 42(8). DOI:10.1093/nar/gku132 · 9.11 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: The identification of the trans-acting factors and cis-regulatory modules that are involved in human pluripotent stem cell (hPSC) maintenance and differentiation is necessary to dissect the operating regulatory networks in these processes and thereby identify nodes where signal input will direct desired cell fate decisions in vitro or in vivo. To deconvolute these networks, we established a method to influence the differentiation state of hPSCs with a CRISPR-associated catalytically inactive dCas9 fused to an effector domain. In human embryonic stem cells, we find that the dCas9 effectors can exert positive or negative regulation on the expression of developmentally relevant genes, which can influence cell differentiation status when impinging on a key node in the regulatory network that governs the cell state. This system provides a platform for the interrogation of the underlying regulators governing specific differentiation decisions, which can then be employed to direct cellular differentiation down desired pathways.
Development 01/2014; 141(1):219-223. DOI:10.1242/dev.103341 · 6.46 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: Cys2-His2 zinc finger proteins (ZFPs) are the largest group of transcription factors in higher metazoans. A complete characterization of these ZFPs and their associated target sequences is pivotal to fully annotate transcriptional regulatory networks in metazoan genomes. As a first step in this process, we have characterized the DNA-binding specificities of 129 Zinc finger sets from Drosophila using a bacterial one-hybrid system. This data set contains the DNA-binding specificities for at least one encoded ZFP from 70 unique genes and 23 alternate splice isoforms representing the largest set of characterized ZFPs from any organism described to date. These recognition motifs can be used to predict genomic binding sites for these factors within the fruit fly genome. Subsets of fingers from these ZFPs were characterized to define their orientation and register on their recognition sequences, thereby allowing us to define the recognition diversity within this finger set. We find that the characterized fingers can specify 47 of the 64 possible DNA triplets. To confirm the utility of our finger recognition models, we employed subsets of Drosophila fingers in combination with an existing archive of artificial zinc finger modules to create ZFPs with novel DNA-binding specificity. These hybrids of natural and artificial fingers can be used to create functional Zinc Finger Nucleases for editing vertebrate genomes.
Genome Research 03/2013; 23(6). DOI:10.1101/gr.151472.112 · 14.63 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: Motivation: Recognition models for protein-DNA interactions, which allow the prediction of specificity for a DNA-binding domain based only on its sequence or the alteration of specificity through rational design, have long been a goal of computational biology. There has been some progress in constructing useful models, especially for C2H2 zinc finger proteins, but it remains a challenging problem with ample room for improvement. For most families of transcription factors the best available methods utilize k-nearest neighbor (KNN) algorithms to make specificity predictions based on the average of the specificities of the k most similar proteins with defined specificities. Homeodomain (HD) proteins are the second most abundant family of transcription factors, after zinc fingers, in most metazoan genomes, and as a consequence an effective recognition model for this family would facilitate predictive models of many transcriptional regulatory networks within these genomes.
Results: Using extensive experimental data, we have tested several machine learning approaches and find that both support vector machines and random forests (RFs) can produce recognition models for HD proteins that are significant improvements over KNN-based methods. Cross-validation analyses show that the resulting models are capable of predicting specificities with high accuracy. We have produced a web-based prediction tool, PreMoTF (Predicted Motifs for Transcription Factors) (http://stormo.wustl.edu/PreMoTF), for predicting position frequency matrices from protein sequence using a RF-based model.
[Show abstract][Hide abstract] ABSTRACT: FlyFactorSurvey (http://pgfe.umassmed.edu/TFDBS/) is a database of DNA binding specificities for Drosophila transcription factors (TFs) primarily determined using the bacterial one-hybrid system. The database provides community access to over 400 recognition motifs and position weight matrices for over 200 TFs, including many unpublished motifs. Search tools and flat file downloads are provided to retrieve binding site information (as sequences, matrices and sequence logos) for individual TFs, groups of TFs or for all TFs with characterized binding specificities. Linked analysis tools allow users to identify motifs within our database that share similarity to a query matrix or to view the distribution of occurrences of an individual motif throughout the Drosophila genome. Together, this database and its associated tools provide computational and experimental biologists with resources to predict interactions between Drosophila TFs and target cis-regulatory sequences.