[Show abstract][Hide abstract] ABSTRACT: Cell type-specific gene expression in humans involves complex interactions between regulatory factors and DNA at enhancers and promoters. Mapping studies for expression quantitative trait loci (eQTLs), transcription factors (TFs) and chromatin markers have become widely used tools for identifying gene regulatory elements, but prediction of target genes remains a major challenge. Here, we integrate genome-wide data on TF-binding sites, chromatin markers and functional annotations to predict genes associated with human eQTLs. Using the random forest classifier, we found that genomic proximity plus five TF and chromatin features are able to predict >90% of target genes within 1 megabase of eQTLs. Despite being regularly used to map target genes, proximity is not a good indicator of eQTL targets for genes 150 kilobases away, but insulators, TF co-occurrence, open chromatin and functional similarities between TFs and genes are better indicators. Using all six features in the classifier achieved an area under the specificity and sensitivity curve of 0.91, much better compared with at most 0.75 for using any single feature. We hope this study will not only provide validation of eQTL-mapping studies, but also provide insight into the molecular mechanisms explaining how genetic variation can influence gene expression.
Nucleic Acids Research 12/2012; · 8.81 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: Cellular development requires the precise control of gene expression states. Transcription factors are involved in this regulatory process through their combinatorial binding with DNA. Information about transcription factor binding sites can help determine which combinations of factors work together to regulate a gene, but it is unclear how far the binding data from one cell type can inform about regulation in other cell types.
By integrating data on co-localized transcription factor binding sites in the K562 cell line with expression data across 38 distinct hematopoietic cell types, we developed regression models to describe the relationship between the expression of target genes and the transcription factors that co-localize nearby. With K562 binding sites identifying the predictors, the proportion of expression explained by the models is statistically significant only for monocytic cells (p-value< 0.001), which are closely related to K562. That is, cell type specific binding patterns are crucial for choosing the correct transcription factors for the model. Comparison of predictors obtained from binding sites in the GM12878 cell line with those from K562 shows that the amount of difference between binding patterns is directly related to the quality of the prediction. By identifying individual genes whose expression is predicted accurately by the binding sites, we are able to link transcription factors FOS, TAF1 and YY1 to a sparsely studied gene LRIG2. We also find that the activity of a transcription factor may be different depending on the cell type and the identity of other co-localized factors.
Our approach shows that gene expression can be explained by a modest number of co-localized transcription factors, however, information on cell-type specific binding is crucial for understanding combinatorial gene regulation.