Integration of ChIP-seq and machine learning reveals enhancers and a predictive regulatory sequence vocabulary in melanocytes

McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins University School of Medicine, Baltimore, Maryland 21205, USA
Genome Research (Impact Factor: 14.63). 09/2012; 22(11). DOI: 10.1101/gr.139360.112
Source: PubMed


We take a comprehensive approach to the study of regulatory control of gene expression in melanocytes that proceeds from large-scale enhancer discovery facilitated by ChIP-seq; to rigorous validation in silico, in vitro, and in vivo; and finally to the use of machine learning to elucidate a regulatory vocabulary with genome-wide predictive power. We identify 2489 putative melanocyte enhancer loci in the mouse genome by ChIP-seq for EP300 and H3K4me1. We demonstrate that these putative enhancers are evolutionarily constrained, enriched for sequence motifs predicted to bind key melanocyte transcription factors, located near genes relevant to melanocyte biology, and capable of driving reporter gene expression in melanocytes in culture (86%; 43/50) and in transgenic zebrafish (70%; 7/10). Next, using the sequences of these putative enhancers as a training set for a supervised machine learning algorithm, we develop a vocabulary of 6-mers predictive of melanocyte enhancer function. Lastly, we demonstrate that this vocabulary has genome-wide predictive power in both the mouse and human genomes. This study provides deep insight into the regulation of gene expression in melanocytes and demonstrates a powerful approach to the investigation of regulatory sequences that can be applied to other cell types.

  • Source
    • "spanning the 22 genes (RALY-UQCC) harboring the identified associated skin color SNPs. We considered several data sets that represent features associated with regulatory regions: ChIP-seq analysis in a lightly pigmented melanocytic cell line (LP22), a darkly pigmented melanocytic cell line (DP74) (Palstra et al. manuscript in preparation), and in a normal human epidermal keratinocytic cell line [NHEK (Rosenbloom et al. 2013)] of acetylated histone H3 (H3K27Ac), an active chromatin mark (Creyghton et al. 2010), DNaseI hypersensitive sites in epidermal skin melanocytes and in the NHEK cell line (Rosenbloom et al. 2013); ChIP-seq data for the transcription factor MITF in melanocytic cells (Strub et al. 2011), MITF is the melanocyte master regulator (Levy et al. 2006), ChIP-seq data in MALME-3 M melanoma cells for the transcription factor YY1 (Li et al. 2012), an ubiquitously expressed transcription factor that was reported to play an important role in melanocyte development by interacting with the melanocyte-specific isoform of MITF (Li et al. 2012); predicted melanocyte-specific enhancers (Gorkin et al. 2012) and Phastcons conserved elements inferred from 46-way alignments of placental mammals (Siepel et al. 2005). "
    [Show abstract] [Hide abstract]
    ABSTRACT: In the International Visible Trait Genetics (VisiGen) Consortium, we investigated the genetics of human skin color by combining a series of genome-wide association studies (GWAS) in a total of 17,262 Europeans with functional follow-up of discovered loci. Our GWAS provide the first genome-wide significant evidence for chromosome 20q11.22 harboring the ASIP gene being explicitly associated with skin color in Europeans. In addition, genomic loci at 5p13.2 (SLC45A2), 6p25.3 (IRF4), 15q13.1 (HERC2/OCA2), and 16q24.3 (MC1R) were confirmed to be involved in skin coloration in Europeans. In follow-up gene expression and regulation studies of 22 genes in 20q11.22, we highlighted two novel genes EIF2S2 and GSS, serving as competing functional candidates in this region and providing future research lines. A genetically inferred skin color score obtained from the 9 top-associated SNPs from 9 genes in 940 worldwide samples (HGDP-CEPH) showed a clear gradual pattern in Western Eurasians similar to the distribution of physical skin color, suggesting the used 9 SNPs as suitable markers for DNA prediction of skin color in Europeans and neighboring populations, relevant in future forensic and anthropological investigations. Electronic supplementary material The online version of this article (doi:10.1007/s00439-015-1559-0) contains supplementary material, which is available to authorized users.
    Full-text · Article · May 2015 · Human Genetics
  • Source
    • "We found that these predictions were cell type-specific in K562 cells and could accurately distinguish enhancer sequences from non-enhancer sequences. Our results suggest that combinations of TF-binding preferences, not histone modifications alone, are most predictive of actively expressing genomic sequences, a result supported by other attempts to define the sequence features of enhancers (Heinz et al. 2010; Lee et al. 2011; Arvey et al. 2012; Gorkin et al. 2012; Smith et al. 2013). These results support a model where TF binding and subsequent tran- Figure 3. Chromatin features and sequence-specific binding identify active sequences. "
    [Show abstract] [Hide abstract]
    ABSTRACT: The histone modification state of genomic regions is hypothesized to reflect the regulatory activity of the underlying genomic DNA. Based on this hypothesis, the ENCODE Project Consortium measured the status of multiple histone modifications across the genome in several cell types and used these data to segment the genome into regions with different predicted regulatory activities. We measured the cis-regulatory activity of more than 2000 of these predictions in the K562 leukemia cell line. We tested genomic segments predicted to be Enhancers, Weak Enhancers, or Repressed elements in K562 cells, along with other sequences predicted to be Enhancers specific to the HI human embryonic stem cell line (H1-hESC). Both Enhancer and Weak Enhancer sequences in K562 cells were more active than negative controls, although surprisingly, Weak Enhancer segmentations drove expression higher than did Enhancer segmentations. Lower levels of the covalent histone modifications H3K36me3 and H3K27ac, thought to mark active enhancers and transcribed gene bodies, associate with higher expression and partly explain the higher activity of Weak Enhancers over Enhancer predictions. While DNase I hypersensitivity (HS) is a good predictor of active sequences in our assay, transcription factor (TF) binding models need to be included in order to accurately identify highly expressed sequences. Overall, our results show that a significant fraction (similar to 26%) of the ENCODE enhancer predictions have regulatory activity, suggesting that histone modification states can reflect the cis-regulatory activity of sequences in the genome, but that specific sequence preferences, such as TF-binding sites, are the causal determinants of cis-regulatory activity.
    Preview · Article · Jul 2014 · Genome Research
  • Source
    • "Additional evidence suggests that such DNA motifs representing putative TF binding sites are predictive of promoter activity, including tissue-specific expression of their target gene (for example, [13,14]). In addition, DNA motif enrichment analyses have shown that DNA motifs are highly predictive of enhancer activity [15-18]. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Gene expression is controlled by proximal promoters and distal regulatory elements such as enhancers. While the activity of some promoters can be invariant across tissues, enhancers tend to be highly tissue-specific. We compiled sets of tissue-specific promoters based on gene expression profiles of 79 human tissues and cell types. Putative transcription factor binding sites within each set of sequences were used to train a support vector machine classifier capable of distinguishing tissue-specific promoters from control sequences. We obtained reliable classifiers for 92% of the tissues / cell types, with an area under the receiver operating characteristic curve between 60% (for subthalamic nucleus promoters) and 98% (for heart promoters). We next used these classifiers to identify tissue-specific enhancers, scanning distal noncoding sequences in the loci of the 200 most highly and lowly expressed genes. Thirty percent of reliable classifiers produced consistent enhancer predictions, with significantly higher densities in the loci of the most highly expressed compared to lowly expressed genes. Liver- enhancer predictions were assessed in vivo using the hydrodynamic tail vein injection assay. Fifty-eight percent of the predictions yielded significant enhancer activity in the mouse liver, whereas a control set of five sequences was completely negative. We conclude that promoters of tissue-specific genes often contain unambiguous tissue-specific signatures that can be learned and used for the de novo prediction of enhancers.
    Full-text · Article · Oct 2013 · Genome biology
Show more