Integration of ChIP-seq and machine learning reveals enhancers and a predictive regulatory sequence vocabulary in melanocytes

McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins University School of Medicine, Baltimore, Maryland 21205, USA
Genome Research (Impact Factor: 13.85). 09/2012; 22(11). DOI: 10.1101/gr.139360.112
Source: PubMed

ABSTRACT We take a comprehensive approach to the study of regulatory control of gene expression in melanocytes that proceeds from large-scale enhancer discovery facilitated by ChIP-seq; to rigorous validation in silico, in vitro, and in vivo; and finally to the use of machine learning to elucidate a regulatory vocabulary with genome-wide predictive power. We identify 2489 putative melanocyte enhancer loci in the mouse genome by ChIP-seq for EP300 and H3K4me1. We demonstrate that these putative enhancers are evolutionarily constrained, enriched for sequence motifs predicted to bind key melanocyte transcription factors, located near genes relevant to melanocyte biology, and capable of driving reporter gene expression in melanocytes in culture (86%; 43/50) and in transgenic zebrafish (70%; 7/10). Next, using the sequences of these putative enhancers as a training set for a supervised machine learning algorithm, we develop a vocabulary of 6-mers predictive of melanocyte enhancer function. Lastly, we demonstrate that this vocabulary has genome-wide predictive power in both the mouse and human genomes. This study provides deep insight into the regulation of gene expression in melanocytes and demonstrates a powerful approach to the investigation of regulatory sequences that can be applied to other cell types.

  • Source
    • "spanning the 22 genes (RALY-UQCC) harboring the identified associated skin color SNPs. We considered several data sets that represent features associated with regulatory regions: ChIP-seq analysis in a lightly pigmented melanocytic cell line (LP22), a darkly pigmented melanocytic cell line (DP74) (Palstra et al. manuscript in preparation), and in a normal human epidermal keratinocytic cell line [NHEK (Rosenbloom et al. 2013)] of acetylated histone H3 (H3K27Ac), an active chromatin mark (Creyghton et al. 2010), DNaseI hypersensitive sites in epidermal skin melanocytes and in the NHEK cell line (Rosenbloom et al. 2013); ChIP-seq data for the transcription factor MITF in melanocytic cells (Strub et al. 2011), MITF is the melanocyte master regulator (Levy et al. 2006), ChIP-seq data in MALME-3 M melanoma cells for the transcription factor YY1 (Li et al. 2012), an ubiquitously expressed transcription factor that was reported to play an important role in melanocyte development by interacting with the melanocyte-specific isoform of MITF (Li et al. 2012); predicted melanocyte-specific enhancers (Gorkin et al. 2012) and Phastcons conserved elements inferred from 46-way alignments of placental mammals (Siepel et al. 2005). "
    [Show abstract] [Hide abstract]
    ABSTRACT: In the International Visible Trait Genetics (VisiGen) Consortium, we investigated the genetics of human skin color by combining a series of genome-wide association studies (GWAS) in a total of 17,262 Europeans with functional follow-up of discovered loci. Our GWAS provide the first genome-wide significant evidence for chromosome 20q11.22 harboring the ASIP gene being explicitly associated with skin color in Europeans. In addition, genomic loci at 5p13.2 (SLC45A2), 6p25.3 (IRF4), 15q13.1 (HERC2/OCA2), and 16q24.3 (MC1R) were confirmed to be involved in skin coloration in Europeans. In follow-up gene expression and regulation studies of 22 genes in 20q11.22, we highlighted two novel genes EIF2S2 and GSS, serving as competing functional candidates in this region and providing future research lines. A genetically inferred skin color score obtained from the 9 top-associated SNPs from 9 genes in 940 worldwide samples (HGDP-CEPH) showed a clear gradual pattern in Western Eurasians similar to the distribution of physical skin color, suggesting the used 9 SNPs as suitable markers for DNA prediction of skin color in Europeans and neighboring populations, relevant in future forensic and anthropological investigations.
    Human Genetics 05/2015; DOI:10.1007/s00439-015-1559-0 · 4.52 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Massively parallel sequencing technologies have made the generation of genomic data sets a routine component of many biological investigations. For example, Chromatin immunoprecipitation followed by sequence assays detect genomic regions bound (directly or indirectly) by specific factors, and DNase-seq identifies regions of open chromatin. A major bottleneck in the interpretation of these data is the identification of the underlying DNA sequence code that defines, and ultimately facilitates prediction of, these transcription factor (TF) bound or open chromatin regions. We have recently developed a novel computational methodology, which uses a support vector machine (SVM) with kmer sequence features (kmer-SVM) to identify predictive combinations of short transcription factor-binding sites, which determine the tissue specificity of these genomic assays (Lee, Karchin and Beer, Discriminative prediction of mammalian enhancers from DNA sequence. Genome Res. 2011; 21:2167-80). This regulatory information can (i) give confidence in genomic experiments by recovering previously known binding sites, and (ii) reveal novel sequence features for subsequent experimental testing of cooperative mechanisms. Here, we describe the development and implementation of a web server to allow the broader research community to independently apply our kmer-SVM to analyze and interpret their genomic datasets. We analyze five recently published data sets and demonstrate how this tool identifies accessory factors and repressive sequence elements. kmer-SVM is available at
    Nucleic Acids Research 06/2013; 41(Web Server issue). DOI:10.1093/nar/gkt519 · 9.11 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Chromatin modifications have been comprehensively illustrated to play important roles in gene regulation and cell diversity in recent years. Given the rapid accumulation of genome-wide chromatin modification maps across multiple cell types, there is an urgent need for computational methods to analyze multiple maps to reveal combinatorial modification patterns and define functional DNA elements, especially those are specific to cell types or tissues. In this current study, we developed a computational method using differential chromatin modification analysis (dCMA) to identify cell-type-specific genomic regions with distinctive chromatin modifications. We then apply this method to a public data set with modification profiles of nine marks for nine cell types to evaluate its effectiveness. We found cell-type-specific elements unique to each cell type investigated. These unique features show significant cell-type-specific biological relevance and tend to be located within functional regulatory elements. These results demonstrate the power of a differential comparative epigenomic strategy in deciphering the human genome and characterizing cell specificity.
    Nucleic Acids Research 08/2013; 41(20). DOI:10.1093/nar/gkt712 · 9.11 Impact Factor
Show more