Conference PaperPDF Available

Towards Better Prioritization of Epigenetically Modified DNA Regions

Authors:
  • IFBR-BSRC Alexander Fleming

Abstract and Figures

Epigenetic modifications of the genome can cause profound changes in phenotype of an organism. Experimental methods allow us to detect regions of the DNA that have been epigenetically modified; these regions are said to be enriched in a queried state versus a control. Detecting the enriched regions is not a simple matter as making sense of the data involves multiple analytical steps and often results in false calls. In this study, we analyze the utility of using additional features of the data (such as the transcription start site (TSS) and the histone coverage) to detect enrichment. We train a decision tree ensemble using these three features and review how well they identify regions that are truly enriched (as validated by q-PCR). We find that the enrichment score derived di-rectly from ChIP-chip experiment data is less informative than the histone coverage.
Content may be subject to copyright.
I. Maglogiannis, V. Plagianakos, and I. Vlahavas (Eds.): SETN 2012, LNAI 7297, pp. 270–277, 2012.
© Springer-Verlag Berlin Heidelberg 2012
Towards Better Prioritization of Epigenetically Modified
DNA Regions
Ernesto Iacucci1, Dusan Popovic1, Georgios A. Pavlopoulos1,
Léon-Charles Tranchevent1, Marijke Bauters2,
Bart De Moor1, and Yves Moreau1
1 ESAT-SCD / IBBT-K.U.Leuven Future Health Department, Katholieke Universiteit Leuven,
Kasteelpark Arenberg 10, Box 2446, 3001, Leuven, Belgium
2 Department of Human Genetics / IBBT-K.U.Leuven Future Health Department, Katholieke
Universiteit Leuven, Kasteelpark Arenberg 10, Box 2446, 3001, Leuven, Belgium
{Ernesto.Iacucci,Dusan.Popovic,Georgios.Pavlopoulos,
Leon-Charles.Tranchevent,Bart.DeMoor,
Yves.Moreau}@esat.kuleuven.be,
marijke.bauters@cme.vib-kuleuven.be.be
Abstract. Epigenetic modifications of the genome can cause profound changes
in phenotype of an organism. Experimental methods allow us to detect regions
of the DNA that have been epigenetically modified; these regions are said to be
enriched in a queried state versus a control. Detecting the enriched regions is
not a simple matter as making sense of the data involves multiple analytical
steps and often results in false calls. In this study, we analyze the utility of using
additional features of the data (such as the transcription start site (TSS) and the
histone coverage) to detect enrichment. We train a decision tree ensemble using
these three features and review how well they identify regions that are truly
enriched (as validated by q-PCR). We find that the enrichment score derived di-
rectly from ChIP-chip experiment data is less informative than the histone
coverage.
Keywords: ChIP-chip, data integration, protein-DNA, machine learning,
decision trees.
1 Introduction
The detection of protein-DNA interactions is an important area of research. Protein-
DNA interactions account for various cellular events such as DNA repair and
transcription factor binding [1-3]. Transcription factors regulate the expression level
of gene products that carry out the majority of processes in the cell. Histone-DNA
interactions are a specific type of protein-DNA interactions that also influence
the expression of genes. Indeed, DNA that is wound around histone-bodies (the
complex form of histones) is less accessible to the cellular transcriptional machinery
and thus genes located in these regions are less likely to be expressed [4,5]. These
Towards Better Prioritization of Epigenetically Modified DNA Regions 271
modifications are considered epigenetic as they alter the expression of genes while
not changing their sequences.
A widely used technique to measure protein-DNA interaction is chromatin im-
munoprecipitation followed by DNA microarray hybridization (ChIP-chip). Using
ChIP-chip, one is able to identify areas of the genome that are enriched between
two conditions of interest (e.g., disease vs. control) [1,6]. Detecting the enriched
regions is not a simple matter as making sense of the data involves multiple analyti-
cal steps and often results in false calls [7,8]. In this study, we assess whether using
additional features enhance the detection of enriched regions [9,10]. In addition to
the enrichment scores, extracted from ChIP-chip data, the transcription start site
(TSS) and histone coverage scores are defined and used to train a decision tree
based algorithms.
While the primary feature resulting from a ChIP-chip experiment is the enrichment
score for a region, the other two features are easily derived. The TSS score is the
distance of the region to the nearest predicted TSS. The histone coverage is a unit
value which is calculated from a regions size (in base-pairs) in relation to the size of a
full turn of the DNA around a histone body (147 base-pairs).
We then review how well these three features perform in predicting the regions
that are truly enriched (as validated by q-PCR).
2 Methods
Our dataset consists of 25 DNA regions for which we have ChIP-chip enrichment
scores, region sizes and distances to the nearest transcription start site, and validated
q-PCR values. Our dataset is derived from ChIP-chip experiments essaying fragile-X
patient samples (data unpublished). The q-PCR values define the positive and nega-
tive examples and will be considered binary for the purposes of this work.
The ChIP-chip enrichment score is derived from a data analysis procedure de-
scribed in [11,12]. Briefly, the data is processed as follows:
The outliers in the data are removed (probes in a 5 probe window are aver-
aged and probes which are over 2 standard deviation from the mean are re-
moved).
The data is normalized, by adjusting the mean of entire distribution to zero.
The differences between the two samples are calculated (one sample is the
condition/disease sample and the other would be the control).
The data is smoothed (a 3-point moving average is calculated for each peak).
The probes, which show significant differences, are identified (those over 2
standard deviations from the mean).
The regions of consistent difference defined by multiple probes (4 probes of a
5 probe window) are called (flagged as significant).
272 E. Iacucci et al.
The transcription start site (TSS) feature is calculated as the distance from the near-
est TSS. These distances (measured in base-pairs), are then mapped to an integer
score which varied from 0 to 5. The histone coverage is a feature which is com-
puted from the size (measured in base-pairs) of the enriched region. The size of
the region is transformed into a unit value by applying the equation displayed in
Figure 1.
Fig. 1. Histone coverage score calculation
The dataset, consisting of the three features and the validated q-PCR outcomes,
is then feed to a decision tree-learning algorithm (classregtree, Matlab v7.10.0). In
addition, a bagged decision tree ensemble classifier on the whole dataset is also
trained.
Towards Better Prioritization of Epigenetically Modified DNA Regions 273
This algorithm builds individual trees on the bootstrap replicates of the original
dataset and then uses out-of-bag observations to compute unbiased estimates of the
classification error. This is often exploited to measure feature importance. For each
feature, its values across all the observations are permuted, after which the difference
in mean squared error is examined. Eventually, a higher positive difference implies
greater importance for that feature. Furthermore, we validated our results using leave-
one-out cross validation.
3 Results and Discussion
We run a decision tree learning algorithm on the whole dataset to examine which
features are selected as the most informative and in which order. We construct a
ROC curve for each feature and examine the AUC as a heuristic to determine which
features are the most important. Out of the three features considered, we find that
enrichment performs the poorly (AUC TSS: 0.62, AUC Enrichment Score 0.60, AUC
Histone score: 0.73). This result suggests that the use of enrichment scores alone is
not an optimal strategy to predict truly enriched regions. We observe that the TSS
score and histone coverage, are necessary to improve performance in the prediction
task. This observation is consistent with the results from the out-of-the-bag feature
importance analysis (see Figure 2).
Fig. 2. Out of bag importance of features
274 E. Iacucci et al.
Figure 2 demonstrates that the highest positive difference occurs with the histone
coverage, which implies greater importance of this feature. Surprisingly, the enrich-
ment feature is associated to a negative difference, indicating that it is the least impor-
tant of the features. In order to illustrate this finding with the original data, we create a
scatter plot that compares the histone coverage with the enrichment value while at the
same time indicating the positive and negative regions (see Figure 3).
Fig. 3. Scatter plot of histone coverage vs enrichment score. Circles indicate negative examples
and crosses indicate positive examples.
Figure 3 demonstrates that negative examples (circles) are concentrated at higher
histone coverage values while they are spread across high and low enrichment
values. Positive examples (crosses) are also spread across high and low enrichment
values but are mostly found at lower histone coverage values.
The utility of the TSS score and the histone coverage is more apparent when one
considers that the decision tree constructed using the whole dataset has a topology
which determines the first split on the histone coverage and the second split on the
TSS score and determines no splits on the enrichment score (see Figure 4).
In order to assess the reliability of this approach, we ran 100 iterations of a leave-
one-out cross validation analysis. The results were as follows: using the enrichment
feature alone, the random forest algorithm has a mean performance (accuracy) of 0.44
(st. dev. 0.022), when we use all three features the value rises to 0.64 (st. dev. 0.045),
when we use the TSS score and the Histone coverage (and no enrichment value), the
best value, of 0.76 (st. dev. 0.022), is achieved.
Towards Better Prioritization of Epigenetically Modified DNA Regions 275
Fig. 4. Decision Tree
As presented in Table 1, the results are consistent with the ones obtained single de-
cision trees using random forests algorithm (Matlab TreeBagger class). The lower
accuracy of random forests comparing to single trees in this particular case could be
explained by small size of the training data sets thought the iterations. The former
performs additional bootstrapping step internally, effectively reducing available train-
ing data even more in this way.
Table 1. Metric comparison across 100 experiments
Features Random forest Decision trees
mean std. mean std.
TSS score, Histone coverage 0.6380 0.0237 0.7600 0.0223
TSS score, Histone coverage, Enrichment 0.5948 0.0315 0.6400 0.0446
Enrichment 0.4508 0.0196 0.4400 0.0223
4 Conclusions
This work present novel insight into the task of predicting true enrichment in regions
detected by ChIP-chip experimentation. Our main technical contribution is two-fold.
276 E. Iacucci et al.
First, we demonstrate that the use of enrichment scores alone is not an optimal strate-
gy. Second, we show that the use of two additional features, namely TSS and histone
coverage, provide unique information, and are necessary to improve the prediction
results. Looking forward, we plan to examine the integration of other features and the
development of other strategies, which might increase predictive power.
Acknowledgements. Funding: The authors would like to acknowledge support from:
Research Council KUL:ProMeta, GOA Ambiorics, GOA MaNet, CoE EF/05/007
SymBioSys en KUL PFV/10/016 SymBioSys , START 1, several PhD/postdoc &
fellow grants. Flemish Government: FWO: PhD/postdoc grants, projects G.0318.05
(subfunctionalization), G.0553.06 (VitamineD), G.0302.07 (SVM/Kernel), research
communities (ICCoS, ANMMM, MLDM); G.0733.09 (3UTR); G.082409 (EGFR)
IWT: PhD Grants, Silicos; SBO-BioFrame, SBO-MoKa, TBM-IOTA3 FOD:Cancer
plans, IBBT. Bel- gian Federal Science Policy Office: IUAP P6/25 (BioMaGNet,
Bioinformatics and Modeling: from Genomes to Networks, 2007- 2011); EU-RTD:
ERNSI: European Research Network on System Identification; FP7-HEALTH;
CHeartED.
References
1. Bieda, M., Xu, X., Singer, M.A., Green, R., Farnham, P.J.: Unbiased location analysis of
E2F1-binding sites suggests a widespread role for E2F1 in the human genome. Genome
Res. 16(5), 595–605 (2006)
2. Kreuz, M., Rosolowski, M., Berger, H., Schwaenen, C., Wessendorf, S., Loeffler, M., Ha-
senclever, D.: Development and implementation of an analysis tool for array-based com-
parative genomic hybridization. Methods Inf. Med. 46(5), 608–613 (2007)
3. Pelizzola, M., Koga, Y., Urban, A.E., Krauthammer, M., Weissman, S., Halaban, R., Mo-
linaro, A.M.: MEDME: an experimental and analytical methodology for the estimation of
DNA methylation levels based on microarray derived MeDIP-enrichment. Genome
Res. 18(10), 1652–1659 (2008)
4. Dowell, R.D.: Transcription factor binding variation in the evolution of gene regulation.
Trends Genet. 26(11), 468–475 (2010)
5. Gilchrist, D.A., Fargo, D.C., Adelman, K.: Using ChIP-chip and ChIP-seq to study the
regulation of gene expression: genome-wide localization studies reveal widespread regula-
tion of transcription elongation. Methods 48(4), 398–408 (2009)
6. MacQuarrie, K.L., Fong, A.P., Morse, R.H., Tapscott, S.J.: Genome-wide transcription
factor binding: beyond direct target regulation. Trends Genet. 27(4), 141–148 (2011)
7. Toedling, J., Huber, W.: Analyzing ChIP-chip data using bioconductor. PLoS Comput.
Biol. 4(11), e1000227 (2008)
8. Toedling, J., Skylar, O., Krueger, T., Fischer, J.J., Sperling, S., Huber, W.: Ringo–an
R/Bioconductor package for analyzing ChIP-chip readouts. BMC Bioinformatics 8, 221
(2007)
9. Chen, K.B., Zhang, Y.: A varying threshold method for ChIP peak-calling using multiple
sources of information. Bioinformatics 26(18), i504–i510 (2010)
Towards Better Prioritization of Epigenetically Modified DNA Regions 277
10. Johnson, D.S., Li, W., Gordon, D.B., Bhattacharjee, A., Curry, B., Ghosh, J., Brizuela, L.,
Carroll, J.S., Brown, M., Flicek, P., Koch, C.M., Dunham, I., Bieda, M., Xu, X., Farnham,
P.J., Kapranov, P., Nix, D.A., Gingeras, T.R., Zhang, X., Holster, H., Jiang, N., Green,
R.D., Song, J.S., McCuine, S.A., Anton, E., Nguyen, L., Trinklein, N.D., Ye, Z., Ching,
K., Hawkins, D., Ren, B., Scacheri, P.C., Rozowsky, J., Karpikov, A., Euskirchen, G.,
Weissman, S., Gerstein, M., Snyder, M., Yang, A., Moqtaderi, Z., Hirsch, H., Shulha,
H.P., Fu, Y., Weng, Z., Struhl, K., Myers, R.M., Lieb, J.D., Liu, X.S.: Systematic evalua-
tion of variability in ChIP-chip experiments using predefined DNA targets. Genome
Res. 18(3), 393–403 (2008)
11. Sharp, A.J., Migliavacca, E., Dupre, Y., Stathaki, E., Sailani, M.R., Baumer, A., Schinzel,
A., Mackay, D.J., Robinson, D.O., Cobellis, G., Cobellis, L., Brunner, H.G., Steiner, B.,
Antonarakis, S.E.: Methylation profiling in individuals with uniparental disomy
identifies novel differentially methylated regions on chromosome 15. Genome Res. 20(9),
1271–1278 (2010)
12. Sharp, A.J., Stathaki, E., Migliavacca, E., Brahmachary, M., Montgomery, S.B., Dupre,
Y., Antonarakis, S.E.: DNA methylation profiles of human active and inactive X chromo-
somes. Genome Res. 21(10), 1592–1600 (2011)
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
X-chromosome inactivation (XCI) is a dosage compensation mechanism that silences the majority of genes on one X chromosome in each female cell. To characterize epigenetic changes that accompany this process, we measured DNA methylation levels in 45,X patients carrying a single active X chromosome (X(a)), and in normal females, who carry one X(a) and one inactive X (X(i)). Methylated DNA was immunoprecipitated and hybridized to high-density oligonucleotide arrays covering the X chromosome, generating epigenetic profiles of active and inactive X chromosomes. We observed that XCI is accompanied by changes in DNA methylation specifically at CpG islands (CGIs). While the majority of CGIs show increased methylation levels on the X(i), XCI actually results in significant reductions in methylation at 7% of CGIs. Both intra- and inter-genic CGIs undergo epigenetic modification, with the biggest increase in methylation occurring at the promoters of genes silenced by XCI. In contrast, genes escaping XCI generally have low levels of promoter methylation, while genes that show inter-individual variation in silencing show intermediate increases in methylation. Thus, promoter methylation and susceptibility to XCI are correlated. We also observed a global correlation between CGI methylation and the evolutionary age of X-chromosome strata, and that genes escaping XCI show increased methylation within gene bodies. We used our epigenetic map to predict 26 novel genes escaping XCI, and searched for parent-of-origin-specific methylation differences, but found no evidence to support imprinting on the human X chromosome. Our study provides a detailed analysis of the epigenetic profile of active and inactive X chromosomes.
Article
Full-text available
The binding of transcription factors to specific DNA target sequences is the fundamental basis of gene regulatory networks. Chromatin immunoprecipitation combined with DNA tiling arrays or high-throughput sequencing (ChIP-chip and ChIP-seq, respectively) has been used in many recent studies that detail the binding sites of various transcription factors. Surprisingly, data from a variety of model organisms and tissues have demonstrated that transcription factors vary greatly in their number of genomic binding sites, and that binding events can significantly exceed the number of known or possible direct gene targets. Thus, current understanding of transcription factor function must expand to encompass what role, if any, binding might have outside of direct transcriptional target regulation. In this review, we discuss the biological significance of genome-wide binding of transcription factors and present models that can account for this phenomenon.
Article
Full-text available
Gene regulation commonly involves interaction among DNA, proteins and biochemical conditions. Using chromatin immunoprecipitation (ChIP) technologies, protein-DNA interactions are routinely detected in the genome scale. Computational methods that detect weak protein-binding signals and simultaneously maintain a high specificity yet remain to be challenging. An attractive approach is to incorporate biologically relevant data, such as protein co-occupancy, to improve the power of protein-binding detection. We call the additional data related with the target protein binding as supporting tracks. We propose a novel but rigorous statistical method to identify protein occupancy in ChIP data using multiple supporting tracks (PASS2). We demonstrate that utilizing biologically related information can significantly increase the discovery of true protein-binding sites, while still maintaining a desired level of false positive calls. Applying the method to GATA1 restoration in mouse erythroid cell line, we detected many new GATA1-binding sites using GATA1 co-occupancy data. http://stat.psu.edu/ approximately yuzhang/pass2.tar.
Article
Full-text available
The maternal and paternal genomes possess distinct epigenetic marks that distinguish them at imprinted loci. In order to identify imprinted loci, we used a novel method, taking advantage of the fact that uniparental disomy (UPD) provides a system that allows the two parental chromosomes to be studied independently. We profiled the paternal and maternal methylation on chromosome 15 using immunoprecipitation of methylated DNA and hybridization to tiling oligonucleotide arrays. Comparison of six individuals with maternal versus paternal UPD15 revealed 12 differentially methylated regions (DMRs). Putative DMRs were validated by bisulfite sequencing, confirming the presence of parent-of-origin-specific methylation marks. We detected DMRs associated with known imprinted genes within the Prader-Willi/Angelman syndrome region, such as SNRPN and MAGEL2, validating this as a method of detecting imprinted loci. Of the 12 DMRs identified, eight were novel, some of which are associated with genes not previously thought to be imprinted. These include a site within intron 2 of IGF1R at 15q26.3, a gene that plays a fundamental role in growth, and an intergenic site upstream of GABRG3 that lies within a previously defined candidate region conferring an increased maternal risk of psychosis. These data provide a map of parent-of-origin-specific epigenetic modifications on chromosome 15, identifying DNA elements that may play a functional role in the imprinting process. Application of this methodology to other chromosomes for which UPD has been reported will allow the systematic identification of imprinted sites throughout the genome.
Article
Full-text available
DNA methylation is an important component of epigenetic modifications that influences the transcriptional machinery and is aberrant in many human diseases. Several methods have been developed to map DNA methylation for either limited regions or genome-wide. In particular, antibodies specific for methylated CpG have been successfully applied in genome-wide studies. However, despite the relevance of the obtained results, the interpretation of antibody enrichment is not trivial. Of greatest importance, the coupling of antibody-enriched methylated fragments with microarrays generates DNA methylation estimates that are not linearly related to the true methylation level. Here, we present an experimental and analytical methodology, MEDME (modeling experimental data with MeDIP enrichment), to obtain enhanced estimates that better describe the true values of DNA methylation level throughout the genome. We propose an experimental scenario for evaluating the true relationship in a high-throughput setting and a model-based analysis to predict the absolute and relative DNA methylation levels. We successfully applied this model to evaluate DNA methylation status of normal human melanocytes compared to a melanoma cell strain. Despite the low resolution typical of methods based on immunoprecipitation, we show that model-derived estimates of DNA methylation provide relatively high correlation with measured absolute and relative levels, as validated by bisulfite genomic DNA sequencing. Importantly, the model-derived DNA methylation estimates simplify the interpretation of the results both at single-loci and at chromosome-wide levels.
Article
Full-text available
The E2F family of transcription factors regulates basic cellular processes. Here, we take an unbiased approach towards identifying E2F1 target genes by examining localization of E2F1-binding sites using high-density oligonucleotide tiling arrays. To begin, we developed a statistically-based methodology for analysis of ChIP-chip data obtained from arrays that represent 30 Mb of the human genome. Using this methodology, we identified regions bound by E2F1, MYC, and RNA Polymerase II (POLR2A). We found a large number of binding sites for all three factors; extrapolation suggests there may be approximately 20,000-30,000 E2F1- and MYC-binding sites and approximately 12,000-17,000 active promoters in HeLa cells. In contrast to our results for MYC, we find that the majority of E2F1-binding sites (>80%) are located in core promoters and that 50% of the sites overlap transcription starts. Only a small fraction of E2F1 sites possess the canonical binding motif. Surprisingly, we found that approximately 30% of genes in the 30-Mb region possessed an E2F1 binding site in a core promoter and E2F1 was bound near to 83% of POLR2A-bound sites. To determine if these results were representative of the entire human genome, we performed ChIP-chip analyses of approximately 24,000 promoters and confirmed that greater than 20% of the promoters were bound by E2F1. Our results suggest that E2F1 is recruited to promoters via a method distinct from recognition of the known consensus site and point toward a new understanding of E2F1 as a factor that contributes to the regulation of a large fraction of human genes.
Article
Transcription factor interactions with DNA are one of the primary mechanisms by which expression is modulated, yet their evolution remains poorly understood. Chromatin immunoprecipitation followed by microarray (ChIP-chip) or sequencing (ChIP-Seq) has revolutionized the study of protein-DNA interactions. However, only recently has attention focused on determining to what extent these regulatory interactions vary between species across entire genomes. A series of recent studies have compared in vivo binding data across a range of evolutionary distances. Binding events diverge rapidly, indicating gene regulation is an evolutionarily flexible process.
Article
Transcription is a sophisticated multi-step process in which RNA polymerase II (Pol II) transcribes a DNA template into RNA in concert with a broad array of transcription initiation, elongation, capping, termination, and histone modifying factors. Recent global analyses of Pol II distribution have indicated that many genes are regulated during the elongation phase, shedding light on a previously underappreciated mechanism for controlling gene expression. Understanding how various factors regulate transcription elongation in living cells has been greatly aided by chromatin immunoprecipitation (ChIP) studies, which can provide spatial and temporal resolution of protein-DNA binding events. The coupling of ChIP with DNA microarray and high-throughput sequencing technologies (ChIP-chip and ChIP-seq) has significantly increased the scope of ChIP studies and genome-wide maps of Pol II or elongation factor binding sites can now be readily produced. However, while ChIP-chip/ChIP-seq data allow for high-resolution localization of protein-DNA binding sites, they are not sufficient to dissect protein function. Here we describe techniques for coupling ChIP-chip/ChIP-seq with genetic, chemical, and experimental manipulation to obtain mechanistic insight from genome-wide protein-DNA binding studies. We have employed these techniques to discern immature promoter-proximal Pol II from productively elongating Pol II, and infer a critical role for the transition between initiation and full elongation competence in regulating development and gene induction in response to environmental signals.