ArticlePDF Available

The p53HMM algorithm: Using profile hidden markov models to detect p53-responsive genes

Authors:

Abstract and Figures

A computational method (called p53HMM) is presented that utilizes Profile Hidden Markov Models (PHMMs) to estimate the relative binding affinities of putative p53 response elements (REs), both p53 single-sites and cluster-sites. These models incorporate a novel "Corresponded Baum-Welch" training algorithm that provides increased predictive power by exploiting the redundancy of information found in the repeated, palindromic p53-binding motif. The predictive accuracy of these new models are compared against other predictive models, including position specific score matrices (PSSMs, or weight matrices). We also present a new dynamic acceptance threshold, dependent upon a putative binding site's distance from the Transcription Start Site (TSS) and its estimated binding affinity. This new criteria for classifying putative p53-binding sites increases predictive accuracy by reducing the false positive rate. Training a Profile Hidden Markov Model with corresponding positions matching a combined-palindromic p53-binding motif creates the best p53-RE predictive model. The p53HMM algorithm is available on-line: (http://tools.csb.ias.edu). Using Profile Hidden Markov Models with training methods that exploit the redundant information of the homotetramer p53 binding site provides better predictive models than weight matrices (PSSMs). These methods may also boost performance when applied to other transcription factor binding sites.
The Topologies of p53 Single-site and Cluster-site Models. (a) A Profile Hidden Markov Model (PHMM) contains three hidden states for each position in a sequence motif of length n: a match state (green squares), an insertion state (orange diamonds), and a delete state (gray circles). The arrows represent allowed transitions between states and have associated probabilities. The match and insertion states also have associated nucleotide emission probabilities. The first and last insertion states (I-0 and I-n) and associated transitions (in red) are shown for completeness. However, they are not present in the p53 models since they are replaced by FIM and FEM models. (b) The topology of the Finite Emission Module (FEM) of length N allows the ability to model any distribution of spacer-lengths between 1 and N. For the p53 models, the model and background probabilities within the FEM modules are identically uniform so that there is no-cost for spacer-lengths between 1 and N, and are referred to as "no-cost FEMs". (c) The topology of the Free Insertion Module (FIM) allows for the ability to model an exponentially decaying distribution of spacer-lengths. However, by setting the model and background probabilities to identically uniform, the FIM can model any sequence of infinite length with no associated cost to the overall score (hence the word "Free"). (d) The main components of the p53 single-site model are the left and right half-site PHMMs, which potentially contain corresponding positions between them. These two half-site models are separated by a no-cost FEM model that limits the length of any intervening spacer sequence to 20 bp. The half-site models are also wrapped by two FIMs that allow the Viterbi algorithm to find the best matching motifs anywhere in the candidate sequences. (e) The topology of the p53 cluster-site model consists of a single PHMM that models a general half-site, and two back-transitions that allow for modeling an infinite number of half-sites within the cluster-site. The back-transition through the no-cost FEM-14 model limits the spacer-sequence between the half-sites to lengths ≤ 14 bp.
… 
Content may be subject to copyright.
A preview of the PDF is not available

Supplementary resource (1)

... Another identified pair was FAS and ACTA2 (Fig. S6b). While the short ACTA2 transcript is known to be regulated by p53 via a functional RE, 58 the longer ACTA2 transcript (shown on23,24,53,54 For analysis of the identified ChIP-seq peaks, we used the p53MH algorithm 16 and demonstrated that the ChIP-seq peaks are enriched for putative p53MH sites, unlike the Input peaks. The enrichment was tightly confined within a 100 nt window centered at the peak maximum (Fig. 3A ). ...
... 63 Most p53 ChIP-seq peaks are in proximal although some are in distal CGIs (Figs. 6D and S9). ChIP-seq peaks, both in proximal (Fig. 6E)23,24,53,54 Because the p53 binding sites in IMR90 cells are enriched at CGIs, we asked if they contained sequence motif(s) distinct from those reported by genome-wide studies in cancer cells. A de novo motif search was conducted on the 743 high-confidence peaks (350 nt regions centered at the peak maxima) using MEME 4.6.1 (see Sup. methods). ...
Article
We report here genome-wide analysis of the tumor suppressor p53 binding sites in normal human cells. 743 high-confidence ChIP-seq peaks representing putative genomic binding sites were identified in normal IMR90 fibroblasts using a reference chromatin sample. More than 40% were located within 2 kb of a transcription start site (TSS), a distribution similar to that documented for individually studied, functional p53 binding sites and, to date, not observed by previous p53 genome-wide studies. Nearly half of the high-confidence binding sites in the IMR90 cells reside in CpG islands, in marked contrast to sites reported in cancer-derived cells. The distinct genomic features of the IMR90 binding sites do not reflect a distinct preference for specific sequences, since the de novo developed p53 motif based on our study is similar to those reported by genome-wide studies of cancer cells. More likely, the different chromatin landscape in normal, compared with cancer-derived cells, influences p53 binding via modulating availability of the sites. We compared the IMR90 ChIPseq peaks to the recently published IMR90 methylome and demonstrated that they are enriched at hypomethylated DNA. Our study represents the first genome-wide, de novo mapping of p53 binding sites in normal human cells and reveals that p53 binding sites reside in distinct genomic landscapes in normal and cancer-derived human cells.
... Cellular senescence is mainly triggered by telomere shortening, which is commonly operated by the p53 pathway. The p53 protein is able to regulate the biogenesis and secretion of exosomes to activate senescence, but is also known to regulate the transcription of growth factors, which are able to modulate the microenvironment [58]. ...
Article
Full-text available
Purpose of the review: To summarize the scientific evidence regarding the effects of environmental exposures on extracellular vesicle (EV) release and their contents. As environmental exposures might influence the aging phenotype in a very strict way, we will also report the role of EVs in the biological aging process. Recent findings: EV research is a new and quickly developing field. With many investigations conducted so far, only a limited number of studies have explored the potential role EVs play in the response and adaptation to environmental stimuli. The investigations available to date have identified several exposures or lifestyle factors able to modify EV trafficking including air pollutants, cigarette smoke, alcohol, obesity, nutrition, physical exercise, and oxidative stress. EVs are a very promising tool, as biological fluids are easily obtainable biological media that, if successful in identifying early alterations induced by the environment and predictive of disease, would be amenable to use for potential future preventive and diagnostic applications.
... Moreover, basic PWMs are restricted to the detection of motifs with a fixed length. This constraint has previously led to alternative heuristic approaches for the modeling of TFBS for TFs tolerant of variable widths, such as nuclear receptors [20] and p53 [21,22]. The analysis of variable spacing in the context of Escherichia coli promoter prediction was an early advance in the field, with the spacing between elements addressed with either PFMs [23] or PWMs using logarithms of the probabilities [24] (see [8] for a review). ...
Article
Finding where transcription factors (TFs) bind to the DNA is of key importance to decipher gene regulation at a transcriptional level. Classically, computational prediction of TF binding sites (TFBSs) is based on basic position weight matrices (PWMs) which quantitatively score binding motifs based on the observed nucleotide patterns in a set of TFBSs for the corresponding TF. Such models make the strong assumption that each nucleotide participates independently in the corresponding DNA-protein interaction and do not account for flexible length motifs. We introduce transcription factor flexible models (TFFMs) to represent TF binding properties. Based on hidden Markov models, TFFMs are flexible, and can model both position interdependence within TFBSs and variable length motifs within a single dedicated framework. The availability of thousands of experimentally validated DNA-TF interaction sequences from ChIP-seq allows for the generation of models that perform as well as PWMs for stereotypical TFs and can improve performance for TFs with flexible binding characteristics. We present a new graphical representation of the motifs that convey properties of position interdependence. TFFMs have been assessed on ChIP-seq data sets coming from the ENCODE project, revealing that they can perform better than both PWMs and the dinucleotide weight matrix extension in discriminating ChIP-seq from background sequences. Under the assumption that ChIP-seq signal values are correlated with the affinity of the TF-DNA binding, we find that TFFM scores correlate with ChIP-seq peak signals. Moreover, using available TF-DNA affinity measurements for the Max TF, we demonstrate that TFFMs constructed from ChIP-seq data correlate with published experimentally measured DNA-binding affinities. Finally, TFFMs allow for the straightforward computation of an integrated TF occupancy score across a sequence. These results demonstrate the capacity of TFFMs to accurately model DNA-protein interactions, while providing a single unified framework suitable for the next generation of TFBS prediction.
... Hence, further progress in the field will almost certainly require the inclusion of additional types of information to reduce the number of false-positive predictions. Computational sequence-based approaches to predict binding sites include position weight matrices (PWMs) (8–16), hidden Markov models (17–19) and support vector machines (20,21). Compared with a simple regular expression, these more accurately represent the extent of variation at specific base positions in the binding sequence. ...
Article
Full-text available
Genome-wide prediction of transcription factor binding sites is notoriously difficult. We have developed and applied a logistic regression approach for prediction of binding sites for the p53 transcription factor that incorporates sequence information and chromatin modification data. We tested this by comparison of predicted sites with known binding sites defined by chromatin immunoprecipitation (ChIP), by the location of predictions relative to genes, by the function of nearby genes and by analysis of gene expression data after p53 activation. We compared the predictions made by our novel model with predictions based only on matches to a sequence position weight matrix (PWM). In whole genome assays, the fraction of known sites identified by the two models was similar, suggesting that there was little to be gained from including chromatin modification data. In contrast, there were highly significant and biologically relevant differences between the two models in the location of the predicted binding sites relative to genes, in the function of nearby genes and in the responsiveness of nearby genes to p53 activation. We propose that these contradictory results can be explained by PWM and ChIP data reflecting primarily biophysical properties of protein–DNA interactions, whereas chromatin modification data capture biologically important functional information.
... The region homologous to the murine clustered p53 REs is below, with putative half-sites as before (those with a single mismatch are underlined). The parentheses indicate a nonamer that might be a putative half-site with a deletion within the core CWWG motif (the minus sign indicates the deletion), a situation observed in about 5% of p53 binding half-sites [41]. doi:10.1371/journal.pgen.1002731.g001 ...
Article
Full-text available
Author Summary TP53, the gene encoding p53, is mutated in more than half of human cancers. Consequently, p53 is one of the most studied transcription factors, shown to directly regulate more than 150 genes. The mouse is a model of choice to study p53 mutants and cancer. However, differences were found between tumorigenesis in mice and humans, and these should be investigated to improve the relevance of mouse models. The distinct mutational events required to initiate retinoblastomas in these species constitute a classic example of such differences. Here we show that p53 regulates the Retinoblastoma-like 2 (Rbl2) gene, encoding tumor suppressor p130, in murine but not human cells. The p53-dependent regulation of murine Rbl2/p130 relies on clustered p53 response elements, located within tandem repeats poorly conserved in evolution. A similar situation was found for two other genes, also p53 targets in mice but not in humans. Thus, tandem repeats may shape differences in mammalian p53 regulatory networks. By uncovering differences in p53 target gene repertoires between mice and humans, our findings may help to improve mice as models of human disease. In addition, the role of tandem repeats in shaping the target gene repertoires of other mammalian transcription factors should be considered.
... Meanwhile, we searched for target binding sites of proteins using FIMO, which does not allow for insertion and deletions in motif matching. However, it is known that the DNA target sites of some proteins contain indels [21]. Therefore, more flexible motif finding algorithms that take into account special sequence patterns (e.g. ...
Article
Full-text available
The regulatory mechanism of recombination is a fundamental problem in genomics, with wide applications in genome-wide association studies, birth-defect diseases, molecular evolution, cancer research, etc. In mammalian genomes, recombination events cluster into short genomic regions called "recombination hotspots". Recently, a 13-mer motif enriched in hotspots is identified as a candidate cis-regulatory element of human recombination hotspots; moreover, a zinc finger protein, PRDM9, binds to this motif and is associated with variation of recombination phenotype in human and mouse genomes, thus is a trans-acting regulator of recombination hotspots. However, this pair of cis and trans-regulators covers only a fraction of hotspots, thus other regulators of recombination hotspots remain to be discovered. In this paper, we propose an approach to predicting additional trans-regulators from DNA-binding proteins by comparing their enrichment of binding sites in hotspots. Applying this approach on newly mapped mouse hotspots genome-wide, we confirmed that PRDM9 is a major trans-regulator of hotspots. In addition, a list of top candidate trans-regulators of mouse hotspots is reported. Using GO analysis we observed that the top genes are enriched with function of histone modification, highlighting the epigenetic regulatory mechanisms of recombination hotspots.
... Meanwhile, we searched for target binding sites of proteins using FIMO, which does not allow for insertion and deletions in motif matching. However, it is known that the DNA target sites of some proteins contain indels [22]. Therefore, more flexible motif finding algorithms that take into account special sequence patterns (e.g. ...
Conference Paper
Full-text available
The regulatory mechanism of recombination is a fundamental problem in genomics, with wide applications in genome wide association studies, birth-defect diseases, molecular evolution, cancer research, etc. In mammalian genomes, recombination events cluster into short genomic regions called ¡§recombination hotspots¡¨. Recently, a 13-mer motif enriched in hotspots is identified as a candidate cis-regulatory element of human recombination hotspots, moreover, a zinc finger protein, PRDM9, binds to this motif and is associated with variation of recombination phenotype in human and mouse genomes, thus is a trans-acting regulator of recombination hotspots. However, this pair of cis and trans-regulators covers only a fraction of hotspots, thus other regulators of recombination hotspots remain to be discovered. In this paper, we propose an approach to predicting additional trans-regulators from DNA-binding proteins by comparing their enrichment of binding sites in hotspots. Applying this approach on newly mapped mouse hotspots genome-wide, we confirmed that PRDM9 is a major trans-regulator of hotspots. In addition, a list of top candidate trans-regulators of mouse hotspots is reported. Using GO analysis we observed that the top genes are enriched with function of his tone modification, highlighting the epigenetic regulatory mechanisms of recombination hotspots.
... A profile Hidden Markov Model (HMM) architecture, which statistically represents a pattern of positionspecific conservation for a series of related sequences, is a probabilistic machine learning approach that has the potential to discover patterns in sets of data that are difficult to notice by direct observation [35]. HMMs have proven useful, for example, in sequence alignment of protein families and prediction of novel family members [36,37], prediction of signal peptides [38], and prediction of p53-binding sites [39]. HMMs also form the foundation of Pfam's homology searching capabilities [40]. ...
Article
Full-text available
Author Summary Protein kinases are enzymes that regulate key cellular processes by covalently attaching a phosphate group to substrate proteins; they are crucial components of signaling pathways involved in cancer, diabetes, and many other diseases. Identifying the substrates of particular protein kinases is challenging, and many existing biochemical methods are biased against weakly expressed proteins like transcription factors. Here we exploited the observation that mitogen-activated protein kinases (MAPKs) briefly attach to many of their substrates before phosphorylating them, docking onto a sequence known as the ‘D-site’. We developed D-finder, a computational tool that uses a combination of expert knowledge and machine learning to search genome databases for D-sites. We then verified several of D-finder's predictions using rigorous and well-established biochemical assays. The most intriguing predicted and verified substrates were the Gli1 and Gli3 transcription factors of the ‘hedgehog’ signaling pathway. Gli transcription factors are involved in embryonic development and stem cell differentiation, and have also been found to be hyperactive in several types of cancer. There is emerging evidence that crosstalk with MAPK pathways is important in Gli-mediated regulation. Our study, however, is the first to show that MAPKs directly phosphorylate Gli transcription factors.
Article
The tumour suppressor p53 has a central role in the response to cellular stress. Activated p53 transcriptionally regulates hundreds of genes that are involved in multiple biological processes, including in DNA damage repair, cell cycle arrest, apoptosis and senescence. In the context of DNA damage, p53 is thought to be a decision-making transcription factor that selectively activates genes as part of specific gene expression programmes to determine cellular outcomes. In this Review, we discuss the multiple molecular mechanisms of p53 regulation and how they modulate the induction of apoptosis or cell cycle arrest following DNA damage. Specifically, we discuss how the interaction of p53 with DNA and chromatin affects gene expression, and how p53 post-translational modifications, its temporal expression dynamics and its interactions with chromatin regulators and transcription factors influence cell fate. These multiple layers of regulation enable p53 to execute cellular responses that are appropriate for specific cellular states and environmental conditions.
Article
Full-text available
Motivation: BioJava is an open-source project for processing of biological data in the Java programming language. We have recently released a new version (3.0.5), which is a major update to the code base that greatly extends its functionality. Results: BioJava now consists of several independent modules that provide state-of-the-art tools for protein structure comparison, pairwise and multiple sequence alignments, working with DNA and protein sequences, analysis of amino acid properties, detection of protein modifications and prediction of disordered regions in proteins as well as parsers for common file formats using a biologically meaningful data model. Availability: BioJava is an open-source project distributed under the Lesser GPL (LGPL). BioJava can be downloaded from the BioJava website (http://www.biojava.org). BioJava requires Java 1.6 or higher. All inquiries should be directed to the BioJava mailing lists. Details are available at http://biojava.org/wiki/BioJava:MailingLists Contact: andreas.prlic@gmail.com
Article
Full-text available
BioJava is a mature open-source project that provides a framework for processing of biological data. BioJava contains powerful analysis and statistical routines, tools for parsing common file formats and packages for manipulating sequences and 3D structures. It enables rapid bioinformatics application development in the Java programming language. Availability: BioJava is an open-source project distributed under the Lesser GPL (LGPL). BioJava can be downloaded from the BioJava website (http://www.biojava.org). BioJava requires Java 1.5 or higher. Contact: andreas.prlic@gmail.com. All queries should be directed to the BioJava mailing lists. Details are available at http://biojava.org/wiki/BioJava:MailingLists.
Article
Full-text available
Recent studies have demonstrated transcriptional activation domains within the tumor suppressor protein p53, while others have described specific DNA-binding sites for p53, implying that the protein may act as a transcriptional regulatory factor. We have used a reiterative selection procedure (CASTing: cyclic amplification and selection of targets) to identify new specific binding sites for p53, using nuclear extracts from normal human fibroblasts as the source of p53 protein. The preferred consensus is the palindrome GGACATGCCCGGGCATGTCC. In vitro-translated p53 binds to this sequence only when mixed with nuclear extracts, suggesting that p53 may bind DNA after posttranslational modification or as a complex with other protein partners. When placed upstream of a reporter construct, this sequence promotes p53-dependent transcription in transient transfection assays.
Book
This talk will review a little over a decade's research on applying certain stochastic models to biological sequence analysis. The models themselves have a longer history, going back over 30 years, although many novel variants have arisen since that time. The function of the models in biological sequence analysis is to summarize the information concerning what is known as a motif or a domain in bioinformatics, and to provide a tool for discovering instances of that motif or domain in a separate sequence segment. We will introduce the motif models in stages, beginning from very simple, non-stochastic versions, progressively becoming more complex, until we reach modern profile HMMs for motifs. A second example will come from gene finding using sequence data from one or two species, where generalized HMMs or generalized pair HMMs have proved to be very effective.
Article
The endosomal compartment of the cell is involved in a number of functions including: (a) internalizing membrane proteins to multivesicular bodies and lysosomes; (b) producing vesicles that are secreted from the cell (exosomes); and (c) generating autophagic vesicles that, especially in times of nutrient deprivation, supply cytoplasmic components to the lysosome for degradation and recycling of nutrients. The p53 protein responds to various stress signals by initiating a transcriptional program that restores cellular homeostasis and prevents the accumulation of errors in a cell. As part of this process, p53 regulates the transcription of a set of genes encoding proteins that populate the endosomal compartment and impact upon each of these endosomal functions. Here, we demonstrate that p53 regulates transcription of the genes TSAP6 and CHMP4C, which enhance exosome production, and CAV1 and CHMP4C, which produce a more rapid endosomal clearance of the epidermal growth factor receptor from the plasma membrane. Each of these p53-regulated endosomal functions results in the slowing of cell growth and division, the utilization of catabolic resources and cell-to-cell communication by exosomes after a stress signal is detected by the p53 protein. These processes avoid errors during stress and restore homeostasis once the stress is resolved.
Article
Recent experiments have suggested that p53 action may be mediated through its interaction with DNA. We have now identified 18 human genomic clones that bind to p53 in vitro. Precise mapping of the binding sequences within these clones revealed a consensus binding site with a striking internal symmetry, consisting of two copies of the 10 base pair motif 5'-PuPuPuC(A/T)(T/A)GPyPyPy-3' separated by 0-13 base pairs. One copy of the motif was insufficient for binding, and subtle alterations of the motif, even when present in multiple copies, resulted in loss of affinity for p53. Mutants of p53, representing each of the four "hot spots" frequently altered in human cancers, failed to bind to the consensus dimer. These results define the DNA sequence elements with which p53 interacts in vitro and which may be important for p53 action in vivo.
Article
Mutant forms of the gene encoding the tumor suppressor p53 are found in numerous human malignancies, but the physiologic function of p53 and the effects of mutations on this function are unknown. The p53 protein binds DNA in a sequence-specific manner and thus may regulate gene transcription. Cotransfection experiments showed that wild-type p53 activated the expression of genes adjacent to a p53 DNA binding site. The level of activation correlated with DNA binding in vitro. Oncogenic forms of p53 lost this activity. Moreover, all mutants inhibited the activity of coexpressed wild-type p53, providing a basis for the selection of such mutants during tumorigenesis.
Article
A graphical method is presented for displaying the patterns in a set of aligned sequences. The characters representing the sequence are stacked on top of each other for each position in the aligned sequences. The height of each letter is made proportional to Its frequency, and the letters are sorted so the most common one is on top. The height of the entire stack is then adjusted to signify the information content of the sequences at that position. From these ‘sequence logos’, one can determine not only the consensus sequence but also the relative frequency of bases and the information content (measured In bits) at every position in a site or sequence. The logo displays both significant residues and subtle sequence patterns.
Article
Aligned sequences from the same family (e.g. the haemoglobins) are seldom representative of the entire family. This is because (1) the sequence databases are heavily skewed toward a small number of organisms and (2) only a minute fraction of all the different family members have been sequenced. For many applications, such as using alignments or profiles to perform database searches for distantly related family members, such unequal representation requires correction. An algorithm to perform appropriate weighting of individual sequences is presented along with examples illustrating its efficacy.