ArticlePDF Available

The p53HMM algorithm: Using profile hidden markov models to detect p53-responsive genes

May 2009
BMC Bioinformatics 10(1):111

May 2009
10(1):111

DOI:10.1186/1471-2105-10-111

Source
PubMed

License
CC BY 2.0

Authors:

Xin Yu

Rutgers, The State University of New Jersey

Eduardo D Sontag

Northeastern University

A computational method (called p53HMM) is presented that utilizes Profile Hidden Markov Models (PHMMs) to estimate the relative binding affinities of putative p53 response elements (REs), both p53 single-sites and cluster-sites. These models incorporate a novel "Corresponded Baum-Welch" training algorithm that provides increased predictive power by exploiting the redundancy of information found in the repeated, palindromic p53-binding motif. The predictive accuracy of these new models are compared against other predictive models, including position specific score matrices (PSSMs, or weight matrices). We also present a new dynamic acceptance threshold, dependent upon a putative binding site's distance from the Transcription Start Site (TSS) and its estimated binding affinity. This new criteria for classifying putative p53-binding sites increases predictive accuracy by reducing the false positive rate. Training a Profile Hidden Markov Model with corresponding positions matching a combined-palindromic p53-binding motif creates the best p53-RE predictive model. The p53HMM algorithm is available on-line: (http://tools.csb.ias.edu). Using Profile Hidden Markov Models with training methods that exploit the redundant information of the homotetramer p53 binding site provides better predictive models than weight matrices (PSSMs). These methods may also boost performance when applied to other transcription factor binding sites.

Original Data from El-Deiry et al., Used To Define The p53 Consensus Binding Site. The original DNA fragments collected from a genome-wide, p53-antibody immunoprecipitation, that were used to define the head-to-head (HH) p53 Consensus Binding Site, are graphically presented 3. The yellow columns corresponding to the 1st and 2nd half-sites were used to define the consensus p53 motif. The p53 binding site is highly degenerative. Within the yellow columns, notice that 7 of the 20 DNA target sites (35%) had at least one nucleotide insertion (green), deletion (red), or both (magenta) relative to the discovered 10 bp-spacer-10 bp consensus. Since insertions and deletions throw off the reading frame of a weight matrix, any PSSM approach will inherently mis-score at least 35% of these 20 sites. Alignments of the 160 experimentally validated p53 binding sites also reveal that any PSSM approach would inherently mis-score at least 30% of them as well. Another observation is that additional p53 half-sites are immediately adjacent (in yellow) to the ones used to define the consensus in 15 of the 20 target sites (75%). Since the genome-wide immunoprecipitation study was designed to pull down the highest affinity sites, the fact that 75% of the target sites are actually p53 cluster-sites is the first indication that cluster-sites of 3 or more half-sites confer higher binding affinity 22.

…

: Normalized Experimental Affinity of Cluster-sites

…

The Topologies of p53 Single-site and Cluster-site Models. (a) A Profile Hidden Markov Model (PHMM) contains three hidden states for each position in a sequence motif of length n: a match state (green squares), an insertion state (orange diamonds), and a delete state (gray circles). The arrows represent allowed transitions between states and have associated probabilities. The match and insertion states also have associated nucleotide emission probabilities. The first and last insertion states (I-0 and I-n) and associated transitions (in red) are shown for completeness. However, they are not present in the p53 models since they are replaced by FIM and FEM models. (b) The topology of the Finite Emission Module (FEM) of length N allows the ability to model any distribution of spacer-lengths between 1 and N. For the p53 models, the model and background probabilities within the FEM modules are identically uniform so that there is no-cost for spacer-lengths between 1 and N, and are referred to as "no-cost FEMs". (c) The topology of the Free Insertion Module (FIM) allows for the ability to model an exponentially decaying distribution of spacer-lengths. However, by setting the model and background probabilities to identically uniform, the FIM can model any sequence of infinite length with no associated cost to the overall score (hence the word "Free"). (d) The main components of the p53 single-site model are the left and right half-site PHMMs, which potentially contain corresponding positions between them. These two half-site models are separated by a no-cost FEM model that limits the length of any intervening spacer sequence to 20 bp. The half-site models are also wrapped by two FIMs that allow the Viterbi algorithm to find the best matching motifs anywhere in the candidate sequences. (e) The topology of the p53 cluster-site model consists of a single PHMM that models a general half-site, and two back-transitions that allow for modeling an infinite number of half-sites within the cluster-site. The back-transition through the no-cost FEM-14 model limits the spacer-sequence between the half-sites to lengths ≤ 14 bp.

…

The Four p53 Correspondence Motifs. The four correspondence motifs for the repeated, palindromic p53 RE are graphically represented. In the top three motifs, each line corresponds 2 synonymous positions. In the bottom motif, the previously independent half-sites are made corresponding (tied) by the yellow connecting lines so that now 4 synonymous positions are corresponded. The completely un-tied motif (not shown) has no correspondence, and thus no connecting lines, between any of the positions in the motif. (R = A or G, W = A or T, and Y = C or T. Position ã has the complement nucleotide emission distribution of a.)

…

Cross Validation with Receiver Operating Characteristic (ROC) curves reveals increased predictive power over weight matrices. 1000 iterations of 10-fold random-split cross validation reveal that the most predictive models utilize the correspondence structures. The combined-palindromic model is the best model since it contains roughly half as many parameters as the other three correspondence models. The positive set contains 160 experimentally validated p53 binding sites, and the negative set contains 40 bp random samples from the mononucleotide content of the training set. The true positive and false positive rates are calculated and plotted for all possible threshold values for each model. The predictive measure for comparing the curves is the AUC (Area Under the Curve). In all the PHMM models the insert-state emissions are fixed to the A, G, C, T nucleotide distribution of the training set. The best classifier uses the combined-palindromic training motif. (Position ã has the complement nucleotide emission distribution of a).

…

Figures - uploaded by Eduardo D Sontag

Content may be subject to copyright.

Content uploaded by Eduardo D Sontag

Content may be subject to copyright.

Available via license: CC BY 2.0

Content may be subject to copyright.

A preview of the PDF is not available

Additional File 1

Data

April 2009

Todd Riley · Xin Yu · Eduardo D Sontag · Arnold Levine

Download

Distinct p53 genomic binding patterns in normal and cancer-derived human cells

Article

Dec 2011
CELL CYCLE

We report here genome-wide analysis of the tumor suppressor p53 binding sites in normal human cells. 743 high-confidence ChIP-seq peaks representing putative genomic binding sites were identified in normal IMR90 fibroblasts using a reference chromatin sample. More than 40% were located within 2 kb of a transcription start site (TSS), a distribution similar to that documented for individually studied, functional p53 binding sites and, to date, not observed by previous p53 genome-wide studies. Nearly half of the high-confidence binding sites in the IMR90 cells reside in CpG islands, in marked contrast to sites reported in cancer-derived cells. The distinct genomic features of the IMR90 binding sites do not reflect a distinct preference for specific sequences, since the de novo developed p53 motif based on our study is similar to those reported by genome-wide studies of cancer cells. More likely, the different chromatin landscape in normal, compared with cancer-derived cells, influences p53 binding via modulating availability of the sites. We compared the IMR90 ChIPseq peaks to the recently published IMR90 methylome and demonstrated that they are enriched at hypomethylated DNA. Our study represents the first genome-wide, de novo mapping of p53 binding sites in normal human cells and reveals that p53 binding sites reside in distinct genomic landscapes in normal and cancer-derived human cells.

Extracellular Vesicles: How the External and Internal Environment Can Shape Cell-To-Cell Communication

Article

Full-text available

Mar 2017

Purpose of the review: To summarize the scientific evidence regarding the effects of environmental exposures on extracellular vesicle (EV) release and their contents. As environmental exposures might influence the aging phenotype in a very strict way, we will also report the role of EVs in the biological aging process. Recent findings: EV research is a new and quickly developing field. With many investigations conducted so far, only a limited number of studies have explored the potential role EVs play in the response and adaptation to environmental stimuli. The investigations available to date have identified several exposures or lifestyle factors able to modify EV trafficking including air pollutants, cigarette smoke, alcohol, obesity, nutrition, physical exercise, and oxidative stress. EVs are a very promising tool, as biological fluids are easily obtainable biological media that, if successful in identifying early alterations induced by the environment and predictive of disease, would be amenable to use for potential future preventive and diagnostic applications.

The Next Generation of Transcription Factor Binding Site Prediction

Article

Sep 2013
PLOS COMPUT BIOL

Finding where transcription factors (TFs) bind to the DNA is of key importance to decipher gene regulation at a transcriptional level. Classically, computational prediction of TF binding sites (TFBSs) is based on basic position weight matrices (PWMs) which quantitatively score binding motifs based on the observed nucleotide patterns in a set of TFBSs for the corresponding TF. Such models make the strong assumption that each nucleotide participates independently in the corresponding DNA-protein interaction and do not account for flexible length motifs. We introduce transcription factor flexible models (TFFMs) to represent TF binding properties. Based on hidden Markov models, TFFMs are flexible, and can model both position interdependence within TFBSs and variable length motifs within a single dedicated framework. The availability of thousands of experimentally validated DNA-TF interaction sequences from ChIP-seq allows for the generation of models that perform as well as PWMs for stereotypical TFs and can improve performance for TFs with flexible binding characteristics. We present a new graphical representation of the motifs that convey properties of position interdependence. TFFMs have been assessed on ChIP-seq data sets coming from the ENCODE project, revealing that they can perform better than both PWMs and the dinucleotide weight matrix extension in discriminating ChIP-seq from background sequences. Under the assumption that ChIP-seq signal values are correlated with the affinity of the TF-DNA binding, we find that TFFM scores correlate with ChIP-seq peak signals. Moreover, using available TF-DNA affinity measurements for the Max TF, we demonstrate that TFFMs constructed from ChIP-seq data correlate with published experimentally measured DNA-binding affinities. Finally, TFFMs allow for the straightforward computation of an integrated TF occupancy score across a sequence. These results demonstrate the capacity of TFFMs to accurately model DNA-protein interactions, while providing a single unified framework suitable for the next generation of TFBS prediction.

Models incorporating chromatin modification data identify functionally important p53 binding sites

Article

Full-text available

Apr 2013
NUCLEIC ACIDS RES

Genome-wide prediction of transcription factor binding sites is notoriously difficult. We have developed and applied a logistic regression approach for prediction of binding sites for the p53 transcription factor that incorporates sequence information and chromatin modification data. We tested this by comparison of predicted sites with known binding sites defined by chromatin immunoprecipitation (ChIP), by the location of predictions relative to genes, by the function of nearby genes and by analysis of gene expression data after p53 activation. We compared the predictions made by our novel model with predictions based only on matches to a sequence position weight matrix (PWM). In whole genome assays, the fraction of known sites identified by the two models was similar, suggesting that there was little to be gained from including chromatin modification data. In contrast, there were highly significant and biologically relevant differences between the two models in the location of the predicted binding sites relative to genes, in the function of nearby genes and in the responsiveness of nearby genes to p53 activation. We propose that these contradictory results can be explained by PWM and ChIP data reflecting primarily biophysical properties of protein–DNA interactions, whereas chromatin modification data capture biologically important functional information.

Fuzzy Tandem Repeats Containing p53 Response Elements May Define Species-Specific p53 Target Genes

Article

Full-text available

Jun 2012
PLOS GENET

Author Summary TP53, the gene encoding p53, is mutated in more than half of human cancers. Consequently, p53 is one of the most studied transcription factors, shown to directly regulate more than 150 genes. The mouse is a model of choice to study p53 mutants and cancer. However, differences were found between tumorigenesis in mice and humans, and these should be investigated to improve the relevance of mouse models. The distinct mutational events required to initiate retinoblastomas in these species constitute a classic example of such differences. Here we show that p53 regulates the Retinoblastoma-like 2 (Rbl2) gene, encoding tumor suppressor p130, in murine but not human cells. The p53-dependent regulation of murine Rbl2/p130 relies on clustered p53 response elements, located within tandem repeats poorly conserved in evolution. A similar situation was found for two other genes, also p53 targets in mice but not in humans. Thus, tandem repeats may shape differences in mammalian p53 regulatory networks. By uncovering differences in p53 target gene repertoires between mice and humans, our findings may help to improve mice as models of human disease. In addition, the role of tandem repeats in shaping the target gene repertoires of other mammalian transcription factors should be considered.

Epigenetic functions enriched in transcription factors binding to mouse recombination hotspots

Article

Full-text available

Jun 2012
PROTEOME SCI

The regulatory mechanism of recombination is a fundamental problem in genomics, with wide applications in genome-wide association studies, birth-defect diseases, molecular evolution, cancer research, etc. In mammalian genomes, recombination events cluster into short genomic regions called "recombination hotspots". Recently, a 13-mer motif enriched in hotspots is identified as a candidate cis-regulatory element of human recombination hotspots; moreover, a zinc finger protein, PRDM9, binds to this motif and is associated with variation of recombination phenotype in human and mouse genomes, thus is a trans-acting regulator of recombination hotspots. However, this pair of cis and trans-regulators covers only a fraction of hotspots, thus other regulators of recombination hotspots remain to be discovered. In this paper, we propose an approach to predicting additional trans-regulators from DNA-binding proteins by comparing their enrichment of binding sites in hotspots. Applying this approach on newly mapped mouse hotspots genome-wide, we confirmed that PRDM9 is a major trans-regulator of hotspots. In addition, a list of top candidate trans-regulators of mouse hotspots is reported. Using GO analysis we observed that the top genes are enriched with function of histone modification, highlighting the epigenetic regulatory mechanisms of recombination hotspots.

Prediction of Trans-regulators of Recombination Hotspots in Mouse Genome

Conference Paper

Full-text available

Nov 2011

The regulatory mechanism of recombination is a fundamental problem in genomics, with wide applications in genome wide association studies, birth-defect diseases, molecular evolution, cancer research, etc. In mammalian genomes, recombination events cluster into short genomic regions called ¡§recombination hotspots¡¨. Recently, a 13-mer motif enriched in hotspots is identified as a candidate cis-regulatory element of human recombination hotspots, moreover, a zinc finger protein, PRDM9, binds to this motif and is associated with variation of recombination phenotype in human and mouse genomes, thus is a trans-acting regulator of recombination hotspots. However, this pair of cis and trans-regulators covers only a fraction of hotspots, thus other regulators of recombination hotspots remain to be discovered. In this paper, we propose an approach to predicting additional trans-regulators from DNA-binding proteins by comparing their enrichment of binding sites in hotspots. Applying this approach on newly mapped mouse hotspots genome-wide, we confirmed that PRDM9 is a major trans-regulator of hotspots. In addition, a list of top candidate trans-regulators of mouse hotspots is reported. Using GO analysis we observed that the top genes are enriched with function of his tone modification, highlighting the epigenetic regulatory mechanisms of recombination hotspots.

Computational Prediction and Experimental Verification of New MAP Kinase Docking Sites and Substrates Including Gli Transcription Factors

Article

Full-text available

Aug 2010
PLOS COMPUT BIOL

Author Summary Protein kinases are enzymes that regulate key cellular processes by covalently attaching a phosphate group to substrate proteins; they are crucial components of signaling pathways involved in cancer, diabetes, and many other diseases. Identifying the substrates of particular protein kinases is challenging, and many existing biochemical methods are biased against weakly expressed proteins like transcription factors. Here we exploited the observation that mitogen-activated protein kinases (MAPKs) briefly attach to many of their substrates before phosphorylating them, docking onto a sequence known as the ‘D-site’. We developed D-finder, a computational tool that uses a combination of expert knowledge and machine learning to search genome databases for D-sites. We then verified several of D-finder's predictions using rigorous and well-established biochemical assays. The most intriguing predicted and verified substrates were the Gli1 and Gli3 transcription factors of the ‘hedgehog’ signaling pathway. Gli transcription factors are involved in embryonic development and stem cell differentiation, and have also been found to be hyperactive in several types of cancer. There is emerging evidence that crosstalk with MAPK pathways is important in Gli-mediated regulation. Our study, however, is the first to show that MAPKs directly phosphorylate Gli transcription factors.

The multiple mechanisms that regulate p53 activity and cell fate

Article

Mar 2019

The tumour suppressor p53 has a central role in the response to cellular stress. Activated p53 transcriptionally regulates hundreds of genes that are involved in multiple biological processes, including in DNA damage repair, cell cycle arrest, apoptosis and senescence. In the context of DNA damage, p53 is thought to be a decision-making transcription factor that selectively activates genes as part of specific gene expression programmes to determine cellular outcomes. In this Review, we discuss the multiple molecular mechanisms of p53 regulation and how they modulate the induction of apoptosis or cell cycle arrest following DNA damage. Specifically, we discuss how the interaction of p53 with DNA and chromatin affects gene expression, and how p53 post-translational modifications, its temporal expression dynamics and its interactions with chromatin regulators and transcription factors influence cell fate. These multiple layers of regulation enable p53 to execute cellular responses that are appropriate for specific cellular states and environmental conditions.

A comprehensive approach for validating p53 binding site predictions

Conference Paper

May 2017

BioJava: An open-source framework for bioinformatics in 2012

Article

Full-text available

Aug 2012
BIOINFORMATICS

Motivation: BioJava is an open-source project for processing of biological data in the Java programming language. We have recently released a new version (3.0.5), which is a major update to the code base that greatly extends its functionality. Results: BioJava now consists of several independent modules that provide state-of-the-art tools for protein structure comparison, pairwise and multiple sequence alignments, working with DNA and protein sequences, analysis of amino acid properties, detection of protein modifications and prediction of disordered regions in proteins as well as parsers for common file formats using a biologically meaningful data model. Availability: BioJava is an open-source project distributed under the Lesser GPL (LGPL). BioJava can be downloaded from the BioJava website (http://www.biojava.org). BioJava requires Java 1.6 or higher. All inquiries should be directed to the BioJava mailing lists. Details are available at http://biojava.org/wiki/BioJava:MailingLists Contact: andreas.prlic@gmail.com

BioJava: An Open-Source Framework for Bioinformatics

Article

Full-text available

Oct 2008
BIOINFORMATICS

BioJava is a mature open-source project that provides a framework for processing of biological data. BioJava contains powerful analysis and statistical routines, tools for parsing common file formats and packages for manipulating sequences and 3D structures. It enables rapid bioinformatics application development in the Java programming language. Availability: BioJava is an open-source project distributed under the Lesser GPL (LGPL). BioJava can be downloaded from the BioJava website (http://www.biojava.org). BioJava requires Java 1.5 or higher. Contact: andreas.prlic@gmail.com. All queries should be directed to the BioJava mailing lists. Details are available at http://biojava.org/wiki/BioJava:MailingLists.

A transcriptionally active DNA-binding site for human p53 protein complexes

Article

Full-text available

Jul 1992

Recent studies have demonstrated transcriptional activation domains within the tumor suppressor protein p53, while others have described specific DNA-binding sites for p53, implying that the protein may act as a transcriptional regulatory factor. We have used a reiterative selection procedure (CASTing: cyclic amplification and selection of targets) to identify new specific binding sites for p53, using nuclear extracts from normal human fibroblasts as the source of p53 protein. The preferred consensus is the palindrome GGACATGCCCGGGCATGTCC. In vitro-translated p53 binds to this sequence only when mixed with nuclear extracts, suggesting that p53 may bind DNA after posttranslational modification or as a complex with other protein partners. When placed upstream of a reporter construct, this sequence promotes p53-dependent transcription in transient transfection assays.

Hidden Markov models in computational biology: Applications to protein modeling

Article

Jan 1994

Biological Sequence Analysis

Book

Jan 1998

This talk will review a little over a decade's research on applying certain stochastic models to biological sequence analysis. The models themselves have a longer history, going back over 30 years, although many novel variants have arisen since that time. The function of the models in biological sequence analysis is to summarize the information concerning what is known as a motif or a domain in bioinformatics, and to provide a tool for discovering instances of that motif or domain in a separate sequence segment. We will introduce the motif models in stages, beginning from very simple, non-stochastic versions, progressively becoming more complex, until we reach modern profile HMMs for motifs. A second example will come from gene finding using sequence data from one or two species, where generalized HMMs or generalized pair HMMs have proved to be very effective.

The regulation of the endosomal compartment by p53 the tumor suppressor gene

Article

May 2009
FEBS J

The endosomal compartment of the cell is involved in a number of functions including: (a) internalizing membrane proteins to multivesicular bodies and lysosomes; (b) producing vesicles that are secreted from the cell (exosomes); and (c) generating autophagic vesicles that, especially in times of nutrient deprivation, supply cytoplasmic components to the lysosome for degradation and recycling of nutrients. The p53 protein responds to various stress signals by initiating a transcriptional program that restores cellular homeostasis and prevents the accumulation of errors in a cell. As part of this process, p53 regulates the transcription of a set of genes encoding proteins that populate the endosomal compartment and impact upon each of these endosomal functions. Here, we demonstrate that p53 regulates transcription of the genes TSAP6 and CHMP4C, which enhance exosome production, and CAV1 and CHMP4C, which produce a more rapid endosomal clearance of the epidermal growth factor receptor from the plasma membrane. Each of these p53-regulated endosomal functions results in the slowing of cell growth and division, the utilization of catabolic resources and cell-to-cell communication by exosomes after a stress signal is detected by the p53 protein. These processes avoid errors during stress and restore homeostasis once the stress is resolved.

Definition of a consensus binding site for p53

Article

May 1992

Recent experiments have suggested that p53 action may be mediated through its interaction with DNA. We have now identified 18 human genomic clones that bind to p53 in vitro. Precise mapping of the binding sequences within these clones revealed a consensus binding site with a striking internal symmetry, consisting of two copies of the 10 base pair motif 5'-PuPuPuC(A/T)(T/A)GPyPyPy-3' separated by 0-13 base pairs. One copy of the motif was insufficient for binding, and subtle alterations of the motif, even when present in multiple copies, resulted in loss of affinity for p53. Mutants of p53, representing each of the four "hot spots" frequently altered in human cancers, failed to bind to the consensus dimer. These results define the DNA sequence elements with which p53 interacts in vitro and which may be important for p53 action in vivo.

Oncogenic forms of p53 inhibit p53-regulated gene expression

Article

Jun 1992

Mutant forms of the gene encoding the tumor suppressor p53 are found in numerous human malignancies, but the physiologic function of p53 and the effects of mutations on this function are unknown. The p53 protein binds DNA in a sequence-specific manner and thus may regulate gene transcription. Cotransfection experiments showed that wild-type p53 activated the expression of genes adjacent to a p53 DNA binding site. The level of activation correlated with DNA binding in vitro. Oncogenic forms of p53 lost this activity. Moreover, all mutants inhibited the activity of coexpressed wild-type p53, providing a basis for the selection of such mutants during tumorigenesis.

Sequence Logos: A New Way to Display Consensus Sequences

Article

Nov 1990

A graphical method is presented for displaying the patterns in a set of aligned sequences. The characters representing the sequence are stacked on top of each other for each position in the aligned sequences. The height of each letter is made proportional to Its frequency, and the letters are sorted so the most common one is on top. The height of the entire stack is then adjusted to signify the information content of the sequences at that position. From these ‘sequence logos’, one can determine not only the consensus sequence but also the relative frequency of bases and the information content (measured In bits) at every position in a site or sequence. The logo displays both significant residues and subtle sequence patterns.

Weighting aligned protein or nucleic acid sequences to correct for unequal representation

Article

Jan 1991

Aligned sequences from the same family (e.g. the haemoglobins) are seldom representative of the entire family. This is because (1) the sequence databases are heavily skewed toward a small number of organisms and (2) only a minute fraction of all the different family members have been sequenced. For many applications, such as using alignments or profiles to perform database searches for distantly related family members, such unequal representation requires correction. An algorithm to perform appropriate weighting of individual sequences is presented along with examples illustrating its efficacy.

The p53HMM algorithm: Using profile hidden markov models to detect p53-responsive genes

Abstract and Figures

Supplementary resource (1)

Recommended publications

Zinc-α 2-glycoprotein hinders cell proliferation and reduces cdc2 expression

Regulation of Apoptosis by HER2 in Breast Cancer

Construction and characterization of a cDNA library from human brain glioma cell line U251 with over...

Functions of human papillomavirus E6 and E7 oncoproteins