Conference Paper

Selection of putative cis-regulatory motifs through regional and global conservation

Nat. Res. Council of Canada, Ottawa, Ont., Canada
DOI: 10.1109/CSB.2004.1332545 Conference: Computational Systems Bioinformatics Conference, 2004. CSB 2004. Proceedings. 2004 IEEE
Source: IEEE Xplore


Cis-regulatory motifs are often overrepresented in promoters and may exhibit frequency biases in subpromoter regions (SPRs). Many probabilistic algorithms have been used to predict such motifs, but they tend to generate many false positives. We devised a novel algorithm, MotifFilter, that computes representation indices (RIs) for putative motifs. MotifFilter's RI is a ratio of the actual over expected frequency of a motif in promoters, SPRs or random genomic DNA that takes into account of the nucleotide probability distributions in these regions. This approach was applied to a genome-wide survey of putative cAMP-response elements (CREs) for motifs generated by a profile hidden Markov model. Twenty of 144 putative CRE motifs found in the survey were retained by the MotifFilter.

Download full-text


Available from: Youlian Pan, Oct 06, 2015
24 Reads
  • Source
    • "For this reason, we use genome specific nucleotide frequency. Additionally, the nucleotide frequencies change over various regions of genomics sequences [24]-[25]. Fore more precise prediction, regional nucleotide frequencies should be applied. Occasionally, the log-odd score of a motif instance could be dominated by one or a few positions because of their extremely high or low frequency values for certain nucleotide(s). "
    [Show abstract] [Hide abstract]
    ABSTRACT: In biological sequence research, the positional weight matrix (PWM) is often used to search for putative transcription factor binding sites. A log-odd score is usually applied to measure the closeness of a subsequence to the PWM. However, the log-odd score is motif-length-dependent and thus there is no universally applicable threshold. In this paper, we propose an alternative scoring index (G) varying from zero, where the subsequence is not much different from the background, to one, where the subsequence fits best to the PWM. We also propose a measure evaluating the statistical expectation at each G index. We investigated the PWMs from the TRANSFAC and found that the statistical expectation is significantly ( p < 0.0001) correlated with both the length of the PWMs and the threshold G value. We applied this method to two PWMs (GCN4_C and ROX1_Q6) of yeast transcription factor binding sites and two PWMs (HIC1-02, HIC1_03) of the human tumor suppressor (HIC-1) binding sites from the TRANSFAC database. Finally, our method compares favorably with the broadly used Match method. The results indicate that our method is more flexible and can provide better confidence. Dans le domaine de la recherche de séquences en biologie, on a souvent recours à la matrice position-poids pour chercher les sites de fixation présumés des facteurs de transcription. On utilise généralement un score log-odd pour mesurer le degré de concordance d'une sous-séquence avec la matrice position-poids. Cependant, comme le score log-odd dépend de la longueur du motif, on ne peut donc pas appliquer un seuil universel. Dans cet article, nous proposons un autre index de scores (G) variant à partir de zéro, où la sous-séquence n'est pas très différente du bruit de fond, par rapport à un, et où la sous-séquence concorde le plus à la matrice position-poids. Nous proposons également une mesure évaluant l'espérance statistique de chaque index G. Nous avons étudié les matrices position-poids de la banque TRANSFAC et avons établi que l'espérance statistique est corrélée de manière statistiquement significative ( p < 0,0001) avec à la fois la longueur des matrices position-poids et le seuil de G. Nous avons appliqué cette méthode à deux matrices position-poids (GCN4_C et ROX1_Q6) correspondant aux sites de fixation d'un facteur de transcription chez la levure et deux matrices position-poids (HIC1-02 et HIC1_03) correspondant aux sites de fixation de HIC-1, un facteur de transcription suppresseur de tumeur chez l'humain, tirées de la banque TRANSFAC. Finalement, notre méthode se compare avantageusement à Match, la méthode couramment utilisée. Les résultats indiquent que notre méthode est plus souple et peut fournir un plus grand degré de certitude.
  • Source
    • "The representation index (RI) was used to measure representation of a motif in a set of sequences and was defined as the total number of occurrences of a motif (N) divided by the statistical expectation value (E) of the motif in a given set of sequences [44]. Motif expectation values were calculated using the following formula: "
    [Show abstract] [Hide abstract]
    ABSTRACT: Transcription factors regulate gene expression by interacting with their specific DNA binding sites. Some transcription factors, particularly those involved in transcription initiation, always bind close to transcription start sites (TSS). Others have no such preference and are functional on sites even tens of thousands of base pairs (bp) away from the TSS. The Cyclic-AMP response element (CRE) binding protein (CREB) binds preferentially to a palindromic sequence (TGACGTCA), known as the canonical CRE, and also to other CRE variants. CREB can activate transcription at CREs thousands of bp away from the TSS, but in mammals CREs are found far more frequently within 1 to 150 bp upstream of the TSS than in any other region. This property is termed positional bias. The strength of CREB binding to DNA is dependent on the sequence of the CRE motif. The central CpG dinucleotide in the canonical CRE (TGACGTCA) is critical for strong binding of CREB dimers. Methylation of the cytosine in the CpG can inhibit binding of CREB. Deamination of the methylated cytosines causes a C to T transition, resulting in a functional, but lower affinity CRE variant, TGATGTCA. We performed genome-wide surveys of CREs in a number of species (from worm to human) and showed that only vertebrates exhibited a CRE positional bias. We performed pair-wise comparisons of human CREs with orthologous sequences in mouse, rat and dog genomes and found that canonical and TGATGTCA variant CREs are highly conserved in mammals. However, when orthologous sequences differ, canonical CREs in human are most frequently TGATGTCA in the other species and vice-versa. We have identified 207 human CREs showing such differences. Our data suggest that the positional bias of CREs likely evolved after the separation of urochordata and vertebrata. Although many canonical CREs are conserved among mammals, there are a number of orthologous genes that have canonical CREs in one species but the TGATGTCA variant in another. These differences are likely due to deamination of the methylated cytosines in the CpG and may contribute to differential transcriptional regulation among orthologous genes.
    BMC Evolutionary Biology 02/2007; 7 Suppl 1(Suppl 1):S15. DOI:10.1186/1471-2148-7-S1-S15 · 3.37 Impact Factor
  • Article: Famili
Show more