Conference Paper

# Selection of putative cis-regulatory motifs through regional and global conservation

Nat. Res. Council of Canada, Ottawa, Ont., Canada;

DOI: 10.1109/CSB.2004.1332545 Conference: Computational Systems Bioinformatics Conference, 2004. CSB 2004. Proceedings. 2004 IEEE Source: IEEE Xplore

- [Show abstract] [Hide abstract]

**ABSTRACT:**In biological sequence research, the positional weight matrix (PWM) is often used to search for putative transcription factor binding sites. A log-odd score is usually applied to measure the closeness of a subsequence to the PWM. However, the log-odd score is motif-length-dependent and thus there is no universally applicable threshold. In this paper, we propose an alternative scoring index (G) varying from zero, where the subsequence is not much different from the background, to one, where the subsequence fits best to the PWM. We also propose a measure evaluating the statistical expectation at each G index. We investigated the PWMs from the TRANSFAC and found that the statistical expectation is significantly ( p < 0.0001) correlated with both the length of the PWMs and the threshold G value. We applied this method to two PWMs (GCN4_C and ROX1_Q6) of yeast transcription factor binding sites and two PWMs (HIC1-02, HIC1_03) of the human tumor suppressor (HIC-1) binding sites from the TRANSFAC database. Finally, our method compares favorably with the broadly used Match method. The results indicate that our method is more flexible and can provide better confidence. Dans le domaine de la recherche de séquences en biologie, on a souvent recours à la matrice position-poids pour chercher les sites de fixation présumés des facteurs de transcription. On utilise généralement un score log-odd pour mesurer le degré de concordance d'une sous-séquence avec la matrice position-poids. Cependant, comme le score log-odd dépend de la longueur du motif, on ne peut donc pas appliquer un seuil universel. Dans cet article, nous proposons un autre index de scores (G) variant à partir de zéro, où la sous-séquence n'est pas très différente du bruit de fond, par rapport à un, et où la sous-séquence concorde le plus à la matrice position-poids. Nous proposons également une mesure évaluant l'espérance statistique de chaque index G. Nous avons étudié les matrices position-poids de la banque TRANSFAC et avons établi que l'espérance statistique est corrélée de manière statistiquement significative ( p < 0,0001) avec à la fois la longueur des matrices position-poids et le seuil de G. Nous avons appliqué cette méthode à deux matrices position-poids (GCN4_C et ROX1_Q6) correspondant aux sites de fixation d'un facteur de transcription chez la levure et deux matrices position-poids (HIC1-02 et HIC1_03) correspondant aux sites de fixation de HIC-1, un facteur de transcription suppresseur de tumeur chez l'humain, tirées de la banque TRANSFAC. Finalement, notre méthode se compare avantageusement à Match, la méthode couramment utilisée. Les résultats indiquent que notre méthode est plus souple et peut fournir un plus grand degré de certitude.01/2008; - Lecture Notes in Engineering and Computer Science. 01/2008;
- [Show abstract] [Hide abstract]

**ABSTRACT:**In biological sequence research, the positional weight matrix (PWM) is often used to search for putative transcription factor binding sites. A set of experimentally verified oligonucleotides known to be functional motifs are collected and aligned. The frequency of each nucleotide A, C, G, or T at each column of the alignment is calculated in the matrix. Once a PWM is constructed, it can be used to search from a nucleotide sequence for subsequences that are possibly perform the same function. The match between a subsequence and a PWM is usually described by a score function, which measures the closeness of the subsequence to the PWM as compared with the given background. Nevertheless, the score function is usually motif-length-dependent and thus there is no universally applicable threshold. In this paper, we propose an alternative scoring index (G) varying from zero, where the subsequence is not much different from the background, to one, where the subsequence fits best to the PWM. We also propose a measure evaluating the statistical expectation at each G index. We investigated the PWMs from the TRANSFAC and found that the statistical expectation is significantly (p<0.0001) correlated with both the length of the PWMs and the threshold G value. We applied this method to two PWMs (GCN4_C and ROX1_Q6) of yeast transcription factor binding sites and two PWMs (HIC1-02, HIC1_03) of the human tumor suppressor (HIC-1) binding sites from the TRANSFAC database. Finally, our method compares favorably with the broadly used Match method. The results indicate that our method is more flexible and can provide better confidenceEngineering Letters 01/2008; 16(4):498-504.

Data provided are for informational purposes only. Although carefully collected, accuracy cannot be guaranteed. The impact factor represents a rough estimation of the journal's impact factor and does not reflect the actual current impact factor. Publisher conditions are provided by RoMEO. Differing provisions from the publisher's actual policy or licence agreement may be applicable.