MATRIX SEARCH 1.0: A computer program that scans DNA sequences for transcriptional elements using a database of weight matrices

Department of Molecular, Cellular, and Developmental Biology, University of Colorado, Boulder 80309-0347, USA.
Computer applications in the biosciences: CABIOS 11/1995; 11(5):563-6. DOI: 10.1093/bioinformatics/11.5.563
Source: PubMed


The information matrix database (IMD), a database of weight matrices of transcription factor binding sites, is developed. MATRIX SEARCH, a program which can find potential transcription factor binding sites in DNA sequences using the IMD database, is also developed and accompanies the IMD database. MATRIX SEARCH adopts a user interface very similar to that of the SIGNAL SCAN program. MATRIX SEARCH allows the user to search an input sequence with the IMD automatically, to visualize the matrix representations of sites for particular factors, and to retrieve journal citations. The source code for MATRIX SEARCH is in the 'C' language, and the program is available for unix platforms.

Download full-text


Available from: Gary D Stormo, Oct 10, 2015
59 Reads
  • Source
    • "In addition, a great variety of computer programs have been developed in order to match pre-compiled motifs within genomic landscapes (12,22–26). These computational approaches effectively scan PSSMs along large genomic sequences in order to locate putative cis-regulatory elements, those with alignment scores above some arbitrary threshold. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Sequence alignment of proteins and nucleic acids is a routine task in bioinformatics. Although the comparison of complete peptides, genes or genomes can be undertaken with a great variety of tools, the alignment of short DNA sequences and motifs entails pitfalls that have not been fully addressed yet. Here we confront the structural superposition of transcription factors with the sequence alignment of their recognized cis elements. Our goals are (i) to test TFcompare (, a structural alignment method for protein-DNA complexes; (ii) to benchmark the pairwise alignment of regulatory elements; (iii) to define the confidence limits and the twilight zone of such alignments and (iv) to evaluate the relevance of these thresholds with elements obtained experimentally. We find that the structure of cis elements and protein-DNA interfaces is significantly more conserved than their sequence and measures how this correlates with alignment errors when only sequence information is considered. Our results confirm that DNA motifs in the form of matrices produce better alignments than individual sequences. Finally, we report that empirical and theoretically derived twilight thresholds are useful for estimating the natural plasticity of regulatory sequences, and hence for filtering out unreliable alignments.
    Nucleic Acids Research 12/2012; 41(3). DOI:10.1093/nar/gks1301 · 9.11 Impact Factor
  • Source
    • "In spite of the ubiquity of pseudocounts, there is no standard way of choosing them. Many types of pseudocounts are used, 0.01 (4), 1 (5), 1.5 (6), 2 (7), 4 (8) and the square root of the number of binding sites in the sample (9). (In this article, when we speak of a pseudocount of, e.g. "
    [Show abstract] [Hide abstract]
    ABSTRACT: To represent the sequence specificity of transcription factors, the position weight matrix (PWM) is widely used. In most cases, each element is defined as a log likelihood ratio of a base appearing at a certain position, which is estimated from a finite number of known binding sites. To avoid bias due to this small sample size, a certain numeric value, called a pseudocount, is usually allocated for each position, and its fraction according to the background base composition is added to each element. So far, there has been no consensus on the optimal pseudocount value. In this study, we simulated the sampling process by artificially generating binding sites based on observed nucleotide frequencies in a public PWM database, and then the generated matrix with an added pseudocount value was compared to the original frequency matrix using various measures. Although the results were somewhat different between measures, in many cases, we could find an optimal pseudocount value for each matrix. These optimal values are independent of the sample size and are clearly correlated with the entropy of the original matrices, meaning that larger pseudocount vales are preferable for less conserved binding sites. As a simple representative, we suggest the value of 0.8 for practical uses.
    Nucleic Acids Research 02/2009; 37(3):939-44. DOI:10.1093/nar/gkn1019 · 9.11 Impact Factor
  • Source
    • "To abstract a motif from the sequences found by the modified SELEX process, we need a computational method: a supervised algorithm, trained on a set of binding sites identified directly by experimental measurements [23,24,9]. We will compare different supervised methods for extraction of parameters and use CAP targets as a benchmark. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Characterizing transcription factor binding motifs is a common bioinformatics task. For transcription factors with variable binding sites, we need to get many suboptimal binding sites in our training dataset to get accurate estimates of free energy penalties for deviating from the consensus DNA sequence. One procedure to do that involves a modified SELEX (Systematic Evolution of Ligands by Exponential Enrichment) method designed to produce many such sequences. We analyzed low stringency SELEX data for E. coli Catabolic Activator Protein (CAP), and we show here that appropriate quantitative analysis improves our ability to predict in vitro affinity. To obtain large number of sequences required for this analysis we used a SELEX SAGE protocol developed by Roulet et al. The sequences obtained from here were subjected to bioinformatic analysis. The resulting bioinformatic model characterizes the sequence specificity of the protein more accurately than those sequence specificities predicted from previous analysis just by using a few known binding sites available in the literature. The consequences of this increase in accuracy for prediction of in vivo binding sites (and especially functional ones) in the E. coli genome are also discussed. We measured the dissociation constants of several putative CAP binding sites by EMSA (Electrophoretic Mobility Shift Assay) and compared the affinities to the bioinformatics scores provided by methods like the weight matrix method and QPMEME (Quadratic Programming Method of Energy Matrix Estimation) trained on known binding sites as well as on the new sites from SELEX SAGE data. We also checked predicted genome sites for conservation in the related species S. typhimurium. We found that bioinformatics scores based on SELEX SAGE data does better in terms of prediction of physical binding energies as well as in detecting functional sites. We think that training binding site detection algorithms on datasets from binding assays lead to better prediction. The improvements in accuracy came from the unbiased nature of the SELEX dataset rather than from the number of sites available. We believe that with progress in short-read sequencing technology, one could use SELEX methods to characterize binding affinities of many low specificity transcription factors.
    BMC Biotechnology 01/2009; 8(1):94. DOI:10.1186/1472-6750-8-94 · 2.03 Impact Factor
Show more