MATRIX SEARCH 1.0: a computer program that scans DNA sequences for transcriptional elements using a database of weight matrices.
ABSTRACT The information matrix database (IMD), a database of weight matrices of transcription factor binding sites, is developed. MATRIX SEARCH, a program which can find potential transcription factor binding sites in DNA sequences using the IMD database, is also developed and accompanies the IMD database. MATRIX SEARCH adopts a user interface very similar to that of the SIGNAL SCAN program. MATRIX SEARCH allows the user to search an input sequence with the IMD automatically, to visualize the matrix representations of sites for particular factors, and to retrieve journal citations. The source code for MATRIX SEARCH is in the 'C' language, and the program is available for unix platforms.
- World-Wide Web bio94lhertzlhertz Compilation of vertebrate-encoded transcription factors. 3-26..
- A relational database of transcription factors Compilation of transcription regulating proteins SIGNAL SCAN 3.0- new database and program features. 1749-1756..
- Identification of consensus patterns in unaligned DNA sequences known to be functionally related Weight matrix description of four eukaryotic RNA polymerase II promoter elements derived from 502 unrelated promoter sequences. 81-92..
Vol. 11 no. 5 1995
MATRIX SEARCH 1.0: a computer program that
scans DNA sequences for transcriptional
elements using a database of weight matrices
Qing K. Chen1, Gerald Z. Hertz and Gary D. Stormo
The information matrix database (IMD), a database of
weight matrices of transcription factor binding sites, is
developed. MATRIX SEARCH, a program which can find
potential transcription factor binding sites in DNA
sequences using the IMD database, is also developed and
accompanies the IMD database. MATRIX SEARCH
adopts a user interface very similar to that of the
SIGNAL SCAN program. MATRIX SEARCH allows
the user to search an input sequence with the IMD
automatically, to visualize the matrix representations of
sites for particular factors, and to retrieve journal citations.
The source code for MATRIX SEARCH is in the 'C
language, and the program is available for unix platforms.
Advances in the understanding of gene regulation have
produced an extensive body of transcription factor
binding sites. That has led to the development of various
transcription factor databases, among which are the
Ghosh's transcription factor database (Ghosh, 1990) and
the Wingender's transcription element database (Win-
gender, 1988). However, great redundancies exist in these
databases. Factors are represented under different aliases,
even exactly the same binding sites for the same factors are
multiply represented in each of the databases due to
alternative site definitions. In addition, some entries
contain symbols that indicate allowable variations at
certain positions (e.g. W means A or T), yet the likelihood
of occurrence of the variations are not assessed.
Programs such as SIGNAL SCAN (Prestridge and
Stormo, 1993) utilize these databases to find potential
transcription factor binding sites in DNA sequences.
These programs search for exact matches between the
input sequences and the database entries. There are
potential pitfalls with this approach. First, if the actual
Department of Molecular, Cellular, and Developmental Biology, University
of Colorado, Boulder, CO 80309-0347, USA
To whom correspondence should be addressed. Email: chenq@boulder
conserved pattern of a factor binding site is shorter than its
entries in the databases, a linear search using these
database entries could fail to detect a substantial number
of the binding sites for this factor; on the other hand, a
great number of false positives could appear if the length
of a database entry is shorter than the conserved pattern.
Second, this type of search algorithm requires exact match
between the input sequence and a database entry, it may
fail to report sites that fit into a conserved pattern, yet do
not match any of the database entries. Third, if an entry
allows for nucleotide variations at certain positions, a
linear search will report all sequences with those varia-
tions, without assessing quality of the matches, and
without considering the possibility that some combina-
tions of the allowable variations are in fact not allowed.
Fourth, this method treats all the database entries as
equally important. That is inadequate because some
factors might be more important as general transcription
factors, and some sequences may resemble the consensus
more closely. In addition, a significant number of
transcription factor binding sites have only single entries
in the databases. The fact that there is no additional
published study available on them indicates that they may
not commonly exist on promoters, or some of the data
may be flawed.
Another approach to search for transcription factor
binding sites is by developing matrix representations of
these sites. This approach is based on the observation that
the binding sites for transcription factors are conserved,
though somewhat ambiguously, due to functional con-
straints. Therefore, DNA sequences known to interact
with a particular factor are aligned together, and statistical
methods such as log-likelihood statistics (Hertz et al.,
1990) are used to assess the best alignment, which is the
one least likely to occur by chance. The conserved patterns
thus determined are stored in matrices (Figure 1), and
these matrices can be used to assess the probability that
any DNA sequence contains binding sites for correspond-
ing transcription factors.
Bucher has developed four such matrices for the TATA-
box, CCAAT-box, GC-box, and Cap signal (Bucher,
) Oxford University Press
at Washington University at St Louis on July 12, 2011
Q.K.Cbeo, G.Z.Hertz and G.D.Stormo
A G G T
A G G A
A G G C G
T T G A
Fig. 1. An example of a matrix summarizing a DNA sequence alignment
On the top is an alignment of 4 6-mers. Below is a matrix containing the
number of times that the indicated letter is observed at the indicated
position of this alignment.
1990) 32 matrices for transcription factors are also
supplied with Wingender's database (Wingender, 1987).
However, an extensive transcription factor matrix data-
base with a user friendly program to scan DNA sequences
is not available. In this paper, we describe the development
of an information matrix database (IMD) and the
program MATRIX SEARCH which scans for putative
transcription factor binding sites on an input DNA
segment. In developing IMD, care was taken to exclude
multiple representations of the same sites, and we tried to
group all entries for the same factors together, even
though they may be under different aliases. The programs
wconsensus (Hertz and Stormo, 1994) and consensus
(Hertz et al., 1990) were used to generate optimal weight
matrices for each factor binding site. The resulting
matrices are organized into an indexed database and the
program MATRIX SEARCH can be used for easy search
with the database. The MATRIX SEARCH program
addresses the above mentioned problems and thus can be
used to supplement databases and search programs based
on exact sequence matches.
The IMD database
The original data of transcription factor binding sites are
taken either from Ghosh's database (Ghosh, 1990),
Wingender's database (Wingender, 1987) or directly
from the original references whenever necessary. The
sites are organized into seven classes according to the
organism containing the corresponding transcription
factor: mammalian, avian, amphibian, insect, plant,
yeast and prokaryote.
The aliases of the factors are determined according to
either the article by Faisst and Meyer (1992), or by
information supplied in the factor.dat file of the Win-
gender's database. Sites for the same factors are grouped
together and used to generate weight matrices.
The programs used to generate the matrices are the
consensus and wconsensus program developed in our
laboratory. The two programs identify consensus pattern
in unaligned DNA sequences, with one difference: the
wconsensus program determines the length of the
conserved pattern automatically, whereas the consensus
program requires user input of the length. The goal of
these programs is to identify a matrix that describes the
common sequence pattern shared by N sequences known
to bind a specific transcription factor. The matrix with the
maximum information content, as described by the
following formula, will be selected as the matrix least
likely to occur by chance.
/represents the information content of the matrix; L is the
length of the pattern, b refers to the row of the matrix (i. e.
the bases A, C, G, T), / refers to the position in the pattern.
Nbl refers to the number of base b at position i; N is the
total number of sequences contributing to the matrix, and
pb is the genomic frequency of base b.
After selection of the optimal matrices for the tran-
scription factors, a unique IMD number is used to
represent each matrix and journal citations of the
corresponding factors are provided. The seven classes of
matrices are then organized into an indexed database of
information matrices. Altogether, the current IMD
database has a total of 352 matrices: 204 mammalian, 27
avian, 8 amphibian, 40 insect, 17 plant, 39 yeast, and 17
Factors with a single binding site entry in the combined
database of Ghosh's and Wingender's are not represented
in the IMD database. We discriminate against single
entries because (i) there is a higher probability that they
may give flawed information due to lack of independent
confirmation; (ii) they may not bind important general
transcription factors unless future data suggests otherwise;
and (iii) they can be searched now with SIGNAL SCAN
or other methods.
The MATRIX SEARCH program
The MATRIX SEARCH program can be used to score an
input sequence against matrices in IMD. It is based on the
PATSER program previously developed in our labora-
tory. MATRIX SEARCH uses the following formula to
at Washington University at St Louis on July 12, 2011
MATRIX SEARCH 1.0
determine the scores:
where L, i, Nbl, N, Pb are as defined in formula (1). S
measures the log likelihood ratio of a match such that a
higher S represents a better match to the conserved
pattern described by the matrix. We add 0.01 in this
formula instead of adding 1 as in the original formula
(Hertz et al., 1990) so that the scoring will more strongly
discriminate against sequences with bases that do not
occur in the available data. Cut off scores for the matrices
are determined so that a matrix detects as many true
signals (as represented by the contributing entries) while
excluding as many false positives as possible. Overall,
roughly 75-100% of the contributing entries scored above
the cut offs of all matrices. For the few matrices which give
high false positive detections, the user is given a choice
whether to include them in the search. To further reduce
false positives, the program disallows the overlapping of
sites for the same factors. Should a factor have over-
lapping sites above the cutoff score, only the one with the
highest S score will be reported.
Match ratio, the ratio of the S score of a particular
alignment to the S score of one of the best alignments, are
also determined. The higher the match ratio, the better the
MATRIX SEARCH is a user friendly program. Its user
interface is very similar to that of the SIGNAL SCAN
program developed by Dr Dan Prestridge (Prestridge and
Stormo, 1993). The input is a DNA sequence needed to be
analysed, and the output is a list of putative sites found on
that DNA segment. The output contains the factor names,
starting positions of the sites, on which strand the sites are,
the match ratio, and IMD numbers which can be used to
get the composition of individual matrices and reference
The major limitation of the IMD database and
MATRIX SEARCH program is that since we have set
the cut off scores stringently, some real sites could escape
detection. The program does not detect sites for factors
with single entries in the combined Ghosh's and Win-
gender's databases, since there are no matrices for them.
Other programs, such as SIGNAL SCAN, should be used
in conjunction to MATRIX SEARCH if desired.
MATRIX SEARCH analysis
The analysis of sequences using a database of transcrip-
tional elements has resulted in finding previously
unknown functional transcriptional elements in DNA
sequences (Prestridge and Stormo, 1993). However, there
are many reported putative transcriptional elements in any
sequence scan, most of which are probably irrelevant. To
demonstrate the advantage and limitation of the matrix
search approach in comparison with the linear search
approach, the MATRIX SEARCH and the SIGNAL
SCAN programs are used to scan a well-characterized
DNA segment—the SV40 transcription/replication region
(Lebowitz and Ghosh, 1982). Both programs find almost
all the 30 confirmed sites known to us, including sites for
AP-1, AP-4, Spl, Oct-1/NFIII, NF-kB, T-Ag, etc.,
although both of them failed to report a couple of weak
sites, and MATRIX SEARCH failed to report the LSF
site because it has only a single entry in the combined
database, and thus not represented in the IMD database.
Many more sites are reported by both programs.
MATRIX SEARCH reported a total of 140 sites whereas
SIGNAL SCAN reported a total of 360 sites. The number
of sites reported by SIGNAL SCAN can be reduced to 180
after taking into consideration the redundancy of the
SIGNAL SCAN database. As an example of that
redundancy: the 3 AP-3 sites has been reported a total
of 32 times under different names. Many of the
unconfirmed sites reported by both programs are likely
to be false positives. Other than the fact that the flanking
sequences, though not conserved at the level of primary
sequence, are also important in interacting with the
binding factors, both programs report additional false
positives due to their intrinsic limitations.
Figure 2A shows a few examples of the potential false
positives reported by SIGNAL SCAN, but not MATRIX
SEARCH. The unconfirmed CREB site at position 14 is
picked up because it is an exact match to the entry
CGTCA, though that is three nucleotides shorter than the
CREB consensus TGACGTCA. The AP-1 site reported at
position 589, GAGAGGA, does not resemble the AP-1
Fig. 2. Examples of potential false positives detected by SIGNAL SCAN but not MATRIX SEARCH, within the SV40 transcription/replication control
region. CUTOFF: the match ratios of the cutoff score for corresponding matrices.
at Washington University at St Louis on July 12, 2011
Q.K.Cben, G.Z.Herfa and G.D.Storroo
0 0 2 2 2 13 12 7
0 0 2 3 0
4 16 3 8 5
12 0 9 3 9
Fig. 3. Alignment weight matrix for AP-3.
consensus at all. Checking references for that entry (Distel
et al., 1987) reveals that the AP-1 consensus TGACTCA is
on the same DNA segment used to perform the binding
assay. Therefore, the authors may have made a mistake in
assessing the precise AP-1 binding site. The NF-IL6 site at
41 is reported by SIGNAL SCAN because it matches the
consensus TKNNGNAAK. Consensus patterns with this
type of representation only indicates variations at certain
positions are allowed, but fail to set a cut off for
disallowed combination. These types of false positives
can be avoided by using the matrix search approach.
One advantage of MATRIX SEARCH is that it can
detect sites that fit into a conserved pattern, yet do not
match exactly any of the known sites. However, for
matrices with low information content, such as that of
AP-3 (Figure 3) in which variations at multiple positions
are allowed, this can increase the detection of false
positives. For example, searching the SV40 region for
AP-3 sites with SIGNAL SCAN yielded five potential
false positives, whereas MATRIX SEARCH detected
In conclusion, both the MATRIX SEARCH and the
SIGNAL SCAN programs are capable of detecting the
vast majority of known transcriptional factor binding sites
on DNA segments. They complement each other in
reducing false positive detections. The main advantages
of MATRIX SEARCH over most other site recognition
programs is that it can detect a protein binding site which
is not an exact match of any published site for that protein,
it reduces false positive detections, its scores indicate the
relative strength of sites, and it reduces complexity and
confusion caused by the redundancy of database entries.
As for SIGNAL SCAN, the best uses of MATRIX
(i) to find candidate binding protein for a known site
found by an investigator;
(ii) to scan a promoter sequence to identify possible
(iii) to locate a regulatory element within a DNA
sequence if there is evidence for its existence. (Prestridge
and Stormo, 1993.)
MATRIX SEARCH is currently available for unix
platform by anonymous ftp to beagle.colorado.edu, cd
pub, get imd.tar. An IBM-compatible PC version is
currently under development in collaboration with Dr
Dan Prestridge. Dr Prestridge is also currently developing
a program to incorporate MATRIX SEARCH and
SIGNAL SCAN together. For more inrmation on
MATRIX SEARCH or how to obtain a copy of the
program, send request to firstname.lastname@example.org,
or email@example.com if you are also interested in the
SIGNAL SCAN program.
We would like to thank Dr Dan Prestndge of University of Minnesota for
his suggestions, and source code for the user interface. We would also like
to thank Dr David Ghosh for suggestions and database, and Dr Edgar
Wingender for database. This research is supported by NIH grants
7F32HG00124 and NIH HG00249.
Ghosh.D. (1990) A relational database of transcription factors. Nucleic
Acids Res., 18, 1749-1756.
Wingender.E. (1988) Compilation of transcription regulating proteins.
Nucleic Acids Res., 16, 1879-1902.
Prestridge.D. and Stormo.G.D. (1993) SIGNAL SCAN 3.0- new
database and program features. Comput. Applic. Biosci., 7, 203-206.
Hertz,G.Z., Hartzell.G.W. and Stormo.G.D. (1990) Identification of
consensus patterns in unaligned DNA sequences known to be
functionally related. Compui. Applic. Biosci., 6, 81-92.
Bucher,P. (1990) Weight matrix description of four eukaryotic RNA
polymerase II promoter elements derived from 502 unrelated promoter
sequences. J. Mol. Biol., 212, 563-578.
Hertz,G.Z. and Stormo.G.D. (1994) In The Third International
Conference on Bioinformatics and Genome Research (Lim, H.A. and
Cantor, C.R. eds.), World-Wide Web
Faisst,S. and Meyer.S. (1992) Compilation of vertebrate-encoded
transcription factors. Nucleic Acids Res., 20, 3—26.
LebowitzJ. and Ghosh,P.K. (1982) Initiation and regulation of Simian
Virus 40 early transcription in vitro J Virol., 41, 449-461
Distel.R.J., Ro,H., Rosen.B.S., Groves.D.L. and Spiegelman,B.M.
(1987) Nucleoprotein complexes that regulate gene expression in
adipocyte differentiation: direct participation of c-fos. Cell, 49, 835-
Received on May I. 1995; accepted on May 24. 1995
at Washington University at St Louis on July 12, 2011