Search of regular sequences in promoters from eukaryotic genomes.
ABSTRACT In this paper, the notion of "regularity" is introduced to describe the structural features of DNA sequences. This notion expands the "latent periodicity" term. The novel method for revealing regularity based on the runs test is described. The search of regular sequences in eukaryotic promoters has shown that more than 60% of them possess a regularity property on statistically significant level. Possible biological functions of regularity are discussed together with the possibility of using this characteristic for performing promoter annotation.

Dataset: Kravatskaya et al Genomics 2011
 [Show abstract] [Hide abstract]
ABSTRACT: We describe a new mathematical method for finding very diverged short tandem repeats containing a single indel. The method involves comparison of two frequency matrices: a first matrix for a subsequence before shift and a second one for a subsequence after it. A measure of comparison is based on matrix similarity. The approach developed was applied to analysis of the genomes of Caenorhabditis elegans, Drosophila melanogaster and Saccharomyces cerevisiae. They were investigated regarding the presence of tandem repeats having repeat length equal to 2  11 nucleotides except equal to 3, 6 and 9 nucleotides. A number of phase shift regions for these genomes was approximately 2.2×10(4), 1.5×10(4) and 1.7×10(2), respectively. Type I error was less than 5%. The mean length of fuzzy periodicity and phase shift regions was about 220 nucleotides. The regions of fuzzy periodicity having single insertion or deletion occupy substantial parts of the genomes: 5%, 3% and 0.3%, respectively. Only less than 10% of these regions have been detected previously. That is, the number of such regions in the genomes of C. elegans, D. melanogaster and S. cerevisiae is dramatically higher than it has been revealed by any known methods. We suppose that some found regions of fuzzy periodicity could be the regions for protein binding.Computational biology and chemistry 04/2014; 51C:1221. · 1.37 Impact Factor 
Article: Genomewide analysis of promoters: clustering by alignment and analysis of regular patterns.
[Show abstract] [Hide abstract]
ABSTRACT: In this paper we perform a genomewide analysis of H. sapiens promoters. To this aim, we developed and combined two mathematical methods that allow us to (i) classify promoters into groups characterized by specific global structural features, and (ii) recover, in full generality, any regular sequence in the different classes of promoters. One of the main findings of this analysis is that H. sapiens promoters can be classified into three main groups. Two of them are distinguished by the prevalence of weak or strong nucleotides and are characterized by short compositionally biased sequences, while the most frequent regular sequences in the third group are strongly correlated with transposons. Taking advantage of the generality of these mathematical procedures, we have compared the promoter database of H. sapiens with those of other species. We have found that the abovementioned features characterize also the evolutionary content appearing in mammalian promoters, at variance with ancestral species in the phylogenetic tree, that exhibit a definitely lower level of differentiation among promoters.PLoS ONE 01/2014; 9(1):e85260. · 3.73 Impact Factor
Page 1
This article appeared in a journal published by Elsevier. The attached
copy is furnished to the author for internal noncommercial research
and education use, including for instruction at the authors institution
and sharing with colleagues.
Other uses, including reproduction and distribution, or selling or
licensing copies, or posting to personal, institutional or third party
websites are prohibited.
In most cases authors are permitted to post their version of the
article (e.g. in Word or Tex form) to their personal website or
institutional repository. Authors requiring further information
regarding Elsevier’s archiving and manuscript policies are
encouraged to visit:
http://www.elsevier.com/copyright
Page 2
Author's personal copy
Computational Biology and Chemistry 33 (2009) 196–204
Contents lists available at ScienceDirect
Computational Biology and Chemistry
journal homepage: www.elsevier.com/locate/compbiolchem
Research Article
Search of regular sequences in promoters from eukaryotic genomes
Andrew Shelenkov∗, Eugene Korotkov
Bioengineering Centre of Russian Academy of Sciences, 117312 Moscow, Prt 60tya Oktyabrya, 7/1, Russian Federation
a r t i c l e i n f o
Article history:
Received 26 June 2008
Received in revised form 8 February 2009
Accepted 18 March 2009
Keywords:
DNA sequence analysis
Periodicity
Promoters
Protein binding sites
Regularity
a b s t r a c t
Inthispaper,thenotionof“regularity”isintroducedtodescribethestructuralfeaturesofDNAsequences.
This notion expands the “latent periodicity” term. The novel method for revealing regularity based on the
runs test is described. The search of regular sequences in eukaryotic promoters has shown that more than
60% of them possess a regularity property on statistically significant level. Possible biological functions of
regularity are discussed together with the possibility of using this characteristic for performing promoter
annotation.
© 2009 Elsevier Ltd. All rights reserved.
1. Introduction
At the present time, a widescale analysis of DNA sequences
from various genomes, including human ones, takes place. One of
the most important goals of this analysis is a characterization and
determination of the functions of various genes. During the last
decade, several reliable methods of predicting proteincoding DNA
regions have been proposed (Claverie, 1997). However, the predic
tion of regulator regions, in particular, promoters, still remains a
challenging task although different methods have also been pro
posed for their prediction (Bajic et al., 2002; Davuluri et al., 2001;
Ohler et al., 2002). Promoter is a genome region located near the
site of transcription initiation and playing the key role in genetic
regulation (Pedersen et al., 1999). Promoters receive signals from
various sources (e.g., from cell receptors) and control the level of
transcription initiation that to a great extent determines a gene
expression (Dieterich et al., 2005). Thus the promoter revealing is
an important step in performing gene annotation.
To distinguish genome regions containing and not containing
promoters (the latter, obviously, being the most part), a large set
of characteristics was used, for example, CpG islands (Davuluri et
al., 2001; Bajic and Seah, 2003), TATAboxes (Ohler et al., 2002;
Knudsen, 1999), CAATboxes (Ohler et al., 2002; Knudsen, 1999),
some typical sites of transcription factors’ binding sites (Ohler
et al., 2002; Knudsen, 1999; Solovyev and Shahmuradov, 2003),
pentamer matrices (Bajic et al., 2002), oligonucleotides (Scherf et
al., 2000), and also combined approaches were used (Xie et al.,
∗Corresponding author. Tel.: +7 499 135 2161; fax: +7 499 135 0571.
Email address: fallandar@gmail.com (A. Shelenkov).
2006). Besides, some pattern recognition techniques were used
such as neural networks (Ohler et al., 2002; Bajic and Seah, 2003;
Knudsen, 1999), linear and discriminant analysis (Davuluri et al.,
2001; Solovyev and Shahmuradov, 2003), interpolation Markov
model (Ohler et al., 2002), analysis of independent constituents
(Matsuyama and Kawamura, 2004) and some other methods
(Gershenzon et al., 2006). However, the analysis of experimental
data has shown (Bajic et al., 2004) that the issue of choosing sta
tistically significant biological signals to be used in the promoter
prediction programs still remains unsolved. There is no character
istic which describes the whole variety of promoters, and each of
the features revealed during examination of promoter sequences
has its own usage limitations (Xie et al., 2006). Therefore, there is a
need for finding some new characteristics which would be specific
for the promoters, but at the same time would be quite flexible to
meet the variety of their types.
The presence of periodicity and other structure elements in
genetic texts was previously revealed by various mathematical
methods (Benson, 1999; Konopka, 1994; Konopka and Martindale,
1995). In this paper we introduce a new term – regularity – to char
acterize a sequence and propose the method of identification of
regular sequences. We apply the method developed to search the
regular sequences in promoters from various eukaryotic genomes.
The notion of regularity is an extension of the “latent periodic
ity” term (Korotkov et al., 2003; Laskin et al., 2005; Shelenkov
et al., 2006, 2008) for the case when the positions of individual
nucleotides in a period are not fixed rigidly. Regularity is a non
randomnucleotidedistributioninsidetheperiodsofthenucleotide
sequence being examined. To reveal the regularity we divide the
sequence into equalsized intervals (or “periods”) and calculate
the degree of “nonrandomness” of each nucleotide distribution in
14769271/$ – see front matter © 2009 Elsevier Ltd. All rights reserved.
doi:10.1016/j.compbiolchem.2009.03.001
Page 3
Author's personal copy
A. Shelenkov, E. Korotkov / Computational Biology and Chemistry 33 (2009) 196–204
197
these periods. We used the runs test (Hoel, 1966; Brownlee, 1965)
as a criterion for estimating the nonrandomness of these distribu
tions. The method of calculating the number of runs is illustrated
below. The definition of a regular sequence and calculation of a
quantitative measure of the level of sequence regularity are given
in Section 2.1.
The method described in the current paper can be used for
searchingthelatentperiodicity,anditcanalsorevealthesequences
whichcouldnotbenominallyconsideredtobeperiodical,although
their structural composition is similar to the one of periodical
sequences. The main advantages of the method are its comparative
insensitivitytoapresenceofindels,noneedforpreliminaryassign
ment of the period type (e.g., in a form of a frequency matrix), and,
being the most important, the possibility of revealing the sequence
regularity at simultaneous presence of nucleotide indels and sub
stitutions in the analyzed sequence.
In Section 3 it is shown that the majority of known promoters
possess the regular structure.
2. Methods and Algorithms
2.1. Regularity Definition and Its Quantitative Measure defined
Using Runs Test
Let us define the regular sequences as DNA sequences which
have the same (or close) number of certain type (or all) nucleotides
in each period, while nucleotides are distributed in a same way
along the periods. Here the periods of length n are the DNA frag
ments of the same length located one after another in a sequence.
Let us consider the sequence S=s1, s2, ..., sL, containing the char
acters from the alphabet Q={a, t, c, g}. In Fig. 1, DNA sequence is
dividedintotheperiodsoflengthl=3byusingthesymbolF.Atfirst,
we place F at 0 and L+1 positions of S. Then we count n nucleotides
Fig. 1. Application of the runs test to determination of adenine distribution “non
randomness” along the periods of the length 3 nucleotides. The sequence under
study is divided into separate periods by insertion of the symbol F after each three
nucleotides starting from the zero position of the sequence. Then the sequence of
categories (or codes) is formed by changing the symbol F to 0 and the symbol ‘a’ to
1, and the number of runs is calculated. Run is a sequence of identical elements that
is preceded and followed by different elements or no element (the beginning or the
end of the sequence). For example, the sequences of codes 000111 and 01 contain
tworuns.Thenumberofrunsinperiodicsequence(B)isalwaysgreaterthanorequal
to the one in nonperiodic sequence (A). This allows using the number of runs as
periodicity measure which is almost insensitive to nucleotide indels. This is shown
in (C), where the deletion of ‘c’ from 9th position of the periodic sequence is made.
It is clear that the number of runs does not change and remains equal to17.
starting from the first position of S and place another symbol F.
We repeat this process till the end of the sequence S. As a result,
we obtain the sequence S?=Fs1, s2, ..., sn, Fsn+1, sn+2, ..., s2n, F,
..., sLF (the length of the last period can be smaller than the gen
eral length). Practically, the presence of regularity means that the
distribution of positions of some nucleotide along the sequence is
the same as the distribution of the positions of symbol F along this
sequence. We use the runs test to define the quantitative measure
of the regularity.
Runs test (Wald–Wolfowitz test) (Hoel, 1966; Brownlee, 1965)
is a nonparametric test that is used to test the hypothesis H0stat
ing that two (or more, generally) datasets represent the random
independent samples with amounts n1and n2(n2+n2=N) from
the same universal set, i.e., the distribution functions of these sam
ples are the same. In use of the test, it is estimated the number of
runs in a series which elements may generally possess k different
values. Let us consider the case k=2. Observation results are writ
ten as the static series of joint sample T, while the data belonging
to a certain group are given by the coding variable that may pos
sess two values (0 and 1, where 0 shows the value belonging to the
first sample, while 1 to the second one). The values of coding vari
able form the series R that is called “the sequence of categories” or
“the sequence of codes”. To apply the runs test, the elements of the
series T are sorted in ascending order with making simultaneous
rearrangements in the series R.
Run is a sequence of identical elements that is preceded and
followed by different elements or no element (the beginning or
the end of the sequence). The test statistic is a number of runs
in the sequence of codes. If the hypothesis H0is true, then both
samples should be mixed well in the static series and thus the
number of runs should be large. If the samples have been derived
from the universal sets having different distributions (which vary
in mean value or dispersion), then the number of runs will be
small.
For example, if the sequence S looks like tcgcgcattattcaagtacc,
thenS?=FtcgcgFcattaFttcaaFgtaccFandthusthestaticseriesforthe
periodlengthequalto5nucleotidesforthesymbolFandnucleotide
‘a’ is {0, 0, 1, 1, 0, 1, 1, 0, 1, 0} (runs are underlined). Here the value
in the sequence of codes corresponding to the symbol F is 0, and
the one corresponding to the nucleotide ‘a’ is 1. In this example the
number of runs is 7.
To introduce the quantitative measure for a case of nucleotide
sequences, we derive four sequences of codes K(i) from the
sequence S?. The sequence of codes K(i), i=1, 2, 3, 4 is derived from
thesequenceS?bychangingthesymbolFto0andthesymbolq(i)to
1, while all other symbols are ignored. At that it is considered that
the properties of a source sequence are represented by the proper
ties of binary sequences (Lobzin and Chechetkin, 2000). Here q(i) is
a symbol from the alphabet Q defined above. Actually, this means
that we compare the samples containing the position numbers of
F and q(i) in the sequence S?. We calculated the number of runs
r(i) for each introduced sequence of codes. Statistical significance
ofthenumberofrunsobtainedwascalculatedusingthesimulation
modeling (Monte Carlo method). It included performing a random
shuffling of the symbols of initial sequence S, then the sequence
S?was built for the newly obtained sequence together with the
sequences K(i), and the numbers of runs r(i), i=1, 2, 3, 4 were deter
mined. As a result, P values for each of the number of runs r(i) were
obtained for random sequences (we used P=100). After this, the
mean value ?r(i) and the standard deviation ?r(i) were calculated
for four sets of the obtained values. The formula for calculating the
statistical significance Z(i) of the number of runs r(i) was:
Z(i) =r(i) − ?r(i)
?r(i)
(1)
Page 4
Author's personal copy
198
A. Shelenkov, E. Korotkov / Computational Biology and Chemistry 33 (2009) 196–204
It should be noted that the runs test can perform the reli
able comparison of the samples comparable by their amounts
(Sheskin,2000),i.e.,thenumberofthesymbolsFshouldbeapprox
imately equal to the number of the nucleotides of each type. If
the number of the nucleotides of some type is much larger or
smaller (more than two times) than the number of the sym
bols F, then the application of the runs test may not reveal the
sequence regularity. Let us modify the example given above. Let the
sequence S=tcgcgaaaattaaacaaaag, S?=FtcgcgFaaaatFtaaacFaaaagF.
In this case the static series for the nucleotide ‘a’ looks like {0, 0,
1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0}. This example shows that the num
ber of runs does not increase when the number of symbols in each
period increases. For long periods (>4) this fact can be the cause of
not revealing some regular sequences.
To solve this problem, we inserted additional symbols F into the
sequence S?. In this case the periods of a sequence S are divided into
parts (or ?boxes?, analogously to the models used in combinato
rial calculus) each in the same way. For the example given above
the sequence S?after inserting an additional symbol F into each
period may look like FtcgFcgFaaaFatFtaaFacFaaaFagF (new sym
bols are shown boldfaced). We considered all possible placements
of new symbol and chose the position (identical in each period)
for which the number of runs was the greatest. Then a statistical
significance was calculated again using Monte Carlo method as it
wasdescribedabove,namely,thepositionofadditionalsymbolina
period giving the greatest number of runs was determined for each
sequence obtained by random shuffling of the initial sequence S.
Thus we determined the maximal number of runs rp(i) where p
shows the order number of a random sequence. We shuffle the ini
tial sequence 100 times, so p=1, 2, ..., 100. Then the mean value
?rand the standard deviation ?rwere calculated for the set of 100
values and the corresponding value Z1was calculated by formula
(1) (here the index shows the number of additional symbols F in a
sequence S?).
To simplify the method description, we discuss the calculation
only for one nucleotide type. Evidently, the calculations were per
formed in the same way for each type.
If m(q(i))−m(F(i))>L/n (i.e., numbers of additional and original
symbols are comparable), then one more symbol F was added to
each period in the same way as it was described above. Thus the
value Z2was calculated. Here m(q(i)) shows the number of type i
nucleotides, m(F(i)) shows the number of symbols F in a sequence
S?for this nucleotide. Upon completion of symbol addition process,
we have the set of values Z0, Z1, Z2, ..., Zk. We choose the maximal
one from these values –let us denote it Z.
The large positive value of Z means that the observed number
of runs greatly exceeds the expected one, i.e., the average length
of the run in static series is close to 1. This fact witnesses that the
number of times the symbol under study occurs in “boxes” (or in
plain periods) are similar on statistically significant level. In a case
of large negative value, the presence of a symbol in “boxes” differs
from period to period, so artificial symbols and nucleotides are not
“mixed” well. In this work, we focus on the large positive values
since they suppose that the source sequence possesses the peri
odicity (possibly, the fuzzy one) or, generally, that the sequence is
regular.
All sequences described in Section 3 are regular. We chose this
term because these sequences possess some regular structures
which are represented by the regular occurrence of nucleotides
inside the “boxes” limited by the symbols F. Such regularity of
nucleotide distribution in “boxes” has a statistically significant dif
ference from a distribution of symbols in random sequences. This
fact is reflected by the large values of corresponding Z variables.
However, not all of these sequences are periodic in the traditional
way. In fact, most of them are not, this is why the present methods
of revealing periodicity are not able to find them.
2.2. Calculation of Joint Statistic and Regularity Search
Organization
We have considered above the application of the runs test for
searching the regularity of one nucleotide in the sequences under
study. However, to perform the deeper comparative analysis of
the sequences it is necessary to estimate regularity for all four
nucleotides simultaneously. To do this we first calculate the statis
tic Z (formula (1)) for each nucleotide separately. Let us designate
the values obtained for a, t, c and g as Za, Zt, Zc and Zg, respec
tively. All these variables have a distribution close to normal one
as it will be shown below. Let us use the property of normal distri
bution to get the new variable whose distribution is also normal.
We obtain:
Zsum=Za+ Zt+ Zc+ Zg
√4
=Za+ Zt+ Zc+ Zg
2
∼ N(0;1) (2)
Thesearchofregularitywasconductedasfollows.DNAsequence
wasscannedwithawindowoflengthequalto500nucleotides(the
lengthofpromotersequencesunderconsideration).Aperiodlength
n was chosen and four sequences S?(one for each nucleotide) were
built for a sequence S in a window by introducing the symbols F. n
belonged to the range 2–16. For each of these four sequences the
maximalvalueofZwasdeterminedasitwasdescribedabove.Then
the value Zsumwas calculated for the obtained values of Za, Zt, Zcи
Zgby using formula (2).
We tried to obtain the most exact borders of the subsequence to
make easier the future investigation of its function. So the search of
the regular subsequence with the maximal value of Zsum within
a window was performed by changing the border of the ana
lyzed sequence, namely, its left and right borders were changed
independently with a step=2 and a statistic was calculated for
each such subsequence. Thus, the subsequences of the given win
dow having length from 50 to 500 nucleotides were considered.
Results included all nonoverlapping subsequences of the window
for which the value of Zsum exceeded the threshold. The subse
quences having length less than 50 were not considered because
the sample size in this case was not large enough to get the reliable
results at the chosen level of statistical significance.
Parameters of sequence scanning (window=500, step inside a
window=2) have been chosen, on the one hand, to provide the
reasonable rate of calculations and, on the other hand, to reveal all
statisticallysignificantregularsubsequencesintheconsideredDNA
sequences.
Sinceitseemsinterestingtoinvestigatetheregionsofasequence
which possess regularity not only for all four nucleotides, but
also for all their subsets (including regularity for the only one
nucleotide), we calculated the values of the statistic for all pos
sible combinations of nucleotides for the considered subsequence
(usingformulasanalogousto(2))andthenchosethemaximalvalue
among them.
If we consider only the statistic for all four nucleotides, then
the negative value for one nucleotide can decrease the value of
Zsumbelow the threshold. For example, for the case Za=−4, Zt=3,
Zc=3 and Zg=3 we obtain Zsum=2.5, whereas Ztcg=5.2, i.e., there is
a regularity for three nucleotides. Thus performing the calculations
for all nucleotide combinations allowed lowering the influence of
negative statistic values for some nucleotides on the possibility of
regularity revealing.
2.3. Building of the Regularity Scheme
It is evident that the use of statistical significance Z as the
only characteristic of regular sequences does not allow revealing
regularity types and comparing them. Yet such a characteris
tic is necessary, especially for investigating the biological role of
Page 5
Author's personal copy
A. Shelenkov, E. Korotkov / Computational Biology and Chemistry 33 (2009) 196–204
199
regularity. We propose to use the regularity scheme as such a char
acteristic.
?Regularityscheme?isaschematicrepresentationofFsymbol
arrangement in the sequence for which the nucleotide distribu
tion occurred to be the “mostly nonrandom”, that is, for which
the value of Zsumwas maximal, e.g., the scheme given below shows
thatthemaximalvalueoftheteststatistichasbeenobtainedforthe
sequence S?having nucleotide “a” in any period position from 1st to
4thnucleotide,andalsoinsomeof5thand6thpositions.Nucleotide
“c” may occur in 1st–3rd and 4th–6th positions of the period, etc.
In this case there exists regularity for three nucleotides—“a, c, g”,
while the period length equals 6 nucleotides.
F   F—F
F  F  F
FF F  F
FFF  FF
a
c
t
g
2.4. Distribution Density of Z for Random Sequences. Choosing
the Threshold Value of Zsumfor Searching Regular Sequences in
Promoters
The conformity of results obtained by using a simulation mod
eling is necessary to be checked by applying the method used to
randomsequences.Thischeckingwillallowchoosingthethreshold
valuefortheteststatisticinsuchawaythatthepossibilityofreveal
ing regularity in a random sequence on significance level higher
than a threshold will tend to zero. Besides, building the specter of
theteststatisticforrandomsequencesgivesapossibilityofestimat
ingtherelevancyofusingthenormalapproximationforcalculation
of Zsum.
The specter of statistic Z for one nucleotide was built for the
sequence generated by the random number generator (period of
the generator was ∼2×1018) for the period lengths from 2 to 16
nucleotides. The length of the sequence was 10 million nucleotides
(i.e., approximately 10 times larger than the total length of the
analyzed promoter sequences). The initial frequencies of each
nucleotide’s occurrences were equal to the ones in the analyzed
set of promoters. The generated sequence was analyzed by using
a scanning window as it was described above. For example, the
specter has been obtained for the period lengthn=4. Mean value of
Zwasequalto0.0124,standarddeviation=1.1267,thevaluesZ≥5.0
were not found. The number of Z values greater than 4.0 was 3, the
maximal value was 4.1982. The results obtained for other period
lengths were analogous.
In addition, the specter of a joint statistic Zsumwas built for the
samerandomsequencewithapurposeofcheckingtherelevancyof
using the normal approximation for its calculation. A joint statistic
was calculated for all four nucleotides and for all possible combi
nations of two and three nucleotides (i.e., in the same way as it had
been done for searching the regularity in promoters). Distribution
of the statistic values in all cases is analogous to the one obtained
for single nucleotides.
Therefore,thethresholdvalueZ=4.0ensuresthatnotmorethan
one random sequence will be found to be regular for the length of
1 million nucleotides (the total length of the promoters analyzed).
To get more accurate estimation, we performed the same cal
culations for the random sequence having the length equal to 100
million nucleotides. In this case 39 regular regions with Zsum≥4.0
werefound,whilemaximalvaluewasequalto4.6.Thus,usingthese
results we can also conclude that not more than one DNA sequence
willbeconsideredregularbychanceontheanalyzedsetofpromot
ers while using our algorithm with a threshold value Zsum=4.0. Let
us also note that there were no sequences revealed with Zsum≥5.0.
So, we consider a DNA sequence from the analyzed set of pro
moters to be regular if statistic value for it is Zsum≥4.0. Because
of the reasons stated in this chapter, we believe this value to be
relevant for the real data.
2.5. An Algorithm of Filtering the Results
When changing the window borders in the range 50–500 with
the step=2, it is possible to reveal the sequences possessing a regu
larityatstatisticallysignificantlevelwhichoverlapwitheachother.
Such an overlap (either full or partial) may occur when the same
sequence was revealed at different scanning window positions. To
prevent this, we leave only the sequence having the greater value
of Zsum if some sequences overlapped each other by more than
80%.
Weusetherunstesttosearchtheregularitywithaperiodlength
n in DNA sequences. However, the fact that Zsumfor some sequence
is greater than the threshold for some period length n does not
allow making an unambiguous conclusion regarding the presence
of regularity exactly for this length. The case is that the regularity
for the length n may be caused, for example, by the presence of
regularity for multiple periods, i.e., for the period lengths nk, where
k=2,3,.... In this case the period length n may give the statistical
significance greater than threshold, while the real regularity occurs
for the lengths kn. We will consider that there is a regularity with a
period length n in DNA sequence if the value of Zsumfor this length
is greater than the threshold and the values of test statistic for all
other analyzed period lengths are smaller (they may either exceed
the threshold or not). Thus we calculated the value of Zsumfor all
period lengths in a range from 2 to 16 nucleotides for the sequence
found.IfthemaximalvalueofZsumdidnotcorrespondtotheperiod
length n, then the sequence was excluded from the consideration.
In addition, we performed one more filtration step to provide
the maximal reliability of the results. For each of the sequences
revealed to be regular we built the information decomposition
specter (Korotkov et al., 2003) which shows the statistical signif
icance of the found period expressed in the terms of Z (formula
(1)) against the period length. Information decomposition reveals
the periodicity without indels in nucleotide sequences, but it can
also reveal the latent periodicity (Korotkov et al., 2003). Let the
period having the maximal value of Z be referred as the “maximal
period”. We will consider that the maximal period really exists in
a sequence if Z≥4.0 for this period in information decomposition
specter. In case when such maximal period existed in a sequence
and its length did not correspond to the length of regularity n, we
excluded the analyzed DNA sequence from consideration because
it was most likely that this length of regularity did not correspond
to the maximal value of Zsum.
The set of sequences passed through all steps of filtration was
considered as the set of final results.
2.6. Choosing of Promoter Sequences to be Analyzed for Regularity
The source promoter sequences were obtained from EPD
(Eukaryotic Promoter Database) (Schmid et al., 2006), version 93
(thetotalnumberofsequencescontainedindatabase,excludingthe
preliminary data, was 4809). We have chosen 2236 sequences rep
resenting all groups of organisms. To do this, we used the database
option “representative set of not closely related sequences”. The
use of this option ensures that the set of selected sequences will
not contain any 2 sequences having the pairwise similarity of
more than 50%. The range of promoter regions was (−499, +1)
where +1 corresponded to the transcription initiation site. Since
we were searching the regularity only in promoter sequences,
that is, not including the genes, the analyzed set of promoters
included 2236 sequences, each of the length equals to 500. For
each of these sequences the calculations described above were per
formed.
Page 6
Author's personal copy
200
A. Shelenkov, E. Korotkov / Computational Biology and Chemistry 33 (2009) 196–204
Fig. 2. An example of the periodic sequence revealed by the method of information decomposition which was also revealed as regular. Here the periodic sequence (shown
in bold) is a part of regular sequence (A). An alignment of the modified and initial sequences. Similarity level is 61.4% (B).
3. Results
3.1. Search of a Regularity in Artificial Periodic Sequence with
Indels
Let us consider an example of regular nucleotide sequence. The
method we developed reveals regularity on statistically significant
level in it, while other methods (e.g., Korotkov et al., 2003; Benson,
1999;Sharmaetal.,2004),includingtheonebasedonFouriertrans
form (Sharma et al., 2004), are not able to reveal the periodicity.
Asequenceoflength400possessingtheperfectperiodicitywith
the period length=5 was chosen as an initial one. It contains 80
copies of the {attcg} subsequence. Then indels and substitutions
simulating mutations were made in it by using a random number
generator.AteachchangeofthesequencethevalueZsumwascalcu
lated for n=5. Upon making 80 indels, the following sequence has
been obtained (length=399):
atcgaggagtatcatttgatgcgatttacgatcgccatgtttgatttcgatattcggttgattcg
ttcgttcgatcctggatcgaaatggtcgaatcccagaatccgatcgactcgattggttcgattg
attcgatttcgatcgccattatcgtaaaagatcgatccggataagtttcggtcgggatactgtg
ataatctgatcgaaccttttcattcttgacgctgtcgggggggatgattcttggattcattcatt
atccgtcgacatgtcgatgtattcgagtttcatggattgatctgattcgttcgattaagattcttc
cgattcgattcgagttcgattcatagattgatcgggcgattcatttccccgattttttctttgcga
agatcttgc
An alignment of the initial and the newly obtained sequences
is shown in Fig. 2B. The value of the test statistic for this new
sequence was Zsum=5.02 (Za=3.41, Zt=0.92, Zc=2.49, Zg=3.21). At
the same time, the methods based on Fourier transform (Sharma
et al., 2004), information decomposition (Korotkov et al., 2003)
and dynamic programming (Benson, 1999) have not revealed the
periodicity with the period length n=5 in this sequence on sta
tistically significant level. Thus the method we developed is less
sensitivetonucleotideindelsthanthesemethodsofsearchingperi
odicity.
3.2. Regularity Search in Sequences Possessing the Latent
Periodicity Revealed by Information Decomposition Method
We have also studied the possibility of revealing regular sub
sequences by using our method in the DNA sequences with latent
periodicity revealed by the method of information decomposition
(Korotkov et al., 2003). To do this, we performed the search of the
latent periodicity on the same set of promoters by using the infor
mation decomposition. The total number of periodic sequences
with a period length in a range from 2 to 16 was 109, while 62
of them had their length greater than 50 nucleotides (the mini
mal length for revealing regularity). All these sequences have been
revealed as regular with a length of regularity corresponding to the
length of the latent period. Therefore, in this case the regularity
search reveals the sequences possessing the latent periodicity.
An example of the sequence revealed by both methods is shown
in Fig. 2A.
3.3. An Example of a Regular Sequence Found in a Promoter
Let us give an example of a regular sequence revealed in a
promoter. The regularity with length of 10 nucleotides has been
revealed in locus EP77531 in the region (−113, −44). The test statis
tic for it was Zsum=5.1 (Za=1.9, Zt=3.1, Zc=2.0, Zg=3.1). The DNA
sequence looked like:
tacactatcgatagccaactgtgcaatcgatagcgtgtcatctctgactcaaatgcactc
gaatgcagcatgaccgttta
AdditionalsymbolsFwereusedforregularityidentification.The
sequences S?for the nucleotides a, t, c and g were:
FF  F   F—F
FF     F  F
FFF  F  F—F
FF     F  F
a
t
c
g
Page 7
Author's personal copy
A. Shelenkov, E. Korotkov / Computational Biology and Chemistry 33 (2009) 196–204
201
Table 1
The number of regular sequences found in promoters for the period
lengths 2–16.
Period length The number of regular sequences
2
3
4
5
6
7
8
9
111
227
56
118
131
142
91
185
363
374
296
189
163
151
136
10
11
12
13
14
15
16
It should be noted that no value of statistical significance Z for
single nucleotides exceeds the threshold value, that is, regularity is
a joint property of all four nucleotides of a promoter sequence.
3.4. Results of Searching Regular Sequences for a Period Length
2–16 in Promoters
We searched for regular subsequences in promoter sequences
from the EPD (Schmid et al., 2006) using the following scanning
parameters: window size=500, step for window border vari
ation=2. The length of the regular subsequences revealed lay
in a range 50–500. The data regarding the number of revealed
sequences are given in Table 1.
We have analyzed 2236 promoters totally, while regularity has
been revealed at statistically significant level in 1342 of them. Thus
more than 60% contain the regular subsequences with a period
length range from 2 to 16 nucleotides. The regularity has been also
revealed in other promoter sequences, but the level of statistical
significancehasnotexceededthethreshold.Letusconsiderthedis
tribution of the length of regular subsequences (Table 2) and their
arrangement in promoters.
The longest regular sequence had a length 484bp. However, as
it can be seen from the Table 2, the majority of the sequences had
theirlengthinarange50–150.Sinceduringfiltrationofoverlapping
sequences we selected the ones having the greater statistical sig
nificance, not length (i.e., we did not try to maximize the length of
the sequence revealed), such length distribution is quite expected.
Such a selection process allows setting of the more exact regularity
bordersthat,inturn,increasesthereliabilityoftheresultsobtained.
Table 2
Length distribution of the regular sequences revealed.
Length range, base pairs. The number of the found sequences
50–100
100–150
150–200
200–250
250–300
300–350
350–400
400–450
450–500
750
428
182
91
40
23
13
12
3
Investigation of regular sequence arrangement in promoters is
more important for understanding the biological role of the reg
ularity than sequence length distribution. We divided a promoter
sequence into the intervals of length equal to 10bp and counted
the number of sequences that fall into each of these intervals. Since
the minimal length of regular sequence was 50bp, it is evident that
each sequence fell into several intervals. First of all, we wanted to
know if the distribution of regular sequences along the promot
ers was random. If such a distribution has maxima or minima on
the certain regions of promoter, there is a possibility of connect
ing the regularity to some biological function which these regions
possess.
We used the simulation modeling (Monte Carlo method) to
determine if the regular sequence distribution was random. To
do this, we randomly placed all the regular sequences that were
found along the promoter sequence. At this, the lengths of reg
ular sequences corresponded to the ones in the set of revealed
sequences, e.g., if there had been 22 revealed sequences with the
length 96, then the same number of sequences having this length
were randomly placed. Practically, we used a random number gen
erator to determine only the starting point of the sequence. Its
coordinate might vary in a range (1, 500k), where k is the length of
thesequencetobeplaced,i.e.,thesequencecouldnotgobeyondthe
promoter borders. Such placement of sequences was repeated 200
times, and after each iteration the number of sequences fallen into
the intervals with length 10 was determined. Based on these data,
the mean value and the dispersion were calculated, and then the
statistical significance Z was calculated by using the formula anal
ogous to (1). The values of Z obtained for each interval are shown in
Fig. 3. As shown in the figure, the number of the regular sequences
revealed in the interval (−99, +1) relative to the beginning of a gene
considerably increases the number expected for the random place
mentofthesequences.Startingfrom−99thnucleotide,thevalueof
Z exceeds 5.0. Besides, the declination of Z in the negative direction
was observed in the intervals (−389, −169). We think that this fact
Fig. 3. Decline of the number of regular sequences from expected one for the random distribution, in magnitude of standard deviation. The right borders of the intervals are
shown.
Page 8
Author's personal copy
202
A. Shelenkov, E. Korotkov / Computational Biology and Chemistry 33 (2009) 196–204
is caused by the prevalent localization of the regular regions in the
interval (−99, +1).
4. Discussion
Currently high emphasis is placed on the investigation of pro
moter sequences since they determine gene activity in eukaryotic
and prokaryotic cells. If it is possible to classify promoters by some
quantitative sequence properties and, after that, to find a connec
tion between these properties and the gene expression in certain
cells or at the certain moment of organism development, then this
mayopenthewaytoreconstructionofgeneticnetsinthecellbased
onthesequencesofthewholegenomes(FickettandHatzigeorgiou,
1997; Werner, 2003). On the other hand, the improvement of algo
rithmsforsearchingpromotersequencesmaymakethegenesearch
in eukaryotic genomes more accurate (Fickett and Hatzigeorgiou,
1997;Novichkovetal.,2001;Werner,2003;Hertel,2008),sincethe
revealing of promoter region points to the beginning of a gene. It is
important to develop new mathematical approaches to solve these
problems, which may reveal the new rules of promoter sequence
organization, and which will be used both for identification of pro
moter sequences and for their classification. In the present work
we have introduced the “regularity” term and have found that reg
ular sequences are mostly present in the region from −99 to +1
nucleotide relative to the site of transcription initiation. RNA poly
merase usually binds with promoter at the region (−45, +5). Totally
some tens of proteins are involved in transcription complex forma
tion. By the previously obtained data (Zhang, 2007), the promoter
region (−38, +5) is a binding site of TFIIB (−38, −32), TBP (−31,
−24), TFIIB (−23, −17) and TAF1 (−2, +5). The distribution of regu
larsequencesforeachperiodlengthalong10nuleotideintervalsof
the region (−99, +1) is shown in Table 3. A regular sequence spans
several intervals. Each value of the Table 3 represents the number
of regular sequences of some period found in the given interval. For
example, the number of sequences with the period length equal
to 10 in the interval (−59, −50) was 60 (of course, each of these
sequences also appears in some other intervals since its minimal
length is 50). Comparing the previously obtained data with Table 3
and Fig. 3, we see that the binding regions of transcription factors
and RNA polymerase are entirely overlapped with the regions in
which the number of the regular sequences has exceeded expected
values. Based on this, we can suppose that the sequence regularity
isimportantforbindingoftheseproteinstoDNA.Itislikelythatcer
tainregularinterchangeofnucleotidesisnecessaryforsuchbinding
(Kutuzovaetal.,1997,1999;Ioshikhesetal.,1999).Thisinterchange
may be connected with evolutional relationship of various tran
scription factors that lead to the similarity of DNA sequences to
which these factors are able to bind. Such similarity may produce
regularity. The regularity in a region (−99, +1) has been revealed
on statistically significant level for approximately 60% of the pro
moter sequences under study, while for other 40% it has also been
revealed, but the significance level was lower than the threshold
value4.0.Thismaybeduetothedifferencesinasetoftranscription
factors for different promoters. Different sets, in turn, may lead to
different statistical significances and regularity length in the region
(−99, +1).
Besides, the revealed regularity may be connected with a DNA
molecule bend in the region of RNA polymerase binding (Mizuno,
1987; Tchernaenko et al., 2008). The regularity of nucleotide inter
change may be the cause of DNA bending formation with a possible
purpose of transcription complex formation facilitation (Ozoline et
al., 1999; Bolshoy and Nevo, 2000).
Periodicity of promoter sequences was previously investigated
by the methods based on Fourier transform (Ioshikhes et al., 1999;
Kutuzova et al., 1997, 1999; Bolshoy and Nevo, 2000). Results
obtained in the papers (Kutuzova et al., 1997, 1999) show that there
is a periodicity with a period length from 6 to 8 nucleotides in the
region (−99, +1). The authors suppose that the periodicity revealed
is mainly concerned with interaction of a DNA polymerase with the
region from −99 to +1 (core promoter). These conclusions are in
complete accordance with the data obtained in the present work.
In (Ioshikhes et al., 1999; Bolshoy and Nevo, 2000) periodicity of
promoter regions was also investigated by Fourier transform, as a
result of which the periodicity with a period length between 10
and 11 nucleotides has been revealed. The authors concluded that
such type of periodicity is specific for nucleosome binding near +1
position. It was also noted that this periodicity occurred because of
regular interchange of AA and TT dinucleotides (Herzel et al., 1999;
Cohanim et al., 2006; Salih et al., 2007). These data are also in good
accordance with the results obtained by us, since it is clear from
Table 3 that the main contribution to the observed regularity in the
interval (−99, +1) is made by exactly the period length from 10 to
12bp.
Theonlydifferenceisthatthemathematicalapproachwedevel
oped reveals the regularity individually for each promoter, while
the use of Fourier transform can reveal periodicity with a period
length from 10 to 11 only for the certain set of promoter sequences
(Ioshikhes et al., 1999). This fact shows that our approach is more
sensitive and thus can be used for developing the recognizing algo
rithms for individual promoter sequences.
Integrally, the fact that the largest part of regular sequences
in the region (−99, +1) possesses a regularity with a length
10–12 nucleotides suggests that the nature of regularity differs for
different regularity lengths. Comparatively short regularity (2–6
nucleotides), as it was noted above, may be important for bind
ingbothRNApolymeraseIIandtranscriptionfactors,whiletheone
Table 3
Distribution of regular sequences by the period length in the region (−99, +1).
Period length/interval
−99; −90
228
3 43
49
59
69
7 15
88
9 15
1066
1168
1246
13 50
14 29
15 32
1627
−89; −80
28
44
9
7
9
11
9
15
69
65
44
49
28
33
27
−79; −70
27
47
8
6
6
11
9
19
64
62
45
51
28
34
27
−69; −60
26
48
8
6
8
10
11
20
63
66
45
48
28
32
26
−59; −50
21
44
9
7
9
10
9
19
60
62
40
46
26
30
21
−49; −40
19
43
10
6
8
10
9
18
54
57
40
44
22
30
17
−39; −30
15
32
7
4
7
9
9
14
43
42
37
30
18
25
16
−29; −20
14
25
7
4
6
9
7
10
34
35
31
21
13
23
10
−19; −10
7
18
7
4
3
8
4
4
20
20
17
14
7
16
5
−9; +1
5
12
4
4
1
8
2
3
11
14
10
8
5
9
3
Page 9
Author's personal copy
A. Shelenkov, E. Korotkov / Computational Biology and Chemistry 33 (2009) 196–204
203
withalength10–12nucleotidesmaybeconcernedwithDNAbends
in promoter region. Such a bend may provide the nucleosome posi
tioning and/or facilitate the binding of RNA polymerase near +1
position.
The testing of the developed mathematical approach has shown
that it was able to reveal the perfect and the latent periodicity
with comparatively small number of indels. However, the method
used could not find the regions where long nucleotide insertions
have occurred. In this case the sequences of codes K(i) may contain
rather larger number of empty “boxes” that will lead to substantial
decrease in statistical significance for such a region. This may be
the reason why the regularity in some promoter sequences in the
region (−99, +1) was revealed with Z<4.0.
Our method is not seriously affected by transpositions of neigh
boring nucleotides. Within the limits of one “box” in a sequence of
codes K(i) the nucleotide position is not important for the method,
thus the transpositions of the type AT→TA and similar ones do not
influence the possibility of revealing the regularity in nucleotide
sequences.Takingintoaccountsuchtranspositionsisimportantfor
revealing the regions of protein binding to DNA since due to certain
conformational mobility of the proteins this process may require
the presence of certain nucleotides not in the specific promoter
position,butinavicinityofthisposition(FickettandHatzigeorgiou,
1997; Werner, 2003). Such transpositions cannot be taken into
account by the methods based on dynamic programming during
making comparison of nucleotide sequences. The presence of the
pair AT against the pair TA in alignment will be considered as a lack
of similarity which will decrease the weight of final alignment.
The results obtained by us may be useful for identification
of potential promoter sequences in eukaryotic genomes. Ear
lier multiple algorithms of searching and identifying promoter
sequences have been proposed (Pedersen et al., 1999; Fickett and
Hatzigeorgiou, 1997; Werner, 2003; Ohler and Niemann, 2001;
Anwar et al., 2008; Hutchinson, 1996; Prestridge, 1995; Reese,
2001). Most of these approaches are based on searching certain
motifs in DNA sequences, wherein the mathematical methods of
motif identification can be different. The common result of these
works is that it is possible to reveal the significant part of known
promoters (true positives), while the number of incorrect predic
tions (false positives) remains rather high. Usually it is possible to
reveal more than 50% of existing promoters with a background
of one false prediction per about 1000 DNA bases (Fickett and
Hatzigeorgiou, 1997). We think that a regularity revealed by us can
lower the number of false positive predictions. To achieve this, we
should reveal regularity in all known promoter sequences and try
to classify the regularity revealed for each DNA base of each known
promoter.Thenitispossibletodevelopamethodofregularityclass
searching in genome sequences based on the classes obtained ear
lier. The algorithms of promoter identification developed earlier
may be applied only to the DNA sequences in which the regularity
of certain class has been found. We suppose that such a combined
approach can significantly (by several times) reduce the number of
false positives which are revealed currently.
Onthewhole,ourworkshowsthattheregularstructureislikely
to be the intrinsic property of DNA promoter regions.
References
Anwar, F., Baker, S.M., Jabid, T., Hasan, M.M., Shoyaib, M., Khan, H., Walshe, R., 2008.
Pol II promoter prediction using characteristic 4mer motifs: a machine learning
approach. BMC Bioinform. 9, 414–422.
Bajic, V.B., Chong, A., Seah, S.H., Brusic, V., 2002. An intelligent system for vertebrate
promoter recognition. IEEE Intell. Syst. Mag. 17, 64–70.
Bajic, V.B., Seah, S.H., 2003. Dragon gene start finder: an advanced system for finding
approximate locations of the start of gene transcriptional units. Genome Res. 13,
1923–1929.
Bajic, V.B., Tan, S.L., Suzuki, Y., Sugano, S., 2004. Promoter prediction analysis on the
whole human genome. Nat. Biotechnol. 22, 1467–1473.
Benson, G., 1999. Tandem repeats finder: a program to analyze DNA sequences.
Nucleic Acids Res. 27, 573–580.
Bolshoy,A.,Nevo,E.,2000.EcologicgenomicsofDNA:upstreambendinginprokary
otic promoters. Genome Res. 10, 1185–1193.
Brownlee,K.A.,1965.StatisticalTheoryandMethodologyinScienceandEngineering,
second ed. Wiley, New York.
Claverie, J.M., 1997. Computational methods for the identification of genes in verte
brate genomic sequences. Hum. Mol. Genet. 6, 1735–1744.
Cohanim, A.B., Trifonov, E.N., Kashi, Y., 2006. Specific selection pressure at the
third codon positions: contribution to 10 to 11base periodicity in prokaryotic
genomes. J. Mol. Evol. 63, 393–400.
Davuluri, R.V., Grosse, I., Zhang, M.Q., 2001. Computational identification of promot
ers and first exons in the human genome. Nat. Genet. 29, 412–417.
Dieterich, C., et al., 2005. Comparative promoter region analysis powered by CORG.
BMC Genomics 6, 24.
Fickett, J.W., Hatzigeorgiou, A.G., 1997. Eukaryotic promoter recognition. Genome
Res. 7, 861–878.
Gershenzon, N.I., Trifonov, E.N., Ioshikhes, I.P., 2006. The features of Drosophila core
promoters revealed by statistical analysis. BMC Genomics 7, 161.
Hertel, K.J., 2008. Combinatorial control of exon recognition. J. Biol. Chem. 283,
1211–1215.
Herzel,H.,Weiss,O.,Trifonov,E.N.,1999.10–11bpperiodicitiesincompletegenomes
reflect protein structure and DNA folding. Bioinformatics 15, 187–193.
Hoel, P.G., 1966. Introduction to Mathematical Statistics, third ed. Wiley, New York.
Hutchinson, G.B., 1996. The prediction of vertebrate promoter regions using
differential hexamer frequency analysis. Comput. Appl. Biosci. 12, 391–
398.
Ioshikhes,I.,Trifonov,E.N.,Zhang,M.Q.,1999.Periodicaldistributionoftranscription
factor sites in promoter regions and connection with chromatin structure. Proc.
Natl. Acad. Sci. U.S.A. 96, 2891–2895.
Knudsen, S., 1999. Promoter2.0: for the recognition of PoIII promoter sequences.
Bioinformatics 15, 356–361.
Konopka, A.K., 1994. Sequence and codes: fundamental of biomolecular cryptology.
In: Smith, D. (Ed.), Biocomputing: Informatics and Genome Projects. Academic
Press, San Diego, pp. 119–174.
Konopka, A.K., Martindale, C., 1995. Noncoding DNA, Zipf’s law, and language. Sci
ence 268, 789.
Korotkov, E.V., Korotkova, M.A., Kudryashov, N.A., 2003. Information decomposition
method to analyze symbolical sequences. Phys. Lett. A 312, 198–210.
Kutuzova, G.I., Frank, G.K., Esipova, N.G., Makeev, V.Iu., Polozov, R.V., 1997. Fourier
analysis of nucleotide sequences, Periodicity in E. coli promoter sequences.
Biofizika 42, 354–362.
Kutuzova,G.I.,Frank,G.K.,Esipova,N.G.,Makeev,V.Iu.,Polozov,R.V.,1999.Periodicity
in contacts of RNApolymerase with promoters. Biofizika 44, 216–223.
Laskin, A.A., Kudryashov, N.A., Skryabin, K.G., Korotkov, E.V., 2005. Latent periodic
ity of serinethreonine and tyrosine protein kinases and other protein families.
Comput. Biol. Chem. 29, 229–243.
Lobzin, V.V., Chechetkin, V.R., 2000. The order and correlations in genomic DNA
sequences. Spectral approach. Uspehi Fizicheskih Nauk (Russian) 170, 57–81.
Matsuyama, Y., Kawamura, R., 2004. Promoter recognition for E. coli DNA seg
mentsbyindependentcomponentanalysis.Proc.Comput.Syst.Bioinform.Conf.,
686–691.
Mizuno, T., 1987. Static bend of DNA helix at the activator recognition site of the
ompF promoter in Escherichia coli. Gene 54, 57–64.
Novichkov, P.S., Gelfand, M.S., Mironov, A.A., 2001. Gene recognition in eukaryotic
DNA by comparison of genomic sequences. Bioinformatics 17, 1011–1018.
Ohler, U., Liao, G.C., Niemann, H., Rubin, G.M., 2002. Computational analysis of
core promoters in the Drosophila genome. Genome Biol. 3, 12, research0087.1
0087.12.
Ohler, U., Niemann, H., 2001. Identification and analysis of eukaryotic promoters:
recent computational approaches. Trends Genet. 17, 56–60.
Ozoline, O.N., Deev, A.A., Trifonov, E.N., 1999. DNA bendability—a novel feature in E.
coli promoter recognition. J. Biomol. Struct. Dyn. 16, 825–831.
Pedersen, A.G., Baldi, P., Chauvin, Y., Brunak, S., 1999. The biology of Eukaryotic
promoter prediction: a review. Comp. Chem. 23, 191–207.
Prestridge, D.S., 1995. Predicting Pol II promoter sequences using transcriptional
factor binding sites. J. Mol. Biol. 249, 923–932.
Reese, M.G., 2001. Application of a timedelay neural network to promoter annota
tion in the Drosophila melanogaster genome. Comput. Chem. 26, 51–56.
Salih, F., Salih, B., Trifonov, E.N., 2007. Sequencedirected mapping of nucleosome
positions. J. Biomol. Struct. Dyn. 24, 489–493.
Scherf, M., Klingenhoff, A., Werner, T., 2000. Highly specific localization of promoter
regionsinlargegenomicsequencesbypromoterinspector:anovelcontextanal
ysis approach. J. Mol. Biol. 297, 599–606.
Schmid, C.D., Perier, R., Praz, V., Bucher, P., 2006. EPD in its twentieth year: towards
completepromotercoverageofselectedmodelorganisms.NucleicAcidsRes.34,
D82–D85.
Sharma, D., Issac, B., Raghava, G.P., Ramaswamy, R., 2004. Spectral repeat finder
(SRF):identificationofrepetitivesequencesusingFouriertransformation.Bioin
formatics 20, 1405–1412.
Shelenkov, A.A., Skryabin, K.G., Korotkov, E.V., 2006. Search and classification of
potential minisatellite sequences from bacterial genomes. DNA Res. 13, 89–102.
Shelenkov, A.A., Korotkov, A.E., Korotkov, E.V., 2008. MMsat—a database of potential
micro and minisatellites. Gene 409, 53–60.
Sheskin, D.J., 2000. Handbook of Parametric and Nonparametric Statistical Proce
dures, second ed. Chapman & Hall/CRC, New York.
Page 10
Author's personal copy
204
A. Shelenkov, E. Korotkov / Computational Biology and Chemistry 33 (2009) 196–204
Solovyev, V.V., Shahmuradov, I.A., 2003. PromH: promoters identification using
orthologous genomic sequences. Nucleic Acids Res. 31, 3540–3545.
Tchernaenko, V., Radlinska, M., Lubkowska, L., Halvorson, H.R., Kashlev, M., Lut
ter, L.C., 2008. DNA bending in transcription initiation. Biochemistry 47, 1885–
1895.
Werner, T., 2003. The state of the art of mammalian promoter recognition. Brief
Bioinform. 4, 22–30.
Xie, X., Wu, S., Lam, K.M., Yan, H., 2006. Promoterexplorer: an effective promoter
identification method based on the AdaBoost algorithm. Bioinformatics 22,
2722–2728.
Zhang,M.Q.,2007.Computationalanalysesofeukaryoticpromoters.BMCBioinform.
8 (Suppl. 6), S3.
View other sources
Hide other sources
 Available from Eugene V Korotkov · Jun 3, 2014
 Available from narod.ru