Page 1

This article appeared in a journal published by Elsevier. The attached

copy is furnished to the author for internal non-commercial research

and education use, including for instruction at the authors institution

and sharing with colleagues.

Other uses, including reproduction and distribution, or selling or

licensing copies, or posting to personal, institutional or third party

websites are prohibited.

In most cases authors are permitted to post their version of the

article (e.g. in Word or Tex form) to their personal website or

institutional repository. Authors requiring further information

regarding Elsevier’s archiving and manuscript policies are

encouraged to visit:

http://www.elsevier.com/copyright

Page 2

Author's personal copy

Computational Biology and Chemistry 33 (2009) 196–204

Contents lists available at ScienceDirect

Computational Biology and Chemistry

journal homepage: www.elsevier.com/locate/compbiolchem

Research Article

Search of regular sequences in promoters from eukaryotic genomes

Andrew Shelenkov∗, Eugene Korotkov

Bioengineering Centre of Russian Academy of Sciences, 117312 Moscow, Pr-t 60-tya Oktyabrya, 7/1, Russian Federation

a r t i c l e i n f o

Article history:

Received 26 June 2008

Received in revised form 8 February 2009

Accepted 18 March 2009

Keywords:

DNA sequence analysis

Periodicity

Promoters

Protein binding sites

Regularity

a b s t r a c t

Inthispaper,thenotionof“regularity”isintroducedtodescribethestructuralfeaturesofDNAsequences.

This notion expands the “latent periodicity” term. The novel method for revealing regularity based on the

runs test is described. The search of regular sequences in eukaryotic promoters has shown that more than

60% of them possess a regularity property on statistically significant level. Possible biological functions of

regularity are discussed together with the possibility of using this characteristic for performing promoter

annotation.

© 2009 Elsevier Ltd. All rights reserved.

1. Introduction

At the present time, a wide-scale analysis of DNA sequences

from various genomes, including human ones, takes place. One of

the most important goals of this analysis is a characterization and

determination of the functions of various genes. During the last

decade, several reliable methods of predicting protein-coding DNA

regions have been proposed (Claverie, 1997). However, the predic-

tion of regulator regions, in particular, promoters, still remains a

challenging task although different methods have also been pro-

posed for their prediction (Bajic et al., 2002; Davuluri et al., 2001;

Ohler et al., 2002). Promoter is a genome region located near the

site of transcription initiation and playing the key role in genetic

regulation (Pedersen et al., 1999). Promoters receive signals from

various sources (e.g., from cell receptors) and control the level of

transcription initiation that to a great extent determines a gene

expression (Dieterich et al., 2005). Thus the promoter revealing is

an important step in performing gene annotation.

To distinguish genome regions containing and not containing

promoters (the latter, obviously, being the most part), a large set

of characteristics was used, for example, CpG islands (Davuluri et

al., 2001; Bajic and Seah, 2003), TATA-boxes (Ohler et al., 2002;

Knudsen, 1999), CAAT-boxes (Ohler et al., 2002; Knudsen, 1999),

some typical sites of transcription factors’ binding sites (Ohler

et al., 2002; Knudsen, 1999; Solovyev and Shahmuradov, 2003),

pentamer matrices (Bajic et al., 2002), oligonucleotides (Scherf et

al., 2000), and also combined approaches were used (Xie et al.,

∗Corresponding author. Tel.: +7 499 135 2161; fax: +7 499 135 0571.

E-mail address: fallandar@gmail.com (A. Shelenkov).

2006). Besides, some pattern recognition techniques were used

such as neural networks (Ohler et al., 2002; Bajic and Seah, 2003;

Knudsen, 1999), linear and discriminant analysis (Davuluri et al.,

2001; Solovyev and Shahmuradov, 2003), interpolation Markov

model (Ohler et al., 2002), analysis of independent constituents

(Matsuyama and Kawamura, 2004) and some other methods

(Gershenzon et al., 2006). However, the analysis of experimental

data has shown (Bajic et al., 2004) that the issue of choosing sta-

tistically significant biological signals to be used in the promoter

prediction programs still remains unsolved. There is no character-

istic which describes the whole variety of promoters, and each of

the features revealed during examination of promoter sequences

has its own usage limitations (Xie et al., 2006). Therefore, there is a

need for finding some new characteristics which would be specific

for the promoters, but at the same time would be quite flexible to

meet the variety of their types.

The presence of periodicity and other structure elements in

genetic texts was previously revealed by various mathematical

methods (Benson, 1999; Konopka, 1994; Konopka and Martindale,

1995). In this paper we introduce a new term – regularity – to char-

acterize a sequence and propose the method of identification of

regular sequences. We apply the method developed to search the

regular sequences in promoters from various eukaryotic genomes.

The notion of regularity is an extension of the “latent periodic-

ity” term (Korotkov et al., 2003; Laskin et al., 2005; Shelenkov

et al., 2006, 2008) for the case when the positions of individual

nucleotides in a period are not fixed rigidly. Regularity is a non-

randomnucleotidedistributioninsidetheperiodsofthenucleotide

sequence being examined. To reveal the regularity we divide the

sequence into equal-sized intervals (or “periods”) and calculate

the degree of “non-randomness” of each nucleotide distribution in

1476-9271/$ – see front matter © 2009 Elsevier Ltd. All rights reserved.

doi:10.1016/j.compbiolchem.2009.03.001

Page 3

Author's personal copy

A. Shelenkov, E. Korotkov / Computational Biology and Chemistry 33 (2009) 196–204

197

these periods. We used the runs test (Hoel, 1966; Brownlee, 1965)

as a criterion for estimating the non-randomness of these distribu-

tions. The method of calculating the number of runs is illustrated

below. The definition of a regular sequence and calculation of a

quantitative measure of the level of sequence regularity are given

in Section 2.1.

The method described in the current paper can be used for

searchingthelatentperiodicity,anditcanalsorevealthesequences

whichcouldnotbenominallyconsideredtobeperiodical,although

their structural composition is similar to the one of periodical

sequences. The main advantages of the method are its comparative

insensitivitytoapresenceofindels,noneedforpreliminaryassign-

ment of the period type (e.g., in a form of a frequency matrix), and,

being the most important, the possibility of revealing the sequence

regularity at simultaneous presence of nucleotide indels and sub-

stitutions in the analyzed sequence.

In Section 3 it is shown that the majority of known promoters

possess the regular structure.

2. Methods and Algorithms

2.1. Regularity Definition and Its Quantitative Measure defined

Using Runs Test

Let us define the regular sequences as DNA sequences which

have the same (or close) number of certain type (or all) nucleotides

in each period, while nucleotides are distributed in a same way

along the periods. Here the periods of length n are the DNA frag-

ments of the same length located one after another in a sequence.

Let us consider the sequence S=s1, s2, ..., sL, containing the char-

acters from the alphabet Q={a, t, c, g}. In Fig. 1, DNA sequence is

dividedintotheperiodsoflengthl=3byusingthesymbolF.Atfirst,

we place F at 0 and L+1 positions of S. Then we count n nucleotides

Fig. 1. Application of the runs test to determination of adenine distribution “non-

randomness” along the periods of the length 3 nucleotides. The sequence under

study is divided into separate periods by insertion of the symbol F after each three

nucleotides starting from the zero position of the sequence. Then the sequence of

categories (or codes) is formed by changing the symbol F to 0 and the symbol ‘a’ to

1, and the number of runs is calculated. Run is a sequence of identical elements that

is preceded and followed by different elements or no element (the beginning or the

end of the sequence). For example, the sequences of codes 000111 and 01 contain

tworuns.Thenumberofrunsinperiodicsequence(B)isalwaysgreaterthanorequal

to the one in non-periodic sequence (A). This allows using the number of runs as

periodicity measure which is almost insensitive to nucleotide indels. This is shown

in (C), where the deletion of ‘c’ from 9th position of the periodic sequence is made.

It is clear that the number of runs does not change and remains equal to17.

starting from the first position of S and place another symbol F.

We repeat this process till the end of the sequence S. As a result,

we obtain the sequence S?=Fs1, s2, ..., sn, Fsn+1, sn+2, ..., s2n, F,

..., sLF (the length of the last period can be smaller than the gen-

eral length). Practically, the presence of regularity means that the

distribution of positions of some nucleotide along the sequence is

the same as the distribution of the positions of symbol F along this

sequence. We use the runs test to define the quantitative measure

of the regularity.

Runs test (Wald–Wolfowitz test) (Hoel, 1966; Brownlee, 1965)

is a non-parametric test that is used to test the hypothesis H0stat-

ing that two (or more, generally) datasets represent the random

independent samples with amounts n1and n2(n2+n2=N) from

the same universal set, i.e., the distribution functions of these sam-

ples are the same. In use of the test, it is estimated the number of

runs in a series which elements may generally possess k different

values. Let us consider the case k=2. Observation results are writ-

ten as the static series of joint sample T, while the data belonging

to a certain group are given by the coding variable that may pos-

sess two values (0 and 1, where 0 shows the value belonging to the

first sample, while 1 to the second one). The values of coding vari-

able form the series R that is called “the sequence of categories” or

“the sequence of codes”. To apply the runs test, the elements of the

series T are sorted in ascending order with making simultaneous

rearrangements in the series R.

Run is a sequence of identical elements that is preceded and

followed by different elements or no element (the beginning or

the end of the sequence). The test statistic is a number of runs

in the sequence of codes. If the hypothesis H0is true, then both

samples should be mixed well in the static series and thus the

number of runs should be large. If the samples have been derived

from the universal sets having different distributions (which vary

in mean value or dispersion), then the number of runs will be

small.

For example, if the sequence S looks like tcgcgcattattcaagtacc,

thenS?=FtcgcgFcattaFttcaaFgtaccFandthusthestaticseriesforthe

periodlengthequalto5nucleotidesforthesymbolFandnucleotide

‘a’ is {0, 0, 1, 1, 0, 1, 1, 0, 1, 0} (runs are underlined). Here the value

in the sequence of codes corresponding to the symbol F is 0, and

the one corresponding to the nucleotide ‘a’ is 1. In this example the

number of runs is 7.

To introduce the quantitative measure for a case of nucleotide

sequences, we derive four sequences of codes K(i) from the

sequence S?. The sequence of codes K(i), i=1, 2, 3, 4 is derived from

thesequenceS?bychangingthesymbolFto0andthesymbolq(i)to

1, while all other symbols are ignored. At that it is considered that

the properties of a source sequence are represented by the proper-

ties of binary sequences (Lobzin and Chechetkin, 2000). Here q(i) is

a symbol from the alphabet Q defined above. Actually, this means

that we compare the samples containing the position numbers of

F and q(i) in the sequence S?. We calculated the number of runs

r(i) for each introduced sequence of codes. Statistical significance

ofthenumberofrunsobtainedwascalculatedusingthesimulation

modeling (Monte Carlo method). It included performing a random

shuffling of the symbols of initial sequence S, then the sequence

S?was built for the newly obtained sequence together with the

sequences K(i), and the numbers of runs r(i), i=1, 2, 3, 4 were deter-

mined. As a result, P values for each of the number of runs r(i) were

obtained for random sequences (we used P=100). After this, the

mean value ?r(i) and the standard deviation ?r(i) were calculated

for four sets of the obtained values. The formula for calculating the

statistical significance Z(i) of the number of runs r(i) was:

Z(i) =r(i) − ?r(i)

?r(i)

(1)

Page 4

Author's personal copy

198

A. Shelenkov, E. Korotkov / Computational Biology and Chemistry 33 (2009) 196–204

It should be noted that the runs test can perform the reli-

able comparison of the samples comparable by their amounts

(Sheskin,2000),i.e.,thenumberofthesymbolsFshouldbeapprox-

imately equal to the number of the nucleotides of each type. If

the number of the nucleotides of some type is much larger or

smaller (more than two times) than the number of the sym-

bols F, then the application of the runs test may not reveal the

sequence regularity. Let us modify the example given above. Let the

sequence S=tcgcgaaaattaaacaaaag, S?=FtcgcgFaaaatFtaaacFaaaagF.

In this case the static series for the nucleotide ‘a’ looks like {0, 0,

1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0}. This example shows that the num-

ber of runs does not increase when the number of symbols in each

period increases. For long periods (>4) this fact can be the cause of

not revealing some regular sequences.

To solve this problem, we inserted additional symbols F into the

sequence S?. In this case the periods of a sequence S are divided into

parts (or ?boxes?, analogously to the models used in combinato-

rial calculus) each in the same way. For the example given above

the sequence S?after inserting an additional symbol F into each

period may look like FtcgFcgFaaaFatFtaaFacFaaaFagF (new sym-

bols are shown bold-faced). We considered all possible placements

of new symbol and chose the position (identical in each period)

for which the number of runs was the greatest. Then a statistical

significance was calculated again using Monte Carlo method as it

wasdescribedabove,namely,thepositionofadditionalsymbolina

period giving the greatest number of runs was determined for each

sequence obtained by random shuffling of the initial sequence S.

Thus we determined the maximal number of runs rp(i) where p

shows the order number of a random sequence. We shuffle the ini-

tial sequence 100 times, so p=1, 2, ..., 100. Then the mean value

?rand the standard deviation ?rwere calculated for the set of 100

values and the corresponding value Z1was calculated by formula

(1) (here the index shows the number of additional symbols F in a

sequence S?).

To simplify the method description, we discuss the calculation

only for one nucleotide type. Evidently, the calculations were per-

formed in the same way for each type.

If m(q(i))−m(F(i))>L/n (i.e., numbers of additional and original

symbols are comparable), then one more symbol F was added to

each period in the same way as it was described above. Thus the

value Z2was calculated. Here m(q(i)) shows the number of type i

nucleotides, m(F(i)) shows the number of symbols F in a sequence

S?for this nucleotide. Upon completion of symbol addition process,

we have the set of values Z0, Z1, Z2, ..., Zk. We choose the maximal

one from these values –let us denote it Z.

The large positive value of Z means that the observed number

of runs greatly exceeds the expected one, i.e., the average length

of the run in static series is close to 1. This fact witnesses that the

number of times the symbol under study occurs in “boxes” (or in

plain periods) are similar on statistically significant level. In a case

of large negative value, the presence of a symbol in “boxes” differs

from period to period, so artificial symbols and nucleotides are not

“mixed” well. In this work, we focus on the large positive values

since they suppose that the source sequence possesses the peri-

odicity (possibly, the fuzzy one) or, generally, that the sequence is

regular.

All sequences described in Section 3 are regular. We chose this

term because these sequences possess some regular structures

which are represented by the regular occurrence of nucleotides

inside the “boxes” limited by the symbols F. Such regularity of

nucleotide distribution in “boxes” has a statistically significant dif-

ference from a distribution of symbols in random sequences. This

fact is reflected by the large values of corresponding Z variables.

However, not all of these sequences are periodic in the traditional

way. In fact, most of them are not, this is why the present methods

of revealing periodicity are not able to find them.

2.2. Calculation of Joint Statistic and Regularity Search

Organization

We have considered above the application of the runs test for

searching the regularity of one nucleotide in the sequences under

study. However, to perform the deeper comparative analysis of

the sequences it is necessary to estimate regularity for all four

nucleotides simultaneously. To do this we first calculate the statis-

tic Z (formula (1)) for each nucleotide separately. Let us designate

the values obtained for a, t, c and g as Za, Zt, Zc and Zg, respec-

tively. All these variables have a distribution close to normal one

as it will be shown below. Let us use the property of normal distri-

bution to get the new variable whose distribution is also normal.

We obtain:

Zsum=Za+ Zt+ Zc+ Zg

√4

=Za+ Zt+ Zc+ Zg

2

∼ N(0;1) (2)

Thesearchofregularitywasconductedasfollows.DNAsequence

wasscannedwithawindowoflengthequalto500nucleotides(the

lengthofpromotersequencesunderconsideration).Aperiodlength

n was chosen and four sequences S?(one for each nucleotide) were

built for a sequence S in a window by introducing the symbols F. n

belonged to the range 2–16. For each of these four sequences the

maximalvalueofZwasdeterminedasitwasdescribedabove.Then

the value Zsumwas calculated for the obtained values of Za, Zt, Zcи

Zgby using formula (2).

We tried to obtain the most exact borders of the subsequence to

make easier the future investigation of its function. So the search of

the regular subsequence with the maximal value of Zsum within

a window was performed by changing the border of the ana-

lyzed sequence, namely, its left and right borders were changed

independently with a step=2 and a statistic was calculated for

each such subsequence. Thus, the subsequences of the given win-

dow having length from 50 to 500 nucleotides were considered.

Results included all non-overlapping subsequences of the window

for which the value of Zsum exceeded the threshold. The subse-

quences having length less than 50 were not considered because

the sample size in this case was not large enough to get the reliable

results at the chosen level of statistical significance.

Parameters of sequence scanning (window=500, step inside a

window=2) have been chosen, on the one hand, to provide the

reasonable rate of calculations and, on the other hand, to reveal all

statisticallysignificantregularsubsequencesintheconsideredDNA

sequences.

Sinceitseemsinterestingtoinvestigatetheregionsofasequence

which possess regularity not only for all four nucleotides, but

also for all their subsets (including regularity for the only one

nucleotide), we calculated the values of the statistic for all pos-

sible combinations of nucleotides for the considered subsequence

(usingformulasanalogousto(2))andthenchosethemaximalvalue

among them.

If we consider only the statistic for all four nucleotides, then

the negative value for one nucleotide can decrease the value of

Zsumbelow the threshold. For example, for the case Za=−4, Zt=3,

Zc=3 and Zg=3 we obtain Zsum=2.5, whereas Ztcg=5.2, i.e., there is

a regularity for three nucleotides. Thus performing the calculations

for all nucleotide combinations allowed lowering the influence of

negative statistic values for some nucleotides on the possibility of

regularity revealing.

2.3. Building of the Regularity Scheme

It is evident that the use of statistical significance Z as the

only characteristic of regular sequences does not allow revealing

regularity types and comparing them. Yet such a characteris-

tic is necessary, especially for investigating the biological role of

Page 5

Author's personal copy

A. Shelenkov, E. Korotkov / Computational Biology and Chemistry 33 (2009) 196–204

199

regularity. We propose to use the regularity scheme as such a char-

acteristic.

?Regularityscheme?isaschematicrepresentationofFsymbol

arrangement in the sequence for which the nucleotide distribu-

tion occurred to be the “mostly non-random”, that is, for which

the value of Zsumwas maximal, e.g., the scheme given below shows

thatthemaximalvalueoftheteststatistichasbeenobtainedforthe

sequence S?having nucleotide “a” in any period position from 1st to

4thnucleotide,andalsoinsomeof5thand6thpositions.Nucleotide

“c” may occur in 1st–3rd and 4th–6th positions of the period, etc.

In this case there exists regularity for three nucleotides—“a, c, g”,

while the period length equals 6 nucleotides.

F- - - -F-—F

F- - -F- - -F

F-F- -F- - -F

F-F-F- - -F-F

a

c

t

g

2.4. Distribution Density of Z for Random Sequences. Choosing

the Threshold Value of Zsumfor Searching Regular Sequences in

Promoters

The conformity of results obtained by using a simulation mod-

eling is necessary to be checked by applying the method used to

randomsequences.Thischeckingwillallowchoosingthethreshold

valuefortheteststatisticinsuchawaythatthepossibilityofreveal-

ing regularity in a random sequence on significance level higher

than a threshold will tend to zero. Besides, building the specter of

theteststatisticforrandomsequencesgivesapossibilityofestimat-

ingtherelevancyofusingthenormalapproximationforcalculation

of Zsum.

The specter of statistic Z for one nucleotide was built for the

sequence generated by the random number generator (period of

the generator was ∼2×1018) for the period lengths from 2 to 16

nucleotides. The length of the sequence was 10 million nucleotides

(i.e., approximately 10 times larger than the total length of the

analyzed promoter sequences). The initial frequencies of each

nucleotide’s occurrences were equal to the ones in the analyzed

set of promoters. The generated sequence was analyzed by using

a scanning window as it was described above. For example, the

specter has been obtained for the period lengthn=4. Mean value of

Zwasequalto0.0124,standarddeviation=1.1267,thevaluesZ≥5.0

were not found. The number of Z values greater than 4.0 was 3, the

maximal value was 4.1982. The results obtained for other period

lengths were analogous.

In addition, the specter of a joint statistic Zsumwas built for the

samerandomsequencewithapurposeofcheckingtherelevancyof

using the normal approximation for its calculation. A joint statistic

was calculated for all four nucleotides and for all possible combi-

nations of two and three nucleotides (i.e., in the same way as it had

been done for searching the regularity in promoters). Distribution

of the statistic values in all cases is analogous to the one obtained

for single nucleotides.

Therefore,thethresholdvalueZ=4.0ensuresthatnotmorethan

one random sequence will be found to be regular for the length of

1 million nucleotides (the total length of the promoters analyzed).

To get more accurate estimation, we performed the same cal-

culations for the random sequence having the length equal to 100

million nucleotides. In this case 39 regular regions with Zsum≥4.0

werefound,whilemaximalvaluewasequalto4.6.Thus,usingthese

results we can also conclude that not more than one DNA sequence

willbeconsideredregularbychanceontheanalyzedsetofpromot-

ers while using our algorithm with a threshold value Zsum=4.0. Let

us also note that there were no sequences revealed with Zsum≥5.0.

So, we consider a DNA sequence from the analyzed set of pro-

moters to be regular if statistic value for it is Zsum≥4.0. Because

of the reasons stated in this chapter, we believe this value to be

relevant for the real data.

2.5. An Algorithm of Filtering the Results

When changing the window borders in the range 50–500 with

the step=2, it is possible to reveal the sequences possessing a regu-

larityatstatisticallysignificantlevelwhichoverlapwitheachother.

Such an overlap (either full or partial) may occur when the same

sequence was revealed at different scanning window positions. To

prevent this, we leave only the sequence having the greater value

of Zsum if some sequences overlapped each other by more than

80%.

Weusetherunstesttosearchtheregularitywithaperiodlength

n in DNA sequences. However, the fact that Zsumfor some sequence

is greater than the threshold for some period length n does not

allow making an unambiguous conclusion regarding the presence

of regularity exactly for this length. The case is that the regularity

for the length n may be caused, for example, by the presence of

regularity for multiple periods, i.e., for the period lengths nk, where

k=2,3,.... In this case the period length n may give the statistical

significance greater than threshold, while the real regularity occurs

for the lengths kn. We will consider that there is a regularity with a

period length n in DNA sequence if the value of Zsumfor this length

is greater than the threshold and the values of test statistic for all

other analyzed period lengths are smaller (they may either exceed

the threshold or not). Thus we calculated the value of Zsumfor all

period lengths in a range from 2 to 16 nucleotides for the sequence

found.IfthemaximalvalueofZsumdidnotcorrespondtotheperiod

length n, then the sequence was excluded from the consideration.

In addition, we performed one more filtration step to provide

the maximal reliability of the results. For each of the sequences

revealed to be regular we built the information decomposition

specter (Korotkov et al., 2003) which shows the statistical signif-

icance of the found period expressed in the terms of Z (formula

(1)) against the period length. Information decomposition reveals

the periodicity without indels in nucleotide sequences, but it can

also reveal the latent periodicity (Korotkov et al., 2003). Let the

period having the maximal value of Z be referred as the “maximal

period”. We will consider that the maximal period really exists in

a sequence if Z≥4.0 for this period in information decomposition

specter. In case when such maximal period existed in a sequence

and its length did not correspond to the length of regularity n, we

excluded the analyzed DNA sequence from consideration because

it was most likely that this length of regularity did not correspond

to the maximal value of Zsum.

The set of sequences passed through all steps of filtration was

considered as the set of final results.

2.6. Choosing of Promoter Sequences to be Analyzed for Regularity

The source promoter sequences were obtained from EPD

(Eukaryotic Promoter Database) (Schmid et al., 2006), version 93

(thetotalnumberofsequencescontainedindatabase,excludingthe

preliminary data, was 4809). We have chosen 2236 sequences rep-

resenting all groups of organisms. To do this, we used the database

option “representative set of not closely related sequences”. The

use of this option ensures that the set of selected sequences will

not contain any 2 sequences having the pairwise similarity of

more than 50%. The range of promoter regions was (−499, +1)

where +1 corresponded to the transcription initiation site. Since

we were searching the regularity only in promoter sequences,

that is, not including the genes, the analyzed set of promoters

included 2236 sequences, each of the length equals to 500. For

each of these sequences the calculations described above were per-

formed.

Page 6

Author's personal copy

200

A. Shelenkov, E. Korotkov / Computational Biology and Chemistry 33 (2009) 196–204

Fig. 2. An example of the periodic sequence revealed by the method of information decomposition which was also revealed as regular. Here the periodic sequence (shown

in bold) is a part of regular sequence (A). An alignment of the modified and initial sequences. Similarity level is 61.4% (B).

3. Results

3.1. Search of a Regularity in Artificial Periodic Sequence with

Indels

Let us consider an example of regular nucleotide sequence. The

method we developed reveals regularity on statistically significant

level in it, while other methods (e.g., Korotkov et al., 2003; Benson,

1999;Sharmaetal.,2004),includingtheonebasedonFouriertrans-

form (Sharma et al., 2004), are not able to reveal the periodicity.

Asequenceoflength400possessingtheperfectperiodicitywith

the period length=5 was chosen as an initial one. It contains 80

copies of the {attcg} subsequence. Then indels and substitutions

simulating mutations were made in it by using a random number

generator.AteachchangeofthesequencethevalueZsumwascalcu-

lated for n=5. Upon making 80 indels, the following sequence has

been obtained (length=399):

atcgaggagtatcatttgatgcgatttacgatcgccatgtttgatttcgatattcggttgattcg

ttcgttcgatcctggatcgaaatggtcgaatcccagaatccgatcgactcgattggttcgattg

attcgatttcgatcgccattatcgtaaaagatcgatccggataagtttcggtcgggatactgtg

ataatctgatcgaaccttttcattcttgacgctgtcgggggggatgattcttggattcattcatt

atccgtcgacatgtcgatgtattcgagtttcatggattgatctgattcgttcgattaagattcttc

cgattcgattcgagttcgattcatagattgatcgggcgattcatttccccgattttttctttgcga

agatcttgc

An alignment of the initial and the newly obtained sequences

is shown in Fig. 2B. The value of the test statistic for this new

sequence was Zsum=5.02 (Za=3.41, Zt=0.92, Zc=2.49, Zg=3.21). At

the same time, the methods based on Fourier transform (Sharma

et al., 2004), information decomposition (Korotkov et al., 2003)

and dynamic programming (Benson, 1999) have not revealed the

periodicity with the period length n=5 in this sequence on sta-

tistically significant level. Thus the method we developed is less

sensitivetonucleotideindelsthanthesemethodsofsearchingperi-

odicity.

3.2. Regularity Search in Sequences Possessing the Latent

Periodicity Revealed by Information Decomposition Method

We have also studied the possibility of revealing regular sub-

sequences by using our method in the DNA sequences with latent

periodicity revealed by the method of information decomposition

(Korotkov et al., 2003). To do this, we performed the search of the

latent periodicity on the same set of promoters by using the infor-

mation decomposition. The total number of periodic sequences

with a period length in a range from 2 to 16 was 109, while 62

of them had their length greater than 50 nucleotides (the mini-

mal length for revealing regularity). All these sequences have been

revealed as regular with a length of regularity corresponding to the

length of the latent period. Therefore, in this case the regularity

search reveals the sequences possessing the latent periodicity.

An example of the sequence revealed by both methods is shown

in Fig. 2A.

3.3. An Example of a Regular Sequence Found in a Promoter

Let us give an example of a regular sequence revealed in a

promoter. The regularity with length of 10 nucleotides has been

revealed in locus EP77531 in the region (−113, −44). The test statis-

tic for it was Zsum=5.1 (Za=1.9, Zt=3.1, Zc=2.0, Zg=3.1). The DNA

sequence looked like:

tacactatcgatagccaactgtgcaatcgatagcgtgtcatctctgactcaaatgcactc-

gaatgcagcatgaccgttta

AdditionalsymbolsFwereusedforregularityidentification.The

sequences S?for the nucleotides a, t, c and g were:

F-F- - -F- - - -F—F

F-F- - - - - -F- - -F

F-F-F- - -F- - -F—F

F-F- - - - - -F- - -F

a

t

c

g

Page 7

Author's personal copy

A. Shelenkov, E. Korotkov / Computational Biology and Chemistry 33 (2009) 196–204

201

Table 1

The number of regular sequences found in promoters for the period

lengths 2–16.

Period length The number of regular sequences

2

3

4

5

6

7

8

9

111

227

56

118

131

142

91

185

363

374

296

189

163

151

136

10

11

12

13

14

15

16

It should be noted that no value of statistical significance Z for

single nucleotides exceeds the threshold value, that is, regularity is

a joint property of all four nucleotides of a promoter sequence.

3.4. Results of Searching Regular Sequences for a Period Length

2–16 in Promoters

We searched for regular subsequences in promoter sequences

from the EPD (Schmid et al., 2006) using the following scanning

parameters: window size=500, step for window border vari-

ation=2. The length of the regular subsequences revealed lay

in a range 50–500. The data regarding the number of revealed

sequences are given in Table 1.

We have analyzed 2236 promoters totally, while regularity has

been revealed at statistically significant level in 1342 of them. Thus

more than 60% contain the regular subsequences with a period

length range from 2 to 16 nucleotides. The regularity has been also

revealed in other promoter sequences, but the level of statistical

significancehasnotexceededthethreshold.Letusconsiderthedis-

tribution of the length of regular subsequences (Table 2) and their

arrangement in promoters.

The longest regular sequence had a length 484bp. However, as

it can be seen from the Table 2, the majority of the sequences had

theirlengthinarange50–150.Sinceduringfiltrationofoverlapping

sequences we selected the ones having the greater statistical sig-

nificance, not length (i.e., we did not try to maximize the length of

the sequence revealed), such length distribution is quite expected.

Such a selection process allows setting of the more exact regularity

bordersthat,inturn,increasesthereliabilityoftheresultsobtained.

Table 2

Length distribution of the regular sequences revealed.

Length range, base pairs. The number of the found sequences

50–100

100–150

150–200

200–250

250–300

300–350

350–400

400–450

450–500

750

428

182

91

40

23

13

12

3

Investigation of regular sequence arrangement in promoters is

more important for understanding the biological role of the reg-

ularity than sequence length distribution. We divided a promoter

sequence into the intervals of length equal to 10bp and counted

the number of sequences that fall into each of these intervals. Since

the minimal length of regular sequence was 50bp, it is evident that

each sequence fell into several intervals. First of all, we wanted to

know if the distribution of regular sequences along the promot-

ers was random. If such a distribution has maxima or minima on

the certain regions of promoter, there is a possibility of connect-

ing the regularity to some biological function which these regions

possess.

We used the simulation modeling (Monte Carlo method) to

determine if the regular sequence distribution was random. To

do this, we randomly placed all the regular sequences that were

found along the promoter sequence. At this, the lengths of reg-

ular sequences corresponded to the ones in the set of revealed

sequences, e.g., if there had been 22 revealed sequences with the

length 96, then the same number of sequences having this length

were randomly placed. Practically, we used a random number gen-

erator to determine only the starting point of the sequence. Its

coordinate might vary in a range (1, 500-k), where k is the length of

thesequencetobeplaced,i.e.,thesequencecouldnotgobeyondthe

promoter borders. Such placement of sequences was repeated 200

times, and after each iteration the number of sequences fallen into

the intervals with length 10 was determined. Based on these data,

the mean value and the dispersion were calculated, and then the

statistical significance Z was calculated by using the formula anal-

ogous to (1). The values of Z obtained for each interval are shown in

Fig. 3. As shown in the figure, the number of the regular sequences

revealed in the interval (−99, +1) relative to the beginning of a gene

considerably increases the number expected for the random place-

mentofthesequences.Startingfrom−99thnucleotide,thevalueof

Z exceeds 5.0. Besides, the declination of Z in the negative direction

was observed in the intervals (−389, −169). We think that this fact

Fig. 3. Decline of the number of regular sequences from expected one for the random distribution, in magnitude of standard deviation. The right borders of the intervals are

shown.

Page 8

Author's personal copy

202

A. Shelenkov, E. Korotkov / Computational Biology and Chemistry 33 (2009) 196–204

is caused by the prevalent localization of the regular regions in the

interval (−99, +1).

4. Discussion

Currently high emphasis is placed on the investigation of pro-

moter sequences since they determine gene activity in eukaryotic

and prokaryotic cells. If it is possible to classify promoters by some

quantitative sequence properties and, after that, to find a connec-

tion between these properties and the gene expression in certain

cells or at the certain moment of organism development, then this

mayopenthewaytoreconstructionofgeneticnetsinthecellbased

onthesequencesofthewholegenomes(FickettandHatzigeorgiou,

1997; Werner, 2003). On the other hand, the improvement of algo-

rithmsforsearchingpromotersequencesmaymakethegenesearch

in eukaryotic genomes more accurate (Fickett and Hatzigeorgiou,

1997;Novichkovetal.,2001;Werner,2003;Hertel,2008),sincethe

revealing of promoter region points to the beginning of a gene. It is

important to develop new mathematical approaches to solve these

problems, which may reveal the new rules of promoter sequence

organization, and which will be used both for identification of pro-

moter sequences and for their classification. In the present work

we have introduced the “regularity” term and have found that reg-

ular sequences are mostly present in the region from −99 to +1

nucleotide relative to the site of transcription initiation. RNA poly-

merase usually binds with promoter at the region (−45, +5). Totally

some tens of proteins are involved in transcription complex forma-

tion. By the previously obtained data (Zhang, 2007), the promoter

region (−38, +5) is a binding site of TFIIB (−38, −32), TBP (−31,

−24), TFIIB (−23, −17) and TAF1 (−2, +5). The distribution of regu-

larsequencesforeachperiodlengthalong10-nuleotideintervalsof

the region (−99, +1) is shown in Table 3. A regular sequence spans

several intervals. Each value of the Table 3 represents the number

of regular sequences of some period found in the given interval. For

example, the number of sequences with the period length equal

to 10 in the interval (−59, −50) was 60 (of course, each of these

sequences also appears in some other intervals since its minimal

length is 50). Comparing the previously obtained data with Table 3

and Fig. 3, we see that the binding regions of transcription factors

and RNA polymerase are entirely overlapped with the regions in

which the number of the regular sequences has exceeded expected

values. Based on this, we can suppose that the sequence regularity

isimportantforbindingoftheseproteinstoDNA.Itislikelythatcer-

tainregularinterchangeofnucleotidesisnecessaryforsuchbinding

(Kutuzovaetal.,1997,1999;Ioshikhesetal.,1999).Thisinterchange

may be connected with evolutional relationship of various tran-

scription factors that lead to the similarity of DNA sequences to

which these factors are able to bind. Such similarity may produce

regularity. The regularity in a region (−99, +1) has been revealed

on statistically significant level for approximately 60% of the pro-

moter sequences under study, while for other 40% it has also been

revealed, but the significance level was lower than the threshold

value4.0.Thismaybeduetothedifferencesinasetoftranscription

factors for different promoters. Different sets, in turn, may lead to

different statistical significances and regularity length in the region

(−99, +1).

Besides, the revealed regularity may be connected with a DNA

molecule bend in the region of RNA polymerase binding (Mizuno,

1987; Tchernaenko et al., 2008). The regularity of nucleotide inter-

change may be the cause of DNA bending formation with a possible

purpose of transcription complex formation facilitation (Ozoline et

al., 1999; Bolshoy and Nevo, 2000).

Periodicity of promoter sequences was previously investigated

by the methods based on Fourier transform (Ioshikhes et al., 1999;

Kutuzova et al., 1997, 1999; Bolshoy and Nevo, 2000). Results

obtained in the papers (Kutuzova et al., 1997, 1999) show that there

is a periodicity with a period length from 6 to 8 nucleotides in the

region (−99, +1). The authors suppose that the periodicity revealed

is mainly concerned with interaction of a DNA polymerase with the

region from −99 to +1 (core promoter). These conclusions are in

complete accordance with the data obtained in the present work.

In (Ioshikhes et al., 1999; Bolshoy and Nevo, 2000) periodicity of

promoter regions was also investigated by Fourier transform, as a

result of which the periodicity with a period length between 10

and 11 nucleotides has been revealed. The authors concluded that

such type of periodicity is specific for nucleosome binding near +1

position. It was also noted that this periodicity occurred because of

regular interchange of AA and TT dinucleotides (Herzel et al., 1999;

Cohanim et al., 2006; Salih et al., 2007). These data are also in good

accordance with the results obtained by us, since it is clear from

Table 3 that the main contribution to the observed regularity in the

interval (−99, +1) is made by exactly the period length from 10 to

12bp.

Theonlydifferenceisthatthemathematicalapproachwedevel-

oped reveals the regularity individually for each promoter, while

the use of Fourier transform can reveal periodicity with a period

length from 10 to 11 only for the certain set of promoter sequences

(Ioshikhes et al., 1999). This fact shows that our approach is more

sensitive and thus can be used for developing the recognizing algo-

rithms for individual promoter sequences.

Integrally, the fact that the largest part of regular sequences

in the region (−99, +1) possesses a regularity with a length

10–12 nucleotides suggests that the nature of regularity differs for

different regularity lengths. Comparatively short regularity (2–6

nucleotides), as it was noted above, may be important for bind-

ingbothRNApolymerase-IIandtranscriptionfactors,whiletheone

Table 3

Distribution of regular sequences by the period length in the region (−99, +1).

Period length/interval

−99; −90

228

3 43

49

59

69

7 15

88

9 15

1066

1168

1246

13 50

14 29

15 32

1627

−89; −80

28

44

9

7

9

11

9

15

69

65

44

49

28

33

27

−79; −70

27

47

8

6

6

11

9

19

64

62

45

51

28

34

27

−69; −60

26

48

8

6

8

10

11

20

63

66

45

48

28

32

26

−59; −50

21

44

9

7

9

10

9

19

60

62

40

46

26

30

21

−49; −40

19

43

10

6

8

10

9

18

54

57

40

44

22

30

17

−39; −30

15

32

7

4

7

9

9

14

43

42

37

30

18

25

16

−29; −20

14

25

7

4

6

9

7

10

34

35

31

21

13

23

10

−19; −10

7

18

7

4

3

8

4

4

20

20

17

14

7

16

5

−9; +1

5

12

4

4

1

8

2

3

11

14

10

8

5

9

3

Page 9

Author's personal copy

A. Shelenkov, E. Korotkov / Computational Biology and Chemistry 33 (2009) 196–204

203

withalength10–12nucleotidesmaybeconcernedwithDNAbends

in promoter region. Such a bend may provide the nucleosome posi-

tioning and/or facilitate the binding of RNA polymerase near +1

position.

The testing of the developed mathematical approach has shown

that it was able to reveal the perfect and the latent periodicity

with comparatively small number of indels. However, the method

used could not find the regions where long nucleotide insertions

have occurred. In this case the sequences of codes K(i) may contain

rather larger number of empty “boxes” that will lead to substantial

decrease in statistical significance for such a region. This may be

the reason why the regularity in some promoter sequences in the

region (−99, +1) was revealed with Z<4.0.

Our method is not seriously affected by transpositions of neigh-

boring nucleotides. Within the limits of one “box” in a sequence of

codes K(i) the nucleotide position is not important for the method,

thus the transpositions of the type AT→TA and similar ones do not

influence the possibility of revealing the regularity in nucleotide

sequences.Takingintoaccountsuchtranspositionsisimportantfor

revealing the regions of protein binding to DNA since due to certain

conformational mobility of the proteins this process may require

the presence of certain nucleotides not in the specific promoter

position,butinavicinityofthisposition(FickettandHatzigeorgiou,

1997; Werner, 2003). Such transpositions cannot be taken into

account by the methods based on dynamic programming during

making comparison of nucleotide sequences. The presence of the

pair AT against the pair TA in alignment will be considered as a lack

of similarity which will decrease the weight of final alignment.

The results obtained by us may be useful for identification

of potential promoter sequences in eukaryotic genomes. Ear-

lier multiple algorithms of searching and identifying promoter

sequences have been proposed (Pedersen et al., 1999; Fickett and

Hatzigeorgiou, 1997; Werner, 2003; Ohler and Niemann, 2001;

Anwar et al., 2008; Hutchinson, 1996; Prestridge, 1995; Reese,

2001). Most of these approaches are based on searching certain

motifs in DNA sequences, wherein the mathematical methods of

motif identification can be different. The common result of these

works is that it is possible to reveal the significant part of known

promoters (true positives), while the number of incorrect predic-

tions (false positives) remains rather high. Usually it is possible to

reveal more than 50% of existing promoters with a background

of one false prediction per about 1000 DNA bases (Fickett and

Hatzigeorgiou, 1997). We think that a regularity revealed by us can

lower the number of false positive predictions. To achieve this, we

should reveal regularity in all known promoter sequences and try

to classify the regularity revealed for each DNA base of each known

promoter.Thenitispossibletodevelopamethodofregularityclass

searching in genome sequences based on the classes obtained ear-

lier. The algorithms of promoter identification developed earlier

may be applied only to the DNA sequences in which the regularity

of certain class has been found. We suppose that such a combined

approach can significantly (by several times) reduce the number of

false positives which are revealed currently.

Onthewhole,ourworkshowsthattheregularstructureislikely

to be the intrinsic property of DNA promoter regions.

References

Anwar, F., Baker, S.M., Jabid, T., Hasan, M.M., Shoyaib, M., Khan, H., Walshe, R., 2008.

Pol II promoter prediction using characteristic 4-mer motifs: a machine learning

approach. BMC Bioinform. 9, 414–422.

Bajic, V.B., Chong, A., Seah, S.H., Brusic, V., 2002. An intelligent system for vertebrate

promoter recognition. IEEE Intell. Syst. Mag. 17, 64–70.

Bajic, V.B., Seah, S.H., 2003. Dragon gene start finder: an advanced system for finding

approximate locations of the start of gene transcriptional units. Genome Res. 13,

1923–1929.

Bajic, V.B., Tan, S.L., Suzuki, Y., Sugano, S., 2004. Promoter prediction analysis on the

whole human genome. Nat. Biotechnol. 22, 1467–1473.

Benson, G., 1999. Tandem repeats finder: a program to analyze DNA sequences.

Nucleic Acids Res. 27, 573–580.

Bolshoy,A.,Nevo,E.,2000.EcologicgenomicsofDNA:upstreambendinginprokary-

otic promoters. Genome Res. 10, 1185–1193.

Brownlee,K.A.,1965.StatisticalTheoryandMethodologyinScienceandEngineering,

second ed. Wiley, New York.

Claverie, J.M., 1997. Computational methods for the identification of genes in verte-

brate genomic sequences. Hum. Mol. Genet. 6, 1735–1744.

Cohanim, A.B., Trifonov, E.N., Kashi, Y., 2006. Specific selection pressure at the

third codon positions: contribution to 10- to 11-base periodicity in prokaryotic

genomes. J. Mol. Evol. 63, 393–400.

Davuluri, R.V., Grosse, I., Zhang, M.Q., 2001. Computational identification of promot-

ers and first exons in the human genome. Nat. Genet. 29, 412–417.

Dieterich, C., et al., 2005. Comparative promoter region analysis powered by CORG.

BMC Genomics 6, 24.

Fickett, J.W., Hatzigeorgiou, A.G., 1997. Eukaryotic promoter recognition. Genome

Res. 7, 861–878.

Gershenzon, N.I., Trifonov, E.N., Ioshikhes, I.P., 2006. The features of Drosophila core

promoters revealed by statistical analysis. BMC Genomics 7, 161.

Hertel, K.J., 2008. Combinatorial control of exon recognition. J. Biol. Chem. 283,

1211–1215.

Herzel,H.,Weiss,O.,Trifonov,E.N.,1999.10–11bpperiodicitiesincompletegenomes

reflect protein structure and DNA folding. Bioinformatics 15, 187–193.

Hoel, P.G., 1966. Introduction to Mathematical Statistics, third ed. Wiley, New York.

Hutchinson, G.B., 1996. The prediction of vertebrate promoter regions using

differential hexamer frequency analysis. Comput. Appl. Biosci. 12, 391–

398.

Ioshikhes,I.,Trifonov,E.N.,Zhang,M.Q.,1999.Periodicaldistributionoftranscription

factor sites in promoter regions and connection with chromatin structure. Proc.

Natl. Acad. Sci. U.S.A. 96, 2891–2895.

Knudsen, S., 1999. Promoter2.0: for the recognition of PoIII promoter sequences.

Bioinformatics 15, 356–361.

Konopka, A.K., 1994. Sequence and codes: fundamental of biomolecular cryptology.

In: Smith, D. (Ed.), Biocomputing: Informatics and Genome Projects. Academic

Press, San Diego, pp. 119–174.

Konopka, A.K., Martindale, C., 1995. Non-coding DNA, Zipf’s law, and language. Sci-

ence 268, 789.

Korotkov, E.V., Korotkova, M.A., Kudryashov, N.A., 2003. Information decomposition

method to analyze symbolical sequences. Phys. Lett. A 312, 198–210.

Kutuzova, G.I., Frank, G.K., Esipova, N.G., Makeev, V.Iu., Polozov, R.V., 1997. Fourier

analysis of nucleotide sequences, Periodicity in E. coli promoter sequences.

Biofizika 42, 354–362.

Kutuzova,G.I.,Frank,G.K.,Esipova,N.G.,Makeev,V.Iu.,Polozov,R.V.,1999.Periodicity

in contacts of RNA-polymerase with promoters. Biofizika 44, 216–223.

Laskin, A.A., Kudryashov, N.A., Skryabin, K.G., Korotkov, E.V., 2005. Latent periodic-

ity of serine-threonine and tyrosine protein kinases and other protein families.

Comput. Biol. Chem. 29, 229–243.

Lobzin, V.V., Chechetkin, V.R., 2000. The order and correlations in genomic DNA

sequences. Spectral approach. Uspehi Fizicheskih Nauk (Russian) 170, 57–81.

Matsuyama, Y., Kawamura, R., 2004. Promoter recognition for E. coli DNA seg-

mentsbyindependentcomponentanalysis.Proc.Comput.Syst.Bioinform.Conf.,

686–691.

Mizuno, T., 1987. Static bend of DNA helix at the activator recognition site of the

ompF promoter in Escherichia coli. Gene 54, 57–64.

Novichkov, P.S., Gelfand, M.S., Mironov, A.A., 2001. Gene recognition in eukaryotic

DNA by comparison of genomic sequences. Bioinformatics 17, 1011–1018.

Ohler, U., Liao, G.C., Niemann, H., Rubin, G.M., 2002. Computational analysis of

core promoters in the Drosophila genome. Genome Biol. 3, 12, research0087.1-

0087.12.

Ohler, U., Niemann, H., 2001. Identification and analysis of eukaryotic promoters:

recent computational approaches. Trends Genet. 17, 56–60.

Ozoline, O.N., Deev, A.A., Trifonov, E.N., 1999. DNA bendability—a novel feature in E.

coli promoter recognition. J. Biomol. Struct. Dyn. 16, 825–831.

Pedersen, A.G., Baldi, P., Chauvin, Y., Brunak, S., 1999. The biology of Eukaryotic

promoter prediction: a review. Comp. Chem. 23, 191–207.

Prestridge, D.S., 1995. Predicting Pol II promoter sequences using transcriptional

factor binding sites. J. Mol. Biol. 249, 923–932.

Reese, M.G., 2001. Application of a time-delay neural network to promoter annota-

tion in the Drosophila melanogaster genome. Comput. Chem. 26, 51–56.

Salih, F., Salih, B., Trifonov, E.N., 2007. Sequence-directed mapping of nucleosome

positions. J. Biomol. Struct. Dyn. 24, 489–493.

Scherf, M., Klingenhoff, A., Werner, T., 2000. Highly specific localization of promoter

regionsinlargegenomicsequencesbypromoterinspector:anovelcontextanal-

ysis approach. J. Mol. Biol. 297, 599–606.

Schmid, C.D., Perier, R., Praz, V., Bucher, P., 2006. EPD in its twentieth year: towards

completepromotercoverageofselectedmodelorganisms.NucleicAcidsRes.34,

D82–D85.

Sharma, D., Issac, B., Raghava, G.P., Ramaswamy, R., 2004. Spectral repeat finder

(SRF):identificationofrepetitivesequencesusingFouriertransformation.Bioin-

formatics 20, 1405–1412.

Shelenkov, A.A., Skryabin, K.G., Korotkov, E.V., 2006. Search and classification of

potential minisatellite sequences from bacterial genomes. DNA Res. 13, 89–102.

Shelenkov, A.A., Korotkov, A.E., Korotkov, E.V., 2008. MMsat—a database of potential

micro- and mini-satellites. Gene 409, 53–60.

Sheskin, D.J., 2000. Handbook of Parametric and Non-parametric Statistical Proce-

dures, second ed. Chapman & Hall/CRC, New York.

Page 10

Author's personal copy

204

A. Shelenkov, E. Korotkov / Computational Biology and Chemistry 33 (2009) 196–204

Solovyev, V.V., Shahmuradov, I.A., 2003. PromH: promoters identification using

orthologous genomic sequences. Nucleic Acids Res. 31, 3540–3545.

Tchernaenko, V., Radlinska, M., Lubkowska, L., Halvorson, H.R., Kashlev, M., Lut-

ter, L.C., 2008. DNA bending in transcription initiation. Biochemistry 47, 1885–

1895.

Werner, T., 2003. The state of the art of mammalian promoter recognition. Brief

Bioinform. 4, 22–30.

Xie, X., Wu, S., Lam, K.-M., Yan, H., 2006. Promoterexplorer: an effective promoter

identification method based on the AdaBoost algorithm. Bioinformatics 22,

2722–2728.

Zhang,M.Q.,2007.Computationalanalysesofeukaryoticpromoters.BMCBioinform.

8 (Suppl. 6), S3.