Content uploaded by Charles Lawrence
Author content
All content in this area was uploaded by Charles Lawrence on May 17, 2021
Content may be subject to copyright.
Short Communication
An application of nonlinear optimization in molecular biology
J.G. Ecker
a
, M. Kupferschmid
c,*
, C.E. Lawrence
b
, A.A. Reilly
b
, A.C.H. Scott
d
a
Department of Mathematical Sciences, Rensselaer Polytechnic Institute, 110 8th Street, Troy, NY 12180-3590, USA
b
New York State Department of Health, Wadsworth Biometric Laboratory, Albany, NY 12201, USA
c
Academic Computing Services, Alan M. Voorhees Computing Center, Rensselaer Polytechnic Institute, 110 8th Street,
Troy, NY 12180-3590, USA
d
Aereal Inc., San Francisco, CA 94107, USA
Received 1 September 1999; accepted 16 February 2001
Abstract
A maximum likelihood approach has been proposed for finding protein binding sites on strands of DNA [G.D.
Stormo, G.W. Hartzell, Proceedings of the National Academy of Sciences of the USA 86 (1989) 1183]. We formulate an
optimization model for the problem and present calculations with experimental sequence data to study the behavior of
this site identification method. Ó2002 Elsevier Science B.V. All rights reserved.
Keywords: Optimization; Nonlinear programming; Molecular biology; Protein binding; Maximum likelihood
1. The problem
Suppose we are given several letter sequences each 105 positions long, and that each position contains a
letter from the set {A,T,C,G}. One such sequence is shown below, with dots beneath the letters in positions
10, 20, etc.
Within each sequence there is a subsequence of length 16, called a site, having the pattern
L1L2L3L4L5L6L7L8L9L10
where the six middle positions denoted can be any pattern of the four letters. If the data were perfect, all
of the sites would have the same ideal or consensus pattern of Li’s, but in the data we are given, most sites
European Journal of Operational Research 138 (2002) 452–458
www.elsevier.com/locate/dsw
*
Corresponding author. Tel.: +1-518-276-6558; fax: +1-518-276-2809.
E-mail address: kupfem@rpi.edu (M. Kupferschmid).
0377-2217/02/$ - see front matter Ó2002 Elsevier Science B.V. All rights reserved.
PII: S0 3 7 7 - 2 2 1 7 ( 0 1 ) 0 0 1 2 2 - 9
vary slightly from the consensus pattern and some vary significantly. A site can be located anywhere within
a sequence, and the sites in the different sequences can be located differently.
The problem is to find the Li’s of the consensus site pattern and the locations of the sites in the given
sequences of imperfect data.
2. A probabilistic model
In addition to having the pattern shown above, sites differ from the nonsite parts of the sequences in that
the frequencies of the letters in the Lipositions are different from those in other positions (that is, positions
outside the site and the middle six positions within the site). A maximum likelihood approach has been
proposed [7] to exploit these frequency differences to identify the sites.
Let w1;w2;w3, and w4be, respectively, the probabilities of their being an A,T,C,orGin any position
outside a site or in the middle six positions, and w4iþ1,w4iþ2,w4iþ3, and w4iþ4the probabilities of an A,T,C,
or Gin the Liposition of each site. We assume that the letter occurrences are statistically independent. Thus
there are 44 variables wjin this problem, and 105 16 þ1¼90 possible site starting positions sin each 105
position sequence.
If we knew the probabilities w1;...;w44 , we could calculate (as illustrated in Section 3) the probability of
an observed sequence given that a binding site starts in position s. Letting Ssbe the event that a sequence
begins in position s, we have
PrfsequencejSsg¼probability of the sequence data given that a binding site starts in position s;
PrfSsjsequenceg¼probability that a binding site starts in position sgiven the sequence data;
PrfSsg¼probability that a binding site starts in position s:
We assume the starting positions are equally likely, so PrfSsg¼ 1
90 for s¼1;...;90. Then from Bayes’
theorem,
PrfSsjsequenceg¼PrfsequencejSsgPrfSsg
Prfsequenceg¼PrfsequencejSsg
P90
s¼1PrfsequencejSsg:
For each sequence we can then select likely starting positions sfor the binding sites based on the proba-
bilities PrfSsjsequenceg. Of course, the w4iþ1,w4iþ2,w4iþ3,andw4iþ4give the probabilities of an A,T,C,orG
in the Lipositions, so knowing the wjwe can identify the letter most likely to be in the Liposition.
This probabilistic model thus yields estimates of the starting positions and the letter in each Liposition
of the site, which is the problem of Section 1. How can we find the wj?
3. Formulation of the nonlinear program
We use the maximum likelihood approach to find wj’s that best fit the sequence data, by maximizing the
probability of the observed sequences [5]. Given a trial set of values for the wjand a trial starting position s
for a single site in a sequence, we can write down the probability that we would observe this sequence if it
were generated at random according to the probability model of Section 2. For example, suppose s¼20.
Then the site is as shown within the sequence given earlier.
J.G. Ecker et al. / European Journal of Operational Research 138 (2002) 452–458 453
The conditional probability of observing this sequence is then
PrfsequencejS20g¼w23
1w34
2w14
3w24
4ðw6w12w14 w20w24 Þðw28 w31w36w37 w44 Þ:
There are 23 A’s, 34 T’s, 14 C’s, and 24 G’s outside of the Lipositions of the site. These letters occur where
they do with probabilities w1,w2,w3, and w4, respectively, so the probability that they all appear where they
do is w23
1w34
2w14
3w24
4. In the Lipositions of the site we find the letters TGTGG, with probabilities w6,w12 ,w14,
w20, and w24 , and GCGAG having the probabilities w28 ,w31 ,w36,w37 , and w44 .
The probability Tlof observing sequence lis
Tl¼X
90
s¼1
PrfsequencejSsgPrfSsg¼X
90
s¼1Y
44
j¼1
wpjsl
j
1
90
;
where, for s¼1;...;90 and j¼1;...;44, pjsl is the number of occurrences of the letter associated with wj.
Thus, in the example above (l¼1;s¼20), some typical pjsl are
p5;20;1¼number of occurrences of the letter associated with w5
¼number of A’s in the L1position of the site ¼0;
p12;20;1¼number of G’sinL2¼1:
The probability of observing sequence lis Tl, so the probability of observing, say, 18 sequences is
Prfdatag¼Y
18
l¼1
Tl¼Y
18
l¼1X
90
s¼1Y
44
j¼1
wpjsl
j
1
90
;
and this is the probability that we must find wj’s to maximize. The wjare probabilities, so 0 6wj61. Thus,
we can introduce the transformation wj¼exjto obtain
Prfdatag¼Y
18
l¼1
Tl¼1
90
4418
Y
18
l¼1X
90
s¼1
exp X
44
j¼1
xjpjsl
!
;
which eliminates the inner continued product. To remove the constant factor ð1
90Þ4418 and the remaining
continued product, we equivalently maximize
FðxÞ¼X
18
l¼1
ln X
90
s¼1
exp X
44
j¼1
xjpjsl
! !
:
Because only one letter can occupy any location in the sequence, we have the following constraints, which
the optimization will force to be satisfied as equalities:
w1þw2þw3þw461;
w5þw6þw7þw861;
.
.
.
w41 þw42 þw43 þw44 61:
454 J.G. Ecker et al. / European Journal of Operational Research 138 (2002) 452–458
Applying the transformation wj¼exjto these constraints, we obtain the nonlinear programming problem
NLP:
max
x1;...;x44
FðxÞ¼X
18
l¼1
ln X
90
s¼1
exp X
44
j¼1
xjpjsl
! !
:
subject to
ex1þex2þex3þex461;
.
.
.
ex41 þex42 þex43 þex44 61;
xjfree;j¼1;...;44:
NLP is completely specified once the letter counts pjsl are known, and they depend only on the sequence
data in the way described at the beginning of this section.
The objective function FðxÞis convex because it is the sum of functions having the form
fðxÞ¼ln X
s
eaT
sx
!
;
and fðxÞis convex, as we show now. The function fis continuous, so according to [4, Section 1.4] it is
convex if and only if for every two vectors xand y,
f1
2x
þ1
2y61
2fðxÞþ1
2fðyÞ
or
ln X
s
eaT
s
1
2xþ1
2y
!
61
2ln X
s
eaT
sx
!
þ1
2ln X
s
eaT
sy
!
:
Letting us¼expðaT
sð1
2xÞÞ and vs¼expðaT
sð1
2yÞÞ, we need to show that
ln X
s
usvs
!
61
2ln X
s
u2
s
!
þ1
2ln X
s
v2
s
!
or equivalently
X
s
usvs6exp 1
2ln X
s
u2
s
þ1
2ln X
s
v2
s!
or
X
s
usvs6X
s
u2
s
!
1
2
X
s
v2
s
!
1
2
:
But each usand vsis positive so
X
s
usvs
!
2
6Xu2
s
Xv2
s
;
which is just Cauchy’s inequality. Thus, FðxÞis convex.
J.G. Ecker et al. / European Journal of Operational Research 138 (2002) 452–458 455
Because FðxÞis convex, NLP has local maxima located on the boundary of the feasible set and the
optimal point xHwill be among them.
4. The protein binding site application
The problem of Section 1 models the protein binding site problem considered by Stormo and Hartzell [7]
and Lawrence and Riley [3] in which the data consist of 18 DNA sequences each 105 bases long. The
sequences are listed in Fig. 1. Each sequence contains one or two sites of the kind described in Section 1, so
we can use this data and the nonlinear programming model NLP of Section 3 to find the consensus site
pattern and the most likely locations of the sites.
From the data we computed the pjsl as described in Section 3, and we solved NLP obtaining the results
shown in Fig. 2. The solution is a Karush–Kuhn–Tucker point, with Lagrange multipliers of 1710 for the
first constraint and 18 for the others.
Fig. 1. Experimentally determined sequence data.
Fig. 2. The solution to NLP.
456 J.G. Ecker et al. / European Journal of Operational Research 138 (2002) 452–458
Recalling that the xjare the logs of probabilities for the appearance of an A,T,C,orGoutside the sites
and then in the 10 site positions L1;...;L10 , the first line in Fig. 2 corresponds to outside the sites, the
second line corresponds to L1, and so on. Then we can see by inspection of xHthat the consensus site pattern
is TGTGATCACA. This agrees with the consensus site pattern reported in [3].
From xHwe found the wjand used the Bayes’ theorem analysis of Section 2 to compute for each se-
quence lthe probabilities PrfSsjsequence lgof all the possible site starting positions s¼1;...;90. The table
in Fig. 3 reports for each sequence the true site starting position or positions, along with the two most likely
estimated starting positions and their probabilities.
The maximum likelihood model predicts a true starting position in every sequence except l¼07, and in
two of the six sequences that contain two sites it predicts both of them. Sequence 07 is reproduced below,
with the consensus site pattern shown above at the predicted starting position of 54 and below at the true
starting position of 45.
There are two mismatches between the consensus site and the corresponding sequence positions when the
site starting position is 54, but three mismatches when the starting position is 45, so it is not surprising that
the maximum likelihood model predicts a starting position of 54 in this sequence.
5. Solving the nonlinear program
The focus of this paper is on formulating the protein binding site identification problem as a nonlinear
optimization, and on interpreting the coordinates of the maximizing point to identify the sites. However, as
noted in Section 3, NLP has local maxima on the boundary of the feasible set and this makes it difficult to
find xH. The problem of solving NLP is therefore itself also of some interest.
Fig. 3. Starting positions in the experimental data.
J.G. Ecker et al. / European Journal of Operational Research 138 (2002) 452–458 457
Our previous computational experience [1] showed that the ellipsoid algorithm of Shor [6,2] is often an
effective heuristic for solving optimization problems having local extrema. This method is not guaranteed to
find a global optimum nor even a Karush–Kuhn–Tucker point for NLP, but by trying various starting
points and repeatedly restarting the algorithm with a new initial ellipsoid centered on the best point dis-
covered so far, we were able to find the solution reported in Section 4.
The structure of the constraints in this problem suggests that a special-purpose algorithm might be
devised to more efficiently search for xHin the boundary of the feasible set.
References
[1] J.G. Ecker, M. Kupferschmid, A computational comparison of the ellipsoid algorithm with several nonlinear programming
algorithms, SIAM Journal on Control and Optimization 23 (5) (1985) 657–674.
[2] J.G. Ecker, M. Kupferschmid, in: Introduction to Operations Research, Krieger, Malibar, FL, 1991, pp. 315–322.
[3] C.E. Lawrence, A.A. Reilly, An expectation maximization (EM) algorithm for the identification and characterization of common
sites in unaligned biopolymer sequences, PROTEINS: Structure, Function, and Genetics 7 (1990) 41–51.
[4] D.S. Mitrinovi
cc, Analytic Inequalities, Springer, New York, 1970.
[5] A.C.H. Scott, Locating binding sites for cyclic-AMP receptor proteins on unaligned DNA fragments using nonlinear
programming, Ph.D. thesis, Rensselaer Polytechnic Institute, 1993.
[6] N.Z. Shor, Cut-off method with space extension in convex programming problems, Cybernetics 12 (1977) 94–96.
[7] G.D. Stormo, G.W. Hartzell, Identifying protein-binding sites from unaligned DNA fragments, Proceedings of the National
Academy of Sciences of the USA 86 (1989) 1183–1187.
458 J.G. Ecker et al. / European Journal of Operational Research 138 (2002) 452–458