ArticlePDF Available

An application of nonlinear optimization in molecular biology

Authors:

Abstract and Figures

A maximum likelihood approach has been proposed for finding protein binding sites on strands of DNA [G.D. Stormo, G.W. Hartzell, Proceedings of the National Academy of Sciences of the USA 86 (1989) 1183]. We formulate an optimization model for the problem and present calculations with experimental sequence data to study the behavior of this site identification method.
Content may be subject to copyright.
Short Communication
An application of nonlinear optimization in molecular biology
J.G. Ecker
a
, M. Kupferschmid
c,*
, C.E. Lawrence
b
, A.A. Reilly
b
, A.C.H. Scott
d
a
Department of Mathematical Sciences, Rensselaer Polytechnic Institute, 110 8th Street, Troy, NY 12180-3590, USA
b
New York State Department of Health, Wadsworth Biometric Laboratory, Albany, NY 12201, USA
c
Academic Computing Services, Alan M. Voorhees Computing Center, Rensselaer Polytechnic Institute, 110 8th Street,
Troy, NY 12180-3590, USA
d
Aereal Inc., San Francisco, CA 94107, USA
Received 1 September 1999; accepted 16 February 2001
Abstract
A maximum likelihood approach has been proposed for finding protein binding sites on strands of DNA [G.D.
Stormo, G.W. Hartzell, Proceedings of the National Academy of Sciences of the USA 86 (1989) 1183]. We formulate an
optimization model for the problem and present calculations with experimental sequence data to study the behavior of
this site identification method. Ó2002 Elsevier Science B.V. All rights reserved.
Keywords: Optimization; Nonlinear programming; Molecular biology; Protein binding; Maximum likelihood
1. The problem
Suppose we are given several letter sequences each 105 positions long, and that each position contains a
letter from the set {A,T,C,G}. One such sequence is shown below, with dots beneath the letters in positions
10, 20, etc.
Within each sequence there is a subsequence of length 16, called a site, having the pattern
L1L2L3L4L5L6L7L8L9L10
where the six middle positions denoted can be any pattern of the four letters. If the data were perfect, all
of the sites would have the same ideal or consensus pattern of Li’s, but in the data we are given, most sites
European Journal of Operational Research 138 (2002) 452–458
www.elsevier.com/locate/dsw
*
Corresponding author. Tel.: +1-518-276-6558; fax: +1-518-276-2809.
E-mail address: kupfem@rpi.edu (M. Kupferschmid).
0377-2217/02/$ - see front matter Ó2002 Elsevier Science B.V. All rights reserved.
PII: S0 3 7 7 - 2 2 1 7 ( 0 1 ) 0 0 1 2 2 - 9
vary slightly from the consensus pattern and some vary significantly. A site can be located anywhere within
a sequence, and the sites in the different sequences can be located differently.
The problem is to find the Li’s of the consensus site pattern and the locations of the sites in the given
sequences of imperfect data.
2. A probabilistic model
In addition to having the pattern shown above, sites differ from the nonsite parts of the sequences in that
the frequencies of the letters in the Lipositions are different from those in other positions (that is, positions
outside the site and the middle six positions within the site). A maximum likelihood approach has been
proposed [7] to exploit these frequency differences to identify the sites.
Let w1;w2;w3, and w4be, respectively, the probabilities of their being an A,T,C,orGin any position
outside a site or in the middle six positions, and w4iþ1,w4iþ2,w4iþ3, and w4iþ4the probabilities of an A,T,C,
or Gin the Liposition of each site. We assume that the letter occurrences are statistically independent. Thus
there are 44 variables wjin this problem, and 105 16 þ1¼90 possible site starting positions sin each 105
position sequence.
If we knew the probabilities w1;...;w44 , we could calculate (as illustrated in Section 3) the probability of
an observed sequence given that a binding site starts in position s. Letting Ssbe the event that a sequence
begins in position s, we have
PrfsequencejSsprobability of the sequence data given that a binding site starts in position s;
PrfSsjsequenceprobability that a binding site starts in position sgiven the sequence data;
PrfSsprobability that a binding site starts in position s:
We assume the starting positions are equally likely, so PrfSs1
90 for s¼1;...;90. Then from Bayes’
theorem,
PrfSsjsequencePrfsequencejSsgPrfSsg
Prfsequenceg¼PrfsequencejSsg
P90
s¼1PrfsequencejSsg:
For each sequence we can then select likely starting positions sfor the binding sites based on the proba-
bilities PrfSsjsequenceg. Of course, the w4iþ1,w4iþ2,w4iþ3,andw4iþ4give the probabilities of an A,T,C,orG
in the Lipositions, so knowing the wjwe can identify the letter most likely to be in the Liposition.
This probabilistic model thus yields estimates of the starting positions and the letter in each Liposition
of the site, which is the problem of Section 1. How can we find the wj?
3. Formulation of the nonlinear program
We use the maximum likelihood approach to find wj’s that best fit the sequence data, by maximizing the
probability of the observed sequences [5]. Given a trial set of values for the wjand a trial starting position s
for a single site in a sequence, we can write down the probability that we would observe this sequence if it
were generated at random according to the probability model of Section 2. For example, suppose s¼20.
Then the site is as shown within the sequence given earlier.
J.G. Ecker et al. / European Journal of Operational Research 138 (2002) 452–458 453
The conditional probability of observing this sequence is then
PrfsequencejS20w23
1w34
2w14
3w24
4ðw6w12w14 w20w24 Þðw28 w31w36w37 w44 Þ:
There are 23 A’s, 34 T’s, 14 C’s, and 24 G’s outside of the Lipositions of the site. These letters occur where
they do with probabilities w1,w2,w3, and w4, respectively, so the probability that they all appear where they
do is w23
1w34
2w14
3w24
4. In the Lipositions of the site we find the letters TGTGG, with probabilities w6,w12 ,w14,
w20, and w24 , and GCGAG having the probabilities w28 ,w31 ,w36,w37 , and w44 .
The probability Tlof observing sequence lis
Tl¼X
90
s¼1
PrfsequencejSsgPrfSsX
90
s¼1Y
44
j¼1
wpjsl
j
1
90

;
where, for s¼1;...;90 and j¼1;...;44, pjsl is the number of occurrences of the letter associated with wj.
Thus, in the example above (l¼1;s¼20), some typical pjsl are
p5;20;1¼number of occurrences of the letter associated with w5
¼number of As in the L1position of the site ¼0;
p12;20;1¼number of GsinL2¼1:
The probability of observing sequence lis Tl, so the probability of observing, say, 18 sequences is
PrfdataY
18
l¼1
Tl¼Y
18
l¼1X
90
s¼1Y
44
j¼1
wpjsl
j
1
90

;
and this is the probability that we must find wj’s to maximize. The wjare probabilities, so 0 6wj61. Thus,
we can introduce the transformation wj¼exjto obtain
PrfdataY
18
l¼1
Tl¼1
90

4418
Y
18
l¼1X
90
s¼1
exp X
44
j¼1
xjpjsl
!
;
which eliminates the inner continued product. To remove the constant factor ð1
90Þ4418 and the remaining
continued product, we equivalently maximize
FðxÞ¼X
18
l¼1
ln X
90
s¼1
exp X
44
j¼1
xjpjsl
! !
:
Because only one letter can occupy any location in the sequence, we have the following constraints, which
the optimization will force to be satisfied as equalities:
w1þw2þw3þw461;
w5þw6þw7þw861;
.
.
.
w41 þw42 þw43 þw44 61:
454 J.G. Ecker et al. / European Journal of Operational Research 138 (2002) 452–458
Applying the transformation wj¼exjto these constraints, we obtain the nonlinear programming problem
NLP:
max
x1;...;x44
FðxÞ¼X
18
l¼1
ln X
90
s¼1
exp X
44
j¼1
xjpjsl
! !
:
subject to
ex1þex2þex3þex461;
.
.
.
ex41 þex42 þex43 þex44 61;
xjfree;j¼1;...;44:
NLP is completely specified once the letter counts pjsl are known, and they depend only on the sequence
data in the way described at the beginning of this section.
The objective function FðxÞis convex because it is the sum of functions having the form
fðxÞ¼ln X
s
eaT
sx
!
;
and fðxÞis convex, as we show now. The function fis continuous, so according to [4, Section 1.4] it is
convex if and only if for every two vectors xand y,
f1
2x
þ1
2y61
2fðxÞþ1
2fðyÞ
or
ln X
s
eaT
s
1
2xþ1
2y

!
61
2ln X
s
eaT
sx
!
þ1
2ln X
s
eaT
sy
!
:
Letting us¼expðaT
sð1
2xÞÞ and vs¼expðaT
sð1
2yÞÞ, we need to show that
ln X
s
usvs
!
61
2ln X
s
u2
s
!
þ1
2ln X
s
v2
s
!
or equivalently
X
s
usvs6exp 1
2ln X
s
u2
s
þ1
2ln X
s
v2
s!
or
X
s
usvs6X
s
u2
s
!
1
2
X
s
v2
s
!
1
2
:
But each usand vsis positive so
X
s
usvs
!
2
6Xu2
s

Xv2
s

;
which is just Cauchy’s inequality. Thus, FðxÞis convex.
J.G. Ecker et al. / European Journal of Operational Research 138 (2002) 452–458 455
Because FðxÞis convex, NLP has local maxima located on the boundary of the feasible set and the
optimal point xHwill be among them.
4. The protein binding site application
The problem of Section 1 models the protein binding site problem considered by Stormo and Hartzell [7]
and Lawrence and Riley [3] in which the data consist of 18 DNA sequences each 105 bases long. The
sequences are listed in Fig. 1. Each sequence contains one or two sites of the kind described in Section 1, so
we can use this data and the nonlinear programming model NLP of Section 3 to find the consensus site
pattern and the most likely locations of the sites.
From the data we computed the pjsl as described in Section 3, and we solved NLP obtaining the results
shown in Fig. 2. The solution is a Karush–Kuhn–Tucker point, with Lagrange multipliers of 1710 for the
first constraint and 18 for the others.
Fig. 1. Experimentally determined sequence data.
Fig. 2. The solution to NLP.
456 J.G. Ecker et al. / European Journal of Operational Research 138 (2002) 452–458
Recalling that the xjare the logs of probabilities for the appearance of an A,T,C,orGoutside the sites
and then in the 10 site positions L1;...;L10 , the first line in Fig. 2 corresponds to outside the sites, the
second line corresponds to L1, and so on. Then we can see by inspection of xHthat the consensus site pattern
is TGTGATCACA. This agrees with the consensus site pattern reported in [3].
From xHwe found the wjand used the Bayes’ theorem analysis of Section 2 to compute for each se-
quence lthe probabilities PrfSsjsequence lgof all the possible site starting positions s¼1;...;90. The table
in Fig. 3 reports for each sequence the true site starting position or positions, along with the two most likely
estimated starting positions and their probabilities.
The maximum likelihood model predicts a true starting position in every sequence except l¼07, and in
two of the six sequences that contain two sites it predicts both of them. Sequence 07 is reproduced below,
with the consensus site pattern shown above at the predicted starting position of 54 and below at the true
starting position of 45.
There are two mismatches between the consensus site and the corresponding sequence positions when the
site starting position is 54, but three mismatches when the starting position is 45, so it is not surprising that
the maximum likelihood model predicts a starting position of 54 in this sequence.
5. Solving the nonlinear program
The focus of this paper is on formulating the protein binding site identification problem as a nonlinear
optimization, and on interpreting the coordinates of the maximizing point to identify the sites. However, as
noted in Section 3, NLP has local maxima on the boundary of the feasible set and this makes it difficult to
find xH. The problem of solving NLP is therefore itself also of some interest.
Fig. 3. Starting positions in the experimental data.
J.G. Ecker et al. / European Journal of Operational Research 138 (2002) 452–458 457
Our previous computational experience [1] showed that the ellipsoid algorithm of Shor [6,2] is often an
effective heuristic for solving optimization problems having local extrema. This method is not guaranteed to
find a global optimum nor even a Karush–Kuhn–Tucker point for NLP, but by trying various starting
points and repeatedly restarting the algorithm with a new initial ellipsoid centered on the best point dis-
covered so far, we were able to find the solution reported in Section 4.
The structure of the constraints in this problem suggests that a special-purpose algorithm might be
devised to more efficiently search for xHin the boundary of the feasible set.
References
[1] J.G. Ecker, M. Kupferschmid, A computational comparison of the ellipsoid algorithm with several nonlinear programming
algorithms, SIAM Journal on Control and Optimization 23 (5) (1985) 657–674.
[2] J.G. Ecker, M. Kupferschmid, in: Introduction to Operations Research, Krieger, Malibar, FL, 1991, pp. 315–322.
[3] C.E. Lawrence, A.A. Reilly, An expectation maximization (EM) algorithm for the identification and characterization of common
sites in unaligned biopolymer sequences, PROTEINS: Structure, Function, and Genetics 7 (1990) 41–51.
[4] D.S. Mitrinovi
cc, Analytic Inequalities, Springer, New York, 1970.
[5] A.C.H. Scott, Locating binding sites for cyclic-AMP receptor proteins on unaligned DNA fragments using nonlinear
programming, Ph.D. thesis, Rensselaer Polytechnic Institute, 1993.
[6] N.Z. Shor, Cut-off method with space extension in convex programming problems, Cybernetics 12 (1977) 94–96.
[7] G.D. Stormo, G.W. Hartzell, Identifying protein-binding sites from unaligned DNA fragments, Proceedings of the National
Academy of Sciences of the USA 86 (1989) 1183–1187.
458 J.G. Ecker et al. / European Journal of Operational Research 138 (2002) 452–458
... For instance, they have been employed in assignment problems utilizing piecewise linearization techniques [24], in the design of embedded computer systems through deterministic iteration techniques [25], and in engineering structure design based on signomial discrete programming [26]. Additionally, they have been utilized in deterministic optimization methods within engineering and management [27], as well as in the field of molecular biology to optimize the localization of protein binding sites on DNA strands [28]. Likewise, this methodology has been used in medical contexts for optimizing fractionated protocols in cancer radiotherapy via nonlinear programming [29]. ...
Article
Full-text available
Introduction. In vocal production models employing spring-mass-damper frameworks, precision in determining damping coefficients that align with physiological vocal fold characteristics is crucial, accounting for potential variations in the representation of viscosity-elasticity properties. Objective. This study aims to conduct a parametric fitting of a vocal production model based on a mass-spring-damper system incorporating subglottic pressure interaction, with the purpose of accurately modeling the collision forces exerted by vocal folds during phonation. Method. A metaheuristic search algorithm was employed for parametric synthesis. The algorithm was applied to elasticity coefficients c1 and c2, as well as damping coefficients ε1 and ε2, which directly correlate with the mass matrices of the model. This facilitates the adjustment of fold composition to achieve desired physiological behavior. Results. The vocal system's behavior for each simulation cycle was compared to a predefined standard under normal conditions. The algorithm determined the simulation endpoint by evaluating discrepancies between key features of the obtained signals and the desired ones. Conclusion. Parametric fitting enabled the approximation of physiological vocal production behavior, providing estimates of the impact forces experienced by vocal folds during phonation.
... Many problems in science and engineering can be state as derivation-free optimization problems [5,18], such as decision-making [42], engineering design [43], molecular biology [8], system and database design [19], power generation [1], surgery [44], and astronomy [4]. In this paper, we focus on the global optimization problem in the following form: ...
Article
Full-text available
This article presents a new DIRECT-type SCABALL (scattering balls) algorithm with a new partition method for derivation-free optimization problems. It does not focus on dividing the region of interest into specific geometric shapes, but rather scatters several balls to cover it. In SCABALL, several potential optimal regions are selected at each iteration, and they are covered by smaller balls sequentially. In this way, the SCABALL ensures the everywhere dense convergence. The center points and radii of the scattered balls significantly influence the efficiency of SCABALL; therefore, the minimax designs are used in the initial and sequential stages to obtain better coverage. The SCABALL parameters, including the number of balls and their radii, were analyzed by numerical investigation. We provided the empirical choices for those parameters and found that the balls’ radii can be contracted to balance efficiency and global convergence. Numerical experiments show that the SCABALL algorithm is locally biased and robust.
... There are many publications studying optimization models, showing that a large number of biological problems can be formulated as optimization problems; see, e.g., [1][2][3][4][5] and the references therein, among others. ...
Article
A known class of computational problems in molecular biology is the consensus string problem, to which belongs the problem of string selection via comparison. This paper deals with one of these problems called Closest String Problem (CSP). A novel definition of CSP is provided, based upon the Pareto optimality notion, to obtain most useful sequences. Also, a zero-one optimization model to solve the new defined CSP is introduced. Finally, a comparison between the new definition (model) and a current one is given.
... PSP has been proven to be an NP-hard problem[2]; the number of conformations grows exponentially with the number of residues. Thus the non-deterministic search techniques have dominated attempts, of which ample approaches such as, Monte Carlo simulation[3], Simulated Annealing[4], and Ant Colony Optimization[5], though because of their simplicity and search effectiveness, Genetic Algorithm (GA)[6],[7]is one of the most attractive algorithm and hybrids between deterministic and stochastic methodologies using nonlinear optimization techniques and maximum likelihood[8]approaches. In Reference[6], a pioneer work using GA for PSP, torsion angle representation was used and the GA reached even lower energy levels than the protein in its native state; because of imprecise energy function. ...
Article
Full-text available
In this paper, Protein Tertiary Structure Prediction using Evolutionary Algorithms (EAs) such as Self-Adaptive Differential Evolution (SaDE) and Real-coded Genetic Algorithm (RGA) are discussed. RGA is implemented with various crossover and mutation operators. The algorithms are tested on a peptide Met-enkephalin. The energy functions used are ECEPP/2 and ECEPP/3 force fields. SaDE and RGA with discrete crossover and boundary mutation produce the best energy values than other crossover and mutation operators of RGA. But, Statistical results of SaDE and RGA show that SaDE outperforms RGA in terms of number of function evaluations, mean energy and success rate. The best results obtained using SaDE and RGA are compared with native structure 1PLW and classical benchmark Scheraga conformation and the corresponding minimum RMSD values are 2.13 A o and 1.45 A o respectively. Comparison of the best results of SaDE and RGA with other reported RGA variants show better performance in terms of energy and computational search efficiency. A set of unique hundred best solutions obtained from both algorithms are clustered using hierarchical cluster algorithm. This gives seven independent clusters suggesting the robustness of these methodologies and the ability to explore the conformational space available and to populate the near native conformations.
... For instance, we refer to some such publications among others. Ecker et al. [9] formulated a nonlinear optimization model for finding protein binding sites on DNA strands and presented calculations with experimental sequence data to study the behaviour of the provided site identification method. Cherruault [6,7] provided some numerical techniques leading to the determination of a global optimum of some optimization problems arising in biology, medicine and biomedicine. ...
Article
In this paper, we discuss modelling and solving some multiobjective optimization problems arising in biology. A class of comparison problems for string selection in molecular biology and a relocation problem in conservation biology are modelled as multiobjective optimization programmes. Some discussions about applications, solvability and different variants of the obtained models are given, as well. A crucial part of the study is based upon the Pareto optimization which refers to the Pareto solutions of multiobjective optimization problems. For such solution, improvement of some objective function can only be obtained at the expense of the deterioration of at least one other objective function.
Article
In this paper, Protein Structure Prediction problem is solved using Diversity Controlled Self-Adaptive Differential Evolution with Local search (DCSaDE-LS). DCSaDE-LS, an improved version of Self-Adaptive Differential Evolution (SaDE), use simple fuzzy system to control the diversity of individuals and local search to maintain a balance between exploration and exploitation. DCSaDE-LS with four different local search replacement strategies are used. SaDE is also implemented for comparison purposes. Algorithms are tested on a peptide Met-enkephalin for force fields ECEPP/2, ECEPP/3 and CHARMM22. Results show that both DCSaDE-LS and SaDE produce the best energy for both force fields. Among the four replacement strategies, DCSaDE-LS1 strategy reports better results than other strategies and SaDE in terms of number of function evaluations, mean energy and success rate. Best conformations obtained using DCSaDE-LS is compared with native structure 1PLW and GEM structure Scheraga. Nonparametric statistical tests for multiple comparisons ( \(1\times N\) ) with control method are implemented for CHARMM22 observations. A set of unique 100 best conformations obtained from DCSaDE-LS are clustered into 3 independent clusters suggesting the robustness of this methodology and the ability to explore the conformational space available and to populate the near native conformations.
Conference Paper
Full-text available
See the presentation.
Article
Full-text available
With the increasing reliance on modeling optimization problems in practical applications, a number of theoretical and algorithmic contributions of optimization have been proposed. The approaches developed for treating optimization problems can be classified into deterministic and heuristic. This paper aims to introduce recent advances in deterministic methods for solving signomial programming problems and mixed-integer nonlinear programming problems. A number of important applications in engineering and management are also reviewed to reveal the usefulness of the optimization methods.
Article
Introduction/Background Definitions Formulation Methods/Applications Models Program 1 (Nonlinear Mixed 0-1 Program) Program 2 (Linear Mixed 0-1 Program) Program 3 Cases Finding CRP Binding Sites with a Given Pattern Conclusions References
Chapter
Introduction PSP Problem Protein Structure Discerning Methods PSP Energy Minimization EAs PSP Parallel EA Performance Evaluation Results and Discussion Conclusions and Suggested Research References
Article
Full-text available
Statistical methodology for the identification and characterization of protein binding sites in a set of unaligned DNA fragments is presented. Each sequence must contain at least one common site. No alignment of the sites is required. Instead, the uncertainty in the location of the sites is handled by employing the missing information principle to develop an "expectation maximization" (EM) algorithm. This approach allows for the simultaneous identification of the sites and characterization of the binding motifs. The reliability of the algorithm increases with the number of fragments, but the computations increase only linearly. The method is illustrated with an example, using known cyclic adenosine monophosphate receptor protein (CRP) binding sites. The final motif is utilized in a search for undiscovered CRP binding sites.
Article
Full-text available
The ability to determine important features within DNA sequences from the sequences alone is becoming essential as large-scale sequencing projects are being undertaken. We present a method that can be applied to the problem of identifying the recognition pattern for a DNA-binding protein given only a collection of sequenced DNA fragments, each known to contain somewhere within it a binding site for that protein. Information about the position or orientation of the binding sites within those fragments is not needed. The method compares the "information content" of a large number of possible binding site alignments to arrive at a matrix representation of the binding site pattern. The specificity of the protein is represented as a matrix, rather than a consensus sequence, allowing patterns that are typical of regulatory protein-binding sites to be identified. The reliability of the method improves as the number of sequences increases, but the time required increases only linearly with the number of sequences. An example, using known cAMP receptor protein-binding sites, illustrates the method.
Article
A computational comparison of several general purpose nonlinear programming algorithms is presented. This study was motivated by previous results which show that the recently developed ellipsoid algorithm is competitive with a widely used augmented Lagrangian algorithm. To provide a better perspective on the value of ellipsoid algorithms in nonlinear programming, the present study includes some of the most highly regarded nonlinear programming algorithms. The algorithms considered here are chosen from four distinct classes and 50 well-known test problems are used. The algorithms used represent augmented Lagrangian, ellipsoid, generalized reduced gradient, and iterative quadratic programming methods. Results regarding robustness and relative efficiency are presented.
Locating binding sites for cyclic-AMP receptor proteins on unaligned DNA fragments using nonlinear programming
  • A C H Scott
A.C.H. Scott, Locating binding sites for cyclic-AMP receptor proteins on unaligned DNA fragments using nonlinear programming, Ph.D. thesis, Rensselaer Polytechnic Institute, 1993.