ArticlePDF Available

An application of nonlinear optimization in molecular biology

April 2002
European Journal of Operational Research 138(2):452-458

April 2002
138(2):452-458

DOI:10.1016/S0377-2217(01)00122-9

Source
DBLP

Authors:

J. G. Ecker

Rensselaer Polytechnic Institute

Charles Lawrence

Brown University

Show all 5 authorsHide

A maximum likelihood approach has been proposed for finding protein binding sites on strands of DNA [G.D. Stormo, G.W. Hartzell, Proceedings of the National Academy of Sciences of the USA 86 (1989) 1183]. We formulate an optimization model for the problem and present calculations with experimental sequence data to study the behavior of this site identification method.

Experimentally determined sequence data.

…

The solution to NLP.

…

Starting positions in the experimental data.

…

Figures - uploaded by Charles Lawrence

Content may be subject to copyright.

Content uploaded by Charles Lawrence

Content may be subject to copyright.

Short Communication

An application of nonlinear optimization in molecular biology

J.G. Ecker

, M. Kupferschmid

c,*

, C.E. Lawrence

, A.A. Reilly

, A.C.H. Scott

Department of Mathematical Sciences, Rensselaer Polytechnic Institute, 110 8th Street, Troy, NY 12180-3590, USA

New York State Department of Health, Wadsworth Biometric Laboratory, Albany, NY 12201, USA

Academic Computing Services, Alan M. Voorhees Computing Center, Rensselaer Polytechnic Institute, 110 8th Street,

Troy, NY 12180-3590, USA

Aereal Inc., San Francisco, CA 94107, USA

Received 1 September 1999; accepted 16 February 2001

Abstract

A maximum likelihood approach has been proposed for ﬁnding protein binding sites on strands of DNA [G.D.

Stormo, G.W. Hartzell, Proceedings of the National Academy of Sciences of the USA 86 (1989) 1183]. We formulate an

optimization model for the problem and present calculations with experimental sequence data to study the behavior of

Keywords: Optimization; Nonlinear programming; Molecular biology; Protein binding; Maximum likelihood

1. The problem

Suppose we are given several letter sequences each 105 positions long, and that each position contains a

letter from the set {A,T,C,G}. One such sequence is shown below, with dots beneath the letters in positions

10, 20, etc.

Within each sequence there is a subsequence of length 16, called a site, having the pattern

L1L2L3L4L5L6L7L8L9L10

where the six middle positions denoted can be any pattern of the four letters. If the data were perfect, all

of the sites would have the same ideal or consensus pattern of Li’s, but in the data we are given, most sites

European Journal of Operational Research 138 (2002) 452–458

www.elsevier.com/locate/dsw

Corresponding author. Tel.: +1-518-276-6558; fax: +1-518-276-2809.

E-mail address: kupfem@rpi.edu (M. Kupferschmid).

PII: S0 3 7 7 - 2 2 1 7 ( 0 1 ) 0 0 1 2 2 - 9

vary slightly from the consensus pattern and some vary signiﬁcantly. A site can be located anywhere within

a sequence, and the sites in the diﬀerent sequences can be located diﬀerently.

The problem is to ﬁnd the Li’s of the consensus site pattern and the locations of the sites in the given

sequences of imperfect data.

2. A probabilistic model

In addition to having the pattern shown above, sites diﬀer from the nonsite parts of the sequences in that

the frequencies of the letters in the Lipositions are diﬀerent from those in other positions (that is, positions

outside the site and the middle six positions within the site). A maximum likelihood approach has been

proposed [7] to exploit these frequency diﬀerences to identify the sites.

Let w1;w2;w3, and w4be, respectively, the probabilities of their being an A,T,C,orGin any position

outside a site or in the middle six positions, and w4iþ1,w4iþ2,w4iþ3, and w4iþ4the probabilities of an A,T,C,

or Gin the Liposition of each site. We assume that the letter occurrences are statistically independent. Thus

there are 44 variables wjin this problem, and 105 16 þ1¼90 possible site starting positions sin each 105

position sequence.

If we knew the probabilities w1;...;w44 , we could calculate (as illustrated in Section 3) the probability of

an observed sequence given that a binding site starts in position s. Letting Ssbe the event that a sequence

begins in position s, we have

PrfsequencejSsg¼probability of the sequence data given that a binding site starts in position s;

PrfSsjsequenceg¼probability that a binding site starts in position sgiven the sequence data;

PrfSsg¼probability that a binding site starts in position s:

We assume the starting positions are equally likely, so PrfSsg¼ 1

90 for s¼1;...;90. Then from Bayes’

theorem,

PrfSsjsequenceg¼PrfsequencejSsgPrfSsg

Prfsequenceg¼PrfsequencejSsg

P90

s¼1PrfsequencejSsg:

For each sequence we can then select likely starting positions sfor the binding sites based on the proba-

bilities PrfSsjsequenceg. Of course, the w4iþ1,w4iþ2,w4iþ3,andw4iþ4give the probabilities of an A,T,C,orG

in the Lipositions, so knowing the wjwe can identify the letter most likely to be in the Liposition.

This probabilistic model thus yields estimates of the starting positions and the letter in each Liposition

of the site, which is the problem of Section 1. How can we ﬁnd the wj?

3. Formulation of the nonlinear program

We use the maximum likelihood approach to ﬁnd wj’s that best ﬁt the sequence data, by maximizing the

probability of the observed sequences [5]. Given a trial set of values for the wjand a trial starting position s

for a single site in a sequence, we can write down the probability that we would observe this sequence if it

were generated at random according to the probability model of Section 2. For example, suppose s¼20.

Then the site is as shown within the sequence given earlier.

J.G. Ecker et al. / European Journal of Operational Research 138 (2002) 452–458 453

The conditional probability of observing this sequence is then

PrfsequencejS20g¼w23

1w34

2w14

3w24

4ðw6w12w14 w20w24 Þðw28 w31w36w37 w44 Þ:

There are 23 A’s, 34 T’s, 14 C’s, and 24 G’s outside of the Lipositions of the site. These letters occur where

they do with probabilities w1,w2,w3, and w4, respectively, so the probability that they all appear where they

do is w23

1w34

2w14

3w24

4. In the Lipositions of the site we ﬁnd the letters TGTGG, with probabilities w6,w12 ,w14,

w20, and w24 , and GCGAG having the probabilities w28 ,w31 ,w36,w37 , and w44 .

The probability Tlof observing sequence lis

Tl¼X

s¼1

PrfsequencejSsgPrfSsg¼X

s¼1Y

j¼1

wpjsl



;

where, for s¼1;...;90 and j¼1;...;44, pjsl is the number of occurrences of the letter associated with wj.

Thus, in the example above (l¼1;s¼20), some typical pjsl are

p5;20;1¼number of occurrences of the letter associated with w5

¼number of A’s in the L1position of the site ¼0;

p12;20;1¼number of G’sinL2¼1:

The probability of observing sequence lis Tl, so the probability of observing, say, 18 sequences is

Prfdatag¼Y

l¼1

Tl¼Y

l¼1X

s¼1Y

j¼1

wpjsl



;

and this is the probability that we must ﬁnd wj’s to maximize. The wjare probabilities, so 0 6wj61. Thus,

we can introduce the transformation wj¼exjto obtain

Prfdatag¼Y

l¼1

Tl¼1



4418

l¼1X

s¼1

exp X

j¼1

xjpjsl

;

which eliminates the inner continued product. To remove the constant factor ð1

90Þ4418 and the remaining

continued product, we equivalently maximize

FðxÞ¼X

l¼1

ln X

s¼1

exp X

j¼1

xjpjsl

! !

Because only one letter can occupy any location in the sequence, we have the following constraints, which

the optimization will force to be satisﬁed as equalities:

w1þw2þw3þw461;

w5þw6þw7þw861;

w41 þw42 þw43 þw44 61:

454 J.G. Ecker et al. / European Journal of Operational Research 138 (2002) 452–458

Applying the transformation wj¼exjto these constraints, we obtain the nonlinear programming problem

NLP:

max

x1;...;x44

FðxÞ¼X

l¼1

ln X

s¼1

exp X

j¼1

xjpjsl

! !

subject to

ex1þex2þex3þex461;

ex41 þex42 þex43 þex44 61;

xjfree;j¼1;...;44:

NLP is completely speciﬁed once the letter counts pjsl are known, and they depend only on the sequence

data in the way described at the beginning of this section.

The objective function FðxÞis convex because it is the sum of functions having the form

fðxÞ¼ln X

eaT

;

and fðxÞis convex, as we show now. The function fis continuous, so according to [4, Section 1.4] it is

convex if and only if for every two vectors xand y,

þ1

2y61

2fðxÞþ1

2fðyÞ

ln X

eaT

2xþ1



2ln X

eaT

þ1

2ln X

eaT

Letting us¼expðaT

sð1

2xÞÞ and vs¼expðaT

sð1

2yÞÞ, we need to show that

ln X

usvs

2ln X

þ1

2ln X

or equivalently

usvs6exp 1

2ln X

þ1

2ln X

usvs6X

But each usand vsis positive so

usvs

6Xu2



Xv2



;

which is just Cauchy’s inequality. Thus, FðxÞis convex. 

J.G. Ecker et al. / European Journal of Operational Research 138 (2002) 452–458 455

Because FðxÞis convex, NLP has local maxima located on the boundary of the feasible set and the

optimal point xHwill be among them.

4. The protein binding site application

The problem of Section 1 models the protein binding site problem considered by Stormo and Hartzell [7]

and Lawrence and Riley [3] in which the data consist of 18 DNA sequences each 105 bases long. The

sequences are listed in Fig. 1. Each sequence contains one or two sites of the kind described in Section 1, so

we can use this data and the nonlinear programming model NLP of Section 3 to ﬁnd the consensus site

pattern and the most likely locations of the sites.

From the data we computed the pjsl as described in Section 3, and we solved NLP obtaining the results

shown in Fig. 2. The solution is a Karush–Kuhn–Tucker point, with Lagrange multipliers of 1710 for the

ﬁrst constraint and 18 for the others.

Fig. 1. Experimentally determined sequence data.

Fig. 2. The solution to NLP.

456 J.G. Ecker et al. / European Journal of Operational Research 138 (2002) 452–458

Recalling that the xjare the logs of probabilities for the appearance of an A,T,C,orGoutside the sites

and then in the 10 site positions L1;...;L10 , the ﬁrst line in Fig. 2 corresponds to outside the sites, the

second line corresponds to L1, and so on. Then we can see by inspection of xHthat the consensus site pattern

is TGTGATCACA. This agrees with the consensus site pattern reported in [3].

From xHwe found the wjand used the Bayes’ theorem analysis of Section 2 to compute for each se-

quence lthe probabilities PrfSsjsequence lgof all the possible site starting positions s¼1;...;90. The table

in Fig. 3 reports for each sequence the true site starting position or positions, along with the two most likely

estimated starting positions and their probabilities.

The maximum likelihood model predicts a true starting position in every sequence except l¼07, and in

two of the six sequences that contain two sites it predicts both of them. Sequence 07 is reproduced below,

with the consensus site pattern shown above at the predicted starting position of 54 and below at the true

starting position of 45.

There are two mismatches between the consensus site and the corresponding sequence positions when the

site starting position is 54, but three mismatches when the starting position is 45, so it is not surprising that

the maximum likelihood model predicts a starting position of 54 in this sequence.

5. Solving the nonlinear program

The focus of this paper is on formulating the protein binding site identiﬁcation problem as a nonlinear

optimization, and on interpreting the coordinates of the maximizing point to identify the sites. However, as

noted in Section 3, NLP has local maxima on the boundary of the feasible set and this makes it diﬃcult to

ﬁnd xH. The problem of solving NLP is therefore itself also of some interest.

Fig. 3. Starting positions in the experimental data.

J.G. Ecker et al. / European Journal of Operational Research 138 (2002) 452–458 457

Our previous computational experience [1] showed that the ellipsoid algorithm of Shor [6,2] is often an

eﬀective heuristic for solving optimization problems having local extrema. This method is not guaranteed to

ﬁnd a global optimum nor even a Karush–Kuhn–Tucker point for NLP, but by trying various starting

points and repeatedly restarting the algorithm with a new initial ellipsoid centered on the best point dis-

covered so far, we were able to ﬁnd the solution reported in Section 4.

The structure of the constraints in this problem suggests that a special-purpose algorithm might be

devised to more eﬃciently search for xHin the boundary of the feasible set.

References

[1] J.G. Ecker, M. Kupferschmid, A computational comparison of the ellipsoid algorithm with several nonlinear programming

algorithms, SIAM Journal on Control and Optimization 23 (5) (1985) 657–674.

[2] J.G. Ecker, M. Kupferschmid, in: Introduction to Operations Research, Krieger, Malibar, FL, 1991, pp. 315–322.

[3] C.E. Lawrence, A.A. Reilly, An expectation maximization (EM) algorithm for the identiﬁcation and characterization of common

sites in unaligned biopolymer sequences, PROTEINS: Structure, Function, and Genetics 7 (1990) 41–51.

[4] D.S. Mitrinovi

cc, Analytic Inequalities, Springer, New York, 1970.

[5] A.C.H. Scott, Locating binding sites for cyclic-AMP receptor proteins on unaligned DNA fragments using nonlinear

programming, Ph.D. thesis, Rensselaer Polytechnic Institute, 1993.

[6] N.Z. Shor, Cut-oﬀ method with space extension in convex programming problems, Cybernetics 12 (1977) 94–96.

[7] G.D. Stormo, G.W. Hartzell, Identifying protein-binding sites from unaligned DNA fragments, Proceedings of the National

Academy of Sciences of the USA 86 (1989) 1183–1187.

458 J.G. Ecker et al. / European Journal of Operational Research 138 (2002) 452–458

Fine-Tuning of a Voice Production Model to Estimate Impact Stress Using a Metaheuristic Method

Article

Full-text available

Jan 2024

Introduction. In vocal production models employing spring-mass-damper frameworks, precision in determining damping coefficients that align with physiological vocal fold characteristics is crucial, accounting for potential variations in the representation of viscosity-elasticity properties. Objective. This study aims to conduct a parametric fitting of a vocal production model based on a mass-spring-damper system incorporating subglottic pressure interaction, with the purpose of accurately modeling the collision forces exerted by vocal folds during phonation. Method. A metaheuristic search algorithm was employed for parametric synthesis. The algorithm was applied to elasticity coefficients c1 and c2, as well as damping coefficients ε1 and ε2, which directly correlate with the mass matrices of the model. This facilitates the adjustment of fold composition to achieve desired physiological behavior. Results. The vocal system's behavior for each simulation cycle was compared to a predefined standard under normal conditions. The algorithm determined the simulation endpoint by evaluating discrepancies between key features of the obtained signals and the desired ones. Conclusion. Parametric fitting enabled the approximation of physiological vocal production behavior, providing estimates of the impact forces experienced by vocal folds during phonation.

A new partition method for DIRECT-type algorithm based on minimax design

Article

Full-text available

May 2023
J GLOBAL OPTIM

This article presents a new DIRECT-type SCABALL (scattering balls) algorithm with a new partition method for derivation-free optimization problems. It does not focus on dividing the region of interest into specific geometric shapes, but rather scatters several balls to cover it. In SCABALL, several potential optimal regions are selected at each iteration, and they are covered by smaller balls sequentially. In this way, the SCABALL ensures the everywhere dense convergence. The center points and radii of the scattered balls significantly influence the efficiency of SCABALL; therefore, the minimax designs are used in the initial and sequential stages to obtain better coverage. The SCABALL parameters, including the number of balls and their radii, were analyzed by numerical investigation. We provided the empirical choices for those parameters and found that the balls’ radii can be contracted to balance efficiency and global convergence. Numerical experiments show that the SCABALL algorithm is locally biased and robust.

An optimization modelling for string selection in molecular biology using Pareto optimality

Article

Aug 2011
APPL MATH MODEL

Majid Soleimani-damaneh

A known class of computational problems in molecular biology is the consensus string problem, to which belongs the problem of string selection via comparison. This paper deals with one of these problems called Closest String Problem (CSP). A novel definition of CSP is provided, based upon the Pareto optimality notion, to obtain most useful sequences. Also, a zero-one optimization model to solve the new defined CSP is introduced. Finally, a comparison between the new definition (model) and a current one is given.

International Association of Scientific Innovation and Research (IASIR) PROTEIN TERTIARY STRUCTURE PREDICTION USING EVOLUTIONARY ALGORITHMS

Article

Full-text available

Jan 2013

In this paper, Protein Tertiary Structure Prediction using Evolutionary Algorithms (EAs) such as Self-Adaptive Differential Evolution (SaDE) and Real-coded Genetic Algorithm (RGA) are discussed. RGA is implemented with various crossover and mutation operators. The algorithms are tested on a peptide Met-enkephalin. The energy functions used are ECEPP/2 and ECEPP/3 force fields. SaDE and RGA with discrete crossover and boundary mutation produce the best energy values than other crossover and mutation operators of RGA. But, Statistical results of SaDE and RGA show that SaDE outperforms RGA in terms of number of function evaluations, mean energy and success rate. The best results obtained using SaDE and RGA are compared with native structure 1PLW and classical benchmark Scheraga conformation and the corresponding minimum RMSD values are 2.13 A o and 1.45 A o respectively. Comparison of the best results of SaDE and RGA with other reported RGA variants show better performance in terms of energy and computational search efficiency. A set of unique hundred best solutions obtained from both algorithms are clustered using hierarchical cluster algorithm. This gives seven independent clusters suggesting the robustness of these methodologies and the ability to explore the conformational space available and to populate the near native conformations.

On some multiobjective optimization problems arising in biology

Article

Apr 2011

Majid Soleimani-damaneh

In this paper, we discuss modelling and solving some multiobjective optimization problems arising in biology. A class of comparison problems for string selection in molecular biology and a relocation problem in conservation biology are modelled as multiobjective optimization programmes. Some discussions about applications, solvability and different variants of the obtained models are given, as well. A crucial part of the study is based upon the Pareto optimization which refers to the Pareto solutions of multiobjective optimization problems. For such solution, improvement of some objective function can only be obtained at the expense of the deterioration of at least one other objective function.

Protein structure prediction using diversity controlled self-adaptive differential evolution with local search

Article

Jun 2014

In this paper, Protein Structure Prediction problem is solved using Diversity Controlled Self-Adaptive Differential Evolution with Local search (DCSaDE-LS). DCSaDE-LS, an improved version of Self-Adaptive Differential Evolution (SaDE), use simple fuzzy system to control the diversity of individuals and local search to maintain a balance between exploration and exploitation. DCSaDE-LS with four different local search replacement strategies are used. SaDE is also implemented for comparison purposes. Algorithms are tested on a peptide Met-enkephalin for force fields ECEPP/2, ECEPP/3 and CHARMM22. Results show that both DCSaDE-LS and SaDE produce the best energy for both force fields. Among the four replacement strategies, DCSaDE-LS1 strategy reports better results than other strategies and SaDE in terms of number of function evaluations, mean energy and success rate. Best conformations obtained using DCSaDE-LS is compared with native structure 1PLW and GEM structure Scheraga. Nonparametric statistical tests for multiple comparisons ( \(1\times N\) ) with control method are implemented for CHARMM22 observations. A set of unique 100 best conformations obtained from DCSaDE-LS are clustered into 3 independent clusters suggesting the robustness of this methodology and the ability to explore the conformational space available and to populate the near native conformations.

Vector Optimization: An Introduction and Some Recent Problems

Conference Paper

Full-text available

Dec 2014

Majid Soleimani-damaneh

See the presentation.

A Review of Deterministic Optimization Methods in Engineering and Management

Article

Full-text available

Jun 2012

With the increasing reliance on modeling optimization problems in practical applications, a number of theoretical and algorithmic contributions of optimization have been proposed. The approaches developed for treating optimization problems can be classified into deterministic and heuristic. This paper aims to introduce recent advances in deterministic methods for solving signomial programming problems and mixed-integer nonlinear programming problems. A number of important applications in engineering and management are also reviewed to reveal the usefulness of the optimization methods.

Mixed 0-1 Linear Programming Approach for DNA Transcription Element Identification

Article

Jan 2008

Introduction/Background Definitions Formulation Methods/Applications Models Program 1 (Nonlinear Mixed 0-1 Program) Program 2 (Linear Mixed 0-1 Program) Program 3 Cases Finding CRP Binding Sites with a Given Pattern Conclusions References

Parallel Evolutionary Computations in Discerning Protein Structures

Chapter

Jun 2005

Introduction PSP Problem Protein Structure Discerning Methods PSP Energy Minimization EAs PSP Parallel EA Performance Evaluation Results and Discussion Conclusions and Suggested Research References

An Expectation Maximization (EM) Algorithm for the Identification and Characterization of Common Sites in Unaligned Biopolymer Sequences

Article

Full-text available

Jan 1990

Statistical methodology for the identification and characterization of protein binding sites in a set of unaligned DNA fragments is presented. Each sequence must contain at least one common site. No alignment of the sites is required. Instead, the uncertainty in the location of the sites is handled by employing the missing information principle to develop an "expectation maximization" (EM) algorithm. This approach allows for the simultaneous identification of the sites and characterization of the binding motifs. The reliability of the algorithm increases with the number of fragments, but the computations increase only linearly. The method is illustrated with an example, using known cyclic adenosine monophosphate receptor protein (CRP) binding sites. The final motif is utilized in a search for undiscovered CRP binding sites.

Identifying protein-binding sites from unaligned DNA fragments

Article

Full-text available

Mar 1989

The ability to determine important features within DNA sequences from the sequences alone is becoming essential as large-scale sequencing projects are being undertaken. We present a method that can be applied to the problem of identifying the recognition pattern for a DNA-binding protein given only a collection of sequenced DNA fragments, each known to contain somewhere within it a binding site for that protein. Information about the position or orientation of the binding sites within those fragments is not needed. The method compares the "information content" of a large number of possible binding site alignments to arrive at a matrix representation of the binding site pattern. The specificity of the protein is represented as a matrix, rather than a consensus sequence, allowing patterns that are typical of regulatory protein-binding sites to be identified. The reliability of the method improves as the number of sequences increases, but the time required increases only linearly with the number of sequences. An example, using known cAMP receptor protein-binding sites, illustrates the method.

A Computational Comparison of Several Nonlinear Programming Algorithms

Article

Sep 1985

A computational comparison of several general purpose nonlinear programming algorithms is presented. This study was motivated by previous results which show that the recently developed ellipsoid algorithm is competitive with a widely used augmented Lagrangian algorithm. To provide a better perspective on the value of ellipsoid algorithms in nonlinear programming, the present study includes some of the most highly regarded nonlinear programming algorithms. The algorithms considered here are chosen from four distinct classes and 50 well-known test problems are used. The algorithms used represent augmented Lagrangian, ellipsoid, generalized reduced gradient, and iterative quadratic programming methods. Results regarding robustness and relative efficiency are presented.

Cut-off method with space extension in convex programming problems

Article

Jan 1977
Cybernetics

N. Z. Shor

Locating binding sites for cyclic-AMP receptor proteins on unaligned DNA fragments using nonlinear programming

A C H Scott

A.C.H. Scott, Locating binding sites for cyclic-AMP receptor proteins on unaligned DNA fragments using nonlinear programming, Ph.D. thesis, Rensselaer Polytechnic Institute, 1993.

An application of nonlinear optimization in molecular biology

Abstract and Figures

Recommended publications

Hybrid Evolutionary And Annealing Algorithms For Nonlinear Discrete Constrained Optimization.

NLPLSQ: A Fortran Implementation of an SQP-Gauss-Newton Algorithm for Least-Squares Optimization - U...

3D loading problem formulation using mixed integer nonlinear programming

Measurement of roundness: A nonlinear approach