A Multiobjective Genetic Algorithm Based Approach to the Optimization of Oligonucleotide Microarray Production Process
ABSTRACT Microarrays are becoming more and more utilized in the experimental platform in molecular biology. Although rapidly becoming
affordable, these micro devices still have quite high production cost which limits their commercial appeal. Here we present
a novel multiobjective evolutionary approach to the optimization of the production process of microarray devices mainly aimed
at lowering the number of fabrication steps. In order to allow the reader to better understand what we describe we report
herein a detailed description of a realworld study case carried out on the most recent microarray platforms of the market
leader in this field. A comparative analysis of the most widely used approaches, main potentialities and drawbacks of the
proposed approach are presented.

Conference Paper: A Multiobjective Genetic Optimization Technique for the Strategic Design of Distribution Networks.
Vitoantonio Bevilacqua, Mariagrazia Dotoli, Marco Falagario, Fabio Sciancalepore, Dario D'Ambruoso, Stefano Saladino, Rocco Scaramuzzi[Show abstract] [Hide abstract]
ABSTRACT: We address the optimal design of a Distribution Network (DN), presenting a procedure employing MultiObjective Genetic Algorithms (MOGA) to select the (sub) optimal DN configuration. Using multiobjective genetic optimization allows solving a nonlinear design problem with piecewise constant contributions in addition to linear ones. The MOGA application allows finding a Pareto frontier of (sub) optimal solutions, which is compared with the frontier obtained solving the same problem with linear programming, where piecewise constant contributions are linearly approximated. The two curves represent, respectively, the upper and the lower limit of the region including the real Pareto curve. Both the genetic optimization model and the linear programming are applied under structural constraints to a case study describing the DN of an Italian enterprise.Advanced Intelligent Computing Theories and Applications. With Aspects of Artificial Intelligence  7th International Conference, ICIC 2011, Zhengzhou, China, August 1114, 2011, Revised Selected Papers; 01/2011  [Show abstract] [Hide abstract]
ABSTRACT: In this paper we propose a new implementation of a multi objective genetic algorithm that handles constrained problems to approach the financial problem of the portfolio optimization. The objective of the paper is to propose and empirically apply a new multiobjective genetic algorithm for portfolio optimization extending the Markowitz meanvariance model ([1,2] Markowitz, 1952 and 1959). At the end of the paper the obtained results are discussed and compared with non linear other different techniques.Advanced Intelligent Computing  7th International Conference, ICIC 2011, Zhengzhou, China, August 1114, 2011. Revised Selected Papers; 01/2011
Page 1
A Multiobjective Genetic Algorithm Based
Approach to the Optimization of Oligonucleotide
Microarray Production Process
Filippo Menolascina1, Vitoantonio Bevilacqua1, Caterina Ciminelli1,
Mario Nicola Armenise1, and Giuseppe Mastronardi1
Department of Electrotechnics and Electronics, Technical University of Bari,
70126, Italy
f.menolascina@ieee.org
Abstract. Microarrays are becoming more and more utilized in the ex
perimental platform in molecular biology. Although rapidly becoming af
fordable, these micro devices still have quite high production cost which
limits their commercial appeal. Here we present a novel multiobjective
evolutionary approach to the optimization of the production process of
microarray devices mainly aimed at lowering the number of fabrication
steps. In order to allow the reader to better understand what we describe
we report herein a detailed description of a realworld study case carried
out on the most recent microarray platforms of the market leader in this
field. A comparative analysis of the most widely used approaches, main
potentialities and drawbacks of the proposed approach are presented.
1Introduction
An oligonucleotide microarray is a piece of glass or plastic material on which
singlestranded fragments of DNA, called probes, are placed or synthesized. The
chips produced, for instance, can contain more than one million spots (or fea
tures) as small as 11 μm, with each spot accommodating several million copies of
a probe. Probes are typically 25 nucleotides long and are synthesized in parallel,
on the chip, in a series of repetitive steps. Each step appends the same nucleotide
to probes of selected regions of the chip. Selection occurs by exposure to light
with the help of a photolithographic mask[1].
Formally, we have a set of probes P = {p1,p2,...pn} that are produced by a
series of masks M = {m1,m2,...mT}, each mask mtallowing the addition of
a particular nucleotide St∈ {A,C,G,T} to be included in a subset of P. The
nucleotide deposition sequence S = S1S2...ST corresponding to the sequence
of nucleotides added at each masking step is therefore a supersequence of all
p ∈ P[10].
In general, a probe can be embedded within S in several ways. The embedding
step of pkcan be described as Ttuple ?k= (ek,1,ek,2,...ek,T) in which ek,t= 1
if probe pk receives nucleotide St (at step t), or 0 otherwise. The deposition
sequence is often denoted repeated permutation of the alphabet, mainly because
D.S. Huang et al. (Eds.): ICIC 2008, LNAI 5227, pp. 1039–1046, 2008.
c ? SpringerVerlag Berlin Heidelberg 2008
Page 2
1040F. Menolascina et al.
of its regular structure and because such sequences maximize the number of
distinct subsequences. We distinguish between synchronous and asynchronous
embeddings. In the first case, each probe has exactly one nucleotide synthesized
in every cycle of the deposition sequence; hence, 25 cycles or 100 steps are needed
to synthesize probes of length 25. In the case of asynchronous embeddings, probes
can have any number of nucleotides synthesized in any given cycle, allowing
shorter deposition sequences. All chips manufactured by this producer of can be
asynchronously synthesized in 74 steps (18.5 cycles), which is probably due to
careful probe selection. The problem of finding the sequence that reduces the
number of steps required to accomplish the microarray production process is
called SCS (Short Common Supersequence). The SCS problem is wellknown
to be NPcomplete. In this paper, we present a novel approach to the problem
of finding the SCS based on a multiobjective genetic algorithm that tries to
minimize both the number of steps and the number of mask change in order to
minimize the costs related to microarray production.
2 The SCS Problem with Applications in Bioinformatics
and Nanotechnology
2.1Microarray Techonology
Several microarray technologies are available today, based on a variety of fabrica
tion techniques including printing with finepointed pins onto glass slides, inkjet
printing, electrochemistry on microelectrode arrays and photolithography. This
paper is mainly concerned with the production of highdensity oligonucleotide
microarray, also called DNA chips or gene chips, that are fabricated by pho
tolithography. This type of microarray consists of relatively short DNA probes
synthesized at specific locations, named features or spots, on a solid surface.
Each probe is a singlestranded DNA molecule of 10 to 70 nucleotides that
perfectly matches with a specific portion of a target molecule. sequence of nu
cleotides added in each step is called deposition sequence or synthesis schedule.
The selection of which probes receive the nucleotide is achieved by photolithog
raphy [1][2]. Figure 1 illustrates this process: The quartz wafer of a GeneChip
array is initially coated with a chemical compound topped with a lightsensitive
protecting group that is removed when exposed to ultraviolet light, activat
ing the compound for chemical coupling. A lithographic mask is used to direct
light and remove the protecting groups of only those positions that should re
ceive the nucleotide of a particular synthesis step. A solution containing ade
nine (A), thymine (T), cytosine (C) or guanine (G) is then flushed over the
chip surface, but the chemical coupling occurs only in those positions that have
been previously deprotected. Each coupled nucleotide also bears another protect
ing group so that the process can be repeated until all probes have been fully
synthesized.
Page 3
A Multiobjective Genetic Algorithm1041
Fig.1. Probe synthesis via photolithographic masks. The chip is coated with a chemical
compound and a lightsensitive protecting group; masks are used to direct light and
activate selected probes for chemical coupling; nucleotides are appended to deprotected
probes; the process is repeated until all probes have been fully synthesized.
2.2SCS Problem
The problem of finding the Shortest Common Supersequence (SCS) of a given
set of sequences is a very important problem in computer science, especially in
computational molecular biology. The SCS of a set of sequences can be stated
as follows: Given two sequences S = s1s2...sm and T = t1t2...tn, over an
alphabet set Σ = {σ1,σ2,...,σn}, we say that S is the subsequence of T (and
equivalently, T is the supersequence of S) if for every sj, there is sj= tijfor some
1 ≤ i1< i2< ... < im≤ n. Given a finite set of sequences S = {S1,S2,...,Sk},
a common supersequence of S is a sequence T such that T is a supersequence
of every sequence Sj(1 ≤ j ≤ k) in S. Then, a shortest common supersequence
(SCS) of S is a supersequence of S that has minimum length. In this paper,
we shall assume that k is the number of sequences in S, n is the length of each
sequence, and q = Σ is the size of the alphabet. The SCS problem has ap
plications in many diverse areas, including data compression [3], scheduling [4],
query optimization [5], text comparison and analysis, and biological sequence
comparisons and analysis [6][7]. As a result, the SCS problem has been very
intensively investigated [8][9]. One basic result is that the SCS of two sequences
of length n can be computed using dynamic programming in O(n2) time and
O(n2) space (see, for example, [10]). There are also several papers that reported
improvements on the running time and space required for dynamic program
ming algorithms [9]. For a fixed k, the dynamic programming algorithm can be
extended to solve the SCS problem for k sequences of length n in O(nk) time
Page 4
1042F. Menolascina et al.
and space. Clearly, this algorithm is not practical for large k. The general SCS
problem on arbitrary k sequences of length n is wellknown to be NPhard. In
fact, Jiang and Li [10] showed that even the problem of finding a constant ratio
approximation solution is also NPhard.
Previous Research in SCS problem. We now present a brief survey of the
most popular heuristic algorithms proposed in literature. Let S be any instance
of the SCS problem and let CSA(S) be the supersequence of S identified by a
heuristic algorithm A. Let opt(S) denote an optimal solution for the instance S.
Then, we say that A has an approximation ratio of λ if CSA(S)/opt(S) ≤ λ
for all instances S.
Alphabet Algorithm
The Alphabet algorithm is a quite simple approach to the problem undest
investigation[8]. Let S be a set of sequences of maximum length n over the
alphabet Σ = {σ1,σ2,...,σq}, then the Alphabet algorithm outputs a common
supersequence of (σ1,σ2,...,σq)n. The Alphabet algorithm has an approxima
tion ratio of q = Σ. The time complexity of the Alphabet algorithm is O(qn).
There have also been modifications of the Alphabet algorithm that uses informa
tion from S to ‘remove’ redundant characters in (σ1,σ2,...,σq)n. These methods
improve the performance in practice, but not in the worst case approximation
ratio of q.
Majority Merge Algorithm
The MajorityMerge algorithm [10](MM) is a simple, greedy heuristic algorithm.
Let’s suppose we analyze every sequence from left to right, the frontier is defined
as the rightmost characters to be analyzed. Initially, the supersequence CS is
empty. At each step, let s be the majority among the ‘frontier’ characters of the
remaining portions of the sequences in S. Set CS = CSs (where  represent
concatenation) and delete the ‘frontier’ s characters from sequences in S. Re
peat until no sequences are left. This algorithm is the same as the Sum Height
algorithm (SH) proposed in [12]. This algorithm does not have any worstcase
approximation ratio, but performs very well in practice. The time complexity of
the MajorityMerge algorithm is O(qkn).
Greedy and Tournament algorithms
The Greedy algorithm (GRDY) and Tournament algorithm (TOUR) studied
in [13] are two variations of an iterative scheme based on combining optimal
sequence pairs. Given any pair of sequences, Si and Sj, an optimal superse
quence of the pair, denoted by SCS(Si,Sj), can be computed in O(n2) using
dynamic programming. The Greedy algorithm first chooses the ‘best’ sequence
pair the that gives the shortest SCS(Si,Sj). Without loss of generality, we as
sume that these two sequences are S1and S2. The algorithm then replaces the
two sequences S1 and S2 by their supersequence, SCS(S1,S2). The algorithm
proceeds recursively. Thus, we can express it as follows:
Page 5
A Multiobjective Genetic Algorithm1043
Greedy(S1,S2,...,Sk) = Greedy(SCS(S1,S2),S3,...,Sk)
The Tournament algorithm is similar to the Greedy algorithm. It builds a ‘tour
nament’ based on finding multiple best pairs at each round and can be expressed
schematically as follows:
Tournament(S1,S2,...,Sk) = Tournament(SCS(S1,S2),SCS(S3,S4),...,
SCS(Sk−1,Sk)).
Both Greedy and Tournament algorithms have O(k2n2) time complexity and
O(kn + n2) space complexity. Unfortunately, it was shown in [11] that both
Greedy and Tournament do not have approximation ratios.
3Multiobjective Genetic Algorithms in Microarray
Production Process Optimization
MultiObjective Genetic Algorithms (MOGAs) are a relatively recent extension
of Genetic Algorithms (GAs) that are well established bioinspired computa
tional optimization approaches with a wide range of applications that spans
from finance to medicine. The concept of GA was developed by Holland and his
colleagues in the 1960s and 1970s [14]. GA are inspired by the evolutionist theory
explaining the origin of species. In nature, weak and unfit species within their
environment are faced with extinction by natural selection. The strong ones have
greater opportunity to pass their genes to future generations via reproduction.
If these changes provide additional advantages in the challenge for survival, new
species evolve from the old ones. Unsuccessful changes are eliminated by natural
selection. In GA terminology, a solution vector x ∈ X is called an individual or
a chromosome. Chromosomes are made of discrete units called genes. Each gene
controls one or more features of the chromosome. In the original implementation
of GA by Holland, genes are assumed to be binary digits. In later implemen
tations, more varied gene types have been introduced. Normally, a chromosome
corresponds to a unique solution x in the solution space. This requires a mapping
mechanism between the solution space and the chromosomes.
Being a populationbased approach,GA are well suited to solvemultiobjective
optimization problems. A generic singleobjective GA can be modified to find a
set of multiple nondominated solutions in a single run. The ability of GA to si
multaneously search different regions of a solution space makes it possible to find
a different set of solutions for difficult problems with nonconvex, discontinuous,
and multimodal solutions spaces. Most multiobjective GA do not require the
user to prioritize, scale, or weigh objectives. Therefore, GA have been the most
popular heuristic approach to multiobjective design and optimization problems.
Jones et al. [15] reported that 90% of the approaches to multiobjective optimiza
tion aimed to approximate the true Pareto front for the underlying problem. A
majority of these used a metaheuristic technique, and 70% of all metaheuris
tics approaches were based on evolutionary approaches. From this perspective it
Page 6
1044F. Menolascina et al.
could be easily intended how MOGAs can be used in order to carry out an opti
mization that aims at pursuing a minimization in terms of the number of steps
required for manufacturing a microarray and, contemporary, to minimize set up
times costs associated to the change in base to base to be deposited on the sur
face of the so far assembled biochip. In the next paragraph we will show how this
provblem has been addressed using a MOGA and we will expose computational
results relating to a specific instance of this problem.
4Novel Optimisation Procedure
In the proposed experimental design we evaluated the performance of the MOGA
approach (Pareto front in Fig. 2, i.e. the times required. In order to make our
evaluation as close to a real case as possible we used 4 oligonucleotide sequences
taken from the most advanced chip for gene expression evaluation from the mar
ket leader. It is well known that the manufacturing process consists of depositing
nucleotides in a stepbystep flavor, so as to reach the 25 oligonucleotide length
for each of the features on the array. We selected the sequences:
8146645 gaagactcgcctgttgggacagcgc
8054479 gcatgtggctacttagtaaatagta
8154660 gcttagaaaacaggtcctcagcaca
8162631 ggtagcaaccgtcacaatctggatg
2530 3540 455055
30
35
40
45
50
55
60
Objective 1
Objective 2
Pareto front
Fig.2. Pareto front of the solutions found by the MOGA
Page 7
A Multiobjective Genetic Algorithm1045
Table 1. The deposition sequence and the corresponding embedding matrix
G G C T A C G T G C T G A C A G C T A G C G A
1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3
ATCGAAGCGCGA     A   T  C  G A  A G C   G C G A
CCGTCCAGCTAA   C  C G T  C     C A G C T A  A  
GTACTTAGCTAC G   T A C  T   T  A  G C T A  C   
GGATGGAGCTAC G G   A   T G   G A   G C T A  C  
    1   1  1  1 1  1 1 1   1 1 1 1
Embedding  1  1 1 1  1     1 1 1 1 1 1  1  
Matrix 1   1 1 1  1   1  1  1 1 1 1  1   
1 1   1   1 1   1 1   1 1 1 1  1  
and we evaluated the performances of the algorithm using the following protocol:
we started taking 12 bases from each of sequence and running the optimization
algorithm 100 times on the same problem. We proceeded by adding each time
one base to the previous sequence and reevaluating the performances of the
algorithm, until the end of the sequences has been reached. At each step we
recorded the time required for each optimization task, the number of steps re
quired by each solution and the reason why the optimization algorithm ended.
For each of the 14 tests carried out we extracted the mean and the confidence
intervals for both optimization times and optimization results; this was done in
order to extract main estimators of the performances of the proposed algorithm
under the hypothesis that the process under observation is ergodic (so that the
mean estimation is independent on the specific realization and that it tends
to the real value for n, number of observations, n → ∞). Computational time
analysis of the optimization task revealed that times required by the proposed
algorithm can be adequately fitted with a relatively low order polynomial that
seems to enforce the thesis that states that the computational complexity of this
approach can be approximated to O(x6). This algorithm was able to complete
the optimation task proposed in the previous section using only 23 deposition
setps are reported in Tab. 1. As it can be observed no deposition step can be
suppressed without affecting the whole process. This suggest that the necessary
condition for optimality is at least satisfied. The results reported herein seem to
confirm the robustness and versatility of the proposed algorithm and push the
need for further research in this field.
5 Discussion
In this paper, we have proposed a novel MultiObjective approach for reduction
of manufacturing steps required for microarray assembly. The algorithm is built
on a MOGA that firstly generates random templates, and the Evolution process
to reduce templates from template pool to get shorter and less expensive result.
These processes are shown to be powerful for solving the SCS problem. Compar
ing the performance of our approach with the industry gold standard we can state
Page 8
1046F. Menolascina et al.
that the proposed system is able to outperform alternative approaches under pre
defined conditions. The proposed solution results to be quite interesting in terms
of result optimality; however it should be noticed that computational complexity
of the algorithm under investigation is not negligible and it requires many opti
mization tasks before completion. However much research effort must be spent
on border conflicts minimization in microarray production due to the specific
technological limitations that characterize the photolitographic processes.
References
1. Fodor, S., Read, J., Pirrung, M., Stryer, L., Lu, A., Solas, D.: Lightdirected,
Spatially Addressable Parallel Chemical Synthesis. Science 251, 767–773 (1991)
2. Hannenhalli, S., Hubell, E., Lipshutz, R., Pevzner, P.A.: Combinatorial Algorithms
for Design of DNA Arrays. Advances in Biochemical Engineering Biotechnology 77,
1–9 (2002)
3. Storer, J.A.: Data Compression: Methods and Theory. Computer Science Press
(1988)
4. Foulser, D.E., Li, M., Yang, Q.: Theory and Algorithms for Plan Merging. Artificial
Intelligence 57(2), 143–181 (1992)
5. Sellis, T.K.: Multiplequery Optimization. ACM Transactions on Database Systems
(TODS) 13(1), 23–52 (1988)
6. Cormen, T.H., Leiserson, C.E., Rivest, R.L., Stein, C.: Introduction to Algorithms,
2nd edn. MIT Press/McGrawHill (2001)
7. Sankoff, D., Kruskal, J.: Time Warps, String Edits and Macromolecules: the Theory
and Practice of Sequence Comparisons. Addison Wesley, Reading (1983)
8. Barone, P., Bonizzoni, P., Vedova, G.D., Mauri, G.: An Approximation Algorithm
for the Shortest Common Supersequence Problem: an Experimental Analysis. In:
Symposium on Applied Computing, Proceedings of the 2001 ACM symposium on
Applied computing, pp. 56–60 (2001)
9. Gusfield, D.: Algorithms on Strings, Trees, and Sequences: Computer Science and
Computational Biology. Cambridge University Press, New York (1997)
10. Jiang, T., Li, M.: On the Approximation of Shortest Common Supersequences and
Longest Common Subsequences. SIAM Journal of Computing 24(5), 1122–1139
(1995)
11. Timkovsky, V.G.: On the Approximation of Shortest Common Nonsubsequences
and Supersequences. Technical report (1993)
12. Kasif, S., Weng, Z., Derti, A., Beigel, R., DeLisi, C.: A Computational Framework
for Optimal Masking in the Synthesis of Oligonucleotide Microarrays. Nucleic Acids
Research 30(20) (2002)
13. Irving, R.W., Fraser, C.: On the WorstCase Behaviour of Some Approximation
Algorithms for the Shortest Common Supersequence of k Strings. In: Proceedings of
the 4th Annual Symposium on Combinatorial Pattern Matching, pp. 63–73 (1993)
14. Holland, J.H.: Adaptation in Natural and Artificial Systems. University of Michi
gan Press, Ann Arbor (1975)
15. Jones, D.F., Mirrazavi, S.K., Tamiz, M.: Multiobjective Metaheuristics: An
Overview of the Current Stateoftheart. Eur. J. Oper. Res. 137(1), 1 (2002)