Page 1

INFORMS Journal on Computing

Articles in Advance, pp. 1–15

issn1091-9856?eissn1526-5528

informs®

doi10.1287/ijoc.1090.0322

©2009 INFORMS

New Optimization Model and Algorithm for Sibling

Reconstruction from Genetic Markers

W. Art Chaovalitwongse, Chun-An Chou

Department of Industrial and Systems Engineering, Rutgers University, Piscataway,

New Jersey 08854 {wchaoval@rci.rutgers.edu, joechou@rci.rutgers.edu}

Tanya Y. Berger-Wolf, Bhaskar DasGupta, Saad Sheikh

Department of Computer Science, University of Illinois, Chicago, Illinois 60607

{tanyabw@uic.edu, dasgupta@bert.cs.uic.edu, ssheik3@uic.edu}

Mary V. Ashley, Isabel C. Caballero

Department of Biological Sciences, University of Illinois, Chicago, Illinois 60607

{ashley@uic.edu, icabal2@uic.edu}

W

evolution, and dispersal patterns. Full use of the newly available genetic data often depends upon reconstruct-

ing genealogical relationships of individual organisms, such as sibling reconstruction. This paper presents a

new optimization framework for sibling reconstruction from single generation microsatellite genetic data. Our

framework is based on assumptions of parsimony and combinatorial concepts of Mendel’s inheritance rules.

Here, we develop a novel optimization model for sibling reconstruction as a large-scale mixed-integer program

(MIP), shown to be a generalization of the set covering problem. We propose a new heuristic approach to effi-

ciently solve this large-scale optimization problem. We test our approach on real biological data as presented in

other studies as well as simulated data, and compare our results with other state-of-the-art sibling reconstruction

methods. The empirical results show that our approaches are very efficient and outperform other methods while

providing the most accurate solutions for two benchmark data sets. The results suggest that our framework

can be used as an analytical and computational tool for biologists to better study ecological and evolutionary

processes involving knowledge of familial relationships in a wide variety of biological systems.

ith improved tools for collecting genetic data from natural and experimental populations, new opportu-

nities arise to study fundamental biological processes, including behavior, mating systems, adaptive trait

Key words: set covering; genetic markers; simulation; mixed-integer program; analysis of algorithms; sibling

reconstruction

History: Accepted by Harvey Greenberg, Area Editor for Computational Biology and Medical Applications;

received January 2008; revised August 2008, January 2009, February 2009; accepted February 2009. Published

online in Articles in Advance.

1.

As more and more highly variable molecular markers

become available for a wider range of species, investi-

gators can increasingly characterize evolutionary, eco-

logical, population, and demographic parameters in

diverse species of plants, animals, and microbes. To

effectively extract ecological and evolutionary infor-

mation from these emerging genotypic data sets, com-

putational approaches for accurate reconstruction of

familial relationships need to keep pace with our abil-

ity to sample organisms and obtain their genotypes.

Therefore, improved methods for reconstruction of

genealogical relationships from genetic data will be

extremely useful for biologists working on a wide

range of such questions. For wild species, kinship and

pedigrees cannot be inferred from field observations

alone. Several modern tools, like codominant molecu-

lar markers such as DNA microsatellites, provide new

Introduction

possibilities to develop novel computational meth-

ods of establishing pedigree relationship. In studies

where organisms are sampled and genotyped without

information about their parents, it may be possible

to identify cohorts of siblings based on microsatellite

data. The sibling group identification allows inference

of many interesting biological parameters, including

the number of reproducing adults, their fecundity,

and the average size of litters. For threatened species,

knowledge of sibship relationships can be important

for conservation and aid in management strategies.

For studies of evolutionary genetics, sibling recon-

struction can be used for assessing the heritabilities of

adaptive traits and how they will respond to natural

selection.

Although several methods for sibling reconstruc-

tion from microsatellite data have been previously

proposed (Almudevar and Field 1999, Almudevar

1

INFORMS holds copyright to this article and distributed this copy as a courtesy to the author(s).

Additional information, including rights and permission policies, is available at http://journals.informs.org/.

Page 2

Chaovalitwongse et al.: Optimization Methods for Sibling Reconstruction from Genetic Markers

INFORMS Journal on Computing, Articles in Advance, pp. 1–15, ©2009 INFORMS

2

2003, Beyer and May 2003, Konovalov et al. 2004,

Painter 1997, Smith et al. 2001, Thomas and Hill 2002,

Wang 2004), most techniques have offered very limited

applications and have not been very practical (Butler

et al. 2004). The main reason is that most sibling recon-

struction methods use statistical likelihood models to

infer genealogical relationships and are based on the

knowledge of typical allele distribution and frequency,

family sizes, and other information about the species

(Blouin 2003). For this reason, previous techniques are

often constrained by the availability of thorough pop-

ulation sampling. They are also heavily biased toward

parentage assignment because parentage is more eas-

ily resolved.

In this study, we focus on a combinatorial approach

that does not require prior genetic information about

the species such as population allele frequencies.

Our approach is based on combinatorial concepts

of the Mendelian laws of inheritance, and for now,

we are limiting our methods to diploid organisms

(Berger-Wolf et al. 2005, Chaovalitwongse et al. 2007).

Similar combinatorial methods have also been suc-

cessful previously for closely related molecular genet-

ics questions, such as haplotype reconstruction (Eskin

et al. 2003, Li and Jiang 2003). Generally speaking,

our sibling reconstruction approach uses the simple

Mendelian inheritance rules to impose combinato-

rial constraints (referred as 4-allele and 2-allele con-

straints) to allow only genetically possible sibling

groups to be reconstructed. Note that the main chal-

lenge of combinatorial approaches is that the actual

number of sibling sets is not known a priori. We there-

fore employ parsimony assumptions and aim to find

the smallest number of sibling groups, each satisfy-

ing the Mendelian constraints. In our previous study,

we proposed an algorithm to (1) reconstruct all pos-

sible sibsets satisfying the 4-allele constraint, which is

a looser version of 2-allele constraint; and (2) assign

individuals to sibsets by solving a set covering prob-

lem. This algorithm was able to reconstruct sib-

sets with some degree of success (Chaovalitwongse

et al. 2007). Most recently, we developed a heuris-

tic approach to generate all maximal sibsets that sat-

isfy the 2-allele constraint, which can be theoretically

proven to provide only the lower bound of the real

number of sibsets (Berger-Wolf et al. 2007).

In this paper we present the first integrated math-

ematical programming model to construct and assign

individuals into sibsets that satisfy the 2-allele con-

straint. This model is provably a true presentation of

the sibling reconstruction problem under parsimony

assumption. Specifically, the objective of this model is

to minimize the number of reconstructed sibsets with

provably true constraints equivalent to the Mendelian

rules. This model is a very large-scale mixed-integer

program (MIP), which is very difficult to solve. Since

the model has a set covering structure, we propose

a new heuristic approach based on a well-known

approximation algorithm of the set covering problem

to solve this large-scale optimization problem.

The rest of the paper is organized as follows. In §2,

some basic background in genetics and population

biology and a brief description of combinatorial con-

cepts of the Mendelian inheritance laws are given.

In §3, the complexity issues of the sibling reconstruc-

tion problem, the proposed optimization model, and

the algorithmic framework of our solution approach

are presented. Section 4 describes our computational

experience including the characteristics of real biolog-

ical data, the random data generator, a measure of

solution accuracy, and the comparison of performance

characteristics of our framework with those obtained

by some related computational approaches in the lit-

erature. The concluding remarks and discussion are

given in §5.

2.

In this section, we give some basic definitions of some

biological terms related to the sibling reconstruction

that will be used frequently in this paper. We also give

some background of microsatellite data and combina-

torial concepts of Mendel’s inheritance laws.

Background

2.1.

Although there are several molecular markers used

in population genetics, microsatellites are the most

commonly used in kinship and population studies.

Microsatellites are polymorphic loci present in nuclear

genomes, usually noncoding, consisting of repeating

units of nucleotides. Microsatellites are short (one

to six base pairs) simple repeats such as (CA/GT)n

or (AGC/TCG)nthat are scattered around eukaryotic

genomes. They are also known as simple sequence

repeats (SSRs). Microsatellites are especially useful for

studying population demographics and reproductive

patterns because they are neutral and co-dominant

markers, and the inference of genotypes at each locus

is straightforward. More importantly, microsatellites

are preferred because of high numbers of alleles and

heterozygosities, providing the highest resolution for

identifying related individuals (Queller et al. 1993).

Because of these advantages and their widespread

use, we focus our development of sibling relation-

ship (sibship) reconstruction methods to unlinked,

multi-allelic, codominantly inherited markers such as

microsatellites. Figure 1 shows a schematic example

of microsatellites sampled at two loci and their result-

ing genotypes (alleles). Note that in reality the allele

codings of tandem repeats at different loci may be dif-

ferent. This makes alleles at different loci independent

of each other. For example, allele 1 at locus 1 will have

a different sequence (i.e., number of tandem repeats)

from that of allele 1 at locus 2.

Microsatellite Data

INFORMS holds copyright to this article and distributed this copy as a courtesy to the author(s).

Additional information, including rights and permission policies, is available at http://journals.informs.org/.

Page 3

Chaovalitwongse et al.: Optimization Methods for Sibling Reconstruction from Genetic Markers

INFORMS Journal on Computing, Articles in Advance, pp. 1–15, ©2009 INFORMS

3

CACA

Chromosome pair

CACACA

GAGAGA

GAGAGAGA

Loci

Microsatellites

Locus 1

Locus 1:

Allele #1 = CACA

Allele #2 = CACACA

Allele #3 = CACACACA

Locus 2

1/22/3

(1)(2)

Locus 2:

Allele #1 = GAGA

Allele #2 = GAGAGA

Allele #3 = GAGAGAGA

Figure 1

Notes. Given a chromosome pair of an individual, two loci were sampled.

At each locus, genotypes were extracted and allele encoded. In this exam-

ple, the microsatellite data of this individual are (1/2), (2/3) for loci 1 and 2,

respectively.

A Schematic Example of a Microsatellite Marker

2.2.

Sibset is a group of individuals that share at least one

parent. When they share both parents they are called

full siblings, and when they share exactly one of the

parents they are called half siblings. Here, when we

refer to sibling groups or sibsets we mean full sib-

lings. Microsatellites are short, tandem repeats of a

DNA sequence that vary in length. In the genome,

microsatellites occur at a specific location on a chro-

mosome, which is called a locus. In other words,

a locus is a particular chromosomal location of a DNA

sequence in the genome—in this case, a microsatel-

lite DNA sequence. An allele is a distinct pattern

of variable DNA sequences in microsatellites, which

is determined by the length of the tandem repeat

that occupies a given locus (position) on a chromo-

some. Usually, numerous alleles occur at a locus, with

each allele differing in the number of repeat motifs.

In diploid organisms like humans, two homologous

copies of each chromosome and two alleles make up

the genotype. The alleles for each locus were inherited

from each parent (one from the mother and one from

the father). A homozygous individual has two identical

alleles at a particular genetic locus, whereas a heterozy-

gous individual has two different alleles at a particular

genetic locus.

Basic Definitions

2.3.Combinatorial Concepts of Mendelian

Inheritance Laws

Our basic framework for sibling reconstruction is

built around the combinatorial concepts of Mendel’s

laws (Mendel 1866, Bowler 1989). The law of seg-

regation, known as Mendel’s first law, essentially

concludes that the two alleles for each characteris-

tic segregate during gamete production to preserve

the population variation. In other words, offsprings

inherit two alleles (one from the mother and one from

the father) on each of the chromosome pairs. The law

of independent assortment, known as the inheritance

law or Mendel’s second law, states that the inheri-

tance pattern of one trait will not affect the inheri-

tance pattern of another. This implies that alleles of

different loci assort independently of one another dur-

ing gamete formation so that there are no correlations

across different loci. In short, these laws lay down

a very simple rule for gene inheritance: An offspring

inherits one allele from each of its parents independently

for each locus.

Based on this simple rule, we introduce two

Mendelian constraints to ensure genetically consistent

sibling groups (called full siblings). These constraints

can be mathematically defined as follows. Given a set

U of ?U? = n individuals from the same generation,

each individual 1 ≤ i ≤ n is represented by a genetic

marker of l loci ??aij?bij??1≤j≤l. The numbers aijand bij

represent a specific allele pair, denoted by front and

back alleles, respectively, of individual i at locus l. The

above-mentioned Mendelian laws impose the follow-

ing conditions on a group of individuals S ⊆ U to be

full siblings (Berger-Wolf et al. 2005):

Definition 1. A set S ⊆ U of l-locus individ-

uals is said to satisfy the 4-allele condition if

??

Definition 2. Assuming there is no mutation

(allele swapping) in the gene, a set S ⊆ U of l-locus

individuals is said to satisfy the 2-allele condition if

??

Clearly, the 4-allele and 2-allele conditions are nec-

essary (but not sufficient) combinatorial constraints

of Mendelian inheritance laws. In other words, if one

knows the maternal and paternal alleles, the off-

springs’ alleles in the sibset must satisfy these two con-

ditions. However, for a group of individuals whose

alleles satisfy these two conditions, they are not nec-

essarily siblings. It is easy to see that the 2-allele con-

dition is stronger (tighter) than the 4-allele condition.

Based on the 4-allele and 2-allele conditions, we

can mathematically derive combinatorial constraints

of sibsets used for sibset reconstruction (Berger-Wolf

et al. 2005, Chaovalitwongse et al. 2007). It is impor-

tant to note that for the 2-allele condition to hold,

we need to assume that there is no front- and back-

allele swapping (i.e., the order of the parental alle-

les is always the same side). In real life, the allele

order is unknown and results in swapping alleles at

a single locus. Nevertheless, we can derive combi-

natorial constraints, which are theoretically equiva-

lent to the 2-allele condition, even in the case where

there is allele swapping. The combinatorial 4-allele

and 2-allele constraints can be defined as follows.

Given a set of individuals S, we let A be a collection

i∈S?aij?bij??≤4 at locus j for 1≤j ≤l;

i∈Saij?≤2 and ??

i∈Sbij?≤2 for 1≤j ≤l.

INFORMS holds copyright to this article and distributed this copy as a courtesy to the author(s).

Additional information, including rights and permission policies, is available at http://journals.informs.org/.

Page 4

Chaovalitwongse et al.: Optimization Methods for Sibling Reconstruction from Genetic Markers

INFORMS Journal on Computing, Articles in Advance, pp. 1–15, ©2009 INFORMS

4

of distinct alleles presented at a given locus and let R

be a collection of distinct homozygous alleles (appears

with itself) present at a given locus.

Definition 3. A set of individuals satisfies the

4-allele condition if ?A?≤4.

Definition 4. A set of individuals satisfies the

2-allele condition if the following two conditions hold:

(1) ?A? + ?R? ≤ 4, and (2) each and every allele can-

not appear together with more than two other alleles

(excluding itself).

2.4.

For field biologists, familial relationships are needed

to learn about a species’ evolutionary potential, their

mating systems and reproductive patterns, dispersal,

and inbreeding. Sibling reconstruction is thus needed

when wild samples consist primarily of offspring

cohorts, in cases where parental samples are lacking.

The real objective of sibling reconstruction is to iden-

tify a set of individuals that are siblings. Based on

genetic samples of offspring cohorts alone, there is no

real objective (function) of the sibling reconstruction

problem since the actual pedigree and sibgroups were

not known. There are two common frameworks pro-

posed to tackle this sibling reconstruction problem.

The first one is to use statistical estimates of relat-

edness among individuals and try to reconstruct a

group containing individuals with very similar allele

patterns. The second one is to use the combinato-

rial concepts of Mendelian rules, as mentioned in the

previous section. The real challenge of the combina-

torial approach is that it only imposes a rule of bio-

logically consistent sibset but does not have a real

objective function. For example, any set of two indi-

viduals can be siblings. One can simply group a pair

of offspring cohorts and say that they are siblings,

and all the reconstructed sibling groups always sat-

isfy the Mendelian rules. To explain the population

and its sibling groups when using the Mendelian

rules, our approach uses parsimony assumptions to

the smallest number of sibling groups, each satisfy-

ing the Mendelian rules. Specifically, one can, in turn,

formulate the sibling reconstruction problem as a

problem of minimizing the number of sibsets that

contain all individuals and satisfy the Mendelian rules

(i.e., 2-allele constraint). This problem is very difficult,

and the complexity of enumerating all possible sibsets

satisfying the 2-allele constraint is exponential. These

computational challenges are addressed in this paper.

Sibling Reconstruction Challenges

3. Sibling Reconstruction Problem

Under Parsimony Assumptions

We present a new optimization model for the sib-

ling relationship reconstruction problem based on

microsatellite data acquired from individuals from a

single generation. The reconstruction will be based on

the 2-allele constraint while we apply a parsimony-

driven explanation of the sibsets. In other words, we

model the objective of this optimization by assigning

individuals parsimoniously into the smallest number

of (possibly overlapping) groups that satisfy the nec-

essary 2-allele constraint.

3.1.

First, we discuss the complexity and approximation

issues of the sibling reconstruction problem based

on the 4-allele constraint and the 2-allele constraint.

We consider a set U of n different individuals, each

with l loci. The 2-allele problem with parameters n

and l is denoted by 2-ALLELEn?l, and the 4-allele

problem is denoted by 4-ALLELEn?l. Let g be the

parameter denoting the maximum number of indi-

viduals that can be full siblings in an instance of

4-ALLELEn?lor 2-ALLELEn?l. Since no two individu-

als are the same (i.e., their alleles must differ at some

locus), 2≤g ≤??4

ilarly, we can derive 2≤g ≤8lfor 2-ALLELEn?l. Either

problem has a trivial optimal solution if g =2. Fur-

thermore, if g is a constant, both 4-ALLELEn?land

2-ALLELEn?lcan be posed as a set cover problem with

n elements and?n

g

size being g, and thus has a (1 + lng)-approximation

by using standard algorithms for the set cover prob-

lem (Vazirani 2001). For general g, since any two indi-

viduals can be put in the same sibling group, either

problem has a trivial g/2-approximation. Next, we dis-

cuss the approximability results of 2-ALLELEn?land

4-ALLELEn?lfor g =3 and any arbitrary g.

Theorem 1 (Ashley et al. 2009). Both 4-ALLELEn?l

and 2-ALLELEn?l are ??153/152? − ??-inapproximable

even if g =3 assuming RP?=NP.

This theorem can be proved by providing an

approximation preserving reduction from the triangle

packing problem to our allele problems. The trian-

gle packing problem requires one to find a maxi-

mum number of node disjoint triangles (3-cycles) in

a graph. Conceptually, we provide a reduction from

an instance of the triangle packing problem to an

instance of our allele problems such that three nodes

for a 3-cycle in the graph if and only if the individuals

corresponding to those triangles can be covered by a

sibling group. For more details, please refer to Ashley

et al. (2009).

Complexity and Approximation Issues

1

?+2·?4

2

??l=16lfor 4-ALLELEn?l. Sim-

?=O?ng? sets, with the maximum set

Theorem 2 (Ashley et al. 2009). For any two con-

stants 0 < ? < ? < 1 with g = n?, 2-ALLELEn?land

4-ALLELEn?lcannot be approximated to within a ratio

of n?unless NP⊆ZPP.

This theorem can be proved by providing a reduc-

tion from the well-known graph coloring problem to

INFORMS holds copyright to this article and distributed this copy as a courtesy to the author(s).

Additional information, including rights and permission policies, is available at http://journals.informs.org/.

Page 5

Chaovalitwongse et al.: Optimization Methods for Sibling Reconstruction from Genetic Markers

INFORMS Journal on Computing, Articles in Advance, pp. 1–15, ©2009 INFORMS

5

our allele problems such that there is an individual

corresponding to each node in the graph, and a color-

ing of the graph translates to a cover by sibling groups

with a constant factor blowup in the approximation.

For more details, please refer to Ashley et al. (2009).

3.2.

For the sake of simplicity, we shall call the opti-

mization model for the sibling reconstruction problem

based on the 2-allele constraint 2-allele optimization

model (2AOM). The objective function of 2AOM is

to minimize the total number of sibsets while assign-

ing every individual into groups satisfying the 2-allele

constraint. The 2AOM problem can be mathematically

formulated as follows. First, we define the following

sets that will be used throughout this paper: i ∈I is a

set of individuals, j ∈ J is a set of reconstructed sib-

sets, k ∈ K is a set of alleles, and l ∈ L is a set of loci.

In general, from biological data, we are given a set of

?L?-locus ?I? individuals. Figure 2 illustrates an exam-

ple of microsatellite genotypes for seven individuals

scored at two loci each. We can subsequently present

the data into a multidimensional 0-1 matrix format.

From the input matrix, fl

if the front allele at locus l of individual i is k, bl

an indicator if the back allele at locus l of individ-

ual i is k, ? al

appears at locus l of individual i, and ˆ al

is an indicator if individual i is homozygous (allele k

appears twice) at locus l. On the right, Figure 2 also

shows how the markers are converted into a multi-

dimensional 0-1 matrix representing the input vari-

ables ? al

distinct alleles at locus 1; therefore, we have a 7 × 4

matrix at locus 1. The matrix of input variables (ˆ al

can be constructed similarly.

Next we define the following decision variables:

• zj∈ ?0?1?: indicates if any individual is selected

to be a member of sibset j;

2-Allele Optimization Model

ikis defined as an indicator

ikis

ik= max?fl

ik?bl

ik? is an indicator if allele k

ik= fl

ik∗ bl

ik

ik. In this example, there are a total of four

ik)

1

2

3

4

5

6

7

IndividualLocus 1

1/3

1/4

Locus 2

1/1

2/2

2/3

2/4

1/1

1/2

1/3

2/3

1/4

10

1

1

1

11

1

1

1

1

1

10

0

0

0

0

0

0

0

0

00

0

0

10

1/2

1/2

1/3

Locus 2

Locus 1

Allele

#1#2 #3#4

Figure 2 An Example of an Input Data Matrix ?? al

Microsatellite Markers

ik? from

• xij∈ ?0?1?: indicates if individual i is selected to

be a member of sibset j;

• yl

has allele k at locus l;

• ol

homozygous individual in sibset j with allele k

appearing twice at locus l; and

• vl

allele k?in sibset j at locus l.

The mathematical programming formulation of the

2AOM problem is given by the following.

Objective Function. The overall objective function

in Equation (1) is to minimize the total number of

sibsets:

min?

∀j∈J

(1) Cover and Logical Constraints. Equation (2)

represents the cover constraints ensuring that every

individual is assigned to at least one sibset. Equa-

tion (3) ensures that the binary sibset variables must

be activated for the assignment of any individual i to

sibset j.

?

∀j∈J

xij≤zj

(2) 2-Allele Constraints. Equation (4) ensures that

the binary variable for allele indication yl

activated for the assignment of any individual i to

sibset j. Equation (5) ensures that the binary variable

for homozygous indication ol

the existence of homozygous individual with allele k

appearing twice at locus l in sibset j. Equation (7)

restricts that the binary variable for allele pair indica-

tion vl

vidual i to sibset j. Note that M1, M2, and M3are large

constants, which can be defined as M1=2∗?I?+1 and

M2= M3= ?I?+1. Equation (6) ensures that the num-

ber of distinct alleles and the number of homozygous

alleles is less or equal to 4. Equation (8) ensures that

every allele in the set does not appear with more than

two other alleles (excluding itself).

?

∀i∈I

?

∀i∈I

?

∀k∈K

?

∀i∈I

∀j ∈J? ∀k ∈K? ∀k?∈K\k? ∀l ∈L? (7)

?

∀k?∈K\k

jk∈ ?0?1?: indicates if any members in sibset j

jk∈ ?0?1?: indicates if there is at least one

jkk? ∈ ?0?1?: indicates if allele k appears with

zj?

(1)

xij≥1

∀i ∈I?

(2)

∀i ∈I? ∀j ∈J?

(3)

jkmust be

jkmust be activated for

jkk? must be activated for any assignment of indi-

? al

ikxij≤M1yl

jk

∀j ∈J? ∀k ∈K? ∀l ∈L?

(4)

ˆ al

ikxij≤M2ol

jk

∀j ∈J? ∀k ∈K? ∀l ∈L?

(5)

?yl

jk+ol

jk?≤4

∀j ∈J? ∀l ∈L?

(6)

? al

ik? al

ik?xij≤M3vl

jkk?

vl

jkk? ≤2

∀j ∈J? ∀k ∈K? ∀l ∈L?

(8)

INFORMS holds copyright to this article and distributed this copy as a courtesy to the author(s).

Additional information, including rights and permission policies, is available at http://journals.informs.org/.

Page 6

Chaovalitwongse et al.: Optimization Methods for Sibling Reconstruction from Genetic Markers

INFORMS Journal on Computing, Articles in Advance, pp. 1–15, ©2009 INFORMS

6

(3) Binary and Nonnegativity Constraints.

zj?xij?yl

jk?ol

jk∈?0?1?

total

∀i ∈I? ∀j ∈J? ∀k ∈K? ∀l ∈L?

ofdiscrete Thenumbervariables is

O?max??J? ∗ ?K? ∗ ?L???I? ∗ ?J???, and the total number

of constraints is O??J?∗?K?2∗?L??. It is easy to see that

the 2AOM problem is a very large-scale MIP problem

and may not be easy to solve in large instances.

Next, we will prove the correctness of our model

and show that the 2AOM problem is NP-hard in the

strong sense. The 2AOM problem is considered to

be a generalization of the well-known set covering

problem with additional constraints to satisfy the

2-allele condition.

Proposition 1. The constraints in Equations (4)–(8)

are equivalent to the 2-allele constraint.

Proof. By Equation (4), yl

exists at least one member in sibset j with allele k

at locus l. Therefore, in sibset j, the total number of

distinct alleles at locus l is equal to?

is equivalent to ?A? in the 2-allele theorem. By Equa-

tion (5), ol

homozygous member in sibset j with allele k appear-

ing twice at locus l. Therefore, in sibset j, the total

number of distinct homozygous alleles at locus l is

equal to?

2-allele theorem. This will make ?A? + ?R? ≤ 4 equiv-

alent to?

indicates that allele k appears together with allele k?

at locus l in sibset j. By Equation (8), we guarantee

that every allele does not appear with two other allele

at every locus. This completes the proof.

jk= 1 indicates that there

∀k∈Kyl

jk, which

jk= 1 indicates that there exists at least one

∀k∈Kol

jk, which is equivalent to ?R? in the

∀k∈K?yl

jk+ol

jk? ≤ 4. By Equation (5), vl

jkk? = 1

?

Proposition 2. If we introduce a weight or cost cjto

each set zj∀j ∈J in Equation (1), the set covering problem

can be reduced to the 2AOM problem.

Proof. Consider a standard set covering problem:

∀j∈Jcjzj subject to

can reduce this set covering problem to the 2AOM

problem as follows. First, relax the constraints

in Equations (5)–(8) in the 2AOM problem, and let

?K?=?L?=1. If aij= 0 in the set covering prob-

lem, define ? al

M1= ?I? + 1. Equation (4) can then be expressed by

?

ual i can be covered by set j (aij=1). Multiplying both

sides of Equation (3) by xij, we obtain ?xij?2=xij≤xijzj

(because xij∈?0?1?, x2

over J, we obtain?

expression with Equation (2), we can derive the fol-

lowing expression:?

equivalent to the constraints?

covering problem. This completes the proof.

?

?

∀j∈Jaijzj≥ 1, zj∈ ?0?1?. We

ik= M1+ 1; otherwise, ? al

ik= 1, where

∀i∈I? aixij≤M1yj, which allows xij=1 only if individ-

ij=xij). Summing the expression

∀j∈Jxijzj. Combining this

∀j∈Jxijzj≥?

∀j∈Jaijzj≥ 1 in the set

∀j∈Jxij≤?

∀j∈Jxij≥ 1, which is

?

Proposition 3. The

NP-hard.

2AOM problemisstrongly

Proof. According to Ashley et al. (2009), the

2-ALLELEn?lproblem is NP-complete. The 2AOM is

an optimization version of 2-ALLELEn?land a gener-

alization of the set covering problem. Therefore, the

2AOM problem is NP-hard.

It is very important to note that the 2AOM problem

requires an initialization of the number of sibsets.

If the initial number of sibsets is too small, the problem

will become infeasible. If the initial number of sibsets

is too large, we will have to introduce many more

binary variables than needed. The proposed heuristic

approach discussed next can also be used to initial-

ize the number of sibsets as its solution can be theo-

retically shown to be an upper bound of the 2AOM

problem.

In general, the objective and covering constraints

of the 2AOM problem is rather artificial to the sib-

ling reconstruction problem as they are built upon

parsimony assumptions. In addition, the proposed

heuristic approach (to be discussed in the next sec-

tion) provides a sibset reconstruction solution that is

in a form of set partitioning. We should therefore

investigate a variant of the 2AOM problem with set

partitioning constraints. Specifically, we will test this

modified 2AOM problem ( ?

set covering constraints in Equation (2) with set par-

titioning constraints given by

?

2AOM) by replacing the

?

∀j∈J

xij=1

∀i ∈I?

(9)

3.3. Heuristic Approach: Iterative Maximum

Covering Set

As mentioned earlier, the 2AOM problem is a very

large-scale MIP problem. In addition, based on the

parsimony assumptions, the minimum number of

sibsets may not give the most accurate sibling recon-

struction, which is the real objective of our sibling

reconstruction problem. In addition, we can only say

that the optimal solution to 2AOM (the number of sib-

sets) is biologically a true lower bound of the real sib-

sets. Therefore, to solve our problem more efficiently,

we herein propose a heuristic approach—namely, an

iterative maximum covering set (IMCS)—which is an

iterative optimization approach motivated by a widely

known approximation algorithm for the set covering

problem, i.e., a maximum coverage approach. The idea

behind this approach is to construct one sibset maxi-

mizing the individual cover in each iteration. Essen-

tially, in each iteration, we solve a reduced problem of

2AOM. The objective of IMCS is to maximize the total

number of individuals to be covered by a sibset, which

satisfies the 2-allele property. The IMCS problem can

INFORMS holds copyright to this article and distributed this copy as a courtesy to the author(s).

Additional information, including rights and permission policies, is available at http://journals.informs.org/.

Page 7

Chaovalitwongse et al.: Optimization Methods for Sibling Reconstruction from Genetic Markers

INFORMS Journal on Computing, Articles in Advance, pp. 1–15, ©2009 INFORMS

7

be formally defined as follows. First, we define the fol-

lowing decision variables:

• xi∈ ?0?1?: indicates if individual i is selected to

be a member of the current sibset;

• yl

rent sibset has allele k at locus l;

• ol

homozygous member in the current sibset with allele

k appearing twice at locus l; and

• vl

allele k?in the current sibset at locus l.

We mathematically formulate the IMCS problem at

each iteration as follows.

Objective Function. The overall objective function

in Equation (10) is to maximize the total number of

individuals to be selected as members of the current

sibset:

max?

∀i∈I

(1) 2-Allele Constraints. Equation (11) ensures that

the binary variables for allele indication must be acti-

vated for the assignment of individual i to the current

sibset. Equation (12) ensures that the binary variables

for homozygous indication must be activated for the

existence of homozygous individual, with allele k

appearing twice at locus l in the current sibset. Equa-

tion (13) restricts that the binary variables for allele

pair indication vl

of individual i. Note that M1, M2, and M3are large

constants, which can be defined as M1= 2 ∗ ?I? + 1

and M2= M3= ?I?+1. Equation (14) ensures that the

combination of the number of distinct alleles and the

number of homozygous alleles in the current sibset

is less or equal to 4. Equation (15) ensures that every

allele of each individual does not appear together

with more than two other alleles (excluding itself).

?

∀i∈I

?

∀i∈I

?

∀i∈I

?

∀k∈K

?

∀k?∈K\k

(2) Binary and Nonnegativity Constraints.

k∈ ?0?1?: indicates if any members in the cur-

k∈ ?0?1?: indicates if there is at least one

kk? ∈ ?0?1?: indicates if allele k appears with

xi?

(10)

kk? must be activated for the selection

? al

ikxi≤M1yl

k

∀k ∈K? ∀l ∈L?

(11)

ˆ al

ikxi≤M2ol

k

∀k ∈K? ∀l ∈L?

(12)

? al

ik? al

ik?xi≤M3vl

kk?

∀k ∈K? ∀k?∈K\k? ∀l ∈L? (13)

?yl

k+ol

k?≤4

∀l ∈L?

(14)

vl

kk? ≤2

∀k ∈K? ∀l ∈L?

(15)

xi?yl

k?ol

k∈?0?1?

∀i ∈I? ∀k ∈K? ∀l ∈L?

It is easy to see that the IMCS problem is much

more compact than the 2AOM problem and it does

not require an initialization in terms of the total num-

ber of sibsets. The total number of discrete variables

in the IMCS problem is O?max??K?∗?L???I???, and the

total number of constraints is O??K?2∗?L??.

Iterative Procedure. The idea of iterative proce-

dure of the proposed heuristic approach is motivated

by Khuller et al. (1999). This heuristic approach is

required to solve the IMCS problem in multiple iter-

ations (m), where m is the final number of sibsets at

the termination of our approach. In each iteration, the

solution to the IMCS problem gives a list of individ-

uals to be assigned to the current sibset. Then we

remove the assigned individuals from the set I and

repeat this procedure until there are no individuals

in set I. The procedure of the iterative maximum cov-

ering set approach is given in Figure 3. The overall

approach is viewed as solving an assignment problem

rather than solving the set covering problem, because

an individual belongs to only one set. We note that

this approach is very fast and scalable and can be

used for very large-scale sibset reconstruction prob-

lems. This is because after every subsequent itera-

tion, the IMCS problem becomes significantly smaller

as we remove the largest possible group of assigned

individuals and alleles that do not appear in the

remaining individuals.

3.4. Solution Perturbation of Iterative

Maximum Covering Set

Because the mathematical programming formulation

of IMCS has a combinatorial objective function, it is

very likely that multiple or alternate optimal solu-

tions exist (called degeneracy). It might be worth-

while investigating the accuracy of alternate optimal

and second-best solutions. We note that the IMCS

approach is greedy-based; the reconstructed sibsets

are very much dictated by the sibset constructed in

the first (and possibly second) iteration. Here, we can

perturb the reconstructed sibsets by exploring alter-

nate optimal or second-best solutions in the first iter-

ation only, and in both first and second iterations.

To obtain alternate optimal and/or second-best

solutions, we first optimally solve the IMCS problem

in the first (or second) iteration, use a cut constraint

to delete the optimal solution from the feasible space,

and resolve the IMCS problem with the additional

cut constraint. We subsequently repeat the steps of

Iterative Maximum Covering Set approach

Input: Set I of unassigned individuals

Output: The number of sibsets and sibset assignment

for individual set I, allele set K, locus set L

WHILE I ?=? DO

—Solve the IMCS problem

IF optimal solution shows xi=1 THEN

—Remove individual i from set I

IF there is no individual in set I having

allele k at any loci in L THEN

—Remove allele k from set K

Figure 3Pseudo-Code of the IMCS Approach

INFORMS holds copyright to this article and distributed this copy as a courtesy to the author(s).

Additional information, including rights and permission policies, is available at http://journals.informs.org/.

Page 8

Chaovalitwongse et al.: Optimization Methods for Sibling Reconstruction from Genetic Markers

INFORMS Journal on Computing, Articles in Advance, pp. 1–15, ©2009 INFORMS

8

Yes

No

Start

IMCS with additional cut

constraint

Iteration ≤ t?

Remove assigned

individuals (x*) and put

them into a new sibset.

Use assigned individuals

(x*) to construct a cut

constraint

Output solution to sibling

reconstruction

End

Individuals left to

be assigned

No

Yes

Figure 4Flowchart of Solution Perturbation of IMCS by Applying a Cut

Constraint

Note. t is the number of iterations where the cut constraint is applied

?t =1?2?.

IMCS approach in other iterations as usual. Figure 4

illustrates the flowchart of the alternate optimal and

second best solution procedure. Here, we use two

types of cut constraints. The first cut constraint, called

Opt Cut, is a simple constraint to ensure that the opti-

mal solution is deleted from the feasible space, which

is given by?

of I whose x∗

It may be possible that this cut will only delete one

individual from the sibset in the first iteration, and

the reconstructed sibsets might be very similar to the

ones without the cut. The second cut constraint, called

Complementary Cut, is proposed to ensure that the

selected individuals in the first sibset will be some-

what different from the previous optimal solution.

In other words, we want to ensure that at least one

of the individuals that was not selected in the pre-

vious optimal solution must be selected in the per-

turbed solution. The second cut constraint, in fact,

ensures that the set of complements is covered, which

is given by?

cut constraint cannot guarantee that the new solution

is an alternate optimal or second-best solution, the

constraint gives more diversification to the solutions;

also note that the second cut constraint is tighter than

the first cut constraint.

i?∈Kxi? ≤ ?K? − 1, where K is a subset

i= 1 in the previous optimal solution.

i?∈I\Kxi? ≥1. Although adding the second

4.

This section describes the characteristics of our data

set (both real biological and simulated data) used

in this study to evaluate the performance of the

Computational Experience

proposed 2AOM,

2AOM and IMCS approaches are made available

at http://kinalyzer.cs.uic.edu. The performance was

assessed by the accuracy of reconstructed sibsets with

respect to the real (known) sibling groups.

?

2AOM, and IMCS approaches.

4.1.

We used both real biological data and randomly gen-

erated data to assess the performance of our opti-

mization model and algorithm. Some of the real data

used in this study have been previously used for

sibling reconstruction (Almudevar and Field 1999).

These data were considered benchmark data because

the true sibling relationships were known. However,

because of the limitations of the real data, including

scoring errors and missing alleles, we developed a

random population (problem) generator used to con-

trol the characteristics of the data set to validate our

approaches.

4.1.1.Real Biological Data. In this study, we used

four real biological data sets of microsatellite geno-

types scored from individuals whose true sibling

groups were known. Although the data sets were

obtained from wild species (animals and plants), they

came from controlled crosses because true sibling

groups are typically not known in wild populations.

Most data sets analyzed in this study were imper-

fect because of the technical errors in acquiring and

scoring microsatellite data, which resulted in missing

alleles and/or genotyping errors. There are several

possible and relatively common causes of imperfect

data, including allelic dropout and null alleles. In this

study, any missing alleles or detected genotyping

error was replaced by a wildcard (∗) to indicate the

missing information. When checking for genetic fea-

sibility of membership of a new individual in a sib-

ling group, a wildcard could correspond to any allele.

Generally speaking, if the data sets were complete

and the sample sizes were large enough, one should

be able to reconstruct very accurate sibsets (although

one cannot guarantee a perfect reconstruction). Given

that we had almost complete allele information for the

salmon and shrimp data sets, we expected to obtain

more reliable and accurate sibset results than those

obtained in the radish and fly data sets. The charac-

teristics of the data sets are shown in Table 1.

Data Set

Table 1Characteristics of Real Biological Data Sets

No. of

individuals

No. of

sibsets

No. of

loci

Avg. no. of

alleles/locus

Percentage of

missing allelesSpecies

Salmon∗

Radish∗

Shrimp

Fly

351

531

59

190

6

2

4

5

7

2

7?8

3?0

14?9

7?0

0?00

3?99

0?06

37?89

13

6

∗Some known sibsets in the data set are biologically inconsistent because

of genotyping errors during the data collection.

INFORMS holds copyright to this article and distributed this copy as a courtesy to the author(s).

Additional information, including rights and permission policies, is available at http://journals.informs.org/.

Page 9

Chaovalitwongse et al.: Optimization Methods for Sibling Reconstruction from Genetic Markers

INFORMS Journal on Computing, Articles in Advance, pp. 1–15, ©2009 INFORMS

9

Salmon. The Atlantic salmon Salmo salar data set

was acquired from the genetic improvement program

of the Atlantic Salmon Federation (Herbinger et al.

1999). We used a truncated sample of microsatellite

genotypes of 250 individuals from five families with

four loci per individual. The data did not have miss-

ing alleles at any locus. This data set was a subset of

one used by Almudevar and Field (1999) to illustrate

their technique.

Radish. The wild radish Raphanus raphanistrum

data set (Conner 2005) consisted of samples from 150

radishes from two families with five loci and five alle-

les per locus. There were 37 missing alleles among all

the loci. The parent genotypes were available.

Shrimp. The tiger shrimp Penaeus monodon data

set (Jerry et al. 2006) consisted of 59 individuals from

13 families with seven loci. There were 16 missing

alleles among all the loci. The parent genotypes were

available.

Fly. The Scaptodrosophila hibisci data set (Wilson

et al. 2002) consisted of 190 same generation individ-

uals (flies) from six families sampled at various num-

ber of loci with up to eight alleles per locus. Parent

genotypes were known. All individuals shared two

sampled loci that were chosen for our study. A sub-

stantial proportion of alleles were missing for some

individuals.

4.1.2.

data, we developed a random population generator

that works as follows. The generator first constructed

a group of adults (parents) with the full genetic infor-

mation. Based on this information, a single generation

of sibling data were generated and the parentage

information was retained so that the true sibling

groups were known. The sibling population generator

requires the following parameters: M is the number of

adult males, F is the number of adult females, l is the

number of sampled loci, a is the number of alleles per

locus, j is the number of juveniles in the population

per one adult female, and o is the maximum num-

ber of offsprings per parent couple. The procedure of

our random generator can be described in detail as

follows:

Step 1. First, we generated the parent population of

M males and F females with parents with l loci, each

having a distinct alleles per locus.

Step 2. After the parents were generated, we cre-

ated a population of their offsprings by randomly

selecting j pairs of parents. A male and a female were

chosen independently and uniformly at random from

the parent population.

Step 3. For each of the chosen parent pairs, we gen-

erated a specified number of offsprings, o, each ran-

domly receiving one allele from its mother and one

from its father at each locus.

Random Data. To create a set of simulation

This population generator is a rather simplistic

approach; however, it is consistent with the genetics

of known parents and provides a baseline for testing

the accuracy of the algorithm. To produce a simulated

data set used in this study, we varied the parameters

of the population generator as follows:

• The number of adult females (F) and the number

of adult males (M) are set to 10;

• The number of sampled loci (l) is set to 2, 4, 6,

and 10;

• The number of alleles per locus (a) is set to 2, 5,

10, and 20;

• The number of families (j) is 1, 2, 5, and 10; and

• The maximum number of offsprings per cou-

ple (o) is set to 2, 5, and 10.

For each parameter setting, we obtained a set of off-

spring population with known parent pairs. In each

population, there were o×j individuals with j known

sibling groups. Random data are made available at

http://kinalyzer.cs.uic.edu.

4.2.

We evaluated the effectiveness of our approaches by

comparing the reconstructed sibling groups with the

actual known sibling groups. The error measurement

was obtained by calculating the minimum partition

distance (Gusfield 2002). The error rate (1−accuracy)

was defined as the ratio of the partition distance to

the total number of individuals. In other words, the

accuracy used in this study is the percentage of indi-

viduals correctly assigned to sibling groups.

The minimum partition distance is known to be

equivalent to the maximum linear assignment prob-

lem (MLAP) that can be solved in polynomial time.

This problem is also known as the maximum bipartite

weighted matching problem. The MLAP for sibling

reconstruction problem can be defined as follows:

Given two collections of sibsets ?A1?????An? and

?B1?????Bm?, let C be the n×m cost matrix where cij

is the cost of the assigning sibset Aito sibset Bj. Then

the MLAP is to find an assignment of all sibsets in A

to all sibsets in B at the maximum cost (individual

matchings) such that each sibset in A is assigned to

at most one sibset in B, and vice versa. The MLAP

can be formulated as a MIP problem given by

Evaluation and Assessment

max

n?

i=1

m?

j=1

n?

i=1

xij∈?0?1??

m?

j=1

cijxij

(16)

s.t.

xij≤1 for i =1?????n?

(17)

xij≤1for j =1?????m?

(18)

INFORMS holds copyright to this article and distributed this copy as a courtesy to the author(s).

Additional information, including rights and permission policies, is available at http://journals.informs.org/.

Page 10

Chaovalitwongse et al.: Optimization Methods for Sibling Reconstruction from Genetic Markers

INFORMS Journal on Computing, Articles in Advance, pp. 1–15, ©2009 INFORMS

10

We note that the solution to MLAP, ?U? − (maximum

assignment), can be represented as the minimum

number of individuals to be deleted so that these

two sibset collections are identical. This distance

is the errors (misassignments) and is used to calcu-

late the accuracy of our approaches. However, note

that the relationships among parents are not necessar-

ily monogamous; i.e., some sibsets in A (or B) may not

be disjoint. As our MIP model is a covering model,

the solution (set of full sibling groups) does not induce

a partition on the individuals. Thus, to make this

accuracy measure more appropriate, we propose the

following modification of the MLAP: given two collec-

tions of non-disjoint sets ?P1?????Pn? and ?Q1?????Qm?

of elements in U and a solution (maximum assign-

ment of ?∪Pi∩Qj?) to the MLAP over the matrix cij=

?Pi∩ Qj?, the minimum distance between two sibling

sets is ?U?−(maximum assignment).

4.3. Empirical Results

We used the 2AOM,

?

described above to reconstruct the sibling groups from

real and simulated data. We subsequently measured

the accuracies of the reconstructed sibling sets in

reference to the true sibling groups by solving the

MLAP for every data set. In addition, we compared

the performance characteristics (solution accuracies)

of our approaches to the ones obtained by our pre-

vious approach (Berger-Wolf et al. 2007) and three

other well-known sibship reconstruction methods in

the literature (Almudevar and Field 1999, Beyer and

May 2003, Konovalov et al. 2004). All tests of our

new approaches were run on Intel Xeon Quad Core

3.0 GHz processor workstation with 8 GB RAM mem-

ory. Computational times reported in this section were

obtained from the desktop’s internal timing calcula-

tions, which included time used for preprocessing,

perturbation, and postprocessing. All the mathemat-

ical modeling and algorithms were implemented in

MATLAB and solved using a callable General Alge-

braic Modeling System (GAMS) library with CPLEX

version 10.0 (default setting). The tests of our previ-

ous and other methods were run on a single proces-

sor with 4 GB RAM memory on the 64-node cluster

running RedHat Linux 9.0. The difference in platforms

2AOM, and IMCS approaches

Table 2 Performance Characteristics of the Proposed 2AOM, 2-Allele Optimization Model with Set Partitioning Constraints (?

Approaches on Real Biological Data Sets

2AOM), and IMCS

2AOM

?

2AOM

IMCS

Real no.

of sibsets SpeciesNo. of sibsets Accuracy (%)CPU time No. of sibsets Accuracy (%)CPU timeNo. of sibsetsAccuracy (%)CPU time

Salmon

Radish

Shrimp

Fly

6

2

8

3

94?02

51?98

94?92

66?84

>72,000

75.17

>72,000

>72,000

1476?07

49?15

100?00

55?26

>72,000

363.23

>72,000

>72,000

7

3

98?30

52?54

100?00

47?37

149?19

26?31

184?72

22?78

3

13 14 1313

6778

and operating systems was dictated by the available

software licenses and provided binary codes.

4.3.1.

used the 2AOM,

described in §3 to reconstruct the sibling sets on all

four real biological data sets. As mentioned, CPLEX

was used to solve the optimization models in all

approaches and the stopping criterion was set to be

either less than 0.01% of solution gap or 20 hours

(72,000 seconds) of running time. We note that the

IMCS approach obtained the optimal solutions in all

instances, whereas the 2AOM and

obtained the optimal solution only in the radish

data set. Specifically, in the salmon, shrimp, and fly

data sets, when using 2AOM and

CPLEX failed to obtain the optimal solution under

the 72-hour time limit, and the reported results were

based on the best integer-feasible solutions. The qual-

ity of sibset solutions was assessed in terms of sibset

accuracy as explained in §4.2. Table 2 gives the per-

formance characteristics and solution quality of the

2AOM,

?

tional times reported in Table 2 are in seconds. We

note that the objective functions of both approaches

are to minimize the number of sibsets.

Although both the set covering (2AOM) and set

partitioning ( ?

mal solution within the time limit for the radish data

set, it provided quite good integer-feasible solutions

of sibling reconstruction in all other data sets. In terms

of the validity of the parsimony assumption, sibset

solution gaps were computed with respect to the true

numbers of sibling sets in the biological data sets. The

set covering 2AOM approach yielded 33% (2/6), 50%

(1/2), 8% (1/13), and 16% (1/6) relative gaps to the

real number of sibsets, respectively. The set partition-

ing

?

8% (1/13), and 16% (1/6) relative gaps to the real

number of sibsets, respectively. The optimal heuris-

tic solutions of IMCS approach yielded relative gaps

for the salmon, radish, and fly data sets of about

17% (1/6), 50% (1/2), and 33% (2/6), respectively.

Nevertheless, we note that the objective function val-

ues were rather artificial because the real solution of

Results from Real Biological Data. We

?

2AOM, and IMCS approaches

?

2AOM approaches

?

2AOM approaches,

2AOM, and IMCS approaches. The computa-

2AOM) models only obtained the opti-

2AOM approach yielded 133% (8/6), 50% (1/2),

INFORMS holds copyright to this article and distributed this copy as a courtesy to the author(s).

Additional information, including rights and permission policies, is available at http://journals.informs.org/.

Page 11

Chaovalitwongse et al.: Optimization Methods for Sibling Reconstruction from Genetic Markers

INFORMS Journal on Computing, Articles in Advance, pp. 1–15, ©2009 INFORMS

11

interest was the sibset assignments provided by these

approaches. More importantly, the data sets were not

perfect; i.e., there were missing alleles and genotyp-

ing errors. These errors would have made the true

known sibset assignments violate the Mendelian con-

straints. Thus, the optimal solutions to both 2AOM

and

?

solution gaps in terms of the number of sibsets.

We observed that all methods provided relatively

accurate reconstructed sibset results. In particular, the

IMCS and

?

set reconstruction for the shrimp data set with 100%

accuracy. All approaches provided accurate recon-

structed sibling relationships for the salmon data set,

and the reason that we did not obtain 100% accu-

racy is because there were genotyping errors and an

inaccurate known sibset assignment in the data set.

For the fly data set, the 2AOM approach obtained

a slightly more accurate sibset solution than that

obtained by IMCS approach. On the other hand, the

IMCS approach was more accurate in three other

data sets. In all data sets except shrimp, the set cov-

ering 2AOM approach consistently provided better

reconstruction results than the set partitioning

approach.

It is worth noting that none of the approaches per-

formed well on the radish and fly data sets because

there were a lot of missing data and genotyping

errors. For the radish data set, we investigated the

input genotypes and observed that the true sibsets

(given solutions) violated the 2-allele property, which

was biologically impossible. We did not cleanse or

correct the data because this data set was used in

the literature before and we wanted to compare our

solution with the previous ones. It is important to

note that although the IMCS approach was required

to solve optimization problems iteratively, the com-

putational times required by the IMCS approach were

significantly less than those required by the 2AOM

and

?

Solution Perturbation. We investigated and com-

pared the accuracy of reconstructed sibsets by

perturbing the solutions in the first and second iter-

ations of IMCS approach. Table 3 presents the per-

formance characteristics of IMCS with Opt Cut and

2AOM models, if obtained, would have provided

2AOM approaches provided a perfect sib-

?

2AOM

2AOM approaches.

Table 3 Performance Characteristics of the IMCS Approaches with Opt Cut and Complementary Cut

Constraints Applied in the First Iteration

IMCS with Opt Cut IMCS with Complementary Cut

Real no.

of sibsets SpeciesNo. of sibsets Accuracy (%)CPU time No. of sibsetsAccuracy (%) CPU time

Salmon

Radish

Shrimp

Fly

6

2

7

3

98?01

52?35

100?00

44?74

133?59

23?34

159?30

23?16

7

3

98?30

51?41

100?00

36?84

124?78

19?94

154?31

18?67

131313

688

IMCS with Complementary Cut applied in the first

iteration only. For the shrimp data set, all approaches

performed very well and were able to perfectly recon-

struct the true sibsets. In fact, all approaches recon-

structed the same sibsets, but the only difference was

the order of reconstructed sibsets because of alternate

optimal solutions in the first iterations. For the salmon

data set, the Complementary Cut also provided an

alternate best sibset reconstruction while the Opt Cut

provided a slightly less accurate solution. For other

data sets, the perturbed solutions did not perform as

well as the optimal solution to IMCS, although the

accuracies are very close. Table 4 presents the perfor-

mance characteristics of IMCS with Opt Cut and IMCS

with Complementary Cut applied in the first and sec-

ond iterations. For both shrimp and salmon data sets,

the Complementary Cuts again provided an alternate

best sibset reconstruction. Nevertheless, the Opt Cut

was able to obtain slightly more accurate solutions in

other data sets. Based on these results, we concluded

that if the data had a good separation among sibsets

like the shrimp data set, any of these techniques would

have been able to accurately reconstruct the true sib-

sets. However, for imperfect data, these results sug-

gested that the greedy approach that used only the

optimal solution may be a better option in practice.

Performance Comparison. Next, we compared the

accuracies of sibset solutions obtained by 2AOM,

?

the-art methods for sibling reconstruction in Table 5.

These methods are based on very diverse approaches

with different mechanisms and solution behaviors.

The BWG algorithm, proposed previously by our

group in Berger-Wolf et al. (2007), is based on 2-allele

set construction version of the set covering method

proposed in Chaovalitwongse et al. (2007). The A&F

algorithm, proposed in Almudevar and Field (1999),

is based on a completely combinatorial approach to

exhaustively enumerate all possible sibling sets sat-

isfying the 2-allele property (although the authors

do not explicitly state the property) and obtain a

maximal, not necessarily optimal, collection of sib-

ling sets. The B&M algorithm, proposed in Beyer and

May (2003), is based on a mixture of likelihood and

combinatorial techniques used to construct a graph

2AOM, and IMCS approaches to four current state-of-

INFORMS holds copyright to this article and distributed this copy as a courtesy to the author(s).

Additional information, including rights and permission policies, is available at http://journals.informs.org/.

Page 12

Chaovalitwongse et al.: Optimization Methods for Sibling Reconstruction from Genetic Markers

INFORMS Journal on Computing, Articles in Advance, pp. 1–15, ©2009 INFORMS

12

Table 4 Performance Characteristics of the IMCS Approaches with Opt Cut and Complementary Cut

Constraints Applied in the First and Second Iterations

IMCS with Opt Cut IMCS with Complementary Cut

Real no.

of sibsetsSpecies No. of sibsetsAccuracy (%) CPU timeNo. of sibsetsAccuracy (%) CPU time

Salmon

Radish

Shrimp

Fly

6

2

8

3

97?72

52?17

100?00

46?84

161?50

28?58

154?28

20?84

7

4

98?30

42?94

100?00

44?21

134?08

30?92

165?75

17?91

1313 13

687

with individuals as nodes and the edges weighted by

the pairwise likelihood (relatedness) ratio, and iden-

tify potential sibling sets by the connected compo-

nents in the graph. The KG or KinGroup algorithm,

proposed in Konovalov et al. (2004), is based on the

likelihood estimates of partitions of individuals into

sibling groups by comparing, for every individual, the

likelihood of being part of any existing sibling group

with the likelihood of starting its own group.

From the results in Table 5, we observed that the

proposed 2AOM and IMCS approaches outperformed

other methods on the shrimp data set. The main rea-

son that our approaches performed very effectively

was that this data set was almost complete in allele

information and the average number of distinct alle-

les per locus was very high compared to other data

sets, intuitively making the distinction among differ-

ent sibsets easier. Nevertheless, our approaches were

also competitive in the data sets with missing allele

information. We observed that the radish data set

presented a problem for all methods except BWG,

since it had partial self-reproduction and offsprings

of a selfed individual were hard to separate from

their half-siblings produced by that and any other

individual. Our approaches did not take this species-

dependent constraint into account.

4.3.2.

idated the proposed 2AOM and IMCS approaches on

simulated data set and compared the results to the

actual known sibling groups in the data to assess

the accuracy of our constructed sibling sets. In addi-

tion, we compared the accuracy of our approaches to

that of the M4SCP proposed in our previous paper

(Chaovalitwongse et al. 2007). The reason that we did

Results from Simulated Data. We also val-

Table 5Accuracies of the Sibling Sets Constructed by Our Approaches

and Other Approaches from Real Biological Data Sets

2AOM

(%)

?

(%)

2AOM

IMCS

(%)

BWG

(%)

A&F

(%)

B&M

(%)

KG

(%)Species

Salmon

Radish

Shrimp

Fly

94?02

51?98

94?92

66?84

76?07

49?15

100?00

55?26

98?30

52?54

100?00

47?37

98?30

75?90

77?97

100?00

N/A∗

N/A∗

67?80

31?05

99?71

53?30

77?97

27?89

96?02

29?95

77?97

54?73

∗A&F ran out of 4 GB memory as it enumerates all possible sibling sets.

not compare the

almost always outperformed by the 2AOM approach.

Because there were several parameter combinations

in this simulation, we limited the running time of

CPLEX to two hours (7,200 seconds) for the 2AOM

and IMCS approaches. The comparison of the three

approaches on randomly simulated data is shown

in Table 6. Because there were four-dimensional

parameter settings (i.e., four parameters to a set), we

reported the results by fixing one parameter at a time.

The accuracies and computational time were calcu-

lated based on the average of all other varying param-

eters. From the results in Table 6, we observed that

the proposed 2AOM and IMCS approaches outper-

formed the M4SCP on average with all the fixed

parameters. Note that this was not always the case for

all parameter settings. The reconstruction based on

the IMCS approach was consistently better than that

based on the 2AOM approach so was the computa-

tional time. We observed that the computational time

of IMCS drastically increased when l =10, a=10, j =

10, and o = 10 because there was an instance when

the IMCS approach failed to solve the simulated data

with that setting to optimality. Therefore, the running

?

2AOM approach was that it was

Table 6Accuracies of the Sibling Sets Constructed by Our Approaches

and the M4SCP Approach (Chaovalitwongse et al. 2007) from

Simulated Data Set

2AOMIMCSM4SCP

Parameter

settings

Accuracy

(%)

CPU

time

Accuracy

(%)

CPU

time

Accuracy

(%)

CPU

time

l =2

l =4

l =6

l =10

a=2

a=5

a=10

a=20

j =2

j =5

j =10

o =2

o =5

o =10

Note. The CPU time is reported in seconds.

59?25

63?94

64?28

60?56

26?67

69?42

71?81

80?14

76?67

64?63

44?73

49?48

69?46

67?08

2?273?04

2?754?80

3?005?49

3?078?93

57?61

66?53

71?44

71?89

26?67

72?19

81?83

86?78

78?13

64?58

57?90

54?38

69?83

76?40

2?28

8?28

28?96

239?21

0?21

30?54

225?17

22?81

0?72

3?65

204?68

2?67

14?41

191?97

54?18

52?71

54?78

55?28

36?98

58?34

60?71

60?91

62?88

49?56

34?00

18?19

36?66

53?98

0?26

0?21

0?19

0?19

0?16

0?16

0?39

0?19

0?02

0?11

0?75

0?22

0?27

0?22

0?56

3?679?45

3?699?62

3?732?64

1?50

3?079?56

5?253?14

1?711?83

3?250?27

3?372?10

INFORMS holds copyright to this article and distributed this copy as a courtesy to the author(s).

Additional information, including rights and permission policies, is available at http://journals.informs.org/.

Page 13

Chaovalitwongse et al.: Optimization Methods for Sibling Reconstruction from Genetic Markers

INFORMS Journal on Computing, Articles in Advance, pp. 1–15, ©2009 INFORMS

13

time went up to 7,200 seconds. In all other cases, the

IMCS approach always obtained optimal solutions in

very reasonable time. In most cases, except when a=

2 and j = 2, the 2AOM approach failed to solve the

sibling reconstruction problems to optimality.

Figure 5 illustrates the performance trend for all

three approaches when varying the number of alle-

les per locus (a) and the number of sampled loci (l)

and fixing the number of families (j) and the num-

ber of offsprings per family (o) to 10. Intuitively, the

sibling reconstruction problem should be easier to

solve when the number of alleles per locus increases

because there is a greater variation in allele frequency

distribution, which should help us to distinct one

sibling group to another. In Figure 5, the accuracy

increases as l increases for both 2AOM and IMCS

approaches. However, we did not see the same behav-

ior in the M4SCP approach, which was not robust to

the goodness and completeness of the data. Similarly,

the reconstructed sibling sets should be more accurate

if there are more sampled loci (more combinatorial

constraints). We observed a very nice accuracy trend

in the IMCS approach. However, the accuracies of the

M4SCP and 2AOM approaches did not improve with

the allelic information from additional loci. We spec-

ulated that this happened with the 2AOM approach

because it failed to efficiently and effectively solve the

optimization problems as the problem size increased.

Thus, the reconstructed sibling sets were from the

best feasible integer (not optimal) solutions. It is easy

to see that the proposed IMCS approach is a more

robust, efficient, and accurate approach.

Figure 6 illustrates the performance trend for

2AOM, IMCS, BWG, B&M, and KG approaches

when varying the number of alleles per locus (a)

and the number of sampled loci (l). We observed an

increasing accuracy as a increases with all approaches

except the 2AOM approach. Although the 2AOM

model was a complete mathematical formulation of

100

90

80

70

60

50

40

30

20

10

0

100

90

80

70

60

50

40

30

20

10

0

2510

20246 10

Number of alleles per locus (a) Number of sampled loci (l)

Accuracy of reconstructed sibsets Accuracy of reconstructed sibsets

Solution accuracies (l = 2, j = 10, o = 10)Solution accuracies (a =5, j = 10, o = 10)

M4SCP

2AOM

IMCS

Figure 5

Notes. The y-axis shows the accuracy of reconstruction as a function of the number of alleles per locus (left) and the number of sampled loci (right). The title

shows the value of the fixed parameters.

Accuracies of the Sibling Sets Constructed by the 2AOM, IMCS, and M4SCP Approaches on Randomly Generated Data

the sibling reconstruction problem, it failed to obtain

the optimal solutions in most cases because of the

time limit. Compared with all other approaches, the

IMCS approach was the best in terms of the trade-

off between solution quality and computational time.

Our previous BWG approach outperformed the IMCS

approach when the number of alleles per locus was

small; however, the computational time required by

the BWG approach was much larger. In conclusion,

the proposed 2AOM and IMCS approaches gave accu-

rate and reliable sibset solutions when there was

enough separation in the data (e.g., number of loci

and number of alleles per locus). Note that although

the 2AOM approach would require more computation

time (e.g., days or weeks) to solve the MIP problem

to optimality, it should deliver the best possible solu-

tion. The choice of use would solely be application

dependent.

5.

This paper presents a novel optimization model

and solution approach for the problem of sibling

reconstruction from single generation microsatellite

genetic data. The sibling reconstruction problem is an

extremely difficult problem that has been shown to

be NP-complete and cannot be approximated to the

ratio of n?, where n is the number of individual and

0<? <1. A new optimization model for this problem

2AOM, was herein developed and shown to be a gen-

eralization of the well-known NP-hard set covering

problem. A heuristic approach, IMCS, was developed

to efficiently solve the 2AOM model based on a well-

known approximation algorithm of the set covering

problem to iteratively solve the decomposed problems

of 2AOM. The IMCS approach is able to accurately

reconstruct sibling groups without the knowledge

of underlying population allele frequencies, which

is required by other likelihood-based sibling recon-

struction approaches. This has made our work very

Conclusion and Discussion

INFORMS holds copyright to this article and distributed this copy as a courtesy to the author(s).

Additional information, including rights and permission policies, is available at http://journals.informs.org/.

Page 14

Chaovalitwongse et al.: Optimization Methods for Sibling Reconstruction from Genetic Markers

INFORMS Journal on Computing, Articles in Advance, pp. 1–15, ©2009 INFORMS

14

100

80

60

40

20

0

510 15

Accuracy (%)

100

80

60

40

20

0

Accuracy (%)

Number of alleles per locus

51015

Number of alleles per locus

Males/females = 10, families = 10, offsprings = 10

Loci = 2

Males/females = 10, families = 10, offsprings = 10

Loci = 4

BWG

B&M

KG

2AOM

IMCS

Figure 6Accuracies of the Sibling Sets Constructed by 2AOM, ICMS, BWG, B&M, and KG Approaches from Simulated Data Sets with the Parameter

Settings M =F =10, j =10, o =10, l =2?4, and a=5?10?15

practical because it may be difficult to obtain accurate

estimates of underlying population allele frequencies

independently of the sample of potential siblings.

We implemented and tested our approaches on

both real biological and simulated data, and then

compared the solution quality of our approaches with

other state-of-the-art sibling reconstruction methods

in the literature. For biological data, our approaches

performed as well or better than other methods.

Most importantly, our approaches were able to per-

fectly reconstruct the true sibling sets in the shrimp

data set—a result not obtained by our previously

published methods. The results suggested that our

combinatorial-based approaches gave accurate and

reliable sibset solutions for clean and well-separated

data sets. On the other hand, our approaches did

not perform well for the radish and fly data sets

because of missing alleles and biologically inconsis-

tent sibset solutions. These are errors typically present

in microsatellite data. For example, allelic dropout

occurs when one or both alleles are not amplified dur-

ing polymerase chain reaction (PCR). Heterozygous

mistyping occurs when two alleles are amplified by

PCR, but one or both of them, for a variety of reasons,

are not recorded as present. Homozygous mistyping

occurs when only one allele is amplified by PCR,

and it is not any of the parental alleles. Allele com-

bination error occurs when one or both alleles at a

locus are present in the parents (or sibling group)

but Mendelian inheritance rules are still violated. To

reasonably assess our approaches on error-free data,

the experiments on simulated data allowed us to esti-

mate the accuracy of our approaches. In all cases

except a = 2, the proposed IMCS approach success-

fully reconstructed the sibling sets with an accuracy

greater than 50%.

In parallel with this work, we have addressed the

possibilities of errors in data or missing allele infor-

mation by using the concept of consensus methods

(Sheikh et al. 2008). We have developed an error-

tolerant method for reconstructing sibling relation-

ships to tolerate genotyping errors and mutations

in data. The key idea of this method is to remove

microsatellite data from one locus at a time, assuming

it to be erroneous, and obtain a sibling reconstruction

solution based on the remaining loci. We consider an

individual pair to be siblings if there is a consensus

among (almost) all the reconstructed solutions. Its pre-

liminary results are presented in Sheikh et al. (2008).

In the future, we plan to validate our approaches

on other biological data sets and more realistic sim-

ulated populations (e.g., non-uniform allele distribu-

tions). In addition, we will also modify our approaches

for populations that contain partial self-reproduction

and half-siblings by incorporating species-dependent

constraints (field knowledge) into our models.

Acknowledgments

This research is supported by the following grants: National

Science Foundation (NSF) IIS-0611998 and NSF CCF-

0546574 (to the first author), NSF IIS-0612044 (to the third,

fourth, and sixth authors), DBI-0543365 and IIS-0346973

(to the fourth author), Fullbright Scholarship (to the fifth

author), and DIMACS special focus on Computational and

Mathematical Epidemiology (to the fourth author). The

authors are grateful to the people who have shared their

data: Jeff Connor, Atlantic Salmon Federation, Dean Jerry,

and Stuart Barker. The authors would also like to thank

Anthony Almudevar, Bernie May, and Dmitri Konovalov for

sharing their software.

References

Almudevar, A. 2003. A simulated annealing algorithm for max-

imum likelihood pedigree reconstruction. Theoret. Population

Biol. 63(2) 63–75.

Almudevar, A., C. Field. 1999. Estimation of single-generation sib-

ling relationships based on DNA markers. J. Agricultural, Biol.,

Environment. Statist. 4(2) 136–165.

INFORMS holds copyright to this article and distributed this copy as a courtesy to the author(s).

Additional information, including rights and permission policies, is available at http://journals.informs.org/.

Page 15

Chaovalitwongse et al.: Optimization Methods for Sibling Reconstruction from Genetic Markers

INFORMS Journal on Computing, Articles in Advance, pp. 1–15, ©2009 INFORMS

15

Ashley, M. V., T. Y. Berger-Wolf, P. Berman, W. Chaovalitwongse, B.

DasGupta, M.-Y. Kao. 2009. On approximating four covering

and packing problems. J. Comput. System Sci. 75(5) 287–302.

Berger-Wolf, T. Y., B. DasGupta, W. Chaovalitwongse, M. V. Ashley.

2005. Combinatorial reconstruction of sibling relationships.

Proc. 6th Internat. Sympos. Computational Biol. Genome Informatics

?CBGI 05?, Salt Lake City, UT, 1252–1255.

Berger-Wolf, T. Y., S. Sheikh, B. DasGupta, M. V. Ashley,

I. C. Caballero, W. Chaovalitwongse, S. L. Putrevu. 2007.

Reconstructing sibling relationships in wild populations.

Bioinformatics 23(13) i49–i56.

Beyer, J., B. May. 2003. A graph-theoretic approach to the parti-

tion of individuals into full-sib families. Molecular Ecology 12(8)

2243–2250.

Blouin, M. S. 2003. DNA-based methods for pedigree reconstruc-

tion and kinship analysis in natural populations. Trends Ecology

Evolution 18(10) 503–511.

Bowler, P. J. 1989. The Mendelian Revolution: The Emergence of

Hereditarian Concepts in Modern Science and Society. The Johns

Hopkins University Press, Baltimore.

Butler, K., C. Field, C. M. Herbinger, B. R. Smith. 2004. Accuracy,

efficiency and robustness of four algorithms allowing full sib-

ship reconstruction from DNA marker data. Molecular Ecology

13(6) 1589–1600.

Chaovalitwongse, W., T. Y. Berger-Wolf, B. DasGupta, M. V. Ashley.

2007. Set covering approach for reconstruction of sibling rela-

tionships. Optim. Methods Software 22(1) 11–24.

Conner, J. K. 2005. Personal communication (December 8).

Eskin, E., E. Haleprin, R. M. Karp. 2003. Efficient reconstruction of

haplotype structure via perfect phylogeny. J. Bioinformatics and

Comput. Biol. 1(1) 1–20.

Gusfield, D. 2002. Partition-distance: A problem and class of per-

fect graphs arising in clustering. Inform. Processing Lett. 82(3)

159–164.

Herbinger, C., P. T. O’Reilly, R. W. Doyle, J. M. Wright, F. O’Flynn.

1999. Early growth performance of Atlantic salmon full-sib

families reared in single family tanks or in mixed family tanks.

Aquaculture 173(1–4) 105–116.

Jerry, D. R., B. S. Evans, M. Kenway, K. Wilson. 2006. Development

of a microsatellite DNA parentage marker suite for black tiger

shrimp Penaeus monodon. Aquaculture 255(1–4) 542–547.

Khuller, S., A. Moss, J. Naor. 1999. The budgeted maximum cover-

age problem. Inform. Processing Lett. 70(1) 39–45.

Konovalov, D. A., C. Manning, M. T. Henshaw. 2004. KINGROUP:

A program for pedigree relationship reconstruction and kin

group assignments using genetic markers. Molecular Ecology

Notes 4(4) 779–782.

Li, J., T. Jiang. 2003. Efficient inference of haplotypes from genotype

on a pedigree. J. Bioinformatics Comput. Biol. 1(1) 41–69.

Mendel, G. 1866. Versuche über Pflanzen-Hybriden. Verhandlungen

des Naturforscheden Vereins in Brünn, Bd. IV fur das Jahr 1865,

3–47. [Translated as Experiments in plant hybridisation (J. Roy

Horticultural Soc. 26 1–32, 1901)].

Painter, I. 1997. Sibship reconstruction without parental informa-

tion. J. Agricultural, Biol., Environment. Statist. 2 212–229.

Queller, D. C., J. E. Strassman, C. R. Hughes. 1993. Microsatellites

and kinship. Trends Ecology Evolution 8 285–288.

Sheikh, S. I., T. Y. Berger-Wolf, M. V. Ashley, I. C. Caballero,

W. Chaovalitwongse, B. DasGupta. 2008. Error-tolerant sib-

ship reconstruction in wild populations. Proc. 7th Ann. Inter-

nat. Conf. Computational Systems Bioinformatics, Stanford, CA,

273–284.

Smith, B. R., C. M. Herbinger, H. R. Merry. 2001. Accurate partition

of individuals into full-sib families from genetic data without

parental information. Genetics 158(3) 1329–1338.

Thomas, S. C., W. G. Hill. 2002. Sibship reconstruction in hierar-

chical population structures using Markov Chain Monte carlo

techniques. Genetic Res. 79(3) 227–234.

Vazirani, V. V. 2001. Approximation Algorithms. Springer-Verlag,

New York.

Wang, J. 2004. Sibship reconstruction from genetic data with typing

errors. Genetics 166(4) 1968–1979.

Wilson, A. C. C., P. Sunnucks, J. S. F. Barker. 2002. Isola-

tion and characterization of 20 polymorphic microsatellite

loci for Scaptodrosophila hibisci. Molecular Ecology Notes 2(3)

242–244.

INFORMS holds copyright to this article and distributed this copy as a courtesy to the author(s).

Additional information, including rights and permission policies, is available at http://journals.informs.org/.