# Reconstructing sibling relationships in wild populations.

**ABSTRACT** Reconstruction of sibling relationships from genetic data is an important component of many biological applications. In particular, the growing application of molecular markers (microsatellites) to study wild populations of plant and animals has created the need for new computational methods of establishing pedigree relationships, such as sibgroups, among individuals in these populations. Most current methods for sibship reconstruction from microsatellite data use statistical and heuristic techniques that rely on a priori knowledge about various parameter distributions. Moreover, these methods are designed for data with large number of sampled loci and small family groups, both of which typically do not hold for wild populations. We present a deterministic technique that parsimoniously reconstructs sibling groups using only Mendelian laws of inheritance. We validate our approach using both simulated and real biological data and compare it to other methods. Our method is highly accurate on real data and compares favorably with other methods on simulated data with few loci and large family groups. It is the only method that does not rely on a priori knowledge about the population under study. Thus, our method is particularly appropriate for reconstructing sibling groups in wild populations.

**0**Bookmarks

**·**

**103**Views

- [Show abstract] [Hide abstract]

**ABSTRACT:**Studying associations between mating system parameters and fitness in natural populations of trees advances our understanding of how local environments affect seed quality, and thereby helps to predict when inbreeding or multiple paternities should impact on fitness. Indeed, for species that demonstrate inbreeding avoidance, multiple paternities (i.e. the number of male parents per half-sib family) should still vary and regulate fitness more than inbreeding - named here as the 'constrained inbreeding hypothesis'. We test this hypothesis in Eucalyptus gracilis, a predominantly insect-pollinated tree. Fifty-eight open-pollinated progeny arrays were collected from trees in three populations. Progeny were planted in a reciprocal transplant trial. Fitness was measured by family establishment rates. We genotyped all trees and their progeny at eight microsatellite loci. Planting site had a strong effect on fitness, but seed provenance and seed provenance × planting site did not. Populations had comparable mating system parameters and were generally outcrossed, experienced low biparental inbreeding and high levels of multiple paternity. As predicted, seed families that had more multiple paternities also had higher fitness, and no fitness-inbreeding correlations were detected. Demonstrating that fitness was most affected by multiple paternities rather than inbreeding, we provide evidence supporting the constrained inbreeding hypothesis; i.e. that multiple paternity may impact on fitness over and above that of inbreeding, particularly for preferentially outcrossing trees at life stages beyond seed development.PLoS ONE 02/2014; 9(2):e90478. · 3.53 Impact Factor - SourceAvailable from: Wanpracha ChaovalitwongseMary V. Ashley, Tanya Y. Berger-Wolf, Isabel C. Caballero, Wanpracha Chaovalitwongse, Bhaskar DasGupta, Saad I. Sheikh[Show abstract] [Hide abstract]

**ABSTRACT:**New technologies for collecting genotypic data from natural populations open the possibilities of investigating many fundamental biological phenomena, including be- havior, mating systems, heritabilities of adaptive traits , kin selection, and dispersal patterns. The power and potential of genotypic information often rests in the ability to reconstruct genealogical relationships among individuals. These relationships include parentage, full and half-sibships, and higher order aspect s of pedigrees. Some areas of genealogical inference, such as parentage, have been studied extensively. Although methods for pedigree inference and kinship analysis exist, most make assumptions that do not hold for wild populations of animals and plants. In this chapter, we focus on the full sibling relationship an d first review existing methods for full sibship reconstructions from microsatell ite genetic markers. We then describe our new combinatorial methods for sibling reconstruction based on simple - SourceAvailable from: Giovanni Rossi[Show abstract] [Hide abstract]

**ABSTRACT:**Measuring the distance between partitions is useful for clustering compari-son in different fields. For example, in bioinformatics the measuring mostly obtains through a maximum matching distance MMD, although this is algorithmically demanding and hardly fits certain instances. In fact, another distance measure is being tested, namely one based on information theory and termed variation of information VI. Alternatively, this paper proposes the Hamming distance HD, displaying large range and great measurement sensitivity, while also relying on a neat binary string representation of partitions. Novel distance HD is computationally handly and shares with VI important characterizing axioms. Developing from the combinatorial concern to translate the traditional Hamming distance from subset to partition lattices, HD constitutes a valuable computational tool for clustering and information processing where a distance between partitions is to be measured.

Page 1

BIOINFORMATICS

Vol. 00 no. 00 2007

Pages 1–9

Reconstructing Sibling Relationships in Wild Populations

Tanya Y. Berger-Wolfa∗, Saad I. Sheikha, Bhaskar DasGuptaa, Mary V.

Ashleyb, Isabel C. Caballerob, Wanpracha Chaovalitwongsec, S. Lahari

Putrevua

aDepartment of Computer Science,bDepartment of Biological Sciences,

University of Illinois at Chicago, Chicago, IL 60607

{tanyabw,ssheik3,bdasgup,ashley,icabal2,sputre2}@uic.edu

cDepartment of Industrial Engineering, Rutgers University, Piscataway, NJ 08855

wchaoval@rci.rutgers.edu

ABSTRACT

Reconstruction of sibling relationships from genetic data

is an important component of many biological applications.

In particular, the growing application of molecular markers

(microsatellites) to study wild populations of plant and animals

has created the need for new computational methods of

establishing pedigree relationships, such as sibgroups, among

individuals in these populations. Most current methods for

sibship reconstruction from microsatellite data use statistical

and heuristic techniques that rely on a priori knowledge about

various parameter distributions. Moreover, these methods are

designed for data with large number of sampled loci and

small family groups, both of which typically do not hold for

wild populations. We present a deterministic technique that

parsimoniously reconstructs sibling groups using only Mendelian

laws of inheritance. We validate our approach using both

simulated and real biological data and compare it to other

methods. Our method is highly accurate on real data and

compares favorably with other methods on simulated data with

few loci and large family groups. It is the only method that does

not rely on a priori knowledge about the population under study.

Thus, our method is particularly appropriate for reconstructing

sibling groups in wild populations.

1

For wild populations, the growing development and application

ofmolecularmarkersprovidesnewpossibilitiesforestablishing

kinship and reconstructing pedigrees in species where such

information cannot be obtained from field observations alone.

Knowledge of kinship in wild or experimental populations

of non-model organisms allows the investigation of many

fundamental biological phenomena, including mating systems,

selection and adaptation, kin selection, and dispersal patterns.

The power and potential of the genotypic information obtained

in these studies often rests in our ability to reconstruct

genealogical relationships among individuals (Garant and

Kruuk, 2005). These relationships include parentage, full and

INTRODUCTION

∗to whom correspondence should be addressed

half-sibships, and higher order aspects of pedigrees (Blouin,

2003;Butler etal.,2004;Jones andArdren,2003). Inthispaper

we are only concerned with full sibling relationships.

While there are several potential molecular markers that

could be applied to pedigree reconstruction, microsatellites

(also known as SSRs, STRs, SSLPs, and VNTRs) are the

most widely used marker and offer several advantages. Unlike

dominant markers such as AFLPs and ISSRs, microsatellite

alleles are codominant, so inference of genotypes and allele

frequencies at each locus are straightforward. Development

of SNPs is more difficult and expensive than microsatellite

development for species not subject to large-scale genome

projects. More importantly, the power to identify related

individuals depends mainly on the number of alleles per

locus and their heterozygosity, and microsatellites are clearly

superior to other markers in both regards, with 5-20 alleles and

heterozygosities of > 0.700 being typical, as reported in many

wild populations. Finally, many field studies wish to estimate

population parameters as well as individual relationships, so

development and application of microsatellites is the best

investment of resources for accomplishing such multiple goals.

Because of these advantages of microsatellite over other

markers, together with their current widespread use, we focus

ourdevelopmentofsibshipreconstructionmethodstounlinked,

multi-allelic, codominantly-inherited markers, as these features

describe microsatellite markers. Generally, phase or haplotype

information is not available for microsatellite loci in non-model

organisms.

While several methods for sibling reconstruction from mutli-

allelic microsatellite data have been proposed (Almudevar

and Field, 1999; Almudevar, 2003; Beyer and May, 2003;

Konovalov et al., 2004; Painter, 1997; Smith et al., 2001;

Thomas and Hill, 2002; Wang, 2004), most have not been

’ground-truthed’ (but see Butler et al., 2004) and have

received relatively limited application. The majority of the

kinship and pedigree reconstruction methods rely on the

knowledge about typical allele distribution and frequency,

family sizes, etc. and use statistical likelihood models to

infer genealogical relationships (Blouin, 2003). We build on

c ? Oxford University Press 2007.

1

Page 2

our earlier work (Berger-Wolf et al., 2005; Chaovalitwongse

et al., 2006) and propose a new algorithm for sibship

reconstruction using combinatorial optimization. There have

been no truly combinatorial methods for kinship reconstruction

problems (Almudevar and Field, 1999; Beyer and May,

2003). Combinatorial methods have been very successful in

closely related molecular genetics questions, such as haplotype

reconstruction (Eskin et al., 2003; Li and Jiang, 2003).

Our approach uses the simple Mendelian inheritance rules

to impose constraints on the genetic content possibilities of

a sibling group. We formulate the inferred combinatorial

constraints and, under the parsimony assumption, use a

provably correct algorithm to construct the smallest number of

groups of individuals that satisfy these constraints. We test our

approach on both simulated and real biological data.

2

2.1

Microsatellites, also known as Short Tandem Repeats (STR),

Simple Tandem Repeats (STR), Simple Sequence Repeats

(SSR), Simple Sequence Length Polymorphisms (SSLP), or

Variable Number of Tandem Repeats (VNTR), are short

sequences of repeated DNA (typically two to four base-pairs).

Different individual organisms can have microsatellites with

different number of repeats at the same locus (part of DNA). In

fact, this variability is what makes the microsatellites so useful

for genetic analysis. In diploid organisms an individual will

have two copies of each microsatellite sequence, one from the

mother, one from the father, called alleles. The two copies may

differ in the number of repeats of the same segment, depending

on the parental DNA. For example, if the mother has “CA”

repeated 8 times and 12 times, and the father has 10 and 13

repeats, then the offspring may have 12 and 10 “CA” repeats at

that locus.

Finding each new microsatellite locus is time and resource

consuming.Thus, microsatellite markers for non-model

species typically consist of very few, 2 – 20, loci. Yet,

once a locus is identified and the specific PCR primers are

designed, screening each individual is relatively quick and

cheap. Together with the high variability (high number of

alleles per locus) this makes microsatellites the marker of

choice for genetic research of wild populations.

METHODS

Microsatellite Genetic Markers

2.2

The main focus of our paper is to design a method that

accurately reconstructs sibling groups from microsatellite data

of a single generation. We now define the sibling reconstruction

problemmoreformally. Givenagenetic(microsatellite)sample

at l loci from a population of n diploid individuals of the same

generation, U, the goal is to reconstruct the full sibling groups

(groups of individuals with the same parents). We assume no

knowledge of parental information.

Sibling Reconstruction Problem Statement

U = {X1,...Xn}, where Xi = (< ai1,bi1 >,...,< ail,bil>)

and aijand bijare the two alleles of the individual i at locus j.

The goal is to find a partition of individuals P1,...Pm such

that

∀1 ≤ k ≤ m,∀Xu,Xv ∈ Pk: Parents(Xu) = Parents(Xv)

Notice,

Parents(x). This is a biological objective. We will discuss

computational approaches to achieve a good estimate of the

biological sibling relationship.

here that we have not defined the function

2.3

Mendelian genetics lay down a very simple rule for inheritance

in diploid organisms: an offspring inherits one allele from each

of its parents for each locus. This introduces two overlapping

necessary (but not sufficient) constraints on full siblings

groups: 4-allele property and 2-allele property (Berger-Wolf

et al., 2005).

4-Allele Property: The total number of distinct alleles

occurring at any locus may not exceed 4.

Formally, a set S ⊆ U has the 4-allele property if

2-Allele and 4-Allele Properties

∀1 ≤ j ≤ l :

˛˛˛˛˛

[

i∈S

{aij,bij}

˛˛˛˛˛≤ 4.

Clearly, the 4-allele property is necessary since a group

of siblings can inherit only combinations of the 4 alleles

of their common parents. The 4-allele property is effective

for identifying sibling groups where the data are mostly

heterozygous and the parent individuals share few common

alleles. Generally, as in Table 1, a set consisting of any two

individuals satisfies the 4-allele property. The set of individuals

1, 3 and 4 from Table 1 satisfies the 4-allele property. However,

the set of individuals 2, 3 and 5 fails to satisfy it as the alleles

occurring at the first locus are {12, 31, 56, 44, 51}.

2-Allele Property: There exist an assignment of individual

alleles within a locus to maternal and paternal such that the

number of distinct alleles assigned to each parent at this locus

does not exceed 2.

Formally, a set S ⊆ U has the 2-allele property if for each

Xi in each locus there exists an assignment of aij = cij or

bij = cij (and the other allele assigned to ¯ cij) such that

∀1 ≤ j ≤ l :

˛˛˛˛˛

[

i∈S

{cij}

˛˛˛˛˛≤ 2 and

˛˛˛˛˛

[

i∈S

{¯ cij}

˛˛˛˛˛≤ 2

2-Allele property is clearly stricter than 4-allele property.

Looking at the Table 1, our previous 4-allele set of individuals

1, 3 and 4 fails to satisfy the stricter 2-allele property as the

alleles appearing on the left side at locus 1 { 44, 31, 13 } are

more than two. Moreover, there is no swapping of alleles that

will bring down the number of alleles on each side to two: the

1st and 4th individuals with alleles 44/44 and 13/13 already fill

the capacity.

The 2-allele property takes into account the fact that the

parents can contribute only two alleles each to their offspring.

Note, that the 2-allele property is, again, a necessary but not

2

Page 3

Table 1. An example of input data for the sibling reconstruction

problem. The five individuals have been sampled at two genetic

loci. Each allele is represented by a number. Same numbers

represent the same alleles.

Individual

Radish 1

Radish 2

Radish 3

Radish 4

Radish 5

Alleles (a/b) at locus1

44/44

12/56

31/44

13/13

31/51

Alleles (a/b) at locus2

55/23

14/31

55/14

31/23

14/31

a sufficient constraint for a group of individuals to be siblings.

Notice, also, that any two individuals necessarily satisfy the 2-

allele property as well since by default the number of alleles on

each side of any locus is at most two.

The 2-allele property reduces the possible combinations of

alleles at a locus in a group of siblings down to a few canonical

options (modulo the numbering of the alleles). Assuming the

allelesarenumbered1through4, Table2listsalldifferenttypes

of sibling groups possible with the 2-allele property. We do this

by listing all possible pairs of parents whose alleles are among

1,2,3, and 4 and all the offspring they can produce. However,

in any sibling group with a given set of parents only a subset of

the offspring possibilities from the table may be present.

It is important to note that Table 2 gives an exhaustive list of

canonical possibilities of allele combinations at a given locus

in a group of siblings without violating the 2-allele property.

Without the loss of generality, we assume that the alleles at

each locus are numbered 1 through 4. This is sufficient since

according to the 4-allele property the number of alleles in any

sibling group cannot exceed four. Further, there are 4! =

24 possible mappings of any four alleles onto numbers 1–4.

However, we list only the canonical minimal options (parents’

alleles being numbered sequentially). It is not hard to check

that the list of parents is exhaustive. Hence, Table 2 presents an

exhaustive canonical list of possible sibling groups. It is also

easy to verify that the resulting sibling groups indeed confirm

to the 2-allele property.

2.4

As we have mentioned, the biological function Parents(x)

cannot be defined mathematically. We model the objective

of reconstructing the sibling relationships mathematically by

assigning individuals parsimoniously into the smallest number

of (possibly overlapping) groups that satisfy the necessary 2-

allele constraint. Formally, recall that we are given a population

U of n diploid individuals sampled at l loci

Minimum 2-Allele Set Cover

U = {X1,...Xn}, where Xi = (< ai1,bi1 >,...,< ail,bil>)

and aijand bijare the two alleles of the individual i at locus j.

The goal of the MINIMUM 2-ALLELE SET COVER problem is

to find the smallest number of subsets S1,...,Smsuch that each

Si ⊆ U and satisfies the 2-allele constraint andSSi = U.

Table 2. Canonical possible combinations of

parent alleles and all resulting offspring allele

combinations at a single locus

Parents Offspring

allele a

allele b

Set parents (1 / 2) (3/ 4)

1

2

1

2

3

4

4

3

3

4

4

3

1

2

1

2

Set parents (1 / 2) (1 / 3)

1

2

1

2

3

3

1

1

3

3

1

2

1

2

Set parents (1 / 2) (1 / 2)

1

1

2

2

1

2

1

2

Set parents (1 / 1) (1 / 1)

11

Set parents (1 / 1) (1 / 2)

1

1

2

1

2

1

Set parents (1 / 1) (2 / 3)

1

1

2

3

2

3

1

1

Set parents (1 / 1) (2 / 2)

1

2

2

1

We conjecture that the MINIMUM 2-ALLELE SET COVER

is NP-complete. A simple corollary of the following theorem

from Berger-Wolf et al., 2005 shows that it is in NP.

THEOREM 1 (Berger-Wolf et al., 2005). Let R be the

number of alleles that are homozygous or appear with 3 other

distinct alleles in a given locus and A be the total number of

distinct alleles at a locus. Then a set of individuals satisfies the

2-allele property if and only if for every locus it satisfies the

constraint

A + R ≤ 4

It is easy to see that given a set of individuals we can verify

that it satisfies the 2-allele property in O(nl) time using the

3

Page 4

constraint above. Thus, MINIMUM 2-ALLELE SET COVER is in

NP.

Since the MINIMUM 2-ALLELE SET COVER is likely to be

NP-hard, one approach is to design approximation algorithms

or heuristics that will produce suboptimal solutions. Instead,

we use commercial MIP solver CPLEX1to solve the problem

to optimality.

2.5

We now present our algorithm for solving the sibling

reconstruction problem abstracted as the MINIMUM 2-ALLELE

SET COVER. Our algorithm uses the 2-allele and 4-allele

properties (specifically, Table 2) to generate all maximal

potential sibling sets. We then restate the problem as a

MINIMUM SET COVER to find the minimum number of sibling

sets containing all the individuals. Thus, the algorithm has two

steps:

Minimum 2-Allele Set Cover Algorithm

1. Create potential sibling sets based on the 2-allele property

for each locus and maximally assign individuals to each

set without violating the 2-allele property in any locus

2. Use minimum set cover to find the minimum number of

the 2-allele sets from step 1 whose union contains all the

individuals.

We now explain the algorithm in more detail. In step 1, we

build on the approach presented in Berger-Wolf et al., 2005;

Chaovalitwongse et al., 2006 by generating sets that satisfy

the 2-allele property. In the implementation of the algorithm

we use the complete version of Table 2, with all 24 possible

mappings of alleles to numbers 1–4, to generate all maximal

possible sets. Since the list is exhaustive, if a set does not match

one of the patterns in Table 2 under some mapping of its alleles

onto numbers 1–4, it cannot possibly be a sibling group. During

both steps of our algorithm we maintain an index or lookup of

all sets to ensure there are no duplications.

2.5.1

necessarily satisfies the 2-allele property. Thus, initially we use

all`n

from Table 2 for each locus j. Each allele is assigned a number

between 1 and 4 based on the order of its occurrence. Then,

for each pair of individual alleles we search for all matching

canonical sets in Table 2 to determine the set of possibilities,

PossibilitiesSet.

After generating these initial sets based on pairs of

individuals, the algorithm repeatedly iterates through all the

individuals, testing each set for a possible assignment of the

individual to the set. In each cycle of the iterations, only

the sets that were present at the beginning of the cycle are

considered for each individual. An individual is assigned to a

set if its alleles match the possibilities of the set as defined by

the extended Table 2.

Algorithm 2-allele.

Recall that any pair of individuals

2

´pairs of n individuals to generate the candidate sets.

Each set is generated using the initial possible canonical sets

1CPLEX is a registered trademark of ILOG

However, adding an individual to a potential sibling set may

reduce the set of the matching canonical patterns. For example,

adding an individual with alleles 3/1 to a set of two individuals

with alleles 1/2 and 2/1 changes the potential set of parents

from {(1,1)(2,2); (1,1)(2,2); (1,2)(1,2); (1,1)(2,3); (1,2)(1,3);

(1,2)(3,4)} to just {(1,1)(2,3); (1,2)(1,3); (1,2)(3,4)}. Thus,

when adding a new individual to a set, we check if a new

valid set can be created to accommodate all of the individuals

already assigned to the set as well as the new individual. The

validity of the new set is determined by the 4-allele property

and the extended Table 2. The alleles at every locus of the new

individual must match at least one of the canonical patterns

that collectively satisfy all the previous individuals assigned to

the set. Once we determine that the set can be expanded (and

its set of possible matching parents reduced) to accommodate

the new individual in a valid way, we create a modified copy

of the set. The individual is then checked against this new

set for all the remaining loci. After we have verified that the

new individual does not violate the 2-allele property of the

new set at every locus, as explained above, and verifying that

the set doesn’t already exist, we add the set to the collection

of potential sibling sets. However, for the remainder of the

iteration cycle all the individuals are checked only against the

sets that had been present at the beginning of the cycle. This

ordering ensures that each individual is checked against each

set exactly once.

We repeat this process, cycling through all the individuals in

the population. Once a set present at the beginning of the cycle

has been inspected against all the individuals, the set is marked

as done and is not revisited. This ensures that all sibling pairs

that could possibly occur are evaluated, and that no sibling sets

are generated that never occur in data.

The cycles of iterations over the individuals continues until

all sets are marked as done. As the last step a singleton set for

each of the elements is added containing just that element to

ensure that a family group containing one offspring is possible.

After all the potential sibling sets are generated we apply

the minimum set cover to find the minimum number of sibling

groups whose union contains all the individuals.

2.5.2

that the algorithm terminates since the sets newly added in each

iteration cycle are always bigger than the sets present at the

beginning of the iteration cycle and each individual can occur

at most once in a set.

We already showed that Table 2 exhaustively lists all the

canonical possibilities of sibling groups (modulo the mapping

of alleles to the numbers 1–4). We show that our algorithm

produces all the sibling groups that confirm to the listings in

thistable, andnosiblinggroupisgeneratedthatdoesnotsatisfy

one of the canonical possibilities.

Proof of Correctness and Termination.

First, we note

THEOREM 2. Algorithm 2-allele produces all and only the

possible 2-allele groups that are supported by the data.

Proof. As we have stated before, all possible pairs of

individuals create minimal (non singleton) valid sibling groups

4

Page 5

and must correspond to at least one of the entries in Table 2

by default. The algorithm then exhaustively compares every

individual against every such possible sibling set and generates

new sets as necessary if the 2-allele property is not violated.

Thus, every combination of individuals that can be siblings

will be generated. Suppose, to the contrary, there exists a valid

maximal sibling group S that has never been generated and

consider the smallest such group. Let Xi be the individual

with the highest index i in this group. When we remove the

individual Xifrom the population all the individuals that could

be siblings before can still be siblings. Thus, S − Xi is still

a valid sibling group and, by inductive hypothesis, it must

have been generated by the algorithm. We examine Xiagainst

the group S − Xi. Adding Xi does not violate the 2-allele

property (since it is a sibling group) and therefore there exists

a corresponding canonical set in Table 2 that contains S. Thus,

we would add the corresponding possible set if it was not

already among the sets.

Since we check every sibling group at all loci before adding

it to the collection of potential sets, we ensure that we never

generate an set that doesn’t satisfy the 2-allele property at

every locus.

After all possible sibling groups are generated we use the

minimum set cover approach to find the smallest number of

sibling groups whose union contains all the individuals. While

the minimumsetcover problem is NP-complete, modern mixed

integer program solvers can solve it to optimally in most

instances. Thus, it is not meaningful to discuss the theoretical

computational complexity of the algorithm.

2.5.3

a classical NP-complete (Karp, 1972) problem. Minimum Set

Cover is defined as follows: given a universe U of elements

X1,...,Xn and a collection of subsets S of U, the goal is to

find the minimum collection of subsets C ⊆ S whose union is

the entire universe U.

Formally, given: U

=

{S1,S2,...Sm} find

Minimum Set Cover.

Minimum set cover problem is

{X1,X2,...,Xn} and S=

min |C| s.t. C ⊆ S and

[

Si∈C

Si = U

Set cover cannot be approximated in polynomial time

to within a factor of (1 − ?)lnn unless NP

DTIME(nloglogn) (Feige, 1998). Johnson introduced a 1 +

ln n approximation in 1974 (Johnson, 1974).

In order to solve set cover we use standard integer

programming solvers. The integer program formulation of the

set cover problem is as follows: given a matrix A

⊆

aij =

(

1

0

if i ∈ Sj

otherwise

the set cover problem is

minPm

i=1xi

s.t.

Ax ≥ 1

xi ∈ {0,1}

3

To validate and assess the accuracy of our approach we use

datasets with known genetics and genealogy. However, such

biological datasets containing no errors are rare. In addition,

we create simulated sets using a large number of parameters

over a wide range of values. In each instance we compare our

algorithm to other methods for sibship reconstruction.

We measure the error by comparing the known sibling sets

with those generated by our algorithm, and calculating the

minimum partition distance (Gusfield, 2002). The error is the

percentage of individuals that would need to be removed to

make the reconstructed sibling sets equal to the true sibling

sets. Note, we are computing the error in terms of individuals,

not in terms of the number of sibling groups reconstructed

incorrectly. Thus, the accuracy is the percent of individuals

correctly assigned to sibling groups.

The experiments were run on a combination of a cluster of

64 mixed AMD and Intel Xeon nodes of 2.8 GHz and 3.0GHz

processors and a single Intel Pentium D Dual Core 3.2 GHz

Intel processor with 4 GB RAM memory.

EVALUATION AND RESULTS

3.1

We compare the performance of our algorithm to three other

sibship reconstruction methods. The methods span a variety of

approachesandhavedifferentbehaviorondifferentparameters.

We now describe the methods.

Sibship Reconstruction Methods

Almudevar and Field. Our algorithm is based on a very

similarideaproposedbyAlmudevarandField,1999whichis

a completely combinatorial approach. Here, potential sibling

sets are too constructed using the 2-allele property (although

the authors do not explicitly state the property). However,

these sets are constructed by enumerating exhaustively

all combinations of individuals and testing those for the

compliance with the 2-allele property. At the end, a maximal,

not necessarily optimal, collection of sibling sets is returned

as a solution.

Beyer and May. The approach proposed in Beyer and May,

2003isamixtureoflikelihoodandcombinatorialtechniques.

The algorithm constructs a graph with individuals as nodes

and the edges weighted by the pairwise likelihood ratio that

the individuals are siblings versus being unrelated. Very light

edges are ignored. Potential families are identified by the

connected components in this graph.

KinGroup. Konovalov et al., 2004 have proposed an algorithm

based entirely on likelihood estimates of partitions of

individuals into sibling groups.

considered one at a time. For each individual, the likelihood

of it being part of any existing sibling group, as well as

starting its own group, is calculated. The individual is placed

into the group it is most likely to belong. Unfortunately,

the outcome heavily depends on the order in which the

individuals are considered.

The individuals are

5

Page 6

3.2

We have identified four biological datasets of microsatellite

data where sibling groups are known. These are not wild

populations since in wild populations we typically do not know

the true sibling groups, which is precisely why we need the

sibling reconstruction method.

Biological Data

Radishes. The wild radish Raphanus raphanistrum dataset

(Conner, 2006) consists of samples from 150 radishes from

two families with 17 sampled loci. There are missing alleles

among all the loci. The parent genotypes are available.

Salmon. The Atlantic Salmon Salmo salar dataset comes from

the genetic improvement program of the Atlantic Salmon

Federation (Herbinger et al., 1999). We use a truncated

sample of microsatellite genotypes of 351 individuals from

6 families with 4 loci per individual. The data does not have

missing alleles at any locus. This dataset is a subset of one

of the samples of genotyped individuals used by Almudevar

and Field, 1999 to illustrate their technique.

Shrimp. The tiger shrimp Penaeus monodon dataset (Jerry

et al., 2006) consists of 59 individuals from 13 families with

7 loci. There are 16 missing alleles. The parentage is known.

Flies. Scaptodrosophila hibisci dataset (Wilson et al., 2002)

consists of 190 same generation individuals (flies) from

6 families sampled at various number of loci with up to

8 alleles per locus. Parent genotypes were known. All

individuals shared 2 sampled loci which were chosen for

our study. Some of the alleles were missing for some of the

individuals.

Table 3 summarizes the results of the four algorithms on the

biological datasets.

Table 3. Accuracy (percent) of our algorithm and the three

reference algorithms on biological datasets. Here l is the number

of loci in a dataset and “Inds” column gives the number of

individuals in the dataset. The three reference algorithms are

Almudevar and Field, 1999 (A&F), Beyer and May, 2003

(B&M), and the KinGroup by Konovalov et al., 2004 (KG).

Dataset

l

Inds

Ours

A&FB&M KG

Shrimp

Salmon

Radishes

Flies

7

4

5

2

5977.97

98.30

75.90

100.00

67.80 77.97

99.71

53.30

27.89

77.97

96.02

29.95

54.73

351

531

190

Out of memory

Out of memory

31.05

Almudevar and Field’s algorithm ran out of 4GB memory on the salmon

and radish datasets.

3.3

In addition,

simulations. We first create random diploid parents and then

generate complete genetic data for offspring varying the

number of males, females, alleles, loci, number of families

and number of offspring per family. We then use the 2-allele

algorithm described above to reconstruct the sibling groups.

Random Simulations

we validate our approach using random

We compare our results to the actual known sibling groups

in the data to assess accuracy. We measure the error rates

of algorithm using the Gusfield Partition Distance (Gusfield,

2002). In addition, we compare the accuracy of our 2-allele

algorithm to the two reference sibling reconstruction methods,

Beyer and May, 2003 and Konovalov et al., 2004, described

above. We repeat the entire process for each fixed combination

of parameter values 1000 times. We omit the comparison of

the results to the algorithm of Almudevar and Field, 1999

since the current version of provided software requires user

interaction and therefore it is infeasible to use it in the

automated simulation pipeline of 1000 iterations of over a

hundred combinations of parameter values.

First, we generate the parent generation of M males and

F females with parents with l loci and a specified number

of alleles per locus a. We create populations with uniform

as well as non-uniform allele distributions. After the parents

are created, their offsprings are generated by selecting f pairs

of parents. A male and a female are chosen independently,

randomly and uniformly from the parent population. For these

parents a specified number of offsprings o is generated. Here,

too, we create populations with a uniform as well as a skewed

family size distribution. Each offspring randomly receives one

allele each from its mother and father at each locus. This

is a rather simplistic approach, however, it’s consistent with

the genetics of known parents and provides a baseline for the

accuracy of the algorithm since biological data are generally

not random and uniform.

The parameter ranges for the study are as follows:

• The number of adult females F and the number of adult

males M were equal and set to 5, 10 or 15.

• The number of loci sampled l = 2,4,6

• The number of alleles per locus (for the uniform allele

frequency distribution) a = 5,10,15.

• Non-uniform allele frequency distribution (for 4 alleles):

12 - 4 - 1 - 1, as in Almudevar, 2003.

• The number of families in the population f = 2,5,10.

• The number of offspring per couple (for the uniform

family size distribution) o = 2,5,10.

• Non-uniform family size distribution (for 5 families): 25 -

10 - 10 - 4 - 1, as in Almudevar, 2003

All datasets were generated on the 64-node cluster running

RedHat Linux 9.0. The 2-allele algorithm is used on this

generated population to find the smallest number of 2-allele

sets necessary to explain this juvenile population. We use

the commercial MIP solver CPLEX 9.0 for Windows XP on

a single processor machine to solve the minimum set cover

problem to optimality. The reference algorithms were run on

a single processor machine running Windows XP2.

2The difference in platforms and operating systems is dictated by the

available software licenses and provided binary code

6

Page 7

We measure the reconstruction accuracy of the 2-allele

algorithm as the function of the number of alleles per each

locus, family size (number of offspring), number of families

(and polygamy), and the variation in allele frequency and

family size distributions.

Figure 1 shows representative results for the accuracy of our

2-allele algorithm and the two reference algorithms on uniform

allele frequency and family sizes distributions. Figure 2 shows

results for the datasets with skewed family sizes and allele

frequency distributions. Each bar represents the mean value of

a 1000 random repetitions and the error bars show the standard

deviation.

4

We have proposed a new fully combinatorial algorithm for the

problem of reconstruction of sibling relationships from single

generation microsatellite genetic data. We have implemented

and tested our approach on both real biological and simulated

data.

On biological data our algorithm performed as well or better

than other sibling reconstruction methods. The difference is

particularly striking for the flies dataset with 2 loci. Our

algorithm accurately reconstructed the sibling groups despite

some missing alleles while other methods all have over 45%

error rate. The radish dataset presented a problem for all

methods since it has partial self-reproduction which none of

the methods take into account. Offspring of a selfed individual

are hard to separate from their half-siblings produced by that

and any other individual. Still, even on this dataset our method

performed significantly better than others.

The simulated data provides a base line for the accuracy

estimate of our algorithm, with real biological data likely to

have better reconstruction accuracy, as indicated by the results

on the biological datasets. For the datasets with the uniform

allele frequency and family sizes distributions, for the number

of alleles per locus above 5 and the number of offspring

per family above 5, the accuracy of our algorithm is above

50 percent in most cases, rapidly increasing as the number

of offspring or alleles increases. Our algorithm performs

significantly better than other methods when the number of loci

is very small and there is reasonable diversity of alleles. In fact,

the algorithm of Beyer and May, 2003, has high error rates

specifically for those parameter values. Thus, our algorithm is

particularly well suited for natural populations of plants and

animals, with large family groups and few sampled loci.

However, we have conducted a very limited set of

experimentsondatasetswithnon-uniformparameterdistributions.

To obtain conclusive results, we need to explore a wider range

of non-uniform distributions. To fully evaluate the performance

of our algorithm, we need to validate the results on other

biological datasets and more realistic simulated populations.

In addition, we have yet to address the possibilities of errors

in data. The fact that our algorithm accurately reconstructed

sibling groups on biological data with missing alleles is

encouraging. However, our algorithm would need to be

DISCUSSION AND CONCLUSIONS

modified to address errors from mistyped and misidentified

alleles.

It is impossible in the current setting of the experiments

to accurately compare running times of the algorithms.

However, the algorithm of Almudevar and Field, 1999, which

uses exhaustive enumeration of all potential sibling sets,

unsurprisingly ran out of memory on datasets greater than

200 individuals. It took on the order of hours to complete on

the dataset of 190 individuals. Moreover, since the current

implementation requires user interaction, the performance of

the algorithm could not be evaluated in the random simulations.

The likelihood approaches of Beyer and May, 2003 and

Konovalov et al., 2004 are very fast. Both produced answers

in a matter of seconds on datasets of a 100 individuals and

in less than 2 minutes on datasets of 500 individuals. For our

algorithm, we first (in matter of seconds) formulate the 2-

allele set cover problem, then this formulation is imported into

CPLEX and solved as a set cover problem. Recall that the set

cover problem is NP-complete and CPLEX is a commercial

software designed specifically to solve such computationally

hard problems to optimality. The entire (automated) process

takesabout2hoursondatasetsof500individuals. Atthispoint,

our focus has been on evaluating the accuracy and viability of

our approach. Now that our approach has proven viable we

will concentrate on improving the running time and the overall

usability of our method.

The main advantage of the combinatorial approach is

its lack of reliance on a priori knowledge about various

population parameters, such as allele frequency and the

degree of inbreeding. However, Mendelian constraints do not

provide any basis for distinguishing between family groups

when the groups are small, or when individuals share many

common alleles. Additional information, such as relative

allele frequency within the sample can be easily added to

generate combinatorial constraints on the potential sibling sets.

Unlike likelihood methods, combinatorial approaches use that

information only for comparison purposes, and do not require a

background data model or an accurate estimate value for any of

the parameter. Thus, we believe that combinatorial approaches

are particularly appropriate for analysis of natural animal and

plant populations where background information is difficult to

obtain.

Our technique can be extended to solve a number of related

problems. The first immediate variation is reconstructing

sibship relationships when partial information, such as one of

the parents, is available. We have already implemented and

applied the extended version of our algorithm to identifying

the minimum number of necessary male oak trees that have

pollinated a single female tree to produce the sampled acorns.

Our approach provided additional supporting evidence that oak

pollen disperses much further than previously thought.

Another simple variation of our algorithm produces half-

sibling groups as well as full sibs. In addition, our algorithm

can be used to identify the parsimonious set of parental

genotypes necessary to produce the sibling groups.

7

Page 8

Fig. 1. Accuracy of the sibling group reconstruction using the 2-allele algorithm on randomly generated data. The y-axis shows the accuracy

of reconstruction as a function of various simulation parameters. The accuracy of our algorithm is shown, as well as that of the two reference

algorithms: Beyer and May, 2003 and Konovalov et al., 2004 (KingGroup). The title shows the value of the fixed parameters: the number of adult

males/females, number of families, the number of offspring per family, the number of loci, and the number of alleles per locus.

Fig. 2. Accuracy of the sibling group reconstruction using our 2-allele algorithm and the two reference methods on the datasets with skewed family

sizes and allele frequency distributions.

8

Page 9

Finally, from the algorithmic perspective, there are a number

of alternatives that would improve the performance of our

method that we are currently exploring.

To conclude, we have presented a fully combinatorial

approach to reconstructing sibling groups from microsatellite

data. Our method does not rely on any a priori knowledge about

data parameters yet provides results with accuracy comparable

to or better than those of likelihood methods.

ACKNOWLEDGMENTS

This research is supported by the following grants: NSF IIS-

0612044andIIS-0611998(Berger-Wolf, Ashley, Chaovalitwongse,

DasGupta), Fullbright Scholarship (Sheikh),

0546574 (Chaovalitwongse). We are grateful to the people who

have shared their data with us: Jeff Connor, Atlantic Salmon

Federation, Dean Jerry, and Stuart Barker. We would also

like to thank Anthony Almudevar, Bernie May, and Dmitry

Konovalov for sharing their software and the anonymous

reviewers for very helpful comments.

NSF CCF-

REFERENCES

Almudevar, A. (2003). A simulated annealing algorithm for maximum likelihood

pedigree reconstruction. Theoretical Population Biology, 63, 63–75.

Almudevar, A. and Field, C. (1999). Estimation of single generation sibling

relationships based on DNA markers. Journal of Agricultural, Biological,

and Environmental Statistics, 4, 136–165.

Berger-Wolf, T. Y., DasGupta, B., Chaovalitwongse, W., and Ashley, M. V.

(2005). Combinatorial reconstruction of sibling relationships. In Proceedings

of the 6th International Symposium on Computational Biology and Genome

Informatics (CBGI 05), pages 1252–1255, Utah.

Beyer, J. and May, B. (2003). A graph-theoretic approach to the partition of

individuals into full-sib families. Molecular Ecology, 12, 2243–2250.

Blouin, M. S. (2003). DNA-based methods for pedigree reconstruction and

kinship analysis in natural populations. TRENDS in Ecology and Evolution,

18(10), 503–511.

Butler, K., Field, C., Herbinger, C., and Smith, B. (2004). Accuracy, efficiency

and robustness of four algorithms allowing full sibship reconstruction from

DNA marker data. Molecular Ecology, 13, 1589–1600.

Chaovalitwongse, A., Berger-Wolf, T., Dasgupta, B., and Ashley, M. (2006). Set

covering approach for reconstruction of sibling relationships. Optimization

Methods and Software.

Conner, J. K. (2006). personal communication.

Eskin, E., Haleprin, E., and Karp, R. M. (2003). Efficient reconstruction of

haplotype structure via perfect phylogeny. Journal of Bioinformatics and

Computational Biology, 1(1), 1–20.

Feige, U. (1998). A threshold of lnn for approximating set cover. Journal of the

ACM, 45, 634–652.

Garant, D. and Kruuk, L. E. B. (2005). How to use molecular marker data to

measure evolutionary parameters in wild populations. Molecular Ecology,

14, 1843–1859.

Gusfield, D. (2002). Partition-distance: A problem and class of perfect graphs

arising in clustering. Information Processing Letters, 82(3), 159–164.

Herbinger, C., O´Reilly, P. T., Doyle, R. W., Wright, J. M., and O´Flynn, F. (1999).

Early growth performance of atlantic salmon full-sib families reared in single

family tanks or in mixed family tanks. Aquaculture, 173, 105–116.

Jerry, D. R., Evans, B. S., Kenway, M., and Wilson, K. (2006). Development

of a microsatellite dna parentage marker suite for black tiger shrimp penaeus

monodon. Aquaculture, 255(1–4), 542–547.

Johnson, D. S. (1974). Approximation algorithms for combinatorial problems. J.

Comput. System Sci., 9, 256–278.

Jones, A. G. and Ardren, W. R. (2003). Methods of parentage analysis in natural

populations. Molecular Ecology, (12), 2511–2523.

Karp, R. M. (1972). Reducibility among combinatorial problems. In R. E. Miller

and J. W. Thatcher, editors, Complexity of Computer Computations, pages

85–103. Plenum Press.

Konovalov, D. A., Manning, C., and Henshaw, M. T. (2004). KINGROUP: a

program for pedigree relationship reconstruction and kin group assignments

using genetic markers.

Molecular Ecology Notes.

8286.2004.00796.x.

Li, J. and Jiang, T. (2003). Efficient inference of haplotypes from genotype on a

pedigree. Journal of Bioinformatics and Computational Biology, 1(1), 41–69.

Painter, I. (1997). Sibship reconstruction without parental information. Journal

of Agricultural, Biological, and Environmental Statistics, 2, 212–229.

Smith, B. R., Herbinger, C. M., and Merry, H. R. (2001). Accurate partition

of individuals into full-sib families from genetic data without parental

information. Genetics, 158, 1329–1338.

Thomas, C. S. and Hill, W. G. (2002). Sibship reconstruction in hierarchical

population structuresusingmarkovchainmontecarlotechniques. Genet.Res.,

Camb., 79, 227–234.

Wang, J. (2004). Sibship reconstruction from genetic data with typing errors.

Genetics, 166, 1968–1979.

Wilson, A., Sunnucks, P., and Barker, J. (2002). Isolation and characterization

of 20 polymorphic microsatellite loci for Scaptodrosophila hibisci. Molecular

Ecology Notes, 2, 242–244.

doi: 10.1111/j.1471-

9

#### View other sources

#### Hide other sources

- Available from Bhaskar Dasgupta · May 27, 2014
- Available from cs.uic.edu