Reconstructing sibling relationships in wild populations.
ABSTRACT Reconstruction of sibling relationships from genetic data is an important component of many biological applications. In particular, the growing application of molecular markers (microsatellites) to study wild populations of plant and animals has created the need for new computational methods of establishing pedigree relationships, such as sibgroups, among individuals in these populations. Most current methods for sibship reconstruction from microsatellite data use statistical and heuristic techniques that rely on a priori knowledge about various parameter distributions. Moreover, these methods are designed for data with large number of sampled loci and small family groups, both of which typically do not hold for wild populations. We present a deterministic technique that parsimoniously reconstructs sibling groups using only Mendelian laws of inheritance. We validate our approach using both simulated and real biological data and compare it to other methods. Our method is highly accurate on real data and compares favorably with other methods on simulated data with few loci and large family groups. It is the only method that does not rely on a priori knowledge about the population under study. Thus, our method is particularly appropriate for reconstructing sibling groups in wild populations.

Article: Comparing pedigree graphs.
[Show abstract] [Hide abstract]
ABSTRACT: Abstract Pedigree graphs, or family trees, are typically constructed by an expensive process of examining genealogical records to determine which pairs of individuals are parent and child. New methods to automate this process take as input genetic data from a set of extant individuals and reconstruct ancestral individuals. There is a great need to evaluate the quality of these methods by comparing the estimated pedigree to the true pedigree. In this article, we consider two main pedigree comparison problems. The first is the pedigree isomorphism problem, for which we present a lineartime algorithm for leaflabeled pedigrees. The second is the pedigree edit distance problem, for which we present (1) several algorithms that are fast and exact in various special cases, and (2) a general, randomized heuristic algorithm. In the negative direction, we first prove that the pedigree isomorphism problem is as hard as the general graph isomorphism problem, and that the subpedigree isomorphism problem is NPhard. We then show that the pedigree edit distance problem is APXhard in general and NPhard on leaflabeled pedigrees. We use simulated pedigrees to compare our editdistance algorithms to each other as well as to a branchandbound algorithm that always finds an optimal solution.Journal of computational biology: a journal of computational molecular cell biology 08/2012; 19(9):9981014. · 1.69 Impact Factor  SourceAvailable from: Martin F Breed[Show abstract] [Hide abstract]
ABSTRACT: Studying associations between mating system parameters and fitness in natural populations of trees advances our understanding of how local environments affect seed quality, and thereby helps to predict when inbreeding or multiple paternities should impact on fitness. Indeed, for species that demonstrate inbreeding avoidance, multiple paternities (i.e. the number of male parents per halfsib family) should still vary and regulate fitness more than inbreeding  named here as the 'constrained inbreeding hypothesis'. We test this hypothesis in Eucalyptus gracilis, a predominantly insectpollinated tree. Fiftyeight openpollinated progeny arrays were collected from trees in three populations. Progeny were planted in a reciprocal transplant trial. Fitness was measured by family establishment rates. We genotyped all trees and their progeny at eight microsatellite loci. Planting site had a strong effect on fitness, but seed provenance and seed provenance × planting site did not. Populations had comparable mating system parameters and were generally outcrossed, experienced low biparental inbreeding and high levels of multiple paternity. As predicted, seed families that had more multiple paternities also had higher fitness, and no fitnessinbreeding correlations were detected. Demonstrating that fitness was most affected by multiple paternities rather than inbreeding, we provide evidence supporting the constrained inbreeding hypothesis; i.e. that multiple paternity may impact on fitness over and above that of inbreeding, particularly for preferentially outcrossing trees at life stages beyond seed development.PLoS ONE 01/2014; 9(2):e90478. · 3.73 Impact Factor  [Show abstract] [Hide abstract]
ABSTRACT: Many methods have been proposed to reconstruct the pedigree of a sample of individuals from their multilocus marker genotypes. These methods, like those in other fields of statistical inferences, may suffer from both type I (falsely related) and type II (falsely unrelated) errors. In sibship reconstruction, type I errors come from the spurious fusion of two or more small sibships into a single sibship, and type II errors originate from the spurious splitting of a large sibship into two or more small sibships. In this study I investigate the tendencies of both types of errors made by the likelihood methods in sibship reconstruction, using both analytical and simulation approaches. I propose an improvement on the likelihood methods to reduce sibship splitting, and thus type II errors by downscaling the number of inferred siblings sharing the same genotype at a locus. Simulations are then conducted to compare the accuracy of the original and improved likelihood methods in sibship reconstruction of a large sample of individuals in fullsib families of the same small size, the same large size and highly variable sizes, using a variable number of loci with a variable number of alleles per locus. The methods were also applied to the analysis of a salmon data set. I show that my scaling scheme prevents effectively the splitting of large sibships, and reduces type II errors greatly with little increase in type I errors. As a result, it improves the overall accuracy of sibship assignments, except when sibships are expected to be uniformly small or marker information is unrealistically scarce.Heredity advance online publication, 24 April 2013; doi:10.1038/hdy.2013.34.Heredity 04/2013; · 4.11 Impact Factor
Page 1
BIOINFORMATICS
Vol. 00 no. 00 2007
Pages 1–9
Reconstructing Sibling Relationships in Wild Populations
Tanya Y. BergerWolfa∗, Saad I. Sheikha, Bhaskar DasGuptaa, Mary V.
Ashleyb, Isabel C. Caballerob, Wanpracha Chaovalitwongsec, S. Lahari
Putrevua
aDepartment of Computer Science,bDepartment of Biological Sciences,
University of Illinois at Chicago, Chicago, IL 60607
{tanyabw,ssheik3,bdasgup,ashley,icabal2,sputre2}@uic.edu
cDepartment of Industrial Engineering, Rutgers University, Piscataway, NJ 08855
wchaoval@rci.rutgers.edu
ABSTRACT
Reconstruction of sibling relationships from genetic data
is an important component of many biological applications.
In particular, the growing application of molecular markers
(microsatellites) to study wild populations of plant and animals
has created the need for new computational methods of
establishing pedigree relationships, such as sibgroups, among
individuals in these populations. Most current methods for
sibship reconstruction from microsatellite data use statistical
and heuristic techniques that rely on a priori knowledge about
various parameter distributions. Moreover, these methods are
designed for data with large number of sampled loci and
small family groups, both of which typically do not hold for
wild populations. We present a deterministic technique that
parsimoniously reconstructs sibling groups using only Mendelian
laws of inheritance. We validate our approach using both
simulated and real biological data and compare it to other
methods. Our method is highly accurate on real data and
compares favorably with other methods on simulated data with
few loci and large family groups. It is the only method that does
not rely on a priori knowledge about the population under study.
Thus, our method is particularly appropriate for reconstructing
sibling groups in wild populations.
1
For wild populations, the growing development and application
ofmolecularmarkersprovidesnewpossibilitiesforestablishing
kinship and reconstructing pedigrees in species where such
information cannot be obtained from field observations alone.
Knowledge of kinship in wild or experimental populations
of nonmodel organisms allows the investigation of many
fundamental biological phenomena, including mating systems,
selection and adaptation, kin selection, and dispersal patterns.
The power and potential of the genotypic information obtained
in these studies often rests in our ability to reconstruct
genealogical relationships among individuals (Garant and
Kruuk, 2005). These relationships include parentage, full and
INTRODUCTION
∗to whom correspondence should be addressed
halfsibships, and higher order aspects of pedigrees (Blouin,
2003;Butler etal.,2004;Jones andArdren,2003). Inthispaper
we are only concerned with full sibling relationships.
While there are several potential molecular markers that
could be applied to pedigree reconstruction, microsatellites
(also known as SSRs, STRs, SSLPs, and VNTRs) are the
most widely used marker and offer several advantages. Unlike
dominant markers such as AFLPs and ISSRs, microsatellite
alleles are codominant, so inference of genotypes and allele
frequencies at each locus are straightforward. Development
of SNPs is more difficult and expensive than microsatellite
development for species not subject to largescale genome
projects. More importantly, the power to identify related
individuals depends mainly on the number of alleles per
locus and their heterozygosity, and microsatellites are clearly
superior to other markers in both regards, with 520 alleles and
heterozygosities of > 0.700 being typical, as reported in many
wild populations. Finally, many field studies wish to estimate
population parameters as well as individual relationships, so
development and application of microsatellites is the best
investment of resources for accomplishing such multiple goals.
Because of these advantages of microsatellite over other
markers, together with their current widespread use, we focus
ourdevelopmentofsibshipreconstructionmethodstounlinked,
multiallelic, codominantlyinherited markers, as these features
describe microsatellite markers. Generally, phase or haplotype
information is not available for microsatellite loci in nonmodel
organisms.
While several methods for sibling reconstruction from mutli
allelic microsatellite data have been proposed (Almudevar
and Field, 1999; Almudevar, 2003; Beyer and May, 2003;
Konovalov et al., 2004; Painter, 1997; Smith et al., 2001;
Thomas and Hill, 2002; Wang, 2004), most have not been
’groundtruthed’ (but see Butler et al., 2004) and have
received relatively limited application. The majority of the
kinship and pedigree reconstruction methods rely on the
knowledge about typical allele distribution and frequency,
family sizes, etc. and use statistical likelihood models to
infer genealogical relationships (Blouin, 2003). We build on
c ? Oxford University Press 2007.
1
Page 2
our earlier work (BergerWolf et al., 2005; Chaovalitwongse
et al., 2006) and propose a new algorithm for sibship
reconstruction using combinatorial optimization. There have
been no truly combinatorial methods for kinship reconstruction
problems (Almudevar and Field, 1999; Beyer and May,
2003). Combinatorial methods have been very successful in
closely related molecular genetics questions, such as haplotype
reconstruction (Eskin et al., 2003; Li and Jiang, 2003).
Our approach uses the simple Mendelian inheritance rules
to impose constraints on the genetic content possibilities of
a sibling group. We formulate the inferred combinatorial
constraints and, under the parsimony assumption, use a
provably correct algorithm to construct the smallest number of
groups of individuals that satisfy these constraints. We test our
approach on both simulated and real biological data.
2
2.1
Microsatellites, also known as Short Tandem Repeats (STR),
Simple Tandem Repeats (STR), Simple Sequence Repeats
(SSR), Simple Sequence Length Polymorphisms (SSLP), or
Variable Number of Tandem Repeats (VNTR), are short
sequences of repeated DNA (typically two to four basepairs).
Different individual organisms can have microsatellites with
different number of repeats at the same locus (part of DNA). In
fact, this variability is what makes the microsatellites so useful
for genetic analysis. In diploid organisms an individual will
have two copies of each microsatellite sequence, one from the
mother, one from the father, called alleles. The two copies may
differ in the number of repeats of the same segment, depending
on the parental DNA. For example, if the mother has “CA”
repeated 8 times and 12 times, and the father has 10 and 13
repeats, then the offspring may have 12 and 10 “CA” repeats at
that locus.
Finding each new microsatellite locus is time and resource
consuming.Thus, microsatellite markers for nonmodel
species typically consist of very few, 2 – 20, loci. Yet,
once a locus is identified and the specific PCR primers are
designed, screening each individual is relatively quick and
cheap. Together with the high variability (high number of
alleles per locus) this makes microsatellites the marker of
choice for genetic research of wild populations.
METHODS
Microsatellite Genetic Markers
2.2
The main focus of our paper is to design a method that
accurately reconstructs sibling groups from microsatellite data
of a single generation. We now define the sibling reconstruction
problemmoreformally. Givenagenetic(microsatellite)sample
at l loci from a population of n diploid individuals of the same
generation, U, the goal is to reconstruct the full sibling groups
(groups of individuals with the same parents). We assume no
knowledge of parental information.
Sibling Reconstruction Problem Statement
U = {X1,...Xn}, where Xi = (< ai1,bi1 >,...,< ail,bil>)
and aijand bijare the two alleles of the individual i at locus j.
The goal is to find a partition of individuals P1,...Pm such
that
∀1 ≤ k ≤ m,∀Xu,Xv ∈ Pk: Parents(Xu) = Parents(Xv)
Notice,
Parents(x). This is a biological objective. We will discuss
computational approaches to achieve a good estimate of the
biological sibling relationship.
here that we have not defined the function
2.3
Mendelian genetics lay down a very simple rule for inheritance
in diploid organisms: an offspring inherits one allele from each
of its parents for each locus. This introduces two overlapping
necessary (but not sufficient) constraints on full siblings
groups: 4allele property and 2allele property (BergerWolf
et al., 2005).
4Allele Property: The total number of distinct alleles
occurring at any locus may not exceed 4.
Formally, a set S ⊆ U has the 4allele property if
2Allele and 4Allele Properties
∀1 ≤ j ≤ l :
˛˛˛˛˛
[
i∈S
{aij,bij}
˛˛˛˛˛≤ 4.
Clearly, the 4allele property is necessary since a group
of siblings can inherit only combinations of the 4 alleles
of their common parents. The 4allele property is effective
for identifying sibling groups where the data are mostly
heterozygous and the parent individuals share few common
alleles. Generally, as in Table 1, a set consisting of any two
individuals satisfies the 4allele property. The set of individuals
1, 3 and 4 from Table 1 satisfies the 4allele property. However,
the set of individuals 2, 3 and 5 fails to satisfy it as the alleles
occurring at the first locus are {12, 31, 56, 44, 51}.
2Allele Property: There exist an assignment of individual
alleles within a locus to maternal and paternal such that the
number of distinct alleles assigned to each parent at this locus
does not exceed 2.
Formally, a set S ⊆ U has the 2allele property if for each
Xi in each locus there exists an assignment of aij = cij or
bij = cij (and the other allele assigned to ¯ cij) such that
∀1 ≤ j ≤ l :
˛˛˛˛˛
[
i∈S
{cij}
˛˛˛˛˛≤ 2 and
˛˛˛˛˛
[
i∈S
{¯ cij}
˛˛˛˛˛≤ 2
2Allele property is clearly stricter than 4allele property.
Looking at the Table 1, our previous 4allele set of individuals
1, 3 and 4 fails to satisfy the stricter 2allele property as the
alleles appearing on the left side at locus 1 { 44, 31, 13 } are
more than two. Moreover, there is no swapping of alleles that
will bring down the number of alleles on each side to two: the
1st and 4th individuals with alleles 44/44 and 13/13 already fill
the capacity.
The 2allele property takes into account the fact that the
parents can contribute only two alleles each to their offspring.
Note, that the 2allele property is, again, a necessary but not
2
Page 3
Table 1. An example of input data for the sibling reconstruction
problem. The five individuals have been sampled at two genetic
loci. Each allele is represented by a number. Same numbers
represent the same alleles.
Individual
Radish 1
Radish 2
Radish 3
Radish 4
Radish 5
Alleles (a/b) at locus1
44/44
12/56
31/44
13/13
31/51
Alleles (a/b) at locus2
55/23
14/31
55/14
31/23
14/31
a sufficient constraint for a group of individuals to be siblings.
Notice, also, that any two individuals necessarily satisfy the 2
allele property as well since by default the number of alleles on
each side of any locus is at most two.
The 2allele property reduces the possible combinations of
alleles at a locus in a group of siblings down to a few canonical
options (modulo the numbering of the alleles). Assuming the
allelesarenumbered1through4, Table2listsalldifferenttypes
of sibling groups possible with the 2allele property. We do this
by listing all possible pairs of parents whose alleles are among
1,2,3, and 4 and all the offspring they can produce. However,
in any sibling group with a given set of parents only a subset of
the offspring possibilities from the table may be present.
It is important to note that Table 2 gives an exhaustive list of
canonical possibilities of allele combinations at a given locus
in a group of siblings without violating the 2allele property.
Without the loss of generality, we assume that the alleles at
each locus are numbered 1 through 4. This is sufficient since
according to the 4allele property the number of alleles in any
sibling group cannot exceed four. Further, there are 4! =
24 possible mappings of any four alleles onto numbers 1–4.
However, we list only the canonical minimal options (parents’
alleles being numbered sequentially). It is not hard to check
that the list of parents is exhaustive. Hence, Table 2 presents an
exhaustive canonical list of possible sibling groups. It is also
easy to verify that the resulting sibling groups indeed confirm
to the 2allele property.
2.4
As we have mentioned, the biological function Parents(x)
cannot be defined mathematically. We model the objective
of reconstructing the sibling relationships mathematically by
assigning individuals parsimoniously into the smallest number
of (possibly overlapping) groups that satisfy the necessary 2
allele constraint. Formally, recall that we are given a population
U of n diploid individuals sampled at l loci
Minimum 2Allele Set Cover
U = {X1,...Xn}, where Xi = (< ai1,bi1 >,...,< ail,bil>)
and aijand bijare the two alleles of the individual i at locus j.
The goal of the MINIMUM 2ALLELE SET COVER problem is
to find the smallest number of subsets S1,...,Smsuch that each
Si ⊆ U and satisfies the 2allele constraint andSSi = U.
Table 2. Canonical possible combinations of
parent alleles and all resulting offspring allele
combinations at a single locus
Parents Offspring
allele a
allele b
Set parents (1 / 2) (3/ 4)
1
2
1
2
3
4
4
3
3
4
4
3
1
2
1
2
Set parents (1 / 2) (1 / 3)
1
2
1
2
3
3
1
1
3
3
1
2
1
2
Set parents (1 / 2) (1 / 2)
1
1
2
2
1
2
1
2
Set parents (1 / 1) (1 / 1)
11
Set parents (1 / 1) (1 / 2)
1
1
2
1
2
1
Set parents (1 / 1) (2 / 3)
1
1
2
3
2
3
1
1
Set parents (1 / 1) (2 / 2)
1
2
2
1
We conjecture that the MINIMUM 2ALLELE SET COVER
is NPcomplete. A simple corollary of the following theorem
from BergerWolf et al., 2005 shows that it is in NP.
THEOREM 1 (BergerWolf et al., 2005). Let R be the
number of alleles that are homozygous or appear with 3 other
distinct alleles in a given locus and A be the total number of
distinct alleles at a locus. Then a set of individuals satisfies the
2allele property if and only if for every locus it satisfies the
constraint
A + R ≤ 4
It is easy to see that given a set of individuals we can verify
that it satisfies the 2allele property in O(nl) time using the
3
Page 4
constraint above. Thus, MINIMUM 2ALLELE SET COVER is in
NP.
Since the MINIMUM 2ALLELE SET COVER is likely to be
NPhard, one approach is to design approximation algorithms
or heuristics that will produce suboptimal solutions. Instead,
we use commercial MIP solver CPLEX1to solve the problem
to optimality.
2.5
We now present our algorithm for solving the sibling
reconstruction problem abstracted as the MINIMUM 2ALLELE
SET COVER. Our algorithm uses the 2allele and 4allele
properties (specifically, Table 2) to generate all maximal
potential sibling sets. We then restate the problem as a
MINIMUM SET COVER to find the minimum number of sibling
sets containing all the individuals. Thus, the algorithm has two
steps:
Minimum 2Allele Set Cover Algorithm
1. Create potential sibling sets based on the 2allele property
for each locus and maximally assign individuals to each
set without violating the 2allele property in any locus
2. Use minimum set cover to find the minimum number of
the 2allele sets from step 1 whose union contains all the
individuals.
We now explain the algorithm in more detail. In step 1, we
build on the approach presented in BergerWolf et al., 2005;
Chaovalitwongse et al., 2006 by generating sets that satisfy
the 2allele property. In the implementation of the algorithm
we use the complete version of Table 2, with all 24 possible
mappings of alleles to numbers 1–4, to generate all maximal
possible sets. Since the list is exhaustive, if a set does not match
one of the patterns in Table 2 under some mapping of its alleles
onto numbers 1–4, it cannot possibly be a sibling group. During
both steps of our algorithm we maintain an index or lookup of
all sets to ensure there are no duplications.
2.5.1
necessarily satisfies the 2allele property. Thus, initially we use
all`n
from Table 2 for each locus j. Each allele is assigned a number
between 1 and 4 based on the order of its occurrence. Then,
for each pair of individual alleles we search for all matching
canonical sets in Table 2 to determine the set of possibilities,
PossibilitiesSet.
After generating these initial sets based on pairs of
individuals, the algorithm repeatedly iterates through all the
individuals, testing each set for a possible assignment of the
individual to the set. In each cycle of the iterations, only
the sets that were present at the beginning of the cycle are
considered for each individual. An individual is assigned to a
set if its alleles match the possibilities of the set as defined by
the extended Table 2.
Algorithm 2allele.
Recall that any pair of individuals
2
´pairs of n individuals to generate the candidate sets.
Each set is generated using the initial possible canonical sets
1CPLEX is a registered trademark of ILOG
However, adding an individual to a potential sibling set may
reduce the set of the matching canonical patterns. For example,
adding an individual with alleles 3/1 to a set of two individuals
with alleles 1/2 and 2/1 changes the potential set of parents
from {(1,1)(2,2); (1,1)(2,2); (1,2)(1,2); (1,1)(2,3); (1,2)(1,3);
(1,2)(3,4)} to just {(1,1)(2,3); (1,2)(1,3); (1,2)(3,4)}. Thus,
when adding a new individual to a set, we check if a new
valid set can be created to accommodate all of the individuals
already assigned to the set as well as the new individual. The
validity of the new set is determined by the 4allele property
and the extended Table 2. The alleles at every locus of the new
individual must match at least one of the canonical patterns
that collectively satisfy all the previous individuals assigned to
the set. Once we determine that the set can be expanded (and
its set of possible matching parents reduced) to accommodate
the new individual in a valid way, we create a modified copy
of the set. The individual is then checked against this new
set for all the remaining loci. After we have verified that the
new individual does not violate the 2allele property of the
new set at every locus, as explained above, and verifying that
the set doesn’t already exist, we add the set to the collection
of potential sibling sets. However, for the remainder of the
iteration cycle all the individuals are checked only against the
sets that had been present at the beginning of the cycle. This
ordering ensures that each individual is checked against each
set exactly once.
We repeat this process, cycling through all the individuals in
the population. Once a set present at the beginning of the cycle
has been inspected against all the individuals, the set is marked
as done and is not revisited. This ensures that all sibling pairs
that could possibly occur are evaluated, and that no sibling sets
are generated that never occur in data.
The cycles of iterations over the individuals continues until
all sets are marked as done. As the last step a singleton set for
each of the elements is added containing just that element to
ensure that a family group containing one offspring is possible.
After all the potential sibling sets are generated we apply
the minimum set cover to find the minimum number of sibling
groups whose union contains all the individuals.
2.5.2
that the algorithm terminates since the sets newly added in each
iteration cycle are always bigger than the sets present at the
beginning of the iteration cycle and each individual can occur
at most once in a set.
We already showed that Table 2 exhaustively lists all the
canonical possibilities of sibling groups (modulo the mapping
of alleles to the numbers 1–4). We show that our algorithm
produces all the sibling groups that confirm to the listings in
thistable, andnosiblinggroupisgeneratedthatdoesnotsatisfy
one of the canonical possibilities.
Proof of Correctness and Termination.
First, we note
THEOREM 2. Algorithm 2allele produces all and only the
possible 2allele groups that are supported by the data.
Proof. As we have stated before, all possible pairs of
individuals create minimal (non singleton) valid sibling groups
4
Page 5
and must correspond to at least one of the entries in Table 2
by default. The algorithm then exhaustively compares every
individual against every such possible sibling set and generates
new sets as necessary if the 2allele property is not violated.
Thus, every combination of individuals that can be siblings
will be generated. Suppose, to the contrary, there exists a valid
maximal sibling group S that has never been generated and
consider the smallest such group. Let Xi be the individual
with the highest index i in this group. When we remove the
individual Xifrom the population all the individuals that could
be siblings before can still be siblings. Thus, S − Xi is still
a valid sibling group and, by inductive hypothesis, it must
have been generated by the algorithm. We examine Xiagainst
the group S − Xi. Adding Xi does not violate the 2allele
property (since it is a sibling group) and therefore there exists
a corresponding canonical set in Table 2 that contains S. Thus,
we would add the corresponding possible set if it was not
already among the sets.
Since we check every sibling group at all loci before adding
it to the collection of potential sets, we ensure that we never
generate an set that doesn’t satisfy the 2allele property at
every locus.
After all possible sibling groups are generated we use the
minimum set cover approach to find the smallest number of
sibling groups whose union contains all the individuals. While
the minimumsetcover problem is NPcomplete, modern mixed
integer program solvers can solve it to optimally in most
instances. Thus, it is not meaningful to discuss the theoretical
computational complexity of the algorithm.
2.5.3
a classical NPcomplete (Karp, 1972) problem. Minimum Set
Cover is defined as follows: given a universe U of elements
X1,...,Xn and a collection of subsets S of U, the goal is to
find the minimum collection of subsets C ⊆ S whose union is
the entire universe U.
Formally, given: U
=
{S1,S2,...Sm} find
Minimum Set Cover.
Minimum set cover problem is
{X1,X2,...,Xn} and S=
min C s.t. C ⊆ S and
[
Si∈C
Si = U
Set cover cannot be approximated in polynomial time
to within a factor of (1 − ?)lnn unless NP
DTIME(nloglogn) (Feige, 1998). Johnson introduced a 1 +
ln n approximation in 1974 (Johnson, 1974).
In order to solve set cover we use standard integer
programming solvers. The integer program formulation of the
set cover problem is as follows: given a matrix A
⊆
aij =
(
1
0
if i ∈ Sj
otherwise
the set cover problem is
minPm
i=1xi
s.t.
Ax ≥ 1
xi ∈ {0,1}
3
To validate and assess the accuracy of our approach we use
datasets with known genetics and genealogy. However, such
biological datasets containing no errors are rare. In addition,
we create simulated sets using a large number of parameters
over a wide range of values. In each instance we compare our
algorithm to other methods for sibship reconstruction.
We measure the error by comparing the known sibling sets
with those generated by our algorithm, and calculating the
minimum partition distance (Gusfield, 2002). The error is the
percentage of individuals that would need to be removed to
make the reconstructed sibling sets equal to the true sibling
sets. Note, we are computing the error in terms of individuals,
not in terms of the number of sibling groups reconstructed
incorrectly. Thus, the accuracy is the percent of individuals
correctly assigned to sibling groups.
The experiments were run on a combination of a cluster of
64 mixed AMD and Intel Xeon nodes of 2.8 GHz and 3.0GHz
processors and a single Intel Pentium D Dual Core 3.2 GHz
Intel processor with 4 GB RAM memory.
EVALUATION AND RESULTS
3.1
We compare the performance of our algorithm to three other
sibship reconstruction methods. The methods span a variety of
approachesandhavedifferentbehaviorondifferentparameters.
We now describe the methods.
Sibship Reconstruction Methods
Almudevar and Field. Our algorithm is based on a very
similarideaproposedbyAlmudevarandField,1999whichis
a completely combinatorial approach. Here, potential sibling
sets are too constructed using the 2allele property (although
the authors do not explicitly state the property). However,
these sets are constructed by enumerating exhaustively
all combinations of individuals and testing those for the
compliance with the 2allele property. At the end, a maximal,
not necessarily optimal, collection of sibling sets is returned
as a solution.
Beyer and May. The approach proposed in Beyer and May,
2003isamixtureoflikelihoodandcombinatorialtechniques.
The algorithm constructs a graph with individuals as nodes
and the edges weighted by the pairwise likelihood ratio that
the individuals are siblings versus being unrelated. Very light
edges are ignored. Potential families are identified by the
connected components in this graph.
KinGroup. Konovalov et al., 2004 have proposed an algorithm
based entirely on likelihood estimates of partitions of
individuals into sibling groups.
considered one at a time. For each individual, the likelihood
of it being part of any existing sibling group, as well as
starting its own group, is calculated. The individual is placed
into the group it is most likely to belong. Unfortunately,
the outcome heavily depends on the order in which the
individuals are considered.
The individuals are
5
Page 6
3.2
We have identified four biological datasets of microsatellite
data where sibling groups are known. These are not wild
populations since in wild populations we typically do not know
the true sibling groups, which is precisely why we need the
sibling reconstruction method.
Biological Data
Radishes. The wild radish Raphanus raphanistrum dataset
(Conner, 2006) consists of samples from 150 radishes from
two families with 17 sampled loci. There are missing alleles
among all the loci. The parent genotypes are available.
Salmon. The Atlantic Salmon Salmo salar dataset comes from
the genetic improvement program of the Atlantic Salmon
Federation (Herbinger et al., 1999). We use a truncated
sample of microsatellite genotypes of 351 individuals from
6 families with 4 loci per individual. The data does not have
missing alleles at any locus. This dataset is a subset of one
of the samples of genotyped individuals used by Almudevar
and Field, 1999 to illustrate their technique.
Shrimp. The tiger shrimp Penaeus monodon dataset (Jerry
et al., 2006) consists of 59 individuals from 13 families with
7 loci. There are 16 missing alleles. The parentage is known.
Flies. Scaptodrosophila hibisci dataset (Wilson et al., 2002)
consists of 190 same generation individuals (flies) from
6 families sampled at various number of loci with up to
8 alleles per locus. Parent genotypes were known. All
individuals shared 2 sampled loci which were chosen for
our study. Some of the alleles were missing for some of the
individuals.
Table 3 summarizes the results of the four algorithms on the
biological datasets.
Table 3. Accuracy (percent) of our algorithm and the three
reference algorithms on biological datasets. Here l is the number
of loci in a dataset and “Inds” column gives the number of
individuals in the dataset. The three reference algorithms are
Almudevar and Field, 1999 (A&F), Beyer and May, 2003
(B&M), and the KinGroup by Konovalov et al., 2004 (KG).
Dataset
l
Inds
Ours
A&FB&M KG
Shrimp
Salmon
Radishes
Flies
7
4
5
2
5977.97
98.30
75.90
100.00
67.80 77.97
99.71
53.30
27.89
77.97
96.02
29.95
54.73
351
531
190
Out of memory
Out of memory
31.05
Almudevar and Field’s algorithm ran out of 4GB memory on the salmon
and radish datasets.
3.3
In addition,
simulations. We first create random diploid parents and then
generate complete genetic data for offspring varying the
number of males, females, alleles, loci, number of families
and number of offspring per family. We then use the 2allele
algorithm described above to reconstruct the sibling groups.
Random Simulations
we validate our approach using random
We compare our results to the actual known sibling groups
in the data to assess accuracy. We measure the error rates
of algorithm using the Gusfield Partition Distance (Gusfield,
2002). In addition, we compare the accuracy of our 2allele
algorithm to the two reference sibling reconstruction methods,
Beyer and May, 2003 and Konovalov et al., 2004, described
above. We repeat the entire process for each fixed combination
of parameter values 1000 times. We omit the comparison of
the results to the algorithm of Almudevar and Field, 1999
since the current version of provided software requires user
interaction and therefore it is infeasible to use it in the
automated simulation pipeline of 1000 iterations of over a
hundred combinations of parameter values.
First, we generate the parent generation of M males and
F females with parents with l loci and a specified number
of alleles per locus a. We create populations with uniform
as well as nonuniform allele distributions. After the parents
are created, their offsprings are generated by selecting f pairs
of parents. A male and a female are chosen independently,
randomly and uniformly from the parent population. For these
parents a specified number of offsprings o is generated. Here,
too, we create populations with a uniform as well as a skewed
family size distribution. Each offspring randomly receives one
allele each from its mother and father at each locus. This
is a rather simplistic approach, however, it’s consistent with
the genetics of known parents and provides a baseline for the
accuracy of the algorithm since biological data are generally
not random and uniform.
The parameter ranges for the study are as follows:
• The number of adult females F and the number of adult
males M were equal and set to 5, 10 or 15.
• The number of loci sampled l = 2,4,6
• The number of alleles per locus (for the uniform allele
frequency distribution) a = 5,10,15.
• Nonuniform allele frequency distribution (for 4 alleles):
12  4  1  1, as in Almudevar, 2003.
• The number of families in the population f = 2,5,10.
• The number of offspring per couple (for the uniform
family size distribution) o = 2,5,10.
• Nonuniform family size distribution (for 5 families): 25 
10  10  4  1, as in Almudevar, 2003
All datasets were generated on the 64node cluster running
RedHat Linux 9.0. The 2allele algorithm is used on this
generated population to find the smallest number of 2allele
sets necessary to explain this juvenile population. We use
the commercial MIP solver CPLEX 9.0 for Windows XP on
a single processor machine to solve the minimum set cover
problem to optimality. The reference algorithms were run on
a single processor machine running Windows XP2.
2The difference in platforms and operating systems is dictated by the
available software licenses and provided binary code
6
Page 7
We measure the reconstruction accuracy of the 2allele
algorithm as the function of the number of alleles per each
locus, family size (number of offspring), number of families
(and polygamy), and the variation in allele frequency and
family size distributions.
Figure 1 shows representative results for the accuracy of our
2allele algorithm and the two reference algorithms on uniform
allele frequency and family sizes distributions. Figure 2 shows
results for the datasets with skewed family sizes and allele
frequency distributions. Each bar represents the mean value of
a 1000 random repetitions and the error bars show the standard
deviation.
4
We have proposed a new fully combinatorial algorithm for the
problem of reconstruction of sibling relationships from single
generation microsatellite genetic data. We have implemented
and tested our approach on both real biological and simulated
data.
On biological data our algorithm performed as well or better
than other sibling reconstruction methods. The difference is
particularly striking for the flies dataset with 2 loci. Our
algorithm accurately reconstructed the sibling groups despite
some missing alleles while other methods all have over 45%
error rate. The radish dataset presented a problem for all
methods since it has partial selfreproduction which none of
the methods take into account. Offspring of a selfed individual
are hard to separate from their halfsiblings produced by that
and any other individual. Still, even on this dataset our method
performed significantly better than others.
The simulated data provides a base line for the accuracy
estimate of our algorithm, with real biological data likely to
have better reconstruction accuracy, as indicated by the results
on the biological datasets. For the datasets with the uniform
allele frequency and family sizes distributions, for the number
of alleles per locus above 5 and the number of offspring
per family above 5, the accuracy of our algorithm is above
50 percent in most cases, rapidly increasing as the number
of offspring or alleles increases. Our algorithm performs
significantly better than other methods when the number of loci
is very small and there is reasonable diversity of alleles. In fact,
the algorithm of Beyer and May, 2003, has high error rates
specifically for those parameter values. Thus, our algorithm is
particularly well suited for natural populations of plants and
animals, with large family groups and few sampled loci.
However, we have conducted a very limited set of
experimentsondatasetswithnonuniformparameterdistributions.
To obtain conclusive results, we need to explore a wider range
of nonuniform distributions. To fully evaluate the performance
of our algorithm, we need to validate the results on other
biological datasets and more realistic simulated populations.
In addition, we have yet to address the possibilities of errors
in data. The fact that our algorithm accurately reconstructed
sibling groups on biological data with missing alleles is
encouraging. However, our algorithm would need to be
DISCUSSION AND CONCLUSIONS
modified to address errors from mistyped and misidentified
alleles.
It is impossible in the current setting of the experiments
to accurately compare running times of the algorithms.
However, the algorithm of Almudevar and Field, 1999, which
uses exhaustive enumeration of all potential sibling sets,
unsurprisingly ran out of memory on datasets greater than
200 individuals. It took on the order of hours to complete on
the dataset of 190 individuals. Moreover, since the current
implementation requires user interaction, the performance of
the algorithm could not be evaluated in the random simulations.
The likelihood approaches of Beyer and May, 2003 and
Konovalov et al., 2004 are very fast. Both produced answers
in a matter of seconds on datasets of a 100 individuals and
in less than 2 minutes on datasets of 500 individuals. For our
algorithm, we first (in matter of seconds) formulate the 2
allele set cover problem, then this formulation is imported into
CPLEX and solved as a set cover problem. Recall that the set
cover problem is NPcomplete and CPLEX is a commercial
software designed specifically to solve such computationally
hard problems to optimality. The entire (automated) process
takesabout2hoursondatasetsof500individuals. Atthispoint,
our focus has been on evaluating the accuracy and viability of
our approach. Now that our approach has proven viable we
will concentrate on improving the running time and the overall
usability of our method.
The main advantage of the combinatorial approach is
its lack of reliance on a priori knowledge about various
population parameters, such as allele frequency and the
degree of inbreeding. However, Mendelian constraints do not
provide any basis for distinguishing between family groups
when the groups are small, or when individuals share many
common alleles. Additional information, such as relative
allele frequency within the sample can be easily added to
generate combinatorial constraints on the potential sibling sets.
Unlike likelihood methods, combinatorial approaches use that
information only for comparison purposes, and do not require a
background data model or an accurate estimate value for any of
the parameter. Thus, we believe that combinatorial approaches
are particularly appropriate for analysis of natural animal and
plant populations where background information is difficult to
obtain.
Our technique can be extended to solve a number of related
problems. The first immediate variation is reconstructing
sibship relationships when partial information, such as one of
the parents, is available. We have already implemented and
applied the extended version of our algorithm to identifying
the minimum number of necessary male oak trees that have
pollinated a single female tree to produce the sampled acorns.
Our approach provided additional supporting evidence that oak
pollen disperses much further than previously thought.
Another simple variation of our algorithm produces half
sibling groups as well as full sibs. In addition, our algorithm
can be used to identify the parsimonious set of parental
genotypes necessary to produce the sibling groups.
7
Page 8
Fig. 1. Accuracy of the sibling group reconstruction using the 2allele algorithm on randomly generated data. The yaxis shows the accuracy
of reconstruction as a function of various simulation parameters. The accuracy of our algorithm is shown, as well as that of the two reference
algorithms: Beyer and May, 2003 and Konovalov et al., 2004 (KingGroup). The title shows the value of the fixed parameters: the number of adult
males/females, number of families, the number of offspring per family, the number of loci, and the number of alleles per locus.
Fig. 2. Accuracy of the sibling group reconstruction using our 2allele algorithm and the two reference methods on the datasets with skewed family
sizes and allele frequency distributions.
8
Page 9
Finally, from the algorithmic perspective, there are a number
of alternatives that would improve the performance of our
method that we are currently exploring.
To conclude, we have presented a fully combinatorial
approach to reconstructing sibling groups from microsatellite
data. Our method does not rely on any a priori knowledge about
data parameters yet provides results with accuracy comparable
to or better than those of likelihood methods.
ACKNOWLEDGMENTS
This research is supported by the following grants: NSF IIS
0612044andIIS0611998(BergerWolf, Ashley, Chaovalitwongse,
DasGupta), Fullbright Scholarship (Sheikh),
0546574 (Chaovalitwongse). We are grateful to the people who
have shared their data with us: Jeff Connor, Atlantic Salmon
Federation, Dean Jerry, and Stuart Barker. We would also
like to thank Anthony Almudevar, Bernie May, and Dmitry
Konovalov for sharing their software and the anonymous
reviewers for very helpful comments.
NSF CCF
REFERENCES
Almudevar, A. (2003). A simulated annealing algorithm for maximum likelihood
pedigree reconstruction. Theoretical Population Biology, 63, 63–75.
Almudevar, A. and Field, C. (1999). Estimation of single generation sibling
relationships based on DNA markers. Journal of Agricultural, Biological,
and Environmental Statistics, 4, 136–165.
BergerWolf, T. Y., DasGupta, B., Chaovalitwongse, W., and Ashley, M. V.
(2005). Combinatorial reconstruction of sibling relationships. In Proceedings
of the 6th International Symposium on Computational Biology and Genome
Informatics (CBGI 05), pages 1252–1255, Utah.
Beyer, J. and May, B. (2003). A graphtheoretic approach to the partition of
individuals into fullsib families. Molecular Ecology, 12, 2243–2250.
Blouin, M. S. (2003). DNAbased methods for pedigree reconstruction and
kinship analysis in natural populations. TRENDS in Ecology and Evolution,
18(10), 503–511.
Butler, K., Field, C., Herbinger, C., and Smith, B. (2004). Accuracy, efficiency
and robustness of four algorithms allowing full sibship reconstruction from
DNA marker data. Molecular Ecology, 13, 1589–1600.
Chaovalitwongse, A., BergerWolf, T., Dasgupta, B., and Ashley, M. (2006). Set
covering approach for reconstruction of sibling relationships. Optimization
Methods and Software.
Conner, J. K. (2006). personal communication.
Eskin, E., Haleprin, E., and Karp, R. M. (2003). Efficient reconstruction of
haplotype structure via perfect phylogeny. Journal of Bioinformatics and
Computational Biology, 1(1), 1–20.
Feige, U. (1998). A threshold of lnn for approximating set cover. Journal of the
ACM, 45, 634–652.
Garant, D. and Kruuk, L. E. B. (2005). How to use molecular marker data to
measure evolutionary parameters in wild populations. Molecular Ecology,
14, 1843–1859.
Gusfield, D. (2002). Partitiondistance: A problem and class of perfect graphs
arising in clustering. Information Processing Letters, 82(3), 159–164.
Herbinger, C., O´Reilly, P. T., Doyle, R. W., Wright, J. M., and O´Flynn, F. (1999).
Early growth performance of atlantic salmon fullsib families reared in single
family tanks or in mixed family tanks. Aquaculture, 173, 105–116.
Jerry, D. R., Evans, B. S., Kenway, M., and Wilson, K. (2006). Development
of a microsatellite dna parentage marker suite for black tiger shrimp penaeus
monodon. Aquaculture, 255(1–4), 542–547.
Johnson, D. S. (1974). Approximation algorithms for combinatorial problems. J.
Comput. System Sci., 9, 256–278.
Jones, A. G. and Ardren, W. R. (2003). Methods of parentage analysis in natural
populations. Molecular Ecology, (12), 2511–2523.
Karp, R. M. (1972). Reducibility among combinatorial problems. In R. E. Miller
and J. W. Thatcher, editors, Complexity of Computer Computations, pages
85–103. Plenum Press.
Konovalov, D. A., Manning, C., and Henshaw, M. T. (2004). KINGROUP: a
program for pedigree relationship reconstruction and kin group assignments
using genetic markers.
Molecular Ecology Notes.
8286.2004.00796.x.
Li, J. and Jiang, T. (2003). Efficient inference of haplotypes from genotype on a
pedigree. Journal of Bioinformatics and Computational Biology, 1(1), 41–69.
Painter, I. (1997). Sibship reconstruction without parental information. Journal
of Agricultural, Biological, and Environmental Statistics, 2, 212–229.
Smith, B. R., Herbinger, C. M., and Merry, H. R. (2001). Accurate partition
of individuals into fullsib families from genetic data without parental
information. Genetics, 158, 1329–1338.
Thomas, C. S. and Hill, W. G. (2002). Sibship reconstruction in hierarchical
population structuresusingmarkovchainmontecarlotechniques. Genet.Res.,
Camb., 79, 227–234.
Wang, J. (2004). Sibship reconstruction from genetic data with typing errors.
Genetics, 166, 1968–1979.
Wilson, A., Sunnucks, P., and Barker, J. (2002). Isolation and characterization
of 20 polymorphic microsatellite loci for Scaptodrosophila hibisci. Molecular
Ecology Notes, 2, 242–244.
doi: 10.1111/j.1471
9
View other sources
Hide other sources
 Available from Bhaskar Dasgupta · May 27, 2014
 Available from cs.uic.edu