Page 1
RESEARCH ARTICLEOpen Access
A fast least-squares algorithm for population
inference
R Mitchell Parry1and May D Wang1,2,3*
Abstract
Background: Population inference is an important problem in genetics used to remove population stratification in
genome-wide association studies and to detect migration patterns or shared ancestry. An individual’s genotype can
be modeled as a probabilistic function of ancestral population memberships, Q, and the allele frequencies in those
populations, P. The parameters, P and Q, of this binomial likelihood model can be inferred using slow sampling
methods such as Markov Chain Monte Carlo methods or faster gradient based approaches such as sequential
quadratic programming. This paper proposes a least-squares simplification of the binomial likelihood model
motivated by a Euclidean interpretation of the genotype feature space. This results in a faster algorithm that easily
incorporates the degree of admixture within the sample of individuals and improves estimates without requiring
trial-and-error tuning.
Results: We show that the expected value of the least-squares solution across all possible genotype datasets is
equal to the true solution when part of the problem has been solved, and that the variance of the solution
approaches zero as its size increases. The Least-squares algorithm performs nearly as well as Admixture for these
theoretical scenarios. We compare least-squares, Admixture, and FRAPPE for a variety of problem sizes and
difficulties. For particularly hard problems with a large number of populations, small number of samples, or greater
degree of admixture, least-squares performs better than the other methods. On simulated mixtures of real
population allele frequencies from the HapMap project, Admixture estimates sparsely mixed individuals better than
Least-squares. The least-squares approach, however, performs within 1.5% of the Admixture error. On individual
genotypes from the HapMap project, Admixture and least-squares perform qualitatively similarly and within 1.2% of
each other. Significantly, the least-squares approach nearly always converges 1.5- to 6-times faster.
Conclusions: The computational advantage of the least-squares approach along with its good estimation
performance warrants further research, especially for very large datasets. As problem sizes increase, the difference in
estimation performance between all algorithms decreases. In addition, when prior information is known, the
least-squares approach easily incorporates the expected degree of admixture to improve the estimate.
Background
The inference of population structure from the geno-
types of admixed individuals poses a significant problem
in population genetics. For example, genome wide asso-
ciation studies (GWAS) compare the genetic makeup of
different individuals in order to extract differences in the
genome that may contribute to the development or
suppression of disease. Of particular interest are single
nucleotide polymorphisms (SNPs) that reveal genetic
changes at a single nucleotide in the DNA chain. When
a particular SNP variant is associated with a disease, this
may indicate that the gene plays a role in the disease
pathway, or that the gene was simply inherited from a
population that is more (or less) predisposed to the dis-
ease. Determining the inherent population structure within
a sample removes confounding factors before further ana-
lysis and reveals migration patterns and ancestry [1]. This
paper deals with the problem of inferring the proportion of
an individual’s genome originating from multiple ancestral
* Correspondence: maywang@bme.gatech.edu
1The Wallace H. Coulter Department of Biomedical Engineering, Georgia
Institute of Technology and Emory University, Atlanta, GA 30332, USA
2Parker H. Petit Institute of Bioengineering and Biosciences and Department
of Electrical and Computer Engineering, Georgia Institute of Technology,
Atlanta, GA 30332, USA
Full list of author information is available at the end of the article
© 2013 Parry and Wang; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the
Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use,
distribution, and reproduction in any medium, provided the original work is properly cited.
Parry and Wang BMC Bioinformatics 2013, 14:28
http://www.biomedcentral.com/1471-2105/14/28
Page 2
populations and the allele frequencies in these ancestral
populations from genotype data.
Methods for revealing population structure are divided
into fast multivariate analysis techniques and slower
discrete admixture models [2]. Fast multivariate techniques
such as principal components analysis (PCA) [2-8] reveal
subspaces in the genome where large differences between
individuals are observed. For case–control studies, the lar-
gest differences commonly due to ancestry are removed to
reduce false positives [4]. Although PCA provides a fast so-
lution, it does not directly infer the variables of interest:
the population allele frequencies and individual admixture
proportions. On the other hand, discrete admixture models
that estimate these variables typically require much more
computation time. Following a recent trend toward faster
gradient-based methods, we propose a faster simpler least-
squares algorithm for estimating both the population allele
frequencies and individual admixture proportions.
Pritchard et al. [9] originally propose a discrete admix-
ture likelihood model based on the random union of
gametes for the purpose of population inference. In par-
ticular, their model assumes Hardy-Weinberg equilibrium
within the ancestral populations (i.e., allele frequencies are
constant) and linkage equilibrium between markers within
each population (i.e., markers are independent). Each indi-
vidual in the current sample is modeled as having some
fraction of their genome originating from each of the an-
cestral populations. The goal of population inference is to
estimate the ancestral population allele frequencies, P, and
the admixture of each individual, Q, from the observed
genotypes, G. If the population of origin for every allele, Z,
is known, then the population allele frequencies and the
admixture for each individual have a Dirichlet distribution.
If, on the other hand, P and Q are known, the population
of origin for each individual allele has a multinomial distri-
bution. Pritchard et al. infer populations by alternately
sampling Z from a multinomial distribution based on P
and Q; and P and Q from Dirichlet distributions based on
Z. Ideally, this Markov Chain Monte Carlo sampling
method produces independent identically distributed sam-
ples (P,Q) from the posterior distribution P(P,Q|G). The
inferred parameters are taken as the mean of the posterior.
This algorithm is implemented in an open-source software
tool called Structure [9].
The binomial likelihood model proposed by Pritchard
et al. was originally used for datasets of tens or hundreds of
loci. However, as datasets become larger, especially consid-
ering genome-wide association studies with thousands or
millions of loci, two problems emerge. For one, linkage
disequilibrium introduces correlations between markers.
Although Falush et al. [10] extended Structure to incorpor-
ate loose linkage between loci, larger datasets also pose a
computational challenge that has not been met by these
sampling-based approaches. This has led to a series of more
efficient optimization algorithms for the same likelihood
model with uncorrelated loci. This paper focuses on im-
proving computational performance, leaving the treatment
of correlated loci to future research.
Tang et al. [11] proposed a more efficient expectation
maximization (EM) approach. Instead of randomly sam-
pling from the posterior distribution, the FRAPPE EM
algorithm [11] starts with a randomly initialized Z, then
alternates between updating the values of P and Q for
fixed Z, and maximizing the likelihood of Z for fixed P
and Q. Their approach achieves similar accuracy to
Structure and requires much less computation time. Wu
et al. [12] specialized the EM algorithm in FRAPPE to
accommodate the model without admixture, and gener-
alized it to have different mixing proportions at each
locus. However, these EM algorithms estimate an un-
necessary and unobservable variable Z, something that
more efficient algorithms could avoid.
Alexander et al. [13] proposed an even faster approach
for inferring P and Q using the same binomial likelihood
model but bypassing the unobservable variable Z. Their
close-source software, Admixture, starts at a random
feasible solution for P and Q and then alternates be-
tween maximizing the likelihood function with respect
to P and then maximizing it with respect to Q. The like-
lihood is guaranteed not to decrease at each step eventu-
ally converging at a local maximum or saddle point. For
a moderate problem of approximately 10000 loci, Ad-
mixture achieves comparable accuracy to Structure and
requires only minutes to execute compared to hours for
Structure [13].
Another feature of Structure’s binomial likelihood
model is that it allowed the user to input prior know-
ledge about the degree of admixture. The prior distribu-
tion for Q takes the form of a Dirichlet distribution with
a degree of admixture parameter, α, for every population.
For α = 0, all of an individual’s alleles originate from the
same ancestral population; for α > 0, individuals contain
a mixture of alleles from different populations; for α = 1,
every assignment of alleles to populations is equally
likely (i.e., the non-informative prior); and for α → ∞, all
individuals have equal contributions from every ancestral
population. Alexander et al. replace the population de-
gree of admixture parameter in Structure with two para-
meters, λ and γ, that when increased also decrease the
level of admixture of the resulting individuals. However,
the authors admit that tuning these parameters is non-
trivial [14].
This paper contributes to population inference research
by (1) proposing a novel least-squares simplification of the
binomial likelihood model that results in a faster algorithm,
and (2) directly incorporating the prior parameter α that
improves estimates without requiring trial-and-error tun-
ing. Specifically, we utilize a two block coordinate descent
Parry and Wang BMC Bioinformatics 2013, 14:28
http://www.biomedcentral.com/1471-2105/14/28
Page 2 of 17
Page 3
method [15] to alternately minimize the criterion for P and
then for Q. We adapt a fast non-negative least-squares
algorithm [16] to additionally include a sum-to-one con-
straint for Q and an upper-bound for P. We show that the
expected value for the estimates of P (or Q) across all pos-
sible genotype datasets are equal to the true values when Q
(or P) are known and that the variance of this estimate
approaches zero as the problem size increases. Compared
to Admixture, the least-squares approach provides a slightly
worse estimate of P or Q when the other is known. How-
ever, when estimating P and Q from only the genotype
data, the least-squares approach sometimes provides better
estimates, particularly with a large number of populations,
small number of samples, or more admixed individuals.
The least-squares approximation provides a simpler and
faster algorithm, and we provide it as Matlab scripts on our
website.
Results
First, we motivate a least-squares simplification of the bino-
mial likelihood model by deriving the expected value and
covariance of the least-squares estimate across all possible
genotype matrices for partially solved problems. Second, we
compare least-squares to sequential quadratic program-
ming (Admixture’s optimization algorithm) for these cases.
Third, we compare Admixture, FRAPPE, and least-squares
using simulated datasets with a factorial design varying
dataset properties in G. Fourth, we compare Admixture
and least-squares using real population allele frequencies
from the HapMap Phase 3 project. Finally, we compare the
results of applying Admixture and least-squares to real data
from the HapMap Phase 3 project where the true popula-
tion structure is unknown.
The algorithms we discuss accept as input the number of
populations, K, and the genotypes, gli∈{0,1,2}, representing
the number of copies of the reference allele at locus l for
individual i. Then, the algorithms attempt to infer the
population allele frequencies, plk= [0,1], for locus l and
population k, as well as the individual admixture propor-
tions, qki= [0,1] whereP
Table 1 Matrix notation
Genotype matrix
g11
g12
⋯
g1N
g21
g22
⋯
g2N
⋮⋮⋱⋮
gM1 gM2 ⋯ gMN
gli∈{0,1,2} : number of reference alleles at
lth locus for ith individual.
kqki= 1. In all cases, 1 ≤ l ≤ M,
1 ≤ i ≤ N, and 1 ≤ k ≤ K. Table 1 summarizes the matrix
notation.
Empirical estimate and upper bound on total variance
To validate our derived bounds on the total variance (Equa-
tions 13, 17, 18 and 19), we generate simulated genotypes
from a known target for p = [0.1, 0.7]T. We simulate N in-
dividual genotypes using the full matrix Q with each col-
umn drawn from a Dirichlet distribution with shape
parameter α. We repeat the experiment 10000 times produ-
cing an independent and identically distributed genotype
each time. Each trial produces one estimate for p. We then
compute the mean and covariance of the estimates of p
and compare them to those predicted in the bounds. For
α = 1 and N = 100,
?
cov ^ p
ð Þ ¼
?0:0015
trace cov ^ p ½ ?
The bound using the sample covariance of q in Equation
13 provides the following:
?
trace cov ^ p ½ ?
mean ^ p
ð Þ ¼
0:0999
0:7002
0:0027
?
? 0:0015
0:0046
??
ð Þ ¼ 0:0073
ð1Þ
QQT¼
36:62
16:20
Þ≤0:0097
16:20
30:99
?
ð
ð2Þ
The bound using the properties of the Dirichlet distri-
bution in Equation 17 provides a bound of 0.01. As the
number of samples increases, the difference between the
bound and the asymptotic bound for the Dirichlet dis-
tributed q will approach zero.
Figure 1 plots the total variance (trace of the covari-
ance) matrix for a variety of values for N and α using
the same target value for p. Because the expected value
of the estimate is equal to the true value of p, the total
variance is analogous to the sum of the squared error
(SSE) between the true p and its estimate. Clearly, the
total variance decreases with N. For N = 10000, the root
mean squared error falls below 1%.
Population allele frequencies matrix
p11
p12
⋯
p21
p22
⋯
⋮⋮
pM1 pM2 ⋯ pMK
0≤Plk≤1: percentage of reference alleles at
lth locus in kth population.
Individual admixture matrix
q11
q12
q21
q22
⋮
qM1 qM2 ⋯ qKN
qki≥0;P
originating from kth population.
G ¼
2
664
3
775
P ¼
p1K
p2K
⋱⋮
2
664
3
775
Q ¼
⋯
⋯
q1N
q2N
⋱⋮⋮
2
664
3
775
k ¼ 1
Kqki¼ 1: fraction of ith individual’s genome
M = number of loci (markers)
1≤l≤M
N = number of individuals
1≤i≤N
K = number of populations
1≤k≤K
Parry and Wang BMC Bioinformatics 2013, 14:28
http://www.biomedcentral.com/1471-2105/14/28
Page 3 of 17
Page 4
Intuitively, the error in the least-squares estimate for P
and Q decreases as the number of individuals and the num-
ber of loci increases, respectively. Figure 1 supports this no-
tion, suggesting that on very large problems for which the
gradient based and expectation maximization algorithms
were designed, the error in the least-squares estimate
approaches zero.
Comparing least-squares approximation to binomial
likelihood model
Given estimates of the population allele frequencies, early
research focused on estimating the individual admixture
[17]. We also note that the number of iterations and con-
vergence properties confound the comparison of iterative
algorithms. To avoid these problems and emulate a prac-
tical research scenario, we compare least-squares to se-
quential quadratic programming (used in Admixture)
when P or Q are known a priori. In this scenario, each al-
gorithm converges in exactly one step making it possible
to compare the underlying updates for P and Q independ-
ently. For N = 100, 1000, and 10000; and α = 0.1, 1, and 2;
we consider a grid of two-dimensional points for p, where
pi= {0.05, 0.15, ..., 0.95}. For each trial, we first generate
a random Q such that every column is drawn from a
Dirichlet distribution with shape parameter, α. Then, we
randomly generate a genotype using Equation 11. We
compute the least-squares solution using Equation 27 and
use Matlab’s built-in function ‘fmincon’ to minimize the
negative of the log-likelihood in Equation 7, similar to
Admixture’s approach. We repeat the process for 1000
trials and aggregate the results.
Figure 2 illustrates the root mean squared error in es-
timating p given the true value of Q. Both algorithms
present the same pattern of performance as a function
of p = [p1, p2]. Values of p near 0.5 present the most dif-
ficult scenarios. Positively correlated values (e.g., p1= p2)
present slightly less error than negatively correlated
values (e.g., p1= 1 – p2). Table 2 summarizes the per-
formance over all values of p for varying N and α. In all
cases, fmincon performs slightly better than least-
squares and both algorithms approach zero error as N
increases. We repeat this analysis for known values for P
and estimate q using the two approaches. Figure 3 illus-
trates the difference in performance for the two algo-
rithms as we vary q1 between 0.05 and 0.95 with
q2=1 – q1. Again, fmincon performs slightly better in all
cases but both approach zero as M increases. In the next
section we show that the additional error introduced by
the least-squares approximation to the objective function
remains small relative to the error introduced by the
characteristics of the genotype data.
Simulated experiments to compare least-squares to
Admixture and FRAPPE
In the previous sections, we consider the best-case sce-
nario where the true value of P or Q is known. In a real-
istic scenario, the algorithms must estimate both P and
Q fromonly the genotype information. Table3
10
2
10
N
3
10
4
10
−5
10
−4
10
−3
10
−2
10
−1
Total Variance
Bounded Total Variance
a=2, upper bound
a=2
a=1, upper bound
a=1
a=0.1, upper bound
a=0.1
Figure 1 Bound on total variance. Solid and dashed lines correspond to the empirical estimate of the total variance and the upper bound for
total variance, respectively.
Parry and Wang BMC Bioinformatics 2013, 14:28
http://www.biomedcentral.com/1471-2105/14/28
Page 4 of 17
Page 5
summarizes the results of a four-way analysis of variance
with 2-way interactions among experimental factors. By
far the factor with the most impact on performance is
the number of individuals, N. The degree of admixture,
α, and the number of populations, K, accounts for the
second and third most variation, respectively. These
three factors and two-way interactions between them ac-
count for the vast majority of variation. In particular, the
choice of algorithm accounts for less than about 1% of
the variation in estimation performance. That is, when
estimating population structure from genotype data, the
number of samples, the number of populations, and
the degree of admixture play a much more important
role than the choice between least-squares, Admixture,
and FRAPPE and least-squares. However, as shown
in Figure 4, when considering the computation time
required by the algorithm, the choice of algorithm con-
tributes about 40% of the variation including interac-
tions. Therefore, for the range of population inference
problems described in this study, the choice of algorithm
plays a very small role in the estimation of P and Q but
a larger role in computation time.
Further exploration reveals that the preferred algo-
rithm depends on K, N, and α. Table 4 lists the root
mean squared error for the estimation of Q for all com-
binations of parameters across n = 50 trials. Out of the
36 scenarios, Admixture, least-squares, and FRAPPE per-
form significantly better than their peers 13, six, and
zero times, respectively; they perform insignificantly
worse than the best algorithm 30, 17, and 10 times, re-
spectively. The least-squares algorithm appears to per-
formwell onthemore
combinations of large K, small N, or large α. Table 5 lists
the root mean squared error for estimating P. For N =
100, the algorithms do not perform significantly differ-
ently. For N = 10000, all algorithms perform with less
than 2.5% root mean squared error (RMSE). In all, Ad-
mixture performs significantly better than its peers 11
times out of 36. However, Admixture never performs
significantly worse than its peers. Least-squares and
FRAPPE perform insignificantly worse than Admixture
17 and 20 times out of 36, respectively. Table 6 sum-
marizes the timing results. Least square converges sig-
nificantly faster 34 out of 36 times with an insignificant
difference for the remaining two scenarios. FRAPPE con-
verges significantly slower in all scenarios. With two
exceptions, the least-squares algorithm provides a 1.5- to
5-times speedup.
difficultproblems with
Comparison on admixtures derived from the HapMap3
dataset
Tables 7 and 8 lists the performance and computation
time for the least-squares approach and Admixture using
a convergence threshold of ε = 1.0e-4 and ε = 1.4e-3, re-
spectively. Each marker in the illustrations represents
one individual. A short black line emanating from each
Sequential Quadratic Programming
N=100, a=1.00
p1
p2
a
0.2 0.40.6 0.81
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Least Squares
N=100, a=1.00
p1
b
0.2 0.40.60.81
0
0.01
0.02
0.03
0.04
0.05
0.06
0.07
Figure 2 Precision of best-case scenario for estimating P. Root mean squared error for different values of p using (a) Admixture’s Sequential
Quadratic Programming or (b) the least-squares approximation.
Table 2 Root mean squared error in P for known Q and K= 2
RMSE (%)
N= 100
α=0.1
α=1.0
N= 1000
α=1.0
N= 10000
α=1.0
α=2.0
α=0.1
α=2.0
α=0.1
α=2.0
SQP4.356.03 7.411.37 1.902.37 0.43 0.600.75
LS4.37 6.167.681.381.932.400.440.610.76
Parry and Wang BMC Bioinformatics 2013, 14:28
http://www.biomedcentral.com/1471-2105/14/28
Page 5 of 17
Page 6
marker indicates the offset from the original (correct)
position. For all simulations, the least-squares algorithms
perform within 0.1% of Admixture for estimating the true
population allele frequencies in P. For well-mixed popula-
tions in Simulation 1 and 2, the least-squares algorithms
perform comparably well or even better than Admixture.
However, for less admixed data in Simulations 3 – 6, Ad-
mixture provides better estimates of the true population
proportions depicted in the scatter plots. In all cases, the
least-squares algorithms perform within 1.5% of Admixture
and between about 2- and 3-times faster than Admixture.
The apparent advantage of Admixture involves indivi-
duals on the periphery of the unit simplex defining the
space of Q. In Table 7, this corresponds to individuals
on the boundary of the right triangle defined by the
x-axis, y-axis, and y = 1 – x diagonal line. For Simulation
1, the original Q contains very few individuals on the
boundary, Admixture estimates far more on the bound-
ary, and the least-squares was closer to the ground truth.
For Simulation 2 – 6, the ground truth contains more
individuals on the boundary, Admixture correctly esti-
mates these boundary points but the least-squares
0 0.20.4 0.60.81
0
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0.09
0.1
q1
RMSE
M=100, LS
M=100, SQP
M=1000, LS
M=1000, SQP
M=10000, LS
M=10000, SQP
Figure 3 Precision of best-case scenario for estimating Q. Solid and dashed lines correspond to Admixture’s Sequential Quadratic
Programming optimization and the least-squares approximation, respectively.
Table 3 Sources of variation in root mean squared error
ANOVA
Sum squared error (×10-2)
Error variance for PError variance for Q
Sum squared error (×10-4)
Time variance
Factors and interactions
PercentPercentSum squared error (×104) Percent
K
59.0 8.244.0 3.958.7 3.2
N
519.672.4 376.233.0585.5 32.2
Α
63.1 8.8341.1 29.933.2 1.8
Algorithm 0.10.01.7 0.1 266.314.6
K × N
32.14.5 32.6 2.998.2 5.4
K × α
9.01.3 8.2 0.74.4 0.2
K × Algorithm0.00.0 0.4 0.055.1 3.0
N × α
29.1 4.1282.624.8 58.83.2
N × Algorithm 0.0 0.0 2.10.2 445.624.5
Α × Algorithm0.2 0.08.40.710.5 0.6
Error 5.70.8 43.23.8204.411.2
Total717.9100.01140.4100.01820.4100.0
Parry and Wang BMC Bioinformatics 2013, 14:28
http://www.biomedcentral.com/1471-2105/14/28
Page 6 of 17
Page 7
algorithms predict fewer points on the boundary. Simu-
lation 6 provides the most obvious example where Ad-
mixture estimates individuals exactly on the boundary
and least-squares contains a jumble of individuals near
but not exactly on the line.
Real dataset from the HapMap phase 3 project
Over 20 repeated trials, Admixture converged in an
average of 42.1 seconds with standard deviation of 9.1
seconds, and the least-squares approach converged in
33.6 seconds with a standard deviation of 9.8 seconds.
Figure 5 illustrates the inferred population proportions
for one run. The relative placement of individuals from
each known population is qualitatively similar. The two
methods differ at extreme points such as those values of
q1, q2, or 1 – q1– q2that are near zero. The Admixture
solution has more individuals on the boundary and the
least-squares approach has fewer. Although we cannot
estimate the error of these estimates because the real
world data has no ground truth, we can compare their
results quantitatively. The Admixture and the least-
squares solution differed by an average of 1.2% root
mean squared difference across the 20 trials. We esti-
mate α = 0.12 from the Admixture solution’s total vari-
ance using Equation 31. This roughly corresponds to the
simulated experiment with three populations, 100 sam-
ples, and a degree of admixture of 0.1. In that case, Ad-
mixture and least-squares exhibited a very small root
mean squared error of 0.62% and 0.74%, respectively
(Table 4).
Discussion
This work contributes to the population inference litera-
ture by providing a novel simplification of the binomial
likelihood model that improves the computational effi-
ciency of discrete admixture inference. This approxima-
tion results in an inference algorithm based on minimizing
the squared distance between the genotype matrix G and
twice the product of the population allele frequencies and
individual admixture proportions: 2PQ. This Euclidean
distance-based interpretation aligns with previous results
employing multivariate statistics. For example, researchers
have found success using principal component analysis to
reveal and remove stratification [2-4] or even to reveal
clusters of individuals in subpopulations [5-7]. Recently,
McVean [5] proposed a genealogical interpretation of prin-
cipal component analysis and uses it to reveal information
about migration, geographic isolation, and admixture. In
particular, given two populations, individuals cluster along
the first principal component. Admixture proportion is the
fractional distance between the two population centers.
However, these cluster centers must known or inferred in
order to estimate ancestral population allele frequencies.
The least-squares approach infers these estimates effi-
ciently and directly.
Typically, discrete admixture models employ a binomial
likelihood function rather than a Euclidean distance-based
one. Pritchard et al. detail one such model and use a slow
sampling based approach to infer the admixed ancestral
populations for individuals in a sample [9]. Recognizing the
performance advantage of maximizing the likelihood rather
than sampling the posterior,Tang et al. proposed an expect-
ation maximization algorithm and Alexander et al. [13]
proposed a sequential quadratic programming (SQP) ap-
proach using the same likelihood function [9]. We take this
approach a step further by simplifying the model proposed
by Pritchard et al. to introduce a least-squares criterion. By
justifying the least-squares simplification, we connect the
fast and practical multivariate statistical approaches to the
theoretically grounded binomial likelihood model. We val-
idate our approach on a variety of simulated and real
datasets.
First, we show that if the true value of P (or Q) is
known, the expected value of the least squares solution
for Q (or P) across all possible genotype matrices is equal
to the true value, and the variance of this estimate
decreases with M (or N). In this best-case scenario, we
show that SQP provides a slightly better estimate than the
10
0
10
1
10
2
10
3
K=2
AD
K=2
LS1
K=3
AD
K=3
LS1
K=4
AD
K=4
LS1
Computation Time (seconds)
N=1000, a=0.5
a
N=100
AD
N=100 N=1000 N=1000 N=10000N=10000
LS1 ADLS1 ADLS1
K=4, a=0.5
b
a=0.1 a=0.1 a=0.5 a=0.5 a=1.0 a=1.0 a=2.0 a=2.0
ADLS1ADLS1AD LS1AD LS1
K=4, N=1000
c
Figure 4 Computational timing comparison. Box plots show the median (red line) and inter-quartile range (blue box) for computation time
on a logarithmic scale using (a) N=1000, α=0.5, and varying K; (b) K=4, α=0.5, and varying N; and (c) K=4, N=1000, and varying α.
Parry and Wang BMC Bioinformatics 2013, 14:28
http://www.biomedcentral.com/1471-2105/14/28
Page 7 of 17
Page 8
least-squares solution for a variety of problem sizes and
difficulty. For more common scenarios where the algo-
rithms must estimate P and Q using only the genotype in-
formation in G, we show that for particularly difficult
problems with small N, large K, or large α, theleast-
squares approach often performs better than its peers. For
about one-third of the parameter sets, Admixture per-
forms significantly better than least-squares and FRAPPE
but all algorithms approach zero error as N becomes very
large. In addition, the error introduced by the choice of
Table 4 Root mean squared error for Q
KN
α
ADLSFRAPPE SignificanceLSα
2 1000.100.48 0.72 0.52AD = FR < LS0.64
2 1000.50 1.12 1.13 1.03FR = AD = LS 1.18
2 100 1.002.22 2.22 2.29AD = LS = FR2.22
2 100 2.00 4.13 4.11 4.50LS = AD = FR3.84
2 10000.10 0.570.970.63 AD < FR < LS0.74
2 1000 0.500.690.74 0.71 AD < FR < LS0.74
2 10001.000.860.91 1.00 AD < LS < FR0.91
21000 2.001.58 1.65 2.33 AD = LS < FR0.93
2 10000 0.100.59 1.030.61 AD < FR < LS0.76
2 100000.500.70 0.810.72 AD < FR < LS0.73
2 100001.000.740.770.79 AD < LS < FR0.77
2 100002.00 0.89 0.97 1.32AD < LS < FR0.96
3 100 0.100.620.74 0.63AD = FR < LS 0.66
31000.50 2.011.81 2.00 LS < FR = AD1.91
31001.00 3.49 3.233.60 LS < AD = FR3.23
3 1002.005.775.395.89LS < AD = FR 5.00
3 10000.10 0.681.15 0.73AD < FR < LS0.76
310000.500.850.88 0.89AD < LS = FR0.93
3 10001.001.181.171.35LS = AD < FR1.17
310002.001.941.922.49LS = AD < FR1.20
3100000.100.741.260.76AD < FR < LS0.79
3100000.500.870.970.87AD = FR < LS0.87
3100001.00 0.890.920.95AD < LS < FR 0.92
310000 2.001.071.09 1.49AD < LS < FR 1.09
41000.100.790.76 0.80LS = AD = FR 0.77
4100 0.502.81 2.402.85LS < AD = FR2.56
41001.004.434.014.55LS < AD = FR4.01
41002.00 6.636.136.81LS < AD = FR5.65
410000.100.731.170.74AD = FR < LS0.72
410000.500.950.951.00LS = AD < FR1.07
410001.001.341.321.47LS = AD < FR1.32
410002.002.092.062.50LS = AD < FR1.32
4100000.100.841.330.84AD = FR < LS0.74
4100000.500.961.030.96AD = FR < LS0.95
4100001.000.970.991.03AD < LS < FR0.99
4100002.001.141.151.51AD = LS < FR1.15
‘AD’ = Admixture with ε = MN×10-4, ‘LS1’ = Least-squares with ε = MN×10-4and
α = 1, ‘FR’ = FRAPPE with ε = 1. Bold values indicate significantly less error
than those without bold. ‘<’ indicates significantly less at 4.6e-4 level, and ‘=’
indicates insignificant difference. ‘LSα’ = Least-squares with correct α provided
only for reference.
Table 5 Root mean squared error for P
KN
α
ADLSFRAPPESignificanceLSα
21000.104.334.374.33AD = FR = LS4.36
2 1000.505.135.175.14AD = FR = LS5.17
21001.005.996.035.99 AD = FR = LS6.03
21002.007.247.28 7.29AD = LS = FR7.25
210000.101.371.421.38AD < FR < LS1.39
210000.501.621.651.63AD = FR < LS1.65
210001.00 1.901.931.92AD < FR = LS1.93
210002.002.522.582.82 AD = LS < FR2.38
2100000.100.460.570.46AD < FR < LS0.48
2100000.500.520.560.53AD < FR < LS0.52
2100001.000.600.610.62AD < LS < FR0.61
2100002.000.810.871.14AD < LS < FR0.92
31000.105.585.645.58AD = FR = LS5.62
31000.507.377.427.38AD = FR = LS7.42
31001.009.059.069.06AD = FR = LS9.06
3 1002.0011.3611.3311.39LS = AD = FR11.30
310000.101.781.871.78AD = FR < LS1.80
310000.502.352.402.35 AD = FR < LS2.39
310001.002.973.003.01AD < LS = FR3.00
310002.004.114.144.41AD = LS < FR3.89
3100000.100.610.820.62AD < FR < LS0.61
3100000.500.780.840.78AD = FR < LS0.76
3100001.000.930.950.98AD < LS < FR0.95
3100002.001.351.361.82AD = LS < FR1.49
41000.106.836.906.84 AD = FR = LS6.87
41000.509.61 9.639.62AD = FR = LS9.62
41001.0011.9011.8911.92LS = AD = FR11.89
41002.0014.9414.8915.01LS = AD = FR14.89
410000.102.162.282.16AD = FR < LS2.17
410000.503.103.153.11AD = FR < LS3.15
410001.004.044.064.08AD < LS = FR4.06
410002.005.615.625.88AD = LS < FR5.36
4100000.100.761.020.77AD = FR < LS0.71
4100000.501.041.111.04AD = FR < LS1.01
4100001.001.281.30 1.33AD < LS < FR1.30
4100002.001.871.872.36AD = LS < FR2.06
‘AD’ = Admixture with ε = MN×10-4, ‘LS1’ = Least-squares with ε = MN×10-4and
α = 1, ‘FR’ = FRAPPE with ε = 1. Bold values indicate significantly less error
than those without bold. ‘<’ indicates significantly less at 4.6e-4 level, and ‘=’
indicates insignificant difference. ‘LSα’ = Least-squares with correct α provided
only for reference.
Parry and Wang BMC Bioinformatics 2013, 14:28
http://www.biomedcentral.com/1471-2105/14/28
Page 8 of 17
Page 9
algorithms was relatively small compared to other charac-
teristics of the experiment such as sample size, number of
populations, and the degree of admixture in the sample.
That is, improving accuracy has more to do with improving
the dataset than with selecting the algorithm, suggesting
that algorithm selection may depend on other criteria such
as its speed. In nearly all cases, the least-squares method
computes its solution faster, typically a 1.5- to 5-times fas-
ter. At the current problem size involving about 10000 loci,
this speed improvement may justify the use of least-squares
algorithms. For a single point estimate, researchers may
prefer a slightly more accurate algorithm at the cost of sec-
onds or minutes. For researchers testing several values of K
and α and using multiple runs to gauge the fitness of each
parameter set, or those estimating standard errors [13], the
speed improvement could be the difference between hours
and days of computation. As the number of loci increase to
hundreds of thousands or even millions, speed may be
more important. The least-squares approach offers an alter-
native simpler and faster algorithm for population inference
that provides qualitatively similar results.
The key speed advantage of the least-squares approach
comes from a single nonnegative least-squares update that
minimizes a quadratic criterion for P and then for Q per it-
eration. Admixture, on the other hand, minimizes several
quadratic criteria sequentially as it fits the true binomial
model. Although the least-squares algorithm completes
each update in less time and is guaranteed to converge to a
local minimum or straddle point, predicting the number of
iterations to convergence presents a challenge. We provide
empirical timing results and note that selecting a suitable
stopping criterion for these iterative methods can change
the timing and accuracy results. For comparison, we use
the same stopping criterion with published thresholds for
Admixture and FRAPPE [13], and a threshold of MN×10-10
for least-squares.
This work is motivated in part by the desire to analyze
larger genotype datasets. In this paper, we focus on the
computational challenges of analyzing very large num-
bers of markers and individuals. However, linkage dis-
equilibrium introduces correlations between loci that
cannot be avoided in very large datasets. Large datasets
can be pruned to diminish the correlation between loci.
For example, Alexander et al. prune the HapMap phase
3 dataset from millions of SNPs down to around 10000
to avoid correlations. In this study, we assume linkage
equilibrium and therefore uncorrelated markers and
limit our analysis to datasets less than about 10000
SNPs. Incorporating linkage disequilibrium in gradient-
based optimizations of the binomial likelihood model
remains an open problem.
Estimating the number of populations K from the
admixed samples continues to pose a difficult challenge
for clustering algorithms in general and population in-
ference in particular. In practice, experiments can be
designed to include individual samples that are expected
to be distributed close to their ancestors. For example,
Tang et al. [11] suggested using domain knowledge to
collect an appropriate number of pseudo-ancestors that
Table 6 Computation time
KN
Α
AD LS FRAPPESignificanceLSα
2 1000.10 4.711.009.97 LS < AD < FR 0.77
2 100 0.504.691.168.22 LS < AD < FR1.12
2 1001.005.461.78 8.31 LS < AD < FR1.77
2 100 2.006.252.37 10.40 LS < AD < FR2.55
2 1000 0.10 43.3711.87 136.88LS < AD < FR8.06
2 10000.50 51.7013.98 112.41 LS < AD < FR12.34
2 10001.0062.00 24.43 118.90 LS < AD < FR24.03
21000 2.0083.0751.33 195.43 LS < AD < FR48.43
2 100000.10 447.68142.14 1963.83LS < AD < FR 93.61
2 100000.50 570.12 209.391908.72 LS < AD < FR 157.44
2100001.00687.88 352.242242.18 LS < AD < FR349.51
210000 2.001037.45796.833762.70LS < AD < FR406.63
3 1000.106.10 1.8415.29LS < AD < FR 1.48
31000.506.42 2.0515.75LS < AD < FR 1.90
3100 1.007.192.71 16.78LS < AD < FR2.74
3 100 2.00 9.004.0119.80LS < AD < FR4.24
310000.1069.4118.32 223.32LS < AD < FR12.53
310000.5078.7324.10264.85LS < AD < FR21.42
310001.0096.8938.06305.50LS < AD < FR36.63
310002.00121.4560.79 355.51LS < AD < FR 55.54
3 100000.10791.36155.56 3256.83 LS < AD < FR121.19
3100000.50883.99301.52 4251.68LS < AD < FR 264.77
3 100001.001175.25 617.80 5111.92LS < AD < FR578.42
3100002.001506.201404.27 7052.33LS < AD < FR901.56
4 1000.108.06 2.4523.93LS < AD < FR2.00
41000.508.782.66 26.56LS < AD < FR2.72
4 1001.0010.033.7030.89LS < AD < FR3.43
4 1002.0012.945.0037.26LS < AD < FR4.86
410000.1081.7217.32386.11LS < AD < FR13.45
410000.5099.9224.37 433.17LS < AD < FR22.68
410001.00117.7136.94508.49LS < AD < FR36.01
410002.00156.3958.02564.57LS < AD < FR57.62
4100000.10879.95229.065798.15LS < AD < FR176.27
4100000.501170.97480.997051.69LS < AD < FR505.45
410000 1.001555.901017.418108.08LS < AD < FR1051.81
4100002.002202.082538.5410445.75AD = LS < FR1308.79
‘AD’ = Admixture with ε = MN×10-4, ‘LS1’ = Least-squares with ε = MN×10-4and
α = 1, ‘FR’ = FRAPPE with ε = 1. Bold values indicate significantly less error
than those without bold. ‘<’ indicates significantly less at 4.6e-4 level, and ‘=’
indicates insignificant difference. ‘LSα’ = Least-squares with correct α provided
only for reference.
Parry and Wang BMC Bioinformatics 2013, 14:28
http://www.biomedcentral.com/1471-2105/14/28
Page 9 of 17
Page 10
reveal allele frequencies of the ancestral populations.
The number of groups considered provides a convenient
starting point for K. Lacking domain knowledge, compu-
tational approaches can be used to try multiple reasonable
values for K and evaluating their fitness. For example,
Pritchard et al. [9] estimated the posterior distribution of
K and select the most probable K. Another approach is to
evaluate the consistency of inference for different values
of K. If the same value of K leads to very different infer-
ences of P and Q from different random starting points,
the inference can be considered inconsistent. Brunet et al.
[18] proposed this method of model selection called con-
sensus clustering.
For realistic population allele frequencies, P, from the
HapMap Phase 3 dataset and very little admixture in Q,
Admixture provides better estimates of Q. The key advan-
tage of Admixture appears to be for individuals containing
nearly zero contribution from one or more inferred popula-
tions, whereas the least-squares approach performs better
when the individuals are well-mixed. Visually, both
approaches reveal population structure. Using the two
approaches to infer three ancestral populations from four
Table 7 Simulation experiments (1–3) using realistic population allele frequencies from the HapMap phase 3 project
Simulation 1 q ~ Dir(1,1,1)Simulation 2 q ~ Dir(.5,.5,.5) Simulation 3 q ~ Dir(.1,.1,.1)
Original
Admixture
Least-squares (α=1)
Least-squares with α
RMSE (%) ± Std. Dev.Time (s.) ± Std.
Dev.
RMSE (%) ± Std. Dev. Time (s.) ±
Std. Dev.
RMSE (%) ± Std. Dev. Time (s.) ±
Std. Dev.
PQPQPQ
AD (ε=1e-4) 2.50 ± 0.04 2.19 ± 0.11105 ± 13 1.99 ± 0.02 1.44 ± 0.04 88 ± 91.54 ± 0.01 0.76± 0.0286 ± 7
AD (ε=1.4e-3) 2.50 ± 0.04 2.19 ± 0.1198 ± 13 1.99 ± 0.02 1.44 ± 0.0487 ± 11 1.54 ± 0.010.76± 0.02 83 ± 9
LS1 (ε=1.4e-3) 2.51 ± 0.03 1.85 ± 0.0751 ± 6 2.04 ± 0.02 1.43 ± 0.0437 ± 8 1.63 ± 0.011.75± 0.05 27 ± 5
LSα (ε=1.4e-3) 2.51 ± 0.03 1.85 ± 0.0754 ± 82.03 ± 0.02 1.53 ± 0.04 28 ± 4 1.57 ± 0.01 1.08 ± 0.0215 ± 4
Parry and Wang BMC Bioinformatics 2013, 14:28
http://www.biomedcentral.com/1471-2105/14/28
Page 10 of 17
Page 11
HapMap Phase 3 sampling populations reveals qualitatively
similar results.
We believe the computational advantage of the least-
squares approach along with its good estimation perform-
ance warrants further research especially for very large
datasets. For example, we plan to adapt and apply the
least-squares approach to datasets utilizing microsatellite
data rather than SNPs and consider the case of more than
two alleles per locus. Researchers have incorporated geo-
spatial information into sampling-based [19] and PCA-
based [8] approaches. Multiple other extensions to
sampling-based or PCA-based algorithms have yet to be
incorporated into faster gradient-based approaches.
Conclusion
This paper explores the utility of a least-squares
approach for the inference of population structure in
genotype datasets. Whereas previous Euclidean distance-
based approaches received little theoretical justification,
we show that a least-squares approach is the result of a
first-order approximation of the negative log-likelihood
function for the binomial generative model. In addition,
Table 8 Simulation experiments (4–6) using realistic population allele frequencies from the HapMap phase 3 project
Simulation 4 q ~ Dir(.2,.2,.05)Simulation 5 q ~ Dir(.2,.2,.5) Simulation 6 q ~ Dir(.05,.05,.01)
Original
Admixture
Least-squares (α=1)
Least-squares with α
RMSE (%) Std. Dev. Time (s.) ±
Std. Dev.
RMSE (%) ± Std. Dev.Time (s.) ±
Std. Dev.
RMSE (%) ± Std. Dev.Time (s.) ±
Std. Dev.
PQPQPQ
AD (ε=1e-4) 2.01 ± 0.05 0.87 ± 0.02 94 ±121.98±0.03 1.16 ±0.03 93 ±171.96 ± 0.07 0.53 ± 0.02 91 ±9
AD (ε=1.4e-3) 2.01 ± 0.050.87 ± 0.0282 ±5 1.98±0.031.16 ±0.0386 ±13 1.96 ± 0.070.53 ± 0.02 82 ±7
LS1 (ε=1.4e-3)2.09 ± 0.05 1.70 ± 0.0531 ±7 2.06 ±0.031.60±0.04 34 ± 5 2.04 ± 0.07 2.00 ± 0.04 27 ±7
LSα (ε=1.4e-3) 2.05 ± 0.051.17±0.0317 ±3 2.02 ±0.04 1.34±0.04 24 ± 41.99 ± 0.07 1.09 ± 0.0314 ±3
Parry and Wang BMC Bioinformatics 2013, 14:28
http://www.biomedcentral.com/1471-2105/14/28
Page 11 of 17
Page 12
we show that the error in this approximation approaches
zero as the number of samples (individuals and loci)
increases. We compare our algorithm to state-of-the-art
algorithms, Admixture and FRAPPE, for optimizing the
binomial likelihood model, and show that our approach
requires less time and performs comparably well. We
provide both quantitative and visual comparisons that il-
lustrate the advantage of Admixture at estimating indivi-
duals with little admixture, and show that our approach
infers qualitatively similar results. Finally, we incorporate
a degree of admixture parameter that improves estimates
for known levels of admixture without requiring add-
itional parameter tuning as is the case for Admixture.
Methods
The algorithms we discuss accept the number of popula-
tions, K, and an M × N genotype matrix, G as input:
g11
g12
⋯
g1N
g21
g22
⋯
g2N
⋮⋮⋱
gM1 gM2 ⋯
gMN
G ¼
⋮
2
664
3
775
ð3Þ
where gli∈ {0,1,2} representing the number of copies of
the reference allele at the lth locus for the ith individual,
M is the number of markers (loci), and N is the number
of individuals. Given the genotype matrix, G, the algo-
rithms attempt to infer the population allele frequencies
and the individual admixture proportions. The matrix P
contains the population allele frequencies:
2
P ¼
p11
p21
⋮
pM1 pM2 ⋯ pMK
p12
p22
⋯
⋯
p1K
p2K
⋱⋮⋮
664
3
775
ð4Þ
where 0 ≤ plk≤ 1 representing the fraction of reference
alleles out of all alleles at the lth locus in the kth popula-
tion. The matrix Q contains the individual admixture
proportions:
Q ¼
q11
q21
⋮
qK1
q12
q22
⋯
⋯
q1N
q2N
⋱
qKN
⋮⋮
qK2 ⋯
2
664
3
775
ð5Þ
where 0 ≤ qik≤ 1 represents the fraction of the ith
individual’s genome originating from the kth popula-
tion and for all i,
the matrix notation we use.
P
k qki = 1. Table 1 summarizes
Likelihood function
Alexander et al. model the genotype (i.e., the number
of reference alleles at a particular locus) as the result
of two draws from a binomial distribution [13]. In the
generative model, each allele copy for one individual
at one locus has an equal chance, mli, of receiving the
reference allele:
mli¼ ΣK
k¼1p1kqk1
ð6Þ
The log-likelihood of the parameters P and Q from
the original Structure binomial model and ignoring an
additive constant is the following [13]:
L M
ðÞ ¼ Σ
M
l¼1Σ
N
i¼1gli1n mli
½ ? þ 2 ? gli
ðÞ1n 1 ? mli
½?ð7Þ
00.20.4 0.60.81
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
q1
q2
Admixture, eps=0.0001
a
ASW
CEU
MEX
YRI
00.20.40.60.81
q1
Least−squares, eps=0.0001
b
ASW
CEU
MEX
YRI
Figure 5 Comparison on HapMap Phase 3 dataset. Inferred population membership proportions using (a) Admixture and (b) least-squares with α=1.
Each point represents a different individual among the four populations: ASW, CEU, MEX, and YRI. The axes represent the proportion of each individual’s
genome originating from each inferred population. The proportion belonging to the third inferred population is given by q3= 1 – q1– q2.
Parry and Wang BMC Bioinformatics 2013, 14:28
http://www.biomedcentral.com/1471-2105/14/28
Page 12 of 17
Page 13
To see the effect on gradient-based optimization, we
also present the derivative of the likelihood with respect
to a particular mli:
∂
∂mliL M
ðÞ ¼
gli? 2mli
mli1 ? mli
ðÞ≈4 gli? 2mli
ðÞð8Þ
In order to achieve a least-squares criterion, we must ap-
proximate this derivative with a line. Figure 6 plots this de-
rivative with respect to mlifor the three possible values of
gli(0, 1, or 2). To avoid biasing the approximation to high
or low values of mli, we approximate the derivative with its
first-order Taylor approximation in the neighborhood of
mli= 1/2. More complex optimizations might update the
neighborhood of the Taylor approximation during the
optimization. In the interest of simplicity, we select one
neighborhood for all iterations, genotypes, individuals, and
loci. The following least-squares objective function has the
approximated derivative in the above equation:
?L M
ðÞ≈ Σ
L
l¼1Σ
N
i¼12mli? gli
ðÞ2¼ 2M ? G
kk2
2
ð9Þ
The right-hand-side of Equation 9 provides the least-
squares criterion. Figure 6 shows the deviation between
the linear approximation and the true slope. Values
match closely for 0.35 ≤ mli≤ 0.65 but as mliapproaches
zero or one the true slope diverges for two of the three
genotypes. Therefore, we have the following least-
squares optimization problem:
arg min
P;Q
2PQ ? G
kk2
2;such that
0≤P≤1
Q≥0
ΣK
k¼1qki¼ 1
8
:
<
ð10Þ
Bounded error for the least-squares approach
We justify the least-squares approach by showing that
the expected value across all genotypes is equal to the
true value in the binomial likelihood model, and that the
covariance approaches zero as the size of the data
increases. In order to analyze the least squares perform-
ance across all possible genotype matrices, we consider
the generative model for G. Given the true ancestral
population allele frequencies, P, and the proportion of
each individual’s alleles originating from each popula-
tion, Q, the genotype at locus l for individual i is a bino-
mial random variable, gli:
gli∼Binomial 2;mli
mli¼ ΣK
If M was directly observable, we could solve for P or
Q given the other using P = MQ#or Q = P#M, where #
is the Moore-Penrose pseudo-inverse. However, we only
observe the elements of G which is only partially in-
formative of M. First we consider the uncertainty in esti-
mating P. Each gliis an independent random variable
with the following mean and bound on the variance:
ðÞ
k¼1p1kqki
ð11Þ
E gli
½ ? ¼ 2mli
var gli
½ ? ??
1
2
ð12Þ
Mean and total variance of the estimate of p
For ease of notation, we focus on one locus at index l in
one row of P, ^ p ¼ ^ pl1;^ p12;...;^ p1K
[gl1,g12,...,g1N]T, and estimate the mean, covariance, and
provide a bound on the total variance of its estimate:
½?T, one row of G, g=
^ p ¼1
¼1
2QTgE ^ p ½ ? ¼ pcov ^ p ½ ?
4QTcov g ½ ?Qtrace cov ^ p ½ ?ðÞ≤1
8traceQQT
???1
??
ð13Þ
Intuitively, QQTscales linearly with N and we expect
the bound on the trace to decrease linearly with N. If
0.20.40.6 0.81.0
m
15
10
5
5
10
15
Figure 6 First-order approximation for slope of log-likelihood of m. Solid and dashed lines correspond to the true and approximated slope,
respectively. The red, green, and blue lines correspond to g = 0, g = 1, and g = 2, respectively.
Parry and Wang BMC Bioinformatics 2013, 14:28
http://www.biomedcentral.com/1471-2105/14/28
Page 13 of 17
Page 14
the columns, q, of Q are independent and identically
distributed, QQTapproaches N×E[qqT], resulting in a
bound that decreases linearly with N:
?
To put this bound in more familiar terms we consider
q drawn from a Dirichlet distribution with shape param-
eter α, resulting in the following:
?
Asymptotically, QQTapproaches N×E[qqT] and (QQT)-1
approaches:
?
trace cov ^ p ½ ?ðÞ≤1
8Ntrace
E qqT
?????1
?
ð14Þ
E qqT
??¼
1
4α þ 2
α þ 1
α
α
α þ 1
?
ð15Þ
2
N
α þ 1
?α
? α
α þ 1
?
ð16Þ
resulting in the following asymptotic bound on the total
variance:
Þ≤1
4N
trace cov ^ p ½ ?ð
α þ 1
ðÞ2
ð17Þ
Mean and total variance of the estimate for q
The same analysis can be repeated for one individual at
index i in one column of Q, ^ q ¼ ^ qli;^ q2i;...;^ qKi
one column of G, g=[gli,g2i, ...,gLi]T:
½?Tand
^ q ¼1
¼1
2PgE ^ q ½ ? ¼ qcov ^ q ½ ?
4P cov g ½ ?PTtrace cov ^ q ½ ?ðÞ≤1
8tracePTP
???1
??
ð18Þ
Intuitively, PTP increases linearly with M, and we ex-
pect the bound on the total variance to decrease linearly
with M. Similarly, if the rows, p, of P are independent
and identically distributed, PTP approaches M×E[ppT],
resulting in an asymptotic bound that decreases linearly
with M:
?
trace cov ^ q ½ ?ðÞ≤
1
8Mtrace
E pTp
?????1
?
ð19Þ
Incorporating degree of admixture, α
Pritchard et al. [13] use a prior distribution to bias their
solution toward those with a desired level of admixture.
This prior on the columns of Q takes the form of a
Dirichlet distribution:
q∼D α;α;...;α
ð
Because all the shape parameters (α) are equal, this
prior assumes that all ancestral populations are
equally represented in the current sample. The log of
this prior probability is the following ignoring an
additive constant:
Þð20Þ
In P q ð Þ ¼ α ? 1
¼ 1 ? Σ
ðÞ Σ
K
k¼1ln qk
k¼1qk
½?; where qk
K?1
ð21Þ
The derivative of the log prior with respect to q and
its first-order approximation at the mean of qk= 1/K is
the following:
∂
∂qk
ln P q ð Þ ¼ ?α ? 1
≈ ? 2K2α ? 1
ðÞ qk? qK
qkqK
Þ qk?1
ðÞ
ð
K
??
ð22Þ
0.20.40.60.81.0
q1
6
4
2
2
4
6
Figure 7 First-order approximation for slope of log-likelihood of q. Solid and dashed lines correspond to the true and approximated slope,
respectively, for K = 2. The blue, green, red, and orange lines correspond to α = 0.1, α = 0.5, α = 1, and α = 2, respectively.
Parry and Wang BMC Bioinformatics 2013, 14:28
http://www.biomedcentral.com/1471-2105/14/28
Page 14 of 17
Page 15
The following penalty function combines the columns
of Q into a single negative log-likelihood function with
the approximated derivative in the above equation:
?ln p Q
ð Þ≈K2α ? 1
ð
Þ Q ?1
Þ Σ
N
i¼lΣ
????
K
k¼1
2
qki?1
K
??2
¼ K2α ? 1
ð
K
????
2
ð23Þ
The right-hand-side of Equation 23 acts as a penalty
term for the least-squares criterion in Equation 9. Figure 7
shows the difference between the real and approximated
slope. For q near its mean of 1/K, the approximation fits
closely but for extreme values of q the true slope diverges.
Combining the terms in Equations 9 and 23 and including
problem constraints, we have the following least-squares
optimization problem:
arg min
P;Q
2PQ ? G
????
kk2
2
þ K2α ? 1
ðÞ Q ?1
K
????
2
2
;such that
0≤P≤1
Q≥0
ΣK
k¼1qki¼ 1
8
:
<
ð24Þ
Optimization algorithm
The non-convex optimization problem in Equation 10 can
be approached as a two-block coordinate descent problem
[15,20]. We initialize Q with nonnegative values such that
each column sums to one. Then, we alternate between
minimizing the criterion function with respect to P with
fixed Q:
arg min
0≤P≤
2PQ ? G
kk2
2
ð25Þ
and then minimizing with respect to Q with fixed P:
arg min
Q≥0
ΣK
k¼1qki¼1
2PQ ? G
kk2
2þ K2α ? 1
ðÞ Q ?1
K
????
????
2
2
ð26Þ
This process is repeated until the change in the criter-
ion function is less than ε at which point we consider
the algorithm to have converged. The Admixture algo-
rithm suggests a threshold of ε = 1e-4 but we have found
that a larger threshold often suffices. Unless otherwise
stated, we use a threshold that depends on the size of
the problem: ε = MN×10-10, corresponding to 1e-4 when
M = 10000 and N = 100.
Least-squares solution for P
Van Benthem and Keenan [16] propose a fast nonnega-
tively constrained active/passive set algorithm that avoids
redundant calculations for problems with multiple right-
hand-sides. Without considering the constraints on P,
Equation 25 can be classically solved using the pseudo-
inverse of Q:
^P ¼1
2GQTQQT
???1
ð27Þ
However, some of the elements of P may be less than
zero. In the active/passive set approach, if elements of P
are negative, they are clamped at zero and added to the
active set. The unconstrained solution is then applied to
the remaining passive elements of P. If the solution hap-
pens to be nonnegative, the algorithm finishes. If not,
negative elements are added to the active set and ele-
ments in the active set with a negative gradient (will de-
crease the criterion by increasing) are added back to the
passive set. The process is repeated until the passive set
is non-negative and the active set contains only elements
with a positive gradient at zero. We extend the approach
of Van Benthem and Keenan to include an upper bound
at one. Therefore, we maintain two active sets: those
clamped at zero and those clamped at one and update
both after the unconstrained optimization of the passive
set at each iteration. We provide Matlab source code
that implements this algorithm on our website.
Least-squares solution for Q
When solving for Q it is convenient to reformulate
Equation 26 into simpler terms:
arg min
Q≥0
ΣK
k¼1qki¼1
?P ¼
?PQ ??G
kk2
2
2P
K α ? 1
ðÞ
1
2 =IK
?
?
?
?G ¼
G
Þ
α ? 1
ð
1
2 =1KxN
?
ð28Þ
The unconstrained solution for this equation is the fol-
lowing:
^Q ¼ 4PTP þ K2α ? 1
¼
ðÞI
???12PTG þ K α ? 1
ðÞ
??
?PT?P
???1?PT?G
ð29Þ
When prior information is known about the sparse-
ness, we use α in the equations above. When no prior
information is known, we use α = 1 corresponding to
the uninformative prior and resulting in the ordinary
Parry and Wang BMC Bioinformatics 2013, 14:28
http://www.biomedcentral.com/1471-2105/14/28
Page 15 of 17
Page 16
pseudo-inverse solution. In order to incorporate the
sum-to-one constraint on the columns of Q, we employ
the method of Lagrange multipliers using Equation 11 in
the work of Settle and Drake substituting the identity
matrix for the noise matrix, N [21]. For completeness,
we include the solution below:
Q ¼ aUj þ U ? aUJU
U ¼
K
i¼1Σ
j ¼ 1;1;...;1
J ¼ jjT
ðÞ?PT?G
?PT?P
???1
K
j¼1uij
a ¼
Σ
???1
?T
½
ð30Þ
As before, some elements of Q may be negative. In that
case, we utilize the active set method to clamp elements
of Q at zero and update active and passive sets at each it-
eration until convergence as described above. We adapt
the Matlab script by Van Benthem and Keenan so that the
unconstrained solution uses Equation 30 instead of the
standard pseudo-inverse and provide it on our website.
Simulated experiments to compare the proposed
approach to Admixture and FRAPPE
We generate simulated genotype data for a variety of
problems using M = 10000 markers, and varying N be-
tween 100, 1000, and 10000; K between 2, 3, and 4; and
α between 0.1, 0.5, 1, and 2, for a total of 36 parameter
sets. For each combination of N, K, and α, we generate
the ground truth P from a uniform distribution, and Q
from a Dirichlet distribution parameterized by α. Then,
we draw a random genotype for each individual using
the binomial distribution in Equation 11. We estimate P
and Q using only the genotype information and the true
number of populations, K. We repeat the experiment
50 times drawing new, P, Q, and G matrices each time.
Finally, we record the performance of Admixture using
the published tight convergence threshold of ε = 1e-4
[13] and a loose convergence threshold of ε = MN×10-4;
the least-squares algorithm using an uninformative prior
(α = 1) and ε = MN×10-4, and the FRAPPE EM algo-
rithm using the published threshold of ε = 1. For refer-
ence, we also include the least-squares algorithm with
informative prior (known α) with convergence threshold
of ε = MN×10-4
mances with the two convergence thresholds were nearly
identical and we only report the results for ε = MN×
10-4, resulting in shorter computation times. We used a
four-way analysis of variance (ANOVA) with a fixed
effects model to reveal which factors (including algo-
rithm) contribute more or less to the estimation error
and computation time.
. In all experiments, Admixture’s perfor-
Statistical significance of root mean squared error and
computation time
For each combination of K, N, and α, we perform a
Kruskal-Wallis test to determine if Admixture, Least-
Squares, and FRAPPE perform significantly differently at
a Bonferroni adjusted significance level of 0.05/(36 par-
ameter sets) = 0.0014. If there is no significant differ-
ence, we consider their performances equal. If there is a
significant difference, we perform pair-wise Mann–Whitney
U-tests to determine significant differences between specific
algorithms. We use a Bonferroni adjusted significance level
of 0.05/(36 parameter sets)/(3 pair-wise comparisons) =
4.6e-4. The ‘Summary’ columns contain the order of per-
formance among the algorithms such that every algorithm
to the left of a ‘<’ symbol performs better than every algo-
rithm to the right. An ‘=’ symbol indicates that the adjacent
algorithms do not perform significantly differently.
Comparison on admixtures derived from the HapMap3
dataset
In the original Admixture paper [13], the authors simulate
admixed genotypes from population allele frequencies
derived from the HapMap Phase 3 dataset [22]. We follow
their example to compare the algorithms with more realis-
tic population allele frequencies. Rather than drawing P
from a uniform distribution, we estimate the population al-
lele frequencies for unrelated individuals in the HapMap
Phase 3 dataset using individuals from the following
groups: Han Chinese in Beijing, China (CHB), Utah resi-
dents with ancestry from Northern and Western Europe
(CEU), and Yoruba individuals in Ibadan, Nigeria (YRI)
[22]. We use the same 13928 SNPs provided in the sample
data on the Admixture webpage [23]. We randomly simu-
late 1000 admixed individuals: q ~ Dirichlet(α1, α2, α3).
When the Dirichlet parameters are not equal, we use the
degree of admixture, α, for LSα that results in the same
total variance as the combination of α1, α2, and α3:
α ¼K ? 1
K2v
?1
K;
where the total variance; v ¼ ΣK
k¼1
αk α0? αk
α2
ð
ð
0α0þ 1
Þ
Þ;
and α0¼ ΣK
k¼1αk
ð31Þ
Real dataset from the HapMap phase 3 project
In the original Admixture paper [13], the authors use
Admixture to infer three hypothetical ancestral popula-
tions from four known populations in the HapMap
Phase 3 dataset, including individuals with African an-
cestry in the American Southwest (ASW), individuals
with Mexican ancestry in Los Angeles (MEX), and the
Parry and Wang BMC Bioinformatics 2013, 14:28
http://www.biomedcentral.com/1471-2105/14/28
Page 16 of 17
Page 17
same CEU CEU and YRI individuals from the previous
example. We ran each algorithm 20 times on the dataset
using a convergence threshold of ε = 1e-4, recording the
convergence times for each trial.
Competing interests
The authors declare that they have no competing interests.
Authors’ contributions
RMP conceived of the least-squares approach to inferring population
structure, designed the study, and drafted the document. MDW initiated the
SNP data analysis project, acquired funding to sponsor this effort, and
directed the project and publication. All authors read and approved the final
manuscript.
Acknowledgements
This work was supported in part by grants from Microsoft Research, National
Institutes of Health (Bioengineering Research Partnership R01CA108468,
P20GM072069, Center for Cancer Nanotechnology Excellence U54CA119338,
and 1RC2CA148265), and Georgia Cancer Coalition (Distinguished Cancer
Scholar Award to Professor M. D. Wang).
Author details
1The Wallace H. Coulter Department of Biomedical Engineering, Georgia
Institute of Technology and Emory University, Atlanta, GA 30332, USA.
2Parker H. Petit Institute of Bioengineering and Biosciences and Department
of Electrical and Computer Engineering, Georgia Institute of Technology,
Atlanta, GA 30332, USA.3Winship Cancer Institute and Hematology and
Oncology Department, Emory University, Atlanta, GA 30322, USA.
Received: 15 March 2012 Accepted: 6 November 2012
Published: 23 January 2013
References
1.Beaumont M, Barratt EM, Gottelli D, Kitchener AC, Daniels MJ, Pritchard JK,
Bruford MW: Genetic diversity and introgression in the Scottish wildcat.
Mol Ecol 2001, 10:319–336.
2.Novembre J, Ramachandran S: Perspectives on human population
structure at the cusp of the sequencing era. Annu Rev Genomics Hum
Genet 2011, 12.
3.Menozzi P, Piazza A, Cavalli-Sforza L: Synthetic maps of human gene
frequencies in Europeans. Science 1978, 201:786–792.
4. Price AL, Patterson NJ, Plenge RM, Weinblatt ME, Shadick NA, Reich D:
Principal components analysis corrects for stratification in genome-wide
association studies. Nat Genet 2006, 38:904–909.
5. McVean G: A genealogical interpretation of principal components
analysis. PLoS Genet 2009, 5:e1000686.
6. Patterson N, Price AL, Reich D: Population structure and eigenanalysis.
PLoS Genet 2006, 2:e190.
7. Lee C, Abdool A, Huang CH: PCA-based population structure inference
with generic clustering algorithms. BMC Bioinforma 2009, 10.
8.Novembre J, Stephens M: Interpreting principal component analyses of
spatial population genetic variation. Nat Genet 2008, 40:646–649.
9.Pritchard JK, Stephens M, Donnelly P: Inference of population structure
using multilocus genotype data. Genetics 2000, 155:945–959.
10. Falush D, Stephens M, Pritchard JK: Inference of population structure
using multilocus genotype data linked loci and correlated allele
frequencies. Genetics 2003, 164:1567–1587.
11. Tang H, Peng J, Wang P, Risch NJ: Estimation of individual admixture:
Analytical and study design considerations. Genet Epidemiol 2005,
28:289–301.
12. Wu B, Liu N, Zhao H: PSMIX: an R package for population structure
inference via maximum likelihood method. BMC Bioinforma 2006, 7:317.
13.Alexander DH, Novembre J, Lange K: Fast model-based estimation of
ancestry in unrelated individuals. Genome Res 2009, 19:1655.
14.Alexander D, Lange K: Enhancements to the ADMIXTURE algorithm for
individual ancestry estimation. BMC Bioinforma 2011, 12:246.
15. Kim H, Park H: Non-negative matrix factorization based on alternating
non-negativity constrained least squares and active set method. SIAM
Journal in Matrix Analysis and Applications 2008, 30:713–730.
16. Van Benthem MH, Keenan MR: Fast algorithm for the solution of large-
scale non-negativity-constrained least squares problems. J Chemom 2004,
18:441–450.
Hanis CL, Chakraborty R, Ferrell RE, Schull WJ: Individual admixture
estimates: disease associations and individual risk of diabetes and
gallbladder disease among Mexican Americans in Starr County, Texas.
Am J Phys Anthropol 1986, 70:433–441.
Brunet JP, Tamayo P, Golub TR, Mesirov JP: Metagenes and molecular
pattern discovery using matrix factorization. Proc Natl Acad Sci U S A
2004, 101:4164.
Guillot G, Estoup A, Mortier F, Cosson JF: A spatial statistical model for
landscape genetics. Genetics 2005, 170:1261–1280.
Bertsekas DP: Nonlinear programming. Belmont, Mass.: Athena Scientific
1995.
Settle JJ, Drake NA: Linear mixing and the estimation of ground cover
proportions. Int J Remote Sens 1993, 14:1159–1177.
Altshuler D, Brooks LD, Chakravarti A, Collins FS, Daly MJ, Donnelly P, Gibbs
RA, Belmont JW, Boudreau A, Leal SM: A haplotype map of the human
genome. Nature 2005, 437:1299–1320.
ADMIXTURE: fast ancestry estimation. [http://www.genetics.ucla.edu/
software/admixture/download.html].
17.
18.
19.
20.
21.
22.
23.
doi:10.1186/1471-2105-14-28
Cite this article as: Parry and Wang: A fast least-squares algorithm for
population inference. BMC Bioinformatics 2013 14:28.
Submit your next manuscript to BioMed Central
and take full advantage of:
• Convenient online submission
• Thorough peer review
• No space constraints or color figure charges
• Immediate publication on acceptance
• Inclusion in PubMed, CAS, Scopus and Google Scholar
• Research which is freely available for redistribution
Submit your manuscript at
www.biomedcentral.com/submit
Parry and Wang BMC Bioinformatics 2013, 14:28
http://www.biomedcentral.com/1471-2105/14/28
Page 17 of 17
Download full-text