# A fast least-squares algorithm for population inference.

**ABSTRACT** BACKGROUND: Population inference is an important problem in genetics used to remove population stratification in genome-wide association studies and to detect migration patterns or shared ancestry. An individual's genotype can be modeled as a probabilistic function of ancestral population memberships, Q, and the allele frequencies in those populations, P. The parameters, P and Q, of this binomial likelihood model can be inferred using slow sampling methods such as Markov Chain Monte Carlo methods or faster gradient based approaches such as sequential quadratic programming. This paper proposes a least-squares simplification of the binomial likelihood model motivated by a Euclidean interpretation of the genotype feature space. This results in a faster algorithm that easily incorporates the degree of admixture within the sample of individuals and improves estimates without requiring trial-and-error tuning. RESULTS: We show that the expected value of the least-squares solution across all possible genotype datasets is equal to the true solution when part of the problem has been solved, and that the variance of the solution approaches zero as its size increases. The Least-squares algorithm performs nearly as well as Admixture for these theoretical scenarios. We compare least-squares, Admixture, and FRAPPE for a variety of problem sizes and difficulties. For particularly hard problems with a large number of populations, small number of samples, or greater degree of admixture, least-squares performs better than the other methods. On simulated mixtures of real population allele frequencies from the HapMap project, Admixture estimates sparsely mixed individuals better than Least-squares. The least-squares approach, however, performs within 1.5% of the Admixture error. On individual genotypes from the HapMap project, Admixture and least-squares perform qualitatively similarly and within 1.2% of each other. Significantly, the least-squares approach nearly always converges 1.5- to 6-times faster. CONCLUSIONS: The computational advantage of the least-squares approach along with its good estimation performance warrants further research, especially for very large datasets. As problem sizes increase, the difference in estimation performance between all algorithms decreases. In addition, when prior information is known, the least-squares approach easily incorporates the expected degree of admixture to improve the estimate.

**0**Bookmarks

**·**

**76**Views

- [Show abstract] [Hide abstract]

**ABSTRACT:**Inference of individual ancestry coefficients, which is important for population genetic and association studies, is commonly performed using computer-intensive likelihood algorithms. With the availability of large population genomic data sets, fast versions of likelihood algorithms have attracted considerable attention. Reducing the computational burden of estimation algorithms remains, however, a major challenge. Here, we present a fast and efficient method for estimating individual ancestry coefficients based on sparse non-negative matrix factorization algorithms. We implemented our method in the computer program sNMF, and applied it to human and plant data sets. The performances of sNMF were then compared to the likelihood algorithm implemented in the computer program ADMIXTURE. Without loss of accuracy, sNMF computed estimates of ancestry coefficients with run-times approximately 10 to 30 times shorter than those of ADMIXTURE.Genetics 02/2014; · 4.87 Impact Factor

Page 1

RESEARCH ARTICLE Open Access

A fast least-squares algorithm for population

inference

R Mitchell Parry1and May D Wang1,2,3*

Abstract

Background: Population inference is an important problem in genetics used to remove population stratification in

genome-wide association studies and to detect migration patterns or shared ancestry. An individual’s genotype can

be modeled as a probabilistic function of ancestral population memberships, Q, and the allele frequencies in those

populations, P. The parameters, P and Q, of this binomial likelihood model can be inferred using slow sampling

methods such as Markov Chain Monte Carlo methods or faster gradient based approaches such as sequential

quadratic programming. This paper proposes a least-squares simplification of the binomial likelihood model

motivated by a Euclidean interpretation of the genotype feature space. This results in a faster algorithm that easily

incorporates the degree of admixture within the sample of individuals and improves estimates without requiring

trial-and-error tuning.

Results: We show that the expected value of the least-squares solution across all possible genotype datasets is

equal to the true solution when part of the problem has been solved, and that the variance of the solution

approaches zero as its size increases. The Least-squares algorithm performs nearly as well as Admixture for these

theoretical scenarios. We compare least-squares, Admixture, and FRAPPE for a variety of problem sizes and

difficulties. For particularly hard problems with a large number of populations, small number of samples, or greater

degree of admixture, least-squares performs better than the other methods. On simulated mixtures of real

population allele frequencies from the HapMap project, Admixture estimates sparsely mixed individuals better than

Least-squares. The least-squares approach, however, performs within 1.5% of the Admixture error. On individual

genotypes from the HapMap project, Admixture and least-squares perform qualitatively similarly and within 1.2% of

each other. Significantly, the least-squares approach nearly always converges 1.5- to 6-times faster.

Conclusions: The computational advantage of the least-squares approach along with its good estimation

performance warrants further research, especially for very large datasets. As problem sizes increase, the difference in

estimation performance between all algorithms decreases. In addition, when prior information is known, the

least-squares approach easily incorporates the expected degree of admixture to improve the estimate.

Background

The inference of population structure from the geno-

types of admixed individuals poses a significant problem

in population genetics. For example, genome wide asso-

ciation studies (GWAS) compare the genetic makeup of

different individuals in order to extract differences in the

genome that may contribute to the development or

suppression of disease. Of particular interest are single

nucleotide polymorphisms (SNPs) that reveal genetic

changes at a single nucleotide in the DNA chain. When

a particular SNP variant is associated with a disease, this

may indicate that the gene plays a role in the disease

pathway, or that the gene was simply inherited from a

population that is more (or less) predisposed to the dis-

ease. Determining the inherent population structure within

a sample removes confounding factors before further ana-

lysis and reveals migration patterns and ancestry [1]. This

paper deals with the problem of inferring the proportion of

an individual’s genome originating from multiple ancestral

* Correspondence: maywang@bme.gatech.edu

1The Wallace H. Coulter Department of Biomedical Engineering, Georgia

Institute of Technology and Emory University, Atlanta, GA 30332, USA

2Parker H. Petit Institute of Bioengineering and Biosciences and Department

of Electrical and Computer Engineering, Georgia Institute of Technology,

Atlanta, GA 30332, USA

Full list of author information is available at the end of the article

© 2013 Parry and Wang; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the

Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use,

distribution, and reproduction in any medium, provided the original work is properly cited.

Parry and Wang BMC Bioinformatics 2013, 14:28

http://www.biomedcentral.com/1471-2105/14/28

Page 2

populations and the allele frequencies in these ancestral

populations from genotype data.

Methods for revealing population structure are divided

into fast multivariate analysis techniques and slower

discrete admixture models [2]. Fast multivariate techniques

such as principal components analysis (PCA) [2-8] reveal

subspaces in the genome where large differences between

individuals are observed. For case–control studies, the lar-

gest differences commonly due to ancestry are removed to

reduce false positives [4]. Although PCA provides a fast so-

lution, it does not directly infer the variables of interest:

the population allele frequencies and individual admixture

proportions. On the other hand, discrete admixture models

that estimate these variables typically require much more

computation time. Following a recent trend toward faster

gradient-based methods, we propose a faster simpler least-

squares algorithm for estimating both the population allele

frequencies and individual admixture proportions.

Pritchard et al. [9] originally propose a discrete admix-

ture likelihood model based on the random union of

gametes for the purpose of population inference. In par-

ticular, their model assumes Hardy-Weinberg equilibrium

within the ancestral populations (i.e., allele frequencies are

constant) and linkage equilibrium between markers within

each population (i.e., markers are independent). Each indi-

vidual in the current sample is modeled as having some

fraction of their genome originating from each of the an-

cestral populations. The goal of population inference is to

estimate the ancestral population allele frequencies, P, and

the admixture of each individual, Q, from the observed

genotypes, G. If the population of origin for every allele, Z,

is known, then the population allele frequencies and the

admixture for each individual have a Dirichlet distribution.

If, on the other hand, P and Q are known, the population

of origin for each individual allele has a multinomial distri-

bution. Pritchard et al. infer populations by alternately

sampling Z from a multinomial distribution based on P

and Q; and P and Q from Dirichlet distributions based on

Z. Ideally, this Markov Chain Monte Carlo sampling

method produces independent identically distributed sam-

ples (P,Q) from the posterior distribution P(P,Q|G). The

inferred parameters are taken as the mean of the posterior.

This algorithm is implemented in an open-source software

tool called Structure [9].

The binomial likelihood model proposed by Pritchard

et al. was originally used for datasets of tens or hundreds of

loci. However, as datasets become larger, especially consid-

ering genome-wide association studies with thousands or

millions of loci, two problems emerge. For one, linkage

disequilibrium introduces correlations between markers.

Although Falush et al. [10] extended Structure to incorpor-

ate loose linkage between loci, larger datasets also pose a

computational challenge that has not been met by these

sampling-based approaches. This has led to a series of more

efficient optimization algorithms for the same likelihood

model with uncorrelated loci. This paper focuses on im-

proving computational performance, leaving the treatment

of correlated loci to future research.

Tang et al. [11] proposed a more efficient expectation

maximization (EM) approach. Instead of randomly sam-

pling from the posterior distribution, the FRAPPE EM

algorithm [11] starts with a randomly initialized Z, then

alternates between updating the values of P and Q for

fixed Z, and maximizing the likelihood of Z for fixed P

and Q. Their approach achieves similar accuracy to

Structure and requires much less computation time. Wu

et al. [12] specialized the EM algorithm in FRAPPE to

accommodate the model without admixture, and gener-

alized it to have different mixing proportions at each

locus. However, these EM algorithms estimate an un-

necessary and unobservable variable Z, something that

more efficient algorithms could avoid.

Alexander et al. [13] proposed an even faster approach

for inferring P and Q using the same binomial likelihood

model but bypassing the unobservable variable Z. Their

close-source software, Admixture, starts at a random

feasible solution for P and Q and then alternates be-

tween maximizing the likelihood function with respect

to P and then maximizing it with respect to Q. The like-

lihood is guaranteed not to decrease at each step eventu-

ally converging at a local maximum or saddle point. For

a moderate problem of approximately 10000 loci, Ad-

mixture achieves comparable accuracy to Structure and

requires only minutes to execute compared to hours for

Structure [13].

Another feature of Structure’s binomial likelihood

model is that it allowed the user to input prior know-

ledge about the degree of admixture. The prior distribu-

tion for Q takes the form of a Dirichlet distribution with

a degree of admixture parameter, α, for every population.

For α = 0, all of an individual’s alleles originate from the

same ancestral population; for α > 0, individuals contain

a mixture of alleles from different populations; for α = 1,

every assignment of alleles to populations is equally

likely (i.e., the non-informative prior); and for α → ∞, all

individuals have equal contributions from every ancestral

population. Alexander et al. replace the population de-

gree of admixture parameter in Structure with two para-

meters, λ and γ, that when increased also decrease the

level of admixture of the resulting individuals. However,

the authors admit that tuning these parameters is non-

trivial [14].

This paper contributes to population inference research

by (1) proposing a novel least-squares simplification of the

binomial likelihood model that results in a faster algorithm,

and (2) directly incorporating the prior parameter α that

improves estimates without requiring trial-and-error tun-

ing. Specifically, we utilize a two block coordinate descent

Parry and Wang BMC Bioinformatics 2013, 14:28

http://www.biomedcentral.com/1471-2105/14/28

Page 2 of 17

Page 3

method [15] to alternately minimize the criterion for P and

then for Q. We adapt a fast non-negative least-squares

algorithm [16] to additionally include a sum-to-one con-

straint for Q and an upper-bound for P. We show that the

expected value for the estimates of P (or Q) across all pos-

sible genotype datasets are equal to the true values when Q

(or P) are known and that the variance of this estimate

approaches zero as the problem size increases. Compared

to Admixture, the least-squares approach provides a slightly

worse estimate of P or Q when the other is known. How-

ever, when estimating P and Q from only the genotype

data, the least-squares approach sometimes provides better

estimates, particularly with a large number of populations,

small number of samples, or more admixed individuals.

The least-squares approximation provides a simpler and

faster algorithm, and we provide it as Matlab scripts on our

website.

Results

First, we motivate a least-squares simplification of the bino-

mial likelihood model by deriving the expected value and

covariance of the least-squares estimate across all possible

genotype matrices for partially solved problems. Second, we

compare least-squares to sequential quadratic program-

ming (Admixture’s optimization algorithm) for these cases.

Third, we compare Admixture, FRAPPE, and least-squares

using simulated datasets with a factorial design varying

dataset properties in G. Fourth, we compare Admixture

and least-squares using real population allele frequencies

from the HapMap Phase 3 project. Finally, we compare the

results of applying Admixture and least-squares to real data

from the HapMap Phase 3 project where the true popula-

tion structure is unknown.

The algorithms we discuss accept as input the number of

populations, K, and the genotypes, gli∈{0,1,2}, representing

the number of copies of the reference allele at locus l for

individual i. Then, the algorithms attempt to infer the

population allele frequencies, plk= [0,1], for locus l and

population k, as well as the individual admixture propor-

tions, qki= [0,1] whereP

Table 1 Matrix notation

Genotype matrix

g11

g12

⋯

g1N

g21

g22

⋯

g2N

⋮⋮⋱⋮

gM1 gM2 ⋯ gMN

gli∈{0,1,2} : number of reference alleles at

lth locus for ith individual.

kqki= 1. In all cases, 1 ≤ l ≤ M,

1 ≤ i ≤ N, and 1 ≤ k ≤ K. Table 1 summarizes the matrix

notation.

Empirical estimate and upper bound on total variance

To validate our derived bounds on the total variance (Equa-

tions 13, 17, 18 and 19), we generate simulated genotypes

from a known target for p = [0.1, 0.7]T. We simulate N in-

dividual genotypes using the full matrix Q with each col-

umn drawn from a Dirichlet distribution with shape

parameter α. We repeat the experiment 10000 times produ-

cing an independent and identically distributed genotype

each time. Each trial produces one estimate for p. We then

compute the mean and covariance of the estimates of p

and compare them to those predicted in the bounds. For

α = 1 and N = 100,

?

cov ^ p

ð Þ ¼

?0:0015

trace cov ^ p ½ ?

The bound using the sample covariance of q in Equation

13 provides the following:

?

trace cov ^ p ½ ?

mean ^ p

ð Þ ¼

0:0999

0:7002

0:0027

?

? 0:0015

0:0046

??

ð Þ ¼ 0:0073

ð1Þ

QQT¼

36:62

16:20

Þ≤0:0097

16:20

30:99

?

ð

ð2Þ

The bound using the properties of the Dirichlet distri-

bution in Equation 17 provides a bound of 0.01. As the

number of samples increases, the difference between the

bound and the asymptotic bound for the Dirichlet dis-

tributed q will approach zero.

Figure 1 plots the total variance (trace of the covari-

ance) matrix for a variety of values for N and α using

the same target value for p. Because the expected value

of the estimate is equal to the true value of p, the total

variance is analogous to the sum of the squared error

(SSE) between the true p and its estimate. Clearly, the

total variance decreases with N. For N = 10000, the root

mean squared error falls below 1%.

Population allele frequencies matrix

p11

p12

⋯

p21

p22

⋯

⋮⋮

pM1 pM2 ⋯ pMK

0≤Plk≤1: percentage of reference alleles at

lth locus in kth population.

Individual admixture matrix

q11

q12

q21

q22

⋮

qM1 qM2 ⋯ qKN

qki≥0;P

originating from kth population.

G ¼

2

664

3

775

P ¼

p1K

p2K

⋱⋮

2

664

3

775

Q ¼

⋯

⋯

q1N

q2N

⋱⋮⋮

2

664

3

775

k ¼ 1

Kqki¼ 1: fraction of ith individual’s genome

M = number of loci (markers)

1≤l≤M

N = number of individuals

1≤i≤N

K = number of populations

1≤k≤K

Parry and Wang BMC Bioinformatics 2013, 14:28

http://www.biomedcentral.com/1471-2105/14/28

Page 3 of 17

Page 4

Intuitively, the error in the least-squares estimate for P

and Q decreases as the number of individuals and the num-

ber of loci increases, respectively. Figure 1 supports this no-

tion, suggesting that on very large problems for which the

gradient based and expectation maximization algorithms

were designed, the error in the least-squares estimate

approaches zero.

Comparing least-squares approximation to binomial

likelihood model

Given estimates of the population allele frequencies, early

research focused on estimating the individual admixture

[17]. We also note that the number of iterations and con-

vergence properties confound the comparison of iterative

algorithms. To avoid these problems and emulate a prac-

tical research scenario, we compare least-squares to se-

quential quadratic programming (used in Admixture)

when P or Q are known a priori. In this scenario, each al-

gorithm converges in exactly one step making it possible

to compare the underlying updates for P and Q independ-

ently. For N = 100, 1000, and 10000; and α = 0.1, 1, and 2;

we consider a grid of two-dimensional points for p, where

pi= {0.05, 0.15, ..., 0.95}. For each trial, we first generate

a random Q such that every column is drawn from a

Dirichlet distribution with shape parameter, α. Then, we

randomly generate a genotype using Equation 11. We

compute the least-squares solution using Equation 27 and

use Matlab’s built-in function ‘fmincon’ to minimize the

negative of the log-likelihood in Equation 7, similar to

Admixture’s approach. We repeat the process for 1000

trials and aggregate the results.

Figure 2 illustrates the root mean squared error in es-

timating p given the true value of Q. Both algorithms

present the same pattern of performance as a function

of p = [p1, p2]. Values of p near 0.5 present the most dif-

ficult scenarios. Positively correlated values (e.g., p1= p2)

present slightly less error than negatively correlated

values (e.g., p1= 1 – p2). Table 2 summarizes the per-

formance over all values of p for varying N and α. In all

cases, fmincon performs slightly better than least-

squares and both algorithms approach zero error as N

increases. We repeat this analysis for known values for P

and estimate q using the two approaches. Figure 3 illus-

trates the difference in performance for the two algo-

rithms as we vary q1 between 0.05 and 0.95 with

q2=1 – q1. Again, fmincon performs slightly better in all

cases but both approach zero as M increases. In the next

section we show that the additional error introduced by

the least-squares approximation to the objective function

remains small relative to the error introduced by the

characteristics of the genotype data.

Simulated experiments to compare least-squares to

Admixture and FRAPPE

In the previous sections, we consider the best-case sce-

nario where the true value of P or Q is known. In a real-

istic scenario, the algorithms must estimate both P and

Q fromonly the genotype information.Table3

10

2

10

N

3

10

4

10

−5

10

−4

10

−3

10

−2

10

−1

Total Variance

Bounded Total Variance

a=2, upper bound

a=2

a=1, upper bound

a=1

a=0.1, upper bound

a=0.1

Figure 1 Bound on total variance. Solid and dashed lines correspond to the empirical estimate of the total variance and the upper bound for

total variance, respectively.

Parry and Wang BMC Bioinformatics 2013, 14:28

http://www.biomedcentral.com/1471-2105/14/28

Page 4 of 17

Page 5

summarizes the results of a four-way analysis of variance

with 2-way interactions among experimental factors. By

far the factor with the most impact on performance is

the number of individuals, N. The degree of admixture,

α, and the number of populations, K, accounts for the

second and third most variation, respectively. These

three factors and two-way interactions between them ac-

count for the vast majority of variation. In particular, the

choice of algorithm accounts for less than about 1% of

the variation in estimation performance. That is, when

estimating population structure from genotype data, the

number of samples, the number of populations, and

the degree of admixture play a much more important

role than the choice between least-squares, Admixture,

and FRAPPE and least-squares. However, as shown

in Figure 4, when considering the computation time

required by the algorithm, the choice of algorithm con-

tributes about 40% of the variation including interac-

tions. Therefore, for the range of population inference

problems described in this study, the choice of algorithm

plays a very small role in the estimation of P and Q but

a larger role in computation time.

Further exploration reveals that the preferred algo-

rithm depends on K, N, and α. Table 4 lists the root

mean squared error for the estimation of Q for all com-

binations of parameters across n = 50 trials. Out of the

36 scenarios, Admixture, least-squares, and FRAPPE per-

form significantly better than their peers 13, six, and

zero times, respectively; they perform insignificantly

worse than the best algorithm 30, 17, and 10 times, re-

spectively. The least-squares algorithm appears to per-

form wellon the more

combinations of large K, small N, or large α. Table 5 lists

the root mean squared error for estimating P. For N =

100, the algorithms do not perform significantly differ-

ently. For N = 10000, all algorithms perform with less

than 2.5% root mean squared error (RMSE). In all, Ad-

mixture performs significantly better than its peers 11

times out of 36. However, Admixture never performs

significantly worse than its peers. Least-squares and

FRAPPE perform insignificantly worse than Admixture

17 and 20 times out of 36, respectively. Table 6 sum-

marizes the timing results. Least square converges sig-

nificantly faster 34 out of 36 times with an insignificant

difference for the remaining two scenarios. FRAPPE con-

verges significantly slower in all scenarios. With two

exceptions, the least-squares algorithm provides a 1.5- to

5-times speedup.

difficultproblemswith

Comparison on admixtures derived from the HapMap3

dataset

Tables 7 and 8 lists the performance and computation

time for the least-squares approach and Admixture using

a convergence threshold of ε = 1.0e-4 and ε = 1.4e-3, re-

spectively. Each marker in the illustrations represents

one individual. A short black line emanating from each

Sequential Quadratic Programming

N=100, a=1.00

p1

p2

a

0.20.4 0.60.81

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Least Squares

N=100, a=1.00

p1

b

0.20.40.60.81

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

Figure 2 Precision of best-case scenario for estimating P. Root mean squared error for different values of p using (a) Admixture’s Sequential

Quadratic Programming or (b) the least-squares approximation.

Table 2 Root mean squared error in P for known Q and K= 2

RMSE (%)

N= 100

α=0.1

α=1.0

N= 1000

α=1.0

N= 10000

α=1.0

α=2.0

α=0.1

α=2.0

α=0.1

α=2.0

SQP4.35 6.037.411.37 1.902.37 0.43 0.600.75

LS4.376.167.681.38 1.932.40 0.440.610.76

Parry and Wang BMC Bioinformatics 2013, 14:28

http://www.biomedcentral.com/1471-2105/14/28

Page 5 of 17

Page 6

marker indicates the offset from the original (correct)

position. For all simulations, the least-squares algorithms

perform within 0.1% of Admixture for estimating the true

population allele frequencies in P. For well-mixed popula-

tions in Simulation 1 and 2, the least-squares algorithms

perform comparably well or even better than Admixture.

However, for less admixed data in Simulations 3 – 6, Ad-

mixture provides better estimates of the true population

proportions depicted in the scatter plots. In all cases, the

least-squares algorithms perform within 1.5% of Admixture

and between about 2- and 3-times faster than Admixture.

The apparent advantage of Admixture involves indivi-

duals on the periphery of the unit simplex defining the

space of Q. In Table 7, this corresponds to individuals

on the boundary of the right triangle defined by the

x-axis, y-axis, and y = 1 – x diagonal line. For Simulation

1, the original Q contains very few individuals on the

boundary, Admixture estimates far more on the bound-

ary, and the least-squares was closer to the ground truth.

For Simulation 2 – 6, the ground truth contains more

individuals on the boundary, Admixture correctly esti-

mates these boundary points but the least-squares

0 0.20.4 0.60.81

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

0.1

q1

RMSE

M=100, LS

M=100, SQP

M=1000, LS

M=1000, SQP

M=10000, LS

M=10000, SQP

Figure 3 Precision of best-case scenario for estimating Q. Solid and dashed lines correspond to Admixture’s Sequential Quadratic

Programming optimization and the least-squares approximation, respectively.

Table 3 Sources of variation in root mean squared error

ANOVA

Sum squared error (×10-2)

Error variance for P Error variance for Q

Sum squared error (×10-4)

Time variance

Factors and interactions

Percent PercentSum squared error (×104)Percent

K

59.08.244.03.958.73.2

N

519.672.4 376.2 33.0585.5 32.2

Α

63.18.8341.1 29.933.2 1.8

Algorithm0.10.01.7 0.1266.3 14.6

K × N

32.14.5 32.62.9 98.25.4

K × α

9.01.38.20.74.4 0.2

K × Algorithm0.00.00.4 0.055.13.0

N × α

29.14.1282.624.8 58.83.2

N × Algorithm 0.00.02.10.2445.624.5

Α × Algorithm0.20.08.40.710.50.6

Error5.70.843.23.8204.411.2

Total717.9100.01140.4100.01820.4100.0

Parry and Wang BMC Bioinformatics 2013, 14:28

http://www.biomedcentral.com/1471-2105/14/28

Page 6 of 17

Page 7

algorithms predict fewer points on the boundary. Simu-

lation 6 provides the most obvious example where Ad-

mixture estimates individuals exactly on the boundary

and least-squares contains a jumble of individuals near

but not exactly on the line.

Real dataset from the HapMap phase 3 project

Over 20 repeated trials, Admixture converged in an

average of 42.1 seconds with standard deviation of 9.1

seconds, and the least-squares approach converged in

33.6 seconds with a standard deviation of 9.8 seconds.

Figure 5 illustrates the inferred population proportions

for one run. The relative placement of individuals from

each known population is qualitatively similar. The two

methods differ at extreme points such as those values of

q1, q2, or 1 – q1– q2that are near zero. The Admixture

solution has more individuals on the boundary and the

least-squares approach has fewer. Although we cannot

estimate the error of these estimates because the real

world data has no ground truth, we can compare their

results quantitatively. The Admixture and the least-

squares solution differed by an average of 1.2% root

mean squared difference across the 20 trials. We esti-

mate α = 0.12 from the Admixture solution’s total vari-

ance using Equation 31. This roughly corresponds to the

simulated experiment with three populations, 100 sam-

ples, and a degree of admixture of 0.1. In that case, Ad-

mixture and least-squares exhibited a very small root

mean squared error of 0.62% and 0.74%, respectively

(Table 4).

Discussion

This work contributes to the population inference litera-

ture by providing a novel simplification of the binomial

likelihood model that improves the computational effi-

ciency of discrete admixture inference. This approxima-

tion results in an inference algorithm based on minimizing

the squared distance between the genotype matrix G and

twice the product of the population allele frequencies and

individual admixture proportions: 2PQ. This Euclidean

distance-based interpretation aligns with previous results

employing multivariate statistics. For example, researchers

have found success using principal component analysis to

reveal and remove stratification [2-4] or even to reveal

clusters of individuals in subpopulations [5-7]. Recently,

McVean [5] proposed a genealogical interpretation of prin-

cipal component analysis and uses it to reveal information

about migration, geographic isolation, and admixture. In

particular, given two populations, individuals cluster along

the first principal component. Admixture proportion is the

fractional distance between the two population centers.

However, these cluster centers must known or inferred in

order to estimate ancestral population allele frequencies.

The least-squares approach infers these estimates effi-

ciently and directly.

Typically, discrete admixture models employ a binomial

likelihood function rather than a Euclidean distance-based

one. Pritchard et al. detail one such model and use a slow

sampling based approach to infer the admixed ancestral

populations for individuals in a sample [9]. Recognizing the

performance advantage of maximizing the likelihood rather

than sampling the posterior,Tang et al. proposed an expect-

ation maximization algorithm and Alexander et al. [13]

proposed a sequential quadratic programming (SQP) ap-

proach using the same likelihood function [9]. We take this

approach a step further by simplifying the model proposed

by Pritchard et al. to introduce a least-squares criterion. By

justifying the least-squares simplification, we connect the

fast and practical multivariate statistical approaches to the

theoretically grounded binomial likelihood model. We val-

idate our approach on a variety of simulated and real

datasets.

First, we show that if the true value of P (or Q) is

known, the expected value of the least squares solution

for Q (or P) across all possible genotype matrices is equal

to the true value, and the variance of this estimate

decreases with M (or N). In this best-case scenario, we

show that SQP provides a slightly better estimate than the

10

0

10

1

10

2

10

3

K=2

AD

K=2

LS1

K=3

AD

K=3

LS1

K=4

AD

K=4

LS1

Computation Time (seconds)

N=1000, a=0.5

a

N=100

AD

N=100 N=1000 N=1000 N=10000N=10000

LS1ADLS1ADLS1

K=4, a=0.5

b

a=0.1 a=0.1 a=0.5 a=0.5 a=1.0 a=1.0 a=2.0 a=2.0

ADLS1 ADLS1 ADLS1ADLS1

K=4, N=1000

c

Figure 4 Computational timing comparison. Box plots show the median (red line) and inter-quartile range (blue box) for computation time

on a logarithmic scale using (a) N=1000, α=0.5, and varying K; (b) K=4, α=0.5, and varying N; and (c) K=4, N=1000, and varying α.

Parry and Wang BMC Bioinformatics 2013, 14:28

http://www.biomedcentral.com/1471-2105/14/28

Page 7 of 17

Page 8

least-squares solution for a variety of problem sizes and

difficulty. For more common scenarios where the algo-

rithms must estimate P and Q using only the genotype in-

formation in G, we show that for particularly difficult

problems with small N, large K, or large α, the least-

squares approach often performs better than its peers. For

about one-third of the parameter sets, Admixture per-

forms significantly better than least-squares and FRAPPE

but all algorithms approach zero error as N becomes very

large. In addition, the error introduced by the choice of

Table 4 Root mean squared error for Q

KN

α

ADLS FRAPPESignificance LSα

2 1000.10 0.480.720.52 AD = FR < LS 0.64

2 1000.50 1.121.131.03FR = AD = LS 1.18

2100 1.002.222.222.29AD = LS = FR 2.22

21002.004.134.114.50LS = AD = FR3.84

210000.10 0.570.970.63AD < FR < LS0.74

210000.500.690.740.71AD < FR < LS 0.74

2 1000 1.000.86 0.91 1.00AD < LS < FR 0.91

2 10002.001.581.65 2.33AD = LS < FR0.93

2 100000.10 0.591.030.61 AD < FR < LS0.76

2100000.500.700.810.72AD < FR < LS0.73

2100001.000.740.770.79AD < LS < FR0.77

2100002.000.890.971.32AD < LS < FR0.96

31000.100.620.740.63AD = FR < LS0.66

31000.502.011.812.00LS < FR = AD1.91

31001.003.493.23 3.60LS < AD = FR 3.23

31002.005.775.39 5.89LS < AD = FR 5.00

31000 0.100.681.150.73 AD < FR < LS0.76

310000.500.850.88 0.89AD < LS = FR0.93

310001.001.18 1.171.35 LS = AD < FR1.17

3 10002.001.941.922.49 LS = AD < FR1.20

3100000.100.74 1.26 0.76AD < FR < LS 0.79

3100000.500.870.970.87 AD = FR < LS 0.87

3100001.000.890.920.95 AD < LS < FR 0.92

3100002.001.07 1.09 1.49AD < LS < FR1.09

41000.100.79 0.760.80LS = AD = FR0.77

41000.502.812.402.85 LS < AD = FR2.56

41001.004.434.014.55LS < AD = FR 4.01

41002.006.636.136.81LS < AD = FR5.65

410000.100.731.170.74AD = FR < LS0.72

410000.500.950.951.00 LS = AD < FR1.07

410001.00 1.34 1.32 1.47LS = AD < FR1.32

4 10002.00 2.092.062.50LS = AD < FR1.32

410000 0.100.841.330.84 AD = FR < LS0.74

4100000.500.961.03 0.96AD = FR < LS0.95

4100001.000.970.991.03AD < LS < FR 0.99

4100002.001.141.15 1.51AD = LS < FR1.15

‘AD’ = Admixture with ε = MN×10-4, ‘LS1’ = Least-squares with ε = MN×10-4and

α = 1, ‘FR’ = FRAPPE with ε = 1. Bold values indicate significantly less error

than those without bold. ‘<’ indicates significantly less at 4.6e-4 level, and ‘=’

indicates insignificant difference. ‘LSα’ = Least-squares with correct α provided

only for reference.

Table 5 Root mean squared error for P

KN

α

ADLSFRAPPESignificanceLSα

2 1000.104.33 4.374.33AD = FR = LS4.36

21000.505.135.17 5.14AD = FR = LS5.17

2100 1.005.996.035.99AD = FR = LS6.03

2100 2.007.247.287.29 AD = LS = FR7.25

210000.101.371.421.38 AD < FR < LS1.39

2 1000 0.501.621.65 1.63AD = FR < LS1.65

2 10001.001.901.93 1.92AD < FR = LS 1.93

210002.00 2.52 2.58 2.82AD = LS < FR2.38

2 100000.100.46 0.57 0.46AD < FR < LS0.48

2 100000.50 0.520.56 0.53 AD < FR < LS0.52

2 100001.000.60 0.61 0.62 AD < LS < FR0.61

2 100002.000.810.871.14AD < LS < FR0.92

31000.105.585.645.58AD = FR = LS5.62

31000.507.377.427.38AD = FR = LS7.42

31001.009.059.069.06AD = FR = LS9.06

31002.0011.3611.3311.39LS = AD = FR11.30

310000.101.781.871.78AD = FR < LS1.80

31000 0.502.35 2.402.35AD = FR < LS 2.39

3 10001.00 2.973.00 3.01AD < LS = FR3.00

310002.00 4.114.14 4.41AD = LS < FR3.89

3 100000.100.610.820.62AD < FR < LS0.61

3100000.500.780.840.78AD = FR < LS0.76

3100001.000.930.950.98AD < LS < FR0.95

3100002.001.351.361.82AD = LS < FR1.49

41000.106.836.906.84AD = FR = LS 6.87

41000.509.619.639.62AD = FR = LS9.62

41001.00 11.90 11.8911.92LS = AD = FR11.89

41002.0014.9414.8915.01LS = AD = FR14.89

410000.102.162.282.16AD = FR < LS2.17

410000.503.103.153.11AD = FR < LS3.15

410001.004.044.064.08AD < LS = FR 4.06

41000 2.00 5.615.62 5.88AD = LS < FR 5.36

4 10000 0.100.761.020.77AD = FR < LS0.71

4100000.501.041.111.04AD = FR < LS1.01

4100001.001.281.30 1.33AD < LS < FR1.30

4100002.001.871.872.36AD = LS < FR2.06

‘AD’ = Admixture with ε = MN×10-4, ‘LS1’ = Least-squares with ε = MN×10-4and

α = 1, ‘FR’ = FRAPPE with ε = 1. Bold values indicate significantly less error

than those without bold. ‘<’ indicates significantly less at 4.6e-4 level, and ‘=’

indicates insignificant difference. ‘LSα’ = Least-squares with correct α provided

only for reference.

Parry and Wang BMC Bioinformatics 2013, 14:28

http://www.biomedcentral.com/1471-2105/14/28

Page 8 of 17

Page 9

algorithms was relatively small compared to other charac-

teristics of the experiment such as sample size, number of

populations, and the degree of admixture in the sample.

That is, improving accuracy has more to do with improving

the dataset than with selecting the algorithm, suggesting

that algorithm selection may depend on other criteria such

as its speed. In nearly all cases, the least-squares method

computes its solution faster, typically a 1.5- to 5-times fas-

ter. At the current problem size involving about 10000 loci,

this speed improvement may justify the use of least-squares

algorithms. For a single point estimate, researchers may

prefer a slightly more accurate algorithm at the cost of sec-

onds or minutes. For researchers testing several values of K

and α and using multiple runs to gauge the fitness of each

parameter set, or those estimating standard errors [13], the

speed improvement could be the difference between hours

and days of computation. As the number of loci increase to

hundreds of thousands or even millions, speed may be

more important. The least-squares approach offers an alter-

native simpler and faster algorithm for population inference

that provides qualitatively similar results.

The key speed advantage of the least-squares approach

comes from a single nonnegative least-squares update that

minimizes a quadratic criterion for P and then for Q per it-

eration. Admixture, on the other hand, minimizes several

quadratic criteria sequentially as it fits the true binomial

model. Although the least-squares algorithm completes

each update in less time and is guaranteed to converge to a

local minimum or straddle point, predicting the number of

iterations to convergence presents a challenge. We provide

empirical timing results and note that selecting a suitable

stopping criterion for these iterative methods can change

the timing and accuracy results. For comparison, we use

the same stopping criterion with published thresholds for

Admixture and FRAPPE [13], and a threshold of MN×10-10

for least-squares.

This work is motivated in part by the desire to analyze

larger genotype datasets. In this paper, we focus on the

computational challenges of analyzing very large num-

bers of markers and individuals. However, linkage dis-

equilibrium introduces correlations between loci that

cannot be avoided in very large datasets. Large datasets

can be pruned to diminish the correlation between loci.

For example, Alexander et al. prune the HapMap phase

3 dataset from millions of SNPs down to around 10000

to avoid correlations. In this study, we assume linkage

equilibrium and therefore uncorrelated markers and

limit our analysis to datasets less than about 10000

SNPs. Incorporating linkage disequilibrium in gradient-

based optimizations of the binomial likelihood model

remains an open problem.

Estimating the number of populations K from the

admixed samples continues to pose a difficult challenge

for clustering algorithms in general and population in-

ference in particular. In practice, experiments can be

designed to include individual samples that are expected

to be distributed close to their ancestors. For example,

Tang et al. [11] suggested using domain knowledge to

collect an appropriate number of pseudo-ancestors that

Table 6 Computation time

KN

Α

ADLS FRAPPESignificanceLSα

2100 0.104.711.00 9.97LS < AD < FR0.77

2 1000.504.691.168.22LS < AD < FR1.12

2 1001.005.461.788.31LS < AD < FR1.77

2 1002.006.25 2.37 10.40 LS < AD < FR 2.55

210000.1043.3711.87 136.88LS < AD < FR8.06

2 10000.5051.7013.98112.41LS < AD < FR12.34

210001.0062.0024.43118.90LS < AD < FR24.03

2 10002.0083.0751.33195.43LS < AD < FR48.43

2100000.10447.68142.141963.83LS < AD < FR93.61

2100000.50570.12209.391908.72LS < AD < FR157.44

2 100001.00 687.88352.242242.18 LS < AD < FR349.51

2100002.001037.45796.833762.70 LS < AD < FR406.63

3100 0.10 6.101.8415.29LS < AD < FR 1.48

31000.506.422.05 15.75LS < AD < FR1.90

3100 1.007.192.7116.78 LS < AD < FR2.74

31002.00 9.00 4.0119.80LS < AD < FR4.24

3 10000.10 69.4118.32223.32LS < AD < FR 12.53

310000.50 78.7324.10264.85LS < AD < FR 21.42

310001.00 96.89 38.06305.50LS < AD < FR36.63

310002.00 121.4560.79355.51LS < AD < FR55.54

3100000.10791.36155.563256.83LS < AD < FR121.19

3100000.50883.99301.524251.68 LS < AD < FR264.77

3100001.00 1175.25617.805111.92 LS < AD < FR578.42

3100002.001506.201404.277052.33LS < AD < FR901.56

41000.108.062.4523.93LS < AD < FR2.00

4100 0.508.782.66 26.56LS < AD < FR2.72

41001.00 10.033.7030.89LS < AD < FR 3.43

41002.0012.945.0037.26LS < AD < FR4.86

410000.1081.7217.32386.11LS < AD < FR 13.45

410000.5099.9224.37433.17LS < AD < FR22.68

4 10001.00 117.7136.94508.49LS < AD < FR36.01

4 10002.00156.3958.02 564.57LS < AD < FR57.62

4100000.10879.95229.06 5798.15LS < AD < FR176.27

410000 0.50 1170.97 480.997051.69LS < AD < FR505.45

4 100001.001555.90 1017.418108.08LS < AD < FR 1051.81

410000 2.002202.08 2538.5410445.75 AD = LS < FR1308.79

‘AD’ = Admixture with ε = MN×10-4, ‘LS1’ = Least-squares with ε = MN×10-4and

α = 1, ‘FR’ = FRAPPE with ε = 1. Bold values indicate significantly less error

than those without bold. ‘<’ indicates significantly less at 4.6e-4 level, and ‘=’

indicates insignificant difference. ‘LSα’ = Least-squares with correct α provided

only for reference.

Parry and Wang BMC Bioinformatics 2013, 14:28

http://www.biomedcentral.com/1471-2105/14/28

Page 9 of 17

Page 10

reveal allele frequencies of the ancestral populations.

The number of groups considered provides a convenient

starting point for K. Lacking domain knowledge, compu-

tational approaches can be used to try multiple reasonable

values for K and evaluating their fitness. For example,

Pritchard et al. [9] estimated the posterior distribution of

K and select the most probable K. Another approach is to

evaluate the consistency of inference for different values

of K. If the same value of K leads to very different infer-

ences of P and Q from different random starting points,

the inference can be considered inconsistent. Brunet et al.

[18] proposed this method of model selection called con-

sensus clustering.

For realistic population allele frequencies, P, from the

HapMap Phase 3 dataset and very little admixture in Q,

Admixture provides better estimates of Q. The key advan-

tage of Admixture appears to be for individuals containing

nearly zero contribution from one or more inferred popula-

tions, whereas the least-squares approach performs better

when the individuals are well-mixed. Visually, both

approaches reveal population structure. Using the two

approaches to infer three ancestral populations from four

Table 7 Simulation experiments (1–3) using realistic population allele frequencies from the HapMap phase 3 project

Simulation 1 q ~ Dir(1,1,1) Simulation 2 q ~ Dir(.5,.5,.5)Simulation 3 q ~ Dir(.1,.1,.1)

Original

Admixture

Least-squares (α=1)

Least-squares with α

RMSE (%) ± Std. Dev.Time (s.) ± Std.

Dev.

RMSE (%) ± Std. Dev.Time (s.) ±

Std. Dev.

RMSE (%) ± Std. Dev. Time (s.) ±

Std. Dev.

PQPQPQ

AD (ε=1e-4)2.50 ± 0.04 2.19 ± 0.11 105 ± 131.99 ± 0.02 1.44 ± 0.0488 ± 91.54 ± 0.01 0.76± 0.0286 ± 7

AD (ε=1.4e-3) 2.50 ± 0.04 2.19 ± 0.11 98 ± 131.99 ± 0.02 1.44 ± 0.0487 ± 111.54 ± 0.010.76± 0.0283 ± 9

LS1 (ε=1.4e-3)2.51 ± 0.03 1.85 ± 0.0751 ± 62.04 ± 0.02 1.43 ± 0.0437 ± 8 1.63 ± 0.011.75± 0.0527 ± 5

LSα (ε=1.4e-3)2.51 ± 0.03 1.85 ± 0.0754 ± 82.03 ± 0.02 1.53 ± 0.0428 ± 41.57 ± 0.01 1.08 ± 0.0215 ± 4

Parry and Wang BMC Bioinformatics 2013, 14:28

http://www.biomedcentral.com/1471-2105/14/28

Page 10 of 17

Page 11

HapMap Phase 3 sampling populations reveals qualitatively

similar results.

We believe the computational advantage of the least-

squares approach along with its good estimation perform-

ance warrants further research especially for very large

datasets. For example, we plan to adapt and apply the

least-squares approach to datasets utilizing microsatellite

data rather than SNPs and consider the case of more than

two alleles per locus. Researchers have incorporated geo-

spatial information into sampling-based [19] and PCA-

based [8] approaches. Multiple other extensions to

sampling-based or PCA-based algorithms have yet to be

incorporated into faster gradient-based approaches.

Conclusion

This paper explores the utility of a least-squares

approach for the inference of population structure in

genotype datasets. Whereas previous Euclidean distance-

based approaches received little theoretical justification,

we show that a least-squares approach is the result of a

first-order approximation of the negative log-likelihood

function for the binomial generative model. In addition,

Table 8 Simulation experiments (4–6) using realistic population allele frequencies from the HapMap phase 3 project

Simulation 4 q ~ Dir(.2,.2,.05)Simulation 5 q ~ Dir(.2,.2,.5)Simulation 6 q ~ Dir(.05,.05,.01)

Original

Admixture

Least-squares (α=1)

Least-squares with α

RMSE (%) Std. Dev. Time (s.) ±

Std. Dev.

RMSE (%) ± Std. Dev. Time (s.) ±

Std. Dev.

RMSE (%) ± Std. Dev. Time (s.) ±

Std. Dev.

PQPQPQ

AD (ε=1e-4) 2.01 ± 0.050.87 ± 0.0294 ±121.98±0.03 1.16 ±0.03 93 ±17 1.96 ± 0.07 0.53 ± 0.0291 ±9

AD (ε=1.4e-3)2.01 ± 0.050.87 ± 0.02 82 ±51.98±0.031.16 ±0.03 86 ±13 1.96 ± 0.07 0.53 ± 0.0282 ±7

LS1 (ε=1.4e-3)2.09 ± 0.05 1.70 ± 0.0531 ±7 2.06 ±0.031.60±0.04 34 ± 5 2.04 ± 0.072.00 ± 0.0427 ±7

LSα (ε=1.4e-3) 2.05 ± 0.05 1.17±0.03 17 ±32.02 ±0.04 1.34±0.0424 ± 41.99 ± 0.07 1.09 ± 0.0314 ±3

Parry and Wang BMC Bioinformatics 2013, 14:28

http://www.biomedcentral.com/1471-2105/14/28

Page 11 of 17

Page 12

we show that the error in this approximation approaches

zero as the number of samples (individuals and loci)

increases. We compare our algorithm to state-of-the-art

algorithms, Admixture and FRAPPE, for optimizing the

binomial likelihood model, and show that our approach

requires less time and performs comparably well. We

provide both quantitative and visual comparisons that il-

lustrate the advantage of Admixture at estimating indivi-

duals with little admixture, and show that our approach

infers qualitatively similar results. Finally, we incorporate

a degree of admixture parameter that improves estimates

for known levels of admixture without requiring add-

itional parameter tuning as is the case for Admixture.

Methods

The algorithms we discuss accept the number of popula-

tions, K, and an M × N genotype matrix, G as input:

g11

g12

⋯

g1N

g21

g22

⋯

g2N

⋮⋮⋱

gM1 gM2 ⋯

gMN

G ¼

⋮

2

664

3

775

ð3Þ

where gli∈ {0,1,2} representing the number of copies of

the reference allele at the lth locus for the ith individual,

M is the number of markers (loci), and N is the number

of individuals. Given the genotype matrix, G, the algo-

rithms attempt to infer the population allele frequencies

and the individual admixture proportions. The matrix P

contains the population allele frequencies:

2

P ¼

p11

p21

⋮

pM1 pM2 ⋯ pMK

p12

p22

⋯

⋯

p1K

p2K

⋱⋮⋮

664

3

775

ð4Þ

where 0 ≤ plk≤ 1 representing the fraction of reference

alleles out of all alleles at the lth locus in the kth popula-

tion. The matrix Q contains the individual admixture

proportions:

Q ¼

q11

q21

⋮

qK1

q12

q22

⋯

⋯

q1N

q2N

⋱

qKN

⋮⋮

qK2 ⋯

2

664

3

775

ð5Þ

where 0 ≤ qik≤ 1 represents the fraction of the ith

individual’s genome originating from the kth popula-

tion and for all i,

the matrix notation we use.

P

k qki = 1. Table 1 summarizes

Likelihood function

Alexander et al. model the genotype (i.e., the number

of reference alleles at a particular locus) as the result

of two draws from a binomial distribution [13]. In the

generative model, each allele copy for one individual

at one locus has an equal chance, mli, of receiving the

reference allele:

mli¼ ΣK

k¼1p1kqk1

ð6Þ

The log-likelihood of the parameters P and Q from

the original Structure binomial model and ignoring an

additive constant is the following [13]:

L M

ðÞ ¼ Σ

M

l¼1Σ

N

i¼1gli1n mli

½? þ 2 ? gli

ðÞ1n 1 ? mli

½?ð7Þ

00.2 0.4 0.60.81

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

q1

q2

Admixture, eps=0.0001

a

ASW

CEU

MEX

YRI

0 0.20.4 0.60.81

q1

Least−squares, eps=0.0001

b

ASW

CEU

MEX

YRI

Figure 5 Comparison on HapMap Phase 3 dataset. Inferred population membership proportions using (a) Admixture and (b) least-squares with α=1.

Each point represents a different individual among the four populations: ASW, CEU, MEX, and YRI. The axes represent the proportion of each individual’s

genome originating from each inferred population. The proportion belonging to the third inferred population is given by q3= 1 – q1– q2.

Parry and Wang BMC Bioinformatics 2013, 14:28

http://www.biomedcentral.com/1471-2105/14/28

Page 12 of 17

Page 13

To see the effect on gradient-based optimization, we

also present the derivative of the likelihood with respect

to a particular mli:

∂

∂mliL M

ðÞ ¼

gli? 2mli

mli1 ? mli

ðÞ≈4 gli? 2mli

ðÞð8Þ

In order to achieve a least-squares criterion, we must ap-

proximate this derivative with a line. Figure 6 plots this de-

rivative with respect to mlifor the three possible values of

gli(0, 1, or 2). To avoid biasing the approximation to high

or low values of mli, we approximate the derivative with its

first-order Taylor approximation in the neighborhood of

mli= 1/2. More complex optimizations might update the

neighborhood of the Taylor approximation during the

optimization. In the interest of simplicity, we select one

neighborhood for all iterations, genotypes, individuals, and

loci. The following least-squares objective function has the

approximated derivative in the above equation:

?L M

ðÞ≈ Σ

L

l¼1Σ

N

i¼12mli? gli

ðÞ2¼ 2M ? G

kk2

2

ð9Þ

The right-hand-side of Equation 9 provides the least-

squares criterion. Figure 6 shows the deviation between

the linear approximation and the true slope. Values

match closely for 0.35 ≤ mli≤ 0.65 but as mliapproaches

zero or one the true slope diverges for two of the three

genotypes. Therefore, we have the following least-

squares optimization problem:

arg min

P;Q

2PQ ? G

kk2

2;such that

0≤P≤1

Q≥0

ΣK

k¼1qki¼ 1

8

:

<

ð10Þ

Bounded error for the least-squares approach

We justify the least-squares approach by showing that

the expected value across all genotypes is equal to the

true value in the binomial likelihood model, and that the

covariance approaches zero as the size of the data

increases. In order to analyze the least squares perform-

ance across all possible genotype matrices, we consider

the generative model for G. Given the true ancestral

population allele frequencies, P, and the proportion of

each individual’s alleles originating from each popula-

tion, Q, the genotype at locus l for individual i is a bino-

mial random variable, gli:

gli∼Binomial 2;mli

mli¼ ΣK

If M was directly observable, we could solve for P or

Q given the other using P = MQ#or Q = P#M, where #

is the Moore-Penrose pseudo-inverse. However, we only

observe the elements of G which is only partially in-

formative of M. First we consider the uncertainty in esti-

mating P. Each gliis an independent random variable

with the following mean and bound on the variance:

ðÞ

k¼1p1kqki

ð11Þ

E gli

½ ? ¼ 2mli

var gli

½ ? ??

1

2

ð12Þ

Mean and total variance of the estimate of p

For ease of notation, we focus on one locus at index l in

one row of P, ^ p ¼ ^ pl1;^ p12;...;^ p1K

[gl1,g12,...,g1N]T, and estimate the mean, covariance, and

provide a bound on the total variance of its estimate:

½?T, one row of G, g=

^ p ¼1

¼1

2QTgE ^ p ½ ? ¼ pcov ^ p ½ ?

4QTcov g ½ ?Qtrace cov ^ p ½ ?ðÞ≤1

8traceQQT

???1

??

ð13Þ

Intuitively, QQTscales linearly with N and we expect

the bound on the trace to decrease linearly with N. If

0.20.40.6 0.81.0

m

15

10

5

5

10

15

Figure 6 First-order approximation for slope of log-likelihood of m. Solid and dashed lines correspond to the true and approximated slope,

respectively. The red, green, and blue lines correspond to g = 0, g = 1, and g = 2, respectively.

Parry and Wang BMC Bioinformatics 2013, 14:28

http://www.biomedcentral.com/1471-2105/14/28

Page 13 of 17

Page 14

the columns, q, of Q are independent and identically

distributed, QQTapproaches N×E[qqT], resulting in a

bound that decreases linearly with N:

?

To put this bound in more familiar terms we consider

q drawn from a Dirichlet distribution with shape param-

eter α, resulting in the following:

?

Asymptotically, QQTapproaches N×E[qqT] and (QQT)-1

approaches:

?

trace cov ^ p ½ ?ðÞ≤1

8Ntrace

E qqT

?????1

?

ð14Þ

E qqT

??¼

1

4α þ 2

α þ 1

α

α

α þ 1

?

ð15Þ

2

N

α þ 1

?α

? α

α þ 1

?

ð16Þ

resulting in the following asymptotic bound on the total

variance:

Þ≤1

4N

trace cov ^ p ½ ?ð

α þ 1

ðÞ2

ð17Þ

Mean and total variance of the estimate for q

The same analysis can be repeated for one individual at

index i in one column of Q, ^ q ¼ ^ qli;^ q2i;...;^ qKi

one column of G, g=[gli,g2i, ...,gLi]T:

½?Tand

^ q ¼1

¼1

2PgE ^ q ½ ? ¼ qcov ^ q ½ ?

4P cov g ½ ?PTtrace cov ^ q ½ ?ðÞ≤1

8tracePTP

???1

??

ð18Þ

Intuitively, PTP increases linearly with M, and we ex-

pect the bound on the total variance to decrease linearly

with M. Similarly, if the rows, p, of P are independent

and identically distributed, PTP approaches M×E[ppT],

resulting in an asymptotic bound that decreases linearly

with M:

?

trace cov ^ q ½ ?ðÞ≤

1

8Mtrace

E pTp

?????1

?

ð19Þ

Incorporating degree of admixture, α

Pritchard et al. [13] use a prior distribution to bias their

solution toward those with a desired level of admixture.

This prior on the columns of Q takes the form of a

Dirichlet distribution:

q∼D α;α;...;α

ð

Because all the shape parameters (α) are equal, this

prior assumes that all ancestral populations are

equally represented in the current sample. The log of

this prior probability is the following ignoring an

additive constant:

Þð20Þ

In P q ð Þ ¼ α ? 1

¼ 1 ? Σ

ðÞ Σ

K

k¼1ln qk

k¼1qk

½?; where qk

K?1

ð21Þ

The derivative of the log prior with respect to q and

its first-order approximation at the mean of qk= 1/K is

the following:

∂

∂qk

ln P q ð Þ ¼ ?α ? 1

≈ ? 2K2α ? 1

ðÞ qk? qK

qkqK

Þ qk?1

ðÞ

ð

K

??

ð22Þ

0.20.4 0.60.81.0

q1

6

4

2

2

4

6

Figure 7 First-order approximation for slope of log-likelihood of q. Solid and dashed lines correspond to the true and approximated slope,

respectively, for K = 2. The blue, green, red, and orange lines correspond to α = 0.1, α = 0.5, α = 1, and α = 2, respectively.

Parry and Wang BMC Bioinformatics 2013, 14:28

http://www.biomedcentral.com/1471-2105/14/28

Page 14 of 17

Page 15

The following penalty function combines the columns

of Q into a single negative log-likelihood function with

the approximated derivative in the above equation:

?ln p Q

ð Þ≈K2α ? 1

ð

Þ Q ?1

Þ Σ

N

i¼lΣ

????

K

k¼1

2

qki?1

K

??2

¼ K2α ? 1

ð

K

????

2

ð23Þ

The right-hand-side of Equation 23 acts as a penalty

term for the least-squares criterion in Equation 9. Figure 7

shows the difference between the real and approximated

slope. For q near its mean of 1/K, the approximation fits

closely but for extreme values of q the true slope diverges.

Combining the terms in Equations 9 and 23 and including

problem constraints, we have the following least-squares

optimization problem:

arg min

P;Q

2PQ ? G

????

kk2

2

þ K2α ? 1

ðÞ Q ?1

K

????

2

2

;such that

0≤P≤1

Q≥0

ΣK

k¼1qki¼ 1

8

:

<

ð24Þ

Optimization algorithm

The non-convex optimization problem in Equation 10 can

be approached as a two-block coordinate descent problem

[15,20]. We initialize Q with nonnegative values such that

each column sums to one. Then, we alternate between

minimizing the criterion function with respect to P with

fixed Q:

arg min

0≤P≤

2PQ ? G

kk2

2

ð25Þ

and then minimizing with respect to Q with fixed P:

arg min

Q≥0

ΣK

k¼1qki¼1

2PQ ? G

kk2

2þ K2α ? 1

ðÞ Q ?1

K

????

????

2

2

ð26Þ

This process is repeated until the change in the criter-

ion function is less than ε at which point we consider

the algorithm to have converged. The Admixture algo-

rithm suggests a threshold of ε = 1e-4 but we have found

that a larger threshold often suffices. Unless otherwise

stated, we use a threshold that depends on the size of

the problem: ε = MN×10-10, corresponding to 1e-4 when

M = 10000 and N = 100.

Least-squares solution for P

Van Benthem and Keenan [16] propose a fast nonnega-

tively constrained active/passive set algorithm that avoids

redundant calculations for problems with multiple right-

hand-sides. Without considering the constraints on P,

Equation 25 can be classically solved using the pseudo-

inverse of Q:

^P ¼1

2GQTQQT

???1

ð27Þ

However, some of the elements of P may be less than

zero. In the active/passive set approach, if elements of P

are negative, they are clamped at zero and added to the

active set. The unconstrained solution is then applied to

the remaining passive elements of P. If the solution hap-

pens to be nonnegative, the algorithm finishes. If not,

negative elements are added to the active set and ele-

ments in the active set with a negative gradient (will de-

crease the criterion by increasing) are added back to the

passive set. The process is repeated until the passive set

is non-negative and the active set contains only elements

with a positive gradient at zero. We extend the approach

of Van Benthem and Keenan to include an upper bound

at one. Therefore, we maintain two active sets: those

clamped at zero and those clamped at one and update

both after the unconstrained optimization of the passive

set at each iteration. We provide Matlab source code

that implements this algorithm on our website.

Least-squares solution for Q

When solving for Q it is convenient to reformulate

Equation 26 into simpler terms:

arg min

Q≥0

ΣK

k¼1qki¼1

?P ¼

?PQ ??G

kk2

2

2P

K α ? 1

ðÞ

1

2 =IK

?

?

?

?G ¼

G

Þ

α ? 1

ð

1

2 =1KxN

?

ð28Þ

The unconstrained solution for this equation is the fol-

lowing:

^Q ¼ 4PTP þ K2α ? 1

¼

ðÞI

???12PTG þ K α ? 1

ðÞ

??

?PT?P

???1?PT?G

ð29Þ

When prior information is known about the sparse-

ness, we use α in the equations above. When no prior

information is known, we use α = 1 corresponding to

the uninformative prior and resulting in the ordinary

Parry and Wang BMC Bioinformatics 2013, 14:28

http://www.biomedcentral.com/1471-2105/14/28

Page 15 of 17