# A fast least-squares algorithm for population inference

**Abstract**

Background
Population inference is an important problem in genetics used to remove population stratification in genome-wide association studies and to detect migration patterns or shared ancestry. An individual’s genotype can be modeled as a probabilistic function of ancestral population memberships, Q, and the allele frequencies in those populations, P. The parameters, P and Q, of this binomial likelihood model can be inferred using slow sampling methods such as Markov Chain Monte Carlo methods or faster gradient based approaches such as sequential quadratic programming. This paper proposes a least-squares simplification of the binomial likelihood model motivated by a Euclidean interpretation of the genotype feature space. This results in a faster algorithm that easily incorporates the degree of admixture within the sample of individuals and improves estimates without requiring trial-and-error tuning.
Results
We show that the expected value of the least-squares solution across all possible genotype datasets is equal to the true solution when part of the problem has been solved, and that the variance of the solution approaches zero as its size increases. The Least-squares algorithm performs nearly as well as Admixture for these theoretical scenarios. We compare least-squares, Admixture, and FRAPPE for a variety of problem sizes and difficulties. For particularly hard problems with a large number of populations, small number of samples, or greater degree of admixture, least-squares performs better than the other methods. On simulated mixtures of real population allele frequencies from the HapMap project, Admixture estimates sparsely mixed individuals better than Least-squares. The least-squares approach, however, performs within 1.5% of the Admixture error. On individual genotypes from the HapMap project, Admixture and least-squares perform qualitatively similarly and within 1.2% of each other. Significantly, the least-squares approach nearly always converges 1.5- to 6-times faster.
Conclusions
The computational advantage of the least-squares approach along with its good estimation performance warrants further research, especially for very large datasets. As problem sizes increase, the difference in estimation performance between all algorithms decreases. In addition, when prior information is known, the least-squares approach easily incorporates the expected degree of admixture to improve the estimate.

RES E A R C H A R T I C L E Open Access

A fast least-squares algorithm for population

inference

R Mitchell Parry

1

and May D Wang

1,2,3*

Abstract

Background: Population inference is an important problem in genetics used to remove population stratification in

genome-wide association studies and to detect migration patterns or shared ancestry. An individual’s genotype can

be modeled as a probabilistic function of ancestral population memberships, Q, and the allele frequencies in those

populations, P. The parameters, P and Q, of this binomial likelihood model can be inferred using slow sampling

methods such as Markov Chain Monte Carlo methods or faster gradient based approaches such as sequential

quadratic programming. This paper proposes a least-squares simplification of the binomial likelihood model

motivated by a Euclidean interpretation of the genotype feature space. This results in a faster algorithm that easily

incorporates the degree of admixture within the sample of individuals and improves estimates without requiring

trial-and-error tun ing.

Results: We show that the expected value of the least-squares solution across all possible genotype datasets is

equal to the true solution when part of the problem has been solved, and that the variance of the solution

approaches zero as its size increases. The Least-squares algorithm performs nearly as well as Admixture for these

theoretical scenarios. We compare least-squares, Admixture, and FRAPPE for a variety of problem sizes and

difficulties. For particularly hard problems with a large number of populations, small number of samples, or greater

degree of admixture, least-squares performs better than the other methods. On simulated mixtures of real

population allele frequencies from the HapMap project, Admixture estimates sparsely mixed individuals better than

Least-squares. The least-squares approach, however, performs within 1.5% of the Admixture error. On individual

genotypes from the HapMap project, Admixture and least-squares perform qualitatively similarly and within 1.2% of

each other. Significantly, the least-squares approach nearly always converges 1.5- to 6-times faster.

Conclusions: The computational advantage of the least-squares approach along with its good estimation

performance warrants further research, especially for very large datasets. As problem sizes increase, the difference in

estimation performance between all algorithms decreases. In addition, when prior information is known, the

least-squares approach easily incorporates the expected degree of admixture to improve the estimate.

Background

The inference of population structure from the geno-

types of admixed individuals poses a significant problem

in population genetics. For example, genome wide asso-

ciation studies (GWA S) compare the genetic makeup of

different individuals in order to extract differences in the

genome that may contribute to the development or

suppression of disease. Of particular interest are single

nucleotide polymorphisms (SNPs) that reveal genetic

changes at a single nucleotide in the DNA chain. When

a particular SNP variant is associated with a disease, this

may indicate that the gene plays a role in the disease

pathway, or that the gene was simply inherited from a

population that is more (or less) predisposed to the dis-

ease. Determining the inherent population structure within

a sample removes confounding factors before further ana-

lysis and reveals migration patterns and ancestry [1]. This

paper deals with the problem of inferring the proportion of

an individual’s genome originating from multiple ancestral

* Correspondence: maywang@bme.gatech.edu

1

The Wallace H. Coulter Department of Biomedical Engineering, Georgia

Institute of Technology and Emory University, Atlanta, GA 30332, USA

2

Parker H. Petit Institute of Bioengineering and Biosciences and Department

of Electrical and Computer Engineering, Georgia Institute of Technology,

Atlanta, GA 30332, USA

Full list of author information is available at the end of the article

© 2013 Parry and Wang; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the

Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use,

distribution, and reproduction in any medium, provided the original work is properly cited.

Parry and Wang BMC Bioinformatics 2013, 14:28

http://www.biomedcentral.com/1471-2105/14/28

populations and the allele frequencies in these ancestral

populations from genotype data.

Methods for revealing population structure are divided

into fast multivariate analysis techniques and slower

discrete admixture models [2]. Fast multivariate techniques

such as principal components analysis (PCA) [2-8] reveal

subspaces in the genome where large differences between

individuals are observed. For case–control studies, the lar -

gest differences commonly due to ancestry are removed to

reduce false positives [4]. Although PCA provides a fast so-

lution, it does not directly infer the variables of interest:

the population allele frequencies and individual admixture

proportions. On the other hand, discrete admixture models

that estimate these variables typically require much more

computation time. Following a recent trend toward faster

gradient -based methods, we propose a faster simpler least-

squares algorithm for estimating both the population allele

frequencies and individual admixture proportions.

Pritchard et al. [9] originally propose a discrete admix-

ture likelihood model based on the random union of

gametes for the purpose of population inference. In par-

ticular, their model assumes Hardy-Weinberg equilibrium

within the ancestral populations (i.e., allele frequencies are

constant) and linkage equilibrium between markers within

each population (i.e., markers are independent). Each indi-

vidual in the current sample is modeled as having some

fraction of their genome originating from each of the an-

cestral populations. The goal of population inference is to

estimate the ancestral population allele frequencies, P,and

the admixture of each individual, Q, from the observed

genotypes, G. If the population of origin for every allele, Z,

is known, then the population allele frequencies and the

admixture for each individual have a Dirichlet distribution.

If, on the other hand, P and Q are known, the population

of origin for each individual allele has a multinomial distri-

bution. Pritchard et al. infer populations by alternately

sampling Z from a multinomial distribution based on P

and Q;andP and Q from Dirichlet distrib ution s based on

Z. Ideally, this Markov Chain Monte Carlo sampling

method produces independent identically distributed sam-

ples (P,Q) from the posterior distribution P(P,Q|G). The

inferred parameters are taken as the mean of the posterior.

This algorithm is implemented in an open-source software

tool called Structure [9].

The binomial likelihood model proposed by Pritchard

et al. was originally used for datasets of tens or hundreds of

loci. However, as datasets become larger, especially consid-

ering genome-wide association studies with thousands or

millions of loci, two problems emerge. For one, linkage

disequilibrium introduces correlations between markers.

Although Falush et al. [10] extended Structure to incorpor-

ate loose linkage between loci, larger datasets also pose a

computational challenge that has not been met by these

sampling-based approaches. This has led to a series of more

efficient optimization algorithms for the same likelihood

model with uncorrelated loci. This paper focuses on im-

proving computational performance, leaving the treatment

of correlated loci to future research.

Tang et al. [11] proposed a more efficient expectation

maximization (EM) appr oach. Instead of randomly sam-

pling from the posterior distribution, the FRAPPE EM

algorithm [11] starts with a randomly initialized Z

, then

alternates between updating the values of P and Q for

fixed Z, and maximizing the likelihood of Z for fixed P

and Q. Their approach achieves similar accuracy to

Structure and requires much less computation time. Wu

et al. [12] specialized the EM algorithm in FRAPPE to

accommodate the model without admixture, and gener-

alized it to have different mixing proportions at each

locus. However, these EM algorithms est imate an un-

necessary and unobservable variable Z, something that

more efficient algorithms could avoid.

Alexander et al. [13] proposed an even faster approach

for inferring P and Q using the same binomial likelihood

model but bypassing the unobservable variable Z. Their

close-source software, Admixture, starts at a random

feasible solution for P and Q and then alternates be-

tween maximizing the likelihood function with respect

to P and then maximizing it with respect to Q . The like-

lihood is guaranteed not to decrease at each step eventu-

ally converging at a local maximum or saddle point. For

a moderate problem of approximately 10000 loci, Ad-

mixture achieves comparable accuracy to Structure and

requires only minutes to execute compared to hours for

Structure [13].

Another feature of Structure’s binomial likelihood

model is that it allowed the user to input prior know-

ledge about the degree of admixture. The prior distribu-

tion for Q takes the form of a Dirichlet distribution with

a degree of admixture parameter, α, for every population.

For α = 0, all of an individual’s alleles originate from the

same ancestral population; for α > 0, individuals contain

a mixture of alleles from different populations; for α =1,

every assignment of alleles to populations is equally

likely (i.e., the non-informative prior); and for α → ∞, all

individuals have equal contributions from every an cestral

population. Alexander et al. replace the population de-

gree of admixture parameter in Structure with two para-

meters, λ and γ, that when increased also decrease the

level of admixture of the resulting individuals. However,

the authors admit that tuning these parameters is non-

trivial [14].

This paper contributes to population inference research

by (1) proposing a novel least -squares simplification of the

binomial likelihood model that results in a faster algorithm,

and (2) directly incorporating the prior parameter α

that

improves estimates without requiring trial-and-error tun-

ing. Specifically, we utilize a two block coordinate descent

Parry and Wang BMC Bioinformatics 2013, 14:28 Page 2 of 17

http://www.biomedcentral.com/1471-2105/14/28

method [15] to alternately minimize the criterion for P and

then for Q. We adapt a fast non-negative least-squares

algorithm [16] to additionally include a sum-to-one con-

straint for Q and an upper-bound for P. We show that the

expected value for the estimates of P (or Q) across all pos-

sible genotype datasets are equal to the true values when Q

(or P) are known and that the variance of this estimate

approaches zero as the problem size increases. Compared

to Admixture , the least -squares approach provides a slightly

worse estimat e of P or Q when the other is known. How-

ever, when estimating P and Q from only the genotype

data, the least -squares approach sometimes provides better

estimates, particularly with a large number of populations,

small number of samples, or more admixed individuals.

The least-squares approximation provides a simpler and

faster algorithm, and we provide it as Matlab scripts on our

website.

Results

First, we motivate a least -squares simplification of the bino-

mial likelihood model by deriving the expected value and

covariance of the least-squares estimate across all possible

genotype matrices for partially solved problems. Second, we

compare least-squares to sequential quadratic program-

ming (Admixture’s optimization algorithm) for these cases.

Third, we compare Admixture, FRAPPE, and least-squares

using simulated datasets with a factorial design varying

dataset properties in G. Fourth, we compare Admixture

and least-squares using real population allele frequencies

from the HapMap Phase 3 project. Finally, we compare the

results of applying Admixture and least-squares to real data

from the HapMap Phase 3 project where the true popula-

tion structure is unknown.

The algorithms we discuss accept as input the number of

populations, K, and the genotypes, g

li

∈{0,1,2}, representing

the number of copies of the reference allele at locus l for

individual i. Then, the algorithms attempt to infer the

population allele frequencies, p

lk

= [0,1], for locus l and

population k, as well as the individual admixture propor-

tions, q

ki

= [0,1] where

P

k

q

ki

= 1. In all cases, 1 ≤ l ≤ M,

1 ≤ i ≤ N,and1≤ k ≤ K. Table 1 summarizes the matrix

notation.

Empirical estimate and upper bound on total variance

To validate our derived bounds on the total variance (Equa-

tions 13, 17, 18 and 19), we generate simulated genotypes

from a known target for p = [0.1, 0.7]

T

.WesimulateN in-

dividual genotypes using the full matrix Q with each col-

umn drawn from a Dirichlet distribution with shape

parameter α. We repeat the experiment 10000 times produ-

cing an independent and identically distributed genotype

each time. Each trial produces one estimate for p.Wethen

compute the mean and covariance of the estimates of p

and compare them to those predicted in the bounds. For

α =1andN = 100,

mean

^

pðÞ¼

0:0999

0:7002

cov

^

pðÞ¼

0:0027 0:0015

0:0015 0:0046

trace cov

^

p½ðÞ¼0:0073

ð1Þ

The bound using the sample covariance of q in Equation

13 provides the following:

QQ

T

¼

36:62 16:20

16:20 30:99

trace cov

^

p½ðÞ≤0:0097

ð2Þ

The bound using the properties of the Dirichlet distri-

bution in Equation 17 provides a bound of 0.01. As the

number of samples increases, the difference between the

bound and the asymptotic bound for the Dirichlet dis-

tributed q will approach zero.

Figure 1 plots the total variance (trace of the covari-

ance) matrix for a varie ty of values for N and α using

the same target value for p. Because the expected value

of the estimate is equal to the true value of p, the total

variance is analogous to the sum of the squared error

(SSE) between the true p and its estimate. Clearly, the

total variance decreases with N. For N = 10000, the root

mean squared error falls below 1%.

Table 1 Matrix notation

Genotype matrix Population allele frequencies matrix Individual admixture matrix

G ¼

g

11

g

12

⋯ g

1N

g

21

g

22

⋯ g

2N

⋮⋮⋱⋮

g

M1

g

M2

⋯ g

MN

2

6

6

4

3

7

7

5

P ¼

p

11

p

12

⋯ p

1K

p

21

p

22

⋯ p

2K

⋮⋮⋱⋮

p

M1

p

M2

⋯ p

MK

2

6

6

4

3

7

7

5

Q ¼

q

11

q

12

⋯ q

1N

q

21

q

22

⋯ q

2N

⋮⋮⋱⋮

q

M1

q

M2

⋯ q

KN

2

6

6

4

3

7

7

5

g

li

∈ {0, 1, 2} : number of reference alleles at

lth locus for ith individual.

0≤P

lk

≤1: percentage of reference alleles at

lth locus in kth population.

q

ki

≥0;

P

k ¼ 1

K

q

ki

¼ 1: fraction of ith individual’s genome

originating from kth population.

M = number of loci (markers) 1≤l≤M

N = number of individuals 1≤i≤N

K = number of populations 1≤k≤K

Parry and Wang BMC Bioinformatics 2013, 14:28 Page 3 of 17

http://www.biomedcentral.com/1471-2105/14/28

Intuitively, the error in the least -squares estimate for P

and Q decreases as the number of individuals and the num-

ber of loci increases, respectively. Figure 1 supports this no-

tion, suggesting that on very large problems for which the

gradient based and expectation maximization algorithms

were designed, the error in the least-squares estimate

approaches zero.

Comparing least-squares approximation to binomial

likelihood model

Given estimates of the population allele frequencies, early

research focused on estimating the individual admixture

[17]. We also note that the number of iterations and con-

vergence properties confound the comparison of iterative

algorithms. To avoid these problems and emulate a prac-

tical research scenario, we compare least-squares to se-

quential quadratic programming (used in Admixture)

when P or Q are known a priori. In this scenario, each al-

gorithm converges in exactly one step making it possible

to compare the underlying updates for P and Q independ-

ently. For N = 100, 1000, and 10000; and α =0.1,1,and2;

we consider a grid of two-dimensional points for p,where

p

i

= {0.05, 0.15, ..., 0.95}. For each trial, we first generate

arandomQ such that every column is drawn from a

Dirichlet distribution with shape parameter, α. Then, we

randomly generate a genotype using Equation 11. We

compute the least-squares solution using Equation 27 and

use Matlab’s built-in function ‘fmincon’ to minimize the

negative of the log-likelihood in Equation 7, similar to

Admixture’s approach. We repeat the process for 1000

trials and aggregate the results.

Figure 2 illustrates the root mean squared error in es-

timating p given the true value of Q. Both algorithms

present the same pattern of performance as a function

of p =[p

1

, p

2

]. Values of p near 0.5 present the most dif-

ficult scenarios. Positively correlated values (e.g., p

1

= p

2

)

present slightly less error than negatively correlated

values (e.g., p

1

=1– p

2

). Table 2 summarizes the per-

formance over all values of p for varying N and α. In all

cases, fmincon performs slightly better than least-

squares and both algorithms approach zero error as N

increases. We repeat this analysis for known values for P

and estimate q using the two approaches. Figure 3 illus-

trates the difference in performance for the two algo-

rithms as we vary q

1

between 0.05 and 0.95 with

q

2

=1 – q

1

. Again, fmincon performs slightly better in all

cases but both approach zero as M increases. In the next

section we show that the additional error introduced by

the least-squares approximation to the objective function

remains small relative to the error introduced by the

characteristics of the genotype data.

Simulated experiments to compare least-squares to

Admixture and FRAPPE

In the previous sections, we consider the best-case sce-

nario where the true value of P or Q is known. In a real-

istic scenario, the algorithms must estimate both P and

Q from only the genotype information. Table 3

10

2

10

3

10

4

10

−5

10

−4

10

−3

10

−2

10

−1

N

Total Variance

Bounded Total Variance

a=2, upper bound

a=2

a=1, upper bound

a=1

a=0.1, upper bound

a=0.1

Figure 1 Bound on total variance. Solid and dashed lines correspond to the empirical estimate of the total variance and the upper bound for

total variance, respectively.

Parry and Wang BMC Bioinformatics 2013, 14:28 Page 4 of 17

http://www.biomedcentral.com/1471-2105/14/28

summarizes the results of a four-way analysis of variance

with 2-way interactions among experimental factors. By

far the factor with the most impact on performance is

the number of individuals, N. The degree of admixture,

α, and the number of populations, K, accounts for the

second and third most variation, respectively. These

three factors and two-way interactions betw een them ac-

count for the vast majority of variation. In particular, the

choice of algorithm accounts for less than about 1% of

the variation in estimation performance. That is, when

estimating populatio n structure from genotype data, the

number of samples, the number of populations, and

the degree of admixture play a much more important

role than the choice between least-squares, Admixture,

and FRAPPE and least-squares. However, as shown

in Figure 4, when considering the computation time

required by the algorithm, the choice of algorithm con-

tributes about 40% of the variation including interac-

tions. Therefore, for the range of population inference

problems described in this study, the choice of algorithm

plays a very small role in the estimation of P and Q but

a larger role in computation time.

Further exploration reveals that the preferred algo-

rithm depe nds on K, N, and α. Table 4 lists the root

mean squared error for the estimation of Q for all com-

binations of parame ters across n = 50 trials. Out of the

36 scenarios, Admixture, least-squares, and FRAPPE per-

form significantly better than their peers 13, six, and

zero times, respectively; they perform insignificantly

worse than the best algorithm 30, 17, and 10 times, re-

spectively. The least-squares algorithm appears to per-

form well on the more difficult problems with

combinations of large K, small N, or large α. Table 5 lists

the root mean squared error for estimating P. For N =

100, the algorithms do not perform significantly differ-

ently. For N = 10000, all algorithms perform with less

than 2.5% root mean squared error (RMSE). In all, Ad-

mixture performs significantly better than its peers 11

times out of 36. However, Admixture never performs

significantly worse than its peers. Least-squares and

FRAPPE perform insignificantly worse than Admixture

17 and 20 times out of 36, respectively. Table 6 sum-

marizes the timing results. Least square converges sig-

nificantly faster 34 out of 36 times with an insignificant

difference for the remaining two scenarios . FRAPPE con-

verges significantly slower in all scenarios. With two

exceptions , the least-squares algorithm provides a 1.5- to

5-times speedup.

Comparison on admixtures derived from the HapMap3

dataset

Tables 7 and 8 lists the performance and computation

time for the least-squares approach and Admixture using

a convergence threshold of ε = 1.0e-4 and ε = 1.4e-3, re-

spectively. Each marker in the illustrations represents

one individual. A short black line emanating from each

Sequential Quadratic Programming

N=100, a=1.00

p

1

p

2

a

0.2 0.4 0.6 0.8 1

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Least Squares

N=100, a=1.00

p

1

b

0.2 0.4 0.6 0.8 1

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

Figure 2 Precision of best-case scenario for estimating P. Root mean squared error for different values of p using (a) Admixture’s Sequential

Quadratic Programming or (b) the least-squares approximation.

Table 2 Root mean squared error in P for known Q and K=2

RMSE (%) N= 100 N= 1000 N= 10000

α=0.1 α=1.0 α=2.0 α=0.1 α=1.0 α=2.0 α=0.1 α=1.0 α=2.0

SQP 4.35 6.03 7.41 1.37 1.90 2.37 0.43 0.60 0.75

LS 4.37 6.16 7.68 1.38 1.93 2.40 0.44 0.61 0.76

Parry and Wang BMC Bioinformatics 2013, 14:28 Page 5 of 17

http://www.biomedcentral.com/1471-2105/14/28

marker indicates the offset from the original (correct)

position. For all simulations, the least-squares algorithms

perform within 0.1% of Admixture for estimating the true

population allele frequencies in P. For well-mixed popula-

tions in Simulation 1 and 2, the least-squares algorithms

perform comparably well or even better than Admixture.

However, for less admixed data in Simulations 3 – 6, Ad-

mixture provides better estimates of the true population

proportions depicted in the scatter plots. In all cases, the

least-squares algorithms perform within 1.5% of Admixtu r e

and between about 2- and 3-times faster than Admixture.

The apparent advantage of Admixture involves indivi-

duals on the periphery of the unit simplex defining the

space of Q. In Table 7, this corresponds to individuals

on the boundary of the right triangle defined by the

x-axis, y-axis, and y =1– x diagonal line. For Simulation

1, the original Q contains very few individuals on the

boundary, Admixture estimates far more on the bound-

ary, and the least-squares was closer to the ground truth.

For Simulation 2 – 6, the ground truth contains more

individuals on the boundary, Admixture corre ctly esti-

mates these boundary points but the least-squares

0 0.2 0.4 0.6 0.8 1

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

0.1

q

1

RMSE

M=100, LS

M=100, SQP

M=1000, LS

M=1000, SQP

M=10000, LS

M=10000, SQP

Figure 3 Precision of best-case scenario for estimating Q. Solid and dashed lines correspond to Admixture’s Sequential Quadratic

Programming optimization and the least-squares approximation, respectively.

Table 3 Sources of variation in root mean squared error

ANOVA Error variance for P Error variance for Q Time variance

Factors and interactions Sum squared error (×10

-2

) Percent Sum squared error (×10

-4

) Percent Sum squared error (×10

4

) Percent

K 59.0 8.2 44.0 3.9 58.7 3.2

N 519.6 72.4 376.2 33.0 585.5 32.2

Α 63.1 8.8 341.1 29.9 33.2 1.8

Algorithm 0.1 0.0 1.7 0.1 266.3 14.6

K × N 32.1 4.5 32.6 2.9 98.2 5.4

K × α 9.0 1.3 8.2 0.7 4.4 0.2

K × Algorithm 0.0 0.0 0.4 0.0 55.1 3.0

N × α 29.1 4.1 282.6 24.8 58.8 3.2

N × Algorithm 0.0 0.0 2.1 0.2 445.6 24.5

Α × Algorithm 0.2 0.0 8.4 0.7 10.5 0.6

Error 5.7 0.8 43.2 3.8 204.4 11.2

Total 717.9 100.0 1140.4 100.0 1820.4 100.0

Parry and Wang BMC Bioinformatics 2013, 14:28 Page 6 of 17

http://www.biomedcentral.com/1471-2105/14/28

algorithms predict fewer points on the boundary. Simu-

lation 6 provides the most obvious example where Ad-

mixture estimates individuals exactly on the bound ary

and least-squares contains a jumble of individuals near

but not exactly on the line.

Real dataset from the HapMap phase 3 project

Over 20 repeated trials, Admixture converged in an

average of 42.1 seconds with standard deviation of 9.1

seconds, and the least-squares approach converge d in

33.6 seconds with a standard deviation of 9.8 seconds.

Figure 5 illustrates the inferred population proportions

for one run. The relative placement of individuals from

each known population is qualitatively similar. The two

methods differ at extreme points such as those values of

q

1

, q

2

,or1– q

1

– q

2

that are near zero. The Admixture

solution has more individuals on the boundary and the

least-squares approach has fewer. Although we cannot

estimate the error of these estimates be cause the real

world data has no ground truth, we can compare their

results quantitatively. The Admixture and the least-

squares solution differed by an average of 1.2% root

mean squared differ ence across the 20 trials . We esti-

mate α = 0.12 from the Admixture solution’s total vari-

ance using Equation 31. This roughly corresponds to the

simulated experiment with three populations, 100 sam-

ples , and a degree of admixture of 0.1. In that case, Ad-

mixture and least-squares exhibited a very small root

mean squared error of 0.62% and 0.74%, respectively

(Table 4).

Discussion

This work contributes to the population inference litera-

ture by providing a novel simplification of the binomial

likelihood model that improves the computational effi-

ciency of discrete admixture inference. This approxima-

tion results in an inference algorithm based on minimizing

the squared distance between the genotype matrix G and

twice the product of the population allele frequencies and

individual admixture proportions: 2PQ. This Euclidean

distance-based interpretation aligns with previous results

employing multivariate statistics. For example, researchers

have found success using principal component analysis to

reveal and remove stratification [2-4] or even to reveal

clusters of individuals in subpopulations [5-7]. Recently,

McVean [5] proposed a genealogical interpretation of prin-

cipal component analysis and uses it to reveal information

about migration, geographic isolation, and admixture. In

particular, given two populations, individuals cluster along

the first principal component. Admixture proportion is the

fractional distance between the two population centers.

However, these cluster centers must known or inferred in

order to estimate ancestral population allele frequencies.

The least-squares approach infers these estimates effi-

ciently and directly.

Typically, discrete admixture models employ a binomial

likelihood function rather than a Euclidean distance-based

one. Pritchard et al. detail one such model and use a slow

sampling based approach to infer the admixed ancestral

populations for individuals in a sample [9]. Recognizing the

performance advantage of maximizing the likelihood rather

than sampling the posterior, Tang et al. proposed an expect-

ation maximization algorithm and Alexander et al. [13]

proposed a sequential quadratic programming (SQP) ap-

proach using the same likelihood function [9]. We take this

approach a step further by simplifying the model proposed

by Pritchard et al. to introduce a least -squares criterion. By

justifying the least -squares simplification, we connect the

fast and practical multivariate statistical approaches to the

theoretically grounded binomial likelihood model. We val-

idate our approach on a variety of simulated and real

datasets.

First, we show that if the true value of P (or Q)is

known, the expected value of the least squares solution

for Q (or P) across all possible genotype matrices is equal

to the true value, and the variance of this estimate

decreases with M (or N). In this best-case scenario, we

show that SQP provides a slightly better estimate than the

10

0

10

1

10

2

10

3

K=2 K=2 K=3 K=3 K=4 K=4

AD LS1 AD LS1 AD LS1

Computation Time (seconds)

N=1000, a=0.5

a

N=100 N=100 N=1000 N=1000 N=10000N=10000

AD LS1 AD LS1 AD LS1

K=4, a=0.5

b

a=0.1 a=0.1 a=0.5 a=0.5 a=1.0 a=1.0 a=2.0 a=2.0

AD LS1 AD LS1 AD LS1 AD LS1

K=4, N=1000

c

Figure 4 Computational timing comparison. Box plots show the median (red line) and inter-quartile range (blue box) for computation time

on a logarithmic scale using (a) N=1000, α=0.5, and varying K; (b) K=4, α=0.5, and varying N; and (c) K=4, N=1000, and varying α.

Parry and Wang BMC Bioinformatics 2013, 14:28 Page 7 of 17

http://www.biomedcentral.com/1471-2105/14/28

least-squares solution for a variety of problem sizes and

difficulty. For more common scenarios where the algo-

rithms must estimate P and Q using only the genotype in-

formation in G, we show that for particularly difficult

problems with small N, large K, or large α, the least-

squares approach often performs better than its peers. For

about one-third of the parameter sets, Admixture per-

forms significantly better than least-squares and FRAPPE

but all algorithms approach zero error as N becomes very

large. In addition, the error introduced by the choice of

Table 4 Root mean squared error for Q

KN α AD LS FRAPPE Significance LSα

2 100 0.10 0.48 0.72 0.52 AD = FR < LS 0.64

2 100 0.50 1.12 1.13 1.03 FR = AD = LS 1.18

2 100 1.00 2.22 2.22 2.29 AD = LS = FR 2.22

2 100 2.00 4.13 4.11 4.50 LS = AD = FR 3.84

2 1000 0.10 0.57 0.97 0.63 AD < FR < LS 0.74

2 1000 0.50 0.69 0.74 0.71 AD < FR < LS 0.74

2 1000 1.00 0.86 0.91 1.00 AD < LS < FR 0.91

2 1000 2.00 1.58 1.65 2.33 AD = LS < FR 0.93

2 10000 0.10 0.59 1.03 0.61 AD < FR < LS 0.76

2 10000 0.50 0.70 0.81 0.72 AD < FR < LS 0.73

2 10000 1.00 0.74 0.77 0.79 AD < LS < FR 0.77

2 10000 2.00 0.89 0.97 1.32 AD < LS < FR 0.96

3 100 0.10 0.62 0.74 0.63 AD = FR < LS 0.66

3 100 0.50 2.01 1.81 2.00 LS < FR = AD 1.91

3 100 1.00 3.49 3.23 3.60 LS < AD = FR 3.23

3 100 2.00 5.77 5.39 5.89 LS < AD = FR 5.00

3 1000 0.10 0.68 1.15 0.73 AD < FR < LS 0.76

3 1000 0.50 0.85 0.88 0.89 AD < LS = FR 0.93

3 1000 1.00 1.18 1.17 1.35 LS = AD < FR 1.17

3 1000 2.00 1.94 1.92 2.49 LS = AD < FR 1.20

3 10000 0.10 0.74 1.26 0.76 AD < FR < LS 0.79

3 10000 0.50 0.87 0.97 0.87 AD = FR < LS 0.87

3 10000 1.00 0.89 0.92 0.95 AD < LS < FR 0.92

3 10000 2.00 1.07 1.09 1.49 AD < LS < FR 1.09

4 100 0.10 0.79 0.76 0.80 LS = AD = FR 0.77

4 100 0.50 2.81 2.40 2.85 LS < AD = FR 2.56

4 100 1.00 4.43 4.01 4.55 LS < AD = FR 4.01

4 100 2.00 6.63 6.13 6.81 LS < AD = FR 5.65

4 1000 0.10 0.73 1.17 0.74 AD = FR < LS 0.72

4 1000 0.50 0.95 0.95 1.00 LS = AD < FR 1.07

4 1000 1.00 1.34 1.32 1.47 LS = AD < FR 1.32

4 1000 2.00 2.09 2.06 2.50 LS = AD < FR 1.32

4 10000 0.10 0.84 1.33 0.84 AD = FR < LS 0.74

4 10000 0.50 0.96 1.03 0.96 AD = FR < LS 0.95

4 10000 1.00 0.97 0.99 1.03 AD < LS < FR 0.99

4 10000 2.00 1.14 1.15 1.51 AD = LS < FR 1.15

‘AD’ = Admixture with ε = MN×10

-4

, ‘LS1’ = Least-squares with ε = MN×10

-4

and

α =1,‘FR’ = FRAPPE with ε = 1. Bold values indicate significantly less error

than those without bold. ‘<’ indicates significantly less at 4.6e-4 level, and ‘=’

indicates insign ificant difference. ‘LSα’ = Least-squares with correct α provided

only for reference.

Table 5 Root mean squared error for P

KN α AD LS FRAPPE Significance LSα

2 100 0.10 4.33 4.37 4.33 AD = FR = LS 4.36

2 100 0.50 5.13 5.17 5.14 AD = FR = LS 5.17

2 100 1.00 5.99 6.03 5.99 AD = FR = LS 6.03

2 100 2.00 7.24 7.28 7.29 AD = LS = FR 7.25

2 1000 0.10 1.37 1.42 1.38 AD < FR < LS 1.39

2 1000 0.50 1.62 1.65 1.63 AD = FR < LS 1.65

2 1000 1.00 1.90 1.93 1.92 AD < FR = LS 1.93

2 1000 2.00 2.52 2.58 2.82 AD = LS < FR 2.38

2 10000 0.10 0.46 0.57 0.46 AD < FR < LS 0.48

2 10000 0.50 0.52 0.56 0.53 AD < FR < LS 0.52

2 10000 1.00 0.60 0.61 0.62 AD < LS < FR 0.61

2 10000 2.00 0.81 0.87 1.14 AD < LS < FR 0.92

3 100 0.10 5.58 5.64 5.58 AD = FR = LS 5.62

3 100 0.50 7.37 7.42 7.38 AD = FR = LS 7.42

3 100 1.00 9.05 9.06 9.06 AD = FR = LS 9.06

3 100 2.00 11.36 11.33 11.39 LS = AD = FR 11.30

3 1000 0.10 1.78 1.87 1.78 AD = FR < LS 1.80

3 1000 0.50 2.35 2.40 2.35 AD = FR < LS 2.39

3 1000 1.00 2.97 3.00 3.01 AD < LS = FR 3.00

3 1000 2.00 4.11 4.14 4.41 AD = LS < FR 3.89

3 10000 0.10 0.61 0.82 0.62 AD < FR < LS 0.61

3 10000 0.50 0.78 0.84 0.78 AD = FR < LS 0.76

3 10000 1.00 0.93 0.95 0.98 AD < LS < FR 0.95

3 10000 2.00 1.35 1.36 1.82 AD = LS < FR 1.49

4 100 0.10 6.83 6.90 6.84 AD = FR = LS 6.87

4 100 0.50 9.61 9.63 9.62 AD = FR = LS 9.62

4 100 1.00 11.90 11.89 11.92 LS = AD = FR 11.89

4 100 2.00 14.94 14.89 15.01 LS = AD = FR 14.89

4 1000 0.10 2.16 2.28 2.16 AD = FR < LS 2.17

4 1000 0.50 3.10 3.15 3.11 AD = FR < LS 3.15

4 1000 1.00 4.04 4.06 4.08 AD < LS = FR 4.06

4 1000 2.00 5.61 5.62 5.88 AD = LS < FR 5.36

4 10000 0.10 0.76 1.02 0.77 AD = FR < LS 0.71

4 10000 0.50 1.04 1.11 1.04 AD = FR < LS 1.01

4 10000 1.00 1.28 1.30 1.33 AD < LS < FR 1.30

4 10000 2.00 1.87 1.87 2.36 AD = LS < FR 2.06

‘AD’ = Admixture with ε = MN×10

-4

, ‘LS1’ = Least-squares with ε = MN×10

-4

and

α =1,‘FR’ = FRAPPE with ε = 1. Bold values indicate significantly less error

than those without bold. ‘<’ indicates significantly less at 4.6e-4 level, and ‘=’

indicates insign ificant difference. ‘LSα’ = Least-squares with correct α provided

only for reference.

Parry and Wang BMC Bioinformatics 2013, 14:28 Page 8 of 17

http://www.biomedcentral.com/1471-2105/14/28

algorithms was relatively small compared to other charac-

teristics of the experiment such as sample size, number of

populations, and the degree of admixture in the sample.

That is, improving accuracy has more to do with improving

the dataset than with selecting the algorithm, suggesting

that algorithm selection may depend on other criteria such

as its speed. In nearly all cases, the least-squares method

computes its solution faster, typically a 1.5- to 5-times fas-

ter. At the current problem size involving about 10000 loci,

this speed improvement may justify the use of least-squares

algorithms. For a single point estimate, researchers may

prefer a slightly more accurate algorithm at the cost of sec-

onds or minutes. For researchers testing several values of K

and α and using multiple runs to gauge the fitness of each

parameter set, or those estimating standard errors [13], the

speed improvement could be the difference between hours

and days of computation. As the number of loci increase to

hundreds of thousands or even millions, speed may be

more important. The least -squares approach offers an alter-

native simpler and faster algorithm for population inference

that provides qualitatively similar results.

The key speed advantage of the least -squares approach

comes from a single nonnegative least-squares update that

minimizes a quadratic criterion for P and then for Q per it-

eration. Admixture, on the other hand, minimizes several

quadratic criteria sequentially as it fits the true binomial

model. Although the least -squares algorithm completes

each update in less time and is guaranteed to converge to a

local minimum or straddle point, predicting the number of

iterations to convergence presents a challenge. We provide

empirical timing results and note that selecting a suitable

stopping criterion for these iterative methods can change

the timing and accuracy results. For comparison, we use

the same stopping criterion with published thresholds for

Admixture and FRAPPE [13], and a threshold of MN×10

-10

for least-squares.

This work is motivated in part by the desire to analyze

larger genotype datasets. In this paper, we focus on the

computational challenges of analyzing very large num-

bers of markers and individuals. However, linkage dis-

equilibrium introduces correlations between loci that

cannot be avoided in very large datasets. Large dataset s

can be pruned to diminish the correlation between loci.

For example, Alexander et al. prune the HapMap phase

3 dataset from millions of SNPs down to around 10000

to avoid correlations. In this study, we assume linkage

equilibrium and therefore uncorrelated markers and

limit our analysis to data sets less than about 10000

SNPs. Incorporating linkage disequilibrium in gradient-

based optimizations of the binomial likelihood model

remains an open problem.

Estimating the number of populations K from the

admixed samples continues to pose a difficult challenge

for clustering algorithms in general and population in-

ference in particular. In practice, experiments can be

designed to include individual samples that are expe cted

to be distributed close to their ancestors. For example,

Tang et al. [11] suggested using domain knowledge to

collect an appropriat e number of pseudo-ancestors that

Table 6 Computation time

KN Α AD LS FRAPPE Significance LSα

2 100 0.10 4.71 1.00 9.97 LS < AD < FR 0.77

2 100 0.50 4.69 1.16 8.22 LS < AD < FR 1.12

2 100 1.00 5.46 1.78 8.31 LS < AD < FR 1.77

2 100 2.00 6.25 2.37 10.40 LS < AD < FR 2.55

2 1000 0.10 43.37 11.87 136.88 LS < AD < FR 8.06

2 1000 0.50 51.70 13.98 112.41 LS < AD < FR 12.34

2 1000 1.00 62.00 24.43 118.90 LS < AD < FR 24.03

2 1000 2.00 83.07 51.33 195.43 LS < AD < FR 48.43

2 10000 0.10 447.68 142.14 1963.83 LS < AD < FR 93.61

2 10000 0.50 570.12 209.39 1908.72 LS < AD < FR 157.44

2 10000 1.00 687.88 352.24 2242.18 LS < AD < FR 349.51

2 10000 2.00 1037.45 796.83 3762.70 LS < AD < FR 406.63

3 100 0.10 6.10 1.84 15.29 LS < AD < FR 1.48

3 100 0.50 6.42 2.05 15.75 LS < AD < FR 1.90

3 100 1.00 7.19 2.71 16.78 LS < AD < FR 2.74

3 100 2.00 9.00 4.01 19.80 LS < AD < FR 4.24

3 1000 0.10 69.41 18.32 223.32 LS < AD < FR 12.53

3 1000 0.50 78.73 24.10 264.85 LS < AD < FR 21.42

3 1000 1.00 96.89 38.06 305.50 LS < AD < FR 36.63

3 1000 2.00 121.45 60.79 355.51 LS < AD < FR 55.54

3 10000 0.10 791.36 155.56 3256.83 LS < AD < FR 121.19

3 10000 0.50 883.99 301.52 4251.68 LS < AD < FR 264.77

3 10000 1.00 1175.25 617.80 5111.92 LS < AD < FR 578.42

3 10000 2.00 1506.20 1404.27 7052.33 LS < AD < FR 901.56

4 100 0.10 8.06 2.45 23.93 LS < AD < FR 2.00

4 100 0.50 8.78 2.66 26.56 LS < AD < FR 2.72

4 100 1.00 10.03 3.70 30.89 LS < AD < FR 3.43

4 100 2.00 12.94 5.00 37.26 LS < AD < FR 4.86

4 1000 0.10 81.72 17.32 386.11 LS < AD < FR 13.45

4 1000 0.50 99.92 24.37 433.17 LS < AD < FR 22.68

4 1000 1.00 117.71 36.94 508.49 LS < AD < FR 36.01

4 1000 2.00 156.39 58.02 564.57 LS < AD < FR 57.62

4 10000 0.10 879.95 229.06 5798.15 LS < AD < FR 176.27

4 10000 0.50 1170.97 480.99 7051.69 LS < AD < FR 505.45

4 10000 1.00 1555.90 1017.41 8108.08 LS < AD < FR 1051.81

4 10000 2.00 2202.08 2538.54 10445.75 AD = LS < FR 1308.79

‘AD’ = Admixture with ε = MN×10

-4

, ‘LS1’ = Least-squares with ε = MN×10

-4

and

α =1,‘FR’ = FRAPPE with ε = 1. Bold values indicate significantly less error

than those without bold. ‘<’ indicates significantly less at 4.6e-4 level, and ‘=’

indicates insign ificant difference. ‘LSα’ = Least-squares with correct α provided

only for reference.

Parry and Wang BMC Bioinformatics 2013, 14:28 Page 9 of 17

http://www.biomedcentral.com/1471-2105/14/28

reveal allele frequencies of the ancestral populations.

The number of groups considered provides a convenient

starting point for K. Lacking domain knowledge, compu-

tational approaches can be used to try multiple reasonable

values for K and evaluating their fitness. For example,

Pritchard et al. [9] estimated the posterior distribution of

K and select the most probable K. Another approach is to

evaluate the consistency of inference for different values

of K. If the same value of K leads to very different infer -

ences of P and Q from different random starting points,

the inference can be considered inconsistent. Brunet et al.

[18] proposed this method of model selection called con-

sensus clustering.

For realistic population allele frequencies, P,fromthe

HapMap Phase 3 dataset and very little admixture in Q,

Admixture provides better estimates of Q.Thekeyadvan-

tage of Admixture appears to be for individuals containing

nearly zero contribution from one or more inferred popula-

tions, whereas the least-squares approach performs better

when the individuals are well-mixed. Visually, both

approaches reveal population structure. Using the two

approaches to infer three ancestral populations from four

Table 7 Simulation experiments (1–3) using realistic population allele frequencies from the HapMap phase 3 project

Simulation 1 q ~ Dir(1,1,1) Simulation 2 q ~ Dir(.5,.5,.5) Simulation 3 q ~ Dir(.1,.1,.1)

Original

Admixture

Least-squares (α=1)

Least-squares with α

RMSE (%) ± Std. Dev. Time (s.) ± Std.

Dev.

RMSE (%) ± Std. Dev. Time (s.) ±

Std. Dev.

RMSE (%) ± Std. Dev. Time (s.) ±

Std. Dev.

PQ PQ PQ

AD (ε=1e-4) 2.50 ± 0.04 2.19 ± 0.11 105 ± 13 1.99 ± 0.02 1.44 ± 0.04 88 ± 9 1.54 ± 0.01 0.76 ± 0.02 86 ± 7

AD (ε=1.4e-3) 2.50 ± 0.04 2.19 ± 0.11 98 ± 13 1.99 ± 0.02 1.44 ± 0.04 87 ± 11 1.54 ± 0.01 0.76 ± 0.02 83 ± 9

LS1 (ε=1.4e-3) 2.51 ± 0.03 1.85 ± 0.07 51 ± 6 2.04 ± 0.02 1.43 ± 0.04 37 ± 8 1.63 ± 0.01 1.75 ± 0.05 27 ± 5

LSα (ε=1.4e-3) 2.51 ± 0.03 1.85 ± 0.07 54 ± 8 2.03 ± 0.02 1.53 ± 0.04 28 ± 4 1.57 ± 0.01 1.08 ± 0.02 15 ± 4

Parry and Wang BMC Bioinformatics 2013, 14:28 Page 10 of 17

http://www.biomedcentral.com/1471-2105/14/28

HapMap Phase 3 sampling populations reveals qualitatively

similar results.

We believe the computational advantage of the least-

squares approach along with its good estimation perform-

ance warrants further research especially for very large

datasets. For example, we plan to adapt and apply the

least-squares approach to datasets utilizing microsatellite

data rather than SNPs and consider the case of more than

two alleles per locus. Researchers have incorporated geo-

spatial information into sampling-based [19] and PCA-

based [8] approaches. Multiple other extensions to

sampling-based or PCA-based algorithms have yet to be

incorporated into faster gradient-based approaches.

Conclusion

This paper explores the utility of a least-squares

approach for the inference of population structure in

genotype datasets. Whereas previous Euclidean distance-

based approaches received little theoretical justification,

we show that a least-squares approach is the result of a

first-order approximation of the negative log-likelihood

function for the binomial generative model. In addition,

Table 8 Simulation experiments (4–6) using realistic population allele frequencies from the HapMap phase 3 project

Simulation 4 q ~ Dir(.2,.2,.05) Simulation 5 q ~ Dir(.2,.2,.5) Simulation 6 q ~ Dir(.05,.05,.01)

Original

Admixture

Least-squares (α=1)

Least-squares with α

RMSE (%) Std. Dev. Time (s.) ±

Std. Dev.

RMSE (%) ± Std. Dev. Time (s.) ±

Std. Dev.

RMSE (%) ± Std. Dev. Time (s.) ±

Std. Dev.

PQ PQ PQ

AD (ε=1e-4) 2.01 ± 0.05 0.87 ± 0.02 94 ± 12 1.98 ± 0.03 1.16 ± 0.03 93 ± 17 1.96 ± 0.07 0.53 ± 0.02 91 ± 9

AD (ε=1.4e-3) 2.01 ± 0.05 0.87 ± 0.02 82 ± 5 1.98 ± 0.03 1.16 ± 0.03 86 ± 13 1.96 ± 0.07 0.53 ± 0.02 82 ± 7

LS1 (ε=1.4e-3) 2.09 ± 0.05 1.70 ± 0.05 31 ± 7 2.06 ± 0.03 1.60 ± 0.04 34 ± 5 2.04 ± 0.07 2.00 ± 0.04 27 ± 7

LSα (ε=1.4e-3) 2.05 ± 0.05 1.17 ± 0.03 17 ± 3 2.02 ± 0.04 1.34 ± 0.04 24 ± 4 1.99 ± 0.07 1.09 ± 0.03 14 ± 3

Parry and Wang BMC Bioinformatics 2013, 14:28 Page 11 of 17

http://www.biomedcentral.com/1471-2105/14/28

we show that the error in this approximation approaches

zero as the number of samples (individuals and loci)

increases. We compare our algorithm to state-of-the-art

algorithms, Admixture and FRAPPE , for optimizing the

binomial likelihood model, and show that our approach

requires less time and performs comparably well. We

provide both quantitative and visual comparisons that il-

lustrate the advantage of Admixture at estimating indivi-

duals with little admixture, and show that our appr oach

infers qualitatively similar result s. Finally, we incorporate

a degree of admix ture parameter that improves estimates

for known levels of admixture without requiring add-

itional parameter tuning as is the case for Admixture.

Methods

The algorithms we di scuss accept the number of popula-

tions , K, and an M × N genotype matrix, G as input:

G ¼

g

11

g

12

⋯ g

1N

g

21

g

22

⋯ g

2N

⋮⋮⋱⋮

g

M1

g

M2

⋯ g

MN

2

6

6

4

3

7

7

5

ð3Þ

where g

li

∈ {0,1,2} representing the number of copie s of

the reference allele at the lth locus for the ith individual,

M is the number of markers (loci), and N is the number

of individuals. Given the genotype matrix, G , the algo-

rithms attempt to infer the population allele frequencies

and the individual admixture proportions. The matrix P

contains the population allele frequencies:

P ¼

p

11

p

12

⋯ p

1K

p

21

p

22

⋯ p

2K

⋮⋮⋱⋮

p

M1

p

M2

⋯ p

MK

2

6

6

4

3

7

7

5

ð4Þ

where 0 ≤ p

lk

≤ 1 representing the fraction of reference

alleles out of all alleles at the lth locus in the kth popula-

tion. The matrix Q contains the individual admixture

proportions :

Q ¼

q

11

q

12

⋯ q

1N

q

21

q

22

⋯ q

2N

⋮⋮⋱⋮

q

K1

q

K2

⋯ q

KN

2

6

6

4

3

7

7

5

ð5Þ

where 0 ≤ q

ik

≤ 1 represents the fraction of the ith

individual’s genome originating from the kth popula-

tion and for all i,

P

k

q

ki

= 1. Table 1 summarizes

the matrix notation we use.

Likelihood function

Alexander et al. model the genotype (i.e . , the number

of reference alleles at a particular locus) as the result

of two draws from a binomial distribution [13]. In the

generative model, each allele co py for one individual

at one locus has an equal chance, m

li

,ofreceivingthe

reference allele:

m

li

¼ Σ

K

k¼1

p

1k

q

k1

ð6Þ

The log-likelihood of the parameters P and Q from

the original Structure binomial model and ignorin g an

additive constant is the following [13 ]:

LMðÞ¼Σ

M

l¼1

Σ

N

i¼1

g

li

1nm

li

½þ2 g

li

ðÞ1n 1 m

li

½ð7Þ

0 0.2 0.4 0.6 0.8 1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

q

1

q

2

Admixture, eps=0.0001

a

ASW

CEU

MEX

YRI

0 0.2 0.4 0.6 0.8 1

q

1

Least−squares, eps=0.0001

b

ASW

CEU

MEX

YRI

Figure 5 Comparison on HapMap Phase 3 dataset. Inferred population membership proportions using (a) Admixture and (b) least-squares with α=1.

Each point represents a different individual among the four populations: ASW,CEU,MEX,andYRI.Theaxesrepresent the proportion of each individual’s

genome originating from each inferred population. The proportion belonging to the third inferred population is given by q

3

=1– q

1

– q

2

.

Parry and Wang BMC Bioinformatics 2013, 14:28 Page 12 of 17

http://www.biomedcentral.com/1471-2105/14/28

To see the effect on gradient-based optimization, we

also present the derivative of the likelihood with respect

to a particular m

li

:

∂

∂m

li

LMðÞ¼

g

li

2m

li

m

li

1 m

li

ðÞ

≈4 g

li

2m

li

ðÞ ð8Þ

In order to achieve a least-squares criterion, we must ap-

proximate this derivative with a line. Figure 6 plots this de-

rivative with respect to m

li

for the three possible values of

g

li

(0, 1, or 2). To avoid biasing the approximation to high

or low values of m

li

,weapproximatethederivativewithits

first-o rder Taylor approximation in the neighborhood of

m

li

= 1/2. More complex optimizations might update the

neighborhood of the Taylor approximation during the

optimization. In the interest of simplicity, we select one

neighborhood for all iterations, genotypes, individuals, and

loci. The following least -squares objective function has the

approximated derivative in the above equation:

LMðÞ≈ Σ

L

l¼1

Σ

N

i¼1

2m

li

g

li

ðÞ

2

¼ 2M G

kk

2

2

ð9Þ

The right-hand-side of Equation 9 provides the least-

squares criterion. Figure 6 shows the deviation between

the linear approximation and the true slope. Values

match closely for 0.35 ≤ m

li

≤ 0.65 but as m

li

approaches

zero or one the true slope diverges for two of the three

genotypes. Therefore, we have the following least-

squares optimization problem:

arg min

P;Q

2PQ G

kk

2

2

; such that

0≤P≤1

Q≥0

Σ

K

k¼1

q

ki

¼ 1

8

<

:

ð10Þ

Bounded error for the least-squares approach

We justify the least-squares approach by showing that

the expected value across all genotypes is equal to the

true value in the binomial likelihood model, and that the

covariance approaches zero a s the size of the data

increases. In order to analyze the least squares perform-

ance across all possible genotype matrices, we consider

the generative model for G. Given the true ancestral

population allele frequencies, P, and the proportion of

each individual’s alleles originating from each popula-

tion, Q, the genotype at locus l for individual i is a bino-

mial random variable, g

li

:

g

li

∼Binomial 2; m

li

ðÞ

m

li

¼ Σ

K

k¼1

p

1k

q

ki

ð11Þ

If M was directly observable, we could solve for P or

Q given the other using P = MQ

#

or Q = P

#

M, where #

is the Moore-Penrose pseudo-inverse. Howe ver, we only

observe the elements of G which is only partially in-

formative of M. First we consider the uncertainty in esti-

mating P. Each g

li

is an independent random variable

with the following mean and bound on the variance:

Eg

li

½¼2m

li

var g

li

½

1

2

ð12Þ

Mean and total variance of the estimate of p

For ease of notation, we focus on one locus at index l in

one row of P,

^

p ¼

^

p

l1

;

^

p

12

; ...;

^

p

1K

½

T

,onerowofG, g =

[g

l1

, g

12

, ..., g

1N

]

T

, and estimate the mean, covariance, and

provide a bound on the total variance of its estimate:

^

p ¼

1

2

Q

T

gE

^

p½¼p cov

^

p½

¼

1

4

Q

T

cov g½Qtrace cov

^

p½ðÞ≤

1

8

trace QQ

T

1

ð13Þ

Intuitively, QQ

T

scales linearly with N and we expect

the bound on the trace to decrease linearly with N.If

0.2 0.4 0.6 0.8 1.0

m

15

10

5

5

10

15

Figure 6 First-order approximation for slope of log-likelihood of m. Solid and dashed lines correspond to the true and approximated slope,

respectively. The red, green, and blue lines correspond to g =0,g = 1, and g = 2, respectively.

Parry and Wang BMC Bioinformatics 2013, 14:28 Page 13 of 17

http://www.biomedcentral.com/1471-2105/14/28

the columns, q,ofQ are independent and identically

distributed, QQ

T

approaches N×E[qq

T

], resulting in a

bound that decreases linearly with N:

trace cov

^

p½ðÞ≤

1

8N

trace E qq

T

1

ð14Þ

To put this bound in more familiar terms we consider

q drawn from a Dirichlet distribution with shape param-

eter α, resulting in the following:

Eqq

T

¼

1

4α þ 2

α þ 1 α

ααþ 1

ð15Þ

Asymptotically, QQ

T

approaches N×E[qq

T

]and(QQ

T

)

-1

approaches:

2

N

α þ 1 α

ααþ 1

ð16Þ

resulting in the following asymptotic bound on the total

variance:

trace cov

^

p½ðÞ≤

1

4N

α þ 1ðÞ

2

ð17Þ

Mean and total variance of the estimate for q

The same analysis can be repeated for one individual at

index i in one column of Q,

^

q ¼

^

q

li

;

^

q

2i

; ...;

^

q

Ki

½

T

and

one column of G , g =[g

li

, g

2i

, ..., g

Li

]

T

:

^

q ¼

1

2

PgE

^

q½¼qcov

^

q½

¼

1

4

P cov g½P

T

trace cov

^

q½ðÞ≤

1

8

trace P

T

P

1

ð18Þ

Intuitively, P

T

P increases linearly with M, and we ex-

pect the bound on the total variance to decrease linearly

with M. Similarly, if the rows, p,ofP are inde pendent

and identically distributed, P

T

P approaches M×E[pp

T

],

resulting in an asymptotic bound that decreases linearly

with M:

trace cov

^

q½ðÞ≤

1

8M

trace E p

T

p

1

ð19Þ

Incorporating degree of admixture, α

Pritchard et al. [13] use a prior distribution to bias their

solution toward those with a desired level of admixture.

This prior on the columns of Q takes the form of a

Dirichlet distribution:

q∼D α; α; ...; αðÞ ð20Þ

Because all the shape parameters (α) are equal, this

prior assumes that all ancestral populations are

equally represented in the current sample. The log of

this prior probability is the following ignoring an

additive constant :

In P qðÞ¼ α 1ðÞΣ

K

k¼1

ln q

k

½; where q

k

¼ 1 Σ

K1

k¼1

q

k

ð21Þ

The derivative of the log prior with respect to q and

its first-order approximation at the mean of q

k

=1/K is

the following:

∂

∂q

k

ln PqðÞ¼

α 1ðÞq

k

q

K

ðÞ

q

k

q

K

≈ 2K

2

α 1ðÞq

k

1

K

ð22Þ

0.2 0.4 0.6 0.8 1.0

q

1

6

4

2

2

4

6

Figure 7 First-order approximation for slope of log-likelihood of q. Solid and dashed lines correspond to the true and approximated slope,

respectively, for K = 2. The blue, green, red, and orange lines correspond to α = 0.1, α = 0.5, α = 1, and α = 2, respectively.

Parry and Wang BMC Bioinformatics 2013, 14:28 Page 14 of 17

http://www.biomedcentral.com/1471-2105/14/28

The following penalty function combines the columns

of Q into a single negative log-likelihood function with

the approximated derivative in the above equation:

ln pQðÞ≈K

2

α 1ðÞΣ

N

i¼l

Σ

K

k¼1

q

ki

1

K

2

¼ K

2

α 1ðÞQ

1

K

2

2

ð23Þ

The right-hand-side of Equation 23 acts as a penalty

term for the least-squares criterion in Equation 9. Figure 7

shows the difference between the real and approximated

slope. For q near its mean of 1/K, the approximation fits

closely but for extreme values of q the true slope diverges.

Combining the terms in Equations 9 and 23 and including

problem constraints, we have the following least-squares

optimization problem :

arg min

P;Q

2PQ G

kk

2

2

þ K

2

α 1ðÞQ

1

K

2

2

; such that

0≤P≤1

Q≥0

Σ

K

k¼1

q

ki

¼ 1

8

<

:

ð24Þ

Optimization algorithm

The non-convex optimization problem in Equation 10 can

be approached as a two-block coordina te descent problem

[15,20]. We initialize Q with nonnegative values such that

each column sums to one. Then, we alternate between

minimizing the criterion function with respect to P with

fixed Q:

arg min

0≤P≤

2PQ G

kk

2

2

ð25Þ

and then minimizing with respect to Q with fixed P:

arg min

Q≥0

Σ

K

k¼1

q

ki

¼1

2PQ G

kk

2

2

þ K

2

α 1ðÞQ

1

K

2

2

ð26Þ

This process is repeated until the change in the criter-

ion function is less than ε at which point we consider

the algorithm to have converged. The Admixture algo-

rithm suggests a threshold of ε = 1e-4 but we have found

that a larger threshold often suffices. Unless otherwise

stated, we use a threshold that depends on the size of

the problem: ε = MN×10

-10

, corresponding to 1e-4 when

M = 10000 and N = 100.

Least-squares solution for P

Van Benthem and Keenan [16] propose a fast nonnega-

tively constrained active/passive set algorithm that avoids

redundant calculations for problems with multiple right-

hand-sides. Without considering the constraints on P,

Equation 25 can be classically solved using the pseudo-

inverse of Q:

^

P ¼

1

2

GQ

T

QQ

T

1

ð27Þ

However, some of the elements of P may be less than

zero. In the active/passive set approach, if elements of P

are negative, they are clamped at zero and added to the

active set. The unconstrained solution is then applied to

the remaining pa ssive elements of P. If the solution hap-

pens to be nonnegative, the algorithm finishes. If not,

negative elements are added to the active set and ele-

ments in the active set with a negative gradient (will de-

crease the criterion by increasing) are added back to the

passive set. The process is repeated until the passive set

is non-negative and the active set contains only elements

with a positive gradient at zero. We extend the approach

of Van Benthem and Keenan to include an upper bound

at one. Therefore, we maintain two active sets: those

clamped at zero and those clamped at one and update

both after the unconstrained optimization of the passive

set at each iteration. We provide Matlab source code

that implements this algorithm on our website.

Least-squares solution for Q

When solving for Q it is convenient to reformulate

Equation 26 into simpler terms:

arg min

Q≥0

Σ

K

k¼1

q

ki

¼1

PQ

G

kk

2

2

P ¼

2P

K α 1ðÞ

1

2

=

I

K

G ¼

G

α 1ðÞ

1

2

=

1

KxN

ð28Þ

The unconstrained solution for this equation is the fol-

lowing:

^

Q

¼ 4P

T

P þ K

2

α 1ðÞI

1

2P

T

G þ K α 1ðÞ

¼

P

T

P

1

P

T

G

ð29Þ

When prior information is known about the sparse-

ness, we use α in the equations above. When no prior

information is known, we use α = 1 corresponding to

the uninformative prior and resulting in the ordinary

Parry and Wang BMC Bioinformatics 2013, 14:28 Page 15 of 17

http://www.biomedcentral.com/1471-2105/14/28

pseudo-inverse solution. In order to incorporate the

sum-to-one constraint on the columns of Q, we employ

the method of Lagrange multipliers using Equation 11 in

the work of Settle and Drake substituting the identity

matrix for the noise matrix, N [21]. For completeness,

we include the solution below:

Q ¼ aUj þ U aUJUðÞ

P

T

G

U ¼

P

T

P

1

a ¼ Σ

K

i¼1

Σ

K

j¼1

u

ij

1

j ¼ 1; 1; ...; 1½

T

J ¼ jj

T

ð30Þ

As before, some elements of Q may be negative. In that

case, we utilize the active set method to clamp elements

of Q at zero and update active and passive sets at each it-

eration until convergence as described above. We adapt

the Matlab script by Van Benthem and Keenan so that the

unconstrained solution uses Equation 30 instead of the

standard pseudo-inverse and provide it on our website.

Simulated experiments to compare the proposed

approach to Admixture and FRAPPE

We generate simulated genotype data for a varie ty of

problems using M = 10000 markers, and varying N be-

tween 100, 1000, and 10000; K between 2, 3, and 4; and

α between 0.1, 0.5, 1, and 2, for a total of 36 parameter

sets. For each combination of N, K , and α, we generate

the ground truth P from a uniform distribution, and Q

from a Dirichlet distributio n parameterized by α. Then,

we draw a random genotype for each individual using

the binomial distribution in Equation 11. We estimate P

and Q using only the genotype information and the true

number of populations, K. We repea t the experiment

50 times drawing new, P, Q, and G matrices each time.

Finally, we record the performance of Admixture using

the published tight convergence threshold of ε = 1e-4

[13] and a loose convergence threshold of ε = MN×10

-4

;

the least-squares algorithm using an uninformative prior

(α = 1) and ε = MN×10

-4

, and the FRAPP E EM algo-

rithm using the published threshold of ε = 1. For refer-

ence, we also include the least-squares algorithm with

informative prior (known α) with convergence threshold

of ε = MN×10

-4

.

In all experiments, Admixture’s perfor-

mances with the two convergence thresholds were nearly

identical and we only report the results for ε = MN×

10

-4

, resulting in shorter computation times. We used a

four-way analysis of variance (ANOVA) with a fixed

effects model to reveal which factors (including algo-

rithm) contribute more or less to the estimation error

and computation time.

Statistical significance of root mean squared error and

computation time

For each combination of K, N, and α, we perform a

Kruskal-Wallis test to determine if Admixture, Least-

Squares, and FRAPPE perform significantly differently at

a Bonferroni adjusted significance level of 0.05/(36 par-

ameter sets) = 0.0014. If there is no significan t differ-

ence, we consider their performances equal. If there is a

significant difference, we perform pair-wise Mann–Whitney

U-tests to determine significant differences between specific

algorithms.WeuseaBonferroniadjustedsignificancelevel

of 0.05/(36 parameter sets)/(3 pair -wise comparisons) =

4.6e-4. The ‘Summary’ columns contain the order of per-

formance among the algorithms such that every algorithm

to the left of a ‘<’ symbol performs better than every algo-

rithm to the right. An ‘=’ symbol indicates that the adjacent

algorithms do not perform significantly differently.

Comparison on admixtures derived from the HapMap3

dataset

In the original Admixture paper [13], the authors simulate

admixed genotypes from population allele frequencies

derived from the HapMap Phase 3 dataset [22]. We follow

their example to compare the algorithms with more realis-

ticpopulationallelefrequencies. Rather than drawing P

fromauniformdistribution,weestimatethepopulational-

lele frequencies for unrelated individuals in the HapMap

Phase 3 dataset using individuals from the following

groups: Han Chinese in Beijing, China (CHB), Utah resi-

dents with ancestry from Northern and Western Europe

(CEU), and Yoruba individuals in Ibadan, Nigeria (YRI)

[22]. We use the same 13928 SNPs provided in the sample

data on the Admixture webpage [23]. We randomly simu-

late 1000 admixed individuals: q ~ Dirichlet(α

1

, α

2

, α

3

).

When the Dirichlet parameters are not equal, we use the

degree of admixture, α,forLSα that results in the same

total variance as the combination of α

1

, α

2

,andα

3

:

α ¼

K 1

K

2

v

1

K

;

where the total variance; v ¼ Σ

K

k¼1

α

k

α

0

α

k

ðÞ

α

2

0

α

0

þ 1ðÞ

;

and α

0

¼ Σ

K

k¼1

α

k

ð31Þ

Real dataset from the HapMap phase 3 project

In the original Admixture paper [13], the authors use

Admixture to infer three hypothetical ancestral popula-

tions from four known populations in the HapMap

Phase 3 dataset, including individuals with African an-

cestry in the American Southwest (ASW), individuals

with Mexican ancestry in Los Angeles (MEX), and the

Parry and Wang BMC Bioinformatics 2013, 14:28 Page 16 of 17

http://www.biomedcentral.com/1471-2105/14/28

same CEU CEU and YRI individuals from the previous

example. We ran each algorithm 20 times on the dataset

using a convergence threshold of ε = 1e-4, recording the

convergence times for each trial.

Competing interests

The authors declare that they have no competing interests.

Authors’ contributions

RMP conceived of the least-squares approach to inferring population

structure, designed the study, and drafted the document. MDW initiated the

SNP data analysis project, acquired funding to sponsor this effort, and

directed the project and publication. All authors read and approved the final

manuscript.

Acknowledgements

This work was supported in part by grants from Microsoft Research, National

Institutes of Health (Bioengineering Research Partnership R01CA108468,

P20GM072069, Center for Cancer Nanotechnology Excellence U54CA119338,

and 1RC2CA148265), and Georgia Cancer Coalition (Distinguished Cancer

Scholar Award to Professor M. D. Wang).

Author details

1

The Wallace H. Coulter Department of Biomedical Engineering, Georgia

Institute of Technology and Emory University, Atlanta, GA 30332, USA.

2

Parker H. Petit Institute of Bioengineering and Biosciences and Department

of Electrical and Computer Engineering, Georgia Institute of Technology,

Atlanta, GA 30332, USA.

3

Winship Cancer Institute and Hematology and

Oncology Department, Emory University, Atlanta, GA 30322, USA.

Received: 15 March 2012 Accepted: 6 November 2012

Published: 23 January 2013

References

1. Beaumont M, Barratt EM, Gottelli D, Kitchener AC, Daniels MJ, Pritchard JK,

Bruford MW: Genetic diversity and introgression in the Scottish wildcat.

Mol Ecol 2001, 10:319–336.

2. Novembre J, Ramachandran S: Perspectives on human population

structure at the cusp of the sequencing era. Annu Rev Genomics Hum

Genet 2011, 12.

3. Menozzi P, Piazza A, Cavalli-Sforza L: Synthetic maps of human gene

frequencies in Europeans. Science 1978, 201:786–792.

4. Price AL, Patterson NJ, Plenge RM, Weinblatt ME, Shadick NA, Reich D:

Principal components analysis corrects for stratification in genome-wide

association studies. Nat Genet 2006, 38:904–909.

5. McVean G: A genealogical interpretation of principal components

analysis. PLoS Genet 2009, 5:e1000686.

6. Patterson N, Price AL, Reich D: Population structure and eigenanalysis.

PLoS Genet 2006, 2:e190.

7. Lee C, Abdool A, Huang CH: PCA-based population structure inference

with generic clustering algorithms. BMC Bioinforma 2009, 10.

8. Novembre J, Stephens M: Interpreting principal component analyses of

spatial population genetic variation. Nat Genet 2008, 40:646–649.

9. Pritchard JK, Stephens M, Donnelly P: Inference of population structure

using multilocus genotype data. Genetics 2000, 155:945–959.

10. Falush D, Stephens M, Pritchard JK: Inference of population structure

using multilocus genotype data linked loci and correlated allele

frequencies. Genetics 2003, 164:1567–1587.

11. Tang H, Peng J, Wang P, Risch NJ: Estimation of individual admixture:

Analytical and study design considerations. Genet Epidemiol 2005,

28:289–301.

12. Wu B, Liu N, Zhao H: PSMIX: an R package for population structure

inference via maximum likelihood method. BMC Bioinforma 2006, 7:317.

13. Alexander DH, Novembre J, Lange K: Fast model-based estimation of

ancestry in unrelated individuals. Genome Res 2009, 19:1655.

14. Alexander D, Lange K: Enhancements to the ADMIXTURE algorithm for

individual ancestry estimation. BMC Bioinforma 2011, 12:246.

15. Kim H, Park H: Non-negative matrix factorization based on alternating

non-negativity constrained least squares and active set method. SIAM

Journal in Matrix Analysis and Applications 2008, 30:713–730.

16. Van Benthem MH, Keenan MR: Fast algorithm for the solution of large-

scale non-negativity-constrained least squares problems. J Chemom 2004,

18:441–450.

17. Hanis CL, Chakraborty R, Ferrell RE, Schull WJ: Individual admixture

estimates: disease associations and individual risk of diabetes and

gallbladder disease among Mexican Americans in Starr County, Texas.

Am J Phys Anthropol 1986, 70:433– 441.

18. Brunet JP, Tamayo P, Golub TR, Mesirov JP: Metagenes and molecular

pattern discovery using matrix factorization. Proc Natl Acad Sci U S A

2004, 101:4164.

19. Guillot G, Estoup A, Mortier F, Cosson JF: A spatial statistical model for

landscape genetics. Genetics 2005, 170:1261–1280.

20. Bertsekas DP: Nonlinear programming. Belmont, Mass.: Athena Scientific

1995.

21. Settle JJ, Drake NA: Linear mixing and the estimation of ground cover

proportions. Int J Remote Sens 1993, 14:1159–1177.

22. Altshuler D, Brooks LD, Chakravarti A, Collins FS, Daly MJ, Donnelly P, Gibbs

RA, Belmont JW, Boudreau A, Leal SM: A haplotype map of the human

genome. Nature 2005, 437:1299–1320.

23. ADMIXTURE: fast ancestry estimation. [http://www.genetics.ucla.edu/

software/admixture/download.html].

doi:10.1186/1471-2105-14-28

Cite this article as: Parry and Wang: A fast least-squares algorithm for

population inference. BMC Bioinformatics 2013 14:28.

Submit your next manuscript to BioMed Central

and take full advantage of:

• Convenient online submission

• Thorough peer review

• No space constraints or color ﬁgure charges

• Immediate publication on acceptance

• Inclusion in PubMed, CAS, Scopus and Google Scholar

• Research which is freely available for redistribution

Submit your manuscript at

www.biomedcentral.com/submit

Parry and Wang BMC Bioinformatics 2013, 14:28 Page 17 of 17

http://www.biomedcentral.com/1471-2105/14/28

- CitationsCitations2
- ReferencesReferences29

- [Show abstract] [Hide abstract]
**ABSTRACT:**Inference of individual admixture coefficients, which is important for population genetic and association studies, is commonly performed using compute-intensive likelihood algorithms. With the availability of large population genomic data sets, fast versions of likelihood algorithms have attracted considerable attention. Reducing the computational burden of estimation algorithms remains, however, a major challenge. Here, we present a fast and efficient method for estimating individual admixture coefficients based on sparse non-negative matrix factorization algorithms. We implemented our method in the computer program sNMF, and applied it to human and plant genomic data sets. The performances of sNMF were then compared to the likelihood algorithm implemented in the computer program ADMIXTURE. Without loss of accuracy, sNMF computed estimates of admixture coefficients within run-times approximately 10 to 30 times faster than those of ADMIXTURE. - [Show abstract] [Hide abstract]
**ABSTRACT:**Inference of individual ancestry coefficients, which is important for population genetic and association studies, is commonly performed using computer-intensive likelihood algorithms. With the availability of large population genomic data sets, fast versions of likelihood algorithms have attracted considerable attention. Reducing the computational burden of estimation algorithms remains, however, a major challenge. Here, we present a fast and efficient method for estimating individual ancestry coefficients based on sparse non-negative matrix factorization algorithms. We implemented our method in the computer program sNMF, and applied it to human and plant data sets. The performances of sNMF were then compared to the likelihood algorithm implemented in the computer program ADMIXTURE. Without loss of accuracy, sNMF computed estimates of ancestry coefficients with run-times approximately 10 to 30 times shorter than those of ADMIXTURE.

Data provided are for informational purposes only. Although carefully collected, accuracy cannot be guaranteed. The impact factor represents a rough estimation of the journal's impact factor and does not reflect the actual current impact factor. Publisher conditions are provided by RoMEO. Differing provisions from the publisher's actual policy or licence agreement may be applicable.

This publication is classified Romeo Green.

Learn more