Page 1

Compressive Sensing on Manifolds Using a Nonparametric Mixture of Factor Analyzers:

Algorithm and Performance Bounds

1Minhua Chen,1Jorge Silva,1John Paisley,1Chunping Wang,

2David Dunson and1Lawrence Carin

1Electrical and Computer Engineering Department

2Statistics Department

Duke University

Durham, NC 27708-0291

Abstract

Nonparametric Bayesian methods are employed to constitute a mixture of low-rank Gaussians,

for data x ∈ RNthat are of high dimension N but are constrained to reside in a low-dimensional

subregion of RN. The number of mixture components and their rank are inferred automatically from

the data. The resulting algorithm can be used for learning manifolds and for reconstructing signals from

manifolds, based on compressive sensing (CS) projection measurements. The statistical CS inversion

is performed analytically. We derive the required number of CS random measurements needed for

successful reconstruction, based on easily computed quantities, drawing on block–sparsity properties.

The proposed methodology is validated on several synthetic and real datasets.

I. INTRODUCTION

Compressive sensing (CS) theory [1] shows that if a signal may be sparsely rendered in

some basis, it may be recovered perfectly based on a relatively small set of random–projection

measurements. In practice most signals are compressible in an appropriate basis (not exactly

sparsely represented), and in this case highly accurate CS reconstructions are realized based on

such measurements. Recent work has extended the notion of sparsity–based CS to the more

general framework of manifold–based CS [2]–[4]. In manifold–based CS, the signal is assumed

to belong to a manifold, and low information content corresponds to low intrinsic dimension

of the manifold. For a simple example (taken from [4]), consider signals generated by taking

samples of a truncated and shifted Gaussian pulse, as shown in the left panel of Figure 1.

Although the ambient dimension (number of samples) of the signals is high, the only degree

of freedom is the scalar shift a; therefore, the signals belong to a one–dimensional manifold.

If we have a collection of signals, corresponding to different shifts, and project them onto,

say, a random three–dimensional subspace, then the signals will, with high probability, form a

1

Page 2

Fig. 1.

a random 3D subspace, with the different points on the curve corresponding to different shifts a.

Left: samples from a truncated and shifted Gaussian, with peak shift a. Right: projections of multiple such signals onto

twisting curve that does not self–intersect (right panel of Figure 1). Thus, very few measurements

are required to capture the characteristics of a signal on this manifold. We will return to this

illustrative example in our experimental results. There are many more real examples of low-

dimensional manifold signals, such as digit and face images (see, e.g., [5]).

Although a theoretical analysis for CS on manifolds has been established in [2], [3], very

few algorithms exist for practical implementation. Moreover, existing performance guarantees

depend on quantities that are not easily computable, such as the manifold condition number. In

this paper we propose a statistical framework for CS on manifolds, using a well-studied statistical

model: a mixture of factor analyzers (MFA) [6], [7]. We model a manifold as a finite mixture

of Gaussians, but we depart from conventional Gaussian mixture models by imposing a very

particular structure – the covariances should be approximately low–rank, and the rank should

equal the intrinsic dimension of the manifold. We employ nonparametric statistical methods [8],

[9] to infer an appropriate number of mixture components for a given data set, as well as the

associated rank of the Gaussians.

This model class is rich and can be used for modeling compact manifolds (see, for instance,

[10]). To give some intuition why this is the case, note that if a manifold is compact (which

among other things implies that it is bounded) then it admits a finite covering by topological

disks whose dimensionality equals the intrinsic dimension. We can equate these topological disks

to the principal hyperplanes of sufficiently flat ellipsoids corresponding to high–probability mass

sets of our Gaussians. If there are a sufficiently high number of ellipsoids, then the hyperplanes

are approximately tangent to the manifold and (by definition of manifold) we can establish locally

2

Page 3

(a) (b) (c)

Fig. 2.

ellipsoids; the shaded blue area is zoomed in (b), where we can see that point x on the manifold can be made arbitrarily close

to projections x?

a union of hyperplanes which is not a manifold due to self–intersection.

Modeling a manifold with a mixture of Gaussians: (a) shows a covering of the manifold by high–probability mass

1and x?

2if the mixture contains enough Gaussians; in (c), Gaussians G1 and G2 with common mean µ model

valid one-to-one mappings between points on the manifold and points on the hyperplanes, with

arbitrarily small distance between those points and their mappings. Figure 2a–b illustrates this.

Moreover, there are certain sets that are not manifolds but can still be well modeled by an MFA –

a simple example consists of two Gaussians with the same mean but differently oriented principal

hyperplanes, which do not constitute a manifold due to self intersection, as shown in Figure 2c.

Thus, while the proposed model is appropriate for learning the statistics of manifolds, it is more

generally applicable to data that reside in a low-dimensional region of a high-dimensional space.

In this paper we show how to nonparametrically learn the MFA based upon available training

data, and how to reconstruct signals based on limited random-projection measurements (while

motivated by manifold learning, we emphasize, as discussed above, the MFA may also be applied

to high-dimensional signals x ∈ RNthat reside in a small subregion of RNbut not necessarily

restricted to a manifold). We obtain guarantees similar to those in CS for sparse signals, using

subgaussian random projections that are, with high probability, incoherent with a certain block–

sparsity dictionary. Unlike typical CS results, the dictionary is not, in general, an orthonormal

basis. Namely, we derive a restricted isometry property (RIP) for the composite measurement–

dictionary ensemble, drawing on one of two assumptions: (i) separability of the Gaussian means

and (ii) block-incoherence of the low–rank hyperplanes spanned by the principal directions of

the covariances. Importantly, it is shown that only of the conditions, (i) or (ii), needs to hold.

Our contributions are the following:

• We develop a hierarchical Bayesian algorithm that learns an MFA for the manifold based on

3

Page 4

training data. Unlike existing MFA inference algorithms [7], [11], [12], we adopt nonpara-

metric techniques to simultaneously infer the number of clusters and the intrinsic subspace

dimensionality.

• We present a method for reconstructing out–of–sample data using compressed random

measurements. By using the probability density function learned from the MFA as the

prior distribution, the reconstruction can be found analytically by Bayes’ rule.

• We derive bounds on the number of required random-projection measurements, in terms

of easily computable quantities, such as the rank of the covariances, number of clusters,

separation of the Gaussians, coherence and subcoherence of the dictionary.

The remainder of the paper is organized as follows. In Section II we develop the nonparametric

mixture of factor analyzers, and in Section III we discuss how this model may be employed

for CS inversion. Section IV provides performance bounds for CS assuming that the signals

of interest are drawn from a known low-rank Gaussian mixture model, with example results

presented in Sec V. Conclusions are provided in Section VI.

II. NONPARAMETRIC MIXTURE OF FACTOR ANALYZERS

We assume the signals x ∈ RNunder study are drawn from a Gaussian mixture model, and

that the rank of each mixture component is small relative to N. We represent this statistical

model as a mixture of factor analyzers (MFA) [7], [13]–[15], with the MFA parameters learned

from training data. As a special case this model may be applied to data drawn from a manifold

or manifolds. In the discussion that follows, for conciseness, we will continually refer to data

drawn from manifolds, since this is an important and motivating sub-problem.

Many existing inference algorithms for learning an MFA [7], [11], [12] require one to a priori

fix the subspace dimension and the number of mixture components (clusters). Unfortunately,

these quantities are usually unknown in advance. We address this issue by placing Dirichlet

Process (DP) [8], [16] and Beta Process (BP) [9], [17] priors on the MFA model, to infer the

above mentioned quantities in a data–driven manner. As discussed further below, the DP is a

nonparametric tool for mixture modeling, appropriate for inferring an appropriate number of

mixture components. The BP is a tool that allows one to uncover an appropriate number of

factors [9], [17]. By integrating DP and BP into a single algorithm, we address both of the

aforementioned problems associated with previous development of mixtures of factor analyzers.

4

Page 5

A. Beta process for inferring number of factors

Assume access to n samples xi∈ RN, with i = 1,··· ,n (all vectors are column vectors).

The data are said to be drawn from a factor model if for all i

xi∼ N(Awi+ µ,α−1IN) , wi∼ N(0,IJ)

(1)

where A ∈ RN×J, µ ∈ RNand INis the N ×N dimensional identity matrix (with IJsimilarly

defined). The precision α ∈ R+, and one typically places a gamma prior on this quantity

(discussed further below). We initially assume we know the number of factors J, and at this

point we do not consider a mixture model.

By first considering (1) we may motivate the model that follows. Specifically, we may re-

express (1) as

xi∼ N(µ,Σ)

(2)

where

Σ = AAT+ α−1IN

(3)

If J ≤ N and the columns of A are linearly independent, we may express AAT=?J

with orthonormal vj ∈ RNand singular values ζj ∈ R+, where ζ1through ζJ are ordered in

decreasing amplitude. If 1/α is small relative to ζJ, then xi is drawn approximately from a

Gaussian of covariance rank J, and if J ? N these Gaussians corresponds to localized, low-

dimensional tubes in RN(see Figure 2). As is well known [18], the vectors vjdefine the principal

coordinates of the low-rank Gaussian, which is centered about mean position µ.

j=1ζjvjvT

j,

Before generalizing (1) to a mixture of factor analyzers, we wish to address the problem of

inferring J, which defines the rank of the Gaussians, with this related to the dimensionality of

the manifold. Toward this end, we modify (1) as

xi∼ N(Awi+ µ,α−1IN) , wi= ˆ wi◦ z ,

K

?

where K is an integer chosen to be large relative to the number of anticipated factors (e.g., we

ˆ wi∼ N(0,IK)

(4)

z ∼

k=1

Bernoulli(πk) ,π ∼

K

?

k=1

Beta(a/K,b(K − 1)/K)

may set K = N), πkis the kth component of π, and ◦ represents the point-wise (Hadamard)

5

Page 6

vector product.

As discussed in [9], when K → ∞ the number of non-zero components in z is drawn

from Poisson(a/b). For finite K, of interest here, one may show that the number of non-zero

components of z is drawn from Binomial(K,a/(a+b(K−1))), and therefore one may set a and

b to impose prior belief on the number of factors that will be important per mixture component.

The expected number of non-zero components in z is aK/[a+b(K −1)]. This construction has

been related to a Beta Process, as discussed in [9].

When one performs Bayesian inference using a construction like (4), the posterior density

function on the binary vector z defines the number of columns of A that contribute to the factor

analysis model, and hence this yields the rank of the Gaussian. As discussed further below,

Bayesian inference may be performed relatively simply via this construction.

B. Dirichlet process for mixture of factor analyzers

The hierarchical model in (4) may be used to infer the number of factors in a factor model,

with all xiresiding in a single associated subspace. However, to capture the nonlinear shape of

a manifold, we are interested in a mixture of low-rank Gaussians (see Figure 2). In general one

does not know a priori the proper number of mixture components. We consequently employ

the Dirichlet process (DP) [8], [16]. A draw G from a DP may be expressed as G ∼ DP(ηG0),

where η ∈ R+and G0is a base probability measure. A constructive representation for such a

draw may be represented as [8]

G =

∞

?

t=1

λtδθ∗

t,λt= vt

t−1

?

l=1

(1 − vl) ,vt∼ Beta(1,η) ,θ∗

t∼ G0

(5)

where we note that, by construction,?∞

situated at θ∗

f(θi) with associated parameter θi, with θi∼ G. Since G is of the form in (5), for a relatively

large number of samples N, many of the xiwill share the same parameters θ∗

{xi}i=1,N are drawn from a mixture model. In our problem f(·) is a Gaussian, and hence we

yield a Gaussian mixture model. While there are an infinite number of components (in principle)

t=1λ(t) = 1; the expression δθ∗

tis a point measure

t. The observed data samples {xi}i=1,N may be drawn from a parametric model

t, and therefore the

within G, via posterior inference we infer an appropriate number of mixture components for the

data {xi}i=1,N.

6

Page 7

For the mixture of factor analyzer (MFA) model proposed here, the base measure G0is defined

to be consistent with the FA model. Specifically, the model parameters θ∗

tassociated with mixture

component t are defined to be consistent with the FA model. We have

xi ∼ N(At(i)wi+ µt(i),α−1

wi =

ˆ wi◦ zt(i),

ˆ wi ∼ Nt(i)(0,IK)

t(i) ∼ Mult(1;λ1,...,λT)

t−1

?

vt ∼ Beta(1,η)

K

?

K

?

µt ∼ N(µ,τ−1

K

?

K

?

t=1λt, where we have truncated the DP sum to T terms (properties of this

truncation are discussed in [19]). The notation ∆t ∼?K

the K diagonal elements of ∆tare drawn from N(0,τ−1

matrix ∆tencodes the importance of each column in˜ At, playing the same role of singular values

in SVD. The expression Mult(1;λ1,...,λT) represents drawing one sample from a multinomial

distribution defined by (λ1,...,λT), and t(i) corresponds to the mixture component associated

with the ith draw. The expression Nt(i)(0,IK) is meant to indicate that the factor score associated

with a given sample i is explicitly linked to a particular mixture component (this is important for

t(i)IN)

(6)

At(i)=˜ At(i)∆t(i)

(7)

(8)

(9)

λt = vt

l=1

(1 − vl)

(10)

(11)

zt ∼

k=1

Bernoulli(πk)

(12)

πt ∼

k=1

Beta(a/K,b(K − 1)/K)

(13)

0IN)

(14)

˜ At ∼

k=1

N(0,1

NIN)

(15)

∆t ∼

k=1

N(0,τ−1

tk)

(16)

with λT = 1 −?T−1

k=1N(0,τ−1

tk), with k = 1,2,··· ,K. The diagonal

tk) is meant to mean that

yielding a mixture-component-dependent posterior density function of w, and impacts the Gibbs-

sampler update equations). The vector µ represents the mean computed based on all training

7

Page 8

data used to design the model, i.e., µ =1

The expression˜ At∼?K

independently from N(0,1

(although any given draw will not exactly have unit norm). The covariance associated with

mixture component t, as constituted via the prior, is Σt = ˜ At˜∆2

∆tdiag(zt1,...,ztK). Hence, the number of non-zero components in the binary vector ztdefines

the approximate rank of Σt, assuming the smallest diagonal element of ∆2

α−1

n

?n

i=1xi.

k=1N(0,1

NIN), implying that on average these columns will have unit norm

NIN) means that each of the K columns of˜ Atare drawn

t˜ AT

t+ α−1

t I, where˜∆t =

tis large relative to

t . There are other ways one may draw At from a hierarchical generative model, but this

construction appears to be the most stable among many we have examined.

Note that in (12)-(16), within the prior the T components {zt}t=1,T, {πt}t=1,T, {µt}t=1,T,

{˜ At}t=1,T and {∆t}t=1,T are drawn once for all samples {xi}i=1,N, and these effectively cor-

respond to the draws from the base measure G0; (10)-(11) are also drawn once, these yielding

the T mixture weights in the truncated “stick-breaking” representation of the DP [19]. The ˆ wi

is drawn separately for each of the N samples. Hence, all data drawn from a given mixture

component t share the factor-loading matrix At∈ RN×Kand the same set of important columns

(factor loadings) defined by zt, but each draw from a given mixture component has unique

weights wi.

Concerning the way in which µt∈ RNand At∈ RN×Kare drawn, the use of independent

Gaussians allows for convenient inference. It is important to note that (14) and (15) simply

constitute convenient priors, while the posterior Bayesian analysis will infer the correlations

within these terms. The same is true for the factor score ˆ wi.

C. Inference and hyper-parameter settings

To complete the model, we assume

η ∼ Gamma(c,d) ,τtk∼ Gamma(e,f) ,αt∼ Gamma(g,h)

(17)

Non-informative hyper-parameters are employed throughout, setting c = d = e = f = g = h =

10−6. The precision τ0= 10−6, implying that the means µtwere drawn almost from a uniform

prior. Further, we set a = b = 1 in all examples. While there may appear to be a large number of

hyper-parameters, all of these settings are “standard” [9], [20], and there has been no parameter

tuning for any of the examples. In all examples below, we set the truncations as K = T = 50.

8

Page 9

The model parameters can be estimated using Gibbs sampling – we provide a detailed

explanation of the procedure in the Appendix. One may also perform variational Bayesian (VB)

[11] analysis for this model. However, learning of the model need only be performed once, and

therefore Gibbs sampling has been employed for this purpose. In all examples presented below,

we employed 2000 burn-in iterations, and 1000 collection iterations. As discussed in the next

section, the CS inversion using this model may be performed analytically.

D. Probability density estimation from the above nonparametric MFA model

The above model can be interpreted as a Bayesian local PCA model, in which the signal

manifold is approximated by a mixture of local subspaces. After model inference, the probability

density function (pdf) of the signal can be estimated as follows:

p(x) =

T

?

t=1

λt

?

N

?

x;˜ At(∆tdiag(zt)) ˆ w + µt,α−1

tIN

?

N( ˆ w;ξt,Λt)d ˆ w =

T

?

t=1

λtN(x;χt,Ωt)

(18)

where

χt= µt+˜ At(∆tdiag(zt))ξt;Ωt=˜ At(∆tdiag(zt))Λt(diag(zt)∆t)˜ A?

t+ α−1

tIN

(19)

This is the explicit form of the low-rank GMM density we estimate. If we use the prior distri-

bution for ˆ w, i.e., ξt= 0 and Λt= IK, then χt= µtand Ωt= Σtwith Σtdefined in Section

II-B. However, this estimator may not be accurate since ˆ w has its own posterior distribution in

each mixture component. Thus, we use the following posterior mean and covariance estimate

for ˆ w from Gibbs sampling:

ξt=

1

M

M

?

m=1

?

is the mth collected Gibbs sample. We use mean of Gibbs samples to estimate˜ At.

The estimated pdf in (18) will be used as the prior distribution for new testing signal in the

i:t(i)=t

ˆ w(m)

i

? ?

i:t(i)=t

1

;Λt=

1

M

M

?

m=1

?

i:t(i)=t

ˆ w(m)

i

ˆ w(m)?

i

? ?

i:t(i)=t

1

− ξtξ?

t

where ˆ w(m)

i

compressive sensing application, which is illustrated in the following section.

9

Page 10

III. COMPRESSIVE SENSING USING A LOW-RANK MFA

A. CS Inversion for data drawn from a low-rank GMM

Using the procedure discussed in the previous section, assume access to an MFA for a low-

rank Gaussian mixture model of interest (e.g., for representation of a manifold). Let x ∈ RNbe

a vector drawn from this distribution. Rather than measuring x directly, we perform a projection

measurement y = Φx + ν, where Φ ∈ Rm×Nis a measurement projection matrix, and we are

interested in the case m ? N. Details on the design of Φ are provided in Section IV. The

vector ν ∈ RNrepresents measurement noise. Our goal is to recover x from y, with this done

effectively with m ? N measurements because x is known to be drawn from a low-dimensional

MFA model. Bounds on requirements for m, based upon the properties of the MFA, are discussed

in Section IV. Our objective here is to describe a general-purpose algorithm for recovering x

from y.

Let p(x) represent the MFA learned via the procedure discussed in Section II. Note that the

nonparametric learning procedure discussed there infers a full posterior density function on all

parameters of the mixture model. Within the CS inversion we utilize the inferred mean of each

mixture component, and an approximation to the covariance matrix based on averaging across

all collection samples. Specifically, we have

p(x) =

T

?

t=1

λtN(x;χt,Ωt)

(20)

where χt represents the mean for mixture component t, and Ωt is the approximate inferred

covariance matrix defined in (18) and (19). The λtare the mean mixture weights learned via

the DP analysis, noting that in practice many of these will be very near zero (hence, we infer

the proper number of meaningful mixture components, with T simply a large-valued truncation

of the DP stick-breaking representation [19]).

The noise ν is assumed drawn from a zero-mean Gaussian with precision matrix (inverse

covariance) R. The condition distribution for y given x may be evaluated analytically as

?T

T

?

p(x|y) =

p(x)p(y|x)

?p(x)p(y|x)dx=

˜λtN

t=1λtN (x;χt,Ωt) × N (y;Φx,R−1)

? ?T

l=1λlN (x;χl,Ωl) × N (y;Φx,R−1)dx

=

t=1

?

x; ˜ χt,˜Ωt

?

(21)

10

Page 11

with

˜λt=

λtN?y;Φχt,R−1+ ΦΩtΦ??

?−1= Ωt− ΩtΦ??R−1+ ΦΩtΦ??−1ΦΩt

?Φ?Ry + Ω−1

In the above computations, the following identity for normal distribution is used:

?T

l=1λlN (y;Φχl,R−1+ ΦΩlΦ?)

˜Ωt=?Φ?RΦ + Ω−1

˜ χt=˜Ωt

t

tχt

?= ΩtΦ?(R−1+ ΦΩtΦ?)−1(y − Φχt) + χt

N (x;χt,Ωt) × N?y;Φx,R−1?= N

with the reader referred to equation (10) and the associated discussion in [20] for a fuller

?

x; ˜ χt,˜Ωt

?

× N?y;Φχt,R−1+ ΦΩtΦ??

description of a related derivation. In the results presented below we consider the case for

which the components of R−1tend to zero, therefore assuming noise-free measurements. If

the measurements are noisy one may infer R within a hierarchical Bayesian analysis, but the

inversion for x is no longer analytic (unless the noise covariance R−1is known a priori).

It is interesting to note that the MFA mixture model may be relatively computationally

expensive to learn, depending on the number of samples one has to learn the properties of p(x)

(it is desirable that the number of samples be as large as possible, to improve model quality).

However, once p(x) is learned “off line”, the CS recovery y → x is analytic. Moreover, rather

than simply yielding a single “point” estimate for x, we recover the full distribution p(x|y).

When presenting results we plot the mean value of the inferred x.

B. Illustration using simple manifold data

To make the discussion more concrete, we return to the shifted Gaussian manifold example

discussed in the Introduction. This simple example is considered to examine learning p(x), as

well as inference of p(y|x) for CS inversion. The training set consists of n = 900 shifted

Gaussian samples, each with dimension N = 128.

Figure 3 examines the number of mixture components inferred, as well as the properties of

any particular data point as viewed from the MFA. Specifically, a total of eight important mixture

components are inferred, and the vast majority of data samples (896 of 900) are associated with

only one mixture component (cluster). This implies that the signals possess block sparsity, in

11

Page 12

that a given signal x only employs factors from one of the mixture components. We exploit this

property when deriving bounds for the number of required CS measurements.

In Figure 4 are shown the properties of the individual factor models (for a given mixture

component) of the MFA. Note that each mixture component is dominated by a single factor,

consistent with the one-dimensional character of the manifold.

Fig. 3. Learned MFA mixture-component occupancy for the shifted Gaussian data. Bottom left: Number of mixture components

(clusters) with significant (> 0.1) posterior probability, for each training point. Note how only four out of 900 points on the

manifold have more than one significant cluster. Top left: plots of the posterior p(ti|xi,−−), for i = 208 and i = 300. In most

cases, the posterior behaves as shown for i = 300. Only weights for mixture components 1–10 are shown, with the remaining

components having negligible mixture weights for these samples. Top right: Cluster occupation probability for each training

point. Bottom right: Learned weights of the clusters in the mixture (in total, only eight dominant mixture components are

inferred).

After the training process using the n = 900 samples, we have an MFA model for p(x),

which we may now employ in CS inversion. To test the CS inversion, we generated 100 new

(noise-free) test signals with different peak positions a, and manifested associated compressive

projection measurements y (again, with the components of Φ drawn i.i.d. from N(0,1)). The

performance is shown in Figure 5 and examples of the reconstruction are given in Figure 6 for

m = 5 CS measurements. Using very few measurements (m ? N), we reconstruct x almost

perfectly.

12

Page 13

Fig. 4. Properties of the individual (mixture-component-dependent) factor models for the MFA, considering the shifted Gaussian

data. The left figure shows the center of each cluster (χt), and the right figure shows the subspace usage of each cluster

(∆tdiag(zt))

Fig. 5.

(percentage of the N = 128).

Average relative reconstruction error for the shifted Gaussian data, as a function of the number of measurements

C. Gibbs sampling and label switching

When performing Gibbs sampling to learn the MFA, we infer the number of factors per

mixture component, as well as the number of mixtures. Further, we set the truncation levels for

the number of factors and mixture components to large values (K = T = 50). In principle, the

indexes on the factors and mixtures are exchangeable, and therefore within the Gibbs sampler

the indexes (between the 50 possible values) of the factors and mixture components may be

inter-changed between consecutive Gibbs samples (in fact, this would be an indication of good

13

Page 14

Fig. 6.

figure shows the reconstructed signals with 3.9% measurements (recovery of x ∈ R128based on y ∈ R5).

CS reconstruction result for the shifted Gaussian data. The left figure shows examples of the test signals, and the right

mixing, since the labels are exchangeable). Plots like the right figures in Figure 3 indicate that

the labels of the factors and mixtures converge to a local mode, as the associated labels are stable

after sufficient number of Gibbs burn-in iterations. We note that if label switching becomes a

problem for particular MFA examples, techniques are available to address this [21], [22], such

that one may recover an analytic expression for the mixture model. In all examples considered

here label switching was not found to be a problem.

IV. BOUNDS ON THE NUMBER OF RANDOM MEASUREMENTS

We derive sample complexity bounds for CS reconstruction under the MFA model. Our

analysis assumes a similar setting to that of traditional CS, in the sense that we assume access

to the sparsity dictionary, which is fixed. In practice, we do not know the dictionary because we

do not know the parameters of the MFA a priori – in fact, we propose a method for learning

those parameters in this paper. Nevertheless, we can separate the problem of CS reconstruction

from that of MFA estimation. Therefore, we use plug–in estimates of the MFA model, which

gives us a baseline to compare against other CS methods, with the understanding that, like other

CS results, our bounds do not contemplate parameter uncertainty on the dictionary; in any case,

that would not be possible without taking into account the size of the training set for the MFA

learning step, which we do not attempt here.

Each observation xiis assumed drawn from an MFA as in (6), where in the analysis below we

14

Page 15

consider the case αt(i)→ ∞; hence, we ignore additive measurement noise for simplicity. For

notational simplicity, we additionally assume that each mixture component is composed of d fac-

tors. We define a matrix Ψ ∈ RN×(d+1)T, where consecutive blocks of d+1 contiguous columns

correspond to the associated columns in the mixture-component-dependent factor matrices At,

for t = 1,...,T. The first column in each block corresponds to the respective normalized (to

unit Euclidean distance) mean vector, and the remaining d columns in each block are defined

by the columns of A (which are assumed to be linearly independent). Any x satisfies x = Ψθ,

where θ is assumed to have only d + 1 non-zero components, corresponding to the respective

block. Note that this assumption comes from the way in which we have arranged Ψ to match the

MFA model: each contiguous block of d + 1 columns corresponds to one mixture component;

within each block the first column is the normalized mean of that component and the other d

columns are the factors. This block structure in Ψ naturally imposes the same structure on θ,

which we exploit. Thus, the dictionary Ψ contains a subset of the information present in the

MFA parameters, namely the means and covariances of the mixture component s.

Recovering x from random projections amounts to recovering θ, which we assume to follow a

particular sparsity pattern: θ is block–sparse, meaning that the non-zero coordinates of θ appear

in predefined (d + 1)–sized blocks; in this case, it also happens that the first element is always

equal to ?µt? for the t–th block.

The role of block–sparsity has recently been noted in the problem of reconstructing signals that

live in a union of linear subspaces [23], [24]. The related notion of block–coherence, introduced

in the same work, will be of use here, although in our setting we are interested in a union of

affine spaces rather than linear subspaces (our hyperplanes will generally not include the origin).

An additional difference is that our dictionaries can be, and usually are under–complete, i.e.,

(d+1)T < N. Also of crucial importance is the role of separability, which has been explored by

Dasgupta [18] in the context of learning mixtures of high dimensional Gaussians (not necessarily

low–rank) from random projections. We build on these results and extend them to certain cases

where separability does not hold.

We now define block–sparsity, block–coherence and separability more precisely, following

[23] and [18], respectively.

a) Block–sparsity:[25] Let x ∈ RNbe represented in a dictionary Ψ ∈ RN×(d+1)T, so

that x = Ψθ, where θ ∈ R(d+1)Tis a parameter vector. We say that x is block L–sparse in Ψ,

15