Page 1

FAST ADAPTIVE VARIATIONAL SPARSE BAYESIAN LEARNING WITH AUTOMATIC

RELEVANCE DETERMINATION

Dmitriy Shutin†

Thomas Buchgraber?

Sanjeev R. Kulkarni†

H. Vincent Poor†

†Department of Electrical Engineering, Princeton University, USA

?Signal Processing and Speech Comm. Lab., Graz University of Technology, Austria

ABSTRACT

In this work a new adaptive fast variational sparse Bayesian learn-

ing (V-SBL) algorithm is proposed that is a variational counterpart

of the fast marginal likelihood maximization approach to SBL. It al-

lows one to adaptively construct a sparse regression or classification

function as a linear combination of a few basis functions by mini-

mizing the variational free energy. In the case of non-informative

hyperpriors, also referred to as automatic relevance determination,

the minimization of the free energy can be efficiently realized by

computing the fixed points of the update expressions for the varia-

tional distribution of the sparsity parameters. The criteria that estab-

lish convergence to these fixed points, termed pruning conditions,

allow an efficient addition or removal of basis functions; they also

have a simple and intuitive interpretation in terms of a component’s

signal-to-noise ratio. It has been demonstrated that this interpreta-

tion allows a simple empirical adjustment of the pruning conditions,

which in turn improves sparsity of SBL and drastically accelerates

the convergence rate of the algorithm. The experimental evidence

collected with synthetic data demonstrates the effectiveness of the

proposed learning scheme.

1. INTRODUCTION

During the past decade, research on sparse signal representations has

received considerable attention [1–5]. With a few minor variations,

the general goal of sparse reconstruction is to optimally estimate the

parameters of the following canonical model:

t = Φw + ξ,

(1)

where t ∈ RNis a vector of targets, Φ = [φ1,...,φL] is a design

matrix with L columns corresponding to basis functions φl∈ RN,

l = 1,...,L, and w = [w1,...,wL]Tis a vector of weights that

are to be estimated. The additive perturbation ξ is typically assumed

tobea whiteGaussianrandom vectorwithzero meanand covariance

matrix Σ = τ−1I, where τ is a noise precision parameter. Impos-

ing constraints on the model parameters w is key to sparse signal

modeling [3].

In sparse Bayesian learning (SBL) [2, 4, 6] the weights w are

constrained using a parametric prior probability density function

(pdf) p(w|α) = N(w|0,diag(α)−1), with the prior parameters

α = [α1,...,αL]T, also called sparsity parameters, being inversely

This work was supported in part by an Erwin Schr¨ odinger Postdoctoral

Fellowship, FWF Project J2909-N23, in part by the Austrian Science Fund

(FWF) under Award S10604-N13 within the national research network SISE,

and in part by the U.S. Office of Naval Research under Grant N00014-09-1-

0342, the U.S. Army Research Office under Grant W911NF-07-1-0185, and

by the NSF Science and Technology Center Grant CCF-0939370

proportional to the width of the pdf. Naturally, a large value of αl

will drive the corresponding weight wl to zero, thus encouraging a

solution with only a few nonzero coefficients.

In the relevance vector machine (RVM) approach to the SBL

problem [2] the sparsity parameters α are estimated by maximizing

the marginal likelihood p(t|α,τ) =?p(t|w,τ)p(w|α)dw, which

approach is then referred to as the Evidence Procedure (EP) [2]. Un-

fortunately, the RVM solution is known to converge rather slowly

and the computational complexity of the algorithm scales as O(L3)

[2,7]; this makes the application of RVMs to large data sets imprac-

tical. In [7] an alternative learning scheme was proposed to alleviate

this drawback. This scheme exploits the structure of the marginal

likelihood function to accelerate the maximization via a sequential

addition and deletion of candidate basis functions, thus allowing ef-

ficient implementations of SBL even for “very wide” matrices Φ.

An alternative approach to SBL is based on approximating the

posterior p(w,τ,α|t) with a variational proxy pdf q(w,τ,α) =

q(w)q(τ)q(α) [8] such as to minimize the variational free energy

[9]. There are several advantages of the variational solution to SBL

as compared to that proposed in [2] and [7]: first, the distributions

rather than point estimates of the unobserved variables can be ob-

tained. Second, the variational approach to SBL allows one to ob-

tain analytical approximations to the distributions of interest even

when exact inference of these distributions is intractable. Finally,

the variational methodology provides a general tool for inference on

graphical models that represent extensions of (1), e.g., different pri-

ors, parametric design matrices, etc. Unfortunately, the variational

approach in [8] is equivalent to RVM in terms of estimation com-

plexity and rate of convergence. Also, due to the nature of the vari-

ational approximation, it is no longer possible to exploit the struc-

ture of the marginal likelihood function to implement the learning

more efficiently: the pdfs q(α) and q(τ) are estimated such as to

approximate the true posterior pdfs, thus obscuring the structure of

the marginal likelihood that was exploited in [7].

Nonetheless, it can be shown [10] that by computing the fixed

points of the update expressions for the variational parameters of

q(α) one can establish a dependency between the optimum of the

variational free energy and a sparsity parameter of a single basis

function φl. Our goal in this paper is to extend these results by

constructing a fast adaptive variational SBL scheme that can be used

to implement adaptive SBL by allowing deletion and addition of new

basis functions. We show that the criteria that guarantee the conver-

gence of a sparsity parameter for a basis function, which we term

the pruning condition, can be used either to prune or to add new ba-

sis functions. The computation of the pruning conditions requires

knowing only the target vector t and the posterior covariance matrix

of the weights w. Moreover, the proposed adaptive scheme requires

only O(L2) operations for adding or deleting a basis function. We

is also termed model evidence [2,6]; the corresponding estimation

2180978-1-4577-0539-7/11/$26.00 ©2011 IEEE ICASSP 2011

Page 2

also show that, when adding a new basis function to the model, the

fixed points of V-SBL also maximize the marginal likelihood func-

tion. However, in contrast to the pruning conditions used in [7], our

conditions have a simple interpretation that allows constructing a

sparse representation that guarantees a certain desired quality of the

estimated components in terms of their individual signal-to-noise ra-

tios (SNRs).

Throughout the paper we shall make use of the following nota-

tion. Vectors and matrices are represented as respectively boldface

lowercase letters, e.g., x, and boldface uppercase letters, e.g., X.

For vectors and matrices (·)Tdenotes the transpose. The expres-

sion [B]lkdenotes a matrix obtained by deleting the lth row and

kth column from the matrix B; similarly, [b]ldenotes a vector ob-

tained by deleting the lth element from the vector b. With a slight

abuse of notation we will sometimes refer to a matrix as a set of

column vectors; for instance we write a ∈ X to imply that a is a

column in X, and X \ a to denote a matrix obtained by deleting

the column vector a ∈ X. We use el = [0,...,0,1,0,...,0]T

to denote a canonical vector of appropriate dimension. Finally, for

a random vector x, N(x|a,B) denotes a multivariate Gaussian pdf

with mean a and covariance matrix B; similarly, for a random vari-

able x, Ga(x|a,b) =

with parameters a and b.

ba

Γ(a)xa−1exp(−bx) denotes a gamma pdf

2. VARIATIONAL SPARSE BAYESIAN LEARNING

For the purpose of further analysis let us assume that we have a dic-

tionary D of some potential basis functions. D is assumed to consist

of an active dictionary Φ used in (1) and a passive dictionary ΦC

such that D = Φ ∪ ΦCand Φ ∩ ΦC= ∅.

In SBL it is assumed that the joint pdf of all the variables fac-

tors as p(w,τ,α,t) = p(t|w,τ)p(w|α)p(α)p(τ) [2,4,6]. Under

the Gaussian noise assumption, p(t|w,τ) = N(t|Φw,τ−1I).

The sparsity prior p(w|α) is assumed to factor as p(w|α) =

?L

choice would be a gamma distribution due to conjugacy properties,

e.g., p(τ) = Ga(τ|c,d). The prior p(αl), also called the hyperprior

of the lth component, is selected as a gamma pdf Ga(αl|al,bl).

We will however consider an automatic relevance determination

scenario, obtained when al= bl= 0 for all components; this choice

renders the hyperpriors non-informative [2,6].

The variational solution to SBL is obtained by finding an

approximating pdf q(w,α,τ) = q(w)q(τ)?L

are the variational approximating factors.

q(w,α,τ)thevariationalparameters{ˆ w,ˆS,ˆ c,ˆd,ˆ a1,ˆb1,...,ˆ aL,ˆbL}

can be found in closed form as follows [8]:

l=1p(wl|αl), where p(wl|αl) = N(wl|0,α−1

the prior p(τ) is arbitrary in the context of this work; a convenient

l). The choice of

k=1q(αk), where

q(w) = N(w|ˆ w,ˆS), q(αl) = Ga(αl|ˆ al,ˆbl), andq(τ) = Ga(τ|ˆ c,ˆd)

With this choice of

ˆS =

?

ˆ τΦTΦ + diag(ˆ α)

?−1

,

ˆ w = ˆ τˆSΦTt

(2)

ˆ al= al+ 1/2,

ˆbl= bl+ 1/2(| ˆ wl|2+ˆSll),

ˆd = d +?t − Φˆ w?2+ Trace(ˆSΦTΦ)

(3)

ˆ c = c +N

2, and

2

, (4)

where ˆ τ = Eq(τ){τ} = ˆ c/ˆd, ˆ αl = Eq(αl){αl} = ˆ al/ˆbl, ˆ wlis the

lth element of the vector ˆ w, andˆSllis the lth element on the main

diagonal of the matrixˆS.

2.1. Adaptive fast variational SBL

Although expressions (2)-(4) reduce to those obtained in [2] when

the approximating factors q(τ) and q(αl) are chosen as Dirac mea-

sures on the corresponding domains1, they do not reveal the structure

of the marginal likelihood function that leads to an efficient SBL al-

gorithm in [7]. Nonetheless, an analysis similar to [7] can be per-

formed by computing the fixed points of the update expression for

the variational parameters of a single factor q(αl) [10]. In [10] we

have shown that, for a given basis function φl∈ Φ, the sequence of

estimates {ˆ α[m]

l

}M

and q(αl), converges to the following fixed point ˆ α[∞]

m=1, obtained by successively updating pdfs q(w)

l

as M → ∞:2

ˆ α[∞]

l

=

?

(ω2

∞

l− ςl)−1

ω2

ω2

l> ςl

l≤ ςl

,

(5)

where ςland ωlare the pruning parameters defined as

ςl= (ˆ τφT

lφl− ˆ τ2φT

lΦlˆSlΦT

lΦlˆSlΦT

lφl)−1,

lt)2.

(6)

ω2

l= (ˆ τςlφT

lt − ˆ τ2ςlφT

(7)

The parameters ςland ω2

ldepend on Φl= Φ \ φl, ˆ αl= [ˆ α]l, and

ˆSl= (ˆ τΦT

lΦl+ diag(ˆ αl))−1=

?

ˆS −

ˆSeleT

eT

lˆS

lˆSel

?

ll

,

(8)

where (8) is the posterior covariance matrix of the weights obtained

when the basis function φlis removed from Φ. The result (5) pro-

vides a simple criterion for pruning a basis function from the active

dictionary: a finite value of ˆ α[∞]

l

instructs us to keep the lth compo-

nent in the model since it should minimize the free energy, while the

infinite value of ˆ α[∞]

l

indicates that the basis function φlis superflu-

ous. It can be shown [10] that the test parameters ω2

squared weight of the basis function φland the weight’s variance

computed when ˆ αlequals zero – a fact that will become more evi-

dent later when we consider an inclusion of a basis function in the

active dictionary. The pruning test is applied sequentially to all the

basis functions in the active dictionary Φ to determine which com-

ponents should be pruned. Algorithm 1 summarizes the key steps of

this procedure. For further details we refer the reader to [10].

land ςlare the

Algorithm 1 A test for a deletion of a basis function φl∈ Φ

function TestComponentPrune(l)

Compute ςlfrom (6) and ω2

if ω2

Δˆ αl← ((ω2

ˆS ←ˆS −

Δˆ α−1

l

+eT

l

else

ˆS ←ˆSl, ˆ α ← [ˆ α]l, ΦC← [ΦC,φl], Φ ← Φ \ φl

end if

lfrom (7)

l> ςlthen

l− ςl)−1− ˆ αl);

ˆSeleT

lˆS

ˆSel,ˆ αl← (ω2

l− ςl)−1

It is natural to ask whether this scheme can be made fully adap-

tive by also allowing inclusion of new basis functions from the pas-

sive dictionary ΦC. Let us assume that at some iteration of the al-

1In [2] the posterior pdf of the weights w is Gaussian; its parameters

coincide with the variational parameters of q(w) in (2).

2Notice that since al= bl= 0, the parameters of q(αl) in (3) can be

specified as ˆ al= 1/2 andˆbl= 1/(2ˆ αl), where ˆ αl= 1/( ˆ w2

it makes sense to study the fixed point of the variational update expressions

in terms of ˆ αl, rather than in terms of ˆ alandˆbl.

l+ˆSll) . Thus,

2181

Page 3

gorithm we have L basis functions in the active dictionary Φ and an

estimate ofˆS and ˆ α. Our goal is to test whether a basis function

φL+1∈ ΦCshould be included in Φ.

Assume for the moment that φL+1is in the active dictionary

and that ΦL+1 = [Φ,φL+1], ˆ αL+1 = [ˆ αT, ˆ αL+1]TandˆSL+1 =

(ˆ τΦT

pute ˆ α[∞]

should be kept in the model. This can be done efficiently using only

the current active dictionary Φ and the corresponding covariance

matrixˆS. First, consider a matrix SL+1 obtained fromˆSL+1 by

setting ˆ αL+1 = 0:

L+1ΦL+1+ diag(ˆ αL+1))−1are available. Then we can com-

L+1from (5) and determine whether the basis function φL+1

SL+1 =

?ˆ τΦTΦ + diag(ˆ α)

?

ˆ τΦTφL+1

ˆ τφT

L+1φL+1

−ˆ τˆSΦTφL+1y−1

y−1

L+1

ˆ τφT

L+1Φ

X−1

L+1

L+1φT

?−1

=

L+1

−ˆ τy−1

L+1ΦˆS

?

,

(9)

where XL+1 =ˆS

−1− ˆ τΦTφL+1(φT

L+1φL+1)−1φT

L+1Φ and

yL+1 = ˆ τφT

L+1φL+1− ˆ τ2φT

L+1ΦˆSΦTφL+1.

(10)

By comparing (6) and (10) we immediately notice that ςL+1 =

y−1

Thus, ω2

L+1, whichisthevarianceoftheweightforφL+1when ˆ αL+1 = 0.

L+1can be computed from ΦL+1and SL+1as

ω2

L+1= eT

L+1(ˆ τSL+1ΦT

L+1t)(ˆ τSL+1ΦT

L+1t)TeL+1.

(11)

By substituting (9) into (11) and simplifying the resulting expression

we finally obtain

ω2

L+1= (ˆ τςL+1φT

L+1t − ˆ τ2ςL+1φT

L+1ΦˆSΦTt)2,

(12)

which is identical to (7) with an exception that (12) uses a new basis

function φL+1, a current design matrix Φ and a weight covariance

matrixˆS. Once the test parameters ω2

we can test if the basis function φL+1should be included in the ac-

tivedictionaryΦ. Itshouldbementionedthatin[7]theauthorscom-

pute parameters qL+1 and sL+1 that maximize the marginal likeli-

hood function (see Eq. (19) in [7]). In fact, it is easy to notice

that the expressions for ω2

incide respectively with the q2

the corresponding pruning test and the value of the sparsity param-

eter coincide. In the case of pruning an existing basis function the

relationship is not that straightforward; nonetheless, the simulation

results indicate that the both adaptive schemes achieve almost iden-

tical performance in terms of the mean squared error and the sparsity

of the estimated models.

In case when the new basis function is accepted, the weight co-

variance matrix has to be updated as well. Luckily this can be done

quite simply as follows:

L+1and ςL+1 are computed,

L+1in (12) and ςL+1 = y−1

L+1/s2

L+1in (10) co-

L+1. Consequently,

L+1and s−1

ˆSL+1 =

⎛

⎜

⎝

˘ X

−1

L+1

−ˆ τ

ˆSΦTφL+1

ˆ α[∞]

L+1+yL+1

(ˆ α[∞]

−ˆ τ

φT

L+1ΦˆS

ˆ α[∞]

L+1+yL+1

L+1+ yL+1)−1

⎞

⎟

⎠,

(13)

where˘ XL+1 =ˆS

−1−

ˆ τΦTφL+1φT

ˆ α[∞]

L+1+φT

L+1Φ

L+1φL+1. The inverse of a Schur

complement˘ XL+1 can be computed efficiently using a rank-one

update [11].

In Algorithm 2 we summarize the main steps of this procedure.

Observe that the conditions for adding or deleting a basis function

Algorithm 2 A test for an addition of a basis function φL+1∈ ΦC

function TestComponentAdd(φL+1)

Compute ςL+1 = y−1

if ω2

ˆ α[∞]

Φ ← [Φ,φL+1]; ΦC← ΦC\ φL+1; updateˆS from (13)

else

Reject φL+1

end if

L+1from (10) and ω2

L+1from (12)

L+1> ςL+1then

L+1← (ω2

L+1− ςL+1)−1; ˆ α ← [ˆ αT, ˆ α[∞]

L+1]T

depend exclusively on the measurement t and the matrixˆS that es-

sentially determines how well a basis function “aligns” or correlates

with the other basis functions in the active dictionary.

Notice that the ratio ω2

the component signal-to-noise ratio3SNRl = ω2

SBL prunes a component if its estimated SNR is below 0dB. This

interpretation allows a simple adjustment of the pruning condition

as follows:

ω2

where SNR?is the adjustment SNR. This modified pruning condi-

tion (14) can be used both when adding as well as when pruning

components. Such adjustment might be of a practical interest in sce-

narios for which the true SNR is known and the goal is to delete spu-

rious components introduced by SBL due to the “imperfection” of

the Gaussian sparsity prior or when we wish a sparse estimator that

guarantees a certain quality of the estimated components in terms of

their SNRs.

l/ςlcan be interpreted as an estimate of

l/ςl [10]. Thus,

l> ςl× SNR?,

(14)

2.2. Implementation aspects

Variational inference typically requires choosing initial values for

the variational parameters of q(w,α,τ). Obviously, the adaptive

ability of the algorithm can be exploited to recover the initial factor-

ization by assuming Φ = ∅, and ΦC= D and selecting the initial

value of the noise precision ˆ τ. The algorithm sequentially adds com-

ponents using Alg. 2 and prunes irrelevant ones using Alg. 1. The

pdfs q(w) and q(τ) can be re-computed from (2) and (4) at any stage

of the algorithm. It is important to mention that the order in which

basis functions are added from the passive dictionary influences the

final sparsity of the algorithm, which, as has been pointed out in [7],

is related to the greediness of the fast SBL method. In our imple-

mentation of the fast adaptive V-SBL algorithm we rank all the com-

ponents in ΦCby pre-computing their sparsity parameters ˆ αl and

begin the inclusion with those basis functions that have the smallest

value of ˆ αl, i.e, those functions that are best aligned with the mea-

surement t. Also notice that updating q(τ) requires re-computingˆS,

which requires O(L3) operations.

3. SIMULATION RESULTS

In this section we compare the performance results of the fast adap-

tive V-SBL with the standard RVM algorithm [2], the fast marginal

likelihood maximization method [7], and fast adaptive variational

SBL with an SNR-adjusted threshold, via simulations. The standard

RVM scheme is non-adaptive; we thus assume that all the available

3To gain some intuition into why this is so consider (9) and (11) when

L = 0 and Φ = ∅.

2182

Page 4

basis functions are included in the active dictionary. Note that the

standard RVM algorithm requires one to specify a threshold for the

hyperparameters ˆ αl, ∀l, to “detect” the divergence of the hyperpa-

rameter values. Obviously, this affects the performance of the stan-

dard RVM algorithm. In order to simplify the analysis of the simu-

lation results we assumed the variance ˆ τ−1of the noise to be known

in all simulations. For all compared algorithms the same conver-

gence criterion is used: an algorithm stops when (i) the number of

basis functions between two consecutive iterations has stabilized and

when (ii) the ?2-norm of the difference between the values of hyper-

parameters at two consecutive iterations is less than 10−4.

To test the algorithms we use basis functions φk∈ RN, k =

1,...,K, generated by drawing samples from a multivariate Gaus-

sian distribution with zero mean and covariance matrix I, and a

sparse vector w with L = 10 nonzero elements equal to 1 at ran-

dom locations. With this setting we aim to test how the algorithm’s

pruning mechanism performs when the exact sparsity of the model

is known. We set N = 100 and K = 200. The target vector t

is generated according to (1). The performance of the tested algo-

rithms, averagedover100independentruns, issummarizedinFig.1.

For adaptive algorithms each iteration corresponds to two steps: 1)

−1001020304050

−60

−40

−20

0

NMSE (dB)

SNR (dB)

(a)

−10010 20304050

0

10

20

30

Number of Bfs

SNR (dB)

(b)

10

0

10

1

10

2

10

3

0

50

100

150

200

Number of Bfs

Number of iterations

(c)

Standard RVM, threshold 104

Standard RVM, threshold 1014

Fast marginal likelihood

Fast adaptive SBL

Fast adaptive SBL (SNR−adjusted)

Fig. 1.

(NMSE) versus the signal to noise ratio (SNR); (b) the estimated

number of components versus the SNR; (c) the estimated number of

components versus the number of iterations for SNR=30dB.

SBL performance.(a) Normalized mean-square error

adding components from the passive dictionary and 2) removing

components from the active dictionary. As we see in Fig. 1(a) and

1(b) the variational SBL with adjusted pruning condition (14) out-

performs the other estimation methods in terms of normalized mean-

square error (NMSE) as well as in terms of the number of estimated

components. In fact, it is able to recover the true model sparsity for

SNR > 10dB. The standard RVM recovers the true model spar-

sity only for SNR > 40dB with a pruning threshold 104; increas-

ing this threshold leads to an over-estimation of the true sparsity.

In Fig. 1(c) we plot the convergence rate of the algorithms for the

SNR = 30dB. Here the variational SBL with SNR-adjusted prun-

ing is also a clear winner, reaching the stopping criterion in less

than 10 iterations. Note, however, that both the fast variational SBL

algorithm without the SNR-adjusted pruning and the fast marginal

likelihood maximization algorithm exhibit very fast convergence;

nonetheless they tend to overestimate the true model sparsity, as seen

from Fig. 1(b).

4. CONCLUSION

In this work a fast adaptive variational Sparse Bayesian Learning

(V-SBL) framework has been considered. The fast V-SBL algorithm

optimizes a variational free energy with respect to variational param-

eters of the pdf of a single component. The fixed points of sparsity

parameter update expressions as well as conditions that guarantee

convergence to these fixed points – pruning conditions – have been

obtained in a closed form. This significantly improves the conver-

gence rate of V-SBL. The pruning conditions also reveal the rela-

tionship between the performance of SBL in terms of the number

of estimated components and a measure of SNR. This relationship

enables an empirical adjustment that allows inclusion of the compo-

nents that guarantee a predefined quality in terms of their individ-

ual SNRs. Setting the adjustment parameter to the true SNR allows

extraction of the true sparsity in simulated scenarios. Simulation

studies demonstrate that this adjustment further accelerates the con-

vergence rate of the algorithm.

5. REFERENCES

[1] S. S. Chen, D. L. Donoho, and M. A. Saunders, “Atomic de-

composition by basis pursuit,” SIAM J. Scientific Computing,

vol. 20, pp. 33–61, 1998.

[2] M. Tipping, “Sparse Bayesian learning and the relevance vec-

tor machine,” J. Machine Learning Res., vol. 1, pp. 211–244,

June 2001.

[3] M. Wakin, “An introduction to compressive sampling,” IEEE

Signal Process. Mag., vol. 25, no. 2, pp. 21–30, Mar. 2008.

[4] D. G. Tzikas, A. C. Likas, and N. P. Galatsanos, “The vari-

ational approximation for Bayesian inference,” IEEE Signal

Process. Mag., vol. 25, no. 6, pp. 131–146, November 2008.

[5] W. Bajwa, J. Haupt, A. Sayeed, and R. Nowak, “Compressed

channel sensing: A new approach to estimating sparse multi-

path channels,” Proceedings of the IEEE, vol. 98, no. 6, pp.

1058–1076, Jun. 2010.

[6] D. Wipf and B. Rao, “Sparse Bayesian learning for basis selec-

tion,” IEEE Trans. on Signal Process., vol. 52, no. 8, pp. 2153

– 2164, Aug. 2004.

[7] M. E. Tipping and A. C. Faul, “Fast marginal likelihood max-

imisation for sparse Bayesian models,” in Proc. 9th Int. Work-

shop Artif. Intelligence and Stat., Key West, FL, January 2003.

[8] C. M. Bishop and M. E. Tipping, “Variational relevance vec-

tor machines,” in Proc. 16th Conf. Uncer. in Artif. Intell.

San Francisco, CA, USA: Morgan Kaufmann Publishers Inc.,

2000, pp. 46–53.

[9] C. M. Bishop, Pattern Recognition and Machine Learning.

New York: Springer, 2006.

[10] D. Shutin, T. Buchgraber, S. R. Kulkarni, and H. V. Poor, “Fast

variational sparse Bayesian learning with automatic relevance

determination,” submitted to IEEE Trans. on Signal Process.

[11] W.W.Hager, “Updatingtheinverseofamatrix,” SIAMReview,

vol. 31, no. 2, pp. pp. 221–239, 1989.

2183