Page 1

IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 58, NO. 10, OCTOBER 20105151

Independent Component Analysis by

Entropy Bound Minimization

Xi-Lin Li and Tülay Adalı, Fellow, IEEE

Abstract—A novel (differential) entropy estimator is introduced

where the maximum entropy bound is used to approximate the en-

tropy given the observations, and is computed using a numerical

procedure thus resulting in accurate estimates for the entropy. We

show that such an estimator exists for a wide class of measuring

functions, and provide a number of design examples to demon-

strate its flexible nature. We then derive a novel independent com-

ponent analysis (ICA) algorithm that uses the entropy estimate

thus obtained, ICA by entropy bound minimization (ICA-EBM).

The algorithm adopts a line search procedure, and initially uses

updates that constrain the demixing matrix to be orthogonal for

robust performance. We demonstrate the superior performance of

ICA-EBM and its ability to match sources that come from a wide

range of distributions using simulated and real-world data.

Index Terms—Blind source separation (BSS), differential

entropy, independent component analysis (ICA), principle of

maximum entropy.

I. INTRODUCTION

I

separation (BSS) problem. BSS algorithms can exploit either

non-Gaussianity, nonstationarity, or correlation—see, e.g.,

[1]–[18]. The natural cost for exploiting non-Gaussianity that

leads to ICA is the mutual information among separated com-

ponents, which can be shown to be equivalent to maximum

likelihood estimation [9], and to negentropy maximization [1],

[4] when we constrain the demixing matrix to be orthogonal.

In these approaches, we either estimate a parametric density

model [6]–[10] along with the demixing matrix, or maximize

the information transferred in a network of non-linear units

[11], [12], or estimate/approximate the entropy [1], [4], [13],

[14], [16].

In this paper, we first introduce a novel (differential) entropy1

estimator that approximates the entropy of a random variable

given the observations by using the maximum entropy bound

that is compatible with finite measurements. In this way, the

NDEPENDENT component analysis (ICA) has been

one of the most attractive solutions for the blind source

Manuscript received June 22, 2009; accepted June 22, 2010. Date of publica-

tion July 01, 2010; date of current version September 15, 2010. The associate

editor coordinating the review of this manuscript and approving it for publica-

tion was Prof. Konstantinos I. Diamantaras. This work was supported by the

NSF Grants NSF-CCF 0635129 and NSF-IIS 0612076.

The authors are with the Department of CSEE, University of Maryland —

Baltimore County, Baltimore, MD 21250 USA (e-mail: lixilin@umbc.edu;

adali@umbc.edu).

Color versions of one or more of the figures in this paper are available online

at http://ieeexplore.ieee.org.

Digital Object Identifier 10.1109/TSP.2010.2055859

1Since discrete-valued variables are not considered in this paper, we refer to

differential entropy as simply entropy in the paper.

maximum entropy density matching can be “consistent to the

largest extent with the available data and least committed with

respect to unseen data” [23]. Thus we do not use an approxima-

tion as in [13] and rely on calculation of higher-order moments

as in [14] which are known to be sensitive to outliers. Another

key difference is that we calculate several maximum entropy

bounds and use the tightest one as the final entropy estimate,

ratherthanusingasingleentropyapproximationorbound.Next,

we show that this entropy estimator is a very desirable tool for

performing ICA and introduce an ICA algorithm, ICA by en-

tropy bound minimization (ICA-EBM), that uses the tightest

maximum entropy bound. Because the entropy bound estimator

is quite flexible and can approximate the entropies of a wide

range of distributions, it can be used to perform ICA for sources

that come from distributions that are sub- or super-Gaussian,

unimodal or multimodal, symmetric or skewed by using only a

small class of nonlinear functions.

Natural (relative) gradient descent updates [34], Givens rota-

tions [5], [14], [16], (quasi-) Newton algorithm [4], [10], [15],

and steepest descent on the Stiefel manifold [22] are commonly

used approaches for optimizing the selected cost function for

ICA. In ICA-EBM, we use a line search procedure and initially

constrain the demixing matrix to be orthogonal for better con-

vergence behavior. We demonstrate the superior performance

of ICA-EBM with respect to a class of competing algorithms

using simulations and discuss its properties. We introduced the

entropy estimator using the tightest bound in [32] and demon-

strated its application to ICA. In this paper, we provide a com-

plete treatmentof theentropy estimatorincluding itsimplemen-

tation and a proof for the existence of a solution with a general

class of measuring functions as well as derivation of the ICA

algorithm and its fast implementation. We also present compre-

hensive simulation results to study its performance.

The remainder of this paper is organized as follows. In

Section II, we provide background for ICA and our approach.

The novel entropy estimator is introduced in Section III. A

numerical design method and examples of this entropy es-

timator are presented in Section IV. In Section V, the new

ICA algorithm, ICA-EBM, is presented. To demonstrate the

effectiveness of ICA-EBM, a number of simulation experi-

ments are presented in Section VI, and conclusions are given

in Section VII.

II. BACKGROUND

Letstatistically independent, zero mean sources

be mixed through an

nonsingular mixing matrix so that we obtain the mixtures

1053-587X/$26.00 © 2010 IEEE

Page 2

5152 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 58, NO. 10, OCTOBER 2010

as, where super-

script

The mixtures are separated by forming

denotes the transpose, and is the discrete time index.

, where

is the, and

separation or demixing matrix. A natural cost for achieving

the separation of these

independent sources is the mutual

information

among

:

random variables,

(1)

where

entropy of observations

Thus this cost function assumes the same form as the maximum

likelihood cost. In the subsequent discussions, the time index

is suppressed for simplicity.

InorthogonalICAapproaches,themixturesarepre-whitened

and the demixing matrix is constrained to be an orthogonal ma-

trix. Since

for an orthogonal matrix, the orthog-

onal ICA algorithms minimize the cost function

is the entropy of the th separated source, and

is a constant with respect to.

(2)

under the orthogonality constraint

identity matrix. Even though it is commonly used, the orthogo-

nalityconstraintmayleadtosuboptimalperformance[27],[33].

As observed in (1) and (2), estimation of the entropy or its

approximation plays a key role in the development of ICA al-

gorithms. Commonly used entropy estimators for ICA include

thenonparametricentropyestimator[15],[16],[20],[21],Edge-

worth expansion approximation [1], [35], and estimators based

on the principle of maximum entropy [13], [14]. Nonparametric

entropy estimation is recognized to be practically difficult and

computationally demanding. The Edgeworth expansion and the

estimator given in [14] lead to the use of higher-order moments

or cumulants, which have large estimation variances and are

highly sensitive to outliers. The estimator in [13] uses approx-

imation of the expansion by assuming that the true density of

source is close to the Gaussian density with the same mean and

variance. Thus it may be inaccurate when the true density of

source is far from Gaussian. Another approach to the minimiza-

tion of (1) and (2) is to use density matching through a para-

metricmodelandtoestimatetheparametersofthedensityalong

with the demixing matrix [6]–[10]. These ICA algorithms may

have poor performance if the assumed distribution is far from

the true ones [24], or over complicated by using complex den-

sity models.

For the ICA algorithm we introduce in this paper, ICA-EBM,

entropy is estimated by bounding the entropy of estimates using

numerical computation. By using a few simple measuring func-

tions, a tight entropy bound can be determined for sources that

come from a wide range of distributions, those that have sub- or

super-Gaussian, unimodal, multimodal, symmetric or skewed

probability density functions (pdfs) where we define sub- and

super-Gaussianity with respect to normalized kurtosis as in [2].

Natural (relative) gradient descent updates are commonly

used to minimize the cost function given in [34, Eq. 1]. When

, where is the

is constrained to be orthogonal as in (2), Givens rotations

and steepest descent on the Stiefel manifold are commonly

used to estimate

[5], [14], [16], [22]. Since pre-whitening is

a standard preprocessing procedure for many ICA algorithms

and can simplify the discussion, we always assume that the

mixtures have been pre-whitened, i.e.,

do not constrain the demixing matrix to be orthogonal in

ICA-EBM. Next, we first present the new entropy estimator.

. But we

III. THE ENTROPY ESTIMATOR

Rather than directly trying to estimate the entropy

random variable

using

termine an upper bound for

vides a morepractical and effectiveapproach for approximating

the entropy.

Assume that

is a measuring function [13], and

the expected value of

evaluated over the observed sam-

ples. An upper bound of

can be accurately determined by

solving for the maximum entropy distribution that maximizes

the entropy, and, at the same time, is compatible with the con-

straint

, where

of

, and in practice, it can be estimated as the sample av-

erage of

according to the mean ergodic theorem. In this

way,wecanobtainseveral,say

different measuring functions. It is clear that the tightest en-

tropy bound is the closest one to the true entropy of source, and

can be used as the entropy estimate of source. Although this en-

tropy estimator can only provide an upper bound of the entropy

in general, it is useful for ICA since the entropy or the source

distributions do not need to be estimated with great precision

in ICA for reliable performance. Furthermore, the entropy esti-

mator we introduce is quite flexible. As we demonstrate, with

a few measuring functions, entropy bound for sources from a

wide range of distributions can be obtained.

of

given independent samples, we de-

, which, as we show next, pro-

is

denotes the expectation

,entropyboundsbyusing

A. The Maximum Entropy Distribution

Given the normalized variable

, we have

and

, and for sim-

has zero mean

, where

. Hence, we can only estimate

plicity of discussion, we always assume that

and unit variance in the rest of this paper.

Suppose that the expectation

the observed samples, and we have

principle of maximum entropy [23], we may assume that the

samples are drawn from the distribution

imizes entropy

the constraints

and the normalization condition

. Thus we have the following entropy maxi-

mization problem:

is evaluated over

. According to the

, which max-

under

,,

(3)

Page 3

LI AND ADALI: ICA-EBM 5153

The optimization problem in (3) can be rewritten as a La-

grangian function:

where

ting

, , are the Lagrangian multipliers. By let-

, one finds thathas the form

(4)

where parameters

the constraints in (3). The maximum entropy is then given by

, , , andare to be determined to satisfy

We rewriteas

(5)

where

random variable with zero mean and unit variance, and

is the entropy of a standard Gaussian

(6)

where we have written

both

and

as a function ofsince

in (6) are to be determined by the constraint

. From (5) we know that

ways nonnegative, since

under the zero mean and unit variance constraints, and it is

achieved by a standard Gaussian variable. Thus, we call

negentropy as in [4].

Then, the maximum entropy problem reduces to the problem

of solving for the function

analytic solution for

cannot be obtained, we can solve for

numerically as we show in Section IV.

is al-

is the maximum entropy

given in (6). Even though an

B. Existence of Maximum Entropy Distribution

The problem of the existence of maximum entropy distribu-

tion naturallyarisesinthenew entropyestimator. Existence ofa

maximumentropydistributionwithgivenmomentconstraintsis

wellstudiedintheliterature[25],[26].Asshowninthissection,

using high-order moments as the measurements, we can only

match a small class of pdfs, and the estimations of high-order

moments are sensitive to outliers. For the approach we adopt in

thispaperthatusesanumberofmeasuringfunctionsandseeksa

maximum entropy distribution with general measurement con-

straints, the literature is quite limited on the existence question

and considers only specific forms of measuring functions. Here,

weshowthatforthemaximumentropyproblemgivenin(3),the

maximum entropy solution always exists if the measuring func-

tion

is bounded.

Considering the constraints

,

in (4), we find that the maximum entropy problem given

in (3) leads to the following two equations:

,

as well as the normalization constant

(7)

(8)

Hence, for a given measuring function

we are interested in the existence of a solution for

and (8), and prove the following result.

Proposition 1: If the measuring function

then a solution for

and

given .

Proof: See Appendix A.

However, for an unbounded measuring function, a solu-

tion for

and may not exist for certain values of

. For example, for the unbounded measuring function

, which is widely used in ICA,

must be nonnegative so that all the considered integrals exist.

As a result, the maximum entropy pdf given in (4) can only

match sub-Gaussian densities. Thus if the observed signals

are super-Gaussian and hence

entropy distribution that is compatible with measurements

, and

the estimation of the expectation of an unbounded measuring

function may be inaccurate for heavy tailed source pdfs, it is de-

sirable to constrain the use of unbounded measuring functions

for entropy estimation or density matching. Certain entropy

estimators, e.g., the Edgeworth expansion approximation and

the one proposed in [14], use higher-order statistics, both for

sub- and super-Gaussian sources. Thus the accuracy of these

estimators cannot be guaranteed in general, due to the large

estimation variances of higher-order statistics particularly for

super-Gaussian sources. In our entropy estimator, usually we

use an unbounded measuring function and a bounded one

together to ensure that at least one entropy bound exists.

and a constant ,

and in (7)

is bounded,

given in (7) and (8) exists for any

, or

in (7) and (8)

, no maximum

exists. In general, because

C. Entropy Estimation Procedure

Given

expectation of each measuring function

over

observedsamples,andeachexpectationleadstoanupper

bound estimate of

as

measuring functions, , the

is evaluated

where constant

Gaussian random variable,

determined numerically, and

which is defined to be zero if the maximum entropy distribu-

tion does not exist for an estimate of

is estimated using the sample average of

the rest of the implementation discussion and in the algorithm

is the entropy of a standard

is a function that can be

is the negentropy,

. In practice,

. In

Page 4

5154 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 58, NO. 10, OCTOBER 2010

presentation we keep the expectation operator to simplify the

notation, though note that these all refer to sample averages,

which we use in the implementation. The tightest maximum

entropy bound is used as the final estimate of

, i.e.,

(9)

as it provides the best entropy approximation of the estimates.

Also, note the relation between the entropy estimate and the

likelihood given by

where

imum entropy density based on the measurement information

. Hence, the estimate of the entropy with the

tightest entropy bound has the highest likelihood. At the same

time, the maximum entropy density model given in (4) associ-

ated with the tightest entropy bound provides the best match for

the true pdf of the source. Hence, the estimation defined in (9)

is in agreement with the maximum likelihood estimation prin-

ciple.

is the matched max-

IV. NUMERICAL COMPUTATION OF THE ENTROPY ESTIMATOR

To use the entropy estimator given in (9), we need to select a

setofmeasuringfunctions

advance and store these values. In this section, we first propose

a numerical approach for solving for function

study a number of design examples for this entropy estimator.

andsolveforfunctions in

, and then

A. Numerical Approach

With a given measuring function

solvefor

and in(7)and(8)bythefollowingNewtoniteration

and a given , we can

(10)

starting from an initial guess

solution, where

that is close enough to the

is the Jacobian matrix. It is clear that

solution of (7) and (8) with

we can use

of keptclosetozero,andthenusetheNewtonupdatesgivenin

(10). We can then keep generating sets of solutions for (7) and

(8) by using the previous solutions for

iterations and using a

close to the previous value.

After finding the set of solutions for (7) and (8), normaliza-

tion constant

, expectation

gentropy

readily calculated as

is the

for any

as an initial guess with the value

. Thus, initially

and using the Newton

and ne-

can be

In this way, we can obtain many points

,

terpolation method to obtain the value of

inthe range

As a result, the function

, say

. Then we can use an in-

for any

.

is determined in the range

.

For certain special measuring functions, the above numerical

design method can be simplified. For example, for an even mea-

suring function

, from (7) we can show that

Newton iteration given in (10) simplifies to

, and the

For an odd measuring function, we can show that

a solution of (7) and (8) with

and (8) with , and function

determine

for positive .

is

if

is even. Thus we only need to

is a solution of (7)

B. Examples

It is clear that with the increase of the number of measuring

functions, the entropy bound will be tighter, and thus the pro-

posed entropy estimator will be more accurate. In practice, it

is desirable to use a few simple measuring functions to reduce

thecomputationalload.Thusthemeasuring functionsshouldbe

properly designed and selected. Given some prior information,

we can also choose the appropriate measuring functions. For

example, we can only select even measuring functions to match

symmetric densities, or odd ones to match skewed ones. Our

experience suggests that using a few even and odd measuring

functions provides satisfactory performance for a wide range

of distributions when no prior information is available. The two

evenandtwooddrationalmeasuringfunctionsusedinthispaper

are listed in Table I. A number of typical densities that can

be matched by these measuring functions are shown in Fig. 1,

where we observe that by using these simple rational measuring

functions, one can match sub-Gaussian, super-Gaussian, uni-

modal, bimodal, symmetric, as well as skewed pdfs. In this

paper, the negentropy function

interpolation and saved as piecewise polynomials of order 3.

is obtained by cubic spline

C. Performance of the New Entropy Estimator

Even though the entropy estimator we introduce in this paper

uses the entropy bound, and is thus approximate, it provides re-

liable estimations of the entropy using properly selected mea-

suring functions as we demonstrate in this section.

To demonstrate the performance of the new entropy esti-

mator, we study the estimation of entropy of sources of unit

variance, drawn from the generalized Gaussian distribution

(GGD), which has a pdf of the form

where

is the shape parameter, and

on

. This is a symmetric and unimodal pdf which assumes

the Gaussian pdf for

, sub-Gaussian for

super-Gaussian for

Edgeworth expansion approximation [1], [35], the nonpara-

metric entropy estimator used in [16], and the proposed one, are

used to estimate the entropies of sources of GGD with varying

and sample sizes. Fig. 2 summarizes the results where we

,

is a constant depending

, and

. Three entropy estimators, the

Page 5

LI AND ADALI: ICA-EBM 5155

Fig. 1. Plots of typical pdfs that can be matched by using the measuring func-

tions given in Table I. (a) Symmetric. (b) Asymmetric.

TABLE I

THE TWO EVEN AND TWO ODD RATIONAL MEASURING FUNCTIONS AND

THEIR FIRST- AND SECOND-ORDER DERIVATIVES

observe that the Edgeworth expansion approximation is neither

accurate nor robust to outliers. Its estimation variances are

large for super-Gaussian sources. The nonparametric entropy

estimator is inclined to underestimate the entropies, and the

estimates are inaccurate with small sample sizes. The proposed

entropy estimator always gives more accurate entropy esti-

mates than its competitors for generalized Gaussian sources

with

.

V. THE ICA-EBM ALGORITHM

Since orthogonality constraint improves the stability and

hence convergence properties of ICA algorithms (see, e.g., [28]

and [36]), we adopt a two stage procedure in the implementa-

tion of ICA-EBM, where we first use updates that constrain the

demixing matrix to be orthogonal, and after the convergence

of orthogonal ICA-EBM, we directly minimize (1). In what

follows, we first derive the general line search algorithm mini-

mizing the ICA cost function given in (1), and then obtain the

orthogonal ICA-EBM algorithm as a special case of the general

nonorthogonal ICA-EBM algorithm.

A. ICA-EBM Algorithm

The basic idea of ICA-EBM is to divide the problem of min-

imizing

with respect to

into a series of subproblems such that we minimize

with respect to each of the row vectors

, which is an easier problem to solve. Hence, we

first update

while,

kept constant. For this task, we first write the cost function in

(1) as a function of only

. Since

the parallelepiped spanned by the vectors

can be calculated as

,

, are

is the volume of

, , it

(11)

where

vectors of

lengthperpendiculartoalltherowvectorsof

same trick is used in [37] to write the determinant

afunctionofonly

.Using(11),wecanwrite(1)asafunction

of only

as

is the area of the parallelepiped spanned by all the row

except, and is a vector of unit Euclidian

except .The

as

(12)

where

is a quantity independent of

can be regarded as a penalty function that

tries to keep

orthogonal to all the other row vectors of

the demixing matrix

. We always assume that

length, i.e.,

, and thus

are pre-whitened. Now, by using the entropy estimator given in

(9), we can write (12) as

, and the term

has unit

, since the mixtures

(13)

where

index

different measuring functions are selected according to (9). We

have the gradient

is a quantity independent of

as a function of , i.e.,

, and we write the

, since for different sources,

(14)

where

and

collinear with the previous

learning process. Hence, as in [30], we can project the gradient

in (14) onto the tangent hyperplane of the unit sphere at point

to obtain the steepest descent direction on the unit sphere

as

and

respectively. Notice that the term in (14) that is

has no contribution to the ICA

are the first order derivatives of

(15)