IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 58, NO. 10, OCTOBER 20105151
Independent Component Analysis by
Entropy Bound Minimization
Xi-Lin Li and Tülay Adalı, Fellow, IEEE
Abstract—A novel (differential) entropy estimator is introduced
where the maximum entropy bound is used to approximate the en-
tropy given the observations, and is computed using a numerical
procedure thus resulting in accurate estimates for the entropy. We
show that such an estimator exists for a wide class of measuring
functions, and provide a number of design examples to demon-
strate its flexible nature. We then derive a novel independent com-
ponent analysis (ICA) algorithm that uses the entropy estimate
thus obtained, ICA by entropy bound minimization (ICA-EBM).
The algorithm adopts a line search procedure, and initially uses
updates that constrain the demixing matrix to be orthogonal for
robust performance. We demonstrate the superior performance of
ICA-EBM and its ability to match sources that come from a wide
range of distributions using simulated and real-world data.
Index Terms—Blind source separation (BSS), differential
entropy, independent component analysis (ICA), principle of
separation (BSS) problem. BSS algorithms can exploit either
non-Gaussianity, nonstationarity, or correlation—see, e.g.,
–. The natural cost for exploiting non-Gaussianity that
leads to ICA is the mutual information among separated com-
ponents, which can be shown to be equivalent to maximum
likelihood estimation , and to negentropy maximization ,
 when we constrain the demixing matrix to be orthogonal.
In these approaches, we either estimate a parametric density
model – along with the demixing matrix, or maximize
the information transferred in a network of non-linear units
, , or estimate/approximate the entropy , , ,
In this paper, we first introduce a novel (differential) entropy1
estimator that approximates the entropy of a random variable
given the observations by using the maximum entropy bound
that is compatible with finite measurements. In this way, the
NDEPENDENT component analysis (ICA) has been
one of the most attractive solutions for the blind source
Manuscript received June 22, 2009; accepted June 22, 2010. Date of publica-
tion July 01, 2010; date of current version September 15, 2010. The associate
editor coordinating the review of this manuscript and approving it for publica-
tion was Prof. Konstantinos I. Diamantaras. This work was supported by the
NSF Grants NSF-CCF 0635129 and NSF-IIS 0612076.
The authors are with the Department of CSEE, University of Maryland —
Baltimore County, Baltimore, MD 21250 USA (e-mail: firstname.lastname@example.org;
Color versions of one or more of the figures in this paper are available online
Digital Object Identifier 10.1109/TSP.2010.2055859
1Since discrete-valued variables are not considered in this paper, we refer to
differential entropy as simply entropy in the paper.
maximum entropy density matching can be “consistent to the
largest extent with the available data and least committed with
respect to unseen data” . Thus we do not use an approxima-
tion as in  and rely on calculation of higher-order moments
as in  which are known to be sensitive to outliers. Another
key difference is that we calculate several maximum entropy
bounds and use the tightest one as the final entropy estimate,
we show that this entropy estimator is a very desirable tool for
performing ICA and introduce an ICA algorithm, ICA by en-
tropy bound minimization (ICA-EBM), that uses the tightest
maximum entropy bound. Because the entropy bound estimator
is quite flexible and can approximate the entropies of a wide
range of distributions, it can be used to perform ICA for sources
that come from distributions that are sub- or super-Gaussian,
unimodal or multimodal, symmetric or skewed by using only a
small class of nonlinear functions.
Natural (relative) gradient descent updates , Givens rota-
tions , , , (quasi-) Newton algorithm , , ,
and steepest descent on the Stiefel manifold  are commonly
used approaches for optimizing the selected cost function for
ICA. In ICA-EBM, we use a line search procedure and initially
constrain the demixing matrix to be orthogonal for better con-
vergence behavior. We demonstrate the superior performance
of ICA-EBM with respect to a class of competing algorithms
using simulations and discuss its properties. We introduced the
entropy estimator using the tightest bound in  and demon-
strated its application to ICA. In this paper, we provide a com-
plete treatmentof theentropy estimatorincluding itsimplemen-
tation and a proof for the existence of a solution with a general
class of measuring functions as well as derivation of the ICA
algorithm and its fast implementation. We also present compre-
hensive simulation results to study its performance.
The remainder of this paper is organized as follows. In
Section II, we provide background for ICA and our approach.
The novel entropy estimator is introduced in Section III. A
numerical design method and examples of this entropy es-
timator are presented in Section IV. In Section V, the new
ICA algorithm, ICA-EBM, is presented. To demonstrate the
effectiveness of ICA-EBM, a number of simulation experi-
ments are presented in Section VI, and conclusions are given
in Section VII.
Letstatistically independent, zero mean sources
be mixed through an
nonsingular mixing matrix so that we obtain the mixtures
1053-587X/$26.00 © 2010 IEEE
5152 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 58, NO. 10, OCTOBER 2010
as, where super-
The mixtures are separated by forming
denotes the transpose, and is the discrete time index.
is the, and
separation or demixing matrix. A natural cost for achieving
the separation of these
independent sources is the mutual
entropy of observations
Thus this cost function assumes the same form as the maximum
likelihood cost. In the subsequent discussions, the time index
is suppressed for simplicity.
and the demixing matrix is constrained to be an orthogonal ma-
for an orthogonal matrix, the orthog-
onal ICA algorithms minimize the cost function
is the entropy of the th separated source, and
is a constant with respect to.
under the orthogonality constraint
identity matrix. Even though it is commonly used, the orthogo-
As observed in (1) and (2), estimation of the entropy or its
approximation plays a key role in the development of ICA al-
gorithms. Commonly used entropy estimators for ICA include
worth expansion approximation , , and estimators based
on the principle of maximum entropy , . Nonparametric
entropy estimation is recognized to be practically difficult and
computationally demanding. The Edgeworth expansion and the
estimator given in  lead to the use of higher-order moments
or cumulants, which have large estimation variances and are
highly sensitive to outliers. The estimator in  uses approx-
imation of the expansion by assuming that the true density of
source is close to the Gaussian density with the same mean and
variance. Thus it may be inaccurate when the true density of
source is far from Gaussian. Another approach to the minimiza-
tion of (1) and (2) is to use density matching through a para-
with the demixing matrix –. These ICA algorithms may
have poor performance if the assumed distribution is far from
the true ones , or over complicated by using complex den-
For the ICA algorithm we introduce in this paper, ICA-EBM,
entropy is estimated by bounding the entropy of estimates using
numerical computation. By using a few simple measuring func-
tions, a tight entropy bound can be determined for sources that
come from a wide range of distributions, those that have sub- or
super-Gaussian, unimodal, multimodal, symmetric or skewed
probability density functions (pdfs) where we define sub- and
super-Gaussianity with respect to normalized kurtosis as in .
Natural (relative) gradient descent updates are commonly
used to minimize the cost function given in [34, Eq. 1]. When
, where is the
is constrained to be orthogonal as in (2), Givens rotations
and steepest descent on the Stiefel manifold are commonly
used to estimate
, , , . Since pre-whitening is
a standard preprocessing procedure for many ICA algorithms
and can simplify the discussion, we always assume that the
mixtures have been pre-whitened, i.e.,
do not constrain the demixing matrix to be orthogonal in
ICA-EBM. Next, we first present the new entropy estimator.
. But we
III. THE ENTROPY ESTIMATOR
Rather than directly trying to estimate the entropy
termine an upper bound for
vides a morepractical and effectiveapproach for approximating
is a measuring function , and
the expected value of
evaluated over the observed sam-
ples. An upper bound of
can be accurately determined by
solving for the maximum entropy distribution that maximizes
the entropy, and, at the same time, is compatible with the con-
, and in practice, it can be estimated as the sample av-
according to the mean ergodic theorem. In this
different measuring functions. It is clear that the tightest en-
tropy bound is the closest one to the true entropy of source, and
can be used as the entropy estimate of source. Although this en-
tropy estimator can only provide an upper bound of the entropy
in general, it is useful for ICA since the entropy or the source
distributions do not need to be estimated with great precision
in ICA for reliable performance. Furthermore, the entropy esti-
mator we introduce is quite flexible. As we demonstrate, with
a few measuring functions, entropy bound for sources from a
wide range of distributions can be obtained.
given independent samples, we de-
, which, as we show next, pro-
denotes the expectation
A. The Maximum Entropy Distribution
Given the normalized variable
, we have
, and for sim-
has zero mean
. Hence, we can only estimate
plicity of discussion, we always assume that
and unit variance in the rest of this paper.
Suppose that the expectation
the observed samples, and we have
principle of maximum entropy , we may assume that the
samples are drawn from the distribution
and the normalization condition
. Thus we have the following entropy maxi-
is evaluated over
. According to the
, which max-
LI AND ADALI: ICA-EBM 5153
The optimization problem in (3) can be rewritten as a La-
, , are the Lagrangian multipliers. By let-
, one finds thathas the form
the constraints in (3). The maximum entropy is then given by
, , , andare to be determined to satisfy
random variable with zero mean and unit variance, and
is the entropy of a standard Gaussian
where we have written
as a function ofsince
in (6) are to be determined by the constraint
. From (5) we know that
ways nonnegative, since
under the zero mean and unit variance constraints, and it is
achieved by a standard Gaussian variable. Thus, we call
negentropy as in .
Then, the maximum entropy problem reduces to the problem
of solving for the function
analytic solution for
cannot be obtained, we can solve for
numerically as we show in Section IV.
is the maximum entropy
given in (6). Even though an
B. Existence of Maximum Entropy Distribution
The problem of the existence of maximum entropy distribu-
tion naturallyarisesinthenew entropyestimator. Existence ofa
using high-order moments as the measurements, we can only
match a small class of pdfs, and the estimations of high-order
moments are sensitive to outliers. For the approach we adopt in
maximum entropy distribution with general measurement con-
straints, the literature is quite limited on the existence question
and considers only specific forms of measuring functions. Here,
maximum entropy solution always exists if the measuring func-
Considering the constraints
in (4), we find that the maximum entropy problem given
in (3) leads to the following two equations:
as well as the normalization constant
Hence, for a given measuring function
we are interested in the existence of a solution for
and (8), and prove the following result.
Proposition 1: If the measuring function
then a solution for
Proof: See Appendix A.
However, for an unbounded measuring function, a solu-
and may not exist for certain values of
. For example, for the unbounded measuring function
, which is widely used in ICA,
must be nonnegative so that all the considered integrals exist.
As a result, the maximum entropy pdf given in (4) can only
match sub-Gaussian densities. Thus if the observed signals
are super-Gaussian and hence
entropy distribution that is compatible with measurements
the estimation of the expectation of an unbounded measuring
function may be inaccurate for heavy tailed source pdfs, it is de-
sirable to constrain the use of unbounded measuring functions
for entropy estimation or density matching. Certain entropy
estimators, e.g., the Edgeworth expansion approximation and
the one proposed in , use higher-order statistics, both for
sub- and super-Gaussian sources. Thus the accuracy of these
estimators cannot be guaranteed in general, due to the large
estimation variances of higher-order statistics particularly for
super-Gaussian sources. In our entropy estimator, usually we
use an unbounded measuring function and a bounded one
together to ensure that at least one entropy bound exists.
and a constant ,
and in (7)
given in (7) and (8) exists for any
in (7) and (8)
, no maximum
exists. In general, because
C. Entropy Estimation Procedure
expectation of each measuring function
bound estimate of
measuring functions, , the
Gaussian random variable,
determined numerically, and
which is defined to be zero if the maximum entropy distribu-
tion does not exist for an estimate of
is estimated using the sample average of
the rest of the implementation discussion and in the algorithm
is the entropy of a standard
is a function that can be
is the negentropy,
. In practice,
5154 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 58, NO. 10, OCTOBER 2010
presentation we keep the expectation operator to simplify the
notation, though note that these all refer to sample averages,
which we use in the implementation. The tightest maximum
entropy bound is used as the final estimate of
as it provides the best entropy approximation of the estimates.
Also, note the relation between the entropy estimate and the
likelihood given by
imum entropy density based on the measurement information
. Hence, the estimate of the entropy with the
tightest entropy bound has the highest likelihood. At the same
time, the maximum entropy density model given in (4) associ-
ated with the tightest entropy bound provides the best match for
the true pdf of the source. Hence, the estimation defined in (9)
is in agreement with the maximum likelihood estimation prin-
is the matched max-
IV. NUMERICAL COMPUTATION OF THE ENTROPY ESTIMATOR
To use the entropy estimator given in (9), we need to select a
advance and store these values. In this section, we first propose
a numerical approach for solving for function
study a number of design examples for this entropy estimator.
, and then
A. Numerical Approach
With a given measuring function
and a given , we can
starting from an initial guess
that is close enough to the
is the Jacobian matrix. It is clear that
solution of (7) and (8) with
we can use
(10). We can then keep generating sets of solutions for (7) and
(8) by using the previous solutions for
iterations and using a
close to the previous value.
After finding the set of solutions for (7) and (8), normaliza-
readily calculated as
as an initial guess with the value
. Thus, initially
and using the Newton
In this way, we can obtain many points
terpolation method to obtain the value of
As a result, the function
. Then we can use an in-
is determined in the range
For certain special measuring functions, the above numerical
design method can be simplified. For example, for an even mea-
, from (7) we can show that
Newton iteration given in (10) simplifies to
, and the
For an odd measuring function, we can show that
a solution of (7) and (8) with
and (8) with , and function
for positive .
is even. Thus we only need to
is a solution of (7)
It is clear that with the increase of the number of measuring
functions, the entropy bound will be tighter, and thus the pro-
posed entropy estimator will be more accurate. In practice, it
is desirable to use a few simple measuring functions to reduce
properly designed and selected. Given some prior information,
we can also choose the appropriate measuring functions. For
example, we can only select even measuring functions to match
symmetric densities, or odd ones to match skewed ones. Our
experience suggests that using a few even and odd measuring
functions provides satisfactory performance for a wide range
of distributions when no prior information is available. The two
are listed in Table I. A number of typical densities that can
be matched by these measuring functions are shown in Fig. 1,
where we observe that by using these simple rational measuring
functions, one can match sub-Gaussian, super-Gaussian, uni-
modal, bimodal, symmetric, as well as skewed pdfs. In this
paper, the negentropy function
interpolation and saved as piecewise polynomials of order 3.
is obtained by cubic spline
C. Performance of the New Entropy Estimator
Even though the entropy estimator we introduce in this paper
uses the entropy bound, and is thus approximate, it provides re-
liable estimations of the entropy using properly selected mea-
suring functions as we demonstrate in this section.
To demonstrate the performance of the new entropy esti-
mator, we study the estimation of entropy of sources of unit
variance, drawn from the generalized Gaussian distribution
(GGD), which has a pdf of the form
is the shape parameter, and
. This is a symmetric and unimodal pdf which assumes
the Gaussian pdf for
, sub-Gaussian for
Edgeworth expansion approximation , , the nonpara-
metric entropy estimator used in , and the proposed one, are
used to estimate the entropies of sources of GGD with varying
and sample sizes. Fig. 2 summarizes the results where we
is a constant depending
. Three entropy estimators, the
LI AND ADALI: ICA-EBM 5155
Fig. 1. Plots of typical pdfs that can be matched by using the measuring func-
tions given in Table I. (a) Symmetric. (b) Asymmetric.
THE TWO EVEN AND TWO ODD RATIONAL MEASURING FUNCTIONS AND
THEIR FIRST- AND SECOND-ORDER DERIVATIVES
observe that the Edgeworth expansion approximation is neither
accurate nor robust to outliers. Its estimation variances are
large for super-Gaussian sources. The nonparametric entropy
estimator is inclined to underestimate the entropies, and the
estimates are inaccurate with small sample sizes. The proposed
entropy estimator always gives more accurate entropy esti-
mates than its competitors for generalized Gaussian sources
V. THE ICA-EBM ALGORITHM
Since orthogonality constraint improves the stability and
hence convergence properties of ICA algorithms (see, e.g., 
and ), we adopt a two stage procedure in the implementa-
tion of ICA-EBM, where we first use updates that constrain the
demixing matrix to be orthogonal, and after the convergence
of orthogonal ICA-EBM, we directly minimize (1). In what
follows, we first derive the general line search algorithm mini-
mizing the ICA cost function given in (1), and then obtain the
orthogonal ICA-EBM algorithm as a special case of the general
nonorthogonal ICA-EBM algorithm.
A. ICA-EBM Algorithm
The basic idea of ICA-EBM is to divide the problem of min-
with respect to
into a series of subproblems such that we minimize
with respect to each of the row vectors
, which is an easier problem to solve. Hence, we
kept constant. For this task, we first write the cost function in
(1) as a function of only
the parallelepiped spanned by the vectors
can be calculated as
is the volume of
, , it
same trick is used in  to write the determinant
is the area of the parallelepiped spanned by all the row
except, and is a vector of unit Euclidian
is a quantity independent of
can be regarded as a penalty function that
tries to keep
orthogonal to all the other row vectors of
the demixing matrix
. We always assume that
, and thus
are pre-whitened. Now, by using the entropy estimator given in
(9), we can write (12) as
, and the term
, since the mixtures
different measuring functions are selected according to (9). We
have the gradient
is a quantity independent of
as a function of , i.e.,
, and we write the
, since for different sources,
collinear with the previous
learning process. Hence, as in , we can project the gradient
in (14) onto the tangent hyperplane of the unit sphere at point
to obtain the steepest descent direction on the unit sphere
respectively. Notice that the term in (14) that is
has no contribution to the ICA
are the first order derivatives of
5156 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 58, NO. 10, OCTOBER 2010
Fig. 2. Comparison of the accuracy of three different entropy estimators for the estimation of entropy of generalized Gaussian sources with zero mean and
unit variance. The shape parameter ? takes values from the set ?????????????????????????????????? The adopted Edgeworth expansion approximation is
???? ? ??????????? ? ? ??????, where ? ??? ? ??? ? ? ? is the fourth-order cumulant of normalized random variable ?. Results for Edgeworth expansion
approximation with 0.2 and 0.5 are not shown due to their extremely large estimation variances. The spacing parameter in the nonparametric estimator used in 
is set to??. (a) 100 samples. (b) 1000 samples. (c) 10000 samples.
length, orthogonal tothe previous
ascent direction. Thus we obtain the following line search algo-
. It is
, and pointsin thesteepest
The ICA-EBM algorithm repeats the line search given in (16)
over different row vectors of
The orthogonal ICA-EBM algorithm can be readily derived
from the general line search algorithm given in (16). When we
impose orthogonality constraint to
and (16) reduces to:
is the step length, andis computed using (15).
, naturally becomes
After each row vector in the demixing matrix
dated once, a symmetric decorrelation procedure is performed
to keep the demixing matrix orthogonal, i.e., we use
has been up-
It can be shown that if we choose
given in (17) reduces to the following algorithm:
is clear that the line search algorithm given in (19) is equiva-
lent to the well-known FastICA algorithm . In fact, several
fast blind deconvolution and separation algorithms, e.g., Fas-
tICA and super exponential algorithm (SEA) , , are line
search algorithms, and do not converge faster than an exact line
search algorithm , . Our experience with a large class of
sources suggests that, unlike FastICA, which may occasionally
exhibit oscillatory behavior, the line search algorithms given in
(16) and (17) can provide more robust convergence behavior by
using a simple step length control strategy, which we explain in
the next subsection.
is the second-order derivative of. It
B. Implementation and Computational Complexity of
1) Implementation: To achieve faster convergence, we first
use the orthogonal ICA-EBM with a few measuring functions
to provide a rough initial guess. In our implementation, we use
the FastICA learning rule given in (19) with measuring func-
, maximum number of iterations of 100, and threshold
value of 0.001 in (21) to provide the initial guess. Typically, this
stage requires iterations on the order of 10–20. Then ICA-EBM
uses the line search algorithms given in (19), (17) and (16) se-
quentially with all the measuring functions listed in Table I to
estimate the demixing matrix.
occasionally converge to a saddle point if the sample sizes are
LI AND ADALI: ICA-EBM 5157
small. We mimic the method proposed in  to detect and re-
move saddle convergence. Assume
arated components. We can rotate pair
andare a pair of sep-
with an angle
is detected, and we can rotate the
rows of andas
to remove this saddle point. When the cost function given in
(2) cannot be further reduced by rotating any pair of
(20), the saddle point detection is finished. After finishing the
saddle point detection, we use the orthogonal ICA-EBM algo-
rithm again to refine the solution if any saddle point is detected.
In the nonorthogonal ICA-EBM, we need to determine the
. The algorithm for the calculation of
 can be computationally demanding when the dimension of
is high. A recursive algorithm is proposed for the fast
in Appendix B.
A simple step size control strategy is used in ICA-EBM.
When the algorithm detects that the cost function oscillates, the
step size is halved, and the algorithm begins a new line search
starting from the best solution that has been found.
The maximum number of iterations for the line search algo-
rithms in (19), (17), and (16) are set to 100. The stopping crite-
with a typical value of 0.0001 for .
2) Computational Complexity: In the orthogonal ICA-EBM,
the computational complexity for separating one component is
per iteration. The symmetric decorrelation procedure
has a complexity of
. Thus the total computational
complexity of each full iteration is
since in general
. The nonorthogonal ICA-EBM has
the same computational complexity as the orthogonal version
when we adopt the fast algorithm for the calculation of
given in Appendix C. Thus the total computational complexity
of ICA-EBM is
VI. EXPERIMENTAL RESULTS
The ICA-EBM algorithm is compared with six competitive
ICA algorithms: (Joint Approximate Diagonalization of Eigen-
matrices) JADE , FastICA , efficient variant of algorithm
FastICA (EFICA) , PearsonICA , AMICA , and ro-
bust, accurate, direct independent components analysis algo-
rithm (RADICAL) . JADE is a cumulant-based batch al-
gorithm for source separation and we use the
comparisons. FastICA is based on entropy approximation, and
we use the symmetric decorrelation approach. Two nonlineari-
ties, tanh and skew, which are for the separation of symmetric
and skewed sources respectively, are considered. EFICA uses a
version in the
Fig. 3. Performance comparison of seven ICA algorithms in the separation of
mixture of ? ? ?? sources of generalized Gaussian distribution with shape pa-
rameters? ? ????for? ? ????????,and? ? ???for? ? ?????????.
Each simulation point is averaged over 100 independent runs.
generalized Gaussian distribution source pdf matching mecha-
nism for FastICA. In PearsonICA, the source pdfs are matched
using a Pearson density model. AMICA adopts a mixture of
generalized Gaussians model for density matching, and a quasi-
Newton optimization technique. RADICAL is a nonparametric
ICA algorithm using spacings estimates of entropy and exhaus-
tivesearch optimizationmethod. Toincrease thespeedofRAD-
ICAL, we use its fast version, where no “smoothing points” or
auxiliary points are used. The code of FastICA is downloadable
at http://www.cis.hut.fi/projects/ica/fastica/, code for AMICA
is available at http://sccn.ucsd.edu/ jason/, and the codes for
JADE, PearsonICA and RADICAL are the versions available
on the ICA Central website (http://www.tsi.enst.fr/icacentral/
algos.html). All the graphics and text output functions in these
ICA algorithms are disabled to increase the speed.
Three performance indices are used to evaluate the per-
formance of an ICA algorithm. We assume that all sources
have the same variance. The first performance index is the
percentage of failed trials. Let
demixing-mixing matrix. We say that
bined demixing-mixing matrix if the locations of the largest
squared elements in any two rows are different. Otherwise,
is a failed combined demixing-mixing matrix. The second
performance index is the average interference to signal ratio
(average ISR), which is defined for a successful combined
demixing-mixing matrix. The ISR in each row of
be the combined
is a successful com-
5158 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 58, NO. 10, OCTOBER 2010
Fig. 4. Performance comparison of seven ICA algorithms in the separation
of mixtures of ? ? ?? sources that come from a skewed and unimodal pdf
(Gamma with ????
?????????? ? ?? with parameter ?
???????? for the ??? source where ? ? ???????). Each simulation point is
averaged over 100 independent runs.
as the ratio of the sum of all squared elements in the row except
for the largest to the largest squared element in the row. The
average ISR of
is the average ISR of all the row ISRs. The
third performanceindexis theaverageconsumedCPUtime.All
the algorithms are programmed in Matlab (http://www.math-
In the following experiments, we consider a number of cases:
estimation of combination of sub- and super-Gaussian sources
drawn from a GGD, which are unimodal and symmetric, esti-
mation of sources from a skewed (Gamma distribution) and bi-
modal (mixture of Gaussians) distributions as well as speech
signals. We also demonstrate an example that shows perfor-
mance with increasing number of sources in the mixture.
1) Experiment 1: In this experiment, we study separation of
sources that come from the GGD family. We generate 10 super-
and 10 sub-Gaussian sources as well as one Gaussian source
with shape parameters
for. They are mixed with a random
21 mixing matrix, whose elements are drawn from a zero
mean, unit variance Gaussian distribution. Fig. 3 summarizes
the performance indices. We observe that FastICA with tanh,
EFICA, ICA-EBM and PearsonICA exhibit very good perfor-
mance. Except the occasional failures with small sample sizes,
ICA-EBM performs as well as EFICA, which does assume a
generalized Gaussian distribution and hence has the clear ad-
Fig. 5. Performance comparison of seven ICA algorithms in the separation of
mixture of ? ? ?? sources that come from skewed and bimodal pdf (Gaussian
mixture with ????
? ????????????? ?????? ??????????????? ?
? ?????? ? ?? for the ??? source where ? ? ???????). Each
simulation point is averaged over 100 independent runs.
parameter values, the sources have heavy tailed distributions,
and the cumulants cannot be reliably estimated. AMICA and
RADICAL are the two most computationally demanding algo-
rithms, and they show limited performance when the sample
size is small. Furthermore, AMICA fails frequently even for
large sample sizes. FastICA with skew fails completely since
all the sources are symmetric.
2) Experiment 2: In this experiment, we consider the sep-
aration of sources from a Gamma distribution with pdf of the
, we obtain different unimodal skewed pdfs. We
independent sources with density parame-
mixing matrix is a random 23
the performance indices. From Fig. 4 we observe that FastICA
with skew nonlinearity and ICA-EBM perform very well. Al-
though PearsonICA that does include skewed density models
crease the sample size from 5000 to 10000. Again, RADICAL
and AMICA are the two slowest algorithms, and fail frequently
in this experiment.
3) Experiment 3: In this experiment, the sources are drawn
from a Gaussian mixture distribution with pdf of the form
, where , and
, as the sources. The
23 matrix. Fig. 4 summarizes
LI AND ADALI: ICA-EBM5159
Fig. 6. Performance comparison of seven ICA algorithms with increasing
number of sources. The number of sources varies from 10 to 50, and the sample
size is 2500. For each run, each source is drawn from a randomly selected
distribution considered in Experiments 1–3. Each simulation point is averaged
over 100 independent runs.
By varying , we can obtain different skewed and multimodal
pdfs. Here, we consider the separation of mixtures of
independent sourceswithdensity parameters
. The mixing matrix is a random 25
Fig. 5 summarizes the performance indices. We observe that
ICA-EBM shows the best performance. RADICAL also per-
forms very well if the sample size is large enough. Although
AMICA performs very well if the sample size is very large and
it converges to a successful demixing matrix, it fails frequently.
All the other algorithms show limit performance in this experi-
4) Experiment 4: In this experiment, we study the per-
formance with increasing number of sources. The number of
sources varies from 10 to 50, and the sample size is 2500.
For each run, each source is drawn from a randomly selected
distribution considered in Experiments 1–3. Thus the sources
are from different families of distributions, and can be sub- or
super-Gaussian, unimodal or bimodal, symmetric or skewed.
Fig. 6 summarizes the performance indices where we observe
that ICA-EBM performs the best. RADICAL performs well
if the number of sources is small, and fails frequently when
the number of sources increases. Although AMICA performs
very well if the number of sources is small and it converges
to a successful demixing matrix, it does fail often as well.
All the other algorithms show limited performance. In this
experiment, RADICAL, JADE and AMICA are the three most
computationally demanding algorithms for large number of
sources. The consumed CPU time of JADE increases rapidly
with the increase of number of sources.
5) Experiment5: Inthisexperiment,20naturalaudiosignals
samples obtained from  are used as
the sources. Kurtosis and skewness of these audio signals vary
in the range
of the sources are super-Gaussian, and slightly skewed. Unlike
the case of computer generated sources in the previous experi-
ments, which were all generated as samples from independent
distributions, independent natural signals may exhibit slight de-
nonwhiteness of sources, which further decreases the effective
sample size. This slight dependence among sources changes
the contour of an ICA cost, and may introduce many false sta-
tionary points, which makes the optimization difficult. By using
the nonparametric mutual information estimation method given
in , we find that the mutual information between any pair
of audio signals varies in the range
that certain sources are moderately statistically dependent.
In each run, we randomly choose
the sources, and mix them through a random matrix. The top
panel and bottom panel of Fig. 7 summarize the performance
indices when we leave the orders of samples of all sources un-
touched, and independently and randomly permute the samples
of all sources, respectively. The random permutation of sam-
ples does not change the marginal distribution of a source, but
tends to reduce the mutual information among sources and re-
moves false stationary points of an ICA cost. From Fig. 7 we
observe that RADICAL and ICA-EBM perform very well in
both of these cases, i.e., whether the sources are slightly depen-
dent or almost independent. This fact suggests that they are ro-
cause RADICAL uses exhaustive search over the whole param-
eter space, while ICA-EBM uses gradient search and may con-
verge locally. AMICA shows good performance when it con-
verges successfully. But even when the number of sources is
and EFICA exhibit poor performance before random permuta-
tion, and much better performance after the random permuta-
tion, which implies that these algorithms are sensitive to the
slight dependence among sources. Note that in practice, only
the mixtures are observed, and it is impossible to independently
ICAL and ICA-EBM are more desirable than the other algo-
rithms in the sense that they exhibit more reliable convergence
From the results of Experiments 1–5 we have the following
summary of observations. AMICA and RADICAL are the two
most computationally demanding algorithms, and both require
large sample sizes for satisfactory performance. In most cases,
they are about 1000 times slower than FastICA, EFICA, Pear-
sonICA and ICA-EBM. Furthermore, AMICA suffers from
poor local convergence, and fails frequently in our experiments.
audio signals as
5160 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 58, NO. 10, OCTOBER 2010
Fig. 7. Performance comparison of seven ICA algorithms in the separation of artificial mixtures of real audio signals. In the top panel, the original speech signals
are used as the sources; while in the bottom panel, samples of all speech signals are independently and randomly permuted to reduce the statistical dependence
among sources. Each simulation point is averaged over 100 independent runs.
JADE, FastICA, EFICA and PearsonICA have reasonable
separation performance for sources of simple distributions.
Experiment 3 and our experience suggest that PearsonICA and
EFICA may be inconsistent, as they demonstrate either no or
very little performance gain with increasing sample size. In
fact, the density matching methods of PearsonICA and EFICA
are based on the matching of certain statistics, and a better
matching of certain statistics does not imply a higher likelihood
if the densities of sources are far from the assumed forms.
The quasi-maximum likelihood approach in  approximates
the score function by using a set of basis functions whose
linear mixing coefficients are estimated by using mean square
error (MSE) criterion, but not maximum likelihood criterion.
Thus these ICA algorithms may be inconsistent in general. We
would like to point out that for large number of sources, JADE
can be computationally demanding as there are approximately
cumulant matrices to be jointly diagonalized by Givens
rotations. FastICA, EFICA, PearsonICA and ICA-EBM have
similar computational complexity, i.e.,
computationally more attractive than JADE when the number
of sources increases. When compared to others, ICA-EBM
is more attractive due to its superior separation performance,
reliable convergence, moderate computational complexity and
high flexibility of density matching.
and thus are
We introduced a new entropy estimator based on the prin-
ciple of maximum entropy, studied the conditions for its ex-
istence, and proposed a numerical design method and design
examples. Based on this accurate entropy estimator, we devel-
oped an ICA algorithm, ICA by entropy bound minimization
in separation of sources that come from different distributions.
It is important to note that the approach we presented is quite
different from the traditional parametric approach where a den-
sity model is chosen and the parameters of the density are esti-
mated during the adaptation of the demixing matrix. Our ap-
proach realizes a wide class of probability density functions,
but indirectly, as each measuring function represents a wide
class of densities through the use of different values for the
scalars, , ,and in(4).Thealgorithmusesseveralsimplemea-
suring functions-four for the version we presented here—then
chooses the best density among the ones represented by these
four. The small set of measuring functions we used can model
both skewed and symmetric as well as super- and sub-Gaussian
densities, and those that are unimodal and bimodal. As demon-
strated by the simulations with simulated and real world data,
the performance of ICA-EBM is quite robust with the use of
just these four nonlinearities. Also, it is important to remember
that exact density match is not very critical for the performance
of ICA algorithms. However, at the expense of some increase in
computational complexity, we can easily add more measuring
functions to the implementations and can improve the perfor-
mance of ICA-EBM even further. Also, if any prior informa-
tion is available on the source distributions—e.g., on their mul-
timodal nature or other type of certain characteristics, we can
easily design measuring functions of higher efficiency specifi-
cally for that problem.
Another possibility is to use vector-valued measuring func-
tions in the entropy estimator. However, further research is
needed to determine if such an enhanced entropy estimator
provides significant improvement for the performance of ICA
algorithm. Another interesting study is entropy estimation of
complex random variables based on the principle of maximum
entropy, and its application to complex-valued ICA .
LI AND ADALI: ICA-EBM 5161
PROOF OF PROPOSITION 1
, a solution for
sidered integrals exist. From (8), we find that
is a constant. First, we show that for any
that satisfies (8) exists. Then, we show that for any
and that simultaneously satisfies (7) and (8)
That Satisfies (8): As
is bounded, we require
grows slower than
so that all the con-
is the solution
It is straightforward to show that
creasing, convex functions of
We consider the upper and lower bounds of
and are monotonically de-
with fixedand .
, we have the following
For , we have the following bounds:
Comparing thebounds for
that there exists a value of
is always greater than or equal to the upper bound
, i.e., for, we have
and , we note
, for which the lower bound
By solving the above inequality, we obtain
On the other hand, there is an
such that for,
By solving the above inequality, we obtain
creasing and convex functions of
must exist a unique
page] such that
the bounds of the two functions
the existence of a unique
and are monotonically de-
for given and , there
in the range [see (23) at the bottom of the
. The relationship of
are demonstrated in Fig. 8.
That Satisfies (7) With Any : From (7), we
5162 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 58, NO. 10, OCTOBER 2010
wherewewrite asafunctionof ,i.e.,
is uniquely determined by (22) with a given . In the following,
as for simplicity.
We study the bounds of
their lower bound zero by letting
upper bounds of
integral shown in the equation at the bottom of the page, where
and . It is clear
are nonnegative, and achieve
is bounded. To obtain the
, we will use the definite
is the error function. At the
same time, from (23) we know that
Then for we have the following limit:
where we made use ofin the first line,
and in the
third line, and the fact that exponential grows much faster than
a polynomial function in the last line. Similarly, we can show
Thus we haveand
. Since both
we conclude that there must exist at least one
with fixed are continuous functions of,
. This completes the proof of Proposition
FAST CALCULATION OF
We introduce the followingmatrix:
Then can be calculated as 
where is an arbitrary vector that is not orthogonal to
, , and . Direct cal-
putational complexity of
trix where only the
elements. We write down the expression of
(24) shown at the bottom of the page. By introducing the fol-
involves the inverse of, and leads to a com-
. We derive a recursive
, we find that
th column and the
is a sparse ma-
th row have nonzero
we can rewritecompactly as
has rank 2, and we can obtain the following recursive equation
LI AND ADALI: ICA-EBM 5163
Fig. 8. An example on the typical behavior of ? ??? and ? ??? to demonstrate
the existence and uniqueness of ? in (22) with any ? and ? for bounded mea-
for the inverse of
by using the matrix inversion lemma
. Thus all the inverses of
, can be recursively calculated based on the
previous result, the inverse of
The inverse of
can be quickly calculated from the inverse
. Althoughis not a matrix of low rank,
is a sparse matrix of rank 2 for proper exchanging matrix
that permutes rowwise. Thus we can obtain the inverse
by performing a similar rank-2 modification on the result
, as in (25).
 P.Comon, “Independentcomponentanalysis:Anewconcept?,”Signal
Process., vol. 36, no. 3, pp. 287–314, 1994.
 A. Hyvärinen, J. Karhunen, and E. Oja, Independent Component Anal-
ysis. New York: Wiley, 2001.
 A. Cichocki and S. Amari, Adaptive Blind Signal and Image Pro-
cessing: Learning Algorithms and Applications.
 A. Hyvärinen and E. Oja, “Independent component analysis: Algo-
rithms and applications,” Neural Netw., vol. 13, no. 4–5, pp. 411–430,
 J. F. Cardoso and A. Souloumiac, “Blind beamforming for
non-Gaussian signals,” IEE Proc. F, vol. 140, no. 6, pp. 362–370,
 J. Karvanen, J. Eriksson, and V. Koivunen, “Pearson system based
method for blind separation,” in Proc. 2nd Int. Workshop on Independ.
Compon. Anal. Blind Signal Separation, Helsinki, Finland, 2000, pp.
 Z. Koldovský, P. Tichavský, and E. Oja, “Efficientvariant of algorithm
fastICA for independent component analysis attaining the cramér-Rao
ment of linear ICA techniques using rational nonlinear functions,” in
Proc. ICA2007, 2007, pp. 285–292.
 D. T. Pham and P. Garat, “Blind separation of mixture of independent
sources through a quasi-maximum likelihood approach,” IEEE Trans.
Signal Process., vol. 45, no. 7, pp. 1712–1725, 1997.
 J. A. Palmer, S. Makeig, K. Kreutz-Delgado, and B. D. Rao, “Newton
method for the ICA mixture model,” in Proc. IEEE Int. Conf. Acoust.,
Speech, Signal Process. (ICASSP), Las Vegas, NV, Apr. 2008, pp.
 A. Bell and T. Sejnowski, “An information-maximization approach to
blind separation and blind deconvolution,” Neural Computat., vol. 7,
pp. 1129–1159, 1995.
analysis using an extended infomax algorithm for mixed sub-Gaussian
and super-Gaussian sources,” Neural Computat., vol. 11, no. 2, pp.
 A. Hyvärinen, “New approximations of differential entropy for in-
dependent component analysis and projection pursuit,” in Advances
in Neural Information Processing Systems.
Press, 1998, vol. 10, pp. 273–279.
 D. Erdogmus, K. E. Hild, II, Y. N. Rao, and J. C. Principe, “Min-
imax mutual information approach for independent component anal-
ysis,” Neural Computat., vol. 16, no. 6, pp. 1235–1252, 2004.
 R. Boscolo, H. Pan, and V. P. Roychowdhury, “Independent compo-
nent analysis based on nonparametric density estimation,”IEEE Trans.
Neural Netw., vol. 15, no. 1, pp. 55–65, 2004.
 E. G. Learned-Miller et al., “ICA using spacings estimates of entropy,”
J. Mach. Learn. Res., vol. 4, pp. 1271–1295, 2003.
 D.-T. Pham and J. F. Cardoso, “Blind separation of instantaneous mix-
tures of nonstationary sources,” IEEE Trans. Signal Process., vol. 49,
no. 9, pp. 1837–1848, 2001.
 A. Belouchrani, K. A. Meraim, J. F. Cardoso, and E. Moulines, “A
blind source separation technique based on second order statistics,”
IEEE Trans. Signal Process., vol. 45, no. 2, pp. 434–444, 1997.
 A. Yeredor, “Blind separation of Gaussian sources via second-order
statistics with asymptotically optimal weighting,” IEEE Signal
Process. Lett., vol. 7, pp. 197–200, 2000.
 B. W. Silverman, Density Estimation for Statistics and Data Anal-
ysis. London, U.K.: Chapman and Hall, 1986.
 J. Beirlant, E. Dudewicz, L. Gyorfi, and E. van der Meulen, “Nonpara-
metric entropy estimation: An overview,” Int. J. Math. Statist. Sci., vol.
6, pp. 17–39, 1997.
 S. Fiori, “A theory for learning by weight flow on Stiefel-Grassman
manifold,” Neural Comput., vol. 13, no. 7, pp. 1625–1647, Jul. 2001.
 E. T. Jaynes, “Information theory and statistical mechanics,” Phys.
Rev., vol. 106, pp. 620–630, 1957.
 Unsupervised Adaptive Filtering, Volume 1, Blind Source Separation,
S. Haykin, Ed.New York: Wiley, 2000.
 M. E. John, “On the existence of a class of maximum-entropy prob-
ability density functions,” IEEE Trans. Inf. Theory, vol. IT-23, pp.
772–775, Nov. 1977.
 P. Ishwar and P. Moulin, “On the existence and characterization of
the maxent distribution under general moment inequality constraints,”
IEEE Trans. Inf. Theory, vol. 51, pp. 3322–3333, Sep. 2005.
 J. F. Cardoso, “On the performance of orthogonal source separation
algorithms,” in Proc. Eur. Assoc. Signal Process. Signal Process. VII,
’94, Edinburgh, Scotland, 1994, pp. 776–779.
 J. F. Cardoso, “On the stability of source separation algorithms,” J.
VLSI Signal Process. Syst., vol. 26, no. 1–2, pp. 7–14, 2000.
 O. Shalvi and E. Weinstein, “Super-exponential method for blind
deconvolution,” IEEE Trans. Inf. Theory, vol. 39, pp. 504–519, Mar.
 X.-L. Li, “A new gradient search interpretation of super-exponential
algorithm,” IEEE Signal Process. Lett., vol. 13, no. 3, pp. 173–176,
 V. Zarzoso and P. Comon, “Comparative speed analysis of fastica,” in
Proc. ICA’07, 2007, pp. 293–300.
 X.-L. Li and T. Adalı, “A novel entropy estimator and its application
to ICA,” in Proc. IEEE Workshop on Mach. Learn. Signal Process.,
Grenoble, France, Sep. 2009.
 T. Adalı, H. Li, M. Novey, and J. F. Cardoso, “Complex ICA using
nonlinear functions,” IEEE Trans. Signal Process., vol. 56, no. 9, pp.
 S. Amari, A. Cichocki, and H. H. Yang, “A new learning algorithm for
Systems 1995. Boston, MA: MIT Press, 1996, pp. 752–763.
Cambridge, MA: MIT
5164 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 58, NO. 10, OCTOBER 2010
 M. Jones and R. Sibson, “What is projection pursuit?,” J. Royal Statist.
Soc. A, vol. 150, no. 1, pp. 1–36, 1987.
 H. Li and T. Adalı, “Stability analysis of complex maximum likeli-
hood ICA using Wirtinger calculus,” in Proc. IEEE Int. Conf. Acoust.,
Speech, Signal Process. (ICASSP), Las Vegas, NV, Apr. 2008.
 X.-L. Li and X.-D. Zhang, “Nonorthogonal joint diagonalization free
 H. Lütkepohl, Handbook of Matrices.
 A. Cichocki et al., ICALAB Toolboxes [Online]. Available:
 P. Hanchuan, L. Fuhui, and D. Chris, “Feature selection based on mu-
 J. W. Xu, D. Erdogmus, Y. N. Rao, and J. C. Principe, “Minimax mu-
tual informationapproach forICAofcomplex-valuedlinearmixtures,”
in Proc. ICA’04, Granada, Spain, Sep. 2004, pp. 311–318.
New York: Wiley, 1996.
electrical engineering, from the Dalian University of
Technology, Dalian, China, in 2001 and 2004 respec-
gineering from the Tsinghua University in 2008.
From 2008 to 2009, he was a researcher with
ForteMedia, Inc. Since 2009, he has been a Research
Associate with the Machine Learning for Signal
Processing Lab, University of Maryland, Baltimore
County. His research interests include speech signal
processing, blind source separation, and complex
valued signal processing.
Tülay Adalı (S’89–M’93–SM’98–F’09) received
the Ph.D. degree in electrical engineering from
North Carolina State University, Raleigh, in 1992.
She joined the faculty of the University of Mary-
landBaltimoreCounty (UMBC),Baltimore,in 1992.
She is currently a Professor with the Department
of Computer Science and Electrical Engineering,
UMBC. Her research interests are in the areas of
statistical signal processing, machine learning for
signal processing, and biomedical data analysis.
Dr. Adalı was the General Co-Chair, NNSP
(2001–2003); Technical Chair, MLSP (2004–2008); Publicity Chair, ICASSP
(2000 and 2005); Publications Co-Chair, ICASSP 2008; and Program
Co-Chair, 2009 International Conference on Independent Component Analysis
and Source Separation, 2009 MLSP. She chaired the IEEE SPS Machine
Learning for Signal Processing Technical Committee (2003–2005); Member,
SPS Conference Board (1998–2006); Member, Bio Imaging and Signal
Processing Technical Committee (2004–2007); and was an Associate Editor
for the IEEE TRANSACTIONS ON SIGNAL PROCESSING (2003–2006), and the
Elsevier Signal Processing Journal (2007–2010). She is currently Chair of
Technical Committee 14: Signal Analysis for Machine Intelligence of the
International Association for Pattern Recognition; Member, Machine Learning
for Signal Processing and Signal Processing Theory and Methods technical
committees; an Associate Editor for the IEEE TRANSACTIONS ON BIOMEDICAL
ENGINEERING and JOURNAL OF SIGNAL PROCESSING SYSTEMS FOR SIGNAL,
IMAGE, AND VIDEO TECHNOLOGY, and Senior Editorial Board member of the
IEEE JOURNAL OF SELECTED AREAS IN SIGNAL PROCESSING. She is a Fellow
of the AIMBE and the past recipient of an NSF CAREER Award.