Page 1

A Recursive Sparse Blind Source Separation

Method and its Application to Correlated Data

in NMR Spectroscopy of Biofluids

Yuanchang Sun∗and Jack Xin∗

Abstract

Motivated by the nuclear magnetic resonance (NMR) spectroscopy of bioflu-

ids (urine and blood serum), we present a recursive blind source separation

(rBSS) method for nonnegative and correlated data. BSS problem arises when

one attempts to recover a set of source signals from a set of mixture signals with-

out knowing the mixing process. Various approaches have been developed to

solve BSS problems relying on the assumption of statistical independence of the

source signals. However, signal independence is not guaranteed in many real-

world data like the NMR spectra of chemical compounds. The rBSS method

introduced in this paper deals with the nonnegative and correlated signals aris-

ing in NMR spectroscopy of biofluids. The statistical independence requirement

is replaced by a constraint which requires dominant interval(s) from each source

signal over some of the other source signals in a hierarchical manner. This con-

dition is applicable for many real-world signals such as NMR spectra of urine

and blood serum for metabolic fingerprinting and disease diagnosis. Exploiting

the hierarchically dominant intervals from the source signals, the rBSS method

reduces the BSS problem into a series of sub-BSS problems by a combination

of data clustering, linear programming, and successive elimination of variables.

Then in each sub-BSS problem, an ℓ1minimization problem is formulated for

recovering the source signals in a sparse transformed domain. The method

is substantiated by examples from NMR spectroscopy data and is promising

towards separation and detection in complex chemical spectra without the ex-

pensive multi-dimensional NMR data.

∗Department of Mathematics, University of California at Irvine, Irvine, CA 92697, USA.

Page 2

1 Introduction

Blind source separation (BSS) is a major area of research in signal and image process-

ing. It aims at recovering source signals from their mixtures without detailed knowl-

edge of the mixing process. Applications of BSS include signal analysis and processing

of speech, image, and biomedical signals, especially, signal extraction, enhancement,

denoising, model reduction and classification problems [7]. Recently nonnegative BSS

has received a wide attention in various fields such as computer tomography, biomed-

ical image processing, analytical chemistry [2, 3, 13, 16, 17, 22, 24, 25, 26, 27, 29,

30, 28, 32] where nonnegative constraints are imposed for the mixing process and/or

estimated source signals. The nonnegative BSS problem is defined by the following

matrix model

X = AS ,withAij≥ 0 ,Sij≥ 0 ,

(1.1)

where X ∈ Rm×pis the mixture matrix containing known mixture signals as its rows,

S ∈ Rn×pis the unknown source matrix, A ∈ Rm×nis the unknown mixing matrix.

The dimensions of the matrices are expressed in terms of three numbers: (1) p is the

number of available samples, (2) m is the number of mixture signals, and (3) n is

the number of source signals. Both X and S are sampled functions of an acquisition

variable which may be time, frequency, position, or wavenumber depending on the

measurement device. ,The mathematical problem is to estimate nonnegative A and S

from X. The problem is also known as nonnegative matrix factorization (NMF [17]).

Similar to factorizing a composite number (48 = 6 ∗ 8 = 8 ∗ 6 = 4 ∗ 12 = 12 ∗ 4 =

2 ∗ 24 = 24 ∗ 2 = 3 ∗ 16 = 16 ∗ 3), there are permutation and scaling ambiguities in

solutions to BSS. For any permutation matrix P and invertible diagonal matrix Λ,

(APΛ, Λ−1P−1S) is another pair equivalent to the solution (A,S), since

X = AS = (APΛ)(Λ−1P−1S).

(1.2)

Various approaches, methods, and techniques to BSS problems have been devel-

oped relying on a priori knowledge of source signals such as spatio-temporal decorre-

lation, statistical independence, sparseness, nonnegativity, etc., [6, 7, 11, 15, 18, 19,

20, 24, 29, 30]. For instance, independent component analysis (ICA) [8, 9] recovers

statistically independent source signals and mixing matrix A. Recently there have

been several studies of nonnegative ICA and its applications. For example, [26, 27]

offer theoretical and algorithmic studies of nonnegative ICA. The statistical inde-

pendence requires uncorrelated source signals, and this condition however does not

always hold in real-world problems. For example, the statistical independence should

not be assumed on the NMR spectra of many chemical compounds, because they are

known to be correlated when molecules responsible for each source share common

structural features. Besides, the properly phased absorption-mode NMR spectral

signals from a single-pulse experiment are positive [12]. Recently, there appear con-

siderable activities to nonnegative BSS in NMR spectroscopy with the applications in

identification of organic compounds, metabolic fingerprinting, and disease diagnosis

[1, 16, 24, 27, 30, 28, 31, 33, 34]. For example, Naanaa and Nuzillard (NN) proposed

a nonnegative BSS method in [24] based on a strict local sparseness assumption of

the source signals. The NN assumption (NNA) requires the source signals to be

strictly non-overlapping at some locations of acquisition variable (e.g., frequency). In

1

Page 3

0200 400 600 80010001200

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

β−sitosterol

menthol

Figure 1.1: NMR spectra of two chemical compounds.

sitosterol (blue) has a stand-alone peak. Clearly, menthol (red) does not have such

region.

In the circled region, β-

other words, each source signal must have a stand-alone peak where other sources are

strictly zero there. Such a strict sparseness condition leads to a dramatic mathemat-

ical simplification of a general nonnegative matrix factorization problem (1.1) which

is non-convex. Geometrically speaking, the problem of finding the mixing matrix A

reduces to the identification of a minimal cone containing the columns of mixture

matrix X. The latter can be done by linear programming. In fact, NN’s sparseness

assumption and the geometric construction of columns of A were known in the 1990’s

[2, 32] in the problem of blind hyper-spectral unmixing, where the same mathematical

model (1.1) is used. The analogue of NN’s assumption is called pixel purity assump-

tion [5]. The resulting geometric (cone) method is the so called N-findr [32], and

is now a benchmark in hyperspectral unmixing. NN’s method can be viewed as an

application of N-findr to NMR data. It is possible that measured NMR data may not

strictly satisfy NN’s sparseness conditions, which introduces spurious peaks in the re-

sults. Postprocessing methods will be developed to address the resulting errors. Such

a study has been performed recently in case of (over)-determined mixtures [29] where

it is found that larger peaks in the signals are more reliable and can be used to mini-

mize errors due to lack of strict sparseness. However, the geometric cone method (NN

method, N-findr method) and its postprocessing would fail if the measured data do

not satisfy NN assumption. Therefore, there is a need for new BSS methods which can

separate non-NNA source signals. The following two examples show that a different

condition on source signals is called for in this regard.

Example 1: Consider the NMR spectra of two chemical compounds β-sitosterol

and menthol in Fig.1.1.As shown in the figure, β-sitosterol (blue) has stand-

alone peaks however menthol (red) does not have such a peak. Hence NNA does not

hold. Instead, β-sitosterol overlaps with menthol over the acquisition region and has

dominant intervals over menthol in their NMR spectra.

2

Page 4

Figure 1.2: Examples of standard NMR spectra of serum and urine, showing repre-

sentative structural complexity produced by multiple metabolite signals (plot from

[1]).

Example 2:

blood serum. The complicated NMR spectra contain both wide-peak source signals

and narrow-peak source signals. For example, the blood serum has constituents with

wide spectral peaks which overlap others almost over the whole acquisition region.

Similar signal peaks are observed in urine NMR spectrum. NN’s method and its

postprocessing would not work for this type of data.

The data in Fig. 1.2 are from NMR spectroscopy of urine and

The above two examples indicate that new BSS methods should be developed for

these non-NNA signals. For the urine type NMR data, the method needs to be able

to separate signals of wide spectral peaks from narrow peak signals. The method also

should handle the signals with dominant intervals over one another, such as the data

in example 1: though there are no wide peak signals, one source dominates the other

over the region. In this paper, we shall develop a new BSS method to separate these

two types of non-NNA data. This work is mainly motivated by NMR spectroscopy

of urine and blood serum. Analysis of NMR spectra of biofluids such as urine and

blood serum can provide extremely important information for metabolic fingerprinting

and disease diagnosis (see [1, 31, 33, 34] and references therein). Identification and

assignments of constituents in urine samples depends heavily on 2D NMR spectra.

However, the complexity of urinary composition makes the complete assignments of

the urinary spectra difficult, which is mainly due to the lack of reference spectra for

unknown metabolites. Consequently, as of now, only about one-third of detectable

urinary metabolites have been assigned unambiguously [34]. Similar situation exists

in the NMR spectroscopy of blood serum. Our method can be used to separate and

detect the unknown sources in the residuals of a regular spectra fitting with reference

spectra data. In this context, it is unnecessary to separate all the source signals

from urine and serum type data in a complete blind fashion. Our hope is to offer an

assistive computational tool to produce a short list of possible unknown sources for a

3

Page 5

knowledged chemist to pursue further analysis.

The main challenge of the non-NNA problem we face is that the complicated

NMR spectra contains both wide-peak source signals and narrow-peak source signals

(in urine and serum NMR spectra). As a result, the mixing matrix A cannot be

recovered from data matrix X independently of S as in [24], and so A and S are

much more coupled. This paper uses divide and conquer strategy to retrieve A and

S in a recursive way. The proposed method splits the source separation process into

two major steps. The first step is a backward procedure where clustering and linear

programming techniques are employed to recursively identify columns of the mixing

matrix while simultaneously eliminating source variables. The first step also serves to

convexify the orginal non-convex matrix factorization problem. Half of the unknowns

are estimated. The second step is a forward step to solve a sequence of ℓ1regularized

convex optimization problems to recover the source signals. It should be pointed

that although the method is motivated by the NMR spectroscopy of biofluids, the

underlying ideas certainly can be generalized to other source separation applications.

The paper is outlined as follows. In section 2, we propose a new condition on the

source signals motivated by NMR spectroscopy data of biofluids. In section 3, we

introduce our recursive BSS method. In section 4, we further illustrate our method

with numerical examples including the processing of an experimental NMR data set.

Section 5 is the conclusion. We shall use the following notations throughout the paper.

The notation Ajstands for the j-th column of matrix A, Sjfor the j-th column of

matrix S, Xjthe j-th column of matrix X. While Sj and Xj are the j-th rows of

matrix S and X, or the j-th source and mixture, respectively.

This work was partially supported by NSF-ADT grant DMS-0911277 and NSF

grant DMS-0712881. The authors thank Professor A.J. Shaka and Dr. Hasan Celik

for helpful discussions and their experimental NMR data.

2Assumption on Source Signals

Let us consider the determined case (m = n). The results can be easily extended to

over-determined case (m > n). Consider the linear model (1.1) where each column in

X represents data collected at a particular value of the acquisition variable, and each

row represents a mixture sprectrum.

Recently the authors have developed a postprocessing approach on how to improve

NN results with abundance of mixture data, and how to improve mixing matrix esti-

mation with major peak based corrections [29]. The work in [29] actually considered

a relaxed NNA (rNNA) condition

Assumption (rNNA). : For each i ∈ {1,2,...,n} there exists an ji∈ {1,2,...,p}

such that si,ji> 0 and sk,ji= ǫk(k = 1,...,i − 1,i + 1,...,n) , where ǫk≪ si,ji.

Simply said, each source signal has a dominant peak at acquisition position where

the other sources are allowed to be nonzero. NNA results if all ǫk= 0. The rNNA is

more realistic and robust than the ideal NNA for real-world NMR data [24].

4