Page 1

A General and Unifying Framework for Feature

Construction, in Image-Based Pattern

Classification

Nematollah Batmanghelich1, Ben Taskar2, and Christos Davatzikos1

1Section of Biomedical Image Analysis, Raddiology Department, University of

Pennsylvania, Philadelphia PA 19014, USA

2Computer and Information Department, University of Pennsylvania, Philadelphia

PA 19104, USA

{batmangh@seas,taskar@cis,christos@rad}.upenn.edu

Abstract. This paper presents a general and unifying optimization

framework for the problem of feature extraction and reduction for high-

dimensional pattern classification of medical images. Feature extraction

is often an ad hoc and case-specific task. Herein, we formulate it as a

problem of sparse decomposition of images into a basis that is desired to

possess several properties: 1) Sparsity and local spatial support, which

usually provides good generalization ability on new samples, and lends

itself to anatomically intuitive interpretations; 2) good discrimination

ability, so that projection of images onto the optimal basis yields dis-

criminant features to be used in a machine learning paradigm; 3) spa-

tial smoothness and contiguity of the estimated basis functions. Our

method yields a parts-based representation, which warranties that the

image is decomposed into a number of positive regional projections. A

non-negative matrix factorization scheme is used, and a numerical solu-

tion with proven convergence is used for solution. Results in classification

of Alzheimers patients from the ADNI study are presented.

1Introduction

Voxel-based analysis (VBA) has been widely used in the medical imaging com-

munity. It typically consists of mapping image data to a standard template space,

and then applying voxel-wise linear statistical tests on a Jacobian determinant

[1], [2], transformation-residuals [3], or tissue density maps [4], [5] or directly on

voxel intensity (e.g. diffusion imaging [6]). It therefore identifies regions in which

two groups differ (e.g. patients and controls [2]), or regions in which other vari-

ables (e.g. disease severity [7]) correlate with imaging measurements. However,

this method has limited ability to identify complex population differences, be-

cause it does not take into account the multivariate relationships in data [8], [9].

Moreover, since typically no single anatomical region offers sufficient sensitivity

and specificity in identifying pathologies that span multiple anatomical regions,

it has very limited diagnostic power on an individual basis. In other words, val-

ues of voxels or ROIs showing significant group difference are not necessarily

good discriminants when one wants to classify individuals into groups.

J.L. Prince, D.L. Pham, and K.J. Myers (Eds.): IPMI 2009, LNCS 5636, pp. 423–434, 2009.

c ? Springer-Verlag Berlin Heidelberg 2009

Page 2

424N. Batmanghelich, B. Taskar, and C. Davatzikos

In order to overcome the limitations, high-dimensional pattern classification

methods have been proposed in the relatively recent literature [9,10,11,12,13],

which capture multi-variate nonlinear relationships in the data, and aim to

achieve high classification accuracy of individual scans. A fundamental difficulty

in these methods has been the availability of enough training samples, relative

to the high dimensionality of the data. A critical problem has therefore persisted

in these methods, namely how to optimally perform feature extraction and se-

lection, i.e. to find a parsimonious set of image features that best differentiate

between two or more groups, and which generalize well to new samples.

Feature reduction methods can be categorized into two general families: 1) fea-

ture selection and 2) feature construction [14]. Feature selection methods (e.g.

SVM-RFE [15]) have two problems: first, they do not scale up for medical im-

ages; second, they do not consider domain knowledge (in our case: the fact that

data is coming from images) thus they may end up selecting a subset of fea-

tures which is not biologically interpretable. Another family of feature reduction

methods includes feature construction like PCA, LDA or other linear or non-

linear transformations. These methods can take into account domain knowledge

but they are challenged by two issues: first, constructed features do not have lo-

cal support, but are typically extracted from spatially extensive and overlapping

regions; moreover, they use both positive and negative weights, which render dif-

ficult anatomical interpretability. Finally, the number of basis vectors is usually

bounded by the number of samples, which is usually less than the dimensionality

of features.

In this paper, we propose a novel method which falls into the feature con-

struction category. Finding optimal linear construction can be viewed as finding

a linear transformation, i.e. basis matrix, which is to be estimated from data

according to some desired properties that are discussed next. 1) The basis must

be biologically meaningful: this means that a constructed basis vector should

correspond to contiguous anatomical regions preferably in areas which are bio-

logically related to a pathology of interest. Having local spatial support can be

viewed mathematically as sparsity of a basis vector in combining voxel values.

2) The basis must be discriminant: we are interested in finding features, i.e.

projection onto the basis, that construct spatial patterns that best differentiate

between groups, e.g. patients and controls. 3) The basis must be representative

of the data: in order to represent data, we derive a basis matrix with afore-

mentioned properties and corresponding loadings. Matrix factorization has been

adopted as a framework. Having simultaneously representative and parsimonious

representation of an image is usually referred to parts-based representation in

the literature. A specific variant of Matrix Factorization (MF) which is confined

to be nonnegative (NMF) has been shown experimentally [16], and under some

conditions mathematically [17], to yield parts-based representations of an im-

age. Since general NMF does not consider that underlying data is an image, we

have introduced a Markovian prior to address this issue. Furtheremore, we have

an extra prior to enforce sparsity (parts-based representation) of an image. 4)

Generalization: the proposed method is general and can be applied to a wide

Page 3

A General and Unifying Framework for Feature Construction425

variety of problems and data sets without significant adjustments. In this paper,

we have formulated our problem as an optimization problem that seeks to satisfy

the four criteria above. Moreover, we proposed a novel numerical solution with

a proof of convergence to solve it. Unlike LDA and PCA, the number of basis

vectors are not confined with number of samples in our method thus we are able

to have more basis vectors than samples.

In the Methods section, we first discuss the idea of matrix factorization in

general and NMF in particular (Sect.2.1). In the subsequent sections, a likelihood

term (Sect.2.2) and proper regularization terms are introduced (Sect.2.3,2.4).

In Sect. 2.5, the final optimization problem is formed and a proper method is

suggested to solve it. In the Results section (Sect.3), we apply our method to the

problem of classification of Alzheimer’s disease patients and healthy controls.

2 Methods

2.1

Let’s assume that we collect data into a matrix, X ∈ IR+D×N, such that each

column xi represents one image. This can be done by lexicographical ordering

of voxels. D is number of voxels and N is number of samples. For this case, we

assume that xi’s reside in positive quadrant which is a reasonable assumption

for medical images. The goal is to decompose data matrix, X, into a positive

matrix, B, which is a matrix whose columns are constructed basis vectors, and

a loadings matrix , C, which holds corresponding loadings of the basis, namely

X ≈ BC. The elements of C will form the features extracted from the data via

projection on B; they will be subsequently used for classification. In the litera-

ture, this decomposition is called Non-Negative Matrix Factorization (NMF). It

is straightforward to verify that this is an ill-posed problem. Hence, a regulariza-

tion is necessary. We formulate the problem as a MAP (Maximum a Posteriori)

estimation problem as follows:

p(B,C|X) =p(X|B,C)p(B,C)

p(X)

Here, we assumed that B and C are independent. Therefore, the MAP estimation

problem is formulated as an optimization problem as follows:

General Formulation

=p(X|B,C)p(B)p(C)

p(X)

(1)

max

B,Clogp(B,C|X) ≡ max

in which the first term on the right hand side is a likelihood term and the second

and third terms are priors for B and C respectively. Thus, we need to choose

proper priors and likelihood function according to our problem. In general, NMF

can be written as the following optimization problem:

B,Clogp(X|B,C) + logp(B) + logp(C)(2)

min

B,C>0D(X;BC) + α(B) + β(C)(3)

where D(X;BC) is a negative likelihood function and measures the goodness

of fit, and where the second (α(B)) and third (β(C)) terms form negative log

priors on B and C. Next, we discuss different choices for D(.,.), α(.), and β(.).

Page 4

426N. Batmanghelich, B. Taskar, and C. Davatzikos

2.2Likelihood Term: D(X;BC)

As it is discussed in [18], given a convex function ϕ : S ⊆ IR → IR, Bregman

divergence is a family of D(.,.) functions which are defined as follows Dϕ :

S × int(S) → IR+:

Dϕ(x;y) := ϕ(x) − ϕ(y) − ϕ?(y)(x − y)

where int(S) is the interior of set S. For cases in which x and y are matrices, it

can be augmented as summation over all elements of a matrix:

?

In this paper, we used ϕ(x) = xlogx which readily converts (5) to the KL-

Divergence:

?

It is worth mentioning that other choices for ϕ are also possible (e.g.1

they yield other distance measures (e.g. Frobenius distance between matrices).

(4)

Dϕ(X;Y ) :=

ij

Dϕ(xij,yij)(5)

Dϕ(X;BC) =

ij

xijlog

xij

kbikckj

?

−

?

ij

xij+

?

ijk

bikckj

(6)

2x2) and

2.3Regularizing the Basis: α(B)

The regularization term can be broken down into two terms according to respec-

tive criteria that will be discussed in more detail in this section:

α(B) = α1(B) + α2(B)(7)

In our implementation, each regularization term has a weighting term which

determines its contribution, however, we have omitted the weighting terms for

the sake of simplicity in the notation.

It is reasonable to assume that anatomical regions are expected to display

similar structural and functional characteristics, hence voxels should be grouped

together into regional features. As discussed in the Introduction, local support

and sparsity are two desirable properties which both can be achieved using the

following terms:

α1(B) = 1TBTB1,

?bi?1= 1 (8)

In order to see why this regularization enforces part-based representation, we

should interpret it mathematically. Part-based representation means that we

do not want our basis vectors, bi, to have a lot of overlap with each other.

Considering the fact that the basis are positive (hence, bounded below), having

the least overlap could be translated to orthogonality. Mathematically speaking,

< bi,bj>≈ 0 if i ?= j which means that off-diagonal elements of BTB should be

minimized.

Page 5

A General and Unifying Framework for Feature Construction427

It is also worth mentioning that it has been shown empirically [16] and under

some mild conditions mathematically [17] that NMF yields sparse basis. Nev-

ertheless, equality constraint in (8) in addition to the non-negativity constraint

enforces sparsity even further. This ends the justification of the terms introduced

in (8).

This is the first criterion for the prior over B and was mentioned earlier in

[19]; nevertheless this is not enough when one deals with image data. Diseases

typically affect anatomy and function in a somewhat continuous way. Therefore,

we would prefer that bi represents smooth and contiguous anatomical regions.

Although smoothing can be applied as post processing after optimization and

deriving B, it is preferable to add a smoothness penalty term to the prior of

B. Similar to [20], we exploit the widely used Markov Random Field (MRF)

model. In this model, voxels within a neighborhood interact with each other and

smoothness of an image is modeled as in the Gibbs distribution as follows:

p(I) =1

Zexp(−cα2(B))

⇒ −logp(I) = cα2(B) − logZ

(9)

where I is a vector made by concatenating image voxels (e.g. lexicographically)

and Z is a normalization constant called partition function and c is a constant.

α2(.) is a nonlinear energy function measuring non-smoothness of an image. For

basis matrix B, we can write α2(B) as follows:

α2(B) =

r

?

j=1

D

?

i=1

?

l∈Ui

wilψ(bji− bjl,δ)(10)

where r is the number of basis vectors and D is dimensionality of the images and

Uiis a set containing the neighborhood indices of the i’th voxel and ψ(.,δ) is a

potential function and δ is a free parameter and wklare weighting factors. There

are plenty of choices for the potential function. We adopt a simple quadratic func-

tion that has all desired properties, including nonnegativity, strictly increasing,

unboundedness and more importantly convexity in addition to the fact that it

can be simply represented in a matrix form which will help us to derive an

appropriate auxiliary function:

ψ(x,δ) = (x

δ)2

(11)

Adding both terms, α1 and α2, for basis, total regularization penalty would

become:

r

?

α(B) = 1TBTB1 +

j=1

D

?

i=1

?

l∈Ui

wilψ(bji− bjl,δ)(12)

2.4 Regularizing Coefficients: β(C)

In this section, we will discuss the regularization term for the coefficient matrix.

The main goal of these regularization terms is to boost bases that produce