Page 1

Hierarchical Feature Extraction for Compact

Representation and Classification of Datasets

Markus Schubert and Jens Kohlmorgen

Fraunhofer FIRST.IDA

Kekul´ estr. 7, 12489 Berlin, Germany

{markus,jek}@first.fraunhofer.de

http://ida.first.fraunhofer.de

Abstract. Feature extraction methods do generally not account for hi-

erarchical structure in the data. For example, PCA and ICA provide

transformations that solely depend on global properties of the overall

dataset. We here present a general approach for the extraction of feature

hierarchies from datasets and their use for classification or clustering.

A hierarchy of features extracted from a dataset thereby constitutes a

compact representation of the set that on the one hand can be used to

characterize and understand the data and on the other hand serves as a

basis to classify or cluster a collection of datasets. As a proof of concept,

we demonstrate the feasibility of this approach with an application to

mixtures of Gaussians with varying degree of structuredness and to a

clinical EEG recording.

1 Introduction

The vast majority of feature extraction methods does not account for hierarchical

structure in the data. For example, PCA [1] and ICA [2] provide transformations

that solely depend on global properties of the overall data set. The ability to

model the hierarchical structure of the data, however, might certainly help to

characterize and understand the information contained in the data. For example,

neural dynamics are often characterized by a hierarchical structure in space and

time, where methods for hierarchical feature extraction might help to group

and classify such data. A particular demand for these methods exists in EEG

recordings, where slow dynamical components (sometimes interpreted as internal

“state” changes) and the variability of features make data analysis difficult.

Hierarchical feature extraction is so far mainly related to 2-D pattern anal-

ysis. In these approaches, pioneered by Fukushima’s work on the Neocognitron

[3], the hierarchical structure is typically a priori hard-wired in the architecture

and the methods primarily apply to a 2-D grid structure. There are, however,

more recent approaches, like local PCA [4] or tree-dependent component analysis

[5], that are promising steps towards structured feature extraction methods that

derive also the structure from the data. While local PCA in [4] is not hierarchical

and tree-dependent component analysis in [5] is restricted to the context of ICA,

we here present a general approach for the extraction of feature hierarchies and

Page 2

their use for classification and clustering. We exemplify this by using PCA as

the core feature extraction method.

In [6] and [7], hierarchies of two-dimensional PCA projections (using proba-

bilistic PCA [8]) were proposed for the purpose of visualizing high-dimensional

data. For obtaining the hierarchies, the selection of sub-clusters was performed

either manually [6] or automatically by using a model selection criterion (AIC,

MDL) [7], but in both cases based on two-dimensional projections. A 2-D pro-

jection of high-dimensional data, however, is often not sufficient to unravel the

structure of the data, which thus might hamper both approaches, in particular,

if the sub-clusters get superimposed in the projection. In contrast, our method is

based on hierarchical clustering in the original data space, where the structural

information is unchanged and therefore undiminished. Also, the focus of this

paper is not on visualizing the data itself, which obviously is limited to 2-D or

3-D projections, but rather on the extraction of the hierarchical structure of the

data (which can be visualized by plotting trees) and on replacing the data by

a compact hierarchical representation in terms of a tree of extracted features,

which can be used for classification and clustering. The individual quantity to

be classified or clustered in this context, is a tree of features representing a set of

data points. Note that classifying sets of points is a more general problem than

the well-known problem of classifying individual data points. Other approaches

to classify sets of points can be found, e.g., in [9, 10], where the authors define

a kernel on sets, which can then be used with standard kernel classifiers.

The paper is organized as follows. In section 2, we describe the hierarchical

feature extraction method. In section 3, we show how feature hierarchies can

be used for classification and clustering, and in section 4 we provide a proof

of concept with an application to mixtures of Gaussians with varying degree

of structuredness and to a clinical EEG recording. Section 5 concludes with a

discussion.

2Hierarchical Feature Extraction

We pursue a straightforward approach to hierarchical feature extraction that

allows us to make any standard feature extraction method hierarchical: we per-

form hierarchical clustering of the data prior to feature extraction. The feature

extraction method is then applied locally to each significant cluster in the hi-

erarchy, resulting in a representation (or replacement) of the original dataset in

terms of a tree of features.

2.1Hierarchical Clustering

There are many known variants of hierarchical clustering algorithms (see, e.g.,

[11, 12]), which can be subdivided into divisive top-down procedures and ag-

glomerative bottom-up procedures. More important than this procedural aspect,

however, is the dissimilarity function that is used in most methods to quantify

the dissimilarity between two clusters. This function is used as the criterion to

Page 3

determine the clusters to be split (or merged) at each iteration of the top-down

(or bottom-up) process. Thus, it is this function that determines the clustering

result and it implicitly encodes what a “good” cluster is. Common agglomerative

procedures are single-linkage, complete-linkage, and average-linkage. They differ

simply in that they use different dissimilarity functions [12].

We here use Ward’s method [13], also called the minimum variance method,

which is agglomerative and successively merges the pair of clusters that causes

the smallest increase in terms of the total sum-of-squared-errors (SSE), where

the error is defined as the Euclidean distance of a data point to its cluster mean.

The increase in square-error caused by merging two clusters, Diand Dj, is given

by

?

ni+ nj

where niand nj are the number of points in each cluster, and miand mj are

the means of the points in each cluster [12]. Ward’s method can now simply

be described as a standard agglomerative clustering procedure [11, 12] with the

particular dissimilarity function d given in Eq. (1). We use Ward’s criterion,

because it is based on a global fitness criterion (SSE) and in [11] it is reported

that the method outperformed other hierarchical clustering methods in several

comparative studies. Nevertheless, depending on the particular application, other

criteria might be useful as well.

The result of a hierarchical clustering procedure that successively splits or

merges two clusters is a binary tree. At each hierarchy level, k = 1,...,n, it defines

a partition of the given n samples into k clusters. The leaf node level consists of

n nodes describing a partition into n clusters, where each cluster/node contains

exactly one sample. Each hierarchy level further up contains one node with edges

to the two child nodes that correspond to the clusters that have been merged.

The tree can be depicted graphically as a dendrogram, which aligns the leaf

nodes along the horizontal axis and connects them by lines to the higher level

nodes along the vertical axis. The position of the nodes along the vertical axis

could in principle correspond linearly to the hierarchy level k. This, however,

would reveal almost nothing of the structure in the data. Most of the structural

information is actually contained in the dissimilarity values. One therefore usu-

ally positions the node at level k vertically with respect to the dissimilarity value

of its two corresponding child clusters, Diand Dj,

d(Di,Dj) =

ninj

?mi− mj?,

(1)

δ(k) = d(Di,Dj).

(2)

For k = n, there are no child clusters, and therefore δ(n) = 0 [11]. The function

δ can be regarded as within-cluster dissimilarity. By using δ as the vertical scale

in a dendrogram, a large gap between two levels, for example k and k+1, means

that two very dissimilar clusters have been merged at level k.

2.2Extracting a Tree of Significant Clusters

As we have seen in the previous subsection, a hierarchical clustering algorithm

always generates a tree containing n − 1 non-singleton clusters. This does not

Page 4

necessarily mean that any of these clusters is clearly separated from the rest of

the data or that there is any structure in the data at all. The identification of

clearly separated clusters is usually done by visual inspection of the dendrogram,

i.e. by identifying large gaps. For an automatic detection of significant clusters,

we use the following straightforward criterion

δ(parent(k))

δ(k)

> α,

for 1 < k < n,

(3)

where parent(k) is the parent cluster level of the cluster obtained at level k and

α is a significance threshold. If a cluster at level k is merged into a cluster that

has a within-cluster dissimilarity which is more than α times higher than that

of cluster k, we call cluster k a significant cluster. That means that cluster k

is significantly more compact than its merger (in the sense of the dissimilarity

function). Note that this does not necessarily mean that the sibling of cluster k is

also a significant cluster, as it might have a higher dissimilarity value than cluster

k. The criterion directly corresponds to the relative increase of the dissimilarity

value in a dendrogram from one merger level to the next. For small clusters that

contain only a few points, the relative increase in dissimilarity can be large just

because of the small sample size. To avoid that these clusters are detected as

being significant, we require a minimum cluster size M for significant clusters.

After having identified the significant clusters in the binary cluster tree, we

can extract the tree of significant clusters simply by linking each significant

cluster node to the next highest significant node in the tree, or, if there is none,

to the root node (which is just for the convenience of getting a tree and not a

forest). The tree of significant clusters is generally much smaller than the original

tree and it is not necessarily a binary tree anymore. Also note that there might

be data points that are not in any significant cluster, e.g., outliers.

The criterion in (3) is somewhat related to the criterion in [14], which is used

to take out clusters from the merging process in order to obtain a plain, non-

hierarchical clustering. The criterion in [14] accounts for the relative change of

the absolute dissimilarity increments, which seems to be somewhat less intuitive

and unnecessarily complicated. This criterion might also be overly sensitive to

small variations in the dissimilarities.

2.3Obtaining a Tree of Features

To obtain a representation of the original dataset in terms of a tree of features,

we can now apply any standard feature extraction method to the data points in

each significant cluster in the tree and then replace the data points in the cluster

by their corresponding features. For PCA, for example, the data points in each

significant cluster are replaced by their mean vector and the desired number

of principle components, i.e. the eigenvectors and eigenvalues of the covariance

matrix of the data points. The obtained hierarchy of features thus constitutes

a compact representation of the dataset that does not contain the individual

data points anymore, which can save a considerable amount of memory. This

Page 5

representation is also independent of the size of the dataset. The hierarchy can

on the one hand be used to analyze and understand the structure of the data,

on the other hand – as we will further explain in the next section – it can be

used to perform classification or clustering in cases where the individual input

quantity to be classified (or clustered) is an entire dataset and not, as usual, a

single data point.

3 Classification of Feature Trees

The classification problem that we address here is not the well-known problem

of classifying individual data points or vectors. Instead, it relates to the classifi-

cation of objects that are sets of data points, for example, time series. Given a

“training set” of such objects, i.e. a number of datasets, each one attached with a

certain class label, the problem consists in assigning one class label to each new,

unlabeled dataset. This can be accomplished by transforming each individual

dataset into a tree of features and by defining a suitable distance function to

compare each pair of trees. For example, trees of principal components can be re-

garded as (hierarchical) mixtures of Gaussians, since the principal components

of each node in the tree (the eigenvectors and eigenvalues) describe a normal

distribution, which is an approximation to the true distribution of the underly-

ing data points in the corresponding significant cluster. Two mixtures (sums) of

Gaussians, f and g, corresponding to two trees of principal components (of two

datasets), can be compared, e.g., by using the the squared L2-Norm as distance

function, which is also called the integrated squared error (ISE),

ISE(f,g) =

?

(f − g)2dx.

(4)

The ISE has the advantage that the integral is analytically tractable for mixtures

of Gaussians.

Note that the computation of a tree of principal components, as described

in the previous section, is in itself an interesting way to obtain a mixture of

Gaussians representation of a dataset: without the need to specify the number

of components in advance and without the need to run a maximum likelihood

(gradient ascent) algorithm like, for example, expectation–maximization [15],

which is prone to get stuck in local optima.

Having obtained a distance function on feature trees, the next step is to

choose a classification method that only requires pairwise distances to classify

the trees (and their corresponding datasets). A particularly simple method is

first-nearest-neighbor (1-NN) classification. For 1-NN classification, the tree of

a test dataset is assigned the label of the nearest tree of a collection of trees

that were generated from a labeled “training set” of datasets. If the generated

trees are sufficiently different among the classes, first- (or k-) nearest-neighbor

classification can already be sufficient to obtain a good classification result, as

we demonstrate in the next section.