Hierarchical Feature Extraction for Compact Representation and Classification of Datasets.
ABSTRACT Feature extraction methods do generally not account for hierarchical structure in the data. For example, PCA and ICA provide
transformations that solely depend on global properties of the overall dataset. We here present a general approach for the
extraction of feature hierarchies from datasets and their use for classification or clustering. A hierarchy of features extracted
from a dataset thereby constitutes a compact representation of the set that on the one hand can be used to characterize and
understand the data and on the other hand serves as a basis to classify or cluster a collection of datasets. As a proof of
concept, we demonstrate the feasibility of this approach with an application to mixtures of Gaussians with varying degree
of structuredness and to a clinical EEG recording.

Conference Paper: EEG hidden information mining using hierarchical feature extraction and classification.
[Show abstract] [Hide abstract]
ABSTRACT: This paper presents a hierarchical feature extraction and classification method for electroencephalogram (EEG) hidden information mining. It consists of supervised learning for fewer features, hierarchical knowledge base (HKB) construction and classification test. First, the discriminative rules and the corresponding background conditions are extracted by using autoregressive method in combination with the nonparametric weighted feature extraction (NWFE) and knearest neighbor. Second, through ranking the discriminative rules according to validation test correct rate, a hierarchical knowledge base HKB is constructed. Third, given an EEG sequence, it chooses one or several discriminative rules from the HKB using the upbottom search strategy and calculates classification accuracy. The experiments are carried out upon real electroencephalogram (EEG) recordings from five subjects and the results show the better performance of our method.Proceedings of the 9th IEEE International Conference on Cognitive Informatics, ICCI 2010, July 79, 2010, Beijing, China; 01/2010
Page 1
Hierarchical Feature Extraction for Compact
Representation and Classification of Datasets
Markus Schubert and Jens Kohlmorgen
Fraunhofer FIRST.IDA
Kekul´ estr. 7, 12489 Berlin, Germany
{markus,jek}@first.fraunhofer.de
http://ida.first.fraunhofer.de
Abstract. Feature extraction methods do generally not account for hi
erarchical structure in the data. For example, PCA and ICA provide
transformations that solely depend on global properties of the overall
dataset. We here present a general approach for the extraction of feature
hierarchies from datasets and their use for classification or clustering.
A hierarchy of features extracted from a dataset thereby constitutes a
compact representation of the set that on the one hand can be used to
characterize and understand the data and on the other hand serves as a
basis to classify or cluster a collection of datasets. As a proof of concept,
we demonstrate the feasibility of this approach with an application to
mixtures of Gaussians with varying degree of structuredness and to a
clinical EEG recording.
1Introduction
The vast majority of feature extraction methods does not account for hierarchical
structure in the data. For example, PCA [1] and ICA [2] provide transformations
that solely depend on global properties of the overall data set. The ability to
model the hierarchical structure of the data, however, might certainly help to
characterize and understand the information contained in the data. For example,
neural dynamics are often characterized by a hierarchical structure in space and
time, where methods for hierarchical feature extraction might help to group
and classify such data. A particular demand for these methods exists in EEG
recordings, where slow dynamical components (sometimes interpreted as internal
“state” changes) and the variability of features make data analysis difficult.
Hierarchical feature extraction is so far mainly related to 2D pattern anal
ysis. In these approaches, pioneered by Fukushima’s work on the Neocognitron
[3], the hierarchical structure is typically a priori hardwired in the architecture
and the methods primarily apply to a 2D grid structure. There are, however,
more recent approaches, like local PCA [4] or treedependent component analysis
[5], that are promising steps towards structured feature extraction methods that
derive also the structure from the data. While local PCA in [4] is not hierarchical
and treedependent component analysis in [5] is restricted to the context of ICA,
we here present a general approach for the extraction of feature hierarchies and
Page 2
their use for classification and clustering. We exemplify this by using PCA as
the core feature extraction method.
In [6] and [7], hierarchies of twodimensional PCA projections (using proba
bilistic PCA [8]) were proposed for the purpose of visualizing highdimensional
data. For obtaining the hierarchies, the selection of subclusters was performed
either manually [6] or automatically by using a model selection criterion (AIC,
MDL) [7], but in both cases based on twodimensional projections. A 2D pro
jection of highdimensional data, however, is often not sufficient to unravel the
structure of the data, which thus might hamper both approaches, in particular,
if the subclusters get superimposed in the projection. In contrast, our method is
based on hierarchical clustering in the original data space, where the structural
information is unchanged and therefore undiminished. Also, the focus of this
paper is not on visualizing the data itself, which obviously is limited to 2D or
3D projections, but rather on the extraction of the hierarchical structure of the
data (which can be visualized by plotting trees) and on replacing the data by
a compact hierarchical representation in terms of a tree of extracted features,
which can be used for classification and clustering. The individual quantity to
be classified or clustered in this context, is a tree of features representing a set of
data points. Note that classifying sets of points is a more general problem than
the wellknown problem of classifying individual data points. Other approaches
to classify sets of points can be found, e.g., in [9, 10], where the authors define
a kernel on sets, which can then be used with standard kernel classifiers.
The paper is organized as follows. In section 2, we describe the hierarchical
feature extraction method. In section 3, we show how feature hierarchies can
be used for classification and clustering, and in section 4 we provide a proof
of concept with an application to mixtures of Gaussians with varying degree
of structuredness and to a clinical EEG recording. Section 5 concludes with a
discussion.
2 Hierarchical Feature Extraction
We pursue a straightforward approach to hierarchical feature extraction that
allows us to make any standard feature extraction method hierarchical: we per
form hierarchical clustering of the data prior to feature extraction. The feature
extraction method is then applied locally to each significant cluster in the hi
erarchy, resulting in a representation (or replacement) of the original dataset in
terms of a tree of features.
2.1 Hierarchical Clustering
There are many known variants of hierarchical clustering algorithms (see, e.g.,
[11, 12]), which can be subdivided into divisive topdown procedures and ag
glomerative bottomup procedures. More important than this procedural aspect,
however, is the dissimilarity function that is used in most methods to quantify
the dissimilarity between two clusters. This function is used as the criterion to
Page 3
determine the clusters to be split (or merged) at each iteration of the topdown
(or bottomup) process. Thus, it is this function that determines the clustering
result and it implicitly encodes what a “good” cluster is. Common agglomerative
procedures are singlelinkage, completelinkage, and averagelinkage. They differ
simply in that they use different dissimilarity functions [12].
We here use Ward’s method [13], also called the minimum variance method,
which is agglomerative and successively merges the pair of clusters that causes
the smallest increase in terms of the total sumofsquarederrors (SSE), where
the error is defined as the Euclidean distance of a data point to its cluster mean.
The increase in squareerror caused by merging two clusters, Diand Dj, is given
by
?
ni+ nj
where niand nj are the number of points in each cluster, and miand mj are
the means of the points in each cluster [12]. Ward’s method can now simply
be described as a standard agglomerative clustering procedure [11, 12] with the
particular dissimilarity function d given in Eq. (1). We use Ward’s criterion,
because it is based on a global fitness criterion (SSE) and in [11] it is reported
that the method outperformed other hierarchical clustering methods in several
comparative studies. Nevertheless, depending on the particular application, other
criteria might be useful as well.
The result of a hierarchical clustering procedure that successively splits or
merges two clusters is a binary tree. At each hierarchy level, k = 1,...,n, it defines
a partition of the given n samples into k clusters. The leaf node level consists of
n nodes describing a partition into n clusters, where each cluster/node contains
exactly one sample. Each hierarchy level further up contains one node with edges
to the two child nodes that correspond to the clusters that have been merged.
The tree can be depicted graphically as a dendrogram, which aligns the leaf
nodes along the horizontal axis and connects them by lines to the higher level
nodes along the vertical axis. The position of the nodes along the vertical axis
could in principle correspond linearly to the hierarchy level k. This, however,
would reveal almost nothing of the structure in the data. Most of the structural
information is actually contained in the dissimilarity values. One therefore usu
ally positions the node at level k vertically with respect to the dissimilarity value
of its two corresponding child clusters, Diand Dj,
d(Di,Dj) =
ninj
?mi− mj?,
(1)
δ(k) = d(Di,Dj).
(2)
For k = n, there are no child clusters, and therefore δ(n) = 0 [11]. The function
δ can be regarded as withincluster dissimilarity. By using δ as the vertical scale
in a dendrogram, a large gap between two levels, for example k and k+1, means
that two very dissimilar clusters have been merged at level k.
2.2Extracting a Tree of Significant Clusters
As we have seen in the previous subsection, a hierarchical clustering algorithm
always generates a tree containing n − 1 nonsingleton clusters. This does not
Page 4
necessarily mean that any of these clusters is clearly separated from the rest of
the data or that there is any structure in the data at all. The identification of
clearly separated clusters is usually done by visual inspection of the dendrogram,
i.e. by identifying large gaps. For an automatic detection of significant clusters,
we use the following straightforward criterion
δ(parent(k))
δ(k)
> α,
for 1 < k < n,
(3)
where parent(k) is the parent cluster level of the cluster obtained at level k and
α is a significance threshold. If a cluster at level k is merged into a cluster that
has a withincluster dissimilarity which is more than α times higher than that
of cluster k, we call cluster k a significant cluster. That means that cluster k
is significantly more compact than its merger (in the sense of the dissimilarity
function). Note that this does not necessarily mean that the sibling of cluster k is
also a significant cluster, as it might have a higher dissimilarity value than cluster
k. The criterion directly corresponds to the relative increase of the dissimilarity
value in a dendrogram from one merger level to the next. For small clusters that
contain only a few points, the relative increase in dissimilarity can be large just
because of the small sample size. To avoid that these clusters are detected as
being significant, we require a minimum cluster size M for significant clusters.
After having identified the significant clusters in the binary cluster tree, we
can extract the tree of significant clusters simply by linking each significant
cluster node to the next highest significant node in the tree, or, if there is none,
to the root node (which is just for the convenience of getting a tree and not a
forest). The tree of significant clusters is generally much smaller than the original
tree and it is not necessarily a binary tree anymore. Also note that there might
be data points that are not in any significant cluster, e.g., outliers.
The criterion in (3) is somewhat related to the criterion in [14], which is used
to take out clusters from the merging process in order to obtain a plain, non
hierarchical clustering. The criterion in [14] accounts for the relative change of
the absolute dissimilarity increments, which seems to be somewhat less intuitive
and unnecessarily complicated. This criterion might also be overly sensitive to
small variations in the dissimilarities.
2.3Obtaining a Tree of Features
To obtain a representation of the original dataset in terms of a tree of features,
we can now apply any standard feature extraction method to the data points in
each significant cluster in the tree and then replace the data points in the cluster
by their corresponding features. For PCA, for example, the data points in each
significant cluster are replaced by their mean vector and the desired number
of principle components, i.e. the eigenvectors and eigenvalues of the covariance
matrix of the data points. The obtained hierarchy of features thus constitutes
a compact representation of the dataset that does not contain the individual
data points anymore, which can save a considerable amount of memory. This
Page 5
representation is also independent of the size of the dataset. The hierarchy can
on the one hand be used to analyze and understand the structure of the data,
on the other hand – as we will further explain in the next section – it can be
used to perform classification or clustering in cases where the individual input
quantity to be classified (or clustered) is an entire dataset and not, as usual, a
single data point.
3 Classification of Feature Trees
The classification problem that we address here is not the wellknown problem
of classifying individual data points or vectors. Instead, it relates to the classifi
cation of objects that are sets of data points, for example, time series. Given a
“training set” of such objects, i.e. a number of datasets, each one attached with a
certain class label, the problem consists in assigning one class label to each new,
unlabeled dataset. This can be accomplished by transforming each individual
dataset into a tree of features and by defining a suitable distance function to
compare each pair of trees. For example, trees of principal components can be re
garded as (hierarchical) mixtures of Gaussians, since the principal components
of each node in the tree (the eigenvectors and eigenvalues) describe a normal
distribution, which is an approximation to the true distribution of the underly
ing data points in the corresponding significant cluster. Two mixtures (sums) of
Gaussians, f and g, corresponding to two trees of principal components (of two
datasets), can be compared, e.g., by using the the squared L2Norm as distance
function, which is also called the integrated squared error (ISE),
ISE(f,g) =
?
(f − g)2dx.
(4)
The ISE has the advantage that the integral is analytically tractable for mixtures
of Gaussians.
Note that the computation of a tree of principal components, as described
in the previous section, is in itself an interesting way to obtain a mixture of
Gaussians representation of a dataset: without the need to specify the number
of components in advance and without the need to run a maximum likelihood
(gradient ascent) algorithm like, for example, expectation–maximization [15],
which is prone to get stuck in local optima.
Having obtained a distance function on feature trees, the next step is to
choose a classification method that only requires pairwise distances to classify
the trees (and their corresponding datasets). A particularly simple method is
firstnearestneighbor (1NN) classification. For 1NN classification, the tree of
a test dataset is assigned the label of the nearest tree of a collection of trees
that were generated from a labeled “training set” of datasets. If the generated
trees are sufficiently different among the classes, first (or k) nearestneighbor
classification can already be sufficient to obtain a good classification result, as
we demonstrate in the next section.