Page 1

A General Approach to Mining Quality

Pattern-based Clusters from Microarray Data?

Daxin Jiang1

Jian Pei2

Aidong Zhang1

1State University of New York at Buffalo, USA, {djiang3, azhang}@cse.buffalo.edu

2Simon Fraser University, Canada, jpei@cs.sfu.ca

Abstract. Pattern-based clustering has broad applications in microar-

ray data analysis, customer segmentation, e-business data analysis, etc.

However, pattern-based clustering often returns a large number of highly-

overlapping clusters, which makes it hard for users to identify interest-

ing patterns from the mining results. Moreover, there lacks of a general

model for pattern-based clustering. Different kinds of patterns or differ-

ent measures on the pattern coherence may require different algorithms.

In this paper, we address the above two problems by proposing a general

quality-driven approach to mining top-k quality pattern-based clusters.

We examine our quality-driven approach using real world microarray

data sets. The experimental results show that our method is general,

effective and efficient.

1Introduction

Clustering is an important data mining problem. For a set of objects, a clustering

algorithm partitions the objects into a set of clusters, such that objects within a

cluster are similar to each other, and objects in different clusters are dissimilar.

While many traditional clustering methods often assume that the clusters are

mutually exclusive and rely on metric distance between objects, some recently

emerging applications, such as those in bio-informatics and e-business, post the

challenges of mining non-exclusive, non-distance-based clusters in various sub-

spaces from large databases.

As a typical application, a microarray data set can be modelled as a numerical

data matrix recording the expression levels of genes on samples. An important

task of analyzing microarray data is to find co-expressed genes and phenotypes.

A group of co-expressed genes are the ones that demonstrate similar expression

patterns over a substantial subset of samples, and the subset of samples may

correspond to some phenotype.

Moreover, given a microarray data set, a gene can belong to more than one

co-expressed gene group, since it may correlate to more than one phenotype;

?This research is partly supported by NSF grants DBI-0234895 and IIS-0308001, NIH

grant 1 P20 GM067650-01A1, the Endowed Research Fellowship and the President

Research Grant from Simon Fraser University. All opinions, findings, conclusions

and recommendations in this paper are those of the authors and do not necessarily

reflect the views of the funding agencies.

Page 2

and a sample can manifest more than one phenotype, such as tumor vs. normal

tissues and male vs. female samples. To address the novel requirements, recently,

a new theme of pattern-based clustering, is being developed [1,5,6,9,10] (Please

see Section 2 for a brief review).

As indicated by the previous studies, pattern-based clustering is effective for

mining non-exclusive, non-distance-based clusters. However, the state-of-the-art

methods for pattern-based clustering are still facing the following two serious

challenges, which will be addressed in this paper.

Challenge 1: Pattern-based clustering may return a large number of

highly-overlapping clusters.

To filter out trivial clusters, most of the pattern-based clustering methods

adopt some thresholds, such as the minimum number of objects in a cluster,

the minimum number of attributes in a cluster, and the minimum degree of

coherence of a cluster. Since too tight threshold values may prune out most of

the clusters, including those bearing interesting patterns, loose threshold values

are usually preferred.

However, pattern-based clustering will return the complete set of possible

combinations of objects and attributes that pass the thresholds. When loose

threshold values are specified, thousands or tens of thousands of clusters will be

reported. Moreover, since the microarray data are typically highly-connected [3],

the reported clusters may be often highly overlapping. For example, our empirical

study has shown that the average overlap among the clusters returned by a

representative pattern-based clustering algorithm may be as high as 79% (Please

see Section 5 for details). Clearly, it is hard for users to identify useful patterns

from such voluminous and redundant mining results. Can we develop an effective

method that can automatically focus on finding a small set of representative

clusters with respect to loose threshold values?

Our contribution. In this paper, we propose a theme of mining top-k quality

pattern-based clusters, based on a user specified quality/utilization function. In

particular, the top-k clusters are sorted according to their quality, and the clus-

ters with higher quality are reported before those with lower quality. We show

that, by intuitive quality functions, highly overlapping clusters can be avoided.

Challenge 2: There are numerous pattern-based clustering models due

to various definitions of patterns and coherence measures.

For example, Cheng and Church [1] measured the coherence of clusters by

the mean squared residue score. Wang et al. [9] introduced the notion of pScore

to measure the similarity between the objects in clusters. Liu and Wang [5]

defined patterns by ordering attributes in value ascending order. Jiang et al. [4]

constrained the coherence within groups of samples by the minimum coherence

threshold. Different algorithms are proposed to handle specific models. Even

with a minor change to the specific pattern-based clustering model, such as the

definition of coherence function, we may have to write a new algorithm.

Given that pattern-based clustering methods share essential intuitions and

principles, can we have a general approach such that many different pattern-

based clustering models can be handled consistently?

2

Page 3

Our contribution. In this paper, we develop a general model for pattern-

based clustering to address the above challenge. Our new pattern-base clustering

model is a generalization of several previous models, including bi-Cluster [1], δ-

pCluster [9], OP-Cluster [5] and coherent gene cluster [4]. We study how to mine

top-k quality pattern-based clusters under the general model, and give a general

and efficient algorithm.

The remainder of the paper is organized as follows. In Section 2, we review

the related work, and also clarify the novel progress that we make in this pa-

per comparing to our previous studies on mining microarray data. A general

quality-driven model is introduced in Section 3. A general approach to mining

top-k quality pattern-based clusters is presented in Section 4. We report the

experimental results in Section 5. Finally, we conclude this paper in Section 6.

2 Related Work

Our research is highly related to pattern-based clustering. Cheng and Church

[1] introduced bi-cluster model. Given a subset of objects I and a subset of

attributes J, the coherence of the submatrix (I,J) is measured by the mean

squared residue score.

rIJ=

1

|I||J|

?

i∈I,j∈J

(aij− aiJ− aIj+ aIJ)2,

(1)

where aijis the value of object i on j, aiJis the average value of row i, aIjis the

average value of column j, and aIJis the average value of the submatrix (I,J).

The problem of bi-clustering is to mine submatrices with low mean squared

residue scores. Yang et al. [10] proposed a move-based algorithm to find biclusters

more efficiently. The algorithms in [1] and [10] adopt heuristic search strategies,

and thus cannot guarantee to find the optimal biclusters in a data set.

In [9], Wang et al. proposed the model of δ-pCluster. A subset of objects

O and a subset of attributes A form a pattern-based cluster if for any pair of

objects x, y ∈ O, and any pair of attributes a, b ∈ A, the difference of change of

values on attributes a and b between objects x and y is smaller than a threshold

δ, i.e., |(x.a − y.a) − (x.b − y.b)| ≤ δ. In a recent study [6], Pei et al. developed

MaPle, an efficient algorithm to mine the complete set of maximal pattern-based

clusters (i.e., non-redundant pattern-based clusters).

In [5], Liu and Wang presented the model of OP-Cluster. Under this model,

two objects gi,gjare similar on a subset of attributes S if the values of these two

objects induce the same relative order of those attributes. An efficient algorithm,

OPC-Tree, was developed.

2.1 New Progress in This Paper

Since 2002, we have been systematically developing pattern-based clustering

methods for mining microarray data, e.g., [6,4,3]. For example, we proposed a

3

Page 4

model for coherent clusters, a specific type of pattern-based clusters, in the novel

gene-sample-time series microarray data sets, and developed algorithms Sample-

Gene Search and Gene-Sample Search [4]. Sample-Gene Search was shown more

efficient.

This paper is critically different from [4] and other previous studies on pattern-

based clustering in the following perspectives. First, the methods discussed

in [4] enumerate all pattern-based clusters. As discussed before, although MaPle,

OPC-Tree, Gene-Sample Search and Sample-Gene Search can find the complete

set of the pattern-based clusters in a data set, they may not be effective to

handle the two challenges discussed in Section 1. In this paper, we address the

challenges by proposing a general quality-driven pattern-based clustering frame-

work. Instead of enumerating all the pattern-based clusters, we mine only the

top-k clusters here according to a quality/utilization function specified by users.

All existing methods cannot mine such top-k clusters.

Second, [4] studies a specific type of microarray data stes. In this paper,

we do not focus on a specific model. Instead, we generalize several previously

proposed pattern-based clustering models and propose a general approach.

Last, [4] and this paper share the framework of pattern-growth approaches,

i.e., both methods conduct depth-first search. However, due to the quality-driven

mining requirements, in this paper, we develop techniques to prune futile search

subspaces using the quality criteria (e.g., Section 4.1 and Rule 3). The algorithm

developed in this paper inherits and generalizes the technical merits from [4,6].

3Mining Quality Pattern-based Clusters

For a set of n genes G-Set = {g1,...,gn} and a set of m samples S-Set =

{s1,...,sm}, the expression levels of the genes on the samples form a matrix

M = {mi,j}, where mi,jis the expression level of gene gi(1 ≤ i ≤ n) on sample

sj (1 ≤ j ≤ m). A cluster is a submatrix C = (G,S) of M, i.e., G ⊆ G-Set

and S ⊆ S-Set, such that C is coherent. Here, the coherence of C describes how

coherently the genes in G exhibit expression patterns on the set of samples S.

The measure of coherence varies in different specific pattern-based clustering

models. In this paper, we are interested in constructing a general model instead

of proposing another measure of coherence. Thus, we assume that the coherence

of a submatrix is given by a function cScore such that (1) cScore(C) ≥ 0 for any

submatrix C; and (2) for submatrices C1and C2, if cScore(C1) > cScore(C2),

then C1is more coherent than C2.

For a specific model, it is easy to revise the coherence measure to satisfy the

above two requirements. For example, the bi-Cluster model [1] minimizes the

mean squared residue score rIJ (Equation 1). Since the score is always greater

than or equals to 0, minimizing rIJ is equivalent to maximizing

can use the following cScore() function.

1

rIJ. Thus, we

cScore(C) =

1

?

i∈I,j∈J(aij− aiJ− aIj+ aIJ)2

(2)

4

Page 5

For δ-pCluster, we can use the following function.

?1

For OP-Cluster, we have

?1

Moreover, for coherent gene cluster [4], we can specify the cScore function as

follows.

?1

In real applications, users often have a preference among the clusters. For

example, in mining gene expression data, clusters with a high coherence score

and a large number of genes and samples are strongly preferred. Accordingly, we

define the quality measure of clusters as follows.

cScore(C) =

if pScore(X) ≤ δ for any 2 × 2 sumbmatrix X of C

otherwise0

(3)

cScore(C) =

if patterns in C follow the same ordering

otherwise0

(4)

cScore(C) =

if in C each gene is coherent across the samples

otherwise0

(5)

Definition 1 (Quality of a cluster). Let C = (G,S) be a submatrix of a

microarray data set M, the quality of C is defined as quality(C) = size(C) ·

cScore(C), where size(C) = |G| · |S| and cScore is the coherence function.

For a set of clusters that have no overlap, the quality of the set of clusters is

simply the sum of the quality of each cluster. However, when there exist some

overlaps, we have to make sure that each overlapping cell contributes to the

total quality only once, and the contribution goes to the most quality cluster

that contains the overlap.

Definition 2 (Quality of a set of clusters).

ces. The quality of Ω is defined as quality(Ω) =?

Suppose a user wants a set of k clusters that have the best quality, the prob-

lem can be formulated as to compute a set Ω = {C1,...,Ck} of k submatrices

such that quality(Ω) is globally maximized. However, given different numbers of

clusters k and k?such that k < k?, the corresponding optimized sets of clusters

Ω and Ω?may not be consistent. In other words, since we maximize the quality

function on a global level, a quality cluster C ∈ Ω may not necessarily appear in

Ω?. The inconsistency among the mining results with respect to different num-

bers of clusters is undesirable, since the number of clusters k is usually unknown

a priori.

To address this problem, we turn to a greedy framework. The main idea is

that we compute a series of k clusters Ω = {C1, ..., Ck} such that (1) C1

is the cluster with the highest quality; and (2) for Ci (i ≥ 2), Ci is a cluster

maximizing the “quality improvement” with respect to C1,...,Ci−1. In this

Let Ω be a set of submatri-

mi,j∈∪C∈ΩCQ(mi,j), where

Q(mi,j) = max{cScore(C)|(C ∈ Ω) ∧ (mi,j∈ C)}.

5