Clustering Scatter Plots Using Data Depth Measures.
ABSTRACT Clustering is rapidly becoming a powerful data mining technique, and has been broadly applied to many domains such as bioinformatics and text mining. However, the existing methods can only deal with a data matrix of scalars. In this paper, we introduce a hierarchical clustering procedure that can handle a data matrix of scatter plots. To more accurately reflect the nature of data, we introduce a dissimilarity statistic based on "data depth" to measure the discrepancy between two bivariate distributions without oversimplifying the nature of the underlying pattern. We then combine hypothesis testing with hierarchical clustering to simultaneously cluster the rows and columns of the data matrix of scatter plots. We also propose novel painting metrics and construct heat maps to allow visualization of the clusters. We demonstrate the utility and power of our new clustering method through simulation studies and application to a microbehostinteraction study.
 Citations (19)
 Cited In (0)

Article: Exploring the conditional coregulation of yeast gene expression through fuzzy kmeans clustering.
[Show abstract] [Hide abstract]
ABSTRACT: Organisms simplify the orchestration of gene expression by coregulating genes whose products function together in the cell. Many proteins serve different roles depending on the demands of the organism, and therefore the corresponding genes are often coexpressed with different groups of genes under different situations. This poses a challenge in analyzing wholegenome expression data, because many genes will be similarly expressed to multiple, distinct groups of genes. Because most commonly used analytical methods cannot appropriately represent these relationships, the connections between conditionally coregulated genes are often missed. We used a heuristically modified version of fuzzy kmeans clustering to identify overlapping clusters of yeast genes based on published geneexpression data following the response of yeast cells to environmental changes. We have validated the method by identifying groups of functionally related and coregulated genes, and in the process we have uncovered new correlations between yeast genes and between the experimental conditions based on similarities in geneexpression patterns. To investigate the regulation of gene expression, we correlated the clusters with known transcription factor binding sites present in the genes' promoters. These results give insights into the mechanism of the regulation of gene expression in yeast cells responding to environmental changes. Fuzzy kmeans clustering is a useful analytical tool for extracting biological insights from geneexpression data. Our analysis presented here suggests that a prevalent theme in the regulation of yeast gene expression is the conditionspecific coregulation of overlapping sets of genes.Genome biology 11/2002; 3(11):RESEARCH0059. · 10.30 Impact Factor  SourceAvailable from: pnas.org[Show abstract] [Hide abstract]
ABSTRACT: A system of cluster analysis for genomewide expression data from DNA microarray hybridization is described that uses standard statistical algorithms to arrange genes according to similarity in pattern of gene expression. The output is displayed graphically, conveying the clustering and the underlying expression data simultaneously in a form intuitive for biologists. We have found in the budding yeast Saccharomyces cerevisiae that clustering gene expression data groups together efficiently genes of known similar function, and we find a similar tendency in human data. Thus patterns seen in genomewide expression experiments can be interpreted as indications of the status of cellular processes. Also, coexpression of genes of known function with poorly characterized or novel genes may provide a simple means of gaining leads to the functions of many genes for which information is not available currently.Proceedings of the National Academy of Sciences 01/1999; 95(25):148638. · 9.81 Impact Factor  SourceAvailable from: Paul De Boeck[Show abstract] [Hide abstract]
ABSTRACT: In this paper we present a structured overview of methods for twomode clustering, that is, methods that provide a simultaneous clustering of the rows and columns of a rectangular data matrix. Key structuring principles include the nature of row, column and data clusters and the type of model structure or associated loss function. We illustrate with analyses of symptom data on archetypal psychiatric patients.Statistical Methods in Medical Research 11/2004; 13(5):36394. · 2.36 Impact Factor
Page 1
Research Article
Research Article
Open Access
Open Access
Zhang, et al. J Biomet Biostat 2011, S5
http://dx.doi.org/10.4172/21556180.S5001
Biometrics & Biostatistics
J Biomet Biostat ISSN:21556180 JBMBS, an open access journal Biostatistics: Computational statistics
Keywords: Clustering; Scatter Plot; Data Depth; Quality Index;
Visualization
Introduction
Clustering is rapidly becoming a powerful data mining technique,
and has been broadly applied to many domains such as bioinformatics
[1,2] and text mining [3]. Usually the data are arranged in a data matrix
where each row corresponds to an object and each column to a variable
on which objects are characterized. Each element of this matrix is a
real number, representing the measurement of an object on a specific
variable. To avoid confusion, we call this matrix “the data matrix of
scalars”.
Two onedimensional clustering methods are commonly used:
Hierarchical clustering builds a hierarchy of clusters based on the
dissimilarity measures among objects whose results can be graphically
presented in a tree structure, called dendrogram; Partitioning
clustering, such as kmeans, divides the objects into a prespecified
number of clusters in which each object belongs to the cluster with the
nearest mean. One may see [4,5] for a survey.
Coclustering, also called biclustering, bivariate clustering, or two
mode clustering, is to simultaneously cluster rows and columns. Unlike
the onedimensional clustering methods that seek to identify similar
rows or columns independently, coclustering seeks to identify “blocks”
(or “coclusters”) of rows and columns that show highly interrelated
coherence. For example, in gene expression analysis, coclustering
can be used to solve the dual problem of identifying a set of genes and
conditions simultaneously involved in a metabolic process, a problem
that traditional onedimensional clustering methods can not handle.
Reference [69] showed a detailed review.
However, when each cell of the data matrix is not represented
by a single numerical value and instead contains a scatter plot, all
the existing clustering methods are not applicable any more. One
may think of incorporating the current clustering methods by using
a single measure, say Pearson correlation coefficients, to analyze the
associations between row variables and column variables, which then
reduces the data matrix of scatter plots to the data matrix of scalars.
But the choice of Pearson correlation coefficients is not always sufficient
since it is only a measure of linear association and it is very sensitive
to outliers. Therefore, distance measures among objects based on such
coefficients will hinder the power of discovering clusters of scatter plots
with nonlinear patterns and/or clusters with outliers.
In this paper we introduce a hierarchical clustering procedure that
can handle a data matrix of scatter plots. In Section 2, to more accurately
reflect the nature of data, we introduce a dissimilarity statistic based
on “data depth” to measure the discrepancy between two bivariate
distributions without oversimplifying the nature of the underlying
pattern. We then combine hypothesis testing with hierarchical
clustering to simultaneously cluster the rows and columns of the data
matrix of scatter plots. We also propose novel painting metrics and
construct heat maps to allow visualization of the clusters. In Section
3 and 4, we demonstrate the utility and power of our new clustering
method through simulation studies and application to a microbehost
interaction study.
Methodology
Clustering procedure
Consider a set of row variables {X1,X2,..,VM} and a set of column
variables {Y1,Y2,..,YN}. For each pair of row and column, a number
of observations are taken that can be drawn as a scatter plot in the
Cartesian plane. Our goal is to cluster both rows and columns based on
these M×N scatter plots.
To obtain the distance matrix for performing the hierarchical
clustering of rows, we have to calculate the distance between any two
rows. Consider the ith row and the jth row, we would like to measure
how similar these two rows are to each other based on comparing the
corresponding N pairs of scatter plots. For each column, say the kth
*Corresponding author: Xinping Cui, Department of Statistics, University of
California, Riverside, CA, USA, Email: xinping.cui@ucr.edu
Received October 13, 2011; Accepted December 01, 2011; Published December
25, 2011
Citation: Zhang Z, Cui X, Jeske DR, Li X, Braun J, et al. (2011) Clustering Scatter
Plots Using Data Depth Measures. J Biomet Biostat S5:001. doi:10.4172/2155
6180.S5001
Copyright: © 2011 Zhang Z, et al. This is an openaccess article distributed under
the terms of the Creative Commons Attribution License, which permits unrestricted
use, distribution, and reproduction in any medium, provided the original author and
source are credited.
Clustering Scatter Plots Using Data Depth Measures
Zhanpan Zhang1, Xinping Cui1, Daniel R Jeske1, Xiaoxiao Li2, Jonathan Braun2,3 and James Borneman4
1Department of Statistics, University of California, Riverside, CA, USA
2Department of Molecular and Medical Pharmacology, University of California, Los Angeles, CA, USA
3Department of Pathology and Laboratory Medicine, University of California, Los Angeles, CA, USA
4Department of Plant Pathology and Microbiology, University of California, Riverside, CA, USA
Abstract
Clustering is rapidly becoming a powerful data mining technique, and has been broadly applied to many domains
such as bioinformatics and text mining. However, the existing methods can only deal with a data matrix of scalars. In
this paper, we introduce a hierarchical clustering procedure that can handle a data matrix of scatter plots. To more
accurately reflect the nature of data, we introduce a dissimilarity statistic based on “data depth” to measure the
discrepancy between two bivariate distributions without oversimplifying the nature of the underlying pattern. We then
combine hypothesis testing with hierarchical clustering to simultaneously cluster the rows and columns of the data
matrix of scatter plots. We also propose novel painting metrics and construct heat maps to allow visualization of the
clusters. We demonstrate the utility and power of our new clustering method through simulation studies and application
to a microbehostinteraction study.
Page 2
Citation: Zhang Z, Cui X, Jeske DR, Li X, Braun J, et al. (2011) Clustering Scatter Plots Using Data Depth Measures. J Biomet Biostat S5:001.
doi:10.4172/21556180.S5001
Page 2 of 6
J Biomet Biostat ISSN:21556180 JBMBS, an open access journalBiostatistics: Computational statistics
column, the pair of scatter plots can be thought of as the samples taken
from two independent bivariate distributions Fik and Fjk respectively,
as shown in (Figure 1) in which each square contains a scatter plot.
As a result, the problem of comparing the pair of scatter plots can be
formulated as testing the following hypotheses:
H0: Fik = Fjk vs. Ha: Fik ≠ Fjk.
Denote by (pvalue)ijk the pvalue for testing the above hypotheses.
The smaller the pvalue, the less similar the pair of scatter plots to each
other. By testing the same kind of hypotheses for all the N columns,
we define the dissimilarity (distance) between the ith row and the jth
row as
(
1
=
k
Then the distance matrix for rows {distij} (i, j = 1,2,..,M, and i ≠ j) is
inputted to the regular hierarchical clustering algorithm, which initially
regards each row as an individual cluster, and at each step, merges the
closest pair of clusters until all the rows are merged into one cluster. In
doing this, hierarchical clustering creates a hierarchy of row clusters
that can be represented in a tree structure called dendrogram.
(1)
)
1 (
−
).
=
∑
N
ij ijk
distpvalue
(2)
The same clustering procedure can be applied to columns as well.
Therefore, the rows and the columns in the original data matrix of
scatter plots (Figure1) are reordered according to the row dendrogram
and the column dendrogram, respectively, which produces a new data
matrix of scatter plots that acts as the output of our proposed clustering
procedure.
Hypotheses testing
Liu RY, Singh K [10] proposed a multivariate rank sum test for
the hypotheses H0: Fik = Fjk vs. Ha: Fik≠Fjk where Fik and Fjk are the
distribution functions of two independent populations. Specifically,
the test statistic is based on a quality index that measures the overall
“outlyingness” of population Fjk relative to population Fik,
(
(,)(; )
ikjkik
Q F F P D F UD F V
where
( ; ) ⋅
ik
D F
is an affineinvariant data depth function with respect
to Fik that could be Mahalanobis depth, Tukey (Halfspace) depth, and
Simplicial depth, etc. (Refer to Section 6.1 for details)
Given two samples {
,
S
UU
} from Fik and {
Q(Fik,, Fjk ) can be estimated as
1
(,)( ; ),
=
t
T
where
ik
F and
jk
F are the empirical distributions, (
)
(; )~,~,
=≤
ikik jk
U F VF
(3)
1,
1,,
T
VV
} from Fjk ,
1
= ∑
T
ST
jk
S
ikikt
Q FFR F V
(4)
S
T
; )
S
ikt
R F V is
's
s
U
the
proportion of with
data depth with respect to . From [10,11], we have
(;)(; )
≤
SS
iksik t
D F UD F V
, and
(; ) ⋅
S
ik
D F
is the empirical
(,) 1/ 2 ~
−
(0,(1/1/ )/12) T
+
ST
jkik
Q FFAN S
(5)
under H0: Fik = Fjk for many commonly used data depth functions
(under general regularity conditions).
Notice that the overall “outlyingness” of Fik relative to Fjk can be also
measured by a quality index
(
(,)(; )
jkik jk
Q FF P D F VD F U
where (; ) ⋅
jk
D F
is an affineinvariant data depth function with respect to
Fjk . Likewise, Q(Fjk,, Fik ) may be estimated as
1
(,)(;
=
s
S
where (;)
jks
R F U
is the proportion of 's
(; ) ⋅
jk
D F
is the empirical data depth with respect to
It can be shown that Q(Fjk, Fik) is not directly related to Q(Fik,,
Fik) (Refer to Section 6.2 for more explanation). However intuitively,
we would like to have a unique parameter to measure the difference
between two distributions, either comparing Fik to Fjk, or Fjk to Fik. Under
H0: Fik = Fjk , Q(Fik,, Fik ) = Q(Fjk,, Fik ) = ½. With the location shift and/
or scale change between Fik and Fjk , either Q(Fik,, Fik ) or Q(Fjk,, Fik ) , or
both, would deviate from 1/2 relatively significantly. Therefore, to avoid
having one distribution as the reference distribution, we propose a new
quality index, called TS, to measure the overall “difference” between Fik
and Fjk,
(,),if (,) 1/ 2
−
=
−
The test statistic for testing H0: Fik = Fjk vs. Ha: Fik≠ Fjk is the estimate
of TS,
(,),if (,) 1/ 2
−
=
−
Then (pvalue)ijk is calculated by the following permutation test
procedure:
1. Pool two samples {
,
S
UU
} and {
2. Take a sample of size S without replacement {
the pooled sample, and the remaining is {
called two permutation samples.
3. Estimate Q(Fik, Fik) and Q(Fjk, Fik) by
, respectively, based on the permutation samples obtained in
Step 2.
4. Set *
ik
Q F
**
(,) 1/ 2(
−>
ikjkjk
Q FFQ F
*(,)
jkik
Q FF
otherwise.
5. Repeat the above steps (Step 2  Step 4) B times to yield B
values of *
TS (
estimates the sampling distribution of the test statistic
under H0: Fik = Fjk .
6. Let plower be the proportion of ’s with , and pupper the
proportion of ’s with . Hence (pvalue)ijk = 2 × min(plower, pupper).
Data visualization
Data visualization is an important aspect in the clustering technique.
In the traditional hierarchical clustering application, where cells in a
)
(; ) ~,~,
=≤
jkjkik
VF UF
(6)
1
),
tV
= ∑
S
T
jk
ST
jk iks
Q FF
R F U
T
with
(; )
T
jk
F .
(;)
≤
T
jk
T
jkts
D F VD F U
, and
T
(,) 1/ 2 ;
−
(,),if (,) 1/ 2(,) 1/ 2 .
−
>
<
ikjkikjkjkik
jkikikjkjkik
Q F FQ F F Q F F
TS
Q F FQ F FQ FF
(8)
TS
(,) 1/ 2 ;
−
(,),if (,) 1/ 2(,) 1/ 2 .
−
>
<
ST
jk
ST
jk
T
jk
S
ikikik
T
jk
SST
jk
T
jk
S
ikikik
Q FFQ F FQ FF
Q FFQ FFQ FF
(9)
1,
1,,
T
VV
}.
1
*
**
S
,
}, which are
,U
,V
U
} from
1
*
,
T
V
*(,)
ST
jkik
Q FF and
*(,)
T
jk
S
ik
Q FF
TS to be equal to
ST
*(,)
ST
jk
S
F
if
−
,) 1/ 2
T
ik
F
; and equal to
TS
TS , denoted by *
b
1,2,
=
bB
), whose distribution
TS
1
1
kN
iik
jjk
M
YYY
X
X
F
XF
X
Figure 1: Data structure: a data matrix of scatter plots.
Page 3
Citation: Zhang Z, Cui X, Jeske DR, Li X, Braun J, et al. (2011) Clustering Scatter Plots Using Data Depth Measures. J Biomet Biostat S5:001.
doi:10.4172/21556180.S5001
Page 3 of 6
J Biomet Biostat ISSN:21556180 JBMBS, an open access journalBiostatistics: Computational statistics
data matrix are scalars, the original data can be rearranged according
to the dissimilarity scores between rows (or columns). The smaller the
dissimilarity score between two rows (or columns), the closer the two
rows (or columns). A graphical representation of the rearranged data
matrix, called a heat map, can be created where the cells are color coded
based on their scalar values. Obviously, we would expect cells in close
proximity to each other to have a similar color.
However, it is not straightforward to apply the above painting
strategy to a data matrix of scatter plots since scatter plots can not be
distinguished from each other only by a single color painting system. In
the following we introduce three painting metrics and demonstrate how
to use these metrics to graphically represent the clusters of scatter plots
so that similar scatter plots share the similar color painting whereas
different scatter plots correspond to different color paintings.
1. Center Deviation Index (CDI): All the M×N scatter plots are
pooled as a single scatter plot that is thought of as a sample from
the bivariate distribution Fpool. For any scatter plot that is a sample
from the bivariate distribution , we define its center as the point
that maximizes the empirical data depth for . Then the CDI for a
scatter plot is the distance between its center and the center of the
pooled scatter plot. For example, in (Figure2a), the length of red
segment is the CDI measuring the deviation of the scatter plot
consisting of blue points from the pooled scatter plot consisting
of black points.
2. Center Deviation Direction Index (CDDI): By taking the
center of the pooled scatter plot as the origin of a new Cartesian
coordinate system, the CDDI for a scatter plot is the magnitude
of the angle formed by the vector from the origin to its center and
the positive xaxis, which ranges from π to π. The CDDI depicts
the relative location of a scatter plot with respect to the pooled
scatter plot, and then the relative locations among the scatter
plots. For example, in (Figure 2b), the CDDI for the blue scatter
plot is the degree of the angle formed by two red vectors.
3. Dispersion Index (DI): Consider a scatter plot that is regarded
as a sample from the bivariate distribution G. We move this
scatter plot such that its center and the center of the pooled
scatter plot overlap, which produces a shifted scatter plot that is
regarded as a sample from a new bivariate distribution G’. The
DI for the original scatter plot is the estimation of the quality
index Q(Fpool,G’), which accounts for the difference between the
original scatter plot and the pooled scatter plot excluding the
effect due to the location shift. For example, in (Figure 2c), the
DI for the blue scatter plot is the estimated quality index of the
red scatter plot (obtained from moving the blue scatter plot) with
respect to the pooled scatter plot.
To better understand the utility of the above three painting
metrics, we present a number of painting examples. In each
example, an matrix of scatter plots (each scatter plot contains 100
data points) was generated with the top left 4×4 , top right 4×4,
bottom left 4×4 and bottom right 4×4 scatter plots following the
four different distributions specified in the left panel of (Figure
3,4,5,6). We then obtained 8×8 matrices of CDI, CDDI and DI,
based upon which three heat maps can be easily generated as
shown in the right panel of (Figure 3,4,5,6), where the “red” heat
map is based on CDI, the “blue” heat map on CDDI, the “green”
heat map on DI, and the “black” color stands for the minimum
index value in all the three heat maps. Note that for simplicity, we
used Mahalanobis depth in all the examples discussed here.
Simulation study
To investigate the power of our proposed clustering method, we
performed a class of simulation studies. The basic procedure is as
follows:
1. Specify a “checkerboard” data pattern with a set of row clusters
and column clusters in which each block shares the same bivariate
distribution within itself;
(a) CDI (b) CDDI (c) DI
Figure 2: Three painting metrics.
N
N
(2)(2)
(2)(2)
0
1
2
0
0
2
0
3
2
0
0
2
(,) (,)
0
7
2
0
0
2
0
11
−
2
0
0
2
(,) (,)
N
N
CDI CDDI DI
Figure 3: Painting Example 1: the bivariate normal distributions differ by loca
tion only and are asymmetric about the origin, therefore only CDI can reveal
clusters of scatter plots.
N
(2)(2)
(2)(2)
0
0
1
0
0
1
0
0
16
0
0
(,) (,)
16
0
0
9
0
0
9
0
0
4
0
0
4
(,)
(,)
N
NN
CDI CDDI DI
Figure 4: Painting Example 2: the bivariate normal distributions differ by loca
tion only and are symmetric about the origin, therefore only CDDI can reveal
clusters of scatter plots.
N
(2)(2)
(2)(2)
0
0
1
0
0
1
0
0
16
0
0
(,) (,)
16
0
0
9
0
0
9
0
0
4
0
0
4
(,)
(,)
N
NN
CDI CDDI DI
Figure 5: Painting Example 3: the bivariate normal distributions differ by scale
only, therefore only DI can reveal clusters of scatter plots.
N
N
(2)(2)
(2)(2)
3
3
1
0
0
1
116
0
0
(, ) (,)
116
29
0
0
9
4
4
4
0
0
4
(,)
(,)
2
N
N
−
−
−
−
CDI CDDI DI
Figure 6: Painting Example 4: the bivariate normal distributions differ by both
location and scale and asymmetric about the origin, therefore CDI, CDDI and
DI all reveal clusters of scatter plots.
Page 4
Citation: Zhang Z, Cui X, Jeske DR, Li X, Braun J, et al. (2011) Clustering Scatter Plots Using Data Depth Measures. J Biomet Biostat S5:001.
doi:10.4172/21556180.S5001
Page 4 of 6
J Biomet Biostat ISSN:21556180 JBMBS, an open access journal Biostatistics: Computational statistics
2. Generate random samples based on the given bivariate
distributions, and create a data matrix of scatter plots;
3. Apply our proposed clustering method to this data matrix of
scatter plots, and check whether the original data pattern can be
retrieved or not. That is, we check whether rows within the same
block are still close to each other compared to other rows in the
row dendrogram, and columns as well; or equivalently, whether
there exists a cutting of row dendrogram such that the generated
branch set are exactly same as the original set of row clusters, and
columns as well;
4. Repeat Step 2  Step 3 a number of times, and record the success
rate, the proportion of times that we succeed in retrieving the
original data pattern, which acts as the power measurement for
our proposed clustering method.
Intuitively, the total number of rows and columns (the size of the
data matrix of scatter plots, or the data size), the number of rows and
columns within each block (the block size), and the number of blocks
would affect the success rate. Therefore, we considered three data
pattern settings shown in (Figure 7).
For each setting, we specified a class of bivariate normal distributions
for blocks, which only differ in location. Specifically, the xcoordinate
of the mean increases equidistantly along the row direction ranging
from 0 with the y coordinate of the mean remaining same; whereas
the ycoordinate of the mean increases equidistantly along the column
direction ranging from 0 with the xcoordinate of the mean remaining
same. For example, with a location shift of 1, the mean of the top left
bivariate normal distribution in “R2C2” and “2*R2C2” is 0
0
1
50 data points were generated for each scatter plot, Mahalanobis depth
was adopted, 500 resampling times were taken for the permutation
test, and the “average” linkage method was chosen for the hierarchical
clustering procedure. We performed 500 simulations for each
setting. The relationship between the success rate and the location
0
, the top
right ac, the bottom left
, and the bottom right 1
1
. Furthermore,
shift is summarized in (Figure 8), where the solid lines stand for the
20
02
distributions, and the dashed lines for
coefficient r = 0.5.
From (Figure 8), we may observe the following:
variancecovariance matrix
specified for the bivariate normal
21
12
with the correlation
1. By comparing the solid line with the dashed line for each setting,
the correlation in the bivariate normal distribution improves the
success rate.
2. By comparing “R2C2” with “2*R2C2” both having a fixed number
of blocks, with a relatively large location shift, the larger the block
size, the higher the success rate; with a relatively small location
shift, the smaller the block size, the higher the success rate.
That is, more scatter plots with larger distance between blocks
improves the chance of capturing the pattern. However, more
scatter plots with smaller distance between blocks introduces a
higher chance for noise in the clustering.
3. By comparing “2*R2C2” with “R4C4” both having a fixed data
size, the smaller the number of blocks, the higher the success
rate, which means it is harder to do a more delicate job (more
row clusters and column clusters).
4. By comparing “R2C2” with “R4C4” both having a fixed block size,
with a relatively small location shift, the smaller the number of
blocks, the higher the success rate; with a relatively large location
shift, the larger the number of blocks, the higher the success
rate. The reason is similar to what we previously discussed in the
comparison of “R2C2” with “2*R2C2”.
Application
Our new clustering method should have utility for a variety of
biological applications, one of which will be examining hostmicrobe
interactions. For example, consider the case of inflammatory bowel
disease (IBD). IBD is a disease that is likely caused by several factors,
including genetics, lifestyle and intestinal bacteria. To identify
putatively important hostmicrobe interactions, we recently examined
the amounts of bacteria and proteins in mucosal luminal interface
samples from IBD and healthy subjects [12].
Two datasets were generated from the experiment. “Microbe” data
were arranged as a data matrix with 81 rows (3 rows containing missing
values are excluded) standing for samples, 15 columns for microbes,
and each cell being a single numerical value recording the level of a
microbe in a sample. “Protein” data were also arranged as a data matrix
with 81 rows standing for the same set of samples, 440 columns for
proteins, and each cell being a single numerical value recording the
level of a protein in a sample. To identify associations between levels of
the microbes and proteins, we combined the above two data matrices
of scalars by treating each pair of rows (one from “Microbe” data,
the other from “Protein” data) as bivariate data with the yaxis being
(a) “R2C2” (b) “2*R2C2” (c) “R4C4”
Figure 7: Three data pattern settings: (a) “R2C2”: there are 2×2 blocks (2 row
clusters and 2 column clusters), each of which contains 2×2 cells, thus the data
size is 4×4. (b) “2*R2C2”: the block size is doubled in the “R2C2” setting, thus the
data size is 8×8. (c) “R4C4”: there are 4×4 blocks (4 row clusters and 4 column
clusters), each of which contains 2×2 cells, thus the data size is 8×8.
Figure 8: Success rate versus location shift.
Page 5
Citation: Zhang Z, Cui X, Jeske DR, Li X, Braun J, et al. (2011) Clustering Scatter Plots Using Data Depth Measures. J Biomet Biostat S5:001.
doi:10.4172/21556180.S5001
Page 5 of 6
J Biomet Biostat ISSN:21556180 JBMBS, an open access journal Biostatistics: Computational statistics
microbe level and the yaxis being protein level, which results in a data
matrix of scatter plots as shown in (Figure 1) where M=440, N=15, and
each scatter plot contains 81 data points.
Regarding the scatter plots as samples taken from the corresponding
independent bivariate distributions, we applied our proposed clustering
method to these 440×15 scatter plots, and cluster both proteins (rows)
and microbes (columns). Specifically, we used Mahalanobis depth as
the data depth measure, B=500 resampling times for the permutation
test, and the “average” linkage method to perform the hierarchical
clustering.
We then cut the “Protein” dendrogram at the height of 6, which
generates 80 protein branches/clusters. The proteins within the same
branch are more similar to each other, or show more similar microbe
protein patterns, than those in other branches. From the 80 protein
clusters, we only selected those containing at least 20 proteins, which
leads to 5 protein clusters. We also generated 4 microbe clusters by
cutting the “Microbe” dendrogram at the height of 430, and selected
those containing at least 5 microbes. One pair of the selected protein
cluster and microbe cluster is depicted in (Figure 9), where the heat
map with the DI painting metric is shown. The promise of these results
is demonstrated by the fact that most of the identified proteins have
been previously associated with IBD as in [1318].
Examining such relationships will have utility for several purposes.
First, by clustering relationships of various host and microbial
variables, one can identify groups of relationships that have similar
and/or dissimilar associations by visually examining the heat maps.
Large assemblages of individual relationships with similar associations
may point toward those that have increased importance, because
they indicate organisms having a greater impact on the host, or vice
versa. Assemblages with similar associations might also be used to
identify different taxa with similar functions as well as direct decisions
concerning which of the myriad of unidentified variables should be
examined further. This latter feature addresses the nature of data
generated in this “omics era,” where most of the variables cannot be
identified by simple database searches, but instead require procedures
consuming considerable amounts of time and effort. Lastly, dissimilar
relationships could provide key information, for example, in identifying
relationships between host defense molecules and the bacteria they
target.
Conclusion
Our proposed method showed a significant utility and power
in handling a data matrix of scatter plots. More importantly, this
clustering procedure can be easily extended to the high dimensional
case when one or more sets of variables needs to be analyzed. Moreover,
the novel painting metrics we proposed can be easily extended to multi
dimensional clusters of multivariate plots.
Coclustering is desirable over traditional onedimensional
clustering as it is more informative and easily interpretable while
preserving most of the information contained in the original data;
and it allows dimension reduction along both axes simultaneously and
hence leads to a much more compact representation of the original data
for subsequent analysis. Hence, our future study is to develop a new co
clustering method to deal with a data matrix of scatter plots.
Finally, although these methods were developed to analyze
microbehost interactions, we anticipate that this general approach
will have utility for a wide range of investigations, including those
examining relationships among gene expression profiles, metabolites,
genes and epigenetic parameters.
Appendix
Data depth
Let F be a probability distribution in Rp with P≥1 and x a point in
Rp . The data depth at with respect to F is denoted by
measures how deep (or central) the point x is with respect to F. The
larger , the deeper (or more central) the point with respect to F . Some
commonly used data depth functions are listed as follows.
1. Mahalanobis Depth [19]:
( ; ) D F x, which
1
( ; )
and ∑F are the mean and variancecovariance matrix of F,
respectively. The sample version of
mF
and ∑F with their sample estimates.
2. Tukey Depth / Halfspace Depth [20]:
( ; )inf{ (): is a closed half space
=
TD F xP
Ç
The sample version of
( ; )
empirical distribution.
3. Simplicial Depth [21]:
=
F
SD F xP x
X
is a random sample from F. The sample version of
SD F xis the fraction of the sample random simplexes containing the
point x.
1/[1 ()()]
mm
−
F
′
=+−∑−
hFF
M D F xxx
,
where mF
( ; )
h
M D F x is obtained by replacing
in containing }.
TD F x is
p
x
ÇÇ
(; )
n
TD F x where Fn is the
11
( ; )( is inside the closed simplex
whose vertices are {,,}),
+
p
X
where
11
{,,}
+
p
XX
( ; )
10317.4
11721.7, serum amyloid a
5701.2
9696.5
7178.0
9184.1, haptoglobin
10484.3, protein s100a12
8025.7
8587.68, complement C4
8449.5
9654.0
7840.1
13886.0, transthyretin
8967.5, complement C3
11346.6, serum amyloid a
8451.4
6628.4, apolipoprotein ci
11238.8, serum amyloid a
6415.5, apolipoprotein ci
9308.3, cc motif chemokine 13
9724.3
7201.2
8164.8, complement C3
Clostridium 501
Clostridium 12
Ruminococcus 246
Roseburia 575
Ruminococcus 323
Eubacterium 2766
Akkermanisa 498
Ruminococcus 312
Clostridium 603
Figure 9: Heat map with the DI painting metric.
Page 6
Citation: Zhang Z, Cui X, Jeske DR, Li X, Braun J, et al. (2011) Clustering Scatter Plots Using Data Depth Measures. J Biomet Biostat S5:001.
doi:10.4172/21556180.S5001
Page 6 of 6
J Biomet Biostat ISSN:21556180 JBMBS, an open access journalBiostatistics: Computational statistics
algorithms: finding a match for a biomedical application. Briefings in Bioinfor
10: 297314.
5. Jiang D, Tang C, Zhang A (2004) Cluster analysis for gene expression data: a
survey. IEEE Trans Knowl Data Eng 16: 13701386.
6. Busygin S, Prokopyev O, Pardalos PM (2008) Biclustering in data mining.
Computers & Operations Research 35: 29642987.
7. Madeira SC, Oliveira AL (2004) Biclustering algorithms for biological data
analysis: a survey. IEEE/ACM Trans Comput Biol Bioinform 1: 24–45.
8. Van Mechelen I, Bock HH, De Boeck P (2004) Twomode clustering methods:
a structured overview. Stat Methods Med Res 13: 363394.
9. Prelić A, Bleuler S, Zimmermann P, Wille A, Bühlmann P, et al. (2006) A
systematic comparison and evaluation of biclustering methods for gene
expression data. Bioinformatics 22: 11221129.
10. Liu RY, Singh K (1993) A quality index based on data depth and multivariate
rank tests. J Am Stat Assoc 88: 252260.
11. Zuo Y, He X (2006) On the limiting distributions of multivariate depthbased
rank sum statistics and related test. Ann Stat 34: 28792896.
12. Li X, LeBlanc L, Elashoff D, Borneman J, Goodglick L, et al. (2010) Detecting
DiseaseRelated Biological Neighborhoods by Human Mucosal Interface
Metaproteome Analysis. Abstract to be presented at the DDW meeting.
13. Ahrenstedt O, Knutson L, Nilsson B, NilssonEkdahl K, Odlind B, et al. (1990)
Enhanced local production of complement components in the small intestines
of patients with Crohn’s disease. N Engl j Med 322: 13451349.
14. Broedl UC, Schachinger V, Lingenhel A, Lehrke M, Stark R, et al. (2007)
Apolipoprotein AIV is an independent predictor of disease activity in patients
with inflammatory bowel disease. Inflamm Bowel Dis 13: 391397.
15. Foell D, Kucharzik T, Kraft M, Vogl T, Sorg C, et al. (2003) Neutrophil derived
human S100A12 (ENRAGE) is strongly expressed during chronic active
inflammatory bowel disease. Gut 52: 847853.
16. Greenstein AJ, Sachar DB, Panday AK, Dikman SH, Meyers S, et al. (1992)
Amyloidosis and inflammatory bowel disease. A 50year experience with 25
patients. Medicine (Baltimore) 71: 261270.
17. Hansen JJ, Holt L, Sartor RB (2009) Gene expression patterns in experimental
colitis in IL10deficient mice,” Inflamm. Bowel Dis 15: 89899.
18. Larsson AE, Melgar S, Rehnström E, Michaëlsson E, Svensson L, et al. (2006)
Magnetic resonance imaging of experimental mouse colitis and association
with inflammatory activity. Inflamm Bowel Dis 12: 478485.
19. Mahalanobis PC (1936) On the generalized distance in statistics. Proceedings
of the National Academy of India 12: 4955.
20. Tukey JW (1974) Mathematics and picturing data. Proceedings of the
International Congress of Mathematicians, Vancouver 2: 523531.
21. Liu RY (1990) On a notion of data depth based on random simplices. Ann Stat
18: 405414.
Q(F,G) vs. Q(G,F)
Consider two independent distributions F and G, and two variables
X ~ F and Y ~ G. We present several examples to show the relationship
between Q(F,G) and Q(G,F). For simplicity, univariate normal
distributions and Mahalanobis depth are adopted here.
Example 1: For
00
(,)
m σ=
FN
have
2
,
2
10
(,)
m σ=
GN
, and
2
1
2
0
σσ>
, we
22
0
22
000
( , )Q F G(() /() /) 1/ 2
<mσmσ=−≥−
P XY
,
,
22
1
22
100
( , )Q G F(() /() /) 1/ 2
>mσmσ=−≥−
P YX
and Q(F,G) +Q(G,F) =1.
Example 2: For
( , )
=
Q F GP X
( , )
=
Q G FP Y
2
00
(,)
m σ
m
) /
m
=
−
−
FN
,
2
10
(
m
m
,)
σ
σ
m σ=
G
≥
≥
N
−
−
, and m0≠m1, we have
2
0
) 1/ 2
<
,
2
0
) 1/ 2
<
,
22
0
2
0
2
00
((
((
) /() /
) /
σ
σ
Y
X
22
11
(
and Q(F,G) = Q(G,F).
Example 3: For
2
0
σ>
, we have
( , )
=
Q F G
2
00
(,)
m σ=
FN
,
2
11
(,)
m σ=
GN
, m0≠m1, and
2
1
σ
22
0
22
00
−
=
0
−
((
=
<
) /
m
() /
m
) 1/ 2
<
σ
mσmσ−≥−
P XY
≥
>
,
and
22
1
22
111
( , )
(() /(
1/ 2.
) /)
1/ 2, 1/ 2, or
σ
Q G FP YX
.
Acknowledgment
The research is supported in part by NIH grant 5R01AI078885. An earlier draft
of this paper was first published in Proceedings of the International Conference
on Data Mining (DMIN’10: July 2010, USA; ISBN #: 1601321384; http://www.
worldacademyofscience.org/; Editors: Robert Stahlbock & Sven F. Crone; Assoc.
Editors: M. AbouNasr, H. R. Arabnia, N. Kourentzes, P. Lenca, WM. Lippe, G. M.
Weiss.)
References
1. Eisen MB, Spellman PT, Brown PO, Botstein D (1998) Cluster analysis and
display of genomewide expression patterns. Pro Natl Acad Sci 95: 14863–
14868.
2. Gasch AP, Eisen MB (2002) Exploring the conditional coregulation of
yeast gene expression through fuzzy kmeans clustering. Genome Biol 3:
RESEARCH0059.
3. Dhillon IS, Mallela S, Modha DS (2003) Informationtheoretic coclustering.
KDD 89–98.
4. Andreopoulos B, An A, Wang X, Schroeder M (2009) A roadmap of clustering
Submit your next manuscript and get advantages of OMICS
Group submissions
Unique features:
•?
•?
•?
User?friendly/feasible?websitetranslation?of?your?paper?to?50?world’s?leading?languages
Audio?Version?of?published?paper
Digital?articles?to?share?and?explore
Special features:
•?
•?
•?
•?
•?
•?
•?
•?
200?Open?Access?Journals
15,000?editorial?team
21?days?rapid?review?process
Quality?and?quick?editorial,?review?and?publication?processing
Indexing?at?PubMed?(partial),?Scopus,?DOAJ,?EBSCO,?Index?Copernicus?and?Google?Scholar?etc
Sharing?Option:?Social?Networking?Enabled
Authors,?Reviewers?and?Editors?rewarded?with?online?Scientific?Credits
Better?discount?for?your?subsequent?articles
Submit?your?manuscript?at:?www.omicsonline.org/submission
This? article? was? originally? published? in? a? special? issue,? Biostatistics:
Computational statistics? handled? by? Editor(s).? Dr.? Saonli? Basu? Carriere,?
University?of?Minnesota,?USA.