ArticlePDF Available

Diplomarbeit: Identifying Gene Regulatory Modules in Arabidopsis thaliana.


Abstract and Figures

Gene regulatory modules in three-dimensional microarray data comprise genes that exhibit similar patterns of expression over a subset of experimental conditions and time-points. Given the observed co-expression, they are likely to be co-regulated by the same set of transcriptional control elements. In this work, two clustering algorithms which work on three-dimensional data are presented and applied to the Arabidopsis thaliana abiotic stress dataset provided by the AtGenExpress consortium. These algorithms are able to identify gene regulatory modules which show coherent patterns over all conditions they comprise. Additionally, the algorithms find modules specific for one stress condition and, in contrast to most other clustering techniques, modules, which show different response patterns under a number of different conditions. These so-called independent response (IR) modules contain genes that cluster together under all conditions in the module, but with a different response pattern for each condition. The first of the algorithms, named the two-step clustering approach, extends classical clustering techniques in a way that makes them applicable to three-dimensional data. It clusters a two-dimensional matrix of genes over all points in time under just one condition and then, as a second step, follows the significant clusters found therein through the condition dimension. In this manner, an initial cluster under one condition can be extended to one of the three module types, depending on the patterns which emerge when viewing the genes under the other conditions. The second algorithm is of randomised nature. The Embedded Dimension Iterative Signature Algorithm (EDISA) adapts the ISA (Bergmann et al., 2003), a biclustering algorithm, for application on three-dimensional data. The algorithm draws genes at random and reduces these samples to the strongest signal found therein. By employing Pearson distances, time-trajectories from a three-dimensional sample can be embedded into a two-dimensional matrix. From this matrix, columns and rows with the greatest contribution to the row or column average, respectively, are removed in an iterative process, which breaks the sample down to its signature. The EDISA is able to explicitly return only modules of the desired type. Employing the above algorithms, numerous modules of all three types were found in the AtGenEx-press dataset and annotated, if possible, with their biological function. It thus becomes possible to characterise the principal components of the Arabidopsis stress response. Several of the modules also give rise to further research as they contain significant patterns of differentially expressed genes, whose function is not yet known. It is likely, that these genes possess regulation mechanisms and functions similar to the known genes in the same modules. Finally, as a further interesting result, it could be shown by a module finding that the Arabidopsis circadian clock, responsible for the 24h rhythm, is affected by cold stress.
Content may be subject to copyright.
Identifying Gene Regulatory Modules
in Arabidopsis thaliana.
Martin Strauch
September 2006
Gene regulatory modules in three-dimensional microarray data comprise genes that exhibit similar
patterns of expression over a subset of experimental conditions and time-points. Given the observed
co-expression, they are likely to be co-regulated by the same set of transcriptional control elements.
In this work, two clustering algorithms which work on three-dimensional data are presented and
applied to the Arabidopsis thaliana abiotic stress dataset provided by the AtGenExpress consortium.
These algorithms are able to identify gene regulatory modules which show coherent patterns over all
conditions they comprise. Additionally, the algorithms find modules specific for one stress condition
and, in contrast to most other clustering techniques, modules, which show different response patterns
under a number of different conditions. These so-called independent response (IR) modules contain
genes that cluster together under all conditions in the module, but with a different response pattern
for each condition.
The first of the algorithms, named the two-step clustering approach, extends classical clustering
techniques in a way that makes them applicable to three-dimensional data. It clusters a two-
dimensional matrix of genes over all points in time under just one condition and then, as a second
step, follows the significant clusters found therein through the condition dimension. In this manner,
an initial cluster under one condition can be extended to one of the three module types, depending
on the patterns which emerge when viewing the genes under the other conditions.
The second algorithm is of randomised nature. The Embedded Dimension Iterative Signature
Algorithm (EDISA) adapts the ISA (Bergmann et al., 2003), a biclustering algorithm, for application
on three-dimensional data. The algorithm draws genes at random and reduces these samples to the
strongest signal found therein. By employing Pearson distances, time-trajectories from a three-
dimensional sample can be embedded into a two-dimensional matrix. From this matrix, columns
and rows with the greatest contribution to the row or column average, respectively, are removed in
an iterative process, which breaks the sample down to its signature. The EDISA is able to explicitly
return only modules of the desired type.
Employing the above algorithms, numerous modules of all three types were found in the AtGenEx-
press dataset and annotated, if possible, with their biological function. It thus becomes possible to
characterise the principal components of the Arabidopsis stress response. Several of the modules
also give rise to further research as they contain significant patterns of differentially expressed genes,
whose function is not yet known. It is likely, that these genes possess regulation mechanisms and
functions similar to the known genes in the same modules. Finally, as a further interesting result,
it could be shown by a module finding that the Arabidopsis circadian clock, responsible for the 24h
rhythm, is affected by cold stress.
I Introduction 3
II Biological background 5
II.1 Gene expression data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
II.1.1 The microarray technology . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
II.1.2 Pre-processing of microarray data . . . . . . . . . . . . . . . . . . . . . . . 6
II.2 The AtGenExpress dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
II.3 Biological motivation: finding gene regulatory modules . . . . . . . . . . . . . . 8
III Theoretical background 9
III.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
III.2 Clusters and Biclusters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
III.2.1 Defining clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
III.2.2 Defining biclusters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
III.2.3 (Bi)cluster types and clustering structures . . . . . . . . . . . . . . . . . . . 11
III.3 Triclusters and gene regulatory modules . . . . . . . . . . . . . . . . . . . . . . . 12
III.3.1 Triclusters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
III.3.2 Gene regulatory modules . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
III.4 Clustering algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
III.4.1 Distances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
III.4.2 Clustering techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
III.4.3 Biclustering methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
III.4.4 Clustering on three-dimensional datasets . . . . . . . . . . . . . . . . . . . 21
III.5 Motivation for the algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
IV The algorithms 25
IV.1 Two-step clustering exploratory approach . . . . . . . . . . . . . . . . . . . . . . 25
IV.1.1 Assumptions and introductory remarks . . . . . . . . . . . . . . . . . . . . . 25
IV.1.2 The two-step clustering algorithm . . . . . . . . . . . . . . . . . . . . . . . 26
IV.2 Embedded Dimension ISA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
IV.2.1 Concept . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
IV.2.2 An example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
IV.2.3 Mathematical description and algorithm . . . . . . . . . . . . . . . . . . . . 34
IV.3 Choosing the seeds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
IV.3.1 Guide tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
IV.3.2 Nearest neighbours . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
IV.4 Choosing the parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
IV.5 Post-processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
IV.5.1 Module scores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
IV.5.2 Merging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
IV.5.3 Adding genes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
IV.6 In silico dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
V Results 40
V.1 Two-step clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
V.1.1 Pre-processing and significance issues . . . . . . . . . . . . . . . . . . . . . 40
V.1.2 Results for the two-step clustering approach . . . . . . . . . . . . . . . . . 41
V.2 The EDISA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
VI Discussion 51
VI.1 Results of the two-step clustering approach . . . . . . . . . . . . . . . . . . . . . 51
VI.1.1 Single modules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
VI.1.2 Coherent modules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
VI.1.3 IR modules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
VI.1.4 Overview of the stress responses . . . . . . . . . . . . . . . . . . . . . . . . . 52
VI.2 Results for the EDISA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
VI.2.1 Choosing seeds and parameters . . . . . . . . . . . . . . . . . . . . . . . . . 53
VI.2.2 AtGenExpress data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
VI.3 Comparison of the algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
VII Conclusion 58
VII.1 Possibilities and limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
VII.1.1 Possibilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
VII.1.2 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
VII.2 Outlook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
VII.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
Appendices 65
A Matlab program 65
A.1 GUI for the two-step clustering approach . . . . . . . . . . . . . . . . . . . . . . . 65
B Supplementary material 69
Bibliography 74
Chapter I
With more and more genomes sequenced, the logical next step lies in uncovering the function of
genes and their regulation. Often, a considerable amount of work in the laboratory is necessary to
portray a gene’s behaviour, the transcription factors which control its expression, the interactions
of the expressed protein and, finally, to categorise it. The availability of genome-wide expression
measurements does not render further experimentation obsolete, but offers an additional tool for
data driven hypothesis generation regarding gene regulation. As genes which are regulated by the
same underlying mechanisms are expected to show similar expression patterns, a possible point of
attack in the search for these regulatory mechanisms is to detect gene modules, i.e. genes exhibiting
similar expression patterns under a number of conditions.
The AtGenExpress dataset available at TAIR (Rhee et al., 2003), which is motivation for the
algorithms presented in this work, covers the gene expression of Arabidopsis thaliana in the course
of a day under different abiotic stress conditions. Observing similar responses of genes to a number
of these stress conditions is a strong indicator of co-regulation. Based on module findings, further
research, both bioinformatical and biological, can be performed in order to improve the resolution
of the picture we already have of gene regulation.
In short, the objective of this work is to develop and implement methods which serve the purpose
of finding such gene modules and, eventually, to apply them to the three-dimensional Arabidopsis
thaliana dataset at hand.
The further proceeding will be as follows: After a brief introduction to the technology behind gene
expression measurements and a description of the dataset in Chapter II, light will be shed on the
informatical theory underlying module finding in Chapter III. Obviously, clustering methods lend
themselves to application for this purpose. An overview of the existing clustering techniques on two-
and three-dimensional data will be given, including rigorous definitions for significant clusters and
gene modules. Thereby, it becomes apparent, that the existing methods usually search for modules
which show coherent patterns over all conditions they comprise. Evaluation of the Arabidopsis
dataset, however, reveals an observance as frequent as biological relevant: a considerable number
of modules exist, for which the genes cluster together under several conditions, but with a different
pattern for each of the conditions. The same regulation mechanism seems to act here on all of the
participating genes, which are in unison for the response to salt stress, for example, as they are for
the response to heat stress and yet the patterns for heat and salt, respectively, are different. The
module finding algorithms are thus required to search for coherent modules, as well as for this special
type of module, for which the term Independent Response (IR) module is coined. A rather simple,
but nonetheless interesting module type is the single module which shows a significant pattern only
for one condition. Such an observation is of biological value if it can be asserted that under no other
condition a significant pattern different from the control measurement does exist.
The informatical equivalent to the biological idea of a gene module is a three dimensional cluster
spanning the gene, condition and time axis. A definition can be found in Section III.3. As only
few attempts at searching for three-dimensional clusters have been published as of today, the need
for new approaches in this field is apparent. A review of the clustering approaches available for
three-dimensional data is given in Section III.4.4. The concept of finding triclusters pursued by
these methods appears to be a promising attempt. However, neither do the published approaches
search for the biologically interesting IR modules, nor are there software implementations available.
Making use of existing clustering and biclustering techniques, which are reviewed in Sections III.4
and III.4.3.1, two algorithms will be presented which adapt these techniques for the application to
three-dimensional data. The algorithms are described in detail in Chapter IV. The first method
is called the two-step clustering approach. It involves finding an initial cluster of genes under just
one condition over all points in time using a conventional clustering algorithm. In the second step,
the cluster is then followed through the condition dimension. By comparing the patterns, the genes
from the initial cluster exhibit under the other conditions, it becomes possible to distinguish between
the three types of modules. The method provides a significance analysis for its findings and assigns
p-values to them in order to have a measure for significance. A further interesting feature lies in
the possibility for user interaction: A Matlab implementation of the algorithm is available which
visualises the clustering steps, allowing the user to select initial conditions, interactively change
parameters and to apply post-processing operations in order to remove outliers or to find significant
sub-partitions inside a module. Screenshots for the Matlab program are provided in the Appendix.
The second algorithm is based on the Iterative Signature Algorithm (ISA) by Bergmann et al. (2003),
which works on two-dimensional data. Employing Pearson distances based on time trajectories, the
third dimension is embedded into a two-dimensional matrix such that basic structure of the ISA can
be applied. The Embedded Iterative Signature Algorithm (EDISA) is presented in Section IV.2.
Similar to the ISA, it is a randomised method which takes samples from the data matrix and reduces
them to their signature, i.e. the prevalent pattern in the subset of randomly sampled genes. In the
end, from the reduced samples one can select those which receive a good score based on Pearson
distances and, if necessary, merge the overlapping ones or the duplicates.
Both algorithms were applied to the AtGenExpress Arabidopsis thaliana dataset. Results can be
found in Chapter V. In Chapter VI, GO annotations are presented which help uncovering the
biological function of the modules. Several specific responses to a single stress condition could
be identified, as well as several stress response mechanisms of more general nature. Especially
dense modules can be found for heat shock related genes, which are also active in response to stress
conditions other than heat. Furthermore, two response mechanisms to salt stress could be identified:
Usually, responses to salt and osmotic stress are in unison, whereas in one case, genes appear to
respond to salt, but not to osmotic stress. Moreover, in this case, the genes which react to salt in
the root, show a response under UV-B stress in the shoot tissue.
In Chapter VII, the possibilities and limitations of module finding are discussed, along with a
comparison of the two module finding algorithms presented in this work. Obviously, module
finding is restricted to transcriptional interactions and thus might miss mechanisms involving post-
transcriptional components. It appears, however, to be of great potential in generating hypotheses,
which are motivation for biological experimentation. An example regarding the Arabidopsis circadian
clock mechanism is exemplary of the possibilities which arise when combining module finding and
data-mining methods.
Finally, paths for further development are outlined, which will be pursued as part of a follow-on
Chapter II
Biological background
II.1 Gene expression data
II.1.1 The microarray technology
The microarray technology enables scientists to measure gene expression on a genome-wide scale.
Previously, only small numbers of mRNAs could be detected simultaneously using the Northern Blot
(Alwine et al., 1979). Schena et al. (1995) were the first to publish a comparative gene expression
experiment covering 45 Arabidopsis thaliana genes. Soon, the microarray technology was improved
and became a standard tool in genetics. A review is given by Brown and Botstein (1999). Today,
whole-genome arrays for a number of species are available, which allow for monitoring the whole
transcriptome at the time of mRNA-extraction.
Different array designs compete, however, the basic principle behind all approaches remains the
same. The technology is based on the fact that, theoretically, complementary strands of nucleic
acids will hybridise, whereas others will not. The problem of unspecific hybridisation is addressed
in a later section (see II.1.2). Consequently, as already known from the smaller-scale Northern Blot
experiments, the presence of a specific mRNA transcript is detected through hybridisation with its
complement. The probes, i.e. the transcripts we wish to detect, are situated on the glass surface of
the array and the samples, i.e. the transcripts extracted from a cell, are then hybridised with the
probes. For this procedure to work, both strands, probe and sample, need to be complementary.
Hence, the mRNA transcripts are converted to cDNA by reverse transcription. Only then, sample
and probe for the same identical gene are exactly complementary. The actual detection is performed
with the help of a fluor marker: the samples are labelled with fluorescent markers, which can be
detected through scanning the array.
Two-fluors comparative microarray experiments were conducted e.g. by DeRisi et al. (1997), who
identified genes participating in the diauxic shift in Saccharomyces cerevisiae - the metabolic change
from fermentation to aerobic growth in yeast. An example for a comparative microarray experiment
is presented in Figure II.1.1. For the chip design used in this example, mRNAs have been hybridised
to the array surface in order to produce the probes. For each gene there are multiple identical probes.
Two different fluors are employed to label the samples, e.g. red for a wildtype and green for control.
From the green to red ratio of the samples which hybridise to the probes for a specific gene, one can
thus infer the ratio of the abundance of the respective gene in the wildtype and in control.
Another very common microarray design is the GeneChip r
, a proprietary format of Affymetrix.
For the Affymetrix arrays, oligonucleotides are synthesised directly on the array through a photoli-
tographic process, as opposed to hybridising mRNAs with the array surface. Oligonucleotides of
25 bases in length, which are chosen to represent the genes, are the samples for this type of array
design. Again, numerous probes correspond to one gene, which allows for measuring the abundance
of the respective sample. The Affymetrix system uses only one fluor, i.e. comparative experiments
Figure II.1: The course of a comparative microarray experiment: In this example, vegetative and
sporulating yeast cells are compared. From both cell populations mRNA is extracted and converted
into cDNA by reverse transcription. The cDNAs from the vegetative and sporulating cells are
labelled with different fluors: green for vegetative and red for sporulating. As as result of the greater
relative abundance of TEP1 in sporulating cells, more red-labelled cDNA binds to the respective
array element (probe) composed of DNA from TEP1. From the red to green ratio it can thus be
inferred that sporulating cells produce more telomerase associated protein TEP1 than vegetative
cells. The figure is taken from the review of Brown and Botstein (1999).
involve two arrays, as control and wildtype can no longer be measured on the same array. Reviews of
the oligonucleotide arrays and technical details on their production are provided by Lipshutz et al.
(1999) and Lockhart et al. (1996).
II.1.2 Pre-processing of microarray data
In order to properly evaluate the results of a microarray experiment, statistical methods have to be
employed to be able to cope with noise or to correct for biases due to the experimental hardware. For
example during image analysis on the scanned array, question on segmentation or signal extraction
arise. Details on image analysis, normalisation in microarray experiments and statistical methods
for detecting differentially expressed genes are given by Dudoit et al. (2001).
The arrays employed in generating the AtGenExpress dataset are of the Affymetrix GeneChip r
design. Details on data preparation and expression value calculation can be found in a whitepaper
documenting the current Affymetrix standard MAS 5.0 (Affymetrix, 2002). In one important point,
however, the pre-processing applied to the AtGenExpress dataset differs from the standard, namely,
the treatment of background noise. For a better understanding of the different approaches, first
the standard Affymetrix method is outlined, followed by the GCRMA approach pursued in pre-
processing the AtGenExpress data.
A special feature of Affymetrix arrays is the presence of perfect match and mismatch probes, which
can be used for background adjustment. Recall, that on an Affymetrix array we find numerous cDNA
probes representing the mRNAs of interest, typically 11-20. One might argue that this not sufficient
to make reliable statistical statements about the expression level of the corresponding mRNAs.
Due to unspecific binding of similar mRNAs we encounter a considerable amount of background
noise. In order to account for the latter, on Affymetrix arrays each probe (perfect match, PP) has
a counterpart which is called a mismatch probe (MM). Both are 25 bases long and essentially do
contain the same base sequence, except for the middle base (13): In the MM probe, the 13th base
is replaced with its complementary base. Following the MAS 5.0 standard (Affymetrix, 2002), the
computed intensities are corrected after taking the mismatch intensities into account. The standard
method is based on the difference PM-MM.
An alternative background adjustment method is the robust multi-array analysis (RMA) (Irizarry
et al., 2003), which ignores the MM measurements. The RMA accounts for noise by assuming
distributions - normally distributed additive background noise plus signals following an exponential
distribution. The authors show that the RMA outperforms the standard Affymetrix method.
An extension of RMA, which corrects for the GC content of the sample was proposed by Naef and
Magnasco (2003) and Wu et al. (2004). The so-called GCRMA was applied to the AtGenExpress
dataset instead of the standard Affymetrix PM-MM algorithm. The GCRMA takes into account
that the GC-content of a probe affects non-specific hybridisation. The motivation behind this is
that a complementary GC-pair has three hydrogen bonds in contrast to just two hydrogen bonds for
an AU-pair. Probes with high GC-content might thus have a higher affinity for unspecific binding.
To correct for this, MM probes are categorised according to their GC-content and the background
is estimated for each category, i.e. GC-content in a certain range, separately.
II.2 The AtGenExpress dataset
The thale cress Arabidopsis thaliana is a small plant from the family Brassicaceae (see Figure II.2)
and serves as a model organism in biology due to a rapid generation cycle, convenient cultivation in
the laboratory and a relatively small genome of 125 million base pairs on 5 chromosomes. Today,
the complete sequence of the Arabidopsis genome is known (Arabidopsis Genome Initiative, 2000)
and whole genome gene chips are available.
The AtGenExpress project is a multinational effort to uncover the transcriptome of Arabidopsis
thaliana. As part of the project, which is coordinated by the German Arabidopsis Functional
Genomics Network (AFGN), gene expression experiments to measure the response of Arabidopsis
to abiotic stress treatments have been performed using the Affymetrix ATH1 chip. All abiotic stress
treatments were performed on the same day, in the same growth chambers as one experiment using
the same set of seedlings. The whole dataset was normalised using GCRMA (see II.1.2).
The nine different conditions the abiotic stress dataset comprises are cold, osmotic, salt, drought,
oxidative, genotoxic, wounding and heat stress, as well as UV-B radiation. For each condition, 2
time series for the tissues root and shoot plus two corresponding control measurements are available,
each comprising between 6 and 9 time-points. The dataset is publicly available as a part of the
AtGenExpress data hosted by TAIR (Rhee et al., 2003).
Figure II.2: Arabidopsis thaliana. Image: Wikipedia, GNU free documentation license
II.3 Biological motivation: finding gene regulatory modules
With the Arabidopsis thaliana genome sequenced (Arabidopsis Genome Initiative, 2000) and whole-
genome chips available to uncover the transcriptome, a lot of questions on gene expression and its
regulation arise. Going beyond the deciphering of sequence data, the advent of systems biology has
shifted the focus of biological research towards regulatory mechanisms (Kitano, 2002).
Experiments like those conducted by the AtGenExpress consortium aim at understanding the
function of the genes that have already been sequenced. By measuring the plant’s transcriptome
under specific stress conditions, the genes involved in the response to the stress can be detected.
Genes exhibiting the same behaviour in response to a certain stress condition are likely to be co-
regulated. Similar reactions over a number of conditions hint at a general stress response, while
strikingly different behaviour under a single condition is a strong indicator for a specific function of
the genes involved.
The objective of this thesis is to detect gene regulatory modules. Algorithms are presented which
work towards this aim (see Section IV), which are then applied to the AtGenExpress dataset (II.2).
The motivation behind this work are the three-dimensional dataset at hand, which raises interesting
informatical questions, as well as the biological desire to uncover gene regulation in response to
abiotic stress conditions.
In the biological definition, the gene regulatory modules to be identified comprise genes that behave
similarly over a number of time-points under a number of conditions. By detecting such gene
regulatory modules, it becomes possible to infer the regulation mechanisms which induce this
common behaviour. Often, further analysis in the laboratory will be inevitable to reliably establish
a regulation mechanism, however, valuable information can be gained by detecting gene modules.
Finding previously unannotated genes to show a similar expression specific for a certain condition
allows for determining the general function of these genes. Furthermore, the genes contained in
a module can be scanned for common regulatory sequences or known transcription factors: For
example, a protein involved in the regulation of a certain gene is likely to be involved in the regulation
of the other genes in the same module. A brief review of the biology of gene expression control is
given in Figure II.3. For further information on gene regulation the inclined reader is referred to a
textbook, e.g. Alberts et al. (2002).
In summary, gene (regulatory) modules provide the basis for further research on gene regulatory
mechanisms. The modules obtained can be used to categorise genes of known function and
regulation, as well as to uncover function and regulation mechanisms of previously unannotated
Figure II.3: Control region of a typical eucaryotic gene: expression starts at the promoter, the
beginning of which is marked by a TATA-box. At the promoter site, RNA polymerase II and a
couple of general transcription factors assemble in order for transcription of gene X into RNA to
begin. Upstream, as well as downstream of the gene, gene-specific regulatory sequences can be
found which are binding sites for gene regulatory proteins, the presence of which influences the rate
of transcription initiation (Alberts et al., 2002, p. 401).
Chapter III
Theoretical background
III.1 Notation
The algorithms referred to in this work are designed to operate on gene expression datasets. For ease
of comparison, a unified notation is introduced here, which can be employed for both the two- and
the three-dimensional methods. Two-dimensional gene expression datasets typically consist of a gene
and a condition dimension. The conditions can be exchanged for time or any other dimension without
affecting the notation. Throughout this work, however, the two-dimensional datasets comprise genes
as well as conditions and, in the three-dimensional case, we also encounter a time axis.
Formally, in the two-dimensional case, the gene expression matrix EGC is composed of genes
G={g1, . . . , g|G|}and conditions C={c1, . . . , c|C|}, where |G|and |C|denote the number of
genes and conditions, respectively. An element Egc refers to the expression value of gene gunder
condition c. For a gene g, the vector eC g is called its condition profile. It contains the expression
values for gene gunder all conditions. Analogously, the vector ecG denotes the gene profile for
condition c:
eCg = (Ec1g, E c2g, . . . , E c|C|g),ecG =
In the three-dimensional case, the gene expression matrix EGCT additionally comprises time-points
T={t1, . . . , t|T|},|T|being the number of time-points. The elements Egct denote the expression
value of gene gunder condition cat time-point t. The vector ecgT describes a time-trajectory of
gene gunder condition cover all points in time T:
ecgT = (Ecgt1, E cgt2, . . . , Ecgt|T|) (III.2)
Using the time trajectories just defined, a row (gene) g of ECGT is thus represented by the
following notation. The corresponding vector eCgT contains the time-trajectories for gene gunder
all conditions C={c1, . . . , c|C|}:
eCgT = (ec1g T ,ec2gT , . . . , ec|C|g T ) (III.3)
Analogously, a column (condition) cof ECGT corresponds to the vector ecGT :
ecGT =
III.2 Clusters and Biclusters
III.2.1 Defining clusters
Clustering algorithms are categorized as being unsupervised classification methods which divide a
set of elements into partitions such that the elements within these partitions (clusters) are more
similar to each other than to the remaining elements of the set (Jain et al., 1999). The definition of
a cluster thus involves a similarity or distance measure, the choice of which depends on the clustering
Usually, clustering assumes a one-dimensional space, i.e. a set of measurements. In the case of
microarray analysis, however, datasets comprise at least two dimensions, e.g. genes and conditions.
Hence, rows (genes) and columns (conditions) are treated as atomic elements and are not subdivided
further. A cluster comprises either a set of rows, i.e. a number of genes viewed under all conditions,
or a set of columns. In the latter case all genes are viewed under one condition. Proceeding in this
manner, one-dimensional clusters can be found on two-dimensional data.
In the notation introduced previously (Section III.1) a cluster of rows is a vector eC Gcwith GcG.
Alternatively, we can regard eCcGwith CcCas a cluster of columns.
Clustering methods will often return a partitioning of the dataset regardless of whether the data
actually is structured. It is thus advisable to evaluate the quality of a cluster, i.e. the degree of
similarity of the cluster elements. To this end, one can define a significant cluster as a vector eCGc
which meets a similarity criterion: After applying fcriterion , a significant cluster will be assigned a
value which falls beneath the significance threshold τ. The set of significant clusters of rows Csignif
can thus be expressed by the following term:
row =GcG|fcriterion eC Gc< τ (III.5)
The same applies for the set of significant clusters of columns Csignif
col :
col =CcC|fcriterion eCcG< τ (III.6)
The choice of the criterion is either dependent on the algorithm employed, or, if the algorithm does
not supply cluster scores, has to be specified by the user. A common measure of cluster quality is for
example the average distance of the cluster elements. The distance measures employed in this work
are described in Section III.4.1, whereas the exact cluster criteria are specific for the algorithms and
dealt with in the respective sections.
III.2.2 Defining biclusters
While clustering finds one-dimensional clusters even if the dataset comprises two dimensions,
biclustering methods are able to extract subsets of rows and columns from a two-dimensional gene
expression dataset. Hence, biclusters are two-dimensional clusters. Formally, a bicluster is a vector
eCcGcwith CcCand GCG. Similar to the definition for clusters, we can define the set of
significant biclusters Bsignif :
Bsignif =(Gc, Cc)(G, C)|fcr iterion eCcGc< τ (III.7)
III.2.3 (Bi)cluster types and clustering structures
III.2.3.1 (Bi)cluster types
Madeira and Oliveira (2004) name four major types of biclusters one might wish to search for:
Biclusters with constant values (Figure III.1, example a)
Biclusters with constant values on rows or columns (Figure III.1, example b)
Biclusters with coherent values (Figure III.1, examples c and d)
Biclusters with coherent evolutions (Figure III.1, example e)
Only few biclustering algorithms do actually mine constant biclusters. The vast majority of
algorithms, according to the review by Madeira and Oliveira (2004), searches for biclusters with
coherent values or coherent evolutions.
The concept of coherent values biclusters reflects the fact that genes may very well be co-regulated
even though they do not exhibit the same level of expression. One commonly assumes either an
additive or a multiplicative model which accounts for the differences in the expression level. Consider
example d) in Figure III.1: The underlying model is an additive one, such that for example by adding
one to all values of the first column, the second column can be obtained. A multiplicative model
can equally well be employed (see example c). Additive and multiplicative patterns are sometimes
referred to as shifting and scaling patterns, respectively1. These patterns are of interest especially in
the biological application as expression changes through co-regulation are often relative to a gene’s
normal level of expression, reflecting the influence of a condition-specific regulation factor.
The concept of coherent evolutions does not regard the actual expression values but expression states
which cover a range of expression values. Again, the notions of constant and coherent biclusters can
be applied, this time, however, with states instead of the actual expression values.
Figure III.1: a) Bicluster with constant values, b) Bicluster with constant values on the rows, c)
Bicluster with coherent values assuming a multiplicative model, d) Bicluster with coherent values
assuming an additive model, e) Bicluster with coherent evolution on the rows
An interesting extension to all of the above bicluster classes is allowing time-shifting (see Figure
III.2, example a): The same gene expression response may occur with a time delay for different
All of these classes, except for the time-shifting, apply for clusters, too, with one difference: In
the case of biclusters we encounter subsets of columns and rows (Figure III.2, example c), whereas
clusters only comprise either columns or rows (Figure III.2, example b).
1In order to avoid confusion with the time-shifting patterns, these terms will, however, not be employed throughout
this work. The respective patterns will be referred to as additive and multiplicative patterns.
Figure III.2: a) Example of a time-shifted coherent pattern b) The boxed area highlights a cluster
with constant values on the rows c) A bicluster with constant values
III.2.3.2 Clustering structures
Apart from choosing the type of cluster one is willing to mine, it is necessary to define the structure
of the clustering in advance. The cluster structure determines for example whether the clustering
is exhaustive. If this is the case, all genes are assigned to a cluster, even if this amounts to just
choosing the better of evils.
Fuzzy clustering algorithms, such as fuzzy k-means (Gasch and Eisen, 2002), circumvent this problem
by assigning each gene probabilities of belonging to a specific cluster. In a post-processing step one
might then choose to assign only those genes to clusters which have a probability of belonging to
any one of the clusters greater than a predefined threshold.
Another strategy for achieving a non-exhaustive clustering is pursued in Section IV.1.2 by allowing
only such genes into the clustering, which show similarity to at least one other gene. The genes
which do not meet this criterion - usually a maximum allowed distance - are left unclustered.
Additionally, one must specify whether clusters are allowed to overlap. Common clustering
techniques such as k-means or hierarchical clustering (see Section III.4.2) return an exhaustive
clustering of the input set with exclusive clusters, i.e. each gene belongs to exactly one cluster.
Fuzzy k-means, on the other hand, allows multiple cluster affiliations for a gene. Regarding biclusters,
Madeira and Oliveira (2004) mention that some methods search for either exclusive rows or exclusive
columns biclusters. The former, for example, still require that each gene belong exclusively to a
bicluster. Conditions, however, are allowed to appear in more than one bicluster.
Yet often restrictions of this kind are rather due to algorithmical than to biological reasons. From
the biological point of view, genes should be able to belong to different clusters under different
conditions and some genes might not cluster at all. Consequently, one should assume non-exclusive
as well as non-exhaustive biclusters to be present in the dataset. Several prominent biclustering
methods actually do follow this objective, such as the algorithms of Cheng and Church (2000) or
Tanay et al. (2002).
III.3 Triclusters and gene regulatory modules
III.3.1 Triclusters
Numerous methods are available that mine clusters and biclusters from two-dimensional gene
expression datasets (see Section III.4). For three-dimensional datasets, the clustering concept can be
extended which amounts to mining triclusters, i.e. subsets of genes under subsets of conditions and
time-points. The topic of triclustering being fairly new to the community, there is but one algorithm
searching for triclusters at the moment: The TriCluster algorithm of Zhao and Zaki (2005) is given
a closer look at in Section III.4.4.3.
Furthermore, it is also a valid approach to look for biclusters in a three-dimensional dataset by
regarding whole time-trajectories and thus not partitioning the time dimension. In fact, the
biclustering on three-dimensional data is the only approach hitherto followed, as the TriCluster
algorithm does also search for gene-condition biclusters over the whole time span T and then tries
to cut back on the number of time-points in order form a tricluster.
Using the notation from Section III.1, a tricluster is a vector eGcCcTcwith GCG,CcCand
TCT. The set of significant triclusters Tsignif is defined as follows:
Tsignif =(Gc, Cc, Tc)(G, C, T )|fcriterion eCcGcTc< τ (III.8)
Note that the above definition comprises the ”true” triclusters which also restrict the time-dimension,
as well as the ”biclusters” on three-dimensional data, that are a special case with Tc=T. It would
even be possible to refer to clusters by setting for example Tc=T, Cc=C.
III.3.2 Gene regulatory modules
The triclusters are the most general cluster type so far and cover all the data dimensions we find in
the gene expression datasets available today. They come closest to the biological notion of a gene
regulatory module, or short gene module. Given a significant tricluster eGcCcTc, the genes gGc
are thus said to form a gene regulatory module under the conditions cCcand for the time-points
In order to fulfil the biological requirements on gene modules, the partitioning of ECGT into gene
modules should be
1. non-exhaustive: The biological motivation behind this is that some genes might exhibit a
stationary expression value, either because they never change or because they respond to
conditions other than those covered by the dataset. Without further information at hand it
would be misleading to include these genes into a module.
2. non-exclusive: A gene might be regulated by different mechanisms when responding to different
conditions. It should thus be possible for a gene to belong to multiple modules.
As to the tricluster types we are looking for, additive and multiplicative coherent patterns appear
to be a reasonable choice like they have been for clusters and biclusters (see Section III.2.3).
These patterns can, moreover, be easily handled using Pearson correlation (see Section III.4.1)
to compare time tra jectories. Let thus criterion =coherent. However, the coherence requirement
may sometimes be too restrictive. Again, it is motivated biologically that we expect three particular
kinds of modules:
1. Single modules, where |Cc|= 1 and fcoherent eCcGcTc< τ , i.e. genes that cluster together
under exactly one condition, but not under any other condition.
2. Coherent modules, where |Cc>1|and fcoherent eCcGcTc< τ , i.e. genes exhibiting the same
coherent trajectory pattern under all conditions.
3. Independent Response (IR) Modules , where |Cc>1|and fcoher ent ecGcTc< τ cCc,
i.e. genes which show different coherent trajectory patterns under multiple conditions.
While single modules make it possible to associate genes with the response to a specific condition,
coherent modules may reveal co-regulation under multiple conditions. The genes involved therein
are potentially controlled by the same transcription factors and display a more general response. IR
modules are examples of a more complex type of co-regulation, i.e they hint at the existence of stress
regulation specific for every condition alongside with a common transcriptional control. Examples
for these three module types are shown in Figure III.3.
Figure III.3: a) Single module just for the first column (condition) b) Coherent module over all
columns c) Independent response module
III.4 Clustering algorithms
III.4.1 Distances
Recall, that the definition of a cluster implies that the elements inside a cluster are more similar
to each other than to the remaining elements. In order to perform a clustering, it is thus essential
to measure similarity. The standard approach to measuring similarity is based on dissimilarity:
elements inside a cluster are required to have small distances to the other members of the cluster.
A simple and well-known distance measure is the Minkowski metric (see e.g. Jain et al., 1999, p.8):
dp(xi,xj) = (
|xi,k xj,k|p)1
The Euclidean metric, an even more common distance measure, is a special case of the Minkowski
metric for p= 2. In the three-dimensional space of gene ×conditions ×time expression experiments,
the Euclidean distance between two elements Eg1c1t1and Eg2c2t2(after Quackenbush, 2001, p.4) is:
deuclidean =p(g1g2)2+ (c1c2)2+ (t1t2)2(III.10)
The Euclidean metric is suited for single data points. It can be applied to trajectory data as well,
however, the common distance measure used on trajectories is the Pearson distance, which accounts
for scaling and shifting: multiplying a trajectory θ=ecgT with a constant factor or adding a constant
factor to the trajectory will result in a trajectory θ0which has zero distance to θ.
The Pearson distance is widely used in the analysis of gene expression data. For example, Eisen
et al. (1998) employ a distance based on the Pearson correlation. The actual empirical Pearson
correlation coefficient reads as follows (Hartung, 2005, p.546):
ρ(X, Y ) = Cov(X, Y )
pV ar(X)pV ar(Y)=Pn
The Pearson correlation itself is a similarity measure, which returns values between 1 and 1, the
latter signifying perfect correlation. Common distance measures are 1 ρ(X, Y ) and 2 ρ(X, Y ).
The former is the Pearson distance employed throughout this work.
III.4.2 Clustering techniques
Given distances of elements such as gene expression values, numerous methods do exist which perform
a clustering based on these distances. A couple of simpler methods such as hierarchical clustering
(Sokal and Michener, 1958) or k-means clustering (McQueen, 1967) date back to the 1950s and 1960s.
More recently they have been introduced to the analysis of gene expression data, e.g. hierarchical
clustering by Weinstein et al. (1997) or k-means by Tavazoie et al. (1999). A variety of approaches
have been taken to tackle the clustering problem, such as self organizing maps (SOM) (Kohonen,
1990), spectral clustering (Ng et al., 2002) , model based clustering (Yeung et al., 2001) and others.
Jain et al. (1999) present a general summary of clustering techniques, while Thalamuthu et al. (2006)
review the clustering attempts in the field of microarray analysis.
In this work, clustering is employed as a substep for the algorithms in Section IV. The two clustering
algorithms used are given a closer look at below:
III.4.2.1 Hierarchical clustering
Hierarchical clustering methods have been around for some time and have found numerous applica-
tions. They became known to gene expression analysis through the works of Weinstein et al. (1997)
and Eisen et al. (1998), who cite the average-linkage method by Sokal and Michener (1958) as the
first hierarchical clustering approach. Hierarchical clustering algorithms are agglomerative methods
that join neighbouring elements. For example, the distances between gene expression values are
computed according to a distance measure (see Section III.4.1) and then the expression values are
grouped together with their immediate neighbours. In the following steps, the distances between
the groups are computed and this time the groups are merged to form larger groups. The process
is repeated until all expression values are merged into one large cluster. The merging structure
and thus the relationships of the clustered elements to each other are visualised in the form of a
dendrogram (see Figure III.4).
Figure III.4: The data-points on the left are clustered with a single-linkage hierarchical clustering
algorithm based on Euclidean distances. The resulting tree structure is displayed on the right. The
process of construction can be derived from the dendrogram: first, C and D are merged, as well as
A and B. Then, E and F are merged and finally the groups {C, D}and {E, F}with the elements
{A, B}on the left branch.
Several variants of the hierarchical clustering technique do exist, which differ in the calculation of
distances. Quackenbush (2001) lists, among others:
Single-linkage clustering: The distance between two clusters i and j is defined as the minimum
distance between a member of i and a member of j.
Complete-linkage clustering: The distance between the clusters i and j is defined as the
maximum distance between a member of i and a member of j.
Average-linkage clustering: The distance between i and j is defined as the average distance
between the clusters. This method, which is also known as unweighted pair-group method
average (UPGMA), averages over the distances of each data-point in i to all data-points in j.
Variants are methods employing median or centroid computations instead of the average and
the weighted pair-group average, which uses cluster sizes as weights.
While the single linkage allows for loose clusters, which is often an undesired result, the complete
linkage favours tight clusters. Average linkage ensures that for example single outliers are not able
to determine the distance between two clusters.
III.4.2.2 K-means clustering
The k-means algorithm (McQueen, 1967) is a standard clustering technique which is also widely
used in the analysis of microarray data (see e.g. Tavazoie et al., 1999). The algorithm partitions the
data into a pre-defined number of kclusters. Initially, all genes (gene expression values) are assigned
to one of krandom cluster centres. Typically, the centres are not actually random but rather spread
equally over the range of possible values. Given the centres, each gene is assigned to the closest
centre according to the chosen distance measure, which results in kclusters of genes. Then, for each
cluster the centroid, i.e. the average gene, is computed and defined as a new centre. In an iterative
process, the assigning of genes to the closest centre and the subsequent updating of the centres is
repeated until convergence. An example can be found in Figure III.5.
As the algorithm always returns kclusters, no matter how many clusters actually are present in
the data, the choice of kdeserves some attention. Fortunately, the algorithm has linear complexity
(Jain et al., 1999, p.279), so that we can afford multiple runs with different choices of k. The most
appropriate value for kis determined by assessing the goodness of the clustering: The value for kis
chosen, for which the overall cluster density is the lowest. The cluster density is usually the average
distance inside a cluster, either pairwise or to the centroid.
Another valid approach at estimating kinvolves a PCA analysis (e.g Raychaudhuri et al., 2000) in
a previous step as suggested by Quackenbush (2001). The parameter kis then set to the number of
principal components found for the dataset.
Figure III.5: Example showing the result of a converged k-means clustering. The set of data-points
has been partitioned into k= 2 clusters. Cluster affiliation is symbolised by circles or squares,
respectively. The star symbols represent the cluster centres.
III.4.3 Biclustering methods
III.4.3.1 Review of biclustering methods
While clustering methods have frequently found application in the analysis of two-dimensional
datasets, they are unable to find biclusters (see Section III.2.2) and thus cannot retrieve all the
information contained in these datasets. As in the case of clustering techniques, the concept of
biclustering has already been discovered several decades ago. The direct clustering approach of
Hartigan (1972) was the first to cluster both dimensions simultaneously.
With the advent of gene expression analysis, attention was also drawn to two-dimensional clustering,
spawning further development in this field. Alon et al. (1999) proposed a two way clustering on
both genes and conditions separately. They employed the deterministic annealing algorithm (Rose,
1998), a clustering technique, to obtain ordered cluster trees for genes and for conditions. Using this
clustering information, they rearranged the rows and columns of the data matrix in such a manner
that correlated genes and tissues were placed next to each other.
A year later, Cheng and Church (2000) presented the first ”real” biclustering application on gene
expression data. They formalised the biclustering problem, showed it to be NP-complete and devised
a polynomial time heuristic, which clusters genes and conditions simultaneously. Their biclustering
algorithm is a matrix size reduction method, which aims at discarding rows and columns until a
subset with a low mean squared residue remains. A more detailed description of the algorithm is
given below (Section III.4.3.2).
In the meantime numerous biclustering algorithms have been proposed, which are based on diverse
principles. Comprehensive summaries of the biclustering methods available today are given by
Madeira and Oliveira (2004) and Tanay et al. (2005). Prelic et al. (2006) evaluate the performance
of several prominent biclustering methods.
A number of algorithms follow the general principle of the work of Alon et al. (1999) in applying
standard clustering techniques on rows and columns and combining the results afterwards. A well
known-example is the coupled two-way clustering (CTWC) by Getz et al. (2000). Note, that methods
of this kind do not actually cluster genes and conditions simultaneously - a criterion which is
sometimes required to be met by a biclustering algorithm. However, this definition is controversial.
Frequently, these methods, too, are referred to as biclustering algorithms, as they definitely do find
A huge variety of approaches originating from different fields of computer science have been taken
to address the problem of biclustering, e.g. spectral biclustering (Kluger et al., 2003) in extension
of the spectral clustering method by Ng et al. (2002) or the recently published non-smooth non-
negative matrix factorization (Carmona-Saez et al., 2006). The approach is based on a dimensionality
reduction technique, which was originally designed for use in face recognition (Lee and Seung, 1999).
In the comparative study of Prelic et al. (2006), besides the algorithm by Cheng and Church
(2000), the following widely used biclustering algorithms are included, for which software is publicly
available: The xMotif algorithm (Murali and Kasif, 2003), the Order Preserving Submatrix Algorithm
(OPSM) (Ben-Dor et al., 2003), the Statistical Algorithmic Method for Bicluster Analysis (SAMBA)
(Tanay et al., 2002) and the Iterative Signature Algorithm (ISA) (Bergmann et al., 2003), which will
be discussed in detail below (Section III.4.3.3). In contrast to the algorithms of Cheng and Church
(2000) and Getz et al. (2000), which aim at finding coherent values biclusters, the named algorithms
all search for coherent evolution biclusters.
xMotif The algorithm of Murali and Kasif (2003) finds xMotives. An xMotif is defined as a subset
of genes which are simultaneously conserved under a subset of conditions, i.e. which are in the same
state for each condition. Being a coherent evolution method, the algorithm discretises the data and
assigns states to the genes (cp. Section III.2.3). If all genes of the xMotif are in the same state
under a given condition, this condition is said to match the motif and can be included. The authors
present a probabilistic algorithm for the NP-complete problem of finding the largest xMotif. In
contrast to the other biclustering methods, the results of the xMotif algorithm are comparable to
the concept of IR clusters (see Section III.3.2): genes are conserved under multiple conditions but
are in different states under different conditions.
OPSM The OPSM by Ben-Dor et al. (2003) focuses on submatrices which are order-preserving,
i.e. for which a permutation π= (c1, c2, . . . , c|Cc|)of columns Ccdoes exist, such that the values in
every row of the submatrix are monotonically increasing. If a submatrix contains only genes which
are co-regulated, there will always be such a permutation of the columns. Consider a bicluster
eCc,Gc. The authors define (Cc, π) as a complete model of the bicluster and each row eCcgfor which
the |Cc|values are monotonically increasing when ordered according to πis said to support the
model. Then, g can be added to Gc. They aim at finding such complete models by employing a
heuristic which builds up complete models from smaller sets called partial models. Modifications of
the OPSM have been proposed by Liu et al. (2004) and Bleuler and Zitzler (2005), who relax the
ordering criterion and thus also accept imperfect submatrices, i.e. submatrices with slight deviations
from the order preservation criterion.
SAMBA Tanay et al. (2002) with their SAMBA method make use of the fact that a gene
expression matrix can be viewed as a weighted bipartite graph G= (V , E). The set of vertices
V is partitioned into row (gene) nodes on one side and column (condition) nodes on the other side of
the bipartite graph, while edges E between elements from the two partitions stand for a significant
expression change of a gene when responding to a condition. The SAMBA method aims at finding
heavy sub-cliques in the weighted bipartite graph, which correspond to biclusters, the weights being
significance statements according to a statistical background model. Their data representation is a
discrete one with two states for up- and down-regulation, respectively. The degree of row vertices
is bounded by a parameter d, which, for computational complexity reasons, restricts the number of
genes in a bicluster. Employing the named restriction enables the SAMBA method to exhaustively
enumerate all biclusters inside the bounds given, whereas the other methods presented here all use
greedy search heuristics.
In the following, selected methods are given a closer look at: the seminal paper by Cheng and
Church (2000), who introduced biclustering to gene expression analysis and a more recent approach,
the iterative signature algorithm (ISA) by Bergmann et al. (2003). One of the algorithms in Section
IV, the EDISA, is an extension of the latter.
III.4.3.2 The biclustering algorithm of Cheng & Church
The algorithm of Cheng and Church (2000) is a local method based on the assumption that the
expression values in a perfect gene-condition bicluster are equal. It allows, however, additive
modifications (see III.2.3) for specific rows or columns. With the row and column averages, as
well as the average of the current submatrix (bicluster) removed, the remaining so-called residue
score is low for an element which belongs into the current bicluster, i.e. it does not show any
considerable variation of its own, apart from the variations covered by the row, column and submatrix
Given a bicluster eCcGcdetermined by a row subset GcGand a column subset CcC, the
residue score rcg of an element ecg of the bicluster is defined as follows:
rcg =ecg eCccecGc+eCcGc
with DegC0E=PcC0eG0c
and eCcGc=PgGc, cCcecg
In order to evaluate a bicluster eCcGc, the mean squared residue score is employed:
S(Cc, Gc) = 1
(rcg )2(III.13)
eCcGcis called a δ-bicluster if S(Cc, Gc)δfor a given threshold δ0.
Following a greedy search strategy, columns and rows are discarded step by step from the full matrix:
At each step, the row or column with the highest contribution to the mean squared residue score
is removed, as long as the overall score S(Cc, Gc) is still larger than then the threshold δ. The
contribution of a row drow (g) or column dcol(c) is computed according to Equation III.13: For a
row, the mean is taken over all columns cCc, and, respectively, for a column, the mean over all
rows gGcis computed:
drow (g) = 1
dcol(c) = 1
The algorithm discovers one δ- bicluster at a time and masks the biclusters already found by
assigning random values to the entries contained therein. Thus, during further runs of the algorithm,
which, again, start with the full matrix, the masked entries are likely to have a high residue score
and a different bicluster will be found.
As the authors proved the biclustering problem to be NP-hard (Cheng and Church, 2000), a heuristic
method like the above is necessary. To overcome the typical difficulties associated with greedy search,
it is advisable to employ alternative search strategies, as done for example by Bryan et al. (2005),
who use a Simulated Annealing approach combined with the mean squared residue score.
III.4.3.3 The iterative signature algorithm (ISA)
A more recent biclustering method which scored well in the comparative study performed by Prelic
et al. (2006) is the ISA (Bergmann et al., 2003). Previously, the authors had developed the signature
algorithm (SA) (Ihmels et al., 2002). For most applications (e.g. Ihmels et al., 2005), however, an
extended version called the iterative signature algorithm (ISA) is used (Bergmann et al., 2003),
which repeatedly applies the signature algorithm. The latter is also the version of the algorithm
presented below.
Concept In contrast to the biclustering method of Cheng and Church (2000), the ISA does not
remove rows and columns from the whole expression matrix, but rather draws random samples with
replacement and aims at finding the most evident signal therein, i.e. the signature of a putative
bicluster. This, again, is a process of removing columns and rows - this time from the sample
submatrix - until the submatrix is either empty, or a bicluster coherent over both rows and columns
does remain.
The discarding of rows and columns is performed using a threshold function, such that genes, which
are not sufficiently aligned with the condition profile of the other genes contained in the cluster,
are removed. Analogously, conditions, which, regarding the genes, exhibit a pattern dissimilar to
the other conditions in the cluster (i.e. which have dissimilar gene profiles) are thrown out. For a
definition of the terms gene profile and condition profile see Section III.1.
In the ISA extension of the signature algorithm, the process is repeated over several iterations until
convergence, removing genes which do not match the condition profile and then conditions which
do not match the gene profile in an alternating fashion.
Mathematical formalism The ISA works on two normalised copies of the gene expression matrix
E:EG, a collection of gene profiles, is normalised with respect to the rows (genes) and EC, a
collection of condition profiles, is normalised with respect to the columns (conditions).
, EC(ˆ
eg2C, . . . , ˆ
eg|G|C) (III.15)
The terms ˆ
eGc and ˆ
egC refer to the normalized gene and conditions profiles:
eGc eGc − hegcigG
|eGc − hegc igG|,ˆ
egC egC − hegc icC
|egC − hegc icC|(III.16)
Through the above normalisation, the vectors ˆ
eGc and ˆ
egC have zero mean and unit length, which
allows for comparing different genes or different conditions with each other.
The ISA always works on submatrices of EGand EC. Initially, random samples are drawn and
during the iterations rows and columns are removed, resulting in smaller submatrices. Both
operations can be realised using linear transformations which project the binary gene vector
gm= (g1
m, g2
m, . . . , g |G|
m) and the binary condition vector cm= (c1
m, c2
m, . . . , c|G|
m) onto the respective
matrices EGan EC.
An entry ci
m= 1 or gi
m= 1 signifies that gene or condition iis to remain in the matrix. For example,
(1,2,3,4) ·(0,0,1,1) = (0,0,3,4) : here, the first two entries have been discarded.
with cm= (c1
m, . . . , c|C|
m) and gm= (g1
m, . . . , g|G|
The second important step during execution of the ISA is applying the threshold function and
thereby determining the values for gmand cm. The threshold function ft(x) involves a weight
function w(x) and a step function Θ(x). While the weight function is optional, the actual threshold
is applied by the step function, which sets all negative entries to zero, i.e. all profiles xthat do
not exceed µ(x) by at least t·σ(x), where tis the threshold. Conversely, the positive entries are
set to 1. If the weight function is not specified, as no previous knowledge on the importance of the
elements is available, all weights are set to 1.
w(x1) Θ( fx1t)
w(xNx) Θ( gx|X|t)
with exi=xiµ(x)
Using the above threshold function, the values for cmand gmcan be set. Different thresholds can
be employed for conditions (tC) and genes (tG):
The ISA now proceeds by sampling a sufficiently large number of seeds {g(0)}and iteratively applying
the following update equations, which, in an alternating fashion, assign a gene set to a set of
conditions and vice versa, refining the selection at each step.
gn+1 =ftG(E·
As, with every iteration, the set of genes becomes more homogeneous regarding the threshold
criterion, the series {g(0) ,g(1), ...}usually converges to a fixed point vector g(), where all genes
meet the criterion. Such a fixed point is reached when the following holds for all nabove a certain
|g()+g(n)|<  (III.21)
The above process is then repeated until convergence for each of the seeds. If a sufficiently large
number of seeds was chosen, all biclusters can be found using this approach. Varying the threshold
parameters tGand tCallows for detecting loosely, as well as tightly co-regulated modules.
Extending the ISA An extension of the ISA was presented by Kloster et al. (2005). While
the original algorithm does not make use of the modules previously identified, the extension called
the Progressive Iterative Signature Algorithm (PISA) avoids finding the same module twice. Due
to the sampling process employed for creating the initial seeds, strong signal modules will appear
and be detected several times, while weaker modules might remain undiscovered. As pursued by
Lazzeroni and Owen (2002), a possible way of preventing this behaviour is to subtract the modules
already found from the expression matrix. For the PISA, a similar approach is taken in requiring the
condition vector of the current module to be orthogonal to the condition vectors of all the modules
that have already been discovered.
III.4.4 Clustering on three-dimensional datasets
As of today, only few methods for clustering three-dimensional data exist. Recently, an approach
for clustering gene-sample-time microarray data (Jiang et al., 2004) and the TriCluster algorithm
(Zhao and Zaki, 2005) have been published, which are the first to address this problem. While
Jiang et al.’s approach regards the time dimension as a single, indivisible tra jectory given a subset
of the gene and condition space, the latter method is a triclustering algorithm in the sense that it
partitions the gene, as well as the condition and time dimension. Bleuler and Zitzler (2005) present
a variation of the OPSM (Ben-Dor et al., 2003) employing an evolutionary algorithm as a search
strategy. They also give a brief outlook on the application of their method on three-dimensional
data. The approach is, however, not yet fully developed and not published as of today.
A number of algorithms for time-series data have been devised which are based on conventional full
space clustering methods. These are no triclustering methods, but, naturally, they do operate on
gene-time data, the special properties of which it is useful to consider for three-dimensional dataset
algorithms: multiple data points from time-series are usually not independent as opposed to multiple
conditions or samples. A comprehensive review of time-series data algorithms is given by Bar-Joseph
(2004). Ernst et al. (2005) point out that, for short-time series, many of the trajectories found as a
result of clustering techniques may be due to chance. In order to address this problem they devise an
algorithm which is able to tell significant from chance trajectories. In the following, the algorithms
just mentioned are given a closer look at:
III.4.4.1 The algorithm of Ernst et al.
Due to the limited number of time-points in most time-series experiments, it becomes possible to
enumerate all possible gene-time profiles, i.e. the possible shapes of tra jectories. The actual gene
trajectories are then mapped against the predefined profiles, enriched profiles being regarded as
significant: if more genes are assigned to a profile than expected, the profile is deemed significant
with regard to the underlying null hypothesis. Significant profiles can either be regarded as centroids
for a single cluster or they can be combined in order to form larger clusters. An implementation of
the algorithm is available in the STEM software package (Ernst and Bar-Joseph, 2006).
III.4.4.2 The algorithm of Jiang et al.
Jiang et al.’s algorithm is similar to the two-step clustering approach (Section IV.1.2) in that it uses
a Pearson measure to score time trajectories in the two-dimensional condition ×gene space. Based
on this measure it then proceeds by enumerating all combinations of conditions and determining
the maximal subsets of genes that give rise to coherent clusters when their expression is monitored
in these samples. Besides this so-called sample-gene search, the authors also apply a gene-sample
search which works in a similar fashion: In the first step, all gene combinations are enumerated and
afterwards subsets of samples are tried such that maximal coherent clusters are formed together
with the genes.
III.4.4.3 Zhao & Zhaki’s TriCluster algorithm
Even though the TriCluster algorithm allows sub-partitions of all three dimensions and thus is
capable of finding triclusters, it does, however, not cluster all dimensions at the same time. The
authors rather devise a biclustering method for gene ×sample time-slices, one slice for each time-
point measured, and construct a graph on the clusters found in these slices. From this graph they
finally manage to extract triclusters of the form C=X×Y×Z={cijk }with XG, Y Sand
Step 1: biclustering The biclustering step is performed by the BiCluster algorithm developed
by the same authors. Being a graph based approach, the algorithm aims at finding maximal cliques
in a multigraph with conditions as nodes siand edges labelled with sets of genes {gj}. The weight
of an edge econnecting nodes c1and c2corresponds to the ratio of the expression values of the
genes {gj}under conditions c1and c2, respectively. The authors introduce so-called ratio ranges
which represent a class of similar ratios and are treated as a single ratio. Each edge is labelled with
a set of genes which fall into the same class. For an example see Figure III.6. The algorithm is also
capable of finding multiplicative and additive patterns (see Section III.2.3).
Figure III.6: Example for a multigraph as used by the TriCluster algorithm. Nodes are labelled
with conditions si, while edges are labelled with a set of genes {gj}. Edge weights correspond to
the ratio ranges mentioned in Section III.4.4.3. For example, the genes {g1, g4, g8}fall into a gene
expression ratio class (range) of 2/1 with respect to conditions s4and s6. The figure is taken from
Zhao and Zaki’s paper.
Step 2: intersecting the biclusters Using the BiCluster algorithm described above, a set of
biclusters is computed for each time-slice. An example adapted from Zhao and Zaki (2005) illustrates
the process of mining triclusters. A path which intersects biclusters from the different time-slices
finally results in a bicluster which is restricted to a subset of time-slices, i.e. a tricluster.
Figure III.7: With the example adapted from Zhao and Zaki (2005), the way the TriCluster
algorithm harvests triclusters from the biclusters Cifound in the time-slices tjis easily illustrated:
The intersection Ct0
1yields the gene and sample sets {g1, g4, g8} × {s0, s1, s4}. Each
intersection step is performed only if constraints, e.g. on the minimum number of genes, are still met
afterwards. The path described above results in the tricluster {g1, g4, g8} × {s0, s1, s4} × {t0, t3, t8}.
Likewise, all possible paths are tried and only maximal triclusters are kept.
III.5 Motivation for the algorithms
The objective of this work is to develop methods which are capable of detecting gene regulatory
modules and to apply these methods to the three-dimensional Arabidopsis thaliana dataset presented
in Section II.2. As mentioned previously, the biological notion of a gene regulatory module closely
corresponds to the informatical term of a tricluster. Notwithstanding other approaches, which might
make use of different informatical concepts for modelling a gene module, clustering techniques appear
to be the obvious solution to the problem of finding gene modules which satisfy the definition given
in Section III.3.2.
A lot of work has been spent on developing clustering and biclustering techniques (see the previous
sections for details), however these methods are usually not applicable to three-dimensional datasets
without adapting the algorithms. Moreover, most algorithms search for coherent patterns only and
do not pay attention to the differentiation between the module types. Recall, that there are three
biologically motivated types of modules, namely, single modules, coherent modules and independent
response (IR) modules (see Section III.3.2).
In contrast, only very few publications deal with clustering on three-dimensional data. Due to
the fact that the time of sufficiently large three-dimensional datasets has only just begun there
are no established standard techniques available. Only the approaches by Jiang et al. (2004) and
Zhao and Zaki (2005) could be applied to the AtGenExpress dataset, however no publicly available
implementations do exist. Furthermore, both methods, like the majority of biclustering algorithms,
look for coherent modules only and no further applications of these algorithms apart from the original
publications can be found in the literature as of today.
Consequently, in order to properly analyse the given dataset and to search for the three different
kinds of modules, new module finding algorithms need to be developed. The strategy followed is
1. As pursued by the biclustering method of Getz et al. (2000) or the three-dimensional approach
by Jiang et al. (2004), a possibility of dealing with the three-dimensional case is the application
of conventional clustering or biclustering techniques to lower-dimensional subsets. Inspired by
the named algorithms and the treatment of time-series data in the work of Ernst and Bar-
Joseph (2006), the two-step clustering uses k-means clustering on two-dimensional subsets
combined with an approach to assess the clusters’ significance.
The two-step clustering is designed as an explorative approach, which allows for user interaction
and visual inspection of the intermediate steps. An automated version is supplied but an
additional feature of the method lies in the combination of data mining and the human eye.
A detailed description is given in Section IV.1.
2. The second strategy is of different nature. It extends a known biclustering method in a manner
that makes it applicable to three-dimensional datasets. As the performance of the triclustering
algorithm will depend to some extent on the biclustering method it is based on, it is advisable
to choose a biclustering algorithm which performs well.
In the comparative study of Prelic et al. (2006), the ISA (Bergmann et al., 2003), SAMBA
(Tanay et al., 2002) and OPSM (Ben-Dor et al., 2003) methods scored far better than the
algorithm of Cheng and Church (2000) and the xMotif algorithm by Murali and Kasif (2003).
According to the study, the xMotif algorithm, though the only method able to identify IR
modules, fails to cope appropriately with additive patterns and, moreover, its greedy search
strategy prevents it from achieving better results. This is also the case for the biclustering
algorithm by Cheng and Church (2000) which frequently gets stuck in local maxima, that often
correspond to biologically uninteresting stationary expression biclusters.
Among the remaining algorithms, the ISA proved to be more robust against noise than the
competing methods. Moreover, the structure of the ISA lends itself to application on the
three-dimensional dataset. On the other hand, the SAMBA’s restriction to just two states for
up and down-regulation makes it appear unsuitable for the comparison of time-trajectories as
we encounter them in three-dimensional gene expression datasets. In contrast, the framework
of the ISA algorithm is of general nature and needs not to be changed. The introduction of
Pearson distances, which are best suited to compare time-trajectories and modifications of the
merit-function are sufficient for the adaptation. The extended version of the ISA, the EDISA,
is described in Section IV.2.
Chapter IV
The algorithms
In this work two main algorithms were designed to tackle the problem of finding gene expression
modules in three-dimensional data. If not stated otherwise, the dataset the algorithms were run on
is the three-dimensional AtGenExpress data mentioned in Section II.2.
To begin with, an exploratory approach based on a combination of conventional clustering techniques
was employed, which, though of a basic nature, already gave rise to numerous interesting results
(see Section V.1). Going beyond the scope of this two-step clustering presented in Section IV.1, the
EDISA algorithm (see Section IV.2) modifies a biclustering technique to work on three-dimensional
data. The results of the latter algorithm are presented in Section V.2.
IV.1 Two-step clustering exploratory approach
IV.1.1 Assumptions and introductory remarks
Utilising the properties of the three-dimensional dataset, the two-step clustering approach presented
here is based on the following assumptions: Firstly, it seems reasonable to assume that genes, which
form modules, i.e. cluster together simultaneously under a number of different conditions, should
also cluster within a single one of these. Going beyond this supposition, it is also advisable to relax
the constraint of requiring coherent clusters across all dimensions, demanding only that genes should
cluster under at least one condition. Proceeding in this way allows for detecting single, as well as
coherent and independent response (IR) modules. These module types can be found by using a
cluster as a seed. Viewing the genes contained in this initial cluster under the remaining conditions
will then reveal whether it is specific for the initial condition, whether it turns out to be a coherent
module for a number of conditions or whether the genes cluster together under several conditions
but with a different time tra jectory each, thus forming an IR module.
Secondly, the time dimension can not be partitioned like the gene and sample dimensions - a property,
which reduces search space substantially. Further extensions to this algorithm pending, that might
distinguish between an early and a late response trajectory and thus partition the time axis, it
remains, however, obvious that the order of time-points should not be permuted.
Consequently, for the exploratory approach, the three-dimensional dataset at hand is mapped onto
|C|two-dimensional slides on which gene expression is plotted against the course of time, |C|being
the number of different stress conditions (see Figure IV.1).
Employing the above condition slide visualisation of the search space, the first step of the algorithm
consists of a k-means clustering under a single condition. The significance of the clusters returned
is evaluated with the help of a permutation analysis, giving rise to p-values associated with the
clusters. Subsequently iterating through all conditions and the significant clusters found in the
corresponding slides, it is possible to enumerate all modules, which can, potentially, be obtained
using this approach.
Figure IV.1: Here, an example for the condition slide visualisation of the three-dimensional dataset
is given. For each experimental condition (in this example: responses to cold, salt and osmotic
stress), time is plotted on the x-axis, whereas gene expression is plotted on the y-axis.
IV.1.2 The two-step clustering algorithm
IV.1.2.1 Definitions
To begin with, we need a definition for significant clusters and modules in the context of the two-
step clustering approach. Remember from Section III.3.2 that for a module to be significant, the
application of a criterion function must result in a value below some threshold τ:
Tsignif =(Gc, Cc, Tc)(G, C, T )|fcriterion eCcGcTc< τ (IV.1)
As the algorithm makes use of classical clustering techniques, we need to define a significant cluster
of genes Gcunder one condition cfirst.
ρecGcT, ζ < τ
with ζ=ecGcTgGc
Thus, a significant cluster in the context of the two-step clustering approach has a density, i.e. an
average Pearson distance to the centroid ζ, which does not exceed a threshold τ. The latter is
determined by a permutation analysis (see IV.1.2.2).
As outlined above, modules are found by extending initial clusters found under one condition.
Consequently, for a significant module eCmGmTthe following must hold:
cCm:ecGmTρ< τ (IV.3)
IV.1.2.2 Evaluation of Statistical Significance
Prior to the clustering, computations have to be performed, which serve the estimation of a cluster
finding’s significance. These computations, which are part of a permutation analysis, have to be
conducted only once.
A common criterion of goodness for a cluster is its density, which is, usually, a measure of deviance
for the genes contained therein. For this approach, the standard deviation is employed. It is,
however, insufficient only to look for clusters with low densities, as drawing genes at random could
equally well result in such a low density. In order to obtain a statistical assessment of whether a
dense cluster does reflect a true biological pattern or whether it could possibly be the result of a
chance permutation of the dataset, 50,000 runs of the algorithm (see Algorithm 1) were performed
on randomly shuffled data. Performing this permutation analysis, a null hypothesis distribution
could be computed. For each cluster size, a probability distribution was fitted to the histograms of
the cluster densities obtained for clusters of the respective size (see Figure IV.2). During this step,
the normal, log-normal, beta and gamma distributions were considered. Each fit was validated by a
χ2test such that the best distribution, with respect to the χ2test, was defined the null distribution
for clusters of the respective size. Using the null distributions obtained that way, a p-value can be
computed for each cluster, given its size and density.
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45
# clusters
Figure IV.2: Histogram of cluster densities for cluster of size 5 obtained during the permutation
analysis. Here, a log normal distribution can be fitted to the histogram.
The permutation analysis results in continuous distributions of densities, one for each cluster size.
Naturally, as the number of cluster findings decreases with increasing cluster size, at some point a
computational barrier will be reached beyond which the data will be too sparse for a sufficiently
dense histogram. In such a case one can resort to a discrete distribution, comprising just the p-values
1 and 0, corresponding to a significant or insignificant cluster, respectively. The minimum of the
densities recorded for a certain cluster size sis taken as a threshold for significance. All s-clusters
with below-threshold densities are deemed significant.
Given the results of the permutation analysis, the threshold τis set to the density value which
corresponds to a p-value of p= 0.05, or, in case we have discrete p-values, p= 0. Thus, all s-clusters
are deemed significant, which follow the background distribution with a probability of at most 5%.
IV.1.2.3 Initial Clustering Step
Given a set of suitable null distributions, the data can be partitioned into conditions slides, which
are then searched for significant clusters. Formally, a condition slide for condition ccorresponds
to the vector ecGT . For the algorithm, initially one condition slide ecinitial GT with cinitial Cis
selected, i.e. all gene expression values at all points in time are extracted for which the condition is
The data extracted in this manner is then clustered with a conventional clustering technique, which
returns a set L={Gc(1), Gc(2),...,Gc(|L|)}of gene clusters. For this work, k-means is the method
of choice. Alternatively, the user is free to choose hierarchical clustering, which is also implemented.
Either clustering method works on Pearson distances (see III.4.1).
The basic idea behind this approach is to search for significant clusters in one condition and then to
follow them along the condition dimension. Thus, the quality of the results depends crucially on the
initial clustering. It is a desirable property of the initial clustering that only dense and meaningful
clusters be created in this step. Consequently, avoiding large, meaningless clusters of genes which do
not fit particularly well into a specific cluster, an incomplete partitioning of the genes is performed
at this point. To achieve this, only those genes are allowed into the initial clustering, which have a
Pearson distance of δ= 0.01 or lower to at least one other gene. As we search for dense clusters,
rather than large ones, this appears to be a reasonable constraint. Such a threshold will separate out
genes which do not show a significant affiliation with a specific cluster, thus providing an incomplete
clustering of the current condition slide with relatively few but dense clusters. Formally, G=LL,
i.e. a set of genes Lremains unclustered. Setting δ= 0.01 results in clear clusters, however, trading
off density for size, the user is free to vary the distance threshold δas to fit the purpose.
The number of partitions, i.e. the parameter kfor the k-means clustering, is determined by a
principal component analysis (PCA) such that kis set to the number of principal components that
explain (1 α) % of the variation in the data, where a default value of α= 0.05 yields the best
IV.1.2.4 Deep Clustering Step
Once the initial clustering step has provided a number of seed clusters for the condition currently
viewed, the significant ones are selected and followed along the condition dimension. Given a
significant gene cluster Gc(i)for one stress condition cj, the expression data of the genes participating
in cjis viewed under all other conditions C\ {cj}(for examples, see V.1). The initial gene cluster
Gc(i), which is by definition a cluster under at least one condition, is then extended to as many other
conditions as found to be significant.
Formally, a significant module based on the gene cluster Gc(i)is a vector eCmGc(i)Twith ecGc(i)Tρ<
τfor all conditions cCm. Whether a module is a single, coherent or IR module can be determined
using the rules from Section III.3.2.
The process described is repeated for all possible condition slides and for all clusters found in these
condition slides (see Algorithm 1).
IV.1.2.5 Post-processing
Given a module obtained by clustering ecinitial GT . As the actual k-means clustering has only been
performed for the initial condition cinitial , the clusters might not meet the density requirements we
demand in order to call the cluster significant for the other conditions, even though there might
exist a cluster structure which is obfuscated by noise. Consequently, two kinds of post-processing
operations are applied, if necessary.
It is a common observation that a number of outliers appear in the clusters under the conditions
C\ {cinitial}, which closely made it into the initial cluster Gc(i), but which, obviously, do not fit
into the other clusters. To address this problem, a number of pruning steps can be performed, each
removing the gene with the greatest contribution to the standard deviation of the cluster to be
pruned. For an example see Figure IV.4.
Another common case is the existence of a discernible sub-cluster under one or more conditions.
Apparently, in these cases, a number of genes co-cluster over several conditions, while other,
condition-specific genes are also included in the initial cluster Gc(i). These condition-specific genes
may very well fit into the initial cluster. However, under the other conditions they may obfuscate
the signal, as they do not belong to the pattern all conditions have in common. To extract such
a sub-cluster, another k-means clustering is applied for the conditions under which a sub-partition
can be spotted visually (see IV.3). The most significant sub-cluster is then treated as the new initial
cluster G0
0.5h 1 h 3 h 6 h 12 h 24 h
gene expression
0.5h 1 h 3 h 6 h 12 h 24 h
gene expression
Figure IV.3: Starting from a cluster initially obtained for the condition cold, the two-step clustering
approach yields a cluster for the heat condition (shown on the left). The cluster is shaded, indicating
that it is not considered as being significant (p-value: 0.34). However, a subpartion seems to be
a good cluster candidate. By post-processing the sub-partition can be extracted, resulting in the
significant module on the right.
0.5h 1h 3h 6h 13h 24h
Figure IV.4: Example for cluster pruning: the solid lines correspond to the genes with the greatest
contribution to the cluster’s standard deviation. Removing them results in a denser cluster and thus
improves the score. In each pruning step, the strongest outlier gene is removed until a score below
the significance threshold is reached.
IV.1.2.6 GO mapping
The two-step clustering approach is aimed at finding relatively small, dense and biologically
meaningful modules, in the sense that the genes are co-regulated when responding to a number
of distinct conditions, as opposed to huge and rather loose modules which are likely not to respond
to any particular condition. It should then be possible to annotate the dense modules with the
biological function of the genes contained therein. Therefore, a mapping to the gene ontology (GO)
is performed for all genes contained in the module. The publicly accessible NetAffxTM web frontend
(Liu et al., 2003) is used for this purpose. The genes in a module are annotated with information
about the biological process they are involved in, the cellular component and the molecular function.
A measure of significance for the mapping is given in the form of p-values returned by NetAffxTM.
Algorithm 1 Two-step clustering approach
Input: Gene xTime xCondition dataset. The function GetSlide(c,g) extracts the condition slide
for condition c, containing all genes from g(see FIGURE IV.1).
Output: Results, a set of clusters annotated with the conditions the clusters comprise.
for all Condition(i)Cdo
CurrentSlide GetSlide(Condition(i),Genes);
for all Cluster(j)do
if (significant) then
Cluster(j).conditions Cluster(j).conditions Condition(i);
for all mC\Condition(i)do
ClusterUnderConditionM GetSlide(Condition(m),Cluster(j));
if significant(ClusterUnderConditionM )then
Cluster(j).conditions Cluster(j).conditions Condition(m);
end if
end for
Results Results Cluster(j);
end if
end for
end for
IV.2 Embedded Dimension ISA
IV.2.1 Concept
The Embedded Dimension Iterative Signature Algorithm (EDISA) is an extension of the ISA
(Bergmann et al., 2003), which is described in detail in Section III.4.3.1. While the ISA operates
only on two-dimensional data, the EDISA is capable of mining gene modules in a three-dimensional
dataset like the AtGenExpress data used throughout this work. Furthermore, the EDISA is able to
mine coherent, as well as IR modules (see Section III.3) and to explicitly return only modules of the
desired type.
Commonly, two-dimensional gene expression datasets comprise a gene and either a time or a
condition dimension. In a three-dimensional dataset, we encounter both the time and the condition
axis. Again, it is desirable to change the order of conditions and genes to obtain a cluster (module).
The time dimension, however, should not be permuted, as the time-points do not correspond to
distinct experimental designs as in the case of conditions. The time-points measured are actually
discrete snapshots of an underlying continuous time-series. A gene’s behaviour over time under a
given condition can thus best be visualised by a time-trajectory.
Hence, a distance measure suitable for comparing time tra jectories is employed: The Pearson
distance is used to modify the merit function of the ISA. The time dimension is embedded into
the gene and condition profiles of the ISA, respectively, giving rise to gene-time and gene-condition
profiles. A condition or a gene in the module is thus characterised by a condition-time or gene-time
profile, each comprising multiple time series (see Figure IV.5).
Figure IV.5: A part of the three-dimensional dataset on the left is represented as a condition-time
profile for gene 1 on the right. The curves are symbolical time trajectories for gene 1 under conditions
1,2 and 3, ranging from time point t0to tn.
Instead of applying thresholds on the gene expression values under a certain condition, the EDISA
judges time-trajectories and applies thresholds on Pearson distances to other time trajectories. For
example, only genes exhibiting time trajectories sufficiently aligned with the trajectories of the
other genes in the module are kept, whereas the remaining ones are discarded. Employing the above
procedure, the recursion formula of the original ISA algorithm can be used and is repeated until
convergence, i.e. until no further genes or conditions are removed from the module. The resulting
fixed point corresponds to a module.
In order to explicitly search for either coherent or IR modules, the step which removes columns from
the current sample is adjusted to fit the purpose. The idea is presented in the following section
along with an example, while a complete mathematical description follows in Section IV.2.3.
IV.2.2 An example
Leaving the mathematical formalism to the next section, the EDISA is introduced based on a simple
example. The algorithm being of a randomized nature, a large number of small, random samples is
taken in order to cover the search space. Each of the samples is broken down to its signature. Rows
and columns are removed according to a threshold criterion.
Assume that we have a 5 ×4 matrix filled with tra jectory data (Figure IV.6). This is the sample
which serves as input for the algorithm. The EDISA now searches either for coherent or for IR
modules. In order to find coherent modules, all columns which have a Pearson distance greater than
a threshold τCto the average column trajectory are removed and likewise all rows which have a
Pearson distance greater than a threshold τGto the average row trajectory are also discarded. For
the example module, columns C3 and C4 are removed, as well as row G5, which results in a coherent
module (see Figure IV.7 and Table IV.1). A typical sample size would be 100 genes instead of five,
so that the procedure of removing rows and columns has to be repeated several times until a module
homogeneous over both rows and columns does remain. The thresholds τCand τGcorrespond to
the largest acceptable distance to the average row or column, respectively.
Figure IV.6: Time-tra jectories for five genes (G1, G2 , G3, G4, G5) and four conditions (C1, C2,
C3, C4)
C1 C2 C3 C4
average column 0.13 0.14 0.77 0.45
average row
G1 0.027
G2 0.027
G3 0.027
G4 0.027
G5 0.67
Table IV.1: EDISA for coherent modules: Pearson distances (1 - Pearson correlation) to the average
row or column trajectory. Based on these distances, the EDISA reduces the sample to a coherent
module (see Figure IV.7). The underlined values are greater than the thresholds τCand τG,
respectively. Averages are computed according to Equations IV.5 and IV.6, which will be discussed
Figure IV.7: The EDISA finds a coherent module in the sample by applying the thresholds
τG=τC= 0.2. Based on the distances from Table IV.1, rows C3 and C4 are discarded because
they do not meet the threshold criterion. In the second step, the bottom most row G5 is removed,
which leads to a coherent module.
In order to search for IR modules, columns need to be treated differently. We still apply a threshold
on the distances to the average row trajectory, however, differences between the column trajectories
must be allowed. Thus, the threshold τCis applied on the average intra-column distance, i.e. the
average distance of the trajectories inside a column to the average trajectory of that column. A
homogeneous column will have zero intra-column distance and remains in the module.
As a consequence of this modification a different module is obtained, which comprises one column
more than the coherent module (see Table IV.2 and Figure IV.8).
C1 C2 C3 C4
0.27 0.12 0.19 0.73
average row
G1 0.04
G2 0.04
G3 0.04
G4 0.04
G5 0.95
Table IV.2: EDISA for IR modules: On the left, average Pearson distances inside the columns are
shown (intra-column distances). On the right, Pearson distances to the average row are listed (see
Figure IV.8). Again, the underlined values exceed the respective threshold values. For details see
Equation IV.7
Figure IV.8: By applying the thresholds τC=τG= 0.3, first column C4 is removed and then row G5.
The remaining module is a perfect IR module with neither inter-row nor intra-column deviations.
IV.2.3 Mathematical description and algorithm
IV.2.3.1 Definitions of module types
In Section III.1 a vector-based notation for trajectories ecgT and the columns ecGT and rows eC gT
has been introduced:
ecgT = (Ecgt1, E cgt2, . . . , Ecgt|T|)
eCgT = (ec1g T ,ec2gT , . . . , ec|C|g T )
ecGT =
Further, define the average trajectory eCGT over all conditions Cmand all genes Gm, as well as
the average trajectory over all genes for one condition ecGT :
|Gm| |Cm|
The EDISA is able to find the three different kinds of modules described in Section III.3.2. Single
modules can be seen as a special case of the coherent modules, where the number of conditions is
set to one. The independent response modules, however, are different and thus need to be defined
For coherent modules, the inter-column as well as the inter-row distances must not exceed the
thresholds τCand τG, respectively. This definition requires all time-trajectories in a module to have
a low Pearson distance to each other. The set of coherent modules is defined as:
T M coherent :=
(Gm, Cm)
cCm:ρecGmT,eCmGmT< τC
gGm:ρeCmgT ,eCmGmT< τG
In case we wish to mine IR modules instead of coherent ones, the threshold on the inter-column
distances has to be replaced with a threshold on the average distance inside the columns, i.e. the
intra-column distance. This allows for different time trajectories under different conditions, while
within one condition all genes still must have similar trajectories. The set of IR modules is defined
T M IR :=
(Gm, Cm)
cCm:*ρecgT ,ecGmT+gGm
< τC
gGm:ρeCmgT ,eCmGmT< τG
IV.2.3.2 Threshold functions
The EDISA operates on sets of genes Gnand sets of conditions Cn. These sets are iteratively refined.
At iteration step n, a filter is applied to remove those genes and conditions from the current sets,
which do not meet the module criteria. This results in new gene and condition sets Gn+1 and Cn+1
for the next iteration step n+ 1.
Assume, we are searching for coherent modules. Then, given the current Gnand Cn,Cn+1 is
computed using the threshold function fτC
Cn+1 =ncCn|fτC
thr (ecGnT)o
with fτC
thr (ecGnT) = c: if ρecGnT,DeCnGnTE< τC
Given Gnand Cn+1,Gn+1 can be computed:
Gn+1 =ngGn|fτG
thr (eCn+1gT )o
with fτG
thr (eCn+1gT ) = g: if ρeCn+1gT ,DeCn+ 1GnTE< τG
In order to search for IR modules, the threshold criteria have to be adjusted according to Equation
IV.7, i.e. a modified version of the threshold function, fτC
thris needed:
Cn+1 =ncCn|fτC
with fτC
thr(ecGnT) = c: if *ρecgT ,ecGnT+gGn
< τC
IV.2.3.3 The EDISA algorithm
The EDISA algorithm searches for either coherent or IR modules, and, during this process, refines
the set of conditions Cnand the set of genes Gnat each iteration step n. Initially, the sets G0and
C0are provided. Typically, the number of conditions is relatively small compared to the number of
genes. Thus, we can set C0=C. For G0,mrandom genes from G are sampled: G0=randG(m).
Given these initial sets, the following update formulas are repeated until convergence:
Cn+1 =ncCn|fτC
thr (ecGnT)o
Gn+1 =ngGn|fτG
thr (eCn+1gT )o(IV.11)
Convergence is reached at iteration n, if Gn=Gn1and Cn=Cn1.
Again, for IR modules we have to replace the threshold function:
Cn+1 =ncCn|fτC
The above iteration process reduces one random gene sample of size mto its signature, i.e. it returns
a module. It is left to the post-processing steps (see Section IV.5) to score the module. The EDISA
repeatedly applies the iteration process on different random samples as seeds, generating a module
each time. The number of runs needed to cover the dataset depends on the choice of the seeds,
which will be discussed in the following section.
IV.3 Choosing the seeds
A known problem of the ISA, which also effects the EDISA, lies in its predilection for the strongest
signals, which are found hundreds of times before a weaker signal is detected, if at all. If a strong
signals makes it into an initial sample, it dominates the average, which is therefore drawn towards
the respective signal. Lazzeroni and Owen (2002) address a similar problem by subtracting signals
which are contained in the modules already detected. Kloster et al. (2005) extend the ISA by
demanding that the condition vector of each new module be orthogonal to the condition vectors of
the previously identified modules.
For the EDISA, the data is pre-processed in a way that only selected genes are presented to the
algorithm instead of drawing totally random samples. Two different seed-choosing methods ensure
a quality of the seeds better than random samples and, that genes, which were drawn in one step,
are not immediately replaced into the pool, thus increasing the probability of other genes to be
contained in a module.
IV.3.1 Guide tree
The first seed-choosing method involves a hierarchical clustering on the whole matrix, which, since
it does not take into account the three-dimensionality of the dataset, is not a particularly good
clustering approach. It is, however, easy and quick to perform as a standard pre-processing step and
provides a coarse-resolution picture of the signal-to-noise distribution in the matrix. The hierarchical
clustering returns a dendrogram, from whose leaves the samples for the EDISA are drawn. Let sbe
the sample size passed on to the EDISA as a parameter. Then, a leaf is picked at random and genes
are drawn from it and from the neighbouring leaves until we have selected sof them. This selection
is then treated as a seed for the EDISA instead of a random sample.
In order not to present the same genes to the algorithm over and over, the hierarchical clustering
operates on a different set of genes each time. In the Matlab implementation, every 100 runs a
new, randomly selected 50 % partition of the dataset is chosen to which the hierachical clustering is
IV.3.2 Nearest neighbours
The second approach proceeds by randomly sampling a single gene and then choosing its s1
nearest neighbours, where sis the desired sample size. Pearson distances between complete rows in
the matrix, i.e. vectors eC gT , are employed for this step.
Again, we are interested in not seeing the same genes too frequently. Therefore, the nearest neighbour
choosing operates on a subset of the data. Good results were obtained by selecting a third of the
dataset at random without replacement and choosing 100 seeds from the subset. Then, the remaining
two thirds are drawn and each time, 100 seeds are selected from them. When the gene pool has
drained, it is refilled and the process starts all over again. This, also, ensures that a gene which
could dominate a module will not be chosen again too soon.
IV.4 Choosing the parameters
Basically, two important parameters, the thresholds τCand τGhave to be specified for running the
EDISA. Their value depends on the variance of the dataset and the seed choosing method employed.
If the thresholds are too high, none of the rows or columns will meet the criterion, which results in
empty clusters. In case the thresholds are too low, they will not be able to separate signals from
background and return the unmodified initial samples. In order to obtain good results, it is essential
to find the threshold which separates out the signals. The location of this threshold depends on the
composition of the dataset and will be relatively high for datasets with a considerable amount of
variation in the background.
Generally, a good strategy for setting the thresholds is to start with relatively high levels, which are
then decreased at every iteration. By using this step-down procedure, thresholds are adjusted to
the composition of the modules, which become more homogeneous with every iteration. The Matlab
implementation of the EDISA supports such step-down thresholds. Good results on the artificial
dataset were obtained with initial values (1, 1) and a step-down of 0.2.
A common strategy, which is already known from the ISA, requires several passes over the data
at different resolutions. Starting with a low value of τG= 0.01, for the very dense modules, the
threshold is gradually increased to cover the fuzzier modules. Often, on biological data, the same
module is found at different resolutions, i.e. comprising a dense core in the first pass or as an
extended, fuzzier version with the higher threshold settings. In several cases, however, modules were
only found for a particular parameter setting.
The threshold τCis less sensitive to variation. A range of values is acceptable, inside which the
modules are appropriately reduced in size along their condition dimension. Typically, the values of
τCneed to be higher when searching for IR modules, as the intra-column distances scale differently
in comparison to the distances between the columns.
IV.5 Post-processing
As the algorithm involves drawing a large number of random samples, not all of the results will
be useful. In fact, we have to extract the few well scoring modules. Due to the properties of the
algorithm, all modules are optimised, given their initial gene set. As the latter, however, is randomly
sampled, it may in a lot of cases not contain genes that form a well-scoring module. As a post-
processing step we thus need to score the modules in order to be able to extract the good ones
Furthermore, however well we may pre-process the data (see IV.3), it is inevitable that the
randomised approach will yield the same module a number of times. Also, a maximal module
will be found along with numerous copies of its submodules. Consequently, for a proper evaluation
of the results, a merging step is advisable (IV.5.2), that removes duplicates and thus allows a much
clearer view on the results.
IV.5.1 Module scores
In order to assign scores to the modules, one must distinguish between coherent and IR modules.
While good coherent modules score well when regarded as IR modules, an IR module, on the other
hand, is a very bad example for a coherent module. In both cases, however, the score relies on the
average distances to the module’s centroid.
For IR modules, Pearson distances to the average trajectory under each condition cCmare
computed and averaged afterwards. This ensures, that modules which are homogeneous inside the
conditions receive a good score, regardless of the differences between the conditions:
m=*ρecgT ,ecGmT +cCm, gGm
Coherent modules are scored differently. Here, we have to ensure that the trajectory follows the
same shape under each condition. We thus, in contrast to the scoring of IR modules, compare all
trajectories against the same average trajectory over all conditions:
m=*ρecgT ,eCmGmT +cCm, gGm
IV.5.2 Merging
Prior to the actual merging process, module scores are filtered. To this end, a threshold τscore is
applied, which helps to distinguish the dense from the less dense modules and thus greatly reduces
the size of the result set. The filtered result set is then sorted according to the conditions the modules
comprise. In the merging step, all modules which share a common condition set Cmare considered
for merging. In order to compare the modules, for each of them the module’s average trajectory,
the centroid ζ, is computed:
ζ(eCmGmT) = eCmgT gGm(IV.15)
Now, a k-means clustering is performed on all modules which share a common condition set. The
clustering operates on the pairwise Pearson distances of the centroids, so that similar centroids are
clustered. Thus, the pmodules which are dense enough to pass the filter, are merged into kmodules.
Usually, kp, as the number of overlapping modules is reduced and all duplicates are removed.
The parameter kis set to the number of principal components which explain for 95 % of the variation
in the centroid distance matrix (see Section III.4.2.2).
IV.5.3 Adding genes
As another post-processing step one might wish to extend a module as far as possible. Due to the
randomised approach the EDISA takes, it cannot guarantee that a module contains the maximum
number of genes given the threshold τG: a gene which did not make it into the initial gene sample
can never be contained in the module generated from that sample. Merging similar modules, as
indicated in the previous section, helps resolving this, but still does not guarantee gene maximality.
For selected modules with interesting features it is thus advisable to check for maximality by adding
all genes, one at a time, and to check whether the module score Smimproves or is only slightly
worse, i.e. ∆(Sm)τext., where the threshold τext. determines the maximum allowed deviation.
The adding procedure should, however, only be applied to modules which change substantially over
time, as trying to add genes to a nearly stationary module will blow up the module with background
On the AtGenExpress dataset the adding procedure proves to be a simple but nonetheless valuable
tool, as it allows for extending dense module cores found by the EDISA. It thus becomes possible
to achieve gene maximality for a module while, at the same time, the degree of fuzziness remains
controllable. An example for the application of the adding procedure is presented in Section VII.1.1.
IV.6 In silico dataset
The actual partitioning in the two-step clustering approach is subject to the clustering technique
employed as a substep. Both, k-means and hierarchical clustering, are standard techniques and need
no further validation. For detailed discussions on these methods, see e.g. the review by Jain et al.
In contrast to the first algorithm, the usage of Pearson distances in an iterative procedure on three-
dimensional data is entirely new. Thus, an in silico dataset with implanted modules was designed to
validate the EDISA and to gain experience on how to choose the parameters. A three-dimensional
matrix similar to the AtGenExpress data was constructed, containing thousand genes measured
over 9 conditions with 6 time-points each. Background noise was computed based on a normal
distribution. As each condition corresponds to an independent experiment, the background was
computed separately for each condition. The starting point of a trajectory was drawn from a uniform
distribution, modelling different levels of expression for different genes. Therefore, no expression level
is overrepresented, which could accidentally give rise to an artefact module. The actual oscillation
of the background genes was realised with a normal distribution with variance 0.4 centred around
the starting point.
Then, two IR modules and two coherent modules were implanted into the matrix. For each module
type, a module of size 10 and a bigger one containing 20 genes were chosen. An example of a module
embedded into background noise is shown in Figure IV.9.
0 10 20 30 40 50 60
Figure IV.9: Example module from the artificial dataset. All conditions were filled up with noise and
afterwards the modules, like this IR module over three conditions, were planted. As the background
noise was computed separately for each condition, the transitions between the conditions are visible,
which does, however, not affect the clustering: the transitions are not considered during the clustering
process in order to avoid the detection of false signals.
Chapter V
V.1 Two-step clustering
The two-step clustering approach (see IV.1.2) was implemented in Matlab. The program provides a
GUI which allows for exploring the dataset. Moreover, pruning and sub-partitioning operations on
the modules are supported. Screenshots can be found in the Appendix. Here, results are presented
along with details concerning the pre-processing steps.
V.1.1 Pre-processing and significance issues
For the evaluation of the two-step clustering approach, the algorithm was run on the AtGenExpress
dataset, i.e. on 2 ×9 different condition slides. As a pre-processing step the log2of the expression
values was taken and a filter on the fold-change was applied. The condition slides ecG0Tcontained
only those genes G0G, which show a two-fold expression change with respect to control for at least
one time-point for the condition c. In preliminary experiments, a two-fold change proved to be a
good indicator of differential expression. Without applying the fold-change filter, the data contained
too much background noise for the k-means or hierarchical clustering methods employed.
Both clustering techniques result in different initial cluster sets. Often, k-means finds more medium-
size clusters, whereas the hierarchical clustering tends to produce one or two large clusters plus a
couple of two or three genes clusters, which are rarely significant. However, as both methods found
different significant clusters, the algorithm was run twice, using k-means and hierarchical clustering.
As indicated before (see IV.1.2.2), p-values are assigned to the clusters, given cluster size and density,
in order to assess the statistical significance of the finding. The p-values result from a probability
distribution fitted to the histogram of densities obtained by a permutation analysis.
Due to constraints regarding computation time, sufficiently dense histograms were only computed
for cluster sizes up to 14. For cluster sizes equal to or greater than 15, a discrete distribution was
used instead, giving rise to p-values 0 (insignificant) and 1 (significant). Enough clustering data
could be sampled to create histograms for the smaller clusters up to a size of 14. As the smaller
clusters appear much more frequently in permuted data, it is of importance to have reliable null
distributions at hand for these cases. For the larger cluster sizes, histograms are computationally
expensive and, moreover, these clusters do have a low rate of occurrence (e.g. 1 cluster of size
29 found in 50,000 runs). Thus, it is sufficient to employ the discrete distribution for these cases,
defining a threshold density, above which a significant cluster must not fall (cp. IV.1.2.2).
As no cluster findings of size greater than 29 were reported from the 50,000 permutations run,
clusters larger than this size are required to meet the criterion for cluster size 29. Apparently, this is
a conservative method, which might unjustly reject big clusters. It is, however, aimed at minimizing
the number of false positives and no blatant misjudgements were discovered during the evaluation,
which would call for more computational effort needed to sample clusters of larger sizes.
V.1.2 Results for the two-step clustering approach
A large number of modules can potentially be obtained with the two-step clustering approach
introduced above. Depending on the parameter settings, such as the distance threshold δfor the
incomplete clustering, either relatively small and dense or larger and fuzzier clusters emerge. For
either case, meaningful clusters could be found. The algorithm was run for all possible combinations
of initial conditions, tissue types and distance thresholds, where thresholds of δ= 0.01 for the dense
and δ= 0.1 for the fuzzier clusters were employed.
Due to the pre-filtering of relevant, i.e. differentially expressed, genes by computing the fold-change,
only few significant initial clusters could be found for the conditions genotoxic and oxidative. In
these cases, too few genes fulfil the criterion (2-fold change with respect to control). Thus, without
sufficient data for a meaningful initial clustering, no modules can be discovered. Reducing the
fold-change requirement would allow more genes into the clustering, however at the cost of losing
biological relevance, as we are only interested in responses distinct from the control measurement.
Biological validation of the clustering results was performed with NetAffxTM (Liu et al., 2003). An
example of the GO results for the salt/osmotic bicluster from Figure V.5 is given in Figure V.1.
Considering several of the obtained clusters as examples, an overview of the variety of modules
types that can be discovered using this method will be given here. For a biological, as well as
methodological evaluation of the results, the reader is referred to the discussion in Section VI.
V.1.2.1 Single modules
Due to the nature of the approach, which derives all modules from the initial clusters, the most
abundant module type is the single module. Examples for this condition-specific module type are
given in Figure V.2 for the UV-B condition, as well as in Figure V.4 for wounding stress.
V.1.2.2 Coherent modules
The next most prevalent module type is the coherent module. Again, examples are presented, one
for osmotic and salt stress in Figure V.5 and one comprising all stress conditions in Figure V.6.
In contrast to all the other examples in this section, which were obtained by using k-means, the
salt/osmotic module results from a hierarchical clustering in the first step. In this case, hierarchical
clustering yields a cluster which can be approximated but not reproduced by k-means.
The all-conditions coherent module with cold stress as initial condition in Figure V.6 was obtained
by extracting a significant sub-partition from a larger, insignificant cluster under heat stress. As
only the extracted subset forms a coherent module, the other genes add a considerable amount of
noise, thus rendering the full cluster insignificant.
Numerous other coherent modules were discovered, a selection of which is presented in Table V.1.
V.1.2.3 IR modules
Moreover, independent response (IR) modules can be obtained using the two-step clustering
approach. The IR module in Figure V.7 spans most of the conditions. Another example can be
found in Figure V.8. This three-condition IR module contains a coherent module for the conditions
drought and wound.
Figure V.1: A subgraph of the GO annotation graph (biological process ) for the module shown in
Figure V.5. The purple and red coloured boxes highlight small p-values, i.e. they correspond to
highly significant findings. Starting from the root node ”biological process”, the nodes ”response
to water” and ”abscisic acid stimulus” can be reached by traversing the annotation graph. These
nodes confirm the module under the conditions salt and osmotic. Note, that the p-values are those
returned by NetAffxTM and refer only to the GO classification.
gene expression
cold osmotic salt
gene expression
drought genotoxic oxidative
0.5h 1 h 3 h 6 h 12 h 24 h
gene expression
0.5h 1 h 3 h 6 h 12 h 24 h
0.5h 1 h 3 h 6 h 12 h 24 h
Figure V.2: Module obtained by starting withUV-B (shoot tissue) as initial condition. Obviously,
the genes participating in the initial cluster do not cluster together for the other conditions. This is
an example for a condition-specific single module, UV-B being the highlighted condition. However,
when data for the root is included, another significant response becomes visible (see Figure V.3).
gene expression
cold osmotic salt
gene expression
drought genotoxic oxidative
0.5h 1 h 3 h 6 h 12 h 24 h
gene expression
0.5h 1 h 3 h 6 h 12 h 24 h
0.5h 1 h 3 h 6 h 12 h 24 h
Figure V.3: The same module as depicted in Figure V.2 is now drawn for the root tissue. Surprisingly,
the genes which show a significant coherent response under the UV-B stress in the shoot tissue,
cluster together under salt stress in the root tissue.
gene expression
cold osmotic salt
gene expression
drought genotoxic oxidative
0.5h 1 h 3 h 6 h 12 h 24 h
gene expression
0.5h 1 h 3 h 6 h 12 h 24 h
0.5h 1 h 3 h 6 h 12 h 24 h
Figure V.4: Single module obtained by starting with wound (shoot tissue) as initial condition. This
module is specific for wounding stress. The same module shows no particular response in the root.
gene expression
cold osmotic salt
gene expression
drought genotoxic oxidative
0.5h 1 h 3 h 6 h 12 h 24 h
gene expression
0.5h 1 h 3 h 6 h 12 h 24 h
0.5h 1 h 3 h 6 h 12 h 24 h
Figure V.5: Coherent module generated with salt (shoot tissue) as initial condition, using hierarchical
clustering. The shaded conditions are not included into the module, whereas all conditions with a
p-value below 5% are. Hence, this module only contains the conditions salt and osmotic.
gene expression
cold osmotic salt
gene expression
drought genotoxic oxidative
0.5h 1 h 3 h 6 h 12 h 24 h
gene expression
0.5h 1 h 3 h 6 h 12 h 24 h
0.5h 1 h 3 h 6 h 12 h 24 h
Figure V.6: Module generated with cold (shoot tissue) as initial condition. All conditions are
included into the cluster, as they have a p-value below 5 %. With the exception of cold stress, the
module is coherent over all conditions.
gene expression
cold osmotic salt
gene expression
drought genotoxic oxidative
0.5h 1 h 3 h 6 h 12 h 24 h
gene expression
0.5h 1 h 3 h 6 h 12 h 24 h
0.5h 1 h 3 h 6 h 12 h 24 h
Figure V.7: Module generated with heat (root tissue) as initial condition. The shaded conditions
are not included into the module, whereas all conditions with a p-value below 5 % are. Except for
UV-B and cold., all conditions are part of this IR module. The osmotic and salt slides are coherent.
gene expression
cold osmotic salt
gene expression
drought genotoxic oxidative
0.5h 1 h 3 h 6 h 12 h 24 h
gene expression
0.5h 1 h 3 h 6 h 12 h 24 h
0.5h 1 h 3 h 6 h 12 h 24 h
Figure V.8: Starting from a cluster originally obtained for the wound condition in the shoot tissue,
an IR module can be found, which also comprises the conditions drought and cold. Again, condition
slides with a p-value <5 % are drawn in solid black lines, whereas the others are shaded.
cinitial tissue δsingle IR coherent
shoot 0.01 4 - -
shoot 0.1 3 c u -
root 0.01 4 - -
root 0.1 3 - -
shoot 0.01 1 - 2x (os s) (os s u)
shoot 0.1 4 - -
root 0.01 - (os s d ox h) (c os s ox w) 2x (os s)
root 0.1 1 (os s d ox u w h) (c os s) (os s)
shoot 0.01 2 (c os s) -
shoot 0.1 1 - 2x (os s)
root 0.01 2 (s d u) (c os s ox h) -
root 0.1 4 - -
shoot 0.01 1 - -
shoot 0.1 2 (c os s d u) -
root 0.01 - - -
root 0.1 4 (d g) -
shoot 0.01 - - -
shoot 0.1 - - -
root 0.01 - - -
root 0.1 2 - -
shoot 0.01 3 - -
shoot 0.1 3 - -
root 0.01 - - -
root 0.1 - - -
shoot 0.01 3 (ox u) (d u) -
shoot 0.1 1 2x (d u w) (d ox u w) (d go ox u) (ox u) -
root 0.01 - - -
root 0.1 - - -
shoot 0.01 5 - -
shoot 0.1 3 (d w) -
root 0.01 - - -
root 0.1 - (os s d g ox w h) -
shoot 0.01 3 - -
shoot 0.1 2 (os h) (os s ox h) -
root 0.01 4 (os s d g ox w h) -
root 0.1 2 (os s d g ox w h) (s h) -
Table V.1: An exhaustive search of the dataset yields the results listed above. All combinations
of initial conditions and tissues were considered, as well as two values for the distance threshold δ
for each of these combinations. Thus, an overview of the denser (δ= 0.01) and fuzzier (δ= 0.1)
modules is given. The table contains the number of single, IR and coherent modules obtainable for
the respective initial condition. Abbreviations for the conditions are as follows: cold (c), osmotic
(os), salt (s), drought (d), genotoxic (g), oxidative (ox), UV-B (u), wound (w), heat (h). The two-
step clustering was employed with k-means as internal clustering method and default parameter
values, namely, a fold change of 2 to pre-select only genes differentially expressed for the respective
condition and α= 0.05. No pruning steps have been applied, although pruning allows for extending
some of the modules to cover more conditions. Only significant modules were included into the
table. In several cases, the modules found for δ= 0.1 are fuzzier versions of the ones found for
δ= 0.01. For example, the single modules found for osmotic/shoot/0.1 correspond to the (os s)
modules obtained with δ= 0.01, but are not dense enough to be deemed significant. In the case of
osmotic stress, a great number of genes meets the fold change criterion and therefore a lower value
for δis better suited for obtaining dense modules.
The EDISA was applied to the AtGenExpress dataset using the nearest neighbours (NN) and guide
tree (GT) methods for choosing the seeds. As GT detected only a subset of the modules which NN
could find, the latter method was selected for application on the biological dataset. Thus employing
the NN method, the AtGenExpress dataset was scanned at different resolutions from τG= 0.01
to τG= 0.2. For background information on the seed choosing methods and a motivation of the
parameter settings see Sections IV.3 and VI.2.1.
Many of the modules found by the EDISA exist in different versions, ranging from a rather small
but dense core of nearly perfectly correlating genes to a large but rather fuzzy module. Merging all
similar modules (see Section IV.5.2) is a good strategy of identifying the principal components in
the dataset, but sometimes results in large and rather diffuse modules. A good way of obtaining
a gene maximal module with a clear pattern is to extend a small and dense module found by a
high resolution scan and to extend it using the adding procedure (see IV.5.3). It is thus possible to
control the module’s degree of fuzziness.
Proceeding as described above, numerous modules at different resolutions could be mined from the
AtGenExpress dataset. As the two-step clustering approach was able to discover the most important
patterns in the data, many of the modules are already known. However, through performing the scans
at different resolutions, several modules could be reduced to a module core with better correlation
or, alternatively, extended by numerous genes which fit into the pattern but were not considered by
the two-step clustering.
As mentioned above, the merging and extending procedures of the EDISA allow for easily identifying
the principal components of the dataset:
Heat shock related: The best correlation based on Pearson distances is attained by a module
consisting of heat shock related proteins. A condition profile for the module which has, in a
smaller version, already been identified by the two-step clustering (Figure V.7) is plotted in
Figure V.9. The pattern for the first three conditions with a sharp lower peak under cold stress
and two more pronounced peaks under osmotic and salt stress is a very common observation
in the dataset and shared by a lot of other modules. The especially strong signal under heat
stress, however, is specific for the genes in this module.
Circadian rhythm: A cycling pattern following a 24 hours rhythm is the most obvious coherent
pattern in the dataset. A relatively clear pattern is shown in Figure V.10. By merging with
similar findings, much larger and more diffuse versions of this module can be obtained. The
module overlaps with the one shown in Figure V.6. The relation to the circadian rhythm is
discussed in Section VII.1.1, where a small module of clock genes also found by the EDISA is
Cold/Osmotic/Salt: The most common strong signals can be found under the first three
conditions. Even specific modules, like the heat shock module, additionally contain a
cold/osmotic/salt component. Frequently, however, signals do only occur under these three
conditions. A variety of patterns can be identified, including early and late responses, as well
as responses for all three conditions or just two out of three. Usually, responses for osmotic and
salt stress are in unison. In one case, however, genes react to salt, but not to osmotic stress
(module f in Figure V.13). Moreover, the latter module responds to UV-B stress in the shoot,
whereas the other cold/salt/osmotic modules usually are specific for the tissue they were found
in. This hints at the presence of at least two different regulation mechanisms. An overview
of the cold/osmotic/salt modules identified by the EDISA is given in Figure V.13, while the
unusual response to salt in the root and UV-B in the shoot tissue is depicted in Figures V.12
and V.11.
UV-B related : Further very pronounced responses can be found under exposure to UV-B stress.
Several single modules are identical to the ones detected by the two-step clustering. One of
the most striking patterns in this respect is a module which reacts to UV-B stress in the shoot
tissue and to salt in the root. It is nearly identical to the salt stress module f ) from Figure
V.13. This overlap shows that the same response mechanism can be active under different
conditions. Further information is given in the discussion (Section VI).
Figure V.9: This profile shows an extended version of the IR module from Figure V.7. The EDISA
detects multiple variants of this module. The clearest could be obtained by extending a dense module
consisting of 8 genes. Using the adding procedure, all genes which decreased the score by at most
0.01 were included into the model, resulting in 31 genes which exhibit a similar behaviour over most
Figure V.10: The largest coherent module in the dataset appears to follow a circadian rhythm. The
profile oscillates over all conditions, each of which spans 24 hours. There is apparently no response
specific to a particular stress condition, but rather a stable circadian program, which remains largely
unaffected by external influences.
Figure V.11: Condition profile for the genes from a module identified by the EDISA for conditions
cold and salt in the root tissue (see also Figure V.13, f ). Indeed, the most obvious expression
changes in the root tissue take place under cold and salt stress. Also, a downward slope for drought
stress is visible. The response of the same genes in the shoot tissue is depicted in Figure V.12.
Figure V.12: Condition profile in the shoot tissue for the module from Figure V.11. As the module
originally was found in the root tissue, it appears a bit fuzzier. Still, the main components are
clearly visible. In contrast to the response in the root tissue, here the most obvious expression
change occurs for UV-B stress.
Figure V.13: The EDISA identified numerous modules comprising two or three conditions out of
the set {cold, osmotic, salt}. As these signals belong to the most common in the whole dataset, an
overview of the responses for these three conditions is given here. The first four rows (a-d) depict
modules found in the shoot tissue. Except for module d), they do show no particular response to
the three stresses in control or in the root tissue. The genes from d) form a module in both tissues,
but with different trajectories: the response in the root tissue sets on earlier (see Figure B.5 in the
appendix). Sometimes, as for example in row b), a condition which is not part of the module has
been plotted just for comparison. The two bottom most modules (e-f) were found in the root tissue.
While e) shows not particular response in shoot, f) responds mainly to UV-B in the shoot tissue.
See also Figures V.12 and V.11 for the last one. Generally, the responses to osmotic and salt stress
are very similar in shape, except for the module shown in row f). The marked rows (b, d and e)
appear to be exclusively co-regulated just for osmotic and salt stress, whereas the others also involve
a response under cold stress.
Chapter VI
VI.1 Results of the two-step clustering approach
VI.1.1 Single modules
Condition-specific single modules are abundant in the results returned by the two-step clustering
approach. An example is given in Figure V.2, showing a module specific just for exposition
to UV-B radiation. This module is enriched with GO annotations like ”ripening” (p-value:
7.56 ·1013) , ”respiratory gaseous exchange” (p-value: 2.84 ·1036) or ”ethylene biosynthesis”
(p-value: 3.17 ·1012), ethylene being involved in biotic and abiotic stress signalling responses.
Moreover, it stimulates the ripening of fruit and the opening of flowers.
Apparently, under this condition, the plant responds to a light stimulus. However, no specific
reaction to UV-B is visible. This may be due to insufficient GO data, as only about half of the
genes contained in the module are annotated. On the other hand, this module might constitute
a general light stimulus response, UV-B being a natural part of the sunlight spectrum. Thus, we
need not necessarily encounter a stress response containing for example DNA repair mechanisms.
Interestingly, when viewing the genes from this module in the root tissue, a specific response to salt
stress is visible (see Figure V.3). A similar observation was made using the EDISA and is discussed
in Section VI.2.2.
Another single module is the one presented in Figure V.4, which is specific for wounding stress.
Significant GO mappings for this module are ”response to wounding”(p-value: 2.5114 ·105),
”antibiotic synthesis” (p-value: 1.47605 ·103), ”response to abscisic acid” (p-value: 1.338 76 ·104),
”defense response” (p-value: 5.04438 ·102) and ”actin cytoskeleton organization and biosynthesis”
(p-value: 1.77632 ·103). All of these GO terms can easily be related to desiccation as caused
by wounding the plant (abscisic acid, see also VI.1.2) or defense mechanisms with immunological
function. Even genes related to ”gibberellic acid mediated signalling synthesis” (p-value: 1.28917 ·
105) can be related to the wounding stress as this plant hormone can, among other functions,
promote tissue growth. In contrast to the UV-B module, the response described here is specific to
wounding stress in the shoot tissue, as no particular response can be seen in the root.
VI.1.2 Coherent modules
In Figure V.5, a coherent module for the conditions salt and osmotic is presented. Obviously, there
is a close relationship between the two conditions involved. The most significant GO mappings are
”response to water”(p-value: 3.25 ·1014), ”chitin catabolism” (p-value: 2.63 ·1016) and ”response
to abscisic acid stimulus” (p-value: 6.94 ·1030), the latter being responsible for closing the plant’s
stomata to prevent loss of water. Apparently, the GO mapping confirms the clustering result.
For the other example in Figure V.6,cold stress applied to the root tissue was used as an initial
condition. As mentioned in the results section (V.1.2.2), this module is a sub-partition of a larger
cluster, which does not show especially stress specific GO annotations in the first instance. The
result of the sub-clustering step, however, is a significant module for all conditions and can be
related to general stress responses, such as ”response to cold” (p-value: 3.84·1025), ”hypersensitivity
response” (p-value: 1.56·1015) and ”cellular response to water deprivation” (p-value: 5.05·10106).
Although stress-specific annotations could be found, the 24 hours rhythm and a slight overlap with
the module from Figure V.10 hints at a circadian rhythm also behind this trajectory pattern.
VI.1.3 IR modules
The module shown in Figure V.7 is a small but especially dense IR module originating from the
heat condition slide. Out of the 6 genes contained in the module, 5 map to the GO term ”response
to heat” (p-value: 9.83 ·10233). Furthermore, all genes map to the GO term ”response to unfolded
protein” (p-value: 3.69 ·10201), as well as to ”protein folding” (p-value: 1.67 ·1055). Such an
individual response module with low density clusters under all included conditions, which, however,
have different time trajectories with respect to each other, is a strong indicator of possible co-
Finally, another IR module is depicted in Figure V.8. Comprising the conditions cold,wound
and drought, it is confirmed by a number of GO annotations: ”response to wounding” (p-
value: 3.0803 ·1010), ”growth” (p-value: 4.61915 ·1016) and ”response to pathogen” (p-value:
1.08789 ·106) correspond to wounding stress, while ”response to cold” (p-value: 5.23054 ·1016)
obviously confirms the inclusion of the cold slide. The drought stress response is reflected by GO
annotations ”osmotic stress”’ (p-value: 3.545 ·1011) and ”response to water deprivation” (p-value:
4.336·109). Furthermore, plant hormones are present, as indicated by ”jasmonic acid biosynthesis”
(p-value: 1.6173 ·1045) and ”gibberellic acid catabolism” (p-value: 2.64809 ·1031).
Interestingly, the highly significant annotation ”regulation of timing of transition from vegetative to
reproductive phase” (p-value: 1.00654 ·10120) is assigned to two genes in this module.
VI.1.4 Overview of the stress responses
Table V.1 gives an overview of the abundance of modules in the dataset. Single modules can be
obtained for all conditions, while coherent or IR modules are much more frequent under salt/osmotic,
heat or UV-B stress. For osmotic stress, numerous dense initial clusters are discovered with δ= 0.01,
which can be extended to modules. Also under the conditions salt,heat and UV-B, several significant
initial clusters and, subsequently, modules can be found which meet the distance threshold criterion
of δ= 0.01. For the drought and wound stress conditions, modules are only detected after relaxing
the criterion to δ= 0.1. The modules found under these conditions are thus more fuzzy, which is
reflected by larger standard deviations. Regarding the conditions genotoxic and oxidative as initial
condition slides, no modules are discovered. Only a few genes pass the filter on the fold change for
these conditions.
Generally, the more genes were allowed into the clustering as being differentially expressed, the more
clusters and modules could be found. Usually, compensation for a small number of genes can be
achieved by relaxing the distance threshold filter, as done for the conditions drought and wound.
The most prevalent coherent module type is a two-condition module comprising salt and osmotic
stress. Naturally, these conditions and the responses to them are similar. Several different trajectory
shapes for responses to salt and osmotic stress occur. As the EDISA was able to identify more
patterns, a summary of the different trajectory shapes can be found in the EDISA results section
(Figure V.13).
A small module initially obtained from the heat slide spans almost all conditions. This module
consists of heat shock proteins, which, besides being active during the heat shock response, are also
common responders to general stress stimuli. Another common case is the combination of drought
with wound or salt and sometimes UV-B. As wounding the plant leads to a loss of fluids, the close
relationship of the drought and salt stress responses is comprehensible.
However, the results from the GO mapping should be interpreted with care. Often, a large number
of genes in a module are unannotated and, moreover, it is not always clear how reliable the GO data
really is. Still, the combination of GO data and response patterns as they were identified by the
clustering approaches is a valid and helpful approach at explaining and structuring the Arabidopsis
stress response.
VI.2 Results for the EDISA
VI.2.1 Choosing seeds and parameters
The EDISA was applied to the in silico dataset with different parameter settings using random
seeds (RS) and the two pre-processing steps described in Section IV.3, the guide tree (GT) and the
nearest neighbour method (NN). With appropriately set threshold values, both GT and NN were
able to discover all four modules.
Although preliminary experiments on the AtGenExpress dataset had produced some results using
random seeds, none of the artificial modules could be found as a whole with just randomly sampling
the seeds. In contrast to the biological data, which has already passed a fold change filter and thus is
enriched with numerous signals, the artificial dataset consists mostly of noise with just the relatively
small implanted modules as signals.
A known problem of the ISA, as well as of the ISA, is the tendency to find the strongest signal
very frequently while weaker signals are often overlooked (cp. Section IV.3). Apparently, the RS
method is attracted to modules consisting of background noise, which happen to have a low Pearson
correlation, as the one shown in Figure VI.3. These disturbing signals are almost absent in fold
change filtered biological data. The in silico dataset, on the other hand, demands robustness against
noise and the ability to detect small signals, which, according to the test runs, can only be achieved
with the pre-processing methods.
While both, GT and NN, were able to detect all four modules, each still had a predilection for two
of the modules, which were found far more often than the others.
As expected from previous runs on biological data, all modules can only be detected by scanning
over a range of parameter settings. Bergmann et al. (2003) also chose the multiple run strategy.
For a given combination of thresholds, the EDISA usually finds one or two of the modules without
noise (see Figure VI.1). If the thresholds are not set appropriately, modules might not be detected
or they are not correctly reduced to the signal pattern (see Figure VI.2).
In summary, with both pre-processing methods the EDISA is able to detect small signals in front
of a noisy background. False positives do occur, which are, however, not a fault of the method, as
they definitely contain patterns with low density based on Pearson distances. This is a problem
every clustering method encounters and it is usually accounted for by filtering out the genes which
are not differentially expressed. On artificial data, both seed choosing methods performed equally
well. However, when applied to the AtGenExpress dataset, NN turned out to be superior to GT.
Apparently, hierarchical structures are well suited to the in silico dataset with its rather uniform
noise distribution and perfectly homogeneous modules, whereas the NN method is of more general
nature and able to cope with fuzzier modules.
Like the ISA, the EDISA is susceptible to parameter changes. Choosing the right parameters has
therefore been discussed in Section IV.4. For the AtGenExpress dataset, appropriate threshold
values are τC= 0.3 for coherent modules and τC= 0.5 when searching for IR modules. The modules
were scanned at resolutions between τG= 0.01 and τG= 0.2. Prior to the clustering with the
EDISA, a two-fold filter on the fold change was applied in order to reduce the noise level and to
assure comparability with the results of the two-step clustering. Commonly, 20000 iterations are
totally sufficient to cover a dataset of this size (ca. 1000 genes for the filtered data), which takes
only a couple of minutes to compute.
Figure VI.1: Some of the artificial modules, like this IR module found using seeds provided by a
guide tree, are perfectly reconstructed. No genes are missing and all other conditions were spliced
Figure VI.2: The quality of the results depends crucially on the threshold settings. In this example,
both τGand τGare set to suboptimal values, which results in some noise genes and an extra condition
to be included into the module (from the in silico dataset).
Figure VI.3: In some cases, when working on the in silico dataset, the EDISA returns a module
consisting of background noise. This is a property