Cluster analysis for gene expression data: a survey

Dept. of Comput. Sci. & Eng., State Univ. of New York, USA;
IEEE Transactions on Knowledge and Data Engineering (Impact Factor: 1.89). 12/2004; 16(11):1370- 1386. DOI: 10.1109/TKDE.2004.68
Source: IEEE Xplore

ABSTRACT DNA microarray technology has now made it possible to simultaneously monitor the expression levels of thousands of genes during important biological processes and across collections of related samples. Elucidating the patterns hidden in gene expression data offers a tremendous opportunity for an enhanced understanding of functional genomics. However, the large number of genes and the complexity of biological networks greatly increases the challenges of comprehending and interpreting the resulting mass of data, which often consists of millions of measurements. A first step toward addressing this challenge is the use of clustering techniques, which is essential in the data mining process to reveal natural structures and identify interesting patterns in the underlying data. Cluster analysis seeks to partition a given data set into groups based on specified features so that the data points within a group are more similar to each other than the points in different groups. A very rich literature on cluster analysis has developed over the past three decades. Many conventional clustering algorithms have been adapted or directly applied to gene expression data, and also new algorithms have recently been proposed specifically aiming at gene expression data. These clustering algorithms have been proven useful for identifying biologically relevant groups of genes and samples. In this paper, we first briefly introduce the concepts of microarray technology and discuss the basic elements of clustering on gene expression data. In particular, we divide cluster analysis for gene expression data into three categories. Then, we present specific challenges pertinent to each clustering category and introduce several representative approaches. We also discuss the problem of cluster validation in three aspects and review various methods to assess the quality and reliability of clustering results. Finally, we conclude this paper and suggest the promising trends in this field.

  • Source
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: “Clustering” the significance and application of this technique is spread over various fields. Clustering is an unsupervised process in data mining, that is why the proper evaluation of the results and measuring the compactness and separability of the clusters are important issues. The procedure of evaluating the results of a clustering algorithm is known as cluster validity measure. Different types of indexes are used to solve different types of problems and indices selection depends on the kind of available data. This paper first proposes Canonical PSO based K-means clustering algorithm and also analyses some important clustering indices (intercluster, intracluster) and then evaluates the effects of those indices on real-time air pollution database, wholesale customer, wine, and vehicle datasets using typical K-means, Canonical PSO based K-means, simple PSO based K-means, DBSCAN, and Hierarchical clustering algorithms. This paper also describes the nature of the clusters and finally compares the performances of these clustering algorithms according to the validity assessment. It also defines which algorithm will be more desirable among all these algorithms to make proper compact clusters on this particular real life datasets. It actually deals with the behaviour of these clustering algorithms with respect to validation indexes and represents their results of evaluation in terms of mathematical and graphical forms.
    International Scholarly Research Notices. 10/2014; Volume 2014 (2014).
  • [Show abstract] [Hide abstract]
    ABSTRACT: This survey first introduces how to produce and represent the gene expression data, and then discusses the state-of-the-art cluster algorithms applied to gene expression data. According to the goals of clustering, clustering algorithms are divided into three categories: gene-based clustering, sample-based clustering, and biclustering. Basic biological principles and challenges for each category are presented. For each category, the basic principle is discussed in detail as well as its advantages and drawbacks. This paper concludes with a summarization in this field and a discussion of future trends.
    ACTA AUTOMATICA SINICA 01/2008; 34(2).

Full-text (2 Sources)