Cluster Analysis for Gene Expression Data: A Survey

Dept. of Comput. Sci. & Eng., State Univ. of New York, USA
IEEE Transactions on Knowledge and Data Engineering (Impact Factor: 2.07). 12/2004; 16(11):1370- 1386. DOI: 10.1109/TKDE.2004.68
Source: IEEE Xplore


DNA microarray technology has now made it possible to simultaneously monitor the expression levels of thousands of genes during important biological processes and across collections of related samples. Elucidating the patterns hidden in gene expression data offers a tremendous opportunity for an enhanced understanding of functional genomics. However, the large number of genes and the complexity of biological networks greatly increases the challenges of comprehending and interpreting the resulting mass of data, which often consists of millions of measurements. A first step toward addressing this challenge is the use of clustering techniques, which is essential in the data mining process to reveal natural structures and identify interesting patterns in the underlying data. Cluster analysis seeks to partition a given data set into groups based on specified features so that the data points within a group are more similar to each other than the points in different groups. A very rich literature on cluster analysis has developed over the past three decades. Many conventional clustering algorithms have been adapted or directly applied to gene expression data, and also new algorithms have recently been proposed specifically aiming at gene expression data. These clustering algorithms have been proven useful for identifying biologically relevant groups of genes and samples. In this paper, we first briefly introduce the concepts of microarray technology and discuss the basic elements of clustering on gene expression data. In particular, we divide cluster analysis for gene expression data into three categories. Then, we present specific challenges pertinent to each clustering category and introduce several representative approaches. We also discuss the problem of cluster validation in three aspects and review various methods to assess the quality and reliability of clustering results. Finally, we conclude this paper and suggest the promising trends in this field.

41 Reads
  • Source
    • "Clustering algorithms have been studied extensively in the last three decades, with many traditional clustering techniques successfully applied or adapted to gene expression data, which led to the discovery of biologically relevant groups of genes or samples [6]. Traditional clustering algorithms usually process data on the full feature space while emerging attention has been paid to subspace clustering. "
    [Show abstract] [Hide abstract]
    ABSTRACT: With modern technologies such as microarray, deep sequencing, and liquid chromatography-mass spectrometry (LC-MS), it is possible to measure the expression levels of thousands of genes/proteins simultaneously to unravel important biological processes. A very first step towards elucidating hidden patterns and understanding the massive data is the application of clustering techniques. Nonlinear relations, which were mostly unutilized in contrast to linear correlations, are prevalent in high-throughput data. In many cases, nonlinear relations can model the biological relationship more precisely and reflect critical patterns in the biological systems. Using the general dependency measure, Distance Based on Conditional Ordered List (DCOL) that we introduced before, we designed the nonlinear K-profiles clustering method, which can be seen as the nonlinear counterpart of the K-means clustering algorithm. The method has a built-in statistical testing procedure that ensures genes not belonging to any cluster do not impact the estimation of cluster profiles. Results from extensive simulation studies showed that K-profiles clustering not only outperformed traditional linear K-means algorithm, but also presented significantly better performance over our previous General Dependency Hierarchical Clustering (GDHC) algorithm. We further analyzed a gene expression dataset, on which K-profile clustering generated biologically meaningful results.
    09/2015; 2015(1):918954. DOI:10.1155/2015/918954
  • Source
    • "However, these distance functions are not appropriate to measure the object correlation in the gene matrix [1]. Moreover , only a small subset of genes participate in any cellular process of interest, and a cellular process occurs only in a subset of the samples, requiring biclustering or the subspace clustering to capture clusters formed by a subset of genes across a subset of samples [2]. Table 1 shows an example of the original 5 × 6 data matrix and the corresponding graph is shown in Figure 1(a). "
    [Show abstract] [Hide abstract]
    ABSTRACT: Order-preserving submatrices (OPSMs) have been applied in many fields, such as DNA microarray data analysis, automatic recommendation systems, and target marketing systems, as an important unsupervised learning model. Unfortunately, most existing methods are heuristic algorithms which are unable to reveal OPSMs entirely in NP-complete problem. In particular, deep OPSMs, corresponding to long patterns with few supporting sequences, incur explosive computational costs and are completely pruned by most popular methods. In this paper, we propose an exact method to discover all OPSMs based on frequent sequential pattern mining. First, an existing algorithm was adjusted to disclose all common subsequence (ACS) between every two row sequences, and therefore all deep OPSMs will not be missed. Then, an improved data structure for prefix tree was used to store and traverse ACS, and Apriori principle was employed to efficiently mine the frequent sequential pattern. Finally, experiments were implemented on gene and synthetic datasets. Results demonstrated the effectiveness and efficiency of this method.
    Computational and Mathematical Methods in Medicine 07/2015; 2015:1-11. DOI:10.1155/2015/680434 · 0.77 Impact Factor
    • "Clustering and classification are two tasks that help to reveal natural structures and patterns in complex gene expression data. Clustering groups genes into similar categories (gene clustering), or similar samples, for example in identifying cancer subtypes (sample clustering) or simultaneously groups both genes and samples (co-clustering or biclustering)[1]. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Gene expression data generated from microarray experiments are characterized by large number of genes or dimensions. Informative gene selection for performing clustering to discover useful phenotypes is a major issue as there is no class information available. In this paper, we propose a wrapper based feature selection approach to perform sample based clustering on gene expression data. The proposed work uses Particle Swarm Optimization(PSO) for best subset generation and k-means as wrapper algorithm for evaluating the subsets. Experimental results show that the features selected by this method is able to produce clusters of good quality. Clustering accuracy of 70-80% were obtained for different datasets.
Show more

Similar Publications

Preview (3 Sources)

41 Reads