Cluster analysis for gene expression data: a survey

Dept. of Comput. Sci. & Eng., State Univ. of New York, USA
IEEE Transactions on Knowledge and Data Engineering (Impact Factor: 1.82). 12/2004; 16(11):1370- 1386. DOI: 10.1109/TKDE.2004.68
Source: IEEE Xplore

ABSTRACT DNA microarray technology has now made it possible to simultaneously monitor the expression levels of thousands of genes during important biological processes and across collections of related samples. Elucidating the patterns hidden in gene expression data offers a tremendous opportunity for an enhanced understanding of functional genomics. However, the large number of genes and the complexity of biological networks greatly increases the challenges of comprehending and interpreting the resulting mass of data, which often consists of millions of measurements. A first step toward addressing this challenge is the use of clustering techniques, which is essential in the data mining process to reveal natural structures and identify interesting patterns in the underlying data. Cluster analysis seeks to partition a given data set into groups based on specified features so that the data points within a group are more similar to each other than the points in different groups. A very rich literature on cluster analysis has developed over the past three decades. Many conventional clustering algorithms have been adapted or directly applied to gene expression data, and also new algorithms have recently been proposed specifically aiming at gene expression data. These clustering algorithms have been proven useful for identifying biologically relevant groups of genes and samples. In this paper, we first briefly introduce the concepts of microarray technology and discuss the basic elements of clustering on gene expression data. In particular, we divide cluster analysis for gene expression data into three categories. Then, we present specific challenges pertinent to each clustering category and introduce several representative approaches. We also discuss the problem of cluster validation in three aspects and review various methods to assess the quality and reliability of clustering results. Finally, we conclude this paper and suggest the promising trends in this field.

  • [Show abstract] [Hide abstract]
    ABSTRACT: In this paper a new framework based on multiobjective optimization (MOO), namely FeaClusMOO, is proposed which is capable of identifying the correct partitioning as well as the most relevant set of features from a data set. A newly developed multiobjective simulated annealing based optimization technique namely archived multiobjective simulated annealing (AMOSA) is used as the background strategy for optimization. Here features and cluster centers are encoded in the form of a string. As the objective functions, two internal cluster validity indices measuring the goodness of the obtained partitioning using Euclidean distance and point symmetry based distance, respectively, and a count on the number of features are utilized. These three objectives are optimized simultaneously using AMOSA in order to detect the appropriate subset of features, appropriate number of clusters as well as the appropriate partitioning. Points are allocated to different clusters using a point symmetry based distance. Mutation changes the feature combination as well as the set of cluster centers. Since AMOSA, like any other MOO technique, provides a set of solutions on the final Pareto front, a technique based on the concept of semi-supervised classification is developed to select a solution from the given set. The effectiveness of the proposed FeaClustMOO in comparison with other clustering techniques like its Euclidean distance based version where Euclidean distance is used for cluster assignment, a genetic algorithm based automatic clustering technique (VGAPS-clustering) using point symmetry based distance with all the features, K-means clustering technique with all features is shown for seven higher dimensional data sets obtained from real-life.
    Applied Soft Computing 04/2015; 29. DOI:10.1016/j.asoc.2014.12.009 · 2.68 Impact Factor
  • Source
    Xitong Gongcheng Lilun yu Shijian/System Engineering Theory and Practice 01/2014; 34(9):2417-2431.
  • [Show abstract] [Hide abstract]
    ABSTRACT: Case-based reasoning as a concept covers almost a lot of technologies and techniques including knowledge management, artificial intelligence, machine learning techniques as well as database technology. The usage of all these technologies can easily aid in early detection of breast cancer as well as help other decision makers take the right decision on time and all the times. Of the main hot topics nowadays concerning executive managers and decision makers is measuring the similarity between objects. For better performance most organizations are in need on semantic similarity and similarity measures. This article presents mathematically different distance metrics used for measuring the binary similarity between quantitative data within cases. The case study represents a quantitative data of breast cancer patients within Faculty of medicine Cairo University. The experimental results show that the squared chord distance yields better with a 96.76 % without normalization that correlate more closely with human assessments compared to other distance measures used in this study.
    Information Systems Design and Intelligent Applications, 01/2015: pages 449-456; Springer India., ISBN: 978-81-322-2246-0

Preview (2 Sources)