BIOINFORMATICS Computational cluster validation in post-genomic data analysis

School of Chemistry, University of Manchester, Faraday Building, Sackville Street, PO Box 88, Manchester M60 1QD, UK.
Bioinformatics (Impact Factor: 4.98). 09/2005; 21(15):3201-12. DOI: 10.1093/bioinformatics/bti517
Source: PubMed


The discovery of novel biological knowledge from the ab initio analysis of post-genomic data relies upon the use of unsupervised processing methods, in particular clustering techniques. Much recent research in bioinformatics has therefore been focused on the transfer of clustering methods introduced in other scientific fields and on the development of novel algorithms specifically designed to tackle the challenges posed by post-genomic data. The partitions returned by a clustering algorithm are commonly validated using visual inspection and concordance with prior biological knowledge--whether the clusters actually correspond to the real structure in the data is somewhat less frequently considered. Suitable computational cluster validation techniques are available in the general data-mining literature, but have been given only a fraction of the same attention in bioinformatics.
This review paper aims to familiarize the reader with the battery of techniques available for the validation of clustering results, with a particular focus on their application to post-genomic data analysis. Synthetic and real biological datasets are used to demonstrate the benefits, and also some of the perils, of analytical clustervalidation.
The software used in the experiments is available at
Enlarged colour plots are provided in the Supplementary Material, which is available at

Download full-text


Available from: Joshua Damian Knowles, Feb 25, 2014
  • Source
    • "It may be based on the intrinsic properties of the data (internal validation) or when true class memberships is a priori known (external validation) using a large number of internal and external indexes. A comprehensive description and comparison analysis of an extensive list of those validation techniques may be found in recent literature (e.g., Bennett et al., 2013; Deborah et al., 2010; Halkidi et al., 2001; Handl et al., 2005; Rend on et al., 2011). "
    [Show abstract] [Hide abstract]
    ABSTRACT: This study focuses on the use of spaceetime permutation scan statistics (STPSS) to assess both the existence and the statistical significance of clusters on aggregated datasets. The investigated case study is represented from the Portuguese Rural Fire Database (PRFD) where the fire occurrences are georefer-enced to an administrative unit level. The main goals are: (i) assessing the robustness of the STPSS to correctly detect clusters on aggregated datasets; (ii) testing the existence of spaceetime clustering in the PRFD; and (iii) characterizing the detected clusters. A synthetic database was designed to assess the potential bias introduced by aggregation of the data on the performance of the STPSS method. Results confirmed the ability of the STPSS to correctly identify clusters, regarding their number, location, and spatio-temporal dimensions and provided recommendations about the parameters setting of the scanning window. Finally, a discussion of the identified clusters on the PRFD is presented.
    Full-text · Article · Oct 2015 · Environmental Modelling and Software
  • Source
    • "D UE to the rapid development of new molecular biological techniques, cluster analysis from bio-molecular data becomes increasingly important in medicine, biology and related areas [1], [2], [3], [4], [5], [68], [69], [70]. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Performing clustering analysis is one of the important research topics in cancer discovery using gene expression profiles, which is crucial in facilitating the successful diagnosis and treatment of cancer. While there are quite a number of research works which perform tumor clustering, few of them considers how to incorporate fuzzy theory together with an optimization process into a consensus clustering framework to improve the performance of clustering analysis. In this paper, we first propose a random double clustering based cluster ensemble framework (RDCCE) to perform tumor clustering based on gene expression data. Specifically, RDCCE generates a set of representative features using a randomly selected clustering algorithm in the ensemble, and then assigns samples to their corresponding clusters based on the grouping results. In addition, we also introduce the random double clustering based fuzzy cluster ensemble framework (RDCFCE), which is designed to improve the performance of RDCCE by integrating the newly proposed fuzzy extension model into the ensemble framework. RDCFCE adopts the normalized cut algorithm as the consensus function to summarize the fuzzy matrices generated by the fuzzy extension models, partition the consensus matrix, and obtain the final result. Finally, adaptive RDCFCE (A-RDCFCE) is proposed to optimize RDCFCE and improve the performance of RDCFCE further by adopting a self-evolutionary process (SEPP) for the parameter set. Experiments on real cancer gene expression profiles indicate that RDCFCE and A-RDCFCE works well on these data sets, and outperform most of the state-of-the-art tumor clustering algorithms.
    Full-text · Article · Sep 2015 · IEEE/ACM transactions on computational biology and bioinformatics / IEEE, ACM
    • "Consistent with the recommendation of Brown et al. (2011) and Calvert et al. (2014), it would be advantageous to investigate the suitability of other validation statistics to determine if any tend to perform better at identifying clusters within highly correlated and noisy AR data. The effectiveness of cluster validation techniques relies on the degree of cluster separation in the input dataset (Handl et al. 2005) and therefore reliably clustering a complex AR dataset, such as Point Cloates, would be a significant challenge. An additional, and potentially fruitful, area of future research could be a comparison of multiple AR datasets to see if particular clustering and validation routines can consistently outperform others across a range of seabed environments. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Acoustic backscatter from the seafloor is a complex function of signal frequency, seabed roughness, grain size distribution, benthos, bioturbation, volume reverberation, and other factors. Angular response is the variation in acoustic backscatter with incident angle and is considered be an intrinsic property of the seabed. An unsupervised classification technique combining a self-organising map (SOM) and hierarchical clustering was used to create an angular response facies map and explore the relationships between acoustic facies and ground truth data. Cluster validation routines indicated that a two cluster solution was optimal and separated sediment dominated environments from mixtures of sediment and hard ground. Low cluster separation limited cluster validation routines from identifying fine cluster structure visible with an AR density plot. Cluster validation, aided by a visual comparison with an AR density plot, indicated that a 14 cluster solution was also a suitable representation of the input dataset. Clusters that were a mixture of hard and unconsolidated substrates displayed an increase in backscatter with an increase in the occurrence of hard ground and highlighted the sensitivity of AR curves to the presence of even modest amounts of hard ground. Remapping video observations and sediment data onto the SOM matrix is innovative and depicts the relationship between ground truth data and cluster structure. Mapping environmental variables onto the SOM matrix can show broad trends and localised peaks and troughs and display the variability of ground truth data within designated clusters. These variables, when linked to AR curves via clusters, can indicate how environmental factors influence the shape of the curves. Once these links are established they can be incorporated into improved geoacoustic models that replicate field observations
    No preview · Article · Sep 2015 · Geo-Marine Letters
Show more