BIOINFORMATICS Computational cluster validation in post-genomic data analysis

School of Chemistry, University of Manchester, Faraday Building, Sackville Street, PO Box 88, Manchester M60 1QD, UK.
Bioinformatics (Impact Factor: 4.98). 09/2005; 21(15):3201-12. DOI: 10.1093/bioinformatics/bti517
Source: PubMed

ABSTRACT The discovery of novel biological knowledge from the ab initio analysis of post-genomic data relies upon the use of unsupervised processing methods, in particular clustering techniques. Much recent research in bioinformatics has therefore been focused on the transfer of clustering methods introduced in other scientific fields and on the development of novel algorithms specifically designed to tackle the challenges posed by post-genomic data. The partitions returned by a clustering algorithm are commonly validated using visual inspection and concordance with prior biological knowledge--whether the clusters actually correspond to the real structure in the data is somewhat less frequently considered. Suitable computational cluster validation techniques are available in the general data-mining literature, but have been given only a fraction of the same attention in bioinformatics.
This review paper aims to familiarize the reader with the battery of techniques available for the validation of clustering results, with a particular focus on their application to post-genomic data analysis. Synthetic and real biological datasets are used to demonstrate the benefits, and also some of the perils, of analytical clustervalidation.
The software used in the experiments is available at
Enlarged colour plots are provided in the Supplementary Material, which is available at

Download full-text


Available from: Joshua Damian Knowles, Feb 25, 2014
29 Reads
    • "Consistent with the recommendation of Brown et al. (2011) and Calvert et al. (2014), it would be advantageous to investigate the suitability of other validation statistics to determine if any tend to perform better at identifying clusters within highly correlated and noisy AR data. The effectiveness of cluster validation techniques relies on the degree of cluster separation in the input dataset (Handl et al. 2005) and therefore reliably clustering a complex AR dataset, such as Point Cloates, would be a significant challenge. An additional, and potentially fruitful, area of future research could be a comparison of multiple AR datasets to see if particular clustering and validation routines can consistently outperform others across a range of seabed environments. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Acoustic backscatter from the seafloor is a complex function of signal frequency, seabed roughness, grain size distribution, benthos, bioturbation, volume reverberation, and other factors. Angular response is the variation in acoustic backscatter with incident angle and is considered be an intrinsic property of the seabed. An unsupervised classification technique combining a self-organising map (SOM) and hierarchical clustering was used to create an angular response facies map and explore the relationships between acoustic facies and ground truth data. Cluster validation routines indicated that a two cluster solution was optimal and separated sediment dominated environments from mixtures of sediment and hard ground. Low cluster separation limited cluster validation routines from identifying fine cluster structure visible with an AR density plot. Cluster validation, aided by a visual comparison with an AR density plot, indicated that a 14 cluster solution was also a suitable representation of the input dataset. Clusters that were a mixture of hard and unconsolidated substrates displayed an increase in backscatter with an increase in the occurrence of hard ground and highlighted the sensitivity of AR curves to the presence of even modest amounts of hard ground. Remapping video observations and sediment data onto the SOM matrix is innovative and depicts the relationship between ground truth data and cluster structure. Mapping environmental variables onto the SOM matrix can show broad trends and localised peaks and troughs and display the variability of ground truth data within designated clusters. These variables, when linked to AR curves via clusters, can indicate how environmental factors influence the shape of the curves. Once these links are established they can be incorporated into improved geoacoustic models that replicate field observations
    Geo-Marine Letters 09/2015; 35(5):387-403. DOI:10.1007/s00367-015-0415-5 · 2.12 Impact Factor
  • Source
    • "Considering the data size in our study, k-means is more convenient and effective than other algorithms (Handl et al., 2005). When k-means clustering is processing, more than 3,000 calculation cycles were run to achieve a stable and reliable result. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Hepatocellular carcinoma (HCC) is one of the most deadly cancers in the world due to its high metastatic potential. By using the isobaric tags for relative and absolute quantitation (iTRAQ)-based quantitative N-glycoproteomic analysis, 26 differentially expressed serum glycoproteins derived from defined stages in orthotopic xenograft tumor model were identified. Among them, expression level of soluble EGFR (sEGFR) was verified in HCC cell lines. We found that non-metastasis HCC cell lines express significantly more sEGFR than HCC cell lines with metastasis potential both in cell lysates and culture media. Serum samples from 28 non-metastatic HCC patients and 28 metastatic HCC patients were assayed. Compared with the non-metastatic HCC group, serum level of sEGFR in metastatic HCC group was statistically lower (p<0.01). All these results provide evidence that sEGFR is a potential candidate for metastasis-associated biomarkers of HCC. The related molecular mechanism deserves to be further explored.
    Discovery medicine 05/2015; 19(106):333-41. · 3.63 Impact Factor
  • Source
    • "It may be based on the intrinsic properties of the data (internal validation) or when true class memberships is a priori known (external validation) using a large number of internal and external indexes. A comprehensive description and comparison analysis of an extensive list of those validation techniques may be found in recent literature (e.g., Bennett et al., 2013; Deborah et al., 2010; Halkidi et al., 2001; Handl et al., 2005; Rend on et al., 2011). "
    [Show abstract] [Hide abstract]
    ABSTRACT: This study focuses on the use of spaceetime permutation scan statistics (STPSS) to assess both the existence and the statistical significance of clusters on aggregated datasets. The investigated case study is represented from the Portuguese Rural Fire Database (PRFD) where the fire occurrences are georefer-enced to an administrative unit level. The main goals are: (i) assessing the robustness of the STPSS to correctly detect clusters on aggregated datasets; (ii) testing the existence of spaceetime clustering in the PRFD; and (iii) characterizing the detected clusters. A synthetic database was designed to assess the potential bias introduced by aggregation of the data on the performance of the STPSS method. Results confirmed the ability of the STPSS to correctly identify clusters, regarding their number, location, and spatio-temporal dimensions and provided recommendations about the parameters setting of the scanning window. Finally, a discussion of the identified clusters on the PRFD is presented.
    Environmental Modelling and Software 05/2015; 72:239-249. DOI:10.1016/j.envsoft.2015.05.016 · 4.42 Impact Factor
Show more