BIOINFORMATICS Computational cluster validation in post-genomic data analysis

School of Chemistry, University of Manchester, Faraday Building, Sackville Street, PO Box 88, Manchester M60 1QD, UK.
Bioinformatics (Impact Factor: 4.98). 09/2005; 21(15):3201-12. DOI: 10.1093/bioinformatics/bti517
Source: PubMed


The discovery of novel biological knowledge from the ab initio analysis of post-genomic data relies upon the use of unsupervised processing methods, in particular clustering techniques. Much recent research in bioinformatics has therefore been focused on the transfer of clustering methods introduced in other scientific fields and on the development of novel algorithms specifically designed to tackle the challenges posed by post-genomic data. The partitions returned by a clustering algorithm are commonly validated using visual inspection and concordance with prior biological knowledge--whether the clusters actually correspond to the real structure in the data is somewhat less frequently considered. Suitable computational cluster validation techniques are available in the general data-mining literature, but have been given only a fraction of the same attention in bioinformatics.
This review paper aims to familiarize the reader with the battery of techniques available for the validation of clustering results, with a particular focus on their application to post-genomic data analysis. Synthetic and real biological datasets are used to demonstrate the benefits, and also some of the perils, of analytical clustervalidation.
The software used in the experiments is available at
Enlarged colour plots are provided in the Supplementary Material, which is available at


Available from: Joshua Damian Knowles, Feb 25, 2014
    • "Consistent with the recommendation of Brown et al. (2011) and Calvert et al. (2014), it would be advantageous to investigate the suitability of other validation statistics to determine if any tend to perform better at identifying clusters within highly correlated and noisy AR data. The effectiveness of cluster validation techniques relies on the degree of cluster separation in the input dataset (Handl et al. 2005) and therefore reliably clustering a complex AR dataset, such as Point Cloates, would be a significant challenge. An additional, and potentially fruitful, area of future research could be a comparison of multiple AR datasets to see if particular clustering and validation routines can consistently outperform others across a range of seabed environments. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Acoustic backscatter from the seafloor is a complex function of signal frequency, seabed roughness, grain size distribution, benthos, bioturbation, volume reverberation, and other factors. Angular response is the variation in acoustic backscatter with incident angle and is considered be an intrinsic property of the seabed. An unsupervised classification technique combining a self-organising map (SOM) and hierarchical clustering was used to create an angular response facies map and explore the relationships between acoustic facies and ground truth data. Cluster validation routines indicated that a two cluster solution was optimal and separated sediment dominated environments from mixtures of sediment and hard ground. Low cluster separation limited cluster validation routines from identifying fine cluster structure visible with an AR density plot. Cluster validation, aided by a visual comparison with an AR density plot, indicated that a 14 cluster solution was also a suitable representation of the input dataset. Clusters that were a mixture of hard and unconsolidated substrates displayed an increase in backscatter with an increase in the occurrence of hard ground and highlighted the sensitivity of AR curves to the presence of even modest amounts of hard ground. Remapping video observations and sediment data onto the SOM matrix is innovative and depicts the relationship between ground truth data and cluster structure. Mapping environmental variables onto the SOM matrix can show broad trends and localised peaks and troughs and display the variability of ground truth data within designated clusters. These variables, when linked to AR curves via clusters, can indicate how environmental factors influence the shape of the curves. Once these links are established they can be incorporated into improved geoacoustic models that replicate field observations
    Geo-Marine Letters 09/2015; 35(5):387-403. DOI:10.1007/s00367-015-0415-5 · 2.12 Impact Factor
  • Source
    • "We applied three internal measures—Dunn's index, connectivity, and silhouette width—using the package clValid in R to validate the most probable number of clusters, which we varied between two and six. Merits and applications of these measures have been discussed elsewhere (Handl et al. 2005). "
    [Show abstract] [Hide abstract]
    ABSTRACT: Pervasive introductions of non-native taxa are behind processes of homogenization of various types affecting the global flora and fauna. Chile’s freshwater ecosystems encompass a diverse and highly endemic fish fauna that might be sensitive to the introduction of non-native species, an ongoing process that started two centuries ago, but has to date received little attention. Using historical (native) and present-day (native and non-native) presence-absence data sets of compositional similarity, our goal was twofold: (1) evaluate patterns of taxonomic homogenization at various spatial scales and (2) identify clusters of widely versus narrowly distributed species to assess their relative role in compositional changes. We expect that non-native species with wide distributions might have a larger influence in taxonomic homogenization than those with narrow distributions. Chile’s fish assemblages have become increasingly homogenized during the last two centuries when evaluating changes in compositional similarity among 201 watersheds (65.3 % of total comparisons showed homogenization) distributed among six defined biotic units. Taxonomic differentiation was significantly more prevalent than taxonomic homogenization within biotic units. Among biotic units, comparisons between historical and current compositional similarity were all significantly different. We identified one cluster of non-native fishes that were distributed across the entire five or six biotic units. This cluster included Brown Trout (Salmo trutta) and Rainbow Trout (Oncorhynchus mykiss) as the two most representative species. A second cluster we identified included fishes such that on average spanned only one or two biotic units. We provide first evidence for an ongoing and large-scale process of taxonomic homogenization among Chile’s watersheds occurring at various scales. Our findings provide taxonomic and biogeographic baseline information for management plans and courses of action for conservation of native fishes, many of which are endemic. We also discuss management guidelines of non-native fishes in Chile. Baseline information of both native and non-native fish taxa might be applicable to other isolated regions elsewhere.
    Revista Chilena de Historia Natural 09/2015; 88(1-1):16. DOI:10.1186/s40693-015-0046-2 · 0.65 Impact Factor
  • Source
    • "Considering the data size in our study, k-means is more convenient and effective than other algorithms (Handl et al., 2005). When k-means clustering is processing, more than 3,000 calculation cycles were run to achieve a stable and reliable result. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Hepatocellular carcinoma (HCC) is one of the most deadly cancers in the world due to its high metastatic potential. By using the isobaric tags for relative and absolute quantitation (iTRAQ)-based quantitative N-glycoproteomic analysis, 26 differentially expressed serum glycoproteins derived from defined stages in orthotopic xenograft tumor model were identified. Among them, expression level of soluble EGFR (sEGFR) was verified in HCC cell lines. We found that non-metastasis HCC cell lines express significantly more sEGFR than HCC cell lines with metastasis potential both in cell lysates and culture media. Serum samples from 28 non-metastatic HCC patients and 28 metastatic HCC patients were assayed. Compared with the non-metastatic HCC group, serum level of sEGFR in metastatic HCC group was statistically lower (p<0.01). All these results provide evidence that sEGFR is a potential candidate for metastasis-associated biomarkers of HCC. The related molecular mechanism deserves to be further explored.
    Discovery medicine 05/2015; 19(106):333-41. · 3.63 Impact Factor
Show more