Davoud Moulavi’s research while affiliated with University of Alberta and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (7)


Hierarchical Density Estimates for Data Clustering, Visualization, and Outlier Detection
  • Article

July 2015

·

1,020 Reads

·

687 Citations

ACM Transactions on Knowledge Discovery from Data

Ricardo J. G. B. Campello

·

Davoud Moulavi

·

Arthur Zimek

·

An integrated framework for density-based cluster analysis, outlier detection, and data visualization is introduced in this article. The main module consists of an algorithm to compute hierarchical estimates of the level sets of a density, following Hartigan's classic model of density-contour clusters and trees. Such an algorithm generalizes and improves existing density-based clustering techniques with respect to different aspects. It provides as a result a complete clustering hierarchy composed of all possible density-based clusters following the nonparametric model adopted, for an infinite range of density thresholds. The resulting hierarchy can be easily processed so as to provide multiple ways for data visualization and exploration. It can also be further postprocessed so that: (i) a normalized score of "outlierness" can be assigned to each data object, which unifies both the global and local perspectives of outliers into a single definition; and (ii) a "flat"(i.e., nonhierarchical) clustering solution composed of clusters extracted from local cuts through the cluster tree (possibly corresponding to different density thresholds) can be obtained, either in an unsupervised or in a semisupervised way. In the unsupervised scenario, the algorithm corresponding to this postprocessing module provides a global, optimal solution to the formal problem of maximizing the overall stability of the extracted clusters. If partially labeled objects or instance-level constraints are provided by the user, the algorithm can solve the problem by considering both constraints violations/satisfactions and cluster stability criteria. An asymptotic complexity analysis, both in terms of running time and memory space, is described. Experiments are reported that involve a variety of synthetic and real datasets, including comparisons with state-of-the-art, density-based clustering and (global and local) outlier detection methods.


On strategies for building effective ensembles of relative clustering validity criteria

June 2015

·

39 Reads

·

30 Citations

Knowledge and Information Systems

·

Davoud Moulavi

·

·

[...]

·

Evaluation and validation are essential tasks for achieving meaningful clustering results. Relative validity criteria are measures usually employed in practice to select and validate clustering solutions, as they enable the evaluation of single partitions and the comparison of partition pairs in relative terms based only on the data under analysis. There is a plethora of relative validity measures described in the clustering literature, thus making it difficult to choose an appropriate measure for a given application. One reason for such a variety is that no single measure can capture all different aspects of the clustering problem and, as such, each of them is prone to fail in particular application scenarios. In the present work, we take advantage of the diversity in relative validity measures from the clustering literature. Previous work showed that when randomly selecting different relative validity criteria for an ensemble (from an initial set of 28 different measures), one can expect with great certainty to only improve results over the worst criterion included in the ensemble. In this paper, we propose a method for selecting measures with minimum effectiveness and some degree of complementarity (from the same set of 28 measures) into ensembles, which show superior performance when compared to any single ensemble member (and not just the worst one) over a variety of different datasets. One can also expect greater stability in terms of evaluation over different datasets, even when considering different ensemble strategies. Our results are based on more than a thousand datasets, synthetic and real, from different sources.


Density-Based Clustering Validation
  • Conference Paper
  • Full-text available

April 2014

·

5,058 Reads

·

229 Citations

One of the most challenging aspects of clustering is validation, which is the objective and quantitative assessment of clustering results. A number of different relative validity criteria have been proposed for the validation of globular, clusters. Not all data, however, are composed of globular clusters. Density-based clustering algorithms seek partitions with high density areas of points (clusters, not necessarily globular) separated by low density areas, possibly containing noise objects. In these cases relative validity indices proposed for globular cluster validation may fail. In this paper we propose a relative validation index for density-based, arbitrarily shaped clusters. The index assesses clustering quality based on the relative density connection between pairs of objects. Our index is formulated on the basis of a new kernel density function, which is used to compute the density of objects and to evaluate the within- and between-cluster density connectedness of clustering results. Experiments on synthetic and real world data show the effectiveness of our approach for the evaluation and selection of clustering algorithms and their respective appropriate parameters.

Download

A framework for semi-supervised and unsupervised optimal extraction of clusters from hierarchies

November 2013

·

376 Reads

·

71 Citations

Data Mining and Knowledge Discovery

We introduce a framework for the optimal extraction of flat clusterings from local cuts through cluster hierarchies. The extraction of a flat clustering from a cluster tree is formulated as an optimization problem and a linear complexity algorithm is presented that provides the globally optimal solution to this problem in semi-supervised as well as in unsupervised scenarios. A collection of experiments is presented involving clustering hierarchies of different natures, a variety of real data sets, and comparisons with specialized methods from the literature.


Density-Based Clustering Based on Hierarchical Density Estimates

April 2013

·

4,296 Reads

·

1,953 Citations

Lecture Notes in Computer Science

We propose a theoretically and practically improved density-based, hierarchical clustering method, providing a clustering hierarchy from which a simplified tree of significant clusters can be constructed. For obtaining a “flat” partition consisting of only the most significant clusters (possibly corresponding to different density thresholds), we propose a novel cluster stability measure, formalize the problem of maximizing the overall stability of selected clusters, and formulate an algorithm that computes an optimal solution to this problem. We demonstrate that our approach outperforms the current, state-of-the-art, density-based clustering methods on a wide variety of real world data.


A Simpler and More Accurate AUTO-HDS Framework for Clustering and Visualization of Biological Data

November 2012

·

62 Reads

·

4 Citations

IEEE/ACM Transactions on Computational Biology and Bioinformatics

In [CHECK END OF SENTENCE], the authors proposed a framework for automated clustering and visualization of biological data sets named AUTO-HDS. This letter is intended to complement that framework by showing that it is possible to get rid of a user-defined parameter in a way that the clustering stage can be implemented more accurately while having reduced computational complexity.


Figure I. Performance of Bagging Models Built over Statistically Selected Genes and Biologically Selected Genes and Fusion of Bagged Hub genes and Bagged STT Models. 
Combining gene expression and interaction network data to improve kidney lesion score prediction

March 2012

·

150 Reads

·

6 Citations

International Journal of Bioinformatics Research and Applications

Current method of diagnosing kidney rejection based on histopathology of renal biopsies in form of lesion scores is error-prone. Researchers use gene expression microarrays in combination of machine learning to build better kidney rejection predictors. However the high dimensionality of data makes this task challenging and compels application of feature selection methods. We present a method for predicting lesions using combination of statistical and biological feature selection methods along with an ensemble learning technique. Results show that combining highly interacting genes (Hub Genes) from protein-protein interaction network with genes selected by squared t-test method brings the most accurate kidney lesion score predictor.

Citations (7)


... Examples of common external validity indices for clustering include the F-measure, Rand index, and Jaccard coefficient. However, the applicability of external indices may be restricted in real-world clustering scenarios where external information is unavailable (Jaskowiak et al., 2016). Internal validity indices are used to evaluate the quality of clustering results solely based on the data itself, without relying on any external information or prior knowledge. ...

Reference:

DEA-based internal validity index for clustering
On strategies for building effective ensembles of relative clustering validity criteria
  • Citing Article
  • June 2015

Knowledge and Information Systems

... Outlierness is computed as the average distance to the x-closest points in the model. • GLOSH (global-local outlier scores from hierarchies) (Campello et al. 2015) summarizes the input space by means of a density-based hierarchical clustering structure; later, it uses the closest cluster of a given point and the referential density of such cluster to estimate point outlierness. This way, GLOSH is able to overcome the global-local dichotomy and take both approaches into account at the same time. ...

Hierarchical Density Estimates for Data Clustering, Visualization, and Outlier Detection
  • Citing Article
  • July 2015

ACM Transactions on Knowledge Discovery from Data

... For this embedding, we utilize UMAP (Uniform Manifold Approximation and Projection) [27], a dimensionality reduction technique that preserves local and some global structure. Given that we have no a priori knowledge about the number of clusters, we employ HDBSCAN (Hierarchical Density-Based Spatial Clustering of Applications with Noise), which can handle clusters of varying densities [4]. HDBSCAN builds a hierarchy of clusters based on density, represented by a condensed tree, and allows for robust handling of noise, making it suitable for the possibly intricate structure of feature representation spaces. ...

Density-Based Clustering Based on Hierarchical Density Estimates
  • Citing Conference Paper
  • April 2013

Lecture Notes in Computer Science

... We evaluate how well the embedding on which we perform the clustering preserves the distances by measuring the mean squared error between the distance matrices in the original representation and its embedding (RMSE). To evaluate the clustering, we compute a density-based validity index (DBCV) [30], which measures intra-vs inter-cluster density. Further, we report the rate of points classified as noise by HDBSCAN. ...

Density-Based Clustering Validation

... A flat solution is extracted automatically based on local cuts. In [27], Campello et al. introduce the Framework for Optimal Selection of Clusters (FOSC), which formalizes cluster selection through local cuts as an optimization problem. ...

A framework for semi-supervised and unsupervised optimal extraction of clusters from hierarchies
  • Citing Article
  • November 2013

Data Mining and Knowledge Discovery

... Hierarchical Clustering (Chezhian et al., 2011) (Campello et al, 2012) have used Lloyd's k-means and progressive greedy k-means, density based clustering techniques, for clustering and visualization of DNA sequences. A randomly generated dataset have been used with k(number of clusters)=2,3,4 and 5. ...

A Simpler and More Accurate AUTO-HDS Framework for Clustering and Visualization of Biological Data
  • Citing Article
  • November 2012

IEEE/ACM Transactions on Computational Biology and Bioinformatics

... Novel tools to advance rejection diagnosis Transcriptomics, metabolomics, and proteomics have shown great potential in developing sensitive and specific diagnostic tools for monitoring early changes in cell signal transduction, regulation, and biochemical pathways (45,46). Their role in complementing histological Banff criteria for the diagnosis of rejection after both solid organ transplantation and VCA holds great promise. ...

Combining gene expression and interaction network data to improve kidney lesion score prediction

International Journal of Bioinformatics Research and Applications