Conference Paper

An empirical comparison of fast and efficient tools for mining textual data

Conference: ISCSE 2010, 1st International Symposium on Computing in Science and Engineering


In order to effectively manage and retrieve the information comprised in vast amount of text documents,
powerful text mining tools and techniques are essential. In this paper we evaluate and compare two state-of-the-art data mining tools for clustering high-dimensional text data, Cluto and Gmeans. Several experiments were conducted on three benchmark datasets, and results are analysed in terms of clustering quality, memory and CPU time consumption. We empirically show that Gmeans offers high scalability by sacrificing clustering quality while Cluto presents better clustering quality at the expense of memory and CPU time.


Available from: Volkan Tunalı