Data mining is one of the long known research topics, which is making a comeback especially with the advent of Big Data. 'Clustering' technique is an important component in data mining. As we enter the Big Data era where many real-world datasets consist of multi-dimensional features, clustering has been gaining momentum in importance within this topic. The traditional clustering algorithms often fail to detect meaningful clusters in high-dimensional data set. Therefore, they become computationally expensive when dealing with data comprised of multiple dimensions. In this paper, we have proposed a modified technique that will perform well with high dimensional data set. In our proposed method we used Principle Component Analysis for dimension reduction before applying standard EM algorithm. The performance of the proposed set of algorithms is evaluated on the basis of silhouette index and time of execution.
Clustering algorithms have emerged as an alternative powerful meta-learning tool to accurately analyze the massive volume of data generated by modern applications. In particular, their main goal is to categorize data into clusters such that objects are grouped in the same cluster when they are similar according to specific metrics. There is a vast body of knowledge in the area of clustering and there has been attempts to analyze and categorize them for a larger number of applications. However, one of the major issues in using clustering algorithms for big data that causes confusion amongst practitioners is the lack of consensus in the definition of their properties as well as a lack of formal categorization. With the intention of alleviating these problems, this paper introduces concepts and algorithms related to clustering, a concise survey of existing (clustering) algorithms as well as providing a comparison, both from a theoretical and an empirical perspective. From a theoretical perspective, we developed a categorizing framework based on the main properties pointed out in previous studies. Empirically, we conducted extensive experiments where we compared the most representative algorithm from each of the categories using a large number of real (big) data sets. The effectiveness of the candidate clustering algorithms is measured through a number of internal and external validity metrics, stability, runtime, and scalability tests. In addition, we highlighted the set of clustering algorithms that are the best performing for big data.
This paper evaluates several methods of discretisation (binning) within a k-Nearest Neighbour predictor. Our k-NN is constructed using binary neural networks which require continuous-valued data to be discretised to allow it to be mapped to the binary neural framework. Our approach uses discretisation coupled with robust encoding to map data sets onto the binary neural network. In this paper, we compare seven unsupervised discretisation methods for retrieval accuracy (prediction accuracy) across a range of well-known prediction data sets comprising time-series data. We analyse whether there is an optimal discretisation configuration for our k-NN. The analyses demonstrate that the configuration is data specific. Hence, we recommend running evaluations of a number of configurations, varying both the discretisation methods and the number of discretisation bins, using a test data set. This evaluation will pinpoint the optimum configuration for new data sets.
Principle Component Analysis (PCA) is a math-ematical procedure widely used in exploratory data analysis, signal processing, etc. However, it is often considered a black box operation whose results and procedures are difficult to understand. The goal of this paper is to provide a detailed explanation of PCA based on a designed visual analytics tool that visualizes the results of principal component analysis and supports a rich set of interactions to assist the user in better understanding and utilizing PCA. The paper begins by describing the relationship between PCA and single vector decomposition (SVD), the method used in our visual analytics tool. Then a detailed explanation of the interactive visual analytics tool, including advan-tages and limitations, is provided.
Due to incredible growth of high dimensional dataset, conventional data base querying methods are inadequate to extract useful information, so researchers nowadays is forced to develop new techniques to meet the raised requirements. Such large expression data gives rise to a number of new computational challenges not only due to the increase in number of data objects but also due to the increase in number of features/attributes. Hence, to improve the efficiency and accuracy of mining task on high dimensional data, the data must be preprocessed by an efficient dimensionality reduction method. Recently cluster analysis is a popularly used data analysis method in number of areas. K-means is a well known partitioning based clustering technique that attempts to find a user specified number of clusters represented by their centroids. But its output is quite sensitive to initial positions of cluster centers. Again, the number of distance calculations increases exponentially with the increase of the dimensionality of the data. Hence, in this paper we proposed to use the Principal Component Analysis (PCA) method as a first phase for K-means clustering which will simplify the analysis and visualization of multi dimensional data set. Here also, we have proposed a new method to find the initial centroids to make the algorithm more effective and efficient. By comparing the result of original and new approach, it was found that the results obtained are more accurate, easy to understand and above all
the time taken to process the data was substantially reduced.
Cross-validation is a tried and tested approach to select the number of components in principal component analysis (PCA), however, its main drawback is its computational cost. In a regression (or in a non parametric regression) setting, criteria such as the general cross-validation one (GCV) provide convenient approximations to leave-one-out cross-validation. They are based on the relation between the prediction error and the residual sum of squares weighted by elements of a projection matrix (or a smoothing matrix). Such a relation is then established in PCA using an original presentation of PCA with a unique projection matrix. It enables the definition of two cross-validation approximation criteria: the smoothing approximation of the cross-validation criterion (SACV) and the GCV criterion. The method is assessed with simulations and gives promising results.
Organizing data into sensible groupings is one of the most fundamental modes of understanding and learning. As an example, a common scheme of scientific classification puts organisms into a system of ranked taxa: domain, kingdom, phylum, class, etc. Cluster analysis is the formal study of methods and algorithms for grouping, or clustering, objects according to measured or perceived intrinsic characteristics or similarity. Cluster analysis does not use category labels that tag objects with prior identifiers, i.e., class labels. The absence of category information distinguishes data clustering (unsupervised learning) from classification or discriminant analysis (supervised learning). The aim of clustering is to find structure in data and is therefore exploratory in nature. Clustering has a long and rich history in a variety of scientific fields. One of the most popular and simple clustering algorithms, K-means, was first published in 1955. In spite of the fact that K-means was proposed over 50 years ago and thousands of clustering algorithms have been published since then, K-means is still widely used. This speaks to the difficulty in designing a general purpose clustering algorithm and the ill-posed problem of clustering. We provide a brief overview of clustering, summarize well known clustering methods, discuss the major challenges and key issues in designing clustering algorithms, and point out some of the emerging and useful research directions, including semi-supervised clustering, ensemble clustering, simultaneous feature selection during data clustering, and large scale data clustering.
Practical statistical clustering algorithms typically center upon an iterative refinement optimization procedure to compute a locally optimal clustering solution that maximizes the fit to data. These algorithms typically require many database scans to converge, and within each scan they require the access to every record in the data table. For large databases, the scans become prohibitively expensive. We present a scalable implementation of the Expectation-Maximization (EM) algorithm. The database community has focused on distance-based clustering schemes and methods have been developed to cluster either numerical or categorical data. Unlike distancebased algorithms (such as K-Means), EM constructs proper statistical models of the underlying data source and naturally generalizes to cluster databases containing both discrete-valued and continuous-valued data. The scalable method is based on a decomposition of the basic statistics the algorithm needs: identifying regions of the data that...
Data mining and statistics what is the connection
Jan 1998
F Jerome
F.Jerome, "Data mining and statistics what is the connection, "
Computing Science and Statistics. Proceedings of the 29th Symposium
on the Interface. Interface Foundation of North America. (1998)
Linked datav Connecting and exploiting big data. White Paper Linked Data
Mar 2012
M Ian
W Mark
M.Ian and W. Mark, " Linked datav Connecting and exploiting big data.
White Paper Linked Data: March 2012.