Conference PaperPDF Available

Clustering Big Urban Dataset

Authors:

Abstract

Cities are producing and collecting massive amount of data from various sources such as transportation network, energy sector, smart homes, tax records, surveys, LIDAR data, mobile phones sensors etc. All of the aforementioned data, when connected via the Internet, fall under the Internet of Things (IoT) category. To use such a large volume of data for potential scientific computing benefits, it is important to store and analyze such amount of urban data using efficient computing resources and algorithms. However, this can be problematic due to many challenges. This article explores some of these challenges and test the performance of two partitional algorithms for clustering Big Urban Datasets, namely: the K-Means vs. the Fuzzy c-Mean (FCM). Clustering Big Urban Data in compact format represents the information of the whole data and this can benefit researchers to deal with this reorganized data much efficiently. Our experiments conclude that FCM outperformed the K-Means when presented with such type of dataset, however the later is lighter on the hardware utilisations.
Clustering Big Urban Dataset
Ahmad Al Shami, Weisi Guo, Ganna Pogrebna
Warwick Institute for the Science of Cities, School of Engineering, The University of Warwick
Coventry, CV4 7AL, United Kingdom
Email: a.al-shami@warwick.ac.uk, weisi.guo@warwick.ac.uk
Abstract—Cities are producing and collecting massive amount
of data from various sources such as transportation network,
energy sector, smart homes, tax records, surveys, LIDAR data,
mobile phones sensors etc. All of the aforementioned data, when
connected via the Internet, fall under the Internet of Things
(IoT) category. To use such a large volume of data for potential
scientific computing benefits, it is important to store and analyze
such amount of urban data using efficient computing resources
and algorithms. However, this can be problematic due to many
challenges. This article explores some of these challenges and
test the performance of two partitional algorithms for clustering
Big Urban Datasets, namely: the K-Means vs. the Fuzzy c-
Mean (FCM). Clustering Big Urban Data in compact format
represents the information of the whole data and this can benefit
researchers to deal with this reorganized data much efficiently.
Our experiments conclude that FCM outperformed the K-Means
when presented with such type of dataset, however the later is
lighter on the hardware utilisations.
Index terms— Big Data; LIDAR, Fuzzy c-Mean; K-Means,
Hardware Utilisation, Smart City
I. INTRODUCTION
The challenges of Big Data are due to the 5Vs which are:
Volume, Velocity, Variety, Veracity and Value to be gained
from the analysis of Big Data [1]. Many researchers are
dealing with different types of data sets, the concern here is
to wither to introduce a new algorithm or to use the existing
ones to suit large datasets. Currently, two approaches are
predominant: First, is known as Scaling-Up which focuses
the efforts on the enhancement of the available algorithms.
This approach risks them becoming useless for tomorrow, as
the data continues to grow. Hence, to deal with continuously
growing in size datasets, it will be necessary to frequently scale
up algorithms as the time moves on. The second approach is
to Scale-Down or to reduce the data itself, and to use existing
algorithms on the skimmed version of the data after reducing
its size. The scaling down of data may also risk the loss of
valuable information due the summarising and size reductions
techniques. But, still it is argued that using the scaling down
technique may only risk the information that is comparatively
unimportant or redundant. Since there is still a great scope
for the research in both areas, this article focuses on the
scale-down of data sets by comparing clustering techniques.
Clustering is defined as the process of grouping a set of items
or objects which have same attributes or characteristics.
II. K-MEA NS V S. FUZ ZY C -MEA NS
To highlight the advantages to scientific computing for Big
Data and to avoid the above mentioned disadvantages for the
hierarchical clustering techniques, this article is focusing on
comparing two trendy and computationally attractive parti-
tional techniques which are:
1) K-Means: This is a widely used clustering algorithm
as it partition a data set into K clusters (C1;C2; :::
;CK), represented by their arithmetic means called the
centroid, which is calculated as the average of all data
points (records) belonging to certain cluster.
2) Fuzzy c-Means (FCM) was introduced by [16] and it is
derived from the K-means concept for the purpose of
clustering datasets, but it differs in that the object may
belong to more than one cluster at the same time with
a certain degree of belonging to each cluster.
The FCM clustering is obtained by minimizing the
objective function at each iteration, an objective function
is minimized to find the best location for the clusters
and its values are returned in objective function. Fuzzy
clusters can be characterised by class membership func-
tion matrix, and cluster centres are determined first at
the learning stage, and then the classification is made
by the comparison of Euclidean distance between the
incoming features and each cluster centre [17]. For a
data set represented as X={x1,x2,...,xj. . . , xn} ⊂ Rs
into cclusters, where 1 <c<n; the fuzzy clusters can
be characterized by a c×nmembership function matrix
U, whose entries satisfy the following conditions:
c
i=1
ui,j=1,j=1,2,...,n(1)
0<
n
j=1
ui,j<n,i=1,2,...,c(2)
where ui,jis the grade of membership for xjdata
entry in the ith cluster. Cluster centres are determined
initially at the learning stage. Then, the classification
is made by comparison of distance between the data
points and cluster centres. Clusters are obtained by
the minimisation of the following cost function via an
iterative scheme.
J(U,V) =
n
j=1
c
i=1
(ui,j)2
xjvi
(3)
where V={v1,v2,...,vi, . . . vc}are cvectors of cluster
centres with virepresenting the centre for ith cluster.
To calculate the centre of each cluster, the following
iterative algorithm is used.
a) Estimate the class membership U.
b) Calculate vectors of cluster centres
V={v1,v2,...,vi, . . . vc}using the following ex-
pression:
vi=n
j=1(ui,j)2xj
n
j=1(ui,j)2i=1,2,...,c(4)
c) Update the class membership matrix Uwith:
ui,j=1
c
r=1kxjvik
kxjvrk2i=1,...,c;j=1,...,n
(5)
d) If control error defined as the difference between
two consecutive iterations of the membership ma-
trix Uis less than a pre-specific value, then the
process can stop. Otherwise process will repeat
again from step 2.
After a number of iterations, cluster centres will satisfy
the minimisation of the cost function Jto a local
minimum [17].
III. EXP ER IM EN TS SE TU P
Lidar dataset were used for this experiment as it is gaining
importance for urban planning such as floodplain mapping,
hydrology, geomorphology, landscape ecology, coastal en-
gineering, survey assessments, and volumetric calculations.
The experiments carried to compare how the candidate K-
Means and FCM clustering techniques cope with clustering
big urban dataset using mid-range level computer hardware.
The experiment were performed using an AMD 8320, 4.1
GHz, 8 core processor with 8 GB of RAM and running a
64-bit Windows 8.1 operating system. The algorithms were
implemented against a a LIDAR data points [3], taken for our
campus location at Latitude: 52.23- 52.22and Longitude:
1.335- 1.324. This location represents the University of
Warwick main campus with an initialization of 1000000 x
1000 digital surface data points.
IV. COMPARATIVE ANA LYSI S
1) K-Means Clustering: This clustering technique is applied
to the specified dataset starting with a small cluster
number K = 5 and gradually increased to K = 25 clusters.
Fig. 1 shows how on average the used hardware fared
to obtain the desired number of Kclusters and Table
I lists a summary of the statistics of elapsed time and
resources used for K-Means algorithm to converge.
2) FCM Clustering: This clustering technique was also
applied to the same generated dataset with cluster num-
ber starting with 5 and gradually increased to reach
25 clusters. Fig. 2-a and Fig. 2-b show the CPU and
RAM usage while executing the large dataset with FCM
clustering function and Table II lists summary of the
(a) CPU-K-Means
(b) RAM-K-Means
Fig. 1: Average CPU and Memory usage during K-Means
execution.(a) CPU, (b) RAM.
TABLE I: Time elapsed and resources used for K-Means
clustering.
Clusters counts Time/Seconds CPU used RAM used
5161.178 21% of 4.0 GHz 36% of 8.0 GB
10 244.642 27% of 4.0 GHz 42% of 8.0 GB
15 338.345 36% of 4.0 GHz 47% of 8.0 GB
20 409.618 48% of 4.0 GHz 53% of 8.0 GB
25 484.013 55% of 4.0 GHz 58% of 8.0 GB
Average 327.558 37.4% 47.2%
main time and resources it took the FCM algorithm to
converge for the different number of assigned clusters.
TABLE II: Time elapsed and resources used for FCM cluster-
ing.
Clusters counts Time/Seconds CPU used RAM used
542.190 56% of 4.0 GHz 65% of 8.0 GB
10 83.577 59% of 4.0 GHz 67% of 8.0 GB
15 127.848 65% of 4.0 GHz 75% of 8.0 GB
20 168.994 67% of 4.0 GHz 87% of 8.0 GB
25 214.995 69% of 4.0 GHz 91% of 8.0 GB
Average 127.520 63.2% 77.0%
By comparing the results in Table I and Table II, it is clear
that the lowest average time measured for FCM to regroup the
data was 127.520 seconds, while it took K-Means an average
of 327.558 seconds to form the same number of clusters. On
(a) CPU-FCM
(b) RAM-FCM
Fig. 2: Average CPU and Memory usage during FCM execu-
tion.(a) CPU, (b) RAM.
average FCM used up between 5-7 out of the eight available
cores, with 63.2 percent of the CPU processing power and 77
percent of the RAM memory. The K-Means on the other hand
utilised between 4-6 cores with the rest remain as idle cores
with an average of 37.4 percent of the CPU processing power
and 47.2 percent of the RAM memory.
On average FCM used up between 5 7 out of the eight
available cores, with 63.2 percent of the CPU processing
power and 77 percent of the RAM memory. The K-Means
on the other hand utilised between 4 6 with the rest remain
as idle cores with an average of 37.4 percent of the CPU
processing power and 47.2 percent of the RAM memory.
Overall, both algorithms are scalable to deal with Big Data,
but, FCM is fast and would make an excellent clustering algo-
rithm for everyday computing. In addition, it would offer some
extra added advantages such as its ability to handle different
data types [18]. Also, this fuzzy partitioning technique and
due to its fuzzy capability, FCM could produce a better quality
of the clustering output [19] which could benefit many data
analysts.
V. CONCLUSIONS AND FUTURE WORK
A comparative case study for clustering Big Urban Data set
using handy and simple techniques is proposed. The K-Means
and FCM were tested to cluster a Big Data set hosted on a
PC for everyday computing. The presented techniques can be
instantly mobilised as a robust methods to handle partitional
clustering for a large dataset with ease. However, FCM would
be a better choice if speed and quality are priority. In the
near future we plan to focus our attention on the quality of
the clusters produced here and to compare more clustering
techniques against another types of big datasets.
REFERENCES
[1] Zhai, Y., Ong, Y., & Tsang, I. (2014). The Emerging “Big Dimension-
ality”. Computational Intelligence Magazine, IEEE, 9(3), 14-26.
[2] Cull B., 3 ways big data is transforming government, 8, 2013.
[3] LIDAR Digital Terrain Model. Available upon license, The Environ-
ment Agency: http://data.gov.uk/dataset/lidar-digital-surface-model. Last
accessed 2nd April 2015.
[4] Mehmet Koyuturk, Ananth Grama, and Naren Ramakrishnan, “Com-
pression, Clustering, and Pattern Discovery in very High-Dimensional
Discrete-Attribute Data Sets”, IEEE Transactions On Knowledge And
Data Engineering, April 2005, Vol. 17, No. 4
[5] Jain, Anil K., M. Narasimha Murty, and Patrick J. Flynn. ”Data
clustering: a review.” ACM computing surveys (CSUR) 31.3 (1999):
264-323.
[6] Yadav, Chanchal, Shuliang Wang, and Manoj Kumar. “Algorithm
and approaches to handle large Data-A Survey.“ arXiv preprint
arXiv:1307.5437 (2013).
[7] Lawrence 0. Hall, Nitesh Chawla , Kevin W. Bowyer, “Decision Tree
Learning on Very Large Data Sets“, IEEE, Oct 1998
[8] Mr. D. V. Patil, Prof. Dr. R. S. Bichkar, “A Hybrid Evolutionary
Approach To Construct Optimal Decision Trees with Large Data Sets“,
IEEE, 2006
[9] Guillermo Sinchez-Diaz , Jose Ruiz-Shulcloper, “A Clustering Method
for Very Large Mixed Data Sets”, IEEE, 2001
[10] Tan, P., Steinbach, M. and Kumar, V. (2005) Cluster Analysis: Basic
Concepts and Algorithms. In: Introduction to Data Mining, Addison-
Wesley, Boston.
[11] Emily Namey, Greg Guest, Lucy Thairu, Laura Johnson, “Data Reduc-
tion Techniques for Large Qualitative Data Sets”, 2007
[12] Moshe Looks, Andrew Levine, G. Adam Covington, Ronald P. Loui,
John W. Lockwood, Young H. Cho, “Streaming Hierarchical Clustering
for Concept Mining”, IEEE, 2007
[13] Yen-ling Lu, chin-shyurng fahn, “Hierarchical Artificial Neural Net-
works For Recognizing High Similar Large Data Sets. ”, Proceedings
of the Sixth International Conference on Machine Learning and Cyber-
netics, August 2007
[14] Shuliang Wang, Wenyan Gan, Deyi Li, Deren Li “Data Field For
Hierarchical Clustering”, International Journal of Data Warehousing and
Mining, Dec. 2011
[15] Tatiana V. Karpinets, Byung H.Park, Edward C. Uberbacher, “Analyzing
large biological datasets with association network”, Nucleic Acids
Research, 2012
[16] Bezdek J. C., Ehrlich R., Full W., “FCM: The Fuzzy c-Means Clustering
Algorithm,” Computers and Geosciences, vol. 10, no. 2-3, p 191-203,
1984.
[17] Al Shami, A., Lotfi, A., Coleman, S. “Intelligent synthetic composite
indicators with application”, Soft Computing, 17(12), 2349-2364, 2013.
[18] Maimon, O. Z., & Rokach, L. (Eds.). “Data mining and knowledge
discovery handbook” Vol. 1. Springer, New York. 2005.
[19] Fahad, A., Alshatri, N., Tari, Z., Alamri, A., Y Zomaya, A., Khalil,
I., ... & Bouras, A. “A Survey of Clustering Algorithms for Big Data:
Taxonomy & Empirical Analysis.” IEEE, 2014.
[20] Guha, S., Rastogi, R., & Shim, K. “CURE: an efficient clustering
algorithm for large databases.” In ACM SIGMOD Record (Vol. 27, No.
2, pp. 73-84). ACM, 1998.
[21] Moretti, C.; Steinhaeuser, K.; Thain, D.; Chawla, N.V., ”Scaling up
Classifiers to Cloud Computers,” Data Mining, 2008. ICDM ’08. Eighth
IEEE International Conference on , vol., no., pp.472,481, 15-19 Dec.
2008.
[22] Esteves, R.M.; Pais, R.; Chunming Rong, ”K-means Clustering in the
Cloud – A Mahout Test,” Advanced Information Networking and Ap-
plications (WAINA), 2011 IEEE Workshops of International Conference
on , vol., no., pp.514,519, 22-25 March 2011.
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
Clustering algorithms have emerged as an alternative powerful meta-learning tool to accurately analyze the massive volume of data generated by modern applications. In particular, their main goal is to categorize data into clusters such that objects are grouped in the same cluster when they are similar according to specific metrics. There is a vast body of knowledge in the area of clustering and there has been attempts to analyze and categorize them for a larger number of applications. However, one of the major issues in using clustering algorithms for big data that causes confusion amongst practitioners is the lack of consensus in the definition of their properties as well as a lack of formal categorization. With the intention of alleviating these problems, this paper introduces concepts and algorithms related to clustering, a concise survey of existing (clustering) algorithms as well as providing a comparison, both from a theoretical and an empirical perspective. From a theoretical perspective, we developed a categorizing framework based on the main properties pointed out in previous studies. Empirically, we conducted extensive experiments where we compared the most representative algorithm from each of the categories using a large number of real (big) data sets. The effectiveness of the candidate clustering algorithms is measured through a number of internal and external validity metrics, stability, runtime, and scalability tests. In addition, we highlighted the set of clustering algorithms that are the best performing for big data.
Chapter
Full-text available
You have your way. I have my way. As for the right way, the correct way, and the only way, it does not exist. FRIEDRICH NIETZSCHE W ORKING WITH DATA COLLECTED through a team effort or in multiple sites can be both challenging and rewarding. The sheer size and complexity of the data set sometimes makes the analy-sis daunting, cbut a large data set may also yield richer and more useful in-formation. In this chapter, we explore strategies for combining qualitative and quantitative analysis techniques for the analysis of large, qualitative data sets. Large is a relative term, of course, and many of the techniques de-scribed here are applicable to smaller data sets as well. However, the bene-fits of the data reduction techniques we propose increase as the data sets themselves grow in size and complexity. In our selection of techniques, we have taken a broad view of large qualitative data sets, aiming to highlight trends, relationships, or associations for further analysis, without deempha-sizing the importance of the context and richness of the data themselves. This perspective also brings focus to the multiple interpretive lenses that a group of researchers brings to team-based analysis. Throughout the chapter, we use examples from some of our research to illustrate the use of the methods discussed. In doing so, we identify the strengths and weaknesses of each method and suggest ways in which an ap-propriate technique may be chosen for a given research question and its corresponding data set.
Article
Full-text available
This article is proposing an alternative approach to develop Intelligent Synthetic Composite Indicators (ISCI). The suggested approach utilizes Fuzzy Proximity Knowledge Mining technique to build the qualitative taxonomy initially, and then Fuzzy c-means is employed to form the new composite indicators. A fully worked application is presented. The application uses Information and Communication Technology real variables to form a new unified ICT index, illustrating the method of construction for ISCI. The weighting and aggregation results obtained were compared against Principal Component Analysis, Factor Analysis and the Geometric mean to weight and aggregate synthetic composite indicators. This study also compares and contrasts two special Fuzzy c-means techniques that is, the Optimal Completion Strategy and the Nearest Prototype Strategy to impute missing values. The results are compared against statistical imputation techniques. The validity and robustness of all techniques are evaluated using Monte Carlo simulation. The results obtained suggest a novel, intelligent and non-biased method of building future composite indicators.
Article
Full-text available
Data mining environment produces a large amount of data, that need to be analyzed, patterns have to be extracted from that to gain knowledge. In this new era with boom of data both structured and unstructured, in the field of genomics, meteorology, biology, environmental research and many others, it has become difficult to process, manage and analyze patterns using traditional databases and architectures. So, a proper architecture should be understood to gain knowledge about the Big Data. This paper presents a review of various algorithms from 1994-2013 necessary for handling such large data set. These algorithms define various structures and methods implemented to handle Big Data, also in the paper are listed various tool that were developed for analyzing them.
Article
Full-text available
Due to advances in high-throughput biotechnologies biological information is being collected in databases at an amazing rate, requiring novel computational approaches that process collected data into new knowledge in a timely manner. In this study, we propose a computational framework for discovering modular structure, relationships and regularities in complex data. The framework utilizes a semantic-preserving vocabulary to convert records of biological annotations of an object, such as an organism, gene, chemical or sequence, into networks (Anets) of the associated annotations. An association between a pair of annotations in an Anet is determined by the similarity of their co-occurrence pattern with all other annotations in the data. This feature captures associations between annotations that do not necessarily co-occur with each other and facilitates discovery of the most significant relationships in the collected data through clustering and visualization of the Anet. To demonstrate this approach, we applied the framework to the analysis of metadata from the Genomes OnLine Database and produced a biological map of sequenced prokaryotic organisms with three major clusters of metadata that represent pathogens, environmental isolates and plant symbionts.
Conference Paper
Full-text available
We are concerned with the general problem of concept mining - discovering useful associations, relationships, and groupings in large collections of data. Mathematical transformation algorithms have proven effective at reducing the content of multilingual, unstructured data into a vector that describes the content. Such methods are particularly desirable in fields undergoing information explosions, such as network traffic analysis, bioinformatics, and the intelligence community. In response, concept mining methodology is being extended to improve performance and permit hardware implementation -traditional methods are not sufficiently scalable. Hardware-accelerated systems have proven effective at automatically classifying such content when topics are known in advance. Our complete system builds on our past work in this area, presented in the Aerospace 2005 and 2006 conferences, where we described a novel algorithmic approach for extracting semantic content from unstructured text document streams. However, there is an additional need within the intelligence community to cluster related sets of content without advance training. To allow this function to happen at high speed, we have implemented a system that hierarchically clusters streaming content. The method, streaming hierarchical partitioning, is designed to be implemented in hardware and handle extremely high ingestion rates. As new documents are ingested, they are dynamically organized into a hierarchy, which has a fixed maximal size. Once this limit is reached, documents must consequently be excreted at a rate equaling their ingestion. The choice of documents to excrete is a point of interest -we present several autonomous heuristics for doing so intelligently, as well as a proposal for incorporating user interaction to focus attention on concepts of interest. A related desideratum is robust accommodation of concept drift -gradual change in the distribution and content of the document stream over time. Accordin- gly, we present and analyze experimental results for document streams evolving over time under several regimes. Current and proposed methods for concisely and informatively presenting derived content from streaming hierarchical clustering to the user for analysis are presented in this content. To support our claims of eventual hardware implementation and real-time performance with a high ingestion rate, we provide a detailed hardware-ready design, with asymptotic analysis and performance predictions. The system has been prototyped and tested on a Xeon processor as well as on a PowerPC embedded within a Xilinx Virtex2 FPGA. In summary, we describe a system designed to satisfy three primary goals: (1) real-time concept mining of high-volume data streams; (2) dynamic organization of concepts into a relational hierarchy; (3) adaptive reorganization of the concept hierarchy in response to evolving circumstances and user feedback.
Article
Abstract-The world continues to generate quintillion bytes of data daily, leading to the pressing needs for new efforts in dealing with the grand challenges brought by Big Data. Today, there is a growing consensus among the computational intelligence communities that data volume presents an immediate challenge pertaining to the scalability issue. However, when addressing volume in Big Data analytics, researchers in the data analytics community have largely taken a one-sided study of volume, which is the "Big Instance Size" factor of the data. The flip side of volume which is the dimensionality factor of Big Data, on the other hand, has received much lesser attention. This article thus represents an attempt to fill in this gap and places special focus on this relatively under-explored topic of "Big Dimensionality", wherein the explosion of features (variables) brings about new challenges to computational intelligence. We begin with an analysis on the origins of Big Dimensionality. The evolution of feature dimensionality in the last two decades is then studied using popular data repositories considered in the data analytics and computational intelligence research communities. Subsequently, the state-of-the-art feature selection schemes reported in the field of computational intelligence are reviewed to reveal the inadequacies of existing approaches in keeping pace with the emerging phenomenon of Big Dimensionality. Last but not least, the "curse and blessing of Big Dimensionality" are delineated and deliberated.
Book
The Data Mining process encompasses many different specific techniques and algorithms that can be used to analyze the data and derive the discovered knowledge. An important problem regarding the results of the Data Mining process is the development of efficient indicators of assessing the quality of the results of the analysis. This, the quality assessment problem, is a cornerstone issue of the whole process because: i) The analyzed data may hide interesting patterns that the Data Mining methods are called to reveal. Due to the size of the data, the requirement for automatically evaluating the validity of the extracted patterns is stronger than ever. ii)A number of algorithms and techniques have been proposed which under different assumptions can lead to different results. iii)The number of patterns generated during the Data Mining process is very large but only a few of these patterns are likely to be of any interest to the domain expert who is analyzing the data. In this chapter we will introduce the main concepts and quality criteria in Data Mining. Also we will present an overview of approaches that have been proposed in the literature for evaluating the Data Mining results.