Content uploaded by Ahmad Al Shami

Author content

All content in this area was uploaded by Ahmad Al Shami on Oct 11, 2015

Content may be subject to copyright.

Clustering Big Urban Dataset

Ahmad Al Shami, Weisi Guo, Ganna Pogrebna

Warwick Institute for the Science of Cities, School of Engineering, The University of Warwick

Coventry, CV4 7AL, United Kingdom

Email: a.al-shami@warwick.ac.uk, weisi.guo@warwick.ac.uk

Abstract—Cities are producing and collecting massive amount

of data from various sources such as transportation network,

energy sector, smart homes, tax records, surveys, LIDAR data,

mobile phones sensors etc. All of the aforementioned data, when

connected via the Internet, fall under the Internet of Things

(IoT) category. To use such a large volume of data for potential

scientiﬁc computing beneﬁts, it is important to store and analyze

such amount of urban data using efﬁcient computing resources

and algorithms. However, this can be problematic due to many

challenges. This article explores some of these challenges and

test the performance of two partitional algorithms for clustering

Big Urban Datasets, namely: the K-Means vs. the Fuzzy c-

Mean (FCM). Clustering Big Urban Data in compact format

represents the information of the whole data and this can beneﬁt

researchers to deal with this reorganized data much efﬁciently.

Our experiments conclude that FCM outperformed the K-Means

when presented with such type of dataset, however the later is

lighter on the hardware utilisations.

Index terms— Big Data; LIDAR, Fuzzy c-Mean; K-Means,

Hardware Utilisation, Smart City

I. INTRODUCTION

The challenges of Big Data are due to the 5Vs which are:

Volume, Velocity, Variety, Veracity and Value to be gained

from the analysis of Big Data [1]. Many researchers are

dealing with different types of data sets, the concern here is

to wither to introduce a new algorithm or to use the existing

ones to suit large datasets. Currently, two approaches are

predominant: First, is known as Scaling-Up which focuses

the efforts on the enhancement of the available algorithms.

This approach risks them becoming useless for tomorrow, as

the data continues to grow. Hence, to deal with continuously

growing in size datasets, it will be necessary to frequently scale

up algorithms as the time moves on. The second approach is

to Scale-Down or to reduce the data itself, and to use existing

algorithms on the skimmed version of the data after reducing

its size. The scaling down of data may also risk the loss of

valuable information due the summarising and size reductions

techniques. But, still it is argued that using the scaling down

technique may only risk the information that is comparatively

unimportant or redundant. Since there is still a great scope

for the research in both areas, this article focuses on the

scale-down of data sets by comparing clustering techniques.

Clustering is deﬁned as the process of grouping a set of items

or objects which have same attributes or characteristics.

II. K-MEA NS V S. FUZ ZY C -MEA NS

To highlight the advantages to scientiﬁc computing for Big

Data and to avoid the above mentioned disadvantages for the

hierarchical clustering techniques, this article is focusing on

comparing two trendy and computationally attractive parti-

tional techniques which are:

1) K-Means: This is a widely used clustering algorithm

as it partition a data set into K clusters (C1;C2; :::

;CK), represented by their arithmetic means called the

centroid, which is calculated as the average of all data

points (records) belonging to certain cluster.

2) Fuzzy c-Means (FCM) was introduced by [16] and it is

derived from the K-means concept for the purpose of

clustering datasets, but it differs in that the object may

belong to more than one cluster at the same time with

a certain degree of belonging to each cluster.

The FCM clustering is obtained by minimizing the

objective function at each iteration, an objective function

is minimized to ﬁnd the best location for the clusters

and its values are returned in objective function. Fuzzy

clusters can be characterised by class membership func-

tion matrix, and cluster centres are determined ﬁrst at

the learning stage, and then the classiﬁcation is made

by the comparison of Euclidean distance between the

incoming features and each cluster centre [17]. For a

data set represented as X={x1,x2,...,xj. . . , xn} ⊂ Rs

into cclusters, where 1 <c<n; the fuzzy clusters can

be characterized by a c×nmembership function matrix

U, whose entries satisfy the following conditions:

c

∑

i=1

ui,j=1,j=1,2,...,n(1)

0<

n

∑

j=1

ui,j<n,i=1,2,...,c(2)

where ui,jis the grade of membership for xjdata

entry in the ith cluster. Cluster centres are determined

initially at the learning stage. Then, the classiﬁcation

is made by comparison of distance between the data

points and cluster centres. Clusters are obtained by

the minimisation of the following cost function via an

iterative scheme.

J(U,V) =

n

∑

j=1

c

∑

i=1

(ui,j)2

xj−vi

(3)

where V={v1,v2,...,vi, . . . vc}are cvectors of cluster

centres with virepresenting the centre for ith cluster.

To calculate the centre of each cluster, the following

iterative algorithm is used.

a) Estimate the class membership U.

b) Calculate vectors of cluster centres

V={v1,v2,...,vi, . . . vc}using the following ex-

pression:

vi=∑n

j=1(ui,j)2xj

∑n

j=1(ui,j)2i=1,2,...,c(4)

c) Update the class membership matrix Uwith:

ui,j=1

∑c

r=1kxj−vik

kxj−vrk2i=1,...,c;j=1,...,n

(5)

d) If control error deﬁned as the difference between

two consecutive iterations of the membership ma-

trix Uis less than a pre-speciﬁc value, then the

process can stop. Otherwise process will repeat

again from step 2.

After a number of iterations, cluster centres will satisfy

the minimisation of the cost function Jto a local

minimum [17].

III. EXP ER IM EN TS SE TU P

Lidar dataset were used for this experiment as it is gaining

importance for urban planning such as ﬂoodplain mapping,

hydrology, geomorphology, landscape ecology, coastal en-

gineering, survey assessments, and volumetric calculations.

The experiments carried to compare how the candidate K-

Means and FCM clustering techniques cope with clustering

big urban dataset using mid-range level computer hardware.

The experiment were performed using an AMD 8320, 4.1

GHz, 8 core processor with 8 GB of RAM and running a

64-bit Windows 8.1 operating system. The algorithms were

implemented against a a LIDAR data points [3], taken for our

campus location at Latitude: 52.23◦- 52.22◦and Longitude:

1.335◦- 1.324◦. This location represents the University of

Warwick main campus with an initialization of 1000000 x

1000 digital surface data points.

IV. COMPARATIVE ANA LYSI S

1) K-Means Clustering: This clustering technique is applied

to the speciﬁed dataset starting with a small cluster

number K = 5 and gradually increased to K = 25 clusters.

Fig. 1 shows how on average the used hardware fared

to obtain the desired number of Kclusters and Table

I lists a summary of the statistics of elapsed time and

resources used for K-Means algorithm to converge.

2) FCM Clustering: This clustering technique was also

applied to the same generated dataset with cluster num-

ber starting with 5 and gradually increased to reach

25 clusters. Fig. 2-a and Fig. 2-b show the CPU and

RAM usage while executing the large dataset with FCM

clustering function and Table II lists summary of the

(a) CPU-K-Means

(b) RAM-K-Means

Fig. 1: Average CPU and Memory usage during K-Means

execution.(a) CPU, (b) RAM.

TABLE I: Time elapsed and resources used for K-Means

clustering.

Clusters counts Time/Seconds CPU used RAM used

5161.178 21% of 4.0 GHz 36% of 8.0 GB

10 244.642 27% of 4.0 GHz 42% of 8.0 GB

15 338.345 36% of 4.0 GHz 47% of 8.0 GB

20 409.618 48% of 4.0 GHz 53% of 8.0 GB

25 484.013 55% of 4.0 GHz 58% of 8.0 GB

Average 327.558 37.4% 47.2%

main time and resources it took the FCM algorithm to

converge for the different number of assigned clusters.

TABLE II: Time elapsed and resources used for FCM cluster-

ing.

Clusters counts Time/Seconds CPU used RAM used

542.190 56% of 4.0 GHz 65% of 8.0 GB

10 83.577 59% of 4.0 GHz 67% of 8.0 GB

15 127.848 65% of 4.0 GHz 75% of 8.0 GB

20 168.994 67% of 4.0 GHz 87% of 8.0 GB

25 214.995 69% of 4.0 GHz 91% of 8.0 GB

Average 127.520 63.2% 77.0%

By comparing the results in Table I and Table II, it is clear

that the lowest average time measured for FCM to regroup the

data was 127.520 seconds, while it took K-Means an average

of 327.558 seconds to form the same number of clusters. On

(a) CPU-FCM

(b) RAM-FCM

Fig. 2: Average CPU and Memory usage during FCM execu-

tion.(a) CPU, (b) RAM.

average FCM used up between 5-7 out of the eight available

cores, with 63.2 percent of the CPU processing power and 77

percent of the RAM memory. The K-Means on the other hand

utilised between 4-6 cores with the rest remain as idle cores

with an average of 37.4 percent of the CPU processing power

and 47.2 percent of the RAM memory.

On average FCM used up between 5 −7 out of the eight

available cores, with 63.2 percent of the CPU processing

power and 77 percent of the RAM memory. The K-Means

on the other hand utilised between 4 −6 with the rest remain

as idle cores with an average of 37.4 percent of the CPU

processing power and 47.2 percent of the RAM memory.

Overall, both algorithms are scalable to deal with Big Data,

but, FCM is fast and would make an excellent clustering algo-

rithm for everyday computing. In addition, it would offer some

extra added advantages such as its ability to handle different

data types [18]. Also, this fuzzy partitioning technique and

due to its fuzzy capability, FCM could produce a better quality

of the clustering output [19] which could beneﬁt many data

analysts.

V. CONCLUSIONS AND FUTURE WORK

A comparative case study for clustering Big Urban Data set

using handy and simple techniques is proposed. The K-Means

and FCM were tested to cluster a Big Data set hosted on a

PC for everyday computing. The presented techniques can be

instantly mobilised as a robust methods to handle partitional

clustering for a large dataset with ease. However, FCM would

be a better choice if speed and quality are priority. In the

near future we plan to focus our attention on the quality of

the clusters produced here and to compare more clustering

techniques against another types of big datasets.

REFERENCES

[1] Zhai, Y., Ong, Y., & Tsang, I. (2014). The Emerging “Big Dimension-

ality”. Computational Intelligence Magazine, IEEE, 9(3), 14-26.

[2] Cull B., 3 ways big data is transforming government, 8, 2013.

[3] LIDAR Digital Terrain Model. Available upon license, The Environ-

ment Agency: http://data.gov.uk/dataset/lidar-digital-surface-model. Last

accessed 2nd April 2015.

[4] Mehmet Koyuturk, Ananth Grama, and Naren Ramakrishnan, “Com-

pression, Clustering, and Pattern Discovery in very High-Dimensional

Discrete-Attribute Data Sets”, IEEE Transactions On Knowledge And

Data Engineering, April 2005, Vol. 17, No. 4

[5] Jain, Anil K., M. Narasimha Murty, and Patrick J. Flynn. ”Data

clustering: a review.” ACM computing surveys (CSUR) 31.3 (1999):

264-323.

[6] Yadav, Chanchal, Shuliang Wang, and Manoj Kumar. “Algorithm

and approaches to handle large Data-A Survey.“ arXiv preprint

arXiv:1307.5437 (2013).

[7] Lawrence 0. Hall, Nitesh Chawla , Kevin W. Bowyer, “Decision Tree

Learning on Very Large Data Sets“, IEEE, Oct 1998

[8] Mr. D. V. Patil, Prof. Dr. R. S. Bichkar, “A Hybrid Evolutionary

Approach To Construct Optimal Decision Trees with Large Data Sets“,

IEEE, 2006

[9] Guillermo Sinchez-Diaz , Jose Ruiz-Shulcloper, “A Clustering Method

for Very Large Mixed Data Sets”, IEEE, 2001

[10] Tan, P., Steinbach, M. and Kumar, V. (2005) Cluster Analysis: Basic

Concepts and Algorithms. In: Introduction to Data Mining, Addison-

Wesley, Boston.

[11] Emily Namey, Greg Guest, Lucy Thairu, Laura Johnson, “Data Reduc-

tion Techniques for Large Qualitative Data Sets”, 2007

[12] Moshe Looks, Andrew Levine, G. Adam Covington, Ronald P. Loui,

John W. Lockwood, Young H. Cho, “Streaming Hierarchical Clustering

for Concept Mining”, IEEE, 2007

[13] Yen-ling Lu, chin-shyurng fahn, “Hierarchical Artiﬁcial Neural Net-

works For Recognizing High Similar Large Data Sets. ”, Proceedings

of the Sixth International Conference on Machine Learning and Cyber-

netics, August 2007

[14] Shuliang Wang, Wenyan Gan, Deyi Li, Deren Li “Data Field For

Hierarchical Clustering”, International Journal of Data Warehousing and

Mining, Dec. 2011

[15] Tatiana V. Karpinets, Byung H.Park, Edward C. Uberbacher, “Analyzing

large biological datasets with association network”, Nucleic Acids

Research, 2012

[16] Bezdek J. C., Ehrlich R., Full W., “FCM: The Fuzzy c-Means Clustering

Algorithm,” Computers and Geosciences, vol. 10, no. 2-3, p 191-203,

1984.

[17] Al Shami, A., Lotﬁ, A., Coleman, S. “Intelligent synthetic composite

indicators with application”, Soft Computing, 17(12), 2349-2364, 2013.

[18] Maimon, O. Z., & Rokach, L. (Eds.). “Data mining and knowledge

discovery handbook” Vol. 1. Springer, New York. 2005.

[19] Fahad, A., Alshatri, N., Tari, Z., Alamri, A., Y Zomaya, A., Khalil,

I., ... & Bouras, A. “A Survey of Clustering Algorithms for Big Data:

Taxonomy & Empirical Analysis.” IEEE, 2014.

[20] Guha, S., Rastogi, R., & Shim, K. “CURE: an efﬁcient clustering

algorithm for large databases.” In ACM SIGMOD Record (Vol. 27, No.

2, pp. 73-84). ACM, 1998.

[21] Moretti, C.; Steinhaeuser, K.; Thain, D.; Chawla, N.V., ”Scaling up

Classiﬁers to Cloud Computers,” Data Mining, 2008. ICDM ’08. Eighth

IEEE International Conference on , vol., no., pp.472,481, 15-19 Dec.

2008.

[22] Esteves, R.M.; Pais, R.; Chunming Rong, ”K-means Clustering in the

Cloud – A Mahout Test,” Advanced Information Networking and Ap-

plications (WAINA), 2011 IEEE Workshops of International Conference

on , vol., no., pp.514,519, 22-25 March 2011.