Conference PaperPDF Available

Figures

Content may be subject to copyright.
Cruncher: Distributed In-Memory Processing for
Location-Based Services
Ahmed S. Abdelhamid, Mingjie Tang, Ahmed M. Aly, Ahmed R. Mahmood, Thamir Qadah,
Walid G. Aref, Saleh Basalamah
Purdue University, West Lafayette, IN, USA Umm Al-Qura University, Makkah, KSA
Abstract—Advances in location-based services (LBS) demand
high-throughput processing of both static and streaming data. Re-
cently, many systems have been introduced to support distributed
main-memory processing to maximize the query throughput.
However, these systems are not optimized for spatial data process-
ing. In this demonstration, we showcase Cruncher, a distributed
main-memory spatial data warehouse and streaming system.
Cruncher extends Spark with adaptive query processing tech-
niques for spatial data. Cruncher uses dynamic batch processing
to distribute the queries and the data streams over commodity
hardware according to an adaptive partitioning scheme. The
batching technique also groups and orders the overlapping spatial
queries to enable inter-query optimization. Both the data streams
and the offline data share the same partitioning strategy that
allows for data co-locality optimization. Furthermore, Cruncher
uses an adaptive caching strategy to maintain the frequently-used
location data in main memory. Cruncher maintains operational
statistics to optimize query processing, data partitioning, and
caching at runtime. We demonstrate two LBS applications over
Cruncher using real datasets from OpenStreetMap and two
synthetic data streams. We demonstrate that Cruncher achieves
order(s) of magnitude throughput improvement over Spark when
processing spatial data.
I. INTRODUCTION
The popularity of location-based services (LBS, for short)
has resulted in an unprecedented increase in the volume
of spatial information. In addition to the location attributes
(e.g., longitude and latitude), the created data may include a
temporal component (e.g., timestamp), and other application-
driven attributes (e.g., check-in data, identifiers for moving
objects, and associated textual content) [1]. Applications span a
wide range of services, e.g., tracking moving objects, location-
based advertisement, online-gaming, etc. Although these LBSs
vary according to the nature of the underlying application, they
share the need for high-throughput processing, low latency,
support for adaptivity due to changes in location data distri-
bution over time, and efficient utilization of the computing
resources. This demands for the efficient processing of spatial
data streams with high rates as well as huge amounts of static
spatial data, e.g., OpenStreetMap. Moreover, the worldwide
use of LBS applications requires processing of spatial queries
at an unprecedented scale. For instance, LBSs are required to
maintain information for tens if not hundreds of millions of
users in addition to huge amounts of other service-associated
data (e.g., maps and road networks), while processing millions
of user requests and data updates per second.
Cloud computing platforms, where hardware cost is asso-
ciated with usage rather than ownership, call for enhancing
the query processing and storage efficiency. Furthermore, the
dynamic nature of location data, especially spatial data streams
and workloads, render the conventional optimize-then-execute
model inefficient, and calls for adaptive query processing
techniques, where statistics are collected to fine-tune the query
processing and storage at runtime (e.g., see [2]).
One aspect that distinguishes LBSs is query complexity. In
contrast to enterprise data applications, LBS queries are more
sophisticated and can involve combinations of spatial, tempo-
ral, and relational operators, e.g., see [3], [4]. Some of these
operators are expensive, e.g., k-nearest-neighbor (kNN) [5].
To address these challenges, various parallel and distributed
systems are customized to handle location data, e.g., MD-
Hbase [1], HadoopGIS [6], SpatialHadoop [7], Parallel Sec-
ondo [8], and Tornado [9]. These systems have a common goal;
to store and query big spatial data over shared-nothing com-
modity machines. However, they suffer from disk bottlenecks,
and provide no provisions for adaptive query processing.
Recently, the significant drop in main-memory cost has ini-
tiated a wave of main-memory distributed processing systems.
Spark [10] is an exemplification of such computing paradigm.
Spark provides a shared-memory abstraction using Resilient
Distributed Datasets (RDDs, for short) [11]. RDDs are im-
mutable and support only coarse-grained operations (referred
to as transformations). RDDs keep the history of transfor-
mations (referred to as Lineage) for fault tolerance. RDDs
are lazily evaluated and ephemeral. An RDD transformation
is only computed upon data access (referred to as Actions)
and data is kept in memory only upon deliberate request. In
addition, Spark supports near-real-time data stream processing
through small batches represented as RDDs (referred to as
Discretized Streams) [12]. However, Spark is not optimized
for spatial data processing and makes no assumptions about
the underlying data or query types.
This demonstration presents Cruncher, a distributed spatial
data warehouse and streaming system. Cruncher provides
high-throughput processing of online and offline spatial data.
Cruncher extends Spark with adaptive query processing tech-
niques. Originally, Spark processes data stream records in
order of arrival. However, processing a batch of data elements
or queries offers an opportunity for optimization and renders
the fixed batch content ordering sub-optimal. Hence, Cruncher
introduces a new batching technique, where the system dynam-
ically changes the batch content ordering to update the RDDs
efficiently. In addition, processing a batch of multiple queries
offers an opportunity for multi-query optimization, and hence
Cruncher introduces an inter-query optimization technique for
range and kNN queries. Furthermore, Spark speeds-up the
data processing by partitioning the data in main memory.
Lineage Graph
Global Index
<KD-Tree>
Data Catalog
<Fine-Grid>
Worker 1
Master
Worker 2 Worker N
Batching Manager
Query SQL Parser
Query Optimizer
Partitioning Manager
Cache Manager
Garbage Collector
Stream Batches ...
Query Answers ...
Offline Data
(e.g. Maps)
Distributed In-Memory Caching
Fig. 1. An overview of Cruncher.
However, static partitioning of spatial data over a distributed
memory is not robust in case of dynamic workloads and
dynamic data distributions. Hence, Cruncher uses an adaptive
partitioning technique that recognizes the query/data hot spots,
and incrementally updates the data partitioning to minimize
redundant processing. Finally, Cruncher introduces a garbage
collector to remove outdated and obsolete RDDs from the
main memory. Cruncher uses three type of runtime statistics;
a global index of location data partitions, a grid with fine
granularity to maintain count statistics for the location-based
data and queries, and a lineage graph that tracks the RDD
transformations. We demonstrate Cruncher’s capabilities using
data streams of moving objects from BerlinMod [5] and real
static datasets from OpenStreetMap [13].
II. OVERVIEW OF CRUNCH ER
A. Supported Features
Below, we summarize the main features of Cruncher:
High Throughput Processing. Cruncher uses dynamic batch-
ing and inter-query optimization to achieve high query
throughput given the underlying resources. Cruncher achieves
orders of magnitude throughput improvement over Spark.
Online and Offline Processing. In addition to offline data
(e.g., maps), Cruncher handles three types of location data
streams, namely, a user queries stream, a data updates stream
(to be applied to the offline data), and application data streams.
Cruncher can process queries with relational and spatial pred-
icates against combinations of all these types of data sources.
Adaptive Main-Memory Data Partitioning. RDDs do not
make any assumptions about the properties of the underlying
data nor the incoming queries. In contrast, Cruncher adaptively
update the RDD partitioning at runtime to cope with the
changes in the query workload and data distribution, and hence
consistently maximize the query throughput.
Efficient Memory Evacuation Policy. Because Cruncher
relies on RDD transformations for query processing and data
updates, multiple RDDs may co-exist in memory carrying
data for the same spatial range fully or partially. For efficient
memory use, only single data copy should be maintained per
spatial region that reflects the most recent updates. Cruncher
utilizes an efficient garbage collector with spatial awareness to
eliminate in-memory duplicates.
Interactive Map-Assisted GUI. Cruncher supports an inter-
active GUI that extends Apache Zeppelin [14] to support SQL-
like and Map-UI querying.
Light-Weight Fault Tolerance. Cruncher extends the lineage-
based fault tolerance mechanism of the RDD model [11]. By
associating runtime statistics with the RDD transformations,
Cruncher persists the updated RDDs efficiently on disk.
B. Supported Queries
Cruncher aims to support queries that include both spatial
and relational predicates, where multiple spatial predicates
can appear within a single query. The supported spatial pred-
icates include Range, kNN Select, kNN join, and Spatial
Join. Cruncher also supports temporal and textual predicates.
Queries can run against online streams or offline (i.e., static)
data, and can be snapshot or continuous. Examples of the
supported queries are presented in Section IV.
C. Data Model
An LBS can store data about stationary as well as moving
objects and queries. The following updates are continuously
received by Cruncher: 1) periodic updates for the locations of
the moving objects/queries and their associated data, e.g., the
time of the update, the text associated with the new location,
e.g., tweets from the new location, etc., 2) service updates for
the stationary data, and 3) queries that include spatial, temporal
and textual predicates. The stationary and moving objects have
the format of:{object-identifier, location, timestamp, relational
data, free text}. The queries can be represented in SQL,
from which the following format is extracted: {query-identifier,
timestamp, location, predicate-list}.
III. IN-MEMORY PROCESS ING I N CRUNCHER
Cruncher employs a set of techniques that achieve efficient
distributed main-memory processing of spatial data. This sec-
tion highlights each of these techniques.
A. Adaptive Partitioning with On-Demand Indexing
Cruncher dynamically partitions in-memory data to redis-
tribute data over the cluster. The Objective is to repartition the
RDDs based on the query workload and data updates, such
that a query operates on a minimal set of data required to
retrieve its answer. Cruncher’s Partitioning Manager extends
our work in [2]. As in [2], two global indices are maintained.
Refer to Fig. 1. A k-d tree index represents the current
data partitioning scheme, and a fine-grained grid maintains
the count of data points and queries at each grid cell. The
112
5
334
112334
122234
12255
[2][1]
5
4
5
[D1, D2, .DN]
[Q1, Q2, QN]
[3] [4]
KD-Tree
A worker’s portion of global Index
Streams sub-batches
Data Updates
Groups of updates and queries
Query
Update
Single Query
Single Data Update
Updates Group
Query Group
Queries
Fig. 2. Dynamic Batch Processing.
repartitioning is incrementally triggered based on a cost model
that minimizes redundant data processing. The cost model
integrates the number of points and queries per partition. In
Cruncher, we introduce two extensions. First, we make the
cost model aware of the data updates. Second, when applying
a batch of designated queries on a particular data partition, we
consider the nature of the underlying operators. For example,
in the case of extensive use of kNN operators in a certain
partition, we can dynamically build a suitable index for this
partition to facilitate the execution of the kNN operators within
this partition. Subsequently, the index can be invalidated and
removed upon data updates or partitioning changes, or simply
to preserve memory when the index is not needed anymore.
B. Dynamic Batch Processing
Cruncher partitions the incoming batches of data updates
and queries into small sub-batches according to the data
partitioning scheme. Each sub-batch is sent to the machine
responsible for processing and caching the data for its spatial
region. This partitioning is dynamic and can be different for ev-
ery batch as discussed in Section III-A. Furthermore, Cruncher
groups and sorts the sub-batches for processing. Recall that in
the RDD model, every update triggers the creation of a new
RDD. To handle frequent updates common in LBSs, Cruncher
minimizes the number of transformations required to update
the data in order to reduce the writing overhead. Cruncher
supports two modes of operation: consistent and greedy. In
the former, updates and queries are grouped. The correctness
of evaluation is achieved by guaranteeing that a query will
never process an item, where an update is available with a
timestamp proceeding that of the query, until the update is
applied first. Fig. 2 shows how a batch of queries/updates
is sorted and grouped as a series of transformations. The
grouping and sorting are based on the timestamps and the
spatial regions of the updates/queries. In the greedy mode,
the sub-batch is divided into two transformations only (i.e.,
queries vs. updates). The queries are applied first, and then
the updates. This mode is suitable for LBS applications with
relaxed correctness criteria but that are sensitive to latency.
Minimizing the number of RDD creations leads to shorter
lineage, and hence reduces the overhead during crash recovery.
C. Multi-Query Optimization
Batch query processing creates opportunities for multi-
query optimization. Cruncher applies location-specific inter-
map
Approach 1: Execute the queries in
ascending order of MBR containment
Approach 2: Group execution using
most Contained MBR for all queries
map
Join
Major partial containment
Introduce a container MBR
Perfect containment
Distinct groups
Answers
Queries
Q1
Q1 Q2 Q3
map map
Q1
Q2
Q3
Mutual
MBR
Stage 1
Stage 1
Stage 2
Fig. 3. Multi-query Optimization.
query optimization, e.g., based on the containment in minimum
bounding rectangles (MBRs, for short). Consider a location-
based query, say Q1, that is MBR-contained within another
query’s MBR, say Q2s. Cruncher sorts the queries within a
batch based on MBR containment. Refer to Fig. 3. 3 cases
for sorting a query group based on their MBRs are illustrated.
The queries are executed using one of two approaches. First,
queries can be executed sequentially as a series of transfor-
mations in descending order of MBR size. Alternatively, we
can join the predicates with the subset of data that includes all
these predicates. In other words, we apply one transformation
to get the data of the biggest MBR, and then join with all the
predicates. In the first approach, an RDD is defined per query,
but in the latter approach, we have one and only one RDD,
where each tuple is tagged with the satisfying query.
D. Distributed Workload-aware Caching
Continuously applying RDD transformations can result in
multiple copies of the same data in main memory (i.e., in
different RDDs). Recall that updating an RDD creates a new
RDD with the update. Also, applying a spatial query on an
RDD, creates a new RDD for the query answer holding a
subset of the original RDD. To avoid these multiple copies,
Cruncher applies a caching mechanism that: 1) increases the
memory hit-ratio by keeping the frequently accessed data in
main memory, and 2) reduces the replication factor of data in
memory. Fig. 4 shows how a spatial index (a grid with fine
granularity) maintains access-counters in the grid cells as well
as coverage relationships among the RDDs and the grid cells.
For each grid cell, Cruncher keeps a counter of usage plus the
time of last access. This information is useful for LRU or ARC
cache replacement policies when the data does not fit in main
memory. In addition, the Garbage Collector uses the coverage
relationship to keep in memory only the most updated RDDs
for each spatial region and evacuate outdated RDDs.
E. Fault Tolerance
Cruncher relies on the lineage-based fault-tolerance mech-
anism of RDDs. However, applying many transformations on
the RDDs results in long lineage chains. It is vital to keep the
lineage graph manageable by forcing periodic persistance of
RDDs (i.e., saving them to disk) when necessary, otherwise
the re-computations of the RDDs in case of failure can be
Maintain the
MBR of each
data partition
adaptively
Global Spatial Index
<Kd-Tree>
Catalog
<Fine Grid>
Each cell keep track of RDDs
covering its spatial region
Original Data
<Spatially Partitioned RDD>
Queries Answer RDDs
(e.g., Spatial Range)
Each cell maintains
the number of
objects, queries,
updates and last
accessed time
Fig. 4. Spatial-aware Caching.
more expensive than reading from disk. Cruncher uses a
simple, yet effective, technique to maintain a manageable
lineage chain. Cruncher keeps track of the processing time
of all the transformations. When the total processing time of
a sequence of transformations exceeds the expected reading
time from disk, checkpointing is triggered to save the RDD
to disk. Observe that Cruncher keeps track of the computed
transformations only. The transformations yet to be computed
are not counted.
IV. DEMO SCENARIO
We demonstrate Cruncher’s capabilities using two applica-
tions, where we use a real dataset of points of interest from
OpenStreetMap [13] and synthetic datasets of moving objects
from the BerlinMod Benchmark [5]. We append a textual
description to each moving object. We generate a synthetic
data stream that simulates offers (e.g., coupons) made by the
restauarants in the OpenStreetMap dataset.
Online Data Processing: In this scenario, the user locates her
nearby friends who have certain text associated to them. This
query runs against a stream of moving objects, and can be
expressed as follows:
RUN QUERY q1 AS
SELECT kNN FROM Friends AS F
WHERE CONTAINS (F.text, Sam’)
and kNN.k=3 and kNN.Focal(@Current_Location);
Online and Offline Data Processing: In this scenario, the
user gets notified of offers (e.g., coupons, sales) applicable to
the restaurants that are inside a specific spatial region. This is a
continuous query that requires hybrid processing of an online
stream (i.e., offers) and offline data in the map. The query can
be expressed as follows:
REGISTER QUERY q2 AS
SELECT *FROM OSM_Data AS O, OFFERS AS F
WHERE INSIDE(O, @Spatial_Range)
and CONTAINS(F.text, Coupon’, Sale’)
and OVERLAPS(O.text, F.text)
and F.type = Restaurant’;
We will show the performance of Crusher under different
rates of online data streams and user queries. We will use
Fig. 5. Cruncher SQL and Map User Interface.
different batch sizes for processing various combinations of
the above queries. We will visualize how the global data
partitioning scheme adapts given the change in the query
workload and data updates. We will compare the throughput of
Cruncher against original Spark. Conducted experiments show
how Cruncher achieves orders of magnitude improvement in
query throughput.
ACKNOWLEDGMENT
This research was supported in part by National Science
Foundation under Grant Number IIS 1117766.
REFERENCES
[1] S. Nishimura, S. Das, D. Agrawal, and A. E. Abbadi, “MD-HBase:
A scalable multi-dimensional data infrastructure for location aware
services,” in IEEE MDM, 2011.
[2] A. M. Aly, A. S. Abdelhamid, A. R. Mahmood, W. G. Aref, M. S.
Hassan, H. Elmeleegy, and M. Ouzzani, A demonstration of aqwa:
Adaptive query-workload-aware partitioning of big spatial data, in
VLDB, 2015.
[3] A. M. Aly, W. G. Aref, and M. Ouzzani, “Spatial queries with k-nearest-
neighbor and relational predicates,” in SIGSPATIAL, 2015.
[4] ——, “Spatial queries with two knn predicates, in VLDB, 2012.
[5] C. D¨
untgen, T. Behr, and R. H. G¨
uting, “Berlinmod: a benchmark for
moving object databases,” VLDB J., 2009.
[6] A. Aji, F. Wang, H. Vo, R. Lee, Q. Liu, X. Zhang, and J. H. Saltz,
“Hadoop-GIS: A high performance spatial data warehousing system
over mapreduce, PVLDB, vol. 6, no. 11, 2013.
[7] A. Eldawy and M. F. Mokbel, “SpatialHadoop: A mapreduce framework
for spatial data,” in ICDE, 2015.
[8] J. Lu and R. H. Guting, “Parallel secondo: A practical system for large-
scale processing of moving objects,” in ICDE, 2014.
[9] A. R. Mahmood, A. M. Aly, T. Qadah, E. K. Rezig, A. Daghistani,
A. Madkour, A. S. Abdelhamid, M. S. Hassan, and S. B. Walid G. Aref,
“Tornado: A distributed spatio-textual stream processing system,” in
VLDB, 2015.
[10] “Apache Spark, https://spark.apache.org/, 2015.
[11] M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley,
M. J. Franklin, S. Shenker, and I. Stoica, “Resilient distributed datasets:
A fault-tolerant abstraction for in-memory cluster computing,” in NSDI,
2012.
[12] M. Zaharia, T. Das, H. Li, T. Hunter, S. Shenker, and I. Stoica,
“Discretized streams: Fault-tolerant streaming computation at scale,”
in SOSP, 2013.
[13] “OpenStreetMap, http://www.openstreetmap.org/, 2015.
[14] “Apache Zeppelin, https://zeppelin.incubator.apache.org, 2015.
... (2) Systems for streamed spatial data [1,3,10,11,13,16,33,39,44,45,57,59,60,64,68]. The problem of querying spatial data streams has been studied extensively. ...
... Another body of work studies the top-k spatio-keyword query [11,59,60] that returns objects having the top-k highest similarities to the input query. Several distributed systems [1,13,16,44,45] have been proposed for querying streamed spatial data. However, these systems do not have native support for aggregate operations over spatial data. ...
... is computed using Equation (1). Although STAR can benefit from materializing a new view v c by gaining efficiency in answering a set of queries, this benefit can be outweighed by the burden of maintaining the new view. ...
Article
The proliferation of mobile phones and location-based services has given rise to an explosive growth in spatial data. In order to enable spatial data analytics, spatial data needs to be streamed into a data stream warehouse system that can provide real-time analytical results over the most recent and historical spatial data in the warehouse. Existing data stream warehouse systems are not tailored for spatial data. In this paper, we introduce the STAR system. STAR is a distributed in-memory data stream warehouse system that provides low-latency and up-to-date analytical results over a fast-arriving spatial data stream. STAR supports both snapshot and continuous queries that are composed of aggregate functions and ad hoc query constraints over spatial, textual, and temporal data attributes. STAR implements a cache-based mechanism to facilitate the processing of snapshot queries that collectively utilizes the techniques of query-based caching (i.e., view materialization) and object-based caching. Moreover, to speed-up processing continuous queries, STAR proposes a novel index structure that achieves high efficiency in both object checking and result updating. Extensive experiments over real data sets demonstrate the superior performance of STAR over existing systems.
... Each spatial keyword subscription contains a spatial argument, a textual argument, and possibly a temporal argument, which lead to three kinds of matching conditions: (1) spatial, (2) textual, and (3) temporal matching conditions. When a new object arrives, the system sends the object to the Fig. 3 Framework for a spatial keyword publish/subscribe system Publish/subscribe systems [4,24,25,36,69,82,104,105,137,139,154,155] Publish/subscribe systems allow users to submit subscriptions that specify spatial and textual matching conditions. Such systems will notify the users in real time when incoming data satisfies the matching conditions Localized event detection [5,29,51,79,86,116,122,124,136,140,156,159,164] A local event is typically a bursty activity that occurs in a local area in a specific timespan. ...
... Spatial matching conditions We classify spatial matching conditions into range matching and distance matching. Range matching is the dominant spatial matching condition in existing studies [4,24,36,59,82,104,105,135,139,154,155]. The spatial component in these studies is a region. ...
... Specifically, Li et al. [82], Guo et al. [59], Wang et al. [139], Yu et al. [154], Chen et al. [36], and Mahmood et al. [104] use Boolean AND semantic, i.e., for d to match q textually, d must contain all the query keywords in q. The frameworks developed by Chen et al. [24], Mahmood et al. [105] and Abdelhamid et al. [4] support both Boolean AND and Boolean OR semantics. Next, another line of work [25,31,69,135,155] computes a text similarity score for the match between q and d. ...
Article
Full-text available
With the broad adoption of mobile devices, notably smartphones, keyword-based search for content has seen increasing use by mobile users, who are often interested in content related to their geographical location. We have also witnessed a proliferation of geo-textual content that encompasses both textual and geographical information. Examples include geo-tagged microblog posts, yellow pages, and web pages related to entities with physical locations. Over the past decade, substantial research has been conducted on integrating location into keyword-based querying of geo-textual content in settings where the underlying data is assumed to be either relatively static or is assumed to stream into a system that maintains a set of continuous queries. This paper offers a survey of both the research problems studied and the solutions proposed in these two settings. As such, it aims to offer the reader a first understanding of key concepts and techniques, and it serves as an “index” for researchers who are interested in exploring the concepts and techniques underlying proposed solutions to the querying of geo-textual data.
... Also, [48] have designed a query-workload-aware technique for partitioning big spatial data that adaptively renews the partitioning in accordance with a query workload (being adaptive), achieving roughly equal load balances while preserving a good degree of SDL. In the same vein, Cruncher [49] employs a dynamic adaptive method that is aware of query workload. Cost-model-based repartitioning is enabled, where a cost model calculates number of points and queries for each partition and repartitions accordingly. ...
... Also, LocationSpark [34] applies a dynamic in-memory caching to cache most frequently and recent spatial datasets in-memory. Perhaps more convenient, Cruncher [49] employs an adaptive caching technique for maintaining frequently accessed spatial data in-memory. The mechanism is simply based on maintaining access-pattern statistics within grid cells. ...
... One of the most significant Spark-based works that aims at appropriately trading off aspects of SDL preservation vs. load balancing spatial partitioning ones is a framework termed as Cruncher [49]. Regarding spatial partitioning aspects, Cruncher focuses on SDL preservation and load balancing, by neglecting BSOs minimization. ...
Article
Full-text available
The high abundance of IoT devices have caused an unprecedented accumulation of avalanches of geo-referenced IoT spatial data that if could be analyzed correctly would unleash important information. This can feed decision support systems for better decision making and strategic planning regarding important aspects of our lives that depend heavily on location-based services. Several spatial data management systems for IoT data in Cloud has recently gained momentum. However, the literature is still missing a comprehensive survey that conceptualize a convenient framework that classify those frameworks under appropriate categories. In this survey paper, we focus on the management of big geospatial data that are generated by IoT data sources. We also define a conceptual framework and match the works of the recent literature with it. We then identify future research frontiers in the field depending on the surveyed works.
... Spatial applications require extending the capabilities of general distributed data streaming systems to support spatial operations and spatial query processing. In particular, spatial partitioning and indexing techniques are needed to support efficient processing of spatial data [6,15,22,35,43,40,23,24,14]. Motivation A key challenge to improve the performance of a distributed system is to ensure workload balancing across its machines. ...
... HadoopGIS [7], SATO [39], and SpatialHadoop [16] are big spatial processing systems on top of Hadoop. LocationSpark [37], Cruncher [6], Simba [41], SparkGIS [11] are spatial extensions to Spark. All these systems do not offer real-time big spatial data processing. ...
... This cannot work for processing spatial streams in real-time. Cruncher [6] is a proposal for adaptive spatial stream processing on top of Spark. However, Cruncher works only on micro-batch stream processing that has relatively high latency, i.e., seconds. ...
Preprint
The proliferation of GPS-enabled devices has led to the development of numerous location-based services. These services need to process massive amounts of spatial data in real-time. The current scale of spatial data cannot be handled using centralized systems. This has led to the development of distributed spatial streaming systems. Existing systems are using static spatial partitioning to distribute the workload. In contrast, the real-time streamed spatial data follows non-uniform spatial distributions that are continuously changing over time. Distributed spatial streaming systems need to react to the changes in the distribution of spatial data and queries. This paper introduces SWARM, a light-weight adaptivity protocol that continuously monitors the data and query workloads across the distributed processes of the spatial data streaming system, and redistribute and rebalance the workloads soon as performance bottlenecks get detected. SWARM is able to handle multiple query-execution and data-persistence models. A distributed streaming system can directly use SWARM to adaptively rebalance the system's workload among its machines with minimal changes to the original code of the underlying spatial application. Extensive experimental evaluation using real and synthetic datasets illustrate that, on average, SWARM achieves 200% improvement over a static grid partitioning that is determined based on observing a limited history of the data and query workloads. Moreover, SWARM reduces execution latency on average 4x compared with the other technique.
Article
The proliferation of GPS-enabled devices has led to the development of numerous location-based services. These services need to process massive amounts of streamed spatial data in real-time. The current scale of spatial data cannot be handled using centralized systems. This has led to the development of distributed spatial streaming systems. Existing systems are using static spatial partitioning to distribute the workload. In contrast, the real-time streamed spatial data follows non-uniform spatial distributions that are continuously changing over time. Distributed spatial streaming systems need to react to the changes in the distribution of spatial data and queries. This article introduces SWARM, a lightweight adaptivity protocol that continuously monitors the data and query workloads across the distributed processes of the spatial data streaming system and redistributes and rebalances the workloads as soon as performance bottlenecks get detected. SWARM is able to handle multiple query-execution and data-persistence models. A distributed streaming system can directly use SWARM to adaptively rebalance the system’s workload among its machines with minimal changes to the original code of the underlying spatial application. Extensive experimental evaluation using real and synthetic datasets illustrate that, on average, SWARM achieves 2 improvement in throughput over a static grid partitioning that is determined based on observing a limited history of the data and query workloads. Moreover, SWARM reduces execution latency on average 4 compared with the other technique.
Article
Streaming spatio-textual data that contains geolocations and textual contents, e.g., geo-tagged tweets, is becoming increasingly available. Users can register continuous queries to receive up-to-date results continuously, or pose snapshot queries to receive results instantly. The large scale of spatio-textual data streams and huge amounts of queries pose great challenges to the current location-based services, and call for more efficient data management systems. In this paper, we present SSTD (Streaming Spatio-Textual Data), a distributed in-memory system supporting both continuous and snapshot queries with spatial, textual, and temporal constraints over data streams. Compared with existing distributed data stream management systems, SSTD has at least three novelty: (1) It supports more types of queries over streamed spatio-textual data ; (2)SSTD adopts a novel workload partitioning method termed QT (Quad-Text) tree, that utilizes the joint distribution of queries and spatio-textual data to reduce query latency and enhance system throughput. (3) To achieve load balance and robustness, we develop three new workload adjustment methods for SSTD to fit the changes in the distributions of data or queries. Extensive experiments on real-life datasets demonstrate the superior performance of SSTD.
Conference Paper
Full-text available
The ubiquity of location-aware devices and smartphones has unleashed an unprecedented proliferation of location-based services that require processing queries with both spatial and relational predicates. Many algorithms and index structures already exist for processing k-Nearest-Neighbor (kNN, for short) predicates either solely or when combined with textual keyword search. Unfortunately , there has not been enough study on how to efficiently process queries where kNN predicates are combined with general rela-tional predicates, i.e., ones that have selects, joins and group-by's. One major challenge is that because the kNN is a ranking operation , applying a relational predicate before or after a kNN predicate in a query evaluation pipeline (QEP, for short) can result in different outputs, and hence leads to different query semantics. In particular , this renders classical relational query optimization heuristics, e.g., pushing selects below joins, inapplicable. This paper presents various query optimization heuristics for queries that involve combinations of kNN select/join predicates and relational predicates. The proposed optimizations can significantly enhance the performance of these queries while preserving their semantics. Experimental results that are based on queries from the TPC-H benchmark and real spatial data from OpenStreetMap demonstrate that the proposed optimizations can achieve orders of magnitude enhancement in query performance.
Article
Full-text available
The ubiquity of location-aware devices, e.g., smartphones and GPS devices, has led to a plethora of location-based services in which huge amounts of geotagged information need to be efficiently processed by large-scale computing clusters. This demo presents AQWA, an adaptive and query-workload-aware data partitioning mechanism for processing large-scale spatial data. Unlike existing cluster-based systems, e.g., SpatialHadoop, that apply static partitioning of spatial data, AQWA has the ability to react to changes in the query-workload and data distribution. A key feature of AQWA is that it does not assume prior knowledge of the query-workload or data distribution. Instead, AQWA reacts to changes in both the data and the query-workload by incrementally updating the partitioning of the data. We demonstrate two prototypes of AQWA deployed over Hadoop and Spark. In both prototypes, we process spatial range and k-nearest-neighbor (kNN, for short) queries over large-scale spatial datasets, and we exploit the performance of AQWA under different query-workloads.
Conference Paper
Full-text available
The ubiquity of location-aware devices, e.g., smartphones and GPS devices, has led to a plethora of location-based services in which huge amounts of geotagged information need to be efficiently processed by large-scale computing clusters. This demo presents AQWA, an adaptive and query-workload-aware data partitioning mechanism for processing large-scale spatial data. Unlike existing cluster-based systems, e.g., SpatialHadoop, that apply static partitioning of spatial data, AQWA has the ability to react to changes in the query-workload and data distribution. A key feature of AQWA is that it does not assume prior knowledge of the query-workload or data distribution. Instead, AQWA reacts to changes in both the data and the query-workload by incrementally updating the partitioning of the data. We demonstrate two prototypes of AQWA deployed over Hadoop and Spark. In both prototypes, we process spatial range and k-nearest-neighbor (kNN, for short) queries over large-scale spatial datasets, and we exploit the performance of AQWA under different query-workloads.
Conference Paper
Full-text available
We present Resilient Distributed Datasets (RDDs), a distributed memory abstraction that lets programmers perform in-memory computations on large clusters in a fault-tolerant manner. RDDs are motivated by two types of applications that current computing frameworks handle inefficiently: iterative algorithms and interactive data mining tools. In both cases, keeping data in memory can improve performance by an order of magnitude. To achieve fault tolerance efficiently, RDDs provide a restricted form of shared memory, based on coarse-grained transformations rather than fine-grained updates to shared state. However, we show that RDDs are expressive enough to capture a wide class of computations, including recent specialized programming models for iterative jobs, such as Pregel, and new applications that these models do not capture. We have implemented RDDs in a system called Spark, which we evaluate through a variety of user applications and benchmarks.
Chapter
The widespread use of location-aware devices together with the increased popularity of micro-blogging applications (e.g., Twitter) led to the creation of large streams of spatio-textual data. In order to serve real-time applications, the processing of these large-scale spatio-textual streams needs to be distributed. However, existing distributed stream processing systems (e.g., Spark and Storm) are not optimized for spatial/textual content. In this demonstration, we introduce Tornado, a distributed in-memory spatio-textual stream processing server that extends Storm. To efficiently process spatiotextual streams, Tornado introduces a spatio-textual indexing layer to the architecture of Storm. The indexing layer is adaptive, i.e., dynamically re-distributes the processing across the system according to changes in the data distribution and/or query workload. In addition to keywords, higher-level textual concepts are identified and are semantically matched against spatio-textual queries. Tornado provides data deduplication and fusion to eliminate redundant textual data. We demonstrate a prototype of Tornado running against real Twitter streams, where the users can register continuous or snapshot spatio-textual queries using a map-assisted queryinterface.
Article
This paper describes SpatialHadoop; a full-fledged MapReduce framework with native support for spatial data. SpatialHadoop is a comprehensive extension to Hadoop that injects spatial data awareness in each Hadoop layer, namely, the language, storage, MapReduce, and operations layers. In the language layer, SpatialHadoop adds a simple and expressive high level language for spatial data types and operations. In the storage layer, SpatialHadoop adapts traditional spatial index structures, Grid, R-tree and R+-tree, to form a two-level spatial index. SpatialHadoop enriches the MapReduce layer by two new components, SpatialFileSplitter and SpatialRecordReader, for efficient and scalable spatial data processing. In the operations layer, SpatialHadoop is already equipped with a dozen of operations, including range query, kNN, and spatial join. Other spatial operations are also implemented following a similar approach. Extensive experiments on real system prototype and real datasets show that SpatialHadoop achieves orders of magnitude better performance than Hadoop for spatial data processing.
Conference Paper
Parallel Secondo scales up the capability of processing extensible data models in Secondo. It combines Hadoop with a set of Secondo databases, providing almost all existing SECONDO data types and operators. Therefore it is possible for the user to convert large-scale sequential queries to parallel queries without learning the Map/Reduce programming details. This paper demonstrates such a procedure. It imports the data from the project OpenStreetMap into Secondo databases to build up the urban traffic network and then processes network-based queries like map-matching and symbolic trajectory pattern matching. All involved queries were stated as sequential expressions and time-consuming in single-computer Secondo. However, they can achieve an impressive performance in Parallel Secondo after being converted to the corresponding parallel queries, even on a small cluster consisting of six low-end computers.
Conference Paper
Many "big data" applications must act on data in real time. Running these applications at ever-larger scales requires parallel platforms that automatically handle faults and stragglers. Unfortunately, current distributed stream processing models provide fault recovery in an expensive manner, requiring hot replication or long recovery times, and do not handle stragglers. We propose a new processing model, discretized streams (D-Streams), that overcomes these challenges. D-Streams enable a parallel recovery mechanism that improves efficiency over traditional replication and backup schemes, and tolerates stragglers. We show that they support a rich set of operators while attaining high per-node throughput similar to single-node systems, linear scaling to 100 nodes, sub-second latency, and sub-second fault recovery. Finally, D-Streams can easily be composed with batch and interactive query models like MapReduce, enabling rich applications that combine these modes. We implement D-Streams in a system called Spark Streaming.