Conference PaperPDF Available
Hierarchical Semantics Matching For Heterogeneous
Spatio-temporal Sources
Daniel Glake
Norbert Ritter
daniel.glake@uni-hamburg.de
ritter@informatik.uni-hamburg.de
Universität Hamburg
Germany
Florian Ocker
Nima Ahmady-Moghaddam
Daniel Osterholz
Ula Lenfers
Thomas Clemen
orian.ocker@haw-hamburg.de
nima.ahmady-moghaddam@haw-hamburg.de
daniel.osterholz@haw-hamburg.de
ula.lenfers@haw-hamburg.de
thomas.clemen@haw-hamburg.de
Hamburg University of Applied Sciences
Germany
ABSTRACT
Spatio-temporal data are semantically valuable information used for
various analytical tasks to identify spatially relevant and temporally
limited correlations within a domain. The increasing availability
and data acquisition from multiple sources with their typically high
heterogeneity are getting more and more attention. However, these
sources often lack interconnecting shared keys, making their in-
tegration a challenging problem. For example, publicly available
parking data that consist of point data on parking facilities with
uctuating occupancy and static location data on parking spaces
cannot be directly correlated. Both data sets describe two dierent
aspects from distinct sources in which parking spaces and uc-
tuating occupancy are part of the same semantic model object.
Especially for ad hoc analytical tasks on integrated models, these
missing relationships cannot be handled using join operations as
usual in relational databases. The reason lies in the lack of equijoin
relationships, comparing for equality of strings and additional over-
head in loading data up before processing. This paper addresses the
optimization problem of nding suitable partners in the absence of
equijoin relations for heterogeneous spatio-temporal data, applica-
ble to ad hoc analytics. We propose a graph-based approach that
achieves good recall and performance scaling via hierarchically sep-
arating the semantics along spatial, temporal, and domain-specic
dimensions. We evaluate our approach using public data, showing
that it is suitable for many standard join scenarios and highlighting
its limitations.
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for prot or commercial advantage and that copies bear this notice and the full citation
on the rst page. Copyrights for components of this work owned by others than ACM
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specic permission and/or a
fee. Request permissions from permissions@acm.org.
CIKM ’21, November 1–5, 2021, Virtual Event, QLD, Australia
©2021 Association for Computing Machinery.
ACM ISBN 978-1-4503-8446-9/21/11. . . $15.00
https://doi.org/10.1145/3459637.3482350
CCS CONCEPTS
Information systems Entity resolution;Join algorithms.
KEYWORDS
matching, processing, datasets, spatial, temporal, integration
ACM Reference Format:
Daniel Glake, Norbert Ritter, Florian Ocker, Nima Ahmady-Moghaddam,
Daniel Osterholz, Ula Lenfers, and Thomas Clemen. 2021. Hierarchical
Semantics Matching For Heterogeneous Spatio-temporal Sources. In Proceed-
ings of the 30th ACM International Conference on Information and Knowledge
Management (CIKM ’21), November 1–5, 2021, Virtual Event, QLD, Australia.
ACM, New York, NY, USA, 11 pages. https://doi.org/10.1145/3459637.3482350
1 INTRODUCTION
Spatio-temporal data are nowadays more ubiquitous than ever
before; examples include biological, meteorological and urban socio-
economic data, agricultural data, geo-tagged public data, and sensor
data. According to more and more public data, available portals and
services, the volume of such data generated per day increases at
a staggering and unprecedented rate. This provides opportunities
for many dierent analytical tasks. But often, the data come with
heterogeneous representations and without any specied semantic
relationship.
Before applying analytical tasks, applications need to retrieve
and prepare the data. This additional integration overhead restricts
ad hoc analytics, getting increasingly critical for short-term decision
making (e.g., trac-ow redirection or construction site planning).
Following a concrete motivating example redacted from an actual
use case, suppose an analytical task helping to plan new e-loading
stations by guring out the density of parking occupation. The
analysis has a collection of data layers with parking spaces and a
stream of past and upcoming occupations for existing e-mobility
stations. The e-mobility station, represented by a point, preserves
the actual occupation and is not directly related to aected parking
spaces. Relationships can be inferred only by their spatial proximity
to a given number eld describing how many parking spaces are
associated to the station. The occupation data changes over time
Full Paper Track
CIKM ’21, November 1–5, 2021, Virtual Event, Australia
565
and correlates changes with an existing parking layer, aggregating
it across the considered spatial extent. The main diculty is the
lack of equijoin relationships and a public mapping between e-
mobility stations and parking spaces. The matching has to rely on
other descriptive aspects to deduce semantic relationships without
considering only their domain data and the temporal and spatial
dimensions.
Regarding this integration problem, we propose an iterative
hierarchical solution for automatic integration, implemented in
our existing analytical framework MARS. The approach matches
data object from heterogeneous sources with spatio-temporal and
unknown domain-dimensions. Our contributions in this paper are:
Discussion of the problem in matching data with spatial,
temporal, and domain-specic semantics.
Proposals of approximated spatial, temporal and general-
purpose matching mechanism for heterogeneous data.
Discussion of technical integration.
Recall, precision, and performance evaluation on real sets.
The remainder of this paper is structured as follows: In Section 3,
we describe the problem in abstract terms and discuss conicts in
spatial, temporal, and common data matching. Pursuant to the prob-
lem statement dened in Section 3.1, we discuss temporal-related
matching in Section 4.1, spatial matching with multi-dimensional
index-based access in Section 4.2, and a combined semantic sim-
ilarity matching in Section 4.3 to capture unknown domains. In
Section 5, we describe our solution for retrieving heterogeneous
sources and how data are loaded to apply ad hoc analytics. In Sec-
tion 6, we evaluate our approach against two scenarios, discussing
result with benets and drawbacks of our approach in Section 7
and oer a conclusion and overview of topics for future work in
Section 8.
2 RELATED WORK
The problem of resolving matching partners in the absence of key
candidates has been characterized in various works for dierent
domains. Commonly used approaches such as join paths manually
managed or derived from the data set in Data Civilizer [
13
] or
Clio [
18
] describe possible joineabl tables, requiring knowledge of
foreign key relationships and the schema. Alternatively, nding
linkable objects from distinct sources can be viewed as a similar-
ity problem. In information retrieval, this has been studied well,
focusing on set similarity with small comparison sets, such as key-
words and small text [
14
,
39
,
53
]. They become applicable to larger
domains by adapting deployed similarity metrics such as Jaccard
to containment scores in [
43
] or to extended-Jaccard in order to
compromise exible token divergences. Other approaches such
as [
21
,
52
] consider heterogeneous attributes or combine multiple
similarity metrics as [
37
], tuning a weight for each metric accord-
ingly. SEMA-JOIN [
27
], Auto-Join [
60
], The Mannheim Search Join
Engine [
38
], and previous approaches to table extensions [
7
,
56
]
on WebTables [
8
], on the other hand, are methods that statisti-
cally derive correlations from given corpora data sets and identify
pseudo-key relationships. However, these inference methods re-
quire large amounts of corpora data, such as the 100 million tables
used in SEMA-JOIN [
27
]. Such volumes are not always available are
not readily manageable with regular personal infrastructure. In the
context of spatial data, existing work on spatial matches, which, in
addition to the main runtime features such as GPU utilization [
1
] or
streaming of spatio-temporal data [
34
], also address the renement
of matching quality. The
𝜖
-distance approaches [
5
] establish links
to nearby objects. However, distance alone is not always sucient
for matching since nearby coordinate support points cannot always
be determined for a large spanning area. In contrast, partition-
based or minimum bounding box-based approaches decompose
the space in [
46
] into regular grids or via widely used R-tree join
algorithms [
28
], and link objects based on geometric predicates
(such as overlap). However, in the case of point data with minor
spacing, data may also be excluded. In contrast, other approaches
such as the SpatialHash Join [
41
] or Scalable Sweeping-Based Spa-
tial Join [
2
] are multi-assignment joins that assign multiple data
objects to a target object. A good discussion of previous and current
spatial join methods can be found in [
31
]. The combination as a
spatio-temporal composite so far focuses on runtime renement
by using distributed computation such as Apache Spark [
55
], the
use of in-memory databases [
33
], or using better index structures
[
59
]. Previous traditional spatial, temporal, and spatio-temporal
joins were studied concerning geometric or temporal dimensions,
while other semantic data attached to the objects was widely ig-
nored. The spatio-textual similarity join [
5
], a hybrid of the spatial
intersection join and set similarity, the k-distance join [
48
], and
the GeoRDF join [
40
] are among the rst approaches to include
semantic information along with spatial data, but none of them
have considered temporal properties yet.
3 MATCHING PROBLEMS
Before looking into the distinct aspects of matching, we formu-
late the matching problem in Section 3.1. Then, we discuss spatial,
temporal, and general-purpose problems in Section 3.2, terming
boundary conditions of this work.
3.1 Problem Statement
Consider an existing set
𝐸={𝑒1, 𝑒2, . .., 𝑒𝑖,}
of data object with a
set
𝑀={𝑚1,𝑚2, ..., 𝑚𝑗}
of possible matching candidates in the
scope of a temporal-invariant ad hoc analyse, denoted as A. Each
object
𝑒𝐸
(or
𝑚𝑀
) contains a spatial instance
G
, a temporal
valid time period
T
, and an associated data set
D
with attribute
and values. The spatial instance
G
is represented as a Minimum
Bounding Rectangle (MBR) (or simply Bounding Box) and covers
all information within a specic area, including points,lines, and
polygons. We denote the MBR as
G=[𝑝𝑏𝑙 , 𝑝𝑢𝑟 ]
in which
𝑝𝑏𝑙 =
(𝑝𝑏𝑙 .𝑥, 𝑝𝑏𝑙 .𝑦)
and
𝑝𝑢𝑟 =(𝑝𝑢𝑟 .𝑥, 𝑝𝑢 𝑟 .𝑦)
species the bottom-left and
upper-right coordinate, respectively. The valid time period
T
id
denoted as
T=[𝑡𝑠, 𝑡𝑒)
describing an inclusive starting time
𝑇𝑠
and exclusive end time
𝑇𝑒
. The associated data
D
is a set of values
D={𝑣1, 𝑣2, ..., 𝑣𝑘}
with dierent attribute types. A value
𝑣𝑖
of
an object
𝑒𝐸
(or
𝑚𝑀
) can be linear (e.g., e-mobility station
Price
or
Availability
) or nominal (or symbolic) (e.g., parking space
with
Occupation
or
Kind
), and a linear value can be continuous
or discrete. To quantify the similarity between two objects
𝑒𝑖𝐸
and
𝑚𝑗𝑀
, we consider each of the three dimensions of interest
(temporal,spatial, and domain-specic semantics) separately. For
Full Paper Track
CIKM ’21, November 1–5, 2021, Virtual Event, Australia
566
temporal data, we consider an intersection of data objects to occur
when their time periods overlap.
Denition 3.1. Given two objects
𝑒
and
𝑚
, their temporal inter-
section is dened as:
𝑇(𝑒, 𝑚)=T
𝑒∩ T
𝑚(1)
For the spatial dimension, the MBR extent of on object needs to
intersect with that of a matching partner, applying set similarity
with a given 𝜖-distance:
Denition 3.2. Given two objects
𝑒
and
𝑚
, a precision
𝜖
, and the
prex function
𝜌
, the spatial similarity is given by the set similarity
of matching prexes of coordinates with length
𝜖
. Their spatial
instance Gcorrelates by 𝜖-distance on prex.
𝑆(𝑒, 𝑚)=
𝑚𝑖𝑛 (𝜖,𝜌 ( G𝑒),𝜌 ( G𝑚))
Õ
𝑖=0
|𝜌(G𝑒)𝑖𝜌(G𝑚)𝑖|
|𝜌(G𝑒)𝑖𝜌(G𝑚)𝑖|(2)
For the domain-specic dimensions, the feature-based similarity
is used to conclude the relationships with usage of the weight
function 𝑓𝑤:D → Ras follows:
Denition 3.3. Given two objects
𝑒
and
𝑚
, threshold
𝜏𝐷
,0
𝜏𝐷
1and, goal is to satisfy
𝑠𝑖𝑚 (𝑒, 𝑚) ≥ 𝜏
in identify pairs
(𝑎𝑖, 𝑏 𝑗), 𝑎𝑖𝑒
and 𝑏𝑗𝑚applicable to similarity.
Given the denitions above and object
𝑒
and
𝑚
, we concern:
(1) Temporal constraints: their temporal intersection is
. (2) Spatial
constraints: their spatial similarity is larger than a spatial thresh-
old
𝜏𝑠
in which
𝑆(𝑒, 𝑚)>𝜏𝑠
or the minimized object is selected
𝑚𝑖𝑛(Ð𝑘
𝑖=0Ð𝑙
𝑗=0𝑆(𝑒𝑖,𝑚𝑗))
if it exists. (3) Semantic constraints: their
similarity of heterogeneous associated data is larger than a semantic
similarity threshold
𝜏𝑑
in which
𝐷(𝑒, 𝑚)>𝜏𝑑
. Given the constraints,
we formulate the matching problem as follows.
Denition 3.4. Given two collection
𝐸={𝑒1, 𝑒2, . .., 𝑒𝑛}
and
𝑀=
{𝑚1,𝑚2, ..., 𝑚𝑘}
and two similarity thresholds
𝜏𝑠
and
𝜏𝑑
, a similarity
join nds all matching partners
(𝑒𝑖,𝑚𝑗)
where
T
𝑒T
𝑚
,
𝑆(𝑒, 𝑚)>𝜏𝑠
and 𝐷(𝑒, 𝑚) 𝜏𝑑.
Figure 1 shows an example of domain-specic spatio-temporal
objects, each associated with a set of possible heterogeneous data
values (e.g.,
{𝑎, 𝑏, ..., 𝑘, 𝑙, ... 𝑥 ,𝑦, 𝑧}
. The set of rectangles
𝑅𝑖
represent
MBRs that are valid at dierent times (e.g,
𝑅𝑖
or
𝑅𝑖′′
) and for which
a temporal change can aect either the position of the geometric
instance (e.g,
𝑅3↦→ 𝑅3
) and their values. Consider object
𝑅2′′
or
𝑅5
, changes can express themselves by making new objects valid
at specied points in time denoted as
or
′′
, exposing semantic data
{𝑥}
or nothing
to an ad hoc analysis. When an object’s semantic
data change, they can be aected in one of the following ways:
(1) simple attribute value changes (
{𝑥}𝑅1↦→ {𝑥}𝑅1
). (2) Added or
removed attributes (
{𝑥, 𝑦, 𝑧}𝑅3↦→ {𝑥, 𝑦 }𝑅3↦→ {𝑥 , 𝑦, 𝑧 }𝑅3
). (3) Com-
bined changing action in one step({𝑘, 𝑙 , 𝑚}𝑅7↦→ {𝑘, 𝑙 }𝑅7′′ ).
3.2 Common Matching Problems
Many of the challenges of identifying matching data objects via at-
tribute comparison can be classied into a few kinds of heterogene-
ity [
17
]. Dierences between structural properties of data sources
restrict how their data objects can be related to each other. There-
fore, the bulk of many data integration endeavours is devoted to
𝑅1={𝑥}𝑅1={𝑥}𝑅2′′ ={𝑎, 𝑏}
𝑅3={𝑥, 𝑦 }𝑅3={𝑥, 𝑦, 𝑧 }𝑅3′′ ={𝑥, 𝑦 }
𝑅3′′′ ={𝑥}𝑅4={𝑎, 𝑏}𝑅4={𝑎, 𝑏 }
𝑅5=𝑅6={𝑎, 𝑏, 𝑐 }𝑅6′′ ={𝑎, 𝑏, 𝑐 }
𝑅6′′′ ={𝑥, 𝑦, 𝑧}𝑅7={𝑘, 𝑙 , 𝑚}𝑅7′′ ={𝑘, 𝑙 }
Figure 1: Example spatio-temporal data objects with dier-
ent associated data objects.
minimizing heterogeneity, thereby increasing the integrability of
the data sources and the comparability of the data objects at hand.
Such exercises are part of data cleanup, pre-processing, homogeniza-
tion, and feature preparation and are prerequisite to any meaningful
search for matches.
A set of data sources is syntactically heterogeneous when any of
its pairwise data objects are unequal in their forms of representa-
tion. For instance, the comparability of data objects from relational,
aggregate-, object-, and matrix-oriented data sources is compro-
mised due to their dierent structural composition of the data
storage systems. However, even if the integration of heterogeneous
sources is possible, the search for matches might be hindered by
syntactic heterogeneity among the data objects to be compared.
Given two data objects with a comparable attribute that has a
cardinal scale of measure, semantic heterogeneity might arise due
to the present representation of the numbers. For instance, the
attribute values of one data object might be normalized, whereas
those of the other data object might not be. Similarly, when inte-
grating external data into a model, conformity issues between the
schema of the external source and the data representation in the
model might impede comparability. This occurs when, for example,
the nominal values of two data objects are represented literally in
one and encoded numerically in the other case.
Content-based heterogeneity describes the occurrence of attribute
values from dierent data objects that are semantically valid and,
therefore, equally likely to be true. When faced with conicting
data objects, one object’s values must be chosen over the other,
eectively rejecting those of the latter as false. How is such an
ambiguity best resolved? In the decision-making process, the qual-
ity and integrity of the data sources is likely a consideration [
15
].
Ideally, these conict resolutions should occur within an automated
and congurable data integration pipeline, as making manual case-
by-case decisions is not feasible.
Full Paper Track
CIKM ’21, November 1–5, 2021, Virtual Event, Australia
567
3.3 Spatial and Temporal Matching Problems
In addition to the general challenges that arise from dierent types
of data heterogeneity, more specic issues apply when analyzing
spatial and temporal data. For instance, geospatial data encoded
in other coordinate reference systems (e.g., WGS:84 or UTM) are
not comparable because the geographic position of the objects is
described dierently. In addition, even within the same reference
systems, comparability might be impeded by using dierent stan-
dards employed in representing data. These standards typically
arise from the domain (e.g., aviation) resulting in semantics degra-
dation, in which dierent instances describe the same object on
multiple abstractions levels [
6
]. For example, spatial geometry types
(point, cell, line, polygon) describe individual parking lots occupied
by vehicles or areas occupied by multiple cars. These objects can
not or only partially be comparable.
Concerning temporal data, the issue of outliers on the time axis
is often of particular concern [
25
,
58
]. A disruption in the temporal
continuity of a time series whose data objects are ordered on a time
axis occurs when there exists at least one interval on the time axis
for which no data object’s period is valid. In case of such anomalies,
temporal and other data attributes need to be corrected (if they are
erroneous) while maintaining as much data integrity as possible.
4 INTEGRATING SPATIO-TEMPORAL DATA
Due to the heterogeneity of the data sources and the absence of com-
parability via equijoins, we introduce Hierarchical Spatio-temporal
Matching for heterogeneous matching, denoted as HSTM. The ap-
proach is hierarchical because the integration of spatio-temporal
data must address spatial and temporal aspects separately. Links of
objects are constructed within a multidimensional space by their
spatial proximity, temporal validity, and semantic similarity. This
results in a graph structure, which successively builds up according
to the analysis progress. Spatial integration addresses the challenge
of nding matches above a given precision threshold, distinguish-
ing from similar objects, whereas temporal integration requires
population data regarding their validity time frame.
4.1 Temporal Validity
The temporal validity of data and temporal data referring to the
same object requires a well-dened (dis-)integration mechanism
based on selection. In order to nd new valid and invalid values,
we propose a denition of two types of joins linking new tempo-
ral valid data
𝑀
of a temporal catalogue to an existing collection
𝐸
. More precisely, the denition is based on a single data model
which has its roots in the relational implementations [
32
]. HSTM
assigns data objects to an interval, and the time axis is divided into
intervals of minimum duration. Each interval is called a chronon
[
16
], the smallest unit possible, formed by enclosing start as well
as end chronons. For this purpose, we imply a data schema for
temporal objects
𝑀
in conjunction with our existing collection
𝐸
described in section 3.1. Each data object has its own schema with
𝑆=(𝑎𝑖, .. ., 𝑎𝑛,T )
as an arbitrary set of data attributes
𝑎𝑖, .. ., 𝑎𝑛
with
1
𝑖𝑛
and time period
T=[𝑡𝑠, 𝑡𝑒)
. It captures the time during
which the information recorded by the attributes applies (or applied
or will apply). We use
𝑆𝐸
and
𝑆𝑀
as shorthand for
{𝑎1, .. ., 𝑎𝑛}
and
{𝑏1, .. ., 𝑏𝑚}
and dene
𝑒
as an instance of
E
as well as
𝑚
of
𝑀
. The
next invalid
object
next valid
object
current
simulation
step
finished steps
Figure 2: HSTM update and remove process with valid and
invalid window.
time interval of the current analysis is dened as
A=[𝑡𝑠, 𝑡𝑐)
, with
start 𝑡𝑠and current time 𝑡𝑐of the analytical task.
Figure 2 shows the analysis scope overlapping with or inter-
secting valid and invalid values. The HSTM updates
𝑒𝑖
’s with new
data when there is an overlap between the validity period and the
analysis time
𝑡𝑐
. In contrast to overlapping, the intersection of the
periods serves invalid data with subsequent removal of
𝑒𝑗
when no
more changes were provided. Therefore, we dene two auxiliary
functions for intersecting and overlapping. For a more precise for-
mulation, let
𝑓
(short for rst) and
𝑙
(short for last) be the smallest
and largest, respectively, of two-argument chronologies. Also, let
𝑈𝑠
and
𝑈𝑒
be the start and end chronons of
𝑈
, likewise
𝑉𝑠
and
𝑉𝑒
of
𝑉
. Equation 3 shows
𝑖𝑛𝑡 𝑒𝑟𝑠𝑒 𝑐𝑡 (𝑈 , 𝑉 )
with
𝑈
and
𝑉
as intervals,
returning
𝑡𝑟𝑢𝑒
exactly when the start and end of
𝑈
is within the
scope of 𝑉:
𝑖𝑛𝑡 𝑒𝑟𝑠𝑒 𝑐𝑡 (𝑈 , 𝑉 )=𝑈𝑠𝑉𝑠𝑈𝑒<𝑉𝑒(3)
Equation 4 shows
𝑜𝑣 𝑒𝑟𝑙 𝑎𝑝 (𝑈 , 𝑉 )
which returns exactly the max-
imum interval of the two arguments. If there is no overlap,
is
returned:
𝑜𝑣 𝑒𝑟𝑙 𝑎𝑝 (𝑈 , 𝑉 )=([𝑙(𝑈𝑠,𝑉𝑠), 𝑓 (𝑈𝑒, 𝑉𝑒)] if 𝑙(𝑈𝑠, 𝑉𝑠) ≤ 𝑓(𝑈𝑒,𝑉𝑒)
otherwise
(4)
Equation
𝐸𝑇
𝑃(𝑒T,𝑚T)M
describes a
𝜃
-join and
𝐸𝑇
𝑃(𝑒T,𝑚T)M
describes a semi-join between the object instance
𝑒𝐸
on the left
side and the new data object
𝑚𝑀
on the right side, satisfying
a condition
𝑃
:
𝐸×𝑀→ {true,false}
.
𝑃
is a custom predicate,
delegating further matching conditions to the underlying hierarchy
(spatial and semantic). The temporal selection reduces the amount
of possible matching candidates between objects in
𝑀
and existing
ones in
𝐸
to a limited set of candidates, which then form the basis
for subsequent steps of HSTM. We use
𝑒T𝑆𝐸
and
𝑚T𝑆𝑀
as the explicit join attributes. Instead, the semi-join retrieves the
reduced set of data objects from the left side, which fully intersects
with the actual analysis interval, denoting all object as invalid the
analysis.
In addition to preventing duplicate values, we collapse mutually
equal non-timestamp values and non-valued tuples with equal
timestamps into a single data object [
32
]. The equation coalesce
omits missing values or those which are equal to data of upcoming
points in time. The denition consists of merging object whose time
is value-equal, preventing returning invalid times outside the scope
of
𝑀
. We separate the denition into the auxiliary functions 5 and
6. The rst function looks for candidates of an arbitrary data object
Full Paper Track
CIKM ’21, November 1–5, 2021, Virtual Event, Australia
568
𝑥
in which the attribute value for the specied
𝑏
is value-equal
and duplicate entries in time exist by intersecting the period with
the existing data object 𝑧:
collapse(𝑧, 𝑟 )=𝑥𝑟(𝑧[𝑏]=𝑥[𝑏]=
𝑖𝑛𝑡 𝑒𝑟𝑠𝑒 𝑐𝑡 (𝑥[T ], 𝑧 [ T ])
(∀𝑥𝑟(𝑥[𝑏]=𝑥[𝑏]=𝑖𝑛𝑡 𝑒𝑟𝑠𝑒 𝑐𝑡 (𝑥[T ] ), 𝑧 [T ]) ))
(5)
The second auxiliary function (see Eq. 6) ensures that no in-
valid chronon is returned that is outside the valid time scope of
considered data object 𝑧when two elements are equal:
invalid(𝑧, 𝑟 )=𝑡𝑧[T ] 𝑥𝑟(𝑧[𝑏]=𝑥[𝑏]∧
𝑡𝑥[𝑡𝑠] 𝑡<𝑥[𝑡𝑒]) (6)
Equation 7 describes the collapsing of equal values for times-
tamped data and incorrect times.
𝑐𝑜𝑎𝑙 𝑒𝑠𝑐𝑒 (𝑟)={𝑧|collapse (𝑧, 𝑟 ) ∧ invalid(𝑧, 𝑟 )} (7)
With the denition above, the temporal selection can be dened
as follows:
𝜎𝑃(A, 𝑚 )𝑇𝑀={𝑥|∃ 𝑚𝑐𝑜 𝑎𝑙𝑒𝑠𝑐 𝑒 (𝑀):
𝑥[T ] =𝑜𝑣 𝑒𝑟𝑙 𝑎𝑝 (𝑚[T ],A ) ∧ 𝑥[T ] ∅} (8)
Only collapsed data objects in correlation with Aare used.
4.2 Spatial Subdivision
Spatial similarity is determined by prex lter method [
9
]. Two
objects
𝑎
and
𝑏
are considered to be similar when they share a
common set of prex tokens. Therefore, a multidimensional data-
base consists of points belonging to connected and ne-granular
multidimensional objects within a space
𝑆
over dimension
𝑑
. A
spatial database, denoted
U𝑠𝑝𝑎𝑡𝑖𝑎𝑙 ⊆ U
is a special case of the mul-
tidimensional database, the entire comparison universe
U
, where
𝑑=
2. As mentioned in Section 3.1 spatial objects are considered as
rectangular spaces, containing the MBRs of the comparison objects.
We use the prex ltering with 2D-linearization, decomposing the
space into individual bounding boxes. HSTM assigns spatial data
G
to one of the spaces, having an
𝜖
-common prex and is entered
into a Patricia-trie. If two objects
𝑒
and
𝑚
share a prex valid up to
𝜖
elements, they are also adjacent within
𝜖
-distance. Therefore we
associate the spatial space
𝑆
with a hierarchy, consisting of a set of
nodes
V={𝑣1, 𝑣2, ..., 𝑣𝑛}
whose elements each constitute a subset
of the space
𝑆𝑖𝑆
for which the transitive property holds. The
set of leaf nodes L ⊆ V in the spatial hierarchy point to concrete
coordinates of the input space
𝑆
. Each parent
𝑣𝑘
of a node
𝑣𝑖
points
to
𝑆𝑘
such that
𝑆𝑖𝑆𝑘
holds. Conversely, all children
𝑣𝑖
of a node
𝑣𝑘in the hierarchy subsume the whole space of 𝑆𝑘:
𝑣𝑗∈ {𝑣|𝑣𝑐ℎ𝑖𝑙𝑑𝑟 𝑒𝑛 (𝑣𝑖) 𝑣𝑣𝑘}:𝑆𝑗𝑆𝑖(9)
For the induction of the parent space
𝑆𝑘
from its child-spaces
𝑆𝑖
,
the following condition holds:
𝑆𝑘=Ø
𝑣𝑖𝑐ℎ𝑖𝑙𝑑𝑟 𝑒𝑛 (𝑣𝑘)
𝑆𝑖(10)
Each branch node
𝑣𝑖
within the hierarchy that is not part of
the leaf node set
L
constitutes a subspace of the total space. To
match the object, HTSM applies spatial range query, so that range
conditions are satised by the intersection of MBRs each other.
Formally, a spatial query, denoted as
𝑞
, has a range specication of
the form
𝑞range =[𝑞𝑏𝑙 , 𝑞𝑢𝑟 ]
in which
𝑝𝑏𝑙 =(𝑝𝑏𝑙 .𝑥, 𝑝𝑏 𝑙 .𝑦)
and
𝑝𝑢𝑟 =
(𝑝𝑢𝑟 .𝑥, 𝑝𝑢𝑟 .𝑦)
satisfy the condition
𝑝𝑏𝑙 .𝑥 𝑝𝑢𝑟 .𝑥
and
𝑝𝑏𝑙 .𝑦 𝑝𝑢𝑟 .𝑦
.
The specication is equivalent to the description in Section 3.1. We
dene an auxiliary function
𝑠𝑝𝑎𝑐𝑒
:
V × N𝑆
which returns each
space for a particular level within the hierarchy.
A given range query
𝑞
with specication
𝑞range =[𝑞𝑏𝑙 , 𝑞𝑢𝑟 ]
, and
associated two-dimensional point
𝑝𝑈𝑠𝑝𝑎𝑡 𝑖𝑎𝑙
is contained within
the query
𝑞
if and only if
𝑞𝑏𝑙 .𝑥 𝑝.𝑥 𝑞𝑢𝑟 .𝑥
and
𝑞𝑏𝑙 .𝑦 𝑝.𝑦
𝑞𝑢𝑟 .𝑦
. In conjunction with the range in
𝑞
, two considered spatial
objects match when they share their MBR. Therefore, the
𝑖𝑛𝑡𝑒𝑟𝑠𝑒𝑐𝑡
equivalent to Equation 3 is introduces. To simplify readability, for
Equation 11 we briey assume 𝑎=G𝑒and 𝑏=G𝑚:
𝑖𝑛𝑡 𝑒𝑟𝑠𝑒 𝑐𝑡 (𝑎, 𝑏)=(𝑎𝑏𝑙 .𝑥 𝑏 .𝑥 𝑎𝑢𝑟 .𝑥 𝑎𝑏𝑙 .𝑦 𝑏.𝑦 𝑎𝑢𝑟 .𝑦) ∨
(𝑏𝑏𝑙 .𝑥 𝑎.𝑥 𝑏𝑢𝑟 .𝑥 𝑏𝑏𝑙 .𝑦 𝑎.𝑦 𝑏𝑢𝑟 .𝑦)
(11)
The spatial matching predicate
𝑃𝑆
:
𝐸×𝑀→ {𝑡𝑟 𝑢𝑒, 𝑓 𝑎𝑙𝑠 𝑒}
on
two data set
𝐸
and
𝑀
correlates the spatial instance
G
with each
other:
𝑃𝑆(𝑒G,𝑚G)=𝑖𝑛𝑡 𝑒𝑟𝑠𝑒 𝑐𝑡 (𝑒G,𝑚G)(12)
HSTM considers geometry on the spatial scale. Together with the
temporal and semantic aspects, new data is temporally constrained
and summarized over similar geometries using approximate ras-
terization. Therefore, we linearize the MBR geometric objects and
construct a space-lling curve for ecient use of range queries
[
45
]. Range queries on
𝑍
-curves are relevant for ecient imple-
mentations of multi-dimensional indexes such as the Universal
B-tree (UB-tree) [
3
,
42
,
50
], the BUB-tree [
19
], and the PH-tree
[
57
]. These trees interleave some or all bits of each dimension of
a stored k-dimensional point
𝑝=𝑝
0
, 𝑝1, . .., 𝑝𝑘1
into bit string,
called
𝑍
-address
𝑧
, representing a coordinate of a space lled
𝑍
-
curve. For example, the lexicographic ordering of values encoded
in
𝑍
-addresses is the
𝑍
-ordering; see [
51
] for more discussion. A
comparison of
𝑍
-curves with other space-lling curves is given in
[
44
]. For our 2-dimensional space, we only need to consider
𝑘=
2:
Denition 4.1. Given a dimension
𝑑
and a point
𝑝0𝑖𝑑
, a
𝑍
-
address
𝑧
is a bit string with length
𝜖
of this point consisting of
𝜖×𝑑
bits and where 0
𝑝𝑖
2
𝜖
describes the highest bit of each
value 𝑝𝑖.
For a given spatial database
U𝑠𝑝𝑎𝑡𝑖𝑎𝑙
with hierarchy
𝐻
, we create
a space-lling
𝑍
-curve and place each MBR of a spatial object in this
space. Figure 3 shows an exemplary subdivided space of multiple
MBR, denoted as
𝑅𝑖
, and a
𝑍
-curve range query
𝑞𝑟𝑎𝑛𝑔𝑒
intersecting
a subset of objects. Locating matching points of MBR inside this
range, HSTM traverses the tree, locating nodes that intersect a
𝑍
-value range (e.g,
[
15
,
60
]
in Figure 3). Once reaching the end
of a node, it returns to the parent and moves on to the next one,
stopping when the upper range limit of the
𝑍
-value is reached. In
this manner, the approach approximately retrieves all objects with
intersecting MBR of an element
𝑚
by performing a window query
Full Paper Track
CIKM ’21, November 1–5, 2021, Virtual Event, Australia
569
Figure 3: 𝑍-curve and 𝑧values of a MBR sample in 2D space.
Leaf nodes (green), intermediate nodes (orange), and root
(green).
box dened as
𝑞𝑙𝑟
and
𝑞𝑢𝑟
. The window query is transformed to
their
𝑧
-bit string range. Starting at the root root, we look for the rst
𝜖
subspaces given by
𝑐ℎ𝑖𝑙𝑑 𝑟𝑒𝑛(𝑟)
and check whether they overlap.
4.3 Content-Based Similarity
For semantic similarity, we consider two kinds of well-known met-
rics. Considering linear data in
D
, we utilize the
Cosine
distance,
which is a popular similarity measure treating two linear inputs as
vectors in space, computing the Cosine between them:
cos(D𝑒,D𝑚)=D𝑒D𝑚
∥D𝑒 D𝑚(13)
In contrast to nominal data, we utilize the
Jaccard
similarity and
propose an extension concerning distinction of containing nominal
values between
D𝑒
and
D𝑚
and marginal dierence in tokens, mak-
ing soft approaches required. Therefore we compute the common
and distinctive token dierences and apply a given weighting func-
tion. Each value
𝑣∈ D
is important to a dierent degree, expressed
by an assigned weight
𝑓𝑤
:
D → R
(e.g., the latent dirichlet alloca-
tion [
4
]). The degree of distinct
𝑑𝑣 (D𝑒,D𝑚)=Í𝑣∈D𝑒∩D𝑚𝑓𝑤(𝑣)
and common values
𝑐𝑚(D𝑒,D𝑚)=Í𝑣∈D𝑒D𝑚𝑓𝑤(𝑣)
is in propor-
tion according to Jaccard-coecient:
𝑠𝑖𝑚 (D𝑒,D𝑚)=𝑑𝑣 (D𝑒,D𝑚)
𝑐𝑣 (D𝑒,D𝑚)(14)
Both metrics are widely used in practice but come with some
major drawbacks. The single usage of
Jaccard
or
Cosine
cannot
handle heterogeneous attributes, consisting of nominal and linear
data in
D
for objects
𝑒
and
𝑚
.Jaccard is well-suited for nomi-
nal data but becomes imprecise when string representations are
marginally dierent from each other, and equality cannot satisfy
cut and union operation. In addition, Jaccard is less suitable since
it nds frequent correlations when comparing high-dimensional
with low-dimensional data in
D
.Cosine ts better when values
have a linear attribute type but suers in co-rating items even if
there exists a high dierence in one value. Therefore, we propose a
hybrid in which linear and nominal attribute values are considered
separately for a continuous input of the analytical model.
With function
𝑑𝑣
and
𝑐𝑣
, we dene the domain-specic similarity
𝑠𝑖𝑚𝐷
as a fraction that returns the similarity between two input
attribute-value sets
𝑒
and
𝑚
in the interval
[
0
,
1
]
as a loss based on
the idea of soft-Jaccard[49]:
sim𝐻(𝑒, 𝑚)=1𝑠𝑖𝑚(𝑒[ D],𝑚[D])
||𝑒[D]| |1+ | |𝑚[D ] ||1𝑠𝑖𝑚 (𝑒[D, 𝑚[D]) (15)
where
D𝑒D𝑒
are the sets of all attributes between the objects.
The function
𝑠𝑖𝑚 (D𝑒,D𝑚)
returns the similarity between two di-
vergent attribute sets, using the exible comparison dened above.
The result is aggregated and reduces to a loss, used to dene the
similarity constraints discussed in Section 3.1:
𝐷(𝑒, 𝑚)=𝑠𝑖𝑚𝐻(𝑒, 𝑚) 𝜏𝐷(16)
4.4 Hierarchical Matching
Finally, all three dimensions of dierent matching approaches have
to be combined to a coherent procedure in which the number of
possible matching candidates are successively reduced to one or
zero. The order in which the matching algorithms are executed
depends on the data availability of the individual part.
𝐸𝐷
𝑠𝑖𝑚𝐻(𝑒[ D ],𝑚 [ D ]) 𝜏𝐷(𝐸𝑆
𝑃(𝑒[G],𝑚[ G ]) (𝜎𝑇
𝑃( A,𝑚[ T ] ) 𝑀)) (17)
First, temporal matching is executed, reducing matching objects
to valid partners based on the current timestamp. It has no direct
dependency on existing objects. Next, the remaining matching part-
ners are joined with existing entities using the spatial matching
prex. The candidates are now reduced on temporal validity and
spatial proximity. Finally, the previous set of objects are seman-
tically joined with an existing object in
𝐸
. Property information,
nominal and linear, is used to determine if a candidate can be
matched to an
𝑒
(duplicate) or integrated as a new one. On updating
an existing entity, the question of merging previous and new infor-
mation has to be faced. From a temporal perspective, the lifetime of
𝑒
is prolonged. Concerning the spatial dimension,
𝑒
can either adopt
the new geometry or merge geometries. Finally, data integration
is a task that depends even more than spatial integration on the
requirements of the model at hand. A simple update of
𝑒
is possible,
but without knowledge of the quality and purpose of the sources,
it remains questionable without any insight into the domain.
Precisely with the separate steps in Eq. 17, the denition of
𝑜𝑣 𝑒𝑟𝑙 𝑎𝑝 (𝑈 , 𝑉 )
in Eq. 4, predicates
𝑃𝑆
and
𝑃𝐷
over
𝐸×𝑀
{true,false}
, and the auxiliary function
𝑐𝑜𝑎𝑙 𝑒𝑠𝑐𝑒 (𝑟)
in Eq. 7, we de-
ne complete match and fusion. The process of getting all next
valid data is described in Equation 18 and consists of nding those
matches that are intersecting the actual analytical progress, whereas
Full Paper Track
CIKM ’21, November 1–5, 2021, Virtual Event, Australia
570
the process of retrieving invalid data objects is described in Equa-
tion 19. Again, we use
𝑆𝐸={𝑎𝑖, .. ., 𝑎𝑛}
and
𝑆𝑀={𝑏𝑖, .. ., 𝑏𝑚}
for
attributes of element in 𝐸and 𝑀respectively:
𝐸H S T M 𝑀={𝑥| ∃𝑒𝐸, 𝑚𝑐𝑜𝑎𝑙𝑒 𝑠𝑐𝑒 (𝑀)
(𝑃(𝑒, 𝑚)𝑆𝜖𝑃(𝑒 , 𝑚)𝐷𝜏𝐷
𝑥[T ] =𝑜𝑣 𝑒𝑟𝑙 𝑎𝑝 (𝑚[T ],A ) ∧ 𝑥[T ] ∅ ∧ 𝑥[𝑆𝐸]=𝑒[𝑆𝐸]
(𝑥[𝑆𝑀]=𝑚[𝑆𝑀] 𝑥[𝑆𝑀]=𝑛𝑢𝑙𝑙 ))∨
𝑒𝐸, 𝑚𝑐𝑜𝑎𝑙𝑒𝑠 𝑐𝑒 (𝑀) (¬(𝑃(𝑒, 𝑚)𝑆𝜖𝑃(𝑒 , 𝑚)𝐷𝜏𝑑=
𝑥[𝑆𝐸]=𝑒[𝑆𝐸] 𝑥[𝑆𝑀]=𝑛𝑢𝑙𝑙 𝑜 𝑣𝑒𝑟 𝑙𝑎𝑝 (𝑥[ T ],A) ∅∧
𝑥[T ] ∅)) }
(18)
The rst three lines handle the case in which the data object
𝑥
derives from
𝑒
and
𝑚
by applying the inner matching condition
for
𝑒
and
𝑚
, given
𝜖
-distance for spatial-based
𝑃(𝑒, 𝑚)𝑆𝜖
and
𝜏𝐷
threshold for semantic-based decisions
𝑃(𝑒, 𝑚)𝐷𝜏𝐷
. The attribute
values
𝑒[𝑆𝐸]
of the existing object are used for resolved partners,
including all remaining attributes
𝑚[𝑆𝑀]
from the match. The valid
time of this data update is expanded by overlap with the analysis
interval Aand associated time in 𝑥[T ] .
The last three lines handle the cases in which no matching was
found, but the valid period overlaps with the actual analysis time,
describing a new object. The result 𝑥is lled with values of 𝑚.
The invalid join retrieves all data which are now invalid for the
analysis, described in Equation 19:
𝐸T
𝑃(𝑒,𝑚)𝑀={𝑒| ∃𝑒𝐸𝑠𝑐𝑜𝑎𝑙𝑒 𝑠𝑐𝑒 (𝑀)
𝑖𝑛𝑡 𝑒𝑟𝑠𝑒 𝑐𝑡 (𝑚[T ],A)} (19)
The semi-join retrieves all pending objects contained entirely
within the current analysis progress time range
A
and thus are no
more applicable for the task because the valid time is over.
5 TECHNICAL INTEGRATION
Since multiple data sets are heterogeneous, providing only a sub-
set of required information for the considered scenario, and do
not contain validity periods – specically only a temporal marker
assigned, for example, by corresponding sensors or occupied by
the hub system time – all static and dynamic sources are equally
relevant.
5.1 Source Inclusion
Given a scenario description, the implementation of HSTM refers
to multiple static and dynamic sources, contiguous with a polystore
architecture [
22
]. Source entries contain implicit or explicit tempo-
ral references (such as validity period or validity marker), applied to
all data objects contained in the source as well as spatial references
in multiple formats (point,line, and polygon). In contrast to static
sources, dynamic sources are congured by the given scenario de-
scription
1
containing (push-based) queries and mappings, resolving
naming (1:1) and structural (1:n) conicts. The system registers a
query at the real-time endpoint, returning the complete state of
1https://mars.haw-hamburg.de/articles/core/model-conguration/
Figure 4: Data loading layer, transforming input data objects
to common data model.
objects when any matching element changes. Figure 4 shows the
loading structure of multiple data sources in which the
𝜎
-layer allo-
cates using the individual wrapper implementation. Therefore, each
wrapper implements the native interface of the source and exposes
ascan operation and selection (
𝜎
) lter. The selection predicate is
dened on an associated arrays, decoupling from native formats
and providing transparency.
The system uses the Volcano-model [
24
], iterating over data
objects and selecting those, satisfy the predicate in
𝜎𝑃
. The sys-
tem transforms each result into a hierarchical associative array,
according to a given mapping. Due to this volcano-based lookup,
the
𝜎
-layer returns an iterator, making post-processing operations
available such as
top𝑘
or
count
. Temporal data objects are forwarded
into two intermediate, distinct catalogue stores to encapsulate the
model’s potential heterogeneous sources. The non-temporal cat-
alogue contains values that are valid for the entire analysis time
A
, whereas the temporal catalogue preserves multiple groups with
time series, each concerning a sequence of data changes. When
adding new temporal data to the catalogue, the system rst checks
for type conformity and temporal overlap with the current time
𝑡𝑐
in
A
, preventing the system from getting errors in data [
11
,
20
,
30
]
or from being occupied by temporal outliers [
25
,
58
]. New time
series entries lead to the creation of new data objects. Insertion of
data to an existing sequence causes the invalidation of the previous
oldest value and subsequent replacement by the valid time of the
new one. These constructs inherit valid periods of data [36, 47].
Since external data are integrated, the approach needs to deal
with divergent transmission latency,unequal clocks of internal and
external scopes, and duplicates. As discussed in Section 3, this in-
cludes data cleaning by ltering spatial [
10
,
35
] and temporal [
25
,
58
] outliers according to scenario mapping removing errors in the
attribute data such as typos or dependency violations [11, 20, 30].
5.2 Matching Population
The complete matching approach is shown in Algorithm 1, illustrat-
ing each step in which valid and invalid temporal-invariant objects
Full Paper Track
CIKM ’21, November 1–5, 2021, Virtual Event, Australia
571
of
𝐸
and
𝑀
are created, updated, or removed during the discrete
execution of the analysis.
Algorithm 1 Valid/Invalid Data Population
Require: 𝐸, 𝜏, M, entity set, entity type and data source
1: Initialize update set 𝑈=
2: 𝑉 𝑎𝑙𝑖𝑑 T𝜎𝑃(𝑚[ T ],A) 𝑀(see Eq. 17)
3: for (𝑚) 𝑉 𝑎𝑙 𝑖𝑑 Tdo
4: 𝑉 𝑎𝑙𝑖𝑑 G𝜎𝑃(𝑚[ G ],𝑒 [G ] ) 𝐸(spatial matching, see Eq. 11)
5: for (𝑒𝑔) 𝑉 𝑎𝑙 𝑖𝑑 Gdo
6: if 𝑠𝑖𝑚𝐻(𝑒𝑔, 𝑚) ≤ 𝜏𝑑then
7: 𝑈𝑈∪ {update(𝑚) }
8: else
9: 𝑈𝑈∪ {create𝜏(𝑚)}
10: end if
11: end for
12: end for
13: 𝐼𝑛𝑣 𝑎𝑙𝑖𝑑 𝐸𝑇
𝑃M(see Eq. 19)
14: for (𝑒) ∈ 𝐼 𝑛𝑣𝑎𝑙𝑖𝑑 do
15: if 𝑒𝑈then
16: 𝐸𝐸\ {𝑒}// remove
17: end if
18: end for
19: 𝐴𝐴𝑈
The population algorithm collects new and updated data objects
from
𝑀
(line 2) in an intermediate set
𝑈
(line 7and 9). Match-
ing candidate in
𝐸
receive an update notication in their model
implementation (e.g., the parking space), whereas the absence of
matches with
𝑀
concludes in the creation of new elements for
𝐸
.
Precisely, matches are retrieved by applying the
𝜎𝑃(𝑚[G ],𝑒 [ G ]) 𝐸
(line 4) with new valid
𝑚
on the
𝑍
-curve discussed in Section 4.2.
For each possible spatial match
𝑒𝑔
, we decide in which manner
the containing semantics is similar to the new
𝑚
. When satisfying
𝑠𝑖𝑚𝐻(𝑒𝑔, 𝑚)>𝜏𝐷]
, the algorithm creates an instance of the given
model type
𝜏
and inserts it into the update set, otherwise an update
is applied to the model with
𝑚
. The invalidation removes remain-
ing
𝑒
from
𝐸
and preserves those that were created or received an
update (line 13).
HSTM uses symmetric matches between the present data and
the validity window on the temporal catalogue in which the left
side of the join is already being loaded, and matching partners on
the right side are located. Due to the lack of conformity between the
underlying model of the data sets, we utilize a left-outer bind-join
[26].
6 EVALUATION
Hierarchical matching is implemented as a prototype using our
existing multi-agent framework MARS and is available here
2
. The
MARS framework already has an implementation for ecient graph
structures [
54
] and was modied to realize the spatial hierarchy.
Calculations and measurements were performed on Windows 10
without active virtualization, using NetCore v5.0.4 on an AMD
Ryzen 9 3900X 12-core CPU @ 32.0 GB RAM, 3TB SATA HD 64MB
2https://git.haw-hamburg.de/mars/model-deployments
Buer, compiler-optimization enabled. Data were obtained from the
Hamburg Urban Data Hub Geoportal
3
(GP) and the Global Database
of Events, Language, and Tone (GDELT) database. GDELT repre-
sents a
single-source
set, containing
>
200M geopolitical spatial-
and temporal-invariant records, with 34 categorical- and numerical-
typed data in an homogeneous schema. In contrast to GDELT, GP
is a public set
multi-sources
with over 150 spatial and temporal
heterogeneous data subset for the city of Hamburg, describing park-
ing without equijoin relationships among sources. We evaluate
HSTM with two dierent scenarios. The rst scenario concerns the
problem of identifying matches between parking lots with temporal
occupation data from heterogeneous sources. No occupation data
is provided for the parking lots themselves (only spatial data). It
needs to be derived from existing e-charging stations, providing
real-time occupation data without relations with to the given park-
ing lots or spatial extent (insucient spatial information). This
merge is relevant regarding the correctness of a digital twin in the
domain of trac over time [
23
]. We investigate Precision and Recall
of the matching approach for one-step correlation of parking lots
and compare dierent selections of
𝜖
. Parking and loading stations
are matched with each other on a subset of 34 sources near city
centre. Given an equijoin prove semantically correct linkage on
parking spaces as well as distance relationship with e-stations from
stream-input. The e-charging stations acting as proxies, contain-
ing a count of nearby aected parking lots. The second scenario
operates on data with equijoin relationships, which allows the au-
tomatic comparison of larger data sets. Incorporating GDELT, this
scenario evaluate the matching performance and Recall of HSTM.
We compare HSTM with existing index-based methods, comparing
the K-ANN clustering with our similarity function. We decided
to use the vantage-point tree (VP-tree), k-dimensional tree (KD-
tree) and the navigable small world (NSW) because they are the
most prominent solutions for general purpose K-ANN clustering in
practice.
7 DISCUSSION
We presented a matching approach whose decisions for spatial,
temporal, and semantic properties were considered deferentially
with ad hoc analyses. Figure 5 shows that varying
𝜖
aects the
Precision, in which categorical data and missing considerations of
spatial or temporal restrictions are signicantly more applicable
for the range
[
0
,
5
]
, whereas combining linear values holds a Recall
>
0
.
87 and Precision
>
0
.
92. Using HSTM on polluted data sets
shows signicant drawbacks in which the matching is similar to
only using Jaccard or Cosine respectively, and only increasing the
spatial
𝜖
helps nd suitable candidates. Figure 6 shows comparison
results with leading K-ANN clustering on GDELT data set. Using
naive K-ANN approach shows lower Recall with lower processing
performance for VP-, KD-tree, and NSW approximation. For
𝜖=
8, we show that the few precise spatial quantizations with less
computational cost tend to the same behaviour as when
𝜖=
12
and where only NSW is performing better. In total, results were
classied properly compared to query performance but mostly come
with limitations. Due to the spatial abstraction, selecting nearby
objects using the MBR leads to invalidly intersecting matches (e.g.,
3Used public data sets available here https://geoportal-hamburg.de/geo- online/
Full Paper Track
CIKM ’21, November 1–5, 2021, Virtual Event, Australia
572
0 5 10
0.85
0.9
0.95
1
𝜖-distance precision
Recall
(1)
0 5 10
0
0.5
1
𝜖-distance precision
(2)
0 5 10
0.9
0.95
1
𝜖-distance precision
Precision
(3)
0 5 10
0
0.5
1
𝜖-distance precision
(4)
HSTM𝑠𝑖𝑚𝐻
HSTMJaccard
HSTMCosine
Figure 5: Comparison of Recall and Precision of the HSTM approach using similarity function sim𝐻and Jaccard and Cosine
only, varying 𝜖-distance. Recall comparison on all parking-data (1) set and polluted set on Parking spaces (2). Precision com-
parison on all parking-data (3) and polluted set (4)
101100101
0
0.5
1
Processing time in ms
Recall
(1)
102101100101
0
0.5
1
Processing time in ms
(2)
102101100101102
0
0.5
1
Processing time in ms
(3)
HSTM𝜖=8
HSTM𝜖=12
NSW
KD-tree
VP-tree
Figure 6: Comparison of Recall measures on subsets of GDELT using HSTM approach and public K-ANN lookup 𝐾=
1
im-
plementations. Computation on top-
50
K elements (1), top-
100
K elements (2) and top-
1
M elements (3). X-axis has logarithmic
scale.
curves of roads) with points or small polygons. Large extensions of
these geometries correlate with failing curves and their MBR, even
if spatially, the objects have nothing in common.
Comparing our technical integration, we dier from other spatio-
temporal joins [
12
] in such a way that matching applies to a setting
of heterogeneous data sources. HSTM is a nested loop match, in
which intermediate results from external sources are passed to
the internal objects in
𝐸
, using the lter predicate
𝑃(𝐸[𝑎], 𝑀 [𝑏])
.
HSTM sends out subqueries multiple times. Pushing selection com-
bination to data source [
26
] alleviates limited access patterns in
which query capability restriction such as MQTT
4
and extensions
[
29
]. Though this approach provides well runtimes and logical data
independence, the process is very trivial because the objects are
discretized and sorted along the time axis with a xed intersection.
Merging spatial objects cannot be provided because this is read-only
access and delegates this to the model. Given the soft consistency
requirements, a exible temporal distance
𝑡𝑎+𝛼
and
𝑡𝑒+𝛼
would
be better-suited by increasing or decreasing the candidate set. Sim-
ilarly, it is unclear whether and to what extent objects should be
valid when
𝑡𝑠=𝑡𝑒
holds since the algorithm drops them out before
the analysis continues. The semantic approach dierentiates het-
erogeneous objects and allows for their weights in the comparison
4Message Queuing Telemetry Transport
metric. And in addition, we did not consider normalization for the
temporal input without prior knowledge of the value space, which
is relevant in identifying outliers. Instead, we proposed discretizing
values along with time, subsuming the value range step by step.
8 CONCLUSION
This paper studied the similarity problem in matching spatio- tem-
poral domain-specic data for ad hoc analytical tasks from het-
erogeneous sources. We devise a prex approach called HSTM for
matching objects, respecting valid temporal time, spatial locality,
and domain-specic similarity concerning heterogeneity. Utilizing
bind-join we have implemented a data loading technique to retrieve
data from multiple kinds of sources, supporting time-discrete ad
hoc analysis. The evaluation shows that HSTM achieves high per-
formance with good Recall results. We now integrate our approach
into a practical solution concerning trac demands, comparing
their specic matching results on a broader scale, and improving
transparency for the ad hoc task in decoupling more conguration
from the user.
ACKNOWLEDGMENTS
This research was supported in part by the City of Hamburg (Smart-
OpenHamburg project).
Full Paper Track
CIKM ’21, November 1–5, 2021, Virtual Event, Australia
573
REFERENCES
[1]
Danial Aghajarian, Satish Puri, and Sushil Prasad. 2016. GCMF: an ecient
end-to-end spatial join system over large polygonal datasets on GPGPU platform.
In Proceedings of the 24th ACM SIGSPATIAL International Conference on Advances
in Geographic Information Systems. 1–10.
[2]
Lars Arge, Octavian Procopiuc, Sridhar Ramaswamy, Torsten Suel, and Jef-
frey Scott Vitter. 1998. Scalable sweeping-based spatial join. In VLDB, Vol. 98.
Citeseer, 570–581.
[3]
Rudolf Bayer. 1997. The universal B-tree for multidimensional indexing: General
concepts. In International Conference on WorldwideComputing and Its Applications.
Springer, 198–209.
[4]
David M Blei, Andrew Y Ng, and Michael I Jordan. 2003. Latent dirichlet allocation.
the Journal of machine Learning research 3 (2003), 993–1022.
[5]
Panagiotis Bouros, Shen Ge, and Nikos Mamoulis. 2012. Spatio-textual similarity
joins. Proceedings of the VLDB Endowment 6, 1 (2012), 1–12.
[6]
Moisés Castelo Branco, Javier Troya, Krzysztof Czarnecki, Jochen Küster, and Ha-
gen Völzer. 2012. Matching Business Process Workowsacross Abstraction Levels.
In Proceedings of the 15th International Conference on Model Driven Engineering
Languages and Systems (Innsbruck, Austria) (MODELS’12). Springer-Verlag, Berlin,
Heidelberg, 626–641. https://doi.org/10.1007/978-3- 642-33666-9_40
[7]
Michael J Cafarella, Alon Halevy, and Nodira Khoussainova. 2009. Data inte-
gration for the relational web. Proceedings of the VLDB Endowment 2, 1 (2009),
1090–1101.
[8]
Michael J Cafarella, Alon Halevy, Daisy Zhe Wang, Eugene Wu, and Yang Zhang.
2008. Webtables: exploring the power of tables on the web. Proceedings of the
VLDB Endowment 1, 1 (2008), 538–549.
[9]
Surajit Chaudhuri, Venkatesh Ganti, and Raghav Kaushik. 2006. A primitive
operator for similarity joins in data cleaning. In 22nd International Conference on
Data Engineering (ICDE’06). IEEE, 5–5.
[10]
Sanjay Chawla and Pei Sun. 2006. SLOM: A New Measure for Local Spatial
Outliers. Knowl. Inf. Syst. 9, 4 (2006), 412–429. https://doi.org/10.1007/s10115-
005-0200- 2
[11]
Yao-Yi Chiang, Bo Wu, Akshay Anand, Ketan Akade, and Craig A. Knoblock.
2014. A System for Ecient Cleaning and Transformation of Geospatial Data
Attributes. In Proceedings of the 22nd ACM SIGSPATIAL International Conference
on Advances in Geographic Information Systems. ACM, 577–580. https://doi.org/
10.1145/2666310.2666373
[12]
James Cliord and Albert Croker. 1987. The historical relational data model
(HRDM) and algebra based on lifespans. In 1987 IEEE Third International Confer-
ence on Data Engineering. IEEE, 528–537.
[13]
Dong Deng, Raul Castro Fernandez, Ziawasch Abedjan, Sibo Wang, Michael
Stonebraker, Ahmed K Elmagarmid, Ihab F Ilyas, Samuel Madden, Mourad Ouz-
zani, and Nan Tang. 2017. The Data Civilizer System.. In Cidr.
[14]
Dong Deng, Guoliang Li, Jianhua Feng, and Wen-Syan Li. 2013. Top-k string
similarity search with edit-distance constraints. In 2013 IEEE 29th International
Conference on Data Engineering (ICDE). IEEE, 925–936.
[15]
Xin Luna Dong, Laure Berti-Equille, and Divesh Srivastava. 2009. Integrating
Conicting Data: The Role of Source Dependence. Proc. VLDB Endow. 2, 1 (Aug.
2009), 550–561. https://doi.org/10.14778/1687627.1687690
[16] Curtis E. Dyreson and Richard Thomas Snodgrass. 1993. Timestamp Semantics
and Representation. Information Systems 18, 3 (April 1993), 143–166. https:
//doi.org/10.1016/0306-4379(93)90034- X
[17]
Ahmed K. Elmagarmid, Panagiotis G. Ipeirotis, and Vassilios S. Verykios. 2007.
Duplicate Record Detection: A Survey. IEEE Trans. on Knowl. and Data Eng. 19, 1
(Jan. 2007), 1–16.
[18]
Ronald Fagin, Laura M Haas, Mauricio Hernández, Renée J Miller, Lucian Popa,
and Yannis Velegrakis. 2009. Clio: Schema mapping creation and data exchange.
In Conceptual modeling: foundations and applications. Springer, 198–236.
[19]
Robert Fenk. 2002. The BUB-tree. In VLDB’02, Proceedings of 28th International
Conference on Very Large Data Bases. Citeseer.
[20]
Venkatesh Ganti and Anish Das Sarma. 2013. Data Cleaning: A Practical Perspec-
tive. Morgan & Claypool Publishers.
[21]
Mike Gashler, Christophe Giraud-Carrier, and Tony Martinez. 2008. Decision
tree ensemble: Small heterogeneous is better than large homogeneous. In 2008
Seventh International Conference on Machine Learning and Applications. IEEE,
900–905.
[22]
Daniel Glake, Fabian Panse, Norbert Ritter, Thomas Clemen, and Ula Lenfers.
2021. Data Management in Multi-Agent Simulation Systems. In BTW 2021, Kai-
Uwe Sattler, Melanie Herschel, and Wolfgang Lehner (Eds.). Gesellschaft für
Informatik, Bonn, 423–436. https://doi.org/10.18420/btw2021-22
[23]
Daniel Glake, Norbert Ritter, and Thomas Clemen. 2020. Utilizing Spatio-
Temporal Data in Multi-Agent Simulation. In Proceedings of the 2020 Winter
Simulation Conference, K.-H. Bae, B. Feng, S. Kim, S. Lazarova-Molnar, Z. Zheng,
T. Roeder, and R. Thiesing (Ed.). Society for Computer Simulation International.
[24]
Goetz Graefe. 1994. Volcano/spl minus/an extensible and parallel query evaluation
system. IEEE Transactions on Knowledge and Data Engineering 6, 1 (1994), 120–
135.
[25]
Manish Gupta, Jing Gao, Charu C. Aggarwal, and Jiawei Han. 2014. Outlier
Detection for Temporal Data: A Survey. IEEE Trans. Knowl. Data Eng. 26, 9 (2014),
2250–2267. https://doi.org/10.1109/TKDE.2013.184
[26]
Laura Haas, Donald Kossmann, Edward Wimmers, and Jun Yang. 1997. Optimiz-
ing queries across diverse data sources. (1997).
[27]
Yeye He, Kris Ganjam, and Xu Chu. 2015. Sema-join: joining semantically-related
tables using big table corpora. Proceedings of the VLDB Endowment 8, 12 (2015),
1358–1369.
[28]
Yun-Wu Huang, Ning Jing, Elke A Rundensteiner, et al
.
1997. Spatial joins using R-
trees: Breadth-rst traversal with global optimizations. In VLDB, Vol. 97. Citeseer,
25–29.
[29] Urs Hunkeler, Hong Linh Truong, and Andy Stanford-Clark. 2008. MQT T-S—A
publish/subscribe protocol for Wireless Sensor Networks. In 2008 3rd International
Conference on Communication Systems Software and Middleware and Workshops
(COMSWARE’08). IEEE, 791–798.
[30]
Ihab F. Ilyas and Xu Chu. 2019. Data Cleaning. ACM. https://doi.org/10.1145/
3310205
[31]
Edwin H Jacox and Hanan Samet. 2007. Spatial join techniques. ACM Transactions
on Database Systems (TODS) 32, 1 (2007), 7–es.
[32]
Christian S Jensen, Curtis E Dyreson, Michael Böhlen, James Cliord, Ramez
Elmasri, Shashi K Gadia, Fabio Grandi, Pat Hayes, Sushil Jajodia, Wolfgang Käfer,
et al
.
1998. The consensus glossary of temporal database concepts—February
1998 version. In Temporal Databases: Research and Practice. Springer, 367–405.
[33]
Martin Kaufmann, Panagiotis Vagenas, Peter M Fischer, Donald Kossmann, and
Franz Färber. 2013. Comprehensive and interactive temporal query processing
with SAP HANA. Proceedings of the VLDB Endowment 6, 12 (2013), 1210–1213.
[34]
Andreas Kipf, Harald Lang, Varun Pandey, Raul Alexandru Persa, Peter Boncz,
Thomas Neumann, and Alfons Kemper. 2018. Adaptive geospatial joins for
modern hardware. arXiv preprint arXiv:1802.09488 (2018).
[35]
Yufeng Kou and Chang-Tien Lu. 2017. Outlier Detection, Spatial. In Encyclopedia
of GIS. Springer, 1539–1546. https://doi.org/10.1007/978- 3-319- 17885-1_945
[36]
Ioannis K. Koumarelas, Lan Jiang, and Felix Naumann. 2020. Data Preparation
for Duplicate Detection. ACM J. Data Inf. Qual. 12, 3 (2020), 15:1–15:24. https:
//dl.acm.org/doi/10.1145/3377878
[37]
Georgia Koutrika, Benjamin Bercovitz, and H FlexRecs Garcia-Molina. [n.d.].
Expressing and combining exible recommendations. In Proceedings of the 35th
SIGMOD International Conference on Management of Data (SIGMOD’09), Provi-
dence, RI, USA, Vol. 29.
[38]
Oliver Lehmberg, Dominique Ritze, Petar Ristoski, Robert Meusel, Heiko Paul-
heim, and Christian Bizer. 2015. The mannheim search join engine. Journal of
Web Semantics 35 (2015), 159–166.
[39]
Chen Li, Jiaheng Lu, and Yiming Lu. 2008. Ecient merging and ltering algo-
rithms for approximate string searches. In 2008 IEEE 24th International Conference
on Data Engineering. IEEE, 257–266.
[40]
John Liagouris, Nikos Mamoulis, Panagiotis Bouros, and Manolis Terrovitis. 2014.
An eective encoding scheme for spatial RDF data. Proceedings of the VLDB
Endowment 7, 12 (2014), 1271–1282.
[41]
Ming-Ling Lo and Chinya V Ravishankar. 1996. Spatial hash-joins. In Proceedings
of the 1996 ACM SIGMOD international conference on Management of data. 247–
258.
[42]
Volker Markl. 2000. Mistral: Processing relational queries using a multidimen-
sional access technique. In Ausgezeichnete Informatikdissertationen 1999. Springer,
158–168.
[43]
Renée J Miller. 2018. Open data integration. Proceedings of the VLDB Endowment
11, 12 (2018), 2130–2139.
[44]
Mohamed F Mokbel, Walid G Aref, and Ibrahim Kamel. 2003. Analysis of multi-
dimensional space-lling curves. GeoInformatica 7, 3 (2003), 179–209.
[45]
Jack A Orenstein and Tim H Merrett. 1984. A class of data structures for associa-
tive searching. In Proceedings of the 3rd ACM SIGACT-SIGMOD Symposium on
Principles of Database Systems. 181–190.
[46]
Jignesh M Patel and David J DeWitt. 1996. Partition based spatial-merge join.
ACM Sigmod Record 25, 2 (1996), 259–270.
[47] Dorian Pyle. 1999. Data Preparation for Data Mining. Morgan Kaufmann.
[48]
Shuyao Qi, Panagiotis Bouros, and Nikos Mamoulis. 2013. Ecient top-k spatial
distance joins. In International Symposium on Spatial and Temporal Databases.
Springer, 1–18.
[49]
Md Atiqur Rahman and Yang Wang. 2016. Optimizing intersection-over-union
in deep neural networks for image segmentation. In International symposium on
visual computing. Springer, 234–244.
[50]
Frank Ramsak, Volker Markl, Robert Fenk, Martin Zirkel, Klaus Elhardt, and
Rudolf Bayer. 2000. Integrating the UB-tree into a database system kernel.. In
VLDB, Vol. 2000. Citeseer, 263–272.
[51]
Hans Sagan. 1994. Hilbert’s space-lling curve. In Space-lling curves. Springer,
9–30.
[52]
Craig Stanll and David Waltz. 1986. Toward memory-based reasoning. Commun.
ACM 29, 12 (1986), 1213–1228.
Full Paper Track
CIKM ’21, November 1–5, 2021, Virtual Event, Australia
574
[53]
Jin Wang, Guoliang Li, Dong Deng, Yong Zhang, and Jianhua Feng. 2015. Two
birds with one stone: An ecient hierarchical framework for top-k and threshold-
based string similarity search. In 2015 IEEE 31st International Conference on Data
Engineering. IEEE, 519–530.
[54]
Julius Weyl, Ula A Lenfers, Thomas Clemen, Daniel Glake, Fabian Panse, and
Norbert Ritter. 2019. Large-scale trac simulation for smart city planning with
mars. In Proceedings of the 2019 Summer Simulation Conference. 1–12.
[55]
Randall T Whitman, Michael B Park, Bryan G Marsh, and Erik G Hoel. 2017.
Spatio-temporal join on apache spark. In Proceedings of the 25th ACM SIGSPATIAL
International Conference on Advances in Geographic Information Systems. 1–10.
[56]
Mohamed Yakout, Kris Ganjam, Kaushik Chakrabarti, and Surajit Chaudhuri.
2012. Infogather: entity augmentation and attribute discovery by holistic match-
ing with web tables. In Proceedings of the 2012 ACM SIGMOD International Con-
ference on Management of Data. 97–108.
[57]
Tilmann Zäschke, Christoph Zimmerli, and Moira C Norrie. 2014. The PH-tree:
a space-ecient storage structure and multi-dimensional index. In Proceedings of
the 2014 ACM SIGMOD international conference on Management of data. 397–408.
[58]
Aoqian Zhang, Shaoxu Song, Jianmin Wang, and Philip S. Yu. 2017. Time Series
Data Cleaning: From Anomaly Detection to Anomaly Repairing. Proc. VLDB
Endow. 10, 10 (2017), 1046–1057. https://doi.org/10.14778/3115404.3115410
[59]
Donghui Zhang, Vassilis J Tsotras, and Bernhard Seeger. 2002. Ecient temporal
join processing using indices. In Proceedings 18th International Conference on
Data Engineering. IEEE, 103–113.
[60]
Erkang Zhu, Yeye He, and Surajit Chaudhuri. 2017. Auto-join: Joining tables by
leveraging transformations. Proceedings of the VLDB Endowment 10, 10 (2017),
1034–1045.
Full Paper Track
CIKM ’21, November 1–5, 2021, Virtual Event, Australia
575
... S5. Complex Analysis: In addition to simple queries and data browsing along the different dimensions [44], various kinds of complex analyses such as k-means and correlation analysis are beneficial in a scientific context. For the case of traffic simulations where graphs are used to represent the road network, graph analyzes should be possible as well. ...
Preprint
Nowadays, data-intensive applications face the problem of handling heterogeneous data with sometimes mutually exclusive use cases and soft non-functional goals such as consistency and availability. Since no single platform copes everything, various stores (RDBMS, NewSQL, NoSQL) for different workloads and use-cases have been developed. However, since each store is only a specialization, this motivates progress in polyglot data management emerged new systems called Mult- and Polystores. They are trying to access different stores transparently and combine their capabilities to achieve one or multiple given use-cases. This paper describes representative real-world use cases for data-intensive applications (OLTP and OLAP). It derives a set of requirements for polyglot data stores. Subsequently, we discuss the properties of selected Multi- and Polystores and evaluate them based on given needs illustrated by three common application use cases. We classify them into functional features, query processing technique, architecture and adaptivity and reveal a lack of capabilities, especially in changing conditions tightly integration. Finally, we outline the benefits and drawbacks of the surveyed systems and propose future research directions and current challenges in this area.
Conference Paper
Full-text available
Multi-agent simulations are an upcoming trend to deal with the urgent need to predict complex situations as they arise in many real-life areas, such as disaster or traffic management. Such simulations require large amounts of heterogeneous data ranging from spatio-temporal to standard object properties. This and the increasing demand for large scale and real-time simulations pose many challenges for data management. In this paper, we present the architecture of a typical agent-based simulation system, describe several data management challenges that arise in such a data ecosystem, and discuss their current solutions within our multi-agent simulation system MARS.
Conference Paper
Full-text available
Navigating stairs is a dangerous mobility challenge for people with low vision, who have a visual impairment that falls short of blindness. Prior research contributed systems for stair navigation that provide audio or tactile feedback, but people with low vision have usable vision and don't typically use nonvisual aids. We conducted the first exploration of augmented reality (AR) visualizations to facilitate stair navigation for people with low vision. We designed visualiza-tions for a projection-based AR platform and smartglasses, considering the different characteristics of these platforms. For projection-based AR, we designed visual highlights that are projected directly on the stairs. In contrast, for smartglasses that have a limited vertical field of view, we designed visualizations that indicate the user's position on the stairs, without directly augmenting the stairs themselves. We evaluated our visualizations on each platform with 12 people with low vision, finding that the visualizations for projection based AR increased participants' walking speed. Our designs on both platforms largely increased participants' self-reported psychological security.
Article
Full-text available
Geospatial joins are a core building block of connected mobility applications. An especially challenging problem are joins between streaming points and static polygons. Since points are not known beforehand, they cannot be indexed. Nevertheless, points need to be mapped to polygons with low latencies to enable real-time feedback. We present an adaptive geospatial join that uses true hit filtering to avoid expensive geometric computations in most cases. Our technique uses a quadtree-based hierarchical grid to approximate polygons and stores these approximations in a specialized radix tree. We emphasize on an approximate version of our algorithm that guarantees a user-defined precision. The exact version of our algorithm can adapt to the expected point distribution by refining the index. We optimized our implementation for modern hardware architectures with wide SIMD vector processing units, including Intel's brand new Knights Landing. Overall, our approach can perform up to two orders of magnitude faster than existing techniques.
Article
Data errors represent a major issue in most application workflows. Before any important task can take place, a certain data quality has to be guaranteed by eliminating a number of different errors that may appear in data. Typically, most of these errors are fixed with data preparation methods, such as whitespace removal. However, the particular error of duplicate records, where multiple records refer to the same entity, is usually eliminated independently with specialized techniques. Our work is the first to bring these two areas together by applying data preparation operations under a systematic approach prior to performing duplicate detection. Our process workflow can be summarized as follows: It begins with the user providing as input a sample of the gold standard, the actual dataset, and optionally some constraints to domain-specific data preparations, such as address normalization. The preparation selection operates in two consecutive phases. First, to vastly reduce the search space of ineffective data preparations, decisions are made based on the improvement or worsening of pair similarities. Second, using the remaining data preparations an iterative leave-one-out classification process removes preparations one by one and determines the redundant preparations based on the achieved area under the precision-recall curve (AUC-PR). Using this workflow, we manage to improve the results of duplicate detection up to 19% in AUC-PR.
Conference Paper
Understanding individual mobility in larger cities is an important success factor for future smart cities. Related simulation scenarios incorporate enormous numbers of agents, with the disadvantage of long run times. In order to provide large-scale and multimodal traffic simulations, we developed MARS V3. Adapting the Modeling and Simulation as a Service (MSaaS) paradigm, a seamless workflow can be provided to the modeling community. An integrated domain-specific language allows model descriptions without a technical overhead. For this study, selected parts of an individual-based traffic model of the City of Hamburg, Germany, were taken as an example. The entire workflow from model development, open data integration, simulation, and result analysis will be described and evaluated. Performance was measured for local and cloud-based simulation execution for up to one million agents. First results show that this concept can be utilized for building decision support systems for smart cities in the near future.
Book
Data quality is one of the most important problems in data management, since dirty data often leads to inaccurate data analytics results and incorrect business decisions. Poor data across businesses and the U.S. government are reported to cost trillions of dollars a year. Multiple surveys show that dirty data is the most common barrier faced by data scientists. Not surprisingly, developing effective and efficient data cleaning solutions is challenging and is rife with deep theoretical and engineering problems. This book is about data cleaning, which is used to refer to all kinds of tasks and activities to detect and repair errors in the data. Rather than focus on a particular data cleaning task, we give an overview of the endto- end data cleaning process, describing various error detection and repair methods, and attempt to anchor these proposals with multiple taxonomies and views. Specifically, we cover four of the most common and important data cleaning tasks, namely, outlier detection, data transformation, error repair (including imputing missing values), and data deduplication. Furthermore, due to the increasing popularity and applicability of machine learning techniques, we include a chapter that specifically explores how machine learning techniques are used for data cleaning, and how data cleaning is used to improve machine learning models. This book is intended to serve as a useful reference for researchers and practitioners who are interested in the area of data quality and data cleaning. It can also be used as a textbook for a graduate course. Although we aim at covering state-of-the-art algorithms and techniques, we recognize that data cleaning is still an active field of research and therefore provide future directions of research whenever appropriate.
Article
Open data plays a major role in supporting both governmental and organizational transparency. Many organizations are adopting Open Data Principles promising to make their open data complete, primary, and timely. These properties make this data tremendously valuable to data scientists. However, scientists generally do not have a priori knowledge about what data is available (its schema or content). Nevertheless, they want to be able to use open data and integrate it with other public or private data they are studying. Traditionally, data integration is done using a framework called query discovery where the main task is to discover a query (or transformation) that translates data from one form into another. The goal is to find the right operators to join, nest, group, link, and twist data into a desired form. We introduce a new paradigm for thinking about integration where the focus is on data discovery, but highly efficient internet-scale discovery that is driven by data analysis needs. We describe a research agenda and recent progress in developing scalable data-analysis or query-aware data discovery algorithms that provide high recall and accuracy over massive data repositories.
Conference Paper
Effective processing of extremely large volumes of spatial data has led to many organizations employing distributed processing frameworks. Apache Spark is one such open-source framework that is enjoying widespread adoption. Within this data space, it is important to note that most of the observational data (i.e., data collected by sensors, either moving or stationary) has a temporal component, or timestamp. In order to perform advanced analytics and gain insights, the temporal component becomes equally important as the spatial and attribute components. In this paper, we detail several variants of a spatial join operation that addresses both spatial, temporal, and attribute-based joins. Our spatial join technique differs from other approaches in that it combines spatial, temporal, and attribute predicates in the join operator. In addition, our spatio-temporal join algorithm and implementation differs from others in that it runs in commercial off-the-shelf (COTS) application. The users of this functionality are assumed to be GIS analysts with little if any knowledge of the implementation details of spatio-temporal joins or distributed processing. They are comfortable using simple tools that do not provide the ability to tweak the configuration of the algorithm or processing environment. The spatio-temporal join algorithm behind the tool must always succeed, regardless of input data parameters (e.g., it can be highly irregularly distributed, contain large numbers of coincident points, it can be extremely large, etc.). These factors combine to place additional requirements on the algorithm that are uncommonly found in the traditional research environment. Our spatio-temporal join algorithm was shipped as part of the GeoAnalytics Server [9], part of the ArcGIS Enterprise 10.5 platform.