Content uploaded by Daniel Glake
Author content
All content in this area was uploaded by Daniel Glake on Jan 07, 2022
Content may be subject to copyright.
Hierarchical Semantics Matching For Heterogeneous
Spatiotemporal Sources
Daniel Glake
Norbert Ritter
daniel.glake@unihamburg.de
ritter@informatik.unihamburg.de
Universität Hamburg
Germany
Florian Ocker
Nima AhmadyMoghaddam
Daniel Osterholz
Ula Lenfers
Thomas Clemen
orian.ocker@hawhamburg.de
nima.ahmadymoghaddam@hawhamburg.de
daniel.osterholz@hawhamburg.de
ula.lenfers@hawhamburg.de
thomas.clemen@hawhamburg.de
Hamburg University of Applied Sciences
Germany
ABSTRACT
Spatiotemporal data are semantically valuable information used for
various analytical tasks to identify spatially relevant and temporally
limited correlations within a domain. The increasing availability
and data acquisition from multiple sources with their typically high
heterogeneity are getting more and more attention. However, these
sources often lack interconnecting shared keys, making their in
tegration a challenging problem. For example, publicly available
parking data that consist of point data on parking facilities with
uctuating occupancy and static location data on parking spaces
cannot be directly correlated. Both data sets describe two dierent
aspects from distinct sources in which parking spaces and uc
tuating occupancy are part of the same semantic model object.
Especially for ad hoc analytical tasks on integrated models, these
missing relationships cannot be handled using join operations as
usual in relational databases. The reason lies in the lack of equijoin
relationships, comparing for equality of strings and additional over
head in loading data up before processing. This paper addresses the
optimization problem of nding suitable partners in the absence of
equijoin relations for heterogeneous spatiotemporal data, applica
ble to ad hoc analytics. We propose a graphbased approach that
achieves good recall and performance scaling via hierarchically sep
arating the semantics along spatial, temporal, and domainspecic
dimensions. We evaluate our approach using public data, showing
that it is suitable for many standard join scenarios and highlighting
its limitations.
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for prot or commercial advantage and that copies bear this notice and the full citation
on the rst page. Copyrights for components of this work owned by others than ACM
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specic permission and/or a
fee. Request permissions from permissions@acm.org.
CIKM ’21, November 1–5, 2021, Virtual Event, QLD, Australia
©2021 Association for Computing Machinery.
ACM ISBN 9781450384469/21/11. . . $15.00
https://doi.org/10.1145/3459637.3482350
CCS CONCEPTS
•Information systems →Entity resolution;Join algorithms.
KEYWORDS
matching, processing, datasets, spatial, temporal, integration
ACM Reference Format:
Daniel Glake, Norbert Ritter, Florian Ocker, Nima AhmadyMoghaddam,
Daniel Osterholz, Ula Lenfers, and Thomas Clemen. 2021. Hierarchical
Semantics Matching For Heterogeneous Spatiotemporal Sources. In Proceed
ings of the 30th ACM International Conference on Information and Knowledge
Management (CIKM ’21), November 1–5, 2021, Virtual Event, QLD, Australia.
ACM, New York, NY, USA, 11 pages. https://doi.org/10.1145/3459637.3482350
1 INTRODUCTION
Spatiotemporal data are nowadays more ubiquitous than ever
before; examples include biological, meteorological and urban socio
economic data, agricultural data, geotagged public data, and sensor
data. According to more and more public data, available portals and
services, the volume of such data generated per day increases at
a staggering and unprecedented rate. This provides opportunities
for many dierent analytical tasks. But often, the data come with
heterogeneous representations and without any specied semantic
relationship.
Before applying analytical tasks, applications need to retrieve
and prepare the data. This additional integration overhead restricts
ad hoc analytics, getting increasingly critical for shortterm decision
making (e.g., tracow redirection or construction site planning).
Following a concrete motivating example redacted from an actual
use case, suppose an analytical task helping to plan new eloading
stations by guring out the density of parking occupation. The
analysis has a collection of data layers with parking spaces and a
stream of past and upcoming occupations for existing emobility
stations. The emobility station, represented by a point, preserves
the actual occupation and is not directly related to aected parking
spaces. Relationships can be inferred only by their spatial proximity
to a given number eld describing how many parking spaces are
associated to the station. The occupation data changes over time
Full Paper Track
CIKM ’21, November 1–5, 2021, Virtual Event, Australia
565
and correlates changes with an existing parking layer, aggregating
it across the considered spatial extent. The main diculty is the
lack of equijoin relationships and a public mapping between e
mobility stations and parking spaces. The matching has to rely on
other descriptive aspects to deduce semantic relationships without
considering only their domain data and the temporal and spatial
dimensions.
Regarding this integration problem, we propose an iterative
hierarchical solution for automatic integration, implemented in
our existing analytical framework MARS. The approach matches
data object from heterogeneous sources with spatiotemporal and
unknown domaindimensions. Our contributions in this paper are:
•
Discussion of the problem in matching data with spatial,
temporal, and domainspecic semantics.
•
Proposals of approximated spatial, temporal and general
purpose matching mechanism for heterogeneous data.
•Discussion of technical integration.
•Recall, precision, and performance evaluation on real sets.
The remainder of this paper is structured as follows: In Section 3,
we describe the problem in abstract terms and discuss conicts in
spatial, temporal, and common data matching. Pursuant to the prob
lem statement dened in Section 3.1, we discuss temporalrelated
matching in Section 4.1, spatial matching with multidimensional
indexbased access in Section 4.2, and a combined semantic sim
ilarity matching in Section 4.3 to capture unknown domains. In
Section 5, we describe our solution for retrieving heterogeneous
sources and how data are loaded to apply ad hoc analytics. In Sec
tion 6, we evaluate our approach against two scenarios, discussing
result with benets and drawbacks of our approach in Section 7
and oer a conclusion and overview of topics for future work in
Section 8.
2 RELATED WORK
The problem of resolving matching partners in the absence of key
candidates has been characterized in various works for dierent
domains. Commonly used approaches such as join paths manually
managed or derived from the data set in Data Civilizer [
13
] or
Clio [
18
] describe possible joineabl tables, requiring knowledge of
foreign key relationships and the schema. Alternatively, nding
linkable objects from distinct sources can be viewed as a similar
ity problem. In information retrieval, this has been studied well,
focusing on set similarity with small comparison sets, such as key
words and small text [
14
,
39
,
53
]. They become applicable to larger
domains by adapting deployed similarity metrics such as Jaccard
to containment scores in [
43
] or to extendedJaccard in order to
compromise exible token divergences. Other approaches such
as [
21
,
52
] consider heterogeneous attributes or combine multiple
similarity metrics as [
37
], tuning a weight for each metric accord
ingly. SEMAJOIN [
27
], AutoJoin [
60
], The Mannheim Search Join
Engine [
38
], and previous approaches to table extensions [
7
,
56
]
on WebTables [
8
], on the other hand, are methods that statisti
cally derive correlations from given corpora data sets and identify
pseudokey relationships. However, these inference methods re
quire large amounts of corpora data, such as the 100 million tables
used in SEMAJOIN [
27
]. Such volumes are not always available are
not readily manageable with regular personal infrastructure. In the
context of spatial data, existing work on spatial matches, which, in
addition to the main runtime features such as GPU utilization [
1
] or
streaming of spatiotemporal data [
34
], also address the renement
of matching quality. The
𝜖
distance approaches [
5
] establish links
to nearby objects. However, distance alone is not always sucient
for matching since nearby coordinate support points cannot always
be determined for a large spanning area. In contrast, partition
based or minimum bounding boxbased approaches decompose
the space in [
46
] into regular grids or via widely used Rtree join
algorithms [
28
], and link objects based on geometric predicates
(such as overlap). However, in the case of point data with minor
spacing, data may also be excluded. In contrast, other approaches
such as the SpatialHash Join [
41
] or Scalable SweepingBased Spa
tial Join [
2
] are multiassignment joins that assign multiple data
objects to a target object. A good discussion of previous and current
spatial join methods can be found in [
31
]. The combination as a
spatiotemporal composite so far focuses on runtime renement
by using distributed computation such as Apache Spark [
55
], the
use of inmemory databases [
33
], or using better index structures
[
59
]. Previous traditional spatial, temporal, and spatiotemporal
joins were studied concerning geometric or temporal dimensions,
while other semantic data attached to the objects was widely ig
nored. The spatiotextual similarity join [
5
], a hybrid of the spatial
intersection join and set similarity, the kdistance join [
48
], and
the GeoRDF join [
40
] are among the rst approaches to include
semantic information along with spatial data, but none of them
have considered temporal properties yet.
3 MATCHING PROBLEMS
Before looking into the distinct aspects of matching, we formu
late the matching problem in Section 3.1. Then, we discuss spatial,
temporal, and generalpurpose problems in Section 3.2, terming
boundary conditions of this work.
3.1 Problem Statement
Consider an existing set
𝐸={𝑒1, 𝑒2, . .., 𝑒𝑖,}
of data object with a
set
𝑀={𝑚1,𝑚2, ..., 𝑚𝑗}
of possible matching candidates in the
scope of a temporalinvariant ad hoc analyse, denoted as A. Each
object
𝑒∈𝐸
(or
𝑚∈𝑀
) contains a spatial instance
G
, a temporal
valid time period
T
, and an associated data set
D
with attribute
and values. The spatial instance
G
is represented as a Minimum
Bounding Rectangle (MBR) (or simply Bounding Box) and covers
all information within a specic area, including points,lines, and
polygons. We denote the MBR as
G=[𝑝𝑏𝑙 , 𝑝𝑢𝑟 ]
in which
𝑝𝑏𝑙 =
(𝑝𝑏𝑙 .𝑥, 𝑝𝑏𝑙 .𝑦)
and
𝑝𝑢𝑟 =(𝑝𝑢𝑟 .𝑥, 𝑝𝑢 𝑟 .𝑦)
species the bottomleft and
upperright coordinate, respectively. The valid time period
T
id
denoted as
T=[𝑡𝑠, 𝑡𝑒)
describing an inclusive starting time
𝑇𝑠
and exclusive end time
𝑇𝑒
. The associated data
D
is a set of values
D={𝑣1, 𝑣2, ..., 𝑣𝑘}
with dierent attribute types. A value
𝑣𝑖
of
an object
𝑒∈𝐸
(or
𝑚∈𝑀
) can be linear (e.g., emobility station
Price
or
Availability
) or nominal (or symbolic) (e.g., parking space
with
Occupation
or
Kind
), and a linear value can be continuous
or discrete. To quantify the similarity between two objects
𝑒𝑖∈𝐸
and
𝑚𝑗∈𝑀
, we consider each of the three dimensions of interest
(temporal,spatial, and domainspecic semantics) separately. For
Full Paper Track
CIKM ’21, November 1–5, 2021, Virtual Event, Australia
566
temporal data, we consider an intersection of data objects to occur
when their time periods overlap.
Denition 3.1. Given two objects
𝑒
and
𝑚
, their temporal inter
section is dened as:
𝑇(𝑒, 𝑚)=T
𝑒∩ T
𝑚≠∅(1)
For the spatial dimension, the MBR extent of on object needs to
intersect with that of a matching partner, applying set similarity
with a given 𝜖distance:
Denition 3.2. Given two objects
𝑒
and
𝑚
, a precision
𝜖
, and the
prex function
𝜌
, the spatial similarity is given by the set similarity
of matching prexes of coordinates with length
𝜖
. Their spatial
instance Gcorrelates by 𝜖distance on prex.
𝑆(𝑒, 𝑚)=
𝑚𝑖𝑛 (𝜖,𝜌 ( G𝑒),𝜌 ( G𝑚))
Õ
𝑖=0
𝜌(G𝑒)𝑖∩𝜌(G𝑚)𝑖
𝜌(G𝑒)𝑖∪𝜌(G𝑚)𝑖(2)
For the domainspecic dimensions, the featurebased similarity
is used to conclude the relationships with usage of the weight
function 𝑓𝑤:D → Ras follows:
Denition 3.3. Given two objects
𝑒
and
𝑚
, threshold
𝜏𝐷
,0
≤𝜏𝐷≤
1and, goal is to satisfy
𝑠𝑖𝑚 (𝑒, 𝑚) ≥ 𝜏
in identify pairs
(𝑎𝑖, 𝑏 𝑗), 𝑎𝑖∈𝑒
and 𝑏𝑗∈𝑚applicable to similarity.
Given the denitions above and object
𝑒
and
𝑚
, we concern:
(1) Temporal constraints: their temporal intersection is
∅
. (2) Spatial
constraints: their spatial similarity is larger than a spatial thresh
old
𝜏𝑠
in which
𝑆(𝑒, 𝑚)>𝜏𝑠
or the minimized object is selected
𝑚𝑖𝑛(Ð𝑘
𝑖=0Ð𝑙
𝑗=0𝑆(𝑒𝑖,𝑚𝑗))
if it exists. (3) Semantic constraints: their
similarity of heterogeneous associated data is larger than a semantic
similarity threshold
𝜏𝑑
in which
𝐷(𝑒, 𝑚)>𝜏𝑑
. Given the constraints,
we formulate the matching problem as follows.
Denition 3.4. Given two collection
𝐸={𝑒1, 𝑒2, . .., 𝑒𝑛}
and
𝑀=
{𝑚1,𝑚2, ..., 𝑚𝑘}
and two similarity thresholds
𝜏𝑠
and
𝜏𝑑
, a similarity
join nds all matching partners
(𝑒𝑖,𝑚𝑗)
where
T
𝑒∩T
𝑚
,
𝑆(𝑒, 𝑚)>𝜏𝑠
and 𝐷(𝑒, 𝑚) ≤ 𝜏𝑑.
Figure 1 shows an example of domainspecic spatiotemporal
objects, each associated with a set of possible heterogeneous data
values (e.g.,
{𝑎, 𝑏, ..., 𝑘, 𝑙, ... 𝑥 ,𝑦, 𝑧}
. The set of rectangles
𝑅𝑖
represent
MBRs that are valid at dierent times (e.g,
𝑅𝑖′
or
𝑅𝑖′′
) and for which
a temporal change can aect either the position of the geometric
instance (e.g,
𝑅3↦→ 𝑅3′
) and their values. Consider object
𝑅2′′
or
𝑅5′
, changes can express themselves by making new objects valid
at specied points in time denoted as
′
or
′′
, exposing semantic data
{𝑥}
or nothing
∅
to an ad hoc analysis. When an object’s semantic
data change, they can be aected in one of the following ways:
(1) simple attribute value changes (
{𝑥}𝑅1↦→ {𝑥′}𝑅1′
). (2) Added or
removed attributes (
{𝑥, 𝑦, 𝑧}𝑅3↦→ {𝑥, 𝑦 }𝑅3′↦→ {𝑥 , 𝑦, 𝑧 }𝑅3′
). (3) Com
bined changing action in one step({𝑘, 𝑙 , 𝑚}𝑅7′↦→ {𝑘′, 𝑙 ′}𝑅7′′ ).
3.2 Common Matching Problems
Many of the challenges of identifying matching data objects via at
tribute comparison can be classied into a few kinds of heterogene
ity [
17
]. Dierences between structural properties of data sources
restrict how their data objects can be related to each other. There
fore, the bulk of many data integration endeavours is devoted to
𝑅1={𝑥}𝑅1′={𝑥′}𝑅2′′ ={𝑎, 𝑏}
𝑅3={𝑥, 𝑦 }𝑅3′={𝑥, 𝑦, 𝑧 }𝑅3′′ ={𝑥, 𝑦 }
𝑅3′′′ ={𝑥}𝑅4={𝑎, 𝑏}𝑅4′={𝑎, 𝑏 }
𝑅5′=∅𝑅6′={𝑎, 𝑏, 𝑐 }𝑅6′′ ={𝑎, 𝑏, 𝑐 }
𝑅6′′′ ={𝑥, 𝑦, 𝑧}𝑅7′={𝑘, 𝑙 , 𝑚}𝑅7′′ ={𝑘′, 𝑙 ′}
Figure 1: Example spatiotemporal data objects with dier
ent associated data objects.
minimizing heterogeneity, thereby increasing the integrability of
the data sources and the comparability of the data objects at hand.
Such exercises are part of data cleanup, preprocessing, homogeniza
tion, and feature preparation and are prerequisite to any meaningful
search for matches.
A set of data sources is syntactically heterogeneous when any of
its pairwise data objects are unequal in their forms of representa
tion. For instance, the comparability of data objects from relational,
aggregate, object, and matrixoriented data sources is compro
mised due to their dierent structural composition of the data
storage systems. However, even if the integration of heterogeneous
sources is possible, the search for matches might be hindered by
syntactic heterogeneity among the data objects to be compared.
Given two data objects with a comparable attribute that has a
cardinal scale of measure, semantic heterogeneity might arise due
to the present representation of the numbers. For instance, the
attribute values of one data object might be normalized, whereas
those of the other data object might not be. Similarly, when inte
grating external data into a model, conformity issues between the
schema of the external source and the data representation in the
model might impede comparability. This occurs when, for example,
the nominal values of two data objects are represented literally in
one and encoded numerically in the other case.
Contentbased heterogeneity describes the occurrence of attribute
values from dierent data objects that are semantically valid and,
therefore, equally likely to be true. When faced with conicting
data objects, one object’s values must be chosen over the other,
eectively rejecting those of the latter as false. How is such an
ambiguity best resolved? In the decisionmaking process, the qual
ity and integrity of the data sources is likely a consideration [
15
].
Ideally, these conict resolutions should occur within an automated
and congurable data integration pipeline, as making manual case
bycase decisions is not feasible.
Full Paper Track
CIKM ’21, November 1–5, 2021, Virtual Event, Australia
567
3.3 Spatial and Temporal Matching Problems
In addition to the general challenges that arise from dierent types
of data heterogeneity, more specic issues apply when analyzing
spatial and temporal data. For instance, geospatial data encoded
in other coordinate reference systems (e.g., WGS:84 or UTM) are
not comparable because the geographic position of the objects is
described dierently. In addition, even within the same reference
systems, comparability might be impeded by using dierent stan
dards employed in representing data. These standards typically
arise from the domain (e.g., aviation) resulting in semantics degra
dation, in which dierent instances describe the same object on
multiple abstractions levels [
6
]. For example, spatial geometry types
(point, cell, line, polygon) describe individual parking lots occupied
by vehicles or areas occupied by multiple cars. These objects can
not or only partially be comparable.
Concerning temporal data, the issue of outliers on the time axis
is often of particular concern [
25
,
58
]. A disruption in the temporal
continuity of a time series whose data objects are ordered on a time
axis occurs when there exists at least one interval on the time axis
for which no data object’s period is valid. In case of such anomalies,
temporal and other data attributes need to be corrected (if they are
erroneous) while maintaining as much data integrity as possible.
4 INTEGRATING SPATIOTEMPORAL DATA
Due to the heterogeneity of the data sources and the absence of com
parability via equijoins, we introduce Hierarchical Spatiotemporal
Matching for heterogeneous matching, denoted as HSTM. The ap
proach is hierarchical because the integration of spatiotemporal
data must address spatial and temporal aspects separately. Links of
objects are constructed within a multidimensional space by their
spatial proximity, temporal validity, and semantic similarity. This
results in a graph structure, which successively builds up according
to the analysis progress. Spatial integration addresses the challenge
of nding matches above a given precision threshold, distinguish
ing from similar objects, whereas temporal integration requires
population data regarding their validity time frame.
4.1 Temporal Validity
The temporal validity of data and temporal data referring to the
same object requires a welldened (dis)integration mechanism
based on selection. In order to nd new valid and invalid values,
we propose a denition of two types of joins linking new tempo
ral valid data
𝑀
of a temporal catalogue to an existing collection
𝐸
. More precisely, the denition is based on a single data model
which has its roots in the relational implementations [
32
]. HSTM
assigns data objects to an interval, and the time axis is divided into
intervals of minimum duration. Each interval is called a chronon
[
16
], the smallest unit possible, formed by enclosing start as well
as end chronons. For this purpose, we imply a data schema for
temporal objects
𝑀
in conjunction with our existing collection
𝐸
described in section 3.1. Each data object has its own schema with
𝑆=(𝑎𝑖, .. ., 𝑎𝑛,T )
as an arbitrary set of data attributes
𝑎𝑖, .. ., 𝑎𝑛
with
1
≤𝑖≤𝑛
and time period
T=[𝑡𝑠, 𝑡𝑒)
. It captures the time during
which the information recorded by the attributes applies (or applied
or will apply). We use
𝑆𝐸
and
𝑆𝑀
as shorthand for
{𝑎1, .. ., 𝑎𝑛}
and
{𝑏1, .. ., 𝑏𝑚}
and dene
𝑒
as an instance of
E
as well as
𝑚
of
𝑀
. The
next invalid
object
next valid
object
current
simulation
step
ﬁnished steps
Figure 2: HSTM update and remove process with valid and
invalid window.
time interval of the current analysis is dened as
A=[𝑡𝑠, 𝑡𝑐)
, with
start 𝑡𝑠and current time 𝑡𝑐of the analytical task.
Figure 2 shows the analysis scope overlapping with or inter
secting valid and invalid values. The HSTM updates
𝑒𝑖
’s with new
data when there is an overlap between the validity period and the
analysis time
𝑡𝑐
. In contrast to overlapping, the intersection of the
periods serves invalid data with subsequent removal of
𝑒𝑗
when no
more changes were provided. Therefore, we dene two auxiliary
functions for intersecting and overlapping. For a more precise for
mulation, let
𝑓
(short for rst) and
𝑙
(short for last) be the smallest
and largest, respectively, of twoargument chronologies. Also, let
𝑈𝑠
and
𝑈𝑒
be the start and end chronons of
𝑈
, likewise
𝑉𝑠
and
𝑉𝑒
of
𝑉
. Equation 3 shows
𝑖𝑛𝑡 𝑒𝑟𝑠𝑒 𝑐𝑡 (𝑈 , 𝑉 )
with
𝑈
and
𝑉
as intervals,
returning
𝑡𝑟𝑢𝑒
exactly when the start and end of
𝑈
is within the
scope of 𝑉:
𝑖𝑛𝑡 𝑒𝑟𝑠𝑒 𝑐𝑡 (𝑈 , 𝑉 )=𝑈𝑠≥𝑉𝑠∧𝑈𝑒<𝑉𝑒(3)
Equation 4 shows
𝑜𝑣 𝑒𝑟𝑙 𝑎𝑝 (𝑈 , 𝑉 )
which returns exactly the max
imum interval of the two arguments. If there is no overlap,
∅
is
returned:
𝑜𝑣 𝑒𝑟𝑙 𝑎𝑝 (𝑈 , 𝑉 )=([𝑙(𝑈𝑠,𝑉𝑠), 𝑓 (𝑈𝑒, 𝑉𝑒)] if 𝑙(𝑈𝑠, 𝑉𝑠) ≤ 𝑓(𝑈𝑒,𝑉𝑒)
∅otherwise
(4)
Equation
𝐸⊲⊳𝑇
𝑃(𝑒T,𝑚T)M
describes a
𝜃
join and
𝐸⋉𝑇
𝑃(𝑒T,𝑚T)M
describes a semijoin between the object instance
𝑒∈𝐸
on the left
side and the new data object
𝑚∈𝑀
on the right side, satisfying
a condition
𝑃
:
𝐸×𝑀→ {true,false}
.
𝑃
is a custom predicate,
delegating further matching conditions to the underlying hierarchy
(spatial and semantic). The temporal selection reduces the amount
of possible matching candidates between objects in
𝑀
and existing
ones in
𝐸
to a limited set of candidates, which then form the basis
for subsequent steps of HSTM. We use
𝑒T⊆𝑆𝐸
and
𝑚T⊆𝑆𝑀
as the explicit join attributes. Instead, the semijoin retrieves the
reduced set of data objects from the left side, which fully intersects
with the actual analysis interval, denoting all object as invalid the
analysis.
In addition to preventing duplicate values, we collapse mutually
equal nontimestamp values and nonvalued tuples with equal
timestamps into a single data object [
32
]. The equation coalesce
omits missing values or those which are equal to data of upcoming
points in time. The denition consists of merging object whose time
is valueequal, preventing returning invalid times outside the scope
of
𝑀
. We separate the denition into the auxiliary functions 5 and
6. The rst function looks for candidates of an arbitrary data object
Full Paper Track
CIKM ’21, November 1–5, 2021, Virtual Event, Australia
568
𝑥
in which the attribute value for the specied
𝑏′
is valueequal
and duplicate entries in time exist by intersecting the period with
the existing data object 𝑧:
collapse(𝑧, 𝑟 )=∃𝑥∈𝑟(𝑧[𝑏′]=𝑥[𝑏′]=⇒
𝑖𝑛𝑡 𝑒𝑟𝑠𝑒 𝑐𝑡 (𝑥[T ], 𝑧 [ T ]) ∧
(∀𝑥′∈𝑟(𝑥[𝑏′]=𝑥′[𝑏′]=⇒𝑖𝑛𝑡 𝑒𝑟𝑠𝑒 𝑐𝑡 (𝑥′[T ] ), 𝑧 [T ]) ))
(5)
The second auxiliary function (see Eq. 6) ensures that no in
valid chronon is returned that is outside the valid time scope of
considered data object 𝑧when two elements are equal:
invalid(𝑧, 𝑟 )=∀𝑡∈𝑧[T ] ∃𝑥∈𝑟(𝑧[𝑏′]=𝑥[𝑏′]∧
𝑡≥𝑥[𝑡𝑠] ∧ 𝑡<𝑥[𝑡𝑒]) (6)
Equation 7 describes the collapsing of equal values for times
tamped data and incorrect times.
𝑐𝑜𝑎𝑙 𝑒𝑠𝑐𝑒 (𝑟)={𝑧collapse (𝑧, 𝑟 ) ∧ invalid(𝑧, 𝑟 )} (7)
With the denition above, the temporal selection can be dened
as follows:
𝜎𝑃(A, 𝑚 )𝑇𝑀={𝑥∃ 𝑚∈𝑐𝑜 𝑎𝑙𝑒𝑠𝑐 𝑒 (𝑀):
𝑥[T ] =𝑜𝑣 𝑒𝑟𝑙 𝑎𝑝 (𝑚[T ],A ) ∧ 𝑥[T ] ≠∅} (8)
Only collapsed data objects in correlation with Aare used.
4.2 Spatial Subdivision
Spatial similarity is determined by prex lter method [
9
]. Two
objects
𝑎
and
𝑏
are considered to be similar when they share a
common set of prex tokens. Therefore, a multidimensional data
base consists of points belonging to connected and negranular
multidimensional objects within a space
𝑆
over dimension
𝑑
. A
spatial database, denoted
U𝑠𝑝𝑎𝑡𝑖𝑎𝑙 ⊆ U
is a special case of the mul
tidimensional database, the entire comparison universe
U
, where
𝑑=
2. As mentioned in Section 3.1 spatial objects are considered as
rectangular spaces, containing the MBRs of the comparison objects.
We use the prex ltering with 2Dlinearization, decomposing the
space into individual bounding boxes. HSTM assigns spatial data
G
to one of the spaces, having an
𝜖
common prex and is entered
into a Patriciatrie. If two objects
𝑒
and
𝑚
share a prex valid up to
𝜖
elements, they are also adjacent within
𝜖
distance. Therefore we
associate the spatial space
𝑆
with a hierarchy, consisting of a set of
nodes
V={𝑣1, 𝑣2, ..., 𝑣𝑛}
whose elements each constitute a subset
of the space
𝑆𝑖⊆𝑆
for which the transitive property holds. The
set of leaf nodes L ⊆ V in the spatial hierarchy point to concrete
coordinates of the input space
𝑆
. Each parent
𝑣𝑘
of a node
𝑣𝑖
points
to
𝑆𝑘
such that
𝑆𝑖⊆𝑆𝑘
holds. Conversely, all children
𝑣𝑖
of a node
𝑣𝑘in the hierarchy subsume the whole space of 𝑆𝑘:
∀𝑣𝑗∈ {𝑣ℎ𝑣ℎ∈𝑐ℎ𝑖𝑙𝑑𝑟 𝑒𝑛 (𝑣𝑖) ∧ 𝑣ℎ≠𝑣𝑘}:𝑆𝑗∩𝑆𝑖≠∅(9)
For the induction of the parent space
𝑆𝑘
from its childspaces
𝑆𝑖
,
the following condition holds:
𝑆𝑘=Ø
𝑣𝑖∈𝑐ℎ𝑖𝑙𝑑𝑟 𝑒𝑛 (𝑣𝑘)
𝑆𝑖(10)
Each branch node
𝑣𝑖
within the hierarchy that is not part of
the leaf node set
L
constitutes a subspace of the total space. To
match the object, HTSM applies spatial range query, so that range
conditions are satised by the intersection of MBRs each other.
Formally, a spatial query, denoted as
𝑞
, has a range specication of
the form
𝑞range =[𝑞𝑏𝑙 , 𝑞𝑢𝑟 ]
in which
𝑝𝑏𝑙 =(𝑝𝑏𝑙 .𝑥, 𝑝𝑏 𝑙 .𝑦)
and
𝑝𝑢𝑟 =
(𝑝𝑢𝑟 .𝑥, 𝑝𝑢𝑟 .𝑦)
satisfy the condition
𝑝𝑏𝑙 .𝑥 ≤𝑝𝑢𝑟 .𝑥
and
𝑝𝑏𝑙 .𝑦 ≤𝑝𝑢𝑟 .𝑦
.
The specication is equivalent to the description in Section 3.1. We
dene an auxiliary function
𝑠𝑝𝑎𝑐𝑒
:
V × N→𝑆
which returns each
space for a particular level within the hierarchy.
A given range query
𝑞
with specication
𝑞range =[𝑞𝑏𝑙 , 𝑞𝑢𝑟 ]
, and
associated twodimensional point
𝑝∈𝑈𝑠𝑝𝑎𝑡 𝑖𝑎𝑙
is contained within
the query
𝑞
if and only if
𝑞𝑏𝑙 .𝑥 ≤𝑝.𝑥 ≤𝑞𝑢𝑟 .𝑥
and
𝑞𝑏𝑙 .𝑦 ≤𝑝.𝑦 ≤
𝑞𝑢𝑟 .𝑦
. In conjunction with the range in
𝑞
, two considered spatial
objects match when they share their MBR. Therefore, the
𝑖𝑛𝑡𝑒𝑟𝑠𝑒𝑐𝑡
equivalent to Equation 3 is introduces. To simplify readability, for
Equation 11 we briey assume 𝑎=G𝑒and 𝑏=G𝑚:
𝑖𝑛𝑡 𝑒𝑟𝑠𝑒 𝑐𝑡 (𝑎, 𝑏)=(𝑎𝑏𝑙 .𝑥 ≤𝑏 .𝑥 ≤𝑎𝑢𝑟 .𝑥 ∧𝑎𝑏𝑙 .𝑦 ≤𝑏.𝑦 ≤𝑎𝑢𝑟 .𝑦) ∨
(𝑏𝑏𝑙 .𝑥 ≤𝑎.𝑥 ≤𝑏𝑢𝑟 .𝑥 ∧𝑏𝑏𝑙 .𝑦 ≤𝑎.𝑦 ≤𝑏𝑢𝑟 .𝑦)
(11)
The spatial matching predicate
𝑃𝑆
:
𝐸×𝑀→ {𝑡𝑟 𝑢𝑒, 𝑓 𝑎𝑙𝑠 𝑒}
on
two data set
𝐸
and
𝑀
correlates the spatial instance
G
with each
other:
𝑃𝑆(𝑒G,𝑚G)=𝑖𝑛𝑡 𝑒𝑟𝑠𝑒 𝑐𝑡 (𝑒G,𝑚G)(12)
HSTM considers geometry on the spatial scale. Together with the
temporal and semantic aspects, new data is temporally constrained
and summarized over similar geometries using approximate ras
terization. Therefore, we linearize the MBR geometric objects and
construct a spacelling curve for ecient use of range queries
[
45
]. Range queries on
𝑍
curves are relevant for ecient imple
mentations of multidimensional indexes such as the Universal
Btree (UBtree) [
3
,
42
,
50
], the BUBtree [
19
], and the PHtree
[
57
]. These trees interleave some or all bits of each dimension of
a stored kdimensional point
𝑝=𝑝
0
, 𝑝1, . .., 𝑝𝑘−1
into bit string,
called
𝑍
address
𝑧
, representing a coordinate of a space lled
𝑍

curve. For example, the lexicographic ordering of values encoded
in
𝑍
addresses is the
𝑍
ordering; see [
51
] for more discussion. A
comparison of
𝑍
curves with other spacelling curves is given in
[
44
]. For our 2dimensional space, we only need to consider
𝑘=
2:
Denition 4.1. Given a dimension
𝑑
and a point
𝑝0≤𝑖≤𝑑
, a
𝑍

address
𝑧
is a bit string with length
𝜖
of this point consisting of
𝜖×𝑑
bits and where 0
≤𝑝𝑖≤
2
𝜖
describes the highest bit of each
value 𝑝𝑖.
For a given spatial database
U𝑠𝑝𝑎𝑡𝑖𝑎𝑙
with hierarchy
𝐻
, we create
a spacelling
𝑍
curve and place each MBR of a spatial object in this
space. Figure 3 shows an exemplary subdivided space of multiple
MBR, denoted as
𝑅𝑖
, and a
𝑍
curve range query
𝑞𝑟𝑎𝑛𝑔𝑒
intersecting
a subset of objects. Locating matching points of MBR inside this
range, HSTM traverses the tree, locating nodes that intersect a
𝑍
value range (e.g,
[
15
,
60
]
in Figure 3). Once reaching the end
of a node, it returns to the parent and moves on to the next one,
stopping when the upper range limit of the
𝑍
value is reached. In
this manner, the approach approximately retrieves all objects with
intersecting MBR of an element
𝑚
by performing a window query
Full Paper Track
CIKM ’21, November 1–5, 2021, Virtual Event, Australia
569
Figure 3: 𝑍curve and 𝑧values of a MBR sample in 2D space.
Leaf nodes (green), intermediate nodes (orange), and root
(green).
box dened as
𝑞𝑙𝑟
and
𝑞𝑢𝑟
. The window query is transformed to
their
𝑧
bit string range. Starting at the root root, we look for the rst
𝜖
subspaces given by
𝑐ℎ𝑖𝑙𝑑 𝑟𝑒𝑛(𝑟)
and check whether they overlap.
4.3 ContentBased Similarity
For semantic similarity, we consider two kinds of wellknown met
rics. Considering linear data in
D
, we utilize the
Cosine
distance,
which is a popular similarity measure treating two linear inputs as
vectors in space, computing the Cosine between them:
cos(D𝑒,D𝑚)=D𝑒D𝑚
∥D𝑒∥ ∥D𝑚∥(13)
In contrast to nominal data, we utilize the
Jaccard
similarity and
propose an extension concerning distinction of containing nominal
values between
D𝑒
and
D𝑚
and marginal dierence in tokens, mak
ing soft approaches required. Therefore we compute the common
and distinctive token dierences and apply a given weighting func
tion. Each value
𝑣∈ D
is important to a dierent degree, expressed
by an assigned weight
𝑓𝑤
:
D → R
(e.g., the latent dirichlet alloca
tion [
4
]). The degree of distinct
𝑑𝑣 (D𝑒,D𝑚)=Í𝑣∈D𝑒∩D𝑚𝑓𝑤(𝑣)
and common values
𝑐𝑚(D𝑒,D𝑚)=Í𝑣∈D𝑒∪D𝑚𝑓𝑤(𝑣)
is in propor
tion according to Jaccardcoecient:
𝑠𝑖𝑚 (D𝑒,D𝑚)=𝑑𝑣 (D𝑒,D𝑚)
𝑐𝑣 (D𝑒,D𝑚)(14)
Both metrics are widely used in practice but come with some
major drawbacks. The single usage of
Jaccard
or
Cosine
cannot
handle heterogeneous attributes, consisting of nominal and linear
data in
D
for objects
𝑒
and
𝑚
.Jaccard is wellsuited for nomi
nal data but becomes imprecise when string representations are
marginally dierent from each other, and equality cannot satisfy
cut and union operation. In addition, Jaccard is less suitable since
it nds frequent correlations when comparing highdimensional
with lowdimensional data in
D
.Cosine ts better when values
have a linear attribute type but suers in corating items even if
there exists a high dierence in one value. Therefore, we propose a
hybrid in which linear and nominal attribute values are considered
separately for a continuous input of the analytical model.
With function
𝑑𝑣
and
𝑐𝑣
, we dene the domainspecic similarity
𝑠𝑖𝑚𝐷
as a fraction that returns the similarity between two input
attributevalue sets
𝑒
and
𝑚
in the interval
[
0
,
1
]
as a loss based on
the idea of softJaccard[49]:
sim𝐻(𝑒, 𝑚)=1−𝑠𝑖𝑚(𝑒[ D],𝑚[D])
𝑒[D] 1+  𝑚[D ] 1−𝑠𝑖𝑚 (𝑒[D, 𝑚[D]) (15)
where
D𝑒D𝑒
are the sets of all attributes between the objects.
The function
𝑠𝑖𝑚 (D𝑒,D𝑚)
returns the similarity between two di
vergent attribute sets, using the exible comparison dened above.
The result is aggregated and reduces to a loss, used to dene the
similarity constraints discussed in Section 3.1:
𝐷(𝑒, 𝑚)=𝑠𝑖𝑚𝐻(𝑒, 𝑚) ≤ 𝜏𝐷(16)
4.4 Hierarchical Matching
Finally, all three dimensions of dierent matching approaches have
to be combined to a coherent procedure in which the number of
possible matching candidates are successively reduced to one or
zero. The order in which the matching algorithms are executed
depends on the data availability of the individual part.
𝐸⋊𝐷
𝑠𝑖𝑚𝐻(𝑒[ D ],𝑚 [ D ]) ≤𝜏𝐷(𝐸⋊𝑆
𝑃(𝑒[G],𝑚[ G ]) (𝜎𝑇
𝑃( A,𝑚[ T ] ) 𝑀)) (17)
First, temporal matching is executed, reducing matching objects
to valid partners based on the current timestamp. It has no direct
dependency on existing objects. Next, the remaining matching part
ners are joined with existing entities using the spatial matching
prex. The candidates are now reduced on temporal validity and
spatial proximity. Finally, the previous set of objects are seman
tically joined with an existing object in
𝐸
. Property information,
nominal and linear, is used to determine if a candidate can be
matched to an
𝑒
(duplicate) or integrated as a new one. On updating
an existing entity, the question of merging previous and new infor
mation has to be faced. From a temporal perspective, the lifetime of
𝑒
is prolonged. Concerning the spatial dimension,
𝑒
can either adopt
the new geometry or merge geometries. Finally, data integration
is a task that depends even more than spatial integration on the
requirements of the model at hand. A simple update of
𝑒
is possible,
but without knowledge of the quality and purpose of the sources,
it remains questionable without any insight into the domain.
Precisely with the separate steps in Eq. 17, the denition of
𝑜𝑣 𝑒𝑟𝑙 𝑎𝑝 (𝑈 , 𝑉 )
in Eq. 4, predicates
𝑃𝑆
and
𝑃𝐷
over
𝐸×𝑀→
{true,false}
, and the auxiliary function
𝑐𝑜𝑎𝑙 𝑒𝑠𝑐𝑒 (𝑟)
in Eq. 7, we de
ne complete match and fusion. The process of getting all next
valid data is described in Equation 18 and consists of nding those
matches that are intersecting the actual analytical progress, whereas
Full Paper Track
CIKM ’21, November 1–5, 2021, Virtual Event, Australia
570
the process of retrieving invalid data objects is described in Equa
tion 19. Again, we use
𝑆𝐸={𝑎𝑖, .. ., 𝑎𝑛}
and
𝑆𝑀={𝑏𝑖, .. ., 𝑏𝑚}
for
attributes of element in 𝐸and 𝑀respectively:
𝐸⊲⊳H S T M 𝑀={𝑥 ∃𝑒∈𝐸, ∃𝑚∈𝑐𝑜𝑎𝑙𝑒 𝑠𝑐𝑒 (𝑀)
(𝑃(𝑒, 𝑚)𝑆𝜖∧𝑃(𝑒 , 𝑚)𝐷𝜏𝐷∧
𝑥[T ] =𝑜𝑣 𝑒𝑟𝑙 𝑎𝑝 (𝑚[T ],A ) ∧ 𝑥[T ] ≠∅ ∧ 𝑥[𝑆𝐸]=𝑒[𝑆𝐸] ∧
(𝑥[𝑆𝑀]=𝑚[𝑆𝑀] ∨ 𝑥[𝑆𝑀]=𝑛𝑢𝑙𝑙 ))∨
∃𝑒∈𝐸, ∀𝑚∈𝑐𝑜𝑎𝑙𝑒𝑠 𝑐𝑒 (𝑀) (¬(𝑃(𝑒, 𝑚)𝑆𝜖∨𝑃(𝑒 , 𝑚)𝐷𝜏𝑑=⇒
𝑥[𝑆𝐸]=𝑒[𝑆𝐸] ∧ 𝑥[𝑆𝑀]=𝑛𝑢𝑙𝑙 ∧𝑜 𝑣𝑒𝑟 𝑙𝑎𝑝 (𝑥[ T ],A) ≠∅∧
𝑥[T ] ≠∅)) }
(18)
The rst three lines handle the case in which the data object
𝑥
derives from
𝑒
and
𝑚
by applying the inner matching condition
for
𝑒
and
𝑚
, given
𝜖
distance for spatialbased
𝑃(𝑒, 𝑚)𝑆𝜖
and
𝜏𝐷
threshold for semanticbased decisions
𝑃(𝑒, 𝑚)𝐷𝜏𝐷
. The attribute
values
𝑒[𝑆𝐸]
of the existing object are used for resolved partners,
including all remaining attributes
𝑚[𝑆𝑀]
from the match. The valid
time of this data update is expanded by overlap with the analysis
interval Aand associated time in 𝑥[T ] .
The last three lines handle the cases in which no matching was
found, but the valid period overlaps with the actual analysis time,
describing a new object. The result 𝑥is lled with values of 𝑚.
The invalid join retrieves all data which are now invalid for the
analysis, described in Equation 19:
𝐸⋉T
𝑃(𝑒,𝑚)𝑀={𝑒 ∃𝑒∈𝐸∃𝑠∈𝑐𝑜𝑎𝑙𝑒 𝑠𝑐𝑒 (𝑀)
𝑖𝑛𝑡 𝑒𝑟𝑠𝑒 𝑐𝑡 (𝑚[T ],A)} (19)
The semijoin retrieves all pending objects contained entirely
within the current analysis progress time range
A
and thus are no
more applicable for the task because the valid time is over.
5 TECHNICAL INTEGRATION
Since multiple data sets are heterogeneous, providing only a sub
set of required information for the considered scenario, and do
not contain validity periods – specically only a temporal marker
assigned, for example, by corresponding sensors or occupied by
the hub system time – all static and dynamic sources are equally
relevant.
5.1 Source Inclusion
Given a scenario description, the implementation of HSTM refers
to multiple static and dynamic sources, contiguous with a polystore
architecture [
22
]. Source entries contain implicit or explicit tempo
ral references (such as validity period or validity marker), applied to
all data objects contained in the source as well as spatial references
in multiple formats (point,line, and polygon). In contrast to static
sources, dynamic sources are congured by the given scenario de
scription
1
containing (pushbased) queries and mappings, resolving
naming (1:1) and structural (1:n) conicts. The system registers a
query at the realtime endpoint, returning the complete state of
1https://mars.hawhamburg.de/articles/core/modelconguration/
... ...
Figure 4: Data loading layer, transforming input data objects
to common data model.
objects when any matching element changes. Figure 4 shows the
loading structure of multiple data sources in which the
𝜎
layer allo
cates using the individual wrapper implementation. Therefore, each
wrapper implements the native interface of the source and exposes
ascan operation and selection (
𝜎
) lter. The selection predicate is
dened on an associated arrays, decoupling from native formats
and providing transparency.
The system uses the Volcanomodel [
24
], iterating over data
objects and selecting those, satisfy the predicate in
𝜎𝑃
. The sys
tem transforms each result into a hierarchical associative array,
according to a given mapping. Due to this volcanobased lookup,
the
𝜎
layer returns an iterator, making postprocessing operations
available such as
top𝑘
or
count
. Temporal data objects are forwarded
into two intermediate, distinct catalogue stores to encapsulate the
model’s potential heterogeneous sources. The nontemporal cat
alogue contains values that are valid for the entire analysis time
A
, whereas the temporal catalogue preserves multiple groups with
time series, each concerning a sequence of data changes. When
adding new temporal data to the catalogue, the system rst checks
for type conformity and temporal overlap with the current time
𝑡𝑐
in
A
, preventing the system from getting errors in data [
11
,
20
,
30
]
or from being occupied by temporal outliers [
25
,
58
]. New time
series entries lead to the creation of new data objects. Insertion of
data to an existing sequence causes the invalidation of the previous
oldest value and subsequent replacement by the valid time of the
new one. These constructs inherit valid periods of data [36, 47].
Since external data are integrated, the approach needs to deal
with divergent transmission latency,unequal clocks of internal and
external scopes, and duplicates. As discussed in Section 3, this in
cludes data cleaning by ltering spatial [
10
,
35
] and temporal [
25
,
58
] outliers according to scenario mapping removing errors in the
attribute data such as typos or dependency violations [11, 20, 30].
5.2 Matching Population
The complete matching approach is shown in Algorithm 1, illustrat
ing each step in which valid and invalid temporalinvariant objects
Full Paper Track
CIKM ’21, November 1–5, 2021, Virtual Event, Australia
571
of
𝐸
and
𝑀
are created, updated, or removed during the discrete
execution of the analysis.
Algorithm 1 Valid/Invalid Data Population
Require: 𝐸, 𝜏, M, entity set, entity type and data source
1: Initialize update set 𝑈=∅
2: 𝑉 𝑎𝑙𝑖𝑑 T←𝜎𝑃(𝑚[ T ],A) 𝑀(see Eq. 17)
3: for (𝑚) ∈ 𝑉 𝑎𝑙 𝑖𝑑 Tdo
4: 𝑉 𝑎𝑙𝑖𝑑 G←𝜎𝑃(𝑚[ G ],𝑒 [G ] ) 𝐸(spatial matching, see Eq. 11)
5: for (𝑒𝑔) ∈ 𝑉 𝑎𝑙 𝑖𝑑 Gdo
6: if 𝑠𝑖𝑚𝐻(𝑒𝑔, 𝑚) ≤ 𝜏𝑑then
7: 𝑈←𝑈∪ {update(𝑚) }
8: else
9: 𝑈←𝑈∪ {create𝜏(𝑚)}
10: end if
11: end for
12: end for
13: 𝐼𝑛𝑣 𝑎𝑙𝑖𝑑 ←𝐸⋉𝑇
𝑃M(see Eq. 19)
14: for (𝑒) ∈ 𝐼 𝑛𝑣𝑎𝑙𝑖𝑑 do
15: if 𝑒∉𝑈then
16: 𝐸←𝐸\ {𝑒}// remove
17: end if
18: end for
19: 𝐴←𝐴∪𝑈
The population algorithm collects new and updated data objects
from
𝑀
(line 2) in an intermediate set
𝑈
(line 7and 9). Match
ing candidate in
𝐸
receive an update notication in their model
implementation (e.g., the parking space), whereas the absence of
matches with
𝑀
concludes in the creation of new elements for
𝐸
.
Precisely, matches are retrieved by applying the
𝜎𝑃(𝑚[G ],𝑒 [ G ]) 𝐸
(line 4) with new valid
𝑚
on the
𝑍
curve discussed in Section 4.2.
For each possible spatial match
𝑒𝑔
, we decide in which manner
the containing semantics is similar to the new
𝑚
. When satisfying
𝑠𝑖𝑚𝐻(𝑒𝑔, 𝑚)>𝜏𝐷]
, the algorithm creates an instance of the given
model type
𝜏
and inserts it into the update set, otherwise an update
is applied to the model with
𝑚
. The invalidation removes remain
ing
𝑒
from
𝐸
and preserves those that were created or received an
update (line 13).
HSTM uses symmetric matches between the present data and
the validity window on the temporal catalogue in which the left
side of the join is already being loaded, and matching partners on
the right side are located. Due to the lack of conformity between the
underlying model of the data sets, we utilize a leftouter bindjoin
[26].
6 EVALUATION
Hierarchical matching is implemented as a prototype using our
existing multiagent framework MARS and is available here
2
. The
MARS framework already has an implementation for ecient graph
structures [
54
] and was modied to realize the spatial hierarchy.
Calculations and measurements were performed on Windows 10
without active virtualization, using NetCore v5.0.4 on an AMD
Ryzen 9 3900X 12core CPU @ 32.0 GB RAM, 3TB SATA HD 64MB
2https://git.hawhamburg.de/mars/modeldeployments
Buer, compileroptimization enabled. Data were obtained from the
Hamburg Urban Data Hub Geoportal
3
(GP) and the Global Database
of Events, Language, and Tone (GDELT) database. GDELT repre
sents a
singlesource
set, containing
>
200M geopolitical spatial
and temporalinvariant records, with 34 categorical and numerical
typed data in an homogeneous schema. In contrast to GDELT, GP
is a public set
multisources
with over 150 spatial and temporal
heterogeneous data subset for the city of Hamburg, describing park
ing without equijoin relationships among sources. We evaluate
HSTM with two dierent scenarios. The rst scenario concerns the
problem of identifying matches between parking lots with temporal
occupation data from heterogeneous sources. No occupation data
is provided for the parking lots themselves (only spatial data). It
needs to be derived from existing echarging stations, providing
realtime occupation data without relations with to the given park
ing lots or spatial extent (insucient spatial information). This
merge is relevant regarding the correctness of a digital twin in the
domain of trac over time [
23
]. We investigate Precision and Recall
of the matching approach for onestep correlation of parking lots
and compare dierent selections of
𝜖
. Parking and loading stations
are matched with each other on a subset of 34 sources near city
centre. Given an equijoin prove semantically correct linkage on
parking spaces as well as distance relationship with estations from
streaminput. The echarging stations acting as proxies, contain
ing a count of nearby aected parking lots. The second scenario
operates on data with equijoin relationships, which allows the au
tomatic comparison of larger data sets. Incorporating GDELT, this
scenario evaluate the matching performance and Recall of HSTM.
We compare HSTM with existing indexbased methods, comparing
the KANN clustering with our similarity function. We decided
to use the vantagepoint tree (VPtree), kdimensional tree (KD
tree) and the navigable small world (NSW) because they are the
most prominent solutions for general purpose KANN clustering in
practice.
7 DISCUSSION
We presented a matching approach whose decisions for spatial,
temporal, and semantic properties were considered deferentially
with ad hoc analyses. Figure 5 shows that varying
𝜖
aects the
Precision, in which categorical data and missing considerations of
spatial or temporal restrictions are signicantly more applicable
for the range
[
0
,
5
]
, whereas combining linear values holds a Recall
>
0
.
87 and Precision
>
0
.
92. Using HSTM on polluted data sets
shows signicant drawbacks in which the matching is similar to
only using Jaccard or Cosine respectively, and only increasing the
spatial
𝜖
helps nd suitable candidates. Figure 6 shows comparison
results with leading KANN clustering on GDELT data set. Using
naive KANN approach shows lower Recall with lower processing
performance for VP, KDtree, and NSW approximation. For
𝜖=
8, we show that the few precise spatial quantizations with less
computational cost tend to the same behaviour as when
𝜖=
12
and where only NSW is performing better. In total, results were
classied properly compared to query performance but mostly come
with limitations. Due to the spatial abstraction, selecting nearby
objects using the MBR leads to invalidly intersecting matches (e.g.,
3Used public data sets available here https://geoportalhamburg.de/geo online/
Full Paper Track
CIKM ’21, November 1–5, 2021, Virtual Event, Australia
572
0 5 10
0.85
0.9
0.95
1
𝜖distance precision
Recall
(1)
0 5 10
0
0.5
1
𝜖distance precision
(2)
0 5 10
0.9
0.95
1
𝜖distance precision
Precision
(3)
0 5 10
0
0.5
1
𝜖distance precision
(4)
HSTM𝑠𝑖𝑚𝐻
HSTMJaccard
HSTMCosine
Figure 5: Comparison of Recall and Precision of the HSTM approach using similarity function sim𝐻and Jaccard and Cosine
only, varying 𝜖distance. Recall comparison on all parkingdata (1) set and polluted set on Parking spaces (2). Precision com
parison on all parkingdata (3) and polluted set (4)
10−1100101
0
0.5
1
Processing time in ms
Recall
(1)
10−210−1100101
0
0.5
1
Processing time in ms
(2)
10−210−1100101102
0
0.5
1
Processing time in ms
(3)
HSTM𝜖=8
HSTM𝜖=12
NSW
KDtree
VPtree
Figure 6: Comparison of Recall measures on subsets of GDELT using HSTM approach and public KANN lookup 𝐾=
1
im
plementations. Computation on top
50
K elements (1), top
100
K elements (2) and top
1
M elements (3). Xaxis has logarithmic
scale.
curves of roads) with points or small polygons. Large extensions of
these geometries correlate with failing curves and their MBR, even
if spatially, the objects have nothing in common.
Comparing our technical integration, we dier from other spatio
temporal joins [
12
] in such a way that matching applies to a setting
of heterogeneous data sources. HSTM is a nested loop match, in
which intermediate results from external sources are passed to
the internal objects in
𝐸
, using the lter predicate
𝑃(𝐸[𝑎′], 𝑀 [𝑏′])
.
HSTM sends out subqueries multiple times. Pushing selection com
bination to data source [
26
] alleviates limited access patterns in
which query capability restriction such as MQTT
4
and extensions
[
29
]. Though this approach provides well runtimes and logical data
independence, the process is very trivial because the objects are
discretized and sorted along the time axis with a xed intersection.
Merging spatial objects cannot be provided because this is readonly
access and delegates this to the model. Given the soft consistency
requirements, a exible temporal distance
𝑡𝑎+𝛼
and
𝑡𝑒+𝛼
would
be bettersuited by increasing or decreasing the candidate set. Sim
ilarly, it is unclear whether and to what extent objects should be
valid when
𝑡𝑠=𝑡𝑒
holds since the algorithm drops them out before
the analysis continues. The semantic approach dierentiates het
erogeneous objects and allows for their weights in the comparison
4Message Queuing Telemetry Transport
metric. And in addition, we did not consider normalization for the
temporal input without prior knowledge of the value space, which
is relevant in identifying outliers. Instead, we proposed discretizing
values along with time, subsuming the value range step by step.
8 CONCLUSION
This paper studied the similarity problem in matching spatio tem
poral domainspecic data for ad hoc analytical tasks from het
erogeneous sources. We devise a prex approach called HSTM for
matching objects, respecting valid temporal time, spatial locality,
and domainspecic similarity concerning heterogeneity. Utilizing
bindjoin we have implemented a data loading technique to retrieve
data from multiple kinds of sources, supporting timediscrete ad
hoc analysis. The evaluation shows that HSTM achieves high per
formance with good Recall results. We now integrate our approach
into a practical solution concerning trac demands, comparing
their specic matching results on a broader scale, and improving
transparency for the ad hoc task in decoupling more conguration
from the user.
ACKNOWLEDGMENTS
This research was supported in part by the City of Hamburg (Smart
OpenHamburg project).
Full Paper Track
CIKM ’21, November 1–5, 2021, Virtual Event, Australia
573
REFERENCES
[1]
Danial Aghajarian, Satish Puri, and Sushil Prasad. 2016. GCMF: an ecient
endtoend spatial join system over large polygonal datasets on GPGPU platform.
In Proceedings of the 24th ACM SIGSPATIAL International Conference on Advances
in Geographic Information Systems. 1–10.
[2]
Lars Arge, Octavian Procopiuc, Sridhar Ramaswamy, Torsten Suel, and Jef
frey Scott Vitter. 1998. Scalable sweepingbased spatial join. In VLDB, Vol. 98.
Citeseer, 570–581.
[3]
Rudolf Bayer. 1997. The universal Btree for multidimensional indexing: General
concepts. In International Conference on WorldwideComputing and Its Applications.
Springer, 198–209.
[4]
David M Blei, Andrew Y Ng, and Michael I Jordan. 2003. Latent dirichlet allocation.
the Journal of machine Learning research 3 (2003), 993–1022.
[5]
Panagiotis Bouros, Shen Ge, and Nikos Mamoulis. 2012. Spatiotextual similarity
joins. Proceedings of the VLDB Endowment 6, 1 (2012), 1–12.
[6]
Moisés Castelo Branco, Javier Troya, Krzysztof Czarnecki, Jochen Küster, and Ha
gen Völzer. 2012. Matching Business Process Workowsacross Abstraction Levels.
In Proceedings of the 15th International Conference on Model Driven Engineering
Languages and Systems (Innsbruck, Austria) (MODELS’12). SpringerVerlag, Berlin,
Heidelberg, 626–641. https://doi.org/10.1007/9783 642336669_40
[7]
Michael J Cafarella, Alon Halevy, and Nodira Khoussainova. 2009. Data inte
gration for the relational web. Proceedings of the VLDB Endowment 2, 1 (2009),
1090–1101.
[8]
Michael J Cafarella, Alon Halevy, Daisy Zhe Wang, Eugene Wu, and Yang Zhang.
2008. Webtables: exploring the power of tables on the web. Proceedings of the
VLDB Endowment 1, 1 (2008), 538–549.
[9]
Surajit Chaudhuri, Venkatesh Ganti, and Raghav Kaushik. 2006. A primitive
operator for similarity joins in data cleaning. In 22nd International Conference on
Data Engineering (ICDE’06). IEEE, 5–5.
[10]
Sanjay Chawla and Pei Sun. 2006. SLOM: A New Measure for Local Spatial
Outliers. Knowl. Inf. Syst. 9, 4 (2006), 412–429. https://doi.org/10.1007/s10115
0050200 2
[11]
YaoYi Chiang, Bo Wu, Akshay Anand, Ketan Akade, and Craig A. Knoblock.
2014. A System for Ecient Cleaning and Transformation of Geospatial Data
Attributes. In Proceedings of the 22nd ACM SIGSPATIAL International Conference
on Advances in Geographic Information Systems. ACM, 577–580. https://doi.org/
10.1145/2666310.2666373
[12]
James Cliord and Albert Croker. 1987. The historical relational data model
(HRDM) and algebra based on lifespans. In 1987 IEEE Third International Confer
ence on Data Engineering. IEEE, 528–537.
[13]
Dong Deng, Raul Castro Fernandez, Ziawasch Abedjan, Sibo Wang, Michael
Stonebraker, Ahmed K Elmagarmid, Ihab F Ilyas, Samuel Madden, Mourad Ouz
zani, and Nan Tang. 2017. The Data Civilizer System.. In Cidr.
[14]
Dong Deng, Guoliang Li, Jianhua Feng, and WenSyan Li. 2013. Topk string
similarity search with editdistance constraints. In 2013 IEEE 29th International
Conference on Data Engineering (ICDE). IEEE, 925–936.
[15]
Xin Luna Dong, Laure BertiEquille, and Divesh Srivastava. 2009. Integrating
Conicting Data: The Role of Source Dependence. Proc. VLDB Endow. 2, 1 (Aug.
2009), 550–561. https://doi.org/10.14778/1687627.1687690
[16] Curtis E. Dyreson and Richard Thomas Snodgrass. 1993. Timestamp Semantics
and Representation. Information Systems 18, 3 (April 1993), 143–166. https:
//doi.org/10.1016/03064379(93)90034 X
[17]
Ahmed K. Elmagarmid, Panagiotis G. Ipeirotis, and Vassilios S. Verykios. 2007.
Duplicate Record Detection: A Survey. IEEE Trans. on Knowl. and Data Eng. 19, 1
(Jan. 2007), 1–16.
[18]
Ronald Fagin, Laura M Haas, Mauricio Hernández, Renée J Miller, Lucian Popa,
and Yannis Velegrakis. 2009. Clio: Schema mapping creation and data exchange.
In Conceptual modeling: foundations and applications. Springer, 198–236.
[19]
Robert Fenk. 2002. The BUBtree. In VLDB’02, Proceedings of 28th International
Conference on Very Large Data Bases. Citeseer.
[20]
Venkatesh Ganti and Anish Das Sarma. 2013. Data Cleaning: A Practical Perspec
tive. Morgan & Claypool Publishers.
[21]
Mike Gashler, Christophe GiraudCarrier, and Tony Martinez. 2008. Decision
tree ensemble: Small heterogeneous is better than large homogeneous. In 2008
Seventh International Conference on Machine Learning and Applications. IEEE,
900–905.
[22]
Daniel Glake, Fabian Panse, Norbert Ritter, Thomas Clemen, and Ula Lenfers.
2021. Data Management in MultiAgent Simulation Systems. In BTW 2021, Kai
Uwe Sattler, Melanie Herschel, and Wolfgang Lehner (Eds.). Gesellschaft für
Informatik, Bonn, 423–436. https://doi.org/10.18420/btw202122
[23]
Daniel Glake, Norbert Ritter, and Thomas Clemen. 2020. Utilizing Spatio
Temporal Data in MultiAgent Simulation. In Proceedings of the 2020 Winter
Simulation Conference, K.H. Bae, B. Feng, S. Kim, S. LazarovaMolnar, Z. Zheng,
T. Roeder, and R. Thiesing (Ed.). Society for Computer Simulation International.
[24]
Goetz Graefe. 1994. Volcano/spl minus/an extensible and parallel query evaluation
system. IEEE Transactions on Knowledge and Data Engineering 6, 1 (1994), 120–
135.
[25]
Manish Gupta, Jing Gao, Charu C. Aggarwal, and Jiawei Han. 2014. Outlier
Detection for Temporal Data: A Survey. IEEE Trans. Knowl. Data Eng. 26, 9 (2014),
2250–2267. https://doi.org/10.1109/TKDE.2013.184
[26]
Laura Haas, Donald Kossmann, Edward Wimmers, and Jun Yang. 1997. Optimiz
ing queries across diverse data sources. (1997).
[27]
Yeye He, Kris Ganjam, and Xu Chu. 2015. Semajoin: joining semanticallyrelated
tables using big table corpora. Proceedings of the VLDB Endowment 8, 12 (2015),
1358–1369.
[28]
YunWu Huang, Ning Jing, Elke A Rundensteiner, et al
.
1997. Spatial joins using R
trees: Breadthrst traversal with global optimizations. In VLDB, Vol. 97. Citeseer,
25–29.
[29] Urs Hunkeler, Hong Linh Truong, and Andy StanfordClark. 2008. MQT TS—A
publish/subscribe protocol for Wireless Sensor Networks. In 2008 3rd International
Conference on Communication Systems Software and Middleware and Workshops
(COMSWARE’08). IEEE, 791–798.
[30]
Ihab F. Ilyas and Xu Chu. 2019. Data Cleaning. ACM. https://doi.org/10.1145/
3310205
[31]
Edwin H Jacox and Hanan Samet. 2007. Spatial join techniques. ACM Transactions
on Database Systems (TODS) 32, 1 (2007), 7–es.
[32]
Christian S Jensen, Curtis E Dyreson, Michael Böhlen, James Cliord, Ramez
Elmasri, Shashi K Gadia, Fabio Grandi, Pat Hayes, Sushil Jajodia, Wolfgang Käfer,
et al
.
1998. The consensus glossary of temporal database concepts—February
1998 version. In Temporal Databases: Research and Practice. Springer, 367–405.
[33]
Martin Kaufmann, Panagiotis Vagenas, Peter M Fischer, Donald Kossmann, and
Franz Färber. 2013. Comprehensive and interactive temporal query processing
with SAP HANA. Proceedings of the VLDB Endowment 6, 12 (2013), 1210–1213.
[34]
Andreas Kipf, Harald Lang, Varun Pandey, Raul Alexandru Persa, Peter Boncz,
Thomas Neumann, and Alfons Kemper. 2018. Adaptive geospatial joins for
modern hardware. arXiv preprint arXiv:1802.09488 (2018).
[35]
Yufeng Kou and ChangTien Lu. 2017. Outlier Detection, Spatial. In Encyclopedia
of GIS. Springer, 1539–1546. https://doi.org/10.1007/978 3319 178851_945
[36]
Ioannis K. Koumarelas, Lan Jiang, and Felix Naumann. 2020. Data Preparation
for Duplicate Detection. ACM J. Data Inf. Qual. 12, 3 (2020), 15:1–15:24. https:
//dl.acm.org/doi/10.1145/3377878
[37]
Georgia Koutrika, Benjamin Bercovitz, and H FlexRecs GarciaMolina. [n.d.].
Expressing and combining exible recommendations. In Proceedings of the 35th
SIGMOD International Conference on Management of Data (SIGMOD’09), Provi
dence, RI, USA, Vol. 29.
[38]
Oliver Lehmberg, Dominique Ritze, Petar Ristoski, Robert Meusel, Heiko Paul
heim, and Christian Bizer. 2015. The mannheim search join engine. Journal of
Web Semantics 35 (2015), 159–166.
[39]
Chen Li, Jiaheng Lu, and Yiming Lu. 2008. Ecient merging and ltering algo
rithms for approximate string searches. In 2008 IEEE 24th International Conference
on Data Engineering. IEEE, 257–266.
[40]
John Liagouris, Nikos Mamoulis, Panagiotis Bouros, and Manolis Terrovitis. 2014.
An eective encoding scheme for spatial RDF data. Proceedings of the VLDB
Endowment 7, 12 (2014), 1271–1282.
[41]
MingLing Lo and Chinya V Ravishankar. 1996. Spatial hashjoins. In Proceedings
of the 1996 ACM SIGMOD international conference on Management of data. 247–
258.
[42]
Volker Markl. 2000. Mistral: Processing relational queries using a multidimen
sional access technique. In Ausgezeichnete Informatikdissertationen 1999. Springer,
158–168.
[43]
Renée J Miller. 2018. Open data integration. Proceedings of the VLDB Endowment
11, 12 (2018), 2130–2139.
[44]
Mohamed F Mokbel, Walid G Aref, and Ibrahim Kamel. 2003. Analysis of multi
dimensional spacelling curves. GeoInformatica 7, 3 (2003), 179–209.
[45]
Jack A Orenstein and Tim H Merrett. 1984. A class of data structures for associa
tive searching. In Proceedings of the 3rd ACM SIGACTSIGMOD Symposium on
Principles of Database Systems. 181–190.
[46]
Jignesh M Patel and David J DeWitt. 1996. Partition based spatialmerge join.
ACM Sigmod Record 25, 2 (1996), 259–270.
[47] Dorian Pyle. 1999. Data Preparation for Data Mining. Morgan Kaufmann.
[48]
Shuyao Qi, Panagiotis Bouros, and Nikos Mamoulis. 2013. Ecient topk spatial
distance joins. In International Symposium on Spatial and Temporal Databases.
Springer, 1–18.
[49]
Md Atiqur Rahman and Yang Wang. 2016. Optimizing intersectionoverunion
in deep neural networks for image segmentation. In International symposium on
visual computing. Springer, 234–244.
[50]
Frank Ramsak, Volker Markl, Robert Fenk, Martin Zirkel, Klaus Elhardt, and
Rudolf Bayer. 2000. Integrating the UBtree into a database system kernel.. In
VLDB, Vol. 2000. Citeseer, 263–272.
[51]
Hans Sagan. 1994. Hilbert’s spacelling curve. In Spacelling curves. Springer,
9–30.
[52]
Craig Stanll and David Waltz. 1986. Toward memorybased reasoning. Commun.
ACM 29, 12 (1986), 1213–1228.
Full Paper Track
CIKM ’21, November 1–5, 2021, Virtual Event, Australia
574
[53]
Jin Wang, Guoliang Li, Dong Deng, Yong Zhang, and Jianhua Feng. 2015. Two
birds with one stone: An ecient hierarchical framework for topk and threshold
based string similarity search. In 2015 IEEE 31st International Conference on Data
Engineering. IEEE, 519–530.
[54]
Julius Weyl, Ula A Lenfers, Thomas Clemen, Daniel Glake, Fabian Panse, and
Norbert Ritter. 2019. Largescale trac simulation for smart city planning with
mars. In Proceedings of the 2019 Summer Simulation Conference. 1–12.
[55]
Randall T Whitman, Michael B Park, Bryan G Marsh, and Erik G Hoel. 2017.
Spatiotemporal join on apache spark. In Proceedings of the 25th ACM SIGSPATIAL
International Conference on Advances in Geographic Information Systems. 1–10.
[56]
Mohamed Yakout, Kris Ganjam, Kaushik Chakrabarti, and Surajit Chaudhuri.
2012. Infogather: entity augmentation and attribute discovery by holistic match
ing with web tables. In Proceedings of the 2012 ACM SIGMOD International Con
ference on Management of Data. 97–108.
[57]
Tilmann Zäschke, Christoph Zimmerli, and Moira C Norrie. 2014. The PHtree:
a spaceecient storage structure and multidimensional index. In Proceedings of
the 2014 ACM SIGMOD international conference on Management of data. 397–408.
[58]
Aoqian Zhang, Shaoxu Song, Jianmin Wang, and Philip S. Yu. 2017. Time Series
Data Cleaning: From Anomaly Detection to Anomaly Repairing. Proc. VLDB
Endow. 10, 10 (2017), 1046–1057. https://doi.org/10.14778/3115404.3115410
[59]
Donghui Zhang, Vassilis J Tsotras, and Bernhard Seeger. 2002. Ecient temporal
join processing using indices. In Proceedings 18th International Conference on
Data Engineering. IEEE, 103–113.
[60]
Erkang Zhu, Yeye He, and Surajit Chaudhuri. 2017. Autojoin: Joining tables by
leveraging transformations. Proceedings of the VLDB Endowment 10, 10 (2017),
1034–1045.
Full Paper Track
CIKM ’21, November 1–5, 2021, Virtual Event, Australia
575