Conference Paper

Heterogeneous Stream Processing and Crowdsourcing for Urban Traffic Management

Heterogeneous Stream Processing and Crowdsourcing
for Urban Traffic Management
Alexander Artikis1, Matthias Weidlich2, Francois Schnitzler3, Ioannis Boutsis4,
Thomas Liebig5, Nico Piatkowski5, Christian Bockermann5, Katharina Morik5,
Vana Kalogeraki4, Jakub Marecek6, Avigdor Gal3, Shie Mannor3, Dermot Kinane7and
Dimitrios Gunopulos8
1Institute of Informatics & Telecommunications, NCSR Demokritos, Athens, Greece,
2Imperial College London, United Kingdom, 3Technion - Israel Institute of Technology, Haifa, Israel,
4Department Informatics, Athens University of Economics and Business, Greece,
5Technical University Dortmund, Germany, 6IBM Research, Dublin, Ireland,
7Dublin City Council, Ireland,
8Department of Informatics and Telecommunications, University of Athens, Greece,,,, {thomas.liebig, nico.piatkowski},
{christian.bockerman,katharina.morik},,, {avigal@ie, shie@ee},,
Urban traffic gathers increasing interest as cities become
bigger, crowded and “smart”. We present a system for het-
erogeneous stream processing and crowdsourcing supporting
intelligent urban traffic management. Complex events related
to traffic congestion (trends) are detected from heterogeneous
sources involving fixed sensors mounted on intersections and
mobile sensors mounted on public transport vehicles. To deal
with data veracity, a crowdsourcing component handles and
resolves sensor disagreement. Furthermore, to deal with data
sparsity, a traffic modelling component offers information in
areas with low sensor coverage. We demonstrate the system
with a real-world use-case from Dublin city, Ireland.
Categories and Subject Descriptors
H.2.4 [
Information Systems
]: Systems—query processing,
rule-based databases
The recent development of innovative technologies related
to mobile computing combined with smart city infrastruc-
tures is generating massive, heterogeneous data and creating
the opportunities for novel applications. In traffic monitoring,
the data sources include traditional ones (sensors) as well as
novel ones such as micro-blogging applications like Twitter;
these provide a new stream of textual information that can
(c) 2014, Copyright is with the authors. Published in Proc. EDBT 2014
on Distribution of this paper is permitted under the
terms of the Creative Commons license CC-by-nc-nd 4.0
EDBT-2014 Athens, Greece
be utilized to capture events, or allow citizens to constantly
interact using mobile sensors.
Detecting complex events from heterogeneous data streams
is a promising vehicle to support applications for monitoring,
detection and online response [11, 20]. Consider e.g. an urban
monitoring system that identifies traffic congestions (in-the-
make) and (proactively) changes traffic light priorities and
speed limits to reduce ripple effects. Such a system may use
traffic flow and density information measured by fixed sensors
mounted in selected intersections, together with reports from
public transport vehicles (buses, trams, etc).
Our work is motivated by an existing traffic monitoring
application in Dublin City, Ireland. We present the general
framework of a system that has been designed in this context,
and the challenges that come up from a real installation and
application. The long term goal of the related INSIGHT
is to enable traffic managers to detect with a high
degree of certainty unusual events throughout the network.
We report on the design of a monitoring system that takes
input from a set of traffic sensors, both static (intersection
located, traffic flow and density monitoring sensors) and mo-
bile (GPS equipped public transportation buses). We explore
the advantages of having such an infrastructure available and
address its limitations.
Some of the main challenges when dealing with large traffic
monitoring data streams are that of veracity and sparsity.
Data arriving from multiple heterogeneous sources, may be
of poor quality and in general requires pre-processing and
cleaning when used for analytics and query answering. In
particular, sensor networks introduce uncertainty into the sys-
tem due to reasons that range from inaccurate measurements
through network local failures to unexpected interference of
mediators. While the first two reasons are well recorded in
the literature, the latter is a new phenomenon that stems
from the distribution of sensor sources. Sensor data may go
through multiple mediators en route to our systems. Such
BibTeX :
mediators apply filtering and aggregation mechanisms, most
of which are unknown to the system that receives the data.
Hence, the uncertainty that is inherent to sensor data is
multiplied by the factor of unknown aggregation and filtering
treatments. In addition, data present a sparsity problem,
since the traffic in several locations in the city is either never
monitored due to lack of sensors, or infrequently monitored
(e.g. when a bus passes by).
In [3], we outlined the principle of using variety of input
data to effectively handle veracity. Streams from multiple
sources were leveraged to generate common complex events.
A complex event processing component matched these events
against each other to identify mismatches that indicate un-
certainty regarding the event sources. Temporal regions
of uncertainty were identified from which point the system
autonomously decided on how to manage this uncertainty.
In this paper we present a holistic view of traffic monitoring;
we present approaches to address (i) the
of the data
problem, (ii) the
of the data problem, and (iii) the
of the data problem. In addition, the streaming
architecture we develop is scalable, and therefore capable
of addressing the
of the data problems that arise
as the available data sources increase. We integrate the
respective techniques in the context of a unified system for a
concrete application. To build the system, we significantly
extend our previous work by incorporating a crowdsourcing
component to facilitate further uncertainty handling and
a component for traffic modelling. The first component
queries volunteers close to the sensors that disagree and
estimates what has actually happened given the participants’
reliability. The benefits of this approach are two-fold. First,
more accurate information is directly given to end users.
Second, the event processing component of our system makes
use of the crowdsourced information to minimise the use of
unreliable sources. The traffic modelling component may also
use the crowdsourced information to resolve data sparsity.
We illustrate our approach using large, heterogeneous data
streams concerning urban traffic management in the city of
Dublin. We describe the requirements that come up including
data sources, analysis methods and technology, and visualisa-
tion. The data we use
come from the Sydney Coordinated
Adaptive Traffic System (SCATS) sensors, i.e. fixed sensors
deployed on intersections to measure traffic flow and density,
and bus probe data stating, among others, the location of
each bus as well as traffic congestions.
The remainder of this paper is organised as follows. Sec-
tion 2 describes the architecture of our system. Then, Sec-
tions 3–6 present each of the main components. Section 7
presents our empirical evaluation, showing the method feasi-
bility. Finally, Section 8 summarises our work.
The general architecture of our system for urban traffic
management is given in Figure 1. In this section, we de-
scribe the input and output of the system, the individual
components that perform the data analysis, and the stream
processing connecting middleware.
Two types of sensor are considered as event sources.
Buses transmit information about their position and conges-
tion and vehicle detectors of a SCATS system are installed at
intersections and report on traffic flow and density. The raw
Figure 1: Overview of the system architecture.
input from these sensors is not directly processed though.
Instead, mediators are involved that filter and aggregate
the raw data. A lack of control over these pre-processing
steps that are interwoven with the communication infrastruc-
ture, therefore, induces uncertainty for the low-level events
that are actually processed by the system. This aspect is
highlighted by the notion of a simple, derived event (SDE),
which is the result of applying a computational derivation
process to some other event, such as an event coming from
a sensor [21]. A stream of such time-stamped SDEs is the
primary input of our system.
Additionally, the system may solicit input from citizens
using a connected crowdsourcing component. The output of
crowdsourcing is fed to the computing components of the
system, to improve the accuracy of the results.
The system helps an operator manage the traffic
situation, by integrating available traffic information from the
different sources, which can then be used to issue alerts when
issues that may impact traffic are identified. An important
requirement is to have a simple, intuitive interactive map to
present all traffic information and alerts.
Stream Processing Component:
The backbone of our
solution is a stream processing component, which couples
the output from the sensors with further data analysis com-
ponents. Stream processing is realized with the Streams
framework [4]. It provides a language for the description of
data flow graphs, which are then compiled into a computation
graph for a stream processing engine.
Data Analysis Components:
The system uses compo-
nents for traffic modelling, complex event processing and
crowdsourcing. Collectively, these components implement
the monitoring logic of the system. The crowdsourcing com-
ponent has two independent parts: the query modelling part
whose objective is to select the humans that will be answering
a question, and a query execution engine which deploys and
executes the question.
Using Streams, SDEs are forwarded to a traffic modelling
component that deals with data sparsity, i.e. makes conges-
tion estimates in areas with low or non-existent sensor cover-
age. SDEs are also forwarded to a complex event processing
engine that identifies complex events (CE) of interest. A CE
is a collection of events that satisfies a certain specification
comprising temporal and, possibly, atemporal constraints
on its deriving events, either SDEs or other CEs. Identified
CEs may then be directly forwarded to end users (city opera-
tors) in order to gain insights on the current traffic situation.
However, the aforementioned uncertainty stemming from
the pre-processing of sensor readings may lead to situations
that cannot be clearly identified. Instead, CEs that relate
to inconsistencies in the event sources are detected. These
CEs are then forwarded to the crowdsourcing component,
which aims at reducing the uncertainty by human input. For
a source disagreement event emitted by the complex event
processing component, the crowdsourcing component selects
one or more humans that act as system participants. By
answering a specific question, they allow for resolving source
disagreements. These results are used in two ways. On the
one hand, they are fed into the complex event processing
component and the traffic modelling component, thereby
supporting adaptability of these components. On the other
hand, CEs are labelled with the details obtained from the
participants and forwarded to city operators, allowing for
deeper insights on the traffic situation.
The Streams framework [4] that is the backbone of our
system provides a XML-based language for the description of
data flow graphs that work on sequences of data items which
are represented by sets of key-value pairs, i.e. event attributes
and their values. The actual processing logic, i.e. the nodes
of the data flow graph, is realised by processes that comprise
a sequence of processors. Processes take a stream or a queue
as input and processors, in turn, apply a function to the
data items in a stream. All these concepts are implemented
in Java, so that adding customized processors is realised by
implementing the respective interfaces of the Streams API.
In addition, Streams allows for the specification of services,
i.e. sets of functions that are accessible throughout the stream
processing application.
Using these concepts, our stream processing component
includes the following parts:
Input handling processes: all SDEs emitted by buses
form one stream, while the SDE emitted by vehicle
detectors of a SCATS system are referenced by four
streams, one per region of Dublin city.
Event processing processes: the definitions of complex
events (CE)s are wrapped by specific processors that
realise an embedding of the complex event processing
component in the Streams environment.
Crowdsourcing processes: the selection of participants
from which feedback should be sought, the generation
of the actual queries, and the processing of responses
are also implemented by specific processors.
Traffic modelling processes: the procedure for mak-
ing congestion estimates at locations with low sensor
coverage is wrapped as a Streams service.
For complex event processing, our solution relies on the
Event Calculus for Run-Time reasoning (RTEC)
[2], a
Prolog-based engine, which is detailed below. We integrated
RTEC by a dedicated processor in Streams that would for-
ward the received SDEs to an RTEC instance using a bidi-
rectional Java-Prolog-Interface. Then, the actual event pro-
cessing is triggered asynchronously and the derived CEs are
emitted to a queue in the Streams framework.
Crowdsourcing essentially involves two steps, the genera-
tion of the queries and the processing of participant responses.
In our solution, each of these steps is implemented by a ded-
icated processor. That is, upon the reception of a respective
event (source disagreement) indicating that feedback should
be sought, a first processor takes events as input and queries
actual participants via an interface. Responses to these
queries, in turn, represent an event stream. The responses
are merged by a second processor to come up with an approx-
imation of the probabilities for the different possible answers.
This second processor also estimates participant reliability.
Our CE recognition component is based on the Event
Calculus for Run-Time reasoning (RTEC) [2]. The Event
Calculus [15] is a logic programming language for represent-
ing and reasoning about events and their effects. The benefits
of a logic programming approach to CE recognition are well-
documented: such an approach has a formal, declarative
semantics, and direct routes to machine learning for con-
structing and refining CE definitions in an automated way.
The use of the Event Calculus has additional advantages:
the process of CE definition development is considerably
facilitated, as the Event Calculus includes built-in rules for
complex temporal representation and reasoning, including
the formalisation of inertia. With the use of the Event Cal-
culus, one may develop intuitive, succinct CE definitions,
facilitating the interaction between CE definition developer
and domain expert, and allowing for code maintenance.
To make the paper self-contained, we summarise the essen-
tials of the CE recognition model based on [2, 3]. We adopt
the common logic programming convention that variables
start with upper-case letters and are universally quantified,
while predicates and constants start with lower-case letters.
4.1 Representation
In RTEC, event types are represented as n-ary predi-
cates event(Attribute1,. . . ,AttributeN), such that the pa-
rameters define the attribute values of an event instance
event(value1,. . . ,valueN). An example from the Dublin traf-
fic management scenario is the type of SDE emitted by buses,
move(Bus ,Line ,Operator,Delay)
, which states that
running in
with a Delay, and operated by
Thus, a specific event instance is an instantiation of this
predicate, e.g. move(33009, r10, o7, 400).
Time is assumed to be linear and discrete, represented
by integer time-points. The occurrence of an event Eat
time Tis modelled by the predicate happensAt(E, T ). The
effects of events are expressed by means of fluents, i.e. prop-
erties that may have different values at different points in
time. The term F
Vdenotes that fluent Fhas value V.
V, T ) represents that fluent Fhas value Vat
a particular time-point T. Interval-based semantics are ob-
tained with the predicate
V, I ), where Iis a list
of maximal intervals for which fluent Fhas value Vcontinu-
ously. holdsAt and holdsFor are defined in such a way that, for
any fluent F,holdsAt(F
V, T ) iff time-point Tbelongs to
one of the maximal intervals of Ifor which holdsFor(F
V, I ).
Table 1 presents the main RTEC predicates.
Fluents are simple or statically determined. For a simple
fluent F,F
Vholds at time-point Tif F
Vhas been
initiated by an event at some time-point earlier than T(using
predicate initiatedAt), and has not been terminated in the
meantime (using predicate terminatedAt), which implements
Table 1: Main predicates of RTEC.
Predicate Meaning
happensAt(E, T ) Event Eoccurs at time T
holdsAt(F=V, T ) The value of fluent Fis Vat time T
holdsFor(F=V, I )Iis the list of the maximal intervals
for which F=Vholds continuously
initiatedAt(F=V, T ) At time Ta period of time for which
F=Vis initiated
terminatedAt(F=V, T ) At time Ta period of time for which
F=Vis terminated
relative Iis the list of maximal intervals
complement produced by the relative complement
all (I0,L,I) of the list of maximal intervals I0
with respect to every list of maximal
intervals of list L
union all(L,I)Iis the list of maximal intervals
produced by the union of the lists of
maximal intervals of list L
intersect all(L,I)Iis the list of maximal intervals
produced by the intersection of
the lists of maximal intervals of list L
the law of inertia. Statically determined fluents are defined
using interval manipulation constructs, such as union all,
intersect all and relative complement all (cf., Table 1).
The input SDE streams are represented by logical facts that
define event instances, with the use of the
happensAt predicate, or the values of fluents, with the use of
the holdsAt predicate. Taking up the earlier example, facts
of the following structure model the bus data stream:
happensAt(move(Bus ,Line ,Operator,Delay),T)
holdsAt(gps(Bus ,Lon,Lat,Direction,Congestion ) = true,T)
gps(Bus ,Lon,Lat,Direction,Congestion)
states the location
of the
, as well as its direction (
) on
. Further, the
fluent provides information about
congestion (0or 1) in the given location.
CEs, in turn, are modelled as logical rules defining event
instances, with the use of happensAt, the effects of events,
with the use of initiatedAt and terminatedAt, or the values
of the fluents, with the use of holdsFor. For illustration,
consider an instantaneous CE that expresses a sharp increase
in the delay of a bus:
happensAt(delayIncrease(Bus ,Lon0,Lat0,Lon,Lat),T)
happensAt(move(Bus , , , Delay0),T0),
holdsAt(gps(Bus ,Lon0,Lat0, , ) = true,T0),
happensAt(move(Bus , , , Delay),T),
holdsAt(gps(Bus ,Lon,Lat,,) = true,T),
‘ ’ is a ‘free’ Prolog variable that is not bound in a rule.
delayIncrease(Bus ,Lon0,Lat0,Lon,Lat)
CE is recognised
when the delay value of a
increases by more than
seconds in two SDEs emitted in less than
seconds. A CE
of this type may indicate a congestion in-the-make between
. This indication may be rein-
forced by instances of this CE type concerning other buses
operating in the same area.
Q136 Q138
Working Memory
Figure 2: Event recognition in RTEC.
4.2 Reasoning
CE recognition is performed as follows. RTEC computes
and stores the maximal intervals of fluents and the time-
points in which events occur at specified query times
, Q
. . .
. At each query time Q
, only the SDEs that
fall within a specified interval—the ‘working memory’ (
or ‘window’—are taken into consideration: all SDEs that
took place before or on Q
are discarded. This way, the
cost of CE recognition depends only on the size of
not on the complete SDE history. As a consequence, ‘win-
dowing’ will potentially change the answer to some queries.
Some of the stored sub-computations may have to be checked
and possibly recomputed. Much of the detail of the RTEC
algorithms is concerned with this requirement.
The size of
, and the temporal distance between two
consecutive query times — the ‘step’ (Q
) — are
tuning parameters that can be either chosen by the end user
or optimized for performance. In the common case that
SDEs arrive at RTEC with delays, it is preferable to make
longer than the step. This way, it becomes possible
to compute, at Q
, the effects of SDEs that took place in
, Q
], but arrived after Q
. This is illustrated in
Figure 2. The figure displays the occurrences of SDEs as dots
and a Boolean fluent as line segments. For event recognition
at Q
, only the events marked in black are considered,
whereas the greyed out events are neglected. Assume that
all events marked in bold arrived only after Q
. Then, we
observe that two SDEs were delayed i.e. they occurred before
, but arrived only after Q
. In our setting, the window
is larger than the step. Hence, these events are not lost but
considered as part of the recognition step at Q138.
Note that increasing the WM size decreases recognition
efficiency. This issue is illustrated in Section 7 where we
evaluate empirically RTEC.
4.3 Event Recognition for
Urban Traffic Management
The input to RTEC consists of SDEs that come from two
heterogeneous data streams with different time granularity.
First, buses transmit information about their position and
congestions every 20-30 sec. The structures of the bus SDE is
given by formalisation
. Second, static sensors mounted on
various junctions—SCATS sensors—transmit every 6 minutes
information about traffic flow and density:
This instantaneous SDE expresses density
and traffic flow
Fmeasured by SCATS sensor Smounted on a lane with
approach Ainto the intersection Int.
In collaboration with domain experts, several CEs have
been defined over the input streams. These CEs relate to,
among others, traffic congestion (in-the-make), and traffic
flow and density trends for proactive decision-making. Traffic
congestion is reported by SCATS sensors as well as buses.
The former is captured as follows:
initiatedAt(scatsCongestion(Int,A,S) = true,T)
Dupper Density threshold,
Flower Flow threshold
terminatedAt(scatsCongestion(Int,A,S) = true,T)
D<upper Density threshold
terminatedAt(scatsCongestion(Int,A,S) = true,T)
F>lower Flow threshold
is a CE expressing congestion in a
SCATS sensor and
scatsCongestion(Int,A,S) = true
is initi-
ated when the density reported by SCATS sensor S, which
is mounted on approach Aof intersection
, is above some
threshold and traffic flow is below some other threshold
(see the fundamental diagram of traffic flow
). Otherwise,
scatsCongestion(Int,A,S) = true
is terminated. The maxi-
mal intervals for which
scatsCongestion(Int,A,S) = true
holds continuously are computed by rule-set
and the
domain-independent holdsFor predicate.
Given the above formalisation, we may define congestion
with respect to a SCATS intersection, i.e. an intersection
with at least one SCATS sensor. For example, we may define
that a SCATS intersection is congested if at least n(n > 1) of
its sensors are congested, or we may have a more structured
intersection congestion definition that depends on approach
congestion which in turn would depend on sensor congestion.
Congestion is also reported by buses—this is very useful
as there are numerous areas in the city that do not have
SCATS sensors. Consider the following formalisation:
initiatedAt(busCongestion(Lon,Lat) = true,T)
happensAt(move(Bus ,,,),T),
holdsAt(gps(Bus ,LonB,LatB, , 1),T),
terminatedAt(busCongestion(Lon,Lat) = true,T)
happensAt(move(Bus ,,,),T),
holdsAt(gps(Bus ,LonB,LatB, , 0),T),
are the coordinates of some area of interest, while
are the current coordinates of a
. The
fluent, like the
event, is given by the dataset.
is an atemporal predicate computing the distance be-
tween two points and comparing them against a thresh-
starts being true when a bus
moves close to the location
for which we are in-
terested in detecting congestions, and (the bus) reports a
congestion (represented by
in the
fluent). Moreover,
stops being true when a (possibly
different) bus moves close to
and reports no con-
gestion (represented by 0in gps).
The two data sources, buses and SCATS sensors, do not al-
ways agree on congestion. Disagreement of the event sources
is captured with the following formalisation:
holdsFor(sourceDisagreement(LonInt ,LatInt) = true,I)
holdsFor(busCongestion(LonInt,LatInt ) = true,I1),
holdsFor(scatsIntCongestion(LonInt ,LatInt) = true,I2),
relative complement all(I1,[I2], I)
scatsIntCongestion(LonInt ,LatInt)
is a CE expressing conges-
tion in the SCATS intersection located at
(LonInt,LatInt )
relative complement all is an interval manipulation construct
of RTEC (see Table 1). In relative complement all(I
, L, I),
Iis the list of maximal intervals produced by the relative
complement of the list of maximal intervals I
with respect
to every list of maximal intervals of list L. The maximal
intervals for which
sourceDisagreement(LonInt,LatInt ) = true
are computed only for the locations of SCATS intersections.
A disagreement between the two data sources is said to take
place as long as some buses report a congestion in the location
(LonInt,LatInt )
of a SCATS intersection, and according to the
SCATS sensors of that intersection there is no congestion.
The detection of
CE indicates verac-
ity in the data sources. There are several ways to deal with
this issue. Probabilistic event recognition techniques may
be employed in order to deal with this type of uncertainty.
Consider, for example, probabilistic graphical models [28],
Markov Logic Networks [9, 26], probabilistic logic program-
ming [25], and fuzzy set and possibility theory [19]. Although
there is considerable work on optimising probabilistic rea-
soning techniques, the imposed overhead in the presence of
large data streams, such as those of Dublin, does not allow
for real-time event recognition [1].
In [3], we used variety of input data to handle veracity.
The events detected on the bus data stream were matched
against the events detected on the SCATS stream to iden-
tify mismatches that indicate uncertainty regarding the data
sources. Temporal regions of uncertainty were identified from
which the system autonomously decided to adapt its sources
in order to deal with uncertainty, without compromising effi-
ciency. More precisely, we assumed that SCATS sensors are
more trustworthy than buses and used these sensors to eval-
uate the information offered by buses. A bus was considered
unreliable when it disagreed with a SCATS sensor on con-
gestion, and remained unreliable as long as it did not agree
during its operation with some other SCATS sensor. The
congestion information offered by unreliable buses, whether
close to a SCATS sensor or not, was discarded.
In this paper, instead, we rely on crowdsourcing techniques
to resolve unreliability in data sources. These techniques
are presented in the following section. The benefits of this
approach are two-fold. First, more accurate information is
directly given to city operators in the case of source disagree-
ment. Second, RTEC takes advantage of the crowdsourced
information to minimise the use of unreliable sources. The
rules below illustrate how this is achieved:
happensAt(disagree(Bus ,LonInt ,LatInt,positive),T)
happensAt(move(Bus ,,,),T),
holdsAt(gps(Bus ,LonB,LatB, , 1),T),
close(LonB,LatB,LonInt,LatInt ),
not holdsAt(scatsIntCongestion(LonInt,LatInt ) = true,T)
happensAt(disagree(Bus ,LonInt ,LatInt,negative),T)
happensAt(move(Bus ,,,),T),
holdsAt(gps(Bus ,LonB,LatB, , 0),T),
close(LonB,LatB,LonInt,LatInt ),
holdsAt(scatsIntCongestion(LonInt,LatInt ) = true,T)
happensAt(agree(Bus ),T)
happensAt(move(Bus ,,,),T),
holdsAt(gps(Bus ,LonB,LatB, , 1),T),
close(LonB,LatB,LonInt,LatInt ),
holdsAt(scatsIntCongestion(LonInt,LatInt ) = true,T)
happensAt(agree(Bus ),T)
happensAt(move(Bus ,,,),T),
holdsAt(gps(Bus ,LonB,LatB, , 0),T),
close(LonB,LatB,LonInt,LatInt ),
not holdsAt(scatsIntCongestion(LonInt,LatInt ) = true,T)
According to the first two rules above, an event
disagree(Bus ,LonInt ,LatInt,Val)
takes place when
close to the location
(LonInt,LatInt )
of a SCATS intersection
and disagrees on congestion with the SCATS sensors of that
if the
states that there is
a congestion and
otherwise. Similarly, according
to the last two rules above, an event
agree(Bus )
takes place
moves close to the location
(LonInt,LatInt )
of a
SCATS intersection and agrees on congestion with the sensors
of that intersection.
A bus is considered unreliable/noisy when it disagrees
on congestion with the SCATS sensors of an intersection
and the information offered by the SCATS sensors is correct
according to the crowdsourced information:
initiatedAt(noisy(Bus ) = true,T)
happensAt(disagree(Bus ,LonInt ,LatInt,BusVal),T),
happensAt(crowd(LonInt ,LatInt,CrowdVal),T0),
BusVal 6=CrowdVal,
terminatedAt(noisy(Bus ) = true,T)
happensAt(agree(Bus ),T)
terminatedAt(noisy(Bus ) = true,T)
happensAt(disagree(Bus ,LonInt ,LatInt,Val),T),
happensAt(crowd(LonInt ,LatInt,Val),T0),
crowd(LonInt ,LatInt,Val)
is an event produced by the crowd-
sourcing component (details are given in Section 5). It states
whether there was a congestion at the SCATS intersection lo-
cated at
(LonInt,LatInt )
according to the human crowd.
if there was a congestion and
noisy(Bus) = true
is initiated when a
disagrees on con-
gestion both with the SCATS sensors of some intersection
and the crowdsourced information. The last condition of the
initiating rule requires that the crowdsourced information is
used for evaluating the reliability of a bus only if it arrives
within a specified period from the time of the source dis-
noisy(Bus) = true
is terminated when the
agrees with the SCATS sensors of some other intersection, or
when it disagrees with SCATS sensors but the crowdsourced
information proves the Bus correct.
An alternative definition of noisy(Bus ) is the following:
initiatedAt(noisy(Bus ) = true,T)
happensAt(disagree(Bus ,,,),T)
terminatedAt(noisy(Bus ) = true,T)
happensAt(agree(Bus ),T)
terminatedAt(noisy(Bus ) = true,T0)
happensAt(disagree(Bus ,LonInt ,LatInt,Val),T),
happensAt(crowd(LonInt ,LatInt,Val),T0),
According to the above rules,
noisy(Bus) = true
is initiated
when a
disagrees on congestion with the SCATS sen-
sors of some intersection, even when there is no crowd-
sourced information to identify the accurate data source.
In other words, in the absence of information to the contrary,
the SCATS sensors are considered more trustworthy than
noisy(Bus) = true
is terminated, however, when there
is crowdsourced information that proves the
correct. As
noisy(Bus) = true
is also terminated when there is
source agreement.
, the
definition that re-
ports congestion from bus data is adapted as follows:
initiatedAt(busCongestion(Lon,Lat) = true,T)
happensAt(move(Bus ,,,),T),
holdsAt(gps(Bus ,LonB,LatB, , 1),T),
not holdsAt(noisy(Bus ) = true),
terminatedAt(busCongestion(Lon,Lat) = true,T)
happensAt(move(Bus ,,,),T),
holdsAt(gps(Bus ,LonB,LatB, , 0),T),
not holdsAt(noisy(Bus ) = true),
According to this new formalisation, the congestion informa-
tion offered by a bus, whether close to a SCATS intersection
or not, is discarded as long as the bus is considered unreliable,
i.e. as long as the disagreements with SCATS sensors are re-
solved in favour of those sensors (when
is defined
by rule-set
) or remain unresolved (when
defined by rule-set (5)).
Given the crowdsourced information, we can also evaluate
the reliability of SCATS sensors. The formalisation is similar
and omitted to save space.
In this section we present the mechanisms we introduce
to ameliorate the veracity problem of the data. Our main
advance is the development of a novel crowdsourcing mecha-
nism whose goal is to supplement the data sources through
querying human volunteers, also called “participants”, about
the true state of the system. To minimise the impact on
the participants, the crowdsourcing component is invoked by
the complex event processing engine (RTEC) when a signifi-
cant disagreement in the data sources is detected. Crowd-
sourcing relies on labels produced by imperfect experts—the
participants—rather than on an oracle (e.g. a city employee).
Crowdsourcing has enjoyed a recent rise in popularity due to
the development of dedicated online tools, such as Amazon
Mechanical Turk
, and has been used for many complex tasks
such as labelling galaxies [16], real-time worker selection [5]
and solving various biological problems [13].
The main appeal of crowdsourcing is the reduced cost of
label acquisition. Typically, the lower quality of the labels
is compensated by acquiring several labels for each data
item and combining them to produce a more accurate label.
E.g. it has been known that the error of the average answer
is usually smaller than the average error of each individual
answer [12]. Developing increasingly better strategies to
aggregate individual answers is an open research area. Many
approaches try to model how reliable each participant is,
and use participant reliability to improve the aggregation
of answers. To this end, the Expectation-Maximization
(EM) algorithm [23], Bayesian uncertainty scores [24] and
sequential Bayesian estimation [10] have been used.
We present a crowdsourcing component that queries par-
ticipants close to the location of a source disagreement event
whenever requested by the CE processing component. The
output of the crowdsourcing component is used for the res-
olution of the disagreement and sent to the end user/city
operator, the CE processing component and the traffic mod-
elling component.
In what follows, we describe our crowdsourcing model and
briefly review the process of reliability estimation with the
classical Expectation-Maximization (EM) algorithm [8, 22].
This algorithm needs to operate in batch mode, which is not
acceptable for our large, streaming problem. Consequently,
we then discuss an online version of the EM algorithm that
supports online crowdsourcing task processing.
5.1 Crowdsourced Model
We model a source disagreement event as an unobserved
categorical variable X
, where t
is an index. Each
variable X
has a true value
, where
the set of possible realizations or labels of X
. Moreover,
, where denotes probabilistic indepen-
dence. We assume that we have access to a prior distribution
) over the possible values of the variable for every t.
This distribution can either be provided by the CE processing
component, or be the uniform distribution. E.g. if only 1
out of 4 buses at a given location indicates a congestion, the
prior distribution could assign a lower prior probability to
the congestion than if 3 out of 4 buses reported a congestion.
We denote by y
the answer given by participant iif he is
queried about X
, and Y
the associated variable. Moreover,
we assume each participant ihas a constant but unknown
probability p
to answer with a wrong label x
when he is
queried about an event X
. When a participant does not give
the true answer, he chooses another one at random. We also
assume that
Val(Yi,t) = Val(Xt)
, i.e. a participant queried
about X
is presented with all possible answers and none
other. More formally,
P(Yi,t =xt|Xt=xt) = 1 pii, t (6)
P(Yi,t =x|Xt=xt) = pi
|Val(Xt)| − 1
We also assume that Yi,t Yi0,t0except if t=t0and i=i0.
For each source disagreement event, we observe a set
of answers, where u
is the subset of participants
queried based on the location. Our goal is to obtain the best
prediction of ˆ
Modifying the assumption on the parameterization of the
conditional distribution of the answers or on the indepen-
dence of the answers of different participants about the same
event would not require big modifications to our approach.
On the other hand, if other independence relationships no
longer hold, the EM algorithm presented below may need
to be altered significantly. E.g. consider two sensor disagree-
ments X
and X
caused by the same bus during the same
working memory. In our crowdsourcing model, we assume
these two events to be independent. A more complex model
could exploit a relationships between these events. This
would require processing them together in the EM algorithm.
5.2 Estimation
If the parameters Θ ≡ {p
(the probability that each
participant lies when queried) are known, inferring a pos-
terior distribution P(X
) is straightforward using
Bayes rule. However, estimating these parameters is dif-
ficult. E.g. the maximum likelihood estimate
of these
parameters based on a crowdsourced data set of Tunob-
served events X
≡ {X
, ..., X
}and associated answers
≡ {y
is the solution of the equation below:
Θ = max
ΘP(A1:T|Θ) (8)
= max
ΘEx1:TP(x1:T,A1:T|Θ) (9)
Solving this equation is not analytically possible, because of
the expectation on the hidden variables.
The Expectation-Maximization (EM) algorithm [8, 22] is
a well-known method to solve this problem. It computes a
sequence of parameters Θ
that converges to a maximum.
The algorithm alternates between computing an expectation
of the likelihood, based on the observations and the current
estimate of the value of the parameters, and maximizing this
expectation to update the parameters:
Qk(Θ) = Ex1:T|Θk,A1:Tlog P(x1:T,A1:T|Θ) (10)
Θk+1 = arg max
ΘQk(Θ) (11)
This algorithm operates in batch mode, which is problem-
atic for stream processing. We could periodically evaluate
the parameters Θ based on the full crowdsourced data set
collected so far, but this would create scaling issues as this
data set keeps growing. We could limit the number of events
we work with to a manageable number, but such a strat-
egy may induce the loss of all the answers provided by a
participant. Indeed, we are only observing the answers of a
(probably small) subset of participants for each event. Hence,
if we operate on a subset of events, there’s a risk we may
discard all the answers of a participant.
Therefore, we use instead an online EM algorithm [6]. This
algorithm can operate on one source disagreement event at
the time, and both the event and the associated answers can
be forgotten once this event has been processed. Discard-
ing this information means that we cannot come back later
and provide a more educated guess about the true value of
the event. This is however only a minor drawback in our
application. These events have a finite and short duration,
so obtaining the true label is only relevant for a short time
that depends on the working memory of the CE processing
component. Moreover, and as opposed to many crowdsourc-
ing applications, we can no longer ask questions about an
event when it is over.
The online EM algorithm uses a stochastic approximation
step to update the function Q(Θ) with a new event X
than recomputing everything. Equation
of the EM
algorithm therefore becomes:
Qt(Θ) = (1 γt)ˆ
Qt1(Θ) + γtExt|Θk,Atlog P(xt,At|Θ)
where the sequence γ
, γ
, ... is such that
limT→∞ PT
limT→∞ PT
t= 1
<. As in
the classical EM algorithm, Θ is then estimated by maximiz-
ing ˆ
In urban traffic management, we do not receive an answer
from every participant for each source disagreement event.
Therefore, we use a different stochastic approximation for
every participant. In other words, we update each partic-
ipant using a specific γ
, where t
is the number of times
this participant has been queried so far. Applying this to
the model described in Section 5.1, results in Algorithm
1. Every function Q
(Θ) is a sum and each term corre-
Algorithm 1 Crowdsourcing
Require: {p1, p2, ...}and {γ1, γ2, ...}
1: ti= 1 i
2: for all P(Xt),At, Lont, Latt, T received do
3: for all xVal(Xt)do
{compute sufficient statistics}
4: ˆα(x) = P(Xt=x)QiutP(Yi,y =yi,t|Xi,t =x)
5: end for
6: for all xV al(x)do
7: α(x) = ˆα(yi,t)
PxV al(Xt)ˆα(x)
8: end for
9: V al =(“Traffic congestion == arg maxxα(x))
10: send happensAt(crowd(Lont, Latt, V al), T )
11: for all iutdo {update parameters}
12: pi=(1 γti)pi+γti 1α(yi,t)
PxV al(Xt)α(x)!
13: ti=ti+ 1
14: end for
15: end for
sponds to one source disagreement event. The parameters
maximizing Q
(Θ) will also be a sum where each term corre-
sponds to one event and depends on the posterior probability
, p
, ...}) of the event. Algorithm
1 first computes these terms (lines 3 to 8), and then per-
forms the stochastic approximation update of the parameter
estimates (lines 11 to 14).
At line 10, the posterior distribution on the labels of the
event is used to generate a message to the CE processing
component, the traffic modelling component and/or the city
operators. More precisely, we inform the interested parties
whether the most likely label is a congestion or not.
5.3 Query Execution Engine
Having defined the query model, the next step is to employ
a crowdsourcing query execution engine to communicate the
queries to the participants while dealing with the challenges of
the mobile setting: real-time performance and reliability. The
functions we pursue are: (i) the provision of a communication
backbone without effort from the user to reach him, and (ii)
adaptive mechanisms that achieve real-time and reliable
To maximize parallelism, the crowdsourcing component
employs the MapReduce programming model [7, 14] to com-
municate the queries to the selected participants and enable
them to do local processing. MapReduce is a computational
paradigm that allows processing parallelizable tasks across
distributed nodes. The model requires that the computa-
tional process is decomposed into two steps, namely map
and reduce, where the following functions are used:
map(key;value)[(key2; value2)]
reduce(key2; [value2]) [f inalvalue]
Each map function processes a key/value pair and produces
an intermediate key/value pair. The input of the map func-
tion has the form
and the output is another pair
(key2 ,value2 )
. Each map function can be executed in paral-
lel on different nodes. Each reduce function is used to merge
all the individual pairs with the same key to produce the final
output. Hence, it computes the final output by processing
the list of values with the same intermediate key2 .
Figure 3: Crowdsourcing application.
In our system, the crowdsourcing query execution engine
communicates the queries to workers—the participants—to
answer specific questions about an event (map task), and
aggregate the results (reduce task). The worker node receives
the assigned task, processes the task and returns the answer,
denoted as intermediate result, through the map function.
After the crowdsourcing component collects all the answers
of the subproblems (intermediate results), it combines them
to form the output, which is the answer to the original query.
This is achieved using the reduce function that is executed
for all the intermediate results with the same key.
Each participant iUregisters with the query execution
engine using a mobile device. The connection to the system
requires the participant to: (1) connect to the Google Cloud
Messaging (GCM) service to retrieve Push Notifications,
and (2) connect to the Crowdsourcing Server using his id
and identify himself as being a Map Worker. Then the
participant can leave the application run on the background,
where he can subscribe and retrieve tasks only when needed
(see Figure 3). Note that the GCM service enables us to
track the participant even if he changes his connection type
(e.g. from WiFi to 3G), or when he remains behind a Network
Address Translation-based routing device.
The query execution engine retrieves queries from the
crowdsourcing component in the form of
, along with a
list of Worker ids. In order to disseminate a query
, the
crowdsourcing component: (1) retrieves the registered on-
line participants from the Crowdsourcing server, (2) selects
the list of workers L
to be queried based on the selected
policy (e.g. location, reliability, etc), and (3) sends L
the query
to the Crowdsourcing Server and waits for the
In case we have real-time response requirements for query
i.e. in the form of a time interval deadline
, we should ensure
that the time it takes to compute the query and communicate
it to each selected participant should not exceed the deadline
requirement, i.e.:
commiq +compiq < deadlineq,iLq
Both the computation and the communication times can be
estimated from historical data. The expected computation
time comp
of each individual participant ito process a
task qcan be computed from the past executed tasks, and
the communication time comm
can be estimated from
the communication time of the tasks executed previously in
the participant’s current location, since it depends on the
network connection in that area—e.g. 2G or 3G.
The Crowdsourcing server disseminates the query
to the
selected workers L
by sending them a Push Notification
that appears on their screen and notifies them by a vibration
and a ringtone sound. Each worker can open the Map task
by touching the notification, for which action the participant
device connects with the Crowdsourcing server and retrieves
the query
. For instance, in traffic monitoring, the Map task
is displayed on the participant’s screen and he can select
the answer (see Figure 3). After the Crowdsourcing Server
has received answers from all Map workers or the reply time
interval has expired, the Server selects a number of Reduce
workers based on the selected policy. The Reduce workers
retrieve the intermediate data, which are the answers of
the Map workers, and aggregate them. Finally, the aggre-
gated data are returned to the Crowdsourcing component.
Although in the presented traffic monitoring example the
computation is simple, we employ the MapReduce infrastruc-
ture to be able to additionally assign more complex queries.
For instance, we could employ the sensors of the smartphones
to extract data, such as their current speed or local humidity,
as a Map task, and aggregate the intermediate data based
on their density at the Reduce phase.
In this section, we describe our approach to solve the data
sparsity problem in our setting. Since data come from fixed
installations (SCATS data) and specific routes (bus GPS-
coded data), there are large parts of the city that are not
covered. However, from a city monitoring view, it is impor-
tant to offer the operator a current picture on the entire
city area. We present a modelling technique that generalises
the current observations to produce estimates for locations
without sensors. A major requirement for such a technique
is to be scalable to city-sized areas, and key to the scalability
of our approach is focusing on modelling the usual, average
case. The model is currently using SCATS data, and is
trained using past data. The technique is designed to be
general enough that any additional sources that can provide
congestion information at specific locations can be incorpo-
rated in the training, including, specifically, the results of
the crowdsourcing component.
The traffic network contains prior knowledge on movement
through the city of Dublin. We model the edge oriented
quantities within a Gaussian Process regression framework,
similar to the approach in [18]. In the traffic graph Geach
junction corresponds to one vertex. To each vertex v
in the
graph, we introduce a latent variable f
which represents
the true traffic flow at v
. The observed traffic flow values
are conditioned on the latent function values with Gaussian
noise i
yi=fi+i, i N (0, σ 2) (13)
We assume that the random vector of all latent function
values follows a Gaussian Process (GP), and in turn, any
finite set of function values
fi:i= 1,...,M
has a mul-
tivariate Gaussian distribution with mean and covariances
computed by the mean and covariance functions of the GP.
The multivariate Gaussian prior distribution of the function
values fis written as
P(f|X) = N(0,ˆ
K) (14)
is the so-called kernel and denotes the M×M
covariance matrix; zero mean is assumed without loss of
For traffic flow values at unmeasured locations u, the
predictive distribution can be computed as follows. Based
on the property of GP, the vector of observed traffic flows (v
at locations
) and unobserved traffic flows (
at locations
u) follows a Gaussian distribution
fu∼ N 0,ˆ
Ku,u +σ2Iˆ
Ku,u ˆ
Ku,u (15)
is the corresponding entries of
between the
unobserved vertices uand observed ones
, and
are defined equivalently.
is an identity matrix of size
Finally the conditional distribution of the unobserved traf-
fic flows are still Gaussian with the mean mand the covari-
ance matrix Σ:
Ku,u +σ2I)1y
Σ = ˆ
Ku,u ˆ
Ku,u +σ2I)1ˆ
Since the latent variables
are linked together in an graph
G, the covariances are closely related to the network structure:
the variables are highly correlated if they are adjacent in
G, and vice versa. Therefore we can employ graph kernels
[27] to denote the covariance functions
, x
) among the
locations xiand xj, and thus the covariance matrix ˆ
The work in [18, 17] describes methods to incorporate
knowledge on preferred routes in the kernel matrix. Lacking
this information, we opt for the commonly used regularized
Laplacian kernel function
where αand βare hyperparameters. Ldenotes the combina-
torial Laplacian, which is computed as L=DA, where A
denotes the adjacency matrix of the graph G.Dis a diagonal
matrix with entries di,i =PjAi,j.
In this section we present the experimental evaluation of
the main components of our system—complex event process-
ing, crowdsourcing and traffic modelling. We used real data
streams coming from the buses and SCATS sensors of Dublin
city. The streams were collected between 1-31 January 2013
and comprise 13GB of data. The bus dataset includes 942
buses. Each operating bus emits SDEs every 20-30 seconds—
on average, the bus dataset has a new SDE every 2 seconds.
The SCATS dataset includes 966 sensors. SCATS sensors
transmit information every six minutes. Both datasets are
publicly available2.
7.1 Complex Event Processing
We recognise CEs concerning traffic flow and density
trends, traffic congestions and congestions in-the-make. Ad-
ditionally, we compute the maximal intervals for which there
is source disagreement, for resolution by means of crowdsourc-
ing, and the intervals for which buses and SCATS sensors
31 sec =
10 min =
12,5K SDE
30 min =
40,5K SDE
50 min =
70 min =
94,5K SDE
90 min =
124K SDE
110 min =
152K SDE
Time (sec)
Working Memory
Static Event Recognition
Self-Adaptive Event Recognition
Figure 4: Event recognition performance.
are considered unreliable. The experiments were run on a
computer with Intel i7 950@3.07GHz processors and 12GB
RAM, running Ubuntu Linux 12.04 and YAP Prolog 6.2.2.
We present two sets of experiments. In the first, we per-
formed ‘static’ recognition, that is, CE recognition that al-
ways takes into consideration all event sources. Then, we per-
formed ‘self-adaptive event recognition’ where noisy sources
are detected at run-time and the system discards them until
they resume offering reliable information. CE recognition
for traffic management, as defined here, is straightforward to
distribute. E.g. in Dublin SCATS sensors are placed into the
intersections of four geographical areas: central city, north
city, west city and south city. We distributed CE recognition
accordingly. We used four processors of the computer on
which we performed the experiments—each processor com-
puted CEs concerning the SCATS sensors of one of the four
areas of Dublin as well as CE concerning the buses that
go through that area. Figure 4 displays the average CE
recognition times in CPU seconds. The working memory
ranges from 10 min, including on average 12,500 SDEs, to
110 minutes, including 152,000 SDEs.
Figure 4 shows that self-adaptive CE recognition has a
minimal overhead compared to static recognition. The over-
head is due to computing and storing the maximal intervals
of additional CEs, capturing the intervals for which some
sources are considered unreliable. Figure 4 also shows that
RTEC performs real-time CE recognition both in the static
and the self-adaptive setting.
7.2 Crowdsourcing
The crowdsourcing component was simu-
lated to evaluate the performance of the online Expectation-
Maximisation (EM) algorithm. We simulated 10 participants
modelled as described in Section 5. We parameterized these
participants using
i= 1 ={0.05,0.15,0.2,0.25,0.25,0.38,0.4,0.5,0.75,0.9}
as their respective error probabilities. There are 4 possible
answers. The first 7 participants are more likely to answer
truthfully. The 8th participant has the same probability
to give the true answer as one of the wrong ones. The
9th participant selects one of the 4 answers according to a
uniform distribution. The last one is trying to mislead the
system and is more likely to give a wrong answer than the
9th participant.
0 200 400 600 800 1000
Number of queries to participant i
Estimation of pi
0 200 400 600 800 1000
Relative estimation error of pi
Number of queries to participant i
Figure 5: The estimation of the quality of each participant.
We used γ
t/(t+ 1) for the stochastic approximation
parameters. We initialize each p
to 0.25, so we bias the initial
parameters towards trustful participants. Using an unbiased
initial estimate (p
0.75) would prevent the parameters to
be updated if the prior probability distributions P(X
) over
the event labels were also uniform. All participants were
queried about each sensor disagreement signalled by the CE
processing component. Figure 5 illustrates the estimation
of the quality of each participant (the probability that he
provides a wrong answer when queried). The values of the
estimation are displayed for each participant at the top and
the relative estimation error at the bottom. Both estimations
are functions of the number of calls to the crowdsourcing
A first observation is that the estimated values converge to
the true value of the corresponding parameters. After process-
ing approximately 100 calls, the ordering of the participant by
quality is more or less correct, except for participants whose
error probabilities are close (participants 2-3 and partici-
pants 6-7). Correctly estimating the quality of participants
leads to a better assessment of the sensor disagreement, but
it is also important for rewarding a participant. Indeed, a
participant’s quality may be a factor in the computation of
the reward he receives for his contribution.
Most of the time (94% in this experiment) the posterior
probability distribution is very peaked: the probability of one
of the 4 explanations is greater than 0.99. Rarely, the answers
provided by the participants are not sufficient to remove
the uncertainty. E.g. the following posterior distribution
[0.49,0.41,0.09, < 0.01] does not provide a clear
2G 3G WiFi
Latency (ms)
Trigger Task
Send Push Notification
Communication Time
Figure 6: Crowdsourcing Query Execution Engine Latency.
explanation for the source disagreement. In general, however,
crowdsourcing is able to resolve an overwhelming number of
source disagreements.
Query Execution Engine.
Figure 6 shows the latency
of the individual steps of the crowdsourcing query execution
engine using different connection types. The presented times
are averages over 10 executions of crowdsourcing tasks for
each connection type. We do not present the latency of
the human responses, i.e. the latency to open the task and
select an answer. We have observed that these times are
typically a lot higher than the other steps. Figure 6 shows the
latency to trigger a task, including the selection of the workers
and the task assignment in the query execution engine, is
minimal in all cases, since there is no communication with
the participant devices, and ranges from 38 to 55 ms. On
the other hand, the time needed to send a Push Notification
from the participant device takes 467 ms on a 2G connection,
while the 3G and WiFi connections only need 169 ms and 184
ms respectively. Note that a Push Notification requires from
the query execution engine to send the notification to the
Google Cloud Messaging server, and then this server forwards
the notification to the device. Finally, Figure 6 shows the
communication time that involves the communication to
retrieve the task, once the task is selected, and send the
answer back to the query execution engine. The 2G network
experiences larger latency of 423 ms while the 3G network
takes 171 ms and the WiFi connection 182 ms. Hence,
although the end-to-end latency depends on the available
network, even in case that only the 2G network is available
the end-to-end latency would need less than a second to
select a worker and communicate with him.
7.3 Traffic Modelling
For the traffic modelling experiments, the traffic network
is generated using OpenStreetMap
—see Figure 7. In the
pre-processing step, the network is restricted to a bounding
window of the size of the city. Next, every street is split at
every junction in order to retrieve street segments. Thus,
we obtain a graph that represents the street network—see
Figure 8. The SCATS locations, depicted as black dots in
Figure 8, are mapped to their nearest neighbours within this
street network. The sensor readings are aggregated within
fixed time intervals. The hyperparametres are chosen in ad-
vance using grid search within the interval [0,
. . .
,10] . Using
the pre-processed measurements, the Gaussian Process esti-
mate is computed for the unobserved locations as described
in Section 6. This step is repeated continuously. The results
are plotted on a visual display—see Figure 9—and shaded
according to their value. High values obtain a red colour
while low values obtain green colour.
Figure 7: Map of Dublin, Ireland (from OpenStreetMap).
Figure 8: Street network and SCATS locations (black dots)
in Dublin.
Figure 9: Traffic Flow estimates obtained by Gaussian Pro-
cess Regression. Green dots correspond to low traffic whereas
red dots indicate congested locations.
We presented a system for heterogeneous stream process-
ing and crowdsourcing supporting intelligent urban traffic
management. Complex events related to traffic congestions
(in-the-make) are detected from heterogeneous sources in-
volving fixed sensors mounted on intersections and mobile
sensors mounted on public transport vehicles. To deal with
the inherent data veracity, a crowdsourcing component han-
dles and resolves source disagreement. Furthermore, to deal
with data sparsity, a traffic modelling component makes con-
gestion estimates in areas with low or non-existent sensor
coverage. Our empirical evaluation on data streams from
Dublin city showed the feasibility of the proposed system.
This work is funded by the EU FP7 INSIGHT project
(318225), the ERC IDEAS NGHCS project, and the Deutsche
Forschungsgemeinschaft within the CRC SFB 876 “Provid-
ing Information by Resource-Constrained Data Analysis”,
projects A1 and C1.
[1] A. Artikis, O. Etzion, Z. Feldman, and F. Fournier.
Event processing under uncertainty. In DEBS, pages
32–43. ACM, 2012.
[2] A. Artikis, M. Sergot, and G. Paliouras. Run-time
composite event recognition. In DEBS, pages 69–80.
ACM, 2012.
[3] A. Artikis, M. Weidlich, A. Gal, V. Kalogeraki, and
D. Gunopulos. Self-adaptive event recognition for
intelligent transport management. In Big Data. IEEE,
[4] C. Bockermann and H. Blom. The streams framework.
Technical Report 5, TU Dortmund University, 12 2012.
[5] I. Boutsis and V. Kalogeraki. Crowdsourcing under
real-time constraints. In IPDPS, pages 753–764, 2013.
[6] O. Capp´e and E. Moulines. On-line
expectation–maximization algorithm for latent data
models. Journal of the Royal Statistical Society: Series
B (Statistical Methodology), 71(3):593–613, 2009.
J. Dean and S. Ghemawat. Mapreduce: Simplified data
processing on large clusters. In OSDI, 2004.
[8] A. P. Dempster, N. M. Laird, and D. B. Rubin.
Maximum likelihood from incomplete data via the EM
algorithm. Journal of the Royal Statistical Society.
Series B (Methodological), 39:1–38, 1977.
P. Domingos and D. Lowd. Markov Logic: An Interface
Layer for Artificial Intelligence. Morgan & Claypool
Publishers, 2009.
[10] P. Donmez, J. G. Carbonell, and J. G. Schneider. A
probabilistic framework to learn from multiple
annotators with time-varying accuracy. In SDM, 2010.
[11] O. Etzion and P. Niblett. Event Processing in Action.
Manning Publications Company, 2010.
[12] F. Galton. Vox populi. Nature, 75:450–451, 1907.
[13] B. M. Good and A. I. Su. Crowdsourcing for
bioinformatics. Bioinformatics, 2013.
[14] T. Kakantousis, I. Boutsis, V. Kalogeraki,
D. Gunopulos, G. Gasparis, and A. Dou. Misco: A
system for data analysis applications on networks of
smartphones using mapreduce. In MDM12, 2012.
[15] R. Kowalski and M. Sergot. A logic-based calculus of
events. New Generation Computing, 4(1):67–96, 1986.
[16] K. Land, A. Slosar, C. Lintott, D. Andreescu,
S. Bamford, P. Murray, R. Nichol, M. J. Raddick,
K. Schawinski, A. Szalay, D. Thomas, and
J. Vandenberg. Galaxy Zoo: the large-scale spin
statistics of spiral galaxies in the Sloan Digital Sky
Survey. Monthly Notices of the Royal Astronomical
Society, 388:1686–1692, Aug. 2008.
[17] T. Liebig, Z. Xu, and M. May. Incorporating mobility
patterns in pedestrian quantity estimation and sensor
placement. In Citizen in Sensor Networks, volume
LNCS 7685, pages 67–80. Springer, 2013.
[18] T. Liebig, Z. Xu, M. May, and S. Wrobel. Pedestrian
quantity estimation with trajectory patterns. In
Machine Learning and Knowledge Discovery in
Databases, volume LNCS 7524, pages 629–643.
Springer, 2012.
[19] H. Liu and H.-A. Jacobsen. Modeling uncertainties in
publish/subscribe systems. In ICDE, 2004.
D. Luckham. The Power of Events: An Introduction to
Complex Event Processing in Distributed Enterprise
Systems. Addison-Wesley, 2002.
[21] D. Luckham and R. Schulte. Event processing glossary
— version 1.1. Event Processing Technical Society, July
G. McLachlan and T. Krishnan. The EM algorithm and
extensions, volume 382. John Wiley and Sons, 2008.
[23] V. C. Raykar, S. Yu, L. H. Zhao, G. H. Valadez,
C. Florin, L. Bogoni, and L. Moy. Learning from
crowds. The Journal of Machine Learning Research,
99:1297–1322, 2010.
[24] V. S. Sheng, F. Provost, and P. G. Ipeirotis. Get
another label? improving data quality and data mining
using multiple, noisy labelers. In KDD. ACM, 2008.
A. Skarlatidis, A. Artikis, J. Filippou, and G. Paliouras.
A probabilistic logic programming event calculus.
Theory and Practice of Logic Programming, 2013.
A. Skarlatidis, G. Paliouras, G. Vouros, and A. Artikis.
Probabilistic event calculus based on markov logic
networks. In RuleML America, pages 155–170, 2011.
A. Smola and R. Kondor. Kernels and regularization on
graphs. In Proc. Conf. on Learning Theory and Kernel
Machines, pages 144–158, 2003.
[28] S. Wasserkrug, A. Gal, O. Etzion, and Y. Turchin.
Efficient processing of uncertain events in rule-based
systems. IEEE Trans. Knowl. Data Eng., 2011.
... In many applications, a physical phenomenon can be sensed by collecting data collaboratively [2][3][4][5][6][7][8][9][10][11][12], either from humans directly, or from devices acting on their behalf. This is variously known as (spatial) crowdsourcing [2,13,14], (mobile) crowd sensing [15,16], or social sensing [17,18]. ...
... There exists extensive literature on social sensing, as surveyed in [17,18]. Much of the early work has been empirical and exploratory in nature; e.g., [2,5]. More recently, however, rigorous analyses have started to appear. ...
... Live Traffic [42]), understanding of traffic accidents [5], pollution [43], and generation of fine-grained noise maps [44,45], to the search for missing entities [6,24,25,[46][47][48] such as stolen bicycles [46], lost children [47,48], and Alzheimer's patients [25]. Most recently, social sensing has found applications in sensing within the COVID-19 pandemic [49]. ...
Full-text available
We consider the design of distributed algorithms that govern the manner in which agents contribute to a social sensing platform. Specifically, we are interested in situations where fairness among the agents contributing to the platform is needed. A notable example are platforms operated by public bodies, where fairness is a legal requirement. The design of such distributed systems is challenging due to the fact that we wish to simultaneously realise an efficient social sensing platform, but also deliver a predefined quality of service to the agents (for example, a fair opportunity to contribute to the platform). In this paper, we introduce iterated function systems (IFS) as a tool for the design and analysis of systems of this kind. We show how the IFS framework can be used to realise systems that deliver a predictable quality of service to agents, can be used to underpin contracts governing the interaction of agents with the social sensing platform, and which are efficient. To illustrate our design via a use case, we consider a large, high-density network of participating parked vehicles. When awoken by an administrative centre, this network proceeds to search for moving missing entities of interest using RFID-based techniques. We regulate which vehicles are actively searching for the moving entity of interest at any point in time. In doing so, we seek to equalise vehicular energy consumption across the network. This is illustrated through simulations of a search for a missing Alzheimer’s patient in Melbourne, Australia. Experimental results are presented to illustrate the efficacy of our system and the predictability of access of agents to the platform independent of initial conditions.
... In the past decade, Microtask crowdsourcing has been used to help solve a wide range of problems that traditionally are expensive for experts to do or difficult for machines to handle, covering ontological engineering (Markotschi et al., 2010;Eckert et al., 2010a;Noy et al., 2013a;Sarasua et al., 2012;, Linked Data management and quality assessment (Simperl et al., 2011a;Acosta et al., 2013a), Semantic annotation 6 , image tagging Liu et al., 2012), relation extraction (Aroyo and Welty, 2013a), query processing (Acosta et al., 2012), proofreading (Bernstein et al., 2010), literature evaluation (Brown and Allison, 2014a), translation (Zaidan and Callison-Burch, 2011), classification (Ho et al., 2013;Shamir et al., 2014; and many citizen science (Silvertown, 1888;Cohn, 2009) projects 7 such as Zooniverse 8 and Stars4All 9 . Crowdsourcing has been applied in different ways that could help improve our daily life, such as creating and maintaining street maps 10 , or monitoring traffic 11 in local areas (Artikis et al., 2014;Chatzimilioudis et al., 2012;Yan et al., 2009). More recently, it has also been used in various disaster management and relief activities (Zook et al., 2010;Gao et al., 2011;Salisbury et al., 2015; to help more efficiently respond to the emergency situations. ...
... Crowdsourcing has been popular for more than a decade since the term was first introduced (Howe, 2006), and has been widely applied in various contexts: writing (Bernstein et al., 2010), translation (Zaidan and Callison-Burch, 2011), emergency response (Zook et al., 2010;, traffic monitoring (Yan et al., 2009;Artikis et al., 2014), classification (Ho et al., 2013;Shamir et al., 2014), etc. In most of these applications, microtask crowdsourcing has been used. ...
Microtask crowdsourcing has been applied in many fields in the past decades, but there are still important challenges not fully addressed, especially in task/workflow design and aggregation methods to help produce a correct result or assess the quality of the result. This research took a deeper look at crowdsourcing classification tasks and explored how task and workflow design can impact the quality of the classification result. This research used a large online knowledge base and three citizen science projects as examples to investigate workflow design variations and their impacts on the quality of the classification result based on statistical, probabilistic, or machine learning models for true label inference, such that design principles can be recommended and applied in other citizen science projects or other human-computer hybrid systems to improve overall quality. It is noticeable that most of the existing research on aggregation methods to infer true labels focus on simple single-step classification though a large portion of classification tasks are not simple single-step classification. There is only limited research looking into such multiple-step classification tasks in recent years and each has a domain-specific or problem-specific focus making it difficult to be applied to other multiple-steps classifications cases. This research focused on multiple-step classification, modeling the classification task as a path searching problem in a graph, and explored alternative aggregation strategies to infer correct label paths by leveraging established individual algorithms from simple majority voting to more sophisticated algorithms like message passing, and expectation-maximisation. This research also looked at alternative workflow design to classify objects using the DBpedia entity classification as a case study and demonstrated the pros and cons of automatic, hybrid, and completely humanbased workflows. As a result, it is able to provide suggestions to the task requesters for crowdsourcing classification task design and help them choose the aggregation method that will achieve a good quality result.
... Recent innovations in information and communication technologies have led to an increase in the adoption of real-time crowd-sourced data feeds in transport modeling. This includes applications of location data for emissions estimations (Hirschmann et al., 2010), building origin and destination matrices (Toole et al., 2015) and general urban traffic management applications (Artikis et al., 2014). This study investigates the use of novel real-time crowd-sourced data feeds that have wider spatial coverage and are not generated specifically for estimating volume-delay functions. ...
Full-text available
Traffic congestion across the world has reached chronic levels. Despite many technological disruptions, one of the most fundamental and widely used functions within traffic modeling, the volume–delay function has seen little in the way of change since it was developed in the 1960s. Traditionally macroscopic methods have been employed to relate traffic volume to vehicular journey time. The general nature of these functions enables their ease of use and gives widespread applicability. However, they lack the ability to consider individual road characteristics (i.e., geometry, presence of traffic furniture, road quality, and surrounding environment). This research investigates the feasibility to reconstruct the model using two different data sources, namely the traffic speed from Google Maps’ Directions Application Programming Interface (API) and traffic volume data from automated traffic counters (ATC). Google’s traffic speed data are crowd-sourced from the smartphone Global Positioning System (GPS) of road users, able to reflect real-time, context-specific traffic condition of a road. On the other hand, the ATCs enable the harvesting of the vehicle volume data over equally fine temporal resolutions (hourly or less). By combining them for different road types in London, new context-specific volume–delay functions can be generated. This method shows promise in selected locations with the generation of robust functions. In other locations, it highlights the need to better understand other influencing factors, such as the presence of on-road parking or weather events.
... The main areas identified from the set of 50 resulting papers were: sensor networks, advertising, finance, telecommunication, social networks, synthetic applications, and network monitoring. The percentage of papers in our set belonging to the different areas is depicted in Figure 2 and the complete list can be found in Table 2. Area Papers Finance [25], [26], [27], [28] Network Monitoring [29], [30], [31], [32] Synthetic [33], [34], [35], [36], [37], [38] Traffic Monitoring [39], [40], [41], [42], [43] Advertising [29], [30], [15], [44], [45], [46], [47] Sensor Network [48], [49], [50], [51], [52], [53], [11], [54] Social Network [55], [56], [57], [58], [59], [60], [61], [62], [63], [18] Telecom [64], [65], [66], [67], [68] Gaming [69] The choice of the applications requires the use of a suitable workload characterization in order to discard those with very similar behavior. In past works [70], two techniques are employed to collect information enabling such characterization: performance measurement instrumentation and source code analysis. ...
Full-text available
Systems enabling the continuous processing of large data streams have recently attracted the attention of the scientific community and industrial stakeholders. Data Stream Processing Systems (DSPSs) are complex and powerful frameworks able to ease the development of streaming applications in distributed computing environments like clusters and clouds. Several systems of this kind have been released and currently maintained as open source projects, like Apache Storm and Spark Streaming. Some benchmark applications have often been used by the scientific community to test and evaluate new techniques to improve the performance and usability of DSPSs. However, the existing benchmark suites lack of representative workloads coming from the wide set of application domains that can leverage the benefits offered by the stream processing paradigm in terms of near real-time performance. The goal of this article is to present a new benchmark suite composed of 15 applications coming from areas like Finance, Telecommunications, Sensor Networks, Social Networks and others. This article describes in detail the nature of these applications, their full workload characterization in terms of selectivity, processing cost, input size and overall memory occupation. In addition, it exemplifies the usefulness of our benchmark suite to compare real DSPSs by selecting Apache Storm and Spark Streaming for this analysis.
... Recent innovations in the information and communication technologies have led to an increase in the adoption of real-time crowd-sourced data feeds in transport modelling. This includes applications of location data for emissions estimations (Hirschmann et al., 2010), building origin and destination matrices (Toole et al., 2015) and general urban traffic management applications (Artikis et al., 2014). This study investigates the use of a novel real-time crowd-sourced data feeds that have wider spatial coverage and are not generated specifically for estimating volume-delay functions. ...
Full-text available
Traffic congestion across the world has reached chronic levels. Despite many technological disruptions, one of the most fundamental and widely used functions within traffic modelling, the volume delay function, has seen little in the way of change since it was developed in the 1960's. Traditionally macroscopic methods have been employed to relate traffic volume to vehicular journey time. The general nature of these functions enables their ease of use and gives widespread applicability. However, they lack the ability to consider individual road characteristics (i.e. geometry, presence of traffic furniture, road quality and surrounding environment). This research investigates the feasibility to reconstruct the model using two different data sources, namely the traffic speed from Google Maps' Directions Application Programming Interface (API) and traffic volume data from automated traffic counters (ATC). Google's traffic speed data are crowd-sourced from the smartphone Global Positioning System (GPS) of road users, able to reflect real-time, context-specific traffic condition of a road. On the other hand, the ATCs enable the harvesting of the vehicle volume data over equally fine temporal resolutions (hourly or less). By combining them for different road types in London, new context-specific volume-delay functions can be generated. This method shows promise in selected locations with the generation of robust functions. In other locations it highlights the need to better understand other influencing factors, such as the presence of on road parking or weather events.
Monitoring of streamed data to detect abnormal behaviour (variously known as event detection, anomaly detection, change detection, or outlier detection) underlies many applications, especially within the Internet of Things. There, one often collects data from a variety of sources, with asynchronous sampling, and missing data. In this setting, one can detect abnormal behavior using low-rank techniques. In particular, we assume that normal observations come from a low-rank subspace, prior to being corrupted by a uniformly distributed noise. Correspondingly, we aim to recover a representation of the subspace, and perform event detection by running point-to-subspace distance query for incoming data. We use a variant of low-rank factorisation, which considers interval uncertainty sets around “known entries”, on a suitable flattening of the input data to obtain a low-rank model. On-line, we compute the distance of incoming data to the low-rank normal subspace and update the subspace to keep it consistent with the seasonal changes present. For the distance computation, we consider subsampling. We bound the one-sided error as a function of the number of coordinates employed. In our computational experiments, we test the proposed algorithm on induction-loop data from Dublin, Ireland.
Due to the ubiquitous nature of smartphones, opportunistic phone-based crowdsensing has emerged as an important sensing modality. Since fine-grain ambient temperature measurements are a pre-requisite for energy-efficient operation of heating and cooling (HVAC) systems in buildings, in this paper, we use mobile phone sensing in conjunction with a web-based crowdsensing system to obtain detailed ambient temperature estimates inside buildings. We present a machine learning approach based on a random forest ensemble learning model that uses the phone battery temperature sensor to infer the ambient air temperature. We also present a few-shot transfer learning method to quickly learn and deploy our model onto new phones with modest training overheads. Our crowdsensing web service enables predictions made by multiple phones to be aggregated in an opportunistic fashion, extending our approach from an individual level to a community level. We evaluate our ML-based model for a range of devices, operating scenarios, and ambient temperatures, and see mean errors of less than ±0.5°F for our temperature predictions. More generally, our results show the feasibility of using an on-device ML model for ambient temperature predictions in mobile phones. This allows buildings – new and old, with and without sensing systems – to benefit from a new class of ubiquitous temperature sensors, enabling more sustainable operation.
Conference Paper
Full-text available
In the recent years we are experiencing the rapid growth of crowdsourcing systems, in which “human workers” are enlisted to perform tasks more effectively than computers, and get compensated for the work they provide. The common belief is that the wisdom of the “human crowd” can greatly complement many computer tasks which are assigned to machines. A significant challenge facing these systems is determining the most efficient allocation of tasks to workers to achieve successful completion of the tasks under real-time constraints. This paper presents REACT, a crowdsourcing system that seeks to address this challenge and proposes algorithms that aim to stimulate user participation and handle dynamic task assignment and execution in the crowdsourcing system. The goal is to determine the most appropriate workers to assign incoming tasks, in such a way so that the realtime demands are met and high quality results are returned. We empirically evaluate our approach and show that REACT meets the requested real-time demands, achieves good accuracy, is efficient, and improves the amount of successful tasks that meet their deadlines up to 61% compared to traditional approaches like AMT.
Conference Paper
Full-text available
The recent years have seen a proliferation of community sensing or participatory sensing paradigms, where individuals rely on the use of smart and powerful mobile devices to collect, store and analyze data from everyday life. Due to this massive collection of the data, a key challenge to all such developments, is to provide a simple but efficient way to facilitate the programming of distributed applications on the embedded devices. We will demonstrate a novel system that provides a principled approach to developing distributed data clustering applications on networks of smartphones and other mobile devices. The system comprises three components: (a) a distributed framework, implemented on mobile phones that eases the programmability and deployment of applications on the devices using simple programming primitives, (b) a data gathering component that tracks the movement of wireless device users and collects sensor data (i.e., GPS and accelerometer sensor data), and (c) a distributed data clustering algorithm that allows users to combine their individual data, that is distributed and energy efficient. Using a road traffic monitoring application we demonstrate how MISCO can efficiently identify anomalies in the road surface conditions and illustrate that our system is practical and has low energy and resource overhead.
Conference Paper
Full-text available
Intelligent transport management involves the use of voluminous amounts of uncertain sensor data to identify and effectively manage issues of congestion and quality of service. In particular, urban traffic has been in the eye of the storm for many years now and gathers increasing interest as cities become bigger, crowded, and “smart”. In this work we tackle the issue of uncertainty in transportation systems stream reporting. The variety of existing data sources opens new opportunities for testing the validity of sensor reports and self-adapting the recognition of complex events as a result. We report on the use of a logic-based event reasoning tool to identify regions of uncertainty within a stream and demonstrate our method with a real-world use-case from the city of Dublin. Our empirical analysis shows the feasibility of the approach when dealing with voluminous and highly uncertain streams.
Full-text available
Events are particularly important pieces of knowledge, as they represent activities of special significance within an organisation: the automated recognition of events is of utmost importance. We present RTEC, an Event Calculus dialect for run-time event recognition and its Prolog implementation. RTEC includes a number of novel techniques allowing for efficient run-time recognition, scalable to large data streams. It can be used in applications where data might arrive with a delay from, or might be revised by, the underlying event sources. We evaluate RTEC using a real-world application.
Full-text available
Big data is recognized as one of the three technology trends at the leading edge a CEO cannot afford to overlook in 2012. Big data is characterized by volume, velocity, variety and veracity ("data in doubt"). As big data applications, many of the emerging event processing applications must process events that arrive from sources such as sensors and social media, which have inherent uncertainties associated with them. Consider, for example, the possibility of incomplete data streams and streams including inaccurate data. In this tutorial we classify the different types of uncertainty found in event processing applications and discuss the implications on event representation and reasoning. An area of research in which uncertainty has been studied is Artificial Intelligence. We discuss, therefore, the main Artificial Intelligence-based event processing systems that support probabilistic reasoning. The presented approaches are illustrated using an example concerning crime detection.
We outline an approach for reasoning about events and time within a logic programming framework. The notion of event is taken to be more primitive than that of time and both are represented explicitly by means of Horn clauses augmented with negation by failure. The main intended applications are the updating of databases and narrative understanding. In contrast with conventional databases which assume that updates are made in the same order as the corresponding events occur in the real world, the explicit treatment of events allows us to deal with updates which provide new information about the past. Default reasoning on the basis of incomplete information is obtained as a consequence of using negation by failure. Default conclusions are automatically withdrawn if the addition of new information renders them inconsistent. Because events are differentiated from times, we can represent events with unknown times, as well as events which are partially ordered and concurrent.
. We propose a generic on-line (also sometimes called adaptive or recursive) version of the expectation–maximization (EM) algorithm applicable to latent variable models of independent observations. Compared with the algorithm of Titterington, this approach is more directly connected to the usual EM algorithm and does not rely on integration with respect to the complete-data distribution. The resulting algorithm is usually simpler and is shown to achieve convergence to the stationary points of the Kullback–Leibler divergence between the marginal distribution of the observation and the model distribution at the optimal rate, i.e. that of the maximum likelihood estimator. In addition, the approach proposed is also suitable for conditional (or regression) models, as illustrated in the case of the mixture of linear regressions model.
Motivation: Bioinformatics is faced with a variety of problems that require human involvement. Tasks like genome annotation, image analysis, knowledge-base population and protein structure determination all benefit from human input. In some cases, people are needed in vast quantities, whereas in others, we need just a few with rare abilities. Crowdsourcing encompasses an emerging collection of approaches for harnessing such distributed human intelligence. Recently, the bioinformatics community has begun to apply crowdsourcing in a variety of contexts, yet few resources are available that describe how these human-powered systems work and how to use them effectively in scientific domains. Results: Here, we provide a framework for understanding and applying several different types of crowdsourcing. The framework considers two broad classes: systems for solving large-volume 'microtasks' and systems for solving high-difficulty 'megatasks'. Within these classes, we discuss system types, including volunteer labor, games with a purpose, microtask markets and open innovation contests. We illustrate each system type with successful examples in bioinformatics and conclude with a guide for matching problems to crowdsourcing solutions that highlights the positives and negatives of different approaches.