ArticlePDF Available

Abstract and Figures

Current runtime verification tools seldom make use of multi-threading to speed up the evaluation of a property on a large event trace. In this paper, we present an extension to the BeepBeep 3 event stream engine that allows the use of multiple threads during the evaluation of a query. Various parallelization strategies are presented and described on simple examples. The implementation of these strategies is then evaluated empirically on a sample of problems. Compared to the previous, single-threaded version of the BeepBeep engine, the allocation of just a few threads to specific portions of a query provides dramatic improvement in terms of running time.
Content may be subject to copyright.
Event Stream Processing with Multiple Threads?
Sylvain Hallé, Raphaël Khoury, and Sébastien Gaboury
Laboratoire d’informatique formelle
Université du Québec à Chicoutimi, Canada
Abstract.
Current runtime verification tools seldom make use of multi-threading
to speed up the evaluation of a property on a large event trace. In this paper, we
present an extension to the BeepBeep 3 event stream engine that allows the use of
multiple threads during the evaluation of a query. Various parallelization strategies
are presented and described on simple examples. The implementation of these
strategies is then evaluated empirically on a sample of problems. Compared to the
previous, single-threaded version of the BeepBeep engine, the allocation of just
a few threads to specific portions of a query provides dramatic improvement in
terms of running time.
1 Introduction
Since its inception, the field of Runtime Verification (RV) has undergone a significant
growth, both in the expressiveness of its specification languages, and in the range of
problems it has addressed. Beyond finite-state machines and propositional temporal
logics such as LTL, many runtime monitoring systems frequently support specification
languages extending these formalisms in various ways. Moreover, it is not uncommon for
RV use cases to involve systems generating tens of thousands of events per second. This
puts pressure on the capacity of existing systems to efficiently process this flow of data,
leading to the development of numerous optimization techniques.
Among these techniques, the leveraging of parallelism in existing computer systems
has seldom been studied. Indeed, the prospect of parallel processing of temporal
constraints in general, and LTL formulæ in particular, is held back precisely because of
the sequential nature of the properties to verify: since the validity of an event may depend
on past and future events, the handling of parts of the trace in parallel and independent
processes seems to be disqualified at the onset. Case in point, a review of available
solutions in Section 2 of this paper observes that most existing trace validation tools are
based on algorithms that do not take advantage of parallelism. While a few solutions do
make use of threads, GPUs, distributed infrastructures or dedicated hardware, in virtually
all cases parallelism is not employed for the evaluation of the property itself.
Current popular trace analysis software, such as the entrants to the latest Competition
on Runtime Verification (CRV) [36], are globally single-thread tools. This is certainly
true of the authors’ BeepBeep 3, described in Section 3, whose version entered at CRV
2016 did not use any multi-threading. Analysis of the available source code for MarQ [35],
?
This paper is an extended version of the paper of same title published at the 2017 International
Conference on Runtime Verification.
arXiv:submit/1944331 [cs.OH] 9 Jul 2017
the single other system entered in the competition’s offline track, did not reveal the use of
any multi-threading either. This is supported by visual inspection of the CPU load graph
during the execution of the tool, which shows a high load on a single of the available
cores at any point in time.
Figure 1 gives an example of this situation. The execution of the single-thread system
is shown in the left part of the graph. One can see that, while the high CPU usage
alternates between the four available cores of the host machine (a result of the operating
system’s load balancing), only one at a time exhibits a high load. This results in much
available CPU power that is not used to speed up the system. On the contrary, a system
harnessing all the available computing resources would result in a load graph similar to
the right part of Figure 1: this time, all four cores are used close to their full capacity.
This figure also presents visual confirmation that the property, in this latter case, takes
much less time to evaluate than in the single-thread scenario.
Fig. 1: An actual CPU load graph from the BeepBeep 3 system evaluating a query on a
large event trace. The left part of the graph shows the query in single-threaded mode; the
right part shows the same query running with multi-threading enabled.
In this paper, we present a set of patterns that allow the use of multiple threads in
the evaluation of queries on traces of events. Section 4 shows how these techniques
leverage the architecture of the BeepBeep 3 event stream processor, whose computation
of queries is implemented as the composition of multiple simple transducers called
processors. The techniques we propose are based on the idea of “wrapping” a processor
or a group of processors into one of a few special, thread-aware containers. This has for
effect that any part of an event stream query can be parallelized, without requiring the
wrapped processors to be aware of the availability of multiple threads. Thanks to this
architecture, any existing computation can be made to leverage parallelism with very
minimal modifications to the original query.
Section 5 then shows experimentally the potential impact of parallelism. Based on a
sample of properties taken from the latest Competition on Runtime Verification (CRV), it
reveals that five out of six queries show improved throughput with the use of parallelism,
sometimes running as much as 4.4 times faster.
2 Parallelism in Runtime Verification
A distinction must be made between the runtime verification of parallel and concurrent
systems [22,27,30, 34,37,38] and the use of parallelism for runtime verification. This
paper is concerned with the latter question. Related literature can be split in two families,
depending on what is being parallelized. We shall call these two families “second-hand”
and “first-hand” parallelism.
2.1 Second-Hand Parallelism
A first set of works provide improvements to elements that are peripheral to the evaluation
of the property, such as the communication between the monitor and the program, or the
use of a separate chip to run the monitor. We call this form of parallelism “second-hand”.
The prospect of using physical properties of hardware to boost the performance of
runtime verification has already been studied in the recent past. For example, Pellizzoni
et al. [33] utilized dedicated commercial-off-the-shelf (COTS) hardware [13] to facilitate
the runtime monitoring of critical embedded systems whose properties were expressed
in Past-time Temporal Linear Logic (PTLTL).
As the number of cores (GPU or multi-core CPUs) in the commodity hardware keeps
increasing, the research of exploiting the available processors or cores to parallelize
the tasks and the computing brings a challenge and also an opportunity to improve
the architecture of runtime verification. For example, Ha et al. [18] introduced a
buffering design called Cache-friendly Asymmetric Buffering (CAB) to improve the
communications between application and runtime monitor by exploiting the shared cache
of the multi-core architecture; Berkovich et al. [9] proposed a GPU-based solution that
effectively utilizes the available cores of the GPU, so that the monitor designed and
implemented with their method can run in parallel with the target program and evaluate
LTL properties.
2.2 First-Hand Parallelism
While the previous works all claim improvements in the efficiency of the monitoring
process, they do not address directly the issue of performing the evaluation of the property
itself using parallelism. Therefore, a second family of related works address the issue of
what we shall call “first-hand” parallelism. In various ways, these works attempt to split
the computation steps of evaluating a query on a trace into blocks that can be executed
independently over multiple processes.
In Runtime Verification When trace properties are expressed as Linear Temporal Logic,
Kuhtz and Finkbeiner showed that the path checking problem (i.e. verifying that a given
trace fulfills the property) belongs to the complexity class AC
1
(logDCFL) [25]. This result
entails that the process can be efficiently split by evaluating entire blocks of events in
parallel. Rather than sequentially traversing the trace, their work considers the circuit
that results from “unrolling” the formula over the trace. However, while the evaluation of
this unrolling can be done in parallel, a specific type of Boolean circuit requires to be
built in advance, which depends on the length of the trace to evaluate. Moreover, the
argument consists of a formal proof; deriving a concrete, usable implementation of this
technique remains an open question.
In an offline context, the leveraging of a cluster infrastructure was first suggested
by Hallé et al., who introduced an algorithm for the automated verification of Linear
Temporal Logic formulæ on event traces, using an increasingly popular cloud computing
framework called MapReduce [19]. The algorithm can process multiple, arbitrary
fragments of the trace in parallel, and compute its final result through a cycle of runs
of MapReduce instances. A different architecture using MapReduce was proposed by
Basin et al. [7]. These techniques only apply in a batch (i.e. offline) context. Moreover,
the setup of a MapReduce infrastructure is a heavy process that only pays off for massive
amounts of data to be evaluated against the same property. At least in the case of the
first approach, it was discovered experimentally that the anticipated speed-up caused by
parallelism was offset by the increased amount of communication and the sheer volume
of tuples that needed to be manipulated.
A second downside of these approaches is that both are tailored to one specific query
language —in both cases an extension of Linear Temporal Logic. It remains unclear
how computing other types of results over an event trace could be done in the same
framework.
In Event Stream Processing Much more work on parallelism and distribution was
undertaken from the database point of view. The evaluation of SQL queries using
multiple cores has spawned a large line of works [5,15, 17,24, 26,32]. However, relational
database queries differ significantly from runtime verification properties, where the order
of events is an essential matter. This rules out the possibility of a simple sharding of
the source data into independents partitions that can be processed independently, as
sequential relationships between events across multiple shards can go missing.
Much closer to RV’s concerns is the field of event stream processing (ESP). An ESP
system generally consists of one or more unbounded sequences of events (the streams),
which are sent to processing units connected according to a directed acyclic graph (DAG).
Such a graph, also called a query, can run continuously on incoming streams or by
reading pre-recorded data. The similarities between runtime verification and ESP have
been highlighted in a recent tutorial [20].
Some recent ESP systems, like Siddhi [39], appear to be single-machine, single-thread
query engines. However, many others naturally support processing units to be distributed
on multiple machines that exchange data through a network communication. By manually
assigning fragments of the query to different machines, distribution of computation can
effectively be achieved. Such systems include Aurora, Borealis [2], Cayuga [10], Apache
Storm [4], Apache S4 [31] and Apache Samza [3]. This form of parallelism is called
“inter-host”, as the computation is performed on physically distinct devices.
Some systems also support “intra-host” parallelism, this time by allotting fragments
of a task to multiple cores (or threads) within a single host. One notable example is
Esper [1], which has documented multi-threading facilities. Of special interest is a system
called PIPES, which provides a three-level multi-threaded architecture [11]. On the first
level, operators are directly connected, forming virtual operators; on the second level,
virtual operators are combined inserting buffers in-between. Separate threads control the
so formed subgraphs. Finally, on the top level, a scheduler enables concurrent execution
of these subgraphs.
3 The BeepBeep 3 Event Stream Query Engine
BeepBeep 3 is an event stream processing engine. It can can be used either as a Java library
embedded in another application’s source code, or as a stand-alone query interpreter
running from the command-line. Releases of BeepBeep 3 are publicly available for
download under an open source license.
1
BeepBeep can be used either as a Java library
embedded in another application’s source code, or as a stand-alone query interpreter
running from the command-line. In this section, we briefly describe the basic principles
underlying the architecture of BeepBeep. The reader is referred to a recent tutorial
for more details, such as the complete formal semantics of the language and some
examples [20].
3.1 Processors
Let
T
be an arbitrary set of elements. An event trace of type
T
is a sequence
e=e0e1. . .
where
eiT
for all
i
. Event types can be as simple as single characters or numbers, or
as complex as matrices, XML documents, plots, logical predicates, polynomials or any
other user-defined data structure.
Afunction is an object that takes zero or more events as its input, and produces
zero or more events as its output. In BeepBeep, functions are first-class objects; they all
descend from an abstract ancestor named
Function
, which declares a method called
evaluate()
so that outputs can be produced for a given array of inputs. A processor is
an object that takes zero or more event traces, and produces zero or more event traces as
its output. A function is stateless, and operates on individual events: given an input, it
immediately produces an output, and the same output is always returned for the same
inputs. On the contrary, a processor is a stateful device: for a given input, its output may
depend on events received in the past. Any processors with compatible types can be
freely composed.
A processor produces its output in a streaming fashion: it does not wait to read its
entire input trace before starting to produce output events. However, a processor can
require more than one input event to create an output event, and hence may not always
output something when given an input. Processors can then be composed (or “piped”)
together, by letting the output of one processor be the input of another. When a processor
has an input arity of 2 or more, the processing of its input is done synchronously. This
means that a computation step will be performed if and only if an event can be consumed
from each input trace. This entails that processors must implicitly manage buffers to
store input events until a result can be computed. This buffering is implicit: it is absent
from both the formal definition of processors and any graphical representation of their
piping. Nevertheless, the concrete implementation of a processor must take care of these
buffers in order to produce the correct output. In BeepBeep, this is done with the abstract
class
SingleProcessor
; descendents of this class simply need to implement a method
named
compute()
, which is called only when an event is ready to be consumed at each
input.
1https://liflab.github.io/beepbeep-3
3.2 Built-In Processors and Palettes
BeepBeep is organized along a modular architecture. The main part of BeepBeep is
called the engine, which provides the basic classes for creating processors and functions,
and contains a handful of general-purpose processors for manipulating traces. The rest
of BeepBeep’s functionalities is dispersed across a number of optional palettes, which
will be discussed later.
A first way to create a processor is by lifting any
m
:
n
function
f
into a
m
:
n
processor. This is done by applying
f
successively to each input event, producing the
output events. A few processors can be used to alter the sequence of events received.
The
CountDecimate
processor returns every
n
-th input event and discards the others.
Another operation that can be applied to a trace is trimming its output. Given a trace, the
Trim processor returns the trace starting at its n-th input event.
Events can also be discarded from a trace based on a condition. The
Filter
processor
fis a
n
:
n
1; the events are let through on its
n
1outputs, if the corresponding event
of input trace nis the Boolean value true (>); otherwise, no output is produced.
BeepBeep also allows users to define their own processors directly as Java objects,
using no more than a few lines of boilerplate code. The simplest way to do so is to extend
the
SingleProcessor
class, which takes care of most of the “plumbing” related to
event management: connecting inputs and outputs, looking after event queues, etc. All
that is left to do is to define its input and output arity, and to write the actual computation
that should occur, i.e. what output event(s) to produce (if any), given an input event. A
detailed description of this extension mechanism has recently been published [23].
4 Multi-Threading Patterns
In this section, we now present a set of patterns that can be applied to processors to
make use of multiple threads. Special care has been given to make these multi-threading
capabilities transparent. For the developer, this means that new processors can be created
in a completely sequential manner; multi-threading can be applied at a later stage by
simply wrapping them into special constructs that make use of threads. For the user, this
means that queries are constructed in the same way, whether multi-threading is used or
not: multi-threaded elements have the same interface as regular ones, and both types of
elements can be freely mixed in a single query.
4.1 Thread Management Model
A first feature is the global thread management model. In BeepBeep, the instantiation of
all threads is controlled by one or more instances of the
ThreadManager
class. Each
thread manager is responsible for the allocation of a pool of threads of bounded size. A
processor wishing to obtain a new thread instance asks permission from its designated
thread manager. If the maximum number of live threads reserved to this manager is not
exceeded, the manager provides a new thread instance to the processor. The processor
can then associate any task to this thread and start it.
So far, this scheme is identical to traditional thread pooling found in various systems,
or, in an equivalent manner, to the
Executor
pattern found in recent versions of Java.
The main difference resides in the actions taken when the manager refuses to provide a
new thread to the caller. Typically, the call for a new thread would become blocking until
a thread finally becomes available. In BeepBeep, on the contrary, the call simply returns
null
; this indicates that whatever actions the processor wished to execute in parallel in a
new thread should instead be executed sequentially within the current thread.
In the extreme case, when given a thread manager that systematically refuses all
thread creation, the processor’s operation is completely sequential; no separate thread
instance is ever created.
2
Moreover, various parts of a query can be assigned to different
thread manager instances with their own thread pool, giving flexible control over how
much parallelism is allowed, and to what fragments of the computation. Hence the
amount of threading can be easily modulated, or even completely turned off quite literally
at the flick of a switch, by changing the value of a single parameter. Since thread instances
are requested dynamically and are generally short-lived, this means that changes in the
threading strategy can also be made during the processing of an event stream.
The upper cap on the number of threads imposed by each thread manager only applies
to running threads. Hence if a set of 10 tasks
T1, . . ., T10
it to be started in parallel using a
thread manager with a limit of 3, the first three tasks will each be started in a new thread.
However, if any of these threads finishes before requesting a thread for
T4
, then
T4
will
also be granted a new thread; the same applies for the remaining tasks. Finished threads
are kept by the thread manager until their
dispose()
method is called, indicating they
can be safely destroyed.
Note that while this technique can be compared to thread scheduling, it is an arguably
simpler form of scheduling. A thread is either created or not, depending solely on whether
the maximum number of live threads assigned to the manager has been reached at the
moment of the request. The balancing of threads across various parts of a query graph is
achieved by assigning these different parts to a different manager.
4.2 Non-Blocking Push/Pull
In a classical, single-thread computation, a call to any of a processor’s Pullable or
Pushable methods is blocking. For example, calling
pull()
on one of a processor’s
Pushables involves a call to the processor’s input pullables
pull()
in order to retrieve
input events, followed by the computation of the processor’s output event. The original
call returns only when this chain of operations is finished. The same is true of all other
operations.
The Pushable and Pullable interfaces also provide a non-blocking version of the push
and pull operations, respectively called
pushFast()
and
pullFast()
. These methods
perform the same task as their blocking counterparts, but may return control to the caller
before the operation is finished. In order to “catch up” with the end of the operation, the
caller must, at a later time, call method
waitFor()
on the same instance. This time, this
method blocks until the push or pull operation that was started is indeed finished.
Following the spirit of transparency explained earlier,
pushFast()
and
pullFast()
are not required to be non-blocking. As a matter of fact, the default behaviour of these two
2
Note that this is different from a thread manager that would dispense only one thread at a time.
In such a case, two threads are active: the calling thread and the one executing the task.
methods is to act as a proxy to their blocking equivalents; similarly, the default behaviour
of
waitFor()
is to return immediately. Thus, for a processor that does not wish to
implement non-blocking operations, calling e.g.
pushFast()
followed by
waitFor()
falls back to standard, blocking processing. Only when a processor wishes to implement
special non-blocking operations does it need to override these defaults.
Rather than implementing non-blocking operations separately for each type of
processor, an easier way consists of enclosing an existing processor within an instance
of the
NonBlockingProcessor
class. When
push()
or
pull()
is called on this
processor, a new thread is asked to its designated thread manager. The actual call to
push()
(resp.
pull()
) on the underlying processor is started in that thread and the
control is immediately returned to the caller. Using a simple Boolean semaphore, a call to
method
waitFor()
of the
NonBlockingProcessor
sleep-loops until that thread stops,
indicating that the operation is finished. We remind the reader that the thread manager
may not provide a new thread; in such a case, the call to the underlying processor is made
within the current thread, and the processor falls back to a blocking mode.
Non-blocking push/pull does not provide increased performance in itself. As a matter
of fact, calling
pushFast()
immediately followed by
waitFor()
may end up being
slower than simply calling
push()
, due to the overhead of creating a thread and watching
its termination. However, it can prove useful in situations where one calls
push()
or
pull()
on a processor, performs some other task
T
, and retrieves the result of that
push()
(resp.
pull()
) at a later time. If the call is done in a non-blocking manner, the
computation of that operation can be done in parallel with the execution of
T
in the
current thread.
It turns out that a couple of commonly used processors in BeepBeep’s palettes
operate in this fashion, and can hence benefit from the presence of non blocking push/pull
methods. We describe a few.
Sliding Window Given a processor
ϕ
and a window width
n
, the sliding window processor
returns the output of a copy of
ϕ
on events
e0e1. . . en1
, followed by the output of a
fresh copy of
ϕ
on events
e1e2. . . en
, and so on. One possible implementation of this
mechanism is to keep in memory up to
n
1copies of
ϕ
, such that copy
ϕi
has been
fed the last
i
events received. Upon processing a new input event, the window pushes
this event to each of the
ϕi
, and retrieves the output of
ϕn1
. This processor copy is then
destroyed, the index of the remaining copies is incremented by 1, and a new copy
ϕ0
is
created.
Figure 2a shows a sequence diagram of these operations when performed in a
blocking way. Figure 2b shows the same operations, this time using non-blocking calls.
The window processor first calls
push()
on each copy in rapid-fire succession. Each
copy of
ϕ
can update its state in a separate thread, thus exhibiting parallel processing.
The processor then waits for each of these calls to be finished, by calling
waitFor()
on
each copy of ϕ. The remaining operations are then performed identically.
This figure provides graphical evidence that, under the assumption that each call
to
push()
occurs truly in parallel, the total time spent on the window’s
push()
call is
shorter than its sequential version. If
Tϕ
is the time for each call to
ϕ
and
TC
is the time
taken for the remaining tasks, then the call to this method goes down from
nTϕ+TC
to
Window ψ
push(e)
φ1.push(e)
φn-1.push(e)
. . .
collect e'
shift
copy
push(e')
φ2.push(e)
(a) Blocking
Window ψ
push(e)φ0.push(e)
φn-1.push(e)
collect e'
shift
copy push(e')
φ1.push(e)
φ0
φ0.waitFor
φn-1.waitFor
φ1φn-1
. . .
(b) Non-blocking
Fig. 2: Sequence diagram for the Window processor: (a) using blocking calls to
ϕ
; (b)
using non-blocking calls running in separate threads.
Tϕ+TC
. If fewer than
n
threads are available, the value is situated somewhere between
these two bounds.
Trace Slicing The Slicer is a 1:1 processor that separates an input trace into different
“slices”. It takes as input a processor
ϕ
and a function
f
:
TU
, called the slicing
function. There exists potentially one instance of
ϕ
for each value in the image of
f
. If
T
is the domain of the slicing function, and
V
is the output type of
ϕ
, the slicer is a
processor whose input trace is of type Tand whose output trace is of type 2V.
When an event
e
is to be consumed, the slicer evaluates
c=f(e)
. This value
determines to what instance of
ϕ
the event will be dispatched. If no instance of
ϕ
is
associated to
c
, a new copy of
ϕ
is initialized. Event
e
is then given to the appropriate
instance of
ϕ
. Finally, the last event output by every instance of
ϕ
is collected into a set,
and that set is the output event corresponding to input event
e
. The function
f
may return
a special value #, indicating that no new slice must be created, but that the incoming
event must be dispatched to all slices. In this latter case, a task similar to the Window
processor can be done: each slice is put in a separate thread, so that it can process the
input event in parallel with the others.
Logical Connectives BeepBeep comes with a palette providing processors for evaluating
all operators of Linear Temporal Logic (LTL), in addition to the first-order quantification
defined in LTL-FO
+
(and present in previous versions of BeepBeep) [21]. Boolean
processors are called
Globally
,
Eventually
,
Until
,
Next
,
ForAll
and
Exists
, and
carry their usual meaning. For example, if
a0a1a2. . .
is an input trace, the processor
Globally
produces an output trace
b0b1b2. . .
such that
bi=
if and only there exists
ji
such that
bj=
. In other words, the
i
-th output event is the two-valued verdict of
evaluating Gϕon the input trace, starting at the i-th event.3
Concretely, this is implemented by creating one copy of the
ϕ
processor upon each
new input event. This event is then pushed into all the current copies, and their resulting
output event (if any) is collected. If any of them is
(false), the processor returns
;
3
Another set of processors, called the “Trooleans”, mirror each of these operators but use a
three-valued semantics. The reader is referred to the BeepBeep 3 tutorial for more details on
this [20].
if any of them is
>
(true), the corresponding copy of
ϕ
is destroyed. This is another
example where the processing of multiple copies of
ϕ
can be done in a separate thread,
in a way similar to the principles described earlier. The same can be done with first-order
quantifiers; hence for the expression
xπ
:
ϕ(x)
, the evaluation of
ϕ
for each value of
xcan be done in parallel.
4.3 Pre-emptive Pulling
A second strategy consists of continuously pulling for new outputs on a processor
P
,
and storing these events in a queue. When a downstream processor
P0
calls
P0spull()
method, the next event is simply popped from that queue, rather than being computed
on-the-fly. If
P0
is running in a different thread from the process that polls
P
, each can
compute a new event at the same time.
Figure 3a shows the processing of an event when done in a sequential manner. A call
to
pull()
on
ψ
results in a pull on
ϕ
, which produces some output event
e
. This event is
then processed by
ψ
, which produces some other output
e0
. If
Tϕ
and
Tψ
correspond to
the computation times of
ϕ
and
ψ
, respectively, then the total time to fetch each event
from ψis their sum, Tϕ+Tψ.
ψ
φ.pull
φ
e1
e2
φ.pull
ψ.pull
ψ.pull
(a) Sequential
ψ
φ.pull
φW
φ.pull
e1
e2
φ.pull
e3
W.pull
e1
W.pull
e2
W.pull
e3
ψ.pull
ψ.pull
ψ.pull
(b) Pre-emptive pulling
Fig. 3: Sequence diagram for pre-emptive pulling: (a) no pre-emptive pulling; (b) W
performs pre-emptive pulling on ϕin a separate thread.
On the contrary, Figure 3b shows the same process, with pre-emptive pulling on
ϕ
occurring in a separate thread. One can see that in this case,
ϕ
produces a new output
event while
ψ
is busy doing its computation on the previous one. The first output event
still takes Tϕ+Tψto be produced, but later ones can be retrieved in max{Tϕ,Tψ}.
In a manner similar to the
NonBlockingProcessor
, pre-emptive pulling is enabled
by enclosing a group of processors inside a
PullGroup
. This processor behaves like a
GroupProcessor
: a set of connected processors can be added to the group, and this
group can then be manipulated, piped and copied as if it were a single “black box”. The
difference is that a call to the
start()
method of a
PullGroup
creates a new thread
where the repeated polling of its outputs occurs. To avoid needlessly producing too many
events that are not retrieved by downstream calls to
pull()
, the polling stops when the
queue reaches a predefined size; polling resumes when some events of that queue are
finally pulled. As with the other constructs presented in this paper, the
PullGroup
takes
into account the possibility that no thread is available; in such a case, output events are
computed only upon request, like the regular GroupProcessor.
In our analysis of computing time, we can see that the speed gain is maximized when
Tϕ=Tψ
; otherwise, either
ϕ
produces events faster than
ψ
can consume them (
Tϕ<Tψ
),
or
ψ
wastes time waiting for the output of
ϕ
(
Tϕ>Tψ
). Therefore, an important part of
using this strategy involves breaking a processor graph into connected regions of roughly
equal computing load.
4.4 Pipelining
Pipelining is the process of reading
n
input events
e1e2. . . en
, creating
n
copies of a
given processor, and launching each of them on one of the input events. A pipeline then
waits until the processor assigned to
e1
produces output events; these events are made
available at the output of the pipeline as they are collected. Once the
e1
processor has no
more output events to produce, it is discarded, the collection resumes on the processor
for e2, and so on.
Note that, once the
e1
processor is discarded, there is now room for creating a new
processor copy and assign it to the next input event,
en+1
. This rolling process goes
on until no input event is available. In such a scheme, the order of the output events is
preserved: in sequential processing, the batch of output of events produced by reading
event e1comes before any output event resulting from processing e2is output.
ψ
φ.pull
φ
e1
e2
e3
S
φ.pull
e'1
e'2
φ.pull
(a) Sequential
ψ
push(e1)
φ.pull
φ2
φ1
φ
e1
e2
e3
S
φ.pull
push(e2)
e'1
e'1
e'2
e'2
φ2
push(e3)
e'3
e'3
φ.pull
(b) Pipelining
Fig. 4: Sequence diagram for pipelining: (a) no pipelining:
ϕ
requests events from
S
on
demand and computes its output event afterwards; (b)
ϕ
pulls multiple events from
S
and
evaluates each of them on a copy of ϕin a separate thread.
Although pipelining borrows features from both pre-emptive pulling and non-blocking
pull, it is actually distinct from these two techniques. As in non-blocking pull, it sends
input events in parallel to multiple copies of a processor; however, rather than sending
the same event to multiple, independent instances of
ϕ
, it sends events that should be
processed in sequence by a single processor instance each to a distinct copy of
ϕ
and
collects their result. In the sequence
e1e2. . .
, this means that one copy of
ϕ
processes
the subtrace e1, while another one processes (in parallel) the subtrace e2, and so on.
Obviously, this “trick” does not guarantee the correct result on all processors, as
some of them have an output that depends on the complete trace. As a simple example,
one can consider the processor that computes the cumulative sum of all input events;
it is clear that pipelining this processor will return an incorrect result, as each copy of
the processor will receive (and hence add) a unique event of the input trace. However,
there do exist processors for which pipelining can be applied; this is the case of all
FunctionProcessor
s, which by definition apply a stateless function to each of their
input events. While this might seem limiting, it turns out that, in the sample of queries
evaluated experimentally later in this paper, a large part of the computing load comes
from the application of a few such functions, and that pipelining proves very effective in
these situations.
We also remark that the process could be generalized in at least two ways. First,
a stateful processor may become stateless after reading a trace prefix; for example, a
processor that computes the sum of input events until it reaches 10, or a Moore machine
that reaches a sink state. Therefore, an “adaptive” pipeline could query a processor for
such a condition, and start pipelining once the processor becomes stateless. Second,
some processors may be stateful, but only with respect to a bounded number
n
of past
events. If this bound is known, pipelining can be applied by sending to each copy of the
processor the window of the last
n
events; the pipeline presented here is the simple case
where n=1.
5 Implementation and Discussion
All the concepts described in the previous section have been implemented in the BeepBeep
3 event processing engine and are available in its latest downloadable version. In this
section, we discuss some of the advantages of this architecture and present experimental
results indicating the potential speedup it can bring.
5.1 Experiments
In order to showcase the usage (and potential advantages) of using multiple threads
according to our proposed architecture, we setup a set of experiments where a modified
version of BeepBeep is run on a set of example queries and input traces. The queries
and input traces are taken from the Offline track of the 2016 Competition on Runtime
Verification [36], in which the single-thread version of BeepBeep 3 had participated.
For each property and each thread, BeepBeep is run in two configurations: 1. The
original query, without any thread-aware processors 2. The same query, modified with
thread-aware processors inserted in the “best” way possible. These modifications were
done with the help of intuition and manual trial and error, so they may not represent
the absolute best way of using threads. The queries use a single thread manager, whose
number of threads is set to be equal to the host machine’s number of cores. No other
modifications to the query have been made.
The experiment measures two elements. The first is throughput, measured in Hz, and
which corresponds to the average number of input events consumed per second. The
main goal of multi-threading is to achieve faster computation; it is therefore natural to
get a hint on the amount of speed-up one can get by involving more than one processing
thread. The second is CPU load, measured in percentage. For a system with
n
cores, let
fi(t)
be a function giving the instantaneous load (between 0 and 1) of core
i
at time
t
. If
TS
and
TE
are the start and end time of the execution of a query, then the load resulting
from the execution of that query, noted λ, is defined as:
λ,1
n(TETS)
n
Õ
i=1TE
TS
fi(t)dt
Intuitively, the load represents how much of the available cores was used during the
execution of the query. For example, a query resulting in a constant usage of 50% on two
cores of a 4-core machine and 0% on the remaining two would have a load of 0.25. Note
that load is a value between 0 and 1 that is not affected by the duration of the computation.
In the experiments, instantaneous load is approximated as a 1-second wide rectangle,
whose height corresponds to the CPU usage as reported by the SIGAR API
4
. Note that
this usage includes that of all applications running on the host machine. Extraneous
activity was minimized by having the machine boot into runlevel 3 (command-line
without X server) with a limited number of running services and daemons.
As is now customary in LIF research projects, all experiments and data are available
for download as a self-contained application that can be launched and controlled from a
web interface. This bundle has been created using the LabPal testing library
5
, and allows
anyone to easily re-run the same queries on the same input data.
The results of these experiments are summarized in Table 1. One of the queries
(SQL Injection) is implemented in BeepBeep as a single finite-state machine that simply
updates the contents of a set upon each incoming event. None of the patterns introduced
in this paper can be applied to this processor, and so no experiment was conducted on it.
(We remind the reader that our proposed solution does not claim to be universal.)
As one can see, the use of multi-threading has increased the throughput and load
of every remaining problem instance. CPU load has increased by a factor of 1.7, and
throughput by a factor of 2.0. In some cases, such as the Auction property, the impact of
parallelism is dramatic. This is due to the fact that this property evaluates a first-order
logic formula with multiple nested quantifiers, such as
aA
:
bB
:
cC
:
ϕ
.
When the cardinality of sets
B
and
C
is large, evaluating
bB
:
cC
:
ϕ
for each
value aAcan efficiently be done in parallel.
5.2 Discussion
We shall now discuss the pros and cons of the multi-threading architecture proposed in
this paper.
4https://support.hyperic.com/display/SIGAR/Home
5https://liflab.github.io/labpal
. The actual lab is available at
https://datahub.
io/dataset/beepbeep-mt- lab.
Query Threading Strategy Throughput Load
Auction bidding Multi-threaded 30,487.8 0.345
None 28,449.5 0.601
Candidate selection Multi-threaded 127.9 0.906
None 28.8 0.315
Endless bashing Multi-threaded 742.5 0.593
None 511.8 0.374
Spontaneous Pingu creation Multi-threaded 782.4 0.467
None 742.1 0.308
Turn around Multi-threaded 30.9 0.772
None 16.6 0.353
Table 1: Throughput and load for each query, for the single-thread and multi-thread
versions of BeepBeep.
Simplicity A first advantage of the proposed architecture is its high simplicity. Using
either of the multi-threading strategies described earlier can generally be done using a
handful of instructions and very shallow modifications to an existing query.
For example, the piece of code in Figure 5 shows how to enable non-blocking
push/pull within a window processor. The regular, mono-threaded version would include
only lines 2 and 4: one creates a cumulative sum processor, which is then given to the
Window processor with a specific window width (10 in this case). The multi-threaded
version adds lines 1 and 3. The first instruction creates a new thread manager and gives it
an upper limit of four threads (one could also reuse an existing manager instance instead
of creating a new one). The third instruction wraps a non-blocking processor around the
original sum processor; it is this processor that is given to the window instead of the
original. The end result is that the calls to
sum
’s
push()
and
pull()
will be split across
four threads, and that wwill operate like in Figure 2b.
1 ThreadManager manager = new ThreadManager(4);
2 FunctionProcessor sum = new FunctionProcessor(new CumulativeFunction(Addition.instance));
3 NonBlockingProcessor nbp = new NonBlockingProcessor(sum, manager);
4 Window w = new Window(nbp, 10);
Fig. 5: Enabling non-blocking calls inside a window processor for parallel processing.
Note how adding multi-threading to this example involves very minimal modifications
to the original query; it merely amounts to wrapping a processor instance around another
one. Apart from the way it handles push/pull internally, that processor behaves exactly like
the original. Thanks to the blocking fallback semantics of
pullFast()
and
pushFast()
,
the window processor that uses it does not even need to be aware that it is handed a
multi-threaded processor.
Separation of Concerns This observation brings us to the second important benefit of
this architecture: multi-threaded code remains very localized. For example, the
Window
processor does not contain any special code to handle multi-threading; the same can be
said of all other processor objects. Hence a user is not required to take multi-threading
into account when writing a new, custom processor. Parallelism occurs only by virtue of
enclosing some processor instances inside thread-aware wrappers.
A telling indicator of this separation of concerns can be obtained from analyzing
BeepBeep’s code: outside the
concurrency
package that provides the objects introduced
in this paper (the thread manager and a handful of processor wrappers, which amount
to roughly 2,000 lines of code), no reference to threads is ever made across the rest of
BeepBeep’s code, including all its palettes (18,000 lines). As a matter of fact, BeepBeep
can even be compiled with this package deleted without affecting any of its functionalities.
Manual Definition of Parallel Regions In counterpart, the current implementation of
multi-threading in BeepBeep requires the user to explicitly define the regions of a query
graph that are to be executed using multiple threads. This requires some knowledge of
the underlying computations that are being executed in various parts of the query, and
some intuition as to which of them could best benefit from the availability of more than
one thread. Doing so improperly can actually turn out to be detrimental to the overall
throughput of the system.
Therefore, the architecture proposed in this paper should be taken as a first step. It
breaks new ground by providing a simple way to add multi-threading to the evaluation of
an arbitrary query; however, in the medium term, higher-level techniques for selecting the
best regions of a query suitable for parallelism should be developed. It shall be noted that
this issue is not specific to BeepBeep, and that the automated fine tuning of parallelism
in query evaluation is a long-standing research problem [8,16, 29,40]. As an anecdotal
proof, Oracle devotes a whole chapter of its documentation on intricate tuning tips for
the evaluation of SQL queries
6
. Readers should not expect that BeepBeep, or any other
monitor system, could transparently (one might say magically) setup threads in the best
way for all possible queries.
6 Conclusion
In this paper, we have introduced a few simple patterns for the introduction of parallelism
in the evaluation of queries over streams of events. These patterns can be applied in
systems whose computation is expressed as the composition of small units of processing
arranged into a graph-like structure, as is the case in the BeepBeep 3 event stream
engine. By surrounding appropriate regions of such a graph with special-purpose thread
managers, parts of a query can be evaluated using multiple cores of a CPU, therefore
harnessing more of a machine’s computing power.
Thanks to the simplicity of these patterns and to BeepBeep’s modular design,
the introduction of parallelism to an existing query requires very limited changes,
which amount to the insertion of two or three lines of code at most. To the best of
our knowledge, this makes BeepBeep the first runtime verification tool with such
parallelization capabilities. However, as we have seen, not all queries are amenable to
parallelization, at least not using the patterns introduced in this paper. Therefore, the
introduction of multi-threading should also be complemented with advances on other
6http://docs.oracle.com/cd/A84870_01/doc/server.816/a76994/tuningpe.htm
fronts; for example, recent works have shown how a clever use of internal data structures
can also provide important speed gains [12].
Nevertheless, experiments on the properties of the 2016 Competition on Runtime
Verification have shown promising results. As we have observed, a majority of the tested
properties benefit from a speed boost of 2
×
and more through the careful application of the
patterns presented in this paper. These encouraging numbers warrant further development
of this line of research in multiple directions. First, modifications to the existing patterns
could be developed to create longer-lived threads and reduce the overhead incurred by
their creation and destruction. Second, one could relax the way in which properties are
evaluated, and tolerate that finite prefixes of an output trace be different from the exact
result. This notion of “eventual consistency” could allow multiple slices of an input to be
evaluated without the need for lock-step synchronization. Finally, special care should be
given to the way queries are expressed; when several equivalent representations of the
same computation exist, one should favor those that lend themselves to parallelization.
References
1. Esper, http://espertech.com
2.
Abadi, D.J., Ahmad, Y., Balazinska, M., Çetintemel, U., Cherniack, M., Hwang, J.H., Lindner,
W., Maskey, A., Rasin, A., Ryvkina, E., Tatbul, N., Xing, Y., Zdonik, S.B.: The design of the
Borealis stream processing engine. In: CIDR. pp. 277–289 (2005)
3.
Apache Software Foundation: Samza,
http://samza.incubator.apache.org
, retrieved
December 1st, 2016
4.
Apache Software Foundation: Storm incubation status,
http://incubator.apache.org/
projects/storm.html, retrieved December 1st, 2016
5.
Apers, P.M.G., van den Berg, C.A., Flokstra, J., Grefen, P.W.P.J., Kersten, M.L., Wilschut,
A.N.: PRISMA/DB: A parallel main memory relational DBMS. IEEE Trans. Knowl. Data
Eng. 4(6), 541–554 (1992), http://dx.doi.org/10.1109/69.180605
6.
Babcock, B., Babu, S., Datar, M., Motwani, R.: Chain : Operator scheduling for memory
minimization in data stream systems. In: Halevy, A.Y., Ives, Z.G., Doan, A. (eds.) Proceedings
of the 2003 ACM SIGMOD International Conference on Management of Data, San Diego,
California, USA, June 9-12, 2003. pp. 253–264. ACM (2003),
http://doi.acm.org/10.
1145/872757.872789
7.
Basin, D.A., Caronni, G., Ereth, S., Harvan, M., Klaedtke, F., Mantel, H.: Scalable offline
monitoring of temporal specifications. Formal Methods in System Design 49(1-2), 75–108
(2016), http://dx.doi.org/10.1007/s10703-016- 0242-y
8.
Belknap, P., Dageville, B., Dias, K., Yagoub, K.: Self-tuning for SQL performance in
oracle database 11g. In: Ioannidis, Y.E., Lee, D.L., Ng, R.T. (eds.) Proceedings of the 25th
International Conference on Data Engineering, ICDE 2009, March 29 2009 - April 2 2009,
Shanghai, China. pp. 1694–1700. IEEE Computer Society (2009),
http://dx.doi.org/10.
1109/ICDE.2009.165
9.
Berkovich, S., Bonakdarpour, B., Fischmeister, S.: Runtime verification with minimal intrusion
through parallelism. Formal Methods in System Design 46(3), 317–348 (2015),
http:
//dx.doi.org/10.1007/s10703-015- 0226-3
10.
Brenna, L., Demers, A.J., Gehrke, J., Hong, M., Ossher, J., Panda, B., Riedewald, M., Thatte,
M., White, W.M.: Cayuga: a high-performance event processing engine. In: Chan, C.Y.,
Ooi, B.C., Zhou, A. (eds.) Proceedings of the ACM SIGMOD International Conference
on Management of Data, Beijing, China, June 12-14, 2007. pp. 1100–1102. ACM (2007),
http://doi.acm.org/10.1145/1247480.1247620
11.
Cammert, M., Heinz, C., Krämer, J., Markowetz, A., Seeger, B.: Pipes: A multi-threaded
publish-subscribe architecture for continuous queries over streaming data sources. Tech. rep.
(2003)
12.
Decker, N., Harder, J., Scheffel, T., Schmitz, M., Thoma, D.: Runtime monitoring with union-
find structures. In: Chechik, M., Raskin, J. (eds.) Tools and Algorithms for the Construction
and Analysis of Systems - 22nd International Conference, TACAS 2016, Held as Part of the
European Joint Conferences on Theory and Practice of Software, ETAPS 2016, Eindhoven,
The Netherlands, April 2-8, 2016, Proceedings. Lecture Notes in Computer Science, vol. 9636,
pp. 868–884. Springer (2016), http://dx.doi.org/10.1007/978-3-662-49674- 9_54
13.
Emerson, E.A.: Temporal and modal logic. Handbook of Theoretical Computer Science,
Volume B: Formal Models and Sematics (B) 995(1072), 5 (1990)
14.
Falcone, Y., Sánchez, C. (eds.): Runtime Verification - 16th International Conference, RV 2016,
Madrid, Spain, September 23-30, 2016, Proceedings, Lecture Notes in Computer Science, vol.
10012. Springer (2016), http://dx.doi.org/10.1007/978-3-319-46982- 9
15.
Ganguly, S., Hasan, W., Krishnamurthy, R.: Query optimization for parallel execution. In:
Stonebraker, M. (ed.) Proceedings of the 1992 ACM SIGMOD International Conference on
Management of Data, San Diego, California, June 2-5, 1992. pp. 9–18. ACM Press (1992),
http://doi.acm.org/10.1145/130283.130291
16.
Gedik, B., Schneider, S., Hirzel, M., Wu, K.: Elastic scaling for data stream processing. IEEE
Trans. Parallel Distrib. Syst. 25(6), 1447–1463 (2014),
http://dx.doi.org/10.1109/
TPDS.2013.295
17.
Graefe, G.: Parallel query execution algorithms. In: Liu, L., Özsu, M.T. (eds.) Encyclopedia
of Database Systems, pp. 2030–2035. Springer US (2009),
http://dx.doi.org/10.1007/
978-0- 387-39940- 9_1078
18.
Ha, J., Arnold, M., Blackburn, S.M., McKinley,K.S.: A concur rent dynamic analysis framework
for multicore hardware. In: ACM SIGPLAN Notices. vol. 44, pp. 155–174. ACM (2009)
19.
Hallé, S., Soucy-Boivin, M.: MapReduce for parallel trace validation of LTL properties. J.
Cloud Comp. 4 (2015)
20. Hallé, S.: When RV meets CEP. In: Falcone and Sánchez [14], pp. 68–91, http://dx.doi.
org/10.1007/978-3- 319-46982- 9_6
21.
Hallé, S., Villemaire, R.: Runtime enforcement of web service message contracts with data.
IEEE Trans. Services Computing 5(2), 192–206 (2012),
http://dx.doi.org/10.1109/
TSC.2011.10
22.
Harrow, J.J.: Runtime checking of multithreaded applications with visual threads. In: Havelund,
K., Penix, J., Visser, W. (eds.) SPIN Model Checking and Software Verification, 7th In-
ternational SPIN Workshop, Stanford, CA, USA, August 30 - September 1, 2000, Pro-
ceedings. Lecture Notes in Computer Science, vol. 1885, pp. 331–342. Springer (2000),
http://dx.doi.org/10.1007/10722468_20
23.
Khoury, R., Gaboury, S., Hallé, S.: A glue language for event stream processing. In: Lin,
J.J., Pei, J., Hu, X., Chang, W., Nambiar, R., Aggarwal, C., Cercone, N., Honavar, V., Huan,
J., Mobasher, B., Pyne, S. (eds.) Workshop Proceedings of the 2016 IEEE International
Conference on Big Data, Big Data 2016, Washington, DC, USA, December 5-9, 2016. IEEE
(2016)
24.
Krikellas, K., Viglas, S., Cintra, M.: Modeling multithreaded query execution on
chip multiprocessors. In: Bordawekar, R., Lang, C.A. (eds.) International Work-
shop on Accelerating Data Management Systems Using Modern Processor and
Storage Architectures - ADMS 2010, Singapore, September 13, 2010. pp. 22–
33 (2010),
http://www.vldb.org/archives/workshop/2010/proceedings/files/
vldb_2010_workshop/ADMS_2010/adms10-krikellas.pdf
25.
Kuhtz, L., Finkbeiner, B.: Efficient parallel path checking for linear-time temporal logic with
past and bounds. Logical Methods in Computer Science 8(4) (2012),
http://dx.doi.org/
10.2168/LMCS-8(4:10)2012
26.
Li, W., Kavi, K.M., Naz, A., Sweany, P.H.: Speculative thread execution in a multithreaded
dataflow architecture. In: Peterson, G.D. (ed.) Proceedings of the ISCA 19th International
Conference on Parallel and Distributed Computing Systems, September 20-11, 2006, San
Francisco, California, USA. pp. 102–107. ISCA (2006)
27.
Luo, Q., Rosu, G.: EnforceMOP: a runtime property enforcement system for multithreaded
programs. In: Pezzè, M., Harman, M. (eds.) International Symposium on Software Testing
and Analysis, ISSTA ’13, Lugano, Switzerland, July 15-20, 2013. pp. 156–166. ACM (2013),
http://doi.acm.org/10.1145/2483760.2483766
28.
Madden, S., Franklin, M.J.: Fjording the stream: An architecture for queries over streaming
sensor data. In: Agrawal, R., Dittrich, K.R. (eds.) Proceedings of the 18th International
Conference on Data Engineering, San Jose, CA, USA, February 26 - March 1, 2002. pp. 555–
566. IEEE Computer Society (2002),
http://dx.doi.org/10.1109/ICDE.2002.994774
29.
Mühlbauer, T., Rödiger, W., Seilbeck, R., Kemper, A., Neumann, T.: Heterogeneity-conscious
parallel query execution: getting a better mileage while driving faster! In: Kemper, A.,
Pandis, I. (eds.) Tenth International Workshop on Data Management on New Hardware,
DaMoN 2014, Snowbird, UT, USA, June 23, 2014. pp. 2:1–2:10. ACM (2014),
http:
//doi.acm.org/10.1145/2619228.2619230
30.
Nazarpour, H., Falcone, Y., Bensalem, S., Bozga, M., Combaz, J.: Monitoring multi-threaded
component-based systems. In: Ábrahám, E., Huisman, M. (eds.) Integrated Formal Methods -
12th International Conference, IFM 2016, Reykjavik, Iceland, June 1-5, 2016, Proceedings.
Lecture Notes in Computer Science, vol. 9681, pp. 141–159. Springer (2016),
http://dx.
doi.org/10.1007/978-3- 319-33693- 0_10
31.
Neumeyer, L., Robbins, B., Nair, A., Kesari, A.: S4: distributed stream computing platform.
In: Fan, W., Hsu, W., Webb, G.I., Liu, B., Zhang, C., Gunopulos, D., Wu, X. (eds.) ICDMW
2010, The 10th IEEE International Conference on Data Mining Workshops, Sydney, Australia,
13 December 2010. pp. 170–177. IEEE Computer Society (2010),
http://dx.doi.org/10.
1109/ICDMW.2010.172
32.
Paes, M., Lima, A.A.B., Valduriez, P., Mattoso, M.: High-performance query processing of
a real-world OLAP database with pargres. In: Palma, J.M.L.M., Amestoy, P., Daydé, M.J.,
Mattoso, M., Lopes, J.C. (eds.) High Performance Computing for Computational Science -
VECPAR 2008, 8th International Conference, Toulouse, France, June 24-27, 2008. Revised
Selected Papers. Lecture Notes in Computer Science, vol. 5336, pp. 188–200. Springer (2008),
http://dx.doi.org/10.1007/978-3- 540-92859- 1_18
33.
Pellizzoni, R., Meredith, P., Caccamo, M., Rosu, G.: Hardware runtime monitoring for
dependable cots-based real-time embedded systems. In: Real-Time Systems Symposium, 2008.
pp. 481–491. IEEE (2008)
34.
Qadeer, S., Tasiran, S.: Runtime verification of concurrency-specific correctness criteria.
STTT 14(3), 291–305 (2012), http://dx.doi.org/10.1007/s10009-011-0210-1
35.
Reger, G., Cruz, H.C., Rydeheard, D.E.: MarQ: Monitoring at runtime with QEA. In:
Baier, C., Tinelli, C. (eds.) Tools and Algorithms for the Construction and Analysis of
Systems - 21st International Conference, TACAS 2015, Held as Part of ETAPS 2015.
Proceedings. Lecture Notes in Computer Science, vol. 9035, pp. 596–610. Springer (2015),
http://dx.doi.org/10.1007/978-3- 662-46681- 0
36.
Reger, G., Hallé, S., Falcone, Y.: Third international competition on runtime verification
- CRV 2016. In: Falcone and Sánchez [14], pp. 21–37,
http://dx.doi.org/10.1007/
978-3- 319-46982- 9_3
37.
Savage, S., Burrows, M., Nelson, G., Sobalvarro, P., Anderson, T.E.: Eraser: A dynamic data
race detector for multithreaded programs. ACM Trans. Comput. Syst. 15(4), 391–411 (1997),
http://doi.acm.org/10.1145/265924.265927
38.
Sen, K., Rosu, G., Agha, G.: Runtime safety analysis of multithreaded programs. In: Paakki,
J., Inverardi, P. (eds.) Proceedings of the 11th ACM SIGSOFT Symposium on Foundations of
Software Engineering 2003 held jointly with 9th European Software Engineering Conference,
ESEC/FSE 2003, Helsinki, Finland, September 1-5, 2003. pp. 337–346. ACM (2003),
http://doi.acm.org/10.1145/940071.940116
39.
Suhothayan, S., Gajasinghe, K., Loku Narangoda, I., Chaturanga, S., Perera, S., Nanayakkara,
V.: Siddhi: A second look at complex event processing architectures. In: Proceedings of the
2011 ACM Workshop on Gateway Computing Environments. pp. 43–50. GCE ’11, ACM,
New York, NY, USA (2011), http://doi.acm.org/10.1145/2110486.2110493
40.
Viglas, S.: A comparative study of implementation techniques for query processing in multicore
systems. IEEE Trans. Knowl. Data Eng. 26(1), 3–15 (2014),
http://dx.doi.org/10.1109/
TKDE.2012.243
... However, their languages are often based on SQL extensions without a clear semantics. An exception is BeepBeep [33,34]: a multi-threaded stream processor that supports LTL-FO + , a first-order variant of LTL. The parallelism in BeepBeep must, however, be arranged manually by the user. ...
Article
Full-text available
Online monitoring is the task of identifying complex temporal patterns while incrementally processing streams of data-carrying events. Existing state-of-the-art monitors for first-order patterns, which may refer to and quantify over data values, can process streams of modest velocity in real-time. We show how to scale up first-order monitoring to substantially higher velocities by slicing the stream, based on the events’ data values, into substreams that can be monitored independently. Because monitoring is not embarrassingly parallel in general, slicing can lead to data duplication. To reduce this overhead, we adapt hash-based partitioning techniques from databases to the monitoring setting. We implement these techniques in an automatic data slicer based on Apache Flink and empirically evaluate its performance using two tools—MonPoly and DejaVu—to monitor the substreams. Our evaluation attests to substantial scalability improvements for both tools.
... Propositional monitors cannot look for patterns that take such parameters into account. In contrast, first-order monitors [10,15,30,31,38,39,42] do not suffer from this limitation, but they must be parallelized to reach the performance of propositional monitors [6,29,[38][39][40][41]. ...
Chapter
Full-text available
Distributed systems are challenging for runtime verification. Centralized specifications provide a global view of the system, but their semantics requires totally-ordered observations, which are often unavailable in a distributed setting. Scalability is also problematic, especially for online first-order monitors, which must be parallelized in practice to handle high volume, high velocity data streams. We argue that scalable online monitors must ingest events from multiple sources in parallel, and we propose a general model for input to such monitors. Our model only assumes a low-resolution global clock and allows for out-of-order events, which makes it suitable for distributed systems. Based on this model, we extend our existing monitoring framework, which slices a single event stream into independently monitorable substreams. Our new framework now slices multiple event streams in parallel. We prove our extension correct and empirically show that the maximum monitoring latency significantly improves when slicing is a bottleneck.
... Moreover, the computational cost of windows could also be mitigated by the use of parallelism. It turns out that the BeepBeep event stream processing engine allows certain types of computations to be performed in parallel when the host machine has a multi-thread processor [62]; this process, however, requires much fine-tuning to actually be beneficial. The insertion of multi-threading into trend distance processors, and the study of its impact on global throughput, is planned as future work. ...
Article
Full-text available
Information systems produce different types of event logs; in many situations, it may be desirable to look for trends inside these logs. We show how trends of various kinds can be computed over such logs in real time, using a generic framework called the trend distance workflow. Many common computations on event streams turn out to be special cases of this workflow, depending on how a handful of workflow parameters are defined. This process has been implemented and tested in a real-world event stream processing tool, called BeepBeep. Experimental results show that deviations from a reference trend can be detected in realtime for streams producing up to thousands of events per second.
... Over the past few years, BeepBeep has been involved in a variety of case studies [4, 12-14, 17, 19], and provides built-in support for multi-threading [16]. For a complete description of BeepBeep, the reader is referred to a recent tutorial [11] or to BeepBeep's detailed user manual [1]. ...
Chapter
Full-text available
This paper describes a plug-in extension of the BeepBeep 3 event stream processing engine. The extension allows one to write a custom grammar defining a particular specification language on event traces. A built-in interpreter can then convert expressions of the language into chains of BeepBeep processors through just a few lines of code, making it easy for users to create their own domain-specific languages.
Article
Full-text available
Designing clean, reusable, and repeatable experiments for a research paper does not have to be difficult. We report on our efforts to create an integrated toolchain for running, processing, and including the results of computer experiments in scientific publications.
Conference Paper
Full-text available
This paper describes the design and implementation of an SQL-like language for performing complex queries on event streams. The Event Stream Query Language (eSQL) aims at providing a simple, intuitive and fully non-procedural syntax, while still preserving backwards compatibility with traditional SQL. More importantly, eSQL's core syntax is designed to be extended by user-defined grammatical constrcts. These new constructs can form domain-specific sub-languages, with eSQL being used as the “glue” to form very expressive queries. These concepts have been implemented in BeepBeep 3, an open source event stream query engine.
Article
Full-text available
This paper addresses the monitoring of logic-independent linear-time user-provided properties in multi-threaded component-based systems. We consider intrinsically independent components that can be executed concurrently with a centralized coordination for multiparty interactions. In this context, the prob- lem that arises is that a global state of the system is not available to the monitor. A naive solution to this problem would be to plug in a monitor which would force the system to synchronize in order to obtain the sequence of global states at runtime. Such a solution would defeat the whole purpose of having concurrent components. Instead, we reconstruct on-the-fly the global states by accumulating the partial states traversed by the system at runtime. We define transformations of components that preserve their semantics and concurrency and, at the same time, allow to monitor global-state properties. Moreover, we present RVMT-BIP, a prototype tool implementing the transformations for monitoring multi-threaded systems described in the BIP (Behavior, Interaction, Priority) framework, an expressive framework for the formal construction of heterogeneous systems. Our experiments on several multi-threaded BIP systems show that RVMT-BIP induces a cheap runtime overhead.
Conference Paper
Full-text available
This paper is an introduction to Complex Event Processing (CEP) intended for an practicioners of Runtime Verification. It first describes typical CEP problems, popular tools and their query languages. It then presents BeepBeep 3, an event stream processor that attempts to bridge the gap between RV and CEP. Thanks to BeepBeep’s generic architecture and flexible input language, queries and properties from both fields can be efficiently processed.
Conference Paper
Full-text available
This paper addresses the monitoring of logic-independent linear-time user-provided properties on multi-threaded component-based systems. We consider intrinsically independent components that can be executed concurrently with a centralized coordination for multiparty interactions. In this context, the problem that arises is that a global state of the system is not available to the monitor. A naive solution to this problem would be to plug a monitor which would force the system to synchronize in order to obtain the sequence of global states at runtime. Such solution would defeat the whole purpose of having concurrent components. Instead, we reconstruct on-the-fly the global states by accumulating the partial states traversed by the system at runtime. We define formal transformations of components that preserve the semantics and the concurrency and, at the same time, allow to monitor global-state properties. Moreover, we present RVMT-BIP, a prototype tool implementing the transformations for monitoring multi-threaded systems described in the BIP (Behavior, Interaction, Priority) framework, an expressive framework for the formal construction of heterogeneous systems. Our experiments on several multi-threaded BIP systems show that RVMT-BIP induces a cheap runtime overhead.
Article
Full-text available
We propose an approach to monitoring IT systems offline where system actions are logged in a distributed file system and subsequently checked for compliance against policies formulated in an expressive temporal logic. The novelty of our approach is that monitoring is parallelized so that it scales to large logs. Our technical contributions comprise a formal framework for slicing logs, an algorithmic realization based on MapReduce, and a high-performance implementation. We evaluate our approach analytically and experimentally, proving the soundness and completeness of our slicing techniques and demonstrating its practical feasibility and efficiency on real-world logs with 400 GB of relevant data.
Article
Full-text available
We present an algorithm for the automated verification of Linear Temporal Logic formulæ on event traces using an increasingly popular cloud computing framework called MapReduce. The algorithm can process multiple, arbitrary fragments of the trace in parallel, and compute its final result through a cycle of runs of MapReduce instances. Experimentation on a variety of cloud-based MapReduce frameworks, including Apache Hadoop, show how complex LTL properties can be validated in reasonable time in a completely distributed fashion. Compared to the classical LTL evaluation algorithm, results show how the use of a MapReduce framework can provide an interesting alternative to existing trace analysis techniques, performance-wise, under favourable conditions.
Book
This book constitutes the refereed proceedings of the 16th International Conference on Runtime Verification, RV 2016, held in Madrid, Spain, in September 2016. The 18 revised full papers presented together with 4 short papers, 3 tool papers, 2 tool demonstration papers, and 5 tutorials, were carefully reviewed and selected from 72 submissions. The RV conference is concerned with all aspects of monitoring and analysis of hardware, software and more general system executions. Runtime verification techniques are lightweight techniques to assess correctness, reliability, and robustness; these techniques are significantly more powerful and versatile than conventional testing, and more practical than exhaustive formal verification.
Conference Paper
We report on the Third International Competition on Runtime Verification (CRV-2016). The competition was held as a satellite event of the 16th International Conference on Runtime Verification (RV’16). The competition consisted of two tracks: offline monitoring of traces and online monitoring of Java programs. The intention was to also include a track on online monitoring of C programs but there were too few participants to proceed with this track. This report describes the format of the competition, the participating teams, the submitted benchmarks and the results. We also describe our experiences with transforming trace formats from other tools into the standard format required by the competition and report on feedback gathered from current and past participants and use this to make suggestions for the future of the competition.
Conference Paper
This paper is concerned with runtime verification of object-oriented software system. We propose a novel algorithm for monitoring the individual behaviour and interaction of an unbounded number of runtime objects. This allows for evaluating complex correctness properties that take runtime data in terms of object identities into account. In particular, the underlying formal model can express hierarchical interdependencies of individual objects. Currently, the most efficient monitoring approaches for such properties are based on lookup tables. In contrast, the proposed algorithm uses union-find data structures to manage individual instances and thereby accomplishes a significant performance improvement. The time complexity bounds of the very efficient operations on union-find structures transfer to our monitoring algorithm: the execution time of a single monitoring step is guaranteed logarithmic in the number of observed objects. The amortised time is bound by an inverse of Ackermann’s function. We have implemented the algorithm in our monitoring tool Mufin. Benchmarks show that the targeted class of properties can be monitored extremely efficient and runtime overhead is reduced substantially compared to other tools.
Article
Runtime verification is a monitoring technique to gain assurance about well-being of a program at run time. Most existing approaches use sequential monitors; i.e., when the state of the program with respect to an event of interest changes, the monitor interrupts the program execution, evaluates a set of logical properties, and finally resumes the program execution. In this paper, we propose a GPU-based method for design and implementation of monitors that enjoy two levels of parallelism: the monitor (1) works along with the program in parallel, and (2) evaluates a set of properties in a parallel fashion as well. Our parallel monitoring algorithms effectively exploit the many-core platform available in the GPU. In addition to parallel processing, our approach benefits from a true separation of monitoring and functional concerns, as it isolates the monitor in the GPU. Thus, our monitoring approach incurs minimal intrusion, as executing monitoring tasks take place in a different computing hardware from execution of the program under inspection. Our method is fully implemented for parametric and non-parametric 3-valued linear temporal logic. Our experimental results show significant reduction in monitoring overhead, monitoring interference, and power consumption due to leveraging the GPU technology. In particular, we observe that our parallel verification algorithms are indeed scalable.