Article
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Recognising patterns that correlate multiple events over time becomes increasingly important in applications that exploit the Internet of Things, reaching from urban transportation through surveillance monitoring to business workflows. In many real-world scenarios, however, timestamps of events may be erroneously recorded, and events may be dropped from a stream due to network failures or load shedding policies. In this work, we present SimpMatch, a novel simplex-based algorithm for probabilistic evaluation of event queries using constraints over event orderings in a stream. Our approach avoids learning probability distributions for time-points or occurrence intervals. Instead, we employ the abstraction of segmented intervals and compute the probability of a sequence of such segments using the notion of order statistics. The algorithm runs in linear time to the number of lost events and shows high accuracy, yielding exact results if event generation is based on a Poisson process and providing a good approximation otherwise. We demonstrate empirically that SimpMatch enables efficient and effective reasoning over event streams, outperforming state-of-the-art methods for probabilistic evaluation of event queries by up to two orders of magnitude.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... More recently, Van der Aa et al. [1] illustrated a method for inferring a linear extension, i.e., a compliant total order, of events in partially ordered traces, based on examples of correct orderings extracted from other traces in the log. Busany et al. [4] estimated probabilities for partially ordered events in IoT event streams. ...
Full-text available
Conference Paper
Process mining is a scientific discipline that analyzes event data, often collected in databases called event logs. Recently, uncertain event logs have become of interest, which contain non-deterministic and stochastic event attributes that may represent many possible real-life scenarios. In this paper, we present a method to reliably estimate the probability of each of such scenarios, allowing their analysis. Experiments show that the probabilities calculated with our method closely match the true chances of occurrence of specific outcomes, enabling more trustworthy analyses on uncertain data.
... For this reason, various survey papers are presented on IoT and big data for multiple applications see Table 1. A number of investigative papers related to different aspects of data IoT are published to date, covering various definitions of IoT, core technologies, architecture, and different applications IOT, for example [22][23][24][25]. • Identify and evaluate the main data indexing techniques in the IoT system. ...
Full-text available
Article
The past decade has been characterized by the growing volumes of data due to the widespread use of the Internet of Things (IoT) applications, which introduced many challenges for efficient data storage and management. Thus, the efficient indexing and searching of large data collections is a very topical and urgent issue. Such solutions can provide users with valuable information about IoT data. However, efficient retrieval and management of such information in terms of index size and search time require optimization of indexing schemes which is rather difficult to implement. The purpose of this paper is to examine and review existing indexing techniques for large-scale data. A taxonomy of indexing techniques is proposed to enable researchers to understand and select the techniques that will serve as a basis for designing a new indexing scheme. The real-world applications of the existing indexing techniques in different areas, such as health, business, scientific experiments, and social networks, are presented. Open problems and research challenges, e.g., privacy and large-scale data mining, are also discussed.
... More recently, Van der Aa et al. [1] illustrated a method for inferring a linear extension, i.e., a compliant total order, of events in partially ordered traces, based on examples of correct orderings extracted from other traces in the log. Busany et al. [4] estimated probabilities for partially ordered events in IoT event streams. ...
Full-text available
Preprint
Process mining is a scientific discipline that analyzes event data, often collected in databases called event logs. Recently, uncertain event logs have become of interest, which contain non-deterministic and stochastic event attributes that may represent many possible real-life scenarios. In this paper, we present a method to reliably estimate the probability of each of such scenarios, allowing their analysis. Experiments show that the probabilities calculated with our method closely match the true chances of occurrence of specific outcomes, enabling more trustworthy analyses on uncertain data.
... Event logs may be derived from sensed data as recorded by real-time locating systems (RTLS). Then, the construction of discrete events from raw signals is inherently uncertain and grounded in probabilistic inference [19,20]. For instance, deriving treatment events in a hospital based on RTLS positions of patients and staff members does not yield fully accurate traces [21]. ...
Preprint
While supporting the execution of business processes, information systems record event logs. Conformance checking relies on these logs to analyze whether the recorded behavior of a process conforms to the behavior of a normative specification. A key assumption of existing conformance checking techniques, however, is that all events are associated with timestamps that allow to infer a total order of events per process instance. Unfortunately, this assumption is often violated in practice. Due to synchronization issues, manual event recordings, or data corruption, events are only partially ordered. In this paper, we put forward the problem of partial order resolution of event logs to close this gap. It refers to the construction of a probability distribution over all possible total orders of events of an instance. To cope with the order uncertainty in real-world data, we present several estimators for this task, incorporating different notions of behavioral abstraction. Moreover, to reduce the runtime of conformance checking based on partial order resolution, we introduce an approximation method that comes with a bounded error in terms of accuracy. Our experiments with real-world and synthetic data reveal that our approach improves accuracy over the state-of-the-art considerably.
... Event logs may be derived from sensed data as recorded by real-time locating systems (RTLS). Then, the construction of discrete events from raw signals is inherently uncertain and grounded in probabilistic inference [19,20]. For instance, deriving treatment events in a hospital based on RTLS positions of patients and staff members does not yield fully accurate traces [21]. ...
Article
While supporting the execution of business processes, information systems record event logs. Conformance checking relies on these logs to analyze whether the recorded behavior of a process conforms to the behavior of a normative specification. A key assumption of existing conformance checking techniques, however, is that all events are associated with timestamps that allow to infer a total order of events per process instance. Unfortunately, this assumption is often violated in practice. Due to synchronization issues, manual event recordings, or data corruption, events are only partially ordered. In this paper, we put forward the problem of partial order resolution of event logs to close this gap. It refers to the construction of a probability distribution over all possible total orders of events of an instance. To cope with the order uncertainty in real-world data, we present several estimators for this task, incorporating different notions of behavioral abstraction. Moreover, to reduce the runtime of conformance checking based on partial order resolution, we introduce an approximation method that comes with a bounded error in terms of accuracy. Our experiments with real-world and synthetic data reveal that our approach improves accuracy over the state-of-the-art considerably.
Full-text available
Article
Complex Event Recognition applications exhibit various types of uncertainty, ranging from incomplete and erroneous data streams to imperfect complex event patterns. We review Complex Event Recognition techniques that handle, to some extent, uncertainty. We examine techniques based on automata, probabilistic graphical models and first-order logic, which are the most common ones, and approaches based on Petri Nets and Grammars, which are less frequently used. A number of limitations are identified with respect to the employed languages, their probabilistic models and their performance, as compared to the purely deterministic cases. Based on those limitations, we highlight promising directions for future work.
Full-text available
Article
We present a system for online monitoring of maritime activity over streaming positions from numerous vessels sailing at sea. It employs an online tracking module for detecting important changes in the evolving trajectory of each vessel across time, and thus can incrementally retain concise, yet reliable summaries of its recent movement. In addition, thanks to its complex event recognition module, this system can also offer instant notification to marine authorities regarding emergency situations, such as risk of collisions, suspicious moves in protected zones, or package picking at open sea. Not only did our extensive tests validate the performance, efficiency, and robustness of the system against scalable volumes of real-world and synthetically enlarged datasets, but its deployment against online feeds from vessels has also confirmed its capabilities for effective, real-time maritime surveillance.
Full-text available
Article
Pattern queries are widely used in complex event processing (CEP) systems. Existing pattern matching techniques, however, can provide only limited performance for expensive queries in real-world applications, which may involve Kleene closure patterns, flexible event selection strategies, and events with imprecise timestamps. To support these expensive queries with high performance, we begin our study by analyzing the complexity of pattern queries, with a focus on the fundamental understanding of which features make pattern queries more expressive and at the same time more computationally expensive. This analysis allows us to identify performance bottlenecks in processing those expensive queries, and provides key insights for us to develop a series of optimizations to mitigate those bottlenecks. Microbenchmark results show superior performance of our system for expensive pattern queries while most state-of-the-art systems suffer from poor performance. A thorough case study on Hadoop cluster monitoring further demonstrates the efficiency and effectiveness of our proposed techniques.
Full-text available
Article
In diverse applications ranging from stock trading to traffic monitoring, data streams are continuously monitored by multiple analysts for extracting patterns of interest in real time. These analysts often submit similar pattern mining requests yet customized with different parameter settings. In this work, we present shared execution strategies for processing a large number of neighbor-based pattern mining requests of the same type yet with arbitrary parameter settings. Such neighbor-based pattern mining requests cover a broad range of popular mining query types, including detection of clusters, outliers, and nearest neighbors. Given the high algorithmic complexity of the mining process, serving multiple such queries in a single system is extremely resource intensive. The naive method of detecting and maintaining patterns for different queries independently is often infeasible in practice, as its demands on system resources increase dramatically with the cardinality of the query workload. In order to maximize the efficiency of the system resource utilization for executing multiple queries simultaneously, we analyze the commonalities of the neighbor-based pattern mining queries, and identify several general optimization principles which lead to significant system resource sharing among multiple queries. In particular, as a preliminary sharing effort, we observe that the computation needed for the range query searches (the process of searching the neighbors for each object) can be shared among multiple queries and thus saves the CPU consumption. Then we analyze the interrelations between the patterns identified by queries with different parameters settings, including both pattern-specific and window-specific parameters. For that, we first introduce an incremental pattern representation, which represents the patterns identified by queries with different pattern-specific parameters within a single compact structure. This enables integrated pattern maintenance for multiple queries. Second, by leveraging the potential overlaps among sliding windows, we propose a metaquery strategy which utilizes a single query to answer multiple queries with different window-specific parameters. By combining these three techniques, namely the range query search sharing, integrated pattern maintenance, and metaquery strategy, our framework realizes fully shared execution of multiple queries with arbitrary parameter settings. It achieves significant savings of computational and memory resources due to shared execution. Our comprehensive experimental study, using real data streams from domains of stock trades and moving object monitoring, demonstrates that our solution is significantly faster than the independent execution strategy, while using only a small portion of memory space compared to the independent execution. We also show that our solution scales in handling large numbers of queries in the order of hundreds or even thousands under high input data rates.
Full-text available
Article
Symbolic event recognition systems have been successfully applied to a variety of application domains, extracting useful information in the form of events, allowing experts or other systems to monitor and respond when significant events are recognised. In a typical event recognition application, however, these systems often have to deal with a significant amount of uncertainty. In this paper, we address the issue of uncertainty in logic-based event recognition by extending the Event Calculus with probabilistic reasoning. Markov Logic Networks are a natural candidate for our logic-based formalism. However, the temporal semantics of the Event Calculus introduce a number of challenges for the proposed model. We show how and under what assumptions we can overcome these problems. Additionally, we study how probabilistic modelling changes the behaviour of the formalism, affecting its key property, the inertia of fluents. Furthermore, we demonstrate the advantages of the probabilistic Event Calculus through examples and experiments in the domain of activity recognition, using a publicly available dataset for video surveillance.
Full-text available
Article
We have been developing a system for recognising human ac-tivity given a symbolic representation of video content. The input of our system is a set of time-stamped short-term ac-tivities detected on video frames. The output of our sys-tem is a set of recognised long-term activities, which are pre-defined temporal combinations of short-term activities. The constraints on the short-term activities that, if satisfied, lead to the recognition of a long-term activity, are expressed using a dialect of the Event Calculus. We illustrate the expressive-ness of the dialect by showing the representation of several typical complex activities. Furthermore, we present a de-tailed evaluation of the system through experimentation on a benchmark dataset of surveillance videos.
Full-text available
Chapter
Full-text available
Article
We present a system for recognising human activity given a symbolic representation of video content. The input of our system is a set of time-stamped short-term activities (STA) detected on video frames. The output is a set of recognised long-term activities (LTA), which are pre-defined temporal combinations of STA. The constraints on the STA that, if satisfied, lead to the recognition of a LTA, have been expressed using a dialect of the Event Calculus. In order to handle the uncertainty that naturally occurs in human activity recognition, we adapted this dialect to a state-of-the-art probabilistic logic programming framework. We present a detailed evaluation and comparison of the crisp and probabilistic approaches through experimentation on a benchmark dataset of human surveillance videos.
Full-text available
Conference Paper
A major problem in detecting events in streams of data is that the data can be imprecise (e.g. RFID data). However, current state-of- the-art event detection systems such as Cayuga (14), SASE (46) or SnoopIB(1), assume the data is precise. Noise in the data can be captured using techniques such as hidden Markov models. Infer- ence on these models creates streams of probabilistic events which cannot be directly queried by existing systems. To address this challenge we propose Lahar1, an event processing system for prob- abilistic event streams. By exploiting the probabilistic nature of the data, Lahar yields a much higher recall and precision than deter- ministic techniques operating over only the most probable tuples. By using a novel static analysis and novel algorithms, Lahar pro- cesses data orders of magnitude more e ciently than a naïve ap- proach based on sampling. In this paper, we present Lahar's static analysis and core algorithms. We demonstrate the quality and per- formance of our approach through experiments with our prototype implementation and comparisons with alternate methods.
Full-text available
Conference Paper
Complex event processing (CEP) over event streams has become increasingly important for real-time applications ranging from health care, supply chain management to business intelligence. These monitoring applications submit complex queries to track sequences of events that match a given pattern. As these systems mature the need for increasingly complex nested sequence query support arises, while the state-of-art CEP systems mostly support the execution of flat sequence queries only. To assure real-time responsiveness and scalability for pattern detection even on huge volume high-speed streams, efficient processing techniques must be designed. In this paper, we first analyze the prevailing nested pattern query processing strategy and identify several serious shortcomings. Not only are substantial subsequences first constructed just to be subsequently discarded, but also opportunities for shared execution of nested subexpressions are overlooked. As foundation, we introduce NEEL, a CEP query language for expressing nested CEP pattern queries composed of sequence, negation, AND and OR operators. To overcome deficiencies, we design rewriting rules for pushing negation into inner subexpressions. Next, we devise a normalization procedure that employs these rules for flattening a nested complex event expression. To conserve CPU and memory consumption, we propose several strategies for efficient shared processing of groups of normalized NEEL subexpressions. These strategies include prefix caching, suffix clustering and customized “bit-marking” execution strategies. We design an optimizer to partition the set of all CEP subexpressions in a NEEL normal form into groups, each of which can then be mapped to one of our shared execution operators. Lastly, we evaluate our technologies by conducting a performance study to assess the CPU processing time using real-world stock trades data. Our results confirm that our NEEL execution in many cases performs 100 fold fast er than the traditional iterative nested execution strategy for real stock market query workloads.
Full-text available
Conference Paper
In recent years, there has been a growing need for active systems that can react automatically to events. Some events are generated externally and deliver data across distributed systems, while others are materialized by the active system itself. Event materialization is hampered by uncertainty that may be attributed to unreliable data sources and networks, or the inability to determine with certainty whether an event has actually occurred. Two main obstacles exist when designing a solution to the problem of event materialization with uncertainty. First, event materialization should be performed efficiently, at times under a heavy load of incoming events from various sources. The second challenge involves the generation of a correct probability space, given uncertain events. We present a solution to both problems by introducing an efficient mechanism for event materialization under uncertainty. A model for representing materialized events is presented and two algorithms for correctly specifying the probability space of an event history are given. The first provides an accurate, albeit expensive method based on the construction of a Bayesian network. The second is a Monte Carlo sampling algorithm that heuristically assesses materialized event probabilities. We experimented with both the Bayesian network and the sampling algorithms, showing the latter to be scalable under an increasing rate of explicit event delivery and an increasing number of uncertain rules (while the former is not). Finally, our sampling algorithm accurately and efficiently estimates the probability space.
Full-text available
Conference Paper
For public transport authorities, the most important aspects for operations are passenger satisfaction and safety. Based on surveys, passengers regard punctuality as the most important aspect, which makes it an important factor of passenger satisfaction for the operators [10]. In the City of Helsinki, the public transport vehicles' movements and status are monitored through a real-time information system, which provides authorities information about punctuality of vehicles. Furthermore, the gathered information is used for route planning and scheduling. Due to the existing information systems in Helsinki, the focus in operation planning has gradually moved from real-time timetables to other factors affecting passenger satisfaction and safety, aiming to find ways to further improve public transport operations. The PRONTO research project focuses on two demonstration cases. One is emergency rescue operations in Dortmund, Germany. The other one is public transport in Helsinki, where the research focuses on developing methods for improving passengers' travel experience through monitoring and analyzing vehicle events in real-time. Through these methods, it is possible to further improve public transport punctuality and especially the driving style that would lead to improved passenger safety and satisfaction, as well as better vehicle endurance. A variety of sensors and connections to existing systems have been implemented in order to provide the PRONTO system with valuable data about the both demonstration cases. This paper presents the activities and achievements so far of the public transport demonstration case.
Full-text available
Article
CQL, a continuous query language, is supported by the STREAM prototype data stream management system (DSMS) at Stanford. CQL is an expressive SQL-based declarative language for registering continuous queries against streams and stored relations. We begin by presenting an abstract semantics that relies only on "black-box" mappings among streams and relations. From these mappings we define a precise and general interpretation for continuous queries. CQL is an instantiation of our abstract semantics using SQL to map from relations to relations, window specifications derived from SQL-99 to map from streams to relations, and three new operators to map from relations to streams. Most of the CQL language is operational in the STREAM system. We present the structure of CQL's query execution plans as well as details of the most important components: operators, interoperator queues, synopses, and sharing of components among multiple operators and queries. Examples throughout the paper are drawn from the Linear Road benchmark recently proposed for DSMSs. We also curate a public repository of data stream applications that includes a wide variety of queries expressed in CQL. The relative ease of capturing these applications in CQL is one indicator that the language contains an appropriate set of constructs for data stream processing.
Full-text available
Article
Event processing will play an increasingly important role in constructing enterprise applications that can immediately react to business critical events. Various technologies have been proposed in recent years, such as event processing, data streams and asynchronous messaging (e.g. pub/sub). We believe these technologies share a common processing model and differ only in target workload, including query language features and consistency requirements. We argue that integrating these technologies is the next step in a natural progression. In this paper, we present an overview and discuss the foundations of CEDR, an event streaming system that embraces a temporal stream model to unify and further enrich query language features, handle imperfections in event delivery and define correctness guarantees. We describe specific contributions made so far and outline next steps in developing the CEDR system.
Conference Paper
Urban data management is already an essential element of modern cities. The authorities can build on the variety of automatically generated information and develop intelligent services that improve citizens daily life, save environmental resources or aid in coping with emergencies. From a data mining perspective, urban data introduce a lot of challenges. Data volume, velocity and veracity are some obvious obstacles. However, there are even more issues of equal importance like data quality, resilience, privacy and security. In this paper we describe the development of a set of techniques and frameworks that aim at effective and efficient urban data management in real settings. To do this, we collaborated with the city of Dublin and worked on real problems and data. Our solutions were integrated in a system that was evaluated and is currently utilized by the city.
Article
Timestamps are often found to be dirty in various scenarios, e.g., in distributed systems with clock synchronization problems or unreliable RFID readers. Without cleaning the imprecise timestamps, temporal-related applications such as provenance analysis or pattern queries are not reliable. To evaluate the correctness of timestamps, temporal constraints could be employed, which declare the distance restrictions between timestamps. Guided by such constraints on timestamps, in this paper, we study a novel problem of repairing inconsistent timestamps that do not conform to the required temporal constraints. Following the same line of data repairing, the timestamp repairing problem is to minimally modify the timestamps towards satisfaction of temporal constraints. This problem is practically challenging, given the huge space of possible timestamps. We tackle the problem by identifying a concise set of promising candidates, where an optimal repair solution can always be found. Repair algorithms with efficient pruning are then devised over the identified candidates. Experiments on real datasets demonstrate the superiority of our proposal compared to the state-of-the-art approaches.
Conference Paper
Real-time analytics of anomalous phenomena on streaming data typically relies on processing a large variety of continuous outlier detection requests, each configured with different parameter settings. The processing of such complex outlier analytics workloads is resource consuming due to the algorithmic complexity of the outlier mining process. In this work we propose a sharing-aware multi-query execution strategy for outlier detection on data streams called SOP. A key insight of SOP is to transform the problem of handling a multi-query outlier analytics workload into a single-query skyline computation problem. We prove that the output of the skyline computation process corresponds to the minimal information needed for determining the outlier status of any point in the stream. Based on this new formulation, we design a customized skyline algorithm called K-SKY that leverages the domination relationships among the streaming data points to minimize the number of data points that must be evaluated for supporting multi-query outlier detection. Based on this K-SKY algorithm, our SOP solution achieves minimal utilization of both computational and memory resources for the processing of these complex outlier analytics workload. Our experimental study demonstrates that SOP consistently outperforms the state-of-art solutions by three orders of magnitude in CPU time, while only consuming 5% of their memory footprint - a clear win-win. Furthermore, SOP is shown to scale to large workloads composed of thousands of parameterized queries.
Conference Paper
Complex Event Processing (CEP) has emerged as a technology of choice for high performance event analytics in time-critical decision-making applications. Yet it is becoming increasingly difficult to support high-performance event processing due to the rising number and complexity of event pattern queries and the increasingly high velocity of event streams. In this work we design the SPASS framework that successfully tackles these demanding CEP workloads. Our SPASS optimizer identifies opportunities for effective shared processing among CEP queries by leveraging time-based event correlations among queries. The problem of pattern sharing is shown to be NP-hard by reducing the Minimum Substring Cover problem to our CEP pattern sharing problem. The SPASS optimizer is designed that finds a shared pattern plan in polynomial-time covering all sequence patterns while still guaranteeing an optimality bound. To execute this shared pattern plan, the SPASS runtime employs stream transactions that assure concurrent shared maintenance and re-use of sub-patterns across queries. Our experimental study confirms that the SPASS framework achieves over 16 fold performance improvement for a wide range of experiments compared to the state-of-the-art solution.
Article
Urban mobility impacts urban life to a great extent. To enhance urban mobility, much research was invested in traveling time prediction: given an origin and destination, provide a passenger with an accurate estimation of how long a journey lasts. In this work, we investigate a novel combination of methods from Queueing Theory and Machine Learning in the prediction process. We propose a prediction engine that, given a scheduled bus journey (route) and a 'source/destination' pair, provides an estimate for the traveling time, while considering both historical data and real-time streams of information that are transmitted by buses. We propose a model that uses natural segmentation of the data according to bus stops and a set of predictors, some use learning while others are learning-free, to compute traveling time. Our empirical evaluation, using bus data that comes from the bus network in the city of Dublin, demonstrates that the snapshot principle, taken from Queueing Theory, works well yet suffers from outliers. To overcome the outliers problem, we use Machine Learning techniques as a regulator that assists in identifying outliers and propose prediction based on historical data.
Chapter
Information systems are becoming more and more intertwined with the operational processes they support. As a result, multitudes of events are recorded by today’s information systems. Nevertheless, organizations have problems extracting value from these data. The goal of process mining is to use event data to extract process-related information, e.g., to automatically discover a process model by observing events recorded by some enterprise system. To show the importance of process mining, this chapter discusses the spectacular growth of event data and links this to the limitations of classical approaches to business process management. To explain the basic concepts, a small example is used. Finally, it is shown that process mining can play an important role in realizing the promises made by contemporary management trends such as SOX and Six Sigma.
Article
A growing number of enterprises use complex event processing for monitoring and controlling their operations, while business process models are used to document working procedures. In this work, we propose a comprehensive method for complex event processing optimization using business process models. Our proposed method is based on the extraction of behavioral constraints that are used, in turn, to rewrite patterns for event detection, and select and transform execution plans. We offer a set of rewriting rules that is shown to be complete with respect to the all, seq, and any patterns. The effectiveness of our method is demonstrated in an experimental evaluation with a large number of processes from an insurance company. We illustrate that the proposed optimization leads to significant savings in query processing. By integrating the optimization in state-of-the-art systems for event pattern matching, we demonstrate that these savings materialize in different technical infrastructures and can be combined with existing optimization techniques.
Conference Paper
Much research attention has been given to delivering high-performance systems that are capable of complex event processing (CEP) in a wide range of applications. However, many current CEP systems focus on processing efficiently data having a simple structure, and are otherwise limited in their ability to support efficiently complex continuous queries on structured or semi-structured information. However, XML streams represent a very popular form of data exchange, comprising large portions of social network and RSS feeds, financial records, configuration files, and similar applications requiring advanced CEP queries. In this paper, we present the XSeq language and system that support CEP on XML streams, via an extension of XPath that is both powerful and amenable to an efficient implementation. Specifically, the XSeq language extends XPath with natural operators to express sequential and Kleene-* patterns over XML streams, while remaining highly amenable to efficient implementation. XSeq is designed to take full advantage of recent advances in the field of automata on Visibly Pushdown Automata (VPA), where higher expressive power can be achieved without compromising efficiency (whereas the amenability to efficient implementation was not demonstrated in XPath extensions previously proposed). We illustrate XSeq's power for CEP applications through examples from different domains, and provide formal results on its expressiveness and complexity. Finally, we present several optimization techniques for XSeq queries. Our extensive experiments indicate that XSeq brings outstanding performance to CEP applications: two orders of magnitude improvement are obtained over the same queries executed in general-purpose XML engines.
Conference Paper
Regular expression matching over sequences in real time is a crucial task in complex event processing on data streams. Given that such data sequences are often noisy and errors have temporal and spatial correlations, performing regular expression matching effectively and efficiently is a challenging task. Instead of the traditional approach of learning a distribution of the stream first and then processing queries, we propose a new approach that efficiently does the matching based on an error model. In particular, our algorithms are based on the realistic Markov chain error model, and report all matching paths to trace relevant basic events that trigger the matching. This is much more informative than a single matching path. We also devise algorithms to efficiently return only top-k matching paths, and to handle negations in an extended regular expression. Finally, we conduct a comprehensive experimental study to evaluate our algorithms using real datasets.
Article
Complex Event Processing (CEP) is a stream processing model that focuses on detecting event patterns in continuous event streams. While the CEP model has gained popularity in the research communities and commercial technologies, the problem of gracefully degrading performance under heavy load in the presence of resource constraints, or load shedding, has been largely overlooked. CEP is similar to "classical" stream data management, but addresses a substantially different class of queries. This unfortunately renders the load shedding algorithms developed for stream data processing inapplicable. In this paper we study CEP load shedding under various resource constraints. We formalize broad classes of CEP load-shedding scenarios as different optimization problems. We demonstrate an array of complexity results that reveal the hardness of these problems and construct shedding algorithms with performance guarantees. Our results shed some light on the difficulty of developing load-shedding algorithms that maximize utility.
Conference Paper
Data streaming systems are becoming essential for monitoring ap- plications such as financial analysis and network intrusion detec- tion. These systems often have to process many similar but differ- ent queries over common data. Since executing each query sepa- rately can lead to significant scalability and performance problems, it is vital to share resources by exploiting similarities in the queries. In this paper we present ways to efficiently share streaming aggre- gate queries with differing periodic windows and arbitrary selec- tion predicates. A major contribution is our sharing technique that does not require any up-front multiple query optimization. This is a significant departure from existing techniques that rely on complex static analyses of fixed query workloads. Our approach is partic- ularly vital in streaming systems where queries can join and leave the system at any point. We present a detailed performance study that evaluates our strategies with an implementation and real data. In these experiments, our approach gives us as much as an order of magnitude performance improvement over the state of the art.
Conference Paper
Composite (or Complex) event processing (CEP) systems search sequences of incoming events for occurrences of user-specified event patterns. Recently, they have gained more attention in a variety of areas due to their powerful and expressive query language and performance potential. Sequentiality (temporal ordering) is the primary way in which CEP systems relate events to each other. In this paper, we present a CEP system called ZStream to efficiently process such sequential patterns. Besides simple sequential patterns, ZStream is also able to detect other patterns, including conjunction, disjunction, negation and Kleene closure. Unlike most recently proposed CEP systems, which use non-deterministic finite automata (NFA's) to detect patterns, ZStream uses tree-based query plans for both the logical and physical representation of query patterns. By carefully designing the underlying infrastructure and algorithms, ZStream is able to unify the evaluation of sequence, conjunction, disjunction, negation, and Kleene closure as variants of the join operator. Under this framework, a single pattern in ZStream may have several equivalent physical tree plans, with different evaluation costs. We propose a cost model to estimate the computation costs of a plan. We show that our cost model can accurately capture the actual runtime behavior of a plan, and that choosing the optimal plan can result in a factor of four or more speedup versus an NFA based approach. Based on this cost model and using a simple set of statistics about operator selectivity and data rates, ZStream is able to adaptively and seamlessly adjust the order in which it detects patterns on the fly. Finally, we describe a dynamic programming algorithm used in our cost model to efficiently search for an optimal query plan for a given pattern.
Conference Paper
In this paper, we present the design, implementation, and evalua- tion of a system that executes complex event queries over real-time streams of RFID readings encoded as events. These complex event queries filter and correlate events to match specific patterns, and transform the relevant events into new composite events for the use of external monitoring applications. Stream-based execution of these queries enables time-critical actions to be taken in environ- ments such as supply chain management, surveillance and facility management, healthcare, etc. We first propose a complex event language that significantly extends existing event languages to meet the needs of a range of RFID-enabled monitoring applica- tions. We then describe a query plan-based approach to efficiently implementing this language. Our approach uses native operators to efficiently handle query-defined sequences, which are a key com- ponent of complex event processing, and pipelines such sequences to subsequent operators that are built by leveraging relational tech- niques. We also develop a large suite of optimization techniques to address challenges such as large sliding windows and intermediate result sizes. We demonstrate the effectiveness of our approach through a detailed performance analysis of our prototype imple- mentation as well as through a comparison to a state-of-the-art stream processor.
Conference Paper
We describe the design and implementation of the Cornell Cayuga System for scalable event processing. We present a query language based on Cayuga Algebra for naturally expressing complex event patterns. We also describe several novel system design and im- plementation issues, focusing on Cayuga's query processor, its in- dexing approach, how Cayuga handles simultaneous events, and its specialized garbage collector.
Conference Paper
Treating instants of time as primitive not only is conceptually implausible but also has encountered grave practical difficulties. A satisfactory theory of time seems to be one which is based on the common-sense idea that events or periods are the primitive enties of time while instants are constructed from them. In this paper we present one such common-sense theory of time. We start from a structure of events, construct instants out of the events, and then show that these instants have the properties we normally expect of them. Views discussed here include the view of Allen and Hayes, and that of Russell.
Conference Paper
In this work, we study the event pattern matching mechanism over streams with interval-based temporal semantics. An expressive language to represent the required temporal patterns among streaming interval events is introduced and the corresponding temporal operator ISEQ is designed. For further improving the interval event processing performance, a punctuation-aware stream processing strategy is provided. Experimental studies illustrate that the proposed techniques bring significant performance improvement in both memory and CPU usage with little overhead.
Conference Paper
The nature of data in enterprises and on the Internet is changing. Data used to be stored in a database first and queried later. Today timely processing of new data, represented as events, is increasingly valuable. In many domains, complex event processing (CEP) systems detect patterns of events for decision making. Examples include processing of environmental sensor data, trades in financial markets and RSS web feeds. Unlike conventional database systems, most current CEP systems pay little attention to query optimisation. They do not rewrite queries to more efficient representations or make decisions about operator distribution, limiting their overall scalability. This paper describes the NEXT CEP system that was especially designed for query rewriting and distribution. Event patterns are specified in a high-level query language and, before being translated into event automata, are rewritten in a more efficient form. Automata are then distributed across a cluster of machines for detection scalability. We present algorithms for query rewriting and distributed placement. Our experiments on the Emulab test-bed show a significant improvement in system scalability due to rewriting and distribution.
Book
Most subfields of computer science have an interface layer via which applications communicate with the infrastructure, and this is key to their success (e.g., the Internet in networking, the relational model in databases, etc.). So far this interface layer has been missing in AI. First-order logic and probabilistic graphical models each have some of the necessary features, but a viable interface layer requires combining both. Markov logic is a powerful new language that accomplishes this by attaching weights to first-order formulas and treating them as templates for features of Markov random fields. Most statistical models in wide use are special cases of Markov logic, and first-order logic is its infinite-weight limit. Artificial intelligence needs an interface layer, a language linking applications to their common infrastructure needs. AI applications involve high degrees of complexity and uncertainty. First-order logic handles complexity well and probabilistic graphical models do the same for uncertainty, but neither can cope effectively with both. Thus neither is sufficient for general AI. Markov logic is a powerful new language that seamlessly combines the two. Statements in Markov logic are simply weighted formulas in first-order logic, interpreted as templates for features of Markov random fields. Most statistical models in wide use are special cases of Markov logic, and first-order logic is its infinite-weight limit. Inference algorithms for Markov logic combine ideas from satisfiability, Markov chain Monte Carlo, belief propagation, and resolution. Learning algorithms make use of conditional likelihood, convex optimization, and inductive logic programming. Markov logic has been successfully applied to problems in information extraction and integration, natural language processing, robot mapping, social networks, computational biology, and others, and is the basis of the open-source Alchemy system.
Article
Complex Event Detection (CED) is emerging as a key capability for many monitoring applications such as intrusion detection, sensor- based activity & phenomena tracking, and network monitoring. Ex- isting CED solutions commonly assume centralized availability and processing of all relevant events, and thus incur significant overhea d in distributed settings. In this paper, we present and evaluate commu- nication efficient techniques that can efficiently perform CED across distributed event sources. Our techniques are plan-based: we generate multi-step event ac- quisition and processing plans that leverage temporal relationships among events and event occurrence statistics to minimize event trans- mission costs, while meeting application-specific latency expecta- tions. We present an optimal but exponential-time dynamic pro- gramming algorithm and two polynomial-time heuristic algorithms, as well as their extensions for detecting multiple complex events with common sub-expressions. We characterize the behavior and perfor- mance of our solutions via extensive experimentation on synthetic and real-world data sets using our prototype implementation.
Article
Large-scale event systems are becoming increasingly popular in a variety of domains. Event pattern evaluation plays a key role in monitoring applications in these domains. identifies matches of user-defined patterns on high-volume event streams. Existing work on pattern evaluation, however, assumes that the occurrence time of each event is known precisely and the events from various sources can be merged into a single stream with a total or partial order. We observe that in real-world applications event occurrence times are often unknown or imprecise. Therefore, we propose a temporal model that assigns a time interval to each event to represent all of its possible occurrence times and revisit pattern evaluation under this model. In particular, we propose the formal semantics of such pattern evaluation, two evaluation frameworks, and algorithms and optimizations in these frameworks. Our evaluation results using both real traces and synthetic systems show that the event-based framework always outperforms the point-based framework and with optimizations, it achieves high efficiency for a wide range of workloads tested.
Article
The need to search for complex and recurring patterns in database sequences is shared by many applications. In this paper, we investigate the design and optimization of a query language capable of expressing and supporting efficiently the search for complex sequential patterns in database systems. Thus, we first introduce SQL-TS, an extension of SQL to express these patterns, and then we study how to optimize the queries for this language. We take the optimal text search algorithm of Knuth, Morris and Pratt, and generalize it to handle complex queries on sequences. Our algorithm exploits the interdependencies between the elements of a pattern to minimize repeated passes over the same data. Experimental results on typical sequence queries, such as double bottom queries, confirm that substantial speedups are achieved by our new optimization techniques.
Conference Paper
Pattern matching over event streams is increasingly being employed in many areas including financial services, RFID-based inventory management, click stream analysis, and electronic health systems. While regular expression matching is well studied, pattern matching over streams presents two new challenges: Languages for pattern matching over streams are significantly richer than languages for regular expression matching. Furthermore, efficient evaluation of these pattern queries over streams requires new algorithms and optimizations: the conventional wisdom for stream query processing (i.e., using selection-join-aggregation) is inadequate. In this paper, we present a formal evaluation model that offers precise semantics for this new class of queries and a query evaluation framework permitting optimizations in a principled way. Wre further analyze the runtime complexity of query evaluation using this model and develop a suite of techniques that improve runtime efficiency by exploiting sharing in storage and processing. Our experimental results provide insights into the various factors on runtime performance and demonstrate the significant performance gains of our sharing techniques.
Article
This paper specifies the Linear Road Benchmark for Stream Data Management Systems (SDMS). Stream Data Management Systems process streaming data by executing continuous and historical queries while producing query results in real-time. This benchmark makes it possible to compare the performance characteristics of SDMS' relative to each other and to alternative (e.g., Relational Database) systems. Linear Road has been endorsed as an SDMS benchmark by the developers of both the Aurora [1] (out of Brandeis University, Brown University and MIT) and STREAM [8] (out of Stanford University) stream systems.
Article
In this paper we establish the correspondence between the point- and interval-based views of temporal databases and the corresponding first-order temporal query languages. This correspondence shows that all first-order queries can be conveniently asked using the pointbased query language (which allows for a much more declarative and natural way of asking temporal queries) and then mechanically translated to an interval-based query language (e.g., TSQL2). Such an approach combines the ease of formulating queries in first-order logic (temporal relational calculus) with efficient query evaluation algorithms developed for the interval-based temporal databases. 1 Introduction In this paper we try to fill the gap between two directions of the development of temporal databases and the corresponding temporal query languages: The first direction studies temporal databases as two-sorted first-order structures and temporal query languages as suitable languages of logic over such structures [5]. ...
A probabilistic logic programming event calculus
  • J Filipou
  • A Artikis
  • A Skarlatidis
  • G Paliouras
J. Filipou, A. Artikis, A. Skarlatidis, and G. Paliouras. A probabilistic logic programming event calculus. TPLP, 15(2):213-245, 2015.
Top-k pattern matching using an informationtheoretic criterion over probabilistic data streams
  • K Sugiura
  • Y Ishikawa
K. Sugiura and Y. Ishikawa. Top-k pattern matching using an informationtheoretic criterion over probabilistic data streams. In APWeb-WAIM, LNCS 10366, pp. 511-526. Springer, 2017.
Raffaele Conforti Marcello La Rosa and A ter Hofstede
  • Raffaele Conforti
  • Marcello La Rosa
  • A Ter Hofstede
On-the-fly sharing for streamed aggregation See Reference
  • Chung Sailesh Krishnamurthy
  • Michael J Wu
  • Franklin
Consistent streaming through time: A vision for event stream processing See Reference
  • S Roger
  • H Barga Jonathan Goldstein Mohamed
  • Mingsheng Ali
  • Hong