Article

PA-SPS: A predictive adaptive approach for an elastic stream processing system

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

ResearchGate has not been able to resolve any citations for this publication.
Thesis
Full-text available
Nowadays, a significant part of computing systems and real-world applications demand parallelism to accelerate their executions. Although high-level and structured parallel programming aims to facilitate parallelism exploitation, there are still issues to be addressed to improve existing parallel programming abstractions. For instance, usually, application developers still have to set non-intuitive or complex parallelism configurations. In this context, self-adaptation is a potential alternative to provide a higher-level of autonomic abstractions and runtime responsiveness in parallel executions. However, it is challenging to provide highly efficient and abstracted self-adaptive solutions. In this work, we are interested in abstractions achievable with self-adaptation transparently managing the executions while the parallel programs are running (at run-time). Our main goals are to increase the adaptation space to be more representative of real-world applications and make self-adaptation more efficient with comprehensive evaluation methodologies, which can provide use-cases demonstrating the true potentials of self-adaptation. Therefore, this doctoral dissertation provides the following scientific contributions: I) An SLR providing a taxonomy of the state-of-the-art. II) A conceptual framework to support designing and abstracting the decision-making process within self-adaptive solutions, such a conceptual framework is then employed in the technical contributions to assist in making the solutions more modular and potentially generalizable. III) Mechanisms and strategies for self-adaptive replicas in applications with single and multiple parallel stages, supporting multiple customizable non-functional requirements. IV) Mechanism, strategy, and optimizations for self-adaptation of Parallel Patterns/applications' graphs topologies. We apply the proposed solutions to the context of stream processing applications, a representative paradigm present in several real-world applications that compute data flowing in the form of streams (e.g., video feeds, image, and data analytics). A part of the proposed solutions is evaluated with SPar and another part with the FastFlow programming framework. The results demonstrate that self-adaptation can provide efficient parallelism abstractions and autonomous responsiveness at run-time, yet achieve a competitive performance w.r.t. the best static executions. Moreover, when appropriate, we compare state-of-the-art solutions and demonstrate that our highly optimized decision-making strategies achieve significant performance and efficiency gains.
Conference Paper
Full-text available
Data stream processing is an attractive paradigm for analyzing IoT data at the edge of the Internet before transmitting processed results to a cloud. However, the relative scarcity of fog computing resources combined with the workloads' nonstationary properties make it impossible to allocate a static set of resources for each application. We propose Gesscale, a resource auto-scaler which guarantees that a stream processing application maintains a sufficient Maximum Sustainable Throughput to process its incoming data with no undue delay, while not using more resources than strictly necessary. Gesscale derives its decisions about when to rescale and which geo-distributed resource(s) to add or remove on a performance model that gives precise predictions about the future maximum sustainable throughput after reconfiguration. We show that this auto-scaler uses 17% less resources, generates 52% fewer reconfigurations, and processes more input data than baseline auto-scalers based on threshold triggers or a simpler performance model. Index Terms-Stream processing, auto-scaling, fog computing.
Article
Full-text available
Elastic distributed stream processing systems are able to dynamically adapt to changes in the workload. Often, these systems react to the rate of incoming data, or to the level of resource utilization, by scaling up or down. The goal is to optimize the system's resource usage, thereby reducing its operational cost. However, such scaling operations consume resources on their own, introducing a certain overhead of resource usage, and therefore cost, for every scaling operation. In addition, migrations caused by scaling operations inevitably lead to brief processing gaps. Therefore, an excessive number of scaling operations should be avoided. We approach this problem by preventing unnecessary scaling operations and over-compensating reactions to short-term changes in the workload. This allows to maintain elasticity, while also minimizing the incurred overhead cost of scaling operations. To achieve this, we use advanced filtering techniques from the field of signal processing to pre-process raw system measurements, thus mitigating superfluous scaling operations. We perform a real-world testbed evaluation verifying the effects, and provide a break-even cost analysis to show the economic feasibility of our approach.
Article
Full-text available
Elasticity is highly desirable for stream processing systems to guarantee low latency against workload dynamics, such as surges in data arrival rate and fluctuations in data distribution. Existing systems achieve elasticity following a resource-centric approach that uses dynamic key partitioning across the parallel instances, i.e. executors, to balance the workload and scale operators. However, such operator-level key repartitioning needs global synchronization and prohibits rapid elasticity. To address this problem, we propose an executor-centric approach, whose core idea is to avoid operator-level key repartitioning while implementing each executor as the building block of elasticity. Following this new approach, we design the Elasticutor framework with two level of optimizations: i) a novel implementation of executors, i.e., elastic executors, that perform elastic multi-core execution via efficient intra-executor load balancing and executor scaling and ii) a global model-based scheduler that dynamically allocates CPU cores to executors based on the instantaneous workloads. We implemented a prototype of Elasticutor and conducted extensive experiments. Our results show that Elasticutor doubles the throughput and achieves an average processing latency up to 2 orders of magnitude lower than previous methods, for a dynamic workload of real-world applications.
Article
Full-text available
Distributed stream processing frameworks are designed to perform continuous computation on possibly unbounded data streams whose rates can change over time. Devising solutions to make such systems elastically scale is a fundamental goal to achieve desired performance and cut costs caused by resource over-provisioning. These systems can be scaled along two dimensions: the operator parallelism and the number of resources. In this paper, we show how these two dimensions, as two symbiotic entities, are independent but must mutually interact for the global benefit of the system. On the basis of this observation, we propose a fine-grained model for estimating the resource utilization of a stream processing application that enables the independent scaling of operators and resources. A simple, yet effective, combined management of the two dimensions allows us to propose ELYSIUM, a novel elastic scaling approach that provides efficient resource utilization. We implemented the proposed approach within Apache Storm and tested it by running two real-world applications with different input load curves. The outcomes backup our claims showing that the proposed symbiotic management outperforms elastic scaling strategies where operators and resources are jointly scaled.
Article
Full-text available
The Internet of Things (IoT) is an emerging technology paradigm where millions of sensors and actuators help monitor and manage physical, environmental, and human systems in real time. The inherent closed-loop responsiveness and decision making of IoT applications make them ideal candidates for using low latency and scalable stream processing platforms. Distributed stream processing systems (DSPS) hosted in cloud data centers are becoming the vital engine for real-time data processing and analytics in any IoT software architecture. But the efficacy and performance of contemporary DSPS have not been rigorously studied for IoT applications and data streams. Here, we propose RIoTBench, a real-time IoT benchmark suite, along with performance metrics, to evaluate DSPS for streaming IoT applications. The benchmark includes 27 common IoT tasks classified across various functional categories and implemented as modular microbenchmarks. Further, we define four IoT application benchmarks composed from these tasks based on common patterns of data preprocessing, statistical summarization, and predictive analytics that are intrinsic to the closed-loop IoT decision-making life cycle. These are coupled with four stream workloads sourced from real IoT observations on smart cities and smart health, with peak streams rates that range from 500 to 10 000 messages/second from up to 3 million sensors. We validate the RIoTBench suite for the popular Apache Storm DSPS on the Microsoft Azure public cloud and present empirical observations. This suite can be used by DSPS researchers for performance analysis and resource scheduling, by IoT practitioners to evaluate DSPS platforms, and even reused within IoT solutions.
Article
Full-text available
Paradigms like Internet of Things and the most recent Internet of Everything are shifting the attention towards systems able to process unbounded sequences of items in the form of data streams. In the real world, data streams may be highly variable, exhibiting burstiness in the arrival rate and non-stationarities such as trends and cyclic behaviors. Furthermore, input items may be not ordered according to timestamps. This raises the complexity of stream processing systems, which must support elastic resource management and autonomic QoS control through sophisticated strategies and run-time mechanisms. In this paper we present Elastic-PPQ, a system for processing spatial preference queries over dynamic data streams. The key aspect of the system design is the existence of two adaptation levels handling workload variations at different time-scales. To address fast time-scale variations we design a fine regulatory mechanism of load balancing supported by a control-theoretic approach. The logic of the second adaptation level, targeting slower time-scale variations, is incorporated in a Fuzzy Logic Controller that makes scale in/out decisions of the system parallelism degree. The approach has been successfully evaluated under synthetic and real-world datasets.
Article
Full-text available
Apache Flink 1 is an open-source system for processing streaming and batch data. Flink is built on the philosophy that many classes of data processing applications, including real-time analytics, continuous data pipelines, historic data processing (batch), and iterative algorithms (machine learning, graph analysis) can be expressed and executed as pipelined fault-tolerant dataflows. In this paper, we present Flink's architecture and expand on how a (seemingly diverse) set of use cases can be unified under a single execution model.
Thesis
Full-text available
Stream-based systems are representative of several application domains including video, audio, networking, graphic processing, etc. Stream programs may run on different kinds of parallel architectures (desktop, servers, cell phones, and supercomputers) and represent significant workloads on our current computing systems. Nevertheless, most of them are still not parallelized. Moreover, when new software has to be developed, programmers often face a trade-off between coding productivity, code portability, and performance. To solve this problem, we provide a new Domain-Specific Language (DSL) that naturally/on-the-fly captures and represents parallelism for stream-based applications. The aim is to offer a set of attributes (through annotations) that preserves the program's source code and is not architecture-dependent for annotating parallelism. We used the C++ attribute mechanism to design a "de-facto" standard C++ embedded DSL named SPar. However, the implementation of DSLs using compiler-based tools is difficult, complicated, and usually requires a significant learning curve. This is even harder for those who are not familiar with compiler technology. Therefore, our motivation is to simplify this path for other researchers (experts in their domain) with support tools (our tool is CINCLE) to create high-level and productive DSLs through powerful and aggressive source-to-source transformations. In fact, parallel programmers can use their expertise without having to design and implement low-level code. The main goal of this thesis was to create a DSL and support tools for high-level stream parallelism in the context of a programming framework that is compiler-based and domain-oriented. Thus, we implemented SPar using CINCLE. SPar supports the software developer with productivity, performance, and code portability while CINCLE provides sufficient support to generate new DSLs. Also, SPar targets source-to-source transformation producing parallel pattern code built on top of FastFlow and MPI. Finally, we provide a full set of experiments showing that SPar provides better coding productivity without significant performance degradation in multi-core systems as well as transformation rules that are able to achieve code portability (for cluster architectures) through its generalized attributes.
Conference Paper
Full-text available
Key grouping is a technique used by stream processing frameworks to simplify the development of parallel stateful operators. Through key grouping a stream of tuples is partitioned in several disjoint sub-streams depending on the values contained in the tuples themselves. Each operator instance target of one sub-stream is guaranteed to receive all the tuples containing a specific key value. A common solution to implement key grouping is through hash functions that, however, are known to cause load imbalances on the target operator instances when the input data stream is characterized by a skewed value distribution. In this paper we present DKG, a novel approach to key grouping that provides near-optimal load distribution for input streams with skewed value distribution. DKG starts from the simple observation that with such inputs the load balance is strongly driven by the most frequent values; it identifies such values and explicitly maps them to sub-streams together with groups of less frequent items to achieve a near-optimal load balance. We provide theoretical approximation bounds for the quality of the mapping derived by DKG and show, through both simulations and a running prototype, its impact on stream processing applications.
Conference Paper
Full-text available
We study the problem of load balancing in distributed stream processing engines, which is exacerbated in the presence of skew. We introduce Partial Key Grouping (PKG), a new stream partitioning scheme that adapts the classical "power of two choices" to a distributed streaming setting by leveraging two novel techniques: key splitting and local load estimation. In so doing, it achieves better load balancing than key grouping while being more scalable than shuffle grouping. We test PKG on several large datasets, both real-world and synthetic. Compared to standard hashing, PKG reduces the load imbalance by up to several orders of magnitude, and often achieves nearly-perfect load balance. This result translates into an improvement of up to 60% in throughput and up to 45% in latency when deployed on a real Storm cluster.
Article
Full-text available
In this paper, we describe ZooKeeper, a service for co-ordinating processes of distributed applications. Since ZooKeeper is part of critical infrastructure, ZooKeeper aims to provide a simple and high performance kernel for building more complex coordination primitives at the client. It incorporates elements from group messaging, shared registers, and distributed lock services in a repli-cated, centralized service. The interface exposed by Zoo-Keeper has the wait-free aspects of shared registers with an event-driven mechanism similar to cache invalidations of distributed file systems to provide a simple, yet pow-erful coordination service. The ZooKeeper interface enables a high-performance service implementation. In addition to the wait-free property, ZooKeeper provides a per client guarantee of FIFO execution of requests and linearizability for all re-quests that change the ZooKeeper state. These design de-cisions enable the implementation of a high performance processing pipeline with read requests being satisfied by local servers. We show for the target workloads, 2:1 to 100:1 read to write ratio, that ZooKeeper can handle tens to hundreds of thousands of transactions per second. This performance allows ZooKeeper to be used exten-sively by client applications.
Conference Paper
Full-text available
Borealis is a second-generation distributed stream pro-cessing engine that is being developed at Brandeis Uni-versity, Brown University, and MIT. Borealis inherits core stream processing functionality from Aurora [14] and distribution functionality from Medusa [51]. Bo-realis modifies and extends both systems in non-trivial and critical ways to provide advanced capabilities that are commonly required by newly-emerging stream pro-cessing applications. In this paper, we outline the basic design and function-ality of Borealis. Through sample real-world applica-tions, we motivate the need for dynamically revising query results and modifying query specifications. We then describe how Borealis addresses these challenges through an innovative set of features, including revi-sion records, time travel, and control lines. Finally, we present a highly flexible and scalable QoS-based opti-mization model that operates across server and sensor networks and a new fault-tolerance model with flexible consistency-availability trade-offs. 1 Introduction
Conference Paper
Full-text available
Evaluating the resiliency of stateful Internet services to significant workload spikes and data hotspots requires realistic workload traces that are usually very difficult to obtain. A popular approach is to create a workload model and generate synthetic workload, however, there exists no characterization and model of stateful spikes. In this paper we analyze five workload and data spikes and find that they vary significantly in many important aspects such as steepness, magnitude, duration, and spatial locality. We propose and validate a model of stateful spikes that allows us to synthesize volume and data spikes and could thus be used by both cloud computing users and providers to stress-test their infrastructure.
Article
Full-text available
Scikit-learn is a Python module integrating a wide range of state-of-the-art machine learning algorithms for medium-scale supervised and unsupervised problems. This package focuses on bringing machine learning to non-specialists using a general-purpose high-level language. Emphasis is put on ease of use, performance, documentation, and API consistency. It has minimal dependencies and is distributed under the simplified BSD license, encouraging its use in both academic and commercial settings. Source code, binaries, and documentation can be downloaded from http://scikit-learn.sourceforge.net.
Article
Full-text available
Cloud computing focuses on delivery of reliable, secure, fault-tolerant, sustainable, and scalable infrastructures for hosting Internet-based application services. These applications have different composition, configuration, and deployment requirements. Quantifying the performance of scheduling and allocation policy on a Cloud infrastructure (hardware, software, services) for different application and service models under varying load, energy performance (power consumption, heat dissipation), and system size is an extremely challenging problem to tackle. To simplify this process, in this paper we propose CloudSim: a new generalized and extensible simulation framework that enables seamless modelling, simulation, and experimentation of emerging Cloud computing infrastructures and management services. The simulation framework has the following novel features: (i) support for modelling and instantiation of large scale Cloud computing infrastructure, including data centers on a single physical computing node and java virtual machine; (ii) a self-contained platform for modelling data centers, service brokers, scheduling, and allocations policies; (iii) availability of virtualization engine, which aids in creation and management of multiple, independent, and co-hosted virtualized services on a data center node; and (iv) flexibility to switch between space-shared and time-shared allocation of processing cores to virtualized services.
Conference Paper
Full-text available
Distributed and parallel computing environments are becoming cheap and commonplace. The availability of large numbers of CPU's makes it possible to process more data at higher speeds. Stream-processing systems are also becoming more important, as broad classes of applications require results in real-time. Since load can vary in unpredictable ways, exploiting the abundant processor cycles requires effective dynamic load distribution techniques. Although load distribution has been extensively studied for the traditional pull-based systems, it has not yet been fully studied in the context of push-based continuous query processing. In this paper, we present a correlation based load distribution algorithm that aims at avoiding overload and minimizing end-to-end latency by minimizing load variance and maximizing load correlation. While finding the optimal solution for such a problem is NP-hard, our greedy algorithm can find reasonable solutions in polynomial time. We present both a global algorithm for initial load distribution and a pair-wise algorithm for dynamic load migration.
Conference Paper
Elasticity is highly desirable for stream systems to guarantee low latency against workload dynamics, such as surges in arrival rate and fluctuations in data distribution. Existing systems achieve elasticity using a resource-centric approach that repartitions keys across the parallel instances, i.e., executors, to balance the workload and scale operators. However, such operator-level repartitioning requires global synchronization and prohibits rapid elasticity. We propose an executor-centric approach that avoids operator-level key repartitioning and implements executors as the building blocks of elasticity. By this new approach, we design the Elasticutor framework with two level of optimizations: i) a novel implementation of executors, i.e., elastic executors, that perform elastic multi-core execution via efficient intra-executor load balancing and executor scaling and ii) a global model-based scheduler that dynamically allocates CPU cores to executors based on the instantaneous workloads. We implemented a prototype of Elasticutor and conducted extensive experiments. We show that Elasticutor doubles the throughput and achieves up to two orders of magnitude lower latency than previous methods for dynamic workloads of real-world applications.
Chapter
In the last decade, stream processing has become a very active research domain motivated by the growing number of stream-based applications. These applications make use of continuous queries, which are processed by a stream processing engine (SPE) to generate timely results given the ephemeral input data. Variations of input data streams, in terms of both volume and distribution of values, have a large impact on computational resource requirements. Dynamic and Automatic Balanced Scaling for Storm (DABS-Storm) is an original solution for handling dynamic adaptation of continuous queries processing according to evolution of input stream properties, while controlling the system stability. Both fluctuations in data volume and distribution of values within data streams are handled by DABS-Storm to adjust the resources usage that best meets processing needs. To achieve this goal, the DABS-Storm holistic approach combines a proactive auto-parallelization algorithm with a latency-aware load balancing strategy.
Article
Data Stream Processing (DSP) applications are widely used to develop new pervasive services, which require to seamlessly process huge amounts of data in a near real-time fashion. To keep up with the high volume of daily produced data, these applications need to dynamically scale their execution on multiple computing nodes, so to process the incoming data flow in parallel. In this paper, we present a hierarchical distributed architecture for the autonomous control of elastic DSP applications. It consists of a two-layered hierarchical solution, where a centralized per-application component coordinates the run-time adaptation of subordinated distributed components, which, in turn, locally control the adaptation of single DSP operators. Thanks to its features, the proposed solution can efficiently run in large-scale Fog computing environments. Exploiting this framework, we design several distributed self-adaptation policies, including a popular threshold-based approach and two reinforcement learning solutions. We integrate the hierarchical architecture and the devised self-adaptation policies in Apache Storm, a popular open-source DSP framework. Relying on the DEBS 2015 Grand Challenge as a benchmark application, we show the benefits of the presented self-adaptation policies, and discuss the strengths of reinforcement learning based approaches, which autonomously learn from experience how to optimize the application performance.
Article
For the task of analyzing survival data to derive risk factors associated with mortality, physicians, researchers, and biostatisticians have typically relied on certain types of regression techniques, most notably the Cox model. With the advent of more widely distributed computing power, methods which require more complex mathematics have become increasingly common. Particularly in this era of "big data" and machine learning, survival analysis has become methodologically broader. This paper aims to explore one technique known as Random Forest. The Random Forest technique is a regression tree technique which uses bootstrap aggregation and randomization of predictors to achieve a high degree of predictive accuracy. The various input parameters of the random forest are explored. Colon cancer data (n = 66,807) from the SEER database is then used to construct both a Cox model and a random forest model to determine how well the models perform on the same data. Both models perform well, achieving a concordance error rate of approximately 18%.
Article
This study aims to investigate the influence of the amount of clustering [intraclass correlation (ICC) = 0%, 5%, or 20%], the number of events per variable (EPV) or candidate predictor (EPV = 5, 10, 20, or 50), and backward variable selection on the performance of prediction models. Researchers frequently combine data from several centers to develop clinical prediction models. In our simulation study, we developed models from clustered training data using multilevel logistic regression and validated them in external data. The amount of clustering was not meaningfully associated with the models' predictive performance. The median calibration slope of models built in samples with EPV = 5 and strong clustering (ICC = 20%) was 0.71. With EPV = 5 and ICC = 0%, it was 0.72. A higher EPV related to an increased performance: the calibration slope was 0.85 at EPV = 10 and ICC = 20% and 0.96 at EPV = 50 and ICC = 20%. Variable selection sometimes led to a substantial relative bias in the estimated predictor effects (up to 118% at EPV = 5), but this had little influence on the model's performance in our simulations. We recommend at least 10 EPV to fit prediction models in clustered data using logistic regression. Up to 50 EPV may be needed when variable selection is performed. Copyright © 2015 Elsevier Inc. All rights reserved.
Conference Paper
Scalable stream processing and continuous dataflow systems are gaining traction with the rise of big data due to the need for processing high velocity data in near real time. Unlike batch processing systems such as MapReduce and workflows, static scheduling strategies fall short for continuous data flows due to the variations in the input data rates and the need for sustained throughput. The elastic resource provisioning of cloud infrastructure is valuable to meet the changing resource needs of such continuous applications. However, multi-tenant cloud resources introduce yet another dimension of performance variability that impacts the application's throughput. In this paper we propose Plastic, an adaptive scheduling algorithm that balances resource cost and application throughput using a prediction-based look-ahead approach. It not only addresses variations in the input data rates but also the underlying cloud infrastructure. In addition, we also propose several simpler static scheduling heuristics that operate in the absence of accurate performance prediction model. These static and adaptive heuristics are evaluated through extensive simulations using performance traces obtained from Amazon AWS IaaS public cloud. Our results show an improvement of up to 20% in the overall profit as compared to the reactive adaptation algorithm.
Conference Paper
Storm has emerged as a promising computation platform for stream data processing. In this paper, we first show inefficiencies of the current practice of Storm scheduling and challenges associated with applying traffic-aware online scheduling in Storm via experimental results and analysis. Motivated by our observations, we design and implement a new stream data processing system based on Storm, namely, T-Storm. Compared to Storm, T-Storm has the following desirable features: 1) based on runtime states, it accelerates data processing by leveraging effective traffic-aware scheduling for assigning/re-assigning tasks dynamically, which minimizes inter-node and inter-process traffic while ensuring no worker nodes are overloaded, 2) it enables fine-grained control over worker node consolidation such that T-Storm can achieve better performance with even fewer worker nodes, 3) it allows hot-swapping of scheduling algorithms and adjustment of scheduling parameters on the fly, and 4) it is transparent to Storm users (i.e., Storm applications can be ported to run on T-Storm without any changes). We conducted real experiments in a cluster using well-known data processing applications for performance evaluation. Extensive experimental results show that compared to Storm (with the default scheduler), T-Storm can achieve over 84% and 27% speedup on lightly and heavily loaded topologies respectively (in terms of average processing time) with 30% less number of worker nodes.
Conference Paper
In this paper, we propose a framework for adaptive admission control and management of a large number of dynamic input streams in parallel stream processing engines. The framework takes as input any available information about input stream behaviors and the requirements of the query processing layer, and adaptively decides how to adjust the entry points of streams to the system. As the optimization decisions propagate early from input management layer to the query processing layer, the size of the cluster is mini- mized, the load balance is maintained, and latency bounds of queries are met in a more effective and timely manner. Declarative integration of external meta-data about data sources makes the system more robust and resource-efficient. Additionally, exploiting knowledge about queries moves data partitioning to the input management layer, where better load balance for query processing can be achieved. We im- plemented these techniques as a part of the Borealis stream processing system and conducted experiments showing the performance benefits of our framework.
Article
In order to explain many secret events of natural phenomena, analyzing non-stationary series is generally an attractive issue for various research areas. The wavelet transform technique, which has been widely used last two decades, gives better results than former techniques for the analysis of earth science phenomena and for feature detection of real measurements. In this study, a new technique is offered for streamflow modeling by using the discrete wavelet transform. This new technique depends on the feature detection characteristic of the wavelet transform. The model was applied to two geographical locations with different climates. The results were compared with energy variation and error values of models. The new technique offers a good advantage through a physical interpretation. This technique is applied to streamflow regression models, because they are simple and widely used in practical applications. However, one can apply this technique to other models.
Conference Paper
Traffic-aware real-time route planning has recently been an application of increasing interest for metropolitan cities with busy traffic. This paper approaches the problem from a stream processing point of view and proposes a general architecture to solve it. This work is inspired by a real use case and is implemented on an industry-strength stream processing engine. Experimental results on this implementation demonstrate the scalability of this approach in terms of increasing data and query rates.
Conference Paper
The long-running nature of continuous queries poses new scalability challenges for dataflow processing. CQ systems execute pipelined dataflows that may be shared across multiple queries. The scalability of these dataflows is limited by their constituent, stateful operators - e.g. windowed joins or grouping operators. To scale such operators, a natural solution is to partition them across a shared-nothing platform. But in the CQ context, traditional, static techniques for partitioned parallelism can exhibit detrimental imbalances as workload and runtime conditions evolve. Long-running CQ dataflows must continue to function robustly in the face of these imbalances. To address this challenge, we introduce a dataflow operator called flux that encapsulates adaptive state partitioning and dataflow routing. Flux is placed between producer-consumer stages in a dataflow pipeline to repartition stateful operators while the pipeline is still executing. We present the flux architecture, along with repartitioning policies that can be used for CQ operators under shifting processing and memory loads. We show that the flux mechanism and these policies can provide several factors improvement in throughput and orders of magnitude improvement in average latency over the static case.
Article
This paper discusses a quasi-likelihood (QL) approach to regression analysis with time series data. We consider a class of Markov models, referred to by Cox (1981, Scandinavian Journal of Statistics 8, 93-115) as "observation-driven" models in which the conditional means and variances given the past are explicit functions of past outcomes. The class includes autoregressive and Markov chain models for continuous and categorical observations as well as models for counts (e.g., Poisson) and continuous outcomes with constant coefficient of variation (e.g., gamma). We focus on Poisson and gamma data for illustration. Analogous to QL for independent observations, large-sample properties of the regression coefficients depend only on correct specification of the first conditional moment.
Article
This paper specifies the Linear Road Benchmark for Stream Data Management Systems (SDMS). Stream Data Management Systems process streaming data by executing continuous and historical queries while producing query results in real-time. This benchmark makes it possible to compare the performance characteristics of SDMS' relative to each other and to alternative (e.g., Relational Database) systems. Linear Road has been endorsed as an SDMS benchmark by the developers of both the Aurora [1] (out of Brandeis University, Brown University and MIT) and STREAM [8] (out of Stanford University) stream systems.
Darts: user-friendly modern machine learning for time series
  • Herzen
Getting Started with Storm -Continuous Streaming Computation with Twitter's Cluster Technology
  • J Leibiusky
  • G Eisbruch
  • D Simonassi
J. Leibiusky, G. Eisbruch, D. Simonassi, Getting Started with Storm -Continuous Streaming Computation with Twitter's Cluster Technology, O'Reilly, 2012.
Making Sense of Stream Processing
  • M Kleppmann
M. Kleppmann, Making Sense of Stream Processing, O'Reilly Media, Incorporated, 2016.
Darts: User-friendly modern machine learning for time series
  • J Herzen
  • F Lässig
  • S G Piazzetta
  • T Neuer
  • L Tafti
  • G Raille
  • T V Pottelbergh
  • M Pasieka
  • A Skrodzki
  • N Huguenin
  • M Dumonal
  • J Koscisz
  • D Bader
  • F Gusset
  • M Benheddi
  • C Williamson
  • M Kosinski
  • M Petrik
  • G Grosch
J. Herzen, F. Lässig, S. G. Piazzetta, T. Neuer, L. Tafti, G. Raille, T. V. Pottelbergh, M. Pasieka, A. Skrodzki, N. Huguenin, M. Dumonal, J. Koscisz, D. Bader, F. Gusset, M. Benheddi, C. Williamson, M. Kosinski, M. Petrik, G. Grosch, Darts: User-friendly modern machine learning for time series, J. Mach. Learn. Res. 23 (2022) 124:1-124:6.
Multi layer perceptron
  • M Riedmiller
  • A Lernen
M. Riedmiller, A. Lernen, Multi layer perceptron, Machine Learning Lab Special Lecture, University of Freiburg (2014) 7-24.
  • A Gruzd
  • P Mai
  • Covid
A. Gruzd, P. Mai, COVID-19 Twitter Dataset, 2020. URL: https:// doi.org/10.5683/SP2/PXF2CU. doi:10.5683/SP2/PXF2CU.
  • S He
  • J Zhu
  • P He
  • M R Lyu
S. He, J. Zhu, P. He, M. R. Lyu, Loghub: A large collection of system log datasets towards automated log analytics, CoRR abs/2008.06448 (2020). URL: https://arxiv.org/abs/2008.06448. arXiv:2008.06448.
Fastflow: highlevel and efficient streaming on multi-core, Programming multi-core and many-core computing systems
  • M Aldinucci
  • M Danelutto
  • P Kilpatrick
  • M Torquati
M. Aldinucci, M. Danelutto, P. Kilpatrick, M. Torquati, Fastflow: highlevel and efficient streaming on multi-core, Programming multi-core and many-core computing systems, parallel and distributed computing (2017).