ArticlePDF Available

Apache Flink™: Stream and Batch Processing in a Single Engine

Authors:

Abstract and Figures

Apache Flink 1 is an open-source system for processing streaming and batch data. Flink is built on the philosophy that many classes of data processing applications, including real-time analytics, continuous data pipelines, historic data processing (batch), and iterative algorithms (machine learning, graph analysis) can be expressed and executed as pipelined fault-tolerant dataflows. In this paper, we present Flink's architecture and expand on how a (seemingly diverse) set of use cases can be unified under a single execution model.
Content may be subject to copyright.
Apache Flink: Stream and Batch Processing in a Single Engine
Paris Carbone
Asterios Katsifodimos*
KTH & SICS Sweden
parisc,haridi@kth.se
Stephan Ewen
Volker Markl*
data Artisans
first@data-artisans.com
Seif Haridi
Kostas Tzoumas
*TU Berlin & DFKI
first.last@tu-berlin.de
Abstract
Apache Flink1is an open-source system for processing streaming and batch data. Flink is built on the
philosophy that many classes of data processing applications, including real-time analytics, continu-
ous data pipelines, historic data processing (batch), and iterative algorithms (machine learning, graph
analysis) can be expressed and executed as pipelined fault-tolerant dataflows. In this paper, we present
Flink’s architecture and expand on how a (seemingly diverse) set of use cases can be unified under a
single execution model.
1 Introduction
Data-stream processing (e.g., as exemplified by complex event processing systems) and static (batch) data pro-
cessing (e.g., as exemplified by MPP databases and Hadoop) were traditionally considered as two very dierent
types of applications. They were programmed using dierent programming models and APIs, and were exe-
cuted by dierent systems (e.g., dedicated streaming systems such as Apache Storm, IBM Infosphere Streams,
Microsoft StreamInsight, or Streambase versus relational databases or execution engines for Hadoop, including
Apache Spark and Apache Drill). Traditionally, batch data analysis made up for the lion’s share of the use cases,
data sizes, and market, while streaming data analysis mostly served specialized applications.
It is becoming more and more apparent, however, that a huge number of today’s large-scale data processing
use cases handle data that is, in reality, produced continuously over time. These continuous streams of data come
for example from web logs, application logs, sensors, or as changes to application state in databases (transaction
log records). Rather than treating the streams as streams, today’s setups ignore the continuous and timely nature
of data production. Instead, data records are (often artificially) batched into static data sets (e.g., hourly, daily, or
monthly chunks) and then processed in a time-agnostic fashion. Data collection tools, workflow managers, and
schedulers orchestrate the creation and processing of batches, in what is actually a continuous data processing
pipeline. Architectural patterns such as the ”lambda architecture” [21] combine batch and stream processing
systems to implement multiple paths of computation: a streaming fast path for timely approximate results, and a
batch oine path for late accurate results. All these approaches suer from high latency (imposed by batches),
Copyright 2015 IEEE. Personal use of this material is permitted. However, permission to reprint/republish this material for
advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any
copyrighted component of this work in other works must be obtained from the IEEE.
Bulletin of the IEEE Computer Society Technical Committee on Data Engineering
1The authors of this paper make no claim in being the sole inventors or implementers of the ideas behind Apache Flink, but rather a
group of people that attempt to accurately document Flink’s concepts and their significance. Consult Section 7 for acknowledgements.
28
high complexity (connecting and orchestrating several systems, and implementing business logic twice), as well
as arbitrary inaccuracy, as the time dimension is not explicitly handled by the application code.
Apache Flink follows a paradigm that embraces data-stream processing as the unifying model for real-time
analysis, continuous streams, and batch processing both in the programming model and in the execution engine.
In combination with durable message queues that allow quasi-arbitrary replay of data streams (like Apache
Kafka or Amazon Kinesis), stream processing programs make no distinction between processing the latest
events in real-time, continuously aggregating data periodically in large windows, or processing terabytes of
historical data. Instead, these dierent types of computations simply start their processing at dierent points
in the durable stream, and maintain dierent forms of state during the computation. Through a highly flexible
windowing mechanism, Flink programs can compute both early and approximate, as well as delayed and accu-
rate, results in the same operation, obviating the need to combine dierent systems for the two use cases. Flink
supports dierent notions of time (event-time, ingestion-time, processing-time) in order to give programmers
high flexibility in defining how events should be correlated.
At the same time, Flink acknowledges that there is, and will be, a need for dedicated batch processing
(dealing with static data sets). Complex queries over static data are still a good match for a batch processing
abstraction. Furthermore, batch processing is still needed both for legacy implementations of streaming use
cases, and for analysis applications where no ecient algorithms are yet known that perform this kind of pro-
cessing on streaming data. Batch programs are special cases of streaming programs, where the stream is finite,
and the order and time of records does not matter (all records implicitly belong to one all-encompassing win-
dow). However, to support batch use cases with competitive ease and performance, Flink has a specialized API
for processing static data sets, uses specialized data structures and algorithms for the batch versions of opera-
tors like join or grouping, and uses dedicated scheduling strategies. The result is that Flink presents itself as a
full-fledged and ecient batch processor on top of a streaming runtime, including libraries for graph analysis
and machine learning. Originating from the Stratosphere project [4], Flink is a top-level project of the Apache
Software Foundation that is developed and supported by a large and lively community (consisting of over 180
open-source contributors as of the time of this writing), and is used in production in several companies.
The contributions of this paper are as follows:
we make the case for a unified architecture of stream and batch data processing, including specific opti-
mizations that are only relevant for static data sets,
we show how streaming, batch, iterative, and interactive analytics can be represented as fault-tolerant
streaming dataflows (in Section 3),
we discuss how we can build a full-fledged stream analytics system with a flexible windowing mechanism
(in Section 4), as well as a full-fledged batch processor (in Section 4.1) on top of these dataflows, by show-
ing how streaming, batch, iterative, and interactive analytics can be represented as streaming dataflows.
2 System Architecture
In this section we lay out the architecture of Flink as a software stack and as a distributed system. While Flink’s
stack of APIs continues to grow, we can distinguish four main layers: deployment, core, APIs, and libraries.
Flink’s Runtime and APIs. Figure 1 shows Flink’s software stack. The core of Flink is the distributed dataflow
engine, which executes dataflow programs. A Flink runtime program is a DAG of stateful operators connected
with data streams. There are two core APIs in Flink: the DataSet API for processing finite data sets (often
referred to as batch processing), and the DataStream API for processing potentially unbounded data streams
(often referred to as stream processing). Flink’s core runtime engine can be seen as a streaming dataflow engine,
and both the DataSet and DataStream APIs create runtime programs executable by the engine. As such, it serves
as the common fabric to abstract both bounded (batch) and unbounded (stream) processing. On top of the core
29
DataSet API
Batch&Process ing
DataStream+API
Stream&Pro cessin g
Runtime
Distributed&Streaming&Dataflow
Local
Single' JVM,'
Embed ded
Cluster
Standalon e,'YARN
Cloud
Google'Comp.'Engine,
EC2
Flink ML
Machine&Learning
Gelly
Graph&API/Library
Table + AP I
Batch
CEP
Complex&Event&
Processing
Deploy Core APIs+&+Libraries
Table + AP I
Streaming
Figure 1: The Flink software stack.
Flink Client
Job,Manager
Task,Man ager,#1
Task,
Slot
Actor,Sys tem
Mem ory/IO,Manager
Network,Manager
Task,
Slot
Task,
Slot
Sche du le r
Checkpoint,Coordinator
Data
Streams
final ExecutionEnvironmentenv = Ex ec ut io nEnv ir on ment .getExecutionEnvironment();
// Create initial IterativeDataSet
IterativeDataSet<Integer> initial = env.fromElements(0).iterate(10000);
DataSet<Int eg er > iter at io n = in it ia l.ma p(new Ma pF un ct io n<Integer, Integer>() {
@Override
public Integer map(Integer i) throws Exception {
doublex = Ma th.random();
doubley = Ma th.random();
returni + (( x * x + y * y < 1) ? 1: 0);
}
}); Flink Program
Dataflow0Graph
Task,Man ager,#2
Task,
Slot
Actor,Sys tem
Mem ory/IO,Manager
Network,Manager
Task,
Slot
Task,
Slot
Graph , Bu i ld e r, & , O pt im i z er
Actor,Sys tem
Dataflow0Graph
Actor,Sys tem
Task%Status
Heartbeats
Statistics
Trigger%Checkpoints,%
Figure 2: The Flink process model.
APIs, Flink bundles domain-specific libraries and APIs that generate DataSet and DataStream API programs,
currently, FlinkML for machine learning, Gelly for graph processing and Table for SQL-like operations.
As depicted in Figure 2, a Flink cluster comprises three types of processes: the client, the Job Manager, and
at least one Task Manager. The client takes the program code, transforms it to a dataflow graph, and submits
that to the JobManager. This transformation phase also examines the data types (schema) of the data exchanged
between operators and creates serializers and other type/schema specific code. DataSet programs additionally
go through a cost-based query optimization phase, similar to the physical optimizations performed by relational
query optimizers (for more details see Section 4.1).
The JobManager coordinates the distributed execution of the dataflow. It tracks the state and progress of each
operator and stream, schedules new operators, and coordinates checkpoints and recovery. In a high-availability
setup, the JobManager persists a minimal set of metadata at each checkpoint to a fault-tolerant storage, such that
a standby JobManager can reconstruct the checkpoint and recover the dataflow execution from there. The actual
data processing takes place in the TaskManagers. A TaskManager executes one or more operators that produce
streams, and reports on their status to the JobManager. The TaskManagers maintain the buer pools to buer or
materialize the streams, and the network connections to exchange the data streams between operators.
3 The Common Fabric: Streaming Dataflows
Although users can write Flink programs using a multitude of APIs, all Flink programs eventually compile down
to a common representation: the dataflow graph. The dataflow graph is executed by Flink’s runtime engine, the
common layer underneath both the batch processing (DataSet) and stream processing (DataStream) APIs.
3.1 Dataflow Graphs
The dataflow graph as depicted in Figure 3 is a directed acyclic graph (DAG) that consists of: (i) stateful
operators and (ii) data streams that represent data produced by an operator and are available for consumption
by operators. Since dataflow graphs are executed in a data-parallel fashion, operators are parallelized into
one or more parallel instances called subtasks and streams are split into one or more stream partitions (one
partition per producing subtask). The stateful operators, which may be stateless as a special case implement
all of the processing logic (e.g., filters, hash joins and stream window functions). Many of these operators
are implementations of textbook versions of well known algorithms. In Section 4, we provide details on the
implementation of windowing operators. Streams distribute data between producing and consuming operators
in various patterns, such as point-to-point, broadcast, re-partition, fan-out, and merge.
30
SRC1 IS1
SRC2
OP1
SNK2
IS2
!"#"$%& '()*$+#",+ -# "$+.#'./$0(12"$+3$0.#"$
4#" #( !"+ $# 3
56',78.29(0#"#($:7;#29$<
SNK1
IS3
=+# 2>.$2 "(1 2"$ +3$0 .# "$
4#" #( !"+ $# 3(5 * .* $'.2 $0 (0 # "#($: 7 ;#29 $<
?,2"+,'(@A$2"
4#" #( B$7 , +0
)*$+#",+(!"#"$
Figure 3: A simple dataflow graph.
!
"!
#!
$!
%!
&!
'!
(!
)!
*!
"!!
!
#!
%!
'!
)!
"!!
"#!
! & "! &! "!!
Throughput)
+,-./01.234253 6637 482792.-.4:8;8.<=
Latency
**:>?@./<.4:36.2342536638.<74A8
Buffer timeout (milliseco nds)
Figure 4: The eect of buer-timeout
in latency and throughput.
3.2 Data Exchange through Intermediate Data Streams
Flink’s intermediate data streams are the core abstraction for data-exchange between operators. An intermediate
data stream represents a logical handle to the data that is produced by an operator and can be consumed by one
or more operators. Intermediate streams are logical in the sense that the data they point to may or may not be
materialized on disk. The particular behavior of a data stream is parameterized by the higher layers in Flink
(e.g., the program optimizer used by the DataSet API).
Pipelined and Blocking Data Exchange. Pipelined intermediate streams exchange data between concurrently
running producers and consumers resulting in pipelined execution. As a result, pipelined streams propagate
back pressure from consumers to producers, modulo some elasticity via intermediate buer pools, in order
to compensate for short-term throughput fluctuations. Flink uses pipelined streams for continuous streaming
programs, as well as for many parts of batch dataflows, in order to avoid materialization when possible. Blocking
streams on the other hand are applicable to bounded data streams. A blocking stream buers all of the producing
operator’s data before making it available for consumption, thereby separating the producing and consuming
operators into dierent execution stages. Blocking streams naturally require more memory, frequently spill to
secondary storage, and do not propagate backpressure. They are used to isolate successive operators against
each other (where desired) and in situations where plans with pipeline-breaking operators, such as sort-merge
joins may cause distributed deadlocks.
Balancing Latency and Throughput. Flink’s data-exchange mechanisms are implemented around the ex-
change of buers. When a data record is ready on the producer side, it is serialized and split into one or more
buers (a buer can also fit multiple records) that can be forwarded to consumers. A buer is sent to a consumer
either i) as soon as it is full or ii) when a timeout condition is reached. This enables Flink to achieve high
throughput by setting the size of buers to a high value (e.g., a few kilobytes), as well as low latency by setting
the buer timeout to a low value (e.g., a few milliseconds). Figure 4 shows the eect of buer-timeouts on the
throughput and latency of delivering records in a simple streaming grep job on 30 machines (120 cores). Flink
can achieve an observable 99th-percentile latency of 20ms. The corresponding throughput is 1.5 million events
per second. As we increase the buer timeout, we see an increase in latency with an increase in throughput,
until full throughput is reached (i.e., buers fill up faster than the timeout expiration). At a buer timeout of
50ms, the cluster reaches a throughput of more than 80 million events per second with a 99th-percentile latency
of 50ms.
Control Events. Apart from exchanging data, streams in Flink communicate dierent types of control events.
These are special events injected in the data stream by operators, and are delivered in-order along with all other
31
Figure 5: Asynchronous Barrier Snapshotting.
data records and events within a stream partition. The receiving operators react to these events by performing
certain actions upon their arrival. Flink uses lots of special types of control events, including:
checkpoint barriers that coordinate checkpoints by dividing the stream into pre-checkpoint and post-
checkpoint (discussed in Section 3.3),
watermarks signaling the progress of event-time within a stream partition (discussed in Section 4.1),
iteration barriers signaling that a stream partition has reached the end of a superstep, in Bulk/Stale-
Synchronous-Parallel iterative algorithms on top of cyclic dataflows (discussed in Section 5.3).
As mentioned above, control events assume that a stream partition preserves the order of records. To this end,
unary operators in Flink that consume a single stream partition, guarantee a FIFO order of records. However,
operators receiving more than one stream partition merge the streams in arrival order, in order to keep up with
the streams’ rates and avoid back pressure. As a result, streaming dataflows in Flink do not provide ordering
guarantees after any form of repartitioning or broadcasting and the responsibility of dealing with out-of-order
records is left to the operator implementation. We found that this arrangement gives the most ecient design, as
most operators do not require deterministic order (e.g., hash-joins, maps), and operators that need to compensate
for out-of-order arrivals, such as event-time windows can do that more eciently as part of the operator logic.
3.3 Fault Tolerance
Flink oers reliable execution with strict exactly-once-processing consistency guarantees and deals with failures
via checkpointing and partial re-execution. The general assumption the system makes to eectively provide
these guarantees is that the data sources are persistent and replayable. Examples of such sources are files and
durable message queues (e.g., Apache Kafka). In practice, non-persistent sources can also be incorporated by
keeping a write-ahead log within the state of the source operators.
The checkpointing mechanism of Apache Flink builds on the notion of distributed consistent snapshots
to achieve exactly-once-processing guarantees. The possibly unbounded nature of a data stream makes re-
computation upon recovery impractical, as possibly months of computation will need to be replayed for a long-
running job. To bound recovery time, Flink takes a snapshot of the state of operators, including the current
position of the input streams at regular intervals.
The core challenge lies in taking a consistent snapshot of all parallel operators without halting the execution
of the topology. In essence, the snapshot of all operators should refer to the same logical time in the computation.
The mechanism used in Flink is called Asynchronous Barrier Snapshotting (ABS [7]). Barriers are control
records injected into the input streams that correspond to a logical time and logically separate the stream to the
part whose eects will be included in the current snapshot and the part that will be snapshotted later.
An operator receives barriers from upstream and first performs an alignment phase, making sure that the
barriers from all inputs have been received. Then, the operator writes its state (e.g., contents of a sliding window,
or custom data structures) to durable storage (e.g., the storage backend can be an external system such as HDFS).
Once the state has been backed up, the operator forwards the barrier downstream. Eventually, all operators will
32
register a snapshot of their state and a global snapshot will be complete. For example, in Figure 5 we show that
snapshot t2 contains all operator states that are the result of consuming all records before t2 barrier. ABS bears
resemblances to the Chandy-Lamport algorithm for asynchronous distributed snapshots [11]. However, because
of the DAG structure of a Flink program, ABS does not need to checkpoint in-flight records, but solely relies on
the aligning phase to apply all their eects to the operator states. This guarantees that the data that needs to be
written to reliable storage is kept to the theoretical minimum (i.e., only the current state of the operators).
Recovery from failures reverts all operator states to their respective states taken from the last successful snap-
shot and restarts the input streams starting from the latest barrier for which there is a snapshot. The maximum
amount of re-computation needed upon recovery is limited to the amount of input records between two consecu-
tive barriers. Furthermore, partial recovery of a failed subtask is possible by additionally replaying unprocessed
records buered at the immediate upstream subtasks [7].
ABS provides several benefits: i) it guarantees exactly-once state updates without ever pausing the computation
ii) it is completely decoupled from other forms of control messages, (e.g., by events that trigger the computation
of windows and thereby do not restrict the windowing mechanism to multiples of the checkpoint interval) and
iii) it is completely decoupled from the mechanism used for reliable storage, allowing state to be backed up to
file systems, databases, etc., depending on the larger environment in which Flink is used.
3.4 Iterative Dataflows
Incremental processing and iterations are crucial for applications, such as graph processing and machine learn-
ing. Support for iterations in data-parallel processing platforms typically relies on submitting a new job for
each iteration or by adding additional nodes to a running DAG [6, 25] or feedback edges [23]. Iterations in
Flink are implemented as iteration steps, special operators that themselves can contain an execution graph (Fig-
ure 6). To maintain the DAG-based runtime and scheduler, Flink allows for iteration “head” and “tail” tasks
that are implicitly connected with feedback edges. The role of these tasks is to establish an active feedback
channel to the iteration step and provide coordination for processing data records in transit within this feedback
channel. Coordination is needed for implementing any type of structured parallel iteration model, such as the
Bulk Synchronous Parallel (BSP) model and is implemented using control event. We explain how iterations are
implemented in the DataStream and DataSet APIs in Section 4.4 and Section 5.3, respectively.
4 Stream Analytics on Top of Dataflows
Flink’s DataStream API implements a full stream-analytics framework on top of Flink’s runtime, including the
mechanisms to manage time such as out-of-order event processing, defining windows, and maintaining and
updating user-defined state. The streaming API is based on the notion of a DataStream, a (possibly unbounded)
immutable collection of elements of a given type. Since Flink’s runtime already supports pipelined data transfers,
continuous stateful operators, and a fault-tolerance mechanism for consistent state updates, overlaying a stream
processor on top of it essentially boils down to implementing a windowing system and a state interface. As
noted, these are invisible to the runtime, which sees windows as just an implementation of stateful operators.
4.1 The Notion of Time
Flink distinguishes between two notions of time: i) event-time, which denotes the time when an event originates
(e.g., the timestamp associated with a signal arising from a sensor, such as a mobile device) and ii) processing-
time, which is the wall-clock time of the machine that is processing the data.
In distributed systems there is an arbitrary skew between event-time and processing-time [3]. This skew
may mean arbitrary delays for getting an answer based on event-time semantics. To avoid arbitrary delays, these
systems regularly insert special events called low watermarks that mark a global progress measure. In the case
of time progress for example, a watermark includes a time attribute tindicating that all events lower than thave
33
Figure 6: The iteration model of Apache Flink.
already entered an operator. The watermarks aid the execution engine in processing events in the correct event
order and serialize operations, such as window computations via a unified measure of progress.
Watermarks originate at the sources of a topology, where we can determine the time inherent in future
elements. The watermarks propagate from the sources throughout the other operators of the data flow. Operators
decide how they react to watermarks. Simple operations, such as map or filter just forward the watermarks they
receive, while more complex operators that do calculations based on watermarks (e.g., event-time windows)
first compute results triggered by a watermark and then forward it. If an operation has more than one input, the
system only forwards the minimum of the incoming watermarks to the operator thereby ensuring correct results.
Flink programs that are based on processing-time rely on local machine clocks, and hence possess a less
reliable notion of time, which can lead to inconsistent replays upon recovery. However, they exhibit lower
latency. Programs that are based on event-time provide the most reliable semantics, but may exhibit latency
due to event-time-processing-time lag. Flink includes a third notion of time as a special case of event-time
called ingestion-time, which is the time that events enter Flink. That achieves a lower processing latency than
event-time and leads to more accurate results in comparison to processing-time.
4.2 Stateful Stream Processing
While most operators in Flink’s DataStream API look like functional, side-eect-free operators, they provide
support for ecient stateful computations. State is critical to many applications, such as machine-learning
model building, graph analysis, user session handling, and window aggregations. There is a plethora of dierent
types of states depending on the use case. For example, the state can be something as simple as a counter or
a sum or more complex, such as a classification tree or a large sparse matrix often used in machine-learning
applications. Stream windows are stateful operators that assign records to continuously updated buckets kept in
memory as part of the operator state.
In Flink state is made explicit and is incorporated in the API by providing: i) operator interfaces or an-
notations to statically register explicit local variables within the scope of an operator and ii) an operator-state
abstraction for declaring partitioned key-value states and their associated operations. Users can also configure
how the state is stored and checkpointed using the StateBackend abstractions provided by the system, thereby
allowing highly flexible custom state management in streaming applications. Flink’s checkpointing mechanism
(discussed in Section 3.3) guarantees that any registered state is durable with exactly-once update semantics.
4.3 Stream Windows
Incremental computations over unbounded streams are often evaluated over continuously evolving logical views,
called windows. Apache Flink incorporates windowing within a stateful operator that is configured via a flexible
declaration composed out of three core functions: a window assigner and optionally a trigger and an evictor.
All three functions can be selected among a pool of common predefined implementations (e.g., sliding time
windows) or can be explicitly defined by the user (i.e., user-defined functions).
More specifically, the assigner is responsible for assigning each record to logical windows. For example,
this decision can be based on the timestamp of a record when it comes to event-time windows. Note that in
the case of sliding windows, an element can belong to multiple logical windows. An optional trigger defines
34
when the operation associated with the window definition is performed. Finally, an optional evictor determines
which records to retain within each window. Flink’s window assignment process is uniquely capable of covering
all known window types such as periodic time- and count-windows, punctuation, landmark, session and delta
windows. Note that Flink’s windowing capabilities incorporate out-of-order processing seamlessly, similarly
to Google Cloud Dataflow [3] and, in principle, subsume these windowing models. For example, below is a
window definition with a range of 6 seconds that slides every 2 seconds (the assigner). The window results are
computed once the watermark passes the end of the window (the trigger).
stream
.window(SlidingTimeWindows.of(Time.of(6, SECONDS), Time.of(2, SECONDS))
.trigger(EventTimeTrigger.create())
A global window creates a single logical group. The following example defines a global window (i.e., the
assigner) that invokes the operation on every 1000 events (i.e., the trigger) while keeping the last 100 elements
(i.e., the evictor).
stream
.window(GlobalWindow.create())
.trigger(Count.of(1000))
.evict(Count.of(100))
Note that if the stream above is partitioned on a key before windowing, the window operation above is local
and thus does not require coordination between workers. This mechanism can be used to implement a wide
variety of windowing functionality [3].
4.4 Asynchronous Stream Iterations
Loops in streams are essential for several applications, such as incrementally building and training machine
learning models, reinforcement learning and graph approximations [9, 15]. In most such cases, feedback loops
need no coordination. Asynchronous iterations cover the communication needs for streaming applications and
dier from parallel optimisation problems that are based on structured iterations on finite data. As presented in
Section 3.4 and Figure 6, the execution model of Apache Flink already covers asynchronous iterations, when
no iteration control mechanism is enabled. In addition, to comply with fault-tolerance guarantees, feedback
streams are treated as operator state within the implicit-iteration head operator and are part of a global snapshot
[7]. The DataStream API allows for an explicit definition of feedback streams and can trivially subsume support
for structured loops over streams [23] as well as progress tracking [9].
5 Batch Analytics on Top of Dataflows
A bounded data set is a special case of an unbounded data stream. Thus, a streaming program that inserts all of
its input data in a window can form a batch program and batch processing should be fully covered by Flink’s
features that were presented above. However, i) the syntax (i.e., the API for batch computation) can be simplified
(e.g., there is no need for artificial global window definitions) and ii) programs that process bounded data sets are
amenable to additional optimizations, more ecient book-keeping for fault-tolerance, and staged scheduling.
Flink approaches batch processing as follows:
Batch computations are executed by the same runtime as streaming computations. The runtime executable
may be parameterized with blocked data streams to break up large computations into isolated stages that
are scheduled successively.
Periodic snapshotting is turned owhen its overhead is high. Instead, fault recovery can be achieved by
replaying the lost stream partitions from the latest materialized intermediate stream (possibly the source).
Blocking operators (e.g., sorts) are simply operator implementations that happen to block until they have
consumed their entire input. The runtime is not aware of whether an operator is blocking or not. These
35
operators use managed memory provided by Flink (either on or othe JVM heap) and can spill to disk if
their inputs exceed their memory bounds.
A dedicated DataSet API provides familiar abstractions for batch computations, namely a bounded fault-
tolerant DataSet data structure and transformations on DataSets (e.g., joins, aggregations, iterations).
A query optimization layer transforms a DataSet program into an ecient executable.
Below we describe these aspects in greater detail.
5.1 Query Optimization
Flink’s optimizer builds on techniques from parallel database systems such as plan equivalence, cost modeling
and interesting-property propagation. However, the arbitrary UDF-heavy DAGs that make up Flink’s dataflow
programs do not allow a traditional optimizer to employ database techniques out of the box [17], since the
operators hide their semantics from the optimizer. For the same reason, cardinality and cost-estimation methods
are equally dicult to employ. Flink’s runtime supports various execution strategies including repartition and
broadcast data transfer, as well as sort-based grouping and sort- and hash-based join implementations. Flink’s
optimizer enumerates dierent physical plans based on the concept of interesting properties propagation [26],
using a cost-based approach to choose among multiple physical plans. The cost includes network and disk I/O
as well as CPU cost. To overcome the cardinality estimation issues in the presence of UDFs, Flink’s optimizer
can use hints that are provided by the programmer.
5.2 Memory Management
Building on database technology, Flink serializes data into memory segments, instead of allocating objects in the
JVMs heap to represent buered in-flight data records. Operations such as sorting and joining operate as much
as possible on the binary data directly, keeping the serialization and deserialization overhead at a minimum and
partially spilling data to disk when needed. To handle arbitrary objects, Flink uses type inference and custom
serialization mechanisms. By keeping the data processing on binary representation and o-heap, Flink manages
to reduce the garbage collection overhead, and use cache-ecient and robust algorithms that scale gracefully
under memory pressure.
5.3 Batch Iterations
Iterative graph analytics, parallel gradient descent and optimisation techniques have been implemented in the
past on top of Bulk Synchronous Parallel (BSP) and Stale Synchronous Parallel (SSP) models, among others.
Flink’s execution model allows for any type of structured iteration logic to be implemented on top, by using
iteration-control events. For instance, in the case of a BSP execution, iteration-control events mark the begin-
ning and the end of supersteps in an iterative computation. Finally, Flink introduces further novel optimisation
techniques such as the concept of delta iterations [14], which can exploit sparse computational dependencies
Delta iterations are already exploited by Gelly, Flink’s Graph API.
6 Related work
Today, there is a wealth of engines for distributed batch and stream analytical processing. We categorise the
main systems below.
Batch Processing. Apache Hadoop is one of the most popular open-source systems for large-scale data analy-
sis that is based on the MapReduce paradigm [12]. Dryad [18] introduced embedded user-defined functions in
general DAG-based dataflows and was enriched by SCOPE [26], which a language and an SQL optimizer on
top of it. Apache Tez [24] can be seen as an open source implementation of the ideas proposed in Dryad. MPP
databases [13], and recent open-source implementations like Apache Drill and Impala [19], restrict their API
to SQL variants. Similar to Flink, Apache Spark [25] is a data-processing framework that implements a DAG-
based execution engine, provides an SQL optimizer, performs driver-based iterations, and treats unbounded
36
computation as micro-batches. In contrast, Flink is the only system that incorporates i) a distributed dataflow
runtime that exploits pipelined streaming execution for batch and stream workloads, ii) exactly-once state con-
sistency through lightweight checkpointing, iii) native iterative processing, iv) sophisticated window semantics,
supporting out-of-order processing.
Stream Processing. There is a wealth of prior work on academic and commercial stream processing systems,
such as SEEP, Naiad, Microsoft StreamInsight, and IBM Streams. Many of these systems are based on research
in the database community [1, 5, 8, 10, 16, 22, 23]. Most of the above systems are either i) academic prototypes,
ii) closed-source commercial products, or iii) do not scale the computation horizontally on clusters of commodity
servers. More recent approaches in data streaming enable horizontal scalability and compositional dataflow
operators with weaker state consistency guarantees (e.g., at-least-once processing in Apache Storm and Samza).
Notably, concepts such as “out of order processing” (OOP) [20] gained significant attraction and were adopted
by MillWheel [2], Google’s internal version of the later oered commercial executor of Apache Beam/Google
Dataflow [3]. Millwheel served as a proof of concept for exactly-once low latency stream processing and OOP,
thus, being very influential to the evolution of Flink. To the best of our knowledge, Flink is the only open-source
project that: i) supports event time and out-of-order event processing ii) provides consistent managed state with
exactly-once guarantees iii) achieves high throughput and low latency, serving both batch and streaming
7 Acknowledgements
The development of the Apache Flink project is overseen by a self-selected team of active contributors to the
project. A Project Management Committee (PMC) guides the project’s ongoing operations, including com-
munity development and product releases. At the current time of writing this, the list of Flink committers
are : M´
arton Balassi, Paris Carbone, Ufuk Celebi, Stephan Ewen, Gyula F ´
ora, Alan Gates, Greg Hogan,
Fabian Hueske, Vasia Kalavri, Aljoscha Krettek, ChengXiang Li, Andra Lungu, Robert Metzger, Maximilian
Michels, Chiwan Park, Till Rohrmann, Henry Saputra, Matthias J. Sax, Sebastian Schelter, Kostas Tzoumas,
Timo Walther and Daniel Warneke. In addition to these individuals, we want to acknowledge the broader Flink
community of more than 180 contributors.
8 Conclusion
In this paper, we presented Apache Flink, a platform that implements a universal dataflow engine designed to
perform both stream and batch analytics. Flink’s dataflow engine treats operator state and logical intermediate
results as first-class citizens and is used by both the batch and a data stream APIs with dierent parameters. The
streaming API that is built on top of Flink’s streaming dataflow engine provides the means to keep recoverable
state and to partition, transform, and aggregate data stream windows. While batch computations are, in theory,
a special case of a streaming computations, Flink treats them specially, by optimizing their execution using a
query optimizer and by implementing blocking operators that gracefully spill to disk in the absence of memory.
References
[1] D. J. Abadi, Y. Ahmad, M. Balazinska, U. Cetintemel, M. Cherniack, J.-H. Hwang, W. Lindner, A. Maskey, A. Rasin,
E. Ryvkina, et al. The design of the Borealis stream processing engine. CIDR, 2005.
[2] T. Akidau, A. Balikov, K. Bekiro˘
glu, S. Chernyak, J. Haberman, R. Lax, S. McVeety, D. Mills, P. Nordstrom, and
S. Whittle. Millwheel: fault-tolerant stream processing at Internet scale. PVLDB, 2013.
[3] T. Akidau, R. Bradshaw, C. Chambers, S. Chernyak, R. J. Fern´
andez-Moctezuma, R. Lax, S. McVeety, D. Mills,
F. Perry, E. Schmidt, et al. The dataflow model: a practical approach to balancing correctness, latency, and cost in
massive-scale, unbounded, out-of-order data processing. PVLDB, 2015.
[4] A. Alexandrov, R. Bergmann, S. Ewen, J.-C. Freytag, F. Hueske, A. Heise, O. Kao, M. Leich, U. Leser, V. Markl,
F. Naumann, M. Peters, A. Rheinlaender, M. J. Sax, S. Schelter, M. Hoeger, K. Tzoumas, and D. Warneke. The
stratosphere platform for big data analytics. VLDB Journal, 2014.
37
[5] A. Arasu, B. Babcock, S. Babu, J. Cieslewicz, M. Datar, K. Ito, R. Motwani, U. Srivastava, and J. Widom. Stream:
The stanford data stream management system. Technical Report, 2004.
[6] Y. Bu, B. Howe, M. Balazinska, and M. D. Ernst. HaLoop: Ecient Iterative Data Processing on Large Clusters.
PVLDB, 2010.
[7] P. Carbone, G. F ´
ora, S. Ewen, S. Haridi, and K. Tzoumas. Lightweight asynchronous snapshots for distributed
dataflows. arXiv:1506.08603, 2015.
[8] B. Chandramouli, J. Goldstein, M. Barnett, R. DeLine, D. Fisher, J. C. Platt, J. F. Terwilliger, and J. Wernsing. Trill:
a high-performance incremental query processor for diverse analytics. PVLDB, 2014.
[9] B. Chandramouli, J. Goldstein, and D. Maier. On-the-fly progress detection in iterative stream queries. PVLDB,
2009.
[10] S. Chandrasekaran and M. J. Franklin. Psoup: a system for streaming queries over streaming data. VLDB Journal,
2003.
[11] K. M. Chandy and L. Lamport. Distributed snapshots: determining global states of distributed systems. ACM TOCS,
1985.
[12] J. Dean et al. MapReduce: simplified data processing on large clusters. Communications of the ACM, 2008.
[13] D. J. DeWitt, S. Ghandeharizadeh, D. Schneider, A. Bricker, H.-I. Hsiao, R. Rasmussen, et al. The gamma database
machine project. IEEE TKDE, 1990.
[14] S. Ewen, K. Tzoumas, M. Kaufmann, and V. Markl. Spinning Fast Iterative Data Flows. PVLDB, 2012.
[15] J. Feigenbaum, S. Kannan, A. McGregor, S. Suri, and J. Zhang. On graph problems in a semi-streaming model.
Theoretical Computer Science, 2005.
[16] B. Gedik, H. Andrade, K.-L. Wu, P. S. Yu, and M. Doo. Spade: the system s declarative stream processing engine.
ACM SIGMOD, 2008.
[17] F. Hueske, M. Peters, M. J. Sax, A. Rheinl¨
ander, R. Bergmann, A. Krettek, and K. Tzoumas. Opening the Black
Boxes in Data Flow Optimization. PVLDB, 2012.
[18] M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly. Dryad: distributed data-parallel programs from sequential
building blocks. ACM SIGOPS, 2007.
[19] M. Kornacker, A. Behm, V. Bittorf, T. Bobrovytsky, C. Ching, A. Choi, J. Erickson, M. Grund, D. Hecht, M. Jacobs,
et al. Impala: A modern, open-source sql engine for hadoop. CIDR, 2015.
[20] J. Li, K. Tufte, V. Shkapenyuk, V. Papadimos, T. Johnson, and D. Maier. Out-of-order processing: a new architecture
for high-performance stream systems. PVLDB, 2008.
[21] N. Marz and J. Warren. Big Data: Principles and best practices of scalable realtime data systems. Manning
Publications Co., 2015.
[22] M. Migliavacca, D. Eyers, J. Bacon, Y. Papagiannis, B. Shand, and P. Pietzuch. Seep: scalable and elastic event
processing. ACM Middleware’10 Posters and Demos Track, 2010.
[23] D. G. Murray, F. McSherry, R. Isaacs, M. Isard, P. Barham, and M. Abadi. Naiad: a timely dataflow system. ACM
SOSP, 2013.
[24] B. Saha, H. Shah, S. Seth, G. Vijayaraghavan, A. Murthy, and C. Curino. Apache tez: A unifying framework for
modeling and building data processing applications. ACM SIGMOD, 2015.
[25] M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica. Spark: Cluster Computing with Working Sets.
USENIX HotCloud, 2010.
[26] J. Zhou, P.-A. Larson, and R. Chaiken. Incorporating partitioning and parallel plans into the scope optimizer. IEEE
ICDE, 2010.
38
... 1. Horizontal resource scaling: Here we find SQL and NoSQL analytics frameworks whose distributed design reduces estimation latency through horizontal scaling of server resources (e.g., Spark [99], Hive [87], Hadoop [85], Dremel [74], Druid [96], Flink [35]). These frameworks scale their clusters with input data and can provide precisely exact estimations. ...
... Apache Spark [99] leverages a DAG-based execution engine and treats unbounded computation as microbatches. Apache Flink [35] enables pipelined streaming execution for batched and streaming data, offers exactly-one semantics and High-dimensional Data Cubes: Data cubes have been an integral part of online analytics frameworks and enable pre-computing and storing statistics for multidimensional aggregates so that queries can be answered on the fly. However, data cubes suffer from the same scalability challenges as Hydra. ...
... out-of-order processing. Hydra could be built on top of Apache Flink.Stream Processing Frameworks: This line of research focuses on the architecture of stream processing systems, answering questions about out-of-order data management, fault tolerance, highavailability, load management, elasticity etc.[5,14,15,21,23,27,35,60,66,76]. Fragkoulis et al. analyze the state of the art of stream processing engines[48]. ...
Preprint
Today's large-scale services (e.g., video streaming platforms, data centers, sensor grids) need diverse real-time summary statistics across multiple subpopulations of multidimensional datasets. However, state-of-the-art frameworks do not offer general and accurate analytics in real time at reasonable costs. The root cause is the combinatorial explosion of data subpopulations and the diversity of summary statistics we need to monitor simultaneously. We present Hydra, an efficient framework for multidimensional analytics that presents a novel combination of using a ``sketch of sketches'' to avoid the overhead of monitoring exponentially-many subpopulations and universal sketching to ensure accurate estimates for multiple statistics. We build Hydra as an Apache Spark plugin and address practical system challenges to minimize overheads at scale. Across multiple real-world and synthetic multidimensional datasets, we show that Hydra can achieve robust error bounds and is an order of magnitude more efficient in terms of operational cost and memory footprint than existing frameworks (e.g., Spark, Druid) while ensuring interactive estimation times.
... Panthera [246] is a storage abstraction for Spark [262] that identifies data access patterns and moves data between DRAM and NVM accordingly. Li and Li propose data spilling for Flink [38]. They follow the anti-caching approach [54] and evict cold data from DRAM to disk. ...
... Neither OS paging, nor application library or tiering-specific data structures provide the functionality needed for efficient data tiering in in-memory databases. 38 In this chapter, we present the research DBMS Hyrise, which is used as the foundation for our work on automatic tiering. We describe its fundamental architecture and dive into selected implementation details. ...
Thesis
Full-text available
A decade ago, it became feasible to store multi-terabyte databases in main memory. These in-memory databases (IMDBs) profit from DRAM's low latency and high throughput as well as from the removal of costly abstractions used in disk-based systems, such as the buffer cache. However, as the DRAM technology approaches physical limits, scaling these databases becomes difficult. Non-volatile memory (NVM) addresses this challenge. This new type of memory is persistent, has more capacity than DRAM (4x), and does not suffer from its density-inhibiting limitations. Yet, as NVM has a higher latency (5-15x) and a lower throughput (0.35x), it cannot fully replace DRAM. IMDBs thus need to navigate the trade-off between the two memory tiers. We present a solution to this optimization problem. Leveraging information about access frequencies and patterns, our solution utilizes NVM's additional capacity while minimizing the associated access costs. Unlike buffer cache-based implementations, our tiering abstraction does not add any costs when reading data from DRAM. As such, it can act as a drop-in replacement for existing IMDBs. Our contributions are as follows: (1) As the foundation for our research, we present Hyrise, an open-source, columnar IMDB that we re-engineered and re-wrote from scratch. Hyrise enables realistic end-to-end benchmarks of SQL workloads and offers query performance which is competitive with other research and commercial systems. At the same time, Hyrise is easy to understand and modify as repeatedly demonstrated by its uses in research and teaching. (2) We present a novel memory management framework for different memory and storage tiers. By encapsulating the allocation and access methods of these tiers, we enable existing data structures to be stored on different tiers with no modifications to their implementation. Besides DRAM and NVM, we also support and evaluate SSDs and have made provisions for upcoming technologies such as disaggregated memory. (3) To identify the parts of the data that can be moved to (s)lower tiers with little performance impact, we present a tracking method that identifies access skew both in the row and column dimensions and that detects patterns within consecutive accesses. Unlike existing methods that have substantial associated costs, our access counters exhibit no identifiable overhead in standard benchmarks despite their increased accuracy. (4) Finally, we introduce a tiering algorithm that optimizes the data placement for a given memory budget. In the TPC-H benchmark, this allows us to move 90% of the data to NVM while the throughput is reduced by only 10.8% and the query latency is increased by 11.6%. With this, we outperform approaches that ignore the workload's access skew and access patterns and increase the query latency by 20% or more. Individually, our contributions provide novel approaches to current challenges in systems engineering and database research. Combining them allows IMDBs to scale past the limits of DRAM while continuing to profit from the benefits of in-memory computing.
... Big data analytics platforms [1], [2], [3], [4], [5], [6] have played a critical role in the unprecedented success of datadriven applications. Such platforms are typically deployed in server clusters and datacenters to support applications that analyze big data ranging from personal data to web visit logs and purchase histories [7], [8]. ...
... In this section, we present the general computational architecture adopted by three of the most popular big data analytics systems, namely, Apache Storm [1], [2], Apache Spark [3], [4], and Apache Flink [5], [6]. We also discuss the implementation differences among the three systems. ...
Article
Full-text available
Big data analytics platforms have played a critical role in the unprecedented success of data-driven applications. However, real-time and streaming data applications, and recent legislation, e.g., GDPR in Europe, have posed constraints on exchanging and analyzing data, especially personal data, across geographic regions. To address such constraints data has to be processed and analyzed in-situ and aggregated results have to be exchanged among the different sites for further processing. This introduces additional network delays due to the geographic distribution of the sites and potentially affecting the performance of analytics platforms that are designed to operate in datacenters with low network delays. In this paper, we show that the three most popular big data analytics systems (Apache Storm, Apache Spark, and Apache Flink) fail to tolerate round-trip times more than 30 milliseconds even when the input data rate is low. The execution time of distributed big data analytics tasks degrades substantially after this threshold, and some of the systems are more sensitive than others. A closer examination and understanding of the design of these systems show that there is no winner in all wide-area settings. However, we show that it is possible to improve the performance of all these popular big data analytics systems significantly amid even transcontinental delays (where inter-node delay is more than 30 milliseconds) and achieve performance comparable to this within a datacenter for the same load.
... For a more detailed description, please refer to our original publication (Henning and Hasselbring 2021c). We evaluate implementations of these tasks samples with the two stream processing engines Apache Kafka Streams and Apache Flink (Carbone et al. 2015). ...
Article
Full-text available
Cloud-native applications constitute a recent trend for designing large-scale software systems. However, even though several cloud-native tools and patterns have emerged to support scalability, there is no commonly accepted method to empirically benchmark their scalability. In this study, we present a benchmarking method, allowing researchers and practitioners to conduct empirical scalability evaluations of cloud-native applications, frameworks, and deployment options. Our benchmarking method consists of scalability metrics, measurement methods, and an architecture for a scalability benchmarking tool, particularly suited for cloud-native applications. Following fundamental scalability definitions and established benchmarking best practices, we propose to quantify scalability by performing isolated experiments for different load and resource combinations, which asses whether specified service level objectives (SLOs) are achieved. To balance usability and reproducibility, our benchmarking method provides configuration options, controlling the trade-off between overall execution time and statistical grounding. We perform an extensive experimental evaluation of our method’s configuration options for the special case of event-driven microservices. For this purpose, we use benchmark implementations of the two stream processing frameworks Kafka Streams and Flink and run our experiments in two public clouds and one private cloud. We find that, independent of the cloud platform, it only takes a few repetitions (≤ 5) and short execution times (≤ 5 minutes) to assess whether SLOs are achieved. Combined with our findings from evaluating different search strategies, we conclude that our method allows to benchmark scalability in reasonable time.
... With the development of distributed computing systems, Apache Flink [7] has gradually replaced Hadoop [8] and Spark [9] as the most popular system, which is characterized by batch-stream integration and higher efficiency. Compared with the traditional Hadoop MapReduce programming model, Flink has a variety of advanced Operators, which enable users to complete logically more complex big data jobs with less code. ...
Article
Full-text available
In the era of intelligent Internet, the management and analysis of massive spatio-temporal data is one of the important links to realize intelligent applications and build smart cities, in which the interaction of multi-source data is the basis of realizing spatio-temporal data management and analysis. As an important carrier to achieve the interactive calculation of massive data, Flink provides the advanced Operator Join to facilitate user program development. In a Flink job with multi-source data connection operations, the selection of join sequences and the data communication in the repartition phase are both key factors that affect the efficiency of the job. However, Flink does not provide any optimization mechanism for the two factors, which in turn leads to low job efficiency. If the enumeration method is used to find the optimal join sequence, the result will not be obtained in polynomial time, so the optimization effect cannot be achieved. We investigate the above problems, design and implement a more advanced Operator joinTree that can support multi-source data connection in Flink, and introduce two optimization strategies into the Operator. In summary, the advantages of our work are highlighted as follows: (1) the Operator enables Flink to support multi-source data connection operation, and reduces the amount of calculation and data communication by introducing lightweight optimization strategies to improve job efficiency; (2) with the optimization strategy for join sequence, the total running time can be reduced by 29% and the data communication can be reduced by 34% compared with traditional sequential execution; (3) the optimization strategy for data repartition can further enable the job to bring 35% performance improvement, and in the average case can reduce the data communication by 43%.
... Although traditional batch-oriented systems can process large chunks of static data, they can not be used to handle such modern applications where data is generated continuously. Thus, stream computing frameworks such as Storm [1], Spark [2], Flink [3], and Heron [4] are used with these applications to process real-time streaming data. In stream computing, continuous data streams are generally discretized to apply computations on subsets of data. ...
Chapter
The distributed stream processing system suffers from the rate variation and skewed distribution of input stream. The scaling policy is used to reduce the impact of rate variation, but cannot maintain high performance with a low overhead when input stream is skewed. To solve this issue, we propose Alps, an Adaptive Load Partitioning Scaling system. Alps exploits adaptive partitioning scaling algorithm based on the willingness function to determine whether to use a partitioning policy. To our knowledge, this is the first approach integrates scaling policy and partitioning policy in an adaptive manner. In addition, Alps achieves the outstanding performance of distributed stream processing system with the least overhead. Compared with state-of-the-art scaling approach DS2, Alps reduces the end-to-end latency by 2 orders of magnitude on high-speed skewed stream and avoids the waste of resources on low-speed or balanced stream.
Article
Full-text available
This paper introduces Trill - a new query processor for analytics. Trill fulfills a combination of three requirements for a query processor to serve the diverse big data analytics space: (1) Query Model: Trill is based on a tempo-relational model that enables it to handle streaming and relational queries with early results, across the latency spectrum from real-time to offline; (2) Fabric and Language Integration: Trill is architected as a high-level language library that supports rich data-types and user libraries, and integrates well with existing distribution fabrics and applications; and (3) Performance: Trill's throughput is high across the latency spectrum. For streaming data, Trill's throughput is 2-4 orders of magnitude higher than comparable streaming engines. For offline relational queries, Trill's throughput is comparable to a major modern commercial columnar DBMS. Trill uses a streaming batched-columnar data representation with a new dynamic compilation-based system architecture that addresses all these requirements. In this paper, we describe Trill's new design and architecture, and report experimental results that demonstrate Trill's high performance across diverse analytics scenarios. We also describe how Trill's ability to support diverse analytics has resulted in its adoption across many usage scenarios at Microsoft.
Article
Full-text available
Distributed stateful stream processing enables the deployment and execution of large scale continuous computations in the cloud, targeting both low latency and high throughput. One of the most fundamental challenges of this paradigm is providing processing guarantees under potential failures. Existing approaches rely on periodic global state snapshots that can be used for failure recovery. Those approaches suffer from two main drawbacks. First, they often stall the overall computation which impacts ingestion. Second, they eagerly persist all records in transit along with the operation states which results in larger snapshots than required. In this work we propose Asynchronous Barrier Snapshotting (ABS), a lightweight algorithm suited for modern dataflow execution engines that minimises space requirements. ABS persists only operator states on acyclic execution topologies while keeping a minimal record log on cyclic dataflows. We implemented ABS on Apache Flink, a distributed analytics engine that supports stateful stream processing. Our evaluation shows that our algorithm does not have a heavy impact on the execution, maintaining linear scalability and performing well with frequent snapshots.
Article
Full-text available
Many systems for big data analytics employ a data flow abstraction to define parallel data processing tasks. In this setting, custom operations expressed as user-defined functions are very common. We address the problem of performing data flow optimization at this level of abstraction, where the semantics of operators are not known. Traditionally, query optimization is applied to queries with known algebraic semantics. In this work, we find that a handful of properties, rather than a full algebraic specification, suffice to establish reordering conditions for data processing operators. We show that these properties can be accurately estimated for black box operators by statically analyzing the general-purpose code of their user-defined functions. We design and implement an optimizer for parallel data flows that does not assume knowledge of semantics or algebraic properties of operators. Our evaluation confirms that the optimizer can apply common rewritings such as selection reordering, bushy join-order enumeration, and limited forms of aggregation push-down, hence yielding similar rewriting power as modern relational DBMS optimizers. Moreover, it can optimize the operator order of nonrelational data flows, a unique feature among today's systems.
Chapter
Traditional database management systems are best equipped to run one-time queries over finite stored data sets. However, many modern applications such as network monitoring, financial analysis, manufacturing, and sensor networks require long-running, or continuous, queries over continuous unbounded streams of data. In the STREAM project at Stanford, we are investigating data management and query processing for this class of applications. As part of the project we are building a general-purpose prototype Data Stream Management System (DSMS), also called STREAM, that supports a large class of declarative continuous queries over continuous streams and traditional stored data sets. The STREAM prototype targets environments where streams may be rapid, stream characteristics and query loads may vary over time, and system resources may be limited.
Article
Unbounded, unordered, global-scale datasets are increasingly common in day-to-day business (e.g. Web logs, mobile usage statistics, and sensor networks). At the same time, consumers of these datasets have evolved sophisticated requirements, such as event-time ordering and windowing by features of the data themselves, in addition to an insatiable hunger for faster answers. Meanwhile, practicality dictates that one can never fully optimize along all dimensions of correctness, latency, and cost for these types of input. As a result, data processing practitioners are left with the quandary of how to reconcile the tensions between these seemingly competing propositions, often resulting in disparate implementations and systems. We propose that a fundamental shift of approach is necessary to deal with these evolved requirements in modern data processing. We as a field must stop trying to groom unbounded datasets into finite pools of information that eventually become complete, and instead live and breathe under the assumption that we will never know if or when we have seen all of our data, only that new data will arrive, old data may be retracted, and the only way to make this problem tractable is via principled abstractions that allow the practitioner the choice of appropriate tradeoffs along the axes of interest: correctness, latency, and cost. In this paper, we present one such approach, the Dataflow Model, along with a detailed examination of the semantics it enables, an overview of the core principles that guided its design, and a validation of the model itself via the real-world experiences that led to its development.
Conference Paper
Naiad is a distributed system for executing data parallel, cyclic dataflow programs. It offers the high throughput of batch processors, the low latency of stream processors, and the ability to perform iterative and incremental computations. Although existing systems offer some of these features, applications that require all three have relied on multiple platforms, at the expense of efficiency, maintainability, and simplicity. Naiad resolves the complexities of combining these features in one framework. A new computational model, timely dataflow, underlies Naiad and captures opportunities for parallelism across a wide class of algorithms. This model enriches dataflow computation with timestamps that represent logical points in the computation and provide the basis for an efficient, lightweight coordination mechanism. We show that many powerful high-level programming models can be built on Naiad's low-level primitives, enabling such diverse tasks as streaming data analysis, iterative machine learning, and interactive graph mining. Naiad outperforms specialized systems in their target application domains, and its unique features enable the development of new high-performance applications.
Article
MillWheel is a framework for building low-latency data-processing applications that is widely used at Google. Users specify a directed computation graph and application code for individual nodes, and the system manages persistent state and the continuous flow of records, all within the envelope of the framework's fault-tolerance guarantees. This paper describes MillWheel's programming model as well as its implementation. The case study of a continuous anomaly detector in use at Google serves to motivate how many of MillWheel's features are used. MillWheel's programming model provides a notion of logical time, making it simple to write time-based aggregations. MillWheel was designed from the outset with fault tolerance and scalability in mind. In practice, we find that MillWheel's unique combination of scalability, fault tolerance, and a versatile programming model lends itself to a wide variety of problems at Google.
Article
Continuous streams of event data are generated in many application domains including financial trading, fraud detection, website analytics and system monitoring. An open challenge in data management is how to analyse and react to large volumes of event data in real-time. As centralised event processing systems reach their computational limits, we need a new class of event processing systems that support deployments at the scale of thousands of machines in a cloud computing setting. In this poster we present SEEP, a novel architecture for event processing that can scale to a large number of machines and is elastic in order to adapt dynamically to workload changes.
Article
Parallel dataflow systems are a central part of most analytic pipelines for big data. The iterative nature of many analysis and machine learning algorithms, however, is still a challenge for current systems. While certain types of bulk iterative algorithms are supported by novel dataflow frameworks, these systems cannot exploit computational dependencies present in many algorithms, such as graph algorithms. As a result, these algorithms are inefficiently executed and have led to specialized systems based on other paradigms, such as message passing or shared memory. We propose a method to integrate incremental iterations, a form of workset iterations, with parallel dataflows. After showing how to integrate bulk iterations into a dataflow system and its optimizer, we present an extension to the programming model for incremental iterations. The extension alleviates for the lack of mutable state in dataflows and allows for exploiting the sparse computational dependencies inherent in many iterative algorithms. The evaluation of a prototypical implementation shows that those aspects lead to up to two orders of magnitude speedup in algorithm runtime, when exploited. In our experiments, the improved dataflow system is highly competitive with specialized systems while maintaining a transparent and unified dataflow abstraction.