Conference PaperPDF Available
Hardware Acceleration Landscape for Distributed
Real-time Analytics: Virtues and Limitations
Mohammadreza Najafi, Kaiwen Zhang, Hans-Arno Jacobsen
Technical University of Munich
Mohammad Sadoghi
Purdue University
Abstract—We are witnessing a technological revolution with
a broad impact ranging from daily life (e.g., personalized
medicine and education) to industry (e.g., data-driven health-
care, commerce, agriculture, and mining). At the core of this
transformation lies “data”. This transformation is facilitated
by embedded devices, collectively known as Internet of Things
(IoT), which produce real-time feeds of sensor data which are
collected and processed to produce a dynamic physical model
used for optimized real-time decision making. At the infrastruc-
ture level, there is a need to develop a scalable architecture
for processing massive volumes of present and historical data
at an unprecedented velocity to support the IoT paradigm. To
cope with such extreme scale, we argue for the need to revisit
the hardware and software co-design landscape in light of two
key technological advancements. First is the virtualization of
computation and storage over highly distributed data centers
spanning across continents. Second is the emergence of a variety
of specialized hardware accelerators that complement traditional
general-purpose processors. Further efforts are required to unify
these two trends in order to harness the power of big data.
In this paper, we present a formulation and characterization of
the hardware acceleration landscape geared towards real-time
analytics in the cloud. Our goal is to assist both researchers
and practitioners navigating the newly revived field of software
and hardware co-design for building next generation distributed
systems. We further present a case study to explore software and
hardware interplay for designing distributed real-time stream
Traditionally, the data management community has devel-
oped specialized solutions that focus on either extreme of a
data spectrum between volume and velocity, yielding to large-
volume batch processing systems, e.g., Hadoop [1], Spark [2],
and high-velocity stream processing systems, e.g., Flink [3],
Storm [4], Spark Streaming [5], respectively. However, the cur-
rent trend, led by the Internet of Things (IoT) paradigm, leans
towards the large-volume processing of rich data produced by
distributed sensors in real-time at a high velocity for reactive
and automated decision making. While the aforementioned
specialized platforms have proven to achieve a certain balance
between speed and scale, their performance is still inadequate
in light of the emerging real-time applications.
This remarkable shift towards big data presents an interest-
ing opportunity to study the interplay of software and hardware
in order to understand the limitations of the current co-design
space for distributed systems, which must be fully explored
before resorting to specialized systems such as ASICs, FPGAs,
and GPUs. Each hardware accelerator has a unique perfor-
mance profile with enormous potential for speed and size
to compete with or to complement CPUs, as envisioned in
(lower is better)
Data Size
Application Sp ecific
Integrated Chip
General Purpos e
Processors Graphi c Processors
Field Programmabl e
Gate Arrays
< 1 100 microseconds
1 100 milliseconds >
Terabyte Petabyte
1 100 seconds >
10 minutes <
Fig. 1: Envisioned acceleration technology outlook.
Figure 1. In this paper, we primarily focus on cost-effective,
power-efficient hardware acceleration solutions which excel
at analytical computations by tapping into inherent low-
level hardware parallelism. To motivate the adoption of these
emerging hardware accelerators, we describe the four primary
challenges faced by today’s general-purpose processors.
Large & complex control units. The design of general-
purpose processors is based on the execution of consecutive
operations on data residing in main memory. This architecture
must guarantee a correct sequential order execution. As a
result, processors include complex logic to increase perfor-
mance (e.g., super pipelining and out-of-order execution). Any
performance gain comes at the cost of devoting up to 95% of
resources (i.e., transistors) to these control units [6].
Memory wall & von Neumann bottleneck. The current
computer architecture suffers from the limited bandwidth
between CPU and memory. This issue is referred to as the
memory wall and is becoming a major scalability limitation
as the gap between CPU and memory speed increases. To
mitigate this issue, processors have been equipped with large
cache units, but their effectiveness heavily depends on the
memory access patterns. Additionally, the von Neumann bot-
tleneck further contributes to the memory wall by sharing the
limited memory bandwidth between both instructions and data.
Redundant memory accesses. The present-day system en-
forces that data arriving from an I/O device is first read/written
to main memory before it is processed by the CPU, resulting
in a substantial loss of memory bandwidth. Consider a simple
data stream filtering operation that does not require the incom-
Data Bus
Address Bus
Control Bus
External Memory
Input and
Custom Stream Proc essing Element
Redundant Access
Memory Wall
Selective external memory access
Low power consumption
System Bus
Data Bus
Address Bus
Control Bus
External Memory
Input and
Custom Stream Processing Element
Redundant Access
Memory Wall
Large control unit
Small execution engine
High power consumption
Large execution engine
Small control unit
Selective external memory access
Low power consumption
Fig. 2: General-purpose processor architecture vs. custom
hardware architecture.
ing data stream to be first written to main memory. In theory,
the data arriving from the I/O should be streamed directly
through the processor; however, today’s computer architecture
prevents this basic modus operandi.
Power consumption. Manufacturers often aim at increasing
the transistor density together with higher clock speed, but
the increase in the clock speed (leading to a higher deriving
voltage for the chip) results in a superlinear increase in power
consumption and a greater need for heat dissipation [7].
These performance limiting factors in today’s computer
architecture have generated a growing interest in accelerating
distributed data management and data stream processing using
custom hardware solutions [8], [9], [10], [11], [12], [13],
[14], [15], [16] and more general hardware acceleration at
cloud-scale [17], [18], [19], [20], [21], [22]. In our past work
on designing custom hardware accelerators for data streams
(see Figure 2), we have demonstrated the use of a much
simpler control logic with better usage of chip area, thereby
achieving higher performance per transistor ratio [13], [15].
We can substantially reduce the memory wall overhead by
coupling processor and local memory, instantiating as many
of these components as necessary. Through this coupling, the
redundant memory access is reduced by avoiding to copy
and read memory whenever possible. The use of many low-
frequency, but specialized, chips may circumvent the need for
generalized but high-frequency processors. Despite the great
potentials of hardware accelerators, there is a pressing need to
navigate this complex and growing landscape before solutions
can be deployed in a large-scale environment.
However, today’s hardware accelerators comes in variety
of forms as demonstrated in Figure 3, ranging from embed-
ded hardware features, e.g., hardware threading and Single-
Instruction Multiple Data (SIMD), in general-purpose pro-
cessors (e.g., CPUs) to more specialized processors such as
GPUs, FPGAs, ASICs. For example, graphics processing units
Multiple Data (SIMD)
Graphics Processing
Unit (GPU)
Discrete & Integrated
GPUs (Nvidia,AMD,Intel)
Field Programmable
Gate Array (FPGA)
Xilinx and Altera
Application Specific
Integrated Chip (ASIC)
Hardware Multi-
Intel Hyper-threading
Fig. 3: Accelerator spectrum.
(GPUs) offer massive parallelism and floating-point compu-
tation based on an architecture equipped with thousands of
lightweight cores coupled with substantially higher memory
bandwidth compared to CPUs. Due to the fixed architecture
of GPUs, only specific types of applications or algorithms can
benefit from its superior parallelism, such as tasks that follow a
regular pattern (e.g., matrix computation). In contrast, a field-
programmable gate array (FPGA) is an integrated circuit that
can be configured to encode any algorithm, even irregular
ones. FPGAs can be configured to construct an optimal archi-
tecture of custom cores engineered for a targeted algorithm.
FPGAs contain compute components which are configurable
logic blocks (CLBs) consisting of Lookup Tables (LUTs) and
SRAM; memory components including registers, distributed
RAMs, and Block RAMs (BRAMs); and communication com-
ponents that consist of configurable interconnects and wiring.
Any circuit design developed for FPGAs can be directly
burned onto application-specific integrated circuits (ASICs),
which provide greater performance in exchange for flexibility.
Although the reminder of this paper focus on FPGAs, many
parts of the content discussed are equally applicable to other
forms of acceleration.
To further narrow down our discussion, we focus mostly on
distributed real-time analytics over continuous data streams
using FPGAs. To cope with the lack of flexibility of custom
hardware solutions, the state of the art assume that the set
of queries is known in advance. Essentially, past works rely
on the compilation of static queries and fixed stream schemas
onto hardware designs synthesized to configure FPGAs [9].
Depending on the complexity of the design, this synthesis step
can last minutes or even hours. This is not suitable for modern-
day stream processing needs, which require fast, on the fly,
reconfiguration. Furthermore, many of the existing approaches
assume a complete halt during any synthesis modification [9],
[12]. While synthesis and stream processing may overlap to
some extent, a significant amount of time and effort is still
required to reconfigure, which may take several minutes up
to hours. More importantly, this approach requires additional
logic for buffering, handling of dropped tuples, requests for
re-transmissions, and additional data flow controlling tasks,
which renders this style of processing difficult, if not im-
possible, in practice. These concerns are often ignored in the
approaches listed above, which assumed that processing stops
entirely before a new query-stream processing cycle starts.
In this paper, we make the following contributions. First,
we present a comprehensive formalization of the acceleration
landscape over distributed heterogeneous hardware in the
context of real-time analytics over data streams. In our formal-
ization, we will incorporate recent attempts to address these
aforementioned shortcomings of custom hardware solutions.
System Model Standalone Co-placement Co-processor
Programming Model
Declarative Languages (SQL-based)Procedural Languages
Hardware Description Languages
(VHDL, Verilog, SystemC, TLM)
General Purpose Languages
(C, C++, Java)
Parallel Programming Languages &
Parallel APIs
(CUDA, OpenCL, OpenMP)
Static Compiler
(Glacier) Dynamic Compiler
Representational Model
Parametrized Circuits
(Skeleton Automata, fpga-ToPSS,
OPB, FQP, Ibex, IBM Netezza) Temporal/Spatial
Parametrized Topology
Parametrized Data Segments
Algorithmic Model
Bi-directional Flow
(Handshake Join)
Uni-directional Flow
Multi-query Optimization
(Rete-like Network)
Boolean Formula Precomputation
Static Circuits
Data Parallelism
Task Parallelism
Pipeline Parallelism
Fig. 4: Acceleration design landscape.
Second, we study the fundamental strategy in finding the right
balance between the data and control flow on hardware for de-
signing highly distributed stream joins, which are some of the
most expensive operations in SQL-based stream processing.
Our experience working with stream joins in FPGAs represent
a case study whose findings and challenges are applicable to
other areas found in our acceleration landscape.
To formalize the acceleration landscape, we develop a broad
classification derived from a recent comprehensive FPGAs and
GPUs tutorial [23]. As presented in Figure 4, we divide the
design space into four categories: system model,programming
model,representational model, and algorithmic model.
In the system model layer, we focus on a holistic view of
the entire distributed compute infrastructure, possibly hosted
virtually in the cloud, in order to identify how accelerators can
be employed when considering the overall system architecture.
We identify three main deployment categories: standalone,
co-placement, and co-processor in distributed systems. The
standalone mode is the most simple architecture, where the
entire software stack is embedded on hardware. Past works on
event [24], [25], [26] and stream processing [9], [12], [13],
[15], [16], where the processing tasks are carried out entirely
on FPGAs, can be categorized as standalone mode. The co-
processor design can be viewed as a co-operative method
in order to offload and/or distribute (partial) computation
from the host system (e.g., running on CPUs) to accelerators
(e.g., FPGAs or GPUs). For instance, FPGAs and CPUs co-
operation is exploited in [27], where the Shore-MT database is
modified to offload regular expression computations, expressed
as user-defined functions in SQL queries, to FPGAs. In
contrast, the co-placement design is focused on identifying
the best placement strategy to place the accelerators on the
data path in a distributed system. This mode of computation
can be characterized as performing a partial or best-effort
computational model. For instance, this model is applied in
IBM Netezza [28].
We refer to active data path as an alternative data-centric
view of the system model. In a distributed setting, each piece
of data travels from a source (data producer) to a destination
(data consumer), passing through the network and temporar-
ily residing in storage and memory of intermediate nodes.
Usually, the actual data computation task is performed close
to the destination using CPUs. Instead, an active data path
distributes processing tasks along the entire length to various
network, storage, and memory components by making them
“active”, i.e., coupled with an accelerator. Thus, a wide range
of partial or best-effort computation can be employed using
co-placement or co-processor strategies, which are building
blocks towards creating active distributed systems.
The programming model focuses on the usability of het-
erogeneous hardware accelerators. Akin to procedural pro-
gramming languages such as C and Java, there are numerous
hardware description languages such as VHDL, Verilog, Sys-
temC, and TLM that expose key attributes of hardware in a
C-like syntax. Unlike the software mindset which is focused
on algorithm implementation (e.g., using conditional and loop
statements), the programming in hardware must also consider
the data and control flow mapping onto logic gates and
wiring. To simplify the usability of hardware programming,
recent works have shifted towards SQL-based declarative
languages using static (Glacier) and dynamic (FQP) compilers.
Essentially, these compilers generate a hardware description
language when given SQL queries as input. Glacier converts a
query into a final circuit design by composing operator-based
Flexible Query Processor (FQP)
Hardware Model
Distributor CB OP-Block OP-Block
Bridge OP-Block CBOP-Block
Result Collector
Event Stream/Query
Query Assigner
Skeleton Meta Data
Event Stream
Analytics Results
File ExeFetch Decode Mem
Processor Pipeline
Main Memory
Data Stream Compiler OptimizerLoader Queries
Analytics Results
Traditional Processing
Fig. 5: Query execution on general-purpose processor (soft-
ware) vs. flexible hardware solution.
logic blocks. Once synthesized to FPGA, the design cannot
be altered. In contrast, FQP starts with a fixed topology of
programmable cores (where each core executes a variety of
SQL operators such as selection, project, or joins) that are
first synthesized on FPGA. Subsequently, the FQP compiler
then generates a dynamic mapping of queries onto the FQP
topology at runtime.
The data and control flow can be realized on hardware
using either static or dynamic circuits, which we refer as the
representational model. The most basic form is a static circuit
(also the best performing) that has a fixed logic and hard-coded
wiring which cannot be altered at runtime. However, there is
a wide spectrum when moving from static to a fully dynamic
design. The first level of dynamism is named parametrized
circuits. In the context of stream processing, this translates
to the ability of changing the selection or join conditions at
runtime without the need to re-synthesize the design.
To accelerate complex event processing, fpga-ToPSS [29],
[24], [26] focuses on hiding the latency of accessing off-chip
memory (used for storing dynamic queries) by exploiting fast
on-chip memory (for storing static queries). Furthermore, the
on-chip memory is coupled with computation using horizontal
data partitioning, which eliminates global communication cost.
To accelerate XML processing, the notion of skeleton au-
tomata [30] allows decomposition of non-deterministic finite-
state automata for encoding XPath queries into static structural
skeleton which are embedded in logic gates while query
conditions are dynamically stored in memory at runtime.
Ibex [31] provides co-operation between hardware (to par-
tially push and compute queries on FPGAs) and software
(by pre-computing and passing arbitrary Boolean selection
conditions onto hardware). Online-Programmable Blocks (OP-
Common FPGA-based Solutions
Flexible Query Processor (FQP)
Map new operators Apply it
(u-seconds ~ m-seconds) (u-seconds)
Apply changes in
hardware model
Time consuming
& costly
(Hours ~ Months)
Time consuming
(Minutes ~ Days)
Halt normal
system operation
Data flow control
Reprogram FPGA
Time consuming
(Seconds ~ Minutes)
Resume the
system operation
Data flow control
Fig. 6: Comparing standard vs. flexible query execution
pipeline on a reconfigurable computing fabric.
Block) [13], [15] introduces dynamic re-configurable logic
blocks to implement selection, projection, and join operations,
where the conditions of each operator can seamlessly be ad-
justed at runtime. A collection of OP-Blocks and general Cus-
tom Blocks (CBs) together form the Flexible Query Processor
(FQP) architecture [13], [15]. FQP is a general streaming
architecture that can be implemented in fully standalone mode
and replace traditional software solutions as demonstrated in
Figure 5. Q100 [14] supports query plans of arbitrary size
by horizontally partitioning them into fixed sets of pipelined
stages of SQL operators using the proposed temporal and
spatial instructions. IBM Netezza [28] is a major commercial
product that exploits parametrizable circuits in order to offload
query computation.
The next level of dynamism is to support schemas of varying
size as far as the wiring width is concerned in order to connect
and transfer tuples among computation cores on FPGAs. One
solution is to simply provision the wiring for the maximum
possible schema size: This option is simply impractical given
the cost of wiring and routing. To support dynamic schemas,
parametrized data segments are proposed in FQP through
vertical partitioning of both query and data given a fixed wiring
and routing budget [32].
The final level of dynamism, which we call parametrized
topology, is the ability to modify both the structure of queries
(i.e., macro changes) and modify the conditions of each opera-
tor within a query (i.e., micro changes). This provides support
for the real-time additions and modifications of queries (see
Figure 6). FQP [13], [15] supports parametrized topologies
which allow the construction of complex queries by composing
re-configurable logic blocks in the form of OP-Blocks, where
the composition itself is also re-configurable (see the example
in Figure 7).
The algorithmic model is the lowest layer of our accel-
eration landscape. To achieve higher throughput and lower
latency, the following goals must be met. First, we must
exploit the available massive parallelism by minimizing the
design control flow. Generally speaking, there are three de-
sign patterns: data parallelism,task parallelism, and pipeline
parallelism. Simply put, data parallelism focuses on executing
the same task concurrently over partitioned data (e.g., SIMD)
while task parallelism is focused on executing concurrent and
independent tasks over replicated/partitioned data (e.g., Intel
hyper-threading). The pipeline parallelism, arguably the most
important design pattern on hardware, is to break a task into a
sequence of sub-tasks, where the data flows from one sub-task
Product Stream
Join Over Product ID
<Window Size: 1536 >
(Age>25 & Gender=female)
Join Over Product ID
<Window Size: 2048>
Customer Stream
Customer Stream
Age > 25
Over Product ID
Window: 1536
Output 1
Gender = female
Over Product ID
Window: 2048
Output 2
Output 1
Output 2
Fig. 7: An example of query plan assignment in FQP.
to the next. The second algorithmic goal is to minimize the
wiring and routing overhead to maximize the data flow. The
third algorithmic goal is to maximize the memory bandwidth
through tight coupling of memory and logic and exploiting
the memory hierarchy effectively starting from the fastest to
slowest memory, namely, registers, on-chip memory, and off-
chip memory. Notably, faster memory often equates to lower
density implying that the amount memory available through
registers is substantially smaller than that of on-chip memory,
and likewise for off-chip memory.
For instance, to support multi-query optimization, a global
query plan based on Rete-like network is constructed to exploit
both inter- and intra-query parallelism, essentially leveraging
all forms of parallelisms mentioned earlier including data,
task, and pipeline parallelism. For faster data pruning in order
to improve memory bandwidth, hash-based indexing methods
are employed to efficiently identify all relevant queries given
incoming tuples [12]. To avoid designing complex adaptive
circuitry, Ibex [31] proposes precomputation of a truth table
for Boolean expressions in software first and transfer the truth
table into hardware during FPGA configuration when a new
query is inserted. There exist also a number of works focused
on stream join, which is one of the most expensive opera-
tion in real-time analytics. For example, handshake join [33]
introduced a new data flow to naturally direct streams in
opposite directions in order to simplify the control flow, which
we categorize as bi-directional flow (bi-flow) while SplitJoin
[34] introduced a novel top-down flow, which we refer to
as uni-directional flow (uni-flow) to completely eliminate all
control flows among joins cores and allow them to operates
independently. The rest of the paper focuses on stream joins
as a case study. Our experience working with stream joins in
FPGAs reveal some conclusions, which can be generalized to
other algorithms and the entire acceleration landscape.
The stream join operation inherits the relational database
join semantics by introducing a sliding window to trans-
form unbounded streams, R,S, into finite sets of tuples
(i.e., relations). Although sliding window semantics provide
a robust abstraction to deal with the unbounded nature of
data streams [35], [33], [36], [37], [38], [39], [13], [12], it
remains a challenge to improve parallelism within stream join
processing, especially, when leveraging many-core systems.
Fig. 8: Parallel stream join models: (a) bi-flow and (b) uni-flow.
For instance, a single sliding window could conceptually be
divided into many smaller sub-windows, where each sub-
window could be assigned to a different join core.1However,
distributing a single logical stream into many independent
cores introduces a new coordination challenge: to guarantee
that each incoming tuple in one stream is compared exactly
once with all tuples in the other stream. To cope with paral-
lelism in this context, two key approaches have been proposed:
handshake join [33] and SplitJoin [34].
Bi-directional Data Flow For Stream Join: The coor-
dination challenge is addressed by handshake join [33] (and
its equivalent realization as OP-Chain in hardware, presented
in [15]) that transforms the stream join into a data flow (bi-flow)
problem (evolved from Kang’s three-step procedure [40]):
tuples flow from left-to-right (for the Sstream) and from right-
to-left (for the Rstream) and pass through each join core. We
refer to this data flow as bidirectional, since tuples pass in two
directions through the join cores.
We broadly classify all join techniques based on similar
designs as belonging to the bi-flow model; this data flow
model provides a scalable parallelism to increase processing
throughput. The bi-flow design ensures that every tuple is
compared exactly once by design (shown in Figure 8). This
new data flow model offers greater processing throughput
by increasing parallelism, yet suffers from latency increase
since the processing of a single incoming tuple requires a
sequential flow through the entire processing pipeline. To
improve latency, a central coordination is introduced to fast-
forward tuples through the linear chain of join cores in the
low-latency handshake join [36] (the coordination module is
depicted in the left side of Figure 8). In order to reduce latency,
each tuple of each stream is replicated and forwarded to the
next join core before the join computation is carried out by
the current core [36].
Figure 10 demonstrates the internal blocks and their
connections in a join core needed to realize the bi-flow
model. Window Buffer (-R&-S) are sliding window
buffers and Buffer Manager (-R&-S) are responsible
for sending/receiving tuples to/from neighboring join cores
and storing/expiring tuples to/from window buffers. The
Coordinator Unit controls permissions and priorities to
manage data communication requests to ensure query pro-
cessing correctness; finally, the Processing Unit is the
execution core of the join core. It is activated when a new
tuple is inserted. Then, based on programmed queries (here, a
join operator), it processes the new tuple by comparing it to
the entire window of the other stream.
1A join core is an abstraction that could apply to a processor’s core, a
compute node in a cluster, or a custom-hardware core on FPGAs (e.g., [33],
[12], [13].)
Distribut ion Network
Result Gather ing Network
1 2 3 4 5 76 8
1 2 3 4 5 76 8
DNode DNodeDNode DNode
DNode DNodeDNode DNode DNode DNodeDNode DNode
DNode DNode
GNodeGNode GNodeGNode
Sub-wind ow R
Sub-wind ow R
Sub-wind ow R
Sub-wind ow S
Sub-win dow S
Sub-win dow S
Fig. 9: Uni-flow parallel stream join hardware architecture.
Uni-directional Data Flow For Stream Join: A funda-
mentally new approach for computing a stream join, referred
to as SplitJoin, has been proposed in [34]. SplitJoin essentially
changes the way tuples enter and leave the sliding windows,
namely, by dropping the need to have separate left and right
data flows (i.e., bi-directional flow). SplitJoin diverts from
the bi-directional data flow-oriented processing of existing
approaches [33], [36]. As illustrated in Figure 8, SplitJoin
introduces a single top-down data flow that transforms the
overall distributed architecture. First, the join cores are no
longer chained linearly (i.e., avoiding linear latency overhead).
In fact, they are now completely independent (i.e., also avoid-
ing inter-core communication overhead). Second, both streams
travel through a single path entering each join core; thus,
eliminating all complexity due to potential race conditions
caused by in-flight tuples and complexity due to ensuring the
correct tuple-arrival order, namely, by relying on the FIFO
property, the ordering requirement is trivially satisfied by using
a single (logical) path. Therefore, SplitJoin does not rely on
any central coordination for propagating and ordering the
input/output streams. Third, the communication path can be
fully utilized to sustain the maximum throughput and each
tuple no longer needs to pass through every join core. SplitJoin
introduces what we, here, refer to as the unidirectional (top-
down) data-flow model, uni-flow, in contrast to the bi-flow
model, introduced earlier.
To parallelize join processing, the uni-flow model divides
the sliding window into smaller sub-windows, and each sub-
window is assigned to a join core. In the uni-flow model,
all join cores receive a new incoming tuple. In each join
core, depending on the tuple origin, i.e., whether from R
or Sstream, the distributed processing and storage steps are
orchestrated accordingly. For example, if the incoming tuple
belongs to the Rstream, TR, then all processing cores compare
TRto all the tuples in stream Ssub-windows. At the same
time, TRis stored in exactly one join core’s sub-window for
Rstream. The join core needed to store a newly arriving
tuple is chosen based on a round-robin scheme. Each join
core independently counts (separately for each stream) the
number of tuples received and, based on its position among
other join cores, determines its turn to store an incoming
tuple. This storage mechanism eliminates the need for a central
coordinator for tuple assignment, which is a key contributing
factor in achieving (almost) linear scalability in the uni-flow
model as increasing the number of join cores.
We focus on the design and realization of a parallel and
distributed stream join in hardware based on the flow-based
model. Furthermore, we compare the join core internal archi-
tecture for both uni-flow and bi-flow models. Our parallel uni-
flow hardware stream join architecture comprises three main
parts: (1) distribution network, (2) processing (join cores), and
(3) result gathering network, as shown in Figure 9.
Distribution Network: The distribution network is respon-
sible for transferring incoming tuples from the system input to
all join cores. In this work, we present two alternatives for this
network: (1) a lightweight design and (2) a scalable design.
The lightweight network distributes incoming tuples to
all join cores at once without extra components, which is
preferable for comparably small solutions, while the scalable
variant uses a hierarchical architecture for the distribution.
Here, we only present the design of the scalable network, while
our experiments include evaluations for both.
In the scalable distribution network, we use DNode to build
a hierarchical network. DNode receives a tuple in its input port
and broadcasts it to all its output ports. All DNodes rely on
the same communication protocol, making it straightforward to
scale the design by cascading them. DNodes store incoming
tuples as long as their internal buffer is not full. As output,
each DNode sends out the stored tuples, one tuple in each
clock cycle, provided the next DNodes are not full. The
upper part of Figure 9 demonstrates the distribution network
comprised of DNodes. Here, we see a 12fan-out size
from each level to the next level from top to bottom. Other
fan-out sizes (e.g., 14) could be interesting to explore since
they reduce the height of the distribution network and lower
communication latency.
The scalable distribution network consumes more resources
(i.e., DNodes pipeline buffers) than the lightweight variant
and adds a few clock cycles, depending on its height, to the
distribution latency (though it does not affect the tuple inser-
tion throughput). On the other hand, the scalable distribution
network pays off as the number of join cores increases, since
Join Core
Results Processing Unit
Fig. 10: Bi-flow join core design.
Processing Core
Join Core (uni-flow)
SplitJoin in Hardware
Processing Unit
Buffer Manager-R
Fig. 11: Uni-flow join core design.
Tuple R
New Join
R Store
Store in
Window R
Tuple S
Store in
Window S
S Store
Store 1
Store 2
Not Store
Not Store
Fig. 12: Storage core.
Repeats for
every tuple
in sliding
New Join
New Join
Read 1
Read 2
Join Wait
Tuple R
New Join
R Store
Store in
Window R
Store 1
Store 2
Not Store Not Store
Fig. 13: Processing core.
it does not suffer from the clock frequency drop (degrading
the performance) as observed in the lightweight design.
The DNodes arrangement, as shown in Figure 9, forms
a pipelined distribution network. Utilizing more join cores
logarithmically increases the number of pipeline stages. This
means it takes more clock cycles for a tuple to reach the join
cores. Nonetheless, a constant transfer rate of tuples from one
pipeline stage to the next keeps the distribution throughput
intact, regardless of the number of stages.
Join Cores: In our parallel hardware design, the actual
stream join processing is performed in join cores. Each join
core individually implements the original join operator (with-
out posing any limitation on the chosen join algorithm, e.g.,
nested-loop join or hash join) but on a fraction of the original
sliding window.
The internal architecture of our hardware join core based
on the uni-flow model is shown in Figure 11, and it has
Fetcher,Storage Core, and Processing Core as
its main building blocks. Fetcher is an intermediate buffer
that separates a join core from the distribution network. This
reduces the communication path latency and improves the
maximum clock frequency. The Storage Core is responsi-
ble for storing (and consequently expiring) tuples from Ror S
streams into their corresponding sub-window. The distribution
task of assigning tuples to each Storage Core is performed
in a round-robin fashion. The Storage Core remembers
the number of tuples received from each stream and (by
knowing the number of join cores and its own position in
them) stores a tuple based on its turn.
The state diagram for the Storage Core controller is
presented in Figure 12. A join operator can be dynamically
programmed without the need for synthesis (individually for
each join core) by an instruction which has two segments. The
first segment defines join parameters such as the number of
join cores and the current join core position among them, while
the second segment carries the join operator conditions. The
join operator programming is performed in Operator Store
1and Operator Store 2. This makes it possible to update
the current join operator in real-time. After programming,
Storage Core is ready to accept tuples from both streams.
While receiving a tuple from the Sstream, when it is the
current join core’s turn to store, the tuple is stored in its
corresponding sub-window in the Store in Window S state;
otherwise, the storage task is skipped by moving to the S Store
Done state. The procedure for the reception of tuples from the
Rstream follows an identical procedure.
The Processing Core is responsible for the actual
execution in which each new tuple (or a chunk of tuples) is
compared to all tuples in the sub-window of other stream by
pulling them up one at a time and comparing them with the
newly arrived tuple.
Figure 13 presents the state diagram of the Processing
Core controller. In the initial step it reads the join operator
in Operator Read 1 and Operator Read 2 states. The actual
comparison is performed in the Join Processing state, where
tuples are read from their corresponding sub-window, one read
per cycle. In case a match is found, it is emitted in Emit Result.
Finally, at the end of processing, the controller waits in the
Join Wait state for another tuple to process. After a new tuple
reception, the whole execution repeats itself by means of a
direct transition to the Join Processing state.
Result Gathering Network: The result gathering network
is responsible for collecting result tuples from join cores. Sim-
ilar to the distribution network, we propose two (1) lightweight
and (2) scalable alternatives for this network. We focus on the
latter for descriptions while comparing both of them in the
evaluation section.
The lower part of Figure 9 demonstrates the design of
the result gathering network using GNodes. Each GNode
collects resulting tuples from two sources connected to its
two upper ports using a Toggle Grant mechanism that toggles
the collection permission for its previous nodes in each clock
cycle. As a result, each source (i.e., a join core or a previous
GNode) pushes out a resulting tuple to the next GNode once
every two clock cycles.
The Toggle Grant mechanism simplifies the design of the
result collection; instead of using a two-directional handshake
between two connected GNodes to transfer a resulting tuple,
we use a single-directional signaling, in which each GNode
looks only at permission to push out one of the results stored
in its buffer. The destination (next) GNode simply toggles this
permission each cycle without the need for any special control
GNodes configuration, as shown in Figure 9, provides a
pipeline result gathering network where in each stage, tuples
from two sources are merged into one and are pushed to the
next pipeline stage from top to bottom (respecting our uni-
directional top-down flow model). The pipelining mechanism
reduces the effective fan-in size in each pipeline stage and
thereby prevents clock frequency drop.
In Figure 9, arrows in the distribution and result gathering
network are data buses that define the width of received tuples,
2 4 8 16 32 64
2 4 8
Number of Join Cores
Throughput (million tuples/s)
W:213 211
(a) Uni-flow (V5, F:100MHz)
7 8 9 10 11 12 13
Window Size(2x)
(b) (JCs: 16, V5, F:100MHz)
11 12 13 14 15 16 17 18
16 17 18
Window Size(2x)
JCs: 512
(c) Uni-flow (V7, F:300MHz)
16 17 18 19 20 21 22 23
Window Size(2x)
JCs: 16 28
(d) Uni-flow on software
Fig. 14: Throughput measurements for flow-based stream join on hardware: (a), (b), and (c); on software: (d).
including their 2-bit headers. The header defines whether we
are dealing with a new join operator or a tuple belonging to
either the Ror Sstream. The width of the data bus for result
tuples is twice (not counting the header) the size of the input
data bus since a result is comprised of two input tuples that
have met the join condition(s).
Flow-based Hardware Design Comparison: In the uni-
flow model, data passes in a single top-down flow, in which
each join core receives tuples directly from the distributor and
operates independently from other join cores (Figure 8b). The
uni-flow design offers full utilization of the communication
channel’s bandwidth, specifically from the input to each join
core, since all tuples travel over the same path to each join
core. Therefore, regardless of the incoming tuple rate for each
stream, every tuple has access to the full bandwidth. To clarify
this issue for the bi-flow model, assume we are receiving tuples
only from stream R; then, all communication channels for
stream Sare left unutilized. Even with an equal tuple rate
for both streams, it is impossible to achieve simultaneous
transmission of both TRand TSbetween two neighboring
join cores due to the locks needed to avoid race conditions.
Furthermore, comparing the internal design of a join core
based on the bi-flow model, Figure 10, and one based on uni-
flow Figure 11, we see a significant reduction in the number of
internal components that correspondingly reduces the design
complexity. Neighbor-to-neighbor tuple traveling circuitries
for two streams are eliminated from Buffer Manager-R
& -S and Coordination Unit, as they are reduced and
merged to form Fetcher and Storage Core in the join
core based on the uni-flow model. This improvement also re-
duces the number of I/Os from five to two, which significantly
reduces the hardware complexity, as the number of I/Os is
often an important indication of complexity and final cost of
a hardware design.
For hardware experiments, we synthesized and programmed
our solution on a ML505 evaluation platform featuring a
Virtex-5 XC5VLX50T FPGA. Additionally, we synthesized
our solution on a more recent VC707 Evaluation board featur-
ing a Virtex-7 XC7VX485T FPGA. For software experiments,
we used a 32-core Dell PowerEdge R820 featuring 4 ×Intel
E5-4650 (TDP: 130 Watt) processors and 32 ×16GB (PC3-
12800) memory, running Ubuntu 14.04.2 LTS.
We realized parallel stream join based on bi-flow model
using a simplified OP-Chain topology from FQP2, proposed
in [15]. We used the Xilinx synthesis tool chain to synthe-
size, map, place, and route both of the bi-flow and uni-flow
parallel hardware realizations and loaded the resulting bit file
onto our FPGA using a JTAG interface. This bit file contains
all required information to configure the FPGA.
The input streams consist of 64-bit tuples that are joined
against each other using an equi-join, though there is no
limitation on the condition(s) used. Both of the realizations
have the ability to adopt larger tuples that are defined by pre-
synthesis parameters.
Throughput Evaluation: Throughput measurements for bi-
flow and uni-flow parallel stream join realizations are presented
in Figure 14. For the uni-flow version, we were able to
instantiate 16 join cores on our platform with up to W: 213
window size (per stream), as we see in Figure 14a. We observe
a linear speedup with respects to the number of join cores as
expected. We were not able to realize window sizes larger
than 211 when instantiating 32 and 64 join cores due to the
extra consumption of memory resources in the distribution and
result gathering networks and auxiliary components.
Figure 14b presents the comparison between the input
throughput in a parallel stream join based on uni-flow and
bi-flow models as we change the window size. We observe
nearly an order of magnitude speedup when using a uni-
flow compared to a bi-flow model. Although in theory, both
models are similar in their parallelization concept, the simpler
architecture in uni-flow brings superior performance. We were
not able to instantiate 16 join cores with 213 in bi-flow
hardware, unlike the uni-flow one, because each join core is
more complex and requires a greater amount of resources to
realize it.
Figure 14c presents extracted (from a synthesis re-
port) throughput on a mid-size, but more recent, Virtex-7
(XC7VX485T) FPGA. We were able to realize a uni-flow
parallel stream join with as many as 512 join cores and
window sizes as large as 218. We used a 300MHz clock
frequency for this evaluation as provided by the synthesis
report. As a result of having more join cores and a higher
clock frequency, we see acceleration of around two orders of
2FQP is available in VHSIC Hardware Description Language (VHDL).
12345678 9
Number of Join Cores
Latency (clock cycles)
W: 218 (V7), W: 218 (V7s), W: 213 (V5)
12345678 9
Latency (µsecond)
Fig. 15: Latency reports for uni-flow parallel stream join on
12 16 20 24 28 32
Number of Join Cores
Latency (millisecond)
W: 217 W: 218 W: 219
Fig. 16: Latency reports for
uni-flow on software.
12345678 9
Number of Join Cores (2x)
Clock Frequency (MH z)
W:218 (V7) W:218 (V7s) W:213 (V5)
Fig. 17: Uni-flow clock fre-
quencies on Virtex-5 and 7.
magnitude when we utilize a window size of 213 compared to
the realization on Virtex-5 (Figure 14a).
We ran our experiments on the software realization of a uni-
flow parallel stream join (available from [34]) and the through-
put results are presented in Figure 14d for 16 and 28 join cores.
Similar to the experimental setup in [34], the maximum input
throughputs were achieved while using 28 cores out of 32 on
our platform since some internal components in SplitJoin, i.e.,
the distribution and result gathering network, also consume a
portion of the processors’ capacity.
Although the operating frequency of the Virtex-7 FPGA is
significantly lower than that of the processors used in our
system, 300MHz compared to the processor base frequency
of 2.7GHz and max turbo of 3.3GHz, still we observed
around 15×acceleration compared to the software realization
(28 join cores) while using the same window size (218) on
both platforms (Figures 14c and 14d). Two factors are main
contributors to this throughput gain: 1) ability to instantiate
more join cores compared to the software version, since
they operate in parallel and linearly increase the processing
throughput as a function of their count. 2) utilizing internal
BRAMs in FPGA by essentially coupling data and processing
in each join core, while in the software variant, the sliding
window data resides in the main memory. This data has to
move back-and-forth through the memory hierarchy for each
incoming tuple.
Latency Evaluation: We refer to latency as the time it
takes to process and emit all results for a newly inserted
tuple. We mainly focus on latency comparisons between the
hardware and software realizations of the uni-flow model for a
parallel stream join. The measurements for these comparisons
are shown in Figures 15 and 16.
Figure 15 captures the latency observed with respects to
the number of clock cycles and the execution time (µsecond).
For realization on Virtex-5 (V5), we used the lightweight
distribution and result gathering networks since the system is
relatively small; however, for the synthesized design on Virtex-
7 (V7), we have reports for both lightweight and scalable
(specified by sin figures) variants of distribution and result
gathering networks.
As we increase the number of join cores, we do not observe
a significant difference in the number of cycles required to
process a tuple in either realization. The distribution network
in the lightweight design requires fewer cycles to transfer in-
coming tuples to all join cores while on the scalable version, a
tuple has to travel through multiple distribution levels (log2N,
where Nis the number of join cores) to reach join cores;
however, this advantage is neutralized by the greater latency
in the lightweight result gathering network. The cost of round-
robin collection from join cores, one after another, quickly
becomes dominant as we approach larger numbers of join
cores. However, by taking into account the clock frequency
drop in the lightweight solution as we increase the number of
join cores, the actual difference in latency becomes significant,
as shown in Figure 15.
Utilizing a window size of 218 for each stream, the hardware
version (Figure 15) shows around two orders of magnitude
improvement in latency compared to the software variant
(Figure 16), mainly due to massive parallelism and memory
and processing coupling.
Scalability Evaluation: The scalability of a hardware de-
sign is determined by how the maximum operating clock
frequency is affected as we scale up the system. Here we scale
up our solution by increasing the number of join cores that
translates to a linearly increase in the processing throughput.
Figure 17 shows how clock frequency changes as we
increase the number of join cores for lightweight versions
on Virtex-5 (V5) and Virtex-7 (V7) and scalable version on
Virtex-7 (V7s). For the realization on our Virtex-5 FPGA, we
do not see any significant drop as we increase the number of
join cores. Although we are using the lightweight version, the
system size (number of join cores) is rather small to show its
effect; actually, we even see an increase in the clock frequency
when utilizing 16 join cores that is due to heuristic mapping
algorithms adopted by the synthesis tool3.
For larger uni-flow based realization with more join cores,
we see how the clock frequency of the lightweight version
drops as we increase the number of join cores. Since the
Virtex-7 FPGA supports higher clock frequencies, compared
to the Virtex-5, it is more sensitive to large fanout sizes and
longer signal paths; therefore, we see this effect even when
using 8 and 16 join cores. For the hardware realization based
on uni-flow with scalable distribution and result gathering
networks, we observe no significant variations in the clock
frequency as we scale up the system.
Power Consumption Evaluation: The extracted power
consumption reports when using 16 join cores with a total
3Using more restrictions (such as using a higher clock constraint, e.g.,
190MHz) it is possible to achieve higher clock frequencies when utilizing
fewer number of join cores.
Fig. 18: Heterogeneous hardware virtualization.
window size of 213 (for each stream) consumed 1647.53mW
and 800.35mW power for parallel stream join based on bi-
flow and uni-flow, respectively. As expected simpler design and
correspondingly smaller circuit size resulted in more than 50%
power saving in utilizing uni-flow compared to bi-flow.
Adoption of hardware accelerators, in particular, FPGAs,
are gaining momentum in both industry and academia. Cloud
providers such as Amazon are building new distributed in-
frastructure that offers FPGAs4that are connected to general-
purpose processors using PCIe connection [41]. Furthermore,
the employed FPGAs share the same memory address space
with CPUs, which could give rise to many novel applications
of FPGAs by using co-placement and co-processor strategies.
Microsoft’s Configurable Cloud [42] also uses a layer of
FPGAs between network switches and servers to filter and ma-
nipulate data flows at line-rate. Another prominent deployment
example is IBM FPGA-Based Acceleration within SuperVessel
OpenPOWER Development Cloud [43]. Google’s Tensor Pro-
cessing Unit (TPU) designed for distributed machine learning
workloads is also gaining traction in its data centers, although
current TPUs are based on ASIC chips [44].
Given that the availability of FPGAs and other forms of
accelerators (e.g., GPUs and ASICs) are now becoming the
norm even in distributed cloud infrastructures, then our broader
vision (first discussed in [32]) is to identify key opportunities
to exploit the strength of available hardware accelerators given
the unique characteristics of distributed real-time analytics. As
a first step to fulfilling this vision, we start from our FQP
fabric [13], [15], a generic streaming architecture composed
of a dynamically re-programmable stream processing elements
that can be chained together to form a customizable processing
topology, exhibiting a “Lego-like” connectable property. We
argue that our proposed architecture may serve as a basic
framework to explore and study the entire life-cycle of accel-
erating distributed stream processing on hardware. Here we
identify a list of important short- and long-term problems that
can be studied within our FQP framework:
1) What is the complexity of query assignment to a set of
custom hardware blocks (including but not limited to OP-
Blocks)? Note, a poorly chosen query assignment may
4Xilinx UltraScale+ VU9P fabricated using a 16 nm process and with
approximately 2.5 million logic elements and 6,800 Digital Signal Processing
(DSP) engines.
increase query execution time, leave some blocks un-
utilized, negatively affect energy use, and degrade the
overall processing performance.
2) How to formalize query assignment algorithmically (e.g.,
develop a cost model), and what is the relationship
between query optimization on hardware and classical
query optimization in databases? Unlike the classical
query optimization and physical plan generations, on
hardware, we are not just limited to join reordering and
physical operator selections because there is a whole new
perspective on how to apply instruction-level and fine-
grained memory-access optimization (through different
OP-Blocks implementations or processor-memory cou-
plings); what is the most efficient method for wiring
custom operators to minimize the routing distance; or
how to collect and store statistics during query execution
while minimizing the impact on query evaluation; and
how to introduce dynamic rewiring and movement of data
given a fixed FQP topology?
3) What is the best initial topology given a sample query
workload and a set of application requirements known
a priori? How to construct an optimal initial topology
which maximizes on-chip resource utilization (reduces
the number of OP-Blocks) or minimizes latency (reduces
routing and wiring complexity among OP-Blocks)?
4) Given the topology and the query assignment formalism,
it is then possible to generalize the query mapping from
single-query optimization to multi-query optimization to
amortize the execution cost across the shared processing
of several queries? But given the dynamic nature of
FQP and a lightweight statistics collection capability, then
the natural question is: How to dynamically re-optimize
and re-map queries onto the FQP topology (in light of
observed statistics) in order to maximize inter- and intra-
query optimizations that are inspired by the capabilities
of custom stream processors?
5) How do we extend query execution on hardware to co-
placement and/or co-processor designs by distributing
and orchestration query execution over heterogeneous
hardware (each with unique characteristics) such as
CPUs, FPGAs, and GPUs? The key insight is how to
superimpose FQP abstraction over these heterogeneous
compute nodes in order to hide their intricacy and to
virtualize the computation over them, as illustrated in
Figure 18. An important design decision arises as to how
these distributed and virtualized devices communicate
and whether or not they are placed on a single board,
thus, having at least a shared external memory space or
distributed shared memory through RDMAs; placed on
multi-boards and connected through interfaces such as
PCIe; or span across the entire data centers connected
through the network.
This research was supported by the Alexander von Hum-
boldt Foundation.
[1] “Apache Hadoop. Apache Software Foundation, 2006. [Online].
[2] M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica,
“Spark: Cluster computing with working sets,” HotCloud, 2010.
[3] P. Carbone, A. Katsifodimos, S. Ewen, V. Markl, S. Haridi, and
K. Tzoumas, “Apache Flink: Stream and batch processing in a single
engine,” Data Engineering, 2015.
[4] A. Toshniwal, S. Taneja, A. Shukla, K. Ramasamy, J. M. Patel, S. Kulka-
rni, J. Jackson, K. Gade, M. Fu, J. Donham, N. Bhagat, S. Mittal, and
D. Ryaboy, “Storm@Twitter,” SIGMOD, 2014.
[5] M. Zaharia, T. Das, H. Li, T. Hunter, S. Shenker, and I. Stoica,
“Discretized streams: Fault-tolerant streaming computation at scale,”
SOSP, 2013.
[6] “Symmetric key cryptography on modern graphics hardware.” Ad-
vanced Micro Devices, Inc., 2008.
[7] “Enhanced Intel SpeedStep technology for the Intel Pentium M proces-
sor. Intel, Inc., 2004.
[8] R. Mueller, J. Teubner, and G. Alonso, “Data processing on FPGAs,
VLDB, 2009.
[9] ——, “Streams on wires: a query compiler for FPGAs,” VLDB, 2009.
[10] A. Hagiescu, W.-F. Wong, D. Bacon, and R. Rabbah, “A computing
origami: Folding streams in FPGAs,” DAC, 2009.
[11] T. Honjo and K. Oikawa, “Hardware acceleration of Hadoop MapRe-
duce,” BigData, 2013.
[12] M. Sadoghi, R. Javed, N. Tarafdar, H. Singh, R. Palaniappan, and H.-A.
Jacobsen, “Multi-query stream processing on FPGAs,” ICDE, 2012.
[13] M. Najafi, M. Sadoghi, and H.-A. Jacobsen, “Flexible query processor
on FPGAs,” PVLDB, 2013.
[14] L. Wu, A. Lottarini, T. K. Paine, M. A. Kim, and K. A. Ross, “Q100:
The architecture and design of a database processing unit,” ASPLOS,
[15] M. Najafi, M. Sadoghi, and H.-A. Jacobsen, “Configurable hardware-
based streaming architecture using online programmable-blocks,” ICDE,
[16] Z. Wang, J. Paul, H. Y. Cheah, B. He, and W. Zhang, “Relational query
processing on OpenCL-based FPGAs,” FPL, 2016.
[17] O. Segal, M. Margala, S. R. Chalamalasetti, and M. Wright, “High level
programming framework for FPGAs in the data center, FPL, 2014.
[18] S. Breß, H. Funke, and J. Teubner, “Robust query processing in co-
processor-accelerated databases,” SIGMOD, 2016.
[19] T. Gomes, S. Pinto, T. Gomes, A. Tavares, and J. Cabral, “Towards an
FPGA-based edge device for the Internet of Things, ETFA, 2015.
[20] C. Wang, X. Li, and X. Zhou, “SODA: Software defined FPGA based
accelerators for big data,” DATE, 2015.
[21] A. Putnam, A. M. Caulfield, E. S. Chung, D. Chiou, K. Constantinides,
J. Demme, H. Esmaeilzadeh, J. Fowers, G. P. Gopal, J. Gray, M. Hasel-
man, S. Hauck, S. Heil, A. Hormati, J.-Y. Kim, S. Lanka, J. Larus,
E. Peterson, S. Pope, A. Smith, J. Thong, P. Y. Xiao, and D. Burger,
“A reconfigurable fabric for accelerating large-scale datacenter services,”
ISCA, 2014.
[22] A. M. Caulfield, E. S. Chung, A. Putnam, H. Angepat, J. Fowers,
M. Haselman, S. Heil, M. Humphrey, P. Kaur, J. Y. Kim, D. Lo,
T. Massengill, K. Ovtcharov, M. Papamichael, L. Woods, S. Lanka,
D. Chiou, and D. Burger, A cloud-scale acceleration architecture,”
MICRO, 2016.
[23] R. R. Bordawekar and M. Sadoghi, “Accelerating database workloads
by software-hardware-system co-design,” ICDE, May 2016.
[24] M. Sadoghi, M. Labrecque, H. Singh, W. Shum, and H.-A. Jacobsen,
“Efficient event processing through reconfigurable hardware for algo-
rithmic trading,” PVLDB, 2010.
[25] L. Woods, J. Teubner, and G. Alonso, “Complex event detection at wire
speed with FPGAs,” PVLDB, 2010.
[26] M. Sadoghi, H. Singh, and H.-A. Jacobsen, “Towards highly parallel
event processing through reconfigurable hardware, DaMoN, 2011.
[27] D. Sidler, Z. Istvan, M. Ewaida, and G. Alonso, Accelerating pattern
matching queries in hybrid CPU-FPGA architectures,” SIGMOD, 2017.
[28] “IBM Netezza 1000 data warehouse appliance.” IBM, Inc., 2009.
[29] M. Sadoghi, H. P. Singh, and H.-A. Jacobsen, “fpga-ToPSS: Line-speed
event processing on FPGAs, DEBS, 2011.
[30] J. Teubner, L. Woods, and C. Nie, “Skeleton automata for FPGAs:
reconfiguring without reconstructing,” SIGMOD, 2012.
[31] L. Woods, Z. Istv´
an, and G. Alonso, “Ibex - an intelligent storage engine
with support for advanced SQL off-loading, PVLDB, 2014.
[32] M. Najafi, M. Sadoghi, and H.-A. Jacobsen, “The FQP vision: Flexible
query processing on a reconfigurable computing fabric,” SIGMOD, 2015.
[33] J. Teubner and R. Mueller, “How soccer players would do stream joins,
SIGMOD, 2011.
[34] M. Najafi, M. Sadoghi, and H.-A. Jacobsen, “SplitJoin: A scalable,
low-latency stream join architecture with adjustable ordering precision,
ATC, 2016.
[35] B. Gedik, R. R. Bordawekar, and P. S. Yu, “CellJoin: A parallel stream
join operator for the cell processor, VLDBJ, 2009.
[36] P. Roy, J. Teubner, and R. Gemulla, “Low-latency handshake join,”
PVLDB, 2014.
[37] S. Blanas, Y. Li, and J. M. Patel, “Design and evaluation of main
memory hash join algorithms for multi-core CPUs,” SIGMOD, 2011.
[38] J. Teubner, G. Alonso, C. Balkesen, and M. T. Ozsu, “Main-memory
hash joins on multi-core CPUs: Tuning to the underlying hardware,
ICDE, 2013.
[39] V. Leis, P. Boncz, A. Kemper, and T. Neumann, “Morsel-driven paral-
lelism: A NUMA-aware query evaluation framework for the many-core
age,” SIGMOD, 2014.
[40] J. Kang, J. Naughton, and S. Viglas, “Evaluating window joins over
unbounded streams,” ICDE, 2003.
[41] “Announcing Amazon EC2 F1 instances with custom FPGAs.” Ama-
zon, Inc., 2016.
[42] A. Caulfield, E. Chung, A. Putnam, H. Angepat, J. Fowers, M. Hasel-
man, S. Heil, M. Humphrey, P. Kaur, J.-Y. Kim, D. Lo, T. Massengill,
K. Ovtcharov, M. Papamichael, L. Woods, S. Lanka, D. Chiou, and
D. Burger, A cloud-scale acceleration architecture,” MICRO, 2016.
[43] “OpenPOWER cloud: Accelerating cloud computing,” IBM Research,
[44] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin,
S. Ghemawat, G. Irving, M. Isard, M. Kudlur, J. Levenberg, R. Monga,
S. Moore, D. G. Murray, B. Steiner, P. Tucker, V. Vasudevan, P. Warden,
M. Wicke, Y. Yu, and X. Zheng, “TensorFlow: A system for large-scale
machine learning,” OSDI, 2016.
... Thus, considering performance alone, RAPID is not ranked among the top solutions. Additionally, some customized systems were proposed for distributed data processing systems [18,19]. ...
Full-text available
The CPU-Accelerator heterogeneous systems have demonstrated performance and efficiency benefits on DBMSs. However, the CPU-Cache-DRAM architecture can not fully utilize the bandwidth of DRAMs such that in-memory approach get limited improvement. To overcome this drawback, it is non-trivial to customize efficient domain-specific accelerators and efficiently shuttle data between the host memory space and accelerator. But even if high-performance accelerators are available for DBMS, it is challenging to integrate the software with accelerator non-intrusively. To address these problems, we propose a hardware-software co-designed system, database offloading engine (DOE), which contains hardware accelerator architecture—Conflux for effective SQL operation offloading, and a software DOE programming platform—DP2 for application integration and seamless harness of the computing power. We subtly partition each well-known relational operator, such as filter, join, group by, aggregate, and sort, and dynamically map these operators on multiple kernels in parallel. The DOE kernels work in streaming processing mode, over which the microarchitecture aggressively exploits data parallelism and memory bandwidth. The experiment results show that DOE achieves more than 100x and 10x performance improvement compared with PostgreSQL and MonetDB respectively.
... In the design of landscape architecture, art, and technology are usually integrated, which not only involves many fields, but also requires designers to have a rich variety of professional knowledge and technology, so that more different styles of landscape design can be completed with high quality and efficiency. However, in the current landscape design, designers do not have more professional expertise and technology, and cannot integrate art and technology [10][11][12]. erefore, it will have a certain impact on the quality and effect of landscape design. In addition, due to the relatively backward construction technology of gardening landscape in our country at this stage, even if a large number of high-quality designs are developed, it is difficult to implement them in gardens, which will lead to the setback of gardening landscape design. ...
Full-text available
Modern landscape greening plays an important role in the construction of modern cities and plays a positive role in improving the natural environment of cities and building a good image of cities. With the continuous progress of society and technology, human’s understanding of artificial intelligence is deepening, and intelligent technology is gradually integrated into all aspects of life. Because media technology has rich design elements and can carry out rich design structures, it will be more intuitive to use multimedia means for garden landscape design. Therefore, for meeting people’s requirements for the diversification of modern urban gardening construction, this study makes a deep analysis of the current status and problems of landscape design and tries to study the effective application methods of artificial intelligence technology in landscape design, to promote the combination of landscape design and artificial intelligence design. At the same time, the combination of AI lighting planning, AI water landscape planning, AI sprinkler planning, and AI paving planning is used to illustrate the application of AI in the specific project design of landscape design. The use of artificial intelligence not only promotes the innovation and optimization of landscape design, but also ensures the quality of modern landscape design and effectively improves the efficiency of modern landscape design.
High performance stream aggregation is critical for many emerging applications that analyze massive volumes of data. Incoming data needs to be stored in a sliding-window during processing, in case the aggregation functions cannot be computed incrementally. Updating the window with new incoming values and reading it to feed the aggregation functions are the two primary steps in stream aggregation. Although window updates can be supported efficiently using multi-level queues, frequent window aggregations remain a performance bottleneck as they put tremendous pressure on the memory bandwidth and capacity. This article addresses this problem by enhancing StreamZip, a dataflow stream aggregation engine that is able to compress the sliding-windows. StreamZip deals with a number of data and control dependency challenges to integrate a compressor in the stream aggregation pipeline and alleviate the memory pressure posed by frequent aggregations. In addition, StreamZip incorporates a caching mechanism for dealing with skewed-key distributions in the incoming data stream. In doing so, StreamZip offers higher throughput as well as larger effective window capacity to support larger problems. StreamZip supports diverse compression algorithms offering both lossless and lossy compression to integers as well as floating point numbers. Compared to designs without compression, StreamZip lossless and lossy designs achieve up to 7.5 × and 22 × higher throughput, while improving the effective memory capacity by up to 5 × and 23 ×, respectively.
The use of Machine Learning in Artificial Intelligence is the inspiration that shaped technology as it is today. Machine Learning has the power to greatly simplify our lives. Improvement in speech recognition and language understanding help the community interact more naturally with technology. The popularity of machine learning opens up the opportunities for optimizing the design of computing platforms using welldefined hardware accelerators. In the upcoming few years, cameras will be utilised as sensors for several applications. For ease of use and privacy restrictions, the requested image processing should be limited to a local embedded computer platform and with a high accuracy. Furthermore, less energy should be consumed. Dedicated acceleration of Convolutional Neural Networks can achieve these targets with high flexibility to perform multiple vision tasks. However, due to the exponential growth in technology constraints (especially in terms of energy) which could lead to heterogeneous multicores, and increasing number of defects, the strategy of defect-tolerant accelerators for heterogeneous multi-cores may become a main micro-architecture research issue. The up to date accelerators used still face some performance issues such as memory limitations, bandwidth, speed etc. This literature summarizes (in terms of a survey) recent work of accelerators including their advantages and disadvantages to make it easier for developers with neural network interests to further improve what has already been established.
The image and video processing algorithms are currently crucial for many applications. Hardware implementation of these algorithms provides higher speed for large computation applications. Besides, noise removing is often a typical pre-processing step to enhance the results of later analysis and processing. Median filter is a typical nonlinear filter that is very commonly used for impulse noise elimination in digital image processing. This paper suggests a low-energy median filter hardware design for battery based hardware applications. Approximate solution with high accuracy is investigated to speed up the filtering operation, reduce the area, and consume less power/energy. Pipelining and parallelism are used to optimize the speed and power of this technique. Non-pipelined, two different pipelined structures, and two parallel architectures versions are designed. The design versions are implemented firstly with a Virtex-5 LX110T FPGA then using the UMC 130nm standard cells ASIC technology. The selection and the even-odd sorting-based median filters are also implemented for an equitable comparison with the standard median filtering techniques. The suggested non-pipelined median filter design enhances the throughput by 35% than the highest investigated state-of-the-art. The pipelining enhances the throughput to more than twice its value. Additionally, the parallel architecture decreases the area and the consumed power by around 40%. The simulation results reveal that one of the suggested designs significantly decreases the area, with the same speed as the fastest design in literature without noticeable degrading the accuracy, and a significant decrease in energy consumption by about 60%.
Full-text available
Apache Flink 1 is an open-source system for processing streaming and batch data. Flink is built on the philosophy that many classes of data processing applications, including real-time analytics, continuous data pipelines, historic data processing (batch), and iterative algorithms (machine learning, graph analysis) can be expressed and executed as pipelined fault-tolerant dataflows. In this paper, we present Flink's architecture and expand on how a (seemingly diverse) set of use cases can be unified under a single execution model.
Conference Paper
Full-text available
There is a rising interest in accelerating stream processing through modern parallel hardware, yet it remains a challenge as how to exploit the available resources to achieve higher throughput without sacrificing latency due to the increased length of processing pipeline and communication path and the need for central coordination. To achieve these objectives, we introduce a novel top-down data flow model for stream join processing (arguably, one of the most resource-intensive operators in stream processing), called SplitJoin, that operates by splitting the join operation into independent storing and processing steps that gracefully scale with respect to the number of cores. Furthermore, SplitJoin eliminates the need for global coordination while preserving the order of input streams by rethinking how streams are channeled into distributed join computation cores and maintaining the order of output streams by proposing a novel distributed punctuation technique. Throughout our experimental analysis, SplitJoin offered up to 60% improvement in throughput while reducing latency by up to 3.3X compared to state-of-the-art solutions.
Full-text available
TensorFlow is a machine learning system that operates at large scale and in heterogeneous environments. TensorFlow uses dataflow graphs to represent computation, shared state, and the operations that mutate that state. It maps the nodes of a dataflow graph across many machines in a cluster, and within a machine across multiple computational devices, including multicore CPUs, general-purpose GPUs, and custom designed ASICs known as Tensor Processing Units (TPUs). This architecture gives flexibility to the application developer: whereas in previous "parameter server" designs the management of shared state is built into the system, TensorFlow enables developers to experiment with novel optimizations and training algorithms. TensorFlow supports a variety of applications, with particularly strong support for training and inference on deep neural networks. Several Google services use TensorFlow in production, we have released it as an open-source project, and it has become widely used for machine learning research. In this paper, we describe the TensorFlow dataflow model in contrast to existing systems, and demonstrate the compelling performance that TensorFlow achieves for several real-world applications.
Datacenter workloads demand high computational capabilities, flexibility, power efficiency, and low cost. It is challenging to improve all of these factors simultaneously. To advance datacenter capabilities beyond what commodity server designs can provide, we designed and built a composable, reconfigurable hardware fabric based on field programmable gate arrays (FPGA). Each server in the fabric contains one FPGA, and all FPGAs within a 48-server rack are interconnected over a low-latency, high-bandwidth network. We describe a medium-scale deployment of this fabric on a bed of 1632 servers, and measure its effectiveness in accelerating the ranking component of the Bing web search engine. We describe the requirements and architecture of the system, detail the critical engineering challenges and solutions needed to make the system robust in the presence of failures, and measure the performance, power, and resilience of the system. Under high load, the large-scale reconfigurable fabric improves the ranking throughput of each server by 95% at a desirable latency distribution or reduces tail latency by 29% at a fixed throughput. In other words, the reconfigurable fabric enables the same throughput using only half the number of servers.
Conference Paper
Technology limitations are making the use of heterogeneous computing devices much more than an academic curiosity. In fact, the use of such devices is widely acknowledged to be the only promising way to achieve application-speedups that users urgently need and expect. However, building a robust and efficient query engine for heterogeneous co-processor environments is still a significant challenge. In this paper, we identify two effects that limit performance in case co-processor resources become scarce. Cache thrashing occurs when the working set of queries does not fit into the co-processor's data cache, resulting in performance degradations up to a factor of 24. Heap contention occurs when multiple operators run in parallel on a co-processor and when their accumulated memory footprint exceeds the main memory capacity of the co-processor, slowing down query execution by up to a factor of six. We propose solutions for both effects. Data-driven operator placement avoids data movements when they might be harmful; query chopping limits co-processor memory usage and thus avoids contention. The combined approach-data-driven query chopping-achieves robust and scalable performance on co-processors. We validate our proposal with our open-source GPU-accelerated database engine CoGaDB and the popular star schema and TPC-H benchmarks.
Modern data appliances face severe bandwidth bottlenecks when moving vast amounts of data from storage to the query processing nodes. A possible solution to mitigate these bottlenecks is query off-loading to an intelligent storage engine, where partial or whole queries are pushed down to the storage engine. In this paper, we present Ibex , a prototype of an intelligent storage engine that supports off-loading of complex query operators. Besides increasing performance, Ibex also reduces energy consumption, as it uses an FPGA rather than conventional CPUs to implement the off-load engine. Ibex is a hybrid engine, with dedicated hardware that evaluates SQL expressions at line-rate and a software fallback for tasks that the hardware engine cannot handle. Ibex supports GROUP BY aggregation, as well as projection- and selection- based ltering. GROUP BY aggregation has a higher impact on performance but is also a more challenging operator to implement on an FPGA.
Conference Paper
Heterogeneous computing offers a promising solution for energy efficient computing in the data center. FPGA based heterogeneous computing is an especially promising direction since it allows for the creation of custom hardware solutions for data centric parallel applications. One of the main issues delaying wide spread adoption of FPGAs as main stream high performance computing devices is the difficulty in programming them. OpenCL was meant to address the difficulties and the non-uniformity related to programming heterogeneous devices, unfortunately because of its complexity it sets the bar high for many software programmers, preventing them from directly benefiting from the computing power and energy efficiency that OpenCL and heterogeneous computing have to offer. This work presents an effort to bridge the gap by extending an existing Java programming framework (APARAPI), based on OpenCL, so that it can be used to program FPGAs at a high level of abstraction and increased ease of programmability. We run several real world algorithms to assess the performance of the APARAPI framework on both a low end and a high end system. On the low end and high and systems respectively we find up to 78-80 percent power reduction and 4.8X-5.3X speed increase running NBody simulation, as well as up to 65-80 percent power reduction and 6.2X-7X speed increase for a K-Means MapReduce algorithm running on top of the Hadoop framework and APARAPI (Abstract)
FPGA has been an emerging field in novel big data architectures and systems, due to its high efficiency and low power consumption. It enables the researchers to deploy massive accelerators within one single chip. In this paper, we present a software defined FPGA based accelerators for big data, named SODA, which could reconstruct and reorganize the acceleration engines according to the requirement of the various dataintensive applications. SODA decomposes large and complex applications into coarse grained single-purpose RTL code libraries that perform specialized tasks in out-of-order hardware. We built a prototyping system with constrained shortest path Finding (CSPF) case studies to evaluate SODA framework. SODA is able to achieve up to 43.75X speedup at 128 node application. Furthermore, hardware cost of the SODA framework demonstrates that it can achieve high speedup with moderate hardware utilization.