ArticlePDF Available

Abstract and Figures

The Flexible Query Processor (FQP) constitutes a family of hardware-based data stream processors that support dynamic changes to queries and streams, as well as static changes to the processor-internal fabric in order to maximize performance for given workloads. FQP is prototyped on field-programmable gate arrays (FPGAs). To this end, FQP supports select, project and window-join queries over data streams. While processing incoming tuples, FQP can accept new queries, a key characteristic distinguishing FQP from related approaches employing FPGAs for stream processing. In this paper, we present our vision of FQP, focusing on few internal details to support the flexibility dimension, in particular, the segment-at-a-time mechanism to realize processing of tuples of variable sizes. While many of these features are readily available in software, their hardware-based realizations have been one of the main shortcomings of existing research efforts.
Content may be subject to copyright.
The FQP Vision: Flexible Query Processing on a
Reconfigurable Computing Fabric
Mohammadreza Najafi1, Mohammad Sadoghi2, Hans-Arno Jacobsen1
1Technical University Munich
2IBM T.J. Watson Research Center
The Flexible Query Processor (FQP) constitutes a family of
hardware-based data stream processors that support dy-
namic changes to queries and streams, as well as static
changes to the processor-internal fabric in order to max-
imize performance for given workloads. FQP is proto-
typed on field-programmable gate arrays (FPGAs). To
this end, FQP supports select, project and window-join
queries over data streams. While processing incoming
tuples, FQP can accept new queries, a key characteris-
tic distinguishing FQP from related approaches employ-
ing FPGAs for stream processing. In this paper, we
present our vision of FQP, focusing on few internal de-
tails to support the flexibility dimension, in particular,
the segment-at-a-time mechanism to realize processing
of tuples of variable sizes. While many of these features
are readily available in software, their hardware-based
realizations have been one of the main shortcomings of
existing research efforts.
There is rising interest in accelerating stream process-
ing through FPGAs (e.g., [3, 4, 7, 8].) Many of these
approaches are based on “compiling” static queries and
fixed stream schemas into hardware designs that are syn-
thesized to configure FPGAs. It is not uncommon for
this synthesis step to take on the order of minutes to
hours (depending on the complexity of the design), whi-
ch is the norm for FPGAs, but is too inflexible for mode-
rn-day stream processing needs, which require the ap-
plication to be able to change the query and the schema
on the fly, without having to wait an extended period of
time for the synthesis computation to be completed.
Furthermore, existing approaches to accelerating str-
eam processing through FPGAs [4, 7, 8] assume that
for processing query and stream modifications the ar-
riving stream is halted, the hardware design updates are
synthesized into configuration information, and the new
information is uploaded onto the FPGA before process-
ing of the event stream can resume. While synthesis and
stream processing may overlap, a significant amount of
time and efforts are still required to halt, re-configure,
and resume the operation, which may take up to sev-
eral minutes in and onto itself. More importantly, this
modus operandi requires logic for buffering, handling
of dropped tuples, requests for re-transmissions, and ad-
ditional data flow controlling tasks, which renders this
style of processing difficult in practice. These concerns
are often ignored in the approaches listed above, which
assumed that processing stops entirely before a new que-
ry-stream processing cycle starts.
In this paper, we aim to fill the gap between soft-
ware solutions which provide the greatest degree of flex-
ibility in query modification needs and hardware solu-
tions which offer massive performance gains by design-
ing an FQP that accepts new queries in an online fash-
ion without disrupting the processing of incoming event
streams. While supporting query modifications at run-
time is almost trivial for software-based techniques, they
are highly uncommon for custom hardware-based ap-
proaches, such as FPGAs, and have so far not received
much attention in the growing body of work on acceler-
ating data processing with FPGAs.
FQP is comprised of a parameterizable number of “on-
line programmable blocks” (referred to as OPBs) that
are inter-connected into a customizable topology. To-
gether with a number of auxiliary components for query
and tuple buffering, routing, and dispatching, the OPBs
form an instance of the FQP that operates entirely on
the FPGA. The inter-connection topology for the OPBs
can be chosen in the manner most advantageous for the
queries to be processed. For example, if the query work-
load lends itself for parallelism, a parallel topology can
be chosen, whereas for workloads with more data de-
pendency, a pipelined topology can be chosen. The cho-
ice of topology is performed statically and an instance
of FQP is synthesized that realizes this topology. The
OPB is the processing core that implements the actual
query operators. It enables online changes to queries
based on a number of parameters, including variable tu-
ple size, projection attributes, selection conditions, join
conditions, and join-window size.
In the design of FQP, we dealt with a number of chal-
lenges: First, a static FPGA-based query processor must
over-provision resources to handle the largest expected
(intermediate) tuple size, which under-utilizes system
SIGMOD Record, June 2015 (Vol. 44, No. 2) 5
resources. Second, the change in tuple size between the
join operation’s inputs and output adds new challenges,
especially when there is the need to use the join result
as input for other operations. Third, determining a mini-
mal processing core that can efficiently handle a variety
of query operators, some of which are stateless, while
others are stateful.
The contributions of this work are manifold: (1) We
outline our vision for stream processing on hardware in
the context of our proposed FQP architecture. (2) We
develop FQP, that unlike the state-of-the-art, enables on-
line changes of queries and stream schema without in-
terrupting query processing over incoming streams and
without the need to re-synthesize the design. (3) We
unify and share the underlying storage buffer for both
data and operator parameters of a query. (4) We sup-
port variable tuple sizes by proposing the segment-at-a-
time processing model, namely, an abstraction that di-
vides a tuple into smaller chunks that are streamed and
processed as a consecutive set of segments. This strat-
egy avoids the need for over-provisioning of hardware
resources. (5) We design FQP as a family for instan-
tiating stream processors that are based on the differ-
ent inter-connection topologies most suitable for the ex-
pected workload.
Over the past few years several projects on acceler-
ating stream processing with FPGAs have been under-
taken, many among researchers in the broader data man-
agement community. This work effectively demonstrated
that FPGAs are a viable option for accelerating certain
data management tasks in general, and stream process-
ing in particular (e.g., [3, 4, 7, 8, 5].)
Lockwood et al. [3] present an FPGA library to ac-
celerate the development of streaming applications in
hardware. Similarly, Meuller et al. [4] present Glacier,
a component library and compiler, that compiles con-
tinuous queries into logic circuits on an operator-level
basis. These approaches (including [7, 8]) are charac-
terized by the goal of representing queries in logic cir-
cuits to achieve the best possible performance. While
performance is a major design goal for us as well, we
additionally aim to offer the application flexibility of up-
dating queries at runtime.
The prototype of a hardware stream processor were
recently presented [5, 6]. Najafi et al. [5, 6] showed the
viability of building a stand-alone stream processor with
FPGAs. The work presented in this paper builds upon
this prototype, elaborating on the full-blown Flexible
Query Processor systems project. For example, here,
details are provided about the segment-at-a-time pro-
cessing mechanism to support the processing of stream
tuples with variable tuple sizes. But, more importantly,
in the current work, we offer our broader and long-term
vision of the FQP project.
To present our vision of stream processing accelera-
tion, we first need to better understand the design space
that must be navigated by resorting to FPGAs in order
to complement or to replace general purpose processors.
We begin by describing the challenges faced by today’s
general purpose processors.
Large & complex control units The design of
general purpose processors is based on the execution of
consecutive operations on data residing in the system’s
main memory. The design must guarantee a correct se-
quential order execution. In presence of such strict exe-
cution order, the processor includes complex logic to in-
crease performance, e.g., super pipelining; out-of-order
execution; single instruction, multiple data; and hyper-
threading. As a result, the performance gain comes at
the cost of having to devote resources (i.e., transistors)
to large and complex control units, which could occupy
up to 95% of chip area [1].
Memory wall & von Neumann bottleneck The
current computer architecture suffers from the limited
bandwidth between CPU and memory. This bandwidth
is small compared to the rate at which the CPU itself can
process. This issue is often referred to as the memory
wall, and is becoming a major scalability limitation as
the gap between CPU and memory speed increases [2].
To mitigate this issue, processors have been equipped
with large cache units. However, the effectiveness of
these units depends on the memory access patterns of
executing programs. Additionally, the von Neumann
bottleneck also contributes to the memory wall by shar-
ing the limited memory bandwidth between instructions
and data.
Redundant memory accesses The current com-
puter architecture enforces that data arriving from an
I/O device is first read/written to main memory before
it is processed by the CPU, which is a cause for a great
deal of memory bandwidth loss. Consider a simple data
stream filtering operation that would not require the in-
coming data stream to be first written to main memory.
If the data arrives from the network interface, then in
theory the data could stream directly through the proces-
sor. However, today’s computer architecture prevents
this modus operandi. Essentially, any I/O to and from
the computer system is channeled through memory, which
is a potential bottleneck even for high-speed inter-connects
such as infiniBand that introduces latency on the order
of microseconds.1
These performance limiting factors in today’s com-
puter architecture have resulted in a growing interest in
accelerating data management and data stream process-
1Emerging InfiniBand adaptors, such as Mellanox ConnectX-
3 Pro (
overview), have the potential to enable FPGAs to process data
in-line and avoid the data movement overhead between I/O
and memory.
6 SIGMOD Record, June 2015 (Vol. 44, No. 2)
ing on FPGAs. By designing custom hardware acceler-
ators tailored for streaming processing, we can afford to
get by with simpler control logic and better usage of chip
area, i.e., we can achieve higher performance per tran-
sistor ratio. Furthermore, we can mitigate memory wall
issue, by coupling processor and local memory and in-
stantiating many of these coupled processors and mem-
ories as needed. The redundant memory access could
be reduced by avoiding copying and reading memory
whenever possible.
Additionally, the stream processing has a set of unique
characteristics that enables us to further exploit under-
lying hardware. For instance, by its nature, there is a
higher chance of repeated execution of the same set of
queries over a given potentially unbounded stream with
a known stream data schema. Such repetition creates an
opportunity to customize data path of our processors for
optimal computation (e.g., avoid deep instruction execu-
tion pipeline). Essentially, we execute the “data over the
instructions” not the other way around, namely, instruc-
tions are implemented in hardware as a custom logic.
Another distinct feature of streaming application is the
I/O nature, where tuples go from input to processing to
output without the need for storage in off-chip memory
for an extended period of time, e.g., state-less opera-
tions such as selections and projections do not require
any write to (external) memory; they can simply be pro-
cessed online in a pure streaming fashion. Also given
the custom hardware implementation of queries, there is
a greater degree of predictability, essentially eliminating
costly branch mispredictions, page misses, and fewer in-
structions fetching. All of these properties have moti-
vated us to rethink the development of a revolutionary
different architecture for a stream processor that avoids
today’s computer architecture challenges.
The Flexible Query Processor (FQP) is a customized hard-
ware solution we designed for building stream proces-
sors that can be specifically tailored to a given set of
queries executed over a data stream. Furthermore, new
queries can be inserted at run-time without requiring ex-
pensive re-synthesis, as is commonplace today in related
FPGA-based processing approaches.2Essentially, FQP
represents a set of components that can be assembled
in various ways to give rise to a whole family of stream
processors. The basis of FQP are Online Programmable-
Blocks (OPBs) (i.e., the processing cores.) The OPB itself
is a simple stream processing element that supports a
number of basic query operators over stream tuples. An
OPB can be dynamically programmed by inter-sparsing
the input data stream with new or updated queries, rep-
resented by a simple instruction set. The structure of
input data stream distribution, OPB arrangement, and
2This is a unique limitation of FPGA-based approaches not
found in software-based approaches, in which dynamic query
insertion and update is hardly an issue worth underlining.
#1-1 OP-Block
Filter Unit Filter Unit
#1-3 OP-Block
Filter Unit Filter Unit
Result Aggregation Buffer
#6-1 OP-Block
Filter Unit Filter Unit
#6-3 OP-Block
Filter Unit Filter Unit
Result Aggregation Buffer
Figure 1: Partially parallel FQP topology.
query result collection components define the connec-
tion topology of FQP. As a result, aside from the flexi-
bility of FQP to be reprogrammed with new queries, the
topology can be tailored to a specific query set (appli-
cation), to maximize processing performance. OPBs are
designed such that they can be connected to each other
serially (i.e., a pipeline arrangement), in parallel (i.e., a
parallel arrangement), and in a mixed manner (i.e.,hy-
brid arrangement).
4.1 FQP Overview
The processing performance of an instance of FQP is
determined by its internal connection topology, the per-
formance of each individual OPB, and the assignment
of queries to OPBs. Figure 1 shows a small-scale par-
tially parallel FQP topology comprised of 24 OPBs. Each
one of the four consecutive OPBs per row are arranged
in a pipelined fashion. All rows are arranged in par-
allel. Other topologies are possible, as determined ap-
propriate by a pre-synthesis-time software-based con-
figuration component that assembles an instance of an
FQP, specifically tailored for the given or expected query
workload. After synthesis, during query assignment, a
query compiler determines a mapping of the input queri-
es onto the given FQP instance. Queries can be inserted
dynamically without requiring a resynthesis of the FQP
instance, a major differentiation of our work from re-
lated approaches.
4.2 FQP Internal Architecture
Data stream distribution circuitry In FQP in-
stance shown above, we opted for a pipelined data stream
distribution architecture, where each incoming tuple is
inserted from the top and passes to the next pipeline
stage (cf., hashed blocks in the figure) in each clock cy-
cle until it reaches the end of the path.
In each stage, the tuple is fed to the corresponding
chain of OPBs. Depending on the configuration, more
OPB-chains could be connected to a single stage. How-
ever, the number of attached chains is limited by a max-
imum fan-out. Attaching more chains to a single stage
can result in a decrease of the FQP’s clock frequency,
leading to a performance degradation of the processor.
The maximal fan-out is device (e.g., FPGAs) dependent
and has to be determined experimentally for a given
FPGA. While feeding tuples through the pipeline of
chains, some chains could be busy processing previous
tuples. This imposes unwanted stalls in the distribution
circuitry, leaving further chains idle. To address this is-
SIGMOD Record, June 2015 (Vol. 44, No. 2) 7
Query BufferWindow Buffer-S
Window Buffer-R
Final Result
(Ver. 1.2)
Figure 2: Stream pro-
cessing element.
Figure 3: Result aggre-
gation buffer (RAB).
Seg-1 (64-bit)
Seg-2 (64-bit)
Tuple (128-bit)
R-Seg-1 (64-bit)
R-Seg-2 (64-bit)
Final Result
Filter Unit
Figure 4: Segment-at-a-time at entry to OPB.
Customer ID: 164 Age: 29
Customer Tuple
Sel (Age > 25)
Customer ID: 164 Age: 29
Resulting Tuple
Figure 5: Customer seg-
regation query.
sue, we use a Buffer/DMUX component which stores in-
coming tuples in its internal buffer and feeds them to the
connected chain. This component also contains a de-
multiplexer which splits streams so they pass though the
shared data stream distribution circuitry.
Online-programmable block (OPB)FQP is com-
prised of a number of OPBs as basic stream processing
elements that realize various query operators (e.g., se-
lect, project, and join). Figure 2 shows a high-level dia-
gram illustrating the ports of an OPB.
OPB itself is comprised of several components which
work in parallel to maximize throughput. Here, we opt
to briefly present the Processing Unit (PU) and the two
window buffers as two important components of our de-
Designed to execute complex operators such as a win-
dow-join, the OPB includes two window buffers, with
the maximum window sizes as pre-synthesis-time con-
figurable parameters. Window Buffer-R is dedicated to
input Stream-R (R) while Window Buffer-L is dedicated
to input Stream-S (S) and to the reception of dynami-
cally inserted queries, also referred to as Query Buffer.
The PU is the actual execution unit of OPB. Upon in-
sertion of a new tuple, the PU fetches instructions from
the Query Buffer and executes them against the tuples
from one or both window buffers (depending on the query
semantic). At the end of execution, the resulting tuples
are emitted via the Final Result port or via the Stream-R
output port for further processing by neighbouring OPBs
(i.e., for larger queries.)
Results at the Final Result port are gathered by the
Filter Unit and after validation are fed to the Result Ag-
gregation Buffer (RAB) for transmission to the output
port of FQP. Validation includes tasks such as comput-
ing validity of an entire result tuple from its constituent
parts, produced by the PU (cf., segment-at-a-time mech-
anism discussed below.)
Result collection circuitry After tuple processing,
validated results are collected by the Result Aggregation
Buffer (RAB), shown in Figure 3. The RAB is com-
prised of a structure of connected buffers (i.e.,Buffer
Nodes) that are responsible for collecting results from
two sources and guiding them from the OPBs to the out-
put port of the FQP. In this collection step, a fairness
granting mechanism makes sure that both sources are
treated equally to avoid starvation. Appropriate tuning
of Buffer Node parameters (e.g., buffer size) and con-
nectivity architecture is important as this affects overall
FQP performance. In other words, poor assignment of
parameters could result in bottlenecks in the transmis-
sion of resulting tuples, while the majority of buffers
would be under-utilized.
4.3 Segment-at-a-time Processing
Not only queries change throughout the life of an ap-
plication, but the streams themselves evolve as well. Th-
eir properties such as schema, tuple size, and input rate
change continuously. These features are at odds with to-
day’s FPGA-based stream processing solutions, which
have, for the most part, been tailored to process one spe-
cific tuple width before requiring re-synthesis if tuple
size changes are permitted at all. This degree of flexibil-
ity poses a severe challenge for a hardware-based solu-
tion, as opposed to its software counter-part. Our design
has been specifically built to afford this flexibility. The
OPB-based design of FQP supports varying size tuples,
thus, allowing for evolving data streams.
Generally speaking, hardware systems have fixed size
input ports, internal communication buses, and output
ports. FQP is no exception. However, flexibility in the
face of varying size data streams stems from the way an
OPB processes incoming tuples. The parametrized de-
sign of the OPB allows us to define its ports’ width prior
to design synthesis. By default, we configure FQP with
a 64-bit port width. As a result, for any tuple larger than
64 bits, it is divided into 64-bit segments at the entry-
point to FQP. The tuple segments arrive at the input port
of an OPB as shown in Figure 4. Then, the OPB pro-
cesses each segment, one at a time, and hands over the
resulting segments to the Filter Unit through its Final
Result port.
Figure 6 shows the segment-at-a-time processing mec-
hanism in more details. Prior to processing a segmented
tuple, queries also need to be updated to handle the seg-
ments. In our example, the query consists of two seg-
ments, of which the first segment corresponds to the first
segment of the tuple, while the second segment of the
query corresponds to the second segment of the tuple.
Segmentation of queries is performed in software out-
side of FQP.
The PU fetches the first segment of the tuple from
the Window Buffer-R as well as the first segment of the
query. Then, the PU executes the segment of the query
and produces a result segment with an additional flag
which shows if the first segment of the tuple satisfies the
(query) conditions in the first segment of the query. This
process is repeated for the second segment of the tuple
All resulting tuple segments are transmitted to and
8 SIGMOD Record, June 2015 (Vol. 44, No. 2)
Query-Seg-1Window Buffer-S
Window Buffer-R
Seg-1 Seg-2
R-Seg-1 R-Seg-2
1 0
Final Result
Drop Result
Figure 6: OPB Segment-at-a-time.
Customer ID: 165 Age: 26 Height: 178 Weight: 76
Segment-1 Segment-2
Height: 178 Weight: 76
Customer Tuple (Segment-2)
Sel (Height < 180)
Resulting (Segment-2) Tuple
Customer ID: 165 Age: 26
Customer Tuple (Segment-1)
Sel (Height < 180)
Customer ID: 165 Age: 26
Resulting (Segment-1) Tuple
Sel (Age > 25)Sel (Age > 25)
1 1 Height: 178 Weight: 76
Filter Unit Resulting (Segment-2) Tuple
Height: 178 Weight: 76
Customer ID: 165 Age: 26
Resulting (Segment-1) Tuple
Step-1 (Customer Tuple)
Figure 7: Segregation query. 64 128 256 512 1024
Tuple Size (bits)
Input Tuple Rate (Million tuples/second)
Input Tuple Rate
Input Throughput
Input Throughput (MB/second)
Figure 8: Effect of tuple size.
stored in internal buffers of the Filter Unit (FU), which
evaluates the validity of the entire resulting tuple. For
example, in a selection operator, one of the tuple seg-
ments may not pass the selection condition, while others
do3, which would render the entire tuple invalid. After
receiving the final segment and positively validating the
result, the FU hands the tuple (a segment at a time) over
to the RAB to transfer it to the output port of FQP. Oth-
erwise, the FU drops all result segments.
Segment-at-a-time for join operator Query as-
signment is a task performed in software that maps the
input queries onto the available blocks of the FQP con-
figuration. This task determines the placement of oper-
ators, which is not known a priori (i.e., we do not know
where a join operator executes). Segment-at-a-time pro-
cessing is necessary to support the further processing of
tuples that result from a join operation as often, the join
result is comprised of both input tuples (unless attributes
are projected out).
Segment-at-a-time tuple size limit The maximally
accepted tuple size is determined by the size of Win-
dow Buffer-(R&L) in the OPB. From a conceptual point
of view, the size of Window Buffer-R is not limited for
stateless operators (i.e., select and project), while this is
not the case for stateful operators (i.e., join). The actual
limit depends on the resources available on the FPGA,
which is highly device-specific and will only increase in
future FPGAs. With today’s technology, we have syn-
thesized blocks with window sizes of up to 4K bytes.
Here, we give an example to illustrate the segment-
-at-a-time mechanism realized by the OPBs. Assume a
Customer stream with Customer ID and Age fields.
FROM Customer Stream
WHERE Age >25
Furthermore, assume a que-
ry to segregate customers
into two groups, those who
are older and those who are younger than 25 years of
age (e.g., a retailer wanting to compute recommenda-
tions based on age.)
This query is programmed onto the OPB and executed
over the customer tuples as shown in Figure 5. As the
Customer stream evolves over time, new attributes,
3E.g., the higher order bits pass the condition, while the lower
order ones do not (for two segments).
such as Height and Weight, are added (e.g., for the
retailer to better differentiate recommendations.)
FROM Customer Stream
WHERE Age >25, Height <180
Thus, the query is
re-written as follows
and through the seg-
ment-at-a-time mechanism, the OPB can execute the new
query over the larger tuples without any changes as
shown in Figure 7.
The processing of the updated (larger) tuple is done
in four steps. In Step 1, after the query for the up-
dated tuple schema was re-programmed onto its (tar-
get) OPB, the updated (enlarged) tuple is divided into
two segments at the entry point of the FQP. In Step 2,
that is, after the segments arrive at the target OPB, the
Processing Unit fetches the first part of the query (Age
>25) and executes it on the first segment of the tuple.
In Step 3, the same process is repeated for the second
part of the query (Height <180) and the second seg-
ment of the tuple. Each one of these steps produces a
resulting tuple segment together with a validation flag.
Finally, in Step 4, the resulting tuple segments are pro-
cessed jointly using the Filter Unit. In case all segments
have satisfied the query conditions, they are handed over
to the RAB for transfer to the output port of the FQP. In
this example, for illustration purposes, we have kept the
data stream simple. In practice, segment-at-a-time is ap-
plicable to larger tuples with more attribute-value fields.
We developed all FQP components in VHDL that are
configured and synthesized on our Xilinx ML505 devel-
opment board. In our experiments, the input was gen-
erated by a workload generator and passed through an
Ethernet component and pipelined reception buffers to
the FQP stream processor. The input streams consist of
64-bit long tuples (i.e., 32-bit attribute and 32-bit value).
Raw processing power evaluation We first present
the raw processing power of various queries by focus-
ing on the number of operators on a topology similar to
Figure 1, where window size is 16 and clock frequency
is 125MHz. For the selection and projection operators,
OPB is capable of supporting |Window Buffer-L|/2 in-
dependent selection operators or |Window Buffer-L|in-
dependent projection operators. Each OPB is capable
of realizing a single join operator. OPBs connected in
SIGMOD Record, June 2015 (Vol. 44, No. 2) 9
a chain (OP-Chain) can realize join operators with even
larger window buffers. For example, utilizing two, three,
or four OPBs increases the window size two, three, or
four times, respectively. The processing performance
of each OPB for the join operator tightly depends on its
window buffers’ sizes. For a window size of 16 tuples,
the current version of OPB is capable of processing 1.44
million tuples per second. The raw processing power of
the topology given in Figure 1 is summarized in Table 1.
Table 1: Tuple processing rate.
Operators # Operators Million Tuples/s
Selection 24×8 230.6
Projection 24×16 272.6
Join 24 34.5
Chained Join (4) 6 8.6
Each OPB is capa-
ble of processing at
the rate of 9.61M,
11.36M, or 1.44M tu-
ples per second for
the selection, projec-
tion, and join operator, respectively, which translates to
230.6M, 272.6M, or 34.5M tuples per second for the
topology in Figure 1. By chaining 4 OPBs we have 6
OP-Chains each with a window size of 4×16 and a total
processing rate of 8.6M tuples per second.
Segment-at-a-time evaluation To evaluate our seg-
ment-at-a-time feature of OPB, and to study its influ-
ences on the input rate, we utilized a data stream by
varying the number of attribute-value pairs per tuple (1
to 16), in which the size of each attribute-value pair is
64 bytes. The clock frequency in this experiment was
125MHz. Figure 8 demonstrates the input tuple rate
achieved as we feed larger tuples to an OPB. By feeding
larger tuples, the sustainable input tuple rate decreases
as expected, since the size of tuple and the number of
attribute-value pairs doubles each time. However, inter-
estingly for a double size tuple the processing time does
not necessarily double as seen in this figure. This is due
to the reduction in the amortized cost of tuple handling
that is mostly for the first segment and decreases for the
subsequent segments. These results are for the selec-
tion operator, but they are applicable for other operators
including the projection and join operators.
Our broader vision is to identify key opportunities to
exploit the strength of available hardware accelerators
given the unique characteristics of stream processing.
As a first step towards fulfilling this goal, we have devel-
oped FQP, a generic streaming architecture composed
of a dynamically re-programmable stream processing
elements, (i.e.,OPBs) that can be chained together to
form a customizable processing topology (exhibiting a
“Lego-like” connectable property). We argue that our
proposed architecture may serve as a basic framework
for both academic and industry research to explore and
study the entire life-cycle of accelerating stream pro-
cessing on hardware. Here we identify a list of impor-
tant short- and long-term problems that can be tackled
within our FQP framework.
What is the complexity of query assignment to a set
of custom hardware blocks (including but not limited to
OPBs). Note, a poorly chosen query assignment may
increase query execution time, leave some blocks un-
utilized, negatively affect energy use, and degrade the
overall processing performance.
How to formalize query assignment algorithmically
(e.g., develop a cost model), and what is the relation-
ship between query optimization on hardware and clas-
sical query optimization in databases. Unlike the classi-
cal query optimization and plan generations, we are not
just limited to join reordering and physical plan selec-
tions, but there is a whole new perspective on how to ap-
ply instruction-level and fine-level memory-access op-
timization (through custom hardware implementation,
e.g., different OPB implementations). For example, what
is the most efficient method for wiring custom opera-
tors to minimize the routing distance? How to collect
statistics during query execution and how to introduce
dynamic re-wiring and movement of data given a fixed
FQP topology?
What is the best initial topology given a query work-
load as a prior? For example, one can construct a topol-
ogy in order to reduce routing (i.e., to reduce the wiring
complexity) or to minimize chip area overhead (i.e., to
reduce the number of OPBs).
Given the topology and the query assignment formal-
ism, is it possible to generalize from single-query op-
timization to multi-query optimization, where we am-
ortize executing cost across the shared processing of the
nqueries and explore inter- and intra-query optimization
that are inspired by the capabilities of custom stream
Finally, how do we extend query execution on hard-
ware to co-processor design by distributing and orches-
tration query execution over different hardware with uni-
que features such as CPUs, FPGAs, and GPUs? An im-
portant design decision arises as to how these various
devices communicate and whether or not they are placed
on a single board, thus, having at least a shared external
memory space, or placed on multi-boards and connected
through interfaces such as PCIe.
[1] Symmetric key cryptography on modern graphics hardware.
Advanced Micro Devices, Inc., 2008.
[2] J. L. Hennessy and D. A. Patterson. Computer Architecture,
Fourth Edition: A Quantitative Approach. Morgan Kaufmann
Publishers Inc., 2006.
[3] J. Lockwood, A. Gupte, N. Mehta, M. Blott, T. English, and
K. Vissers. A low-latency library in FPGA hardware for
high-frequency trading (HFT). In HOTI, 2012.
[4] R. Mueller, J. Teubner, and G. Alonso. Streams on wires: a query
compiler for FPGAs. VLDB, 2009.
[5] M. Najafi, M. Sadoghi, and H.-A. Jacobsen. Flexible query
processor on FPGAs. VLDB, 2013.
[6] M. Najafi, M. Sadoghi, and H.-A. Jacobsen. Configurable
hardware-based streaming architecture using online
programmable-blocks. In ICDE, 2015.
[7] M. Sadoghi, H.-A. Jacobsen, M. Labrecque, W. Shum, and
H. Singh. Efficient event processing through reconfigurable
hardware for algorithmic trading. In VLDB, 2010.
[8] M. Sadoghi, R. Javed, N. Tarafdar, R. Palaniappan, H. P. Singh,
and H.-A. Jacobsen. Multi-query stream processing on FPGAs.
In ICDE, 2012.
10 SIGMOD Record, June 2015 (Vol. 44, No. 2)
... One solution is to simply provision the wiring for the maximum possible schema size: This option is simply impractical given the cost of wiring and routing. To support dynamic schemas, parametrized data segments are proposed in FQP through vertical partitioning of both query and data given a fixed wiring and routing budget [32]. ...
... Given that the availability of FPGAs and other forms of accelerators (e.g., GPUs and ASICs) are now becoming the norm even in distributed cloud infrastructures, then our broader vision (first discussed in [32]) is to identify key opportunities to exploit the strength of available hardware accelerators given the unique characteristics of distributed real-time analytics. As a first step to fulfilling this vision, we start from our FQP fabric [13], [15], a generic streaming architecture composed of a dynamically re-programmable stream processing elements that can be chained together to form a customizable processing topology, exhibiting a "Lego-like" connectable property. ...
... It uses programmable bridges to connect the chains to one another and to the input and output of the processor. Although this design enables FQP [36], [37], [58] to accelerate stream join by spreading the query over multiple join cores (JCs), realized using online programmable blocks (OP-Block). Unfortunately, in practice, this approach introduces new challenges. ...
Full-text available
Stream processing acceleration is driven by the continuously increasing volume and velocity of data generated on the Web and the limitations of storage, computation, and power consumption. Hardware solutions provide better performance and power consumption, but they are hindered by the high research and development costs and the long time to market. In this work, we propose our re-configurable stream processor (Diba), a complete rethinking of a previously proposed customized and flexible query processor that targets real-time stream processing. Diba uses a unidirectional dataflow not dedicated to any specific type of query (operator) on streams, allowing a straightforward placement of processing components on a general data path that facilitates query mapping. In Diba, the concepts of the distribution network and processing components are implemented as two separate entities connected using generic interfaces. This approach allows the adoption of a versatile architecture for a family of queries rather than forcing a rigid chain of processing components to implement such queries. Our experimental evaluations of representative queries from TPC-H yielded processing times of 300, 1220, and 3520 milliseconds for data streams with scale factor sizes of one, four, and ten gigabytes, respectively.
... Finally, there is the question on how hardware heterogeneity could impact the design of HTAP systems. With the end of Moore's Law, the hardware landscape has been shifting towards specialization of computational units (e.g., GPUs, Xeon Phi, FPGAs, near-memory computing, TPUs) (Najafi et al. 2017(Najafi et al. , 2015Teubner and Woods 2013). Hardware has always been an important game changer for databases, and as inmemory processing enabled much of the technology for approaching true HTAP, there is discussion that the coming computing heterogeneity is going to significantly influence the design of future systems as highlighted by Appuswamy et al. (2017). ...
Full-text available
Hybrid OLTP and OLAP
... Sadoghi [29] and Najafi [30,31] describe a reconfigurable event stream processor based on an FPGA. describe a re-programmable event stream processor based on an FPGA. ...
This paper introduces the concept of Semi-static Operator Graphs (SOG) to provide a runtime reconfigurable accelerator for query execution based on a Field Programmable Gate Array (FPGA). Instead of generating an FPGA configuration for a given arbitrary query during system runtime, we deploy a general query structure on the FPGA consisting of multiple small reconfigurable partitions (RP). During deployment of the hybrid database system, for each RP various query operators are prepared as reconfigurable modules (RM). At system runtime, the proposed approach dynamically chooses and reconfigures RMs into the RPs regarding a given query. As a result the reconfiguration overhead during system runtime is significantly reduced and enables the utilization of our hybrid architecture in real-world scenarios.
... While FPGAs offer the ultimate in configurability [3,4,7,9,10], our target architecture, like the Q100, is an ASIC that allows limited per-query configurability. Such circuits sacrifice the bitand gate-level configurability of an FPGA for denser logic and higher clock frequencies. ...
Conference Paper
Previous database accelerator proposals such as the Q100 provide a fixed set of database operators, chosen to support a target query workload. Some queries may not be well-supported by a fixed accelerator, typically because they need more resources/operators of a particular kind than the accelerator provides. By Amdahl's law, these queries become relatively more expensive as they are not fully accelerated. We propose a second-level accelerator, DB-Mesh, to take up some of this workload. DB-Mesh is an asynchronous systolic array that is more generic than the Q100, and can be configured to run a variety of operators with configurable parameters such as record widths. We demonstrate DB-Mesh applied to nested loops joins, an operator that is not directly supported on the Q100. We show that a naïve implementation has the potential for deadlock, and show how to avoid deadlock with a careful design. We also demonstrate how the data flow policy used in the array influences system throughput.
... Past work showed that FPGAs are a viable option for accelerating certain data management tasks in general and stream processing in particular [14,15,17,18,31,32,33]. For example, Hagiescu et al. [15] identify compute-intensive nodes in the query plan of a streaming computation. ...
Conference Paper
Full-text available
There is a rising interest in accelerating stream processing through modern parallel hardware, yet it remains a challenge as how to exploit the available resources to achieve higher throughput without sacrificing latency due to the increased length of processing pipeline and communication path and the need for central coordination. To achieve these objectives, we introduce a novel top-down data flow model for stream join processing (arguably, one of the most resource-intensive operators in stream processing), called SplitJoin, that operates by splitting the join operation into independent storing and processing steps that gracefully scale with respect to the number of cores. Furthermore, SplitJoin eliminates the need for global coordination while preserving the order of input streams by rethinking how streams are channeled into distributed join computation cores and maintaining the order of output streams by proposing a novel distributed punctuation technique. Throughout our experimental analysis, SplitJoin offered up to 60% improvement in throughput while reducing latency by up to 3.3X compared to state-of-the-art solutions.
The last decade has brought groundbreaking developments in transaction processing. This resurgence of an otherwise mature research area has spurred from the diminishing cost per GB of DRAM that allows many transaction processing workloads to be entirely memory-resident. This shift demanded a pause to fundamentally rethink the architecture of database systems. The data storage lexicon has now expanded beyond spinning disks and RAID levels to include the cache hierarchy, memory consistency models, cache coherence and write invalidation costs, NUMA regions, and coherence domains. New memory technologies promise fast non-volatile storage and expose unchartered trade-offs for transactional durability, such as exploiting byte-addressable hot and cold storage through persistent programming that promotes simpler recovery protocols. In the meantime, the plateauing single-threaded processor performance has brought massive concurrency within a single node, first in the form of multi-core, and now with many-core and heterogeneous processors. The exciting possibility to reshape the storage, transaction, logging, and recovery layers of next-generation systems on emerging hardware have prompted the database research community to vigorously debate the trade-offs between specialized kernels that narrowly focus on transaction processing performance vs. designs that permit transactionally consistent data accesses from decision support and analytical workloads. In this book, we aim to classify and distill the new body of work on transaction processing that has surfaced in the last decade to navigate researchers and practitioners through this intricate research subject.
Conference Paper
Full-text available
The key objective of this tutorial is to provide a broad, yet an in-depth survey of the emerging field of co-designing software, hardware, and systems components for accelerating enterprise data management workloads. The overall goal of this tutorial is twofold. First, we provide a concise system-level characterization of different types of data management technologies , namely, the relational and NoSQL databases and data stream management systems from the perspective of analytical workloads. Using the characterization, we discuss opportunities for accelerating key data management workloads using software and hardware approaches. Second, we dive deeper into the hardware acceleration opportunities using Graphics Processing Units (GPUs) and Field-Programmable Gate Arrays (FPGAs) for the query execution pipeline. Furthermore, we explore other hardware acceleration mechanisms such as single-instruction multiple-data (SIMD) that enables short-vector data parallelism. I. TUTORIAL OBJECTIVES The core part of the tutorial is to sketch a roadmap for database hardware-acceleration based on a comprehensive discussion and classification of existing work that not only highlights strengths and novelties of related work, but also critically identifies the limitations for their wide-adoption in practice and open problems from both industry and academic perspectives. This tutorial features four parts: (1) overview of system-level characterization of database systems with the special attention to areas with high potential for hardware acceleration; (2) overview of the SIMD, GPU, and FPGA architectures and programming models, e.g., GPGPU programming using the Nvidia ecosystem and FPGA programming using the Hardware Description Language (HDL) and its synthesis/deployment process; (3) acceleration of key database kernels using SIMD, GPUs, and FPGAs (guided by existing research); and (4) discussion of the chief limitations and open problems for deploying hardware-acceleration to large-scale data management systems in practice. For the GPU portion, we overview the GPU architecture and outline the data parallel programming models such as CUDA and OpenCL. Subsequently, we focus specifically on Nvidia's GPUs and describe key features of the CUDA toolkit through code snippets and examples for database kernel operations such as hashing, sorting, and joins. Similarity, for the FPGA portion, we overview the FPGA architecture and the HDL hardware programming model, which demands a different way of thinking about the problem, a major divergence from the traditional software development mindset. In particular, we describe the FPGA synthesis process and the pros and cons of reconfigurability and reprogrammability of FPGAs through examples and code snippets. This material serve as perquisite, before providing an in-depth explanation and analysis of hardware acceleration for key database kernels and operations (e.g., sorting, joins, filtering, and compression). II. DETAILED TUTORIAL OUTLINE System-level characterization of database systems: We begin the tutorial by highlighting key database technologies that go beyond traditional relational systems and the acceleration challenges imposed by today's architectural limitations. In the data processing space, we discuss row-and column-oriented query execution pipelines, data stream computation, and frequent pattern matching/mining. We also highlight the key architectural challenges such as Moore's law physical limitation, memory wall and Von Neumann's bottleneck, large and complex control units, and power consumption. Hardware programming ecosystem: We provide a brief introduction to data-parallel execution using SIMD instructions , general-purpose computing using GPUs (GPGPUs) and FPGAs, and HDL programming. In this phase, we will cover the basics of SIMD, GPU, and FPGA architectures and programming models. We also describe the entire pipeline and life-cycle for synthesis and re-programmability of FPGAs and discuss the key differences between designing software threading model and hardware logical flow. In particular, our discussion is focused on the Nvidia ecosystem for GPUs. More specifically, we describe how to use the CUDA toolkit for programming GPUs. For FPGAs, we will focus on the Xilinx ecosystem and toolkit and how to design a logical flow based on a high-level hardware description language, such as Verilog, VHDL, or SystemVerilog. These languages enable to specify the circuit behaviorally using traditional C/C++-style constructs for condition, loop, and data types. After providing an architectural overview of GPGPU and FPGA, e.g., heterogeneous system design, memory subsystem , and compute architecture, we discuss the evolution and life-cycle of these different hardware acceleration paradigms. For GPGPUs, we highlight programming evolution from sh to OpenCL and CUDA while focusing on single-instruction multiple-thread (SIMT) model and the hybrid off-loading in GPU programming. For FPGAs, we discuss programming life-cycle starting from Verilog to synthesis and reprogramming of FPGAs (Xilinx toolkit). We further explain the re-programmability using lookup tables (LUTs), the routing architecture and interconnect, utilization of block memory (FPGA on-chip memory) organization and coupling, and comparison of data parallel vs. pipeline programming model. For
Full-text available
The limitations of traditional general-purpose processors have motivated the use of specialized hardware solutions (e.g., FPGAs) to achieve higher performance in stream processing. However, state-of-the-art hardware-only solutions have limited support to adapt to changes in the query workload. In this work, we present a reconfigurable hardware-based streaming architecture that offers the flexibility to accept new queries and to change existing ones without the need for expensive hardware reconfiguration. We introduce the Online Programmable Block (OP-Block), a "Lego-like" connectable stream processing element, for constructing a custom Flexible Query Processor (FQP), suitable to a wide range of data streaming applications, including real-time data analytics, information filtering, intrusion detection, algorithmic trading, targeted advertising, and complex event processing. Through evaluations, we conclude that updating OP-Blocks to support new queries takes on the order of nano to micro-seconds (e.g., 40 ns to realize a join operator on an OP-Block), a feature critical to support of streaming applications on FPGAs.
Full-text available
In this work, we demonstrate Flexible Query Processor (FQP), an online reconfigurable event stream query processor. FQP is an FPGA-based query processor that supports select, project and join queries over event streams at line rate. While processing incoming events, FQP can accept new query expressions, a key distinguishing characteristic from related approaches employing FPGAs for acceleration. Our solution aims to address performance limitations experienced with general purpose processors needing to operate at line rate and lack of on the fly reconfigurability with custom designed hardware solutions on FPGAs.
Conference Paper
Full-text available
Current High-Frequency Trading (HFT) platforms are typically implemented in software on computers with high-performance network adapters. The high and unpredictable latency of these systems has led the trading world to explore alternative "hybrid" architectures with hardware acceleration. In this paper, we survey existing solutions and describe how FPGAs are being used in electronic trading to approach the goal of zero latency. We present an FPGA IP library which implements networking, I/O, memory interfaces and financial protocol parsers. The library provides pre-built infrastructure which accelerates the development and verification of new financial applications. We have developed an example financial application using the IP library on a custom 1U FPGA appliance. The application sustains 10Gb/s Ethernet line rate with a fixed end-to-end latency of 1µs -- up to two orders of magnitude lower than comparable software implementations.
Conference Paper
Full-text available
We present an efficient multi-query event stream platform to support query processing over high-frequency event streams. Our platform is built over reconfigurable hardware -- FPGAs -- to achieve line-rate multi-query processing by exploiting unprecedented degrees of parallelism and potential for pipelining, only available through custom-built, application-specific and low-level logic design. Moreover, a multi-query event stream processing engine is at the core of a wide range of applications including real-time data analytics, algorithmic trading, targeted advertisement, and (complex) event processing.
Full-text available
In this demo, we present fpga-ToPSS (Toronto Publish/Subscribe System Family), an efficient event processing platform for high-frequency and low-latency algorithmic trading. Our event processing platform is built over reconfigurable hardware---FPGAs---to achieve line-rate processing. Furthermore, our event processing engine supports Boolean expression matching with an expressive predicate language that models complex financial strategies to autonomously buy and sell stocks based on real-time financial data.
Taking advantage of many-core, heterogeneous hardware for data processing tasks is a difficult problem. In this paper, we consider the use of FPGAs for data stream processing as coprocessors in many-core architectures. We present Glacier, a component library and compositional compiler that transforms continuous queries into logic circuits by composing library components on an operator-level basis. In the paper we consider selection, aggregation, grouping, as well as windowing operators, and discuss their design as modular elements. We also show how significant performance improvements can be achieved by inserting the FPGA into the system's data path (e.g., between the network interface and the host CPU). Our experiments show that queries on the FPGA can process streams at more than one million tuples per second and that they can do this directly from the network, removing much of the overhead of transferring the data to a conventional CPU.
Taking advantage of many-core, heterogeneous hardware for data processing tasks is a difficult problem. In this paper, we consider the use of FPGAs for data stream processing as co-processors in many-core architectures. We present Glacier, a component library and compositional compiler that transforms continuous queries into logic circuits by composing library components on an operator-level basis. In the paper we consider selection, aggregation, grouping, as well as windowing operators, and discuss their design as modular elements. We also show how significant performance improvements can be achieved by inserting the FPGA into the system's data path (e.g., between the network interface and the host CPU). Our experiments show that queries on the FPGA can process streams at more than one million tuples per second and that they can do this directly from the network, removing much of the overhead of transferring the data to a conventional CPU.