ChapterPDF Available

Hybrid OLTP and OLAP

Authors:

Abstract and Figures

Hybrid OLTP and OLAP
No caption available
… 
Content may be subject to copyright.
H
Hybrid OLTP and OLAP
Jana Giceva1and Mohammad Sadoghi2
1Department of Computing, Imperial College
London, London, UK
2University of California, Davis, CA, USA
Synonyms
HTAP;Hybrid transactional and analytical pro-
cessing;Operational analytics;Transactional an-
alytics
Definitions
Hybrid transactional and analytical processing
(HTAP) refers to system architectures and tech-
niques that enable modern database management
systems (DBMS) to perform real-time analytics
on data that is ingested and modified in the
transactional database engine. It is a term that was
originally coined by Gartner where Pezzini et al.
(2014) highlight the need of enterprises to close
the gap between analytics and action for better
business agility and trend awareness.
Overview
The goal of running transactions and analytics on
the same data has been around for decades, but
has not fully been realized due to technology lim-
itations. Today, businesses can no longer afford
to miss the real-time insights from data that is in
their transactional system as they may lose com-
petitive edge unless business decisions are made
on latest data (Analytics on latest data implies
allowing the query to run on any desired level of
isolations including dirty read, committed read,
snapshot read, repeatable read, or serializable.)
or fresh data (Analytics on fresh data implies
running queries on a recent snapshot of data
that may not necessarily be the latest possible
snapshot when the query execution began or a
consistent snapshot.). As a result, in recent years
in both academia and industry, there has been
an effort to address this problem by designing
techniques that combine the transactional and
analytical capabilities and integrate them in a sin-
gle hybrid transactional and analytical processing
(HTAP) system.
Online transaction processing (OLTP) systems
are optimized for write-intensive workloads.
OLTP systems employ data structures that are
designed for high volume of point access queries
with a goal to maximize throughput and minimize
latency. Transactional DBMSs typically store
data in a row format, relying on indexes and
efficient mechanisms for concurrency control.
Online analytical processing (OLAP) systems
are optimized for heavy read-only queries that
touch large amounts of data. The data structures
used are optimized for storing and accessing large
volumes of data to be transferred between the
storage layer (disk or memory) and the process-
© Springer International Publishing AG 2018
S. Sakr, A. Zomaya (eds.), Encyclopedia of Big Data Technologies,
https://doi.org/10.1007/978-3-319-63962-8_179-1
2Hybrid OLTP and OLAP
ing layer (e.g., CPUs, GPUs, FPGAs). Analytical
DBMSs store data in column stores with fast
scanning capability that is gradually eliminating
the need for maintaining indexes. Furthermore to
keep high performance and to avoid the overhead
of concurrency, these systems only batch updates
at predetermined intervals. This, however, limits
the data freshness visible to analytical queries.
Therefore, given the distinct properties of
transactional data and (stale) analytical data,
most enterprises have opted for a solution that
separates the management of transactional and
analytical data. In such a setup, analytics is
performed as part of a specialized decision
support system (DSS) in isolated data ware-
houses. The DSS executes complex long running
queries on data at rest and the updates from
the transactional database are propagated via
an expensive and slow extract-transform-load
(ETL) process. The ETL process transforms data
from transactional-friendly to analytics-friendly
layout, indexes the data, and materializes selected
pre-aggregations. Today’s industry requirements
are in conflict with such a design. Applications
want to interface with fewer systems and try to
avoid the burden of moving the transactional data
with the expensive ETL process to the analytical
warehouse. Furthermore, systems try to reduce
the amount of data replication and the cost that
it brings. More importantly, enterprises want to
improve data freshness and perform analytics
on operational data. Ideally, systems should
enable applications to immediately react on
facts and trends learned by posing an analytical
query within the same transactional request. In
short, the choice of separating OLTP and OLAP
is becoming obsolete in light of exponential
increase in data volume and velocity and the
necessity to enrich the enterprise to operate based
on real-time insights.
In the rest of this entry, we discuss design de-
cisions, key research findings, and open questions
in the database community revolving around ex-
citing and imminent HTAP challenges.
Key Research Findings
Among existing HTAP solutions, we differen-
tiate between two main design paradigms: (1)
operating on a single data representation (e.g.,
row or column store format) vs. multiple rep-
resentations of data (e.g., both row and column
store formats) and (2) operating on a single copy
of data (i.e., single data replica) vs. maintaining
multiple copies of data (through a variety of data
replication techniques).
Examples of systems that operate on a single
data representation are SAP HANA by Sikka
et al. (2012), HyPer by Neumann et al. (2015),
and L-Store by Sadoghi et al. (2016a,b,2013,
2014). They provide a unified data representation
and enable users to analyze the latest (not just
fresh data) or any version of data. Here we must
note that operating on a single data representation
does not necessarily mean avoiding the need for
data replication. For instance, the initial version
of HyPer relied on a copy-on-write mechanism
from the underlying operating system, while SAP
HANA keeps the data in a main and a delta,
which contains the most recent updates still not
merged with the main. L-Store maintains the his-
tory of changes to the data (the delta) alongside
the lazily maintained latest copy of data in a
unified representation form.
Other systems follow the more classical ap-
proach of data warehousing, where multiple data
representations are exploited, either within a sin-
gle instance of the data or by dedicating a sep-
arate replica for different workloads. Example
systems are SQL Server’s column-store index
proposed by Larson et al. (2015), Oracle’s in-
memory dual-format by Lahiri et al. (2015), SAP
HANA SOE’s architecture described by Goel
et al. (2015), and the work on fractured mirrors
by Ramamurthy et al. (2002), to name a few.
Unified Data Representation
The first main challenge for single data rep-
resentation HTAP systems is choosing (or de-
signing) a suitable data storage format for hy-
brid workloads. As noted earlier, there are many
systems whose storage format was optimized to
support efficient execution of a particular class
Hybrid OLTP and OLAP 3
H
of workloads. Two prominent example formats
are row store (NSM) and column store (DSM),
but researchers have also shown the benefit of
alternative stores like partition attribute across
(PAX) proposed by Ailamaki et al. (2001) that
strike a balance between the two extremes. OLTP
engines typically use a row-store format with
highly tuned data structures (e.g., lock-free skip
lists, latch-free BW trees), which provide low
latencies and high throughput for operational
workloads. OLAP engines have been repeatedly
shown to perform better if data is stored in col-
umn stores. Column stores provide better support
for data compression and are more optimized for
in-memory processing of large volumes of data.
Thus, the primary focus was on developing new
storage layouts that can efficiently support both
OLTP and OLAP workloads.
The main challenge is the conflict in the prop-
erties of data structures suitable to handle large
volume of update-heavy requests on the one hand
and fast data scans on the other hand. Researchers
have explored where to put the updates (whether
to apply them in-place, append them in a log-
structured format, or store them in a delta and
periodically merge them with the main), how
to best arrange the records (row-store, column-
store, or a PAX format), and how to handle mul-
tiple versions and when to do garbage collection.
Sadoghi et al. (2016a,b,2013,2014) proposed
L-Store (lineage-based data store) that combines
the real-time processing of transactional and an-
alytical workloads within a single unified engine.
L-Store bridges the gap between managing the
data that is being updated at a high velocity and
analyzing a large volume of data by introducing
a novel update-friendly lineage-based storage ar-
chitecture (LSA). This is achieved by develop-
ing a contention-free and lazy staging of data
from write-optimized into read-optimized form
in a transactionally consistent manner without
the need to replicate data, to maintain multiple
representation of data, or to develop multiple
loosely integrated engines that limit real-time
capabilities.
The basic design of LSA consists of two
core ideas, as captured in Fig. 1. First, the base
data is kept in read-only blocks, where a block
is an ordered set of objects of any type. The
modification to the data blocks (also referred
to as base pages) is accumulated in the cor-
responding lineage blocks (also referred to as
tail pages). Second, a lineage mapping links an
object in the data blocks to its recent updates in
a lineage block. This essentially decouples the
updates from the physical location of objects.
Therefore, via the lineage mapping, both the
base and updated data are retrievable. A lazy,
contention-free background process merges the
recent updates with their corresponding read-
only base data in order to construct new con-
solidated data blocks while data is continuously
being updated in the foreground without any
interruption. The merge is necessary to ensure
optimal performance for the analytical queries.
Furthermore, each data block internally main-
tains the lineage of the updates consolidated thus
far. By exploiting the decoupling and lineage
tracking, the merge process, which only creates
a new set of read-only consolidated data blocks,
is carried out completely independently from
update queries, which only append changes to
lineage blocks and update the lineage mapping.
Hence, there is no contention in the write paths
of update queries and the merge process. That
is a fundamental property necessary to build a
highly scalable distributed storage layer that is
updatable.
Neumann et al. (2015) proposed a novel
MVCC implementation, which does update
in place and store prior versions as before-
image deltas, which enable both an efficient
scan execution and fine-grained serializability
validation needed for fast processing of
point access transactions. From the NoSQL
side, Pilman et al. (2017) demonstrated how
scans can be efficiently implemented on a key-
value store (KV store) enabling more complex
analytics to be done on large and distributed
KV stores. Wildfire from IBM by Barber et al.
(2016,2017) uses Apache Parquet to support
both analytical and transactional requests. IBM
Wildfire is a variant of IBM DB2 BLU that is
integrated into Apache Spark to support fast
ingest of updates. The authors adopt the relaxed
last-writer-wins semantics and offer an efficient
4Hybrid OLTP and OLAP
Data Block
(Read-only)
Lineage Block
(Append-only)
Primary
Index Secondary
Indexes
Lineage Mapping
Base
Version
Latest
Version
Appending Updates
Lazily Consolidating
Updates
Data Block
(Read-only)
Lineage Block
(Append-only)
Lineage Tacking
Lineage Tacking
Consolidated Data
(Read-only)
Lineage Tacking
Hybrid OLTP and OLAP, Fig. 1 Lineage-based storage architecture (LSA)
snapshot isolation on recent view of data (but
stale) by relying on periodic shipment and writing
of the logs onto a distributed file system.
Manifold of Data Representations
An alternative approach comes from systems that
propose using hybrid stores to keep the data in
two or more layouts, either by partitioning the
data based on the workload properties or by doing
partial replication. They exploit manifold special-
ization techniques that are aligned with the tenet
“one size does not fit all” as explained by Stone-
braker and Cetintemel (2005). Typically the hy-
brid mode consists of storing data in row and
column formats by collocating attributes that are
accessed together within a query. The main differ-
entiators among the proposed solutions come by
addressing the following challenges: How does a
system determine which data to store in which
layout – is it specified manually by the DB
administrator, or is it derived by the system itself?
Is the data layout format static or can it change
over time? If the latter, which data is affected by
the transformations – is it the hot or cold data?
Does the system support update propagation or
is the data transformation only recommended for
read-only data? If the former, when does the
data transformation take place – is it incremental
or immediate? Does it happen as part of query
execution or is it done as part of a background
process?
Grund et al. (2010) with their HYRISE system
propose the use of the partially decomposed
storage model and automatically partition
tables into vertical partitions of varying depth,
depending on how the columns are accessed.
Unfortunately, the bandwidth savings achieved
with this storage model come with an increased
CPU cost. A follow-up work by Pirk et al. (2013)
improves the performance by combining the
partially decomposed storage model with just-
in-time (JiT) compilation of queries, which
eliminates the CPU-inefficient function calls.
Based on the work in HYRISE, Plattner (2009)in
SAP developed HANA, where tuples startout in a
row-store and are then migrated to a compressed
Hybrid OLTP and OLAP 5
H
column-store storage manager. Similarly, also
MemSQL, Microsoft’s SQL Server, and Oracle
support the two storage formats. One of the
main differentiators among these systems is who
decides on the partitioned layout: i.e., whether the
administrator manually specifies which relations
and attributes should be stored as a column store
or the system derives it automatically. The latter
can be achieved by monitoring and identifying
hot vs. cold data or which attributes are accessed
together in OLAP queries. There are a few
research systems that have demonstrated the
benefits of using adaptive data stores that can
transform the data from one format to another
with an evolving HTAP workload. Dittrich and
Jindal (2011) show how to maintain multiple
copies of the data in different layouts and use a
logical log as a primary storage structure before
creating a secondary physical representation
from the log entries. In H2O, Alagiannis et al.
(2014) maintain the same data in different storage
formats and improve performance of the read-
only workload by leveraging multiple processing
engines. The work by Arulraj et al. (2016)
performs data reorganization on cold data. The
reorganization is done as part of an incremental
background process that does not impact the
latency-sensitive transactions but improves the
performance of the analytical queries.
Data Replication
Operating on a single data replica may come at
a cost due to an identified trade-off between de-
gree of data freshness, system performance, and
predictability of throughput and response time for
a hybrid workload mix. Thus, the challenges for
handling HTAP go beyond the layout in which
the data is stored. A recent study by Psaroudakis
et al. (2014) shows that systems like SAP HANA
and HyPer, which rely on partial replication,
experience interference problems between the
transactional and analytical workloads. The ob-
served performance degradation is attributed to
both resource sharing (physical cores and shared
CPU caches) and synchronization overhead when
querying and/or updating the latest data.
This observation motivated researchers to re-
visit the benefits of data replication and address
the open challenge of efficient update propaga-
tion. One of the first approaches for address-
ing hybrid workloads was introduced by Rama-
murthy et al. (2002), where the authors proposed
replicating the data and storing it in two data
formats (row and columnar). The key advantage
is that the data layouts and the associated data
structures as well as the data processing tech-
niques can be tailored to the requirements of
the particular workload. Additionally, the system
can make efficient use of the available hardware
resources and their specialization. This is often
viewed as a loose-form of an HTAP system. The
main challenge for this approach is maintaining
the OLAP replica up to date and avoiding the ex-
pensive ETL process. BatchDB by Makreshanski
et al. (2017) relies on primary-secondary form
of replication with the primary replica handling
the OLTP workload and the updates being prop-
agated to a secondary replica that handles the
OLAP workload (Fig. 2). BatchDB successfully
addresses the problem of performance interfer-
ence by spatially partitioning resources to the
two data replicas and their execution engines
(e.g., either by allocating a dedicated machine
and connecting the replicas with RDMA over In-
finiBand or by allocating separate NUMA nodes
within a multi-socket server machine). The sys-
tem supports high-performance analytical query
processing on fresh data with snapshot isolation
guarantees. It achieves that by queuing queries
and updates on the OLAP side and schedul-
ing them in batches, executing one batch at a
time. BatchDB ensures that the OLAP queries
see the latest snapshot of the data by using a
lightweight propagation algorithm, which incurs
a small overhead on the transactional engine to
extract the updates and efficiently applies them
on the analytical side at a faster rate than what
has been recorded as a transactional throughput
in literature so far. Such a system design solves
the problem of high performance and isolation for
hybrid workloads, and most importantly enables
data analytics on fresh data. The main problem is
that it limits the functionality of what an HTAP
application may do with the data. For instance, it
6Hybrid OLTP and OLAP
OLAP dispatcher
OLTP Replica OLAP Replica
OLTP requests OLAP queries
OLTP Updates
Get latest
snapshot version
OLTP dispatcher
single snapshot
Hybrid OLTP and OLAP, Fig. 2 Replication-based HTAP architecture
does not support interleaving of analytical queries
within a transaction and does not allow analysts
to go back in time and explore prior versions of
the data to explain certain trends.
Oracle GoldenGate and SAP HANA SOE also
leverage data replication and rely on specialized
data processing engines for the different replicas.
In GoldenGate, Oracle uses a publish/subscribe
mechanism and propagates the updates from the
operational database to the other replicas if they
have subscribed to receive the updates log.
Examples of Applications
There are many possible applications of HTAP
for modern businesses. With today’s dynamic
data-driven world, real-time advanced analytics
(e.g., planning, forecasting, and what-if analy-
sis) becomes an integral part of many business
processes rather than a separate activity after the
events have happened. One example is online
retail. The transactional engine keeps track of
data including the inventory list and products
and the registered customers and manages all the
purchases. An analyst can use this online data to
understand customer behavior and come up with
better strategies for product placement, optimized
pricing, discounts, and personalized recommen-
dations as well as to identify products which are
in high demand and do proactive inventory refill.
Another business use for HTAP comes from the
financial industry sector. The transactional engine
already supports millions of banking transactions
per second and real-time fraud detection and
action which can save billions of dollars. Yet
another example of real-time detection and action
is inspired by content delivery networks (CDNs),
where businesses need to monitor the network of
web servers that deliver and distribute the content
at real time. An HTAP system can help identify
distributed denial-of-service (DDoS) attacks, find
locations with spikes to redistribute the traffic
bandwidth, get the top-kcustomers in terms of
traffic usage, and be able to act on it immediately.
Future Directions of Research
At the time of writing this manuscript, we argue
that there is yet any system or technique that ful-
fill all HTAP promises to successfully interleave
analytical processing within transactions. This is
sometimes referred to as true or in-process HTAP,
i.e., real-time analytics that is not limited to latest
committed data and posed as a read-only request
Hybrid OLTP and OLAP 7
H
but incorporates the analytics as part of the same
write-enabled transaction request. Supporting in-
process HTAP is an ongoing effort in the research
community and much sought-after feature for
many enterprise systems.
Another open question is to investigate
whether HTAP systems could and should support
different types of analytical processing in
addition to traditional OLAP analytics, e.g., the
emergence of polystore by Duggan et al. (2015).
Training and inferring from machine learning
models that describe fraudulent user behavior
could supplement the existing knowledge of
OLAP analytics. Furthermore, with the support
of graph processing, one can finally uncover
fraudulent rings and other sophisticated scams
that can be best identified using graph analytics
queries. Therefore, it would be of great use if the
graph analytical engine can process online data
and be integrated in a richer and more expressive
HTAP system (e.g., Hassan et al. 2017).
Finally, there is the question on how hard-
ware heterogeneity could impact the design of
HTAP systems. With the end of Moore’s Law,
the hardware landscape has been shifting to-
wards specialization of computational units (e.g.,
GPUs, Xeon Phi, FPGAs, near-memory comput-
ing, TPUs) (Najafi et al. 2017,2015; Teubner
and Woods 2013). Hardware has always been an
important game changer for databases, and as in-
memory processing enabled much of the tech-
nology for approaching true HTAP, there is dis-
cussion that the coming computing heterogeneity
is going to significantly influence the design of
future systems as highlighted by Appuswamy
et al. (2017).
Cross-References
Blockchain Transaction Processing
In-memory Transactions
Active Storage
Hardware-Assisted Transaction Processing:
NVM
References
Ailamaki A, DeWitt DJ, Hill MD, Skounakis M (2001)
Weaving relations for cache performance. In: VLDB,
pp 169–180
Alagiannis I, Idreos S, Ailamaki A (2014) H2O: a hands-
free adaptive store. In: SIGMOD, pp 1103–1114
Appuswamy R, Karpathiotakis M, Porobic D, Ailamaki A
(2017) The case for heterogeneous HTAP. In: CIDR
Arulraj J, Pavlo A, Menon P (2016) Bridging the
archipelago between row-stores and column-stores for
hybrid workloads. In: SIGMOD, pp 583–598
Barber R, Huras M, Lohman G, Mohan C, Mueller R,
Özcan F, Pirahesh H, Raman V, Sidle R, Sidorkin O,
Storm A, Tian Y, Tözun P (2016) Wildfire: concurrent
blazing data ingest and analytics. In: SIGMOD’16,
pp 2077–2080
Barber R, Garcia-Arellano C, Grosman R, Müller R,
Raman V, Sidle R, Spilchen M, Storm AJ, Tian Y,
Tözün P, Zilio DC, Huras M, Lohman GM, Mohan
C, Özcan F, Pirahesh H (2017) Evolving databases for
new-gen big data applications. In: Online Proceedings
of CIDR
Dittrich J, Jindal A (2011) Towards a one size fits all
database architecture. In: CIDR
Duggan J, Elmore AJ, Stonebraker M, Balazinska M,
Howe B, Kepner J, Madden S, Maier D, Mattson T,
Zdonik S (2015) The BigDAWG polystore system.
SIGMOD Rec 44(2):11–16
Goel AK, Pound J, Auch N, Bumbulis P, MacLean S,
Färber F, Gropengiesser F, Mathis C, Bodner T, Lehner
W (2015) Towards scalable real-time analytics: an
architecture for scale-out of OLxP workloads. PVLDB
8(12):1716–1727
Grund M, Krüger J, Plattner H, Zeier A, Cudré-Mauroux
P, Madden S (2010) HYRISE – a main memory hybrid
storage engine. In: PVLDB, pp 105–116
Hassan MS, Kuznetsova T, Jeong HC, Aref WG,
Sadoghi M (2017) Empowering in-memory relational
database engines with native graph processing. CoRR
abs/1709.06715
Lahiri T, Chavan S, Colgan M, Das D, Ganesh A, Gleeson
M, Hase S, Holloway A, Kamp J, Lee TH, Loaiza J,
Macnaughton N, Marwah V, Mukherjee N, Mullick A,
Muthulingam S, Raja V, Roth M, Soylemez E, Zait M
(2015) Oracle database in-memory: a dual format in-
memory database. In: ICDE, pp 1253–1258. https://doi.
org/10.1109/ICDE.2015.7113373
Larson PA, Birka A, Hanson EN, Huang W, Nowakiewicz
M, Papadimos V (2015) Real-time analytical process-
ing with SQL server. PVLDB 8(12):1740–1751
Makreshanski D, Giceva J, Barthels C, Alonso G
(2017) BatchDB: efficient isolated execution of hybrid
OLTP+OLAP workloads for interactive applications.
In: SIGMOD’17, pp 37–50
Najafi M, Sadoghi M, Jacobsen H (2015) The FQP vision:
flexible query processing on a reconfigurable comput-
ing fabric. SIGMOD Rec 44(2):5–10
8Hybrid OLTP and OLAP
Najafi M, Zhang K, Sadoghi M, Jacobsen H (2017)
Hardware acceleration landscape for distributed real-
time analytics: virtues and limitations. In: ICDCS,
pp 1938–1948
Neumann T, Mühlbauer T, Kemper A (2015) Fast seri-
alizable multi-version concurrency control for main-
memory database systems. In: SIGMOD, pp 677–689
Pezzini M, Feinberg D, Rayner N, Edjali R (2014)
Hybrid transaction/analytical porcessing will foster
opportunities for dramatic business innovation. https://
www.gartner.com/doc/2657815/hybrid-transactionanal
ytical-processing-foster-opportunities
Pilman M, Bocksrocker K, Braun L, Marroquín R, Koss-
mann D (2017) Fast scans on key-value stores. PVLDB
10(11):1526–1537
Pirk H, Funke F, Grund M, Neumann T, Leser U, Mane-
gold S, Kemper A, Kersten ML (2013) CPU and cache
efficient management of memory-resident databases.
In: ICDE, pp 14–25
Plattner H (2009) A common database approach for OLTP
and OLAP using an in-memory column database. In:
SIGMOD, pp 1–2
Psaroudakis I, Wolf F, May N, Neumann T, Böhm A,
Ailamaki A, Sattler KU (2014) Scaling up mixed
workloads: a battle of data freshness, flexibility, and
scheduling. In: TPCTC 2014, pp 97–112
Ramamurthy R, DeWitt DJ, Su Q (2002) A case for
fractured mirrors. In: VLDB’02, pp 430–441
Sadoghi M, Ross KA, Canim M, Bhattacharjee B (2013)
Making updates disk-I/O friendly using SSDs. PVLDB
6(11):997–1008
Sadoghi M, Canim M, Bhattacharjee B, Nagel F, Ross KA
(2014) Reducing database locking contention through
multi-version concurrency. PVLDB 7(13):1331–1342
Sadoghi M, Bhattacherjee S, Bhattacharjee B, Canim M
(2016a) L-store: a real-time OLTP and OLAP system.
CoRR abs/1601.04084
Sadoghi M, Ross KA, Canim M, Bhattacharjee B (2016b)
Exploiting SSDs in operational multiversion databases.
VLDB J 25(5):651–672
Sikka V, Färber F, Lehner W, Cha SK, Peh T, Bornhövd C
(2012) Efficient transaction processing in SAP HANA
database: the end of a column store myth. In: SIG-
MOD’12, pp 731–742
Stonebraker M, Cetintemel U (2005) “One size fits all”: an
idea whose time has come and gone. In: ICDE, pp 2–11
Teubner J, Woods L (2013) Data processing on FPGAs.
Synthesis lectures on data management. Morgan &
Claypool Publishers, San Rafael
... Many application domains, such as fraud detection [2][3][4], business intelligence [5][6][7], healthcare [8,9], personalized recommendation [10,11], and IoT [10], have a critical need to perform real-time data analysis, where data analysis needs to be performed using the most recent version of data [12,13]. To enable real-time data analysis, state-of-the-art database management systems (DBMSs) leverage hybrid transactional and analytical processing (HTAP) [14][15][16]. An HTAP DBMS is a single-DBMS solution that supports both transactional and analytical workloads [12,14,[17][18][19]. ...
... Data Freshness. One of the major challenges in the multipleinstance design is keeping analytical replicas up-to-date even when the transaction update rate is high, without compromising performance isolation [16,18]. To maintain data freshness, the system needs to propagate transactional updates to analytical replicas (referred as update propagation), which requires (1) gathering updates from transactions and shipping them to analytical replicas (update gathering and shipping), and (2) performing the necessary format conversion and applying the updates (update application). ...
Preprint
Full-text available
A growth in data volume, combined with increasing demand for real-time analysis (using the most recent data), has resulted in the emergence of database systems that concurrently support transactions and data analytics. These hybrid transactional and analytical processing (HTAP) database systems can support real-time data analysis without the high costs of synchronizing across separate single-purpose databases. Unfortunately, for many applications that perform a high rate of data updates, state-of-the-art HTAP systems incur significant losses in transactional (up to 74.6%) and/or analytical (up to 49.8%) throughput compared to performing only transactional or only analytical queries in isolation, due to (1) data movement between the CPU and memory, (2) data update propagation from transactional to analytical workloads, and (3) the cost to maintain a consistent view of data across the system. We propose Polynesia, a hardware-software co-designed system for in-memory HTAP databases that avoids the large throughput losses of traditional HTAP systems. Polynesia (1) divides the HTAP system into transactional and analytical processing islands, (2) implements new custom hardware that unlocks software optimizations to reduce the costs of update propagation and consistency, and (3) exploits processing-in-memory for the analytical islands to alleviate data movement overheads. Our evaluation shows that Polynesia outperforms three state-of-the-art HTAP systems, with average transactional/analytical throughput improvements of 1.7x/3.7x, and reduces energy consumption by 48% over the prior lowest-energy HTAP system.
... (HTAP) systems maintain the same data in different storage formats. By leveraging multiple query processing engines, these systems are able to improve performance of read-only workloads [5]. This allows to efficiently perform real-time analytics and transactional workloads on data stored in the same system. ...
Conference Paper
Polystore databases allow to store data in different formats and data models and offer several query languages. While such polystore systems are highly beneficial for various analytical workloads, they provide limited support for transactional and for mixed OLTP and OLAP workloads, the latter in contrast to hybrid transactional and analytical processing (HTAP) systems. In this paper, we present Polypheny-DB, a modular polystore that jointly provides support for analytical and transactional workloads including update operations and that thus takes one step towards bridging the gap between polystore and HTAP systems.
... 41% of traditional search engines users are reported conflicting or contradictory search results unable to find the correct information and 34% of them are found that important information was missing from search results (Lee and Ma, 2012). Thus, the powerful search engines like Google, Yandex and Bing use semantic search ideas but the details of underlying techniques are not publically described (Giceva and Sadoghi, 2018). Semantic search is the next generation search paradigm after traditional one (Tablan et al., 2015). ...
Article
Full-text available
The main goal of information retrieval is getting the most relevant documents to a user’s query. So, a search engine must not only understand the meaning of each keyword in the query but also their relative senses in the context of the query. Discovering the query meaning is a comprehensive and evolutionary process; the precise meaning of the query is established as developing the association between concepts. The meaning determination process is modeled by a dynamic system operating in the semantic space of WordNet. To capture the meaning of a user query, the original query is reformulating into candidate queries by combining the concepts and their synonyms. A semantic score characterizing the overall meaning of such queries is calculated, the one with the highest score was used to perform the search. The results confirm that the proposed "Query Sense Discovery" approach provides a significant improvement in several performance measures.
... The comparison between OLTP and OLAP[20] [21][22]. Planning, problem solving and decision support Adavnatges -Involves standardized and simple queries that return few records hence, it is faster -A large number of short on-line transaction -Involves complex queries along with aggregations that return a huge amount of data. ...
... The journey of the database started with a relational database followed by transactional, eXtensible Markup Language (XML) [10,11], Not only SQL (NoSQL), and Hybrid Transactional/Analytical Processing (HTAP) database [2]. Analytic functions in SQL are used to process analytical databases, XQuery is used to process XML databases and NewSQL queries are used to process Hybrid Transactional and Analytical Processing (HTAP) databases. ...
Article
Full-text available
The journey of the database started with a relational database followed by transactional, eXtensible Markup Language (XML)[10,11], Not only SQL (NoSQL), and Hybrid Transactional/Analytical Processing (HTAP) database[2]. Analytic functions in SQL are used to process analytical databases, XQuery is used to process XML databases and NewSQL queries are used to process Hybrid Transactional and Analytical Processing (HTAP) databases. In the current scenario, the research related to the XML database and analytics on the XML database is an important area of concern. Numerous enterprises use XML as a data exchange format. Considering all the benefits of XML and its use in the field of web applications, various works elaborate to adopt it as a data storage technology. In this work, an attempt has been made to perform analytical functions using XQuery on an XML data and compare its performance with Analytical SQL queries in terms of execution time. A sample set of queries has been performed on a sample dataset. According to the results, it is concluded that Analytic SQL is dominating XQuery to perform analytical functions. In experimentation, we explained the persistence of XML data (electronic health records dataset) in eXist-DB and BaseX and executing queries on that data. The finding is that the performance of XML data in a health care environment is better if we use BaseX instead of eXist-DB.
Book
The last decade has brought groundbreaking developments in transaction processing. This resurgence of an otherwise mature research area has spurred from the diminishing cost per GB of DRAM that allows many transaction processing workloads to be entirely memory-resident. This shift demanded a pause to fundamentally rethink the architecture of database systems. The data storage lexicon has now expanded beyond spinning disks and RAID levels to include the cache hierarchy, memory consistency models, cache coherence and write invalidation costs, NUMA regions, and coherence domains. New memory technologies promise fast non-volatile storage and expose unchartered trade-offs for transactional durability, such as exploiting byte-addressable hot and cold storage through persistent programming that promotes simpler recovery protocols. In the meantime, the plateauing single-threaded processor performance has brought massive concurrency within a single node, first in the form of multi-core, and now with many-core and heterogeneous processors. The exciting possibility to reshape the storage, transaction, logging, and recovery layers of next-generation systems on emerging hardware have prompted the database research community to vigorously debate the trade-offs between specialized kernels that narrowly focus on transaction processing performance vs. designs that permit transactionally consistent data accesses from decision support and analytical workloads. In this book, we aim to classify and distill the new body of work on transaction processing that has surfaced in the last decade to navigate researchers and practitioners through this intricate research subject.
Conference Paper
Full-text available
We investigate a coordination-free approach to transaction processing on emerging multi-sockets, many-core, shared-memory architecture to harness its unprecedented available parallelism. We propose a queue-oriented, control-free concur-rency architecture, referred to as QueCC, that exhibits minimal contention among concurrent threads by eliminating the overhead of concurrency control from the critical path of the transaction. QueCC operates on batches of transactions in two deterministic phases of priority-based planning followed by control-free execution. We extensively evaluate our transaction execution architecture and compare its performance against seven state-of-the-art concurrency control protocols designed for in-memory, key-value stores. We demonstrate that QueCC can significantly out-perform state-of-the-art concurrency control protocols under high-contention by up to 6.3×. Moreover, our results show that QueCC can process nearly 40 million YCSB transactional operations per second while maintaining serializability guarantees with write-intensive workloads. Remarkably, QueCC out-performs H-Store by up to two orders of magnitude.
Conference Paper
Full-text available
To derive real-time actionable insights from the data, it is important to bridge the gap between managing the data that is being updated at a high velocity (i.e., OLTP) and analyzing a large volume of data (i.e., OLAP). However, there has been a divide where specialized solutions were often deployed to support either OLTP or OLAP workloads but not both; thus, limiting the analysis to stale and possibly irrelevant data. In this paper, we present Lineage-based Data Store (L-Store) that combines the real-time processing of transactional and analytical workloads within a single unified engine by introducing a novel update-friendly lineage-based storage architecture. By exploiting the lineage, we develop a contention-free and lazy staging of columnar data from a write-optimized form (suitable for OLTP) into a read-optimized form (suitable for OLAP) in a transactionally consistent approach that supports querying and retaining the current and historic data.
Article
Full-text available
The plethora of graphs and relational data give rise to many interesting graph-relational queries in various domains, e.g., finding related proteins satisfying relational predicates in a biological network. The maturity of RDBMSs motivated academia and industry to invest efforts in leveraging RDBMSs for graph processing, where efficiency is proven for vital graph queries. However, none of these efforts process graphs natively inside the RDBMS, which is particularly challenging due to the impedance mismatch between the relational and the graph models. In this paper, we propose to treat graphs as first-class citizens inside the relational engine so that operations on graphs are executed natively inside the RDBMS. We realize our approach inside VoltDB, an open-source in-memory relational database, and name this realization GRFusion. The SQL and the query engine of GRFusion are empowered to declaratively define graphs and execute cross-data-model query plans formed by graph and relational operators, resulting in up to four orders-of-magnitude in query-time speedup w.r.t. state-of-the-art approaches.
Conference Paper
Full-text available
In this paper we present BatchDB, an in-memory database engine designed for hybrid OLTP and OLAP workloads. BatchDB achieves good performance, provides a high level of data freshness, and minimizes load interaction between the transactional and analytical engines, thus enabling real time analysis over fresh data under tight SLAs for both OLTP and OLAP workloads. BatchDB relies on primary-secondary replication with dedicated replicas, each optimized for a particular workload type (OLTP, OLAP), and a light-weight propagation of transactional updates. The evaluation shows that for standard TPC-C and TPC-H benchmarks, BatchDB can achieve competitive performance to specialized engines for the corresponding transactional and analytical workloads, while providing a level of performance isolation and predictable runtime for hybrid workload mixes (OLTP+OLAP) otherwise unmet by existing solutions.
Article
Full-text available
Multiversion databases store both current and historical data. Rows are typically annotated with timestamps representing the period when the row is/was valid. We develop novel techniques to reduce index maintenance in multiversion databases, so that indexes can be used effectively for analytical queries over current data without being a heavy burden on transaction throughput. To achieve this end, we re-design persistent index data structures in the storage hierarchy to employ an extra level of indirection. The indirection level is stored on solid-state disks that can support very fast random I/Os, so that traversing the extra level of indirection incurs a relatively small overhead. The extra level of indirection dramatically reduces the number of magnetic disk I/Os that are needed for index updates and localizes maintenance to indexes on updated attributes. Additionally, we batch insertions within the indirection layer in order to reduce physical disk I/Os for indexing new records. In this work, we further exploit SSDs by introducing novel DeltaBlock techniques for storing the recent changes to data on SSDs. Using our DeltaBlock, we propose an efficient method to periodically flush the recently changed data from SSDs to HDDs such that, on the one hand, we keep track of every change (or delta) for every record, and, on the other hand, we avoid redundantly storing the unchanged portion of updated records. By reducing the index maintenance overhead on transactions, we enable operational data stores to create more indexes to support queries. We have developed a prototype of our indirection proposal by extending the widely used generalized search tree open-source project, which is also employed in PostgreSQL. Our working implementation demonstrates that we can significantly reduce index maintenance and/or query processing cost by a factor of 3. For the insertion of new records, our novel batching technique can save up to 90 % of the insertion time. For updates, our prototype demonstrates that we can significantly reduce the database size by up to 80 % even with a modest space allocated for DeltaBlocks on SSDs.
Conference Paper
Full-text available
The common " one size does not fit all " paradigm isolates transactional and analytical workloads into separate, specialized database systems. Operational data is periodically replicated to a data warehouse for analytics. Competitiveness of enterprises today, however, depends on real-time reporting on operational data, necessitating an integration of transactional and analytical processing in a single database system. The mixed workload should be able to query and modify common data in a shared schema. The database needs to provide performance guarantees for transactional workloads, and, at the same time, efficiently evaluate complex analytical queries. In this paper, we share our analysis of the performance of two main-memory databases that support mixed work-loads, SAP HANA and HyPer, while evaluating the mixed workload CH-benCHmark. By examining their similarities and differences, we identify the factors that affect performance while scaling the number of concurrent transactional and analytical clients. The three main factors are (a) data freshness, i.e., how recent is the data processed by analytical queries, (b) flexibility, i.e., restricting transactional features in order to increase optimization choices and enhance performance, and (c) scheduling, i.e., how the mixed workload utilizes resources. Specifically for scheduling, we show that the absence of workload management under cases of high concurrency leads to analytical workloads overwhelming the system and severely hurting the performance of transactional workloads.
Article
Full-text available
Over the last two releases SQL Server has integrated two special-ized engines into the core system: the Apollo column store engine for analytical workloads and the Hekaton in-memory engine for high-performance OLTP workloads. There is an increasing demand for real-time analytics, that is, for running analytical queries and reporting on the same system as transaction processing so as to have access to the freshest data. SQL Server 2016 will include enhance-ments to column store indexes and in-memory tables that signifi-cantly improve performance on such hybrid workloads. This paper describes four such enhancements: column store indexes on in-memory tables, making secondary column store indexes on disk-based tables updatable, allowing B-tree indexes on primary column store indexes, and further speeding up the column store scan oper-ator.
Article
Key-Value Stores (KVS) are becoming increasingly popular because they scale up and down elastically, sustain high throughputs for get/put workloads and have low latencies. KVS owe these advantages to their simplicity. This simplicity, however, comes at a cost: It is expensive to process complex, analytical queries on top of a KVS because today's generation of KVS does not support an efficient way to scan the data. The problem is that there are conflicting goals when designing a KVS for analytical queries and for simple get/put workloads: Analytical queries require high locality and a compact representation of data whereas elastic get/put workloads require sparse indexes. This paper shows that it is possible to have it all, with reasonable compromises. We studied the KVS design space and built TellStore, a distributed KVS, that performs almost as well as state-of-the-art KVS for get/put workloads and orders of magnitude better for analytical and mixed workloads. This paper presents the results of comprehensive experiments with an extended version of the YCSB benchmark and a workload from the telecommunication industry.
Conference Paper
Data-intensive applications seek to obtain trill insights in real-time by analyzing a combination of historical data sets alongside recently collected data. This means that to support such hybrid workloads, database management systems (DBMSs) need to handle both fast ACID transactions and complex analytical queries on the same database. But the current trend is to use specialized systems that are optimized for only one of these workloads, and thus require an organization to maintain separate copies of the database. This adds additional cost to deploying a database application in terms of both storage and administration overhead. To overcome this barrier, we present a hybrid DBMS architecture that efficiently supports varied workloads on the same database. Our approach differs from previous methods in that we use a single execution engine that is oblivious to the storage layout of data without sacrificing the performance benefits of the specialized systems. This obviates the need to maintain separate copies of the database in multiple independent systems. We also present a technique to continuously evolve the database's physical storage layout by analyzing the queries' access patterns and choosing the optimal layout for different segments of data within the same table. To evaluate this work, we implemented our architecture in an in-memory DBMS. Our results show that our approach delivers up to 3x higher throughput compared to static storage layouts across different workloads. We also demonstrate that our continuous adaptation mechanism allows the DBMS to achieve a near-optimal layout for an arbitrary workload without requiring any manual tuning.