Conference PaperPDF Available

Faster cloud Star Joins with reduced disk spill and network communication

Authors:

Abstract

Combining powerful parallel frameworks and on-demand commodity hardware, cloud computing has made both analytics and decision support systems canonical to enterprises of all sizes. Associated with unprecedented volumes of data stacked by such companies, filtering and retrieving them are pressing challenges. This data is often organized in star schemas, in which Star Joins are ubiquitous and expensive operations. Excessive data spill and network communication are tight bottlenecks for all proposed MapReduce or Spark solutions. Here, we propose two efficient solutions that drop the computation time by at least 60%: the Spark Bloom-Filtered Cascade Join (SBFCJ) and the Spark Broadcast Join (SBJ). Conversely, a direct Spark implementation of a sequence of joins renders poor performance, showcasing the importance of further filtering for minimal disk spill and network communication. Finally, while SBJ outper-forms 2× when memory per executor is large enough, SBFCJ appears remarkably resilient to low memory scenarios. Both algorithms pose a very competitive solution to Star Joins.
This space is reserved for the Procedia header, do not use it
Faster cloud Star Joins with reduced
disk spill and network communication
Jaqueline Joice Brito1, Thiago Mosqueiro2, Ricardo Rodrigues Ciferri3, and
Cristina Dutra de Aguiar Ciferri1
1Department of Computer Science, University of S˜ao Paulo at S˜ao Carlos, Brazil
{jjbrito, cdac}@icmc.usp.br
2BioCircuits Institute, University of California San Diego, United States
thiago.mosqueiro@usp.br
3Department of Computer Science, Federal University of So Carlos, Brazil
ricardo@dc.ufscar.br
Abstract
Combining powerful parallel frameworks and on-demand commodity hardware, cloud comput-
ing has made both analytics and decision support systems canonical to enterprises of all sizes.
Associated with unprecedented volumes of data stacked by such companies, filtering and retriev-
ing them are pressing challenges. This data is often organized in star schemas, in which Star
Joins are ubiquitous and expensive operations. Excessive data spill and network communication
are tight bottlenecks for all proposed MapReduce or Spark solutions. Here, we propose two
efficient solutions that drop the computation time by at least 60%: the Spark Bloom-Filtered
Cascade Join (SBFCJ) and the Spark Broadcast Join (SBJ). Conversely, a direct Spark im-
plementation of a sequence of joins renders poor performance, showcasing the importance of
further filtering for minimal disk spill and network communication. Finally, while SBJ outper-
forms 2×when memory per executor is large enough, SBFCJ appears remarkably resilient to
low memory scenarios. Both algorithms pose a very competitive solution to Star Joins.
Keywords: Star join; Spark; MapReduce; Bloom filter; Data warehouse.
1 Introduction
For the past decade, data-centric trends emerged to aid decision-making processes [19, 4]. Due to
high costs of maintaining updated computational resources, a new model was proposed based
on commodity hardware and non-local infrastructures: cloud computing [7]. Consequently,
there is an increasing demand for flexible and scalable solutions to store and efficiently filter
large datasets [6]. Aligned to these needs, MapReduce [5] has gained attention in the last years,
delivering fast batch processing in the cloud. More recently, Spark [22] was proposed based on
slightly different premises, and has been shown to suit machine learning tasks well [12].
1
Faster cloud Star Joins with reduced disk spill and network communication Brito et al.
(a)
F
pkF
fk1
fk2
fk3
fk4
m1
m2
D1
pk1
a1,1
a1,2
...
D2
pk2
a2,1
a2,2
...
D3
pk3
a3,1
a3,2
...
D4
pk4
a4,1
a4,2
...
(b) SELECT a2,1,a3,2, SUM(m1)
FROM F, D1, D2,D3
WHERE pk1=fk1
AND pk2=fk2
AND pk3=fk3
AND a1,1<100
AND a3,1BETWEEN 5 AND 10
GROUP BY a2,1,a3,2
ORDER BY a2,1,a3,2
Figure 1: Running example: star schema in (a) and query Qin (b).
In the context of decision support systems, databases are often organized in star schemas
[4], where information is conveyed by a central table linked to several satellite tables (see
Figure 1). This is especially common in Data Warehouses [21, 11]. Star Joins are typical
and remarkably expensive operations in star schemas, required for most applications. Many
MapReduce strategies were proposed [1, 10, 16, 23], yet it is challenging to avoid excessive disk
access and cross-communication among different jobs within MapReduce’s inner structure [17].
A naive Spark implementation also fails at the same points (see Section 5.2). Therefore, our
goal is to propose better strategies to reduce disk spill and network communication.
In this paper, we propose two efficient algorithms that jointly optimize disk-spill, network
communication and query performance of Star Joins: Spark Bloom-Filtered Cascade Join
(SBFCJ) and Spark Broadcast Join (SBJ). Comparing our solutions with some of the most
prominent approaches, both SBFCJ and SBJ reduce computation time at least to 60% against
the best options available. We show both of strategies present outstanding less disk spill and
network communication. It is important to stress that implementing a simple cascade of joins
in Spark is far from being among the best options, showcasing the importance of the use of
Bloom filter or broadcasting technique. Furthermore, SBFCJ and SBJ scale linearly with the
database size, which is necessary to any solution devoted to larger databases. Finally, while
SBJ outperforms SBFCJ when memory per executor is large enough – bigger than 512MB, in
our test case –, SBFCJ seems greatly resilient to lack of memory. Both solutions contribute
as efficient and competitive tools to filter and retrieve data from, for instance, Cloud Data
Warehouses.
2 Background
2.1 Star Joins
A star schema consists of a central table, namely fact table, referencing several satellite di-
mension tables, thus resembling a star [18]. This organization offers several advantages over
normalized transactional schemas, as fast aggregations and simplified business-reporting logic.
For this reason, star schemas are used in online analytical processing (OLAP) systems. We
show in Figure 1(a) an example of a star schema with one fact table Fand four dimensions.
Queries issued over this schema often involve joins between fact and dimensions tables. This
operation is also called Star Join. In Figure 1(b), we show an example of a Star Join accessing
three dimension tables. Real-life applications usually have a large fact table, rendering such
operations very expensive. A considerable share of this operation complexity dwells on the
substantial number of cross-table comparisons. Even in non-distributed systems, it induces
2
Faster cloud Star Joins with reduced disk spill and network communication Brito et al.
massive readouts from a wide range of points in the hard drive.
2.2 Bloom Filters
Bloom filters are compact data structures built over elements of a data set, and then used on
membership queries [20]. For instance, Bloom filters have been used to process star-join query
predicates to discard unnecessary tuples from the fact table [23]. Several variations exist, but
the most basic implementation consists of a bit array of mpositions associated with nhash
functions. Insertions of elements are made through the evaluation of the hash functions, which
generates nvalues indicating positions set to 1.
Checking whether an element is in the represented data set is performed evaluating the n
hash functions, and, if all corresponding positions are 1, then the element may belong to the
data set. Collisions may happen due to the limited number of positions in the bit array. If
the number of inserted elements is known a priori, the ratio of false positives can be controlled
setting the appropriate number of hash functions and array positions.
2.3 MapReduce and Spark
The Apache Hadoop framework has successfully been employed to process star-join queries
(see Section 3). Its processing framework (Hadoop MapReduce) is based on the MapReduce
programming model [5]: jobs are composed of map and reduce procedures that manipulate
key-value pairs. While mappers filter and sort the data, reducers summarize them. Especially
for Star Joins, its processing schedule often results in several sequential jobs and excessive
disk access, defeating the initial purpose of concurrent computation. Thus, it hinders the
performance of even the simpler star-join queries.
Conversely, the Apache Spark initiative provides a flexible framework based on in-memory
computation. Its Resilient Distributed Dataset (RDD) [13] abstraction represents collections
of data spread across machines. Data manipulation is performed with predefined operations
over the RDDs, which are also able to work with key-value pairs. Applications are managed
by a driver program which gets computation resources from a cluster manager (e.g., YARN).
RDD operations are translated into directed acyclic graphs (DAGs) that represents RDD de-
pendencies. The DAG scheduler gets a graph and transforms it into sets of smaller tasks called
stages, then sent to executors. The RDD abstraction of Spark has been presenting remarkable
advantages. In special, Spark demonstrated to excel for machine learning tasks [24], as opposed
to batch processing as the original Hadoop MapReduce’s design. However, to the best of our
knowledge, the star-join processing in Spark has not been addressed in the literature, which is
the goal of this paper.
3 Related work
Star schemas are notably important in decision support systems to solve business-report sort
of queries, given volume of data in such applications [4]. OLAP operations applied on efficient
data organizations are combined with state-of-art data mining techniques to produce accurate
predictions and trends. For instance, OLAP over spatial data has gained much attention with
the increasing access to GPSs in day-to-day life, and many solutions have been proposed using
(Geographical) Data Warehouses [11]. More recently, NoSQL solutions were also proposed to
extend to more non-conventional data architectures [8].
3
Faster cloud Star Joins with reduced disk spill and network communication Brito et al.
With on-demand hardware and the advent of convenient parallel frameworks, cloud com-
puting has extensibly been applied to implement decision support systems [2]. Cloud versions
of star-schema based systems have been proposed in the last years, such as Hive [9]. Particu-
larly, several algorithms were recently proposed to solve Star Joins using MapReduce. Because
of the subsequent joins needed, high network communication and several sequential jobs are
challenging bottlenecks and motivated several of the recent work in this area [1]. In fact, some
attempts to optimize (star) joins even propose changes on the MapReduce framework.
Next, we revisit two strategies [15]. The MapReduce Broadcast Join (MRBJ, for short)
first applies predicates on dimensions, and then broadcasts the results to all nodes, solving join
operations locally. Thus, it requires all filtered dimension tables to fit into the nodes memory,
which may be a limitant. Secondly, the MapReduce Cascade Join (MRCJ for short) maps
tuples from two tables according to their join keys, and performs the join operation on the
reducers. This algorithm can be applied multiple times on sequential jobs to process star joins,
requiring N1 jobs to join Nrelations, decreasing considerably its performance especially for
high-dimensional star schemas.
4 Star-Join Algorithms in Spark
In this section, we present our algorithms to solve Star Joins: Spark Bloom-Filtered Cascade
Join (SBFCJ) and Spark Broadcast Join (SBJ). In the following sections, we discuss each
approach in detail. For short, fact RDD stands for the RDD related to fact table as dimension
RDD, for a dimension table. Both algorithms solve query Qas defined in Figure 1(b).
4.1 Bloom-Filtered Cascade Join
Cascade Join is the most straightforward approach to solve Star Joins. Spark framework per-
forms binary joins through join operator using the key-value abstraction. Fact and dimension
RDDs are lists of key-value pairs containing attributes of interest. Thus, the Star Join can be
computed as a sequence of binary joins between these RDDs. For simplicity, we shall refer to
this approach as Spark Cascade Join (SCJ).
Based on the Cascade Join, we now add Bloom filters. This optimization avoids the trans-
mission of unnecessary data from fact RDD through the cascade of join operations. These
filters are built for each dimension RDD, containing their primary keys that meet the query
predicates. Therefore, the fact RDD is filtered based on the containment of its foreign keys on
these Bloom filters. We refer to this approach as Spark Bloom-Filtered Cascade Join (SBFCJ).
Algorithm 1 exemplifies SBFCJ for solving query Q. RDDs for each table are created in
lines 1, 4, 7 and 10. Then, the filter operator solves predicates of Qin place (lines 2 and 8). For
each RDD, attributes of interest are mapped into key-value pairs by the mapToPair operator.
Dimension RDD keys are also inserted in Bloom filters broadcast to every executor (lines 3, 6
and 9). The filter operator uses these Bloom filters over the fact RDD, discarding unnecessary
key-value pairs in line 11. This is where SBFCJ should gain in performance compared to SCJ.
Then, the fact RDD joins with resulting dimension RDDs in lines 13–15. Finally, reduceByKey
and sortByKey performs, respectively, aggregation and sorting of the results (see line 16).
4.2 Broadcast Join
Spark Broadcast Join (SBJ for short) assumes that all dimension RDDs fit into the executor
memory. Note that each node may have much more than one executor running, which may
4
Faster cloud Star Joins with reduced disk spill and network communication Brito et al.
Algorithm 1 SBFCJ for Q
input: F, D1, D2and D3
output: result of Q
1: RDD1= D1
2: RDD1.filter( a1,1<100 ).mapToPair( pk1, null )
3: BF1= broadcast( RDD1.collect( ) )
4: RDD2= D2
5: RDD2.mapToPair(pk2,a2,1)
6: BF2= broadcast( RDD2.collect( ) )
7: RDD3= D3
8: RDD3.filter( 5 a3,110).mapToPair( pk3,a3,2)
9: BF3= broadcast( RDD3.collect( ) )
10: RDDF= F
11: RDDF.filter( BF1.contains(fk1) and BF2.contains(f k2) and BF3.contains(f k3) )
12: RDDF.mapToPair( f k1, [f k2,f k3,m1] )
13: RDDresult = RDDF.join( RDD1).mapToPair( fk2, [f k3,m1] )
14: RDDresult = RDDresult .join( RDD2).mapToPair( f k3, [a2,1,m1] )
15: RDDresult = RDDresult .join( RDD3).mapToPair( [a2,1,a3,2], m1)
16: F inalResult = RDDresult.reduceByKey( v1+v2).sortByKey( )
Algorithm 2 SBJ for Q
input: F, D1, D2and D3
output: result of Q
1: RDD1= D1
2: RDD1.filter( a1,1<100 ).mapToPair( pk1, null )
3: H1= broadcast( RDD1.collect( ) )
4: RDD2= D2
5: RDD2.mapToPair(pk2,a2,1)
6: H2= broadcast( RDD2.collect( ) )
7: RDD3= D3
8: RDD3.filter( 5 a3,110).mapToPair( pk3,a3,2)
9: H3= broadcast( RDD3.collect( ) )
10: RDDresult = F
11: RDDresult.filter( H1.hasKey(f k1) and H2.hasKey(f k2) and H3.hasKey(f k3) )
12: RDDresult.mapToPair( [H1.get(a2,1), H1.get(a3,2)], m1)
13: F inalResult = RDDresult.reduceByKey( v1+v2).sortByKey( )
constrain the application of SBJ depending on the dataset specifics. Dimension RDDs are
broadcast to all executors, where their data are kept in separate hash maps. Then, all joins are
performed locally in parallel. Since no explicit join operation is needed, SBJ is certain to deliver
the faster query times. Note that, in general, Bloom filters are much smaller data structures
than hash maps. Thus, memory-wise there probably is a balance between cases when SBJ and
SBFCJ performs optimally - which will be verified in the experiments section. This approach
is the Spark counterpart of the MRBJ, introduced in Section 3.
Algorithm 2 details this approach applied on query Q. Hash maps are broadcast variables
created for each dimension RDD in lines 3, 6 and 9, corresponding to lists of key-value pairs. It
is important to note that these hash maps contain not only the dimension primary keys, but all
the needed attributes. These hash maps are kept in the executor primary memory. Then, the
filter operator access the hash maps to select data that should be joined in line 11. Since all the
necessary dimension data are replicated over all executors, in line 12 the mapToPair operator
solves the select clause of Q. As a consequence, there is no need to use the join operator at all,
saving a considerable amount of computation.
5
Faster cloud Star Joins with reduced disk spill and network communication Brito et al.
(a) (b)
Figure 2: Time performance in terms of (a) shuffled data and (b) disk spill. We present
MapReduce in red symbols and Spark, blue. Orange line showing the general trend.
5 Performance Analyses
Next, we present performance analyses of our Spark solutions Spark Bloom-Filtered Cascade
Join (SBFCJ) and Spark Broadcast Join (SBJ).
5.1 Methodology and Experimental Setup
We used a cluster of 21 (1 master and 20 slaves) identical, commercial computers running
Hadoop 2.6.0 and Apache Spark 1.4.1 over a GNU/Linux installation (CentOS 5.7), each node
with two 2GHz AMD CPUs and 8GB of memory. To more intensive tests (i.e., Section 5.3), we
used Microsoft Azure with 21 A4 instances (1 master and 20 slaves), each with eight 2.4GHz
Intel CPUs and 14GB of memory. In both clusters, we have used YARN as our cluster manager.
We have used the Star Schema Benchmark (SSB) [14] to generate and query our datasets.
Unless stated otherwise, each result represents average and standard deviation over 5 runs,
guaranteeing the mean confidence interval ±100s. All implementations used in the following
sections are available in Java at GitHub [3].
5.2 Disk spill, network communication and performance
We show in Figure 2 how our approaches compare to MapReduce strategies (see Section 2.3)
in terms of disk spill and network communication. To simplify the notation, every MapReduce
based solution starts with MR, and the respective references are cited in the figures legend.
Notice that all points referring to MapReduce define a trend correlating time performance to
network communication and disk spill (orange line in Figure 2). Although Spark is known to
outperform MapReduce in a wide range of applications, a direct implementation of sequences
of joins (referred to as SCJ) delivers poor performances and follows the same trends as the
MapReduce approaches. SBFCJ and SBJ, however, are complete outliers (more than 3-σ) and
present remarkably higher performances. This highlights the need for additional optimizations
applied on top/instead of the cascades of joins. In special, notice that both our strategies are
closely followed by MapReduce Broadcast Join (MRBJ in the figure) and a MapReduce ap-
proach proposed by Zhang et al., which processes Star Joins in two jobs using Bloom Filters
[23]. Next, we investigate in more detail each of these strategies.
In Figure 2(a) SBFCJ and SBJ both optimize network communication and computation time
using query Q4.1 with SF 100. As mentioned, excessive cross communication among different
nodes is one of the major drawbacks in MapReduce algorithms. When compared to the best
MapReduce approaches, SBJ presents 200 times lower data shuffling than MRBJ method and
SBFCJ, a reduction of almost 96% against MRCJ. Finally, although the solution proposed by
6
Faster cloud Star Joins with reduced disk spill and network communication Brito et al.
(a) (b) (c)
Figure 3: Impact of the Scale Factor SF in the performance of SBFCJ and SBJ.
Zhang et al. does deliver 25% less shuffled data than SBFCJ, it still delivers a performance
nearly 40% slower. Moreover, test in Figure 2(b) demonstrates one of the main advantages
of Spark’s in-memory methodology: although both best options in MapReduce have low spills
(4GB for MRCJ and 0.5GB for MRBJ), both SBFCJ and SBJ show no disk spill at all. In
this test, we have set Spark with 20 executors (on average, 1 per node) and 3GB of memory
per executor. If we lower the memory, than we start seeing some disk spill from Spark and its
performance drops. We study more on this effect in Section 5.4.
Yet, SCJ, which simply implements a sequence of joins, presents considerable higher disk
spill and computation time when compared to SBJ and SBFCJ. Specifically, not only SCJ has
non-null disk spill, it is bigger than MapReduce best options, although 18% lower than its
counterpart, MRCJ. SBFCJ and SBJ shuffles, respectively, 23 and over 104times less data
than SCJ. Therefore, the reduced disk spill and time observed with SBJ and SBFCJ strategies
are not only due to Spark internal structure to minimize nodes’ cross talk and disk spill. The
bottom line is: application of additional techniques (Bloom filters or broadcasting) is essential
indeed.
It is important to note that this analysis assumed best performance of the MapReduce
approaches. No scenario was observed where either of SBJ or SBFCJ trailed the other strategies.
More details on MapReduce approaches to star-join processing will be discussed elsewhere.
5.3 Scaling the dataset
In this section, we investigate the effect of the dataset volume in the performance times of SBFCJ
and SBJ. Methods in general must be somewhat resilient and scale well with the database size.
Especially in the context of larger datasets, where solutions such as MapReduce and Spark
really make sense, having a linear scaling is simply essential.
In Figure 3 we show the effect of the scale factor SF in the computation time considering
three different queries of SSB. To test both strategies, we have selected queries with considerable
different workloads. Especially in low SFs, as shown in Figure 3(b), a constant baseline was
observed: from SF 50 to 200 the elapsed time simply does not change, revealing other possible
bottlenecks. However, such datasets are rather small, and do not reflect the applications in
which cloud approaches excel. SBFCJ and SBJ performances grow linearly with SF in all
larger scale factors tested.
5.4 Impact of Memory per Executor
As mentioned in Section 4.2, broadcast methods usually demand memory to allocate dimension
tables. While SBJ has outperformed any other solution so far, scenarios with low memory per
executor might compromise its performance. Next, we study how the available memory to each
7
Faster cloud Star Joins with reduced disk spill and network communication Brito et al.
(a) (b)
Figure 4: Comparing SBJ and SBFCJ performances with (a) 1GB and (b) 512MB of memory
per executor. SBJ seems reasonably sensitive to low memory cases.
(a) (b)
Figure 5: Comparing SBJ and SBFCJ with 20 executors and variable memory. Panel (a) shows
that SBJ’s performance is severely impaired.
executor impacts both SBFCJ and SBJ. We have tested query Q4.1 with SF 200.
Parallelization in Spark is carried by executors. For a fixed executor memory we studied how
the number of executors change SBFCJ and SBJ performance. If enough memory is provided,
performance usually follows trends shown in Figure 4(a), where 1GB was used. However, as
the memory decreases, SBJ may be severely impacted while SBFCJ seems to be quite resilient.
In our cluster, with this particular dataset, 512MB was a turning point: Figure 4(b) shows
SBJ losing performance gradually. Below 512MB the difference becomes more pronounced. It
is important to note that the specific value of this turning point (512MB in our tests) likely
changes depending on the cluster, nodes’ configuration and, possibly, dataset.
To explore this effect in detail, we tested the performance using 20 executors while decreasing
their memory. Results in Figure 5(a) show SBJ drops in performance until a point where SBFCJ
actually outperforms it (again, around 512MB). Furthermore, in Figure 5(a), from 450 to 400MB
there is a stunning increase in computation time: it suddenly becomes more than three times
slower. To make sure that this increase in time was not an artifact of a specific executor or
node, we analyzed the average time elapsed by all tasks run by all executors. Figure 5(b)
clearly shows that the elapsed time becomes approximately five times larger in this transition,
suggesting that tasks are in general requiring more computation time. For 256MB, there was
no enough memory to broadcast all dimensions. Comparatively, however, SBFCJ seems more
resilient to the amount of memory per executor than SBJ, a feature that could be exploited
depending on the resources available and application.
Finally, in Figure 6 we investigated a slightly more realistic scenario: while fixing the total
memory, the number of executors increase and share memory equally. Thus, although the
memory per executor is decreasing, all memory resources are always in use. As expected, with
enough memory for each executor performance of both SBJ and SBFCJ increase (see Figure
6(b)). Yet, similarly to Figure 4(b), if the number of executors is blindly increased without
increasing resources, SBJ is severely impaired while SBFCJ’s performance remarkably remains.
8
Faster cloud Star Joins with reduced disk spill and network communication Brito et al.
(a) (b)
Figure 6: Comparing SBJ and SBFCJ with fixed total memory while increasing the number of
executors.
In conclusion, all results in this section point towards a trade off between these two ap-
proaches, and defines a clear guideline on how to choose depending on the cluster resources
and dataset. Regardless of the number of executors, if their memory is enough to fit dimension
RDDs, SBJ may deliver twice faster query times; however, if memory is not enough, SBFCJ is
the best solution available. It is important to stress that all results presented in this section
would scale up with the SF. Thus, in larger SFs this turning point where SBJ slows down should
be larger than 512MB.
6 Concluding remarks
In this paper, we have proposed two approaches to efficiently process Star Join queries, reducing
excessive data spill and network communication: Spark Bloom-Filtered Cascade Join (SBFCJ)
and Spark Broadcast Join (SBJ). All tested MapReduce options trail both of these algorithms
by at least 60% in terms of query execution time. It is important to stress that simply im-
plementing a cascade of joins in Spark, namely, Spark Cascade Join (SCJ), was not enough to
beat MapReduce options, showcasing the importance of using of Bloom filter or broadcasting
techniques. We have also shown that both SBFCJ and SBJ scale linearly with the database
volume, which poses them as competitive solutions for Star Joins in the cloud. While SBJ is
usually faster (between 20-50%) when memory resources are abundant, SBFCJ was remarkably
resilient in scenarios with scarce memory. In fact, with enough resources available, SBJ has no
disk spill at all. To summarize, all of our results point towards a simple guideline: regardless
of the number of executors, if their memory is enough to fit dimension RDDs, SBJ may deliver
twice faster query times; however, if memory is an issue, SBFCJ is the best solution and re-
markably robust to low-memory infrastructures. Therefore, SBFCJ and SBJ both were shown
competitive fitting candidates to solve Star Joins in the cloud.
Acknowledgments. Authors thank Dr. Hermes Senger for allowing us to use his labora-
tory cluster infrastructure. This work was supported by FAPESP, CAPES, CNPq, INEP, and
FINEP. JJ Brito acknowledges FAPESP grant #2012/13158-9. T Mosqueiro acknowledges sup-
port from CNPq 234817/2014-3. JJ Brito, T Mosqueiro and CDA Ciferri acknowledge Microsoft
Azure Research Award MS-AZR-0036P.
References
[1] F. N. Afrati and J. D. Ullman. Optimizing joins in a map-reduce environment. In EDBT 2010,
pages 99–110, 2010.
9
Faster cloud Star Joins with reduced disk spill and network communication Brito et al.
[2] V. S. Agneeswaran. Big Data Analytics Beyond Hadoop: Real-Time Applications with Storm,
Spark, and More Hadoop Alternatives. Pearson FT Press, 2014.
[3] J. J. Brito. Star joins in Spark. https://github.com/jaquejbrito/star-join-spark, 2015. [Online;
accessed 31-January-2016].
[4] S. Chaudhuri, U. Dayal, and V. Ganti. Database technology for decision support systems. IEEE
Computer, 34(12):48–55, 2001.
[5] J. Dean and S. Ghemawat. Mapreduce: simplified data processing on large clusters. Communica-
tions of the ACM, 51(1):107–113, 2008.
[6] H. Demirkan and D. Delen. Leveraging the capabilities of service-oriented decision support systems:
Putting analytics and big data in cloud. Decision Support Systems, 55(1):412–421, 2013.
[7] A. Khajeh-Hosseini et al. Decision support tools for cloud migration in the enterprise. In IEEE
CLOUD 2011, pages 541–548, 2011.
[8] A. Sch¨atzle et al. Cascading map-side joins over hbase for scalable join processing. In Joint
Workshop on Scalable and High-Performance Semantic Web Systems, page 59, 2012.
[9] A. Thusoo et al. Hive - a petabyte scale data warehouse using hadoop. In ICDE 2010, pages
996–1005, 2010.
[10] H. Han et al. Scatter-gather-merge: An efficient star-join query processing algorithm for data-
parallel frameworks. Cluster Computing, 14(2):183–197, 2011.
[11] J. J. Brito et al. Efficient processing of drill-across queries over geographic data warehouses. In
DaWak 2011, pages 152–166, 2011.
[12] M. Li et al. Sparkbench: a comprehensive benchmarking suite for in memory data analytic platform
spark. In Conf. Computing Frontiers 2015, pages 53:1–53:8, 2015.
[13] M. Zaharia et al. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster
computing. In NSDI 2012, pages 15–28, 2012.
[14] P. E. O’Neil et al. The star schema benchmark and augmented fact table indexing. In TPCTC
2009, pages 237–252, 2009.
[15] S. Blanas et al. A comparison of join algorithms for log processing in mapreduce. In SIGMOD
2010, pages 975–986, 2010.
[16] Y. Tao et al. Optimizing multi-join in cloud environment. In HPCC/EUC 2013, pages 956–963,
2013.
[17] David Jiang, Anthony K. H. Tung, and Gang Chen. MAP-JOIN-REDUCE: toward scalable and
efficient data analysis on large clusters. IEEE Transactions on Knowledge and Data Engineering,
23(9):1299–1311, 2011.
[18] R. Kimball and M. Ross. The Data Warehouse Toolkit: The Complete Guide to Dimensional
Modeling. Wiley Computer Publishing, 2 edition, 2002.
[19] J. P. Shim, M. Warkentin, J. F. Courtney, D. J. Power, R. Sharda, and C. Carlsson. Past, present,
and future of decision support technology. Decision Support Systems, 33(2):111 – 126, 2002.
[20] S. Tarkoma, C. E. Rothenberg, and E. Lagerspetz. Theory and practice of bloom filters for
distributed systems. IEEE Communications Surveys and Tutorials, 14(1):131–155, 2012.
[21] Hugh J. Watson and Paul Gray. Decision Support in the Data Warehouse. Prentice Hall Profes-
sional Technical Reference, 1997.
[22] M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica. Spark: Cluster computing
with working sets. In 2nd USENIX Workshop on Hot Topics in Cloud Computing, 2010.
[23] C. Zhang, L. Wu, and J. Li. Efficient processing distributed joins with bloomfilter using mapreduce.
Int. Journal of Grid and Distributed Computing, 6(3):43–58, 2013.
[24] B. Zhu, A. Mara, and A. Mozo. CLUS: parallel subspace clustering algorithm on spark. In ADBIS
(Short Papers and Workshops) 2015, pages 175–185, 2015.
10
Chapter
Improving OLAP query performance in a distributed system such as Hadoop and Spark is a challenging task. An OLAP query is composed of several operations, such as projection, filtering, join, and grouping. The star join operation is the most expensive one and usually involve considerable communication cost. The common method used to decrease the network traffic for the star join operation is to co-partition some tables of a data warehouse on their join key. However, this operation still requires many MapReduce cycles in existing data warehouses partitioning schemes. In this paper, we propose a new physical design of distributed big data warehouses over Hadoop cluster. We propose two methods called “FKey” and “NewKey” based on a data mining technique to guide our physical design. Our partitioning and distribution scheme helps the query’s optimizer to make a good query processing plan, such it can performing the star join operation in only one Spark stage without the shuffle phase. To evaluate our approach we have done some experiments on a cluster of data nodes using the TPC-DS benchmark. The results show that our proposal outperforms the existing approaches in terms of query runtime.
Chapter
Geographic, socioeconomic, and image data enrich the range of analysis that can be achieved in the healthcare decision-making. In this paper, we focus on these complex data with the support of a data warehouse. We propose three designs of star schema to store them: jointed, split, and normalized. We consider healthcare applications that require data sharing and manage huge volumes of data, where the use of frameworks like Spark is needed. To this end, we propose SimSparkOLAP, a Spark strategy to efficiently process analytical queries extended with geographic, socioeconomic, and image similarity predicates. Performance tests showed that the normalized schema provided the best performance results, followed closely by the jointed schema, which in turn outperformed the split schema. We also carried out examples of semantic queries and discuss their importance to the healthcare decision-making.
Article
Full-text available
The combination of powerful parallel frameworks and on-demand commodity hardware in distributed computing has made both analytics and decision support systems canonical to enterprises of all sizes. The unprecedented volumes of data stacked by companies present challenges to process analytical queries efficiently. This data is often organised as star schema, in which star join and group-by are ubiquitous and expensive operations. Although parallel frameworks such as Apache Spark facilitate join and groupby, the implementation can only process two tables at a time and fail to handle the excessive network communication, disk spills and multiple scans of data. In this paper, we present Distributed ATrie Group Join (DATGJ), a fast distributed star join and group-by algorithm for column-stores. DATGJ uses divide and broadcast-based joining technique where the fact table columns are partitioned equally and fast hash table (FHT) for each dimension table are broadcasted. This technique helps it avoid cross communication between workers and disk spills. DATGJ performs a single scan of partitioned fact table columns and use FHT to join data. FHT uses Robin Hood hashing with the upper limit on number of probes and achieve significant speed up during join. DATGJ performs group-by and aggregation leveraging progressive materialisation and realising grouping attributes as a tree shaped deterministic finite automation known as Aggregate Trie or ATrie. We evaluated our algorithm using Star Schema Benchmark (SSBM) to show that it is 1.5X to 6X faster than the most prominent approaches while having zero data shuffle and consistently perform well with addition of resources and in memory-constrained scenarios.
Article
Full-text available
Indices improve the performance of relational databases, especially on queries that return a small portion of the data (i.e., low-selectivity queries). Star joins are particularly expensive operations that commonly rely on indices for improved performance at scale. The development and support of index-based solutions for Star Joins are still at very early stages. To address this gap, we propose a distributed Bitmap Join Index (dBJI) and a framework-agnostic strategy to solve join predicates in linear time. For empirical analysis, we used common Hadoop technologies (e.g., HBase and Spark) to show that dBJI significantly outperforms full scan approaches by a factor between 59% and 88% in queries with low selectivity from the Star Schema Benchmark (SSB). Thus, distributed indices may significantly enhance low-selectivity query performance even in very large databases. : Computer science; Random access; Distributed Bitmap Index; Star Join; Low-selectivity queries; Hadoop ecosystem Keywords: Computer science, Random access, Distributed Bitmap Index, Star Join, Low-selectivity queries, Hadoop ecosystem
Article
A model has been developed and an estimate of the amount of data transmitted over the network has been obtained with duplicating the table across nodes and using the Bloom filter in the MapReduce/Spark environment. Models have been developed for fulfilling queries for joining database tables in the cascading Bloom filter in the same environment. Two cases of joining tables are considered: 1) several bushes with one dimension in each of them; 2) one bush with several dimensions --- star-type storage. We obtained an estimate of the Bloom filter volume transmitted over the network when the tables are joined. Using the example of the Q3 request from the TPC-H test, we analyzed the adequacy of the estimated gain in the amount of data transmitted over the network using the cascading Bloom filter. The prediction error was 2 %.
Chapter
Hadoop uses horizontal partitioning to improve the performance of a big data warehouse. A major challenge when horizontally partitioning the tables of a big data warehouse is to reduce network traffic for a given workload. A common technique to avoid this issue, when performing a join operation, is to co-partition the tables of the data warehouse on their join key. However, in the existing partitioning schemes, executing a star join operation in Hadoop still needs many MapReduce cycles. In this paper, we combine a data-driven and a workload-driven model to create a new scheme for distributed big data warehouses over Hadoop, called “SkipSJoin”. Our approach allows performing the star join operation in only one Spark stage, and allows skipping the loading of some unnecessary HDFS blocks. Our experiments show that our proposal outperforms some approaches in terms of query execution time.
Chapter
Horizontal partitioning techniques have been used for many purposes in big data processing, such as load balancing, skipping unnecessary data loads, and guiding the physical design of a data warehouse. In big data warehouses, the most expensive operation of an OLAP query is the star join, which requires many Spark stages. In this paper, we propose a new data placement strategy in the Apache Hadoop environment called “Smart Data Warehouse Placement (SDWP)”, which allows performing star join operation in only one Spark stage. We investigate the problem of partitioning and load balancing in a cluster of homogeneous nodes. We take into account the characteristics of the cluster and the size of the data warehouse. With our approach, almost all operations of an OLAP query are executed in parallel during the first Spark stage, without a shuffle phase. Our experiments show that our proposed method enhances OLAP query performances in terms of execution time.
Conference Paper
Full-text available
Bioinspired Neural Networks have in many instances paved the way for significant discoveries in Statistical and Machine Learning. Among the many mechanisms employed by biological systems to implement learning, gain control is a ubiquitous and essential component that guarantees standard representation of patterns for improved performance in pattern recognition tasks. Gain control is particularly important for the identification of different odor molecules, regardless of their concentration. In this paper, we explore the functional impact of a biologically plausible model of the gain control on classification performance by representing the olfactory system of insects with a Single Hidden Layer Network (SHLN). Common to all insects, the primary olfactory pathway starts at the Antennal Lobes (ALs) and, then, odor identity is computed at the output of the Mushroom Bodies (MBs).We show that gain-control based on lateral inhibition in the Antennal Lobe robustly solves the classification of highly-concentrated odors. Furthermore, the proposed mechanism does not depend on learning at the AL level, in agreement with biological literature. Due to its simplicity, this bioinspired mechanism may not only be present in other neural systems but can also be further explored for applications, for instance, involving electronic noses.
Conference Paper
Subspace clustering techniques were proposed to discover hidden clusters that only exist in certain subsets of the full feature spaces. However, the time complexity of such algorithms is at most exponential with respect to the dimensionality of the dataset. In addition, datasets are generally too large to fit in a single machine under the current big data scenarios. The extremely high computational complexity, which results in poor scalability with respect to both size and dimensionality of these datasets, give us strong motivations to propose a parallelized subspace clustering algorithm able to handle large high dimensional data. To the best of our knowledge, there are no other parallel subspace clustering algorithms that run on top of new generation big data distributed platforms such as MapReduce and Spark. In this paper we introduce CLUS: a novel parallel solution of subspace clustering based on SUBCLU algorithm. CLUS uses a new dynamic data partitioning method specifically designed to continuously optimize the varying size and content of required data for each iteration in order to fully take advantage of Spark’s in-memory primitives. This method minimizes communication cost between nodes, maximizes their CPU usage, and balances the load among them. Consequently the execution time is significantly reduced. Finally, we conduct several experiments with a series of real and synthetic datasets to demonstrate the scalability, accuracy and the nearly linear speedup with respect to number of nodes of the implementation.
Conference Paper
In cloud computing, complex data analysis usually requires accessing multiple data sets. Existing MapReduce-based multi-join mechanism implements the join of multiple data sets by cascade method, which is flexible but poor efficiency. The paper analyzes existing concurrent join models and proposes a Two-Dimension Reducer matrix based Hierarchized Multi-Join model (TD-HMJ). TD-HMJ handles all the "key" attributes in one Map phase and divides the joined tables into several groups. Each group has three or two tables. In Reduce phase, the tables in each group can be joined at the same time by establishing a two-dimension Reducer matrix. TD-HMJ finishes the joining between groups through multiple Reduce processes. Theoretical analysis and experiment results show that TD-HMJ decreases the data transmission, curtails the time of multi-join, and increases the system efficiency.
Chapter
In this data-driven society, we are collecting a massive amount of data from people, actions, sensors, algorithms and the web; handling “Big Data” has become a major challenge. A question still exists regarding when data may be called big data. How large is big data? What is the correlation between big data and business intelligence? What is the optimal solution for storing, editing, retrieving, analyzing, maintaining, and recovering big data? How can cloud computing help in handling big data issues? What is the role of a cloud architecture in handling big data? How important is big data in business intelligence? This chapter attempts to answer these questions. First, we review a definition of big data. Second, we describe the important challenges of storing, analyzing, maintaining, recovering and retrieving a big data. Third, we address the role of Cloud Computing Architecture as a solution for these important issues that deal with big data. We also discuss the definition and major features of cloud computing systems. Then we explain how cloud computing can provide a solution for big data with cloud services and open-source cloud software tools for handling big data issues. Finally, we explain the role of cloud architecture in big data, the role of major cloud service layers in big data, and the role of cloud computing systems in handling big data in business intelligence models.
Conference Paper
With the popularity of big data and cloud computing, data parallel framework MapReduce based data warehouse systems are used widely. Column store is a default data placement in these systems. Traditionally star join is a core operation in the data warehouse. However, little related work study star join in column store and MapReduce environments. This paper proposes two new cache conscious algorithms Multi-Fragment-Replication Join (MFRJ) and MapReduce-Invisible Join (MRIJ) in MapReduce environments. All these algorithms avoid fact table data movement and are cache conscious in each MapReduce node. In addition, fact table is partitioned into several column groups for cache optimization in MFRJ; One group contains all of foreign key columns and each measure column is a group. In MRIJ, each column is separately processed one by one which has higher cache utilization and avoids frequently cache miss from one column to the other column. MRIJ is composed of several map operation on dimension tables and one MapReduce job. We also apply MRIJ on RCFile in Hive. All operations are processed in mapping phase and avoid high cost of shuffle and reduce operation. If the dimension tables are big enough and cannot cache in local memory, MRIJ is divided into two phases, firstly each dimension table join with corresponding foreign key column in fact table as commonly map reduce join concurrently or serially; secondly all internal results joined for final results based on position index. This strategy also can be applied to other multi-table join. In order to reduce network I/O, dimension table and the fact table foreign key column are co-location storage. Our experimental results in cluster environments show that our algorithms outperform existing approaches in Hive system.
Article
Implementations of map-reduce are being used to perform many operations on very large data. We examine strategies for joining several relations in the map-reduce environment. Our new approach begins by identifying the “map-key,” the set of attributes that identify the Reduce process to which a Map process must send a particular tuple. Each attribute of the map-key gets a “share,” which is the number of buckets into which its values are hashed, to form a component of the identifier of a Reduce process. Relations have their tuples replicated in limited fashion, the degree of replication depending on the shares for those map-key attributes that are missing from their schema. We study the problem of optimizing the shares, given a fixed number of Reduce processes. An algorithm for detecting and fixing problems where a variable is mistakenly included in the map-key is given. Then, we consider two important special cases: chain joins and star joins. In each case, we are able to determine the map-key and determine the shares that yield the least replication. While the method we propose is not always superior to the conventional way of using map-reduce to implement joins, there are some important cases involving large-scale data where our method wins, including: 1) analytic queries in which a very large fact table is joined with smaller dimension tables, and 2) queries involving paths through graphs with high out-degree, such as the Web or a social network.
Article
Using service-oriented decision support systems (DSS in cloud) is one of the major trends for many organizations in hopes of becoming more agile. In this paper, after defining a list of requirements for service-oriented DSS, we propose a conceptual framework for DSS in cloud, and discus about research directions. A unique contribution of this paper is its perspective on how to servitize the product oriented DSS environment, and demonstrate the opportunities and challenges of engineering service oriented DSS in cloud. When we define data, information and analytics as services, we see that traditional measurement mechanisms, which are mainly time and cost driven, do not work well. Organizations need to consider value of service level and quality in addition to the cost and duration of delivered services. DSS in CLOUD enables scale, scope and speed economies. This article contributes new knowledge in service science by tying the information technology strategy perspectives to the database and design science perspectives for a broader audience.