Content uploaded by Jaqueline Joice Brito
Author content
All content in this area was uploaded by Jaqueline Joice Brito on Jul 06, 2019
Content may be subject to copyright.
This space is reserved for the Procedia header, do not use it
Faster cloud Star Joins with reduced
disk spill and network communication
Jaqueline Joice Brito1, Thiago Mosqueiro2, Ricardo Rodrigues Ciferri3, and
Cristina Dutra de Aguiar Ciferri1
1Department of Computer Science, University of S˜ao Paulo at S˜ao Carlos, Brazil
{jjbrito, cdac}@icmc.usp.br
2BioCircuits Institute, University of California San Diego, United States
thiago.mosqueiro@usp.br
3Department of Computer Science, Federal University of So Carlos, Brazil
ricardo@dc.ufscar.br
Abstract
Combining powerful parallel frameworks and on-demand commodity hardware, cloud comput-
ing has made both analytics and decision support systems canonical to enterprises of all sizes.
Associated with unprecedented volumes of data stacked by such companies, filtering and retriev-
ing them are pressing challenges. This data is often organized in star schemas, in which Star
Joins are ubiquitous and expensive operations. Excessive data spill and network communication
are tight bottlenecks for all proposed MapReduce or Spark solutions. Here, we propose two
efficient solutions that drop the computation time by at least 60%: the Spark Bloom-Filtered
Cascade Join (SBFCJ) and the Spark Broadcast Join (SBJ). Conversely, a direct Spark im-
plementation of a sequence of joins renders poor performance, showcasing the importance of
further filtering for minimal disk spill and network communication. Finally, while SBJ outper-
forms 2×when memory per executor is large enough, SBFCJ appears remarkably resilient to
low memory scenarios. Both algorithms pose a very competitive solution to Star Joins.
Keywords: Star join; Spark; MapReduce; Bloom filter; Data warehouse.
1 Introduction
For the past decade, data-centric trends emerged to aid decision-making processes [19, 4]. Due to
high costs of maintaining updated computational resources, a new model was proposed based
on commodity hardware and non-local infrastructures: cloud computing [7]. Consequently,
there is an increasing demand for flexible and scalable solutions to store and efficiently filter
large datasets [6]. Aligned to these needs, MapReduce [5] has gained attention in the last years,
delivering fast batch processing in the cloud. More recently, Spark [22] was proposed based on
slightly different premises, and has been shown to suit machine learning tasks well [12].
1
Faster cloud Star Joins with reduced disk spill and network communication Brito et al.
(a)
F
pkF
fk1
fk2
fk3
fk4
m1
m2
D1
pk1
a1,1
a1,2
...
D2
pk2
a2,1
a2,2
...
D3
pk3
a3,1
a3,2
...
D4
pk4
a4,1
a4,2
...
(b) SELECT a2,1,a3,2, SUM(m1)
FROM F, D1, D2,D3
WHERE pk1=fk1
AND pk2=fk2
AND pk3=fk3
AND a1,1<100
AND a3,1BETWEEN 5 AND 10
GROUP BY a2,1,a3,2
ORDER BY a2,1,a3,2
Figure 1: Running example: star schema in (a) and query Qin (b).
In the context of decision support systems, databases are often organized in star schemas
[4], where information is conveyed by a central table linked to several satellite tables (see
Figure 1). This is especially common in Data Warehouses [21, 11]. Star Joins are typical
and remarkably expensive operations in star schemas, required for most applications. Many
MapReduce strategies were proposed [1, 10, 16, 23], yet it is challenging to avoid excessive disk
access and cross-communication among different jobs within MapReduce’s inner structure [17].
A naive Spark implementation also fails at the same points (see Section 5.2). Therefore, our
goal is to propose better strategies to reduce disk spill and network communication.
In this paper, we propose two efficient algorithms that jointly optimize disk-spill, network
communication and query performance of Star Joins: Spark Bloom-Filtered Cascade Join
(SBFCJ) and Spark Broadcast Join (SBJ). Comparing our solutions with some of the most
prominent approaches, both SBFCJ and SBJ reduce computation time at least to 60% against
the best options available. We show both of strategies present outstanding less disk spill and
network communication. It is important to stress that implementing a simple cascade of joins
in Spark is far from being among the best options, showcasing the importance of the use of
Bloom filter or broadcasting technique. Furthermore, SBFCJ and SBJ scale linearly with the
database size, which is necessary to any solution devoted to larger databases. Finally, while
SBJ outperforms SBFCJ when memory per executor is large enough – bigger than 512MB, in
our test case –, SBFCJ seems greatly resilient to lack of memory. Both solutions contribute
as efficient and competitive tools to filter and retrieve data from, for instance, Cloud Data
Warehouses.
2 Background
2.1 Star Joins
A star schema consists of a central table, namely fact table, referencing several satellite di-
mension tables, thus resembling a star [18]. This organization offers several advantages over
normalized transactional schemas, as fast aggregations and simplified business-reporting logic.
For this reason, star schemas are used in online analytical processing (OLAP) systems. We
show in Figure 1(a) an example of a star schema with one fact table Fand four dimensions.
Queries issued over this schema often involve joins between fact and dimensions tables. This
operation is also called Star Join. In Figure 1(b), we show an example of a Star Join accessing
three dimension tables. Real-life applications usually have a large fact table, rendering such
operations very expensive. A considerable share of this operation complexity dwells on the
substantial number of cross-table comparisons. Even in non-distributed systems, it induces
2
Faster cloud Star Joins with reduced disk spill and network communication Brito et al.
massive readouts from a wide range of points in the hard drive.
2.2 Bloom Filters
Bloom filters are compact data structures built over elements of a data set, and then used on
membership queries [20]. For instance, Bloom filters have been used to process star-join query
predicates to discard unnecessary tuples from the fact table [23]. Several variations exist, but
the most basic implementation consists of a bit array of mpositions associated with nhash
functions. Insertions of elements are made through the evaluation of the hash functions, which
generates nvalues indicating positions set to 1.
Checking whether an element is in the represented data set is performed evaluating the n
hash functions, and, if all corresponding positions are 1, then the element may belong to the
data set. Collisions may happen due to the limited number of positions in the bit array. If
the number of inserted elements is known a priori, the ratio of false positives can be controlled
setting the appropriate number of hash functions and array positions.
2.3 MapReduce and Spark
The Apache Hadoop framework has successfully been employed to process star-join queries
(see Section 3). Its processing framework (Hadoop MapReduce) is based on the MapReduce
programming model [5]: jobs are composed of map and reduce procedures that manipulate
key-value pairs. While mappers filter and sort the data, reducers summarize them. Especially
for Star Joins, its processing schedule often results in several sequential jobs and excessive
disk access, defeating the initial purpose of concurrent computation. Thus, it hinders the
performance of even the simpler star-join queries.
Conversely, the Apache Spark initiative provides a flexible framework based on in-memory
computation. Its Resilient Distributed Dataset (RDD) [13] abstraction represents collections
of data spread across machines. Data manipulation is performed with predefined operations
over the RDDs, which are also able to work with key-value pairs. Applications are managed
by a driver program which gets computation resources from a cluster manager (e.g., YARN).
RDD operations are translated into directed acyclic graphs (DAGs) that represents RDD de-
pendencies. The DAG scheduler gets a graph and transforms it into sets of smaller tasks called
stages, then sent to executors. The RDD abstraction of Spark has been presenting remarkable
advantages. In special, Spark demonstrated to excel for machine learning tasks [24], as opposed
to batch processing as the original Hadoop MapReduce’s design. However, to the best of our
knowledge, the star-join processing in Spark has not been addressed in the literature, which is
the goal of this paper.
3 Related work
Star schemas are notably important in decision support systems to solve business-report sort
of queries, given volume of data in such applications [4]. OLAP operations applied on efficient
data organizations are combined with state-of-art data mining techniques to produce accurate
predictions and trends. For instance, OLAP over spatial data has gained much attention with
the increasing access to GPSs in day-to-day life, and many solutions have been proposed using
(Geographical) Data Warehouses [11]. More recently, NoSQL solutions were also proposed to
extend to more non-conventional data architectures [8].
3
Faster cloud Star Joins with reduced disk spill and network communication Brito et al.
With on-demand hardware and the advent of convenient parallel frameworks, cloud com-
puting has extensibly been applied to implement decision support systems [2]. Cloud versions
of star-schema based systems have been proposed in the last years, such as Hive [9]. Particu-
larly, several algorithms were recently proposed to solve Star Joins using MapReduce. Because
of the subsequent joins needed, high network communication and several sequential jobs are
challenging bottlenecks and motivated several of the recent work in this area [1]. In fact, some
attempts to optimize (star) joins even propose changes on the MapReduce framework.
Next, we revisit two strategies [15]. The MapReduce Broadcast Join (MRBJ, for short)
first applies predicates on dimensions, and then broadcasts the results to all nodes, solving join
operations locally. Thus, it requires all filtered dimension tables to fit into the nodes memory,
which may be a limitant. Secondly, the MapReduce Cascade Join (MRCJ for short) maps
tuples from two tables according to their join keys, and performs the join operation on the
reducers. This algorithm can be applied multiple times on sequential jobs to process star joins,
requiring N−1 jobs to join Nrelations, decreasing considerably its performance especially for
high-dimensional star schemas.
4 Star-Join Algorithms in Spark
In this section, we present our algorithms to solve Star Joins: Spark Bloom-Filtered Cascade
Join (SBFCJ) and Spark Broadcast Join (SBJ). In the following sections, we discuss each
approach in detail. For short, fact RDD stands for the RDD related to fact table as dimension
RDD, for a dimension table. Both algorithms solve query Qas defined in Figure 1(b).
4.1 Bloom-Filtered Cascade Join
Cascade Join is the most straightforward approach to solve Star Joins. Spark framework per-
forms binary joins through join operator using the key-value abstraction. Fact and dimension
RDDs are lists of key-value pairs containing attributes of interest. Thus, the Star Join can be
computed as a sequence of binary joins between these RDDs. For simplicity, we shall refer to
this approach as Spark Cascade Join (SCJ).
Based on the Cascade Join, we now add Bloom filters. This optimization avoids the trans-
mission of unnecessary data from fact RDD through the cascade of join operations. These
filters are built for each dimension RDD, containing their primary keys that meet the query
predicates. Therefore, the fact RDD is filtered based on the containment of its foreign keys on
these Bloom filters. We refer to this approach as Spark Bloom-Filtered Cascade Join (SBFCJ).
Algorithm 1 exemplifies SBFCJ for solving query Q. RDDs for each table are created in
lines 1, 4, 7 and 10. Then, the filter operator solves predicates of Qin place (lines 2 and 8). For
each RDD, attributes of interest are mapped into key-value pairs by the mapToPair operator.
Dimension RDD keys are also inserted in Bloom filters broadcast to every executor (lines 3, 6
and 9). The filter operator uses these Bloom filters over the fact RDD, discarding unnecessary
key-value pairs in line 11. This is where SBFCJ should gain in performance compared to SCJ.
Then, the fact RDD joins with resulting dimension RDDs in lines 13–15. Finally, reduceByKey
and sortByKey performs, respectively, aggregation and sorting of the results (see line 16).
4.2 Broadcast Join
Spark Broadcast Join (SBJ for short) assumes that all dimension RDDs fit into the executor
memory. Note that each node may have much more than one executor running, which may
4
Faster cloud Star Joins with reduced disk spill and network communication Brito et al.
Algorithm 1 SBFCJ for Q
input: F, D1, D2and D3
output: result of Q
1: RDD1= D1
2: RDD1.filter( a1,1<100 ).mapToPair( pk1, null )
3: BF1= broadcast( RDD1.collect( ) )
4: RDD2= D2
5: RDD2.mapToPair(pk2,a2,1)
6: BF2= broadcast( RDD2.collect( ) )
7: RDD3= D3
8: RDD3.filter( 5 ≤a3,1≤10).mapToPair( pk3,a3,2)
9: BF3= broadcast( RDD3.collect( ) )
10: RDDF= F
11: RDDF.filter( BF1.contains(fk1) and BF2.contains(f k2) and BF3.contains(f k3) )
12: RDDF.mapToPair( f k1, [f k2,f k3,m1] )
13: RDDresult = RDDF.join( RDD1).mapToPair( fk2, [f k3,m1] )
14: RDDresult = RDDresult .join( RDD2).mapToPair( f k3, [a2,1,m1] )
15: RDDresult = RDDresult .join( RDD3).mapToPair( [a2,1,a3,2], m1)
16: F inalResult = RDDresult.reduceByKey( v1+v2).sortByKey( )
Algorithm 2 SBJ for Q
input: F, D1, D2and D3
output: result of Q
1: RDD1= D1
2: RDD1.filter( a1,1<100 ).mapToPair( pk1, null )
3: H1= broadcast( RDD1.collect( ) )
4: RDD2= D2
5: RDD2.mapToPair(pk2,a2,1)
6: H2= broadcast( RDD2.collect( ) )
7: RDD3= D3
8: RDD3.filter( 5 ≤a3,1≤10).mapToPair( pk3,a3,2)
9: H3= broadcast( RDD3.collect( ) )
10: RDDresult = F
11: RDDresult.filter( H1.hasKey(f k1) and H2.hasKey(f k2) and H3.hasKey(f k3) )
12: RDDresult.mapToPair( [H1.get(a2,1), H1.get(a3,2)], m1)
13: F inalResult = RDDresult.reduceByKey( v1+v2).sortByKey( )
constrain the application of SBJ depending on the dataset specifics. Dimension RDDs are
broadcast to all executors, where their data are kept in separate hash maps. Then, all joins are
performed locally in parallel. Since no explicit join operation is needed, SBJ is certain to deliver
the faster query times. Note that, in general, Bloom filters are much smaller data structures
than hash maps. Thus, memory-wise there probably is a balance between cases when SBJ and
SBFCJ performs optimally - which will be verified in the experiments section. This approach
is the Spark counterpart of the MRBJ, introduced in Section 3.
Algorithm 2 details this approach applied on query Q. Hash maps are broadcast variables
created for each dimension RDD in lines 3, 6 and 9, corresponding to lists of key-value pairs. It
is important to note that these hash maps contain not only the dimension primary keys, but all
the needed attributes. These hash maps are kept in the executor primary memory. Then, the
filter operator access the hash maps to select data that should be joined in line 11. Since all the
necessary dimension data are replicated over all executors, in line 12 the mapToPair operator
solves the select clause of Q. As a consequence, there is no need to use the join operator at all,
saving a considerable amount of computation.
5
Faster cloud Star Joins with reduced disk spill and network communication Brito et al.
(a) (b)
Figure 2: Time performance in terms of (a) shuffled data and (b) disk spill. We present
MapReduce in red symbols and Spark, blue. Orange line showing the general trend.
5 Performance Analyses
Next, we present performance analyses of our Spark solutions Spark Bloom-Filtered Cascade
Join (SBFCJ) and Spark Broadcast Join (SBJ).
5.1 Methodology and Experimental Setup
We used a cluster of 21 (1 master and 20 slaves) identical, commercial computers running
Hadoop 2.6.0 and Apache Spark 1.4.1 over a GNU/Linux installation (CentOS 5.7), each node
with two 2GHz AMD CPUs and 8GB of memory. To more intensive tests (i.e., Section 5.3), we
used Microsoft Azure with 21 A4 instances (1 master and 20 slaves), each with eight 2.4GHz
Intel CPUs and 14GB of memory. In both clusters, we have used YARN as our cluster manager.
We have used the Star Schema Benchmark (SSB) [14] to generate and query our datasets.
Unless stated otherwise, each result represents average and standard deviation over 5 runs,
guaranteeing the mean confidence interval ±100s. All implementations used in the following
sections are available in Java at GitHub [3].
5.2 Disk spill, network communication and performance
We show in Figure 2 how our approaches compare to MapReduce strategies (see Section 2.3)
in terms of disk spill and network communication. To simplify the notation, every MapReduce
based solution starts with MR, and the respective references are cited in the figures legend.
Notice that all points referring to MapReduce define a trend correlating time performance to
network communication and disk spill (orange line in Figure 2). Although Spark is known to
outperform MapReduce in a wide range of applications, a direct implementation of sequences
of joins (referred to as SCJ) delivers poor performances and follows the same trends as the
MapReduce approaches. SBFCJ and SBJ, however, are complete outliers (more than 3-σ) and
present remarkably higher performances. This highlights the need for additional optimizations
applied on top/instead of the cascades of joins. In special, notice that both our strategies are
closely followed by MapReduce Broadcast Join (MRBJ in the figure) and a MapReduce ap-
proach proposed by Zhang et al., which processes Star Joins in two jobs using Bloom Filters
[23]. Next, we investigate in more detail each of these strategies.
In Figure 2(a) SBFCJ and SBJ both optimize network communication and computation time
using query Q4.1 with SF 100. As mentioned, excessive cross communication among different
nodes is one of the major drawbacks in MapReduce algorithms. When compared to the best
MapReduce approaches, SBJ presents 200 times lower data shuffling than MRBJ method and
SBFCJ, a reduction of almost 96% against MRCJ. Finally, although the solution proposed by
6
Faster cloud Star Joins with reduced disk spill and network communication Brito et al.
(a) (b) (c)
Figure 3: Impact of the Scale Factor SF in the performance of SBFCJ and SBJ.
Zhang et al. does deliver ≈25% less shuffled data than SBFCJ, it still delivers a performance
nearly 40% slower. Moreover, test in Figure 2(b) demonstrates one of the main advantages
of Spark’s in-memory methodology: although both best options in MapReduce have low spills
(4GB for MRCJ and 0.5GB for MRBJ), both SBFCJ and SBJ show no disk spill at all. In
this test, we have set Spark with 20 executors (on average, 1 per node) and 3GB of memory
per executor. If we lower the memory, than we start seeing some disk spill from Spark and its
performance drops. We study more on this effect in Section 5.4.
Yet, SCJ, which simply implements a sequence of joins, presents considerable higher disk
spill and computation time when compared to SBJ and SBFCJ. Specifically, not only SCJ has
non-null disk spill, it is bigger than MapReduce best options, although 18% lower than its
counterpart, MRCJ. SBFCJ and SBJ shuffles, respectively, 23 and over 104times less data
than SCJ. Therefore, the reduced disk spill and time observed with SBJ and SBFCJ strategies
are not only due to Spark internal structure to minimize nodes’ cross talk and disk spill. The
bottom line is: application of additional techniques (Bloom filters or broadcasting) is essential
indeed.
It is important to note that this analysis assumed best performance of the MapReduce
approaches. No scenario was observed where either of SBJ or SBFCJ trailed the other strategies.
More details on MapReduce approaches to star-join processing will be discussed elsewhere.
5.3 Scaling the dataset
In this section, we investigate the effect of the dataset volume in the performance times of SBFCJ
and SBJ. Methods in general must be somewhat resilient and scale well with the database size.
Especially in the context of larger datasets, where solutions such as MapReduce and Spark
really make sense, having a linear scaling is simply essential.
In Figure 3 we show the effect of the scale factor SF in the computation time considering
three different queries of SSB. To test both strategies, we have selected queries with considerable
different workloads. Especially in low SFs, as shown in Figure 3(b), a constant baseline was
observed: from SF 50 to 200 the elapsed time simply does not change, revealing other possible
bottlenecks. However, such datasets are rather small, and do not reflect the applications in
which cloud approaches excel. SBFCJ and SBJ performances grow linearly with SF in all
larger scale factors tested.
5.4 Impact of Memory per Executor
As mentioned in Section 4.2, broadcast methods usually demand memory to allocate dimension
tables. While SBJ has outperformed any other solution so far, scenarios with low memory per
executor might compromise its performance. Next, we study how the available memory to each
7
Faster cloud Star Joins with reduced disk spill and network communication Brito et al.
(a) (b)
Figure 4: Comparing SBJ and SBFCJ performances with (a) 1GB and (b) 512MB of memory
per executor. SBJ seems reasonably sensitive to low memory cases.
(a) (b)
Figure 5: Comparing SBJ and SBFCJ with 20 executors and variable memory. Panel (a) shows
that SBJ’s performance is severely impaired.
executor impacts both SBFCJ and SBJ. We have tested query Q4.1 with SF 200.
Parallelization in Spark is carried by executors. For a fixed executor memory we studied how
the number of executors change SBFCJ and SBJ performance. If enough memory is provided,
performance usually follows trends shown in Figure 4(a), where 1GB was used. However, as
the memory decreases, SBJ may be severely impacted while SBFCJ seems to be quite resilient.
In our cluster, with this particular dataset, 512MB was a turning point: Figure 4(b) shows
SBJ losing performance gradually. Below 512MB the difference becomes more pronounced. It
is important to note that the specific value of this turning point (512MB in our tests) likely
changes depending on the cluster, nodes’ configuration and, possibly, dataset.
To explore this effect in detail, we tested the performance using 20 executors while decreasing
their memory. Results in Figure 5(a) show SBJ drops in performance until a point where SBFCJ
actually outperforms it (again, around 512MB). Furthermore, in Figure 5(a), from 450 to 400MB
there is a stunning increase in computation time: it suddenly becomes more than three times
slower. To make sure that this increase in time was not an artifact of a specific executor or
node, we analyzed the average time elapsed by all tasks run by all executors. Figure 5(b)
clearly shows that the elapsed time becomes approximately five times larger in this transition,
suggesting that tasks are in general requiring more computation time. For 256MB, there was
no enough memory to broadcast all dimensions. Comparatively, however, SBFCJ seems more
resilient to the amount of memory per executor than SBJ, a feature that could be exploited
depending on the resources available and application.
Finally, in Figure 6 we investigated a slightly more realistic scenario: while fixing the total
memory, the number of executors increase and share memory equally. Thus, although the
memory per executor is decreasing, all memory resources are always in use. As expected, with
enough memory for each executor performance of both SBJ and SBFCJ increase (see Figure
6(b)). Yet, similarly to Figure 4(b), if the number of executors is blindly increased without
increasing resources, SBJ is severely impaired while SBFCJ’s performance remarkably remains.
8
Faster cloud Star Joins with reduced disk spill and network communication Brito et al.
(a) (b)
Figure 6: Comparing SBJ and SBFCJ with fixed total memory while increasing the number of
executors.
In conclusion, all results in this section point towards a trade off between these two ap-
proaches, and defines a clear guideline on how to choose depending on the cluster resources
and dataset. Regardless of the number of executors, if their memory is enough to fit dimension
RDDs, SBJ may deliver twice faster query times; however, if memory is not enough, SBFCJ is
the best solution available. It is important to stress that all results presented in this section
would scale up with the SF. Thus, in larger SFs this turning point where SBJ slows down should
be larger than 512MB.
6 Concluding remarks
In this paper, we have proposed two approaches to efficiently process Star Join queries, reducing
excessive data spill and network communication: Spark Bloom-Filtered Cascade Join (SBFCJ)
and Spark Broadcast Join (SBJ). All tested MapReduce options trail both of these algorithms
by at least 60% in terms of query execution time. It is important to stress that simply im-
plementing a cascade of joins in Spark, namely, Spark Cascade Join (SCJ), was not enough to
beat MapReduce options, showcasing the importance of using of Bloom filter or broadcasting
techniques. We have also shown that both SBFCJ and SBJ scale linearly with the database
volume, which poses them as competitive solutions for Star Joins in the cloud. While SBJ is
usually faster (between 20-50%) when memory resources are abundant, SBFCJ was remarkably
resilient in scenarios with scarce memory. In fact, with enough resources available, SBJ has no
disk spill at all. To summarize, all of our results point towards a simple guideline: regardless
of the number of executors, if their memory is enough to fit dimension RDDs, SBJ may deliver
twice faster query times; however, if memory is an issue, SBFCJ is the best solution and re-
markably robust to low-memory infrastructures. Therefore, SBFCJ and SBJ both were shown
competitive fitting candidates to solve Star Joins in the cloud.
Acknowledgments. Authors thank Dr. Hermes Senger for allowing us to use his labora-
tory cluster infrastructure. This work was supported by FAPESP, CAPES, CNPq, INEP, and
FINEP. JJ Brito acknowledges FAPESP grant #2012/13158-9. T Mosqueiro acknowledges sup-
port from CNPq 234817/2014-3. JJ Brito, T Mosqueiro and CDA Ciferri acknowledge Microsoft
Azure Research Award MS-AZR-0036P.
References
[1] F. N. Afrati and J. D. Ullman. Optimizing joins in a map-reduce environment. In EDBT 2010,
pages 99–110, 2010.
9
Faster cloud Star Joins with reduced disk spill and network communication Brito et al.
[2] V. S. Agneeswaran. Big Data Analytics Beyond Hadoop: Real-Time Applications with Storm,
Spark, and More Hadoop Alternatives. Pearson FT Press, 2014.
[3] J. J. Brito. Star joins in Spark. https://github.com/jaquejbrito/star-join-spark, 2015. [Online;
accessed 31-January-2016].
[4] S. Chaudhuri, U. Dayal, and V. Ganti. Database technology for decision support systems. IEEE
Computer, 34(12):48–55, 2001.
[5] J. Dean and S. Ghemawat. Mapreduce: simplified data processing on large clusters. Communica-
tions of the ACM, 51(1):107–113, 2008.
[6] H. Demirkan and D. Delen. Leveraging the capabilities of service-oriented decision support systems:
Putting analytics and big data in cloud. Decision Support Systems, 55(1):412–421, 2013.
[7] A. Khajeh-Hosseini et al. Decision support tools for cloud migration in the enterprise. In IEEE
CLOUD 2011, pages 541–548, 2011.
[8] A. Sch¨atzle et al. Cascading map-side joins over hbase for scalable join processing. In Joint
Workshop on Scalable and High-Performance Semantic Web Systems, page 59, 2012.
[9] A. Thusoo et al. Hive - a petabyte scale data warehouse using hadoop. In ICDE 2010, pages
996–1005, 2010.
[10] H. Han et al. Scatter-gather-merge: An efficient star-join query processing algorithm for data-
parallel frameworks. Cluster Computing, 14(2):183–197, 2011.
[11] J. J. Brito et al. Efficient processing of drill-across queries over geographic data warehouses. In
DaWak 2011, pages 152–166, 2011.
[12] M. Li et al. Sparkbench: a comprehensive benchmarking suite for in memory data analytic platform
spark. In Conf. Computing Frontiers 2015, pages 53:1–53:8, 2015.
[13] M. Zaharia et al. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster
computing. In NSDI 2012, pages 15–28, 2012.
[14] P. E. O’Neil et al. The star schema benchmark and augmented fact table indexing. In TPCTC
2009, pages 237–252, 2009.
[15] S. Blanas et al. A comparison of join algorithms for log processing in mapreduce. In SIGMOD
2010, pages 975–986, 2010.
[16] Y. Tao et al. Optimizing multi-join in cloud environment. In HPCC/EUC 2013, pages 956–963,
2013.
[17] David Jiang, Anthony K. H. Tung, and Gang Chen. MAP-JOIN-REDUCE: toward scalable and
efficient data analysis on large clusters. IEEE Transactions on Knowledge and Data Engineering,
23(9):1299–1311, 2011.
[18] R. Kimball and M. Ross. The Data Warehouse Toolkit: The Complete Guide to Dimensional
Modeling. Wiley Computer Publishing, 2 edition, 2002.
[19] J. P. Shim, M. Warkentin, J. F. Courtney, D. J. Power, R. Sharda, and C. Carlsson. Past, present,
and future of decision support technology. Decision Support Systems, 33(2):111 – 126, 2002.
[20] S. Tarkoma, C. E. Rothenberg, and E. Lagerspetz. Theory and practice of bloom filters for
distributed systems. IEEE Communications Surveys and Tutorials, 14(1):131–155, 2012.
[21] Hugh J. Watson and Paul Gray. Decision Support in the Data Warehouse. Prentice Hall Profes-
sional Technical Reference, 1997.
[22] M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica. Spark: Cluster computing
with working sets. In 2nd USENIX Workshop on Hot Topics in Cloud Computing, 2010.
[23] C. Zhang, L. Wu, and J. Li. Efficient processing distributed joins with bloomfilter using mapreduce.
Int. Journal of Grid and Distributed Computing, 6(3):43–58, 2013.
[24] B. Zhu, A. Mara, and A. Mozo. CLUS: parallel subspace clustering algorithm on spark. In ADBIS
(Short Papers and Workshops) 2015, pages 175–185, 2015.
10