ArticlePDF Available

Abstract and Figures

MapReduce has been widely regarded as a flexible, scalable, and easy-to-use distributed programming paradigm for big data processing such as social network data analysis on cloud computing platforms. To embrace the upcoming of big data era, many efforts have been devoted to accelerating the MapReduce performance from different aspects, especially intermediate result reusing like Dache. In this paper, we observe that existing intermediate result reusing mechanism is not efficient enough as many I/O operations are wasted. Efficient reusing of the intermediate results could potentially improve the MapReduce performance. Inspired by such fact, we propose a framework named MEMoMR (more efficient intermediate result reusing for MapReduce) by introducing a novel reusing mechanism that can substantially reduce the I/O overhead. To this end, we invent a new metadata description method and apply it in the reusing phase. We practically realize MEMoMR and evaluate its performance by implementing it in a real cluster. The experiment results show that MEMoMR can improve the system performance as high as 23.4%, comparing against Dache. Copyright © 2015 John Wiley & Sons, Ltd.
Content may be subject to copyright.
CONCURRENCY AND COMPUTATION: PRACTICE AND EXPERIENCE
Concurrency Computat.: Pract. Exper. (2015)
Published online in Wiley Online Library (wileyonlinelibrary.com). DOI: 10.1002/cpe.3702
SPECIAL ISSUE PAPER
MEMoMR: Accelerate MapReduce via reuse of
intermediate results
Hong Yao1, Jinlai Xu2, Zhongwen Luo2and Deze Zeng1,*,†
1School of Computer Science, China University of Geosciences, Wuhan, China
2School of Information Engineering, China University of Geosciences, Wuhan, China
SUMMARY
MapReduce has been widely regarded as a flexible, scalable, and easy-to-use distributed programming
paradigm for big data processing such as social network data analysis on cloud computing platforms. To
embrace the upcoming of big data era, many efforts have been devoted to accelerating the MapReduce
performance from different aspects, especially intermediate result reusing like Dache. In this paper, we
observe that existing intermediate result reusing mechanism is not efficient enough as many I/O operations
are wasted. Efficient reusing of the intermediate results could potentially improve the MapReduce per-
formance. Inspired by such fact, we propose a framework named MEMoMR (more efficient intermediate
result reusing for MapReduce) by introducing a novel reusing mechanism that can substantially reduce the
I/O overhead. To this end, we invent a new metadata description method and apply it in the reusing phase.
We practically realize MEMoMR and evaluate its performance by implementing it in a real cluster. The
experiment results show that MEMoMR can improve the system performance as high as 23.4%, comparing
against Dache. Copyright © 2015 John Wiley & Sons, Ltd.
Received 5 April 2015; Revised 17 August 2015; Accepted 5 September 2015
KEY WORDS: MapReduce; Hadoop; performance acceleration; intermediate result reusing
1. INTRODUCTION
With the recent data explosion, big data processing, for example, social network analysis, pro-
vides new motives to productivity growth, innovation, and consumer abundance and is becoming
more and more important to pursuit higher profits for most modern enterprises [1]. Cloud comput-
ing platforms provision a pool of computation, storage, and communication resources to host big
data processing tasks. To efficiently use these resources, many novel programming paradigms have
been proposed. One representative, MapReduce, invented by Google [2, 3], can automatically and
transparently parallelize the data processing tasks across different servers in data centers. Thanks
to advantages in scalability, flexibility, and ease of use, it has even become the de facto standard
framework for big data processing and is widely applied. For example, for social network analy-
sis, Chen et al. [4] implement Surfer for large graph processing based on MapReduce framework,
and Zhong et al. [5] develop ComSoc using MapReduce to accurately predict the user behaviors
via combining several social networks’ data. Because of its importance in big data processing, it
has also become a hot research topic as many researchers and engineers intend to improve it from
various aspects [6–14].
*Correspondence to: Deze Zeng, School of Computer Science, China University of Geosciences, No.388, Lumo Road,
Wuhan 430074, China.
E-mail: deze@cug.edu.cn
Copyright © 2015 John Wiley & Sons, Ltd.
H. YAO ET AL.
Figure 1. Original MapReduce procedure.
Figure 1 illustrates the normal procedures of a MapReduce process, which mainly consists of
map, shuffle, and reduce phases. Input data are split into several parts before they are distributed
to the workers in map phase. Then, the intermediate results generated by workers in map phase
are shuffled, sorted, and transferred to the workers in reduce phase. The final results are calculated
by the workers in reduce phase and written to the distributed file system (DFS). As the big data
processing demand is increasing significantly, the speed means anything. There are many applica-
tions being based on big data processing systems and getting enough attentions like social network
analysis and geographic data analysis. Most of the analysis has a characteristic that the analyses
are combined by many incremental or duplicated jobs like the WordCount computations in social
network analysis and the range query computations in geographic data analysis. The incremental
and duplicated jobs can accelerate by reusing the intermediate results automatic. It can be seen that
many intermediate results are generated during the procedure between the map and reduce phases.
Pioneering researchers find out that it is possible to optimize the MapReduce performance from the
consideration of intermediate results. For example, Palanisamy et al. [8] and Hammound et al. [15]
propose Purlieus and CoGRS, respectively, to minimize the network traffic incurred by intermedi-
ate results. Jiang et al. [13] proposed MATE that has heavily modified the MapReduce procedure
to avoid shuffling and achieved a higher performance. On the other hand, it is also noticed that
the MapReduce performance could be accelerated by reusing the intermediated results, which are
normally discarded in traditional MapReduce framework. Inspired by such concept, many novel
frameworks, for example, Dache [16], ReStore [17], and Spark [18], are proposed in the literature
or even released in industry.
To reuse the intermediate results in MapReduce, many challenges must be tackled. Firstly,
MapReduce is not able to identify the jobs’ procedure duplication. This is also the reason that many
operations are repeatedly and unnecessarily conducted. Furthermore, the intermediate process is
transparent to the developers, who cannot directly manipulate the intermediate results. Dache has
successfully proved the benefit of intermediate result reusing. However, we notice that Dache often
requires data transferring between the mappers and the database, hosting the intermediate results.
This incurs unnecessary I/O overhead and limits the performance acceleration.
In this paper, to address these challenges, we propose MEMoMR (more efficient intermediate
result reusing for MapReduce) – an intelligent MapReduce intermediate results reusing system,
which improves the MapReduce performance by making full use of the intermediate results and
avoiding unnecessary I/O operations. The contributions of this paper are as follows:
We design MEMoMR, which is able to efficiently reuse the intermediate results with much
lower I/O overhead. We invent a virtual fetching mechanism for reusing intermediate results in
the map phase. Correspondingly, a novel metadata description is invented for the identification
of reusable intermediate results to avoid unnecessary data transferring.
We implement MEMoMR in a real cluster platform and evaluate its efficiency by comparing
against traditional MapReduce framework and Dache. Performance evaluation results show
Copyright © 2015 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. (2015)
DOI: 10.1002/cpe
MEMoMR: ACCELERATE MapReduce VIA REUSE OF INTERMEDIATE RESULTS
that MEMoMR can improve the performance as high as 95.6% and 23.4% to MapReduce and
Dache, respectively.
This paper is organized as follows. Section 2 summarizes some recent related work on MapRe-
duce performance optimization. Section 3 gives our measurement results of the original MapReduce
and Dache as the motivation of this work. Section 4 details the design of MEMoMR. Section 5
reports the evaluation results. Finally, Section 6 concludes our work.
2. RELATED WORK
Since its original publication in 2004 by Google [2], MapReduce has become the oligopoly in data
analytics and attracted many interests from the researchers to optimize its performance from various
aspects. For example, Jahani et al. [19] propose MANIMAL by placing an analyzer and an optimizer
between the users’ and the running process to optimize the execution time and minimize the queried
catalog. Recently, Nykiel et al. [20] invent MRShare framework that merges a bunch of queries into
a new batch to improve the query efficiently. Vernica et al. [21] partition the key-value pairs and
carefully control the memory usage to speed up the join execution. Okcan et al. [22] use a random
algorithm named 1-Bucket-Theta to accelerate arbitrary joins based on MapReduce. Metwally et
al. [23] propose a two-stage algorithm to minimize the join time consumption in MapReduce.
Besides the aforementioned work focusing on a specific issue, some studies are also conducted
towards a universal MapReduce optimization. Trading storage for computation has been regarded as
a potential direction, and some representative works have been proposed. For example, ReStore [17]
keeps the jobs’ results of a work flow and reuses them when future same work flows are submit-
ted. ReStore is constrained to the map phase’s results only. Peng et al. [24] propose percolator that
requires the programmer to write a program based on an event-driven model. Popa et al. [25] pro-
pose DryadInc that is an extension to Dryad to reuse the excuted tasks, but its range of application
is limited by the mechanism that only reuses the identity sub-DAGs. Bhatotia et al. [26] invent
Incoop aiming at improving the performance of incremental jobs. But it has several limitations: (1)
it requires fetching and storing the results in map phase. This can be skipped with appropriate mech-
anism; (2) it adds ‘content-based marker’ in HDFS splits. This increases the burden in HDFS and
will potentially influence the performance, especially for I/O dense jobs. Another famous frame-
work named Spark [27] uses resilient distributed dataset to hold the intermediate results in memory
for possible future reusing. Spark requires the developers to manually declare the data that shall be
kept persistently in memory. The most similar work to ours is Dache [16], which has a cache man-
ager to control and to reuse the intermediate results between the map phase and the reduce phase.
However, we notice that Dache often requires fetching the intermediate results from the nodes that
store them, as shown by the red lines in Figure 2 (the mapper in the bottom left and reducer in the
bottom right are the nodes storing the intermediate results, which had been calculated before). In
Figure 2. The reuse procedure of Dache.
Copyright © 2015 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. (2015)
DOI: 10.1002/cpe
H. YAO ET AL.
our design, instead of transferring actual data between the workers, either mappers or reducers, we
use metadata to reduce such I/O consumption to potentially accelerate the MapReduce performance.
Furthermore, Dache suffers the single-point failure problem as there is only one cache manager.
Once the cache manager fails, the whole system may collapse as no intermediate result can be
fetched any more. Our work addresses this problem by proposing a distributed intermediate result
management scheme.
3. MOTIVATION
In this section, we give statistic analysis of two representative benchmark MapReduce jobs, that
is, WordCount and TeroSort, on both original MapReduce and Dache, to expose our motivation.
Based on the analysis, we also discuss the feasibility and benefit of reusing intermediate results for
performance acceleration.
We conduct measurement experiments about WordCount and TeraSort on three physical machines
as shown in Table I and 11 virtual machines (VMs) as shown in Table II. The testing input data
of TeraSort are generated by TeraGen tool, with size ranging from 1 to 10 GB. The input data for
WordCount are generated by RandomTextWriter tool, with size ranging from 5 to 50 GB. Figure 3
presents our measurement results on the CPU time consumption for the three main MapReduce
phases. It can be seen from Figure 3(a) that the map phase in WordCount consumes the most CPU
time, for example, accounting for more than 85% in most cases. We also notice that increasing the
data size slightly decreases the map phase proportion, for example, decreasing from 95% to 85%
when the data size increases from 5 to 50 GB. The reduce phase requires only little CPU time,
less than 1%, and therefore even hardly seen from the figure. On the other hand, the measurement
results for TeraSort present different phenomena, as shown in Figure 3(b), where the map phase is
not as dominant as in the WordCount case. The time consumption in the map phase and the reduce
phase shows as a decreasing and an increasing function of the input data size, respectively. For
example, when the input data are in size of 1 GB, the map phase accounts 80%, while it becomes
25% when the size increases to 10 GB, while the reduce phase requires 17% time for data size of
1 GB and increases to 60% for the case with 10-GB input data. For both WordCount and TeraSort,
the shuffle phase requires at most 15% time. Based on the measurement results in Figure 3, it can be
easily derived that if the time required by the map and reduce phases is reduced, the overall system
performance can be significantly accelerated.
Intermediate results reusing has been proved as an efficient way to accelerate the MapReduce
performance. To further validate this issue in our system, we make more measurement experiments
based on Dache [16] and present the results in Figure 4. We implemented the thoughts in Dache [16]
Table I. The hardware configuration of the cluster.
No. Physical machine CPU Memory (GB)
1 Sugon I620-G20 2 Xeon(R) E5-2650 CPUs at 2.6 GHz 64
2 Sugon I620-G20 2 Xeon(R) E5-2650 CPUs at 2.6 GHz 64
3 Sugon I620-G20 2 Xeon(R) E5-2650 CPUs at 2.6 GHz 64
Table II. The virtual machine configuration of the cluster.
No. Role On CPU Memory (GB) Instance
1 Master 1 Four vCPUs 8 NameNode, ResourceManager, and JobHistory
2 Slave 1 Two vCPUs 4 DataNode, SecondaryNameNode, and NodeManager
3–5 Slave 1 Two vCPUs 4 DataNode and NodeManager
6–8 Slave 2 Two vCPUs 4 DataNode and NodeManager
9–11 Slave 3 Two vCPUs 4 DataNode and NodeManager
Copyright © 2015 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. (2015)
DOI: 10.1002/cpe
MEMoMR: ACCELERATE MapReduce VIA REUSE OF INTERMEDIATE RESULTS
Figure 3. Proportion of CPU time in each phase of the original MapReduce.
Figure 4. The time proportion of each phase with intermediate results reusing in 5-GB WordCount and
TeraSort based on Dache.
and deployed it in the real cluster described previously. It can be seen from the figure that checking
the availability of previous intermediate results, that is, ‘map cache’ and ‘shuffle cache’ in the figure,
consumes much time because of the high overhead of I/O operations. For example, ‘map cache’
accounts for around 20% and 18% for TeraSort and WordCount, respectively, while we think that
this part is unnecessary and can be alleviated to further accelerate the overall system performance.
Therefore, we are inspired to find a more efficient way to reuse the intermediate results for
MapReduce performance acceleration and design MEMoMR as detailed in the next section.
4. DESIGN OF MEMoMR
In this section, we describe the design of MEMoMR. We first show our concept and then detail the
warm-up stage and reuse stage, respectively.
Copyright © 2015 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. (2015)
DOI: 10.1002/cpe
H. YAO ET AL.
Figure 5. MEMoMR system overview.
4.1. Design concept
Our measurement results clearly show that intermediate results reusing is indeed a promising way
to accelerate the MapReduce performance. The most advanced representative work based on such
concept, that is, Dache, is still not efficient enough because of the unnecessary I/O operations.
Potentially, we can further improve the performance by avoiding the redundant data transferring.
An overview of our MEMoMR process is illustrated in Figure 5. In order to avoid unnecessary I/O
operation, we propose a ‘virtual’ fetching mechanism in the map phase and introduce two metadata
cache modules (gray in the figure) for the intermediate results after the map phase and the reduce
phase, respectively. The two modules are responsible for checking the availability of intermediate
results. The metadata cache module could be a distributed database system supporting for querying.
With the consideration of scalability, distributed database system is recommended. This may also
prevent the single-point failure problem.
No actual intermediate result is transferred between workers and the two modules but only some
information in small data size. A DFS-based module is introduced to reserve the actual intermediate
results. Thus, instead of actually transferring the intermediate results in Dache when cache hits in the
map phase, we do this virtually by notifying the corresponding mapper that the intermediate result is
available in the cache. The mapper virtually put the intermediate results in its output queue (see the
light orange rectangle for results ‘1’ and ‘2’ in the figure). The actual data transferring happens only
in the shuffle phase. If the data have already been virtually put in the output queue of a mapper, the
actual results shall be fetched from the DFS and put them into the input queue of a corresponding
reducer according to the key values, following the original MapReduce framework. To enable the
virtual fetching mechanism, metadata description is invented tightly according to the intermediate
result reusing requirement in map and reduce phases, respectively. We will respectively detail our
design for the two phases in the following sections.
4.2. Warm-up stage
To reuse the intermediate results, it is first required that they have been generated and stored in the
database. Hence, MEMoMR first has a warm-up stage for intermediate result generation.
4.2.1. Map phase. Let us first look at the warm-up stage in the map phase. Recall that an interme-
diate result after map phase is generated by a mapper according to its assigned data split and the
processing logic defined in the map() function. If a future map task has the same operation and the
same data split, the result can be reused.
Inspired by such fact, it is natural to index an intermediate result based on such information.
Therefore, we describe the intermediate result after map phase as a metadata in form
<splitinput ;map function >;
Copyright © 2015 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. (2015)
DOI: 10.1002/cpe
MEMoMR: ACCELERATE MapReduce VIA REUSE OF INTERMEDIATE RESULTS
Figure 6. Map phase process for reusing the intermediate results.
where
splitinput denotes the input file split. It can be straightforwardly described using the file split’s
name and the pointer (with the start position and size). Besides, in order to react when the
split’s content is modified, we may also checksum or hash it to obtain a checksum value or a
hash code for input split description.
map function is the description of the map function and can be described by features like the
map class’s name.
In the warm-up stage, as no map task has been conducted in the system before, the ‘check’
operation, which is to verify where there exists corresponding intermediate results, must return
‘miss’, for example, the map task for ‘Split1’ in Figure 6. Therefore, normal map process, that is,
including the map function, the local sort operation, and the combination operation, is executed.
Different from traditional MapReduce framework, to future intermediate result reusing, MEMoMR
will upload the metadata as described previously when all the computations are completed in the
map phase and the corresponding intermediate results into the database.
Let us illustrate the aforementioned framework by an example for further understanding of
MEMoMR’s map phase in warm-up stage. Consider a map task described by a tuple
<hdfs W==hdfs dir=part 01 W0C1024; 0x 24;examples.WordCount$Map>
The first item indicates the split file on the HDFS to be processed by the mapper. The input file split
description consists of four parts in the metadata: the HDFS path, the file name, the split pointer
(with the start position and size), and the checksum, for example, “hdfs://hdfs-dir/”,“part-01”,
“0+1024”,and“0x24”, respectively, in the example. In certain data analytic tasks, the input data
may be incremental, that is, with new contents at the end of the input file. So the previous three fea-
tures (the HDFS path, the file name, and the split pointer) are enough to be used to check whether the
intermediate result is available in this situation. The second item in the metadata example indicates
that the map function is in package “examples” and class “WordCount$Map”,where$isasym-
bol indicating that ‘Map’ is the subclass of ‘WorCount’. The metadata can be stored in a distributed
database. The metadata will be stored in a distributed database for future querying. The intermedi-
ate result produced by the map task will be sent to the reducers for reduce phase’s computation if
we have a metadata hit, as shown in Figure 5.
Copyright © 2015 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. (2015)
DOI: 10.1002/cpe
H. YAO ET AL.
4.2.2. Reduce phase. When it comes to the reduce phase in warm-up stage, the intermediate results
in reduce phase have not been ready for reusing. Before the reduce phase, there is the shuffle phase
that responds for transferring the results from the mappers to the reducers according to the key
values. In the map phase, partitioner function calculates the partition number by the keys in the key-
value pairs and the total number of the reducers. Generally, developers do not need to explicitly
control the partition operation. The default partitioner function is a hash function defined by the
original MapReduce, and the map task outputs are transferred to the corresponding reducers decided
by the partition number. The shuffle phase merges and sorts the key-value pairs transferred from
the map tasks to generate the input for reduce function. It can be seen from the aforementioned
procedures that the output for a reducer is determined by a set of factors, including the partitioner
function, the reducer’s index, the total number of reducers, and the map tasks’ outputs. Accordingly,
to represent an intermediate result in the reduce phase, we define the metadata description in form
<partitioner;reducerindex;N
reduce;
®<splitinput;map operation >1;:::;< splitinput ;map operation >M¯>;
where
partitioner is the partition function defined by users in MapReduce model. It defines how to
distribute the key-value pairs;
reducerindex is the index of the reducer;
Nreduce is the number of reducers in the job; and
®<splitinput ;map operation >1;:::;< splitinput ;map operation >M¯is a set of the inputs
from the shuffle phase in the form of the map intermediate result metadata defined in
Section 4.2.1.
Intermediate result reusing in the reduce phase is more complicated because of its complex pro-
cess. For easy understanding, we illustrate the procedures in an easy-to-understand incremental
WordCount example as shown in Figure 7 and the reuse stage in Figure 8. In Figure 7, we have
two mappers and two reducers in this example. Because it is in the warm-up stage, all the tasks are
brand-new. All the intermediate results need to produce by the mappers and transfer to the reducers.
All the key-value pairs generated in map phase contain three parts: the partition number, the key,
and the value, for example, < 0;ABC;2 > in Figure 7. These pairs are then transferred to the cor-
responding reducers, for example, < 0;ABC;2 > to Reducer0 and < 1;DEF;4 > to Reducer1.
Figure 7. Warm-up stage illustration.
Copyright © 2015 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. (2015)
DOI: 10.1002/cpe
MEMoMR: ACCELERATE MapReduce VIA REUSE OF INTERMEDIATE RESULTS
Figure 8. Reuse stage illustration.
After all the aforementioned results are merged and sorted in the shuffle phase, MEMoMR will
upload the metadata and results of shuffle phase. The metadata in Reducer0 could be
<“examples.HashPartitioner”;0;2;¹Mapper0; Mapper1º>;
where “examples.HashPartitioner” is the partitioner, “0” is the reducer index, and “2” is the total
number of the reducers in that job. ¹Mapper0; Mapper1ºimplies that the reduce intermediate
result is related to the input described by the metadata in Mapper0 and Mapper1. The intermediate
results in shuffle phase are uploaded to DFS for future reusing. After that, the normal reduce process
proceeds normally as in the original MapReduce framework, as shown in Figure 7.
4.3. Reuse stage
After the warm-up stage, the intermediate results are ready for reusing in both map phase and reduce
phase. The reusing procedures are described as follows.
4.3.1. Map phase. When a mapper is invoked with a certain task, instead of starting actual process-
ing immediately, it first checks the availability of previous results with the same metadata from the
distributed database.
We detail the map process with intermediate result reusing in Figure 6. As shown in the figure,
once a map task begins, the check module first sends query to the database, which shall then search
and check whether there is any metadata matching the query. If there exists one matching, it means
that the same split has been processed by the same map function before. There is no need to actually
process the split any more. Note that we also do not actually fetch the intermediate result to the
mapper either. We simply put a flag in the mapper to indicate that the intermediate result is ready
for the shuffle phase. In addition, we set the map task as ‘complete’ in the MapReduce schedule.
We view such operation as ‘virtual’ fetching, as shown by the map task for ‘Split0’ in Figure 6.
Otherwise, if no matching item is found, the map phase goes normally as defined in the description
in warm-up stage in Section 4.2.1. According to the aforementioned procedure, one shall notice
that we do not need to actually fetch the intermediate results, and this reduces the unnecessary I/O
overheads in Dache, potentially promoting the overall performance.
Copyright © 2015 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. (2015)
DOI: 10.1002/cpe
H. YAO ET AL.
4.3.2. Reduce phase. When it comes to the reduce phase in reuse stage with reusable intermedi-
ate results, let us still use an example to explain the procedures. In this example, the input file is
appended with some new data compared with a previous job (i.e., the one in Section 4.2.2). Sup-
pose that we have one more mapper compared with the previous job, that is, three mappers and two
reducers in the new job. Two mappers have intermediate results in the database, and therefore, no
actual map processing is required for them because we already have carried it out before. One map-
per has to go through normal map processing and output intermediate results according to the map()
function. Remember that if a cache hit happens in the map phase, we do not actually fetch the inter-
mediate result to the corresponding mapper. The actual fetching happens only in the shuffle phase.
To check the intermediate results in the big table in shuffle phase, it firstly finds all the items match-
ing the first three features, that is, the partitioner function, the reducer’s index, and the reducers’
number. The feature of the map tasks’ outputs is matched with a different strategy in order to make
the intermediate results reusable in the jobs with incremental data. The maximum matching subset
in the big table is regarded as the matching output. For example, the reducer0 in Figure 8 finds out
the maximum subset of <“examples.HashPartitioner”;0;2;¹Mapper 0; Mapper1; Mapper 2º>
as <“examples.HashPartitioner”;0;2;¹Mapper 0; Mapper 1º>. The metadata is the key stored
in the distributed database for querying, and the value is the location where the intermediate result
is stored in the DFS. As long as there is a metadata hit, reducer0 fetches the stored results from the
DFS and then merges them with the output from the other mappers, for example, the only brand-new
map task mapper2 in this example. Upon collecting the outputs from the mappers, either directly
from the actual mappers or virtually from the DFS, it proceeds to the reduce phase defined by the
reduce() function, which takes the map outputs as parameters. The metadata and the input file for
the reduce function will also be uploaded either to the database or to the DFS because it is not
accurately matched one metadata stored on the database.
5. PERFORMANCE EVALUATION
5.1. Implementation and experiment environments
We implement our prototype system on CDH (the Cloudera’s distribution including Apache
Hadoop) and PostgreSQL [28]. We test the performance of MEMoMR in a cluster with two phys-
ical machines as shown in Table III and seven VMs with the following configurations. The master
node has 4-GB memory, and the other worker nodes have 3-GB memory. Each of the nodes has
four virtual 2.5-GHz CPUs as shown in Table IV. We still use TeraSort and WordCount with
different time requirements as the testing benchmark. The testing cases of TeraSort and Word-
Count are also generated by TeraGen and RandomTextWriter, respectively. We then evaluate the
Table III. The hardware configuration of the cluster.
No. Physical machine CPU Memory (GB)
1 Dell PowerEdge R720 2 Xeon(R) E5-2600 CPUs at 2.5 GHz 32
2 Dell PowerEdge R510 2 Xeon(R) E5540 CPUs at 2.53 GHz 24
Table IV. The virtual machine configuration of the cluster.
No. Role On CPU Memory (GB) Instance
1 Master 1 Four vCPUs 4 NameNode, DataNode, ResourceManager,
NodeManager, and JobHistory
2 Slave 1 Four vCPUs 3 DataNode, SecondaryNameNode, and NodeManager
3–4 Slave 1 Four vCPUs 3 DataNode and NodeManager
5–7 Slave 2 Four vCPUs 3 DataNode and NodeManager
Copyright © 2015 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. (2015)
DOI: 10.1002/cpe
MEMoMR: ACCELERATE MapReduce VIA REUSE OF INTERMEDIATE RESULTS
performance of our MEMoMR framework by comparing against the original MapReduce frame-
work as well as Dache under different settings. We are mainly interested in the performance when
‘virtual fetch’ mechanism is enabled as this is the key difference between MEMoMR and Dache.
The other optimizations, for example, data locality and full distributed data storage, introduced by
Dache are complementary to MEMoMR and therefore are not considered in the experiments.
5.2. On the performance acceleration
We first conduct a group of experiments to check how MEMoMR accelerates the MapReduce per-
formance. We perform both WordCount and TeraSort tasks as described previously in the original
MapReduce framework (‘MapReduce’), Dache framework (‘Dache’), and MEMoMR (‘MEM-
oMR’). The input data are 1 and 22 GB after compressed (70 GB in original) for TeraSort and
WordCount, respectively. We plot the evaluation results in Figure 9. Obviously, reusing the interme-
diate results can significantly reduce the task completion time. For example, Dache reduces the task
completion time by 95.6% and 8.7% for the WordCount and TeraSort, respectively. Thanks to the
novel intermediate result reusing mechanism design, our work further reduces the task completion
time by 18.4% and 23.4% comparing with Dache. This validates our envision that avoiding unnec-
essary I/O operations can further promote the system performance. Another interesting observation
Figure 9. The task completion time comparison.
Copyright © 2015 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. (2015)
DOI: 10.1002/cpe
H. YAO ET AL.
from the figure is that more benefits because of intermediate result reusing are obtained in Word-
count than in TeraSort. This is attributed to the fact that the time consumption of WordCount task is
dominated by the map phase, as reported in our measurement results in Figure 3.
5.3. Comparison of stand-alone and cluster environments
We then further validate the efficiency of MEMoMR under different environments, that is, stand-
alone server and seven-server cluster. With the same data input as in the last section, the results in
terms of speed-up ratio compared with the original MapReduce framework are reported in Table V.
Once again, we notice that different tasks exhibit different speed-up ratios. This is because the
intermediate results are generated by the WordCount mappers, that is, the number of occurrence of
certain phases is much smaller than the original dataset. Without doubt, reusing the intermediate
results can substantially reduce the time consumed in the map phase. On the other hand, for TeraSort,
we even notice that increasing the cluster size even reduces the speed-up ratio from 3.6 to 1.1. This
is because TeraSort requires much time in the reduce phase for merging and sorting. Increasing the
cluster number may increase the time spent in I/O, and therefore, increasing the cluster size may
even degrade the speed-up ratio. The results in Table V also indicate that reusing intermediate results
benefits more to the jobs like WordCount dominated by the map phase.
5.4. On different input data sizes
To further extensively evaluate the efficiency of MEMoMR, we conduct a group of experiments
on different input data sizes. We do the following experiments with the same configurations in
Section 3. We increase the input data size from 1 to 10 GB for TeraSort jobs similar to our mea-
surement studies. Each mapper is assigned with 128-MB data, and each reducer shall approximately
process 1-GB data. For 1-GB TeraSort, there are eight mappers and one reducer. The number of
tasks increases linearly with the data volume. So we have 80 mappers and 10 reducers in 10-GB
TeraSort. The evaluation results are reported in Figure 10. We first notice that the input data size does
not have too much on the speed-up ratio as MEMoMR accelerates the performance by around 30%
with vibration less than 10% under different input data sizes compared with the original MapRe-
duce framework. On the other hand, the advantage of MEMoMR over Dache can be also always
observed under any data input sizes, with a speed-up ratio as high as 20%. This further validates the
correctness of our design.
Table V. The speed-up ratio in the stand-alone
and cluster environments.
Speedup TeraSort WordCount
Stand-alone 3.6 6.4
Cluster 1.1 22.9
Figure 10. The completion time in TeraSort jobs with data size from 1 to 10 GB.
Copyright © 2015 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. (2015)
DOI: 10.1002/cpe
MEMoMR: ACCELERATE MapReduce VIA REUSE OF INTERMEDIATE RESULTS
5.5. On different task numbers
Finally, we evaluate the performance under different numbers of map tasks by increasing the map
task number from 10 to 100 for both WordCount with 10-GB uncompressed input data and TeraSort
with 1-GB input data. In this experiment, we vary the number of mappers from 10 to 100 while
fixing the number of reducers as 1. Task failure is inevitable in MapReduce [2]. A task failure will
result in rescheduling on the master node and severely influence the performance. When the task
number is large, some tasks may fail unexpectedly at runtime. Therefore, the task completion time
is not stable. To address this problem, the aforementioned two experiments on both TeraSort and
WordCount are conducted for 10 times with the same input data on the same cluster. The average
speedup is then calculated by excluding the maximum and minimum values. Figure 11 gives the
performance evaluation results. Obviously, different jobs exhibit different relationships between the
map task number and the speed-up ratio. From Figure 11(a), we can see that the speed-up ratio of
WordCount non-linearly decreases with the map task number. The speed-up ratio first decreases fast
and then converges at last with the increasing of map task number. This is because increasing the
map task number introduces overheads in split operation and rotation time, and hence, the speed-
up ratio degrades. However, the evaluation results for TeraSort shown in Figure 11 exhibit totally
different phenomena as the speed-up ratio is not much influenced by the map task number. This is
because the map phase is not dominant in TeraSort according to our measurement studies. These
observations tell us that the map task number plays an important role in the jobs whose map phase
is time dominant.
Figure 11. The speedup in WordCount and TeraSort jobs with map tasks number from 10 to 100.
Copyright © 2015 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. (2015)
DOI: 10.1002/cpe
H. YAO ET AL.
Figure 12. Performance comparison in incremental job.
5.6. On incremental jobs
In order to show how MEMoMR performs in general cases, we consider an incremental job whose
input data are increased step by step. The experiment was conducted on a cluster with three servers
and 11 VMs. A WordCount job for a total size of 10-GB data is considered in this experiment, and
each time, 1-GB data are added in to emulate an incremental job. The performance of unmodified
MapReduce, Dache, and our MEMoMR is investigated. The comparison results are reported in
Figure 12, from which we can see that MEMoMR obviously outperforms both MapReduce and
Dache at any stage during the experiment. In addition, the gap even becomes bigger when the total
dataset size is larger. We attribute such advantage to the ‘virtual’ fetch mechanism of MEMoMR
as the data transferring time increases with the total dataset size, while MEMoMR saves much I/O
time with ‘virtual’ fetch.
6. CONCLUSION AND FUTURE WORK
In this paper, we propose MEMoMR that is able to efficiently reuse the intermediate results in
MapReduce jobs with the same map or reduce tasks to accelerate the system performance. To avoid
often fetching the available intermediate results, we novelly invent a metadata description method
and propose a ‘virtual’ fetching mechanism that is transparent to the developers. We also success-
fully implement MEMoMR and deploy it onto a cluster. By extensive performance evaluations, we
show that MEMoMR outperforms existing intermediate result reusing framework Dache as high
as 18.4% and 23.4% for TeraSort and WordCount, respectively. Some interesting phenomena are
also discovered from the experiment results. The experiment results also prove that MEMoMR pro-
vides a promising performance acceleration solution and is applicable in practice. Our future work
includes further I/O performance improvement for larger dataset in MEMoMR. When dealing with
large datasets, the bottleneck may take place at data transferring between physical machines with
low I/O performance (e.g., hard-disk reading). In this case, I/O performance is often the bottle-
neck and may cause many problems. For example, some subtasks may become stragglers because
of slow I/O operations. Besides, the runtime may also become more unpredictable in this case.
As a result, it is significant to address how to improve the I/O performance for larger datasets.
We think that putting the most frequently used intermediate results in memory would potentially
improve the system performance and will investigate this issue in our future work. Further, we
also plan to include a mechanism to evaluate the applicability of the reuse strategy by collect-
ing the information (e.g., transferring rate and computation speed) of previous jobs, by which we
can estimate the overhead of reusing the intermediate results. If the reusing effort is deserved,
the intermediate result reusing is applicable; otherwise, normal MapReduce procedures shall
be applied.
Copyright © 2015 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. (2015)
DOI: 10.1002/cpe
MEMoMR: ACCELERATE MapReduce VIA REUSE OF INTERMEDIATE RESULTS
ACKNOWLEDGEMENTS
This research was supported by the NSF of China (grant nos. 61272470, 61402425, 61305087, 41404076,
and 61440060), the Fundamental Research Funds for National University, China University of Geosciences,
Wuhan (grant nos. CUG14065 and CUGL150829), and the Provincial Natural Science Foundation of Hubei
(grant no. 2015CFA065).
REFERENCES
1. Manyika J, Chui M, Brown B, Bughin J, Dobbs R, Roxburgh C, Byers AH. Big data: the next frontier for innovation,
competition, and productivity, The McKinsey Global Institute, 2011. Available from: http://www.mckinsey.com/
insights/business_technology/big_data_the_next_frontier_for_innovation [Accessed on January 2015].
2. Dean J, Ghemawat S. Mapreduce: simplified data processing on large clusters. Communications of the ACM 2008;
51(1):107–113.
3. Apache Hadoop. Available from: http://hadoop.apache.org/ [Accessed on January 2015].
4. Chen R, Weng X, He B, Yang M. Large graph processing in the cloud. Proceedings of the 2010 International
Conference on Management of Data – SIGMOD ’10, Indianapolis, IN, USA, 2010; 1123–1126.
5. Zhong E, Fan W, Wang J, Xiao L, Li Y. ComSoc: adaptive transfer of user behaviors over composite social network.
Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Beijing,
China, 2012; 696–704.
6. Le Y, Liu J, Ergün F, Wang D. Online load balancing for mapreduce with skewed data input. In 2014 Proceedings
IEEE INFOCOM. IEEE: New York, NY, USA, 2014; 2004–2012.
7. Zhang L, Li Z, Wu C, Chen M. Online algorithms for uploading deferrable big data to the cloud. In 2014 Proceedings
IEEE INFOCOM. IEEE: New York, NY, USA, 2014; 2022–2030.
8. Palanisamy B, Singh A, Liu L, Jain B. Purlieus: locality-aware resource allocation for mapreduce in a cloud. In
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis.
ACM: New York, NY, USA, 2011; 58.
9. Zhu Y, Jiang Y, Wu W, Ding L, Teredesai A, Li D, Lee W. Minimizing makespan and total completion time in
mapreduce-like systems. In 2014 Proceedings IEEE INFOCOM. IEEE: New York, NY, USA, 2014; 2166–2174.
10. Yuan Y, Wang D, Liu J. Joint scheduling of mapreduce jobs with servers: performance bounds and experiments. In
2014 Proceedings IEEE INFOCOM. IEEE: New York, NY, USA, 2014; 2175–2183.
11. Okcan A, Riedewald M. Anti-combining for mapreduce. In Proceedings of the 2014 ACM SIGMOD International
Conference on Management of Data. ACM: New York, NY, USA, 2014; 839–850.
12. Nykiel T, Potamias M, Mishra C, Kollios G, Koudas N. Sharing across multiple mapreduce jobs. ACM Transactions
on Database Systems (TODS) 2014; 39(2):12.
13. Jiang W, Ravi VT, Agrawal G. A map-reduce system with an alternate API for multi-core environments. In Proceed-
ings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing. IEEE Computer
Society: New York, NY, USA, 2010; 84–93.
14. Wang Y, Jiang W, Agrawal G. Scimate: a novel mapreduce-like framework for multiple scientific data formats. In
2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID). IEEE: New York,
NY, USA, 2012; 443–450.
15. Hammoud M, Rehman MS, Sakr MF. Center-of-gravity reduce task scheduling to lower mapreduce network traffic.
In 2012 IEEE 5th International Conference on Cloud Computing (CLOUD). IEEE: New York, NY, USA, 2012;
49–58.
16. Zhao Y, Wu J. Dache: a data aware caching for big-data applications using the mapreduce framework. In 2013
Proceedings IEEE INFOCOM. IEEE: New York, NY, USA, 2013; 35–39.
17. Elghandour I, Aboulnaga A. Restore: reusing results of mapreduce jobs. Proceedings of the VLDB Endowment 2012;
5(6):586–597.
18. Zaharia M, Das T, Li H, Shenker S, Stoica I. Discretized streams: an efficient and fault-tolerant model for stream
processing on large clusters. In Proceedings of the 4th USENIX Conference on Hot Topics in Cloud Computing.
USENIX Association: Berkeley, CA, USA, 2012; 10–10.
19. Jahani E, Cafarella MJ, Ré C. Automatic optimization for MapReduce programs. Proceedings of the VLDB
Endowment, Vol. 4, Seattle, WA, USA, 2011; 385–396.
20. Nykiel T, Potamias M, Mishra C, Kollios G, Koudas N. Sharing across multiple MapReduce jobs. ACM Transactions
on Database Systems, Vol. 39, New York, NY, USA, 2014; 1–46.
21. Vernica R, Carey MJ, Li C. Efficient parallel set-similarity joins using MapReduce. Proceedings of the 2010 ACM
SIGMOD International Conference on Management of Data, Indianapolis, IN, USA, 2010; 495–506.
22. Okcan A, Riedewald M. Processing theta-joins using MapReduce. Proceedings of the 2011 ACM SIGMOD Interna-
tional Conference on Management of Data, Athens, Greece, 2011; 949–960. DOI: 10.1145/1989323.1989423.
23. Metwally A, Faloutsos C. V-SMART-join: a scalable Mapreduce framework for all-pair similarity joins of multisets
and vectors. Proceedings of the VLDB Endowment, Vol. 5, Istanbul, Turkey, 2012; 704–715.
24. Peng D, Dabek F. Large-scale incremental processing using distributed transactions and notifications. In OSDI,
Vol. 10. USENIX: Vancouver, BC, Canada, 2010; 1–15.
Copyright © 2015 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. (2015)
DOI: 10.1002/cpe
H. YAO ET AL.
25. Popa L, Budiu M, Yu Y, Isard M. Dryadinc: reusing work in large-scale computations. In USENIX Workshop on Hot
Topics in Cloud Computing. USENIX: San Diego, CA, USA, 2009.
26. Bhatotia P, Wieder A, Rodrigues R, Acar UA, Pasquin R. Incoop: Mapreduce for incremental computations. In
Proceedings of the 2nd ACM Symposium on Cloud Computing. ACM: New York, NY, USA, 2011; 7.
27. Zaharia M, Chowdhury M, Franklin MJ, Shenker S, Stoica I. Spark: cluster computing with working sets. In
Proceedings of the 2nd USENIX Conference on Hot Topics in Cloud Computing: Boston, MA, USA, 2010; 10–10.
28. The PostgreSQL Global Development Group. Available from: http://www.postgresql.org/ [Accessed on January
2015].
Copyright © 2015 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. (2015)
DOI: 10.1002/cpe
... During complex expression computation, big data may be accessed multiple times to affect efficiency. The reusing mechanism leads to waste of many I/O operations [28]. This leads to the third challenge of reducing the number of I/O operations. ...
Article
Full-text available
Complex expressions are the basis of data analytics. To process complex expressions on big data efficiently, we developed a novel optimization strategy for parallel computation platforms such as Hadoop and Spark. We attempted to minimize the rounds of data repartition to achieve high performance. Aiming at this goal, we modeled the expression as a graph and developed a simplification algorithm for this graph. Based on the graph, we converted the round minimization problem into a graph decomposition problem and developed a linear algorithm for it. We also designed appropriated implementation for the optimization strategy. Extensive experimental results demonstrate that the proposed approach could optimize the computation of complex expressions effectively with small cost.
... Identifying and exploiting the CSEs to improve query performance is essential for these applications. Multi-query optimizations, which aims to detect and exploit the CSEs among queries in order to reduce the overall query evaluation cost, has been extensively studied for over two decades and demonstrated to be an effective technique in both Relational Database Management Systems and MapReduce contexts [7][8][9][10][11][12][13]. ...
... MapReduce has been widely regarded as a flexible, scalable, and easy-to-use distributed programming paradigm for big data processing such as social network data analysis. The second paper, 'MEMoMR: accelerate MapReduce via reuse of intermediate results' [3] Stream processing is one of the key technologies for data processing in social networks. In order to speed up processing in stream processing systems, a data analysis operator could be partitioned into n parallel tasks, which are usually deployed on m nodes coexisting with other application operators. ...
... Identifying and exploiting the CSEs to improve query performance is essential for these applications. Multi-query optimizations, which aims to detect and exploit the CSEs among queries in order to reduce the overall query evaluation cost, has been extensively studied for over two decades and demonstrated to be an effective technique in both Relational Database Management Systems and MapReduce contexts [7][8][9][10][11][12][13]. ...
Article
Full-text available
Big data analytical systems, such as MapReduce, have become main issues for many enterprises and research groups. Currently, multi-query which translated into MapReduce jobs is submitted repeatedly with similar tasks. So, exploiting these similar tasks can offer possibilities to avoid repeated computations of MapReduce jobs. Therefore, many researches have addressed the sharing opportunity to optimize multi-query processing. Consequently, the main goal of this work is to study and compare comprehensively two existed sharing opportunity techniques using predicate-based filters; MRShare and relaxed MRShare. The comparative study has been performed over TPC-H benchmark and confirmed that the relaxed MRShare technique significantly outperforms the MRShare for shared data in terms of predicate-based filters among multi-query.
Conference Paper
Full-text available
Despite the popularity of MapReduce, there are several obstacles to applying it for developing scientific data analysis applications. Current MapReduce implementations require that data be loaded into specialized file systems, like the Hadoop Distributed File System (HDFS), whereas with rapidly growing size of scientific datasets, reloading data in another file system or format is not feasible. We present a framework that allows scientific data in different formats to be processed with a MapReduce like API. Our system is referred to as SciMATE, and is based on the MATE system developed at Ohio State. SciMATE is developed as a customizable system, which can be adapted to support processing on any of the scientific data formats.We have demonstrated the functionality of our system by creating instances that can be processing NetCDF and HDF5 formats as well as flat-files. We have also implemented three popular data mining applications and have evaluated their execution with each of the three instances of our system.
Conference Paper
Full-text available
MapReduce has achieved tremendous success for large-scale data processing in data centers. A key feature distinguishing MapReduce from previous parallel models is that it interleaves parallel and sequential computation. Past schemes, and especially their theoretical bounds, on general parallel models are therefore, unlikely to be applied to MapReduce directly. There are many recent studies on MapReduce job and task scheduling. These studies assume that the servers are assigned in advance. In current data centers, multiple MapReduce jobs of different importance levels run together. In this paper, we investigate a schedule problem for MapReduce taking server assignment into consideration as well. We formulate a MapReduce server-job organizer problem (MSJO) and show that it is NP-complete. We develop a 3-approximation algorithm and a fast heuristic. We evaluate our algorithms through both simulations and experiments on Amazon EC2 with an implementation in Hadoop. The results confirm the advantage of our algorithms.
Conference Paper
Effectiveness of MapReduce as a big data processing framework depends on efficiencies of scale for both map and reduce phases. While most map tasks are preemptive and parallelizable, the reduce tasks typically are not easily decomposed and often become a bottleneck due to constraints of data locality and task complexity. By assuming that reduce tasks are non-parallelizable, we study offline scheduling of minimizing makespan and minimizing total completion time, respectively. Both preemptive and non-preemptive reduce tasks are considered. On makespan minimization, for preemptive version we design an algorithm and prove its optimality, for non-preemptive version we design an approximation algorithm with the worst ratio of 3/2-1/2h where h is the number of machines. On total complete time minimization, for non-preemptive version we devise an approximation algorithm with worst case ratio of 2-1/h, and for preemptive version we devise a heuristic. We confirm that our algorithms outperform state-of-art schedulers through experiments.
Article
Large-scale data analysis lies in the core of modern enterprises and scientific research. With the emergence of cloud computing, the use of an analytical query processing infrastructure can be directly associated with monetary cost. MapReduce has been a popular framework in the context of cloud computing, designed to serve long-running queries (jobs) which can be processed in batch mode. Taking into account that different jobs often perform similar work, there are many opportunities for sharing. In principle, sharing similar work reduces the overall amount of work, which can lead to reducing monetary charges for utilizing the processing infrastructure. In this article we present a sharing framework tailored to MapReduce, namely, MRShare . Our framework, MRShare , transforms a batch of queries into a new batch that will be executed more efficiently, by merging jobs into groups and evaluating each group as a single query. Based on our cost model for MapReduce, we define an optimization problem and we provide a solution that derives the optimal grouping of queries. Given the query grouping, we merge jobs appropriately and submit them to MapReduce for processing. A key property of MRShare is that it is independent of the MapReduce implementation. Experiments with our prototype, built on top of Hadoop, demonstrate the overall effectiveness of our approach. MRShare is primarily designed for handling I/O-intensive queries. However, with the development of high-level languages operating on top of MapReduce, user queries executed in this model become more complex and CPU intensive. Commonly, executed queries can be modeled as evaluating pipelines of CPU-expensive filters over the input stream. Examples of such filters include, but are not limited to, index probes, or certain types of joins. In this article we adapt some of the standard techniques for filter ordering used in relational and stream databases, propose their extensions, and implement them through MRAdaptiveFilter , an extension of MRShare for expensive filter ordering tailored to MapReduce, which allows one to handle both single- and batch-query execution modes. We present an experimental evaluation that demonstrates additional benefits of MRAdaptiveFilter , when executing CPU-intensive queries in MRShare .
Conference Paper
This work studies how to minimize the bandwidth cost for uploading deferral big data to a cloud computing platform, for processing by a MapReduce framework, assuming the Internet service provider (ISP) adopts the MAX contract pricing scheme. We first analyze the single ISP case and then generalize to the MapReduce framework over a cloud platform. In the former, we design a Heuristic Smoothing algorithm whose worst-case competitive ratio is proved to fall between 2-1/(D+1) and 2(1 - 1/e), where D is the maximum tolerable delay. In the latter, we employ the Heuristic Smoothing algorithm as a building block, and design an efficient distributed randomized online algorithm, achieving a constant expected competitive ratio. The Heuristic Smoothing algorithm is shown to outperform the best known algorithm in the literature through both theoretical analysis and empirical studies. The efficacy of the randomized online algorithm is also verified through simulation studies.
Conference Paper
MapReduce has emerged as a powerful tool for distributed and scalable processing of voluminous data. In this paper, we, for the first time, examine the problem of accommodating data skew in MapReduce with online operations. Different from earlier heuristics in the very late reduce stage or after seeing all the data, we address the skew from the beginning of data input, and make no assumption about a priori knowledge of the data distribution nor require synchronized operations. We examine the input in a continuous fashion and adaptively assign tasks with a load-balanced strategy. We show that the optimal strategy is a constrained version of online minimum makespan and, in the MapReduce context where pairs with identical keys must be scheduled to the same machine, there is an online algorithm with a provable 2-competitive ratio. We further suggest a sample-based enhancement, which, probabilistically, achieves a 3/2-competitive ratio with a bounded error.
Article
We propose Anti-Combining, a novel optimization for MapReduce programs to decrease the amount of data transferred from mappers to reducers. In contrast to Combiners, which decrease data transfer by performing reduce work on the mappers, Anti-Combining shifts mapper work to the reducers. It is also conceptually different from traditional compression techniques. While the latter are applied outside the MapReduce framework by compressing map output and then decompressing it before the data is fed into the reducer, Anti-Combining is integrated into mapping and reducing functionality itself. This enables lightweight algorithms and data reduction even for cases where the Map output data shows no redundancy that could be exploited by traditional compression techniques. Anti-Combining can be enabled automatically for any given MapReduce program through purely syntactic transformations. In some cases, in particular for certain non-deterministic Map and Partition functions, only a weaker version can be applied. At runtime the Anti-Combining enabled MapReduce program will dynamically and adaptively decrease data transfer by making fine-grained local decisions. Our experiments show that Anti-Combining can achieve data transfer reduction similar to or better than traditional compression techniques, while also reducing CPU and local I/O cost. It can even be applied in combination with them to greater effect.