ABSTRACT: Several research works have focused on supporting index access in MapReduce
systems. These works have allowed users to significantly speed up selective
MapReduce jobs by orders of magnitude. However, all these proposals require
users to create indexes upfront, which might be a difficult task in certain
applications (such as in scientific and social applications) where workloads
are evolving or hard to predict. To overcome this problem, we propose LIAH
(Lazy Indexing and Adaptivity in Hadoop), a parallel, adaptive approach for
indexing at minimal costs for MapReduce systems. The main idea of LIAH is to
automatically and incrementally adapt to users' workloads by creating clustered
indexes on HDFS data blocks as a byproduct of executing MapReduce jobs. Besides
distributing indexing efforts over multiple computing nodes, LIAH also
parallelises indexing with both map tasks computation and disk I/O. All this
without any additional data copy in main memory and with minimal
synchronisation. The beauty of LIAH is that it piggybacks index creation on map
tasks, which read relevant data from disk to main memory anyways. Hence, LIAH
does not introduce any additional read I/O-costs and exploit free CPU cycles.
As a result and in contrast to existing adaptive indexing works, LIAH has a
very low (or invisible) indexing overhead, usually for the very first job.
Still, LIAH can quickly converge to a complete index, i.e. all HDFS data blocks
are indexed. Especially, LIAH can trade early job runtime improvements with
fast complete index convergence. We compare LIAH with HAIL, a state-of-the-art
indexing technique, as well as with standard Hadoop with respect to indexing
overhead and workload performance.
ABSTRACT: Yellow elephants are slow. A major reason is that they consume their inputs
entirely before responding to an elephant rider's orders. Some clever riders
have trained their yellow elephants to only consume parts of the inputs before
responding. However, the teaching time to make an elephant do that is high. So
high that the teaching lessons often do not pay off. We take a different
approach. We make elephants aggressive; only this will make them very fast. We
propose HAIL (Hadoop Aggressive Indexing Library), an enhancement of HDFS and
Hadoop MapReduce that dramatically improves runtimes of several classes of
MapReduce jobs. HAIL changes the upload pipeline of HDFS in order to create
different clustered indexes on each data block replica. An interesting feature
of HAIL is that we typically create a win-win situation: we improve both data
upload to HDFS and the runtime of the actual Hadoop MapReduce job. In terms of
data upload, HAIL improves over HDFS by up to 60% with the default replication
factor of three. In terms of query execution, we demonstrate that HAIL runs up
to 68x faster than Hadoop. In our experiments, we use six clusters including
physical and EC2 clusters of up to 100 nodes. A series of scalability
experiments also demonstrates the superiority of HAIL.
Proceedings of the 27th International Conference on Data Engineering, ICDE 2011, April 11-16, 2011, Hannover, Germany; 01/2011
Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2011, Athens, Greece, June 12-16, 2011; 01/2011
PVLDB. 01/2010; 3:518-529.
PVLDB. 01/2010; 3:460-471.