ChapterPDF Available


In next-generation cloud computing clusters, performance of data-intensive applications will be limited, among other factors, by disks data transfer rates. In order to mitigate performance impacts, cloud systems offering hierarchical storage architectures are becoming commonplace. The Hadoop File System (HDFS) offers a collection of storage policies that exploit different storage types such as RAM_DISK, SSD, HDD, and ARCHIVE. However, developing algorithms to leverage heterogeneous storage through an efficient data placement has been challenging. This work presents an intelligent algorithm based on genetic programming which allow to find the optimal mapping of input datasets to storage types on a Hadoop file system.
ResearchGate has not been able to resolve any citations for this publication.
As the access speed gap between DRAM and storage devices such as hard disk drives is ever widening, the I/O module dominantly becomes the system bottleneck. Meanwhile, the map-reduce parallel programming model has been actively studied for the last few years. In this paper, we will show empirically show that flash memory based SSD(Solid State Drive) is very beneficial when used as local storage devices in IO-intensive map-reduce applications (e.g. sorting) using Hadoop open source platform. Specifically, we present that external sorting algorithm in Hadoop with SSD can outperform the algorithm run with hard disk by more than 3. In addition, we also demonstrate that the power consumption can be drastically reduced when SSDs are used.
Hadoop Framework is a successful option for industry and academia to handle Big Data applications. Large input data sets are split into smaller chunks, distributed among the cluster nodes and processed in the same nodes where they are stored. However, some Hadoop data-intensive applications generate a very large volume of intermediate data to the local file system of each node. Many data spilled to disk associated with concurrent accesses from different tasks that are executed on the same node overload the input/output system. We propose to extend Shared Input Policy, a Hadoop job scheduler policy developed by our research group, by adding a RAMDISK for temporary storage of intermediate data. Shared Input Policy schedules batches of data-intensive jobs that share the same input data set. We add RAMDISK to improve performance of Shared Input Policy. RAMDISK has high throughput and low latency and this allows quick access to intermediate data relieving hard disk. Experimental results show that our approach outperforms Hadoop default policy from 40% to 60% for data intensive applications.