Thesis

Developing an efficient query processing technique in cloud based MapReduce system

Authors:
  • S S Jain Subodh PG College
To read the full-text of this research, you can request a copy directly from the author.

Abstract

In the current era of the digital age, global data has been rising at a very high speed due to the increased use of social networking sites, online shopping, the Internet of Things, the Cloud, smartphones, and other handheld devices. A large volume of digital content is generated continuously through photos, videos, tweets, emails, text, and documents. It includes both structured and unstructured data. Database Management Systems like Oracle, DB2, MS SQL Server were previously used for the storage and processing of data. But in today’s scenario, these approaches get failed in handling a large volume of data. Traditional Relational Database Management System (RDBMS) cannot handle large volumes of data because they are designed for steady data retention and have inflexible schemas (Bachhav et al., 2017; Nanda, 2015). They have grown overly complex and hence are difficult to manage. Therefore, in current circumstances, there has been a strong focus on the development of techniques that can handle as well as process the enormous amount of data. MapReduce has emerged as a prominent and well-known solution for processing unprecedented growth of data in an efficient manner. It offers distributed processing of data on clusters of machines. It can process both structured and unstructured data. It has become a popular computing model for cloud platforms and is helpful to process terabytes or gigabytes of data in parallel, achieving quicker results. Since the current design of MapReduce does not consider virtualization so, cloud computing provides a virtualized environment to it where multiple virtual machines (VMs) share resources such as disks, networks, and main memory (Hwang et al., 2018; Tripathi et al., 2018). Hence, running MapReduce on cloud computing is becoming popular as it provides a reliable, available, and scalable environment for the processing of these huge data queries. Several techniques exist in the Cloud-based environment which can further boost MapReduce query performance such as Indexing, Caching, and Joining. Researchers have already been working on these techniques. But in the present study, the emphasis is on caching techniques as it has not been much explored in cloud systems. Caching is one of the solutions that can boost MapReduce performance by reducing the cost of I/O operations, the load of the server, query response time, and CPU usage. The present study makes an attempt to enhance the performance of MapReduce by proposing an algorithm named MapReduce with cache (MRC), based on caching scheme. The main objective of the MRC algorithm is to reduce the job execution time of the MapReduce tasks by retrieving 2 the results from the cache memory. It reduces the number of disks I/O operations and hence, in return, reduces the overall job execution time of the queries fired on the MapReduce system. It increases the throughput of the system. It also avoids the overheads of processing duplicate or similar datasets. To measure the performance of the proposed MRC system, Hadoop benchmarks are run on a single node Hadoop cluster (pseudo-distributed mode) as well as on a heterogeneous Hadoop cluster (fully distributed mode) formed in Amazon Web Services (AWS). In order to evaluate the performance of MRC, a comprehensive set of four experiments is conducted. The database used in the experiment is Gutenberg. The HiBench Benchmark Suite, WordCount is used to analyze the behavior and performance improvement of the MRC system. This benchmark suite is used to perform a comparison between the non-MRC system and the proposed MRC system in terms of MapReduce Job execution time. It is observed that the algorithm, proposed in the thesis, is capable of reducing the average job execution time of MapReduce tasks in a cloud-based environment. The MRC (3-node cluster) system shows a performance enhancement of 48.01 % and the average reduction in job execution time is -51.99%. The MRC (5-node cluster) system shows a performance enhancement of 48.86 % and the average reduction in job execution time is -51.14%. The MRC (7-node cluster) system shows a performance enhancement of 50.09 % and the average reduction in job execution time is -49.91%. The MRC (9-node cluster) system shows a performance enhancement of 51.38 % and the average reduction in job execution time is -48.61%. It is observed that the algorithm, proposed in the thesis, is capable of reducing the average job execution time of MapReduce tasks in a cloud-based environment. Experiment results showed that the MRC has significant performance improvement over the non-MRC system.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the author.

ResearchGate has not been able to resolve any citations for this publication.
ResearchGate has not been able to resolve any references for this publication.