In the current era of the digital age, global data has been rising at a very high speed due to the
increased use of social networking sites, online shopping, the Internet of Things, the Cloud,
smartphones, and other handheld devices. A large volume of digital content is generated
continuously through photos, videos, tweets, emails, text, and documents. It includes both
structured and unstructured data. Database Management Systems like Oracle, DB2, MS SQL
Server were previously used for the storage and processing of data. But in today’s scenario, these
approaches get failed in handling a large volume of data. Traditional Relational Database
Management System (RDBMS) cannot handle large volumes of data because they are designed
for steady data retention and have inflexible schemas (Bachhav et al., 2017; Nanda, 2015). They
have grown overly complex and hence are difficult to manage. Therefore, in current
circumstances, there has been a strong focus on the development of techniques that can handle as
well as process the enormous amount of data.
MapReduce has emerged as a prominent and well-known solution for processing unprecedented
growth of data in an efficient manner. It offers distributed processing of data on clusters of
machines. It can process both structured and unstructured data. It has become a popular computing
model for cloud platforms and is helpful to process terabytes or gigabytes of data in parallel,
achieving quicker results. Since the current design of MapReduce does not consider virtualization
so, cloud computing provides a virtualized environment to it where multiple virtual machines
(VMs) share resources such as disks, networks, and main memory (Hwang et al., 2018; Tripathi
et al., 2018). Hence, running MapReduce on cloud computing is becoming popular as it provides
a reliable, available, and scalable environment for the processing of these huge data queries.
Several techniques exist in the Cloud-based environment which can further boost MapReduce
query performance such as Indexing, Caching, and Joining. Researchers have already been
working on these techniques. But in the present study, the emphasis is on caching techniques as it
has not been much explored in cloud systems. Caching is one of the solutions that can boost
MapReduce performance by reducing the cost of I/O operations, the load of the server, query
response time, and CPU usage.
The present study makes an attempt to enhance the performance of MapReduce by proposing an
algorithm named MapReduce with cache (MRC), based on caching scheme. The main objective
of the MRC algorithm is to reduce the job execution time of the MapReduce tasks by retrieving
2
the results from the cache memory. It reduces the number of disks I/O operations and hence, in
return, reduces the overall job execution time of the queries fired on the MapReduce system. It
increases the throughput of the system. It also avoids the overheads of processing duplicate or
similar datasets.
To measure the performance of the proposed MRC system, Hadoop benchmarks are run on a single
node Hadoop cluster (pseudo-distributed mode) as well as on a heterogeneous Hadoop cluster
(fully distributed mode) formed in Amazon Web Services (AWS).
In order to evaluate the performance of MRC, a comprehensive set of four experiments is
conducted. The database used in the experiment is Gutenberg. The HiBench Benchmark Suite,
WordCount is used to analyze the behavior and performance improvement of the MRC system.
This benchmark suite is used to perform a comparison between the non-MRC system and the
proposed MRC system in terms of MapReduce Job execution time. It is observed that the
algorithm, proposed in the thesis, is capable of reducing the average job execution time of
MapReduce tasks in a cloud-based environment. The MRC (3-node cluster) system shows a
performance enhancement of 48.01 % and the average reduction in job execution time is -51.99%.
The MRC (5-node cluster) system shows a performance enhancement of 48.86 % and the average
reduction in job execution time is -51.14%. The MRC (7-node cluster) system shows a
performance enhancement of 50.09 % and the average reduction in job execution time is -49.91%.
The MRC (9-node cluster) system shows a performance enhancement of 51.38 % and the average
reduction in job execution time is -48.61%. It is observed that the algorithm, proposed in the
thesis, is capable of reducing the average job execution time of MapReduce tasks in a cloud-based
environment. Experiment results showed that the MRC has significant performance improvement
over the non-MRC system.