ArticlePDF Available

Abstract and Figures

The Hadoop Distributed File System (HDFS) is responsible for storing very large data-sets reliably on clusters of commodity machines. The HDFS takes advantage of replication to serve data requested by clients with high throughput. Data replication is a trade-off between better data availability and higher disk usage. Recent studies propose different data replication management frameworks that alter the replication factor of files dynamically in response to the popularity of the data, keeping more replicas for in-demand data to enhance the overall performance of the system. When data gets less popular, these schemes reduce the replication factor, which changes the data distribution and leads to unbalanced data distribution. Such an unbalanced data distribution causes hot spots, low data locality and excessive network usage in the cluster. In this work, we first confirm that reducing the replication factor causes unbalanced data distribution when using Hadoop's default replica deletion scheme. Then, we show that even keeping a balanced data distribution using WBRD (data-distribution-aware replica deletion scheme) that we proposed in previous work performs sub-optimally on heterogeneous clusters. In order to overcome this issue, we propose a Heterogeneity-Aware Replica Deletion scheme (HaRD). HaRD considers the nodes' processing capabilities when deleting replicas; hence it stores more replicas on the more powerful nodes. We implemented HaRD on top of HDFS and conducted a performance evaluation on a 23-node dedicated heterogeneous cluster. Our results show that 2 H. E. Ciritoglu et al. HaRD reduced execution time by up to 60%, and 17% when compared to Hadoop and WBRD, respectively.
Content may be subject to copyright.
HaRD: aheterogeneity‑aware replica
deletion forHDFS
Hilmi Egemen Ciritoglu1* , John Murphy1 and Christina Thorpe2
Introduction
In recent years, the number of data sources is increasing exponentially (e.g., IoT devices
and social media applications), and data is incessantly produced every second. us, the
volume of data is growing rapidly. Moreover, processing enlarging data-sets has para-
mount importance for businesses as it helps to determine mission-critical objectives
and discover opportunities. Consequently, processing large data-sets in order to extract
meaningful information has become vital for business success and has created the
demand for large-scale distributed data-intensive systems [13].
Apache Hadoop [4] is the de facto framework for large-scale distributed data-inten-
sive computing that employs the MapReduce paradigm [5]. e Hadoop project is com-
posed of 4 main components: (i) Hadoop distributed file system (HDFS) [6], (ii) resource
Abstract
The Hadoop distributed file system (HDFS) is responsible for storing very large data-
sets reliably on clusters of commodity machines. The HDFS takes advantage of rep-
lication to serve data requested by clients with high throughput. Data replication
is a trade-off between better data availability and higher disk usage. Recent studies
propose different data replication management frameworks that alter the replication
factor of files dynamically in response to the popularity of the data, keeping more rep-
licas for in-demand data to enhance the overall performance of the system. When data
gets less popular, these schemes reduce the replication factor, which changes the data
distribution and leads to unbalanced data distribution. Such an unbalanced data distri-
bution causes hot spots, low data locality and excessive network usage in the cluster. In
this work, we first confirm that reducing the replication factor causes unbalanced data
distribution when using Hadoop’s default replica deletion scheme. Then, we show that
even keeping a balanced data distribution using WBRD (data-distribution-aware replica
deletion scheme) that we proposed in previous work performs sub-optimally on
heterogeneous clusters. In order to overcome this issue, we propose a heterogeneity-
aware replica deletion scheme (HaRD). HaRD considers the nodes’ processing capabili-
ties when deleting replicas; hence it stores more replicas on the more powerful nodes.
We implemented HaRD on top of HDFS and conducted a performance evaluation on a
23-node dedicated heterogeneous cluster. Our results show that HaRD reduced execu-
tion time by up to 60%, and 17% when compared to Hadoop and WBRD, respectively.
Keywords: Hadoop distributed file system (HDFS), Replication factor, Replica
management framework, Software performance
Open Access
© The Author(s) 2019. This article is distributed under the terms of the Creative Commons Attribution 4.0 International License
(http://creat iveco mmons .org/licen ses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium,
provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and
indicate if changes were made.
RESEARCH
Ciritogluetal. J Big Data (2019) 6:94
https://doi.org/10.1186/s40537‑019‑0256‑6
*Correspondence:
hilmi.egemen.
ciritoglu@ucdconnect.ie
1 Performance Engineering
Laboratory, School
of Computer Science,
University College Dublin,
Dublin, Ireland
Full list of author information
is available at the end of the
article
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Page 2 of 21
Ciritogluetal. J Big Data (2019) 6:94
management framework (YARN) [7], (iii) execution engine, and (iv) Hadoop common.
e component-based approach of Hadoop helps to use the infrastructure more effec-
tively by making use of more sophisticated components, e.g., Apache Spark [8] can be
used instead of the MapReduce engine as it allows in-memory processing of the data.
HDFS proved to be a highly scalable, robust distributed storage system in the big data
ecosystem. erefore, companies trust in HDFS to store their petabytes of data reliably
on distributed nodes. HDFS not only serves as a reliable storage system but also provides
high throughput for thousands of clients’ concurrent queries. Data stored in HDFS can
be retrieved by simple MapReduce jobs or complex graph processing jobs. us, the per-
formance of HDFS is a critical matter for the whole big data ecosystem that stands on
HDFS.
e key idea behind the robustness and efficiency of HDFS is the distributed place-
ment of replicated data. Any file stored on HDFS is divided into fixed-size blocks
(chunks). Each block is stored by replicating three times (by default). Moreover, each
replica is distributed among different nodes in the cluster. is strategy advances sys-
tem performance through effective load-balancing and provides fault-tolerance [9, 10].
Hence, different replica managementframeworks have been proposed in the literature
to improve the system performance by adapting the replication factor either proactively
[11], or dynamically [1214] depending on the popularity of data. Existing replica man-
agement frameworks increase the replication factor for the in-demand data once a par-
ticular data becomes popular. On the contrary, if the data losses its popularity over time,
replica management frameworks adapt the replication factor back to the default level.
Changing the replication factor also changes the block distribution on the cluster. e
influence of increasing the replication factor has been widely studied [9, 15, 16]. How-
ever, our previous work [17] was the first to identify that the current replica deletion
algorithm of Hadoop can be the cause of performance degradation. Consequently, we
proposed Workload-aware Balanced Replica Deletion (WBRD). WBRD achieves up to
48% improvement in job completion time compared to HDFS by balancing the num-
ber of stored blocks for a particular data-set rather than the disk usage in each node
[17]. WBRD’s even block distribution strategy does not take nodes’ processing capabili-
ties into consideration. However, current Hadoop clusters are highly scaled systems and
composed of numerous racks (set of nodes) and generally, each rack contains nodes with
the same characteristics. Racks can be upgraded or replaced separately. Hence, hetero-
geneity occurs in highly scaled Hadoop clusters [18]. WBRD is limited and results in
sub-optimal performance for the case of heterogeneous Hadoop clusters.
In this paper, we propose a novel cost-effective Heterogeneity-aware Replica Dele-
tion algorithm (HaRD) to cover the case of heterogeneous clusters. e primary goal of
HaRD is to balance the ratio of block distribution to the computing capabilities for each
node. erefore, HaRD tries to enhance the system by placing more blocks in power-
ful machines. HaRD determines the computing capability of each node by calculating
the number of containers it can run simultaneously. We implemented HaRD on top of
HDFS and conducted a comprehensive set of experiments with representative bench-
marks to evaluate the performance of HaRD against WBRD, as well as Hadoop. Experi-
mental results on a heterogeneous 23 nodes Hadoop cluster show that HaRD speeds-up
the system performance for the single query, and reduces execution time by 40% and 8%
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Page 3 of 21
Ciritogluetal. J Big Data (2019) 6:94
on average when compared to HDFS and WBRD, respectively. Moreover, improvements
become more compelling when the system is highly-utilised by a large number of con-
current requests, and increase to 60% and 17% compared to HDFS and WBRD, respec-
tively. e present study makes the following contributions:
1. We show the current replica deletion algorithms(both Hadoop and WBRD) do not
consider the processing capability of nodes, and thus heterogeneous clusters become
an edge case.
2. We extend the formal definition of the replica deletion problem to heterogeneous
clusters.
3. We propose a novel cost-effective Heterogeneity-aware Replica Deletion algorithm
(HaRD). In order to consider heterogeneity in the cluster, HaRD uses a container-
based approach to calculate the computing ratio of each machine.
4. We implement the proposed approach and evaluate both the performance improve-
ment and its overhead by conducting an extensive set of experiments on a heteroge-
neous 23-node Hadoop cluster.
e remainder of this paper is organised as follows: "Background" section provides back-
ground information. e related work is reviewed in"Related work" section. "Improv-
ing performance of replica management system through heterogeneity-aware replica
deletion" section identifies the replica deletion problem and models the problem in the
context of heterogeneous clusters and details novel HaRD algorithm."Methods" section
describes the experimental environment. "Results and discussion" section presents the
results of our evaluation. Finally,"Conclusion" section concludes this paper.
Background
HDFS [6] is one of four core modules of the Hadoop Project [4] and is responsible for
storing data in a distributed fashion. e design principle behind HDFS is to develop a
distributed mass-storage system as a main pillar for the Hadoop ecosystem [6]. ere-
fore, HDFS is highly scalable and capable of storing tremendous data-sets on a large
number of commodity machines. On such a scale, node failures are more than a theo-
retical probability and can occur for various reasons, e.g., hardware failure, power losses.
Hence, HDFS’s architecture strengthens fault-tolerance by benefiting from the technique
of replication and distributed storage of replicated data.
HDFS has a master-slave model and is composed of two primary daemons: NameNode
(NN) and DataNode (DN). NameNode, the master, is responsible for storing meta-data
and operations related to meta-data. NN keeps track of DNs by checking their heart-
beat messages periodically. When a DN fails or becomes unavailable, the NN marks it as
dead and coordinates data re-replication. Moreover, the NN manages data requests and
directs them to relevant DNs. DataNode, the slave, is responsible for storing blocks and
serving blocks for data requests. e number of DNs can easily scale to thousands and
can store tens of petabytes [6].
HDFS organises the stored files in a traditional, hierarchical file structure. e main
directory of the system, Root, is at the top of the hierarchy. erefore, any stored file on
HDFS is a part of Root’s branches. While uploading the data into a Hadoop cluster, first,
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Page 4 of 21
Ciritogluetal. J Big Data (2019) 6:94
data is divided into fixed-size blocks. e fixed block size is 64MB in Hadoop1; how-
ever, it has been increased to 128MB in the Hadoop2. e blocks are replicated three
times by default and placed among the nodes in the cluster. In order to place replicas
over different nodes, HDFS leverages a data pipeline rather than using one centralised
node to transfer all of the replicas. In the pipeline, replicas are passed from one DN to
another DN as shown in Fig.1. is decentralised strategy improves the efficiency of
the replica transmission by sharing the network load among the nodes and reduces the
chance of a possible network bottleneck.
Block placement is performed according to Hadoop’s data placement policy [19]. e
policy prefers to place the first replica into the DN that sent the request (the client), oth-
erwise it is put on a DN that is on the same rack with the client node [19]. e sec-
ond replica is placed on a node that is on a different rack from the first replica. e last
(third) replica is placed on the same rack as the second replica but a different node. e
default placement policy is rack-aware as it tries to place replicas into at least two differ-
ent racks in the case of multiple rack environment. ere are two advantages of using the
rack-aware replica placement. Firstly, it enhances the fault-tolerance of the system. us,
submitted jobs can be completed even if a rack fails during the execution. Secondly, the
default placement algorithm benefits from having multiple racks and improves network
usage by reducing off-rack traffic. e reason for this is the network traffic is signifi-
cantly faster between nodes on the same rack than on different racks.
Data locality means processing data where that data is stored and is the fundamen-
tal idea behind data-intensive computing. In data-intensive computing, data-sets are
immense and thus, moving the data from one machine to another requires significant
network traffic. Conversely, the code that needs to be executed is much smaller than
the data itself. erefore, moving the computation to the data is easier than the oppo-
site. e strategy, “moving computation is cheaper than moving data” [4], is employed by
HDFS to improve the efficiency of the system. When a job is submitted to the Resource
Manager, the job is first divided into smaller tasks. en, each task is associated with a
split (i.e., a specific portion of data). Most of the time, the splits are created based on the
HDFS block size. However, this is not always the case, as it completely depends on the
job’s getSplits method. Created splits are associated with map tasks. Hadoop prefers to
schedule split-associated map tasks on the node that keeps the split. Moreover, Hadoop
can even delay the start of tasks to reach better data locality [20]. Any map tasks that
Block 1
Block 2
RACK 1
Node 1
Node 2
Node 3
Node 4
RACK 2
Node 1
Node 2
Node 3
Node 4
Block 3
Block 4
Fig. 1 Data uploading to the cluster
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Page 5 of 21
Ciritogluetal. J Big Data (2019) 6:94
can not be scheduled to run in data local mode require extra data transmission, increas-
ing the network utilisation; consequently, increasing the total execution time. ere are
three different task execution type for the data locality as shown in Fig.2:
Local access: the same node stores the data and executes the task, e.g., R1-Slave 1
needs to process Block 3.
Same rack access: the processing node does not store the data split and requests it
from another node that is located on the same rack in order to start the task, e.g.,
when R2-Slave 4 needs to process Block 2, it requests the block from R2-Slave 3
(which is in the same rack).
Off rack access: the processing node does not store the data split and requests from a
node that is located on another rack, e.g., when R1-Slave 2 needs to process Block 1,
it requests the block from R2-Slave 4.
In the event that running a task in local access mode is not possible after multiple
attempts, Hadoop’s task scheduler gives priority to running the task on a node that is
located on the same rack where data is stored. e reasoning behind is the same as the
benefits of using multiple racks, the network traffic between nodes on the same rack is
significantly faster compared to the nodes on different racks. erefore, the scheduler
exploits on-rack access rather than off-rack to reduce slow inter-rack traffic. e worst
case scenario is the last option of task scheduling, allocating a node that is on a com-
pletely different rack and that requires off-rack access.
Related work
Even though large-scale Hadoop clusters can store a tremendous amount of data,
the demand for each stored data-set is not the same. Moreover, the data-set demand
changes over time. Hence, several studies have been conducted to understand the work-
load of Hadoop clusters [11, 21]. Ananthanarayanan etal. [11] underlined that 12% of
the most popular files are more in demand and received ten times more requests than
Fig. 2 Data locality types in Hadoop jobs
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Page 6 of 21
Ciritogluetal. J Big Data (2019) 6:94
the bottom third of the data (based on the analysis they have accomplished from logs of
Bing production clusters). Another study [21] was conducted by analysing three differ-
ent workload traces (i.e., OpenCloud, M45, WebMining) with various cluster sizes (from
9 nodes to 400 nodes). e authors [21] draw attention to load balancing problems in the
Hadoop cluster. Furthermore, thesame study showed that despite the data distribution
being well-balanced, the task distribution remains unbalanced. Consequently, an unbal-
anced cluster leads to poor data locality and performance degradation for the cluster.
Data replication is a prominent method to improve fault-tolerance and load-balanc-
ing [9, 15, 16]. However, increasing the number of copies stored in the cluster comes
with the price of extra storage. Considering the fact that not all data-sets have the same
demand, there is no one-size-fits-all solution for the replication factor. erefore, vari-
ous approaches have been proposed in the literature for adapting the replication factor
according to the access pattern of data-sets [1114, 22]. All of these strategies alter the
replication factor either proactively [11] or dynamically [1214, 22] based on the ‘hot-
ness’ of the data. Wei etal. [12] propose a cost-effective dynamic replication manage-
ment scheme for the large-scale cloud storage system(CDRM). With the intention of
developing such a system, the authors built a model between data availability and repli-
cation factor. Ananthanarayanan etal. [11] present Scarlett for adapting the replication
factor by calculating a storage budget. Abad etal. [13] propose an adaptive data rep-
lication for efficient cluster scheduling(DARE). DARE aims to identify the replication
factor dynamically based on probabilistic sampling techniques. Cheng etal. [14] intro-
duce an active/standby storage model and propose an elastic replication management
system(ERMS) based on the model. ERMS places new replicas of in-demand data to
active nodes in order to increase data availability. Lin etal. [22] approach the problem
of adapting the replication factor from an energy-efficiency perspective and propose
an energy-efficient adaptive file replication system(EAFR). EAFR places ‘cold’ files into
‘cold’ servers to reach energy efficiency.
In addition to adapting the replication factor, the placement of blocks is another factor
to achieving good load-balancing. Eltabakh etal. [15] propose CoHadoop to co-locate
related files based on the information gathered from the application level. CoHadoop
leverages data pre-partitioning against expensive shuffles. Xie etal. [23] and Lea etal.
[24] propose placing blocks based on the computing ratio of each node. Liao etal. [25]
describe a new approach to the block placement problem based on block access fre-
quency. e authors investigated the history of block access sequences and used the
k-partition algorithm to separate blocks into different groups according to their access
load. Moreover, the placement in hybrid storage systems [26, 27] and smart caching
approaches for remote data accesses [28] is also proposed in the literature. ere is a
considerable amount of research about the block placement because the block place-
ment is decisive for the system performance. However, the connection between replica
management systems and the block placement is missing. For instance, which replica
should be deleted when the framework decides to reduce the replication factor? One
simple approach would be to use HDFS’s deletion algorithm.
But altering the replication factor changes the block density on each node. e
framework that adapts thereplication factorshould also be aware of how the replicas
are distributed. Otherwise, the cluster ends up with unbalanced data distribution and
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Page 7 of 21
Ciritogluetal. J Big Data (2019) 6:94
consequently unbalanced load distribution. In our previous work [17], we identified that
decreasing the replication factor leads to data unbalancing in HDFS and we proposed
Workload-aware Balanced Replica Deletion(WBRD) to balance the data-set distribu-
tion among the nodes. As a result, WBRD achieves up to 48% improvement in execution
time on average. But, WBRD does not fully exploit different nodes’ processing capability
as it is designed for homogeneous clusters. One approach to determine nodes’ process-
ing capability is to measure computing ratios for each different application on each node
[23, 24]. However, as the workload of the cluster is highly dynamic and contains multiple
ad-hoc queries, we prefer to use a more flexible and cost-effective approach. erefore,
instead of following previous approaches, the present work employs a novel cost-effec-
tive container-based approach.
Improving performance ofreplica management system
throughheterogeneity‑aware replica deletion
Replica management
Files stored on HDFS are replicated according to the cluster’s default RF value configuration:
dfs.replication. However, the replication factor (RF) is a file-level setting; different values can
be set for different files. Moreover, the RF can be altered anytime after the creation of a file
through the command: hadoop fs -setrep [-R] [-w] <numReplicas
><
path>. Keeping more
copies of files increases data availability and the chance of running tasks in data-local mode.
Hence, replication provides better load-balancing, data locality and ultimately, reduces jobs’
execution time. Since a tremendous amount of data is stored on a data-intensive cluster,
keeping a few extra copies for all of the data is clearly an extravagant solution.
Consequently, replication management frameworks were proposed to identify the
‘best’ RF for each file individually to achieve better performance while minimising the
extra storage overhead of increasing replication factor. In addition to identifying the RF,
placing these replicas is another crucial problem. Even though all of the proposed replica
management frameworks strive to enhance the performance, adapting the RF changes
the block distribution. If replica creation/deletion algorithms do not consider balancing
the data-set; they end up with a skewed (unbalanced) distribution.
Typically, even block distribution helps to utilise all nodes equally during the execu-
tion of tasks in the cluster and performs better compared to the skewed distribution in
homogeneous clusters. In the case of skewed data distribution, some nodes keep more
data than others. Consequently, these nodes can transform into a ‘hot spot’ of the cluster
as the data needs to be constantly transferred from hot spots to other nodes during jobs’
processing. us, data locality decreases, network utilisation burgeons and processing
takes more time due to the waiting time that occurs in data transmission.
In our previous work [17], we already showed the current deletion algorithm in
Hadoop does not perform well and consequently proposed a workload-aware balanced
replica deletion algorithm. e deletion algorithm in Hadoop only concerns itself with
balancing the overall cluster. More importantly, it does not update the state of utilisation
metrics after each deletion. Unlike HDFS default policy, WBRD aims to balance data-
sets distribution rather than the overall cluster and achieves better performance. e
purpose of the present study is to highlight the limitation of WBRD for a heterogeneous
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Page 8 of 21
Ciritogluetal. J Big Data (2019) 6:94
cluster and to propose Heterogeneity-aware Replica Deletion (HaRD) to address the
shortfall of WBRD.
Motivational example
In this section, we would like to illustrate the replica deletion problem empirically and
discuss the limitation of WBRD to motivate the work. Figure3 shows the evolution of
block distribution on the 23-nodes Hadoop cluster while the RF is altered. More particu-
larly, Fig.3a, b reports the block distribution by using default (HDFS) placement policy
when the replication factor is increased and decreased, respectively. On the other hand,
Fig.3c reports the block distribution during the replica deletion by using WBRD.
When the replication factor is increased as shown in Fig.3a, the number of blocks that
are stored on each node varies. However, the range is small and thus, each node stores a
similar number of blocks. erefore, the standard deviation (SD) is not substantial and
leads to the narrow inter-quartile range. As a result, the block distribution is well-bal-
anced. On the contrary, Fig.3b presents the distribution when the RF is decreased. After
the first deletion, the block distribution range starts from zero which means at least
one of the nodes does not participate in data storage. Moreover, the maximum value of
the range stays dominantly the same and shows that at least one of the nodes keeps the
majority of the data. Consequently, the overall range increases and also inter-quartile
range increases and leads to imbalanced data distribution.
Figure 3c shows block distributions when the replication factor is reduced by using
WBRD. Unlike the HDFS deletion approach, WBRD tries to balance the overall data-set
during the replica deletion. erefore, we can see the inter-quartile range is small and does
not vary. Subsequently, WBRD achieves greater performance compared to default HDFS.
Albeit, WBRD is limited as it does not consider processing capabilities. If each node has
different processing capabilities (e.g., heterogeneous clusters), an even block distribu-
tion would not be an optimal case for efficiency. In such a scenario, powerful nodes finish
their tasks before slower nodes. As a result, either the task in the scheduler queue waits
for slower nodes until slower nodes become available for processing while powerful nodes
are idle or data needs to be transferred from slower nodes to powerful nodes in order to
0
20
40
60
80
3 3->4 4->5 5->6 6->7
Block Count
Replication Factor
(a) RF is increased step by
step on HDFS
0
20
40
60
80
7 7->6 6->5 5->4 4->3
Block Count
Replication Factor
(b) RF is decreased step by
step on HDFS
0
20
40
60
80
7 7->6 6->5 5->4 4->3
Block Count
Replication Factor
(c) RF is decreased step by
step on WBRD
Fig. 3 The block distribution when the replication factor (RF) is altered
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Page 9 of 21
Ciritogluetal. J Big Data (2019) 6:94
continue processing. In both cases, jobs are delayed (considering the fact that the job is only
completed when all sub-tasks are processed fully).
Hence, WBRD’s even distribution performs sub-optimally, and heterogeneous clusters
become an edge case. With the intention of improving the performance even further, our
hypothesis is to keep more replicas on the nodes that have more processing capabilities;
therefore, the workload distribution would be more balanced. We modelled the replica
deletion problem in the context of heterogeneous clusters and proposed heterogeneity-
aware replica deletion algorithm to achieve the modelled objectives.
Formal denition
We assume a cluster
C
is composed of a set of racks Racks. Each rack
RacknRacks
contains a set of machines
mM
such that Rack(m) =
Rackn
. A set of files
is stored
in the cluster
C
over the machines
mM
. Every file
FiF
is divided into fixed-
sized blocks
Bi
(128MB by default) as stated in Eq. (1) and stored in a hierarchical file
organisation.
Root path, Root, is the ‘highest’ level of the hierarchy and every file
FiF
is placed into
a certain path P which is a branch of Root. Each block
bij Bi
is replicated
RF iN
times (i.e., the replication factor of file
Fi
). We denote
bu
ij
the replica number u of the
block
bij
where
0<u=<RF i
. Each replica
bu
ij
is stored a particular machine
M
(
bu
ij)
.
We want to reduce the replication factor from
RFi
to
RF
i
such that
RFi
>RF
i
. We
introduce a binary variable
xu
ij
which takes the value 1 if the replica u of the block
bij
exists after reducing the replication factor, or 0 otherwise.
To strengthen fault-tolerance and promote data availability, replicas are distributed over
different racks according to a rack-awareness condition in the default block placement
policy as expressed in Eq. (3). e proposed algorithm continues the rack-awareness
block placement after a successful deletion for
FiP,j∈{1, ..., |Bi|}
:
We defined a variable for the partial block count
PBCmN
for each machine
mM
.
For a given path,
PBCm
is computed as a sum of all replicas for
FiP
that are stored
on the machine m as expressed in Eq.(4):
Each m has finite resources: (e.g., CPU and RAM) denoted by vCore(m), RAM(m)
respectively. e network connection between machines
mi
,m
j
M
in the same rack
(1)
|
Bi|=
File Size
Block Size
(2)
RF
i
u=1
xu
ij =RF
i,FiP,j∈{1, ..., |Bi
|}
(3)
Rack(M(bu
ij)) |u∈{1, ..., RF
i}and xu
ij =1
2
(4)
PBC
m=
i∈{1,...,|Fi|}
j∈{1,...,|Bi|}
u∈{1, ..., RFi}
M(bu
ij
)=m
x
u
ij
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Page 10 of 21
Ciritogluetal. J Big Data (2019) 6:94
(i.e.,
Rack
(
mi
)=
Rack
(
mj)
) is faster compare to machines are in the different rack
(i.e.,
Rack(mi) = Rack(mj)
). Any submitted job (a.k.a., task) runs on a container allo-
cated to particular node m with resource requirements
vCoreCont
and
RAMCont
. Note
that
vCoreCont
and
RAMCont
both are global properties of the scheduler [29]. A machine
mM
can run a number of containers
Km
concurrently.
Km
is determined by com-
position of available machines’ resources (i.e., vCore(m) and RAM(m)) and containers’
resource requirements (i.e.,
vCoreCont
and
RAMCont
) as expressed in Eq. (5):
While deleting replicas, our main objective is to minimise the maximum ratio of
PBCm
to
Km
for a given path
PP
as shown in Eq.(6). erefore, in every deletion iteration
our algorithm will select a replica that has the biggest division value. Hence, we expect to
see the number of replica stored on a node become dependent on
Km
after our replica-
tion deletion algorithm.
Heterogeneity‑aware replica deletion (HaRD)
To address the problem detailed in "Motivational example" section, we propose Het-
erogeneity-aware Replica Deletion (HaRD) as shown in Algorithm 1. e primary
objective of HaRD is to attain a uniform distribution of the ratio of the block distribu-
tion to computing resource for a given path while satisfying the stated constraints.
HaRD starts with the determination of computing capabilities
Km
of each machine.
One existing approach to define the computing ratio is to measure the performance
of each job in each node [23, 24]. However, measuring the performance of each job on
different types of nodes is not efficient as multiple users concurrently query the sys-
tem with various ad-hoc queries in real Hadoop clusters. erefore, we put forward
a new approach to determine the computing capability (ratio) of each node through
how many containers can run simultaneously on each node manager. After the estab-
lishment of YARN (announced in Hadoop2.0), submitted jobs in a Hadoop cluster
run on the container that is allocated by the node manager. e main idea of YARN
is to bring flexibility to the map/reduce task scheduling which was statically defined
in Hadoop1.0. e resource manager of YARN organises the allocation of containers
by coordinating with node managers and schedules an application based on a node’s
resource usage. Consequently, computing ratios can be used as expressed in Eq.(4).
Our YARN-based approach provides flexibility and extensibility since new processing
features (i.e., the use of GPUs in the Hadoop cluster is becoming mainstream [30])
are implemented on the top of YARN. We would like to underline that HaRD is based
on YARN and therefore, creates an minimal overhead. We are aware that our present
work depends on a correct YARN configuration. Such an assumption is not a strict
constraint as YARN is generally configured during the deployment process [29, 31].
(5)
K
m=min
RAM
(
m
)
RAMCont
,
vCore
(
m
)
vCoreCont 
(6)
minimise
max
mM
PBC
m
K
m
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Page 11 of 21
Ciritogluetal. J Big Data (2019) 6:94
After
Km
is determined, HaRD can be used to decrease the replication factor. When
a user or a replica management framework alters
RFi
to
RF
i
such that
RFi
>
RF
i
for a
particular path, then HaRD will be executed. HaRD starts with the calculation of
PBCm
for each node in the cluster. For this, HaRD retrieves the replica list by iterating every
block of files stored on a P. In the case that an environment is multi-rack, HaRD uses
removeNonRackAware method to remove the set of replicas that violate rack-awareness
constraints. erefore, HaRD ensures the state after the deletion satisfies rack-aware-
ness constraints. HaRD scans through every replica in the list R and finds the replica
that is stored on the most-utilised node by comparing the ratio of
PBCm
to
Km
. Finally,
it removes replicas from the list by using the method deleteBlockAtMachine. Deletion
iterations run for all blocks of each file. If the data distribution is uniform at the begin-
ning, HaRD starts deletions from the least powerful nodes (i.e.,
min(Km)
). After a few
iterations, HaRD balances the nodes’ ratio of
PBCm
to
Km
. en, the rest of the deletion
iterations continue by maintaining the ratio until the last data block is processed.
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Page 12 of 21
Ciritogluetal. J Big Data (2019) 6:94
We would like to note that the value of
Km
would be the same for every node if the
cluster is homogeneous. In such a case, HaRD works in the same way as WBRD. For
homogeneous clusters, we already found that WBRD achieves up to 48% improvement
in execution time when compared to HDFS [17]. erefore, our experiments in this
paper do not consider the case of homogeneous clusters.
Implementation
Whenever the replication factor is altered for a path, the deletion request is made to
the NN by calling the setReplication method in FSNamesystem.java with a path and a
number of replicas. If the NN is not in safe mode, the requested operation is started
by invoking setReplication method in FSDirAttrOp.java. e method returns true if
the operation completed successfully or return false if any problem occurs during the
operation. If
RFi
is less than
RF
i
(i.e., the replication factor is increased) in setReplication
method, the order for allocating new replica is placed into the priority queue of under-
replicated blocks. On the contrary, if
RFi
is bigger than
RF
i
(i.e., the replication factor
is decreased), then processOverReplicatedBlock is executed for the replica deletion. In
order to select the next replica for the deletion, the method collaborates with chooseRep-
licasToDelete method in the class of block placement strategy.
Hadoop supports the use of customised block placement policies by including a
pluggable interface for the block placement [19]. For this reason, Hadoop contains
fundamental methods for the placement which is in the abstract pluggable policy. We
implemented HaRD by modifying the source code of HDFS on the top of Hadoop (ver-
sion 2.7.3). To implement the proposed deletion strategy, we first created a new block
placement policy for HaRD by inheriting the existing block placement policy. en, we
overrode the method of chooseReplicasToDelete in HaRD’s placement strategy. Moreo-
ver, we also modified the block manager class to retrieve
Km
and pass it to HaRD’s place-
ment policy.
We prefer to use the pluggable block placement policy; thus the placement policy can
be changed by altering dfs.block.replicator.classname the configuration in hdfs-site.xml
without changing the source code. We are aware that HaRD’s implementation brings
extra operations and can lead to overhead on the system. However, the all of the newly
implemented code is only executed in the case of replica deletion occurs. Otherwise it
will not have impact on the system performance. We evaluated the overhead using dif-
ferent data-set sizes as well as different number of nodes. e scalability of HaRD is dis-
cussed in"Overhead analysis" section.
Methods
Our experiments were conducted on the Performance Engineering Laboratory’s research
cluster (in University College Dublin). e cluster consists of 23 dedicated machines (1
master and 22 slaves). In this cluster, 20 slaves are identical. erefore, we used cgroups,
Linux kernel feature, to limit computing resources to create heterogeneity in the cluster.
We limited 10 nodes’ CPU to 2 virtual cores and RAM to 4 GB. Overall, the cluster is
composed of 3 different types of nodes. We detailed the resource specification in Table1
for each type of machines. All nodes are equipped with 1 TB hard-drive and connected
with a Gigabit Ethernet switch.
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Page 13 of 21
Ciritogluetal. J Big Data (2019) 6:94
e operating system selected was Lubuntu, which runs on kernel Linux
4.4.0-31-generic, and the java version 1.8.0_131 was installed. All tests were run on
Hadoop version 2.7.3 (native, WBRD and HaRD). Hive version 1.2.2 was selected for
concurrency tests on TPC-H. Ganglia [32] was used for monitoring the cluster.
Testing methodology
In this section, we detail the testing methodology for the experiments. Each test starts
with a new Hadoop cluster deployment. After the successful deployment, the bench-
mark’s data-set is uploaded to the cluster. It is important to note that both TestDFSIO
and Terasort benchmark suites can create their data-set with any given size. So, we
populated them only one time and we repeated tests by using the same data-set during
experiments of both TestDFSIO and Terasort. Hence, we ensured the input is the same
for all algorithms under-test. After the data loading phase, we increased the replication
factor from 3 to 10; consecutively, decreased to three unless otherwise stated. We would
like to note that even though we used 10 as a higher replication factor, any value above 3
creates the similar distribution. Every benchmark is run ten times for statistical signifi-
cance. We normalised results of execution time by using the average of these runs. e
plotted graphs presented indicate the range of results.
Benchmarks
Hadoop tasks can have different bottlenecks: excessive usage of disk I/O, network uti-
lisation, or CPU utilisation. To carry out a reliable test, we selected three well-known
benchmarks [33, 34]. Each benchmark has different characteristic and focuses on stress-
ing different part of the system and all comes out-of-the-box with Hadoop release: (i)
TestDFSIO, (ii) Grep and (iii) Terasort. Hadoop clusters are large-scale distributed and
multi-tenant systems. erefore, numerous queries can be executed by many users at the
same time. e usage of query-like frameworks is common in the production as high-
lighted by previous studies [35, 36]. Hence, in addition to three popular benchmarks,
we include the concurrency test on Hive [37] to represent production domains and test
concurrency.
TestDFSIO
TestDFSIO is a well-known benchmark to measure the distributed I/O throughput of
HDFS. TestDFSIO stresses the disk performance and reports both read and write perfor-
mance of the system. e benchmark is a highly representative test for tasks that suffer
Table 1 Resource specications forthecluster
Computer set CPU type Allocated VCore Allocated ram (GB) Number
ofmachines
Master i7-6700 8 32 1
Slaves-1 i5-6500 4 8 10
Slaves-2 i5-6500 2 4 10
SlavesXL-1 Xeon E5-2430 12 48 2
Total 92 248 23
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Page 14 of 21
Ciritogluetal. J Big Data (2019) 6:94
from I/O bottlenecks. During the experiment, we reported only the reading part of the
test, since we assessed the effect of data distribution on the reading performance. To be
fair in each test case, first, we populated a 100 GB data-set by using the benchmark’s
write functionality and conducted reading throughput tests by using the same data-set.
Grep
Grep is another standard Hadoop benchmark and evaluates the system performance by
searching and counting the number of times a given keyword appears in the text. e
benchmark has a read-intensive characteristic and also stresses CPU by sorting data.
Grep runs two jobs sequentially, the first job calculates the number of times a matching
string appears and passes it to the second job. e second job sorts the result of the first
job according to the matching string’s frequency. We run the Grep benchmark on the
NOAA data-set from the National Centres for Environmental Information. e data-set
was composed of 8 years collected data (between 2008/05–2016/04, 47.3 GB). In our
test, we were looking for data that is generated in January 2011 as a condition; thus, the
keyword was chosen as ‘2010,1,.
Terasort
Terasort is a well-known standard benchmark used to stress the whole system. e
benchmark assesses the performance of Hadoop clusters by sorting data. e task is
not only read-intensive but also network-intensive as it requires expensive data shuffles
while passing data from map tasks to reduce tasks. We created a 50 GB data-set by using
TeraGen and used the same data-set to test each algorithm.
Concurrency test onTPC‑H
TPC-H is a decision support benchmark and in use for assessing the performance of
relational databases [38]. e purpose of including tests with TPC-H is SQL-on-Hadoop
systems (e.g., Hive [37], Impala [39], VectorH [40]) has brought the comfort and flex-
ibility of SQL to Hadoop for querying ‘big’ data; thus, SQL-on-Hadoop has become
mainstream in the industry for big data analytics [41]. To represent the SQL-on-Hadoop
domain, we conducted the concurrency test on a 30 GB TPC-H data set. e concur-
rency test has been performed on Hive version 1.2.2 with different numbers of users:
{25,50,75,100,125} and a 1-second interval between each query run by using Q6.
Results anddiscussion
We evaluated the performance of HaRD on the 23-node heterogeneous Hadoop cluster
through conducting the following experiments: (i) analysed the data distribution after
the replica deletion with three different data-set sizes: {64 GB, 128 GB, 256 GB}, (ii) exe-
cuted three different well-known benchmarks with various replication factors, (iii) per-
formed concurrency test on TPC-H with numerous concurrent users: {25, 50, 75, 100,
125}, and (iv) conducted in depth-analysis to understand improvements in the aspects of
data locality and network utilisation.
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Page 15 of 21
Ciritogluetal. J Big Data (2019) 6:94
Block distribution
Performance of distributed systems is highly dependent on how the data is distributed
among the nodes. Additionally, it is even more important if the distributed system is
running many data-intensive jobs. erefore, our first comparison is the block distri-
bution. Figure4 reports the comparison of block distribution after the RF is reduced
from 10 to 3 by using the different deletion algorithms. Each cross-mark in the figures
indicates the number of blocks that is stored on a particular node. Since the cluster is
composed of three different types of nodes (namely, Slaves-1, Slaves-2 and SlavesXL-1),
we used marks with three different colours to demonstrate the processing capability
of nodes. Colour yellow, orange and red identify the marked node belongs to Slaves-1,
Slaves-2 and SlavesXL-1, respectively. It can be seen from the figure that Hadoop’s dele-
tion algorithm has high SD for the spread of numbers blocks per node(91.3 for 64 GB,
182.1 for 128 GB and 368.6 for 256 GB) compared to mean value(74.2 for 64 GB, 141.8
for 128 GB and 283.6 for 256 GB) and causes skewed data distribution in every case, as it
tries to balance overall cluster’s disk utilisation. Unlike Hadoop, WBRD tries to balance
PBCm
for every node in the cluster. us, WBRD achieves the evenly balanced block dis-
tribution with the low SD(2.3 for 64 GB, 3.4 for 128 GB and 4.9 for 256 GB) compared
to mean value(70.9 for 64 GB, 141.8 for 128 GB and 283.6 for 256 GB); however, the
even block distribution is not fair in terms of the workload distribution since heteroge-
neity exists. Consequently, WBRD causes the unbalanced workload in the heterogene-
ous cluster. On the other hand, HaRD aims to balance the ratio of
PBCm
to
Km
for every
node in the cluster. us, HaRD stores more blocks on more powerful computers; it cre-
ates three different groups in the block distribution as the cluster is composed of three
different machine types.
Average execution time
We conducted our experiment by using three fundamental well-known benchmarks
and compared the performance of each algorithm according to their execution time
on average. Figure 5 presents the result of three different benchmarks (namely,
TestDFSIO, Terasort and Grep) with the RF of 3 for three different deletion algo-
rithms. During experiments, we observed that WBRD achieves notable improvements
0
50
100
150
200
250
Hadoop
WBRD
HaRD
Block Count
(a) 64 GB
0
50
100
150
200
250
300
350
400
450
500
Hadoop
WBRD
HaRD
Block Count
(b) 128 GB
0
100
200
300
400
500
600
700
800
900
1000
Hadoop
WBRD
HaRD
Block Count
(c) 256 GB
Fig. 4 Block distribution after the RF is decreased back to 3
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Page 16 of 21
Ciritogluetal. J Big Data (2019) 6:94
against Hadoop. However, the system performance enhances even further with HaRD
due to the balanced workload distribution as the computing capability of each node
is taken into account during the replica deletion. As a result, HaRD reduces average
execution time 7% for TestDFSIO, 6.1% for Terasort and 9.4% for Grep compared to
WBRD. When we compared the performance of HaRD against HDFS, the improve-
ments become remarkable: 60.3% for Grep, 22.8% for TestDFSIO and 25% for Tera-
sort. Even though each test benchmark has a different bottleneck; HaRD consistently
performs best in all tests.
While we were conducting our experiments, we also focused on the performance
evaluation under lower the RF value due to the strong dependency between RF and job
execution time. e impact of having more replicas has a significant effect on the per-
formance when the system scales [9]. So, tests with the RF:1 acts as tests on bigger clus-
ters. Figure6 shows the result of the same test benchmarks but this time with the RF
of 1. Similar to the results of performance tests with the RF of 3, HaRD outperforms
both WBRD and Hadoop with a single replica. HaRD reaches better performance by
reducing job execution time: 18.1% for TestDFSIO, 9.2% for Terasort and 30.6% for Grep
compared to WBRD. Moreover, the performance gain of HaRD over default Hadoop, in
terms of execution time, is 55.7% for TestDFSIO, 41.4% for Terasort and 77.6% for Grep.
HADOOP WBRD HaRD
0
50
100
150
200
250
300
350
400
450
500
Average Execution Time(s
)
(a) TestDFSIO
0
50
100
150
200
250
300
350
Average Execution Time(s
)
(b) Terasort
0
20
40
60
80
100
120
140
160
180
Average Execution Time(s
)
(c) Grep
Fig. 5 Test benchmarks with RF: 3
0
100
200
300
400
500
600
700
800
900
Average Execution Time(s)
(a) TestDFSIO
0
50
100
150
200
250
300
350
400
450
Average Execution Time(s)
(b) Terasort
0
50
100
150
200
250
300
350
400
Average Execution Time(s)
(c) Grep
Fig. 6 Test benchmarks with RF: 1
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Page 17 of 21
Ciritogluetal. J Big Data (2019) 6:94
Testing withconcurrent users
Hadoop clusters are designed to serve as multi-tenant systems, and the cluster is queried
by numerous users at the same time. erefore, we include the concurrent user test by
using TPC-H Q6. Figure7 reports the average execution time for the concurrent users
test and demonstrates that Hadoop performs worst in every case and also shows HaRD
performs better than WBRD. Improvements of HaRD compared to WBRD in job’s exe-
cution time starts from 14% for 25 concurrent users and increases up to 17% as we stress
the system with more concurrent users. Furthermore, the enhancement in execution
time is around 60% for the all different number of users compared to default HDFS. We
want to note that there is no difference observed between HaRD and WBRD while test-
ing with single TPC-H queries since single queries do not fully stress the system; but,
both HaRD and WBRD still perform significantly better than Hadoop. is experiment
underlines that the performance improvements become more significant when the sys-
tem is fully utilised under the heavy load of concurrent users.
In‑depth analysis
We performed an in-depth analysis of the 125 users concurrency test to understand and
observe improvements in data locality and network utilisation. e system is monitored
by using Ganglia for the network metrics and Hadoop’s HistoryServer for the data local-
ity during the experiment. We measured the data locality by
|DataLocalTasks|∗100
|AllTasks|
. For
125 concurrent users, we found that approximately 85% of all jobs are data-local for
HaRD; however, the data locality drops to 81% for WBRD and 73% for Hadoop Jobs. So,
WBRD transfers 376 more splits during the test compared to HaRD. Running more
data-local jobs reduces the number of blocks that need to be transferred during execu-
tion and in turn leads to less network usage. We inspected the network bandwidth usage
and plotted network graphs in Fig.8. Fig.8a, b shows the aggregated network utilisation
both bytes in and bytes out, respectively. Average network bandwidth usage is 402 Mbps
for HaRD, 432 Mbps for WBRD and 378 Mbps for Hadoop. We can see the proposed
algorithm, HaRD, reduces the average network utilisation approximately 30 Mbps (6.9%)
compared to WBRD. When we compared three algorithms for the overall network
0
500
1000
1500
2000
2500
25 50 75 100 125
Average Execution Time(s)
Number of Concurrent Users
HaRD
WBRD
Hadoop
Fig. 7 Concurrency test with TPC-H Q6
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Page 18 of 21
Ciritogluetal. J Big Data (2019) 6:94
usage, Hadoop performs worst due to the high execution time. Interestingly, Hadoop has
the lowest average network utilisation even though it has the lowest value for the per-
centage of data-local jobs.
To understand default Hadoop’s network behaviour, we carried out further investiga-
tion by observing the network usage on each node individually. We found that the data
is in a continuous flow from ‘hot spots’ to other nodes due to the fact that the majority
of blocks were located on ‘hot spots’. us, we can see higher values for data out on ‘hot
spots’ and lower data out values on other nodes. Conversely, this trend is opposite for
data in; data in is low on ‘hot spots’ and high on the rest of nodes. erefore, the network
utilisation is not well-balanced on the cluster. Moreover, Hadoop’s network bandwidth
usage is not stable due to the high SD of the block distribution; but more importantly,
reaches higher peaks compared to WBRD and HaRD. On the contrary, we identified that
the network bandwidth usage in both WBRD and HaRD is balanced on each node. e
results show that the default Hadoop deletion algorithm causes an imbalance in the net-
work bandwidth usage in the cluster.
Overhead analysis
Overhead is another critical aspect of the feasibility of our approach. erefore, we
conducted an experiment to compare the performance gain against the implementa-
tion overhead. It is important to note that HaRD does not create an overhead for any
other scenarios except the one that replica deletion occurs(i.e., setReplication is trig-
gered with
RF
i
is less than
RFi
). us, we only measured the overhead during the replica
deletion through. In order to measure the overhead, we injected nanosecond precision
time counters at the beginning and the end of our implementation. en the implemen-
tation overhead is calculated by getting difference between time counters. e imple-
mentation of HaRD uses WBRD’s code as a base. WBRD already reached insignificant
overhead(less than 1.75% of the total time spent in reducing replication). During the
development of HaRD, we improved the implementation of WBRD; consequently the
efficiency of WBRD increased. Decreasing the replication factor from 10 to 3 consumes
302 s for a 50 GB data set. Addition to Hadoop’s 302 second overhead, HaRD introduces
a 10.8 millisecond overhead. Figure9a, b present the HaRD’s computational overhead
0
1x10
8
2x10
8
3x10
8
4x10
8
5x10
8
6x10
8
00:00:00
00:10:00
00:20:00
00:30:00
00:40:00
00:50:00
01:00:00
01:10:00
01:20:00
Network Bandwidth Usage (Mbps)
Time
Hadoop
WBRD
HaRD
(a) Bytes In
0
1x10
8
2x10
8
3x10
8
4x10
8
5x10
8
6x10
8
00:00:00
00:10:00
00:20:00
00:30:00
00:40:00
00:50:00
01:00:00
01:10:00
01:20:00
Network Bandwidth Usage (Mbps)
Time
Hadoop
WBRD
HaRD
(b) Bytes Out
Fig. 8 Network bandwidth usage during the test with 125 concurrent users
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Page 19 of 21
Ciritogluetal. J Big Data (2019) 6:94
for various data-set sizes and number of nodes respectively. In both experiment, we see
HaRD’s implementation overheads are significantly less compared to the achieved gain.
Moreover, figures show a linear increase in time for the overhead. us, it proves that
HaRD is highly scalable.
Conclusion
Current replica management systems adapt the replication factor for ‘hot’ data in order
to increase the data locality and achieve better performance, while keeping fewer copies
for less frequently accessed data. However, altering the replication factor changes the
data distribution. Our previous work identified that replica deletion in Hadoop can be
the cause of imbalance in the data distribution and proposed a deletion algorithm for
balancing data overall (WBRD). However, WBRD does not consider nodes’ comput-
ing capabilities and consequently, leads to sub-optimal performance in heterogeneous
clusters. In this paper, we extend the formal definition of the replica deletion problem
to heterogeneous clusters. erefore, we propose a novel cost-effective Heterogeneity-
aware Replica Deletion(HaRD) algorithm to use system resources more efficiently. We
implemented HaRD on top of HDFS and carried out a comprehensive experimental
study to investigate HaRD’s improvements. Experiments show that HaRD improves the
system performance by reducing the average execution time by 40% and 8% when com-
pared to HDFS and WBRD. With more concurrent users, the system is fully utilised and
the average gains increases up to 60% and 17% compared to HDFS and WBRD, respec-
tively. During tests we observed HaRD’s implementation overhead is significantly less
compared to the achieved gain and only 10.8 ms. Moreover, experimental evaluations
showed that HaRD’s overhead scales linearly. As future work, we will develop an adap-
tive replication management framework using the proposed deletion algorithm.
Abbreviations
HDFS: Hadoop distributed file system; WBRD: workload-aware balanced replica deletion; HaRD: a heterogeneity-aware
replica deletion for HDFS; YARN: Yet another resource negotiator—resource management framework; NN: NameNode;
DN: DataNode; RF: replication factor; m: machine;
PBCm
: partial block count on a machine;
Km
: a number of contain-
ers can be executed simultaneously on a machine; Cont: container; F: file; B: block; TPC: The Transaction Processing
Performance Council; NOAA: National Centres for Environmental Information.
0
0.01
0.02
0.03
0.04
0.05
0.06
50 100 200 400
Computational Overhead (s
)
Data-set Size (GB)
(a) Overhead test ona100 GB with differ-
entnumberofnodes
0
0.005
0.01
0.015
0.02
0.025
0.03
10 14 18 22
Computational Overhead (s)
Number of Nodes
(b) Overhead test on a22Nodes with var-
ious data-set sizes
Fig. 9 Overhead test
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Page 20 of 21
Ciritogluetal. J Big Data (2019) 6:94
Acknowledgements
This work was supported, in part, by Science Foundation Ireland Grant 13/RC/2094 and co-funded under the European
Regional Development Fund through the Southern & Eastern Regional Operational Programme to Lero - the Irish Soft-
ware Research Centre (www.lero.ie).
Authors’ contributions
HEC conceived the research idea. Then, HEC implemented the present work, conducted extensive sets of experiments
and wrote the initial draft paper. Both JM and CT guided the research idea and reviewed the manuscript. All authors read
and approved the final manuscript.
Funding
This study was supported by Science Foundation Ireland (Grant No. 13/RC/2094).
Availability of data and materials
TPC-H Benchmark: http://www.tpc.org/infor matio n/bench marks .asp. NOAA: https ://www.ncdc.noaa.gov/cdo-web/datas
ets. TestDFSIO and Terasort: https ://hadoo p.apach e.org/.
Competing interests
The authors declare that they have no competing interests.
Author details
1 Performance Engineering Laboratory, School of Computer Science, University College Dublin, Dublin, Ireland. 2 Techno-
logical University Dublin, Dublin, Ireland.
Received: 18 July 2019 Accepted: 3 October 2019
References
1. Sakr S, Liu A, Batista DM, Alomari M. A survey of large scale data management approaches in cloud environments.
IEEE Commun Surv Tutor. 2011;13(3):311–36.
2. Sohangir S, Wang D, Pomeranets A, Khoshgoftaar TM. Big data: deep learning for financial sentiment analysis. J Big
Data. 2018;5(1):3.
3. Tsai CW, Lai CF, Chao HC, Vasilakos AV. Big data analytics: a survey. J Big data. 2015;2(1):21.
4. Apache Hadoop. http://hadoo p.apach e.org (2018). Accessed 27 June 2019.
5. Dean J, Ghemawat S. Mapreduce: simplified data processing on large clusters. Commun ACM. 2008;51(1):107–13.
6. Shvachko K, Kuang H, Radia S, Chansler R. The hadoop distributed file system. In: 2010 IEEE 26th symposium on
mass storage systems and technologies (MSST). New York: IEEE; 2010. p. 1–10.
7. Vavilapalli VK, Murthy AC, Douglas C, Agarwal S, Konar M, Evans R, Graves T, Lowe J, Shah H, Seth S, et al. Apache
hadoop yarn: yet another resource negotiator. In: Proceedings of the 4th annual symposium on cloud computing.
New York: ACM; 2013. p. 5.
8. Zaharia M, Xin RS, Wendell P, Das T, Armbrust M, Dave A, Meng X, Rosen J, Venkataraman S, Franklin MJ, et al. Apache
spark: a unified engine for big data processing. Commun ACM. 2016;59(11):56–65.
9. Ciritoglu HE, Batista de Almeida L, Cunha de Almeida E, Buda TS, Murphy J, Thorpe C. Investigation of replication
factor for performance enhancement in the hadoop distributed file system. In: Companion of the 2018 ACM/SPEC
international conference on performance engineering. New York: ACM; 2018. p. 135–40.
10. Mazumdar S, Seybold D, Kritikos K, Verginadis Y. A survey on data storage and placement methodologies for cloud-
big data ecosystem. J Big Data. 2019;6(1):15.
11. Ananthanarayanan G, Agarwal S, Kandula S, Greenberg A, Stoica I, Harlan D, Harris E. Scarlett: coping with skewed
content popularity in mapreduce clusters. In: Proceedings of the sixth conference on computer systems. New York:
ACM; 2011. p. 287–300.
12. Wei Q, Veeravalli B, Gong B, Zeng L, Feng D. Cdrm: a cost-effective dynamic replication management scheme for
cloud storage cluster. In: 2010 IEEE international conference on cluster computing (CLUSTER). New York: IEEE; 2010.
p. 188–96.
13. Abad CL, Lu Y, Campbell RH. Dare: adaptive data replication for efficient cluster scheduling. In: 2011 IEEE interna-
tional conference on cluster computing (CLUSTER). New York: IEEE; 2011. p. 159–68.
14. Cheng Z, Luan Z, Meng Y, Xu Y, Qian D, Roy A, Zhang N, Guan G. Erms: an elastic replication management system
for hdfs. In: 2012 IEEE international conference on cluster computing workshops (CLUSTER WORKSHOPS). New York:
IEEE; 2012. p. 32–40.
15. Eltabakh MY, Tian Y, Özcan F, Gemulla R, Krettek A, McPherson J. Cohadoop: flexible data placement and its exploita-
tion in hadoop. VLDB Endow. 2011;4(9):575–85.
16. Milani BA, Navimipour NJ. A systematic literature review of the data replication techniques in the cloud environ-
ments. Big Data Res. 2017;10:1–7.
17. Ciritoglu HE, Saber T, Buda TS, Murphy J, Thorpe C. Towards a better replica management for hadoop distributed file
system. In: 2018 IEEE international congress on Big Data (BigData Congress). New York: IEEE; 2018. p. 104–11.
18. Zaharia M, Konwinski A, Joseph AD, Katz RH, Stoica I. Improving MapReduce performance in heterogeneous envi-
ronments. In: Osdi, 2008; 8:7.
19. Pluggable interface for block placement of hadoop. https ://issue s.apach e.org/jira/brows e/HDFS-385 (2014).
Accessed 27 June 2019.
20. Zaharia M, Borthakur D, Sen Sarma J, Elmeleegy K, Shenker S, Stoica I. Delay scheduling: a simple technique for
achieving locality and fairness in cluster scheduling. In: EuroSys. New York: ACM; 2010. p. 265–78.
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Page 21 of 21
Ciritogluetal. J Big Data (2019) 6:94
21. Ren K, Kwon Y, Balazinska M, Howe B. Hadoop’s adolescence: an analysis of hadoop usage in scientific workloads.
Proc VLDB Endow. 2013;6(10):853–64.
22. Lin Y, Shen H. Eafr: an energy-efficient adaptive file replication system in data-intensive clusters. IEEE Trans Parallel
Distrib Syst. 2017;28(4):1017–30.
23. Xie J, Yin S, Ruan X, Ding Z, Tian Y, Majors J, Manzanares A, Qin X. Improving mapreduce performance through data
placement in heterogeneous hadoop clusters. In: 2010 IEEE international symposium on parallel & distributed
processing, workshops and Phd forum (IPDPSW). New York: IEEE; 2010. p. 1–9.
24. Lee CW, Hsieh KY, Hsieh SY, Hsiao HC. A dynamic data placement strategy for hadoop in heterogeneous environ-
ments. Big Data Res. 2014;1:14–22.
25. Liao J, Cai Z, Trahay F, Peng X. Block placement in distributed file systems based on block access frequency. IEEE
Access. 2018;6:38411–20.
26. Islam NS, Lu X, Wasi-ur Rahman M, Shankar D, Panda DK. Triple-h: a hybrid approach to accelerate HDFS on HPC
clusters with heterogeneous storage architecture. In: 2015 15th IEEE/ACM international symposium on cluster,
cloud and grid computing. New York: IEEE; 2015. p. 101–10.
27. Krish K, Anwar A, Butt AR. hats: a heterogeneity-aware tiered storage for hadoop. In: 2014 14th IEEE/ACM interna-
tional symposium on cluster, cloud and grid computing. New York: IEEE; 2014. p. 502–11.
28. Jalaparti V, Douglas C, Ghosh M, Agrawal A, Floratou A, Kandula S, Menache I, Naor JS, Rao S. Netco: Cache and i/o
management for analytics over disaggregated stores. In: Proceedings of the ACM symposium on cloud computing.
New York: ACM; 2018. p. 186–98.
29. Yarn container configuration. https ://horto nwork s.com/blog/how-to-plan-and-confi gure-yarn-in-hdp-2-0/ (2013).
Accessed 27 June 2019.
30. Using GPU On YARN. https ://hadoo p.apach e.org/docs/r3.1.0/hadoo p-yarn/hadoo p-yarn-site/Using Gpus.html
(2018). Accessed 27 June 2019.
31. Yarn Tunning. https ://www.cloud era.com/docum entat ion/enter prise /5-3-x/topic s/cdh_ig_yarn_tunin g.html (2018).
Accessed 27 June 2019.
32. Massie ML, Chun BN, Culler DE. The ganglia distributed monitoring system: design, implementation, and experi-
ence. Parallel Comput. 2004;30(7):817–40.
33. Huang S, Huang J, Dai J, Xie T, Huang B. The hibench benchmark suite: characterization of the mapreduce-based
data analysis. In: 2010 IEEE 26th international conference on data engineering workshops (ICDEW 2010). New York:
IEEE; 2010. p. 41–51.
34. Ahmad F, Lee S, Thottethodi M, Vijaykumar T. Puma: Purdue MapReduce benchmarks suite 2012.
35. Chen Y, Alspaugh S, Katz R. Interactive analytical processing in big data systems: a cross-industry study of mapre-
duce workloads. Proc VLDB Endow. 2012;5(12):1802–13.
36. Costa E, Costa C, Santos MY. Evaluating partitioning and bucketing strategies for hive-based big data warehousing
systems. J Big Data. 2019;6(1):34.
37. Thusoo A, Sarma JS, Jain N, Shao Z, Chakka P, Anthony S, Liu H, Wyckoff P, Murthy R. Hive: a warehousing solution
over a map-reduce framework. Proc VLDB Endow. 2009;2(2):1626–29.
38. Poess M, Floyd C. New TPC benchmarks for decision support and web commerce. ACM Sigmod Rec. 2000;29:64–71.
39. Bittorf M, Bobrovytsky T, Erickson C, Hecht MGD, Kuff M, Leblang DKA, Robinson N, Rus DRS, Wanderman JRDTS,
Yoder MM. Impala: a modern, open-source SQL engine for hadoop. In: Proceedings of the 7th biennial conference
on innovative data systems research; 2015.
40. Costea A, Ionescu A, Răducanu B, Switakowski M, Bârca C, Sompolski J, Łuszczak A, Szafrański M, De Nijs G, Boncz P.
Vectorh: taking SQL-on-hadoop to the next level. In: SIGMOD/PODS. New York: ACM; 2016. p. 1105–17.
41. Floratou A, Minhas UF, Özcan F. SQL-on-hadoop: full circle back to shared-nothing database architectures. Proc VLDB
Endow. 2014;7(12):1295–306.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
1.
2.
3.
4.
5.
6.
Terms and Conditions
Springer Nature journal content, brought to you courtesy of Springer Nature Customer Service Center GmbH (“Springer Nature”).
Springer Nature supports a reasonable amount of sharing of research papers by authors, subscribers and authorised users (“Users”), for small-
scale personal, non-commercial use provided that all copyright, trade and service marks and other proprietary notices are maintained. By
accessing, sharing, receiving or otherwise using the Springer Nature journal content you agree to these terms of use (“Terms”). For these
purposes, Springer Nature considers academic use (by researchers and students) to be non-commercial.
These Terms are supplementary and will apply in addition to any applicable website terms and conditions, a relevant site licence or a personal
subscription. These Terms will prevail over any conflict or ambiguity with regards to the relevant terms, a site licence or a personal subscription
(to the extent of the conflict or ambiguity only). For Creative Commons-licensed articles, the terms of the Creative Commons license used will
apply.
We collect and use personal data to provide access to the Springer Nature journal content. We may also use these personal data internally within
ResearchGate and Springer Nature and as agreed share it, in an anonymised way, for purposes of tracking, analysis and reporting. We will not
otherwise disclose your personal data outside the ResearchGate or the Springer Nature group of companies unless we have your permission as
detailed in the Privacy Policy.
While Users may use the Springer Nature journal content for small scale, personal non-commercial use, it is important to note that Users may
not:
use such content for the purpose of providing other users with access on a regular or large scale basis or as a means to circumvent access
control;
use such content where to do so would be considered a criminal or statutory offence in any jurisdiction, or gives rise to civil liability, or is
otherwise unlawful;
falsely or misleadingly imply or suggest endorsement, approval , sponsorship, or association unless explicitly agreed to by Springer Nature in
writing;
use bots or other automated methods to access the content or redirect messages
override any security feature or exclusionary protocol; or
share the content in order to create substitute for Springer Nature products or services or a systematic database of Springer Nature journal
content.
In line with the restriction against commercial use, Springer Nature does not permit the creation of a product or service that creates revenue,
royalties, rent or income from our content or its inclusion as part of a paid for service or for other commercial gain. Springer Nature journal
content cannot be used for inter-library loans and librarians may not upload Springer Nature journal content on a large scale into their, or any
other, institutional repository.
These terms of use are reviewed regularly and may be amended at any time. Springer Nature is not obligated to publish any information or
content on this website and may remove it or features or functionality at our sole discretion, at any time with or without notice. Springer Nature
may revoke this licence to you at any time and remove access to any copies of the Springer Nature journal content which have been saved.
To the fullest extent permitted by law, Springer Nature makes no warranties, representations or guarantees to Users, either express or implied
with respect to the Springer nature journal content and all parties disclaim and waive any implied warranties or warranties imposed by law,
including merchantability or fitness for any particular purpose.
Please note that these rights do not automatically extend to content, data or other material published by Springer Nature that may be licensed
from third parties.
If you would like to use or distribute our Springer Nature journal content to a wider audience or on a regular basis or in any other manner not
expressly permitted by these Terms, please contact Springer Nature at
onlineservice@springernature.com
... Ciritoglu et al. [25] proposed a workload-aware balanced replica deletion algorithm to handle imbalanced data, hot spots, and performance degradation. Ciritoglu et al. [26] presented HARD, a heterogeneity-aware replica deletion, which is an extension of the algorithm in [25] to deal with the replica deletion issue in a heterogeneous environment. Dai et al. [27] suggested a new replica placement policy to distribute replicas of data across nodes without resorting to a load balancer. ...
Article
Big Data platforms are becoming increasingly essential these days, given the volume of data generated every moment by millions of people around the world. The Hadoop framework is a solution that allows storing and processing these large amounts of data in parallel on a cluster of machines. The default data placement strategy adopted by the Hadoop Distributed File System (HDFS), initially designed for a homogeneous cluster where all machines are considered identical, relies on distributing data to nodes based only on their disk space availability. Implementing this strategy in a heterogeneous environment, where nodes have varying computing or disk storage capacities, may result in performance degradation. In this paper, we propose a smart data placement strategy (SDPS) in heterogeneous Hadoop clusters that aims to place high-access data on high-performance nodes. It takes cluster heterogeneity into account when distributing data by first dividing nodes into groups based on their performance levels using a clustering algorithm and then allocating data blocks to appropriate nodes based on their hotness. SDPS also allows dynamically specifying the replication factor of data blocks to reduce storage space waste while maintaining data availability. Experimental results show that SDPS is more efficient in a heterogeneous environment compared with the default data placement policy of HDFS, and it improves MapReduce data processing, data locality, and storage efficiency. Doi: 10.28991/HIJ-2025-06-01-03 Full Text: PDF
... Hadoop basically consists of three main components: the first one is MapReduce, used for data processing; the second one is Hadoop Distributed File System (HDFS), used for data storage; and the last one is YARN (Yet Another Resource Negotiator), the resource management and job scheduling component. HDFS is a distributed file system designed to store and analyse large amounts of data for high-bandwidth applications [29]. The Hadoop framework's MapReduce paradigm makes it simple to convert and explore extensive data collections. ...
Article
Internet of Things (IoT)-based Healthcare services, which are becoming more widespread today, continuously generate huge amounts of data which is often called big data. Due to the magnitude and intricacy of the data, it is difficult to find valuable information that can be used for decision-making and prediction. Big data systems take on a significant infrastructure service to better serve the purpose of IoT systems and support critical decision making. On the other hand, privacy preservation, data integrity, and identity verification are essential requirements in healthcare big data service management. To overcome these problems, this article offers a scalable computing system that provides verifiable data access mechanism for IoT-enabled health data analytics in the big data ecosystem. There are two primary sub-architectures in the proposed architecture, namely a big data analytics tracking system and a derived blockchain-based data storage/access system. This approach leverages big data systems and blockchain architecture to analyze, and securely store data from IoT-enabled devices and allow verified access to the stored data. The zero-knowledge protocol is used to ensure that no information is accessible to unauthenticated users alongside avoiding data linkability. The results demonstrate the effectiveness of the our method to solve the problems of big data analytics and privacy issues in healthcare.
... Heterogeneous computing has now become prominent in many areas (Alnezari and Rikli, 2017;Wang et al., 2019). It has become common across the Hadoop cluster due to varying generations of hardware during cluster formation, multiple users with different sets of resources, other job priorities, different CPU bound and I/O bound tasks, workload heterogeneity (Garg and Janakiram, 2018;Ciritoglu et al., 2019;Xu et al., 2018;Islam et al., 2016;Zhou, 2016). These parameters raise heterogeneity across the Hadoop cluster and degrade the performance of Hadoop in a heterogeneous Hadoop cluster. ...
Article
Full-text available
Hadoop is the most economical and cheap software framework that allows distributed storage and parallel processing of more extensive data sets. Hadoop distributed file system (HDFS) allows distributed storage and parallel processing of vast data sets using MapReduce. However, Hadoop’s current implementation believes that computing nodes connected in a cluster are homogeneous and distribute the tasks equally. This equal load distribution creates the load imbalance during storage, resource contention during task scheduling, hardware degradation by its excess use, and software misconfiguration during cluster management which are the leading causes of stragglers in heterogeneous Hadoop clusters. Due to hardware heterogeneity, Hadoop’s performance degrades in the heterogeneous environment. In our study, the paper reviews and analyzes significant studies. It presents the new classification taxonomy to broadly classify the existing straggler management and mitigation techniques into two approaches: proactive and reactive. It analyses and compares the state of art studies and identifies their limitation based on their results. Finally, the systematic review discusses the open issues and the potential directions for future work to manage and mitigate stragglers from the heterogeneous Hadoop clusters.
Article
Full-text available
Neural network models have been used to analyze thyroid ultrasound (US) images and stratify malignancy risk of the thyroid nodules. We investigated the optimal neural network condition for thyroid US image analysis. We compared scratch and transfer learning models, performed stress tests in 10% increments, and compared the performance of three threshold values. All validation results indicated superiority of the transfer learning model over the scratch model. Stress test indicated that training the algorithm using 3902 images (70%) resulted in a performance which was similar to the full dataset (5575). Threshold 0.3 yielded high sensitivity (1% false negative) and low specificity (72% false positive), while 0.7 gave low sensitivity (22% false negative) and high specificity (23% false positive). Here we showed that transfer learning was more effective than scratch learning in terms of area under curve, sensitivity, specificity and negative/positive predictive value, that about 3900 images were minimally required to demonstrate an acceptable performance, and that algorithm performance can be customized according to the population characteristics by adjusting threshold value.
Article
Full-text available
The study of Hadoop Distributed File System (HDFS) and Map Reduce (MR) are the key aspects of the Hadoop framework. The big data scenarios like Face Book (FB) data processing or the twitter analytics such as storing the tweets and processing the tweets is other scenario of big data which can depends on Hadoop framework to perform the storage and processing through which further analytics can be done. The point here is the usage of space and time in the processing of the above-mentioned huge amounts of the data definitely leads to higher amounts of space and time consumption of the Hadoop framework. The problem here is usage of huge amounts of the space and at the same time the processing time is also high which need to be reduced so as to get the fastest response from the framework. The attempt is important as all the other eco system tools also depends on HDFS and MR so as to perform the data storage and processing of the data and alternative architecture so as to improve the usage of the space and effective utilization of the resources so as to reduce the time requirements of the framework. The outcome of the work is faster data processing and less space utilization of the framework in the processing of MR along with other eco system tools like Hive, Flume, Sqoop and Pig Latin. The work is proposing an alternative framework of the HDFS and MR and the name we are assigning is Unified Space Allocation and Data Processing with Metadata based Distributed File System (USAMDFS).
Preprint
Full-text available
Neural network models have been used to analyze thyroid ultrasound (US) images and stratify malignancy risk of the thyroid nodules. We investigated the optimal neural network condition for thyroid US image analysis. We compared scratch and transfer learning models, performed stress tests in 10% increments, and compared the performance of three threshold values. All validation results indicated superiority of the transfer learning model over the scratch model. Stress test indicated that training the algorithm using 3902 images (70%) resulted in a performance which was similar to the full dataset (5575). Threshold 0.3 yielded high sensitivity (1% false negative) and low specificity (72% false positive), while 0.7 gave low sensitivity (22% false negative) and high specificity (23% false positive). Here we showed that transfer learning was more effective than scratch learning, about 3900 images were minimally required to demonstrate an acceptable performance, and algorithm performance can be customized according to the population characteristics by adjusting threshold value.
Chapter
In order to adapt to heterogeneous storage environment and improve the overall I/O efficiency and system throughput, we address the optimization of replica management and retrieval strategies in a distributed file system. The proposed strategies can enhance the system's adaptability to complex environment, improve the system's rationality for load balance, and reduce the overall storage cost of the system. Specifically, this paper presents a performance metric of measuring the load of nodes. Firstly, we address the concept of comprehensive load, and then propose the method of computing comprehensive load, which is based on the multi-dimension analysis on the system. In addition, we propose the replica management and retrieval strategies, which consider the comprehensive load of nodes in the distributed file system, and optimize the allocation of loads among nodes systematically. Based on the abovementioned strategies, we address the replica placement strategy, the replica management strategy, and the retrieval algorithm in a distributed file system, in consideration of the heterogeneity of nodes in the cluster, the difference between files, and the real-time performance of nodes. All these strategies and algorithms can provide an optimization of replica and retrieval process in a distributed file system.
Article
Full-text available
Hive has long been one of the industry-leading systems for Data Warehousing in Big Data contexts, mainly organizing data into databases, tables, partitions and buckets, stored on top of an unstructured distributed file system like HDFS. Some studies were conducted for understanding the ways of optimizing the performance of several storage systems for Big Data Warehousing. However, few of them explore the impact of data organization strategies on query performance, when using Hive as the storage technology for implementing Big Data Warehousing systems. Therefore, this paper evaluates the impact of data partitioning and bucketing in Hive-based systems, testing different data organization strategies and verifying the efficiency of those strategies in query performance. The obtained results demonstrate the advantages of implementing Big Data Warehouses based on denormalized models and the potential benefit of using adequate partitioning strategies. Defining the partitions aligned with the attributes that are frequently used in the conditions/filters of the queries can significantly increase the efficiency of the system in terms of response time. In the more intensive workload benchmarked in this paper, overall decreases of about 40% in processing time were verified. The same is not verified with the use of bucketing strategies, which shows potential benefits in very specific scenarios, suggesting a more restricted use of this functionality, namely in the context of bucketing two tables by the join attribute of these tables.
Article
Full-text available
Currently, the data to be explored and exploited by computing systems increases at an exponential rate. The massive amount of data or so-called “Big Data” put pressure on existing technologies for providing scalable, fast and efficient support. Recent applications and the current user support from multi-domain computing, assisted in migrating from data-centric to knowledge-centric computing. However, it remains a challenge to optimally store and place or migrate such huge data sets across data cent- ers (DCs). In particular, due to the frequent change of application and DC behaviour (i.e., resources or latencies), data access or usage patterns need to be analyzed as well. Primarily, the main objective is to find a better data storage location that improves the overall data placement cost as well as the application performance (such as through- put). In this survey paper, we are providing a state of the art overview of Cloud-centric Big Data placement together with the data storage methodologies. It is an attempt to highlight the actual correlation between these two in terms of better supporting Big Data management. Our focus is on management aspects which are seen under the prism of non-functional properties. In the end, the readers can appreciate the deep analysis of respective technologies related to the management of Big Data and be guided towards their selection in the context of satisfying their non-functional applica- tion requirements. Furthermore, challenges are supplied highlighting the current gaps in Big Data management marking down the way it needs to evolve in the near future.
Conference Paper
Full-text available
The Hadoop Distributed File System (HDFS) is the storage of choice when it comes to large-scale distributed systems. In addition to being efficient and scalable, HDFS provides high throughput and reliability through the repli-cation of data. Recent work exploits this replication feature by dynamically varying the replication factor of in-demand data as a means of increasing data locality and achieving a performance improvement. However, to the best of our knowledge, no study has been performed on the consequences of varying the replication factor. In particular, our work is the first to show that although HDFS deals well with increasing the replication factor, it experiences problems with decreasing it. This leads to unbalanced data, hot spots, and performance degradation. In order to address this problem, we propose a new workload-aware balanced replica deletion algorithm. We also show that our algorithm successfully maintains the data balance and achieves up to 48% improvement in execution time when compared to HDFS, while only creating an overhead of 1.69% on average.
Article
Full-text available
This paper proposes a new data placement policy to allocate data blocks across storage servers of distribute/parallel file systems, for yielding even block access workload distribution. To this end, we first analyze the history of block access sequence of a specific application, and then introduce a k-partition algorithm to divide data blocks into multiple groups, by referring their access frequency. After that, each group has almost same access workloads, we can thus distribute these block groups onto storage servers of distributed file system, to achieve the goal of uniformly assigning data blocks when running the application. In summary, this newly proposed data placement policy can yield not only an even data distribution, but also the block data access balance. The experimental results show that the proposed scheme can greatly reduce I/O time and better improve utilization of storage servers when running the database-relevant applications, compared with the commonly used block data placement strategy, i.e. the round-robin placement policy. OAPA
Conference Paper
Full-text available
The massive growth in the volume of data and the demand for big data utilisation has led to an increasing prevalence of Hadoop Distributed File System (HDFS) solutions. However, the performance of Hadoop and indeed HDFS has some limitations and remains an open problem in the research community. The ultimate goal of our research is to develop an adaptive replication system; this paper presents the first phase of the work - an investigation into the replication factor used in HDFS to determine whether increasing the replication factor for in-demand data can improve the performance of the system. We constructed a physical Hadoop cluster for our experimental environment, using TestDFSIO and both the real world and the synthetic data sets, NOAA and TPC-H, with Hive to validate our proposal. Results show that increasing the replication factor of the »hot» data increases the availability and locality of the data, and thus, decreases the job execution time.
Article
Full-text available
Deep Learning and Big Data analytics are two focal points of data science. Deep Learning models have achieved remarkable results in speech recognition and computer vision in recent years. Big Data is important for organizations that need to collect a huge amount of data like a social network and one of the greatest assets to use Deep Learning is analyzing a massive amount of data (Big Data). This advantage makes Deep Learning as a valuable tool for Big Data. Deep Learning can be used to extract incredible information that buried in a Big Data. The modern stock market is an example of these social networks. They are a popular place to increase wealth and generate income, but the fundamental problem of when to buy or sell shares, or which stocks to buy has not been solved. It is very common among investors to have professional financial advisors, but what is the best resource to support the decisions these people make? Investment banks such as Goldman Sachs, Lehman Brothers, and Salomon Brothers dominated the world of financial advice for more than a decade. However, via the popularity of the Internet and financial social networks such as StockTwits and SeekingAlpha, investors around the world have new opportunity to gather and share their experiences. Individual experts can predict the movement of the stock market in financial social networks with the reasonable accuracy, but what is the sentiment of a mass group of these expert authors towards various stocks? In this paper, we seek to determine if Deep Learning models can be adapted to improve the performance of sentiment analysis for StockTwits. We applied several neural network models such as long short-term memory, doc2vec, and convolutional neural networks, to stock market opinions posted in StockTwits. Our results show that Deep Learning model can be used effectively for financial sentiment analysis and a convolutional neural network is the best model to predict sentiment of authors in StockTwits dataset.
Article
Full-text available
This open source computing framework unifies streaming, batch, and interactive big data workloads to unlock new applications.
Conference Paper
We consider a common setting where storage is disaggregated from the compute in data-parallel systems. Colocating caching tiers with the compute machines can reduce load on the interconnect but doing so leads to new resource management challenges. We design a system Netco, which prefetches data into the cache (based on workload predictability), and appropriately divides the cache space and network bandwidth between the prefetches and serving ongoing jobs. Netco makes various decisions (what content to cache, when to cache and how to apportion bandwidth) to support end-to-end optimization goals such as maximizing the number of jobs that meet their service-level objectives (e.g., deadlines). Our implementation of these ideas is available within the open-source Apache HDFS project. Experiments on a public cloud, with production-trace inspired workloads, show that Netco uses up to 5x less remote I/O compared to existing techniques and increases the number of jobs that meet their deadlines up to 80%.
Article
Cloud computing has various challenges, one of them is using copied data. Data replication is an important technique for distributed mass data management. The aim of the general idea of data replication is placing replications at different places, while there are several replications of a specific file at different points. Replication is one of the most broadly studied phenomena in the distributed environments in which multiple copies of some data are stored at multiple sites where overheads of creating, maintaining and updating the replicas are important and challenging issues. Applications and architecture of distributed computing have changed drastically during last decade and so has replication protocols. Different replication protocols may be suitable for different applications. However, despite the importance of this issue, in a cloud environment as a distributed environment, this issue has not been investigated so far systematically. The data replication in the cloud environment falls into two categories of static and dynamic methods. In the static patterns, a number of created replicas is constant and fixed from the beginning. The number is either determined by the user from the beginning or the cloud environment determines such number. However, in the dynamic algorithm and considering its environment, the number is determined based on user's access algorithm. The objective of this paper is to review the data replication techniques in these two main groups systematically as well as a discussing the main features of each group.