ArticlePDF Available

Abstract and Figures

Big data is currently one of the most critical emerging technologies. Big Data are used as a concept that refers to the inability of traditional data architectures to efficiently handle the new data sets. The 4V’s of big data – volume, velocity, variety and veracity makes the data management and analytics challenging for the traditional data warehouses. It is important to think of big data and analytics together. Big data is the term used to describe the recent explosion of different types of data from disparate sources. Analytics is about examining data to derive interesting and relevant trends and patterns, which can be used to inform decisions, optimize processes, and even drive new business models. Cloud computing seems to be a perfect vehicle for hosting big data workloads. However, working on big data in the cloud brings its own challenge of reconciling two contradictory design principles. Cloud computing is based on the concepts of consolidation and resource pooling, but big data systems (such as Hadoop) are built on the shared nothing principle, where each node is independent and selfsufficient. The integrating big data with cloud computing technologies, businesses and education institutes can have a better direction to the future. The capability to store large amounts of data in different forms and process it all at very large speeds will result in data that can guide businesses and education institutes in developing fast. Nevertheless, there is a large concern regarding privacy and security issues when moving to the cloud which is the main causes as to why businesses and educational institutes will not move to the cloud. This paper introduces the characteristics, trends and challenges of big data. In addition to that, it investigates the benefits and the risks that may rise out of the integration between big data and cloud computing.
Content may be subject to copyright.
PaperBig Data and Cloud Computing: Trends and Challenges
Big Data and Cloud Computing: Trends and Challenges
Samir A. El-Seoud
British University in Egypt-BUE, Cairo, Egypt
Hosam F. El-Sofany
Cairo Higher Institute for Engineering, Computer Science, and Management, Cairo
Mohamed Abdelfattah
British University in Egypt-BUE, Cairo, Egypt
Reham Mohamed
British University in Egypt-BUE, Cairo, Egypt
AbstractBig data is currently one of the most critical emerging technolo-
gies. Big Data are used as a concept that refers to the inability of traditional data
architectures to efficiently handle the new data sets. The 4V’s of big data vol-
ume, velocity, variety and veracity makes the data management and analytics
challenging for the traditional data warehouses. It is important to think of big
data and analytics together. Big data is the term used to describe the recent ex-
plosion of different types of data from disparate sources. Analytics is about ex-
amining data to derive interesting and relevant trends and patterns, which can
be used to inform decisions, optimize processes, and even drive new business
models. Cloud computing seems to be a perfect vehicle for hosting big data
workloads. However, working on big data in the cloud brings its own challenge
of reconciling two contradictory design principles. Cloud computing is based on
the concepts of consolidation and resource pooling, but big data systems (such
as Hadoop) are built on the shared nothing principle, where each node is inde-
pendent and self-sufficient. The integrating big data with cloud computing
technologies, businesses and education institutes can have a better direction to
the future. The capability to store large amounts of data in different forms and
process it all at very large speeds will result in data that can guide businesses
and education institutes in developing fast. Nevertheless, there is a large con-
cern regarding privacy and security issues when moving to the cloud which is
the main causes as to why businesses and educational institutes will not move to
the cloud. This paper introduces the characteristics, trends and challenges of big
data. In addition to that, it investigates the benefits and the risks that may rise
out of the integration between big data and cloud computing.
KeywordsBig data, Cloud computing, Analytics.
PaperBig Data and Cloud Computing: Trends and Challenges
1 Introduction
Big Data is a data analysis methodology enabled by a new generation of technolo-
gies and architecture which support high-velocity data capture, storage, and analysis.
Data sources extend beyond the traditional corporate database to include email, mo-
bile device output, sensor-generated data, and social media output [1]. Data are no
longer restricted to structured database records but include unstructured data data
having no standard formatting [2].
Big Data requires huge amounts of storage space. While the price of storage con-
tinued to decline, the resources needed to leverage big data can still pose financial
difficulties for small to medium sized businesses. A typical big data storage and anal-
ysis infrastructure will be based on clustered network-attached storage (NAS). Clus-
tered NAS infrastructure requires configuration of several NAS “pods” with each
NAS “pod” comprised of several storage devices connected to an NAS device. The
series of NAS devices are then interconnected to allow massive sharing and searching
of data [3].
Cloud computing is an extremely successful paradigm of service oriented compu-
ting, and has revolutionized the way computing infrastructure is abstracted and used.
Three most popular cloud paradigms include: Infrastructure as a Service (IaaS), Plat-
form as a Service (PaaS), and Software as a Service (SaaS). The concept however can
also be extended to Database as a Service or Storage as a Service. Elasticity, pay-per-
use, low upfront investment, low time to market, and transfer of risks are some of the
major enabling features that make cloud computing a universal paradigm for deploy-
ing novel applications which were not economically feasible in a traditional enterprise
infrastructure settings. Scalable database management systems, both for update inten-
sive application workloads, as well as decision support systems—are thus a critical
part of the cloud infrastructure. Scalable and distributed data management has been
the vision of the database research community for more than three decades. Much
research has focused on designing scalable systems for both update intensive work-
loads as well as ad-hoc analysis workloads. Initial designs include distributed data-
bases for update intensive workloads [5], and parallel database systems for analytical
workloads [6]. Parallel databases grew beyond prototype systems to large commercial
systems, but distributed database systems were not very successful and were never
commercialized rather various ad-hoc approaches to scaling were used.
Changes in the data access patterns of applications and the need to scale out to
thousands of commodity machines led to the birth of a new class of systems referred
to as Key-Value stores which are now being widely adopted by various enterprises. In
the domain of data analysis, the MapReduce paradigm [7] and its open-source imple-
mentation Hadoop [9] has also seen widespread adoption in industry and academia
alike. Solutions have also been proposed to improve Hadoop based systems in terms
of usability and performance [10].
On the other hand, data warehouses have been used to manage the large amount of
data. The warehouses and solutions built around them are unable to provide reasona-
ble response times in handling expanding data volumes. One can either perform ana-
lytics on big volume once in days or one can perform transactions on small amounts
iJIM Vol. 11, No. 2, 2017
PaperBig Data and Cloud Computing: Trends and Challenges
of data in seconds. With the new requirements, one needs to ensure the real-time or
near real time response for huge amount of data. The 4V’s of big data volume, ve-
locity, variety and veracitymakes the data management and analytics challenging
for the traditional data warehouses. Big data can be defined as data that exceeds the
processing capacity of conventional database systems. It implies that the data count is
too large, and/or data values change too fast, and/or it does not follow the rules of
conventional database management systems (e.g., consistency). One requires new
expertise in the areas of data management and systems management who understands
how to model the data and prepare them for analysis, and understand the problem
deeply enough to perform the analytics. As data is massive and/or fast changing we
need comparatively many more CPU and memory resources, which are provided by
distributed processors and storage in cloud settings [11].
Data storage using cloud computing is a viable option for small to medium sized
businesses considering the use of Big Data analytic techniques. Cloud computing is
on-demand network access to computing resources which are often provided by an
outside entity and require little management effort by the business. A number of ar-
chitectures and deployment models exist for cloud computing, and these architectures
and models are able to be used with other technologies and design approaches. Own-
ers of small to medium sized businesses who are unable to afford adoption of clus-
tered NAS technology can consider a number of cloud computing models to meet
their big data needs. Small to medium sized business owners need to consider the
correct cloud computing in order to remain both competitive and profitable [4
The paper is organized as follows: in section two, we present the literature review
done in the field. In section three, we present a general concepts and definition of Big
Data. In section four, we present the movement of data management to cloud compu-
ting. In section five we introduce the management of Big Data in cloud computing
environments. In section six, we introduce some solutions and methodologies for data
storage. In section seven, we present the “Map Reduce” and “Hadoop” a free pro-
gramming framework supports the processing of large sets of data in a distributed
computing environment. In section eight, we introduce the advantages of Big Data
applications. In section nine, we introduce the advantages and risks and challenges to
integrating big data with cloud computing. The paper finally concluded in section ten.
2 Literature Review
Big Data and Cloud computing are a major trend that are rapidly growing and new
challenges and solutions are being published every day.
In 2014, [35] a publication was made to define the best methods to deploy big data
analytics applications to the cloud. The series of steps are defined in 6 steps: the first
stage, we develop the business use case by focusing on how the deep business values
will be achieved by moving to the cloud and identify the drivers to achieve this. Fol-
lowing this is aligning the stakeholder’s requirements with the case in order to
achieve their support. Finally the case needs to be feasible by identifying the key
advantages that out weight other solutions available; the second stage, is to access you
PaperBig Data and Cloud Computing: Trends and Challenges
application workload. Depending on the functional requirements set and business
case, the cloud service should have to the ability to support the workload with the
ability to rapidly optimize as the new workloads come online; The third stage, is to
develop a technical approach to the big data platform by focusing on the both the
topologies and data platforms; The fourth stage is to address governance, security,
privacy, risk and accountability requirements. The big data platform should adapt
tools to maintain security of the data and architecture whilst operating under the poli-
cies provided by the organization; finally is to deploy, integrate, and operationalize
the infrastructure. Here the cloud platform has been set up and the data has been pro-
vided and integrated. The platform is then deployed and the organization starts taking
advantage of it by going about the normal day to day tasks.
Another paper in 2014, [33] was published on the integration of big data and cloud
computing technologies. The integration of big data and cloud computing has long
term benefits of both insights and performance. Due to the large amounts of data
collected, they need to analyzed otherwise the data is useless; hence the cloud ser-
vices can handle these extensive amounts of data with rapid response times and real
time processing of the data. There are currently a few integrated cloud environments
for big data analytics; Canpaas is an environment that was developed by Vrije Uni-
versiteit Amsterdam, the University of Rennes 1, Zuse-Institut Berlin and XLAB to
simplify the process of creating scalable cloud applications without the need to put
into consideration the complexity of these applications. The environment also pro-
vides a handful of resources for hosting web applications written in PHP or java, as
well as, managing different database management systems including SQL or NoSQL.
Another environment includes MapReduce provided by apache to aid in parallel com-
puting by enabling distributed file storage systems and processing power. Task Fram-
ing is another environment used mainly for batch processing, which allows the execu-
tion of large number of non-related tasks simultaneously. The benefits of cloud com-
puting include parallel computing, scalability, elasticity, and being inexpensive. In
terms of security and privacy, data is encrypted using advanced encryption techniques
to secure data but this causes high processing overheads hence other solutions are
available. In regards to the performance, some challenges do arise which include data
transfer limitations, data retention, isolation management, and disaster recovery.
In 2015, [34] shared a paper on the efficiency of big data and cloud computing and
why both technologies complement each other. Big data and cloud computing com-
plement each other and are both the fastest growing technologies emerging today. The
cloud seems to provide large computing power by aggregating resources together and
offering a single system view to manage these resources and applications, so why big
data should be placed on the cloud? The following reasons provide a sufficient answer
to the question in hand: cost reduction, where organizations can make use of the pay
per use model instead of doing a major investment to setup servers and clusters to
manage the big data that can become obsolete and require upgrades; reduce over-
heads, by acquiring any new components needed automatically; rapid provision-
ing/time to market, where organizations can easily change the scale of the environ-
ments easily based on the processing requirements; flexibility/scalability, environ-
ments can be set up at any time at any scale in just a few minutes.
iJIM Vol. 11, No. 2, 2017
PaperBig Data and Cloud Computing: Trends and Challenges
3 Big Data: General Concept and Definition
Big Data are used as a concept that refers to the inability of traditional data architec-
tures to efficiently handle the new data sets. Characteristics that force a new archi-
tecture to achieve efficiencies are the data set-at-rest characteristics volume, and
variety of data from multiple domains or types; and from the data-in-motion character-
istics of velocity, or rate of f low, and variability (principally referring to a change in
velocity). Each of these
results in different architectures or dif ferent
data lifecycle process orderings to achieve needed efficiencies. A number of other
terms (often start ing with the letter ‘V’) are also used, but a number of these refer to
the analy tics and not big data architectures, as shown in Figure 1, [12].
Fig. 1. Some Vs’of Big Data.
The new big data paradigm occurs when the scale of the data at rest or in
motion forces the management of the data to be a signif icant driver in the design of
the system architecture. Fundamentally the big data paradigm represents a shift in
data system architectures from monolithic systems w ith vertical scaling (faster pro-
cessors or disks) into a horizontally scaled system that integrates a loosely coupled
set of resources. This shift occurred 20-some years ago in the simulation community
when the scientif ic simulations began using massively parallel processing (MPP)
systems. In different combinations of splitting the code and data across independ-
ent processors, computational scientists were able to greatly extend their simulation
capabilities. This of course introduced a number of complications in such areas as
message passing, data movement, and latency in the consistency across resources,
load balancing, and system inefficiencies while waiting on other resources to com-
plete their tasks. In the same way, the big data paradigm represents this same shift,
again using different mechanisms to distribute code and data across loosely-
coupled resources in order to provide the scaling in data handling that is needed to
match the scaling in the data.
The purpose of storing and retrieving large amounts of data is to perform analysis
that produces additional knowledge about the data. In the past, the analysis was
generally accomplished on a random sample of the data.
PaperBig Data and Cloud Computing: Trends and Challenges
3.1 Big Data Engineering
Big Data Engineering is the storage and data manipulation technologies that lev-
erage a collection of horizontally coupled resources to achieve a nearly linear
scalability in performance.
New engineering techniques in the data layer have been driven by the growing
prominence of dat a ty pes that cannot be handled efficiently in a traditional relational
model. The need for scalable access in structured and unstructured dat a has led to
software built on name-value/key-value pairs or columnar (big table), document-
oriented, and graph (including triple-store) paradigms.
3.2 Non-Relational Model
Non-Relational Models refers to logical data models such as document, graph, key
value and others that are used to provide more efficient storage and access to non-
tabular data sets.
3.3 NoSQL
NoSQL (alternately called “no SQL” or “not only SQL”) refers to dat a stores and
interfaces that are not tied to strict relational approaches.
3.4 Big Data Models
Big Data Models refers to logical dat a models (relational and non-relational) and
processing/computation models (batch, streaming, and transactional) for the storage
and manipulation of data across horizontally scaled resources.
3.5 Schema-on-read
Schema-on-read big data are often stored in a raw form based on its production,
with the schema, needed for organizing (and often cleansing) the data, is discovered
and transformed as the data are queried. This is critical since in order for many ana-
lytics to run efficiently the data must be structured to support the specific algorithms
or processing frameworks involved.
3.6 Big Data Analytics
Big Data Analy tics is rapidly evolving both in terms of functionality and the un-
derlying programming model. Such analytical functions support the integration of
results derived in parallel across distributed pieces of one or more data sources.
iJIM Vol. 11, No. 2, 2017
PaperBig Data and Cloud Computing: Trends and Challenges
3.7 Big Data Paradigm
The big data paradigm consists of the distribution of data systems across horizont ally-
coupled independent resources to achieve the scalability needed for the efficient
processing of extensive data sets.
With the new Big Data Paradigm, analytical functions can be executed against the
entire data set or even in real-time on a continuous stream of data. Analysis may even
integrate multiple data sources from different organizations. For example, consider
the question “What is the correlation between insect borne diseases, temperature,
precipitation, and changes in foliage”. To answer this question an analysis would
need to integrate data about incidence and location of diseases, weather data, and
aerial photography.
While we certainly expect a continued evolution in the methods to achieve effi-
cient scalability across resources, this paradigm shif t (in analogy to the prior
shift in the simulation community) is a one-time occurrence; at least until a new
paradigm shift occurs beyond this crowdsourcing of processing or data system
across multiple horizontally-coupled resources.
The Big Data paradigm has other implications from these technical innovations.
The changes are not only in the logical data storage, but in the parallel distribution
of data and code in the physical file system and direct queries against this storage.
The shift in thinking causes changes in the traditional data lifecycle. One descrip-
tion of the end-to-end data lifecycle categorizes the steps as collection, preparation,
analysis and action. Different big data use cases can be characterized in terms of the
data set characterist ics at-rest or in-motion, and in terms of the time window for the
end-to-end data lifecycle. Data set characteristics change the data lifecycle processes
in different ways, for example in the point of a lifecycle at which the data are placed
in persistent storage. In a traditional relational model, the data are stored after prepa-
ration (for example after the extract-transform-load and cleansing processes). In a
high velocity use case, the data are prepared and analysed for alerting, and only then
is the data (or aggregates of the data) given a persistent storage. In a volume use case
the data are often stored in the raw state in which it was produced, prior to the appli-
cation of the preparation processes to cleanse and organize the data. The consequence
of persistence of data in its raw state is that a schema or model for the dat a are only
applied when the dat a are retrieved, known as schema on read.
A third consequence of big data engineering is often referred to as moving the
processing to the data, not the data to the processing. The implication is that the
data are too extensive to be queried and transmitted into another resource for analy-
sis, so the analysis program is instead distributed to the data-holding resources;
with only the results being aggregated on a different resource. Since I/O bandwidth
is frequently the limited resource in moving data, another approach would be to em-
bed query/filter programs within the physical storage medium.
At its hear t, Big Data refers to the extension of data repositories and processing
across horizontally-scaled resources, much in the same way the compute-intensive
simulation community embraced massively parallel processing two decades ago.
In the past, classic parallel computing applications utilized a rich set of communi-
PaperBig Data and Cloud Computing: Trends and Challenges
cations and synchronization construct s and created diverse communications topolo-
gies. In contrast, today, with data sets growing into the Petaby te and Exaby te scales,
distributed processing frameworks offering pat terns such as map-reduce, offer a
reliable high- level, commercially viable compute model based on commodity com-
puting resources, dy namic resource scheduling, and synchronizat ion techniques [12].
Definition: Big Data is a data set(s) with characteristics (e.g. volume, velocity, va-
riety, variability, veracity, etc.) that for a particular problem domain at a given point
in time cannot be efficiently processed using current/existing/established/traditional
technologies and techniques in order to extract value.
The above definition distinguishes Big Data from business intelligence (BI) and
traditional transactional processing while alluding to a broad spectrum of applica-
tions that includes t hem. The ultimate goal of processing Big Data is to derive dif fer-
entiated value that can be trusted (because the underlying data can be trusted). This
is done through the application of advanced analytics against the complete corpus of
data regardless of scale. Parsing this goal helps frame the value discussion for Big-
Data use cases.
of op
d data
ng t
ant informat ion,
rather than just samples or subsets. It’s also about unif ying all decision-support
time-horizons (past, present, and future) through statistically derived insights into
deep data set s in all those dimensions.
2. Trustworthy data: Deriv ing valid insights either from a single-version-of-truth
consolidation and cleansing of deep data, or from statistical models that sift hay-
stack s of dirty data to find the needles of valid insight.
3. Advanced analytics: Faster insights through a variet y of analytic and mining
techniques from data patterns, such as long tail” analyses, micro-segment ations,
and others, that are not feasible if you’re constrained to smaller volumes, slower
velocities, narrower varieties, and undetermined veracities.
A difficult question is what makes Big Data” big, or how large does a data set
have to be in order to be called big data? The answer is an unsatisfy ing it de-
pends”. Part of this issue is that Bigis a relative term and with the growing densit y
of computational and storage capabilities (e.g. more power in smaller more efficient
form factors) what is considered big today will likely not be considered big tomorrow.
Data are considered “big if the use of the new scalable architectures provides im-
proved business efficiency over other traditional
In other words the func-
tionalit y cannot be achieved in something like a traditional relational database plat-
Big dat a essentially focuses on the self-referencing viewpoint that dat a are big be-
cause it requires scalable systems to handle it, and architectures with better scal-
ing have come about because of the need to handle big data [12].
iJIM Vol. 11, No. 2, 2017
PaperBig Data and Cloud Computing: Trends and Challenges
4 Moving Data Management to Cloud
A data management system has various stages of data lifecycle such as data inges-
tion, extract-transform-load (ETL), data processing, data archival, and deletion. Be-
fore moving one or more stages of data lifecycle to the cloud, one has to consider the
following factors:
1. Availability Guarantees: Each cloud computing provider can ensure a certain
amount of availability guarantees. Transactional data processing requires quick re-
al-time answers whereas for data warehouses long running queries are used to gen-
erate reports. Hence, one may not want to put its transactional data over cloud but
may be ready to put the analytics infrastructure over the cloud.
2. Reliability of Cloud Services: Before offloading data management to cloud, enter-
prises want to ensure that the cloud provides required level of reliability for the da-
ta services. By creating multiple copies of application components the cloud can
deliver the service with the required reliability of service.
3. Security: Data that is bound by strict privacy regulations, such as medical infor-
mation covered by the Health Insurance Portability and Accountability Act
(HIPAA), will require that users log in to be routed to their secure database server.
4. Maintainability: Database administration is a highly skilled activity which involves
deciding how data should be organized, which indices and views should be main-
tained, etc. One needs to carefully evaluate whether all these maintenance opera-
tions can be performed over the cloud data.
Cloud has given enterprises the opportunity to fundamentally shift the way data is
created, processed and shared. This approach has been shown to be superior in sus-
taining the performance and growth requirements of analytical applications and, com-
bined with cloud computing, offers significant advantages [13].
5 Manage Big Data in Cloud Computing Environments
Cloud Computing is an environment based on using and providing services. There
are different categories in which the service-oriented systems can be clustered. One of
the most used criteria to group these systems is the abstraction level that is offered to
the system user. In this way, three different levels are often distinguished: Infrastruc-
ture as a Service (IaaS), Platform as a Service (PaaS), and Software as a Service
(SaaS) as shown in Figure 2, [14].
PaperBig Data and Cloud Computing: Trends and Challenges
Fig. 2. Illustration of the layers for the Service-Oriented Computing
Cloud Computing offers scalability with respect to the use of resources, low ad-
ministration effort, flexibility in the pricing model and mobility for the software user.
Under these assumptions, it is obvious that the Cloud Computing paradigm benefits
large projects, such as the ones related with Big Data and BI [15].
In particular, a common Big Data analytics framework is depicted in Figure 3. Fo-
cusing on the structure of the data management sector we may define, as the most
suitable management organization architecture, one based on a four-layer architecture,
which includes the following components [16]:
1. A file system for the storage of Big Data, i.e., a wide amount of archives of large
size. This layer is implemented within the IaaS level as it defines the basic archi-
tecture organization for the remaining tiers.
2. A DBMS for organizing the data and access them in an efficient way. It can be
viewed in between the IaaS and PaaS as it shares common characteristics from
both schemes. Developers used it to access the data, but its implementation lies on
a hardware level. Indeed, a PaaS acts as an interface where, at the upper side offers
its functionality, and at the bottom side, it has the implementation for a particular
IaaS. This feature allows applications to be deployed on different IaaS without re-
writing them.
3. An execution tool to distribute the computational load among the computers of the
cloud. This layer is clearly related with PaaS, as it is kind of a ‘software API’ for
the codification of the Big Data and BI applications.
4. A query system for the knowledge and information extraction required by the sys-
tem’s users, which is in between the PaaS and SaaS layers.
iJIM Vol. 11, No. 2, 2017
PaperBig Data and Cloud Computing: Trends and Challenges
Fig. 3. Big Data framework [14].
6 Solutions and Methodologies for Data Storage
Several solutions were proposed to store and retrieve large amounts of data de-
manded by Big Data, some of which are currently used in Clouds. Internet-scale file
systems such as the Google File System (GFS) attempt to provide the robustness,
scalability, and reliability that certain Internet services need [17]. Other solutions
provide object-store capabilities where files can be replicated across multiple geo-
graphical sites to improve redundancy, scalability, and data availability. Examples
include Amazon Simple Storage Service (S3), Nirvanix Cloud Storage, OpenStack
Swift and Windows Azure Binary Large Object (Blob) storage. Although these solu-
tions provide the scalability and redundancy that many Cloud applications require,
they sometimes do not meet the con- currency and performance needs of certain ana-
lytics applications.
One key aspect in providing performance for Big Data analytics applications is the
data locality. This is because the volume of data involved in the analytics makes it
prohibitive to transfer the data to process it. This was the preferred option in typical
high performance computing systems: in such systems, that typically concern per-
forming CPU-intensive calculations over a moderate to medium volume of data, it is
feasible to transfer data to the computing units, because the ratio of data transfer to
processing time is small. Nevertheless, in the context of Big Data, this approach of
moving data to computation nodes would generate large ratio of data transfer time to
processing time. Thus, a different approach is preferred, where computation is moved
to where the data is. The same approach of exploring data locality was explored pre-
viously in scientific workflows [18] and in Data Grids [19].
In the context of Big Data analytics, MapReduce presents an interesting model
where data locality is explored to improve the performance of applications. Hadoop,
an open source MapReduce implementation, allows for the creation of clusters that
PaperBig Data and Cloud Computing: Trends and Challenges
use the Hadoop Distributed File System (HDFS) to partition and replicate data sets to
nodes where they are more likely to be consumed by mappers. In addition to exploit-
ing concurrency of large numbers of nodes, HDFS minimizes the impact of failures
by replicating data sets to a configurable number of nodes. It has been used by [20] to
develop an analytics platform to process Face- book’s large data sets. The platform
uses Scribe to aggregate logs from Web servers and then exports them to HDFS files
and uses a Hive–Hadoop cluster to execute analytics jobs. The platform includes
replication and compression techniques and columnar compression of Hive7 to store
large amounts of data.
Although a large part of the data produced nowadays is un- structured, relational
databases have been the choice most organizations have made to store data about their
customers, sales, and products, among other things. As data managed by traditional
DBMS ages, it is moved to data warehouses for analysis and for sporadic retrieval.
Models such as MapReduce are generally not the most appropriate to analyze such
relational data. Attempts have been made to provide hybrid solutions that incorporate
MapReduce to perform some of the queries and data processing required by DBMS’s
In [22] the researchers provide a parallel database design for analytics that supports
SQL and MapReduce scripting on top of a DBMS to integrate multiple data sources.
A few providers of analytics and data mining solutions, by exploring models such as
MapReduce, are migrating some of the processing tasks closer to where the data is
stored, thus trying to minimize surpluses of data preparation, storage, and processing.
Data processing and analytics capabilities are moving towards Enterprise Data Ware-
houses (EDWs), or are being deployed in data hubs to facilitate reuse across various
data sets [23].
Another distinctive trend in Cloud computing is the increasing use of NoSQL data-
bases as the preferred method for storing and retrieving information. NoSQL adopts a
non-relational model for data storage. Leavitt argues that non-relational models have
been avail- able for more than 50 years in forms such as object-oriented, hierarchical,
and graph databases, but recently this paradigm started to attract more attention with
models such as key-store, column- oriented, and document-based stores. The causes
for such raise in interest, according to Levitt, are better performance, capacity of han-
dling unstructured data, and suitability for distributed environments [24].
In [25] the researchers presented a survey of NoSQL databases with emphasis on
their advantages and limitations for Cloud computing. The survey classifies NoSQL
systems according to their capacity in addressing different pairs of CAP (consistency,
availability, partitioning). The survey also explores the data model that the studied
NoSQL systems support.
7 Data Processing and Resource Management
When making an attempt to understand the concept of Big Data, the words such as
“Map Reduce” and “Hadoop” cannot be avoided.
iJIM Vol. 11, No. 2, 2017
PaperBig Data and Cloud Computing: Trends and Challenges
7.1 Hadoop
Hadoop, is a free Java-based programming framework supports the processing of
large sets of data in a distributed computing environment. It is a part of the Apache
project sponsored by the Apache Software Foundation. Hadoop cluster uses a Mas-
ter/Slave structure. Using Hadoop, large data sets can be processed across a cluster of
servers and applications can be run on systems with thousands of nodes involving
thousands of terabytes. Distributed file system in Hadoop helps in rapid data transfer
rates and allows the system to continue its normal operation even in the case of some
node failures. This approach lowers the risk of an entire system failure, even in the
case of a significant number of node failures. Hadoop enables a computing solution
that is scalable, cost effective, and flexible and fault tolerant. Hadoop Framework is
used by popular companies like Google, Yahoo, Amazon and IBM etc., to support
their applications involving huge amounts of data. Hadoop has two main sub projects
Map Reduce and Hadoop Distributed File System (HDFS) [26].
7.2 Map Reduce
Hadoop Map Reduce is a framework used to write applications that process
large amounts
data in parallel on clusters of commodity hardware resources in
a reliable, fault-tolerant
A Map Reduce job first divides the data into
individual chunks which are processed by Map
in parallel. The outputs of the
maps sorted by the framework are then input to the reduce
Generally the
input and the output of the job are both stored in a file-system.
ing and re-executing failed tasks are taken care by the
7.3 Hadoop Distributed File System (HDFS)
HDFS is a file system that spans all the nodes in a Hadoop cluster for data
storage. It
together file systems on local nodes to make it into one large file
system. HDFS
reliability by replicating data across multiple sources to
overcome node
of Big Data A
The big data application refers to the large scale distributed applications which
usually work
large data sets. Data exploration and analysis turned into a
difficult problem in many sectors
the span of big data. With large and complex
data, computation becomes difficult to be
by the traditional data processing
applications which triggers the development of big
applications. Google’s map
reduce framework and apache Hadoop are the defacto
systems for big data
applications, in which these applications generates a huge amount
data. Manufacturing and Bioinformatics are the two major areas of big data applica-
PaperBig Data and Cloud Computing: Trends and Challenges
Big data provide an infrastructure for transparency in manufacturing industry,
which has
ability to unravel uncertainties such as inconsistent component per-
formance and availability.
these big data applications, a conceptual framework
of predictive manufacturing begins with
acquisition where there is a possibil-
ity to acquire different types of sensory data such as
vibration, acoustics,
voltage, current, and controller data. The combination of sensory data
data constructs the big data in manufacturing. This generated big data from the
combination acts as the input into predictive tools and preventive strategies
such as
and health
and [
Another important application for Hadoop is Bioinformatics which covers the
sequencing and other biological domains. Bioinformatics which
requires a large scale
analysis, uses Hadoop. Cloud computing gets the parallel
distributed computing
together with computer clusters and web
In Big data, the software packages provide a rich set of tools and options
where an
could map the entire data landscape across the company, thus
allowing the individual to
the threats he/she faces internally. This is con-
sidered as one of the main advantages as big
keeps the data safe. With this an
individual can be able to detect the potentially
information that is not pro-
tected in an appropriate manner and makes sure it is stored according
the regula-
There are some common characteristics of big data:
1. Big data integrates both structured and unstructured
2. Addresses speed and scalability, mobility and security, flexibility and
3. In big data the realization time to information is critical to extract value from
sources, including mobile devices, radio frequency identification,
the web and a
list of automated sensory
All the organizations and business would benefit from speed, capacity, and scala-
bility of
storage. Moreover, end users can visualize the data and companies can
find new
opportunities. Another notable advantage with big-data is, data
analytics, which allow
individual to personalize the content or look and feel of
the website in real time so that it suits
each customer entering the website. If big
data are combined with predictive analytics,
produces a challenge for many in-
dustries. The combination results in the exploration
these four
1. Calculate the risks on large
2. Detect, prevent, and re-audit financial
3. Improve delinquent
4. Execute high value marketing
iJIM Vol. 11, No. 2, 2017
PaperBig Data and Cloud Computing: Trends and Challenges
9 Advantages and Risks and Challenges of Big Data and Cloud
computing integration
9.1 Advantages
There are many advantages of integrating cloud computing to big date. Big data
raises the need for multiple servers because of the massive data and size it works on,
and it demands high velocity and variability. These multiple servers work in parallel
to provide the big data high requirements. Cloud computing already uses multiple
servers and allow resource allocations. Because of that, it is a great fit to build the big
data on these cloud multi-servers and make use of the resource allocation availability
provided by the cloud environments which would result in a better efficiency for big
data analysis [34]. Using cloud system as a storage for big data would improve the
performance for both. As cloud systems are mainly based on remote multi-servers
which makes it feasible to handle massive amounts of data simultaneously. This fea-
ture allows advanced analytics techniques to enable the big data to deal with massive
amounts of data. The integration between cloud computing and big data would result
in cost reduction. While big data requires clusters of servers and volumes to support
the massive amount of data. Cloud computing systems can work as the structure for
all of these servers and volumes instead of creating new ones for big data which also
provides more flexibility and scalability and eliminates the huge investments that
would be invested on the big data computers and servers [33].
In addition to that, using cloud computing provides faster provisioning to big data
as provisioning servers in the cloud is so easy and feasible. So, based on the pro-
cessing requirements of the big data, the cloud environment used can be scaled ac-
cordingly. This fast provisioning is really important for big data as the value of the
data rapidly decrease over time. In general, cloud computing complements big data
and provides convenient, on-demand and shared computing environment with mini-
mal management effort and reduced overhead. It also makes the environment more
robust, automated and provides multi-tenancy. In addition to all of that, the integra-
tion between both, makes big data resources more controllable, monitored and report-
ed. Moreover, this integration enables reducing the complexity and improving the
productivity. All of these advantages make the cloud based approaches are the right
models for deploying big data.
9.2 Risk and Challenges
Despite all of these advantages of the integration between cloud computing and big
data, there are some challenges and risks that should be considered while deploying
big data on a cloud environment. The fundamental issue that should be considered is
the security of the big data cloud environment. There are some security vulnerabilities
that rise because of this integration between both and creating a new unfamiliar plat-
form. One of the most known Big Data cloud security vulnerability is platform heter-
ogeneity. There are many Big Data deployments that requires deploying a new plat-
PaperBig Data and Cloud Computing: Trends and Challenges
form in the cloud while the existing security tools and practices of the cloud will not
work for such platforms and new security tools are needed to be developed to work
with these new Big Data platforms. These security tools could include authentication,
access control, encryption, intrusion detection, and event logging and monitoring. In
addition to the security policies, the Big Data consolidation plans should be taken into
consideration while the integration with the cloud environment.
Another challenge is the nature of data and its location, as in Big Data, the data can
be in different locations. The cloud environment may include these locations or not.
The type of processing that should be applied to the data, the parallelism of the pro-
cessing, and where the processing should take place either the data is moved to a
processing environment or the processing is performed on the location of the data. All
of these are also challenges that should be taken into consideration while deploying
the Big Data to a cloud system environment. In addition to that, another challenge is
the optimization of the Big Data cloud topology as it specifies the configuration, the
size of the clouds, clusters and nodes that should be included to reach the optimal Big
Data cloud model. [32].
On the other hand these challenges have been overcome, if not in a direct manner,
to allow the idea of the cloud and big data integration become the more practical an-
swer instead of investing countless thousands of dollars in creating an environment
suitable to operate the large amount of processing required as well as accommodate
terabytes of data.
10 Conclusions
Big data is currently one of the most critical emerging technologies. The 4V’s of
big data volume, velocity, variety and veracity makes the data management and
analytics challenging for the traditional data warehouses. Cloud computing seems to
be a perfect vehicle for hosting big data workloads. However, working on big data in
the cloud brings its own challenge of reconciling two contradictory design principles.
The integrating big data with cloud computing technologies, businesses and education
institutes can have a better direction to the future. The capability to store large
amounts of data in different forms and process it all at very large speeds will result in
data that can guide businesses and education institutes in developing fast. The paper
presented the general concepts and definition of Big Data, and illustrated the move-
ment of data management to cloud computing. The paper presented the “Map
and “Hadoop” as Big Data systems that support the processing of
of data in a distributed computing environment. This paper also introduced the char-
acteristics, trends and challenges of big data. In addition to that, it investigates the
benefits and the risks that may rise out of the integration between big data and cloud
computing. The major advantage with the cloud computing and big data integration is
the data storage and processing power availability, the cloud has access to a large pool
of resources and various forms of infrastructures that can accommodate this integra-
tion in the best suitable way possible; with minimum effort the environment can be
set up and managed to allow an excellent work space for all the big data needs i.e.
iJIM Vol. 11, No. 2, 2017
PaperBig Data and Cloud Computing: Trends and Challenges
data analytics. This in turn provides low complexity with high productivity. A couple
of challenges cross the path of this integration, but nothing too major that with today’s
technology and expansion in the field has not already overcome them and give the
cloud a much need edge in being the most practical solution to host and process big
data environments.
11 References
[1] Villars, R. L., Olofson, C. W., & Eastwood, M (June, 2011). Big data: What it is and why
you should care. IDC White Paper. Framingham, MA: IDC.
[2] Coronel, C., Morris, S., & Rob, P. (2013). Database Systems: Design, Implementation, and
Management, (10th. Ed.). Boston: Cengage Learning.
[3] White, C. (2011). Data Communications and Computer Networks: A business user’s ap-
proach, (6th ed.). Boston: Cengage Learning.
[4] IOS Press. (2011). Guidelines on security and privacy in public cloud computing. Journal
of EGovernance,34 149-151. DOI: 10.3233/GOV-2011-0271
[5] J. B. Rothnie Jr., P. A. Bernstein, S. Fox, N. Goodman, M. Hammer, T. A. Landers, C. L.
Reeve, D. W. Shipman, and E. Wong. Introduction to a System for Distributed Databases
(SDD-1). ACM Trans. Database Syst., 5(1):117, 1980.
[6] J. Dewitt, S. Ghandeharizadeh, D. A. Schneider, A. Bricker, H. I. Hsiao, and R. Rasmus-
sen. The Gamma Database Machine Project. IEEE Trans. on Knowl. and Data Eng.,
2(1):44–62, 1990.
[7] Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows, T. Chandra, A.
Fikes, and R. E. Gruber. Bigtable: A Distributed Storage System for Structured Data. In
OSDI, pages205–218, 2006.
[8] Dean and S. Ghemawat. MapReduce: simplified data processing on large clusters. In
OSDI, pages 137–150, 2004.
[9] Apache Hadoop Project., 2009
[10] Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, S. Anthony, H. Liu, P. Wyckoff, and R.
Murthy. Hive - A Warehousing Solution Over a Map-Reduce Framework. PVLDB,
2(2):1626–1629, 2009
[11] Rajeev Gupta, Himanshu Gupta, and Mukesh Mohania, "Cloud Computing and Big Data
Analytics: What Is New from Databases Perspective?". S. Srinivasa and V. Bhatnagar
(Eds.): BDA 2012, LNCS 7678, pp. Springer-Verlag Berlin Heidelberg 42–61, 2012.
[12] ISO/IEC JTC 1. Information technology Big data, Preliminary Report 2014. ISO/IEC
[13] Curino, C., Jones, E.P.C., Popa, R.A., Malviya, N., Wu, E., Madden, S., Balakrishnan, H.,
Zeldovich, N.: Realtional Cloud: A Database-as-a-Service for the Cloud. In: Proceedings
of Conference on Innovative Data Systems Research, CIDR- 2011.
[14] Alberto Ferandez, Sara del R, Victoria opez, Abdullah Bawakid, Maria J. del Jesus, Jose
M. Benitez, and Francisco Herrera. "Big Data with Cloud Computing: an insight on the
computing environment, MapReduce, and programming frameworks". doi:
10.1002/widm.1134. WIREs Data Mining Knowl Discov, 4:380–409, 2014.
[15] Shim K, Cha SK, Chen L, Han W-S, Srivastava D, Tanaka K, Yu H, Zhou X. Data man-
agement challenges and opportunities in cloud computing. In: 17th International Confer-
ence on Database Systems for Advanced Applications (DASFAA’2012). Ber-
lin/Heidelberg: Springer 323; 2012.
PaperBig Data and Cloud Computing: Trends and Challenges
[16] Kambatla K, Kollias G, Kumar V, Grama A. Trends in big data analytics. J Parallel Dis-
trib, 74:25612573, 2014.
[17] S. Ghemawat, H. Gobioff, S.-T. Leung, The google file system, in: Proceedings of the 9th
ACM Symposium on Operating Systems Principles (SOSP 2003), ACM, New York, USA,
pp. 29–43, 2003.
[18] Deelman, A. Chervenak, Data management challenges of data-intensive scientific work-
flows, in: Proceedings of the 8th IEEE International Sympo- sium on Cluster Computing
and the Grid (CCGrid’08), IEEE Computer Society, pp. 687–692, 2008.
[19] Venugopal, R. Buyya, K. Ramamohanarao, A taxonomy of data grids for distributed data
sharing, management and processing, ACM Comput. Surv. 38 (1) 153, 2006.
[20] A. Thusoo, Z. Shao, S. Anthony, D. Borthakur, N. Jain, J.S. Sarma, R. Murthy, H. Liu, Da-
ta warehousing and analytics infrastructure at Facebook, in: Proceedings of the 2010 Inter-
national Conference on Management of Data, ACM, New York, NY, USA, pp. 1013
1020. 2010.
[21] D.J. Abadi, Data management in the cloud: Limitations and opportunities, IEEE Data En-
gineering Bulletin 32 (1) 3–12, 2009.
[22] J. Cohen, B. Dolan, M. Dunlap, J.M. Hellerstein, C. Welton, MAD skills: new analysis
practices for big data, Proceedings of the VLDB Endow 2 (2) 1481–1492, 2009.
[23] D. Jensen, K. Konkel, A. Mohindra, F. Naccarati, E. Sam, Business Analytics in the
Cloud, White paper IBW03004-USEN-00, IBM (April 2012).
[24] N. Leavitt, Will NoSQL Databases Live Up to Their Promise? Computer 43 (2) 1214,
[25] J. Han, H. E, G. Le, J. Du, Survey on NoSQL database, in: 6th International Conference on
Pervasive Computing and Applications (ICPCA 2011), IEEE, Port Elizabeth, South Afri-
ca, pp. 363–366., 2011.
[26] Lu, Huang, Ting-tin Hu, and Hai-shan Chen. "Research on Hadoop Cloud Computing
Model and its Applications.". Hangzhou, China: 2012, pp. 59 63, 21-24 Oct. 2012.
[27] Wie, Jiang , Ravi V.T, and Agrawal G. "A Map-Reduce System with an Alternate API for
Multi-core Environments.". Melbourne, VIC: 2010, pp. 84-93, 17-20 May. 2010.
[28] K, Chitharanjan, and Kala Karun A. "A review on hadoop - HDFS infrastructure exten-
sions.". JeJu Island: 2013, pp. 132-137, 11-12 Apr. 2013.
[29] F.C.P, Muhtaroglu, Demir S, Obali M, and Girgin C. "Business model canvas perspective
on big data applications." Big Data, 2013 IEEE International Conference, Silicon Valley,
CA, Oct 6-9, p. 32 37, 2013.
[30] Zhao, Yaxiong , and Jie Wu. "Dache: A data aware caching for big-data applications using
the MapReduce framework." INFOCOM, 2013 Proceedings IEEE, Turin, Apr 14-19, p. 35
39, 2013,.
[31] Xu-bin, LI, JIANG Wen-rui, JIANG Yi, ZOU Quan. "Hadoop Applications in Bioinfor-
matics." Open Cirrus Summit (OCS), 2012 Seventh, Beijing, Jun 19-20, p. 48 52, 2012.
[32] Castelino, C., Gandhi, D., Narula, H. G., & Chokshi, N. H. (2014). Integration of Big Data
and Cloud Computing. International Journal of Engineering Trends and Technology
(IJETT), 100-102.
[33] Chandrashekar, R., Kala, M., & Mane, D. (2015). Integration of Big Data in Cloud compu-
ting environments for enhanced data processing capabilities. International Journal of Engi-
neering Research and General Science, 240-245.
[34] James Kobielus, I., & Bob Marcus, E. S. (2014). Deploying Big Data Analytics Applica-
tions to the Cloud: Roadmap for Success. Cloud Standards Customer Council.
iJIM Vol. 11, No. 2, 2017
PaperBig Data and Cloud Computing: Trends and Challenges
12 Authors
Samir A. EL-SEOUD is with British University in Egypt-BUE, Cairo, Egypt.
Hosam F. El-SOFANY is with Cairo Higher Institute for Engineering, Computer
Science, and Management, Cairo, Egypt.
Mohamed ABDELFATTAH (corresponding author) is with British University
in Egypt-BUE, Cairo, Egypt (
Reham MOHAMED is with British University in Egypt-BUE, Cairo, Egypt.
This article is a revised version of a paper presented at the BUE International Conference on Sustainable
Vital Technologies in Engineering and Informatics, held Nov 07, 2016 - Nov 09, 2016 , in Cairo, Egypt.
Article submitted 21 December 2016. Published as resubmitted by the authors 23 February 2017.
... Based on Table A1 Top 10 highest cited papers showed the contribution of Big Data in Education and STEM Education in the Appendix, it can be seen that research related to Big Data has great potential and opportunities, but there are challenges in the field of education, so it is necessary to improve learning Big Data analytics and platforms to support education infrastructure [82]- [85]. In analysing Big Data, it is also necessary to develop learning analytics and Big Data business analysis to support and improve success that can be implemented in education [86]. So that by utilising Big Data can help understand various learning problems [87]- [90]. ...
Full-text available
Big Data research has rapidly developed in the last 10 years, making it an interesting topic to investigate to understand the trends and developments of Big Data in Education. This study aims to analyze source documents and state contributions, and visualize research trends on the subject of Big Data in education during the last 10 years and identify potential research topics related to Big Data in Education in the future. This study uses a literature review and Meta-Analyses (PRISMA) method associated with bibliometric analysis using the Scopus and VOSViewer databases, through which 1,076 documents were obtained. The results of the bibliometric analysis show that Big Data research in education has increased significantly in the last 10 years. However, in the last year, it has decreased, this is the next challenge, and an opportunity for future research. The most common types of documents are conference papers, the source of most documents is conference proceedings, and the country that contributes the most is China. English is the most widely spoken language, and the authors with the highest contribution are Daniel, B. K. Further research related to Big Data can be used and implemented in the business world for Big Data analysis in education. Big Data can also be integrated with Science, technology, engineering, and mathematics (STEM) which can be a further research opportunity that can be applied to education because analyzing Big Data requires STEM learning skills. Thus, this research recommends finding updates in the study of Big Data in education by integrating STEM in education because analyzing Big Data again requires STEM learning expertise. This study has limitations in using only one database, namely Scopus, to obtain research data. Therefore, it is recommended that Big Data in Education research be conducted using other databases besides Scopus to obtain more extensive data.
... Businesses, therefore, tend to conduct big data processing in the cloud computing environment. Besides, the cloud allows incorporating information from various sources and places (Abou El-Seoud et al., 2017). ...
... Cloud includes multiple remote servers which makes it possible to deal with mighty sized big data sets at the same time [45] ...
Full-text available
Big data stands for sheer amount of data that is growing unceasingly at a rapid pace. Big Data demands high-powered, robust, reliable, fault-tolerant tools and techniques in order to make it convenient to process, analyse and uproot new insights from Big Data. Big data refers to huge, heterogeneous amount of details, facts and data generating at constantly rising rate. The data sets in Big Data are too bulky or extensive, as a result classical data handling application software are not competent enough to administer them. On the other hand, Cloud computing is a resourceful technology providing high computing power, scalability, computing resources as and when required for processing, storage, analytics and visualization of Big Data. Therefore, cloud computing can be regarded as a feasible and applicable technology which promises to handle Big Data challenges and also provides here and now infrastructures with all the mandatory resources. This paper will mainly review processing of big data using Hadoop and spark in cloud, advantages of driving Big Data using cloud computing and applications of Big data in Cloud.
Full-text available
Cloud computing has revolutionized organizational operations by providing convenient, on-demand access to resources. The emergence of the Internet of Things (IoT) has introduced a new paradigm for collaborative computing, leveraging sensors and devices that generate and process vast amounts of data, thereby resulting in challenges related to scalability and security, making the significance of conventional security methods even more pronounced. Consequently, in this paper, we propose a novel Scalable and Secure Cloud Architecture (SSCA) that integrates IoT and cryptographic techniques, aiming to develop scalable and trustworthy cloud systems, thus enabling multi-user systems and facilitating simultaneous access to cloud resources by multiple users. The design adopts a decentralized approach, utilizing multiple cloud nodes to handle user requests efficiently and incorporates Multicast and Broadcast Rekeying Algorithm (MBRA) to ensure the privacy and confidentiality of user information, utilizing a hybrid cryptosystem that combines MBRA, Post Quantum Cryptography (PQC) and blockchain technology. Leveraging IoT devices, the architecture gathers data from distributed sensing resources and ensures the security of collected information through robust MBRA-PQC encryption algorithms, while the blockchain ensures that the confidential data is stored in distributed and immutable records. The proposed approach is applied to several datasets and the effectiveness is validated through various performance metrics, including response time, throughput, scalability, security, and reliability. The results highlight the effectiveness of the proposed SSCA, showcasing a notable reduction in response time by 1.67 seconds and 0.97 seconds for 250 and 1000 devices, respectively, in comparison to the MHE-IS-CPMT. Likewise, SSCA demonstrated significant improvements in the AUC values, exhibiting enhancements of 6.30%, 6.90%, 7.60%, and 7.30% at the 25-user level, and impressive gains of 5.20%, 9.30%, 11.50%, and 15.40% at the 50-user level when compared to the MHE-IS-CPMT, EAM, SCSS, and SHCEF models, respectively.
Full-text available
Traditional database management systems are inadequate for handling uncertain data due to design limitations. Various methods address database uncertainty, often oversimplifying and restricting representable uncertainties. Rapid data growth from technologies like IoT, multimedia, and social media has led to massive diverse data, known as big data, categorized into structured, semi-structured, and unstructured types. Coping with big data's challenges is encapsulated in the "Vs Model" with volume, velocity, and variety dimensions. It aims to gather variable information for causal understanding and informed decisions. Big data offers competitive advantages: smart decisions (69%), operational control (54%), customer insights (52%), and cost cuts (47%). This research studies big data management and cloud computing, where data and apps are accessed via the internet, saving resources globally. Cloud's simplicity and power facilitate tasks and storage. Risks include bankruptcy or breaches, yet cloud remains valuable for preserving big data. Vital factors in big data's value creation involve converting IT costs to assets and connecting big data outcomes to performance. This abstract summarizes exploration of traditional limitations, big data's essence, cloud's potential, and value creation nuances.
Cloud computing is a potent tool for sophisticated and massive-scale computation. It removes the need for expensive hardware, specialized space, and software maintenance. It has been noticed that cloud computing has resulted in a massive increase in the volume of data, or big data. Managing massive amounts of data is a complex and time-consuming operation that requires an extensive computer infrastructure for effective data processing and analysis. Many industries, including minor and major organizations, healthcare, education, and many more, are attempting to harness the potential of big data. In healthcare, for example, big data is used to reduce treatment costs, predict pandemic outbreaks, and prevent infections, among other things. This article discusses comprehensive data processing strategies from system and application perspectives to offer an orderly picture of the issues that application developers and database management system (DBMS) designers face while designing and deploying internet-scale applications. While big data has various uses in various industries, it has challenges.
Computation offloading has effectively solved the problem of terminal devices computing resources limitation in hospitals by shifting the medical image diagnosis task to the edge servers for execution. Appropriate offloading strategies for diagnostic tasks are essential. However, the risk awareness of each user and the multiple expenses associated with processing tasks have been ignored in prior works. In this article, a multi-user multi-objective computation offloading for medical image diagnosis is proposed. First, the prospect theoretic utility function of each user is designed considering the delay, energy consumption, payment, and risk awareness. Second, the computation offloading problem including the above factors is defined as a distributed optimization problem, which with the goal of maximizing the utility of each user. The distributed optimization problem is then transformed into a non-cooperative game among the users. The exact potential game proves that the non-cooperative game has Nash equilibrium points. A low-complexity computation offloading algorithm based on best response dynamics finally is proposed. Detailed numerical experiments demonstrate the impact of different parameters and convergence in the algorithm on the utility function. The result shows that, compare with four benchmarks and four heuristic algorithms, the proposed algorithm in this article ensures a faster convergence speed and achieves only a 1.14% decrease in the utility value as the number of users increases.
Conference Paper
Full-text available
The buzz-word big-data (application) refers to the large-scale distributed applications that work on unprecedentedly large data sets. Google's MapReduce framework and Apache's Hadoop, its open-source implementation, are the defacto software system for big-data applications. An observation regarding these applications is that they generate a large amount of intermediate data, and these abundant information is thrown away after the processing finish. Motivated by this observation, we propose a data-aware cache framework for big-data applications, which is called Dache. In Dache, tasks submit their intermediate results to the cache manager. A task, before initiating its execution, queries the cache manager for potential matched processing results, which could accelerate its execution or even completely saves the execution. A novel cache description scheme and a cache request and reply protocol are designed. We implement Dache by extending the relevant components of the Hadoop project. Testbed experiment results demonstrate that Dache significantly improves the completion time of MapReduce jobs and saves a significant chunk of CPU execution time.
As massive data acquisition and storage becomes increasingly affordable, a wide variety of enterprises are employing statisticians to engage in sophisticated data analysis. In this paper we highlight the emerging practice of Magnetic, Agile, Deep (MAD) data analysis as a radical departure from traditional Enterprise Data Warehouses and Business Intelligence. We present our design philosophy, techniques and experience providing MAD analytics for one of the world's largest advertising networks at Fox Audience Network, using the Greenplum parallel database system. We describe database design methodologies that support the agile working style of analysts in these settings. We present dataparallel algorithms for sophisticated statistical techniques, with a focus on density methods. Finally, we reflect on database system features that enable agile design and flexible algorithm development using both SQL and MapReduce interfaces over a variety of storage mechanisms.
The term ‘Big Data’ has spread rapidly in the framework of Data Mining and Business Intelligence. This new scenario can be defined by means of those problems that cannot be effectively or efficiently addressed using the standard computing resources that we currently have. We must emphasize that Big Data does not just imply large volumes of data but also the necessity for scalability, i.e., to ensure a response in an acceptable elapsed time. When the scalability term is considered, usually traditional parallel‐type solutions are contemplated, such as the Message Passing Interface or high performance and distributed Database Management Systems. Nowadays there is a new paradigm that has gained popularity over the latter due to the number of benefits it offers. This model is Cloud Computing, and among its main features we has to stress its elasticity in the use of computing resources and space, less management effort, and flexible costs. In this article, we provide an overview on the topic of Big Data, and how the current problem can be addressed from the perspective of Cloud Computing and its programming frameworks. In particular, we focus on those systems for large‐scale analytics based on the MapReduce scheme and Hadoop, its open‐source implementation. We identify several libraries and software projects that have been developed for aiding practitioners to address this new programming model. We also analyze the advantages and disadvantages of MapReduce , in contrast to the classical solutions in this field. Finally, we present a number of programming frameworks that have been proposed as an alternative to MapReduce , developed under the premise of solving the shortcomings of this model in certain scenarios and platforms. WIREs Data Mining Knowl Discov 2014, 4:380–409. doi: 10.1002/widm.1134 This article is categorized under: Technologies > Classification Technologies > Computer Architectures for Data Mining
Conference Paper
Analyzing large data is a challenging problem today, as there is an increasing trend of applications being expected to deal with vast amounts of data that usually do not fit in the main memory of a single machine. For such data-intensive applications, database research community has started to investigate cloud computing as a cost effective option to build scalable parallel data management systems which are capable of serving petabytes of data for millions of users. The goal of this panel is to initiate an open discussion within the community on data management challenges and opportunities in cloud computing. Potential topics to be discussed in the panel include: MapReduce framework, shared-nothing architecture, parallel query processing, security, analytical data management, transactional data management and fault tolerance.
Conference Paper
Hadoop is an open-source software platform for distributed computing dealing with a parallel processing of large data sets. It has been widely used in the field of cloud computing. This paper describes the three most crucial parts of Hadoop, including HDFS, the distributed file system, MapReduce, the data processing model, and HBase, the distributed structured data table. The application status, main research directions and existing problems of Hadoop data processing platform are analyzed, and some performance optimization suggestions are given.
Conference Paper
Apache's Hadoop1 as of now is pretty good but there are scopes of extensions and enhancements. A large number of improvements are proposed to Hadoop which is an open source implementation of Google's Map/Reduce framework. It enables distributed, data intensive and parallel applications by decomposing a massive job into smaller tasks and a massive data set into smaller partitions such that each task processes a different partition in parallel. Hadoop uses Hadoop distributed File System (HDFS) which is an open source implementation of the Google File System (GFS) for storing data. Map/Reduce application mainly uses HDFS for storing data. HDFS is a very large distributed file system that uses commodity hardware and provides high throughput as well as fault tolerance. Many big enterprises believe that within a few years more than half of the world's data will be stored in Hadoop. HDFS stores files as a series of blocks and are replicated for fault tolerance. Strategic data partitioning, processing, layouts, replication and placement of data blocks will increase the performance of Hadoop and a lot of research is going on in this area. This paper reviews some of the major enhancements suggested to Hadoop especially in data storage, processing and placement.
Conference Paper
Large and complex data that becomes difficult to be handled by traditional data processing applications triggers the development of big data applications which have become more pervasive than ever before. In the era of big data, data exploration and analysis turned into a difficult problem in many sectors such as the smart routing and health care sectors. Companies which can adapt their businesses well to leverage big data have significant advantages over those that lag this capability. The need for exploring new approaches to address the challenges of big data forces companies to shape their business models accordingly. In this paper, we summarize and share our findings regarding the business models deployed in big data applications in different sectors. We analyze existing big data applications by taking into consideration the core elements of a business (via business model canvas) and present how these applications provide value to their customers by making profit out of using big data.
One of the major applications of future generation parallel and distributed systems is in big-data analytics. Data repositories for such applications currently exceed exabytes, and are rapidly increasing in size. Beyond their sheer magnitude, these datasets, and associated applications’ considerations pose significant challenges for method and software development. Datasets are often distributed and their size and privacy considerations warrant distributed techniques. Data often resides on platforms with widely varying computational and network capabilities. Considerations of fault-tolerance, security, and access control are critical in many applications Dean and Ghemawat (2004), Apache hadoop and Bhatti (2012) [52,8,38]. Analysis tasks often have hard deadlines, and data quality is a major concern in yet other applications. For most emerging applications, data driven models and methods, capable of operating at scale are as-yet unknown. Even when known methods can be scaled, validation of results is a major issue. Characteristics of hardware platforms and the software stack fundamentally impact data analytics. In this article, we provide an overview of the state-of-the-art and focus on emerging trends to highlight the hardware, software, and applications landscape of big-data analytics.