ArticlePDF Available

Beyond Batch Processing: Towards Real-Time and Streaming Big Data

Authors:

Abstract and Figures

Today, big data is generated from many sources and there is a huge demand for storing, managing, processing, and querying on big data. The MapReduce model and its counterpart open source implementation Hadoop, has proven itself as the de facto solution to big data processing. Hadoop is inherently designed for batch and high throughput processing jobs. Although Hadoop is very suitable for batch jobs but there is an increasing demand for non-batch processes on big data like: interactive jobs, real-time queries, and big data streams. Since Hadoop is not proper for these non-batch workloads, new solutions are proposed to these new challenges. In this article, we discuss two categories of these solutions: real-time processing, and stream processing for big data. For each category, we discuss paradigms, strengths and differences to Hadoop. We also introduce some practical systems and frameworks for each category. Finally, some simple experiments are done to show effectiveness of some solutions compared to available Hadoop-based solutions.
Content may be subject to copyright.
1
Beyond Batch Processing:
Towards Real-Time and Streaming Big Data
Saeed Shahrivari
Computer Engineering Department,
Tarbiat Modares University (TMU), Tehran, Iran
saeed.shahrivari@gmail.com
Abstract: Today, big data is generated from many sources and there is a huge
demand for storing, managing, processing, and querying on big data. The
MapReduce model and its counterpart open source implementation Hadoop, has
proven itself as the de facto solution to big data processing. Hadoop is inherently
designed for batch and high throughput processing jobs. Although Hadoop is very
suitable for batch jobs but there is an increasing demand for non-batch processes on
big data like: interactive jobs, real-time queries, and big data streams. Since Hadoop
is not proper for these non-batch workloads, new solutions are proposed to these
new challenges. In this article, we discuss two categories of these solutions: real-
time processing, and stream processing for big data. For each category, we discuss
paradigms, strengths and differences to Hadoop. We also introduce some practical
systems and frameworks for each category. Finally, some simple experiments are
done to show effectiveness of some solutions compared to available Hadoop-based
solutions.
Keywords: Big data, MapReduce, Real-time processing, Stream processing
1. Introduction
The “Big Data” paradigm is getting an expanding popularity recently. The “Big Data term is
generally used for datasets which are so huge that cannot be processed and managed using
classical solutions like Relational Data Base Systems (RDBMS). Besides volume, large
velocity and variety are other challenges of big data. Numerous sources generate big data.
Internet, Web, Online Social Networks, Digital Imaging, and new sciences like
Bioinformatics, Particle Physics, and Cosmology are some example of sources for big data
to be mentioned [1].
Until now, the most notable solution that is proposed for managing and processing big data
is the MapReduce framework [2]. The MapReduce framework is initially introduced and
used by Google. MapReduce offers three major features in a single package. These features
2
are: simple and easy programming model, automatic and linear scalability, and built-in
fault tolerance. Google announced its MapReduce framework as three major components: a
MapReduce execution engine, a distributed file system called Google File System (GFS) [3],
and a distributed NoSQL database called BigTable [4].
After Google’s announcement of its MapReduce Framework, the Apache foundation started
some counterpart open source implementation of the MapReduce framework. Hadoop
MapReduce and Hadoop YARN as execution engines, the Hadoop Distributed File System
(HDFS), and HBase as replacement for BigTable, were the three major projects [5]. Apache
has also gathered some extra projects like: Cassandara a distributed data management
system resembling to Amazon Dynamo, Zookeeper a high-performance coordination service
for distributed applications, Pig and Hive for data warehousing, and Mahout for scalable
machine learning.
From its inception, the Mapreduce framework has made complex large-scale data
processing easy and efficient. Despite this, MapReduce is designed for batch processing of
large volumes of data, and it is not suitable for recent demands like real-time and online
processing. MapReduce is inherently designed for high throughput batch processing of big
data that take several hours and even days, while recent demands are more centered on
jobs and queries that should finish in seconds or at most, minutes.
In this article, we focus on two new aspects: real-time processing and stream processing
solutions for big data. An example for real-time processing is fast and interactive queries on
big data warehouses, in which user wants the result of his queries in less than seconds
rather than in minutes or hours. Principally, the goal of real-time processing is to provide
solutions that can process big data very fast and interactively. Stream processing deals with
problems that their input data must be processed without being totally stored. There are
numerous use cases for stream processing like: online machine learning, and continuous
computation. These new trends need systems that are more elaborate and agile than the
currently available MapReduce solutions like the Hadoop framework. Hence, new systems
and frameworks have been proposed for these new demands and we discuss these new
solutions.
We have divided the article into three main sections. First, we discuss the strength,
features, and shortcomings of the standard MapReduce framework and its de facto open
source implementation Hadoop. Then, we discuss real-time processing solutions.
Afterwards, we discuss the stream processing systems. At the end of the article, we give
some experimental results comparing the discussed paradigms. And finally, we give a
conclusion.
3
2. The MapReduce framework
Essentially, MapReduce is a programming model that enables large-scale and distributed
processing of big data on a set of commodity machines. MapReduce defines computation as
two functions: map and reduce. The input is a set of key/value pairs, and the output is a list
of key/value pairs. The map function takes an input pair and results a set of intermediate
key/value pairs (which can be empty). The reduce function takes an intermediate key and a
list of intermediate values associated to that key as its input, and results a set of final
key/value pairs as the output. Execution of a MapReduce program involves two phases. In
the first phase each input pair is given to a map function and a set of input pairs is
produced. Afterwards in the second phase, all of the intermediate values that have the
same key are aggregated into a list, and each intermediate key and its associated
intermediate value list is given to a reduce function. More explanation and examples are
available in [2].
The execution of a MapReduce program obeys the same two-phase procedure. Usually,
distributed MapReduce is implemented using master/slave architecture. The master
machine is responsible of assignment of tasks and controlling the slave machines. A
schematic for execution of a MapReduce program is given in Figure 1. The input is stored
over a shared storage like distributed file system, and is split into chunks. First, a copy of
map and reduce functions code is sent to all workers. Then, master assigns map and reduce
tasks to workers. Each worker assigned a map task, reads the corresponding input split and
passes all of its pairs to map function and writes the results of the map function into
intermediate files. After the map phase is finished, the reducer workers read intermediate
files and pass the intermediate pairs to reduce function and finally the pairs resulted by
reduce tasks are written to final output files.
2.1. Apache Hadoop
There are several MapReduce-like implementations for distributed systems like Apache
Hadoop, Disco from Nokia, HPCC from LexisNexis, Dryad from Microsoft [6], and
Sector/Sphere. However, Hadoop is the most well-known and popular open source
implementation of MapReduce. Hadoop uses master/slave architecture and obeys the same
overall procedure like Figure 1, for executing programs. By default, Hadoop stores input
and output files on its distributed file system, HDFS. However, Hadoop provides pluggable
input and output sources. For example, it can also use NoSQL databases like HBase and
Cassandra and even relational databases instead of HDFS.
4
Figure 1: Execution of a MapReduce program [7]
Hadoop has numerous strengths. Some of its strengths come from the MapReduce model.
For example, easy programming model, near-linear speedup and scalability, and fault
tolerance are three major features. Besides these, Hadoop itself provides some extra
features like: different schedulers, more sophisticated and complex job definitions using
YARN, high available master machines, pluggable I/O facilities, and etc. Hadoop provides
the basic platform for big data processing. For more usability, other solutions can be
mounted over Hadoop [5]. Major examples are HBase for storing structured data on very
large tables, Pig and Hive for data warehousing, and Mahout for machine learning.
Although Hadoop, i.e. standard MapReduce, has numerous strengths but it has several
shortcomings too. MapReduce is not able to execute recursive or iterative jobs inherently
[8]. Totally batch behavior is another problem. All of the input must be ready before job
starts and this prevents MapReduce from online and stream processing use cases. The
overhead of framework for starting a job, like copying codes and scheduling, is another
problem that prevents it from executing interactive jobs and near real-time queries.
MapReduce cannot run continuous computations and queries, too. These shortcomings
have triggered creation of new solutions. Next, we will discuss two types of solutions: i)
solutions that try to add real-time processing capabilities to MapReduce, and ii) solutions
that try to provide stream processing of big data.
5
3. Real-time big data processing
Solutions in this sector can be classified into two major categories: i) Solutions that try to
reduce overhead of MapReduce and make it faster to enable execution of jobs in less than
seconds. ii) Solutions that focus on providing means for real-time queries over structured
and unstructured big data using new optimized approaches. Here we discuss both
categories respectively.
3.1. In-memory computing
Slowness of Hadoop is rooted in two major reasons. First, Hadoop was initially designed for
batch processing. Hence, starting execution of jobs is not optimized for fast execution.
Scheduling, task assignment, code transfer to slaves, and job startup procedures are not
designed and programmed to finish in less than seconds. The second reason is the HDFS file
system. HDFS by itself is designed for high throughput data I/O rather than high
performance I/O. Data blocks in HDFS are very large and stored on hard disk drives which
with current technology can deliver transfer rates between 100 and 200 megabytes per
second.
The first problem can be solved by redesigning job startup and task execution modules.
However, the file system problem is inherently caused by hardware. Even if each machine
is equipped with several hard disk modules, the I/O rate would be several hundreds of
megabytes per seconds. This means that if we store 1 terabytes of data on 20 machines,
even a simple search over the data will take minutes rather than seconds. An elegant
solution to this problem is In-Memory Computing. In a nutshell, in-memory computing is
based on using a distributed main memory system to store and process big data in real-
time.
Main memory delivers higher bandwidth, more than 10 gigabytes per second compared to
hard disk’s 200 megabytes per second. Access latency is also much better, nanoseconds
versus milliseconds for hard disks. Price of RAM is also affordable. Currently, 1 TB of RAM
can be bought with less than 20,000$. These performance superiority combined with
dropping price of RAM makes in-memory computing a promising alternative to disk-based
big data processing. There are few in-memory computing solutions available like: Apache
Spark [9], GridGain, and XAP. Amongst them, Spark is both open source and free but others
are commercial.
We must mention that in-memory computing does not means the whole data should be
kept in memory. Even if a distributed pool of memory is available and the framework use
that for caching of frequently used data, the whole job execution performance can be
improved significantly. Efficient caching is especially effective when an iterative job is
being executed. Both Spark and GridGain support this caching paradigm. Spark uses a
6
primary abstraction called Resilient Distributed Dataset (RDD) that is a distributed
collection of items [10]. Spark can be easily integrated with Hadoop and RDDs can be
generated from data sources like HDFS and HBase. GridGain also has its own in-memory
file system called GridGain File System (GGFS) that is able to work as either a standalone
file system or in combination with HDFS, acting as a caching layer. In-memory caching can
also help handling huge streaming data that can easily stifle disk-based storages.
Another important point to be mentioned is the difference between in-memory computing
and in-memory databases and data grids. Although in-memory databases like Oracle Times
Ten and VMware GemFire and in-memory data grids like Hazelcast, Oracle Coherence, and
Jboss Infinispan are fast and have important use cases for today, but they differ from in-
memory computing. In-memory computing is rather a paradigm than a product. As its
name implies, in-memory computing deals with computing, too; in contrast to in-memory
data solutions that just deal with data. Hence, it should also take care of problems like
efficient scheduling, and moving code to data rather than wrongly moving data to code.
Despite this, in-memory data grids can be used as a building block of in-memory computing
solutions.
3.2. Real-time queries over big data
The first work in the area of solutions that try to enable real-time ad-hoc queries over big
data is Dremel by Google [11]. Dremel use two major techniques to achieve real-time
queries over big data: i) Dremel uses a novel columnar storage format for nested structures
ii) Dremel uses scalable aggregation algorithms for computing query results in parallel.
These two techniques enable Dremel to process complex queries in real-time. Cloudera
Impala is an open source counterpart that tries to provide an open source implementation
of Dremel techniques. For this purpose, Impala has developed an efficient columnar binary
storage for Hadoop called Parquet and uses techniques of parallel DBMSs to process ad hoc
queries in real-time. Impala claims considerable performance gains for queries with joins,
over Apache Hive. Although Impala shows promising improvements over Hive, but still for
long running analytics and queries, Hive is still a stable solution.
There are even more solutions in this sector. Apache Drill is also another Dremel-like
solution. However, Drill is not designed to just be a Hadoop-only solution and it provides
real-time queries against other storage systems like Cassandra. Shark is another solution
that is built on top of Spark [12]. Shark is designed to be compatible with Apache Hive and
it can execute all queries that are possible for Hive. Using in-memory computing capability
and the fast execution engine of Spark, Shark claims up to 100x faster response times
compared to Hive [12]. We should also mention the Stinger project by Hortonworks which
is an effort to make 100x performance improvement and add SQL semantics to future
versions of Apache Hive. The final mentionable solution is Amazon Redshift which is a
7
propriety solution from Amazon that aims to provide a very fast solution to petabytes-scale
warehousing.
4. Streaming big data
Data streams are now very common. Log streams, click streams, message streams, and
event streams are some good examples. But, the standard MapReduce model and its
implementations like Hadoop, is completely focused on batch processing. That is to say,
before any computation is started, all of the input data must be completely available on the
input store, e.g. HDFS. The framework process the input data and the output results are
available only when all of the computation is done. On the other hand, a MapReduce job
execution is not continuous. In contrast to these batch properties, today’s applications need
more stream-like demands in which the input data is not available completely at the
beginning and arrives constantly. Also, sometimes an application should run continuously,
e.g. a query that detects some special anomalies from incoming events.
Although MapReduce does not support stream processing, but MapReduce can partially
handle streams using a technique known as micro-batching. The idea is to treat the stream
as a sequence of small batch chunks of data. On small intervals, the incoming stream is
packed to a chunk of data and is delivered to batch system to be processed. Some
MapReduce implementations especially real-time ones like Spark and GridGain support
this technique. However, this technique is not adequate for demands of a true stream
system. Furthermore, the MapReduce model is not suitable for stream processing.
Currently, there are a few stream processing frameworks that are inherently designed for
big data streams. Two notable ones are Storm from Twitter, and S4 from Yahoo [13]. Both
frameworks run on the Java Virtual Machine (JVM) and both process keyed streams.
However, the programming model of the frameworks is different. In S4, a program is
defined in terms of a graph of Processing Elements (PE) and S4 instantiates a PE per each
key. On the other hand in Storm, a program is defined by two abstractions: Spouts and
Bolts. A spout is a source of stream. Spouts can read data from an input queue or even
generate data themselves. A bolt process one or more input streams and produces a
number of output streams. Most of the process logic is expressed in bolts. Each Storm
program is a graph of spouts and bolts which is called a Topology [14].
Considering the programming models we can say that in S4 the program is expressed for
keys while in Storm the program is expressed for the whole stream. Hence, programming
for S4 has a simpler logic while for Storm; programming is more complex but it is more
versatile too. A major strength of Storm over S4 is fault tolerance. In S4, at any stage of
process, if input buffer of a PE gets full, the incoming messages will be simply dropped. S4
also uses a check pointing strategy for fault tolerance. If a node crashes, its PEs will be
8
restarted in another node form their latest state. Hence, any process after the latest state is
lost [13]. However, Storm guarantees process of each tuple if the tuple successfully enters
Storm. Storm does not store states for bolts and spouts but if a tuple does not traverse the
Storm topology in a predefined period, the spout that had generated that tuple will replay
it.
Actually, S4 and Storm take two different strategies. S4 proposes a simpler program model
and it restricts the programmer in declaring the process but instead it provides simplicity
and more automated distributed execution, for example automatic load balancing. In
contrast, Storm gives the programmer more power and freedom for declaring the process
but instead the programmer should take care of things like load balancing, tuning buffer
sizes, and parallelism level for reaching optimum performance. Each of Storm and S4 has
its own strengths and weaknesses, but currently Storm is more popular and has a larger
community of users compared to S4. Due to demands and applications, Streaming big data
will certainly grow much more in future.
5. Experimental results
We did some simple experiments for better illumination of the discussed concepts. The
experiments are aimed to show the improvements of some of the mentioned systems
compared to Hadoop. We do not intend to compare the performance of technologies in
details. We just want to show that recent systems in the area of real-time big data are more
suitable than Hadoop. We should mention that, stream processing has its own paradigms
and it is not correct to compare stream processing solutions to Hadoop. Therefore, we did
not consider stream processing solutions in experiments. For the case of real-time in-
memory computing, we selected Spark. We executed simple programs like WordCount and
Grep, and compared performance results to Hadoop. For the case of real-time queries over
big data, a comprehensive benchmark is done by the Berkeley AMP Lab [15]. Hence, for this
category we just reported a summary of that benchmark.
WordCount counts occurrences of each word in a given text and Grep extracts matching
strings of a given pattern in a given text. WordCount is CPU intensive but Grep is I/O
intensive if a simple pattern is being searched. We searched for simple word; hence, Grep is
totally I/O intensive here. For this experiment, we used a cluster of 5 machines each having
two 4-core 2.4 GHz Intel Xeon E5620 CPU and 20 GBs of RAM. We installed Hadoop 1.2.0,
and Spark 0.8 on the cluster. For running WordCount and Grep, we used an input text file of
size 40 GBs containing texts of about 4 million Persian web pages. For better comparison,
we also executed the standard wc (version 8.13) and grep (version 2.10) programs of Linux
on a single machine and reported their times, too. Actually, the wc command of Linux just
counts number of all words, not occurrences of each word. Hence, it performs a much
simpler job. The results are reported in Figure 2.
9
Figure 2: Hadoop performance compared to Spark
As diagrams show, Spark outperforms Hadoop in both experiments when input is on disk
and when input is totally cached in RAM. In the WordCount problem, there is a little
difference between in-memory Spark and on-disk Spark. On the other hand, for the Grep
problem, there is a significant difference between in-memory Spark and on disk cases.
Especially, when the input file is totally cached in memory, Spark executes Grep in a second
while Hadoop takes about 160 seconds.
Berkeley AMP Lab has compared several frameworks and benchmarked their response
time on different types of queries like: scans, aggregations, and joins on different data sizes.
The benchmarked solutions are: Amazon Redshift, Hive, Shark, and Impala. The input data
set is a set of HTML documents and two SQL tables. For better benchmarking, queries were
executed with varying result sets: i) BI-like results which can be easily fit in a BI tool, ii)
Intermediate results which may not fit in memory of a single node, and iii) ETL-like results
which are so large that require several nodes to store.
Two different clusters were used for this experiment: a cluster of 5 machines with total 342
GBs of RAM, 40 CPU cores, and 10 hard disks for running Impala, Hive, and Shark, and a
cluster of 10 machines with total 150 GBs of RAM, 20 CPU cores, and 30 hard disks for
executing Redshift. Both clusters were launched on the Amazon EC2 cloud computing
infrastructure. The Berkeley AMP Lab benchmark is very comprehensive [15]. In this
article, we just report the aggregation query results. The executed aggregation query is like
SELECT * FROM foo GROUP BY bar” SQL statement. The results are given in Figure 3.
As the results show, new-generation solutions show promising better performance
compared to classic MapReduce-based solution, Hive. The only exception is the ETL-like
query, in which Hive performs the same as Impala and this is because the result is so large
that Impala cannot handle it properly. Despite this, other solutions like Shark and Redshift
perform better than Hive for all result sizes.
Hadoop
Spark- disk
Spark - mem
Linux- wc
0
200
400
600
800
1000
1200
time in seconds
WordCount
Hadoop
Spark- disk
Spark - mem
Linux- grep
0
100
200
300
400
500
time in seconds
Grep
10
Figure 3: Berkeley AMP Lab's benchmark results for real-time queries
6. Conclusion
Big data has become a trend and some solutions have been provided for management and
processing big data. The most popular solutions are MapReduce-based solutions and
among them the Apache Hadoop framework is the most well-known. However, Hadoop is
inherently designed for batch and high throughput job execution and it is suitable for jobs
that process large volumes of data in a long time. In contrast, there are also new demands
like interactive jobs, real-time queries, and stream data that cannot be handled efficiently
by batch-based frameworks like Hadoop. These non-batch demands have resulted in
creation of new solutions. In this article we discussed two categories: real-time processing,
and streaming big data.
In the real-time processing sector, there are two major solutions: in-memory computing,
and real-time queries over big data. In-memory computing uses a distributed memory
storage that can be used either as a standalone input source or as a caching layer for disk-
based storages. Especially, when the input totally fits in distributed memory or when the
job has multiple iterations over input, in-memory computing can significantly reduce
execution time. Solutions to real-time querying over big data mostly use custom storage
formats and well-known techniques from parallel DBMSs for join and aggregation, and
hence can response to queries in less than seconds. In the stream-processing sector, there
are two popular frameworks: Storm, and S4. Each one has its own programming model,
strengths and weaknesses. We discussed both frameworks and their superiority to
MapReduce-based systems for stream processing.
We believe that, solutions to batch and high throughput processing of big data, like Hadoop,
have reached to an acceptable maturity level. However, they are not suitable enough for
non-batch requirements. Considering high demands for interactive queries and big data
streams, in-memory computing shines as notable solution that can handle both real-time
and stream requirements. Among discussed frameworks, Spark is a good example for this
Redshift
Impala - disk
Impala - mem
Shark - disk
Shark - mem
Hive - disk
0
100
200
300
400
500
time in seconds
BI-like-
2M groups
Redshift
Impala - disk
Impala - mem
Shark - disk
Shark - mem
Hive - disk
0
100
200
300
400
500
600
time in seconds
ETL-like-
253M groups
11
case which supports in-memory computing using RDDs, real-time and interactive querying
using Shark, and stream processing using fast micro-batching. However, future will tell
which approach will be popular in practice.
7. References
[1] A. Jacobs, “The pathologies of big data,” Communications of the ACM, vol. 52, no. 8, pp. 3644,
2009.
[2] J. Dean and S. Ghemawat, “MapReduce: a flexible data processing tool,” Communications of the
ACM, vol. 53, no. 1, pp. 7277, 2010.
[3] S. Ghemawat, H. Gobioff, and S.-T. Leung, “The Google file system,” ACM SIGOPS Operating
Systems Review, vol. 37, no. 5, pp. 2943, 2003.
[4] F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows, T. Chandra, A. Fikes, and
R. E. Gruber, “Bigtable: A distributed storage system for structured data,” ACM Transactions on
Computer Systems, vol. 26, no. 2, p. 4, 2008.
[5] T. White, Hadoop: The definitive guide. Yahoo Press, 2012.
[6] M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly, “Dryad: distributed data-parallel programs
from sequential building blocks,” SIGOPS Operating System Review, vol. 41, no. 3, pp. 5972,
2007.
[7] J. Dean and S. Ghemawat, “MapReduce: simplified data processing on large clusters,”
Proceedings of the 6th Symposium on Opearting Systems Design & Implementation, 2004.
[8] F. N. Afrati, V. Borkar, M. Carey, N. Polyzotis, and J. D. Ullman, “Map-reduce extensions and
recursive queries,” in Proceedings of the 14th International Conference on Extending Database
Technology, 2011, pp. 18.
[9] M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica, “Spark: cluster computing
with working sets,” in Proceedings of the 2nd USENIX conference on Hot topics in cloud
computing, 2010.
[10] M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. J. Franklin, S. Shenker, and I.
Stoica, “Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster
computing,” in Proceedings of the 9th USENIX conference on Networked Systems Design and
Implementation, 2012.
[11] S. Melnik, A. Gubarev, J. J. Long, G. Romer, S. Shivakumar, M. Tolton, and T. Vassilakis, “Dremel:
Interactive Analysis of Web-scale Datasets,” Proceedings of VLDB Endowment, vol. 3, no. 12,
pp. 330339, 2010.
[12] C. Engle, A. Lupher, R. Xin, M. Zaharia, M. J. Franklin, S. Shenker, and I. Stoica, “Shark: fast data
analysis using coarse-grained distributed memory,” in Proceedings of the 2012 ACM SIGMOD
International Conference on Management of Data, 2012, pp. 689692.
[13] L. Neumeyer, B. Robbins, A. Nair, and A. Kesari, “S4: Distributed Stream Computing Platform,”
in Proceedings of IEEE International Conference on Data Mining Workshops (ICDMW), 2010, pp.
170177.
[14] “Storm Homepage.” [Online]. Available: http://storm-project.net/. [Accessed: 01-Dec-2013].
[15] Berkeley AMP Lab, “Big Data Benchmark.” [Online]. Available: https://amplab.cs.berkeley.edu
/benchmark/. [Accessed: 01-Dec-2013].
... Real-time processing architectures have evolved to meet specific technical requirements that differentiate them from batch systems. Research has demonstrated that effective stream processing must maintain data movement to keep pace with incoming streams, typically processing windows of 1-5 seconds to achieve near-real-time analysis [4]. These systems demonstrate the capability to handle continuous data flows with predictable performance, maintaining a steady state despite varying input rates. ...
... Traditional batch processing, while suitable for certain analytical workloads, faces inherent limitations in today's high-velocity data environments. Studies of batch processing systems reveal average processing windows of 15-30 minutes, with some organizations reporting delays of up to 24 hours for complex analytical workloads [4]. Performance analysis shows that batch systems typically achieve processing rates of 1,000 to 10,000 records per second during peak operations, significantly lower than real-time alternatives. ...
... The impact of these architectural differences manifests clearly in practical applications. In industrial monitoring scenarios, real-time processing systems have demonstrated the ability to detect anomalies within 50 milliseconds, while batch processing approaches typically require 5-15 minutes to identify similar patterns [4]. Time-series analysis in real-time systems can process up to 500,000 data points per second, enabling applications like predictive maintenance to achieve accuracy rates above 95% while maintaining sub-second response times [3]. ...
Article
Full-text available
This article explores the integration of cloud and edge computing in enabling real-time data processing across various industries. It examines the fundamental differences between real-time and batch-processing approaches, highlighting the technical requirements and architectural considerations necessary for effective implementation. The article explores how the symbiotic relationship between cloud and edge computing creates a robust framework for processing time-sensitive data, with a particular focus on applications in predictive maintenance, autonomous driving, and healthcare monitoring. Through analysis of current implementations and emerging technologies, the article demonstrates how this integration optimizes resource utilization, reduces latency, and enhances system reliability while addressing key challenges in data quality, system reliability, and security.
... Similarly, [10] discussed novel ways to meet the growing demand for non-batch processing needs and Hadoop's incapacity to handle them. They concluded that Spark is a more efficient option for in-memory computing among the frameworks examined. ...
Article
Full-text available
Every day, we generate enormous amounts of data from a wide range of personal devices. The rapid increase in data amount and velocity is pushing our limits to process and analyze them. Traditional machine learning and data analytics methods and algorithms use all historical data in the dataset to build their analyses and models. This may lead to processing and analyzing large amounts of historical data being computationally expensive and time-consuming, especially in real-time applications where speed is crucial. Furthermore, using all historical data may not account for changes in the models and dynamics underlying the data over time. This could lead to inaccurate forecasts or insights. Streaming analytics, on the other hand, processes each point of continuous data as it is received. It is more efficient than batch processing in certain cases. Real-time data processing using stream analytics allows organizations to make immediate and proactive decisions based on up-to-date information. This can be especially beneficial in time-sensitive industries, such as finance or logistics, where even a slight delay in data analysis can result in missed opportunities or costly errors. Additionally, stream analytics enables businesses to detect and respond to anomalies in real time, leading to enhanced operational efficiency and customer experiences. Statistically significant outliers are instances that don’t follow the general trend of the data. Datasets may contain outliers for several reasons, such as mistakes made during data collection or the presence of extremely high or low values. Because of the potential impact of outliers on analysis, it is worthwhile to carefully consider whether or not they should be included. This is useful for spotting inconsistencies or discrepancies, as well as determining which parts of the data need more in-depth analysis. This study discusses topics related to stream data analysis. These topics include a variety of frameworks for processing and analyzing streaming data, methods for detecting outliers, and human activity detection.
... Furthermore, these models require simulations across a broad range of scenarios generated by multiple load inputs that must be carefully combined to cover the component's entire operational range. Such an approach could produce an overwhelming volume of snapshot data, making it computationally infeasible to process, even with high-performance computing resources [43]. Additionally, a mechanical virtual sensor functions as an iterative system with a large number of outputs-one for each resultant nodal stress value-and relatively few load/force inputs [44]. ...
Article
Full-text available
This study presents the design and validation of a numerical method based on an AI-driven ROM framework for implementing stress virtual sensing. By leveraging Reduced-Order Models (ROMs), the research aims to develop a virtual stress transducer capable of the real-time monitoring of mechanical stresses in mechanical components previously analyzed with high-resolution FEM simulations under a wide range of multiple load scenarios. The ROM is constructed through neural networks trained on Finite Element Method (FEM) outputs from multiple scenarios, resulting in a simplified yet highly accurate model that can be easily implemented digitally. The ANN model achieves a prediction error of MAEtest=(0.04±0.06) MPa for the instantaneous mechanical stress predictions, evaluated over the entire range of stress values (0 to 5.32 MPa) across the component structure. The virtual sensor is capable of producing a quasi-instantaneous, detailed full stress map of the component in just 0.13 s using the ROM, for any combination of 4-load inputs, compared to the 6 min and 31 s required by the FEM. Thus, the approach significantly reduces computational complexity while maintaining a high degree of precision, enabling efficient real-time monitoring. The proposed method’s effectiveness is demonstrated through rigorous ROM validation, underscoring its potential for stress control. This precise AI-driven procedure opens new horizons for predictive maintenance strategies centered on stress cycle monitoring.
... Early frameworks such as Apache Storm and Samza were designed to address these challenges by introducing distributed stream processing, allowing for real-time data computations (Adnan & Akbar, 2019). However, as data volumes grew exponentially, these frameworks faced scalability issues, leading to the development of more robust systems like Apache Flink and Apache Kafka, which can efficiently handle highthroughput data streams (Moreno et al., 2017;Shahrivari, 2014). The continuous evolution of these technologies has enabled real-time analytics to become more scalable, flexible, and efficient, supporting applications that require instant decision-making. ...
Article
The rapid expansion of data generation across industries has made real-time analytics essential for timely decision-making and operational efficiency. This review paper examines the current landscape of real-time analytics techniques for processing streaming big data, focusing on approaches that enable high-speed data ingestion, storage, and processing in a continuously evolving data environment. We reviewed a total of 50 articles, encompassing a range of methodologies, applications, and system architectures that support real-time analytics. Key findings highlight advancements in stream processing frameworks, machine learning models for real-time predictions, and challenges associated with data scalability and latency. Applications are particularly prominent in sectors such as finance, healthcare, and urban planning, demonstrating the transformative impact of real-time insights on industry performance. This review contributes to a deeper understanding of real-time data handling techniques, addressing critical areas for future research and development.
... Early frameworks such as Apache Storm and Samza were designed to address these challenges by introducing distributed stream processing, allowing for real-time data computations (Adnan & Akbar, 2019). However, as data volumes grew exponentially, these frameworks faced scalability issues, leading to the development of more robust systems like Apache Flink and Apache Kafka, which can efficiently handle highthroughput data streams (Moreno et al., 2017;Shahrivari, 2014). The continuous evolution of these technologies has enabled real-time analytics to become more scalable, flexible, and efficient, supporting applications that require instant decision-making. ...
... The output of the reduction function is concatenated and then written to the output file [46]. The MapReduce model and Hadoop Open Source Implementation have proven effective for large data processing tasks and were inherently built for batch and processing jobs with high throughput requirements [48]. Throughput, as a QoS metric, indicates the number of MapReduce jobs completed per time unit (e.g., minutes) [49]. ...
Preprint
Full-text available
Since SLAs specify the contractual terms that are formally used between consumers and providers, there is a need to aggregate QoS requirements from the perspectives of Clouds, networks, and devices to deliver the promised IoT functionalities. Therefore, the main objective of this chapter is to provide a conceptual model of SLA for the IoT as well as rich vocabularies to describe the QoS and domain-specific configuration parameters of the IoT on an end-to-end basis. We first propose a conceptual model that identifies the main concepts that play a role in specifying end-to-end SLAs. Then, we identify some of the most common QoS metrics and configuration parameters related to each concept. We evaluated the proposed conceptual model using a goal-oriented approach, and the participants in the study reported a high level of satisfaction regarding the proposed conceptual model and its ability to capture main concepts in a general way.
Article
Full-text available
In today's complex business environments, large organizations often operate within hybrid and heterogeneous IT system landscapes, integrating a range of on-premises and cloud-based business intelligence (BI) and data lake applications. These include platforms such as Databricks, Snowflake, SAP Analytics Cloud (SAC), SAP Datasphere, SAP BW/4HANA, and reporting tools like Microsoft Power BI and Tableau. Such environments cater to diverse business needs and require seamless integration to support complex data modeling, reporting, forecasting, and predictive analytics. However, integrating these systems, each with variations in SQL dialects, architecture, and data models, presents significant challenges, including issues related to user authentication, data security, data aggregation, and maintaining formula consistency and calculation behaviors across different processing layers. This study examines the inconsistencies and issues within these hybrid and heterogeneous environments, focusing on how integration across non-native systems can lead to discrepancies in data structure, query syntax, authentication protocols, and semantics. These differences can result in inaccurate calculations, negatively affecting algorithms, data accuracy, system security and query execution plans. The findings underscore the implications of these integration challenges, highlighting the need for a comprehensive redesign of data flows and calculation logic in hybrid and heterogeneous landscapes. This study also proposes recommendations to improve integration and ensure reliable data reporting outcomes in these environments.
Article
This paper presents a high accuracy air-coupled acoustic rangefinder based on piezoelectric microcantilever beam array using continuous waves. Cantilevers are used to create a functional ultrasonic rangefinder with a range of 0 m up to 1 m. This is achieved through a design of custom arrays. This research investigates various classification techniques to identify airborne ranges using ultrasonic signals. The initial approach involves implementing individual models like the Support Vector Machine (SVM), Gaussian Naive Bayes (GNB), Logistic Regression (LR), k-Nearest Neighbors (kNN), and Decision Tree (DT). To potentially achieve better performance, the study introduces a Deep Learning (DL) architecture based on Convolutional Neural Networks (CNN) to categorize different ranges. The CNN model combines the strengths of multiple classification models, aiming for more accurate range detection. To ensure the model generalizes well to unseen data, a technique called k-fold cross-validation, which provides the reliability assessment, is employed. The proposed framework demonstrates a significant improvement in accuracy (100%), and AUC (1.0) over other approaches.
Article
Full-text available
Neste cenário de crescimento exponencial nos dados globais, a habilidade de processamento e análise torna-se vital para empresas e organizações. Este artigo tem por objetivo a análise de duas abordagens fundamentais: processamento em tempo real e processamento em lotes. Ao analisar métricas de comparação, características e vantagens/desvantagens, este estudo busca capacitar organizações na escolha entre tais abordagens, alinhando-as com suas necessidades. O estudo envolveu testes práticos de desempenho e comportamento, incluindo um caso real no Tribunal Regional Eleitoral do Rio Grande do Norte, utilizando ferramentas como Apache Airflow, Apache Spark e Apache Kafka. Os resultados destacam a eficácia do processamento em lotes para análises retrospectivas e relatórios, enquanto o processamento em tempo real é essencial para situações que demandam baixa latência, como detecção de irregularidades, monitoramento de redes sociais e análise de dados de sensores.
Article
Full-text available
Nowadays, clustering of massive datasets is a crucial part of many data-analytic tasks. Most of the available clustering algorithms have two shortcomings when used on big data: (1) a large group of clustering algorithms, e.g. k -means, has to keep the data in memory and iterate over the data many times which is very costly for big datasets, (2) clustering algorithms that run on limited memory sizes, especially the family of stream-clustering algorithms, do not have a parallel implementation to utilize modern multi-core processors and also they lack decent quality of results. In this paper, we propose an algorithm that combines parallel clustering with single-pass, stream-clustering algorithms. The aim is to make a clustering algorithm that utilizes maximum capabilities of a regular multi-core PC to cluster the dataset as fast as possible while resulting in acceptable quality of clusters. Our idea is to split the data into chunks and cluster each chunk in a separate thread. Then, the clusters extracted from chunks are aggregated at the final stage using re-clustering. Parameters of the algorithm can be adjusted according to hardware limitations. Experimental results on a 12-core computer show that the proposed method is much faster than its batch-processing equivalents (e.g. k -means++) and stream-based algorithms. Also, the quality of solution is often equal to k -means++, while it significantly dominates stream-clustering algorithms. Our solution also scales well with extra available cores and hence provides an effective and fast solution to clustering large datasets on multi-core and multi-processor systems.
Conference Paper
Full-text available
We present Resilient Distributed Datasets (RDDs), a distributed memory abstraction that lets programmers perform in-memory computations on large clusters in a fault-tolerant manner. RDDs are motivated by two types of applications that current computing frameworks handle inefficiently: iterative algorithms and interactive data mining tools. In both cases, keeping data in memory can improve performance by an order of magnitude. To achieve fault tolerance efficiently, RDDs provide a restricted form of shared memory, based on coarse-grained transformations rather than fine-grained updates to shared state. However, we show that RDDs are expressive enough to capture a wide class of computations, including recent specialized programming models for iterative jobs, such as Pregel, and new applications that these models do not capture. We have implemented RDDs in a system called Spark, which we evaluate through a variety of user applications and benchmarks.
Article
Dremel is a scalable, interactive ad-hoc query system for analysis of read-only nested data. By combining multi-level execution trees and columnar data layout, it is capable of running aggregation queries over trillion-row tables in seconds. The system scales to thousands of CPUs and petabytes of data, and has thousands of users at Google. In this paper, we describe the architecture and implementation of Dremel, and explain how it complements MapReduce-based computing. We present a novel columnar storage representation for nested records and discuss experiments on few-thousand node instances of the system.
Article
Dryad is a general-purpose distributed execution engine for coarse-grain data-parallel applications. A Dryad application combines computational "vertices" with communication "channels" to form a dataflow graph. Dryad runs the application by executing the vertices of this graph on a set of available computers, communicating as appropriate through flies, TCP pipes, and shared-memory FIFOs. The vertices provided by the application developer are quite simple and are usually written as sequential programs with no thread creation or locking. Concurrency arises from Dryad scheduling vertices to run simultaneously on multiple computers, or on multiple CPU cores within a computer. The application can discover the size and placement of data at run time, and modify the graph as the computation progresses to make efficient use of the available resources. Dryad is designed to scale from powerful multi-core single computers, through small clusters of computers, to data centers with thousands of computers. The Dryad execution engine handles all the difficult problems of creating a large distributed, concurrent application: scheduling the use of computers and their CPUs, recovering from communication or computer failures, and transporting data between vertices.
Article
While high-level data parallel frameworks, like MapReduce, simplify the design and implementation of large-scale data processing systems, they do not naturally or efficiently support many important data mining and machine learning algorithms and can lead to inefficient learning systems. To help fill this critical void, we introduced the GraphLab abstraction which naturally expresses asynchronous, dynamic, graph-parallel computation while ensuring data consistency and achieving a high degree of parallel performance in the shared-memory setting. In this paper, we extend the GraphLab framework to the substantially more challenging distributed setting while preserving strong data consistency guarantees. We develop graph based extensions to pipelined locking and data versioning to reduce network congestion and mitigate the effect of network latency. We also introduce fault tolerance to the GraphLab abstraction using the classic Chandy-Lamport snapshot algorithm and demonstrate how it can be easily implemented by exploiting the GraphLab abstraction itself. Finally, we evaluate our distributed implementation of the GraphLab abstraction on a large Amazon EC2 deployment and show 1-2 orders of magnitude performance gains over Hadoop-based implementations.
Article
The term ‘Big Data’ has spread rapidly in the framework of Data Mining and Business Intelligence. This new scenario can be defined by means of those problems that cannot be effectively or efficiently addressed using the standard computing resources that we currently have. We must emphasize that Big Data does not just imply large volumes of data but also the necessity for scalability, i.e., to ensure a response in an acceptable elapsed time. When the scalability term is considered, usually traditional parallel‐type solutions are contemplated, such as the Message Passing Interface or high performance and distributed Database Management Systems. Nowadays there is a new paradigm that has gained popularity over the latter due to the number of benefits it offers. This model is Cloud Computing, and among its main features we has to stress its elasticity in the use of computing resources and space, less management effort, and flexible costs. In this article, we provide an overview on the topic of Big Data, and how the current problem can be addressed from the perspective of Cloud Computing and its programming frameworks. In particular, we focus on those systems for large‐scale analytics based on the MapReduce scheme and Hadoop, its open‐source implementation. We identify several libraries and software projects that have been developed for aiding practitioners to address this new programming model. We also analyze the advantages and disadvantages of MapReduce , in contrast to the classical solutions in this field. Finally, we present a number of programming frameworks that have been proposed as an alternative to MapReduce , developed under the premise of solving the shortcomings of this model in certain scenarios and platforms. WIREs Data Mining Knowl Discov 2014, 4:380–409. doi: 10.1002/widm.1134 This article is categorized under: Technologies > Classification Technologies > Computer Architectures for Data Mining
Conference Paper
The initial design of Apache Hadoop [1] was tightly focused on running massive, MapReduce jobs to process a web crawl. For increasingly diverse companies, Hadoop has become the data and computational agorá---the de facto place where data and computational resources are shared and accessed. This broad adoption and ubiquitous usage has stretched the initial design well beyond its intended target, exposing two key shortcomings: 1) tight coupling of a specific programming model with the resource management infrastructure, forcing developers to abuse the MapReduce programming model, and 2) centralized handling of jobs' control flow, which resulted in endless scalability concerns for the scheduler. In this paper, we summarize the design, development, and current state of deployment of the next generation of Hadoop's compute platform: YARN. The new architecture we introduced decouples the programming model from the resource management infrastructure, and delegates many scheduling functions (e.g., task fault-tolerance) to per-application components. We provide experimental evidence demonstrating the improvements we made, confirm improved efficiency by reporting the experience of running YARN on production environments (including 100% of Yahoo! grids), and confirm the flexibility claims by discussing the porting of several programming frameworks onto YARN viz. Dryad, Giraph, Hoya, Hadoop MapReduce, REEF, Spark, Storm, Tez.