Conference PaperPDF Available

BigDataStack: A Holistic Data-Driven Stack for Big Data Applications and Operations

BigDataStack: A holistic data-driven stack for big data applications and operations
Dimosthenis Kyriazisa, Christos Doulkeridisa, Panagiotis Gouvasb, Ricardo Jimenez-Perisc, Ana Juan Ferrerd,
Leonidas Kallipolitise, Pavlos Kranasc, George Kousiourisa, Craig Macdonaldf, Richard McCreadief, Apostolos
Papageorgioug, Marta Patino-Martinezh, Stathis Plitsosi, Dimitris Poulopoulosa, Antonio Paradelld, Paula Ta-
Shmaj, Constantinos Vassilakise, Valerio Vianellog
a University of Piraeus, Piraeus, Greece
b Danaos Shipping, Athens, Greece
c LeanXcale, Madrid, Spain
d Atos Research and Innovation, Madrid, Spain
e Athens Technology Center, Athens, Greece
f University of Glasgow, Scotland, United Kingdom
g NEC Laboratories Europe, Heidelberg, Germany
h Universidad Politécnica de Madrid, Madrid, Spain
i Danaos Shipping, Athens, Greece
j IBM Research, Haifa, Israel,,,,,,,,,,,,,,,,,
AbstractThe new data-driven industrial revolution highlights
the need for big data technologies to unlock the potential in
various application domains. In this context, emerging
innovative solutions exploit several underlying infrastructure
and cluster management systems. However, these systems have
not been designed and implemented in a “big data context”,
and they rather emphasize and address the computational
needs and aspects of applications and services to be deployed.
In this paper we present the architecture of a complete stack
(namely BigDataStack), based on a frontrunner infrastructure
management system that drives decisions according to data
aspects, thus being fully scalable, runtime adaptable and high-
performant to address the needs of big data operations and
data-intensive applications. Furthermore, the stack goes
beyond purely infrastructure elements by introducing
techniques for the dimensioning of big data applications,
modelling and analysis of processes as well as the provision of
data-as-a-service exploiting a proposed seamless analytics
Big data architectures; infrastructure management; big data
as a service;
It is a fact that we experience the data-driven society. By
2020, the world will generate 50 times today’s information,
creating the “Digital Universe” [1] that grows exponentially
since the data creation growth rate is between 40% and 60%
according to OECD [2]. The result of this data explosion is
apparent in all domains of everyday life, ranging from user-
generated content of around 2.5 quintillion bytes every day
[3] to applications in healthcare [4], transportation, logistics
and retail. In all domains, data is the key, not only by adding
value and increasing the efficiency of existing solutions but
also by opening new opportunities and facilitating new
functionalities in these domains. What is more, the value of
data goes beyond their utilization in data-driven applications:
data per se has value and thus emerging business models
aim at exploiting them, with Data Exchange [5] being a
representative example. In this data-driven world,
infrastructures play a critical role. It is expected that more
data will be uploaded and downloaded from various
infrastructure in the next years [6], which surprisingly is
not the case nowadays. In this context, enhanced
infrastructure capabilities are a must in data centres” for
most customers going forward in 2017 and beyond [7]. The
goal for infrastructures is clear: go beyond storing,
processing and offering data to enabling optimum data
service provisioning by turning the underlying infrastructures
to enhanced data-driven and data-oriented environments.
These environments should also include mechanisms for
runtime adaptations across the complete data path and
service lifecycle, since as the data evolve (including sources,
formats, rates, etc), the needs will also change (likely
increase), triggering required adjustments to the
infrastructure [8]. These items have also been characterised
as of key priorities in the BDVA Strategic Research and
Innovation Agenda (SRIA) in the context of the Data
Processing Architectures priority [9].
Furthermore, the data explosion is the result of a
continuous increase in devices located at the periphery of the
network including internet connected objects, embedded
sensors, smartphones, and tablets [1]. Thus, future data-
driven architectures should also account for that given the
potential incompleteness of data as well as the need for real-
time cross-stream processing. Additionally, given the data
value for applications in different domains as well as the
business value of the data per se, the need is for solid end-to-
end data environments providing a reliable source of data to
guide (business) decisions [8], the need is for business and
process analytics frameworks. Such frameworks will exploit
analytics and processing frameworks for predictive and
perspective analytics facilitating event and pattern discovery
as well as deep learning for business intelligence. The latter
has been identified as a key priority in BDVA SRIA in the
context of “Data analytics” priority [9].
Another relevant key priority is “Data Management” [9]:
Besides the aforementioned underlying holistic infrastructure
/ environment, an important aspect relates to the data
provisioning (both as data sets and as data-intensive
applications). Raw data has limited value. The need is for
data functions enabling data to be cleaned, modelled,
represented, stored and analysed. For example, regarding
data cleaning, it is a fact that without looking at the quality
of data, the right data are not available in the right place at
the right time. While modelling and interoperability are
considered to be additional challenges given that data are to
be analysed by different processing frameworks while both
in flight and at rest.
Based on the above, we propose a data-driven
architecture, which aims at ensuring that infrastructure
management will be fully efficient and optimized for data
operations and data-intensive applications. As a holistic
solution, the architecture also incorporates approaches that
range from data-focused application analysis and
dimensioning, process modelling, management and runtime
optimization, to information-driven networking. Moreover,
the architecture introduces a toolkit, which allows the
specification of analytics tasks way and their efficient
integration and execution on top of the proposed
infrastructure management system.
The remainder of the paper is structured as follows:
Section II introduces the key elements of the proposed
architecture, while Section III reviews related work and
highlights the advancements of our approach. Section IV
described the phases of the overall architecture, which is
presented thereafter in Section V. The paper concludes with
a summary of the presented architecture a discussion on
future work and potentials.
While the majority of the approaches for data operations
and data-intensive applications (e.g. Hadoop, Spark, Hive,
etc) “run on top” on typical infrastructure management
systems (e.g. Mesos, Docker, OpenStack, etc), BigDataStack
provides a data-driven infrastructure management system
that is fully efficient and optimized for data operations,
managing resources according to data-based decisions. The
goal is to provide Data as a Service as an optimum offering
on top of an environment being managed through data-
driven decisions, turning raw data into valuable knowledge
through data functions across the complete data path.
BigDataStack offerings are depicted through a full
“stack” that aims at not only facilitating the needs of data
operations and applications (all of which tend to be data-
intensive) but facilitating these needs in an optimum way. It
is based on an infrastructure management system that bases
the management and deployment decisions on data aspects.
A representative example would be that a service-defined
deployment decision (current approach) may result to VMs
deployment in the same physical host for time-critical
operations (e.g. real-time stream processing). The
BigDataStack approach instead will base the decision
according to data aspects (e.g. generation rates, transfer
bottlenecks, etc), which in this example could be that the
data to be aggregated and processed result emerge from a big
number of distributed sources. Thus, the bottleneck is on
retrieving the data from the sources. To this end,
BigDataStack infrastructure management system would
propose a data-driven deployment decision resulting to
containers/VMs placed in geographically distributed physical
hosts. This simple case shows that the trade-off between
service- and data- based decisions on the management layer
should be re-examined nowadays due to the increasing
volumes of data. This is the first core element of
BigDataStack: efficient and optimized infrastructure
management (including all aspects of management for the
computing, storage and networking resources).
The second core element of BigDataStack exploits the
underlying data-driven infrastructure management system in
order to provide Data as a Service in a performant, efficient
and scalable way. Data as a Service will incorporate a set of
technologies addressing the complete data path: modelling
and representation, cleaning, aggregation, and data
processing (including seamless analytics, real-time Complex
Event Processing - CEP, and process mining). The
distributed storage is realized through a layer enabling data
to be fragmented / stored according to different pattern
accesses and allowing the expression for a logical database
on how to split it into fragments. Advanced modelling will
be provided to enable definition of flexible schemas for both
data in flight and at rest, which can be exploited across
multiple processing frameworks. These schemas will be then
utilized by the introduced seamless data analytics framework
that analyses data in a holistic fashion across multiple data
stores and locations, and operates on data irrespective of
where and when it arrives to the framework. A cross-stream
processing engine will be provided that can be executed in
federated environments. The engine will consider the
latencies across data centres, the locality of data sources and
data sinks and produce a partitioned topology that will
maximize the performance.
The third core element of the proposed architecture,
namely the Data Toolkit, aims at openness, extensibility and
wide adoption. The toolkit will allow the ingestion of data
analytics functions and the definition of analytics in a
declarative way, providing at the same time “hints” towards
the infrastructure / cluster management system for the
optimized management of these analytics tasks. Furthermore,
the toolkit will allow data scientists and administrators to
specify requirements and preferences both for the
infrastructure management (e.g. application requirements)
and for the data management such as data quality goals, or
information aggregation “levels”.
The Process Modelling element provides a framework
allowing for flexible modelling of process analytics in order
to enable their execution. Functionality-based process
modelling will then be concretized to technical-level process
mining analytics, while a feedback loop will be implemented
towards overall process optimization and adaptation.
Finally, the architecture includes the Dimensioning
Workbench, which aims at enabling the dimensioning of
applications in terms of predicting the required data services,
their interdependencies with the application micro-services
and the required underlying resources.
A. Application performance analysis and dimensioning
Application analysis is a key step for enabling any type
of optimization on the infrastructure level. As a result,
several models have been proposed for analysing application
behaviour. In particular, in order to meet scaling
requirements, an analysis model is presented [11], which
considers an application as a set of parallel or pipelined tasks
that constructs a set of jobs (a term relevant to the concept of
workflow), while in [12], a probability-theory model is based
mostly on the application deployment architecture working
together with a queueing model for handling the incoming
requests. To this end, probabilistic models are mostly helpful
in cases where predicting behaviours, requests or
performance-related metrics are required, thus used in
various other works as in [13], where semi-Markov models
are deployed, for representing an aspect of the application
under analysis application. Regarding edge applications, in
[14] an abstract model of application execution is introduced,
while various performance metrics and trade-offs amongst
them are examined. For analysis of data-intensive
applications, [15] suggests a two-way modelling method in
which models are considered for both the data-intensive
computing environment (in terms of resources), as well as
the application itself (in terms of concurrent jobs), for
applying a scheduling logic. For a similar type of
application, [16] is analysing different execution logs and
footprints from the developers, while using framework-
related models to get a description of an application. [17] is
trying to solve the problem of allocating scientific data-
intensive workloads with respect to data transfer and
execution times, by considering a workload to be a set of
tasks that construct a directed acyclic graph. Authors in [18]
present an optimization approach that doesn’t rely on the
infrastructure resources. The construction of an adaptive
framework built on the basis of MapReduce is proposed,
using feedback and stochastic learning controls for
parameterizing splitters, mappers and reducers which are
runtime adaptive. This idea of automatically parameterizing
the parameters of MapReduce implementations is followed
in [19], where statistics-based techniques (regression
models) are used for forecasting performance under different
Hadoop configurations. Thus, monitored metrics are used to
model workloads, while causal relations between workload
metrics and configurations are statistically identified. In this
paper, we propose an application dimensioning workbench
to identify dependencies between application components,
between such components and data services, as well as
between data services. The workbench includes a load
generator / injector to facilitate benchmarking with different
set of input parameters and according to these identify and
capture the dependencies and their impact. The workbench
also incorporates performance prediction techniques for the
required infrastructure resources related to the application as
well as to the data services (e.g. cleaning). For the
dimensioning of the latter, analysers of data generation
patterns and query executions are also part of the workbench.
B. Infrastructure and cluster management
Cloud-based infrastructures, containers, microservices,
and new programming platforms are dominating the media
and sweeping across IT departments around the world [20].
Containers as a Service (CaaS) [21] is becoming the new
Platform as a Service (PaaS). With the increased interest in
containers and microservices among developers, cloud
providers are capitalizing on the opportunity through hosted
container management services, as there exist clear
indications of the growth of container technology in
organizations. In general, containers are a more lightweight
virtualization concept [22] which can be seen as more
flexible tools for packaging, delivering and orchestrating
both software services and applications. However,
controlling a vast deployment of containers presents some
complications, such as that containers must be matched with
resources [22]. These challenges led to a concurrent demand
for management and orchestration tools, addressing the
needs for the management of a group of clusters, nodes’
monitoring, better services configuration and management of
the entire cluster server. Among the most popular cluster
management products are the Docker Swarm [23] for
clustering a number of Docker engines into one virtual
engine. In short, every host runs a Swarm agent and manager
that handles the operation and scheduling of containers. The
Swarm manager creates several masters and specific rules for
leader election, which are implemented in the event of a
primary master failure. CoreOS [24] leverages Linux
containers to handle services at a higher abstraction level,
providing advantages similar to virtual machines, but with
the concentration on applications rather than complete
virtualized hosts. Thus, every machine has an agent and an
engine that is active at any time in the cluster, whereas the
entire community of engines is active at all times.
Kubernetes [25] manages containerized applications across
many different hosts, providing tools for deployment,
scalability and applications’ maintenance. It uses pods that
act as groups of containers and are scheduled and deployed
at the same time, while most pods have up to five containers
that make up a service. In addition, Apache Mesos [26]
focuses on the effective isolation of resources and sharing of
applications across distributed networks or frameworks. It
operates as an abstraction layer for computing elements,
running on every machine with one machine designated as
the master running of all the others. Mesos uses a system of
agent nodes to run tasks. The agents send a list of available
resources to a master that distributes tasks to the agents.
Furthermore, Bright [27] lets the management of a private
infrastructure as a single entity, provisioning the hardware,
operating system, and cloud framework from a single
interface. Continuuity Loom [28] provisions, manages, and
scales clusters. Shortly, clusters created with Loom utilize
templates of any hardware and software stack, from simple
standalone servers and traditional application servers to full
Apache Hadoop clusters comprised of thousands of nodes.
Google's Borg system [29] combines admission control,
efficient task-packing, over-commitment, and machine
sharing with process-level performance isolation. It supports
high-availability applications with runtime features that
minimize fault-recovery time, and scheduling policies that
reduce the probability of correlated failures. Quasar [30] is
able to increase resource utilization while providing
consistently high application performance. In short, users can
express performance constraints for each workload, letting
Quasar determine the right amount of resources to meet these
constraints at any point. In [31], the authors present an
architecture that increases persistence and reliability of
automated infrastructure management builds upon Chef
configuration management system and infrastructure-as-a-
service resources from Amazon Web Services. Authors in
[32] present a container-based cluster management platform,
where the virtualization technology is assimilated with
resource and job management system to expand its
applicability, while Docker and HTCondor are interlocked
with each other. To this end, in [33] the authors present a
framework for rapid deployment and management of
clusters. Instead of focusing only the pure computation tasks
on homogeneous clusters (i.e. clusters with identically set up
nodes), this framework aims to ease the configuration of
heterogeneous clusters and to provide an object-oriented API
for low-latency distributed computing.
In the proposed infrastructure management system all
decisions including resource allocation, admission control,
scaling, orchestration, runtime migration and adaptations,
will be data-driven. This data-driven approach will
incorporate the interdependencies between compute, storage
and networking resources to drive resource management
decisions. For different target resource characteristics a
flexible infrastructure deployment approach for containers,
small-footprint or full-fledged VMs will be triggered.
C. Dynamic orchestration of resources & data operations
Orchestration (as an infrastructure management service)
refers to the enablement and the coordinated handling of
various optimizations inside the platform. Examples of such
optimizations are the placement (or allocation) of tasks to
computing resources, decisions regarding parallelization
degrees of parallelizable tasks/services, load balancing,
algorithm selection, and more. In the state of the art, such
optimizations are handled in a way that we call “service-
driven”. This means that the optimization functions, the
criteria, and the system setup are built around basic features
such as CPU power, network bandwidth, and task/service
requirements in order to optimize certain metrics (e.g.
latency) with regard to the examined service. For example,
Pietzuch et al [34] transform latency- and load-related
metrics into distances in a “cost space” and then apply a
placement algorithm which minimizes the cost in this cost
space, while Cardellini et al [35] attempt to optimize
placement in Stream Processing Frameworks based on
topology traffic and node capabilities. Similarly, Xing et al
[36] approach the load-balancing optimization problem
based on the goal of minimizing overload situations and end-
to-end latency. Various other works for such optimizations
can be found in the survey of [37], which focuses on Stream
Processing but is applicable for many other platforms. All
these works exploit obvious known synergies such as the fact
that running tasks on low-CPU nodes can increase
processing time or the fact that overloading certain links can
create bottlenecks. Apart from the fact that such concrete
optimizations have rarely been handled homogeneously or
investigated in a common context, there are also gaps
towards making them data-driven rather than service-driven.
To support data-driven overall orchestration and data-
driven solutions of specific optimization problems (e.g.
placement), we propose techniques to identify the synergies
between characteristics of data analytics and system KPIs,
e.g. functions that represent how data I/O volumes affect the
CPU-intensity of certain tasks etc. BigDataStack will
provide the basis for using such data-related synergies for all
system orchestration aspects by defining machine-readable
profile specifications for profiling homogenously all entities
(e.g. nodes, algorithms, network links, application tasks) are
involved in data-driven orchestration. A rule-based approach
for triggering runtime optimizations based on the monitored
data will also be delivered.
The concept of BigDataStack is reflected in three main
phases: Dimensioning, Deployment and Operation. Given
the data services (e.g. data cleaning, data aggregation, etc)
and the application components (micro-services of a data-
intensive application), the goal of the dimensioning phase is
to provide insights regarding the required infrastructure
resources for these data services and application components.
Additional information during the dimensioning phase may
be obtained by the big data practitioners and programmers
regarding potential preferences and constraints they specify
through the BigDataStack Data Toolkit, as well the
modelling of processes (that will provide input to the process
mining tasks) through the Process Modelling Framework.
The goal of the deployment phase is to deliver the optimum
deployment patterns for the data and application services, by
considering the resources and the interdependencies between
application components and data services (based on the
dimensioning phase outcomes). The operation phase
facilitates the provision of data services including
technologies for resource management, monitoring and
evaluation towards runtime adaptations.
A. Dimensioning phase
The dimensioning approach of BigDataStack aims at
optimizing the provision of data services and data-intensive
applications by understanding not only their data-related
requirements (e.g. related data sources, storage needs, etc)
but also the data services requirements across the data path
(i.e. what services are required for data representation, data
aggregation, etc). In this context, the dimensioning approach
includes a two-step phase that is realized through the
BigDataStack Application Dimensioning Workbench. In the
first step, the composite application (consisting of a set of
micro-services) is analysed to identify the required data
services. The example illustrated in the figure shows that 3
out of 5 application components require specific data
services for aggregation and analytics. The second step is to
dimension these identified / required data services as well as
all the application components, in terms of their
infrastructure resource needs exploiting a load injector
generating different loads to benchmark the services and
analyze their resources and data requirements (e.g. volume,
generation rate, legal constraints, etc).
Figure 1. Dimensioning of application and data services.
B. Deployment phase
The deployment approach of BigDataStack aims at
identifying the optimum deployment patterns and practices
for both data services and applications. The need for such
optimization emerges from the fact that all services to be
deployed have interdependencies that need to be considered
to increase the speed of all data-related operations. To this
end, the deployment approach of BigDataStack includes a
four-step phase and is realized through the Deployment
mechanism of BigDataStack infrastructure / cluster
management system. In the first three steps different
interdependencies are identified and captured: (i) between
different application components, (ii) between application
components and data services, and (iii) between data services
(e.g. between aggregation and storage services). Following
the identification and analysis of these interrelations and
their impact in terms of computation, storage and networking
resources, optimum deployment patterns (considering data
characteristics such as volumes, application components and
data services I/O rates) will be compiled as shown in Fig. 2.
Figure 2. Deployment patterns compilation.
C. Operations phase
The operation approach of BigDataStack is realized
through different components of the BigDataStack
infrastructure management system and aims at the
management of the complete physical infrastructure
resources in an optimized way for data-intensive
applications. The approach includes a seven-step process as
depicted in the figure: (i) computing and storage resources
are identified (i.e. sizing) following the dimensioning phase
outcomes, (ii) distributed storage resources are allocated
considering also the computing resources allocated in the
previous step, (iii) data-driven networking functions are
compiled and deployed in order to facilitate the diverse
networking needs between different computing and storage
resources, (iv) the application components and data services
are deployed and orchestrated based on “combined” data-
and application- aware orchestration templates, (v) data
analytics tasks are distributed across different data stores,
(vi) monitoring data is collected and evaluated for the
resources (computing, storage, network), the data services
(e.g. query execution status) and the application components,
and (vii) runtime adaptations take place for all elements of
the environment including resource re-allocation, storage and
analytics re-distribution, re-compilation of network functions
and deployment patterns, and live-migration.
Figure 3. Operation and runtime adaptation of the complete environment.
The phases described above and realized through a set of
mechanisms, which are presented in Fig. 4. The raw data are
ingested through the Gateway & Unified API component to
the Storage engine of BigDataStack, which enables storage
across different resources and supports data migration across
the infrastructure. The engine includes both stores for
relational and non-relational data, as well as an object store
and the proposed CEP engine for streaming data processing.
The raw data are obtained by the Data Cleaning component
to enhance their quality in terms of completeness, accuracy
and volatility, while thereafter as cleaned data are obtained
by the Data Modelling framework in order to be modelled
(using flexible schemas to describe streaming and stored
data) and annotated with metadata.
Given the stored data, data owners and decision makers
are able to model their complete application / data processing
chains through the Process modelling framework that
incorporates two main components. The first of them is the
Process modelling, which provides an interface to business
users to model their business processes and workflows as
well as to obtain recommendations for their optimization
following the execution of process mining tasks on the
BigDataStack analytics framework. The outcome of the
component is a model in a structural representation (e.g.
YAML or JSON). The second component is the Process
mapping that maps in an automated way the specified
business processes to specific mining and analytics tasks to
be deployed and executed (obtained by a Catalogue of
analytics tasks). The outcome of this component is a specific
list of analytics tasks (in the form of a graph) that is passed
to the Application dimensioning workbench in order to
identify their resource requirements prior to execution.
Moreover, this list / graph is provided to the Data toolkit.
While the aim of the toolkit is to enable data scientists to
specify their analytics tasks and ingest them in the catalogue
of analytics tasks, it also includes a component that exploits
the preferences and requirements by the users and along with
the mapped list / graph of processes creates a “playbook” as
a deployable concrete graph of services. The playbook also
pre-deployment configuration of the services (e.g. the “k” for
k-means clustering).
Figure 4. BigDataStack overall concpetual architecture.
The flow continues with the Application dimensioning
workbench, given that BigDataStack infrastructure
management needs to obtain information about the resource
needs of the application and the corresponding data services.
Initially, the required data services are identified (by the
Data services identification component) and the
interdependencies among them as well as the among the
application components are determined (by the
Interdependencies identification and analysis mechanism).
The second step is to dimension the application components
and the identified data services (through the corresponding
Application & Data services dimensioning), in order to
identify the necessary infrastructure resources in terms of
computing, storage and networking.
The outcome of the application dimensioning workbench
is relayed to the Realization engine, which includes a set of
sub-components for deployment and orchestration. Based on
the obtained dimensioning outcomes (i.e. resource needs and
dependencies) and the availability of resources (information
received from the Resource management engine),
deployment patterns are compiled and used for deployment
on the selected resources (through the Application & data
services deployment mechanism), along with potential
configuration parameters obtained by the Holistic services
configuration component. The Dynamic orchestrator
performs orchestrations of the application and data services
on the allocated resources. The execution of the data
analytics tasks (triggered by the Dynamic orchestrator), is
performed on the Seamless data analytics framework that
facilitates analytics across multiple resources and locations
for both data in flight and at rest.
During runtime, several types of monitoring data are
collected and analysed. The latter is performed by the Triple
monitoring engine, which collects data from different
sources: infrastructure resources (e.g. resource utilization,
data sources generation rates and windows, etc), application
components (e.g. application metrics, data flows across
application components, etc) and data functions / operations
(e.g. data analytics / queries progress, storage distribution,
The collected monitoring data are evaluated through a
QoS evaluation component to identify events / facts that
affect the overall quality of service. These events / facts will
be exploited by the Runtime adaptation engine, that includes
a set of components (i.e. Cluster resources re-allocation,
Storage and analytics re-distribution, Network functions re-
compilation, Live-migration, Application and data services
re-deployment, Dynamic orchestration patterns) to trigger
the corresponding runtime adaptations for all infrastructure
elements. These adaptations are aligned with the elasticity
model compiled from the Application dimensioning
workbench during the analysis of the application.
Finally, the architecture includes the so called Global
decision tracker, which aims at storing the information from
all components regarding the decisions taken for future
optimizations. The key rationale for the introduction of this
component is the fact that decisions have a cascading effect
in the proposed architecture. For example a dimensioning
decision affects the deployment patterns compilation, the
distribution of storage and analytics, etc. The information
whether these decisions are altered during runtime will be
exploited for optimized future decisions across all
components through the decision tracker. Thus, the tracker
provides the ground for cross-component optimization.
The emerging requirements and wide exploitation of data
operations and data-intensive applications highlight the need
for innovative offerings across the complete data lifecycle. In
this paper we have introduced BigDataStack as a complete
stack based on an infrastructure management system that
drives decisions according to data aspects thus being fully
scalable, runtime adaptable and performant for big data
operations and data-intensive applications. BigDataStack
promotes automation and quality and ensures that the
provided data are meaningful, of value and fit-for-purpose
through its Data as a Service offering that addresses the
complete data path with approaches for data cleaning,
modelling, data layout optimization, and distributed storage.
The architecture also incorporates approaches for data-
focused application analysis and dimensioning, and process
modelling towards increased performance, agility and
efficiency, while a toolkit allows the specification of
analytics tasks and their integration in the data path. It is
within our next steps to provide an implementation of the
proposed architecture based on Kubernets and validate it
through three different scenarios from maritime, retail and
The research leading to the results presented in this paper
has received funding from the European Union's funded
Project BigDataStack under grant agreement no 779747.
[1] IDC, “The Digital Universe of Opportunities: Rich Data and the
Increasing Value of the Internet of Things”,
[2] OECD, “New sources of growth- knowledge based capital”,
[3] IBM, “Bringing big data to the enterprise”, http://www-
[4] MIT Technology Review, “Data-driven Health Care”,
[5] The Data Exchange, “Selling Data”,
[6] Where Did The 'Data Explosion' Come From?,
[7] Data Center Knowledge, “2016: The Year of The Data Center”,
[8] J. Baer, “The 4 keys to running a data-driven business”,
[9] Big Data Value Association, “Strategic Research and Innovation
[10] M. Mao, and M. Humphrey, “Auto-scaling to minimize cost and meet
application deadlines in cloud workflows, Proc. International
Conference for High Performance Computing, Networking, Storage
and Analysis, Seattle, Washington, pp. 12-18, 2011.
[11] J. Jiang, J. Lu, G. Zhang, and G. Long, “Optimal Cloud Resource
Auto-Scaling for Web Applications, 13th IEEE/ACM International
Symp. Cluster, Cloud and Grid Computing (CCGrid), May 2013.
[12] Y. Xie, and S-Z. Yu, “A large-scale hidden semi-Markov model for
anomaly detection on user browsing behaviors”, IEEE/ACM
Transaction on Networks, vol. 17, Feb. 2009, pp. 54-65.
[13] S. Venugopal, and R. Buyya, “A Set Coverage-based Mapping
Heuristic for Scheduling Distributed Data-Intensive Applications on
Global Grids”, Proc. 7th IEEE/ACM International Conference on
Grid Computing (GRID '06), IEEE Computer Society, Washington,
DC, USA, 238-245, 2006.
[14] G. Lee, N. Tolia, P. Ranganathan, and R. H. Katz. “Topology-aware
resource allocation for data-intensive workloads”, Proc. ACM Asia-
Pacific Workshop on systems (APSys '10), ACM, New York, NY,
USA, pp. 1-6, 2010.
[15] C. Szabo, Q.Z. Sheng, T. Kroeger, Y. Zhang, J. Yu, “Science in the
cloud: Allocation and execution of data-intensive scientific
workflows”, Journal of Grid Computing, vol. 12, 2014, pp. 245264.
[16] M. Koehler and S. Benkner, “Design of an Adaptive Framework for
Utility-Based Optimization of Scientific Applications in the Cloud”,
Proc. IEEE/ACM Fifth International Conference on Utility and Cloud
Computing (UCC '12), IEEE Computer Society, Washington, DC,
USA, 303-308, 2012.
[17] F. Zhang, J. Cao, X. Song, H. Cai, and C. Wu, “AMREF: An
Adaptive MapReduce Framework for Real Time Applications,” Proc.
International Conference on Grid and Cloud Computing (GCC '10),
IEEE Computer Society, Washington, DC, USA, pp. 157-162, 2010.
[18] H. Yang, Z. Luan, W. Li, D. Qian, G. Guan, “Statistics-based
Workload Modelling for MapReduce, Proc. IEEE 26th International
Parallel and Distributed Processing Symposium Workshops & PhD
Forum (IPDPSW '12), IEEE Computer Society, Washington, DC,
USA, pp. 2043-2051, 2012.
[19] Four Cluster Management Tools to Compare,
[20] With Microsoft Azure Container Service, 'Containers As A Service'
Trend Picks up Momentum,
[21] P. Claus, and B. Lee, "Containers and clusters for edge cloud
architectures--a technology review," 3rd International Conference on
Future Internet of Things and Cloud (FiCloud), IEEE, 2015.
[22] P. René, F. Holzschuher, and F. Pfitzer, "Docker cluster management
for the cloud-survey results and own solution," Journal of Grid
Computing, vol. 14, 2016, pp. 265-282.
[23] Swarm mode overview,
[24] CoreOS,
[25] Kubernetes,
[26] Apache Mesos,
[27] Bright OpenStack,
[28] Continuuity Weaves Hadoop Cluster Management With Loom,
[29] Verma, Abhishek, et al. "Large-scale cluster management at Google
with Borg." Proceedings of the Tenth European Conference on
Computer Systems. ACM, 2015.
[30] C. Delimitrou, and C. Kozyrakis, "Quasar: resource-efficient and
QoS-aware cluster management,", ACM SIGPLAN Notices, vol. 49,
[31] D. Dmitry, M. Haney, and H. Tufo, "Highly Available Cloud-Based
Cluster Management," 15th IEEE/ACM International Symp. Cluster,
Cloud and Grid Computing (CCGrid), IEEE, 2015.
[32] J. Park, and J. Hahm, "Container-based Cluster Management Platform
for Distributed Computing," Proc. International Conference on
Parallel and Distributed Processing Techniques and Applications
(PDPTA), 2015.
[33] D. Malysiak, and Uwe Handmann, "An efficient framework for
distributed computing in heterogeneous beowulf clusters and cluster-
management," IEEE 15th International Symp. Computational
Intelligence and Informatics (CINTI), IEEE, 2014.
[34] P. Pietzuch, J. Ledlie, J. Shneidman, M. Roussopoulos, M. Welsh,
and M. Seltzer, “Network-Aware Operator Placement for Stream-
Processing Systems”, 22nd International Conference on Data
Engineering (ICDE ’06), pp. 4953, IEEE Computer Society, 2006.
[35] V. Cardellini, V. Grassi, F. Lo Presti, and M. Nardelli, “Distributed
QoS-aware Scheduling in Storm”, 9th ACM International Conference
on Distributed Event-Based Systems, pp. 344-347, ACM, 2015.
[36] Y. Xing, S. Zdonik, and J.-H. Hwang, “Dynamic Load Distribution in
the Borealis Stream Processor”, 21st International Conference on
Data Engineering (ICDE ’05), pp. 791802, IEEE Computer Society,
[37] M. Hirzel, R. Soule, S. Schneider, B. Gedik, and R. Grimm, “A
Catalog of Stream Processing Optimizations”, ACM Computing
Surveys, vol. 46, Mar. 2014, pp 134.
... BigDataStack is a European research project that delivers a complete high-performance data-centric stack of technologies as a unique combined and cross-optimized offering that addresses the emerging needs of data operations and applications [8]. BigDataStack introduces the paradigm of a new front-runner data-driven architecture and system ensuring that ...
... Once the Business Analyst defines the whole process, the Data Analyst through the Data Toolkit, i.e., another BigDataStack offering [8], takes control and refines this model by adding extra information, such as, entity names in databases, storage size, configuration of external services, communication protocols, etc. This refinement includes the addition of extra components, if required. ...
... The Real-time Ship Management scenario exploits the Big-DataStack environment with an emphasis on the data as a service offering for big data management, its analytics and methods for real-time monitoring, preventive maintenance, visualization of the vessel state and final results. Within this scenario, all components of BigDataStack are utilized [8], from the previously described BPM framework to the visualization of results. In this section we present and shortly describe the higher-level architecture of BigDataStack and how its components are utilized to implement this scenario. ...
... BigDataStack [1] is an infrastructure management system that provides management of computers, storage, and network in an intensive data application [1]. Applications can be developed in BigDataStack by a definition that describes the application (playbook). ...
... BigDataStack [1] is an infrastructure management system that provides management of computers, storage, and network in an intensive data application [1]. Applications can be developed in BigDataStack by a definition that describes the application (playbook). ...
Conference Paper
Applications performance is strongly linked with the total load, the application deployment architecture and the amount of resources allocated by the cloud or edge computing environments. Considering that the majority of the applications tends to be data intensive, the load becomes quite dynamic and depends on the data aspects, such as the data sources locations, their distribution and the data processing aspects within an application that consists of micro-services. In this paper we introduce an analysis and prediction model that takes into account the characteristics of an application in terms of data aspects and the edge computing resources attributes, such as utilization and concurrency, in order to propose optimized resources allocation during runtime.
... In this work, and in order to fulfill the aforementioned challenges, our baseline Flexibench tool [4], which enables stress testing as a service through virtualized load injector clusters, is extended in order to include a new adapter for launching of DB experiments. The adapter ensures strict sequence of deployment and preparation of the experiment (adapted to the needs of Figure 1) as well as deployment and management of the YCSB client nodes. ...
Full-text available
In recent years there is a surge in the amount of digital data that are generated by financial organizations, which is driving the development and deployment of novel Big Data and Artificial Intelligence (AI) applications in the finance sector. Nevertheless, there is still no easy and standardized way for developing, deploying and operating data-intensive systems for digital finance. This chapter introduces a standards-based reference architecture model for architecting, implementing and deploying big data and AI systems in digital finance. The model introduces the main building blocks that comprise machine learning and data science pipelines for digital finance applications, while providing structuring principles for their integration in applications. Complementary viewpoints of the model are presented, including a logical view and considerations for developing and deploying applications compliant to the reference architecture. The chapter ends up presenting a few practical examples of the use of the reference model for developing data science pipelines for digital finance.
Full-text available
As institutions increasingly shift to distributed and containerized application deployments on remote heterogeneous cloud/cluster infrastructures, the cost and difficulty of efficiently managing and maintaining data-intensive applications have risen. A new emerging solution to this issue is Data-Driven Infrastructure Management (DDIM), where the decisions regarding the management of resources are taken based on data aspects and operations (both on the infrastructure and on the application levels). This chapter will introduce readers to the core concepts underpinning DDIM, based on experience gained from development of the Kubernetes-based BigDataStack DDIM platform ( ). This chapter involves multiple important BDV topics, including development, deployment, and operations for cluster/cloud-based big data applications, as well as data-driven analytics and artificial intelligence for smart automated infrastructure self-management. Readers will gain important insights into how next-generation DDIM platforms function, as well as how they can be used in practical deployments to improve quality of service for Big Data Applications. This chapter relates to the technical priority Data Processing Architectures of the European Big Data Value Strategic Research & Innovation Agenda [33], as well as the Data Processing Architectures horizontal and Engineering and DevOps for building Big Data Value vertical concerns. The chapter relates to the Reasoning and Decision Making cross-sectorial technology enablers of the AI, Data and Robotics Strategic Research, Innovation & Deployment Agenda [34].
Search engines are exceptionally important tools for accessing information in today's world. In satisfying the information needs of millions of users, the effectiveness (the quality of the search results) and the efficiency (the speed at which the results are returned to the users) of a search engine are two goals that form a natural trade-off, as techniques that improve the effectiveness of the search engine can also make it less efficient. Meanwhile, search engines continue to rapidly evolve, with larger indexes, more complex retrieval strategies and growing query volumes. Hence, there is a need for the development of efficient query processing infrastructures that make appropriate sacrifices in effectiveness in order to make gains in efficiency. This survey comprehensively reviews the foundations of search engines, from index layouts to basic term-at-a-time (TAAT) and document-at-a-time (DAAT) query processing strategies, while also providing the latest trends in the literature in efficient query processing, including the coherent and systematic reviews of techniques such as dynamic pruning and impact-sorted posting lists as well as their variants and optimisations. Our explanations of query processing strategies, for instance the WAND and BMW dynamic pruning algorithms, are presented with illustrative figures showing how the processing state changes as the algorithms progress. Moreover, acknowledging the recent trends in applying a cascading infrastructure within search systems, this survey describes techniques for efficiently integrating effective learned models, such as those obtained from learning-to-rank techniques. The survey also covers the selective application of query processing techniques, often achieved by predicting the response times of the search engine (known as query efficiency prediction), and making per-query tradeoffs between efficiency and effectiveness to ensure that the required retrieval speed targets can be met. Finally, the survey concludes with a summary of open directions in efficient search infrastructures, namely the use of signatures, real-time, energy-efficient and modern hardware and software architectures.
Full-text available
Data stream processing applications have a long running nature (24hr/7d) with workload conditions that may exhibit wide variations at run-time. Elasticity is the term coined to describe the capability of applications to change dynamically their resource usage in response to workload fluctuations. This paper focuses on strategies for elastic data stream processing targeting multicore systems. The key idea is to exploit Model Predictive Control, a control-theoretic method that takes into account the system behavior over a future time horizon in order to decide the best reconfiguration to execute. We design a set of energy-aware proactive strategies, optimized for throughput and latency QoS requirements, which regulate the number of used cores and the CPU frequency through the Dynamic Voltage and Frequency Scaling (DVFS) support offered by modern multicore CPUs. We evaluate our strategies in a high-frequency trading application fed by synthetic and real-world workload traces. We introduce specific properties to effectively compare different elastic approaches, and the results show that our strategies are able to achieve the best outcome.
Full-text available
Docker provides a good basis to run composite applications in the cloud, especially if those are not cloud-aware, or cloud-native. However, Docker concentrates on managing containers on one host, but SaaS provi¬ders need a container management solution for multiple hosts. Therefore, a number of tools emerged that claim to solve the problem. This paper classifies the solutions, maps them to requirements from a case study and identifies gaps and integration requirements. We close some of these gaps with our own integration components and tool enhancements, resulting in the currently most complete management suite.
Full-text available
We present an architecture that increases persistence and reliability of automated infrastructure management in the context of hybrid, cluster-cloud environments. We describe our highly available implementation that builds upon Chef configuration management system and infrastructure-as-a-service cloud resources from Amazon Web Services. We summarize our experience with managing a 20-node Linux cluster using this implementation. Our analysis of utilization and cost of necessary cloud resources indicates that the designed system is a low-cost alternative to acquiring additional physical hardware for hardening cluster management.
Conference Paper
Full-text available
Cloud technology is moving towards more distribution across multi-clouds and the inclusion of various devices, as evident through IoT and network integration in the context of edge cloud and fog computing. Generally, lightweight virtualisation solutions are beneficial for this architectural setting with smaller, but still virtualised devices to host application and platform services, and the logistics required to manage this. Containerisation is currently discussed as a lightweight virtualisation solution. In addition to having benefits over traditional virtual machines in the cloud in terms of size and flexibility, containers are specifically relevant for platform concerns typically dealt with Platform-as-a-Service (PaaS) clouds such as application packaging and orchestration. For the edge cloud environment, application and service orchestration can help to manage and orchestrate applications through containers as an application packaging mechanism. We review edge cloud requirements and discuss the suitability container and cluster technology of that arise from having to facilitate applications through distributed multi-cloud platforms build from a range of networked nodes ranging from data centres to small devices, which we refer to here as edge cloud.
Cloud computing promises flexibility and high performance for users and high cost-efficiency for operators. Nevertheless, most cloud facilities operate at very low utilization, hurting both cost effectiveness and future scalability. We present Quasar, a cluster management system that increases resource utilization while providing consistently high application performance. Quasar employs three techniques. First, it does not rely on resource reservations, which lead to underutilization as users do not necessarily understand workload dynamics and physical resource requirements of complex codebases. Instead, users express performance constraints for each workload, letting Quasar determine the right amount of resources to meet these constraints at any point. Second, Quasar uses classification techniques to quickly and accurately determine the impact of the amount of resources (scale-out and scale-up), type of resources, and interference on performance for each workload and dataset. Third, it uses the classification results to jointly perform resource allocation and assignment, quickly exploring the large space of options for an efficient way to pack workloads on available resources. Quasar monitors workload performance and adjusts resource allocation and assignment when needed. We evaluate Quasar over a wide range of workload scenarios, including combinations of distributed analytics frameworks and low-latency, stateful services, both on a local cluster and a cluster of dedicated EC2 servers. At steady state, Quasar improves resource utilization by 47% in the 200-server EC2 cluster, while meeting performance constraints for workloads of all types.
Conference Paper
Storm is a distributed stream processing system that has recently gained increasing interest. We extend Storm to make it suitable to operate in a geographically distributed and highly variable environment such as that envisioned by the convergence of Fog computing, Cloud computing, and Internet of Things.
Several fields of science have traditionally demanded large-scale workflow support, which requires thousands of central processing unit (CPU) cores. In order to support such large-scale scientific workflows, large-capacity cluster systems such as supercomputers are widely used. However, as users require a diversity of software packages and configurations, a system administrator has some trouble in making a service environment in real time. In this paper, we present a container-based cluster management platform and introduce an implementation case to minimize performance reduction and dynamically provide a distributed computing environment desired by users. This paper offers the following contributions. First, a container-based virtualization technology is assimilated with a resource and job management system to expand applicability to support large-scale scientific workflows. Second, an implementation case in which docker and HTCondor are interlocked is introduced. Lastly, docker and native performance comparison results using two widely known benchmark tools and Monte-Carlo simulation implemented using various programming languages are presented.
Google's Borg system is a cluster manager that runs hundreds of thousands of jobs, from many thousands of different applications, across a number of clusters each with up to tens of thousands of machines. It achieves high utilization by combining admission control, efficient task-packing, over-commitment, and machine sharing with process-level performance isolation. It supports high-availability applications with runtime features that minimize fault-recovery time, and scheduling policies that reduce the probability of correlated failures. Borg simplifies life for its users by offering a declarative job specification language, name service integration, real-time job monitoring, and tools to analyze and simulate system behavior. We present a summary of the Borg system architecture and features, important design decisions, a quantitative analysis of some of its policy decisions, and a qualitative examination of lessons learned from a decade of operational experience with it.
Various research communities have independently arrived at stream processing as a programming model for efficient and parallel computing. These communities include digital signal processing, databases, operating systems, and complex event processing. Since each community faces applications with challenging performance requirements, each of them has developed some of the same optimizations, but often with conflicting terminology and unstated assumptions. This article presents a survey of optimizations for stream processing. It is aimed both at users who need to understand and guide the system’s optimizer and at implementers who need to make engineering tradeoffs. To consolidate terminology, this article is organized as a catalog, in a style similar to catalogs of design patterns or refactorings. To make assumptions explicit and help understand tradeoffs, each optimization is presented with its safety constraints (when does it preserve correctness?) and a profitability experiment (when does it improve performance?). We hope that this survey will help future streaming system builders to stand on the shoulders of giants from not just their own community.