ThesisPDF Available

Modern Data Platform with DataOps, Kubernetes, and Cloud-Native Ecosystem

Authors:

Abstract and Figures

Modern Data Platform with DataOps, Kubernetes, and Cloud-Native Ecosystem: Building a resilient big data platform based on Data Lakehouse architecture (research repository: https://github.com/aabouzaid/modern-data-platform-poc). In today’s Big Data world, organisations can gain a competitive edge by adopting data-driven decision-making. However, that requires a modern data platform that is portable, resilient, and efficient to manage organisations’ data and support their growth. Building such a platform requires a deep understanding of the latest data approaches, namely, Data Lakehouse architecture. The newly emerging architecture provides best-of-breed based on Data Warehouse and Data Lake characteristics. Furthermore, the change in the data management architectures was accompanied by changes in storage formats, particularly open standard formats like Apache Hudi, Apache Iceberg, and Delta Lake. This research investigates capabilities provided by Kubernetes and other Cloud-Native software, using DataOps methodologies to build a generic data platform that follows the Data Lakehouse architecture. As a result, the research project defines the data platform specification, architecture, and core components to build a proof of concept for the platform. Moreover, the project implemented the core of the proposed platform, which are infrastructure (Kubernetes), ingestion and transport (Argo Workflows), storage (MinIO), and finally, query and processing (Dremio). Subsequently, after the implementation verification and validation, a performance benchmark was conducted using an industry-standard benchmark suite to assess Dremio’s caching capabilities, which demonstrated a 33% median enhancement of query duration. In the end, the project concludes that Data Lakehouse architecture is young and still evolving. Hence, the data management landscape expects a fundamental transformation in the near future, especially with the rapid adoption of the open tables formats and leveraging Cloud-Native software, particularly Kubernetes, as a unified orchestration platform. Henceforth, the boundaries between the Cloud and on-premises could fade over time, offering greater flexibility to handle vast amounts of data for different organisations’ sizes and use cases.
Content may be subject to copyright.
Modern Data Platform
with DataOps, Kubernetes,
and Cloud-Native Ecosystem
Building a resilient big data platform
based on Data Lakehouse architecture
Author
Ahmed AbouZaid
Dissertation for Master's Degree
Edinburgh Napier University
Master of Science in Data Engineering
April 2023
Supervisor
Dr. Peter Barclay
Internal Examiner
Dr. Nikolaos Pitropakis
Modern Data Platform
with DataOps, Kubernetes, and Cloud-Native Ecosystem
Building a resilient big data platform
based on Data Lakehouse architecture
Ahmed AbouZaid
Submitted in partial fulfilment of the requirements of
Edinburgh Napier University
for the Degree of
Master of Science in Data Engineering
School of Computing
April 2023
MSc Dissertation Checklist
Learning outcome
The markers will assess
Pages1
Hours spent
Learning outcome 1
Conduct a literature search using an
appropriate range of information
sources and produce a critical review of
the findings.
* Range of materials; list of
references
* The literature
review/exposition/background
information chapter
22-44
180 Hours
Learning outcome 2
Demonstrate professional competence
by sound project management and (a)
by applying appropriate theoretical and
practical computing concepts and
techniques to a non-trivial problem, or
(b) by undertaking an approved project
of equivalent standard.
* Evidence of project
management (Gantt chart, diary,
etc.)
* Depending on the topic:
chapters on design,
implementation, methods,
experiments, results, etc.
16-21
45-103
240 Hours
Learning outcome 3
Show a capacity for self-appraisal by
analysing the strengths and weakness
of the project outcomes with reference
to the initial objectives, and to the work
of others.
* Chapter on evaluation
(assessing your outcomes against
the project aims and objectives)
* Discussion of your project's
output compared to the work of
others.
104-112
110 Hours
Learning outcome 4
Provide evidence of the meeting
learning outcomes 1-3 in the form of a
dissertation which complies with the
requirements of the School of
Computing both in style and content.
* Is the dissertation well-written (academic
writing style, grammatical), spell-checked,
free of typos, neatly formatted.
* Does the dissertation contain all relevant
chapters, appendices, title and contents
pages, etc.
* Style and content of the dissertation.
120 Hours
Learning outcome 5
Defend the work orally at a viva voce
examination.
* Performance
* Confirm authorship
1 hour
Have you previously uploaded your dissertation to Turnitin? Yes/No
Has your supervisor seen a full draft of the dissertation before submission? Yes/No
Has your supervisor said that you are ready to submit the dissertation? Yes/No
1Please note the page numbers where evidence of meeting the learning outcome can be found in your dissertation.
Authorship Declaration
I, Ahmed AbouZaid, confirm that this dissertation and the work presented in it
are my own achievement.
Where I have consulted the published work of others, this is always clearly
attributed;
Where I have quoted from the work of others, the source is always given.
With the exception of such quotations, this dissertation is entirely my own
work;
I have acknowledged all main sources of help;
If my research follows on from previous work or is part of a larger
collaborative research project, I have made clear exactly what was done by
others and what I have contributed myself;
I have read and understand the penalties associated with Academic
Misconduct.
I also confirm that I have obtained informed consent from all people I have
involved in the work of this dissertation following the School's ethical guidelines
Signed:
Date: 04.04.2023
Matriculation no: 40497354
Data Protection Declaration
Under the 1998 Data Protection Act, The University cannot disclose your grade to
an unauthorised person. However, other students benefit from studying
dissertations that have their grades attached.
Please write your name below one of the options below to state your preference.
The University may make this dissertation, with an indicative grade, available to
others.
The University may make this dissertation available to others, but the grade may
not be disclosed.
The University may not make this dissertation available to others.
Preface
This dissertation is the culmination of my Master of Science in Data Engineering
at Edinburgh Napier University. It was a unique learning experience by all means.
My intention to enrol in a master’s program was not recent; it began in my final
college year. In 2010, I heard that a university-wide research competition had
started; hence, I formed and led a research group of three members and
participated in the competition. Our research achieved third place, with only four
marks after the first place.
At that time, I knew that I wanted to have a similar experience again, but I
decided first to gain hands-on experience in the technology industry. Later, in
2020, after a decade-long of that research competition, I enrolled in this program
at Edinburgh Napier University, where I experienced significant growth in my
personal, academic, and professional skills.
In my journey, I faced many personal challenges, such as balancing my job and
family responsibilities, the unprecedented impact of the COVID-19 pandemic,
being a father for the first time, maintaining growth in my career, and
transitioning to a Tech lead position. Although it was a challenging combination,
it presented a chance to step up and be a better version of myself. At this
particular moment, I am happy to say that I persevered and achieved my goals.
Finally, I would like to conclude with one of my favourite quotes (usually
attributed to the writer Mark Twain but I never found a reliable reference):
The secret of getting ahead is getting started. The secret of getting
started is breaking your complex overwhelming tasks into small
manageable tasks and starting on the first one.
Ahmed AbouZaid
Berlin, Germany, April 2023
ُِ󰃙ََِاََ ِِ َُُ اَوُلَ ُاَُُََُَا
ةدا
A man of high standing should not be malevolent And who is irascible shall not
achieve excellence
Antara Ibn Shaddad (an ancient Arabic poet)
Acknowledgements
Academically, I would like to express my heartfelt gratitude to my dissertation
supervisor Dr. Peter Barclay, for his invaluable guidance and support throughout
my research journey. Despite his busy schedule, he always made time for me,
providing timely and thoughtful feedback that allowed me to make significant
progress in my research. I am also grateful to Dr. Nikolaos Pitropakis for his vast
knowledge and detailed technical feedback. His expertise and advice shaped this
dissertation and made it the best possible. Both professors have helped me
develop a deeper understanding of my research topic, refine my research
methodology, and improve my writing skills.
Personally, I am grateful to my parents for sowing the seeds of knowledge in me,
to my wife, who has been with me during this dissertation, and to everyone who
contributed to my growth to evolve into a better version of myself. I also would
like to express my appreciation to the people with whom I had discussions on
software or data engineering topics in my journey, namely, Amr Ali, Amr Atya,
Ashraf Aaref, Hesham Eid, Moustafa Mansour, and Youssef Mamdouh.
Finally, I want to take this opportunity to express my sincere gratitude to the
countless individuals across the world who have contributed their time, skills, and
expertise to the development and maintenance of free (as in freedom) and
open-source software. Their contributions have had a transformative impact on
the world of technology, opening up new possibilities for innovation,
collaboration, and progress. To all the free and open-source software
contributors, thank you for your invaluable contributions that made the world a
better place.
Abstract
In today’s Big Data world, organisations can gain a competitive edge by adopting
data-driven decision-making. However, that requires a modern data platform that
is portable, resilient, and efficient to manage organisations’ data and support
their growth. Building such a platform requires a deep understanding of the
latest data approaches, namely, Data Lakehouse architecture. The newly
emerging architecture provides best-of-breed based on Data Warehouse and
Data Lake characteristics. Furthermore, the change in the data management
architectures was accompanied by changes in storage formats, particularly open
standard formats like Apache Hudi, Apache Iceberg, and Delta Lake. This research
investigates capabilities provided by Kubernetes and other Cloud-Native
software, using DataOps methodologies to build a generic data platform that
follows the Data Lakehouse architecture. As a result, the research project defines
the data platform specification, architecture, and core components to build a
proof of concept for the platform. Moreover, the project implemented the core
of the proposed platform, which are infrastructure (Kubernetes), ingestion and
transport (Argo Workflows), storage (MinIO), and finally, query and processing
(Dremio). Subsequently, after the implementation verification and validation, a
performance benchmark was conducted using an industry-standard benchmark
suite to assess Dremio’s caching capabilities, which demonstrated a 33% median
enhancement of query duration. In the end, the project concludes that Data
Lakehouse architecture is young and still evolving. Hence, the data management
landscape expects a fundamental transformation in the near future, especially
with the rapid adoption of the open tables formats and leveraging Cloud-Native
software, particularly Kubernetes, as a unified orchestration platform.
Henceforth, the boundaries between the Cloud and on-premises could fade over
time, offering greater flexibility to handle vast amounts of data for different
organisations’ sizes and use cases.
Contents
Chapter One: Introduction ............................................................................................ 16
1.1 Background 17
1.2 Aims and Scope 18
1.3 Methodology 18
1.4 Dissertation Structure 20
Chapter Two: Literature Review .................................................................................. 22
2.1 Overview 23
2.2 Big Data 23
2.3 DataOps 25
2.4 Data Management Architecture 27
2.4.1 Data Warehouse 27
2.4.2 Data Lake 28
2.4.3 Data Lakehouse 29
2.4.4 Comparison 32
2.5 Cloud Computing 34
2.6 Cloud-Native Software 36
2.7 Modern Data Platform 40
2.8 Summary 44
Chapter Three: Specifications ...................................................................................... 45
3.1 General Goals 46
3.2 Requirements 47
3.2.1 Functional 47
3.2.2 Non-functional 48
3.3 Focus Areas 49
Chapter Four: Architecture ........................................................................................... 52
4.1 Holistic View 53
4.2 Core Components 55
4.2.1 Infrastructure 55
4.2.2 Data Ingestion 59
4.2.3 Data Storage 61
4.2.4 Data Processing 62
4.3 Initial Design 65
Chapter Five: Implementation ..................................................................................... 66
5.1 Approach 67
5.2 Software 68
5.3 Development 69
5.3.1 Kubernetes Cluster 69
5.3.2 Data Applications 70
5.3.3 Data Pipeline 72
5.4 Initial Model 74
Chapter Six: Technical Evaluation ................................................................................ 78
6.1 Overview 79
6.2 Verification 79
6.3 Validation 81
6.3.1 Prerequisites 81
6.3.2 Deployment 82
6.3.3 Data Ingestion 85
6.3.4 Data Processing 86
Chapter Seven: Benchmarking ..................................................................................... 90
7.1 Overview 91
7.2 Framework 91
7.3 Resources 92
7.4 Preparation 93
7.5 Execution 94
7.6 Outcome 96
7.7 Summary 103
Chapter Eight: Results and Discussion ..................................................................... 104
8.1 Overview 105
8.2 Results 105
8.3 Project Reflection 107
8.4 Discussion 108
8.5 Future Work 111
8.6 Conclusion 112
References ..................................................................................................................... 114
Appendix A: Research Proposal ................................................................................. 124
Appendix B: Deployment Screenshots ..................................................................... 128
Appendix C: Benchmarking Plots Code .................................................................... 134
List of Tables
1. Table 2.1: High-level comparison among Data Warehouse, Data Lake, and Data
Lakehouse (E. Janssen, 2022, p. 44).
2. Table 4.1: The definitions of the Unified Data Infrastructure v2.0 (Bornstein et al.,
2022).
3. Table 4.2: Comparison between different ingestion and transport solutions based
on the inclusion and exclusion criteria.
4. Table 4.3: Comparison between different object storage solutions based on the
inclusion and exclusion criteria.
5. Table 4.4: Comparison between different query and processing solutions based on
the inclusion and exclusion criteria.
6. Table 4.5: The core items of the data platform’s initial architecture.
7. Table 5.1: Data applications versions used in the project.
8. Table 5.2: Development tools versions used in the project.
9. Table 6.1: Verify the implementation against the focus areas using the MoSCoW
prioritisation method.
10. Table 7.1: Benchmarking compute resources for Dremio and MinIO.
List of Figures
1. Figure 1.1: The DSRM process model (adapted from Peffers et al., 2007).
2. Figure 2.1: A Venn diagram illustrates how Big Data overlaps directly or indirectly
with different fields (Data Fields Overlap, 2020).
3. Figure 2.2: The three roots of DataOps: Agile, DevOps, and Lean Manufacturing
(Strod, 2019b).
4. Figure 2.3: The aspirations for a data-driven enterprise (Thusoo & Sen Sarma, 2017,
figs. 1.6).
5. Figure 2.4: The structure of the Apache Iceberg table (Iceberg Table’s
Architecture, 2022).
6. Figure 2.5: How data update strategies like Copy-On-Write (CoW) and
Merge-On-Read (MoR) affect open table formats performance (Jain et al., 2023,
fig. 2).
7. Figure 2.6: A high-level architectural comparison of Data Warehouse, Data Lake,
and Data Lakehouse (Lorica et al., 2020).
8. Figure 2.7: Management responsibility in different Cloud service models (IaaS Vs.
PaaS Vs. SaaS, 2020).
9. Figure 2.8: The increase in the number of applications workloads versus auxiliary
workloads on Kubernetes between 2021 and 2022 (Mayr, 2023).
10. Figure 2.9: The top seven categories in which Kubernetes experienced growth
during the period between 2021 and 2022 (CNCF Annual Survey 2022, 2023).
11. Figure 2.10: Kubernetes Operators’ capability levels (Jump Start Using the
Operator-SDK, n.d.).
12. Figure 2.11: High-level architecture of a modern functional data platform (Strod,
2019a).
13. Figure 2.12: The core job role profiles of the data team in a data-centric business
(Thusoo & Sen Sarma, 2017, figs. 4-2).
14. Figure 3.1: The Golden Triangle can be envisioned as a structure comprising three
pillars, where the equilibrium is contingent on the stability of each individual pillar
(Simon, 2019).
15. Figure 4.1: The Unified Data Infrastructure v2.0 architecture diagram (Bornstein et
al., 2022).
16. Figure 4.2: Kubernetes high-level architecture (Kubernetes Architecture, How
Kubernetes Works, 2019).
17. Figure 4.3: Kubernetes Operator flow which takes action inside and/or outside the
cluster.
18. Figure 4.4: Kubernetes Operator reconciliation loop.
19. Figure 4.5: Kustomize traverses a Kubernetes manifest to add, remove or update
configuration options (Kubernetes Native Configuration Management, n.d.).
20. Figure 4.6: The components of the Argo Workflows architecture (Argo Workflows
Architecture, n.d.).
21. Figure 4.7: MinIO architecture on Kubernetes with multi-tenant (MinIO
High-Performance Multi-Cloud Object Storage, 2022, p. 16, adapted version).
22. Figure 4.8: Dremio deployment architecture (Dremio Architecture Guide, 2020, p. 7).
23. Figure 4.9: Dremio functional architecture (Dremio Architecture Guide, 2020, p. 7).
24. Figure 4.10: The initial high-level architecture shows the core parts of the Modern
Data Platform using Kubernetes and Cloud-Native solutions.
25. Figure 5.1: KinD configuration for local Kubernetes cluster.
26. Figure 5.2: The Kustomize directory structure of the MinIO application.
27. Figure 5.3: Kustomize files to deploy MinIO application.
28. Figure 5.4: Data ingestion pipeline using Argo Workflows and Python.
29. Figure 5.5: Dremio query to create an Apache Iceberg table for JSON file.
30. Figure 5.6: The content of the platform Git repository in a tree-like view.
31. Figure 5.7: The hierarchical abstractions of the C4 model (Brown, 2023, sec. Static
structure).
32. Figure 5.8: The data platform’s initial model interactions following the C4 model
guidelines.
33. Figure 6.1: The initial implementation of the UDI v2.0 architecture.
34. Figure 6.2: Install validation prerequisites tools.
35. Figure 6.3: Deploying local Kubernetes cluster and verifying the access to it.
36. Figure 6.4: Install the applications into the Kubernetes cluster.
37. Figure 6.5: Install the applications into the Kubernetes cluster.
38. Figure 6.6: Applying Argo Workflow data pipeline.
39. Figure 6.7: The data pipeline as shown in the Argo Workflows interface.
40. Figure 6.8: The data pipeline transformed JSON files as shown in the MinIO
interface.
41. Figure 6.9: View the ingested data as plain JSON files in Dremio.
42. Figure 6.10: Import the ingested data as plain JSON files in Dremio.
43. Figure 6.11: Query the ingested data as plain JSON files in Dremio.
44. Figure 6.12: Creating an Apache Iceberg table for one of the JSON files In Dremio.
45. Figure 6.13: View the created Apache Iceberg table as stored on MinIO.
46. Figure 7.1: TPC-DS dataset entity relationship diagram contains two fact tables:
Store_Sales and Store_Returns, with other dimension tables (TPC-DS Dataset ERD,
2020).
47. Figure 7.2: Creating GKE Autopilot cluster using “gcloud”, which does not require
defining the Kubernetes cluster resources in advance.
48. Figure 7.3: Generating 10 GB of TPC-DS data schema using the “dsdgen” tool.
49. Figure 7.4: TPC-DS data and metadata creation duration as shown in Dremio.
50. Figure 7.5: Execute JMeter’s test plan to assess the performance of cold and warm
queries in Dremio.
51. Figure 7.6: A sample of Dremio’s TPC-DS benchmark after the clean-up.
52. Figure 7.7: The status of the TPC-DS 99 queries by Dremio.
53. Figure 7.8: Test execution time in seconds with a decrease observed in execution
duration when the cache is available.
54. Figure 7.9: TPC-DS queries performance with cache enabled.
55. Figure 7.10: Overview of queries performance with cache enabled.
56. Figure 7.11: Detailed view of queries performance with cache enabled.
57. Figure 7.12: Top 10 best-performing queries when cache enabled.
58. Figure 7.13: Top 10 worst-performing queries when cache enabled.
59. Figure 7.14: A benchmark query accesses 50.9 million rows in 2.17 seconds.
60. Figure 7.15: A benchmark query accesses 34.2 million rows in 8.41 seconds.
61. Figure 7.16: CPU utilisation percentages for MinIO (top) and Dremio (bottom).
62. Figure 7.17: Memory utilisation percentages for MinIO (top) and Dremio (bottom).
List of Appendix Figures
1. Figure B.1: MinIO deployed Kubernetes resources.
2. Figure B.2: Argo Workflows deployed Kubernetes resources.
3. Figure B.3: Dremio deployed Kubernetes resources.
4. Figure B.4: MinIO interface after the login.
5. Figure B.5: Argo Workflows interface after the login.
6. Figure B.6: Dremio interface after the login.
7. Figure B.7: Adding MinIO as a data source in Dremio - General.
8. Figure B.8: Adding MinIO as a data source in Dremio - Advanced Options.
9. Figure C.1: Python code for the benchmarking plots.
Chapter One:
Introduction
Chapter One: Introduction
1
1.1 Background
Undoubtedly, we live in the “Big Data” era, where data size has multiplied several
times in the last twenty years (Cisco Visual Networking Index, 2019, p. 5). Not
only have data sizes changed, but also new data types and formats have emerged
from different sources. The exponential increase in the data brought many new
challenges where the old methods could not cope with that kind of data which is
different in quantity and quality. Moreover, what makes handling that data
challenging is not just being “big” but all of its properties or what is known as the
“4 Vs”, which are Volume, Velocity, Variety, and Veracity (Etzion & Aragon-Correa,
2016, p. 148).
The massive increase in the data brings many new opportunities, yet with more
challenges (Etzion & Aragon-Correa, 2016, p. 152). For example, “the challenges
multiply when data from multiple sources are combined, particularly where the
identifiers used by the underlying systems are not the same. The risks are
particularly serious where the data are sensitive” (Clarke, 2015, p. 82).
Furthermore, utilising external services like the public Cloud to handle such data
could increase the risk of vendor lock-in (Kratzke, 2014, p. 2). Thus, the old
methods might not be able to deal with that change, where finding new ways to
handle the new challenges of Big Data is mandatory. Especially for data
management, where it is the first gate to all other data-driven activities like
Artificial Intelligence, Machine Learning, and Business Intelligence. For example,
the 2018 Kaggle Machine Learning & Data Science survey showed that data
professionals spent most of their time cleaning data (23%) (Hayes, 2019). As a
result, investing in building a modern data platform, as well as defining the
characteristics of that platform, like the architecture, technologies, and
methodologies of building it, are crucial on an equal footing for many
organisations.
Chapter One: Introduction
1.2 Aims and Scope
This project aims to investigate the current Big Data challenges and how building
a modern data platform using Kubernetes and Cloud-Native software in harmony
with DataOps methodologies could address some of those challenges and
provide an efficient solution for data management. To achieve that goal, this
project aims to answer the following research questions:
RQ1: What are the main current challenges of managing Big Data?
RQ2: How can DataOps methodologies help to manage Big Data?
RQ3: How can Kubernetes and Cloud-Native software help to build
an efficient data platform for Big Data?
The significant change in the data management landscape in recent years,
influenced by Big Data, requires an organisational transformation that affects the
“Golden Triangle”: People, Process, and Technology (Simon, 2019). Nevertheless,
this project focuses on two sides of the Golden Triangle, process and technology,
to build a blueprint and a proof-of-concept of a modern data platform using
DataOps guidelines and Kubernetes and Cloud-Native software. After building
the initial version of the platform, it will be verified and validated against the
defined focus areas using real-world data for COVID-19 cases between 2020 and
2022. Finally, the platform performance will be assessed using an
industry-standard benchmark suite, namely, Transaction Processing Performance
Performance Performance Council for Decision Support (TPC-DS).
1.3 Methodology
To achieve the project’s objectives and answer the research questions, the Design
Science Research Methodology (DSRM) by Peffers et al. (2007) will be used. The
DSRM model has six stages:
1. Identify the Problem and Motivate.
2. Define the Objectives of a Solution.
3. Design and Development.
4. Demonstration.
5. Evaluation.
6. Communication.
Chapter One: Introduction
Although the model has a nominal sequential order which might look somehow
rigid at first glance, in fact, the model’s creators do not expect researchers to
always work in that manner. In addition, the model uses an iterative method,
allowing it to fall back to the previous stages. For these reasons, the model fits
this project’s use case. Figure 1.1 summarises the sections relevant to each
process stage in agreement with the DSRM model.
Identify Problem and Motivate
The lack of a reference to build a data platform that follows
state-of-art in technology, process, and architecture
to handle Big Data challenges
🔻
Inference
Define the Objectives of a Solution
Define the specifications and requirements
to build a modern data platform
🔻
Theory
Design and Development
Design the architecture and set focus areas
of the modern data platform
🔻
How to Knowledge
Demonstration
Build the core of the modern data platform based on DataOps,
Kubernetes, and Cloud-Native ecosystem
🔻
Metrics, Analysis Knowledge
Evaluation
Verify, validate, and benchmark the modern data platform
proof of concept according to the specifications and focus areas
🔻
Disciplinary Knowledge
Communication
Write down the dissertation following the academic practices
Figure 1.1: The DSRM process model (adapted from Peffers et al., 2007).
Chapter One: Introduction
1.4 Dissertation Structure
In addition to the introduction chapter, the remainder of this dissertation is
organised as follows:
Chapter Two - Literature review
This chapter provides a comprehensive overview of the existing research and
literature relevant to the current project topic. It critically evaluates the theories,
concepts, and findings of previous studies that have addressed similar research
questions and identifies the gaps and limitations that need to be addressed.
Chapter Three - Specifications
This chapter outlines the platform specifications, functional and non-functional
requirements, and the project’s high-level plans. It also includes the prioritisation
methodology to define the focus areas of the platform’s initial architecture.
Chapter Four - Architecture
This chapter defines the platform’s architecture and design, which will also be
developed in this project. It describes the platform’s infrastracture, components,
functionalities, and how they interact. Further, it will cover the initial design of
the architecture which be implemented.
Chapter Five - Implementation
This chapter provides details about implementing the platform developed for
this project. It discusses the development approach, the software used to build
the platform, and the workflow of the implemented components.
Chapter Six - Technical Evaluation
This chapter presents the evaluation of the platform’s initial design. It uses
standard evaluation practices, precisely verification and validation, to analyse and
compare the implementation with the defined specifications and architecture.
Chapter One: Introduction
Chapter Seven - Benchmarking
This chapter further extends the evaluation of the platform’s initial design since
the performance assessment is an essential part of the evaluation. Therefore, it
conducts a performance benchmark focusing on caching capabilities using an
industry-standard benchmark suite.
Chapter Eight - Results and Discussion
This chapter summarises the project’s key findings and highlights the research
project’s contributions. Then it discusses the implications and suggests areas for
future work.
Chapter Two:
Literature Review
Chapter Two: Literature Review
2
2.1 Overview
This chapter aims to discover why building a modern data platform based on the
DataOps, Kubernetes, and Cloud-Native ecosystem could be vital to deal with Big
Data challenges.
2.2 Big Data
A variety of data types has rapid growth due to the increase of data sources such
as social networks, semantic webs, Internet of Things sensors, and location-based
services. As previously stated in the introduction, the “Big Data” characteristic is
not only about being “big” but the “4 Vs”, which are Volume, Velocity, Variety, and
Veracity (Etzion & Aragon-Correa, 2016, p. 148). The increase in the data provided
many opportunities, such as (Travica, 2017, p. 4):
Improve decision-making: Analysing large amounts of data, organisations
can gain insights and make more informed decisions.
Personalisation: Big Data can be used to personalise products, services,
and experiences for individual customers or users.
Fraud detection: Large datasets can be analysed to identify patterns of
fraudulent activity and help organisations prevent losses.
Risk assessment: Big Data can be used to assess and mitigate risks in a
variety of industries, including finance, insurance, and healthcare.
Supply chain optimisation: Big Data can be used to optimise the efficiency
of supply chain operations and reduce costs.
Predictive maintenance: Analysing vast amount of data from sensors and
other sources, organisations can predict when equipment will fail to take
preventive action.
Improved customer experience: By analysing customer data, organisations
can improve the customer experience and increase customer loyalty.
New product development: Big Data can identify new product or service
opportunities to meet customer needs and preferences.
Chapter Two: Literature Review
Figure 2.1: A Venn diagram illustrates how Big Data overlaps directly or indirectly
with different fields (Data Fields Overlap, 2020).
Figure 2.1 shows the interactions between Big Data and other data fields. Data
management is crucial because it is the first gate to all other data-driven activities
like Artificial Intelligence, Machine Learning, and Business Intelligence. While
“data is the new gold” is a common phrase, every opportunity comes with its own
set of challenges, and traditional methods often fall short of addressing them.
Hence, to get that gold, there are obstacles to reaching that goal, specifically
finding an efficient way to manage it (Yugal, 2022, p. 237). That is why data
management platforms have evolved over time to deal with the changes
qualitatively and quantitatively. For example, the data quality could severely
affect the opportunities of Big Data (Clarke, 2015, pp. 78–80), and high quality
requires a strong process to handle it. That led to emerging “DataOps” as a
result.
Chapter Two: Literature Review
2.3 DataOps
The term “DataOps” was coined in 2014 for the first time by Lenny Liebmann (3
Reasons Why DataOps Is Essential for Big Data Success | the Big Data Hub, 2014),
notwithstanding the practices it encompasses have been around for much longer.
Based on the changes in the nature of data in the last two decades, it is clear that
“data management approaches have changed drastically in the past few years
due to improved data availability and increasing interest in data analysis (e.g.,
artificial intelligence). Data volume, velocity, and variety require novel and
automated ways to ‘operate’ this data. In accordance with software
development, where DevOps is the de-facto standard to operate code, DataOps
is an emerging approach advocated by practitioners to tackle data management
challenges for analytics” (Mainali et al., 2021, p. 61).
DataOps provides a process-oriented approach on top of normal data pipelines,
which is a crucial element in handling Big Data (Munappy et al., 2020, p. 182).
Following the Agile path, DataOps Manifesto was released in 2017 to provide a
framework for data professionals to work together more effectively and
efficiently. That Manifesto has eighteen principles as a guide to applying DataOps
methodologies which are: (1) Continually satisfy your customer, (2) Value working
analytics, (3) Embrace change, (4) It is a team sport, (5) Daily interactions, (6)
Self-organize, (7) Reduce heroism, (8) Reflect, (9) Analytics is code, (10)
Orchestrate, (11) Make it reproducible, (12) Disposable environments, (13)
Simplicity, (14) Analytics is manufacturing, (15) Quality is paramount, (16) Monitor
quality and performance, (17) Reuse, (18) Improve cycle times (DataOps
Manifesto, 2017).
Furthermore, DataOps combines the Agile, DevOps, and Lean Manufacturing
principles with the integration of people, processes, and technology to create a
more efficient and effective data processing pipeline. It is often seen as a natural
evolution of DevOps, a set of practices that aims to improve collaboration,
communication, and integration between software developers and IT
professionals (Strod, 2019b). “It is the combination of proven methodologies that
helped to grow other industries: DevOps and Agile methodology from the
Chapter Two: Literature Review
software industry and Lean Manufacturing from the automotive/manufacturing
industry” (Mainali et al., 2021, p. 63). Figure 2.2 illustrates the roots of DataOps.
Figure 2.2: The three roots of DataOps: Agile, DevOps, and Lean Manufacturing (Strod, 2019b).
Moreover, DataOps builds on those principles and applies them to data
processing and analytics, where it utilises orchestration, workflow management,
and automation tools to enable flexible and customised data transformations as
needed (Mainali et al., 2021, p. 65). It is often seen as a way to bridge the gap
between data engineers, who are responsible for building and maintaining the
infrastructure and pipelines that handle data, and data scientists, who are
responsible for analysing and interpreting the data to generate insights and
predictions (Mainali et al., 2021, pp. 64-65). Figure 2.3 summarises the
data-driven aspirations in the manner of the DataOps model.
Figure 2.3: The aspirations for a
data-driven enterprise (Thusoo & Sen
Sarma, 2017, figs. 1.6).
Chapter Two: Literature Review
In terms of DataOps adoption, DataOps is considered a highly intuitive approach
to building a data environment. Consequently, newer companies are leaning
toward adopting DataOps with ease and greater speed than established ones,
which requires a significant overhaul of their existing data practices and mindset
(Thusoo & Sen Sarma, 2017, Chapter 1). That is clear in one of the surveys about
DataOps (Gür, 2021), which showed that the top challenges are:
Establishing formal processes (55%).
Orchestrating code and data across tools (%53).
Staff capacity (50%).
Monitoring the end-to-end environment (50%).
That proves, in the real world, pure technical challenges like working with
code/data across tools and having full visibility of the environment also influence
adopting process-oriented practices like DataOps. Therefore, to adopt agile
processes like DataOps to deal with Big Data challenges, a change in the data
management platforms on the technical level is inevitable.
2.4 Data Management Architecture
A Data Management Platform (DMP) is a central place to manage data from one
or more sources. The definition of the DMP varies, depending on the domain
(Boch et al., 2022, pp. 16–17). Generally speaking, DMPs could be specialised like
DMP to manage research data as well as could be generic to manage different
kinds of data (typically, data associated with the business, regardless of its type).
The current research focuses on the generic DMP, which ideally should work with
structured, semi-structured, and unstructured data. As the data grows, the DMPs
architecture evolved over time to provide added value to businesses, namely
Data Warehouse, Data Lake, and finally, Data Lakehouse.
2.4.1 Data Warehouse
Data Warehouse (DWH) started in the 1980s and has been used as a suitable
location to store analytical data in a predefined format (Inmon, 2005, p. 2). As
specified by Ralph Kimball, which has made significant contributions to defining
and polishing the DWH foundation, a data warehouse is “a system that extracts,
cleans, conforms, and delivers source data into a dimensional data store and then
supports and implements querying and analysis for decision making” (Kimball &
Caserta, 2004, p. 22).
Chapter Two: Literature Review
DWHs have been used for decades, so they are battle-tested and have had the
best optimisations for a long time. Hence, they provide many advantages working
as a source of truth, high-quality data, effective query, and paramount supporting
businesses to embrace data-driven decision-making (Watson et al., 2002, p. 492).
On the other hand, DWH also has downsides. Firstly, it works only with structured
data, which is insufficient to handle the increase in data types and sizes. Also,
operational-wise, DWHs are very expensive; in addition to property software
licensing, the data preparation and clean-up processes require a lot of time and
resources. These resources are historically used to run on on-premise hardware,
adding many operational and maintenance overheads. For these reasons, many
enterprises moved to the next generation to deal with these disadvantages,
which is Data Lake architecture.
2.4.2 Data Lake
Data Lake (DL) started around 2010; the term was used for the first time by
James Dixo, who described it as a “large amounts of heterogeneous data are
added from a single source, and users can access them for a variety of analytical
use cases” (Giebler et al., 2019, p. 180). Furthermore, Fang (2015, p. 820) defined
the term as a methodology enabled by a massive data repository based on
low-cost technologies that improve the capture, refinement, archival, and
exploration of raw data within an enterprise. A data lake contains the mess of
raw unstructured or multi-structured data that for the most part have
unrecognised value for the firm”.
With the rise of Cloud Computing and cheap commodity storage solutions, DL
was able to handle the new Big Data cases (where data comes in different types
and sizes); it was also able to store structured and semi-structured formats, and
at the same time, put the cost under control. Furthermore, business-wise,
another advantage of DL is that it can store more data without knowing case use.
At the same time, it provides an easy way to mine the data to find new business
cases since it is in a raw format. Nevertheless, when talking about DL
disadvantages, and because of its ability to store different types of data, it is
easily turned from a “lake” to a “swamp” where the DL is more or less used as a
Chapter Two: Literature Review
place to dump data. Moreover, due to the variety of stored data and the lack of a
unified way to create a metadata layer, it is hard to govern and provide
fine-grained access to DL resources. Finally, the DL performance is not unified
because the underneath data vary in size, type, and format (Giebler et al., 2019, p.
180). For these reasons, a new generation emerged to overcome the DL
disadvantages, which is Data Lakehouse architecture.
2.4.3 Data Lakehouse
Data Lakehouse (DLH) is a relatively new hybrid architecture shown for the first
time in 2020 which combines the capabilities of a DL and a DLH simultaneously. It
is designed to handle both structured and unstructured data in a scalable,
flexible, and cost-effective way. The research by Armbrust et al. defines DLH as “a
data management system based on low-cost and directly-accessible storage that
also provides traditional analytical DBMS management and performance features
such as ACID transactions, data versioning, auditing, indexing, caching, and query
optimisation. Lakehouses thus combine the key benefits of data lakes and data
warehouses: low-cost storage in an open format accessible by various systems
from the former, and powerful management and optimisation features from the
latter” (2021, p. 3). By inspecting that definition, it shows how DLH architecture
promises to fix DL weakness while providing all DWH strengths as described in
the previous sections.
Additionally, the change in the data management architectures was accompanied
by changes in storage systems since the traditional Big Data storage formats like
Parquet and ORC are optimised for read-heavy workloads and lack support for
updates and deletes. Thus, there was a need for a layer on top of them to
maintain data quality and consistency. Hence, the new DLH architectures “center
around open storage formats such as Delta Lake, Apache Hudi and Apache
Iceberg that implement transactions, indexing and other DBMS functionality on
top of low-cost data lake storage (e.g., Amazon S3) and are directly readable from
any processing engine. Lakehouse systems are quickly replacing traditional data
lakes” (Jain et al., 2023, p. 1). Figure 2.4 illustrates a high-level structure of the
Apache Iceberg table as an example of an open table format (Iceberg Table’s
Architecture, 2022).
Chapter Two: Literature Review
Figure 2.4: The structure of the Apache Iceberg table (Iceberg Table’s Architecture, 2022).
Open table formats like Apache Hudi, Apache Iceberg, and Delta Lake are
designed to improve data management for Big Data workloads where these
formats provide a separation between computing and data. Overall, these open
table formats provide organisations with advanced data management capabilities
that help improve data quality and consistency, increase productivity, and reduce
costs. To deal with Big Data workloads, the new formats provide fundamental
advantages like transactional processing, schema evolution, data versioning, and
time travel (Armbrust et al., 2020; Späti, 2022; Jain et al., 2023). As of early 2023,
all three open table formats (Apache Hudi, Apache Iceberg, and Delta Lake) have
seen significant usage in production environments and adoption across the
majority of vendors like AWS, Google, Azure, Snowflake, Databricks, Dremio, and
Cloudera, which is a promising indicator to widely support the new open formats
(Clark, 2023). However, it is worth pointing out that even though the open table
formats evolved and provided approximately the same features, one of the latest
researches about DLH storage systems by Jain et al. (2023) showed that open
table formats vary simultaneously with different factors.
Chapter Two: Literature Review
Hence, it is essential to pay attention to the properties of each format and how
suitable they are to the nature of the project or the use case. Figure 2.5 shows
how data update strategies like Copy-On-Write (CoW) and Merge-On-Read (MoR)
affect the performance of different open table formats.
Figure 2.5: How data update strategies like Copy-On-Write (CoW) and Merge-On-Read (MoR)
affect open table formats performance (Jain et al., 2023, fig. 2).
Chapter Two: Literature Review
2.4.4 Comparison
Figure 2.6 and Table 2.1 show a comparison among DWH, DL, and DLH, which
illustrates how DLH provides the best-of-breed based on DWH and DL properties.
However, in addition to all the pure technical advantages provided by DLH, one of
the main strategic advantages for many enterprises is using standard open data
format (e.g. Apache Hudi, Apache Iceberg, and Delta Lake), which provides
multiple benefits for different use cases, like enabling seamless access to the
data by various analytics engines, including machine learning systems. Also,
avoiding vendor lock-in is a pivotal aspect of data management to move data
across different systems and platforms, primarily where the Cloud vendors’
capabilities are used to handle the Big Data.
Figure 2.6: A high-level architectural comparison of
Data Warehouse, Data Lake, and Data Lakehouse (Lorica et al., 2020).
Data Warehouse
Data Lake
Data Lakehouse
Data Types
Structured data and
processed data
Structured,
semi-structured and
unstructured raw
data
Structured,
semi-structured, and
unstructured, both
processed/raw data
Data Format
Closed, proprietary
format
Open format
Open format
Chapter Two: Literature Review
Purpose
Optimal for data
analytics and
business intelligence
use-cases
Suitable for machine
learning and artificial
intelligence
workloads
Suitable for all
use-cases (data
analytics, business
intelligence, machine
learning and artificial
intelligence)
Cost
Storage is costly and
time- consuming
Storage is
cost-effective, fast,
and flexible
Storage is
cost-effective, fast,
and flexible
Users
Business
professionals
Business analysts,
data scientists, data
engineers
Everyone in the
business
environment
Scalability
Scaling might be
difficult because of
tightly coupled
storage and compute
Scaling is easy and
cost- effective
because of the
separation of storage
and compute
Scaling is easy and
cost- effective
Agility
Less agile, fixed
configuration
Highly agile,
adjustable
configuration
Highly agile,
adjustable
configuration
Analytics
Reporting, BI,
dashboards
Advanced analytics
Suitable for all types
of analytics
workflows
Ease of use
The fixed schema
makes data easy to
locate, access, and
query
Time and effort are
required to organise
and prepare data for
use. Extensive coding
is involved
Simple interfaces are
provided that are
similar to traditional
data warehouses
together with in-built
AI support
Processing
Schema-on-write
Schema-on-read
Schema-on-write and
Schema-on-read
ACID
compliance
Records data in an
ACID-compliant
manner to ensure the
highest level of
integrity
Non-ACID
compliance: updates
and deletes are
complex operations
ACID-compliant to
ensure consistency as
multiple parties
concurrently read or
write data
Table 2.1: High-level comparison among Data Warehouse, Data Lake, and Data Lakehouse
(E. Janssen, 2022, p. 44).
Chapter Two: Literature Review
2.5 Cloud Computing
Cloud computing refers to a service delivery model for information technology in
which resources such as software, storage, and computing power are made
available over the internet to users on demand. In this model, users do not own
or manage the physical infrastructure that supports these resources but instead
access them as a utility provided by a third-party service provider. Therefore,
Cloud computing allows users to access and use these resources on demand
rather than purchasing and maintaining their physical hardware and software
(Ruparelia, 2016, pp. 1-4). Cloud computing could be categorised into three
categories (Ruparelia, 2016, pp. 30-33):
Public Cloud is a computing environment owned and operated by a
third-party Cloud services provider, such as Amazon Web Services (AWS),
Microsoft Azure, or Google Cloud Platform. Public clouds are designed to
be accessed over the internet and are generally available to any
organisation or individual that wants to use them. They offer a variety of
services, including Infrastructure as a Service (IaaS), Platform as a Service
(PaaS), and Software as a Service (SaaS). Public clouds are generally more
cost-effective and easier to use than private clouds, but they may not offer
the same level of security or control.
Private Cloud is a computing environment owned and operated by a single
organisation for exclusive use. Private clouds can be implemented
on-premises or hosted by a third-party provider. They offer the benefits of
Cloud computing, such as scalability and flexibility, but with the added
security and control of an on-premises environment. Private clouds are
generally more expensive and require more resources than public clouds,
but they may be necessary for organisations with strict security or
compliance requirements.
Hybrid Cloud is a computing environment that combines both public and
private clouds. It allows organisations to use the best of both worlds by
using public clouds for certain workloads and private clouds for others. For
example, an organisation might use a public Cloud for non-critical
workloads that do not require high security and a private Cloud for
Chapter Two: Literature Review
sensitive workloads that need to be kept in-house. Hybrid clouds offer the
benefits of both public and private clouds, but they can be more complex
to manage and require careful planning to ensure that workloads are
placed in the appropriate environment.
In addition, when it comes to the management modes, there are three main
types of Cloud computing services management vary depending on the
characteristic of each of them as follows (Ruparelia, 2016, pp. 27–29):
Infrastructure as a Service (IaaS) provides users with access to virtualised
computing resources, such as virtual machines, storage, and networking.
Platform as a Service (PaaS) provides users access to a platform for
developing, testing, and deploying applications.
Software as a Service (SaaS) provides users with access to software
applications that can be used over the internet without the need to install
and maintain the software on their computers.
Figure 2.7: Management responsibility in different Cloud service models
(IaaS Vs. PaaS Vs. SaaS, 2020).
Chapter Two: Literature Review
Because of all the benefits of the Cloud, the adoption rate has soared over time,
especially in recent years (More Than Half of Enterprise IT Spending in Key Market
Segments Will Shift to the Cloud by 2025, 2022). Nevertheless, many enterprises
have challenges and obstacles using Cloud services, especially data-related ones.
According to research conducted on UK organisations, it showed that “data
portability and interoperability concerns were the most discussed theme in
relation to vendor lock-in” (Opara-Martins et al., 2016, p. 4) and “while security
and governance concerns often can be answered by encryption, and cost
concerns can be answered by cost-based decision-making models, vendor lock-in
problems stay” (Kratzke, 2014, p. 2). The same research from Opara-Martins et al.
(2016, p. 10) showed how using a data architecture that relies on standard open
data format could play a key role in avoiding vendor lock-in where “overall, the
results indicate that these challenges closely relate to interoperability and data
portability issues prevalent in the Cloud environment”. Furthermore, when
questioned about the best ways to minimise the risks of vendor lock-in during
Cloud migration, the majority of business participants identified the following
strategies as the most effective means of mitigation:
A. Making well-informed decisions before selecting vendors and signing
Cloud contracts (66.4%).
B. The need for an open environment for continuous competition between
providers in the Cloud service market (52.3%).
C. Using standard software with industry-proven interfaces (39.3%).
2.6 Cloud-Native Software
Cloud-Native refers to the architecture and design of an application built
specifically to take advantage of the Cloud computing model. In other words, a
Cloud-Native application is designed to be scalable, resilient, and take advantage
of the automatic provisioning of resources provided by Cloud platforms. Thus,
Cloud-Native applications are often designed to be run on a distributed system
and are built using microservices and containers, which allow them to be easily
deployed and scaled (Domingus & Arundel, 2022, p. 17).
Chapter Two: Literature Review
As a result, Cloud-Native software can be beneficial for data management
because it is designed to take advantage of the benefits of Cloud computing,
which include (Domingus & Arundel, 2022, p. 16):
Elasticity: It enables automatic scaling up or down depending on demand,
thus ensuring high availability and efficient handling high traffic volume.
Resilience: It is designed to be fault-tolerant and recover quickly from
failures. Therefore, it is ideal for mission-critical applications that should
be highly available.
Cost effectiveness: It can reduce costs by only using the needed resources
at any given time because it can automatically scale up and down in
keeping with demand.
Flexibility: It can be easily deployed to a variety of different environments,
including on-premises, public Cloud, or in a hybrid Cloud environment.
That makes it easy to use in different use cases.
Innovation: It can be developed and deployed faster than traditional
software, allowing organisations to be more agile and responsive to
changing business needs.
In general, Cloud-Native software works best on Cloud-Native platforms like
Kubernetes, which is an open-source platform for automating the deployment,
scaling, and management of containerised applications. Kubernetes provides a
way to deploy predictable and scalable applications, making managing and
maintaining large and complex applications more accessible. Moreover,
Kubernetes has become the de-facto standard for container orchestration and is
widely adopted in private and public sectors. That is, in particular, conforming to
a survey by the Cloud Native Computing Foundation (CNCF), where Kubernetes is
used by more than 50% of organisations that deploy containerised applications in
production (Cloud Native Survey 2021, 2021, p. 5). In addition, many major Cloud
providers, including Amazon Web Services (AWS), Microsoft Azure, and Google
Cloud, offer managed Kubernetes services, further contributing to its
widespread adoption (Cloud Native Survey 2021, 2021, p. 8).
Chapter Two: Literature Review
Another recent report by Mayr (2023) shows the rapid embrace of Kubernetes,
where the key discovery highlights that “as Kubernetes adoption increases and it
continues to advance technologically, Kubernetes has emerged as the ‘operating
system’ of the cloud”. The same report found that from 2021 to 2022, the
application workloads increased by 30% and auxiliary workloads by 211%. The
“auxiliary workload” generally refers to software that supports the main
functionality of an application or system, such as logging, monitoring, and
backup, which are important for the proper operation and maintenance of the
software but are not directly related to the primary functions that the software is
designed to perform. The increase in the number of auxiliary workloads number
confirms that organisations are increasingly adopting sophisticated Kubernetes
technologies, such as security controls, messaging systems, observability tools,
and service meshes. Furthermore, they use Kubernetes for broader use cases,
such as building pipelines and scheduled utility workloads. As a result,
Kubernetes has become the primary platform for running almost any type of
workload. Figure 2.8 illustrates the year-over-year increase in workload numbers.
Figure 2.8: The increase in the number of main workloads versus auxiliary workloads
on Kubernetes between 2021 and 2022 (Mayr, 2023).
Chapter Two: Literature Review
Turning to the newest report, at the time of writing, by CNCF in 2023, it showed
that the general increase in Kubernetes adoption was also combined with a
usage increase in specific categories, namely, the core of the data platforms like
databases (48%) and Big Data applications (35%). Moreover, “across all
categories, open source projects rank among the most frequently used
solutions”, demonstrating how open source software and open standards play a
critical role in shaping the data management landscape (CNCF Annual Survey
2022, 2023). Figure 2.9 represents the Kubernetes growth areas by category
between 2021 and 2022.
Figure 2.9: The top seven categories in which Kubernetes experienced growth
during the period between 2021 and 2022 (CNCF Annual Survey 2022, 2023).
Finally, one of the most crucial Kubernetes features is that it was built with
extensibility in mind. Thus, Kubernetes provides a powerful mechanism to extend
its API to manage complex applications and services as if they were native
Kubernetes resources. That pattern in Kubernetes is called “Operator”, which
enables automating “Day 2” operations and is designed to encapsulate the best
practices, workflows, and knowledge of a particular application or service,
allowing it to be managed in a Kubernetes-native way (Operator Pattern, 2023).
Day 2 operations refer to the ongoing management and maintenance of a
software application or system after the installation (Day 1). By utilising
Kubernetes Operators, the effort and time required to maintain application
Chapter Two: Literature Review
availability, performance, and security can be significantly reduced, allowing
Kubernetes administrators and users to concentrate on business requirements
instead of reinventing the wheel and creating redundant solutions (Dobies &
Wood, 2020, Chapters 1, 10). Figure 2.10 shows the five Operators’ capability
levels according to OperatorHub.io.
Figure 2.10: Kubernetes Operators’ capability levels (Jump Start Using the Operator-SDK, n.d.).
For those reasons, Kubernetes and Cloud-Native ecosystem are promising
solutions to handle the current Big Data challenges since they provide many
advantages not only on the technical level but also on the strategic level, like
portability and using open standards to counter vendor lock-in issues where the
Cloud-Native platforms and software, “truly deliver digital capabilities anywhere
and everywhere” (Costello & Rimol, 2021).
2.7 Modern Data Platform
There have been different attempts to architect and build a data platform that
can cope with Big Data and fits different workloads and use cases, the so-called
“Modern Data Platform” (MDP) is one of them. The term in the technology
industry refers to a set of technologies and practices used to collect, store,
process, and analyse large amounts of data, including Cloud computing, Big Data
processing frameworks, data warehousing and business intelligence tools, and
machine learning and artificial intelligence algorithms. An MDP aims to enable
organisations to gain insights from their data to make data-driven decisions
(Foote, 2022; LaPlante, 2020). Nevertheless, the term “Modern Data Platform is
not commonly used in academic literature, even though the concepts and
technologies that make up an MDP are widely studied and discussed in academic
research. For example, research on Big Data processing, Cloud Computing, Data
Warehousing, and Data Analysis are all relevant to developing and implementing
an MDP.
Chapter Two: Literature Review
As shown in the previous sections, given that many data technologies are
relatively new, like Data Lakehouse, a hybrid architecture shown for the first time
in 2020, reviewing MDP research results has been limited to the years between
2020 and 2023. Using a mix-and-match technique with keywords like “modern”,
“data”, “platform”, “management”, “lakehouse”, and “architecture”, the Google
Scholar search engine showed that most of the related work are conference
papers cover an overview of the Data Lakehouse architecture without many
details about actual methods of implementing such an architecture which is a
clear gap in the academic research.
For example, the article “From Data Warehouse to Lakehouse: A Comparative
Review” by Harby and Zulkernine (2022) provided a comparative review of Data
Warehouses and Data Lakehouses architectures. The authors highlighted the
advantages and disadvantages of each architecture, as well as their suitability for
different types of data and use cases. Furthermore, they compared the two
architectures in terms of data integration, data modelling, data processing,
scalability, and security. Even though they found that Data Lakehouse has several
advantages over data warehouses, they also noted that managing it could be
cumbersome because of its many moving parts.
Another related work is the paper titled “Data Lakehouse - a Novel Step in
Analytics Architecture” by Orescanin and Hlupic (2021) covered the concept of a
Data Lakehouse, which, as mentioned, combines the strengths of data
warehouses and data lakes in a single architecture and how it provides a more
scalable, flexible, and cost-effective solution for modern data analytics. The paper
also covered a high-level architecture of Data Lakehouse and the advantages of
each component, such as the ability to handle real-time data ingestion and
processing, support for Big Data analytics, and improved data governance. Then
the paper also provided examples of companies that created successful Data
Lakehouses products, such as Databricks, Snowflake, and Amazon Web Services.
Finally, the authors argued that Data Lakehouses are a novel step in analytics
architecture that can provide significant benefits for organisations seeking to
leverage their data for competitive advantage.
Chapter Two: Literature Review
One interesting detailed piece of research about Data Lakehouses is the master’s
thesis “The Evolution of Data Storage Architectures: Examining the Value of the
Data Lakehouse” by E. Janssen (2022), which explored in-depth the data storage
architectures and investigated the value of Data Lakehouse architecture. The
thesis began with an introduction to the history of data storage architectures,
including data warehouses and data lake architecture, including their advantages
and disadvantages. The author then introduced the Data Lakehouse architecture,
which combines the strengths of both architectures. Furthermore, The thesis also
included a case study of a company that has implemented a Data Lake and had
evaluated moving to a Data Lakehouse architecture. The study concluded that
the company would likely experience significant benefits from the Data
Lakehouse architecture, including improved data quality, faster data processing,
and increased team collaboration. Also, the research reviewed multiple neutral
and vendor-specific architectures for Data Lakehouse; however, it did not address
the implementation phase nor use Kubernetes as a platform for the architecture.
Finally, a paper titled “Xel: A cloud-agnostic data platform for the design-driven
building of high-availability data science services” by Barron-Lugo et al. (2023) is
indirectly related to the current project where the paper shares the same goals as
well as using state-of-the-art tools and applications to build a portable data
platform for data science services. “Agnosticism is a term used in Cloud
computing to define the property of a Cloud service/application that does not
depend on a given platform or infrastructure to be successfully executed”
(Barron-Lugo et al., 2023, p. 88), hence; the Xel project could benefit from the
MDP as infrastracture for the data science services.
According to the previous research review, no study covered building a generic
data platform using DataOps, Kubernetes, and Cloud-Native ecosystem,
especially a platform focusing on specific aspects like openness, portability, and
averting vendor lock-in (cloud-agnostic). Hence, this research tries to cover this
gap and gives an overview of the state-of-the-art in building a data platform that
leverages modern practices and technology for the time being. It is essential to
realise that the ultimate goal of MDP is not the technology but to build a
data-driven and self-service culture within the organisation to serve the business
Chapter Two: Literature Review
needs; hence, building a data platform helps to create a data-centric business
where different personas smoothly interact with the data platform. Figure 2.11
illustrates a high-level architecture of a modern functional data platform, and
Figure 2.12 shows the major personas that are part of the data team in a
data-centric business.
Figure 2.11: High-level architecture of a modern functional data platform (Strod, 2019a).
Figure 2.12: The core job role profiles of
the data team in a data-centric business
(Thusoo & Sen Sarma, 2017, figs. 4-2).
Chapter Two: Literature Review
2.8 Summary
In the last two decades, Big Data has been a major driver of the paradigm shift in
data management, and with the increasing amount of data generated, traditional
data management systems and approaches need to be improved or even
replaced. Hence, new technologies and architectures, such as Data Lakehouse,
have emerged to handle the volume, velocity, and variety of data. These
technologies enable organisations to store and process large amounts of data
cost-effectively, allowing them to gain new insights and make more informed
decisions. Consequently, organisations can effectively manage the challenges
posed by Big Data with greater flexibility and efficiency by leveraging different
technologies like Cloud computing, DataOps, and Cloud-Native software and
platform such as Kubernetes. This approach facilitates the extraction of value
from Big Data, leading to improved data quality, enhanced collaboration, and
faster time-to-market for data-driven products and services. In light of that,
building a tailored data platform solution based on modern technologies will
enable many companies and organisations, aside from their size or domain, to
manage and benefit from Big Data.
Chapter Three:
Specifications
Chapter Three: Specifications
3
3.1 General Goals
This section aims to define the specifications related to the modern data
platform, like how it should work, the core requirements, and the focus areas for
the initial implementation. In order to do that, first, we need to shed light on the
high-level goals of such a platform. The aspiration is commonly driven by various
business objectives, including enhancing decision-making, uncovering customer
preferences and patterns, and increasing operational efficiency. The vast majority
of enterprises (97%) have developed a documented data strategy to become
data-driven, showing a widespread acknowledgement of the value of data
strategies. However, only a small fraction of businesses (31%) have succeeded in
transforming into data-driven organisations or built a “data culture” (28%). In
fact, firms tend to view themselves as failing to transform their businesses in all
areas except driving innovation with data. Specifically, over half of the businesses
(53.1%) admit to not yet treating data as a business asset, and an even more
significant percentage (52.4%) declare that they are not competing on data and
analytics. The lack of success in creating a data-driven culture is highlighted by
the fact that over two-thirds of firms (69%) admit to not having achieved this
objective, with a similar proportion (71.7%) stating that they have not forged a
data culture (LaPlante & Safari, 2020).
Nevertheless, adopting a data-driven approach touches many areas in an
organisation to achieve efficiency, which requires organisational transformation
and change management. There are multiple frameworks to handle such a
change; one of them is the “Golden Triangle” framework, which Harold Leavitt
invented in the 1960s. The Golden Triangle, or the People, Process, Technology
(PPT) framework, highlights the importance of balancing all three components to
achieve success and optimal outcomes. Even though the PPT framework is a
popular model used in business management and IT, it can be applied to any
industry to help organisations improve their business processes and achieve their
strategic objectives (Simon, 2019). Figure 3.1 conceptualises the relation
between the triangle’s sides where the change in one side affects the others.
Chapter Three: Specifications
Figure 3.1: The Golden Triangle can be
envisioned as a structure comprising three
pillars, where the equilibrium is contingent on
the stability of each individual pillar
(Simon, 2019).
This research’s primary focus is two sides of the framework, which are
Technology (Kubernetes and Cloud-Native software) and the related Process
(DataOps) to build a proof-of-concept of a modern data platform according to
the typical business drivers, and at the same time, overcomes the concerns from
previous research regarding Cloud adoption (Opara-Martins et al., 2016), vendor
lock-in (Kratzke, 2014), and data management processes (Gür, 2021).
3.2 Requirements
Reviewing previous attempts to build a data management platform, as discussed
in section 2.7 Modern Data Platform, showed that most data platforms share
some functional and non-functional requirements. Functional requirements
describe what actions a system must perform, and non-functional requirements
describe qualities a system must possess (Robertson & Robertson, 2012, p. 9).
3.2.1 Functional
Data Ingestion: The platform must have the capability to ingest data from
various sources, including structured, semi-structured, and unstructured
data, in real-time and batch modes. The platform must support multiple
data formats and can handle high volumes of data.
Data Storage: The platform must have efficient and scalable data storage
capabilities, including different types of databases like SQL and NoSQL
databases as well as provide data indexing, search, and retrieval
capabilities. Also, the data platform must support multiple storage
technologies and systems, including classic plain text formats like JSON
and CSV, in addition to modern table formats like Apache Hudi, Apache
Iceberg, and Delta Lake.
Chapter Three: Specifications
Data Processing: The platform must have robust data processing
capabilities, including batch processing, real-time processing, and stream
processing. Furthermore, the platform must support multiple data
processing frameworks, including Apache Spark, Apache Flink, and Apache
Storm, and provide a flexible and extensible processing pipeline.
Data Analytics and Collaboration: The platform must have advanced data
analytics capabilities, including machine learning, predictive analytics, and
data visualisation. Also, the platform must support business analytics
visualisation tools, such as charts, graphs, and dashboards, to help users
understand and analyse data as well as support scientific analytics tools
like multiple machine learning frameworks, including TensorFlow, PyTorch,
and scikit-learn, and provide a library of pre-built models and algorithms.
Data Governance: The platform must have robust data governance
capabilities, including data discovery, data quality, data lineage, and data
auditing, to ensure the accuracy, consistency, and security of the data
being processed and stored.
3.2.2 Non-functional
Portability: The platform should mitigate the risks associated with vendor
lock-in, enabling the user to seamlessly transition to alternative data
processing technologies or vendors with minimum disruption to the
existing infrastructure. Furthermore, the platform should support open
data standards and provide APIs for easy integration with other tools and
systems. Also, it should be designed for Cloud computing environments,
leveraging Cloud computing capabilities and easily moving across different
Cloud providers.
Extensibility: The platform should be easy to extend or modify to
accommodate new functionality or components to handle changing
requirements without requiring significant redesign or refactoring. Which
is necessary for any rapidly evolving organisation to be data-driven and
remain competitive. Furthermore, the platform should be designed with
modularity in mind, using open standards and APIs, separating concerns,
and prioritising interoperability. Therefore, organisations can build a data
platform that adapts to changing business needs and scales as the volume
of data grows.
Chapter Three: Specifications
Scalability: The platform should be scalable and handle the increasing data
as the demand grows. Also, it should support horizontal scaling, allowing
users to add more resources as needed, and vertical scaling, allowing users
to add more processing power and storage capacity to existing resources.
Performance: The platform should provide fast and efficient processing
and retrieval of Big Data, with response times that meet the user’s needs.
It should also be able to process large amounts of data in parallel,
leveraging distributed computing capabilities to optimise performance.
Availability: The platform should be highly available, ensuring that data is
always accessible, even during hardware or network failures. It should
support disaster recovery and business continuity strategies and be able to
recover from failures automatically, and ensure data availability.
Reliability: The platform should be reliable, with minimal downtime and
data loss. It should have built-in fault tolerance and resilience mechanisms
and implement best practices for data backup and recovery practices.
Cost-Effectiveness: The platform should be cost-effective, providing a
good return on investment and flexible pricing options for cost
optimisation as demand changes. It should leverage Cloud computing
capabilities to optimise costs and can automatically provision and
de-provision resources depending on demand.
Usability: The platform should be easy to use, with a user-friendly
interface that is intuitive and accessible to all users. It should support
self-service capabilities, allowing users to manage their own data
processing workflows without relying on IT support.
3.3 Focus Areas
Taking into consideration the mentioned high-level goals and requirements of
the data platform as a starting point, the main focus of this project will be (a)
building a resilient infrastructure for the data platform, (b) applying at least one
of the DataOps principles, (c) implement the core of the DLH architecture.
Accordingly, an iterative approach framework will be used to build a Minimal
Viable Product (MVP), to have better quality and reduce failure rates. This
approach involves developing the system’s essential features first and then
gradually adding additional functionalities in subsequent iterations.
Chapter Three: Specifications
To achieve that, the “Must, Should, Could, and Would haves” (MoSCoW)
prioritisation method will be used to prioritise which features to include in each
iteration. This method categorises features into four groups: Must-haves,
Should-haves, Could-haves, and Would-haves (also known as Will-not-haves).
Must-haves are essential features for the system to function, while Should-haves
are necessary but not essential. Could-haves are desirable but not vital, and
Would-haves are features that are needed but explicitly not included in the
current iteration (Del Sagrado & Del Águila, 2020, p. 171).
The MoSCoW prioritisation method will ensure that the most important features
are included in the first MVP, and at the same time, less critical features can be
deferred to later iterations. That allows for a more focused and efficient
development process while ensuring that the MVP meets the minimum
requirements for the system to be usable. As a result, the following are features
of the initial version of the data platform in keeping with the MoSCoW method.
Must Haves:
MH1: Cloud-Native architecture.
MH2: Scalable and Cloud-Agnostic infrastructure orchestration system.
MH3: Open-source software and open standard formats.
MH4: Data Lakehouse solution as a core of the data platform.
Should Haves:
SH1: A declarative approach for configuration management.
SH2: A data pipeline to ingest data from an external source into the
platform in plain text formats like JSON or CSV.
SH3: Self-service capabilities.
Could Haves:
CH1: An easy way to rebuild the system with minimal manual actions.
CH2: An open format table like Apache Iceberg from one of the ingested
JSON/CSV files.
Chapter Three: Specifications
Would Haves:
WH1: Production grade quality like high availability, security hardening, or
data quality validation.
WH2: The reset of the components not directly related to the core of the
Data Lakehouse.
In light of the defined focus area in line with the MoSCoW method, the next
chapter will cover architecting the prioritisated specifications.
Chapter Four:
Architecture
Chapter Four: Architecture
4
4.1 Holistic View
The literature review showed the evolution of data management architectures
like DWH, DL, and finally, DLH, where the DLH architecture considered the new
recommended setup to cope with Big Data. In the manner of the following
criteria to architect the initial version of the data platform, the architecture
should be (a) provider agnostic, (b) generic or cover different use cases, (c)
detailed, and (d) up-to-date. By review of six architectures of data platforms
created between the years 2020 and 2023 (DataLakeHouse, 2020; Ma et al.,
2020, p. 3; MongoDB, 2021; Desai et al., 2022; Bornstein et al., 2022; Oppermann,
2022) showed that the “Unified Data Infrastructure v2.0” (UDI v2.0) architecture
by Bornstein et al. is the best match of the defined criteria. Figure 4.1 shows that
UDI v2.0 architecture is detailed, modular, and covers all functional requirements,
which could work as a general-purpose data platform and allows the creation of
different blueprints in line with different use cases (e.g., artificial intelligence,
machine learning, multimodal data processing, and business intelligence).
Moreover, Table 4.1 lists the definitions of UDI v2.0 layers, which are input
sources, ingestion/transformation, storage, query/processing, transformation,
analytics, and output.
Chapter Four: Architecture
Figure 4.1: The Unified Data Infrastructure v2.0 architecture diagram (Bornstein et al., 2022).
Section
Definition
Sources
Generate relevant business and operational data.
Ingestion and
Transport
Extract data from operational systems (E). Deliver to storage, aligning
schemas between source and destination (L). Transport analysed data
back to operational systems as needed.
Storage
Store data in a format accessible to query & processing systems and
optimise for consistency, performance, cost, and scale.
Query and
Processing
Translate high-level code (usually written in SQL, Python, or Java/ Scala)
into low-level data processing jobs. Execute queries and data models
against stored data, often using distributed computing. Includes
historical analysis - describing what happened - and predictive analysis -
describing expectations for the future.
Transformation
Transform data into a structure ready for analysis (T) Orchestrate
processing resources for this purpose.
Analysis and
Output
Provide an interface for analysts and data scientists to derive insights
and collaborate. Present analysis results to internal and external users
and embeds data models into user-facing applications.
Table 4.1: The definitions of the Unified Data Infrastructure v2.0 (Bornstein et al., 2022).
Chapter Four: Architecture
4.2 Core Components
To have an MVP for the data platform based on
the Unified Data Infrastructure v2.0, the
following components should be implemented,
and more components could be added later.
The core components covered in this section
are infrastructure, data ingestion, storage, and
processing. In addition, the initial architecture
will be defined afterwards.
4.2.1 Infrastructure
As stated in the literature review, Kubernetes is a container orchestration
platform that automates containerised applications’ installation, expansion, and
administration. Given the fact that it is used by more than 50% of organisations
that deploy containerised applications in production (Cloud Native Survey 2021,
2021, p. 5), it is considered the most popular container management system. Its
features include self-healing abilities, service discovery and load balancing,
storage orchestration, configuration and sensitive data management, resource
management, and batch execution.
The extensible capabilities offered by Kubernetes can be tailored to meet the
unique requirements of the application or organisation, aiming to make
managing and scaling containerised applications simple and efficient in
distributed environments. Since Kubernetes is Cloud-Native and Cloud-Agnostic
by design, it is the ideal infrastructure orchestration system for data platforms
because it offers a shared layer of abstraction that hides the underlying details to
deploy and manage containerised applications easily. Kubernetes can run in
various environments like on-premises, Cloud providers, including Amazon Web
Services (AWS), Microsoft Azure, Google Cloud Platform (GCP), and many others.
Figure 4.2 shows Kubernetes architecture which runs the same way on any Cloud,
on-premises, or local setup to manage containerised applications.
Chapter Four: Architecture
Figure 4.2: Kubernetes high-level architecture
(Kubernetes Architecture, How Kubernetes Works, 2019).
In section 2.6 Cloud-Native Software, it was noted that Kubernetes offers a
unified approach, allowing workloads to leverage its capabilities effectively,
making it one of its most prominent features. Namely, the Operator pattern,
which encapsulates the operational knowledge as code, simplifies the
deployment process, improves resource utilisation, increases flexibility, and
enhances the security of the applications (What Is a Kubernetes Operator?, 2022).
For those reasons, applications that have Kubernetes Operator should be
preferred whenever possible. Figure 4.3 illustrates the flow of the Operator
within the Kubernetes cluster, which takes actions depending on the Operator
logic. Then, Figure 4.4 illustrates the Operator reconciliation loop, which
observes, analyses, and acts to match the current state with the desired state for
a specific resource.
Chapter Four: Architecture
Figure 4.3: Kubernetes Operator flow which takes action inside and/or outside the cluster.
Figure 4.4: Kubernetes Operator reconciliation loop.
Another key feature of Kubernetes and its ecosystem is using declarative
configuration as code. In this approach, the applications’ configuration is defined
as code declaratively rather than imperatively, providing many benefits like
consistency, version control, automation, simplicity, and portability. Hence, it
helps to streamline software development processes, increase reliability, and
reduce errors and downtime. The declarative approach It is a powerful approach
to managing complex systems and applications, and it is widely used in the
modern software development (Declarative Management of Kubernetes Objects
Using Configuration Files, 2023).
Accordingly, Kustomize, the official Kubernetes native declarative configuration
management tool, will be used in this project to manage the Kubernetes
manifests. Moreover, to reduce toil and repetitive work, the Kustomize built-in
plugin “HelmChartInflationGenerator” will be used whenever possible to
consume the upstream Helm packages known as “Helm charts”. Helm is a
package manager for Kubernetes that makes deploying, managing, and
upgrading applications on Kubernetes clusters easier by providing a standardised
Chapter Four: Architecture
way to package and distribute them instead of creating all Kubernetes manifests
from scratch to saving time and effort, avoiding reinventing the wheel. Figure 4.5
presents the core idea of Kustomuize, which provides a template-free way to
customise Kubernetes manifests as layers. By convention, Kubernetes manifests
are in YAML format, which is a human-friendly data serialisation language, but
JSON is also accepted (Managing Resources, 2022).
Figure 4.5: Kustomize traverses a Kubernetes manifest to add, remove or update
configuration options (Kubernetes Native Configuration Management, n.d.).
After covering the infrastructure orchestration platform that matches the goals
and requirements, now moving to the data platform application components
starting with data ingestion.
Chapter Four: Architecture
4.2.2 Data Ingestion
In order to select a solution for ingestion and transport that matches the goals
and requirements, the following inclusion and exclusion criteria were formulated:
IC1: The solution provides modern ETL and workflow capabilities.
IC2: The solution is Cloud-Agnostic and uses Cloud-Native approaches.
IC3: The solution is customisable and fits different use cases.
EC1: The solution is not easy to deploy on Kubernetes.
EC2: The solution does not use a declarative style.
EC3: The solution is complex or has a high learning curve.
Solutions
Inclusion criteria
Exclusion criteria
IC1
IC2
IC3
EC1
EC2
EC3
Apache Airflow
Argo Workflows
Dagster
Kubeflow Pipelines
Prefect
Tekton
Table 4.2: Comparison between different ingestion and transport solutions
based on the inclusion and exclusion criteria.
According to the focus areas in the specifications section and the matching of the
inclusion and exclusion criteria in Table 4.2, Argo Workflows appears to be the
best fit for this project as an ingestion and transport solution. Argo Workflows is
an open-source Kubernetes-native workflow engine for orchestrating parallel and
distributed tasks. It provides a simple and powerful way to define, manage, and
execute complex workflows, including data processing, machine learning, and
continuous integration pipelines. By default, it uses the Kubernetes Operator
pattern to define workflows as declarative YAML files, making it easy to follow
DataOps principles. Furthermore, it supports programmatically writing workflows
via its Software Development Kit (SDK) and built-in templates to define common
Chapter Four: Architecture
workflow patterns for flexibility and extensibility. Figure 4.6 illustrates Argo
Workflows’ full architecture.
Figure 4.6: The components of the Argo Workflows architecture
(Argo Workflows Architecture, n.d.).
Chapter Four: Architecture
4.2.3 Data Storage
In order to select the storage solution that matches the goals and requirements,
the following inclusion and exclusion criteria were formulated:
IC1: The solution provides scalable, high-performance, and distributed
object storage capabilities.
IC2: The solution is Cloud-Agnostic and uses Cloud-Native approaches.
IC3: The solution is compatible with the standard Amazon S3 APIs.
EC1: The solution is not easy to deploy on Kubernetes.
EC2: The solution is not easy to manage and maintain.
EC3: The solution does not have a large community and support.
Solutions
Inclusion criteria
Exclusion criteria
IC1
IC2
IC3
EC1
EC2
EC3
Ceph
MinIO
OpenIO
Riak CS
Zenko
Table 4.3: Comparison between different object storage solutions
based on the inclusion and exclusion criteria.
According to the focus areas in the specifications section and the matching of the
inclusion and exclusion criteria in Table 4.3, MinIO appears to be the best fit as a
storage solution for this project. MinIO provides several key features, making it a
popular choice for building object storage servers for Cloud-Native applications
and infrastructure. Specifically, it provides a highly scalable and distributed
storage system to store massive amounts of structured and unstructured data.
Moreover, MinIO is Amazon S3 compatible and built on top of the Amazon S3 API,
which means it can be easily integrated with various applications and tools.
Figure 4.7 illustrates the architecture for MinIO on Kubernetes and multi-tenant.
Chapter Four: Architecture
Figure 4.7: MinIO architecture on Kubernetes with multi-tenant
(MinIO High-Performance Multi-Cloud Object Storage, 2022, p. 16, adapted version).
4.2.4 Data Processing
In order to select the query and processing solution that matches the goals and
requirements, the following inclusion and exclusion criteria were formulated:
IC1: The solution provides query and processing capabilities.
IC2: The solution is Cloud-Agnostic and uses Cloud-Native approaches.
IC3: The solution has an open-source and enterprise version.
EC1: The solution does not position itself as a DLH platform.
EC2: The solution is not easy to deploy on Kubernetes.
EC3: The solution does not support modern open standards like Apache
Hudi, Apache Iceberg, and Delta Lake.
Chapter Four: Architecture
Solutions
Inclusion criteria
Exclusion criteria
IC1
IC2
IC3
EC1
EC2
EC3
Cloudera
Data Platform
Dremio
Presto
Trino
Table 4.4: Comparison between different query and processing solutions
based on the inclusion and exclusion criteria.
In line with the focus areas in the specifications section and the matching of the
inclusion and exclusion criteria in Table 4.4, Dremio appears to be the best fit as a
query engine solution for this project, and it will be the cornerstone of the DLH.
Dremio Cloud-Native architecture build blocks are open-source technologies such
as Apache Arrow, Gandiva, Apache Arrow Flight and Apache Iceberg. These
technologies provide several key features for building DLH, including a
high-performance query engine, self-service data access, built-in data governance
and security. Both open-source and enterprise versions of Dremio are offered,
with the latter option offering more features and functionalities for enterprises.
For those reasons, Dremio is a popular choice for organisations looking to build a
self-service DLH platform that can provide fast and efficient access to a wide
range of data sources. Figures 4.8 and 4.9 show Dremio deployment and
functional architecture (respectively).
Chapter Four: Architecture
Figure 4.8: Dremio deployment architecture (Dremio Architecture Guide, 2020, p. 7).
Figure 4.9: Dremio functional architecture (Dremio Architecture Guide, 2020, p. 7).
Chapter Four: Architecture
4.3 Initial Design
This section covers the initial design after defining the core components that
match the goals and focus on areas traits via portability to avoid vendor lock-in,
scalability to deal with massive amounts of structured and unstructured data, and
extensibility by using open standards. Table 4.5 lists the items of the initial
design, and Figure 4.10 illustrates the initial data platform’s architecture to be
implemented in the next chapter.
Item
Role
Category
DataOps
Practices framework
Process
Kubernetes
Infrastructure orchestration
Technology
MinIO
Data object storage
Technology
Dremio
Data Lakehouse platform
Technology
Argo Workflows
Data ingestion pipeline
Technology
Table 4.5: The core items of the data platform’s initial architecture.
Figure 4.10: The initial high-level architecture shows the core parts of the Modern Data Platform
using Kubernetes and Cloud-Native solutions.