Conference Paper

An Overview of Current Data Lake Architecture Models

Authors:
  • Algebra University College
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

As the Data Lakes have gained a significant presence in the data world in the previous decade, several main approaches to building Data Lake architectures have been proposed. From the initial architecture towards the novel ones, omnipresent layers have been established, while at the same time new architecture layers are evolving. The evolution of the Data Lake is mirrored in the architectures, giving each layer a distinctive role in data processing and consumption. Moreover, evolving architectures tend to incorporate established approaches, such as Data Vaults, into their layers for more refined usages. In this article, several well-known architecture models will be presented and compared with the goal of identifying their advantages. Next to the architecture models, the topic of Data Governance in the terms of the Data Lake will be covered in order to expound its impact on the Data Lake modeling.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... The idea of "smart cities" is a new way to run cities that uses smart information and communication technologies (ICT) to make city services work better and improve the health and happiness of the people who live there. Smart cities use datadriven methods to make the best use of their resources, protect the environment, and provide better services in areas like public safety, energy management, transportation, energy use, and environmental tracking [1], [2]. ...
... The architecture focuses on important parts of city management, like public safety, transportation, energy efficiency, waste and water management, environmental monitoring, and air quality control. By using cutting edge data processing and management technologies, the framework aims to make cities more sustainable and improve people's quality of life generally [1]. ...
... The original fidelity of the data is preserved by the data lake, which enables flexible and in-depth analyses by preserving it in its raw form. Additionally, the application of sophisticated data processing capabilities, including predictive analytics and machine learning, can be employed to identify concealed patterns, optimize resource allocation, and predict future trends [1]. ...
Conference Paper
Full-text available
The rapid advancement of smart cities, driven by innovative communication and information technologies (ICT), has transformed urban management. This paper introduces the robust SmartCityAI Lakehouse, a hybrid framework specifically designed to implement smart city solutions in IKN, the New Capital City of Indonesia. The proposed architecture seamlessly integrates diverse data sources and supports a wide range of applications, including real-time AI-driven transportation management, energy optimization, public safety enhancement, waste and water management, environmental monitoring, and air quality control. By leveraging both on-premises processing and scalable cloud infrastructure, this framework enhances urban sustainability and improves the quality of life for citizens. The paper also explores the key benefits and challenges of deploying this architecture, providing practical strategies for its implementation in complex urban environments.
... Several Data Lake architectures have been developed and described over time [3], but none has addressed specific requirements related to personal identifiable information. In addition, many different metadata models for Data Lakes have been proposed over time [4], but none has addressed the management of metadata specific to personal identifiable information processing and storage requirements. ...
... An overview of the state and challenges of Data Lake architecture was presented in [9]. A recent study [3] is the result of a comprehensive survey that covers the fundamentals of Data Lakes, their architecture, data storage, and functional tiers. ...
... The comparison between described architectures can be found in [3]. ...
Article
Full-text available
Privacy is a fundamental human right according to the Universal Declaration of Human Rights of the United Nations. Adoption of the General Data Protection Regulation (GDPR) in European Union in 2018 was turning point in management of personal data, specifically personal identifiable information (PII). Although there were many previous privacy laws in existence before, GDPR has brought privacy topic in the regulatory spotlight. Two most important novelties are seven basic principles related to processing of personal data and huge fines defined for violation of the regulation. Many other countries have followed the EU with the adoption of similar legislation. Personal data management processes in companies, especially in analytical systems and Data Lakes, must comply with the regulatory requirements. In Data Lakes, there are no standard architectures or solutions for the need to discover personal identifiable information, match data about the same person from different sources, or remove expired personal data. It is necessary to upgrade the existing Data Lake architectures and metadata models to support these functionalities. The goal is to study the current Data Lake architecture and metadata models and to propose enhancements to improve the collection, discovery, storage, processing, and removal of personal identifiable information. In this paper, a new metadata model that supports the handling of personal identifiable information in a Data Lake is proposed.
... To comprehend the dynamics of urban safety, we turn our attention to critical case studies that form the bedrock of our analytical framework [28]. The metropolis, with its ever-changing landscape and diverse demographics, poses a myriad of challenges. ...
... Our analytical journey culminates in the examination of the scoring system derived from case studies of 'perfect cities.' By assigning an arbitrary score of 100 to the cluster with the highest number of these idealized cities, we establish a comparative framework (similar to [8]). Case studies of cities considered the safest serve as benchmarks, allowing for a robust comparative analysis of other clusters, following established data integration practices for public safety enhancement [28] Through these comparative analyses, we discern the relative safety levels of diverse urban regions, building upon previous works in crime analysis and hotspot identification [4]. The case studies not only validate the effectiveness of our scoring system but also provide a nuanced understanding of the contextual factors influencing safety clusters, similar to existing studies that delve into the intricate interplay between socioeconomic factors, infrastructural elements, and crime rates [16]. ...
Article
Full-text available
This research addresses the critical issue of women’s safety in urban environments, emphasizing the need for innovative solutions to establish secure pathways. SafeRoutes presents a holistic approach, integrating advanced clustering methodologies and GPS technology, detailing its relevance, ideation, methodology, and anticipated results. During ideation, the team prioritized integrating cutting-edge technologies—artificial intelligence, data analytics, and cloud computing. Emphasizing the constraints of existing safety solutions, the focus was on crafting a sophisticated framework for detailed assessments and real-time risk detection during transit. SafeRoutes aims to redefine women’s safety, providing actionable insights for urban planning and law enforcement. The methodology comprises three integral components. Firstly, a robust data ingestion pipeline connects to public and government data sources, ensuring near real-time models enriched with the latest data. The second component uses unsupervised machine learning models, comparing and employing various clustering algorithms. Parameters like crime rates, police presence, and infrastructure are utilized to cluster regions based on women’s safety. Lastly, integration with map APIs and cab service vendors addresses the travel aspect, facilitating real-time alerts for deviations into unsafe areas. Results encompass a nuanced correlation matrix classifying regions based on safety clusters, offering valuable insights for urban planning and law enforcement. Integration with cab services ensures SafeRoutes not only identifies safe paths but actively contributes to enhancing women’s safety during transit. The anticipated outcome positions SafeRoutes as a pioneering solution, contributing substantially to the discourse on urban safety and establishing a benchmark for future research.
... Another survey in [41] identified two major architectures: the data pond and the data zone. In [42], several architecture models are taken into consideration, including the basic two-layered architecture, multi-layered model, and data-lakehouse model. As more architecture models prove valuable to evaluate, our survey will expand the classification to include more data-lake architectures. ...
... While first-generation data lakes struggled with these inherent conflicts, the underlying principle of the data lake represented a step change in data management. As data lakes matured, they soon became perceived as a major pillar in modern data architecture because they enabled enterprises to handle the vastness of the data they wished to capture, store, and process [42]. Data were landing in their native unstructured format, thus maintaining the full integrity of the data and avoiding unnecessary data loss due to pre-conceived pre-processing and transformation steps. ...
Article
Full-text available
This paper presents a comprehensive literature review on the evolution of data-lake technology, with a particular focus on data-lake architectures. By systematically examining the existing body of research, we identify and classify the major types of data-lake architectures that have been proposed and implemented over time. The review highlights key trends in the development of data-lake architectures, identifies the primary challenges faced in their implementation, and discusses future directions for research and practice in this rapidly evolving field. We have developed diagrammatic representations to highlight the evolution of various architectures. These diagrams use consistent notations across all architectures to further enhance the comparative analysis of the different architectural components. We also explore the differences between data warehouses and data lakes. Our findings provide valuable insights for researchers and practitioners seeking to understand the current state of data-lake technology and its potential future trajectory.
... Accordingly, data lakehouses were proposed as a new solution that can carry this amount of data and constitute a new architecture that outperforms traditional storage systems. To be more precise, it proposes a new architecture combining the flexibility and scalability of a data lake with the data structures and data management capabilities of a data warehouse [1]. Nevertheless, such architecture may encompass a vast amount of data in different structures (i.e., structured, semi-structured, unstructured) collected from internal and external sources, which may contain erroneous, duplicate, and missing data. ...
... Currency = Age + Deliver yT ime − I nputT ime (1) Where: ...
Article
Full-text available
Recently, we noticed the emergence of several data management architectures to cope with the challenges imposed by big data. Among them, data lakehouses are receiving much interest from industrial and academic fields due to their ability to hold disparate multi-structured batch and streaming data sources in a single data repository. Thus, the heterogeneous and complex aspect of the data requires a dedicated process to improve their quality and retrieve value from them. Therefore, data curation encompasses several tasks that clean and enrich data to ensure it continues to fit the user requirements. Nevertheless, most existing data curation approaches need more dynamics, flexibility, and customization in constituting the data curation pipeline to align with end user requirements that may vary according to her/his decision context. Moreover, they are dedicated to curating only a single type of structure of batch data sources (e.g., semi-structured). Considering the changing requirements of the user and the need to build a customized data curation pipeline according to the users and the data source characteristics, we propose a service-based framework for adaptive data curation in data lakehouses that encompasses five modules: data collection, data quality evaluation, data characterization, curation service composition, and data curation. The proposed framework is built upon new data characterization and evaluation modular ontology and a curation service composition approach that we detail in the following paper. The experimental findings validate the contributions’ performance in terms of effectiveness and execution time.
... A data lake is a centralized repository that stores vast amounts of raw data in its native format (structured, semi-structured, or unstructured), enabling flexible analytics. Data Lakes were designed to host vast volumes of various data in their raw, unchanged from and almost process data in real-time [2]. While data lakes offer greater flexibility than data warehouses, data quality still plays a key role. ...
Article
Full-text available
Data quality is paramount in data-driven decision-making processes, especially when dealing with large volumes of data in environments like data warehouses and data lakes. These systems store vast amounts of raw and processed data from multiple sources, making data management and quality assurance complex yet critical. With the growing adoption of Artificial Intelligence (AI), new techniques and tools have emerged that can significantly enhance data quality. This paper discusses how AI can improve the quality of data within both data warehouses and data lakes by automating data cleansing, validation, anomaly detection, and ensuring consistency. It explores the benefits, challenges, and methodologies for integrating AI tools into these systems. Keywords: Data quality, AI in data quality, data warehouse, data lakes, big data, data processing, data cleansing, data profiling.
... The integration of Flink with Iceberg [5] is a recent development aimed at bridging the gap between real-time processing and scalable, highperformance data storage [6]. Previous studies have explored individual aspects of Flink's stream processing capabilities and Iceberg's advantages in data lake [11] management, but there is a growing body of work focusing on how these technologies can be combined to deliver high-performance real-time data lakes [8]. ...
Article
Full-text available
Real-time data lakes, which aggregate and process both streaming and batch data, have emerged as key enablers of this capability. This paper explores the integration of Apache Flink, a powerful stream processing engine, and Apache Iceberg , an open table format, to build a high-performance real-time data lake. The combination of these technologies allows for seamless handling of both real-time and historical data, ensuring low-latency queries and efficient storage. We delve into the architectural design, key challenges, and optimizations required to implement a robust system capable of handling diverse workloads. Furthermore, the paper highlights best practices for managing schema evolution, optimizing data partitioning, and ensuring transactional consistency. The integration of Flink and Iceberg not only enhances data accessibility and reliability but also offers a scalable solution for organizations seeking to leverage real-time analytics. Our findings demonstrate the efficacy of this approach in improving data processing speed, accuracy, and overall system performance. In the era of big data, organizations increasingly rely on real-time analytics to gain timely insights and maintain competitive advantage. This paper presents a comprehensive approach to designing and implementing a high-performance real-time data lake using Apache Flink and Apache Iceberg. We explore how Flink, as a robust stream processing engine, can handle real-time data ingestion, processing, and analytics, while Iceberg provides an efficient and scalable data lake storage format. The integration of these technologies is examined to address key challenges such as data consistency, schema evolution, and system scalability. Through practical case studies and performance benchmarks, we demonstrate how this archite cture supports low-latency querying, reliable data management, and seamless integration with existing data infrastructure. Our findings provide valuable insights into optimizing real-time data lakes for large-scale data operations and highlight best practices for leveraging Flink and Iceberg in a modern data ecosystem.
... Given typical data-lake architectures (Hlupić et al. 2022), the metadata-lake corresponds to the two-layered architecture with a transient landing zone and a second zone that divergingly not only holds the raw (meta)data, but jointly raw and refined metadata. This directly results from the EtLT process choice. ...
Preprint
Full-text available
Metadata management for distributed data sources is a long-standing but ever-growing problem. To counter this challenge in a research-data and library-oriented setting, this work constructs a data architecture, derived from the data-lake: the metadata-lake. A proof-of-concept implementation of this proposed metadata system is presented and evaluated as well.
... The real-time and batch layers merge for searches performed through the service layer, which includes a massively parallel processing query engine. Access to this combined dataset enables accurate reporting at all times with low latency (Ding et al., 2022;Hlupić et al., 2022). ...
Article
Full-text available
In today’s data-driven world, the volume of information produced daily is staggering. Without a robust data engineering strategy, companies face the risk of prolonged delays, decreased productivity, dissatisfied customers, and strained business relationships. Effective data management and data modelling are critical for transforming this vast amount of information into valuable insights that drive business growth and provide a competitive edge. By gathering and analysing data through these methods, businesses can make informed decisions that significantly impact their growth and success. Data modelling and data management are both critical components of working with data, but they focus on different aspects of handling and utilising data within an organisation. Understanding the distinction between these two areas is crucial for effectively managing data within an organisation and ensuring that data systems are well-designed and properly maintained. By delving into the specifics of data modelling and data management, this paper aims to provide a comprehensive understanding of how these practices can be leveraged to enhance organisational efficiency, productivity, and decision-making. We provide a comprehensive and insightful exploration of data modelling and data management while highlighting their critical roles in modern business environments.
... To support domain experts operating on data lakes, one approach to this problem is to bring some sort of order to the plethora of available data. For this purpose, there are different data partitioning models that group the data into different partitions, e.g., based on structural or semantic similarities [10]. This enables domain experts to specifically access a partition that is suitable for their purposes. ...
Article
Full-text available
As a result of the paradigm shift away from rather rigid data warehouses to general-purpose data lakes, fully flexible self-service analytics is made possible. However, this also increases the complexity for domain experts who perform these analyses, since comprehensive data preparation tasks have to be implemented for each data access. For this reason, we developed BARENTS, a toolset that enables domain experts to specify data preparation tasks as ontology rules, which are then applied to the data involved. Although our evaluation of BARENTS showed that it is a valuable contribution to self-service analytics, a major drawback is that domain experts do not receive any semantic support when specifying the rules. In this paper, we therefore address how a recommender approach can provide additional support to domain experts by identifying supplementary datasets that might be relevant for their analyses or additional data processing steps to improve data refinement. This recommender operates on the set of data preparation rules specified in BARENTS—i.e., the accumulated knowledge of all domain experts is factored into the data preparation for each new analysis. Evaluation results indicate that such a recommender approach further contributes to the practicality of BARENTS and thus represents a step towards effective and efficient self-service analytics in data lakes.
... The major benefit of a DL is the centralization of content from disparate sources. A DL can have tens of thousands of tables or files and billions of records, requiring scalable data storage, management, and analysis [32] [33]. Once gathered in the DL, data from multiple sources can be correlated, integrated, and processed using state-of-the-art Big Data search and analysis techniques that would otherwise be impossible. ...
Article
Full-text available
Web server access log files are text files containing important data about server activities, client requests addressed to a server, server responses, etc. Large-scale analysis of these data can contribute to various improvements in different areas of interest. The main problem lies in storing these files in their raw form, over long time, to allow analysis processes to be run at any time enabling information to be extracted as foundation for high quality decisions. Our research focuses on offering an economical, secure, and high-performance solution for the storage of large amount of raw log files. Proposed system implements a Data Lake (DL) architecture in cloud using Azure Data Lake Storage Gen2 (ADLS Gen2) for extract–load–transform (ELT) pipelines. This architecture allows large volumes of data to be stored in their raw form. Afterwards they can be subjected to transformation and advanced analysis processes without the need of a structured writing scheme. The main contribution of this paper is to provide a solution that is affordable and more accessible to perform web server access log data ingestion, storage and transformation over the newest technology, Data Lake. As derivative contribution, we proposed the use of Azure Blob Trigger Function to implement the algorithm of transforming log files into parquet files leading to 90% reduction in storage space compared to their original size. That means much lower storage costs than if they had been stored as log files. A hierarchical data storage model has also been proposed for shared access to data over different layers in the DL architecture, on top of which Data Lifecycle Management (DLM) rules have been proposed for storage cost efficiency. We proposed ingesting log files into a Data Lake deployed in cloud due to ease of deployment and low storage costs. The aim is to maintain this data in the long term, to be used in future advanced analytics processes by cross-referencing with other organizational or external data. That could bring important benefits. While the proposed solution is explicitly based on ADLS Gen2, it represents an important benchmark in approaching a cloud DL solution offered by any other vendor.
... For dataset location, discovery mechanisms are used. Those mechanisms employ graph or semantic databases that are used to implement metadata management and governance systems, often associated with the use of data lakes ( [10], [11], [12]). To simplify the queries, the solution proposed by the EFFECTOR Project uses a semantic layer which will also be described in section 3.2. ...
Article
Full-text available
Establishing an efficient information-sharing network among national agencies in the maritime domain is of essential importance in enhancing operational performance, increasing situational awareness, and enabling interoperability among all involved maritime surveillance assets. Based on various data-driven technologies and sources, the EU initiative of Common Information Sharing Environment (CISE), enables the networked participants to timely exchange information concerning vessel traffic, joint SAR & operational missions, emergency situations, and other events at sea. In order to host and process vast amounts of vessels and related maritime data consumed from heterogeneous sources (e.g. SAT-AIS, UAV, radar, METOC), the deployment of big data repositories in the form of Data Lakes is of great added value. The different layers in the Data Lakes with capabilities for aggregating, fusing, routing, and harmonizing data are assisted by decision support tools with combined reasoning modules with semantics aiming at providing a more accurate Common Operational Picture (COP) among maritime agencies. Based on these technologies, the aim of this paper is to present an end-to-end interoperability framework for maritime situational awareness in strategic and tactical operations at sea, developed in EFFECTOR EU-funded project, focusing on the multilayered Data Lake capabilities. Specifically, a case study presents the important sources and processing blocks, such as the SAT-AIS, CMEMS, and UAV components, enabling maritime information exchange in CISE format and communication patterns. Finally, the technical solution is validated in the project's recently implemented maritime operational trials and the respective results are documented.
Article
Full-text available
Data management, particularly in industrial environments, is increasingly vital due to the necessity of handling ever-growing volumes of information, commonly referred to as big data. This survey delves into various papers to comprehend the practices employed within industrial settings concerning data management, by searching for relevant keywords in Q1 Journals related to data management in manufacturing in the databases of WebOfScience, Scopus and IEEE. Additionally, a contextual overview of core concepts and methods related to different aspects of the data management process was conducted. The survey results indicate a deficiency in methodology across implementations of data management, even within the same types of industry or processes. The findings also highlight several key principles essential for constructing an efficient and optimized data management system.
Conference Paper
Full-text available
During the COVID-19 pandemic, the traditional emergency healthcare systems faced unprecedented strain due to the sharp rise in demands for urgent care, scarcity of resources, and increased risks of people getting infected while waiting at the emergency care facility. We present Triage-Bot, an online medical triage provisioning service, that can revolutionize emergency care by decreasing the load on emergency departments, reducing healthcare expenses, and improving the quality of care. Empowered by artificial intelligence and natural language processing, the Triage-Bot service assesses and prioritizes patients' needs based on symptoms, medical history, and perceived conditions from multimodal video, audio, and text data captured during patients' interactions. The captured summarized information with a severity ranking is sent to a human expert to suggest the next action on the user's part. The diverse data types used by the Triage-Bot in communication, authentication, data collection, storage, and analytics requires a robust and scalable system architecture for online service provisioning. In this paper, we specifically focus on the system design and architecture of the Triage-Bot for emergency healthcare settings. With integrated electronic medical records (EMR) and online platforms, the bot fosters collaboration among healthcare professionals and enables swift and informed decision-making even in the face of crises. By partially automating and offering a hybrid triage process, the Triage-Bot improves resource allocation, reduces healthcare management costs for emergency care, minimizes patient waiting times, and improves wellbeing. To address the complexities and demands of healthcare data management, our proposed system incorporates MongoDB database for flexibility, scalability, and versatility in supporting different types of data. Additionally, we implement a data linking and analytics pipeline utilizing a data Lakehouse system to effectively ingest, manage, process, and generate knowledge from heterogeneous data sources.
Chapter
The Durban University of Technology is undertaking a project to develop a data lakehouse system for a South African government-sector training authority. This system is considered critical to enhance the monitoring and evaluation capabilities of the training authority and ensure service delivery. Key to the successful deployment of the data lakehouse is the implementation of suitable data governance for the system. This chapter identifies the key components of data governance relevant to the system through a systematic literature review process. Thereafter, the components of data governance are mapped against the technical architecture of the data lakehouse and the governance mechanisms are for all lakehouse system components are defined. A practitioner expert evaluation is presented to assess the data governance mechanisms. Overall, the data governance framework and resulting mechanisms were found to be sufficient, except regarding ensuring data quality. The need for separate studies focused on ensuring data quality for the data lakehouse system was identified as future work.
Article
Full-text available
This paper offers a comprehensive examination of the process involved in developing and automating supervised end-to-end machine learning workflows for forecasting and classification purposes. It offers a complete overview of the components (i.e., feature engineering and model selection), principles (i.e., bias–variance decomposition, model complexity, overfitting, model sensitivity to feature assumptions and scaling, and output interpretability), models (i.e., neural networks and regression models), methods (i.e., cross-validation and data augmentation), metrics (i.e., Mean Squared Error and F1-score) and tools that rule most supervised learning applications with numerical and categorical data, as well as their integration, automation, and deployment. The end goal and contribution of this paper is the education and guidance of the non-AI expert academic community regarding complete and rigorous machine learning workflows and data science practices, from problem scoping to design and state-of-the-art automation tools, including basic principles and reasoning in the choice of methods. The paper delves into the critical stages of supervised machine learning workflow development, many of which are often omitted by researchers, and covers foundational concepts essential for understanding and optimizing a functional machine learning workflow, thereby offering a holistic view of task-specific application development for applied researchers who are non-AI experts. This paper may be of significant value to academic researchers developing and prototyping machine learning workflows for their own research or as customer-tailored solutions for government and industry partners.
Article
Full-text available
Vast amounts of medical data are generated every day, and constitute a crucial asset to improve therapy outcomes, medical treatments and healthcare costs. Data lakes are a valuable solution for the management and analysis of such a variety and abundance of data, yet to date there is no data lake architecture specifically designed for the healthcare domain. Moreover, benchmarking the underlying infrastructure of data lakes is fundamental for optimizing resource allocation and performance, increasing the potential of this kind of data platforms. This work describes a data lake architecture to ingest, store, process, and analyze heterogeneous medical data. Also, we present a benchmark for infrastructures supporting healthcare data lakes, focusing on a variety of analysis tasks, from relational analysis to machine learning. The benchmark is tested on a virtualized implementation of our data lake architecture, and on two external cloud-based infrastructures. Our results highlight distinctions between infrastructures and tasks of different nature, according to the machine learning techniques, data sizes and formats involved.
Preprint
Full-text available
This paper offers a comprehensive examination of the process involved in developing and automating supervised end-to-end machine learning workflows for forecasting and classification purposes. It offers a complete overview of the components (i.e. feature engineering, model selection, etc), principles (i.e. bias-variance decomposition, model complexity, overfitting, model sensitivity to feature assumptions and scaling, output interpretability, etc), models (i.e. neural networks, regression models, etc), methods (i.e. Cross-Validation, data augmentation, etc), metrics (i.e. Mean Squared Error, F1-score, etc) and tools that rule most supervised learning applications with numerical and categorical data, as well as their integration, automation, and deployment. The end goal and contribution of this paper is the education and guidance of the non-AI expert academic community over complete and rigorous machine learning pipelines and data science practices, from problem scoping to design and state-of-the-art automation tools, including basic principles and reasoning in the choice of methods. The paper delves into the critical stages of supervised machine learning workflow development, many of which are often omitted by researchers due to brevity, and covers foundational concepts essential for understanding and optimizing a functional machine learning workflow, thereby offering a holistic view of task-specific application development for applied researchers who are not AI experts.
Conference Paper
Full-text available
During recent years, data lakes emerged as a way to manage large amounts of heterogeneous data for modern data analytics. Although various work on individual aspects of data lakes exists, there is no comprehensive data lake architecture yet. Concepts that describe themselves as a “data lake architecture” are only partial. In this work, we introduce the data lake architecture framework. It supports the definition of data lake architectures by defining nine architectural aspects, i.e., perspectives on a data lake, such as data storage or data modeling, and by exploring the interdependencies between these aspects. The included methodology helps to choose appropriate concepts to instantiate each aspect. To evaluate the framework, we use it to configure an exemplary data lake architecture for a real-world data lake implementation. This final assessment shows that our framework provides comprehensive guidance in the configuration of a data lake architecture.
Article
Full-text available
The realm of big data has brought new venues for knowledge acquisition, but also major challenges including data interoperability and effective management. The great volume of miscellaneous data renders the generation of new knowledge a complex data analysis process. Presently, big data technologies provide multiple solutions and tools towards the semantic analysis of heterogeneous data, including their accessibility and reusability. However, in addition to learning from data, we are faced with the issue of data storage and management in a cost-effective and reliable manner. This is the core topic of this paper. A data lake, inspired by the natural lake, is a centralized data repository that stores all kinds of data in any format and structure. This allows any type of data to be ingested into the data lake without any restriction or normalization. This could lead to a critical problem known as data swamp, which can contain invalid or incoherent data that adds no values for further knowledge acquisition. To deal with the potential avalanche of data, some legislation is required to turn such heterogeneous datasets into manageable data. In this article, we address this problem and propose some solutions concerning innovative methods, derived from a multidisciplinary science perspective to manage data lake. The proposed methods imitate the supply chain management and natural lake principles with an emphasis on the importance of the data life cycle, to implement responsible data governance for the data lake.
Article
Full-text available
The advances in smart grids are enabling huge amount of data to be aggregated and analyzed for various smart grid applications. However, the traditional smart grid data management systems cannot scale and provide sufficient storage and processing capabilities. To address these challenges, this paper presents a smart grid big data eco-system based on the state-of-the-art Lambda architecture that is capable of performing parallel batch and real-time operations on distributed data. Further, the presented eco-system utilizes a Hadoop Big Data Lake to store various types of smart grid data including smart meter, images and video data. An implementation of the smart grid big data eco-system on a cloud computing platform is presented. To test the capability of the presented eco-system, real-time visualization and data mining applications were performed on real smart grid data. The results of those applications on top of the eco-system suggest that it is capable of performing numerous smart grid big data analytics.
Chapter
More and more, asset management organizations are introducing data science initiatives to support predictive maintenance and anomaly detection. Asset management organizations are by nature data intensive to manage their assets like bridges, dykes, railways and roads. For this, they often implement data lakes using a variety of architectures and technologies to store big data and facilitate data science initiatives. However, the decision-outcomes of data science models are often highly reliant on the quality of the data. The data in the data lake therefore has to be of sufficient quality to develop trust by decision-makers. Not surprisingly, organizations are increasingly adopting data governance as a means to ensure that the quality of data entering the data lake is and remains of sufficient quality, and to ensure the organization remains legally compliant. The objective of the case study is to understand the role of data governance as success factor for data science. For this, a case study regarding the governance of data in a data lake in the asset management domain is analyzed to test three propositions contributing to the success of using data science. The results show that unambiguous ownership of the data, monitoring the quality of the data entering the data lake, and a controlled overview of standard and specific compliance requirements are important factors for maintaining data quality and compliance and building trust in data science products.
Chapter
The digital transformation leads to massive amounts of heterogeneous data challenging traditional data warehouse solutions in enterprises. In order to exploit these complex data for competitive advantages, the data lake recently emerged as a concept for more flexible and powerful data analytics. However, existing literature on data lakes is rather vague and incomplete, and the various realization approaches that have been proposed neither cover all aspects of data lakes nor do they provide a comprehensive design and realization strategy. Hence, enterprises face multiple challenges when building data lakes. To address these shortcomings, we investigate existing data lake literature and discuss various design and realization aspects for data lakes, such as governance or data models. Based on these insights, we identify challenges and research gaps concerning (1) data lake architecture, (2) data lake governance, and (3) a comprehensive strategy to realize data lakes. These challenges still need to be addressed to successfully leverage the data lake in practice.
Conference Paper
The concept of a data lake is emerging as a popular way to organize and build the next generation of systems to master new big data challenges, but there are lots of concerns and questions for large enterprises to implement data lakes. The paper discusses the concept of data lakes and shares the author's thoughts and practices of data lakes.
Conference Paper
This paper articulates data governance as one of the key issue in building Enterprise Data Warehouse. The key goals of this document are to: define the strategy for Data Governance processes and procedures; define the scope of and identify major components of the data governance processes; adhere to enterprise Data Management standards, principles and guidelines; and articulate a vision for building, managing and safeguarding enterprise data foundation. The client-centric focus of business organizations coupled with aggressive attention to the bottom line propelled initiatives such as Data Governance to the top of the list of IT and business executives. The recent financial crisis which spawned the worldwide economic meltdown has been to a great extent blamed on non-trustworthy and non-transparent data. It is becoming progressively and patently evident that data MUST be managed like other assets such as financial and human resources. It has to have defined and mandated set of controls where compliance can be objectively measured and reported.
Lakehouse: A New Generation of Open Platforms that Unify Data Warehousing and Advanced Analytics
  • M Armbrust
  • A Ghodsi
  • R Xin
  • M Zaharia
Armbrust, M., Ghodsi, A., Xin, R., & Zaharia, M. (2021). Lakehouse: A New Generation of Open Platforms that Unify Data Warehousing and Advanced Analytics. 11th Annual Conference .
Data Lake Architecture: Designing the Data Lake and Avoiding the Garbage Dump
  • W H Inmon
Inmon, W. H. (2016). Data Lake Architecture: Designing the Data Lake and Avoiding the Garbage Dump. Technics Publications.
Data Lake. Encyclopedia of Big Data Technologies
  • C Quix
  • R Hai
Quix, C., & Hai, R. (2019). Data Lake. Encyclopedia of Big Data Technologies, 552 559.
Architecting Data Lakes
  • B Sharma
Sharma, B. (2018). Architecting Data Lakes, 2nd Edition (R.
Lakehouse: A New Generation of Open Platforms that Unify Data Warehousing and Advanced Analytics
  • armbrust