Conference Paper

An Overview of Current Trends in Data Ingestion and Integration

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... The batch technique is used for data that does not need to be consumed in real time, such as a database snapshot. Streaming data ingestion, on the other hand, is used in cases where there is a need for real-time data consumption and specific technologies are used to support this type of need [7]. ...
... In paper [7] an analysis of the state-of-the-art of data ingestion and integration is performed, due to the wide range of existing tools and methods for these types of operations. Throughout the paper, the main features and technologies used are analyzed, dividing them into three main groups: data integration tools, stream data ingestion tools and cloud data processing tools. ...
... However, the emergence of technology has triggered a paradigm shift, prompting researchers to explore innovative methodologies grounded in data analytics and artificial intelligence (AI). Studies like [2][3] have emphasized the effectiveness of real-time data in crime analysis, highlighting its crucial role in guiding targeted interventions. Additionally, research in the realm of crime prediction and hotspot identification [4] has laid the groundwork for applying similar methodologies to address women's safety concerns. ...
Article
Full-text available
This research addresses the critical issue of women’s safety in urban environments, emphasizing the need for innovative solutions to establish secure pathways. SafeRoutes presents a holistic approach, integrating advanced clustering methodologies and GPS technology, detailing its relevance, ideation, methodology, and anticipated results. During ideation, the team prioritized integrating cutting-edge technologies—artificial intelligence, data analytics, and cloud computing. Emphasizing the constraints of existing safety solutions, the focus was on crafting a sophisticated framework for detailed assessments and real-time risk detection during transit. SafeRoutes aims to redefine women’s safety, providing actionable insights for urban planning and law enforcement. The methodology comprises three integral components. Firstly, a robust data ingestion pipeline connects to public and government data sources, ensuring near real-time models enriched with the latest data. The second component uses unsupervised machine learning models, comparing and employing various clustering algorithms. Parameters like crime rates, police presence, and infrastructure are utilized to cluster regions based on women’s safety. Lastly, integration with map APIs and cab service vendors addresses the travel aspect, facilitating real-time alerts for deviations into unsafe areas. Results encompass a nuanced correlation matrix classifying regions based on safety clusters, offering valuable insights for urban planning and law enforcement. Integration with cab services ensures SafeRoutes not only identifies safe paths but actively contributes to enhancing women’s safety during transit. The anticipated outcome positions SafeRoutes as a pioneering solution, contributing substantially to the discourse on urban safety and establishing a benchmark for future research.
... A batch pipeline processes data from source to destination at certain predefined times or conditions. Streaming data pipelines is an improvement to batch processing [9]. Here data is processed in real-time from source to destination as the input data is received, this is necessary to fulfill the demands for quick information and results. ...
Preprint
Full-text available
Currently, a variety of pipeline tools are available for use in data engineering. Data scientists can use these tools to resolve data wrangling issues associated with data and accomplish some data engineering tasks from data ingestion through data preparation to utilization as input for machine learning (ML). Some of these tools have essential built-in components or can be combined with other tools to perform desired data engineering operations. While some tools are wholly or partly commercial, several open-source tools are available to perform expert-level data engineering tasks. This survey examines the broad categories and examples of pipeline tools based on their design and data engineering intentions. These categories are Extract Transform Load/Extract Load Transform (ETL/ELT), pipelines for Data Integration, Ingestion, and Transformation, Data Pipeline Orchestration and Workflow Management, and Machine Learning Pipelines. The survey also provides a broad outline of the utilization with examples within these broad groups and finally, a discussion is presented with case studies indicating the usage of pipeline tools for data engineering. The studies present some first-user application experiences with sample data, some complexities of the applied pipeline, and a summary note of approaches to using these tools to prepare data for machine learning.
... # Kafka Producer Example from kafka import KafkaProducer producer = KafkaProducer (bootstrap_servers='your_kafka_brokers') producer.send('topic_name', value='your_data') Apache Flink: Apache Flink[7] excels in stream processing, offering stateful computations and event time processing. It seamlessly integrates with Azure Event Hubs, enabling efficient and scalable data streaming workflows. ...
Article
Full-text available
Demand for efficient large-scale heterogeneous distributed data ingestion pipelines for transforming and publishing data that is essential for advanced analytics and machine learning models, have gained substantial importance. Services increasingly rely on near-real-time signals to accurately identify or predict customer behavior, sentiments, anomalies facilitating data-driven decision-making. This paper delves into the forefront of distributed and parallel computing, examining the latest advancements in storage, query, and ingestion methodologies. Furthermore, it systematically assesses cutting-edge tools designed for both periodic and real-time analysis of heterogeneous data. “The data quality is more important than the Machine Learning model itself ”. Achieving precision in decision-making or generating precise output from the Machine Learning models necessitates a keen focus on input data quality and consistency. Building a robust ingestion platform for handling hundreds of Gigabytes/Petabytes per day involves a comprehensive understanding of the overarching architecture, the intricacies of involved components, and the unique challenges within these ecosystems. Building a service platform demands thoughtful consideration and resolution of key aspects, including a scalable ingestion handler, a flexible and fault-tolerant data processing library, a highly scalable and resilient event system, an analytics/reporting platform, machine learning platform, and robust application health and security measures. This paper delves into the overall architecture, explicates design choices, and imparts insights into best practices for implementing and maintaining such a platform, leveraging contemporary tools. The discussion encompasses critical aspects of the platform's functionality, emphasizing the need for scalability, fbility, resilience, and security to meet the demands of modern data-driven decision-making scenarios.
... Flink provides a high-throughput, low-latency streaming engine with support for event processing and state management. It is fault-tolerant in the event of machine failure and supports exactly-once semantics [13,21,39,40]. ...
Article
Full-text available
Today, advanced websites serve as robust data repositories that constantly collect various user-centered information and prepare it for subsequent processing. The data collected can include a wide range of important information from email addresses, usernames, and passwords to demographic information such as age, gender, and geographic location. User behavior metrics are also collected, including browsing history, click patterns, and time spent on pages, as well as different preferences like product selection, language preferences, and individual settings. Interactions, device information, transaction history, authentication data, communication logs, and various analytics and metrics contribute to the comprehensive range of user-centric information collected by websites. A method to systematically ingest and transfer such differently structured information to a central message broker is thoroughly described. In this context, a novel tool—Dataphos Publisher—for the creation of ready-to-digest data packages is presented. Data acquired from the message broker are employed for data quality analysis, storage, conversion, and downstream processing. A brief overview of the commonly used and freely available tools for data ingestion and processing is also provided.
... We present a generic architecture of a data pipeline shown in Fig. 1. As not otherwise stated, the remaining description in this section is based on Munappy et al. (2020b), Hapke and Nelson (2020), Munappy et al. (2020c), García et al. (2016), Chapman et al. (2020, Hlupić and Puniš (2021), Malley et al. (2016). ...
... For dataset location, discovery mechanisms are used. Those mechanisms employ graph or semantic databases that are used to implement metadata management and governance systems, often associated with the use of data lakes ( [10], [11], [12]). To simplify the queries, the solution proposed by the EFFECTOR Project uses a semantic layer which will also be described in section 3.2. ...
Article
Full-text available
Establishing an efficient information-sharing network among national agencies in the maritime domain is of essential importance in enhancing operational performance, increasing situational awareness, and enabling interoperability among all involved maritime surveillance assets. Based on various data-driven technologies and sources, the EU initiative of Common Information Sharing Environment (CISE), enables the networked participants to timely exchange information concerning vessel traffic, joint SAR & operational missions, emergency situations, and other events at sea. In order to host and process vast amounts of vessels and related maritime data consumed from heterogeneous sources (e.g. SAT-AIS, UAV, radar, METOC), the deployment of big data repositories in the form of Data Lakes is of great added value. The different layers in the Data Lakes with capabilities for aggregating, fusing, routing, and harmonizing data are assisted by decision support tools with combined reasoning modules with semantics aiming at providing a more accurate Common Operational Picture (COP) among maritime agencies. Based on these technologies, the aim of this paper is to present an end-to-end interoperability framework for maritime situational awareness in strategic and tactical operations at sea, developed in EFFECTOR EU-funded project, focusing on the multilayered Data Lake capabilities. Specifically, a case study presents the important sources and processing blocks, such as the SAT-AIS, CMEMS, and UAV components, enabling maritime information exchange in CISE format and communication patterns. Finally, the technical solution is validated in the project's recently implemented maritime operational trials and the respective results are documented.
... Their key features are the availability onpremise with Graphical User Interface (GUI), batch processing, and certain level extendability with the support for processing structured data and semi/unstructured data to some extent. Stream data ingestion tools are featured with a variety of designs and supporting structures, semi-structured, and unstructured data processing [9]. III. ...
Conference Paper
Full-text available
In this paper, we present the core applications of data lakes and other big data infrastructure technologies for the purpose of enhancing the maritime interoperability framework and ensuring resilient collaboration among agencies. The approach is based on the deployment of multi-layered & semantically enabled Data Lakes for storing various maritime data collected from heterogeneous sensors, and on the information exchange process through the Common Information Sharing Environment (CISE) network using advanced Command and Control (C2) platforms. The results of this paper are derived from the EU-funded project EFFECTOR, highlighting the significant contribution of advanced solutions using Artificial Intelligence algorithms and supporting UAV and C2 technologies to various operations at sea. The validation survey results collected from end-users after the execution of the maritime trials are presented as well.
... The data ingest layer [14] processes the incoming data, prioritizing sources and validating the data. While, the data extraction can be done in a single large batch or it can be split into several smaller batches. ...
Conference Paper
Full-text available
Ensuring a high level of vessel traffic surveillance and maritime safety is determined by exploiting innovative ICT technologies and international cooperation among maritime authorities. Therefore, initiatives for maritime surveillance, global and regional integrations are realised through a collaborative, cost-effective and interoperable Common Information Sharing Environment (CISE). Consisting of the institutional network of maritime authorities that cooperate on various domains like safety, border control, environmental and rescue missions at sea, CISE enables the efficient transfer and economic exchange of maritime data and information via different interoperable systems using modern digital technologies. The ever-increasing amount of data received from heterogeneous data sources requires specific processing through the adoption of a Big Data framework which hosts, manages and distributes data to maritime users, contributing with great overall benefits to the CISE network core functionality. Specifically, this paper analyses the advantages of the Data Lake infrastructure, including its processes, techniques, tools and applications used to enhance maritime surveillance and safety across the CISE network. This part contains the deployment and interoperability achieved through the components of the participating command and control (C2) systems. Last, as a case study, an overview of the EU project EFFECTOR is provided which aims to demonstrate an end-to-end interoperability framework of data-driven solutions for maritime situational awareness at strategic and tactical operations.
Article
Full-text available
There are currently many pipeline tools available for data engineering, which data scientists use to tackle data wrangling issues and perform tasks ranging from data ingestion and preparation to its use as input for machine learning (ML). These tools either come with essential built-in components or can be integrated with others to achieve the desired data engineering outcomes. While some of these tools are fully or partially commercial, there is also a range of open-source options capable of handling expert-level data engineering tasks. This survey explores various categories and examples of these pipeline tools based on their design and intended data engineering functions. The categories covered include Extract Transform Load/Extract Load Transform (ETL/ELT), Data Integration, Ingestion, and Transformation pipelines, Data Pipeline Orchestration and Workflow Management, and Machine Learning Pipelines. Additionally, the survey outlines how these tools are used within these categories, providing examples and discussing case studies that showcase real-world applications. These case studies illustrate user experiences with sample data, the complexities involved in the pipelines, and summarize approaches for preparing data for machine learning.
Chapter
The Durban University of Technology is now engaged in a project to create a data lake house system for a Training Authority in the South African Government sector. This system is crucial for improving the monitoring and evaluation capacities of the training authority and ensuring efficient service delivery. Ensuring the high quality of data being fed into the lakehouse is crucial, since low data quality negatively impacts the effectiveness of the lakehouse system. This chapter examines quality control methods for ingestion-layer pipelines in order to present a framework for ensuring data quality. The metrics taken into account for assessing data quality were completeness, accuracy, integrity, correctness, and timeliness. The efficiency of the framework was assessed by effectively implementing it on a sample semi-structured dataset. Suggestions for future development including enhancing by integrating data from a wider range of sources and providing triggers for incremental data intake.
Article
Full-text available
The need for effective large-scale heterogeneous distributed data ingestion pipelines, crucial for transforming and processing data essential to advanced analytics and machine learning models, has seen a significant surge in importance. Modern services increasingly depend on near-real-time signals to precisely identify or predict customer behavior, sentiments, and anomalies, thereby facilitating informed, data-driven decision-making[1].In the rapidly evolving landscape of large-scale enterprise data applications, the demand for efficient data ingestion and stream processing solutions has never been more critical[2]. This technical paper introduces a groundbreaking autonomous self-schedulable library designed for recurring jobs, addressing the challenges faced by enterprises in orchestrating complex data workflows seamlessly. Leveraging authoritative expertise in building robust enterprise applications, this library provides a paradigm shift in how organizations manage and execute recurring tasks within data pipelines [3].
Conference Paper
Full-text available
Data integration is the problem of combining data residing at different sources, and providing the user with a unified view of these data. The problem of designing data integration systems is important in current real world applications, and is characterized by a number of issues that are interesting from a theoretical point of view. This document presents on overview of the material to be presented in a tutorial on data integration. The tutorial is focused on some of the theoretical issues that are relevant for data integration. Special attention will be devoted to the following aspects: modeling a data integration application, processing queries in data integration, dealing with inconsistent data sources, and reasoning on queries.
Article
Full-text available
software responsible for the extraction of data from several sources, their cleansing, customization and insertion into a data warehouse. In this paper, we focus on the problem of the definition of ETL activities and provide formal foundations for their conceptual representation. The proposed conceptual model is (a) customized for the tracing of inter-attribute relationships and the respective ETL activities in the early stages of a data warehouse project; (b) enriched with a 'palette' of a set of frequently used ETL activities, like the assignment of surrogate keys, the check for null values, etc; and (c) constructed in a customizable and extensible manner, so that the designer can enrich it with his own re-occurring patterns for ETL activities.
Cloud Data Integration Leader (modern ETL)
  • A Talend
Talend -A Cloud Data Integration Leader (modern ETL), Retrieved May 29, 2021, from https://www.talend.com/
Apache Software Foundation
  • Sqoop -The
Sqoop -The Apache Software Foundation, Retrieved May 29, 2021, from https://sqoop.apache.org/
Lakehouse: A New Generation of Open Platforms that Unify Data Warehousing and Advanced Analytics
  • M Armbrust
  • A Ghodsi
  • R Xin
  • M Zaharia
Armbrust, M., Ghodsi, A., Xin, R., & Zaharia, M. (2021). Lakehouse: A New Generation of Open Platforms that Unify Data Warehousing and Advanced Analytics. 11th Annual Conference on Innovative Data Systems Research (CIDR '21).
Lakehouse: A New Generation of Open Platforms that Unify Data Warehousing and Advanced Analytics
  • armbrust
Talend-A Cloud Data Integration Leader (modern ETL)
  • A Talend
  • Cloud