Article

Building LinkedIn's real-time activity data pipeline

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... However, in serious physical network decomposition events, such network rerouting would at times become impossible. Therefore, in this paper, another restoration mechanism at the application layer running the Apache Kafka [6] message brokerage has been proposed. The resultant mechanism forms the so-called two-layered restoration framework i.e. at the network layer and the application layer. ...
... To reduce the cost, a cost-effective computing board, i.e., Raspberry Pi and Raspberry Pi camera, have been used as the image sending node. Road traffic images are sent periodically from Raspberry Pi's to an adaptively selectable local broker at the nearest local traffic police controller box with Apache Kafka (hereinafter called Kafka) framework [6]. Incoming image sequences are also forwarded to the traffic data cloud, enabling the traffic police command center to monitor the overall road traffic conditions and for the local traffic police at the neighboring areas to pull relevant traffic data for operating area-coordinated traffic controls. ...
... Kafka is designed to have high-throughput, and low-latency [6]. The typical architecture of Kafka is shown in Fig 2. Kafka can provide the delivery guarantee [9]. ...
Article
Full-text available
In this paper, the authors have designed and implemented the prototype for a near real-time wireless image sequence streaming cloud with two-layered restoration for a road traffic monitoring application of a small-scale network. Since the proposed design is targeted to implement outdoors where the link or node failure could occur, the fault-tolerant capability must be considered. Having only one layer restoration may not provide a good quality of service. Therefore, a two-layer restoration framework is designed in the proposed system by restoring the network layer with the underlying software-defined wireless mesh network capability and at the local broker selection over the Apache Kafka framework. The monitoring application performance has been investigated for the end-to-end average latency and image loss percentage by outdoor testing for 13 hours from 5:40 P.M. 17th November 2020 to 6:40 A.M. 18th November 2020. The end-to-end average latency and image loss percentage have been found to be within the acceptable condition i.e. less than 5 seconds on average with approximated 10% image losses. The proposed system has also been compared with the traditional ad-hoc network, running the OLSR-based network layer, in terms of the rerouting time, restoration time and end-to-end average latency. Based on the emulated wireless network in controllable laboratory environments, the proposed SDWMN-based system outperforms the conventional OLSR-based system with potentially faster rerouting/restoration time due to SDN central controllability and with only marginally increased end-to-end average latency after re-routing/restoration completion. Algorithm complexity analysis has also been given for both the systems. Both the experimental and complexity analysis results thus suggest the practical applicability of the proposed system. Given this promising result, it is therefore recommended as the future research in further developing from the prototype design into the actual deployment for daily traffic monitoring operations.
... Permissioned blockchains, on the other hand, assume the existence of a trusted membership management service, and that nodes participating to the infrastructure are all certified by this authority. This allows the use of classical byzantine fault-tolerant consensus algorithms that assume identifiable participants, called orderers, such as PBFT [11], Zyzzyva [32], or BFT-Smart [7], or CFT protocols such as primary-backup replication [22], [26]. Examples of permissioned blockchains are Hyperledger Fabric [2], Tendermint [10] and Corda [24]. ...
... Fabric partitions the introduction of new blocks to the chain, which is handled by orderers, from the validation and execution of chaincodes, handled by peers. This partition enables orderers to use pluggable consensus protocols such as a crash faulttolerant service based on Apache Kafka [22], or a byzantinefault tolerant service using BFT-Smart [7], [45]. ...
... We use version 1.2 of Fabric. We set up a network of 100 peers belonging to a single organization, one client, and a CFT ordering service consisting of four Kafka [22] nodes and three Zookeeper [26] nodes, corresponding to the default configuration for a Kafka-based setup. All nodes are deployed on a cluster of 15 servers equipped with 8-core L5420 Intel® Xeon® CPUs at 2.5 GHz and 8 GB of RAM, interconnected using 1 Gbps Ethernet. ...
... Permissioned blockchains, on the other hand, assume the existence of a trusted membership management service, and that nodes participating to the infrastructure are all certified by this authority. This allows the use of classical byzantine fault-tolerant consensus algorithms that assume identifiable participants, called orderers, such as PBFT [11], Zyzzyva [32], or BFT-Smart [7], or CFT protocols such as primary-backup replication [22], [26]. Examples of permissioned blockchains are Hyperledger Fabric [2], Tendermint [10] and Corda [24]. ...
... Fabric partitions the introduction of new blocks to the chain, which is handled by orderers, from the validation and execution of chaincodes, handled by peers. This partition enables orderers to use pluggable consensus protocols such as a crash faulttolerant service based on Apache Kafka [22], or a byzantinefault tolerant service using BFT-Smart [7], [45]. ...
... We use version 1.2 of Fabric. We set up a network of 100 peers belonging to a single organization, one client, and a CFT ordering service consisting of four Kafka [22] nodes and three Zookeeper [26] nodes, corresponding to the default configuration for a Kafka-based setup. All nodes are deployed on a cluster of 15 servers equipped with 8-core L5420 Intel® Xeon® CPUs at 2.5 GHz and 8 GB of RAM, interconnected using 1 Gbps Ethernet. ...
Preprint
Full-text available
Permissioned blockchains are supported by identified but individually untrustworthy nodes, collectively maintaining a replicated ledger whose content is trusted. The Hyperledger Fabric permissioned blockchain system targets high-throughput transaction processing. Fabric uses a set of nodes tasked with the ordering of transactions using consensus. Additional peers endorse and validate transactions, and maintain a copy of the ledger. The ability to quickly disseminate new transaction blocks from ordering nodes to all peers is critical for both performance and consistency. Broadcast is handled by a gossip protocol, using randomized exchanges of blocks between peers. We show that the current implementation of gossip in Fabric leads to heavy tail distributions of block propagation latencies, impacting performance, consistency, and fairness. We contribute a novel design for gossip in Fabric that simultaneously optimizes propagation time, tail latency and bandwidth consumption. Using a 100-node cluster, we show that our enhanced gossip allows the dissemination of blocks to all peers more than 10 times faster than with the original implementation, while decreasing the overall network bandwidth consumption by more than 40%. With a high throughput and concurrent application, this results in 17% to 36% fewer invalidated transactions for different block sizes.
... Machine learning algorithms are sensitive to data on fly or streaming data i.e. non-stationary data is a challenge to machine learning algorithm.In [43] the author gives a comparative overview of batch processing paradigms through streaming processing theory and also reviewed the mechanizationtargeting to verify the effectiveness of streaming technology. Representative streaming processing classificationsconsist of"Borealis" [44], S4 [45], "Kafka" [46], and several other contemporary architectures proposed to provide real-time analytics onBig Data [46]- [48]. ▪ With Big Datacomes uncertainty and incompleteness as beingcollected from incongruent sources making it difficult and practically impossible to be processed by machine learning algorithms which were in the past were usually provided with fairly perfect data from recognized and quite restricted sources, making the learning outcomes to be unmistaken. ...
... Machine learning algorithms are sensitive to data on fly or streaming data i.e. non-stationary data is a challenge to machine learning algorithm.In [43] the author gives a comparative overview of batch processing paradigms through streaming processing theory and also reviewed the mechanizationtargeting to verify the effectiveness of streaming technology. Representative streaming processing classificationsconsist of"Borealis" [44], S4 [45], "Kafka" [46], and several other contemporary architectures proposed to provide real-time analytics onBig Data [46]- [48]. ▪ With Big Datacomes uncertainty and incompleteness as beingcollected from incongruent sources making it difficult and practically impossible to be processed by machine learning algorithms which were in the past were usually provided with fairly perfect data from recognized and quite restricted sources, making the learning outcomes to be unmistaken. ...
... For example, Sun et al. [51] formulated a local-learning-based feature selection algorithm for supporting multifaceted dimensional data analysis. The prevailingrepresentative machine learning algorithms for data dimensionality reduction embrace"Principal Component Analysis (PCA)", "Linear Discriminant Analysis (LDA)", "Locally Linear Embedding(LLE)", and "LaplacianEigenmaps" [46]. Lowrank matrix is also employed to manage wide-scale data analysis and dimensionality reduction [52], [53]. ...
Article
Big Datais a buzzword affecting nearly every domain and providing different set new opportunity for the development of knowledge discovery process. Although it comes with challenges like abundance, extensiveness and diversity, timeliness and dynamism, messiness and vagueness, and with an uncertainty as all the data generated does not relates to any specific question and can be associated with another process or activity. To address these challenges are certainly cannot be handled by the traditional infrastructure, platforms and frameworks. New analytical techniques and high performance computing architecture came into picture to handle this explosion. These platforms and architecture are giving a cutting edge to the Big Data Knowledge Discovery process by using Artificial Intelligence, Machine Learning and Expert systems. This study encompasses a comprehensive review of Big Data analytical platforms and frameworks with their comparative analysis. A Knowledge Discovery architecture for Big Data Analytics is also proposed while considering the fundamental aspect of gaining insights from Big Data sets and focus of this analysis is to provide the open challenges associated with these techniques and future research directions.
... Organizations are increasingly using data for decision-making, generating reports, training Machine Learning (ML) and deep learning models, and gaining insights (Goodhope et al., 2012;Jovanovic et al., 2021). However, the effectiveness of data solutions relies heavily on the quality of the data (Cai & Zhu, 2015). ...
... Therefore, merging data of various sources and analyze them with Data Mining (DM) and ML techniques increase the possibility for a teacher to get a holistic understanding of the academic progress of students. ML techniques, such as classification and clustering (Geron, 2017;Mohseni et al., 2020), can be applied to identify patterns and trends within the data that may not be readily apparent through manual analysis alone and predict students' outcomes. By leveraging DM algorithms (Hand et al., 2001) to sift through educational datasets, teachers can gain deeper insights into student performance, learning preferences, and potential areas for intervention or enrichment. ...
Article
Full-text available
There is a significant amount of data available about students and their learning activities in many educational systems today. However, these datasets are frequently spread across several different digital services, making it challenging to use them strategically. In addition, there are no established standards for collecting, processing, analyzing, and presenting such data. As a result, school leaders, teachers, and students do not capitalize on the possibility of making decisions based on data. This is a serious barrier to the improvement of work in schools, teacher and student progress, and the development of effective Educational Technology (EdTech) products and services. Data standards can be used as a protocol on how different IT systems communicate with each other. When working with data from different public and private institutions simultaneously (e.g., different municipalities and EdTech companies), having a trustworthy data pipeline for retrieving the data and storing it in a secure warehouse is critical. In this study, we propose a technical solution containing a data pipeline by employing a secure warehouse—the Swedish University Computer Network (SUNET), which is an interface for information exchange between operational processes in schools. We conducted a user study in collaboration with four municipalities and four EdTech companies based in Sweden. Our proposal involves introducing a data standard to facilitate the integration of educational data from diverse resources in our SUNET drive. To accomplish this, we created customized scripts for each stakeholder, tailored to their specific data formats, with the aim of merging the students’ data. The results of the first four steps show that our solution works. Once the results of the next three steps are in, we will contemplate scaling up our technical solution nationwide. With the implementation of the suggested data standard and the utilization of the proposed technical solution, diverse stakeholders can benefit from improved management, transportation, analysis, and visualization of educational data.
... For the modeling of fault-tolerant data pipelines, this step is useful to inform about the actual and desired characteristics of data pipelines. At the same time, we did a literature review on fault-tolerant data pipelines implemented at large-scale industries like Google [18], Microsoft [19], Facebook [6] and LinkedIn [20]. Learning: Customers return the products from Company A to a screening center when they detect issues. ...
... Dirty data X X X X Statistical methods, Data imputation techniques The study in [20] by K. Goodhope et. al describes the building of a real-time activity data pipeline at LinkedIn. ...
... Helu et al. [12] propose a scalable DPS for IIoT, handling large, high-velocity data. Goodhope et al. [11] describe LinkedIns real-time DPS using Apache Kafka for high throughput and low-latency processing, emphasizing fault tolerance and scalability. Poojara et al. [15] propose a serverless DPS for IoT in fog and cloud computing, improving latency. ...
Conference Paper
Full-text available
This paper introduces the design of a data pipeline system (DPS) integrated with artificial intelligence (AIF) functions to support continuous AI learning and operations for network automation in 5G/6G systems. We design the DPS as a chain of functions, namely ingress and egress Network Data Broker Function (iNDBF and eNDBF) and Network Data Preprocessing Function (NDPPF), to support in-network learning operations. To take into account the distributed nature of the network architecture of 5G systems and beyond, we conceive the DPS to be integrated seamlessly with a distributed learning frameworks such as the federated learning (FL). We performed a realistic evaluation, employing a real dataset from a national mobile operator to simulate the network architecture. Additionally, a FL framework for anomaly detection is integrated with the DPS to assess the effectiveness of our proposal. Evaluation results show that delays in end-to-end data transmission and preprocessing to the AIF locations can cause distributed learning AIFs to work with stale data. The results also highlight how the DPS can counterbalance these delays leading to desynchronisation of the distributed learning process, bringing to AIFs with higher accuracy.
... This streaming processing technology and theory have been studied for decades. Representative open source systems include S4 [16] ,Storm [17] , and Kafka [18] . The streaming processing paradigm is applied for online applications, usually at the second, or even millisecond, level. ...
Article
Full-text available
Big data has drawn great attention from researchers in decision makers, information sciences and policy in enterprises and government. There are so much potential and useful hidden values in the enormous volume of data. Big data is tremendously valuable to provide productivity in evolutionary breakthroughs in scientific disciplines and businesses. It facilitates lot of opportunities to develop great progresses in many domains. Moreover, big data also comes with many challenges including data collection, data analysis, data storage and data visualization. This paper provides the review on big data definitions, layered architecture and common big data challenges. There is no doubt that the future world in technologies and business productivity will converge into the explorations of big data.
... There are different distributed messaging systems: Apache Kafka [35], RabbitMQ [36], JMS (Java Message Service) [37], ActiveMQ [38], ZeroMQ [39], and Kestrel [40]. Apache Kafka is a messaging platform designed to write more than 10 million messages per day on average of 172,000 messages per second [41]. Although Kafka's original purpose is for log processing, it is also used for different scenarios. ...
Article
Full-text available
Vehicle tracking and license plate recognition (LPR) over video surveillance cameras are essential intelligent traffic monitoring systems. Due to the enormous amount of data collected each day, it would be difficult to track vehicles by license plate in a real-world traffic setting. Large volumes of data processing, real-time request responses, and emergency scenario response may not be possible using conventional approaches. By combining license plate recognition with the docker container-based structure of the Apache Kafka node ecosystem, the suggested solution takes a novel approach to vehicle tracking. The primary components of our suggested framework for reading license plates are the identification of license plates and text data queries. License plate localization is performed with You Only Look Once version 3 (YOLOv3) and character recognition with Optical Character Recognition (OCR). The detected vehicle images with license plate results are published on related topics with Apache Kafka. Apache Kafka is a publish-subscribe (producers-consumers) messaging system and one of the most popular architectures used for streaming data. For each license plate search, a topic will be created in the framework where producers publish and consumers receive data. Thus, the workload of the operators will be reduced and they will be able to pay attention to more important events in traffic.
... The data pipeline utilizes activity data in the form of log or event messages that capture user and server activity. These messages are vital for various internet systems, including advertising, relevance, search, recommendation systems, security, and analytics [15]. ...
Article
Full-text available
Data pipelines are crucial for processing and transforming data in various domains, including finance, healthcare, and e-commerce. Ensuring the reliability and accuracy of data pipelines is of utmost importance to maintain data integrity and make informed business decisions. In this paper, we explore the significance of continuous monitoring in data pipelines and its contribution to data observability. This work discusses the challenges associated with monitoring data pipelines in real-time, propose a framework for real-time monitoring, and highlight its benefits in enhancing data observability. The findings of this work emphasize the need for organizations to adopt continuous monitoring practices to ensure data quality, detect anomalies, and improve overall system performance.
... While many contributions focused on specific application domains, e.g., manufacturing (O'Donovan et al., 2015;Frye and Schmitt, 2020), others took a more generic approach (Von Landesberger et al., 2017;Munappy et al., 2020a). Further, there are a number of studies that share experiences (e.g., lessons learned, challenges) about engineering data pipelines (Goodhope et al., 2012;Tiezzi et al., 2020;Munappy et al., 2020b). ...
... Apache Kafka [13] is a distributed message queue which aims to provide a unified, high-throughput, lowlatency real-time data management. Intuitively, producers emit messages which are categorized into adequate topics. ...
Article
Full-text available
Reasoning over semantically annotated data is an emerging trend in stream processing aiming to produce sound and complete answers to a set of continuous queries. It usually comes at the cost of finding a trade-off between data throughput, latency and the cost of expressive inferences. Strider R proposes such a trade-off and combines a scalable RDF stream processing engine with an efficient reasoning system. The main reasoning services are based on a query rewriting approach for SPARQL that benefits from an intelligent encoding of an extension of the RDFS (i.e., RDFS with owl:sameAs) ontology elements. Strider R runs in production at a major international water management company to detect anomalies from sensor streams. The system is evaluated along different dimensions and over multiple datasets to emphasize its performance.
... Their requirement was a system capable of processing 10 billion messages per day, with demand peaking at 172,000 messages per second. Initially, they considered ActiveMQ, but decided to develop Kafka [12], which was subsequently made open source. ...
Article
Full-text available
Within a supply chain organisation, where millions of messages are processed, reliability and performance of message throughput are important. Problems can occur with the ingestion of messages; if they arrive more quickly than they can be processed, they can cause queue congestion. This paper models data interchange (EDI) messages. We sought to understand how best DevOps should model these messages for performance testing and how best to apply smart EDI content awareness that enhance the realms of Ambient Intelligence (Aml) with a Business-to business (B2B) supply chain organisation. We considered key performance indicators (KPI) for over- or under-utilisation of these queueing systems. We modelled message service and inter-arrival times, partitioned data along various axes to facilitate statistical modelling and used continuous parametric and non-parametric techniques. Our results include the best fit for parametric and non-parametric techniques. We noted that a one-size-fits-all model is inappropriate for this heavy-tailed enterprise dataset. Our results showed that parametric distribution models were suitable for modelling the distribution’s tail, whilst non-parametric kernel density estimation models were better suited for modelling the head of a distribution. Depending on how we partitioned our data along the axes, our data suffer from quantisation noise.
... The data pipeline's components can automate the operations of extracting, processing, integrating, validating, and loading data . Data pipelines can process different types of data such as continuous, intermittent, and batch data (Goodhope et al., 2012). Moreover, data pipelines eliminate errors and accelerate the end-to-end data processes, which in turn reduces the latency in the development of data products. ...
Article
Deep learning (DL) based software systems are difficult to develop and maintain in industrial settings due to several challenges. Data management is one of the most prominent challenges which complicates DL in industrial deployments. DL models are data-hungry and require high-quality data. Therefore, the volume, variety, velocity, and quality of data cannot be compromised. This study aims to explore the data management challenges encountered by practitioners developing systems with DL components, identify the potential solutions from the literature and validate the solutions through a multiple case study. We identified 20 data management challenges experienced by DL practitioners through a multiple interpretive case study. Further, we identified 48 articles through a systematic literature review that discuss the solutions for the data management challenges. With the second round of multiple case study, we show that many of these solutions have limitations and are not used in practice due to a combination of four factors: high cost, lack of skill-set and infrastructure, inability to solve the problem completely, and incompatibility with certain DL use cases. Thus, data management for data-intensive DL models in production is complicated. Although the DL technology has achieved very promising results, there is still a significant need for further research in the field of data management to build high-quality datasets and streams that can be used for building production-ready DL systems. Furthermore, we have classified the data management challenges into four categories based on the availability of the solutions.
... The data pipeline's components can automate the operations of extracting, processing, integrating, validating, and loading data [88]. Data pipelines can process different types of data such as continuous, intermittent, and batch data [89]. Moreover, data pipelines eliminate errors and accelerate the end-to-end data processes, which in turn reduces the latency in the development of data products. ...
Preprint
Deep learning (DL) based software systems are difficult to develop and maintain in industrial settings due to several challenges. Data management is one of the most prominent challenges which complicates DL in industrial deployments. DL models are data-hungry and require high-quality data. Therefore, the volume, variety, velocity, and quality of data cannot be compromised. This study aims to explore the data management challenges encountered by practitioners developing systems with DL components, identify the potential solutions from the literature and validate the solutions through a multiple case study. We identified 20 data management challenges experienced by DL practitioners through a multiple interpretive case study. Further, we identified 48 articles through a systematic literature review that discuss the solutions for the data management challenges. With the second round of multiple case study, we show that many of these solutions have limitations and are not used in practice due to a combination of four factors: high cost, lack of skill-set and infrastructure, inability to solve the problem completely, and incompatibility with certain DL use cases. Thus, data management for data-intensive DL models in production is complicated. Although the DL technology has achieved very promising results, there is still a significant need for further research in the field of data management to build high-quality datasets and streams that can be used for building production-ready DL systems. Furthermore, we have classified the data management challenges into four categories based on the availability of the solutions.
... One of these circumstances could be the system growth, or real-time monitoring in which some other way of transmitting data from CA to SA should be considered. Some authors suggest the implementation of Kafka technology, originally developed by LinkedIn in order to monitor and create user activity pipeline and make decisions for real time user contents and ads [21]. Also, various back-end services for real time processing and monitoring of user activities can process the stream as data arrives. ...
Conference Paper
In this paper, a possible solution for the architecture of the remote employee monitoring system is proposed. Development of such a system is particularly interesting at a time of the Covid-19 virus pandemic, when many employees work from home. Today's monitoring systems are mainly based on time tracking and screen monitoring features, without in-depth further analysis of the collected data. Therefore, the aim of this paper is to present one architecture solution that will support not only data collection, processing and visualization, but also the application of Machine learning algorithms that will perform more complex and deeper analyzes. As a result of this paper, the proposed architecture will provide a clear insight into the structure of the system, before the beginning of its development.
... Once researchers identify a target outcome to forecast, features must be acquired and amalgamated from independent corpora into a single corpus. To manage the copious amount of data acquired by different industries, computational scientists rely on the development of scalable and efficient architecture to expediate the process of querying and processing data for performance-based operational pipelines (Goodhope et al. 2012;Yang et al. 2017). Moreover, researchers seek to accumulate diverse data types 9 from different stakeholders and sources to improve model performance and make datadriven decisions (Erwin et al. 2014;Radiuk 2017). ...
Thesis
Monte Carlo simulation studies are used to examine how eight factors impact predictions of a binary target outcome in data science pipelines: (1) the choice of four DMMs [Logistic Regression (LR), Elastic Net Regression (GLMNET), Random Forest (RF), Extreme Gradient Boosting (XGBoost)], (2) the choice of three filter preprocessing feature selection techniques [Correlation Attribute Evaluation (CAE), Fisher’s Scoring Algorithm (FSA), Information Gain Attribute Evaluation (IG)], (3) number of training observations, (4) number of features, (5) error of measurement, (6) class imbalance magnitude, (7) missing data pattern, and (8) feature selection cutoff. The findings are consistent with literature about which data properties and algorithms perform best. Measurement error negatively impacted pipeline performance across all factors, DMMs, and feature selection techniques. A larger number of training observations ameliorated the decrease in predictive efficacy resulting from measurement error, different class imbalance magnitudes, missing data patterns, and feature selection cutoffs. GLMNET significantly outperformed all other DMMs, while CAE and FSA enhanced the performance of LR and GLMNET. A consensus ranking methodology integrating feature selection with cross-validation is presented. As an application, the data pipeline was used to forecast the performance of 3,225 students enrolled in a collegiate biology course using a corpus of 57 university and course-specific features at four time points (pre-course, weeks 3, 6, and 9). Borda’s method applied during cross-validation identified collegiate academic attributes and performance on concept inventory assessments as the primary features impacting student success. Performance variability of the pipeline was generally consistent with the results of the simulation studies. GLMNET exhibited the highest predictive efficacy with the least amount of variability in the area under the curve (AUC) metric. However, increasing the number of training observations did not always significantly enhance pipeline performance. The benefits of developing interpretable data pipelines are also discussed.
... They needed a system that could process 10 billion messages per day with a peak demand of 172,000 messages per second. They initially explored ActiveMQ, but instead, they developed Kafka [45]. ...
Conference Paper
Full-text available
The electronic exchange of business documents between a customer and their trading partner use Electronic Data Interchange (EDI) standards that incorporate various document types from purchase orders to shipping notices between engaging parties. Simulating the behaviour of EDI messages within a Supply Chain networks queuing system has many purposes from understanding the efficiency and effectiveness of queue behaviour leading to process re-engineering and queue optimisation stratification. These different types of EDI transactions are heterogeneous in nature and challenging to model. This research investigates whether a parametric or non-parametric approach is appropriate to model message service times (ST). Our results show that parametric distribution models are suitable for modelling the distributions tail, whilst non-parametric Kernel Density Estimation (KDE) models are ideal for modelling the head.
... A number of studies (Molina et al., 2018;Shen et al., 2019) employ Apache Kafka (Goodhope et al., 2012;Kafka, 2014) as a storage layer for events, and the same approach is taken to use Kafka as a fast, append-only log for the events generated by Debezium. For this purpose, Kafka is configured with a single topic in the OLTP database, in turn having one partition. ...
Article
Full-text available
In database management systems (DBMSs), query workloads can be classified as online transactional processing (OLTP) or online analytical processing (OLAP). These often run within separate DBMSs. In hybrid transactional and analytical processing (HTAP), both workloads may execute within the same DBMS. This article shows that it is possible to run separate OLTP and OLAP DBMSs, and still support timely business decisions from analytical queries running off fresh transactional data. Several setups to manage OLTP and OLAP workloads are analysed. Then, benchmarks on two industry standard DBMSs empirically show that, under an OLTP workload, a row-store DBMS sustains a 1000 times higher throughput than a columnar DBMS, whilst OLAP queries are more than 4 times faster on a columnar DBMS. Finally, a reactive streaming ETL pipeline is implemented which connects these two DBMSs. Separate benchmarks show that OLTP events can be streamed to an OLAP database within a few seconds.
... Data analysis techniques like neural networks, data mining, machine learning, signal processing, and visualization methods demand high-quality data [2]. Recently, many organizations have begun implementing advanced, data-driven, and real-time analytics for both operational and strategic decision making [3,9]. Machine learning algorithms are the foundation for such initiatives [4]. ...
Chapter
Data pipelines play an important role throughout the data management process whether these are used for data analytics or machine learning. Data-driven organizations can make use of data pipelines for producing good quality data applications. Moreover, data pipelines ensure end-to-end velocity by automating the processes involved in extracting, transforming, combining, validating, and loading data for further analysis and visualization. However, the robustness of data pipelines is equally important since unhealthy data pipelines can add more noise to the input data. This paper identifies the essential elements for a robust data pipeline and analyses the trade-off between data pipeline robustness and complexity.
... In cloud computing, scalable Pub/Sub systems are used to connect multiple servers with low latency. [11][12][13] Apache Kafka, 11 which is developed by LinkedIn, can allocate multiple brokers for each topic and supports consumer groups. Each record published to a topic is delivered to one consumer in a group. ...
Article
Full-text available
The importance of real‐time notification has been growing for social services and Intelligent Transporting System (ITS). As an advanced version of Pub/Sub systems, publish‐process‐subscribe systems, where published messages are spooled and processed on edge servers, have been proposed to achieve data‐driven intelligent notifications. In this paper, we present a system that allows a topic to be managed on multiple edge servers so that messages are processed near the publishers, even when publishers spread over a wide area. Duplicating messages on geographically distributed servers could enable immediate notification to neighboring subscribers. However, the duplicated message spool may cause exhaustion of resources. We prepare a formal model of our publish‐process‐subscribe system and formulate the topic allocation as an optimization problem under the resource constraints of edge servers. As the optimization problem is NP‐hard, we propose heuristics leveraging the locality and the pub/sub relationships observed between clients to use the edge server resources efficiently. Our performance evaluation shows that our method reduces the delay to deliver notifications and the effectiveness of the strategy exploiting the relationships between clients.
... In the same way, individual microservices communicate with each other asynchronously. Apache Kafka as the selected messaging system is proven for high throughput and low latency (Goodhope et al. 2012). Within microservices, we process data using stream processing techniques (Cugola and Margara 2012). ...
Article
Full-text available
The Internet of Things adoption in the manufacturing industry allows enterprises to monitor their electrical power consumption in real time and at machine level. In this paper, we follow up on such emerging opportunities for data acquisition and show that analyzing power consumption in manufacturing enterprises can serve a variety of purposes. In two industrial pilot cases, we discuss how analyzing power consumption data can serve the goals reporting, optimization, fault detection, and predictive maintenance. Accompanied by a literature review, we propose to implement the measures real-time data processing, multi-level monitoring, temporal aggregation, correlation, anomaly detection, forecasting, visualization, and alerting in software to tackle these goals. In a pilot implementation of a power consumption analytics platform, we show how our proposed measures can be implemented with a microservice-based architecture, stream processing techniques, and the fog computing paradigm. We provide the implementations as open source as well as a public show case allowing to reproduce and extend our research.
... In cloud computing, scalable Pub/Sub systems are used to connect multiple servers with low latency. 11,12,13 Apache Kafka 11 , which is developed by LinkedIn, can allocate multiple brokers for each topic and supports consumer groups. Each record published to a topic is delivered to one consumer in a group. ...
Preprint
Full-text available
The importance of real-time notification has been growing for social services and Intelligent Transporting System (ITS). As an advanced version of Pub/Sub systems, publish-process-subscribe systems, where published messages are spooled and processed on edge servers, have been proposed to achieve data-driven intelligent notifications. In this paper, we present a system that allows a topic to be managed on multiple edge servers so that messages are processed near the publishers, even when publishers are spread over a wide area. Duplicating messages on geographically distributed servers could enable immediate notification to neighboring subscribers. However, the duplicated message spool may cause exhaustion of resources. We prepare a formal model of our publish-process-subscribe system and formulate the topic allocation as an optimization problem under the resource constraints of edge servers. As the optimization problem is NP-hard, we propose heuristics leveraging the locality and the pub/sub relationships observed between clients to use the edge server resources efficiently. Our performance evaluation shows that our method reduces the delay to deliver notifications and the effectiveness of the strategy exploiting the relationships between clients.
... Many organizations, including optical network operators, are building in-house data analytics teams and the infrastructure needed to acquire, store, and manage huge datasets. These often include technologies such as Kafka [36] and concepts such as data lakes [37]. Vendorspecific software modules will be needed to interface with each vendor's management system until the network information models and interfaces to the physical optical network provide sufficient access in a standardized way [7]. ...
Article
Full-text available
The network operator’s call for open, disaggregated optical networks to accelerate innovation and reduce cost, make progress in the standardization of interfaces, and raise telemetry capabilities in optical network systems has created an opportunity to adopt a new paradigm for optical network design. This new paradigm is driven by direct measurement and continuous learning from the actual optical hardware deployed in the field. We report an approach towards practical, vendor-agnostic, real-time optical network design and network management using a combination of two learning models. We generalize our physics-based optical model parameter estimation algorithm using the extended Kalman state estimation theory and, for the first time, to the best of our knowledge, present results using real optical network field data. An observed 0.3 dB standard deviation of the difference between typical predicted and measured signal quality appears mostly attributable to transponder performance variance. We further propose using the physics-based optical model parameter values as inputs to a second learning model with a recurrent neural network such as a gated recurrent unit (GRU) to allocate the appropriate required optical margin relative to the typical signal quality predicted by the physics-based optical model. A proof of concept shows that for a dataset of 3000 optical connections with a wide variety of amplified spontaneous emission noise and nonlinear noise limited conditions, a 10-hidden-unit 2-layer GRU was sufficient to realize a margin prediction error standard deviation below 0.2 dB. This approach of measurement data-driven automated network design will simplify deployment and enable efficient operation of open optical networks.
... At the same time, the state and industry also put forward strict requirements for the design, manufacture, installation, use, inspection, transformation and repair, and life assessment of boiler, pressure vessel, four major pipelines, turbine metal equipment and generator metal equipment in thermal power plants, and implement the whole process technical supervision and management. Timely finding problems and taking effective technical supervision measures can reduce and avoid the failure of metal monitored parts in the above processes [1]. ...
Article
Full-text available
Thermal power plant contains a large number of metal equipment. The data of construction, overhaul, maintenance and operation of metal equipment can fully reflect the safety status of the equipment. Aiming at a large number of metal supervision data of thermal power plants in Inner Mongolia, according to the requirements of national and industrial standards, a metal supervision and management information system including various data of metal equipment design, manufacturing, installation, use, inspection, transformation and repair, life assessment and other stages is established to realize the whole life cycle management of supervision equipment in thermal power plants. At the same time, make full use of the big data retrieval function of the system, realize the big data comparison of the same type of equipment or the same type of fault, provide sufficient information for the technical supervision and failure analysis personnel, and further improve the technical supervision and technical service quality.
... Components in the data pipeline are capable of automating processes involved in extracting, transforming, combining, validating, and loading data [8]. Data pipelines can process different types of data such as continuous, intermittent and batch data [9]. Moreover, data pipelines eliminate errors and accelerate the end-to-end data processes which in turn reduces the latency in the development of data products. ...
... In the same way, individual microservices communicate with each other asynchronously. Apache Kafka as the selected messaging system is proven for high throughput and low latency [52]. Within microservices, we process data using stream processing techniques [53]. ...
Preprint
Full-text available
The Internet of Things adoption in the manufacturing industry allows enterprises to monitor their electrical power consumption in real time and at machine level. In this paper, we follow up on such emerging opportunities for data acquisition and show that analyzing power consumption in manufacturing enterprises can serve a variety of purposes. Apart from the prevalent goal of reducing overall power consumption for economical and ecological reasons, such data can, for example, be used to improve production processes. Based on a literature review and expert interviews, we discuss how analyzing power consumption data can serve the goals reporting, optimization, fault detection, and predictive maintenance. To tackle these goals, we propose to implement the measures real-time data processing, multi-level monitoring, temporal aggregation, correlation, anomaly detection, forecasting, visualization, and alerting in software. We transfer our findings to two manufacturing enterprises and show how the presented goals reflect in these enterprises. In a pilot implementation of a power consumption analytics platform, we show how our proposed measures can be implemented with a microservice-based architecture, stream processing techniques, and the fog computing paradigm. We provide the implementations as open source as well as a public demo allowing to reproduce and extend our research.
... On the other hand, the feasible space of application parameters may depend on the used parallelization paradigm. For example the number of workers used in each iteration, task allocation and granularity of problem decomposition in the master/worker (master/slave) paradigm, number of tasks per each core in the SPMD paradigm [6], incurred latency in the pipelining paradigm [7] or execution tree depth and thread limit for the divide and conquer paradigm [8]. Paper [9] proposes tuning complex parallel applications using a combination of different parallel programming paradigms such as master-worker and pipelining. ...
Conference Paper
Auto-tuning of configuration and application parameters allows to achieve significant performance gains in many contemporary compute-intensive applications. Feasible search spaces of parameters tend to become too big to allow for exhaustive search in the auto-tuning process. Expert knowledge about the utilized computing systems becomes useful to prune the search space and new methodologies are needed in the face of emerging heterogeneous computing architectures. In this paper we propose an auto-tuning methodology for hybrid CPU/GPU applications that takes into account previous execution experiences, along with an automated tool for iterative testing of chosen combinations of configuration, as well as application-related parameters. Experimental results, based on a parallel similarity search application executed on three different CPU + GPU parallel systems, show that the proposed methodology allows to achieve execution times worse by only up to 8% compared to a search algorithm that performs a full search over combinations of application parameters, while taking only up to 26% time of the latter.
... Kafka is a scalable publishsubscribe messaging system, suitable for both offline and online messaging consumption [20] [21]. In recent years, it has become a highly use data ingestion tool used to ingest streams of data into processing platforms [22] [23]. ...
Conference Paper
During a criminal investigation, the evidence collection process produces an enormous amount of data. These data are present in many medias and IoT devices that are extracted as crime evidences (USB flash drives, smartphones, hard drives, computers, drones, smartwatches, AI speakers, sensors etc). Dueto this data volume, the manual analysis is slow and costly.This work fulfills this gap by presenting a data extraction and processing platform for crime evidence analysis. Our proposed platform leverages a lambda architecture and uses a set of tools and frameworks such as Hadoop HDFS, Kafka, Spark andDocker to analyze a big volume of data at an acceptable time.We also present an example of the proposed platform in use by the State Attorney Office of Rio Grande do Norte (Brazil), where some evaluative tests have been carried out.
... Though a great idea, it looks like this work was no longer continued, because we couldn't find how we could apply it to our specific data sources. LinkedIn tried to overcome some of the challenges mentioned before by engineering a data model based on combining Apache Avro [26] and Pegasus [12,13]. Differently, our data pipeline is offline, so using Avro in our case will not require the integration of Kafka [28,8]. ...
Article
Full-text available
The adoption of the Internet of Things (IoT) in industry provides the opportunity to gather valuable data. Nevertheless, this amount of data must be analyzed to identify patterns in the data, model behaviors of equipment and to enable prediction. Although big data found its initiation already some years ago, there are still many challenges to be solved, e.g. metadata representation and management are still a research topic. The big data architecture of the RISC data analytics framework relies on the combination of big data technologies with semantic approaches, to process and store large volumes of data from heterogeneous sources, provided by FILL, which is a key machine tool provider. The proposed architecture is capable of handling sensor data using big data technologies such as Spark on Hadoop, InfluxDB and Elasticsearch. The metadata representation and management approach is adopted in order to define the structure and the relations (i.e., the connections) between the various data sources provided by the sensors and logging information system. On the other hand, using a metadata approach in our big data environment enhances RISC data analytics framework by making it generic, reusable and responsive in case of changes, thus keeping the data lakes up-to-date and ensuring the validity of the analytics results. The work presented here is part of an ongoing project (BOOST 4.0) currently addressed under the EU H2020 program.
... Apache Kafka has been developed originally at LinkedIn to process real-time log data with delays of no more than a few seconds, and it is designed to be a distributed, scalable, durable and fault-tolerant messaging system with high throughput [1], [2]. LinkedIn relies heavily on the scalability and reliability of Kafka for use cases like monitoring system metrics, traditional messaging or website activity tracking [3]. Kafka can handle more than 1.4 trillion messages per day across over 1400 brokers in LinkedIn [4]. ...
Conference Paper
Full-text available
Apache Kafka is a highly scalable distributed messaging system that provides high throughput with low latency. Various kinds of cloud vendors provide Kafka as a service for users who need a messaging system. Given a certain hardware environment, how to set the configurations of Kafka properly will be the first concern of users. In this paper, we analyze the structure and workflow of Kafka and propose a queueing based packet flow model to predict performance metrics of Kafka cloud services. The input configuration parameters of this model contain the number of brokers in Kafka cluster, the number of partitions in a topic and the batch size of messages. Through this model users can obtain the impact of certain configuration parameters on the performance metrics including the producer throughput, the relative payload and overhead and the change of disk storage usage over time. We use queueing theory to evaluate the end-to-end latency of packets. In the experimental validation we see a strong correlation between packet sizes and packet send interval, and the service time of packets fits a phase-type distribution. The correlation and fitting results are substituted to the essential constants in the model. Experiments are performed with various configurations for observing their effects on performance metrics. The results show that our model achieves high accuracy in predicting throughput and latency.
Article
Publish/subscribe is a messaging pattern where message producers, called publishers, publish messages which they want to be distributed to message consumers, called subscribers. Subscribers are required to subscribe to messages of interest in advance to be able to receive them upon the publishing. In this paper, we discuss a special type of publish/subscribe systems, namely geospatial publish/subscribe systems (GeoPS systems), in which both published messages (i.e., publications) and subscriptions include a geospatial object. Such an object is used to express both the location information of a publication and the location of interest of a subscription. We argue that there is great potential for using GeoPS systems for the Internet of Things and Sensor Web applications. However, existing GeoPS systems are not applicable for this purpose since they are centralized and cannot cope with multiple highly frequent incoming geospatial data streams containing publications. To overcome this limitation, we present a distributed GeoPS system in the cluster which efficiently matches incoming publications in real-time with a set of stored subscriptions. Additionally, we propose four different (distributed) replication and partitioning strategies for managing subscriptions in our distributed GeoPS system. Finally, we present results of an extensive experimental evaluation in which we compare the throughput, latency and memory consumption of these strategies. These results clearly show that they are both efficient and scalable to larger clusters. The comparison with centralized state-of-the-art approaches shows that the additional processing overhead of our distributed strategies introduced by the Apache Spark is almost negligible.
Chapter
In recent years, a series of real-life problems are being solved by the leading role of sensors and the Internet. Smart towns, smart health structures, smart construction, smart landscapes, and smart transport are all part of the applications. However, the IoT sensor data involves a variety of problems in real time, including dilution of unhygienic sensor data and extraordinary resource costs. In addition to normal clinical practices, information and communications technology (ICT) that enables Internet of Things for the development of mechanisms to control elderly behavior allows geriatrics to detect changes in behavior related to such conditions early on. The data capture layer is a discreet low-cost infrastructure that sums up physical system heterogeneity, while data processing capacity handles the huge amount and semantics of sensed knowledge easily. Details are accessible with wired or wireless Internet access. These create enormous amounts of fresh, organized, unstructured, real-time, and big data. The IoT data is very comprehensive and nuanced, with information on the circumstances and the environment. The IoT will never remain idle, as thousands of Internet articles become information collectors and produce huge data. Today’s bulk of large data consists of IoT devices and grows exponentially per year. The analysis of such data needs innovative IoT techniques and data processing. IoT requires a specific abundance of facts, however, which continuously flow from various objects. Conventional technology is also critical. IoT data are very detailed and nuanced, capable of delivering in real time information on actual events or the environment. A smart IOT system has been developed in this manuscript to handle multidimensional data generated by IOT sensors. The model suggested showed that the accuracy and speed of data handling are high when compared to traditional models and to current models.KeywordsIOT gadgetsSensor dataIntelligent frameworkClusteringClassificationData integrityBig data
Chapter
Modern business intelligence relies on efficient processing on very large amount of stream data, such as various event logging and data collected by sensors. To meet the great demand for stream processing, many stream data storage systems have been implemented and widely deployed, such as Kafka, Pulsar and DistributedLog. These systems differ in many aspects including design objectives, target application scenarios, access semantics, user API, and implementation technologies. Each system use a dedicated tool to evaluate its performance. And different systems measure different performance metrics using different loads. For infrastructure architects, it is important to compare the performances of different systems under diverse loads using the same benchmark. Moreover, for system designers and developers, it is critical to study how different implementation technologies affect their performance. However, there is no such a benchmark tool yet which can evaluate the performances of different systems. Due to the wide diversities of different systems, it is challenging to design such a benchmark tool. In this paper, we present SSBench, a benchmark tool designed for stream data storage systems. SSBench abstracts the data and operations in different systems as “data streams” and “reads/writes” to data streams. By translating stream read/write operations into the specific operations of each system using its own APIs, SSBench can evaluate different systems using the same loads. In addition to measure simple read/write performance, SSBench also provides several specific performance measurements for stream data, including end-to-end read latency, performance under imbalanced loads and performance of transactional loads. This paper also presents the performance evaluation of four typical systems, Kafka, Pulsar, DistributedLog and ZStream, using SSBench, and discussion of the causes for their performance differences from the perspective of their implementation techniques.
Chapter
Data pipelines involve a complex chain of interconnected activities that starts with a data source and ends in a data sink. Data pipelines are important for data-driven organizations since a data pipeline can process data in multiple formats from distributed data sources with minimal human intervention, accelerate data life cycle activities, and enhance productivity in data-driven enterprises. However, there are challenges and opportunities in implementing data pipelines but practical industry experiences are seldom reported. The findings of this study are derived by conducting a qualitative multiple-case study and interviews with the representatives of three companies. The challenges include data quality issues, infrastructure maintenance problems, and organizational barriers. On the other hand, data pipelines are implemented to enable traceability, fault-tolerance, and reduce human errors through maximizing automation thereby producing high-quality data. Based on multiple-case study research with five use cases from three case companies, this paper identifies the key challenges and benefits associated with the implementation and use of data pipelines.
Conference Paper
Full-text available
Modern stream processing systems need to process large volumes of data in real-time. Various stream processing frameworks have been developed and messaging systems are widely applied to transfer streaming data among different applications. As a distributed messaging system with growing popularity, Apache Kafka processes streaming data in small batches for efficiency. However, the robustness of Kafka’s batching method against variable operating conditions is not known. In this paper we study the impact of the batch size on the performance of Kafka. Both configuration parameters, the spatial and temporal batch size, are considered. We build a Kafka testbed using Docker containers to analyze the distribution of Kafka’s end-to-end latency. The experimental results indicate that evaluating the mean latency only is unreliable in the context of real-time systems. In the experiments where network faults are injected, we find that the batch size affects the message loss rate in the presence of an unstable network connection. However, allocating resources for message processing and delivery that will violate the reliability requirements implemented as latency constraints of a real-time system is inefficient To address these challenges we propose a reactive batching strategy. We evaluate our batching strategy in both good and poor network conditions. The results show that the strategy is powerful enough to meet both latency and throughput constraints even when network conditions are variable.
Chapter
Recent advances in Artificial Intelligence (AI) have fostered a widespread adoption of Machine Learning (ML) capabilities within many products and services. However, most organizations are not well suited to fully exploit the strategic advantages of AI. Implementing ML solutions is still a complex endeavor due to the fast-pace evolution and the intrinsic exploratory nature of state-of-the-art ML techniques. In many respects, the evolution of data platforms through highly parallel or high performance technologies have focused on the capacity to massively process the elements consumed by these ML models. This separate consideration renders reference architectures to be either suited for analytics consumption, or for raw storage. There is no joint consideration for the complete cycle of data management, models development, and serving with feedback and human-in-the-loop requirements. This paper introduces design criteria conceived to help organizations to architect and implement data platforms to effectively exploit their ML capabilities. The main objective of this work is to expedite the development of data platforms for ML by avoiding common implementation mistakes. The proposed guideline constitutes the methodical articulation of the empirical knowledge acquired over the last years designing, developing, evolving and maintaining a broad spectrum of relevant industry-oriented Data and AI solutions. We have focused on evaluating our proposal by assessing the functionality and usability of the architectures and implementations originated from our design criteria.
Conference Paper
During a criminal investigation, the evidence collection process produces an enormous amount of data. These data are present in many medias that are extracted as crime evidences (USB flash drives, smartphones, hard drives, computers, etc). Due to this data volume, the manual analysis is slow and costly. This work fulfills this gap by presenting a data extraction and processing platform for crime evidence analysis, named INSIDE. Our proposed platform leverages a lambda architecture and uses a set of tools and frameworks such as Hadoop HDFS, Kafka, Spark and Docker to analyze a big volume of data at an acceptable time. We also present an example of the proposed platform in use by the Public Ministry of Rio Grande do Norte(Brazil), where some evaluative tests have been carried out.
Thesis
Full-text available
Recently a new application domain characterized by the continuous and low-latency processing of large volumes of data has been gaining attention. The growing number of applications of such genre has led to the creation of Stream Processing Systems (SPSs), systems that abstract the details of real-time applications from the developer. More recently, the ever increasing volumes of data to be processed gave rise to distributed SPSs. Currently there are in the market several distributed SPSs, however the existing benchmarks designed for the evaluation this kind of system covers only a few applications and workloads, while these systems have a much wider set of applications. In this work a benchmark for stream processing systems is proposed. Based on a survey of several papers with real-time and stream applications, the most used applications and areas were outlined, as well as the most used metrics in the performance evaluation of such applications. With these information the metrics of the benchmark were selected as well as a list of possible application to be part of the benchmark. Those passed through a workload characterization in order to select a diverse set of applications. To ease the evaluation of SPSs a framework was created with an API to generalize the application development and collect metrics, with the possibility of extending it to support other platforms in the future. To prove the usefulness of the benchmark, a subset of the applications were executed on Storm and Spark using the Azure Platform and the results have demonstrated the usefulness of the benchmark suite in comparing these systems.
Article
Full-text available
In this article, we propose a data framework for edge computing that allows developers to easily attain efficient data transfer between mobile devices or users. We propose a distributed key-value storage platform for edge computing and its explicit data distribution management method that follows the publish/subscribe relationships specific to applications. In this platform, edge servers organize the distributed key-value storage in a uniform namespace. To enable fast data access to a record in edge computing, the allocation strategy of the record and its cache on the edge servers is important. Our platform offers distributed objects that can dynamically change their home server and allocate cache objects proactively following user-defined rules. A rule is defined in a declarative manner and specifies where to place cache objects depending on the status of the target record and its associated records. The system can reflect record modification to the cached records immediately. We also integrate a push notification system using WebSocket to notify events on a specified table. We introduce a messaging service application between mobile appliances and several other applications to show how cache rules apply to them. We evaluate the performance of our system using some sample applications.
ResearchGate has not been able to resolve any references for this publication.