Preprint

Design Guidelines for Apache Kafka Driven Data Management and Distribution in Smart Cities

Authors:
Preprints and early-stage research may not have been peer reviewed yet.
To read the file of this research, you can request a copy directly from the authors.

Abstract

Smart city management is going through a remarkable transition, in terms of quality and diversity of services provided to the end-users. The stakeholders that deliver pervasive applications are now able to address fundamental challenges in the big data value chain, from data acquisition, data analysis and processing, data storage and curation, and data visualisation in real scenarios. Industry 4.0 is pushing this trend forward, demanding for servitization of products and data, also for the smart cities sector where humans, sensors and devices are operating in strict collaboration. The data produced by the ubiquitous devices must be processed quickly to allow the implementation of reactive services such as situational awareness, video surveillance and geo-localization, while always ensuring the safety and privacy of involved citizens. This paper proposes a modular architecture to (i) leverage innovative technologies for data acquisition, management and distribution (such as Apache Kafka and Apache NiFi), (ii) develop a multi-layer engineering solution for revealing valuable and hidden societal knowledge in smart cities environment, and (iii) tackle the main issues in tasks involving complex data flows and provide general guidelines to solve them. We derived some guidelines from an experimental setting performed together with leading industrial technical departments to accomplish an efficient system for monitoring and servitization of smart city assets, with a scalable platform that confirms its usefulness in numerous smart city use cases with different needs.

No file available

Request Full-text Paper PDF

To read the file of this research,
you can request a copy directly from the authors.

ResearchGate has not been able to resolve any citations for this publication.
Conference Paper
Full-text available
Distributed networks and real-time systems are becoming the most important components for the new computer age - the Internet of Things (IoT), with huge data streams generated from sensors and data sets generated from existing legacy systems. The data generated offers the ability to measure, infer and understand envi­ronmental indicators, from delicate ecologies and natural resources to urban environments. This can be achieved through the analysis of the heterogeneous data sources (structured and unstructured). In this paper, we propose a distributed framework - Event STream Pro­cessing Engine for Environmental Monitoring Domain (ESTemd) for the application of stream processing on heterogeneous environ­ mental data. Our work in this area demonstrates the useful role big data techniques can play in an environmental decision sup­port system, early warning and forecasting systems. The proposed framework addresses the challenges of data heterogeneity from heterogeneous systems and offers real-time processing of huge environmental datasets through a publish/subscribe method via a unified data pipeline with the application of Apache Kafka for real-time analytics.
Conference Paper
Full-text available
Born from a need for a pure "pay-per-use" model and highly scalable platform, the "Serverless" paradigm emerged and has the potential to become a dominant way of building cloud applications. Although it was originally designed for cloud environments, Serverless is finding its position in the Edge Computing landscape, aiming to bring computational resources closer to the data source. That is, Serverless is crossing cloud borders to assess its merits in Edge computing, whose principal partner will be the Internet of Things (IoT) applications. This move sounds promising as Serverless brings particular benefits such as eliminating always-on services causing high electricity usage, for instance. However, the community is still hesitant to uptake Serverless Edge Computing because of the cloud-driven design of current Serverless platforms, and distinctive characteristics of edge landscape and IoT applications. In this paper, we evaluate both sides to shed light on the Serverless new territory. Our in-depth analysis promotes a broad vision for bringing Serverless to the Edge Computing. It also issues major challenges for Serverless to be met before entering Edge computing.
Conference Paper
Full-text available
The number of devices that the edge of the Internet accommodates and the volume of the data these devices generate are expected to grow dramatically in the years to come. As a result, managing and processing such massive data amounts at the edge becomes a vital issue. This paper proposes "Store Edge Networked Data" (SEND), a novel framework for in-network storage management realized through data repositories deployed at the network edge. SEND considers different criteria (e.g., data popularity, data proximity from processing functions at the edge) to intelligently place different categories of raw and processed data at the edge based on system-wide identifiers of the data context, called labels. We implement a data repository prototype on top of the Google file system, which we evaluate based on real-world datasets of images and Internet of Things device measurements. To scale up our experiments, we perform a network simulation study based on synthetic and real-world datasets evaluating the performance and trade-offs of the SEND design as a whole. Our results demonstrate that SEND achieves data insertion times of 0.06ms-0.9ms, data lookup times of 0.5ms-5.3ms, and on-time completion of up to 92% of user requests for the retrieval of raw and processed data.
Article
Full-text available
For storage and recovery requirements on large-scale seismic waveform data of the National Earthquake Data Backup Center (NEDBC), a distributed cluster processing model based on Kafka message queues is designed to optimize the inbound efficiency of seismic waveform data stored in HBase at NEDBC. Firstly, compare the characteristics of big data storage architectures with that of traditional disk array storage architectures. Secondly, realize seismic waveform data analysis and periodic truncation, and write HBase in NoSQL record form through Spark Streaming cluster. Finally, compare and test the read/write performance of the data processing process of the proposed big data platform with that of traditional storage architectures. Results show that the seismic waveform data processing architecture based on Kafka designed and implemented in this paper has a higher read/write speed than the traditional architecture on the basis of the redundancy capability of NEDBC data backup, which verifies the validity and practicability of the proposed approach.
Article
Full-text available
The Internet of Things (IoT) has recently advanced from an experimental technology to what will become the backbone of future customer value for both product and service sector businesses. This underscores the cardinal role of IoT on the journey towards the fifth generation (5G) of wireless communication systems. IoT technologies augmented with intelligent and big data analytics are expected to rapidly change the landscape of myriads of application domains ranging from health care to smart cities and industrial automations. The emergence of Multi-Access Edge Computing (MEC) technology aims at extending cloud computing capabilities to the edge of the radio access network, hence providing real-time, high-bandwidth, low-latency access to radio network resources. IoT is identified as a key use case of MEC, given MEC's ability to provide cloud platform and gateway services at the network edge. MEC will inspire the development of myriads of applications and services with demand for ultra low latency and high Quality of Service (QoS) due to its dense geographical distribution and wide support for mobility. MEC is therefore an important enabler of IoT applications and services which require real-time operations. In this survey, we provide a holistic overview on the exploitation of MEC technology for the realization of IoT applications and their synergies. We further discuss the technical aspects of enabling MEC in IoT and provide some insight into various other integration technologies therein.
Article
Full-text available
The virtually unlimited available resources and wide range of services provided by the cloud have resulted in the emergence of new cloud-based applications, such as smart grids, smart building control, and virtual reality. These developments, however, have also been accompanied by a problem for delay-sensitive applications that have stringent delay requirements. The current cloud computing paradigm cannot realize the requirements of mobility support, location awareness, and low latency. Hence, to address the problem, an edge computing paradigm that aims to extend the cloud resources and services and enable them to be nearer the edge of an enterprise's network has been introduced. In this article, we highlight the significance of edge computing by providing real-life scenarios that have strict constraint requirements on application response time. From the previous literature, we devise a taxonomy to classify the current research efforts in the domain of edge computing. We also discuss the key requirements that enable edge computing. Finally, current challenges in realizing the vision of edge computing are discussed.
Article
Full-text available
The Internet of Things (IoT) makes smart objects the ultimate building blocks in the development of cyber-physical smart pervasive frameworks. The IoT has a variety of application domains, including health care. The IoT revolution is redesigning modern health care with promising technological, economic, and social prospects. This paper surveys advances in IoT-based health care technologies and reviews the state-of-the-art network architectures/platforms, applications, and industrial trends in IoT-based health care solutions. In addition, this paper analyzes distinct IoT security and privacy features, including security requirements, threat models, and attack taxonomies from the health care perspective. Further, this paper proposes an intelligent collaborative security model to minimize security risk; discusses how different innovations such as big data, ambient intelligence, and wearables can be leveraged in a health care context; addresses various IoT and eHealth policies and regulations across the world to determine how they can facilitate economies and societies in terms of sustainable development; and provides some avenues for future research on IoT-based health care based on a set of open issues and challenges.
Article
Full-text available
This paper provides an overview of the Internet of Things (IoT) with emphasis on enabling technologies, protocols, and application issues. The IoT is enabled by the latest developments in RFID, smart sensors, communication technologies, and Internet protocols. The basic premise is to have smart sensors collaborate directly without human involvement to deliver a new class of applications. The current revolution in Internet, mobile, and machine-to-machine (M2M) technologies can be seen as the first phase of the IoT. In the coming years, the IoT is expected to bridge diverse technologies to enable new applications by connecting physical objects together in support of intelligent decision making. This paper starts by providing a horizontal overview of the IoT. Then, we give an overview of some technical details that pertain to the IoT enabling technologies, protocols, and applications. Compared to other survey papers in the field, our objective is to provide a more thorough summary of the most relevant protocols and application issues to enable researchers and application developers to get up to speed quickly on how the different protocols fit together to deliver desired functionalities without having to go through RFCs and the standards specifications. We also provide an overview of some of the key IoT challenges presented in the recent literature and provide a summary of related research work. Moreover, we explore the relation between the IoT and other emerging technologies including big data analytics and cloud and fog computing. We also present the need for better horizontal integration among IoT services. Finally, we present detailed service use-cases to illustrate how the different protocols presented in the paper fit together to deliver desired IoT services.
Article
Full-text available
Internet of Things (IoT) has provided a promising opportunity to build powerful industrial systems and applications by leveraging the growing ubiquity of radio-frequency identification (RFID), and wireless, mobile, and sensor devices. A wide range of industrial IoT applications have been developed and deployed in recent years. In an effort to understand the development of IoT in industries, this paper reviews the current research of IoT, key enabling technologies, major IoT applications in industries, and identifies research trends and challenges. A main contribution of this review paper is that it summarizes the current state-of-the-art IoT in industries systematically.
Article
Full-text available
The Internet of Things (IoT) shall be able to incorporate transparently and seamlessly a large number of different and heterogeneous end systems, while providing open access to selected subsets of data for the development of a plethora of digital services. Building a general architecture for the IoT is hence a very complex task, mainly because of the extremely large variety of devices, link layer technologies, and services that may be involved in such a system. In this paper, we focus specifically to an urban IoT system that, while still being quite a broad category, are characterized by their specific application domain. Urban IoTs, in fact, are designed to support the Smart City vision, which aims at exploiting the most advanced communication technologies to support added-value services for the administration of the city and for the citizens. This paper hence provides a comprehensive survey of the enabling technologies, protocols, and architecture for an urban IoT. Furthermore, the paper will present and discuss the technical solutions and best-practice guidelines adopted in the Padova Smart City project, a proof-of-concept deployment of an IoT island in the city of Padova, Italy, performed in collaboration with the city municipality.
Article
Full-text available
The Internet of Things (IoT) is part of the Internet of the future and will comprise billions of intelligent communicating "things" or Internet Connected Objects (ICO) which will have sensing, actuating, and data processing capabilities. Each ICO will have one or more embedded sensors that will capture potentially enormous amounts of data. The sensors and related data streams can be clustered physically or virtually, which raises the challenge of searching and selecting the right sensors for a query in an efficient and effective way. This paper proposes a context-aware sensor search, selection and ranking model, called CASSARAM, to address the challenge of efficiently selecting a subset of relevant sensors out of a large set of sensors with similar functionality and capabilities. CASSARAM takes into account user preferences and considers a broad range of sensor characteristics, such as reliability, accuracy, location, battery life, and many more. The paper highlights the importance of sensor search, selection and ranking for the IoT, identifies important characteristics of both sensors and data capture processes, and discusses how semantic and quantitative reasoning can be combined together. This work also addresses challenges such as efficient distributed sensor search and relational-expression based filtering. CASSARAM testing and performance evaluation results are presented and discussed.
Chapter
The modeling and simulation community can benefit from building and running (co-)simulation models on-demand. In order to drive innovation further, it should be easy to set up, orchestrate, and execute simulations – even for researchers with no background in computer science. Additionally, open interfaces should be the default to enable a variety of applications. The possibility to connect external components smoothly with a simulation run allows for elaborate experiments. In this work, the details of the architecture and design of a simulation platform enabling evaluations of future mobility scenarios are presented. While the explanations are based on mobility applications, the concept can be generalized and applied to other domains. The novelty lies in the use of a big data stream processing platform called Apache Kafka. All kinds of required communication are handled by the Kafka middleware. Even the coupling of different simulators is realized on top of it. As a foundation, existing works regarding Modeling and Simulation as a Service are presented. The approach is then described by its architecture, its interfaces, and the life-cycle of a simulation run. Further information is given on the distribution of simulations and the topic-based clock synchronization mechanism. A performance evaluation shows that the approach is on one level with state-of-the art co-simulation middlewares.
Article
Serverless computing is becoming widely adopted among cloud providers, thus making increasingly popular the Function-as-a-Service (FaaS) programming model, where the developers realize services by packaging sequences of stateless function calls. The current technologies are very well suited to data centers, but cannot provide equally good performance in decentralized environments, such as edge computing systems, which are expected to be typical for Internet of Things (IoT) applications. In this paper, we fill this gap by proposing a framework for efficient dispatching of stateless tasks to in-network executors so as to minimize the response times while exhibiting short-and long-term fairness, also leveraging information from a virtualized network infrastructure when available. Our solution is shown to be simple enough to be installed on devices with limited computational capabilities, such as IoT gateways, especially when using a hierarchical forwarding extension. We evaluate the proposed platform by means of extensive emulation experiments with a prototype implementation in realistic conditions. The results show that it is able to smoothly adapt to the mobility of clients and to the variations of their service request patterns, while coping promptly with network congestion.
Article
Wireless edge networks in smart industrial environments increasingly operate using advanced sensors and autonomous machines interacting with each other and generating huge amounts of data. Those huge amounts of data are bound to make data management (e.g., for processing, storing, computing) a big challenge. Current data management approaches, relying primarily on centralized data storage, might not be able to cope with the scalability and real time requirements of Industry 4.0 environments, while distributed solutions are increasingly being explored. In this paper, we introduce the problem of distributed data access in multi-hop wireless industrial edge deployments, whereby a set of consumer nodes needs to access data stored in a set of data cache nodes, satisfying the industrial data access delay requirements and at the same time maximizing the network lifetime. We prove that the introduced problem is computationally intractable and, after formulating the objective function, we design a two-step algorithm in order to address it. We use an open testbed with real devices for conducting an experimental investigation on the performance of the algorithm. Then, we provide two online improvements, so that the data distribution can dynamically change before the first node in the network runs out of energy. We compare the performance of the methods via simulations for different numbers of network nodes and data consumers, and we show significant lifetime prolongation and increased energy efficiency when employing the method which is using only decentralized low-power wireless communication instead of the method which is using also centralized local area wireless communication.
Article
Driven by great demands on low-latency services of the edge devices (EDs), mobile edge computing (MEC) has been proposed to enable the computing capacities at the edge of the radio access network. However, conventional MEC servers suffer some disadvantages such as limited computing capacity, preventing computation-intensive tasks to be processed on time. To relief this issue, we propose the heterogeneous multi-layer MEC (HetMEC) where data that cannot be timely processed at the edge are allowed to be offloaded to the upper-layer MEC servers, and finally to the cloud center (CC) with more powerful computing capacity. We aim to minimize the system latency, i.e., the total computing and transmission time on all layers for the data generated by the EDs. We design the latency minimization algorithm by jointly coordinating the task assignment, computing and transmission resources among the EDs, multi-layer MEC servers, and the CC. Simulation results indicate that our proposed algorithm can achieve a lower latency and higher processing rate than the conventional MEC scheme.
Conference Paper
In recent years there has been a surge in applications focusing on streaming data to generate insights in real-time. Both academia, as well as industry, have tried to address this use case by developing a variety of Stream Processing Engines (SPEs) with a diverse feature set. On the other hand, Big Data applications have started to make use of High-Performance Computing (HPC) which possess superior memory, I/O, and networking resources compared to typical Big Data clusters. Recent studies evaluating the performance of SPEs have focused on commodity clusters. However, exhaustive studies need to be performed to profile individual stages of a stream processing pipeline and how best to optimize each of these stages to best leverage the resources provided by HPC clusters. To address this issue, we profile the performance of a big data streaming pipeline using Apache Flink as the SPE and Apache Kafka as the intermediate message queue. We break the streaming pipeline into two distinct phases and evaluate percentile latencies for two different networks, namely 40GbE and InfiniBand EDR (100Gbps), to determine if a typical streaming application is network intensive enough to benefit from a faster interconnect. Moreover, we explore whether the volume of input data stream has any effect on the latency characteristics of the streaming pipeline, and if so how does it compare for different stages in the streaming pipeline and different network interconnects. Our experiments show an increase of over 10x in 98 percentile latency when input stream volume is increased from 128MB/s to 256MB/s. Moreover, we find the intermediate stages of the stream pipeline to be a significant contributor to the overall latency of the system.
Article
Due to their promise of delivering real-time network insights, today's streaming analytics platforms are increasingly being used in the communications networks where the impact of the insights go beyond sentiment and trend analysis to include real-time detection of security attacks and prediction of network state (i.e., is the network transitioning towards an outage). Current streaming analytics platforms operate under the assumption that arriving traffic is to the order of kilobytes produced at very high frequencies. However, communications networks, especially the telecommunication networks, challenge this assumption because some of the arriving traffic in these networks is to the order of gigabytes, but produced at medium to low velocities. Furthermore, these large datasets may need to be ingested in their entirety to render network insights in real-time. Our interest is to subject today's streaming analytics platforms --- constructed from state-of-the art software components (Kafka, Spark, HDFS, ElasticSearch) --- to traffic densities observed in such communications networks. We find that filtering on such large datasets is best done in a common upstream point instead of being pushed to, and repeated, in downstream components. To demonstrate the advantages of such an approach, we modify Apache Kafka to perform limited native data transformation and filtering, relieving the downstream Spark application from doing this. Our approach outperforms four prevalent analytics pipeline architectures with negligible overhead compared to standard Kafka. (Our modifications to Apache Kafka are publicly available at https://github.com/Esquive/queryable-kafka.git)
Conference Paper
The Internet of Things (IoT) is set to occupy a substantial component of future Internet. The IoT connects sensors and devices that record physical observations to applications and services of the Internet[1]. As a successor to technologies such as RFID and Wireless Sensor Networks (WSN), the IoT has stumbled into vertical silos of proprietary systems, providing little or no interoperability with similar systems. As the IoT represents future state of the Internet, an intelligent and scalable architecture is required to provide connectivity between these silos, enabling discovery of physical sensors and interpretation of messages between the things. This paper proposes a gateway and Semantic Web enabled IoT architecture to provide interoperability between systems, which utilizes established communication and data standards. The Semantic Gateway as Service (SGS) allows translation between messaging protocols such as XMPP, CoAP and MQTT via a multi-protocol proxy architecture. Utilization of broadly accepted specifications such as W3Cs Semantic Sensor Network (SSN) ontology for semantic annotations of sensor data provide semantic interoperability between messages and support semantic reasoning to obtain higher-level actionable knowledge from low-level sensor data.
Article
Fog/edge computing has been proposed to be integrated with Internet-of-Things (IoT) to enable computing services devices deployed at network edge, aiming to improve the user’s experience and resilience of the services in case of failures. With the advantage of distributed architecture and close to end-users, fog/edge computing can provide faster response and greater quality of service for IoT applications. Thus, fog/edge computing-based IoT becomes future infrastructure on IoT development. To develop fog/edge computing-based IoT infrastructure, the architecture, enabling techniques, and issues related to IoT should be investigated first, and then the integration of fog/edge computing and IoT should be explored. To this end, this paper conducts a comprehensive overview of IoT with respect to system architecture, enabling technologies, security and privacy issues, and present the integration of fog/edge computing and IoT, and applications. Particularly, this paper first explores the relationship between Cyber-Physical Systems (CPS) and IoT, both of which play important roles in realizing an intelligent cyber-physical world. Then, existing architectures, enabling technologies, and security and privacy issues in IoT are presented to enhance the understanding of the state of the art IoT development. To investigate the fog/edge computing-based IoT, this paper also investigate the relationship between IoT and fog/edge computing, and discuss issues in fog/edge computing-based IoT. Finally, several applications, including the smart grid, smart transportation, and smart cities, are presented to demonstrate how fog/edge computing-based IoT to be implemented in real-world applications
Article
Apache Kafka is a scalable publish-subscribe messaging system with its core architecture as a distributed commit log. It was originally built at LinkedIn as its centralized event pipelining platform for online data integration tasks. Over the past years developing and operating Kafka we extend its log-structured architecture as a replicated logging backbone for much wider application scopes in the distributed environment. In this abstract we will talk about our design and engineering experience to replicate Kafka logs for various distributed data-driven systems at LinkedIn including source-of-truth data storage and stream processing.
Edgepier: P2p-based container image distribution in edge computing environments
  • S Becker
  • F Schmidt
  • O Kao
S. Becker, F. Schmidt, and O. Kao, "Edgepier: P2p-based container image distribution in edge computing environments." IEEE, 10 2021, pp. 1-8. [Online]. Available: https://ieeexplore.ieee.org/document/ 9679447/
Datafog: Towards a holistic data management platform for the iot age at the network edge
  • H Gupta
  • Z Xu
  • U Ramachandran
H. Gupta, Z. Xu, and U. Ramachandran, "Datafog: Towards a holistic data management platform for the iot age at the network edge," USENIX Workshop on Hot Topics in Edge Computing, HotEdge 2018, co-located with USENIX ATC 2018, 2018.
Processing one billion events per second with NiFi
  • M Payne
M. Payne, "Processing one billion events per second with NiFi," https: //blog.cloudera.com/benchmarking-nifi-performance-and-scalability/, 2020, [Online; accessed 21-June-2022].