Figure 6 - uploaded by Seif Haridi
Content may be subject to copyright.
The iteration model of Apache Flink.  

The iteration model of Apache Flink.  

Source publication
Article
Full-text available
Apache Flink 1 is an open-source system for processing streaming and batch data. Flink is built on the philosophy that many classes of data processing applications, including real-time analytics, continuous data pipelines, historic data processing (batch), and iterative algorithms (machine learning, graph analysis) can be expressed and executed as...

Contexts in source publication

Context 1
... for iterations in data-parallel processing platforms typically relies on submitting a new job for each iteration or by adding additional nodes to a running DAG [6,25] or feedback edges [23]. Iterations in Flink are implemented as iteration steps, special operators that themselves can contain an execution graph (Fig- ure 6). To maintain the DAG-based runtime and scheduler, Flink allows for iteration "head" and "tail" tasks that are implicitly connected with feedback edges. ...
Context 2
... iterations cover the communication needs for streaming applications and differ from parallel optimisation problems that are based on structured iterations on finite data. As presented in Section 3.4 and Figure 6, the execution model of Apache Flink already covers asynchronous iterations, when no iteration control mechanism is enabled. In addition, to comply with fault-tolerance guarantees, feedback streams are treated as operator state within the implicit-iteration head operator and are part of a global snapshot [7]. ...

Similar publications

Article
Full-text available
Background Data preprocessing techniques are devoted to correcting or alleviating errors in data. Discretization and feature selection are two of the most extended data preprocessing techniques. Although we can find many proposals for static Big Data preprocessing, there is little research devoted to the continuous Big Data problem. Apache Flink is...

Citations

... Due to this reason, trend calculation is used as the main procedure to evaluate the scalability of DIRAP. Other proposals behave similarly, whereas extensive results for the scalability of streaming outlier detection using Flink have Flink appeared in Toliopoulos et al. (2020b); more generic insights into Flink behaviour can be found in publications, such as Carbone et al. (2015). Trend calculation requires two parameters, namely the window size, and the window slide. ...
Article
Full-text available
This work describes a novel end-to-end data ingestion and runtime processing pipeline, which is a core part of a technical solution aiming to monitor frailty indices of patients during and after treatment and improve their quality of life. The focus of this work is on the technical architectural details and the functionalities provided, which have been developed in a manner that are extensible, scalable and fault-tolerant by design. Extensibility refers to both data sources and the exact specification of analysis techniques. Our platform can combine data not only from multiple sensor types but also from electronic health records. Also, the analysis component can process the patient data both individually and in combination with other patients, while exploiting both cloud and edge resources. We have shown concrete examples of advanced analytics and evaluated the scalability of the system, which has been fully prototyped.
... Modern big-data applications need to operate on large data sets, often using well-known big-data frameworks, such as Apache Spark [37], Flink [1] or Cassandra [11]. Many of these systems are written in Java, relying on Java NIO. ...
Preprint
    Many big-data frameworks are written in Java, e.g. Apache Spark, Flink and Cassandra. These systems use the networking framework netty which is based on Java NIO. While this allows for fast networking on traditional Ethernet networks, it cannot fully exploit the whole performance of modern interconnects, like InfiniBand, providing bandwidths of 100 Gbit/s and more. In this paper we propose netty support for hadroNIO, a Java library, providing transparent InfiniBand support for Java applications based on NIO. hadroNIO is based on UCX, which supports several interconnects, including InfiniBand. We present hadroNIO extensions and optimizations for supporting netty. The evaluations with microbenchmarks, covering single- and multi-threaded scenarios, show that it is possible for netty applications to reach round-trip times as low as 5 us and fully utilize the 100 Gbit/s bandwidth of high-speed NICs, without changing the application's source code. We also compare hadroNIO with traditional sockets, as well as libvma and the results show, that hadroNIO offers a substantial improvement over plain sockets and can outperform libvma in several scenarios.
    ... Finally, spatiotemporal databases based on traditional RDBMS can consider using distributed file systems such as HDFS to support distributed processing capabilities for themselves, or use in-memory computing frameworks such as Spark [86] and Flink [87] to accelerate computing. ...
    Article
    Full-text available
    Ocean data exhibits interesting yet human critical features affecting all creatures around the world. Studies on Hydrology and Oceanology become the root of many disciplines, including global resource management, macro economy, environment protection, climate predictions, etc, which motivates our further exploration on the underlying feature behind the ocean data. However, with high dimensionality, large quantities, heterogeneous sources, and especially, the spatiotemporal manner, the diversity between the specific knowledge required and massive data chunk puts forward unique challenges in data representation and knowledge mining, effectively. This paper tends to provide a summary of studies on these issues, including the data representation, data processing, knowledge discovery, and algorithms on finding unique patterns on ocean environment changes, such as temperature, tide height, waves, salinity, etc. In detail, we comprehensively discuss about ocean spatiotemporal data processing techniques. We further summarize related representation works on ocean spatiotemporal data, the construction of a ocean knowledge graph, and the management of ocean spatiotemporal data. At last, we combine and compare the collection of the evolution and multiple state-of-the-arts on ocean spatiotemporal data processing.
    ... Click on an item or like an item etc.) and another Kafka queue for features. At the core of the engine is a Flink [4] streaming job for online feature Joiner. The online joiner concatenates features with labels from user actions and produces training examples, which are then written to a Kafka queue. ...
    Preprint
    Full-text available
    Building a scalable and real-time recommendation system is vital for many businesses driven by time-sensitive customer feedback, such as short-videos ranking or online ads. Despite the ubiquitous adoption of production-scale deep learning frameworks like TensorFlow or PyTorch, these general-purpose frameworks fall short of business demands in recommendation scenarios for various reasons: on one hand, tweaking systems based on static parameters and dense computations for recommendation with dynamic and sparse features is detrimental to model quality; on the other hand, such frameworks are designed with batch-training stage and serving stage completely separated, preventing the model from interacting with customer feedback in real-time. These issues led us to reexamine traditional approaches and explore radically different design choices. In this paper, we present Monolith, a system tailored for online training. Our design has been driven by observations of our application workloads and production environment that reflects a marked departure from other recommendations systems. Our contributions are manifold: first, we crafted a collisionless embedding table with optimizations such as expirable embeddings and frequency filtering to reduce its memory footprint; second, we provide an production-ready online training architecture with high fault-tolerance; finally, we proved that system reliability could be traded-off for real-time learning. Monolith has successfully landed in the BytePlus Recommend product.
    ... Based on this model, we analyze individual tasks and the combination of tasks to form workflows, their progress, their bottlenecks, and the actual resource utilization in Sect. 3. We discuss the practical application of the approach and the challenges and tradeoffs involved in Sect. 4. ...
    ... Declarative data analysis frameworks such as Spark [19] and Flink [3] try to assign resources where they are most needed by the concept of lazy evaluation. The required data flow to compute particular outputs is defined, and when the output is needed, the input dependencies are followed backward, and the corresponding computational task is triggered to produce it. ...
    Preprint
    Full-text available
    In the recent years, scientific workflows gained more and more popularity. In scientific workflows, tasks are typically treated as black boxes. Dealing with their complex interrelations to identify optimization potentials and bottlenecks is therefore inherently hard. The progress of a scientific workflow depends on several factors, including the available input data, the available computational power, and the I/O and network bandwidth. Here, we tackle the problem of predicting the workflow progress with very low overhead. To this end, we look at suitable formalizations for the key parameters and their interactions which are sufficiently flexible to describe the input data consumption, the computational effort and the output production of the workflow's tasks. At the same time they allow for computationally simple and fast performance predictions, including a bottleneck analysis over the workflow runtime. A piecewise-defined bottleneck function is derived from the discrete intersections of the task models' limiting functions. This allows to estimate potential performance gains from overcoming the bottlenecks and can be used as a basis for optimized resource allocation and workflow execution.
    ... Data analysis systems, such as batch processing systems [19,30,59] and streaming processing systems [17,35], guarantee the consistent output regardless of the allocated resources. Analyzing training data samples to build neural network models, deep learning framework, known as a data analysis system for artificial intelligence, should also produce consistent model after iterative training. ...
    Preprint
    Distributed synchronized GPU training is commonly used for deep learning. The resource constraint of using fixed GPUs makes large-scale deep learning training jobs suffer, and also lowers the cluster utilization. However, incorporating resource elasticity often introduces non-determinism in model accuracy, which is mainly due to the lack of capability to isolate the model training procedure from hardware resources. We introduce EasyScale, an elastic framework that scales distributed training on heterogeneous GPUs while producing deterministic deep learning models. EasyScale follows the data-parallel training flow strictly, traces the accuracy-relevant factors carefully, utilizes the deep learning characteristics for context switching efficiently, thus achieving elastic accuracy-consistent model training. To saturate the computation capability of heterogeneous GPUs, EasyScale dynamically assigns workers based on our intra-job and inter-job scheduling policies, minimizing GPU idle time and maximizing aggregated job throughput accordingly. Deployed in an online serving cluster of CompanyA, EasyScale powers elastic deep learning training jobs to utilize free GPUs opportunistically, improving the overall cluster utilization by 62.1% without violating SLA.
    ... wherein, K exemplifies the node of the label attribute distribution for online ideological as well as the political teaching resources adaptive recommendation, and p k is the spatial state feature amount and measure of the data link layer [15]. According to the above analysis, an audience behavior feature extraction model of online ideological as well as the political teaching resources adaptive recommendation is constructed, and the online ideological as well as the political teaching resources adaptive recommendation is optimized while agreeing to the feature extraction outcomes and results [12]. ...
    Article
    Full-text available
    The online ideological as well as the political teaching resource management system structure is established in the view of information management in colleges and universities. Furthermore, the online ideological as well as the political teaching information level is improved by combining the optimized design of resource recommendation model. In this paper, an online ideological as well as political teaching resource adaptive recommendation system and algorithm, which is designed on deep reinforcement learning, is suggested. The cost relationship model between online ideological as well as the political teaching resources and learning profitability is constructed. Similarly, the multidimensional constraint index parameter analysis method is adopted, and the adaptive matching model of online ideological as well as the political teaching resources is established. According to online ideological as well as the political teaching norms, combined with the analysis of high-quality educational resources of audience groups, the dynamic evaluation of online ideological as well as the political teaching resources and the adaptive matching model of interest preferences are established. Finally, the deep reinforcement learning method is adopted. By analyzing the characteristics of the resource structure model of online ideological as well as the political teaching resources, through benefit evaluation, resource supply and demand balance management analysis and balanced game control, the online ideological as well as the political teaching resources management system can be improved and self-adaptive recommended. The simulation outcomes indicate that this approach has noble adaptability and high correctness in recommending online ideological as well as the political teaching resources.
    ... In the KRR research field, the Answer Set Programming (ASP) declarative formalism [6,7] has been acknowledged as a particularly attractive basis for SR [1] and a number of SR solutions relying on ASP have been recently proposed [8,9,10,11,12,13,4,14,15,16]. Among these, I-DLV-sr [16] is a SR system that efficiently scale over real-world application domains thanks to a proper integration of the well-established stream processor Apache Flink [17] and the incremental ASP reasoner I 2 -DLV [18]. Its input language, called LDSR (the Language of I-DLV for Stream Reasoning) inherits the highly declarative nature and ease of use from ASP, while being extended with new constructs that are relevant for practical SR scenarios. ...
    Preprint
    The paper investigates the relative expressiveness of two logic-based languages for reasoning over streams, namely LARS Programs -- the language of the Logic-based framework for Analytic Reasoning over Streams called LARS -- and LDSR -- the language of the recent extension of the I-DLV system for stream reasoning called I-DLV-sr. Although these two languages build over Datalog, they do differ both in syntax and semantics. To reconcile their expressive capabilities for stream reasoning, we define a comparison framework that allows us to show that, without any restrictions, the two languages are incomparable and to identify fragments of each language that can be expressed via the other one.
    ... Due to diverse system requirements, such as managing states beyond main memory, elastic scaling, and migrating states among shared-nothing architectures, SPEs were designed to be fully aware of states, to relieve the burden on developers. SPEs with built-in state management support are known as stateful SPEs [36]. For more information about state management, see the following survey [115]. ...
    ... They offer scalability and provide fault tolerance. Over the past decade, many SPEs [12,13,36,57] have been proposed from academia and industry. ...
    ... They are both scalable over a cluster of commodity machines and highly fault-tolerant. Among the modern SPEs are S4 [82], Storm [117], Heron [69], Flink [36], Spark Streaming [130], Samza [84], and the recent Kafka Streams [3]. The emergence of massively parallel processors with multi-terabyte storage capacity and network bandwidths exceeding several gigabytes/second has led to a burst of activity over the past five years, with the objective to enable hardware-conscious stream processing [58,67,80,103,114,131,132,136]. ...
    Preprint
    Full-text available
    Transactional stream processing (TSP) has been increasingly gaining traction. TSP aims to provide a single unified model that offers both transaction- and stream-oriented guarantees. Over the past decade, considerable efforts have resulted in the development of alternative TSP systems, which enables us to explore the commonalities and differences across these solutions. However, a widely accepted standard approach to the integration of transactional functionality with stream processing is still lacking. Existing TSP systems typically focus on a limited number of application features with non-trivial design trade-offs. This survey initially examines diverse transaction models over streams and TSP specific transactional properties, followed by a discussion on the consequences of certain design decisions on system implementations. Subsequently, we highlight a set of representative scenarios, where TSP is employed, as well as discuss some open problems. The aim of this survey is twofold. First, to provide insight into disparate TSP requirements and techniques. Second, to engage the design and development of novel TSP systems.
    ... Compared with existing KGC systems such as DeepDive [80] and T2KG [36], our system features high scalability and high performance, extending capabilities by adopting multi-cloud computing, and providing more functions for KGC. Compared with existing distributed computing systems such as Spark [77] and Flink [13], our system designs and implements a large number of specific operators and built-in well-trained information extraction models specifically for KGC scenarios, and designs distributed scheduling for the characteristics of information extraction applications. Compared with distributed deep learning (DDL) scheduler such as Tiresias [27], BytePS [32], etc., our system focus on the more coarse granularity of scheduling, which is for an ensemble of different architectures deep learning models. ...
    ... Algorithm. Building on the above ideas, we can propose a heuristic algorithm to optimize the objective in Equation 13 and solve the hardness problem of flowline scheduling. To this end, we design a two-step algorithm: first, ensure the compounded tasks should not be separated as the unpartitionable clusters; then, determine the partition strategy and the number of clusters based on a greedy approach. ...
    ... Step 3. Greedy Graph Partition. Motivated by Assumption 2, we construct a greedy graph partition strategy for minimizing the overall cost in Equation 13. This phase is towards the compounded clusters and the omissive orphaned nodes that weren't assigned into any cluster in the previous steps. ...
    Preprint
    Full-text available
    We design a user-friendly and scalable knowledge graph construction (KGC) system for extracting structured knowledge from the unstructured corpus. Different from existing KGC systems, gBuilder provides a flexible and user-defined pipeline to embracing the rapid development of IE models. More built-in template-based or heuristic operators and programmable operators are available for adapting to data from different domains. Furthermore, we also design a cloud-based self-adaptive task scheduling for gBuilder to ensure its scalability on large-scale knowledge graph construction. Experimental evaluation not only demonstrates the ability of gBuilder to organize multiple information extraction models for knowledge graph construction in a uniform platform, and also confirms its high scalability on large-scale KGC task.