Technical Report

Towards Distributed Processing on Event-sourced Graphs

To read the full-text of this research, you can request a copy directly from the author.


The processing of large-scale data sets and streaming data is challenging traditional computing platforms and lacks increasingly relevant features such as data lineage and inherent support for retrospective and predictive analytics. By combining concepts from event processing and graph computing, an Actor-related programming model, and an event-based, time-aware persistence approach into a unified distributed processing solution, we suggest a novel processing approach that embraces the idea of graph-based computing with built-in support for application history.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the author.

... La plupart de la littérature concernant ce patron se trouve en ligne, dans des billets de blog, des présentations ou de la documentation logicielle. La littérature académique est relativement réduite et se rapporte principalement à l'étude de l'évolution de graphes dans le temps[Erb, 2015, Erb et al., 2017. Cette section fournit un aperçu des différentes définitions données de l'ES et du vocabulaire lié à ce patron de conception. ...
L’évolution technologique du web durant ces dernières années a favorisé l’arrivée d’environnements virtuels collaboratifs pour la modélisation 3D à grande échelle. Alors que la collaboration réunit dans un même espace partagé des utilisateurs distants géographiquement pour un objectif de collaboration commun, les ressources matérielles qu'ils apportent (calcul, stockage, 3D ...) avec leurs connaissances sont encore trop rarement utilisées et cela constitue un défi. Il s'agit en effet de proposer un système simple, performant et transparent pour les utilisateurs afin de permettre une collaboration efficace à la fois sur le volet computationnel mais aussi, bien entendu, sur l'aspect métier lié à la modélisation 3D sur le web. Pour rendre efficace le passage à l’échelle, de nombreux systèmes utilisent une architecture réseau dite "hybride", combinant client serveur et pair-à-pair. La réplication optimiste s'adapte bien aux propriétés de ces environnements répartis : la dynamicité des utilisateurs et leur nombre, le type de donnée traitées (3D) et leur taille. Cette thèse présente un modèle pour les systèmes d’édition collaborative en 3D sur le web. L'architecture cliente (3DEvent) permet de déporter les aspects métiers de la 3D au plus près de l’utilisateur sous la forme d’évènements. Cette architecture orientée événements repose sur le constat d’un fort besoin de traçabilité et d’historique sur les données 3D lors de l’assemblage d’un modèle. Cet aspect est porté intrinsèquement par le patron de conception event-sourcing. Ce modèle est complété par la définition d’un intergiciel en pair-à-pair. Sur ce dernier point, nous proposons d'utiliser la technologie WebRTC qui présente une API familière aux développeurs de services en infonuagique. Une évaluation portant sur deux études utilisateur concernant l’acceptance du modèle proposé a été menée dans le cadre de tâches d’assemblage de modèles 3D sur plusieurs groupes d’utilisateurs.
Conference Paper
Full-text available
Many practical computing problems concern large graphs. Standard examples include the Web graph and various social networks. The scale of these graphs - in some cases billions of vertices, trillions of edges - poses challenges to their efficient processing. In this paper we present a computational model suitable for this task. Programs are expressed as a sequence of iterations, in each of which a vertex can receive messages sent in the previous iteration, send messages to other vertices, and modify its own state and that of its outgoing edges or mutate graph topology. This vertex-centric approach is flexible enough to express a broad set of algorithms. The model has been designed for efficient, scalable and fault-tolerant implementation on clusters of thousands of commodity computers, and its implied synchronicity makes reasoning about programs easier. Distribution-related details are hidden behind an abstract API. The result is a framework for processing large graphs that is expressive and easy to program.
Full-text available
A foundational model of concurrency is developed in this thesis. We examine issues in the design of parallel systems and show why the actor model is suitable for exploiting large-scale parallelism. Concurrency in actors is constrained only by the availability of hardware resources and by the logical dependence inherent in the computation. Unlike dataflow and functional programming, however, actors are dynamically reconfigurable and can model shared resources with changing local state. Concurrency is spawned in actors using asynchronous message-passing, pipelining, and the dynamic creation of actors. This thesis deals with some central issues in distributed computing. Specifically, problems of divergence and deadlock are addressed. For example, actors permit dynamic deadlock detection and removal. The problem of divergence is contained because independent transactions can execute concurrently and potentially infinite processes are nevertheless available for interaction.
Most computer systems are built on a command-and-control scheme: one method calls another method and instructs it to perform some action or to retrieve some required information. But often the real world works differently. A company receives a new order; a web server receives a request for a Web page, the right front wheel of my car locks up. In neither case did the system (order processing, web server, anti-lock brake control) schedule or request the action. Instead the event occurred based on external action or activity, caused either by the physical world or another, connected computer system. Could we change the architecture of our system to relinquish control and instead respond to events as they arrive? What would such a system look like? Events Everywhere The real world is full of events. The alarm goes off; the phone rings; the "gas low" warning light in the car comes on. Many computer systems, especially embedded systems, are designed to respond to events. The engine control computer in your car receives an event every time the crankshaft is at the zero position and starts the timer for another round of ignitions. As of now, many of the systems that function based on external events live in a rather small universe, most of them even invisible to the user. However, as computer systems become more and more interconnected they start to publish and receive an increasing number of events. An order management system may receive orders from a Web site or an order entry application and notify other systems of the new order. Systems interested in new orders might be the financial system, which will see whether the order is backed with a credit line or a valid credit card to charge, and the warehouse, which verifies that inventory to fulfill the order is present. Each of these systems might then publish another event to any interested party. The shipping system in turn might wait for both an Inventory Allocated and Payment Processed message and in response prepare the goods for shipment. This event-based style of interaction is notably different from the traditional command-and-control style that would have the warehouse ask for the inventory status, wait for an answer, and then ask the financial system to process the payment. Next, the order management system would wait for a positive answer and lastly instruct the shipping system to send the goods.
This paper suggests that input and output are basic primitives of programming and that parallel composition of communicating sequential processes is a fundamental program structuring method. When combined with a development of Dijkstra's guarded command, these concepts are surprisingly versatile. Their use is illustrated by sample solutions of a variety of a familiar programming exercises.
Conference Paper
Large-scale graph-structured computation is central to tasks ranging from targeted advertising to natural language processing and has led to the development of several graph-parallel abstractions including Pregel and GraphLab. However, the natural graphs commonly found in the real-world have highly skewed power-law degree distributions, which challenge the assumptions made by these abstractions, limiting performance and scalability. In this paper, we characterize the challenges of computation on natural graphs in the context of existing graph-parallel abstractions. We then introduce the PowerGraph abstraction which exploits the internal structure of graph programs to address these challenges. Leveraging the PowerGraph abstraction we introduce a new approach to distributed graph placement and representation that exploits the structure of power-law graphs. We provide a detailed analysis and experimental evaluation comparing PowerGraph to two popular graph-parallel systems. Finally, we describe three different implementation strategies for PowerGraph and discuss their relative merits with empirical evaluations on large-scale real-world problems demonstrating order of magnitude gains.
The transition from sequential to parallel computation is an area of critical concern in today's computer technology, particularly in architecture, programming languages, systems, and artificial intelligence. This book addresses issues in concurrency, and by producing both a syntactic definition and a denotational model of Hewitt's actor paradigm - a model of computation specifically aimed at constructing and analyzing distributed large-scale parallel systems - it advances the understanding of parallel computation.
The concept of one event happening before another in a distributed system is examined, and is shown to define a partial ordering of the events. A distributed algorithm is given for synchronizing a system of logical clocks which can be used to totally order the events. The use of the total ordering is illustrated with a method for solving synchronization problems. The algorithm is then specialized for synchronizing physical clocks, and a bound is derived on how far out of synchrony the clocks can become.
A distributed system can be characterized by the fact that the global state is distributed and that a common time base does not exist. However, the notion of time is an important concept in every day life of our decentralized "real world" and helps to solve problems like getting a consistent population census or determining the potential causality between events. We argue that a linearly ordered structure of time is not (always) adequate for distributed systems and propose a generalized non-standardmodel of time which consists of vectors of clocks. These clock-vectors arepartially orderedand form a lattice. By using timestamps and a simple clock update mechanism the structureofcausality is represented in an isomorphic way. The new model of time has a close analogy to Minkowski's relativistic spacetime and leads among others to an interesting characterization of the global state problem. Finally, we present a new algorithm to compute a consistent global snapshot of a distributed system where messages may bereceived out of order.
Big Data: Principles and best practices of scalable realtime data systems
  • J Marz
  • Nathan
  • Warren
J. Marz, Nathan; Warren. Big Data: Principles and best practices of scalable realtime data systems. Manning Publications, 2013.
Liquid: Unifying nearline and offline big data integration
  • R C Fernandez
  • P Pietzuch
  • J Kreps
  • N Narkhede
  • J Rao
  • J Koshy
  • D Lin
  • C Riccomini
  • G Wang
R. C. Fernandez, P. Pietzuch, J. Kreps, N. Narkhede, J. Rao, J. Koshy, D. Lin, C. Riccomini, and G. Wang. Liquid: Unifying nearline and offline big data integration. In Online Proceedings of the Seventh Biennial Conference on Innovative Data Systems Research Online Proceedings, 2015.