ThesisPDF Available

A Microservice Architecture for the Processing of Large Geospatial Data in the Cloud

Authors:

Abstract and Figures

With the growing number of devices that can collect spatiotemporal information, as well as the improving quality of sensors, the geospatial data volume increases constantly. Before the raw collected data can be used, it has to be processed. Currently, expert users are still relying on desktop-based Geographic Information Systems to perform processing workflows. However, the volume of geospatial data and the complexity of processing algorithms exceeds the capacities of their workstations. There is a paradigm shift from desktop solutions towards the Cloud, which offers virtually unlimited storage space and computational power, but developers of processing algorithms often have no background in computer science and hence no expertise in Cloud Computing. Our research hypothesis is that a microservice architecture and Domain-Specific Languages can be used to orchestrate existing geospatial processing algorithms, and to compose and execute geospatial workflows in a Cloud environment for efficient application development and enhanced stakeholder experience. We present a software architecture that contains extension points for processing algorithms (or microservices), a workflow management component for distributed service orchestration, and a workflow editor based on a Domain-Specific Language. The main aim is to provide both users and developers with the means to leverage the possibilities of the Cloud, without requiring them to have a deep knowledge of distributed computing. In order to conduct our research, we follow the Design Science Research Methodology. We perform an analysis of the problem domain and collect requirements as well as quality attributes for our architecture. To meet our research objectives, we design the architecture and develop approaches to workflow management and workflow modelling. We demonstrate the utility of our solution by applying it to two real-world use cases and evaluate the quality of our architecture based on defined scenarios. Finally, we critically discuss our results. Our contributions to the scientific community can be classified into three pillars. We present a scalable and modifiable microservice architecture for geospatial processing that supports distributed development and has a high availability. Further, we present novel approaches to service integration and orchestration in the Cloud as well as rule-based and dynamic workflow management without a priori design-time knowledge. For the workflow modelling we create a Domain-Specific Language that is based on a novel language design method.
Content may be subject to copyright.
A preview of the PDF is not available
... With the continuing growth in global data, the need for automated data processing becomes more and more evident. This applies to areas such as Bioinformatics [1], Geology [2], and Geoinformatics [3,4] but also Astronomy [5] (see also "Use case 1: computing astronomical image mosaics" section) and Engineering ("Use case 2: shape optimisation via structural analysis" section). Applications in these areas often employ a number of processing steps (hereafter referred to as actions) that need to be run in a specific order to transform a set of input data into a desired output. ...
... However, scientists with no knowledge in distributed computing may be unable to use these alternative programming models correctly. A workflow management system can provide benefits to them in terms of usability, scalability, and flexibility [3]. In this paper, we present an algorithm and a software architecture for such a scientific workflow management system. ...
... In this sense, we are continuously exploring new areas and improving our approach. The research results presented have evolved from a larger group of previous works: a microservice architecture for the processing of large geospatial data [3], workflow modelling with domain-specific languages [65], as well as capability-based task scheduling of process chains in a distributed environment [66]. As mentioned earlier, an implementation of our algorithm and our software architecture is available with the open-source workflow management system Steep [15]. ...
Article
Full-text available
We present an algorithm and a software architecture for a cloud-based system that executes cyclic scientific workflows whose structure may change during run time. Existing approaches either rely on workflow definitions based on directed acyclic graphs (DAGs) or require workarounds to implement cyclic structures. In contrast, our system supports cycles natively, avoids workarounds, and as such reduces the complexity of workflow modelling and maintenance. Our algorithm traverses workflow graphs and transforms them iteratively into linear sequences of executable actions. We call these sequences process chains. Our software architecture distributes the process chains to multiple compute nodes in the cloud and oversees their execution. We evaluate our approach by applying it to two practical use cases from the domains of astronomy and engineering. We also compare it with two existing workflow management systems. The evaluation demonstrates that our algorithm is able to execute dynamically changing workflows with cycles and that design and maintenance of complex workflows is easier than with existing solutions. It also shows that our software architecture can run process chains on multiple compute nodes in parallel to significantly speed up the workflow execution. An implementation of our algorithm and the software architecture is available with the Steep Workflow Management System that we released under an open-source license. The resources for the first practical use case are also available as open source for reproduction.
... Even though it attracts developers because of its ease-of-use capabilities and cost efficiency, it is still a complex solution for most geospatial data processing scenarios to perform a workflow consisting of multiple steps. Each step in a geospatial data process workflow might require different computing power and amount of data [5]. A significant advantage of the serverless application does not use any resources when idle. ...
Article
Full-text available
Geospatial data and related technologies have become an increasingly important aspect of data analysis processes, with their prominent role in most of them. Serverless paradigm have become the most popular and frequently used technology within cloud computing. This paper reviews the serverless paradigm and examines how it could be leveraged for geospatial data processes by using open standards in the geospatial community. We propose a system design and architecture to handle complex geospatial data processing jobs with minimum human intervention and resource consumption using serverless technologies. In order to define and execute workflows in the system, we also propose new models for both workflow and task definitions models. Moreover, the proposed system has new Open Geospatial Consortium (OGC) Application Programming Interface (API) Processes specification-based web services to provide interoperability with other geospatial applications with the anticipation that it will be more commonly used in the future. We implemented the proposed system on one of the public cloud providers as a proof of concept and evaluated it with sample geospatial workflows and cloud architecture best practices.
... With the growing amount of global data, it becomes more and more necessary to automate data processing and analysis. Specialised task automation systems are used in areas such as Bioinformatics [31], Geology [18], Geoinformatics [25], and Astronomy [5] to transform data and to extract or derive knowledge. A special kind of those task automation systems are scientific workflow management systems. ...
Chapter
Distributed scientific workflow management systems processing large data sets in the Cloud face the following challenges: (a) workflow tasks require different capabilities from the machines on which they run, but at the same time, the infrastructure is highly heterogeneous, (b) the environment is dynamic and new resources can be added and removed at any time, (c) scientific workflows can become very large with hundreds of thousands of tasks, (d) faults can happen at any time in a distributed system. In this paper, we present a software architecture and a capability-based scheduling algorithm that cover all these challenges in one design. Our architecture consists of loosely coupled components that can run on separate virtual machines and communicate with each other over an event bus and through a database. The scheduling algorithm matches capabilities required by the tasks (e.g. software, CPU power, main memory, graphics processing unit) with those offered by the available virtual machines and assigns them accordingly for processing. Our approach utilises heuristics to distribute the tasks evenly in the Cloud. This reduces the overall run time of workflows and makes efficient use of available resources. Our scheduling algorithm also implements optimisations to achieve a high scalability. We perform a thorough evaluation based on four experiments and test if our approach meets the challenges mentioned above. The paper finishes with a discussion, conclusions, and future research opportunities. An implementation of our algorithm and software architecture is publicly available with the open-source workflow management system “Steep”.
... Therefore, lambda architecture patterns could also be combined with the microservice architecture style. First approaches for using service-oriented architecture exist for processing large geospatial data in the cloud [31] and as data analytics as a service [32]. Furthermore, microservice architecture style has been chosen for entity matching approaches within data integration phase of data science workflows [33]. ...
Article
Full-text available
In order to support fast development cycles and deploying software components in productive environments, there are three crucial trends in data science. These are agile process models, development of many technologies and increasing usage of cloud platforms. Therefore, effective architectures are needed to support this trend in data science context. This paper explores and evaluates first approaches for, why and how microservice architecture style can support fast development cycles for data science workflows. Microservices are becoming a popular architectural style for designing modern applications due to several advantages like scalability, reliability and maintainability. First, this paper points out the research gap on why microservices could be a suitable way to design data science workflows. Second, it defines relevant research questions for future research that addresses challenges of the microservice architectural style in the data science context. An essential prerequisite for this architecture style is to identify the “right context” of a microservice for data science workflows.
... These patterns are a modular approach, that could be combined with the microservice architecture style. First approaches for using service-oriented approaches or microservices in data-intensive context exist for the processing of large geospatial data in the cloud [17] or for entity matching approaches for data integration [13]. However, those approaches concentrate on specific tasks, domains or technologies. ...
Chapter
Microservices has become a widely used and discussed architectural style for designing modern applications due to advantages like granular scalability and maintainability. However, it is still a complex task decomposing an application into microservices. Software architects often design architectures manually. In this paper we give a state-of-the-art overview of current approaches to identifying microservices. Therefore we use a literature review and classify the content based on the software development process.
... Scientific workflow management systems are used in a wide range of areas including (but not limited to) Bioinformatics (Oinn et al., 2004), Geology (Graves et al., 2011), Geoinformatics (Krämer, 2018), and Astronomy (Berriman et al., 2004) to automate the processing of very large data sets. A scientific workflow is typically represented by a directed acyclic graph that describes how an input data set is processed by certain tasks in a given order to produce a desired outcome. ...
... Scientific workflow management systems such as Pegasus [10] or Kepler [25] support this style of processing. The same applies to the architecture presented in our earlier work [18,17] or the MapReduce programming paradigm [9]. ...
Article
Full-text available
We present a cloud-based approach to transform arbitrarily large terrain data to a hierarchical level-of-detail structure that is optimized for web visualization. Our approach is based on a divide-and-conquer strategy. The input data is split into tiles that are distributed to individual workers in the cloud. These workers apply a Delaunay triangulation with a maximum number of points and a maximum geometric error. They merge the results and triangulate them again to generate less detailed tiles. The process repeats until a hierarchical tree of different levels of detail has been created. This tree can be used to stream the data to the web browser. We have implemented this approach in the frameworks Apache Spark and GeoTrellis. Our paper includes an evaluation of our approach and the implementation. We focus on scalability and runtime but also investigate bottlenecks, possible reasons for them, as well as options for mitigation. The results of our evaluation show that our approach and implementation are scalable and that we are able to process massive terrain data.
... GEE gives an example of how spatial data processing workflows can be realized without the use of local hardware. Krämer (2018) demonstrates a software architecture that uses microservices for processing large sets of spatial data. The architecture is suitable for deployment in a cloud environment. ...
Conference Paper
Full-text available
Cloud Computing becomes increasingly important in information technology. Many applications and data sets have already been transferred to a cloud environment. This technical development is therefore also relevant for spatial data applications. Existing approaches that outline search and processing of geospatial data optimized for cloud computing will be addressed and compared with the characteristics of current spatial data infrastructures. Further, a software architecture is proposed which covers approaches for geodata discovery and geodata processing workflows in cloud computing environments.
... In our previous work, we investigated the possibilities of using the cloud and microservice architectures to process large amounts of heterogeneous geospatial data [7,8]. We focused on use cases from various domains such as land management, urban planning, and marine applications [9,10] where we could show that geospatial data can be of great value given there is sufficient computational power, enough storage resources, and suitable software. ...
Article
Full-text available
We present GeoRocket, a software for the management of very large geospatial datasets in the cloud. GeoRocket employs a novel way to handle arbitrarily large datasets by splitting them into chunks that are processed individually. The software has a modern reactive architecture and makes use of existing services including Elasticsearch and storage back ends such as MongoDB or Amazon S3. GeoRocket is schema-agnostic and supports a wide range of heterogeneous geospatial file formats. It is also format-preserving and does not alter imported data in any way. The main benefits of GeoRocket are its performance, scalability, and usability, which make it suitable for a number of scientific and commercial use cases dealing with very high data volumes, complex datasets, and high velocity (Big Data). GeoRocket also provides many opportunities for further research in the area of geospatial data management.
Article
Full-text available
The author shows the possibility of using microservice architecture at designing the backend of the National Atlas of Russia (NAR) site, as part of the National Spatial Data System. The relevance of the study is due to the need to reliably reflect spatial information on the Russian Federation in the information field. The experience of foreign countries that periodically published national atlases and partially switched from printed versions to sites and geographic information systems on the Internet is considered. Creating the backend, the server part of the NAR site, was divided into stages, a programming language was chosen and a simplified architectural scheme of the service was proposed with the main components highlighted. Free and open source software is considered as components
Article
Full-text available
We present a dynamic searchable symmetric encryption scheme allowing users to securely store geospatial data in the cloud. Geospatial data sets often contain sensitive information, for example, about urban infrastructures. Since clouds are usually provided by third parties, these data need to be protected. Our approach allows users to encrypt their data in the cloud and make them searchable at the same time. It does not require an initialization phase, which enables users to dynamically add new data and remove existing records. We design multiple protocols differing in their level of security and performance, respectively. All of them support queries containing boolean expressions, as well as geospatial queries based on bounding boxes, for example. Our findings indicate that although the search in encrypted data requires more runtime than in unencrypted data, our approach is still suitable for real-world applications. We focus on geospatial data storage, but our approach can also be applied to applications from other areas dealing with keyword-based searches in encrypted data. We conclude the paper with a discussion on the benefits and drawbacks of our approach.
Conference Paper
Full-text available
Scientific workflows consisting of a high number of dependent tasks represent an important class of complex scientific applications. Recently, a new type of serverless infrastruc-tures has emerged, represented by such services as Google Cloud Functions or AWS Lambda. In this paper we take a look at such serverless infrastructures, which are designed mainly for processing background tasks of Web applications. We evaluate their applicability to more compute-and data-intensive scientific workflows and discuss possible ways to repurpose serverless architectures for execution of scientific workflows. A prototype workflow executor function has been developed using Google Cloud Functions and coupled with the HyperFlow workflow engine. The function can run work-flow tasks on the Google infrastructure, and features such capabilities as data staging to/from Google Cloud Storage and execution of custom application binaries. We have successfully deployed and executed the Montage astronomic work-flow, often used as a benchmark, and we report on initial results of performance evaluation. Our findings indicate that the simple mode of operation makes this approach easy to use, although there are costs involved in preparing portable application binaries for execution in a remote environment. While our evaluation uses a preproduction (alpha) version of the Google Cloud Functions platform, we find the presented approach highly promising. We also discuss possible future steps related to execution of scientific workflows in serverless infrastructures, and the implications with regard to resource management for scientific applications in general.
Conference Paper
Full-text available
Large Internet companies like Amazon, Netflix, and LinkedIn are using the microservice architecture pattern to deploy large applications in the cloud as a set of small services that can be developed, tested, deployed, scaled, operated and upgraded independently. However, aside from gaining agility, independent development, and scalability, infrastructure costs are a major concern for companies adopting this pattern. This paper presents a cost comparison of a web application developed and deployed using the same scalable scenarios with three different approaches: 1) a monolithic architecture, 2) a microservice architecture operated by the cloud customer, and 3) a microservice architecture operated by the cloud provider. Test results show that microservices can help reduce infrastructure costs in comparison to standard monolithic architectures. Moreover, the use of services specifically designed to deploy and scale microservices reduces infrastructure costs by 70% or more. Lastly, we also describe the challenges we faced while implementing and deploying microservice applications.
Conference Paper
Full-text available
Cloud computing provides new opportunities to deploy scalable application in an efficient way, allowing enterprise applications to dynamically adjust their computing resources on demand. In this paper we analyze and test the microservice architecture pattern, used during the last years by large Internet companies like Amazon, Netflix and LinkedIn to deploy large applications in the cloud as a set of small services that can be developed, tested, deployed, scaled, operated and upgraded independently, allowing these companies to gain agility, reduce complexity and scale their applications in the cloud in a more efficient way. We present a case study where an enterprise application was developed and deployed in the cloud using a monolithic approach and a microservice architecture using the Play web framework. We show the results of performance tests executed on both applications, and we describe the benefits and challenges that existing enterprises can get and face when they implement microservices in their applications.
Article
Smart Cities make use of ICT technology to address the challenges of modern urban management. The cloud provides an efficient and cost-effective platform on which they can manage, store and process data, as well as build applications performing complex computations and analyses. The quickly changing requirements in a Smart City require flexible software architectures that let these applications scale in a distributed environment such as the cloud. Smart Cities have to deal with huge amounts of data including sensitive information about infrastructure and citizens. In order to leverage the benefits of the cloud, in particular in terms of scalability and cost-effectiveness, this data should be stored in a public cloud. However, in such an environment, sensitive data needs to be encrypted to prevent unauthorized access. In this paper, we present a software architecture design that can be used as a template for the implementation of Smart City applications. The design is based on the microservice architectural style, which provides properties that help make Smart City applications scalable and flexible. In addition, we present a hybrid approach to securing sensitive data in the cloud. Our architecture design combines a public cloud with a trusted private environment. To store data in a cost-effective manner in the public cloud, we encrypt metadata items with CP-ABE (Ciphertext-Policy Attribute-Based Encryption) and actual Smart City data with symmetric encryption. This approach allows data to be shared across multiple administrations and makes efficient use of cloud resources. We show the applicability of our design by implementing a web-based application for urban risk management. We evaluate our architecture based on qualitative criteria, benchmark the performance of our security approach, and discuss it regarding honest-but-curious cloud providers as well as attackers trying to access user data through eavesdropping. Our findings indicate that the microservice architectural style fits the requirements of scalable Smart City applications while the proposed security approach helps prevent unauthorized access.
Article
Microservices can be broadly defined as the design of service-oriented software using a set of small services. In a microservice architecture, application complexity is distributed among narrowly focused and independently deployable units of computation. Such complexity can result in security vulnerabilities. Trustworthiness is also an issue when dealing with microservices. Moreover, there may be gaps in existing legal frameworks with regard to this technology. Solutions to these issues must seek balance between security and performance.
Article
The authors propose that an organic workflow-process technology will power the evolution of information system architectures. The authors outline three likely stages of architectural evolution in the context of a networked economy and discuss critical gaps in the current technology with respect to their envisioned future